Forum OpenACS Development: Re: Hash value for files

Collapse
4: Re: Hash value for files (response to 1)
Posted by Nis Jørgensen on
DISCLAIMER:
I have notworked with file-storage, nor OACS >=5, so some of what I say may be out of date or meaningless.

I have been thinking about hashing/deduplicating files uploaded to the CR lately myself. My focus is mainly on large multimedia files and storage space -  which seems to be quite a different angle than yours. To me the problem ties in quite closely with the internationalisation of content - if nothing else, then because both touch relations between cr_items in very basic ways.

The problem I see (in the current Greenpeace implementation) is that the same files, especially multimedia files, are

a) revisioned, each revision carrying it's own identical copy of the file.

b) "localized" (without actual API) into different content_items, each including translated metadata and an identical copy of the file.

c) used with different metadata in different contexts within the site (ie ifferent title for image in different places). In our case, this is implemented outside the content repository, but it is ugly.

d) Uploaded by different people independently.

A quick test shows that an estimated 2.4 GB in our database are duplicate files in the content repository. You might understand why I am keen to get this right 😊

The "clean" way to solve this, IMO, is to separate file data from content metadata. I am not sure how this best translates to our current concepts (cr_items, file-storage, acs_objects and child_rels), but my best guess would be something like this:

BEGIN my solution

(taking an image file an example)

The file itself is a non-revisioned and immutable cr_item (= only one revision) of type "file". You can store properties about the file (NOT about the content) as properties of the revision, but this should normally be done automatically, not through user input. This includes the hash, file size, content-type etc.

Then we have another content_type "image" which contains metadata about the images itself and the item_id of the file.

When a new file is uploaded, either a new "file" item is created, or possibly reused if the hash exists. Also, a
new revision of the "image" is created, linking to the (old or new) "file".

END my solution

Looking at file-storage, it seems to match the description for my "file" objects pretty closely (with some extra features such as organisation into folders). The only big difference is that file-storage allows file revisions, whereas I want to make them immutable items to avoid problems with sharing them between different users.

To be discussed: Permissions