Forum OpenACS Development: Hash value for files

Collapse
Posted by Malte Sussdorff on
During the Heidelberg gathering there has been some thought invested on how
to circumwent the upload of identical files multiple times. Here is a
proposal for a solution:

The content repository will be extended to handle HASH values that will be
calculated on the file that is being uploaded and stored in the cr_revisions
table. Additionally the 1:n relationship between cr_items and cr_revisions
has to be amended to support a n:m relationship (one revision can belong to
multiple items). You could alternatively use cr_item_rels, but this would
mean to loose the functionality to add a new description to the object along
with the other information stored in cr_revisions but not acs_objects.

Some functions:

- When uploading a file the hash value will be calculated and looked up in
the cr_revision table. If the value is already there, use the same
revision_id for the new item that is beeing uploaded. If not, create a new
revision.

- Uploading a new revision of an item will only affect that item, not all
the other items that are using the to be replaced revision.

- A view should be generated to see where the item is being used in the
system (community and/or url) and by whom. Additionally some calculations
might happen on these relationships, that might be useful for collaborative
filtering and other ways of knowledge linking.

- The calculation shall happen while uploading and should be possible to be
done on only part of the file. This shall prevent a 5MB file to be uploaded
by the user just to realize the whole bandwidth was not necessary as the
file was already there. I imagine a scheduled proc running on the /tmp
directory (where the intermediate uploads are stored by AOLserver) and do
the calculation after the minimum needed file size has been reached. Abort
the upload if the HASH matches a revision already in the system.

- The HASH value shall be the one implemented in a widely used Peer to Peer
network. Benefits:

-- The file can be retrieved by using P2P methods instead of having to
download it directly from the server. This is highly useful if you have a
lower bandwidth for the server (I can think of the E-Lane project here as
primary users).
-- Files can be shared in a closed user group without the need to store the
file on the server. This is e.g. Useful for secretive content that you do
not want to store on the server but instead make available only from your
computer to a trusted number of people.
-- It is possible to upload a file by submitting the hash value and the
(OpenACS) server acting as a client in the P2P network to retrieve the file
(unless it is not wanted, see point above).
-- It is easy to store the same file in multiple OpenACS instances.

- If an item gets a new revision, inform the other items (owners) about the
new revision as well. It is obvious that some thought has to be invested on
how to do this with regards to permission(s) and user interface.

What are the conditions the HASH calculation has to take into account:

- Has to be identical to a de-centralized P2P network.
- Must provide a high likelihood of uniqueness (so that only identical files
get the same HASH value).
- The value should be able to be calculated on only a chunk part of the
file.

Questions:

- How would I make sure that only the people that have permission on the
revision can access the file from my P2P shared drive? Obviously they would
only get access to the HASH value if they have permission to view the file,
but what happens if the HASH is distributed or sniffed upon?
- What P2P networks fulfill the above requirements ?
- How could this be incorporated with WebDAV?
- What would be a good solution to prevent copy righted material to come to
the server (if users could just insert the hash value and the server
retrieves the file, copyrighted material might make it to the server fairly
easily).
- All your comments on this idea, extensions you'd want to see, if you see a
need for this as well.

Obviously this should make it in the new TCL API for storing (and
retrieving) content. Do we have a draft API description somewhere already?

My thanks go to Eduardo and Alvaro for sitting down in Heidelberg and
starting the discussion and to Al for pointing out the idea of using P2P for
not storing files on the central OpenACS server.

Collapse
2: Re: Hash value for files (response to 1)
Posted by Don Baccus on
"Additionally the 1:n relationship between cr_items and cr_revisions
has to be amended to support a n:m relationship (one revision can belong to
multiple items)."

How do you propose to do this without breaking everything that already exists?

Simpler would be to allow multiple revisions to point to the same file (if LOBs are stored in the file system).  The PG BLOB hack already allows for multiple columns to point to the same BLOB ... not sure whether Oracle's LOBs support this.

I still question whether this functionality is useful enough for enough people for us to implement it as part of our core functionality.  File systems the world over for a large variety of operating systems manage without it, and indeed, if  this were of high importance I would think the file system would be the place to implement it.  If the application layer, not filesystem, is the "right" place to implement it because it is of limited use, then the same argument applies to our core "file system" (content repository) service, doesn't it???

An additional point:

"The calculation shall happen while uploading and should be possible to be
done on only part of the file. This shall prevent a 5MB file to be uploaded
by the user just to realize the whole bandwidth was not necessary as the
file was already there."

Wouldn't this require support for partial file upload from AOLserver?  AFAIK when you push "submit" on a file upload form, AOLserver slurps the entire file onto your server before OpenACS can intervene.

Collapse
3: Re: Hash value for files (response to 2)
Posted by Jeff Davis on
I do think storing binary content keyed by hash is a good idea but I would rather implement reference counting which was trigger maintained than do some complicated view for garbage collection.

I think the idea that you could only upload a little bit of a file by checking the hash on the beginning is a bad idea. What if I have a large docbook document where I edit the afterword, the beginning will match and the end will be different. And as Don mentioned, there is not a way without changing AOLServer itself to stop an upload like that. The only way really to implement this is to provide client side software which computes the hash on the whole document locally then only uploads if it's changed. Something like that would be very nice especially in the context of something like photobook where you might want to sync your local photo collection to the server periodically.

Oh, and I think we should call the modules to do all this "OpenrsyncACS" and "OpentorrentACS".

Collapse
4: Re: Hash value for files (response to 1)
Posted by Nis Jørgensen on
DISCLAIMER:
I have notworked with file-storage, nor OACS >=5, so some of what I say may be out of date or meaningless.

I have been thinking about hashing/deduplicating files uploaded to the CR lately myself. My focus is mainly on large multimedia files and storage space -  which seems to be quite a different angle than yours. To me the problem ties in quite closely with the internationalisation of content - if nothing else, then because both touch relations between cr_items in very basic ways.

The problem I see (in the current Greenpeace implementation) is that the same files, especially multimedia files, are

a) revisioned, each revision carrying it's own identical copy of the file.

b) "localized" (without actual API) into different content_items, each including translated metadata and an identical copy of the file.

c) used with different metadata in different contexts within the site (ie ifferent title for image in different places). In our case, this is implemented outside the content repository, but it is ugly.

d) Uploaded by different people independently.

A quick test shows that an estimated 2.4 GB in our database are duplicate files in the content repository. You might understand why I am keen to get this right 😊

The "clean" way to solve this, IMO, is to separate file data from content metadata. I am not sure how this best translates to our current concepts (cr_items, file-storage, acs_objects and child_rels), but my best guess would be something like this:

BEGIN my solution

(taking an image file an example)

The file itself is a non-revisioned and immutable cr_item (= only one revision) of type "file". You can store properties about the file (NOT about the content) as properties of the revision, but this should normally be done automatically, not through user input. This includes the hash, file size, content-type etc.

Then we have another content_type "image" which contains metadata about the images itself and the item_id of the file.

When a new file is uploaded, either a new "file" item is created, or possibly reused if the hash exists. Also, a
new revision of the "image" is created, linking to the (old or new) "file".

END my solution

Looking at file-storage, it seems to match the description for my "file" objects pretty closely (with some extra features such as organisation into folders). The only big difference is that file-storage allows file revisions, whereas I want to make them immutable items to avoid problems with sharing them between different users.

To be discussed: Permissions