Forum OpenACS Improvement Proposals (TIPs): TIP #94 (rejected): Add sha1 hash to cr_revisions

We currently have no method to detect whether a file has been uploaded to the system before or not.

To achieve this I would like to store the sha1 has value as computed by ns_sha1 in cr_revisions to save the time of calculating this for all content when a new file is added.

The reason for sha1 lies in the fact that bittorrent is using it and we therefore have the option to use this value to create a "download with bittorrent" link to file storage in the future.

Once approved I'd add the column "sha1_value" to the cr_revisions table and write an upgrade script. This upgrade script would *not* scan existing files as this could prove desastrous (take a long time).

Furthermore I'd implement a functionality in file-storage to calculate the ns_sha1 key when uploading a file.

Further changes could include:

- Change the upload workflow for files. If a file already exists in the system *and* the user has read permissions on it, offer to create a symlink instead of uploading the file anew.
- Write a bittorrent file generator for files, displayable in folder-chunk.tcl
- Write a closeness calculator for communities (the more files are identical in the communities, the closer they seem to be related). Obviously this involves considerably more than just files, but we need to cover the basics.

For a previous discussion which was not so limited as this TIP look at https://openacs.org/forums/message-view?message_id=179437

Collapse
Posted by Jade Rubick on
You don't need the upgrade script to compute all the sha1 values, but you could do it in the file storage package. It could create one in the background whenever a file is accessed.

Or you could have the file storage package check once a day and update 5-10 of them in the background.

Collapse
Posted by Jeff Davis on
If you do the sha1 in tcl you need to be very careful about not disturbing the encoding. In fact I think using ns_sha1 is not workable since it requires the file to be in memory and for things like video they will be too large. It would be necessary to either add a ns_sha1_file command or use an external program for this to work at all. Adding the function to nssha1 would not be that hard but it would mean dependency on the new nssha1 version (although you could introspect if the function was available and just not generate it if not).

Also, you can't compute it until the user uploads the file so it you can;t avoid the transfer (your "further changes" point 1) although once transfered you could offer to symlink.

Andrew and I looked at how much duplicate content there was in the sloan file storage and it looked like less then 20% (by byte count) was duped so I am not entirely sure it's that big a win.

running sha1's over all the cr content would probably be pretty fast even for a reasonably large site. I have done it for 100gb or so on my desktop machine and it only took about 4 hours iirc.

Collapse
Posted by Malte Sussdorff on
As this TIP was not seconded it seems to have been rejected according to TIP rules. I just hope we did not commit our working patch by accident 😊.