If you do the sha1 in tcl you need to be very careful about not disturbing the
encoding. In fact I think using ns_sha1 is not workable since it requires the file to be in memory and for things like video they will be too large. It would be
necessary to either add a ns_sha1_file command or use an external program for
this to work at all. Adding the function to nssha1 would not be that hard
but it would mean dependency on the new nssha1 version (although you
could introspect if the function was available and just not generate it if not).
Also, you can't compute it until the user uploads the file so it you can;t
avoid the transfer (your "further changes" point 1) although once transfered you could offer to symlink.
Andrew and I looked at how much duplicate content there was in the sloan
file storage and it looked like less then 20% (by byte count) was duped so I
am not entirely sure it's that big a win.
running sha1's over all the cr content would probably be pretty fast even for a reasonably large site. I have
done it for 100gb or so on my desktop machine and it only took about 4 hours iirc.