Forum OpenACS CMS: content repository physical folders keeps growing

Hello everyone,

On a server for a customer I found myself out of disk space. I tought it was strange, because files in content repository are few compared to disk space: a sum on file size in fs_files returned around 3GB, but size of content-repository-content-files goes over 17GB.

Is it possible that files were not deleted when issuing a file_delete command?

I checked the count of revisions grouped on item_id in content_repository, because I was suspecting multiple revisions leftovers from file deletes/adds, but each item_id shows a count of 1, so no multiple revisions...

It seems really strange to me... How could I check files from content-repository-content-files who are not related to any content? I need this so I can identify folders I can remove safely.

Collapse
Posted by Antonio Pisano on
I managed to identify every folder containing cr files. The query is this:

select split(content, '/', 2)
from cr_revisions
where revision_id in
(select live_revision
from fs_files)
group by split(content, '/', 2);

I then removed folders not in content-repository-content-files not belonging that set.

What could have caused this behaviour?

Collapse
Posted by Antonio Pisano on
*I then removed folders in content-repository-content-files not belonging that set.
Collapse
Posted by Gustaf Neumann on
Can it be that this machine is turned off every day before 10pm? At that time cr_delete_scheduled_files is called to delete the files physically....

Have you checked the content of the table "cr_files_to_delete"?

best regards
-gustaf neumann

Collapse
Posted by Antonio Pisano on
The server is never turned off. I also often operate on it during those hours, so I am quite certain about that.

The table "cr_files_to_delete" is empty right now, but this should be correct in current situation... I should try to remove some file and check it again.

I could find the init tcl file where the proc is scheduled.

One question: beside checking the logs, in "ad_schedule_proc" docs it says I should be able to see currently scheduled procs at "/admin/monitoring/schedule-procs.tcl" location on the server, but it seems this is not true anymore. I've also tried to grep for a "schedule-procs" file, but I couldn't find any. Which is the best current method to look at currently scheduled procs on a server?

Collapse
Posted by Gustaf Neumann on
I could find the init tcl file where the proc is scheduled.
it should be here:
acs-content-repository/tcl/acs-content-repository-init.tcl

Which is the best current method to look at currently
scheduled procs on a server?
good question. i use nsstats for this purpose ..../nsstats?@page=sched
When OpenACS is installed with the install-script from https://openacs.org/xowiki/naviserver-openacs, nsstats is installed under /admin/
Collapse
Posted by Antonio Pisano on
Thanks for the pointer!

From that page I can see "cr_delete_scheduled_files" is correctly scheduled, and that when a file is deleted, "cr_files_to_delete" table is updated correctly.

I think I have found my problem: as my application manages music playlists composed by many files, I developed a script which allows for the uploading of a big zip file to the server, containing many mp3s.

As something can go wrong during the insertions of the songs (because of wrong format or other issues), this operation can fail when some of the files were already put into content repository. As the operation is made into a db_transaction, all changes to the db are rolled back, but the same doesn't happen to the physical files in content-repository, which will remain there forever (no explicit deletion issued, so no entry into "cr_files_to_delete").

Would it be reasonable to change "cr_delete_scheduled_files" so it sweeps every file not having an entry into content repository tables? It would be more aggressive, but we could make it parametrical...

Collapse
Posted by Gustaf Neumann on
Dear Antonio,

i've added an additional helper proc "cr_check_orphaned_files" with an optional "-delete" flag to address this problem. http://cvs.openacs.org/changelog/OpenACS?cs=oacs-5-8%3Agustafn%3A20131225124013

The test might return false-positive on non-orphaned files but that is quite unlikely. The bigger problem is that the lookup on large repositories can be prohibitive slow, even as scheduled procedure. On sites like openacs.org, this query is fine. However, on one of our systems we have e.g. 2.5 mio files in the cr. On this system a single lookup whether a file is references from cr_revisions is already very slow:

   select count(*) from cr_revisions where content = '/16/04/28/160430';
   Total runtime: 1904.185 ms
Multiplying 1.9 secs by 2.5 mio entries gives 54+ days (sql time)!

The same slow lookup happens as well in the sql query cr_delete_scheduled_files.fetch_paths we discussed above. Some time ago, i've fixed this on one of our systems by adding the following index

   create index cr_revisions_content_idx on cr_revisions (substring(content for 100));
This helps a lot:
   select count(*) from cr_revisions where substring(content, 1, 100) = substring('/16/04/28/160430', 1, 100);
   Total runtime: 0.062 ms
With this the accumulated SQL times boil down to 155 secs. In this 2.5 minutes there will be quite some stress on the database, so i am not sure, whether one wants to run this every day. Also, the tcl time to compute the 2.5 mio entries using the tcl-lib function is very slow (i've just measured 22 minutes). .... i will commit something better (faster, better configurable) soon.

-g

Collapse
Posted by Gustaf Neumann on
I've committed a much faster version of cr_check_orphaned_file, which
  • uses the indexing as sketched above
  • uses the "find" command instead of the function from the tcllib (more than 10 times faster).
One can now use as well "-mime ..." to check not all files, but just the files e.g. added in the last week. So, one can run once the "big cleanup", and then use the cleanup of the orphaned files on shorter time periods.

all the best
-g

Collapse
Posted by Antonio Pisano on
Dear Gustaf,

many thanks for having taken time to fix this on Christmas! I will put the new proc on my server and give you some feedback soon!

Collapse
Posted by Antonio Pisano on
Made some testing: I found out ::fileutil::find cannot retrieve files located into a symlinked directory. In my installation, coming from the new installation scripts, this is unfortunately the case for "content-repository-content-files" and other folders.

Here http://wiki.tcl.tk/776 they say "globfind" command could overcome some limitations of ::fileutil::find and achieve better performance. I was thinking about trying that command instead, but it is not part of the tcl OpenAcs requires right now, so I would need your opinion.

For my current situation, I fixed the proc by using "exec" and the "find" command under linux. It isn't cross-platform, so I will leave it only here for reference.

ad_proc cr_check_orphaned_files {-delete:boolean} {

Check for orphaned files in the content respository directory, and
delete such files if required. Orphaned files might be created, when
files add added to the content repository, but the transaction is being aborted.

@param -delete delete the orphaned files

} {
set cr_root [nsv_get CR_LOCATIONS CR_FILES]
set root_length [string length $cr_root]

# check for missing trailing slash on directory
if {[string index $cr_root end] != "/"} {
append cr_root /
}

set result ""
# get any not-hidden file into cr directory...
foreach f [exec find $cr_root -type f \( ! -iname ".*" \)] {
set name [string range $f $root_length end]
# ...skip names which are not-numerical...
if {![regexp {^[0-9/]+$} $name]} continue
# ...check if we have a revision under this filename...
if {[db_0or1row _ {
select 1 from cr_revisions
where content = :name limit 1}]} continue

# ...otherwise it is an orphan.
lappend result $f

if {$delete_p} {
file delete $f
}
}

return $result
}

All the best

Collapse
Posted by Gustaf Neumann on
Hi Antonio,

it seems that you did not use the version from 12/25/13 05:18, before you last two messages

http://cvs.openacs.org/changelog/OpenACS?cs=oacs-5-8%3Agustafn%3A20131225161031

that version uses the external "find" command, and receives as well "-mtime" as argument to avoid scans over everything (see above site with 2.5 mio files). Maybe, one can pass "mtime" as filtercmd to globfind, and maybe we could incorporate it in openacs, but for large sites i doubt it can reach the speed of find.

for unix + mac os x/ports, "find" is readily available, for windows the findutils are contained in MSYS.

We could incorporate both approaches (tcl globfind and find), but that's one more code bloat. The easier approach is to register a cr_check_orphaned_files as scheduled proc only when "find" is available, and behave in the other cases as OpenACS did over the last 10 years.

How many files do you have on your system that grew to 17GB? how long takes a full run of cr_check_orphaned_files on that system?

-g

Collapse
Posted by Antonio Pisano on
Ooops... seems like I've messed up a little looking for the last changes to the proc in the cvs browser. You already went for the "find" unix command.

Tried with last version. I saw "find" needs folders to end with "/" to search into them. I added this little check to the proc and committed. Proc seems to work properly.

Bye!

Collapse
Posted by Antonio Pisano on
Files can grow as many as 17GB of leftovers because playlists are at least 500MB big. It became a problem after at least 3 months of usage, which included early test days of the feature, but my customer is really averse to technology (a very good tester), and there is no day he doesn't mess up something (must admit that sometimes the blame is on me tough)...

On my production server, which is very narrow in resources, the proc takes around 1,5 seconds to return from /ds/shell. I have around 1000 files and it returns around 250 orphans.

Collapse
Posted by Gustaf Neumann on
For large content repositories even the improved version is not satisfying. Testing e.g. 2.5+ mio entries in file system and/or in the database has negative impact on other interactions happening at the same, no matter how fast the machine is.

Therefore, i've implemented a much more scalable approach based on a file-creation log that keeps track of freshly created files (per default the files created on one day). These files can be efficiently checked together with "cr_delete_scheduled_files" without putting much stress on either the file-system or the database.

see: http://cvs.openacs.org/changelog/OpenACS?cs=oacs-5-8%3Agustafn%3A20131231162110

The function "cr_check_orphaned_files" can still be used for cleaning up orphaned files from times before the cr-file creation log was in place (that is from the birth of OpenACS until now).

Please test on your systems as well. Note that the change requires an update of acs-tcl and a restart of nsd.

all the best for 2014!
-gustaf neumann

Collapse
Posted by Antonio Pisano on
Happy new year Gustaf!

I'm testing the new changes: after updating acs-content-repository and acs-tcl to the last version and restarting the server, pages now complain for missing variable ::acs::rootdir

I've failed to locate it in config.tcl and other packages, I think I miss some upgraded init script or something... can you give me a pointer?

Collapse
Posted by Antonio Pisano on
Ok, I've found it, I needed to upgrade acs-bootstrap-installer package.

I will fiddle a bit with file insertion/deletion and let you know!

Thanks a lot for this fine solution to the issue!