Forum OpenACS Development: Scalability of get_path (PG) and the file-storage package

As part of ramping up for the 5.0 release I agreed to do some scalability testing on the improved (thanks to Jun) get_path proc for Postgres.

I created a folder/file hierarchy 15 levels deep, with one folder and 100 files at each level.  I did this because get_path ascends the folder tree to build a full path (URL) to a file or folder.  At level 15 that implies a URL with 15 "/" characters in it, just to be super-clear.

My timings on my P4 2.4 MHz machine for the main query used to  create a folder listing displayed by "folder-chunk":

No call to get_path: 0.04 seconds

Level 1: 0.30 seconds

Level 5: 0.48 seconds

Level 10: 0.83 seconds

Level 15: 1.38 seconds

This isn't terrible - 100 files in a folder is an unusual number I think, and 15 levels is extreme.

On the other hand this is a fairly high penalty for "smart URLs".

(NOTE to Ola and/or DaveB or anyone else working on file storage: folders themselves don't have "smart URLs" in the 5.0 file storage package, just the files themselves)

Long-term - hopefully 5.1 - we'll be storing the path in cr_items or a "named object" parent or in objects themselves, and "smart URLs" will be cost-free as far as building lists of URLs goes.

But for 5.0 ... is this cost acceptible?  It is of concern for heavy .LRN users.  If it is not acceptible, might we parameterize it so the system administrator can decide whether or not to enable them?  (I kinda hate that since we'll get rid of the penalty in 5.1).

Discuss, please, I'm the data generator, not the decision maker (not alone, anyway!)

Don,

You made me realize (not exactly realize perhaps - you've made me think - it's getting late here you see) that there shouldn't necessarily be a need to calculate the path for each file that is listed in "folder chunk". Why calculate it a hundred times (in your example) when it probably could be done just once: when you click on a file.

It ought to suffice to have the listed file's url contain the version_id, or better, item_id of the file. Then feed that to get__path and do an internal redirect or whatever is needed to the returned path to make the finishing page have the proper url in the address field of the browser ...

Is that what you figured, Don?

I think we could save a lot of precious cycles by rethinking this.

Oh well, I will have to think about this in the morning.

(Hmm. Don "the data generator" Baccus. That is a cool nickname 😊

/Ola

That would certainly speed stuff up but then if someone puts their mouse over the link and clicks "bookmark this link" they don't get the pretty URL.

They'll only get the pretty URL if they open the link.

Is this acceptable?  What do other people think?  If so I'm sure you could implement it quickly and it's very easy to test so it could be easily put into 5.0...

(I think we're looking at another beta at the end of the week or Monday then a shot at a release candidate the week after, with luck)

Don,

Is the scalability a factor of the total number of content folders that exist, or of the depth of a particular file-storage hierarchy? I think each file-storage package instance has it's own hierarchy.

Perhaps we can get a sample of the depth of file-storage folders in use.

Hmmm, the ultimate solution is pretty simple, and gives pretty URLs for folders also. Make each folder a package instance/site-node. Obviously this won't be done for 5.0.

Because the file-storage UI only shows one folder at a time, not an expanded folder-tree like Windows Explorer, would it be possible to calculate the path for the containing folder, and prepend that to each file name?

Another idea:

If content_item__get_path is defined as stable, and we calculate the path to the parent, a query for all items ina  folder should only incur the calculation cost once for the parent_folder, and even the 15 level deep query would only incur the cost 15 times instead of 1500.

Dave, yes, the last idea would work fine actually ... an interesting idea might be to define get_path as stable recursively rather than loop up the hierarchy the way it does.

That way after the first file path calculation the recursive call to get the parent path would return the cached value ...

But still the call overheads themselves will get you to some degree.

Best is your suggestion of calling get_path on the folder before doing the folder chunk query then concatenating the folder name and file name in the big select rather than call get_path.

That would give us speed nearly identical to the 4.6.3 query I believe.  And this would work for Oracle, too (which doesn't have the equivalent of "stable" attributes for PL/SQL functions that I'm aware off).

BTW the cost is mostly proportionate to the depth of the tree, not the total number of folders.

If you or Ola want to work on this that would be great!  I said I'd measure it, not fix it :)

If you guys whack at it, though, I'll be more than willing to rerun my tests using whatever you guys do come up with.
  Just remember it needs doing for both DBs.

Hello,

I am still phased out.  But maybe this would help.

http://cvs.openacs.org/cvs/*checkout*/openacs-4/contrib/packages/bcms/sql/postgresql/bcms-create.sql?rev=1.9

I have a modified content_item__get_path for pg (last part of the file).  Because of this 2 things:

1.  Its a bit slow.
2.  Hammering it real hard makes my pg go up.  I have to turn restart postmaster.

Those plsql are not in bugtracker patches because:

1.  Its not fully tested.
2.  Some changes the behaviour, I have still no time to hunt down code that relies on the old incorrect behaviour.

Anyway not much help here, also good understanding of the topic.  But if it helps, please pick it up.  Thanks.

Opppsss.  Looking again at Don's first post.  It seems he is already testing the newer get_path.  Hehehehe.  Sorry...
I thought about similar stuff as well. There are three options:

* Use a redirect page (/file) - similar to what Timo and I have been talking about for ages.

* Calculate everything from scratch per request

* Precalculate everything and keep the redundant data up2date by triggers

Eventually our URLs are cool and we should be careful about giving them away. Look at how cool Google's URL usage on the result pages is. I want this for my search package :)

How big could a table_get_path get?

Best is your suggestion of calling get_path on the folder before doing the folder chunk query then concatenating the folder name and file name in the big select rather than call get_path.
Yes, that is by far the best solution.

I'm implementing this right now and will post when I'm done. It's great that you're willing to test the changes, Don. I don't currently have an oracle install at hand but hopefully you do ...

Thanks Dave, your thinking makes this easy.

/Ola

I think it's done now on oacs-5-0/dotlrn-2-0.

Please let me know if something ain't working ...

Thanks.

/Ola

OK it seems to work just about as fast as the query with no file path at all.  The file path itself causes more output to /dev/null and for 100 files 15 folders deep that's a long string, and my average execution time is more like 0.05 than 0.04 seconds without the file path at all.

But 40-50 milliseconds is pretty damned fast!

Good job.

I didn't test the .LRN portlet nor Oracle at this point - I have Oracle running on my Mac OS X laptop but haven't got it running with AOLserver yet.  Maybe it's time to dive into that.  Meanwhile, can someone else test Ola's change with Oracle?