Forum OpenACS Development: News aggregator: Purging brainstorming

Collapse
Posted by Guan Yang on

Currently the news aggregator (NA) package (which I'm working on in HEAD) is a bit of a mess, but I'm working to clean it up. I would like some feedback from the community on one important aspect of current NA design, purging.

Current Design

The current purging mechanism is inspired by Radio UserLand and Python Desktop Server (PyDS). All items are displayed on one page in blog format. (NA expands on this by allowing the user to have multiple aggregators, though the current template doesn't display the additional aggregators. I have an old master template around that does this, but I will integrate it into the package aggregator soon.) When you're done reading all the items on the page, simply click the button Purge this page of news, and the items are no longer visible.

There is a separate table for saved items, na_saved_items. Here is a copy of the central NA query along with its execution plan. When the user has 5 purges for a given aggregator, those 5 purges are deleted and the aggregator_bottom field is updated. No items are displayed (by default) that have an item_id value below the user's aggregator_bottom.

Problems

This design was thought to be scalable, has some problems (including scalability!). Imagine that Alice uses NA a lot and has a lot of purges. The last time Alice used NA and purged her aggregator was 9:00, so all items that were scanned before 9:00 are not visible.

Now Alice creates a new subscription for New York Times. Lots of other users are subscribed to New York Times's RSS feed, so there's no new scan of that feed. All the items in the NYT were scanned before 9:00. As a result, it appears to Alice that nothing has happened — no items appear unless Alice clicks the "Purge Off" link.

Solutions

How should this problem be fixed? One solution would be to give up saving space and simply mapping every single item to each aggregator instance. When Alice has purged a page, a row is inserted into the table for each item that was visible.

Another solution could be to implement have purges at the subscription level (the mapping between aggregators and sources). Would this solve all the problems?

I'd like some ideas!

Collapse
Posted by Malte Sussdorff on
I like the way bloglines (http://www.bloglines.com) handles things (purge after reading). This would mean you store a "last read / purged on" column to each subscription (I guess...:)) and only display items (from the sources) where the source timestamp (if there is such a thing...) is newer. For new subscriptions the last read /purged on will (obviously) be NULL, therefore all items for this subscription should be displayed.

Not having looked at NA, does it support folders and if not, would you consider adding them (so we can store subscriptions in folders). Also, does it interface with categories? (I imagine having the ability to subsribe to a new source and add a category to it, if the default category given by most users does not suit my (personal) needs).

Collapse
Posted by Guan Yang on

NA does support folders in the form of "aggregators". Each user can create multiple aggregators. You can visit my old 4.6.3 dev site (screenshot) to see a template that actually displays them 😉

In any case, as long as we retain the view where views from all sources are gathered on one page (unlike Bloglines, which displays each source on separate pages), I'd like to retain explicit purging.

/Guan

Collapse
Posted by Nima Mazloumi on
Guan...this might be more dotLRN specific but:

- if a group has more than one admin and both create new news feeds than as an admin you are not able to figure out who is the owner of the other feeds and also have no chance to administer their feeds

- as soon as I create more than one feed only it is not clear why which feed is shown. If I follow  the rss xml all contain news but not all are shown in the portlet.

Greetings,
Nima

Collapse
Posted by Kjell Wooding on
Bloglines supports all-on-one-page. Just click on the folder holding the various feeds.

In the same vein, you can arrange the feeds into subfolders, and view all of a particular subfolder at ones (i.e. News, blogs, geek).

This is a nice compromise between the two main feedreader styles, methinks, and the purge-after-read behavior in this way is very useful.

Collapse
Posted by Carl Robert Blesius on
How did the old "what's new" package work? We need a working "what's new" for forums and file storage; maybe it could be used in this case as well (purged = old).

Carl

P.S. Wanted to look and see how it worked myself, but it seems someone has gone in and cleaned out all our wonderful skeletons: OpenACS 3.x is not in the download section, sdm.openacs.org is not up (want to salvaging some other stuff from there too), and I do not know how to pull the bones out of cvs (if they are there at all).

Collapse
Posted by Ola Hansson on
Using "News Aggregator" as the general front-end for a new incarnation of the 3.x "What's New?" feature is really exiting, I think. Good idea Carl!

Hm, keeping track of your own internal feeds, so to speak ...

We just need to make rss-feeds for all the important packages and perhaps make one or several "unified" feeds. It might also be a good idea to implement a mechanism which lets a forums instance (say) subscribe automatically to its own feed (via news-aggergator) and pop up a link within the forum instance.

Should it also be possible to purge news items based on the time you last visited the "source" of a feed (a forum say)?

Talk about eating your own dog food 😉

Collapse
Posted by Jeff Davis on
I am working on making most content feedable by modifying how the search indexer works so you could get a sitewide feed or feeds by subsite (or feeds from a search).

I have already done a "what's new" thing w/o feeds for the community of practice stuff as well.

Collapse
Posted by Ola Hansson on
Jeff, that is brilliant!

When will this end up in the toolkit? 6.0 or even earlier?

Collapse
Posted by Jeff Davis on
It should be there for 5.2. The work for showing "what's new" is done (although some packages need to be fixed to make it work) but I am just starting on the stuff to make rss feeds for anything that is searchable.
Collapse
Posted by Ola Hansson on
Jeff, will this work so that if your package has a working search service contract implementation (or uses the CR's impl.) it will also produce a feed for your package?
Collapse
Posted by Jeff Davis on
You need to change the datasource method so that it returns the requisite data for the rss feed. I am still fiddling with the code to see what I think works best although once I get closer I will post some concrete examples and get some feedback.

I think the idea that things should be feeds at the package level is probably the wrong approach in the sense that you probably want a single feed for a site or subsite with the ability to restrict to categories (where the category might include package type and content type).

I also want to think about how to make your notifications into an rss feed, so it would be a notification type or you could get anything you have notifications turned on for via email also as rss. (This was an idea from Andrew Grumet).

Collapse
Posted by Ola Hansson on
Thanks a lot!

I take back my previous question. The answer should be obvious considering what you said in your first response ...

(Feeds for free... Geez, talk about increased incentive to make your (non CR) package searchable.)

Collapse
Posted by Mark Aufflick on
Jeff that's fantastic. And as said by others, unifying the data sources for feeds, what's new and searching is brilliant. It could also allow for the unified access url we always talk about (/o?object_id=123).
Collapse
Posted by Jeff Davis on
It could also allow for the unified access url we always talk about (/o?object_id=123).
Yeah, I did that too :)

You have to do it for performance for things like that. I ended up with subsite/o/OBJID (since for users for example you want to link to display in the context of the subsite).

Collapse
Posted by Andrew Grumet on
Guan, one thing you could do to reduce the need for purging is to copy the way Radio does timestamps.  Namely, put an absolute timestamp in the source chunk header indicating when the new items in that chunk were first discovered by the scanner.  The timestamp needs to be absolute, not relative, so that it stays the same when you refresh the page.  This UI facilitates a "hit refresh, scan down to last familiar timestamp" workflow that I use in lieu of purging ( http://grumet.net/weblog/archives/2003/08/19/more_on_deletionless_aggregator_reading.html ).
Collapse
Posted by Andrew Grumet on
In order for this to work, there needs to be an automatic mechanism by which old items slip off the page.  In Radio you set the number of items to save, independent of source.  I think other aggregators let you specify a maximum age.