Forum OpenACS Q&A: Parsing RSS

1: Parsing RSS

Posted by Simon Carstensen on 01/13/03 09:40 PM

I am building a package that will depend heavily on parsing RSS feeds. Before getting carried away, I would like to ask whether a standardized way of dealing with RSS parsing already exists.

When I first started investigating this matter I was hoping to find a simple proc taking care of all rss parsing. Perhaps something along the lines of:

rss_parser -url http://pinds.com/blog/rss.xml -channel_metadata {title link description} -all_items

Or something alike. I haven't been able to find anything about RSS parsing in connection with OpenACS, so I would like to ask the community:

1. Have anyone worked on RSS parsing using OpenACS or Tcl?

2. Would anyone be interested in a proc like the one specified above? Or perhaps there's a better way out of this?

/Simon

2: Re: Parsing RSS (response to 1)

Posted by Don Baccus on 01/13/03 09:49 PM

Jerry Asher has worked on this ...

Check out the rss-support package in OpenACS 4.6.

I don't think anyone would call this a polished piece of work yet (am I wrong about that?) but I know Jerry's syndicated content from various sources on his own personal website.

3: Re: Parsing RSS (response to 1)

Posted by Simon Carstensen on 01/13/03 09:59 PM

Ah, sounds perfect. Thanks, Don!

Grabbing "rss-support" from HEAD with:

cvs -z3 -d :pserver:anonymous@openacs.org:/cvsroot co -r oacs-4-6 rss-support

gives me an error, though. Do you know what might be wrong?

On an entirely different note. I wonder where Jerry's gone? His website (http://theashergroup.com) has been down for a long time.

4: Re: Parsing RSS (response to 2)

Posted by Jeff Davis on 01/13/03 10:01 PM

rss-support is more generation than parsing. It works fine although I think it could be made more usable. Thats sort of beside the point though since you care about parsing.

We implemented an rss parser in london (Lee Denison did most of the work I think) and I can probably dig up the code. It was a while ago so I am not sure it will do 1.0 or 2.0 feeds for example. I would be interested in seeing a good "forgiving" parser to work with lars-blogger and do aggregation of the feeds I read.

5: Re: Parsing RSS (response to 1)

Posted by Simon Carstensen on 01/13/03 10:04 PM

Grabbing "rss-support" from HEAD with

I meant the oacs-4-6 branch, of course, not the HEAD.

7: Re: Parsing RSS (response to 5)

Posted by Jeff Davis on 01/13/03 10:09 PM

rss-support was not defined in modules. I added it.

6: Re: Parsing RSS (response to 1)

Posted by Simon Carstensen on 01/13/03 10:14 PM

We implemented an rss parser in london (Lee Denison did most of the work I think) and I can probably dig up the code.

That would be great, Jeff. Feel free to email it to me if you want.

I would be interested in seeing a good "forgiving" parser to work with lars-blogger and do aggregation of the feeds I read.

Hey, exactly what I'm working on :). A simple News Aggregator for use with Lars Blogger and the CMS. And exactly a "forgiving" one was what I had in mind. I was thinking of perhaps porting Mark Pilgrim's Ultra Liberal RSS Parser (http://www.diveintomark.org/projects/misc/rssparser.py.txt) to Tcl (it's written in Python). As far as I know, it's the most forgiving one of the bunch.

Any ideas or comments?

8: Re: Parsing RSS (response to 1)

Posted by Lars Pind on 01/13/03 10:31 PM

Hey Simon,

This would be super-duper-amazingly cool. Let me know how things progresses.

If you really get into the swing and want to fix the current rss-generator, so it can generate better RSS feeds, e.g. ones where my news aggregator can actually figure out how to read the correct posting time, so much the better :)

/Lars

9: Re: Parsing RSS (response to 1)

Posted by Dave Bauer on 01/13/03 11:16 PM

Simon,

I agree that the ultra-liberal rss parser is probably the best code to start with.

Lars, it would probably be best to use ns_xml or tDOM or whatever to generate XML instead of appending a huge string when generating RSS.

10: Re: Parsing RSS (response to 9)

Posted by Jeff Davis on 01/13/03 11:22 PM

I don't know why ns_xml or tDom would be that important for output. We want to generate xhtml from adp files and that is more or less xml as well and I don't think we would ever consider constructing it in ns_xml rather than by just appending a huge string.

11: Re: Parsing RSS (response to 1)

Posted by Bjorn Thor Jonsson on 01/14/03 01:51 AM

rssticker "is a perl CGI script for converting one or more RSS streams to HTML".

The script contains the comment:
# This is a really bad RSS parser. 😊
but it may be interesting.

Sample configuration file here: http://rss.molar.is/verkfaeri/rssticker.cfg.txt

12: Re: Parsing RSS (response to 1)

Posted by Don Baccus on 01/14/03 04:02 AM

Oops, Jeff's right, Jerry's stuff didn't parse but rather generated RSS. Thanks for setting me straight.

I also don't see any reason to use tDOM or ns_xml to generate known XML content. There's no manipulation to do to the document - where's the justification for the extra step? It would consume memory and cycles and the code to do so would probably be less readable.

XML is supposed to be human-readable, after all, and that means it's easily written by code-writing humans ...

14: Yeti (response to 1)

Posted by Andrew Piskorski on 01/14/03 09:15 PM

On the subject of parsers in Tcl, I've heard that Yeti (Wiki page) is useful, although I haven't tried t myself yet.

13: Re: Parsing RSS (response to 1)

Posted by Simon Carstensen on 01/14/03 09:25 PM

I'm planning on using ns_xml to parse the RSS feeds. The aggregator should only grab a feed if it has been modifed within a specifiable amount of time. Additional it should check whether the actual file has changed, using conditional GETs, before downloading it.

Does anyone know whether it possible to use ns_xml with conditional GETs?

15: Re: Parsing RSS (response to 1)

Posted by Simon Carstensen on 01/14/03 10:01 PM

Since [ns_xml parse ?-persist? $string] parses the XML document in a $string and not from an URL, my question is invalid...

I found the proc util_httpget url [ headers ] [ timeout ] [ depth ].

Which would work fine with:

util_httpget http://scripting.com/rss.xml {If-Modified-Since: Sat, 14 Jan 2003 21:43:31 GMT}]

Sorry for asking before looking!

16: Re: Parsing RSS (response to 12)

Posted by John Sequeira on 01/15/03 03:31 PM

I had to write a middle tier component to generate an XML site tree for a .NET project early this year, and started out by assembling a big string.

The front tier folks would occasionally get errors when things like angle brackets showed up in my text. I found switching to using MS's XML DOM builder useful in avoiding these and similar errors, because that was what the front tier folks were using to parse it. After I switched, when there was an error parsing the file, I was very confident that it wasn't my code (it never was).

With XML, since the whole point is data interoperabililty, I would say that maximizing the odds of this by using something like tDOM would be worthwhile.

Also, I'm currently using both Amphetadesk and Radio Userland to parse RSS feeds (yes, I'm a news junkie) - they die all the time on people's feeds that aren't valid XML. From experience, generating valid XML by design is a very good thing.

17: Re: Parsing RSS (response to 1)

Posted by Ryan Lee on 01/19/03 07:08 AM

I have a basic RSS aggregator running on my OpenACS 3.2.5 site. I've worked on it off and on (you're welcome to look at the somewhat patchy code if you'd like). I went straight to XSLT for converting RSS to XHTML (different transformation depending on what version of RSS the feed advertises itself as / if it's RDF). As I use conditional GETs and caching of the XSLT transformations, I find this doesn't present much of a resource problem (then again, I don't get much traffic - I'm the only user of my aggregator :). Of course, you have to depend on the producer of the feed to generate valid XML; I found a couple choice regsub's corrected most errors.

By the way, I also found I had to modify AOLserver's http.tcl to retrieve response headers out of ns_httpget (and make it understand HTTP 301 and 304), namely accessing ETag or Last-Modified information from an RSS feed. I didn't use util_httpget, but from the API it looks like it suffers from the same response headers weakness.