Forum OpenACS Q&A: Monitoring and Controlling Queries

Collapse
Posted by Steve Manning on
Evening Chaps and Chapesses

I was at a meeting today to discuss implementing an e-commerce site using amongst other things OpenACS and PostgreSQL and I was asked a question which I couldn't directly answer and I've been pondering ever since.

Basically, is there any way to correlate a request from a user to a query running in Postgres and can that query be killed if the user goes away?

This comes from an experience the customer had with a web app that fired off expensive queries, the user would get fed up with waiting and fire the query again, and again, and again - eventually bringing the db server to its knees (I should point out that the app server was Tomcat not AOLserver based).

I know I can kill a running Postgres process and can discover  the query etc using tools like pgMonitor but obvously this doesn't indicate where the query orginated and so you risk killing the wrong SELECT if two users are running the same query.

The other obvious answer is don't have expensive queries which could cripple the server or use hardware which is capable of supporting the worst case scenario but this isn't always possible. For instance if I fire several searches at the OpenACS site without waiting for the result, would this upset things?

The question of automatically killing queries where the user has gone away is interesting given the stateless nature of web interfaces. Has anyone ever coded anything which does this?

Am I worrying unnecessarily? Your thoughts or rants would be appreciated.

    Steve

Collapse
Posted by Steve Manning on
(Wow the form posting speed is amazing now - well done guys)
Collapse
Posted by Don Baccus on
I can't think of any easy way to do this ... if PG provides a way to select the PID of the current running backend you could    use an nsv_array to store the handle's PID used when serving an expensive page, keying with user_id and a unique page identifier.  This would require manually allocating the handle rather than using our db_* API.

You'd then clear the array entry when you're done.

On entry you'd check to see if the page is already being processed for the user by looking for the nsv_array and issue a SIGUP to the handle's PID that you've stored.

Then go set up the bookkeeping once again.

Sounds like a hassle.  Better to write fast queries and buy adequate hardware IMO and adjust AOLserver configuration parameters (max and min thread mostly) appropriately.

Collapse
Posted by Don Baccus on
Of course there's this small problem that you don't own the backend process so can't SIGHUP it and you can't run AOLserver as root :)
Collapse
Posted by Andrew Piskorski on
I'm not aware of any easy way to do it in Oracle, either, although it's been a long time since I really dug into it.

In Oracle you definitely can use sqlplus to connect with with DBA priviliges, find all the queries running, and kill sessions to make them stop. But that's nothing like automated, and it definitely gives you no real way to track that back to an OpenACS user_id that requested the page that ran the query.

Generally you never need to do this, not on most database backed websites or other OLTP style application anyway. However, this sort of "kill the runaway query" functionality can definitely be useful in a large data warehouse, say. Especially when you're developing a new query, and, oops, just did a cartesian join of several tables each with millions of rows...

But anyway, easier ways to kill runaway queries would be nice, but it's very unusual that you'd ever need to do so or even really care enough to want to. Developing a new query on a Dev server or in sqlplus is the only time I've ever really seen it. (If on the other hand you have a data warehouse type system that accepts arbitrary SQL queries from end users... Ok, then you'd better worry about being able to kill runaway queries.)

And in Oracle I think if you kill the client process (AOLserver or sqlplus), the query should be aborted, so for development that's usually the quickiest and easiest way by far to stop hosing your server. Oh wait. That could be completely wrong, I don't remember. A least sometimes, the query keeps going even after the AOLserver or Sqlplus client process is gone, but I don't remember if that's the normal behavior or only happens when something goes wrong. Regardless, in practice, even during development, it's fairly unusual to need kill a query manually (using your Oracle DBA Sqlplus foo).

Collapse
Posted by Andrew Piskorski on
But, um, backing to Steve's original scenario:

For your case of "firing off the same slow query again and again bringing the server to its knees", if there's a legitimate reason for that query to be slow (it really is doing hard important work, it's not just written poorly), then what you want to do is prevent the firing off of the queries 2 to N, not abort each N-1 query query as query N is submitted.

Probably that means you need some UI to let the user go find the results of the query he submitted earlier, as he will have clicked away to another page in the meantime. In OpenACS, a simple smallish scale system for that shouldn't be too hard to do. You already have every logged in user uniquely identified, so only allow each user to run query Q once ever N minutes, and cache the results of the query in a NSV for N or N*2 minutes as well, or something like that.

Of course, that's assuming that the expensive query is giving different results for each user. If all your users are running the same query with the same results, then none of that fancy stuff is necessary, all you need is a little cacheing in AOLserver and you're done. You're users will never see any difference except in speed. And you originally said this is for an E-commerce site... Most of the time query tuning and maybe some cacheing is probably all they need.

Collapse
Posted by Claudio Pasolini on
A (very) partial answer to the problem is the ad_return_if_another_copy_is_running API, wich simply prevents the starting of the tcl script if another copy is running.
Collapse
Posted by Steve Manning on
Thanks guys.

As I said, its not a problem I envisaged having but rather one which the client had encounted and it obviously left a nasty taste in his mouth.

Given that an expensive query cannot be persuaded to complete in a reasonable amount of time with a bit of tweaking and a good dollop of hardware, I think I'd plump for a hybrid of these suggestions. That is to track requests to expensive pages by user and not allow a user to access the page more than once.

Following Andrews suggestion, I had contemplated the idea of allowing the user to kick off a query and then sending them notification via e-mail or sms when its ready to collect - probably with a link to a result page or a file to download. Another option is to send them the result as an e-mail but this would only work well if they accepted html content e-mails which I'd rather not force. I may persue this out of curiosity if I get a mo.

If you guys are not experiencing a problem, and you have been runnning OpenACS sites far longer than I, then I suspect its  not the problem the customer makes it out to be. Of course, these words may come back to haunt me :o)

Thanks

    Steve

Collapse
Posted by Andrew Piskorski on
Did the client actually explain to you in detail the real-life scenario that led him to think that aborting in-process queries was the best solution, or even a good solution? (If so, that sounds unusual and I'd be curious what about the site made it so.) Or was he simply mis-analyzing the problem, and some of the other solutions we mentioned above would have been more appropriate?

Unfortunately, in the real world, most of these performance and development problems probably tend to compound. If you have mediocre to poor developers writing lousy SQL queries, the data model underneath is probably substandard as well, any caching scheme they come up with in order to cope with the slow queries is probably also going to be lousy, the information and explanations they convey back to the client to help him make decisions will also be poor, etc...

Fortunately my personal observance of such phenomena has been limited and from a distance, but I have seen that a clueless client plus clueless and/or somewhat dishonest developers can easily conspire to turn the simplest project into a near insurmountable obstacle. (Note that this is just a general observation. I'm not saying that was the case with your client! In fact I doubt it could be, as his original question sounds too knowledgeable for that.)

Yeah, I agree with you, if the query really is going to stay slow for a good reason, you'll want to use or mix these different kinds of locking or caching strategies depending on your specific application.

Some of the infrastructure to make that easier is already there in OpenACS. I haven't checked lately so I can't you exactly what in detail, but people have definitely been improving those tools off and on. E.g.: The Developer Support page is really handy for query profiling. Simple one-off cacheing is often very easy and quick to do with just ad_memoize or the nsv commands. Don wrote a caching layer integrated into the db_* for one of his clients which I think is in the contrib CVS. Maybe more, I dunno.

Collapse
Posted by Steve Manning on
Andrew

The client isn't an IT professional but is a very savvy finance/mangement bloke and so is led by personal experience and information provided by 'professionals'. As I understand it, the particular report was running a query against a table of sales transactions and it didn't scale well under Tomcat/Postgres. In the first few months it was fine but now after a couple of years it can take several minutes to return and even, I'm told, cause IE to timeout waiting for the page to return. The problem is that they were under the illusion that if they hit the Stop button on the browser it would stop the request and kill the query. Not an unreasonable conclusion if you have no idea how the process occurs, afterall if you kill a report in Access then its dead. So if they think the first request has died and isn't going to come back they press Stop and then request the report again. Now of course there are two queries running so it takes even longer so they hit stop again and so on eventually bringing everything to a fullstop.

I think he's ok now with what we've told him of the way in which OpenACS works but the question got me thinking along these lines. Thankfully I've not needed the solutions but its good to know they're there just in case.

    Steve

Collapse
Posted by Don Baccus on
If the query took minutes then either the datamodel has problems, they're trying to return far too many rows to reasonably display on a web page, they're missing indexes, they forgot to ANALYZE their database etc etc.
Collapse
Posted by Steve Manning on
Don

I think the answer is probably yes,yes,yes and yes. 😊

Truth is I don't have access to it at the moment to see whats going on but I suspect its badly written, returns too much and  from what I understand maintenance is 0 so ANALYZE is probably a foreign word.

    Steve