Forum OpenACS Q&A: Re: My production site is down

Collapse
Posted by Samir Joshi on

Something similar happened to our site in early beta, twice in a row, 20 minutes apart. I believe I should be ablet to reproduce it under heavy load, but haven't got opportunity to simulate it in last two weeks. Could not post it earlier, but here is the log excerpt :

Everything normal ...
....
203.163.150.198 - - [06/Jan/2003:19:49:30 +0530] "GET /dotlrn/admin/user-new-2?user_id=20202 HTTP/1.1" 200 5905 
203.163.150.198 - - [06/Jan/2003:20:00:58 +0530] "GET /dotlrn/classes/maths/calculus/f-y-b-sc-maths/?folder%5fid=21188&n%5fpast%5fdays=99999&page%5fnum=2 HTTP/1.1" 302 389
....
Everything normal again...
The server stopped responding after 19:49:30. After 8-9 minutes I rebooted the machine in panic. The debug mode is on, but there is no error displayed there and everything looked normal untill the last entry in the access log.The surprise entries ( not surprising for the log-content but due to timestamps and the delay between them )in the error log are :
Everything normal ...
...
[06/Jan/2003:19:50:05][2554.3076][-conn0-] Debug: PgBindCmd: query with bind variables substituted =
    	select file_storage__new_file (
        	'NEW',           	-- title
        	'20281',          	-- parent_id
        	'19660',            	-- creation_user
        	'203.163.162.234',        	-- creation_ip
		true			-- indb_p
		);

[06/Jan/2003:19:50:05][2554.3076][-conn0-] Notice: Querying '
    	select file_storage__new_file (
        	'NEW',           	-- title
        	'20281',          	-- parent_id
        	'19660',            	-- creation_user
        	'203.163.162.234',        	-- creation_ip
		true			-- indb_p
		);'
[06/Jan/2003:19:58:21][2554.1024][-main-] Notice: nsmain: AOLserver/3.3.1+ad13 stopping
[06/Jan/2003:19:58:21][2554.1024][-main-] Notice: nssock: triggering shutdown
[06/Jan/2003:19:58:21][2554.1024][-main-] Notice: serv: stopping connection threads
[06/Jan/2003:20:00:06][1276.1024][-main-] Notice: nsmain: AOLserver/3.3.1+ad13 starting
....
Things back to normal again...

That is to say, though no new HTTP connection was logged after 19:49:30 , at least server logging thread was working upto 19:50:05. After that despite many requests from clients, there is no HTTP access log entry / debug log. But when I rebooted the machine, AOLServer logged 'AOLserver/3.3.1+ad13 stopping' message at 19:58:21.

There is no error logged in the Postgres log. I could not think of what might have gone wrong and what happened between 19:50 to 19:58.

Many thanks in advance for clues, hints and help !