Forum OpenACS Q&A: Re: Fatal: received fatal signal 11

3: Re: Fatal: received fatal signal 11 - new error after years! (response to 1)

Posted by Torben Brosten on 08/19/09 08:44 PM

Signal 11 can suggest bad RAM, or perhaps running out of memory on a process.

Any code or configuration changes for the server lately that may have reduced available memory for aolserver or perhaps requires aolserver to run with a larger stacksize specified in config.tcl?

What stacksize is specified in the config.tcl?

Do the events happen at about the same time of the day or interval? Look at the timestamp for the line just above the error to get an idea when it stopped. Timing patterns can indicate if scheduled process may be triggering and which one.

Check a few lines back in the error.log to get an idea of what was happening before each case. The number at the beginning of each line identifies a thread/process, so numerous lines may be associated with the same activity.

Can you post more lines from the error.log, say 30+ lines?

Also, sometimes there are wildly large https requests (megabytes) thrown at a form apparently trying to break the server process. I guess it's possible if the form's process is not prepared to handle it. Check (but don't post) server.log to rule this out and to get another perspective of what user requests were in progress.

HTH,

Torben

4: Re: Fatal: received fatal signal 11 - new error after years! (response to 3)

Posted by Shahid Butt on 08/20/09 09:59 AM

Hi I'm responding in place of Matthew as he is away today. Thank you for your response Torben.

Here are some of the lines from the error.log

********************************************

[19/Aug/2009:17:28:33][19281.2935122864][-sched:25-] Error: Transaction aborted: Database operation "0or1row" failed (exception NSINT, "Query returned more than one row.")

SQL:

select user_id, first_names || ' ' || last_name as user_name
from cc_users
where email = '<email removed from post>'

Database operation "0or1row" failed (exception NSINT, "Query returned more than one row.")

SQL:

select user_id, first_names || ' ' || last_name as user_name
from cc_users
where email = '<email removed from post>'

while executing
"ns_pg_bind 0or1row nsdb0 {

select user_id, first_names || ' ' || last_name as user_name
from cc_users
where email = :email

..."
("uplevel" body line 1)
invoked from within
"uplevel $ulevel [list ns_pg_bind $type $db $sql]"
("postgresql" arm line 2)
invoked from within
"switch $driverkey {
oracle {
return [uplevel $ulevel [list ns_ora $type $db $sql] $args]
}
..."
invoked from within
"db_exec 0or1row $db $full_statement_name $sql"
invoked from within
"set selection [db_exec 0or1row $db $full_statement_name $sql]"
("uplevel" body line 2)
invoked from within
"uplevel 1 $code_block "
invoked from within
"db_with_handle -dbn $dbn db {
set selection [db_exec 0or1row $db $full_statement_name $sql]
}"
(procedure "db_0or1row" line 50)
invoked from within
"db_0or1row get_user_name_and_id """
(procedure "get_address_array" line 25)
invoked from within
"get_address_array -addresses [string map {\n "" \r ""} $to_addr]"
(procedure "acs_mail_lite::send" line 10)
invoked from within
"acs_mail_lite::send -to_addr $email -from_addr $from_email -subject $subject -body $content -extraheaders $extra_headers"
(procedure "notification::email::send" line 46)
invoked from within
"notification::email::send $from_user_id $to_user_id $reply_object_id $notification_type_id $subject $content_text $content_html"
(procedure "AcsSc.notificationdeliverymethod.send.notification_email" line 1)
invoked from within
"AcsSc.notificationdeliverymethod.send.notification_email 4000813 799119 {} 692 {File Storage Notification} {} {Notification for: File-Storage: New Fil..."
("uplevel" body line 1)
invoked from within
"uplevel $func_and_args"
(procedure "apply" line 3)
invoked from within
"apply $proc_name $arguments"
(procedure "acs_sc_call" line 6)
invoked from within
"acs_sc_call NotificationDeliveryMethod Send $args $impl_key"
(procedure "notification::delivery::send" line 16)
invoked from within
"notification::delivery::send -from_user_id [ns_set get $notif notif_user] -to_user_id [ns_set get $notif user_id] -notification_type_id [ns_set get..."
("uplevel" body line 3)
invoked from within
"uplevel 1 $transaction_code "
(procedure "db_transaction" line 39)
invoked from within
"db_transaction {
# Send it
notification::delivery::send -from_user_id [ns_set get $notif notif_user] -to_use..."
(procedure "notification::sweep::sweep_notifications" line 107)
invoked from within
"notification::sweep::sweep_notifications -interval_id 635 -batched_p 0"
("eval" body line 1)
invoked from within
"eval [concat [list $proc] $args]"
(procedure "ad_run_scheduled_proc" line 46)
invoked from within
"ad_run_scheduled_proc {t f 60 notification::sweep::sweep_notifications {-interval_id 635 -batched_p 0} 1250698816 0 f}"
[19/Aug/2009:17:28:49][19281.3074333616][-sched-] Fatal: received fatal signal 11

************************************

5: Re: Fatal: received fatal signal 11 - new error after years! (response to 3)

Posted by Shahid Butt on 08/20/09 10:13 AM

We have the following stacksize in our config.tcl

**************************
ns_section ns/threads
ns_param mutexmeter true ;# measure lock contention
# The per-thread stack size must be a multiple of 8k for AOLServer to run under MacOS X
ns_param stacksize [expr 128 * 8192]
**************************

6: Re: Fatal: received fatal signal 11 - new error after years! (response to 4)

Posted by Torben Brosten on 08/20/09 11:19 AM

Shahid Butt,
Looks like a couple of things are happening.

First, there's a query error related to a scheduled notification, probably repeatedly trying since it is failing..

Second, there may be a backlog of scheduled events that are using up available memory and causing the signal 11 (just a guess).

You'll want to fix the query error. Its cause appears to be two users with the same email address, where the query expects users to have unique emails. (I edited out the particular email that was posted here, but you can see it in your error.log version).
cheers,

Torben

7: Re: Fatal: received fatal signal 11 - new error after years! (response to 3)

Posted by Andy Black on 08/20/09 11:48 AM

I'm also replying in Matthews absence.
Thanks for the reply Torben.

It is not happening at a certain time of day, it happens around 10 minutes after a server restart. The site is up and available within this 10 minute window.

We checked the server.log whilst the site was up and I loaded the login page, nothing out of the ordinary was happening just requests for images/css which completed successfully. The web page loaded fully and was static as the fatal error occurred. (no requests were being made at the time in the server.log).

We increased the stacksize from...
ns_param stacksize [expr 128 * 8192], to
ns_param stacksize [expr 512 * 8192]
and the restarted, but still no joy.

We will have a go at removing the duplicate email addresses and restart, how do we clear the backlog of scheduled events?

Thanks,

Andy

Forum OpenACS Q&A: Re: Fatal: received fatal signal 11 - new error after years!