Forum OpenACS Development: Out of memory in AOLServer

Collapse
Posted by Eduardo Santos on
I have to put a csv file with about to 400 MB inside the DB using the OpenACS interface. My first thought was to use the PostgreSQL copy command, but I would have problems with the encoding. The file comes in Windows coding (wich is european I guess) and my DB has to be Unicode. I couldn't find any switch to convert the encoding in the copy moment.

My second solution was to use OpenACs interface through AOLServer, from wich I can get the file in any encoding, perform some changes in it, and put it inside the DB. So what I do to the file is to open it, split by newlines (/n), perform a foreach on it, then split it by the delimiter character (;). So I should split the columns and insert it in the DB. My commands are something like this:

set comdados_file [template::util::file::get_property tmp_filename $comdados]
db_transaction -dbn "marketing" {
db_dml -dbn "marketing" cria_tabela {
CREATE TABLE comdados_mensal
(
var1 character varying,
var2 character varying,
ect...

);
}

set fd [open $comdados_file "r"]
set arquivo [read $fd]
set lista [split $arquivo "\n"]

foreach var $lista {
incr i
set valores [split $var ";"]
set var1 [string trim [lindex $valores 0] {"}]
set var2 [string trim [lindex $valores 1] {"}]
ect...
if {$i > 1 } {
db_dml -dbn "marketing" insere_valores {
INSERT INTO comdados_mensal (var1, var2, etc..)
VALUES (:var1, :var2, etc...);
}
}
}
}
close $fd
file delete $comdados_file

I know this is not the best way to do it, but it's the only choice I could think. The problem is in the foreach part of the code: as the file is too big and I'm running this in a Desktop using Windos (ic), AOLServer consumes all the available memory and Windows closes the application. Any ideas on how to make it better without buy more physical memory?

Collapse
Posted by Patrick Giagnocavo on
My recommendation would be to write a file of the SQL commands.

Then use psql -f filename.sql to actually do the inserts.

I am pretty sure you are reading the entire file into memory at one time, which will not make Windows very happy.

Split the 400MB file into 4, 100MB files using a file split utility.

This is available on Unix (and is probably already installed) and searching for "Windows split file" on Google will turn up many freeware tools to do this on Windows.

Collapse
Posted by Tom Jackson on
My suggestion would be to figure out how to minimize memory, but there are some issues involved. One is that a CSV file could have new lines (\n), so doing anything based upon the idea that records are all on one line is doomed to fail, and with 400MB of data, it probably will fail.

So what to try? Look at ns_getcsv:

http://rmadilo.com/files/nsapi/ns_getcsv.html

This command can be used once a file is open and set to the correct position. It reads one csv line, which might be more than one Tcl line.

When you open a file in Tcl, it does not read the entire file into memory. You can use seek to move to a point to begin reading. So look at the Tcl commands:

http://junom.com/document/tcl/open.htm
http://junom.com/document/tcl/seek.htm
http://junom.com/document/tcl/tell.htm
http://junom.com/document/tcl/close.htm

Using open and close are obvious. Seek moves to the specified char position to be read. The first move is to seek 0, but once you do an ns_getcsv, you can then do a tell, which should be the next value to give to seek, or you might have to add one or two to the value, only testing will tell. Using these commands you avoid the need to parse char-by-char or know too much about the details of csv. The command ns_getcsv exists because it is a special format which cannot be resolved to a simple loop of open, seek, gets, close, or whatever. Also note that ns_getcsv returns the number of columns found, and you can use the Tcl command [eof] to detect the end of file (like while {![eof $csvfile]}).

Yes you will open and close the file for each record. But you are trading money for time. Money you either don't have or don't want to waste; by processing one record at a time you will probably save programming time because the result will work for every situation. You will lose time to process the data, but you could start the process and take a break. Computers work pretty cheap. I think you will find that the file handling code will execute much faster than whatever database code is done for each record, so there will probably be no downside.

Collapse
Posted by Brian Fenton on
¡Hola Eduardo!

I think your first idea was probably the right way to go, there must be a way to tell Postgres the character set of the incoming data. Is this link any use where he talks about using the \encoding command in psql and SET client_encoding TO?
http://www.network-theory.co.uk/docs/postgresql/vol3/AutomaticCharacterSetConversionBetweenServerandClient.html

Probably more useful to you are these 2 procs already in OpenACS: https://openacs.org/api-doc/proc-view?proc=db_load_sql_data
https://openacs.org/api-doc/proc-view?proc=oacs_util::csv_foreach

The 2nd one uses ns_getcsv which Tom Jackson previously mentioned to you. They should give you a good start.

¡Buena suerte!
Brian

Collapse
Posted by Eduardo Santos on
Thank you all very much for the help. I was having a lot of trouble to do it, and your contributions helped a lot.

Tom, I was surprised about these AOLServer commands I didn't know about. Where did you find it? I've spent a lot of time in research about something like this, and I couldn't find anything. Not even google made me find it. We should do something about AOLServer docs; there's no easy information repository, and when we need to change it it's allways very difficult.

It seems like Brian contribution is going to save my life on this, using oacs_util::csv_foreach. I didn't use it before because I didn't know how to deal with this ns_getcsv command. I still have some trouble in the delimiters, beacuse the command uses , as delimiter and my file uses ;. I'll find some workaround about it.

About the db_load_sql_data procedure, it has the same charset problem, and I don't think it's gonna help. Anyways, I'll try to solve the delimiter problem and tell you how I did it.

Thank you very much for the help again.

Collapse
Posted by Eduardo Santos on
For the delimiter problem I found this:

https://openacs.org/forums/message-view?message_id=141943

Collapse
Posted by Eduardo Santos on
Thank you very much Torben. I found your post about this proc in the forum and I used it with a hacked implementation of oacs_util::csv_foreach using your solution. Know I could open the file.

As Tom said, it became a little bit slow, but as I couldn't buy more hardware (and I also have to use Windows), it's the best I can do.