Forum OpenACS Q&A: Is Postgres/MySQL ready for VLDB solutions?

Should we stay away from Postgres or MySQL for VLDB problems?

I have some questions for all you db experts, especially regarding postgres..

Basically our team have been doing scientific marketing, and the load is getting too much for our current configuration. We wish to take sales, web access data, email clickthrough, personal profile data from hundreds of data sources for a dozen clients... plug them into a cheap 5+ server offline db farm.. and be able pull reports (doesn't have to be OLAP.. but that would be a bonus) on large data sets that increases by the gigabytes per week.
The required dataset that make up the input of the regular reports we generate are becoming more than a hundred million rows.

We'll be using major analytic tools to pull from these databases so it's necessary it supports ODBC and friends.

We don't have the budget of large companies. Who can easily justify entire multi-million Siebel installations. And we can't afford wonderful platforms like Oracle on distributed Solaris or SQL Server enterprise on clustered Xeons.

I have heard that Postgres recently acquired the ability to do replication.. that's very exciting.. but will I be able to do selects over partitioned tables spread over multiple databases? I like to learn about what you guys know about free database's limits and near future developments that we can look forward to.

Is there anything in the OSS community that resembles EMC's drive farms?

Any aid or pointers would be much appreciated.

And many apologies to those who feel this thread is off-topic.

Collapse
Posted by David C on
Hi, my coworkers has asked me to also ask about single server situations.. and to add SAPDB and Firebird/Interbase to the discussion.. if anyone can provide any explanation of the scalabilities and replication options of all of above mentioned databases that would be great! Thanks!
Collapse
Posted by Talli Somekh on
This isn't much of an answer for you, but if you need open source OLAP you might want to take a look at Mondrian

talli

Collapse
Posted by David C on
Any help is appreciated, thanks!
Collapse
Posted by Don Baccus on
I suggest you ask about Postgres on the Postgres general discussion list.

Dynamic database-driven websites tend not to accumulate gigabytes of new data per week.  You're not likely to find much practical experience-driven advice here on database systems of that size, and since we support Oracle as well as Postgres with our toolkit we often see our largest, most dynamic sites implemented using Oracle as the back-end.

But there are people in the Postgres community proper who have experience with datasets that grow as rapidly as you describe, so I'm quite sure you'll get some useful info if you ask there .  Go to postgresql.org and click on the "lists" link - you have to subscribe to post unfortunately but that's life!

Collapse
Posted by Andrew Piskorski on
David, if you have a chance to come back to this thread and summarize what you end up learning, that would be cool. :)

It's not clear to me whether you really need database replication at all, but since you mention it, from reading stuff on the web, I know:

PostgreSQL currently has production ready replication for one read/write master, many read-only slaves, in eRServer and/or RServ

PosgreSQL also has something very much cooler in Postgres-R, but it sounds like there's lots of work to get that refactored and brought up to date with the current version of PostgreSQL. I think it all came out of Bettina Kemme's PhD thesis research, so now that she's a professor at McGill maybe some of her students will do the work. :) Btw, her "Don't be lazy, be consistent: Postgres-R, a new way to implement Database Replication." paper is pretty interesting reading.

Collapse
Posted by Andrew Piskorski on
Btw, John Sequira recently pointed out this brief Yahoo article about Clusgres, a tool for clustering PostgreSQL for higher performance. In contrast to say, Backplane (which sounds like super-early-alpha code, not currently useful for anything real), Clustgres sounds like it might actually be useful now in some cases.

Aha, however, they require the (expensive, very high performance) Dolphin/SCI network interconnect. And they say they're using its shmem-style global shared memory hardware, so unfortunately, they are not just taking advantage of SCI's very low latency, they need that distributed shared memory support.

If they were using SCI just for the low latency, then presumably you could still run Clustgres and a cluster with Ethernet hardware (although with probably very bad performance), and could improve the performance somewhat adopting the GAMMA, M-VIA, or other codebases that let you come much closer to SCI latencies using cheap ethernet hardware, much closer than you can ever get using the typical Linux kernel and TCP/IP interface. But that isn't the case - oh well.

They say however that Clustgres currently allows writes, but is "optimized only for reads". Ah, and they're currently using NFS to share the filesystem. Ugh, how could that possibly support transactional and atomic writes? My guess is it can't, and that's why they're currently doing all writes only on the master node. Perhaps a real cluster distributed file-system with the right properties would let Clustgres scale for writes as well as reads, and maybe such a FS is do-able as long as they're already dependent on the SCI hardware.

Actually, I'm not sure how/why NFS would work for transactional reads either, but presumably they've made it do so. (I'm very foggy on what sort of actual locking and ACIDity primitives NFS provides, I just know they're supposed to suck.) Their "white paper" is pretty vague on how the magic "libOPUS" libraries used by Clustgres actually work.

Anyway, Clustgres seems interesting, but given the high cost of SCI hardware, and its current non-scalability for writes, I bet you'd have to have a large PostgreSQL database to make using Clustgres worthwhile. But, if you already had a one really big SMP server, as your database grows it might actually pay to offload the read requests to a bunch of (still expensive) smaller boxes with SCI via Clustgres, thus avoiding having to trade up to an even bigger (and probably much more expensive) SMP box.

I wonder how, if at all, this Clustgres stuff relates to Postgres-R. Since it depends on the SCI hardware I suspect it's unrelated. It would certainly be interesting to read a good compare and contrast of the two though...

Collapse
Posted by Andrew Piskorski on
The Lustre cluster file system sounds awfully complicated, but is in production use Lawrence Livermore and some other supercomputing centers. And it supports full POSIX file system semantics - which means you should be able to run an RDBMS on it. I've no idea whether anyone has actually done that or how well it works, though.
Collapse
Posted by Andrew Piskorski on
Some of the old info I posted in 2003 might be confusing now. My understanding is that since sometime in 2004, the primary PostgreSQL replication system has been Slony, replacing the older Rserver or eRServer stuff. Slony 1 currently does single-master multi-slave replication, but they have a roadmap for adding multi-master as well.
Collapse
Posted by Frank Bergmann on
Talli wrote:

This isn't much of an answer for you, but if you need
open source OLAP you might want to take a look at Mondrian

Hi,

I've been checking Mondrian a few days ago and it was quite a hassle to install. Not that much Mondrian itself, but the installer for JPivot is broken, the GUI frontend for Mondrian.

We actually decided to go with a "Pivot-Table" friendly CSV-Export for the moment for Project/Open.

Did you manage to install Mondrian successfully?

Bests,
Frank

Collapse
Posted by Priya Sachdev on
Hi,
I was trying the process of replicating a postgress DB using Slony. I got stuckup in the process.
1.installed postgres
2.Installed Slony
3.created a DB called "test" and inserted rows.
4.created another DB called "duplicate
5.cluster_setup.sh

#!/bin/sh

slonik <<EOF

cluster name = sql_cluster;

node 1 admin conninfo = 'dbname=test host=localhost user=postgres';
node 2 admin conninfo = 'dbname=duplicate host=localhost user=postgres';

try {
echo ' Initializing the cluster';
init cluster(id = 1, comment = 'Node 1');
}

on error {
echo 'Could not initialize the cluster';
exit -1;
}

create set (id = 1, origin = 1, comment = 'testint');

set add table (set id = 1, origin = 1, id = 1, full qualified name = 'public.test', comment = ' Testing');

set add table (set id = 1, origin = 1, id = 2, full qualified name = 'public.duplicate', comment = ' duplicate Testing');

store node (id = 2, comment = 'Node 2');
store path (server=1, client = 2, conninfo = 'dbname=test host=localhost user=postgres' );

store path (server=2, client = 1, conninfo = 'dbname=duplicate host=localhost user=postgres' );

store listen (origin = 1, provider = 1, receiver = 2);
store listen (origin = 2, provider = 2, receiver = 1);

EOF

6.

[root@ref_deg02 root]$ slon sql_cluster "dbname=test user=postgres" &
[1] 16137
CONFIG main: slon version 1.0.5 starting up
ERROR  cannot get sl_local_node_id - ERROR:  Namespace "_sql_cluster" does not exist
FATAL  main: Node is not initialized properly

[1]+  Exit 255                slon sql_cluster "dbname=test user=postgres"
[root@ref_deg02 root]$

I dont understand what this error means...can anyone explain...what am i doing wrong

Thanks
Priya