Forum OpenACS Development: Response to Query Dispatcher initial specs posted

Collapse
Posted by Kapil Thangavelu on
making a developer's life easier - i've uploaded the next iteration
of the acs sql extractor to
zope.org/Members/k_vertigo/acs-sql-extractor.tgz

major changes include a lot of refactoring (i took a few minutes and
designed this time:). all parsing code in the parser (really just
regexs), configuration via commandline, config file, or environment
variables, somewhat configurable outputing, use of container objects
to model package->directory->file->content.

this release has less true functionality than the original when
operating on sql files, but it is much better designed and thought
out in terms of flexibility. the biggest todo thing now is rewriting
the parser (the writer needs some work, but that will be easier to
figure out once this spec finalizes (at this point it looks like i
need to use dom to insert in a db's specific version of query to
query xml file). the current one uses regex, which are not my forte,
but i believe that i've reached the limit of what i can do
realistically with regexs. if anyone here is a regex guru/expert and
wants to help please email me. at this point i believe there is too
much possible contextual information thats possible in the body of a
query, and a great deal more of the acs4 uses dynamic queries than i
had previously thought. there just seems to be too much irregularity
in the possible sql calls to use regexes for everything. a brief
contrived example in case someone can come up with a solution (btw
python has perl style regexs), to my current problem of retrieving
the query portion of a db api call.

db_string foo "
select user_name from [AdParameter "user_table"]
where user_id = ${user_id}
"
(or the alternate syntax with {select...})

at this point i'm going to try and hand roll a context based
descent parser (i no jack about parser terminology, but i believe
this is correct, and it looks pretty easily doable). the basic idea
behind the new parser is to use a simple regex to identify db calls
and use a parser on a window around the context of that call to
parse out the query properly for statistics analysis, and warnings
about dynamic sql.

the main reason i'm putting this out right now in this state, is to
invite comments and to reassure anyone who's interested that this is
being worked on. feedback welcome.

xml query files - all db versions in one files, i'm obviously coming
from a generative POV, but i'm also thinking about it from a
developers perspective of going through all the various db versions
of a call to add your own. i was originally thinking that with a
generative approach, the queries could be separted from the web
directory and be further subdivided into directories based on db
type (including ansi). after the server tastes the db pool it could
process the appriopiate directory (plus the ansi for anything not
found). after realizing how much dynamic sql there is i realize that
this is not feasible and that a developer will need to do some hand
editing (maybe... more on dynamic sql later).

this would also have the added benefit of making the parsing easier
and less memory intensive (read faster), the current spec requires
reading all of the lower level nodes (of a file with all db types)
to figure which db a query is for, storing this at a higher level in
the xml format would make vast reductions in the numbers of nodes
that need to visited. i'm granting that any xml parsing is a pretty
trivial concern, but i don't want my server taking 5 minutes to
startup (dom is slow, sax would be ideal).

speaking of which what tcl libraries were people thinking about
using for xml processing, i haven't looked at tcl's xml capabilities
for a while, and the last time i did they were all dom based written
in tcl. does tcl have a front end to expat?, a fast dom (level2?),
etc (links, please).

the bottom line concern here is def. whats easiest for developers to
work with?. is having a buffer to another directory open a huge
hassle for developers?

the acs-4 already aggregates content and logic for that content in
the www. but the whole query dispatching system seems to be moving
towards something more along the lines of system resources (esp if
query reuse comes into play), and i don't think its a good idea to
stick system resources in www. (aesthestics).

query reuse - is a good thing, and should help matienance, but i'm
clueless on the best way to go about it. btw. afaik there is no hard
enforcement of unique names built into acs, its all convention
based. in fact i see some duplication of names in the same file.
grep -rn db_* * > ../out.sql

tcl proc - embedding tcl is a not an elegant solution nor is
rengineering tcl code in another language. i still think this is a
very interesting idea, but its not language indepedent. and since
i'm considering doing a zope or perhaps a java port i would really
prefer an xml based format.

three files - i don't see how its possible, to remove the tcl files
all toghether. there are a lot of dynamically generated calls, much
more than i had originally thought, this (dynamics queries) is
something that needs to be discussed in the context of the query
dispatcher. also figure in things like redirects, ad_page_contract,
etc... eliminating the tcl would basically mean recreating it in
another form, why bother?

dtd - if no one else gets around to it, i'll write one up when the
spec finalizes (i'm going to be using it for optional validation of
generated xml files).

metadata - pascal brings up alot cool ideas about increasing the
metadata associated with a query. i'm all for this in a generative
world, in a developer working by hand world this looks like it might
be a little too tedious. while i think having the information
available is a good thing. i think that keeping the dispatcher as
light as possible is more important than increasing its features (at
least until some real world tests are possible). IMO (i have no hard
evidence, but i can't possibly see how it could be otherwise) the
acs4 is slower than acs3x because of all the extra joins and stuff,
(permissions, and privilege heirarchy stuff alone  result isome slow
very common calls), i think streamlining the db abstraction layer to
keep its performance impact to a minimum is good thing. but adding
in metadata seems like a perfectly reasonable thing to do that
should be useful. i think that return types can be safely
encompassed by including the name of the db_api call in the metadata
(db_string, db_1row, db_multirow, etc). typing in tcl:), everything
is a string. comments are a cool. pre and post conditions on queries
sounds way too expensive, IMO, this is application specific
information and responsibility not the systems (unless you're using
a language like effiel, which has native support for this type of
thing).

dynamic sql - this seems to be the white horse that hasn't been
addressed yet. i was originally very concerned by this, but now i'm
not so sure if it really makes much impact in terms of
functionality. the original query is just being stored as string,
and when its retrieved by the dispatcher the string will be
evaluated in context of the pages where these variables presumably
exist. so in the sense of functionality its not a huge issue. by
dynamics sql i'm including all variants of non-hardcoded queries
including fully dynamic queries, those with embedded vars, and
function calls. it does become issue wrt to being able to quantify a
queries db abstraction, since the code that generates the query
could possible be in the tcl page. i really think that this issues
needs some discussion. 95%+ of the dynamic sql calls i currently see
in the acs4 are not of this variety and just use a simple function
call or variable to add a value to a comparison clause or what not.
but the remaining cases could be problematic.

sql92- ben could you elaborate a little more on what envisioned for
adding sql92 support. its sounds a little open ended at the moment.
from what you wrote it looks like your identifying a queries
constructs and comparing them to an rdbms capabilities wrt sql92
compliance... why not just take a simpler approach and use the sql92
query unless an rdbms version of the query is found.