Forum OpenACS Q&A: Clickstream solutions

Collapse
Posted by Talli Somekh on
Hey guys,

Among the more frequently requested features of a site by marketroids
is the always horrible to build Clickstream solution. This is a
feature that is more robust than what Analog or Webalizer bring to
the table primarily because it requires rather detailed data stored
in a DB. aD gave it a shot but never finished it off but released a
0.1d version that from what I've heard almost totally useless for
production environments.

It would make sense that OpenACS might have a clickstream solution
since it's DB backed and a reasonably complete webtoolkit, but
clickstream is a *very* hard problem. I believe the clickstream
module and the article at the ASJ
(http://www.arsdigita.com/asj/clickstream/)was Richard Li's Master's
thesis and it might be a good starting point. However, those that
I've spoken to have said there's no reason to think clickstream *has*
to be an ACS module.

We've recently purchased the NetTracker Enterprise version which has
it's own DB and sells for $1K. I know that furfly runs Nettracker,
and i have generally heard good things about it. Nettracker does
offer a version that connects to Oracle, but it costs $25K.

One suggestion that has been brought up is to do some customization
to cookies so that Nettracker gathers information from there. That's
a fine suggestion, and one that is surely worth pursuing since it
makes the work relatively minimal. Nettracker Enterprise's cost is
not so prohibitive, but it's not Free and Open Source either.

The moral of this long winded tale is that I've not found a good
answer that is both Free and a complete solution. I was wondering
whether anyone has experience implementing clickstream solutions, how
they went about it and whether there are Grand Visions for solving
this for OpenACS.

talli

Collapse
Posted by carl garland on

I can tell you what I implemented but it may be too long to really post here. First off I really only do clickstream for user sessions after they have logged into site and have built and tested but not launched yet but site is in beta and should be live soon for you to get better feel for what I am describing. I've implemented two separate forms of clickstream ... lite and heavy. First the lite version

The lite version is only concerned with user page paths and is as follows:

  • When the server starts up I create a server session id and a directory which holds all sorts of session forensic data including this clickstream data. At startup I create a user_session_path file in this directory. As a side note I also put the access logs, a periodic telemetry dumps, url profiling data, and nsv variable dumps into this directory.
  • When the user logs in I add an nsv variable (user_session_id) to my nsv array user_session_paths.
  • At each pageload (I run a preauth filter on all pages) I determine which page the user is on and map to a page/id hash that is loaded also at server startup. I then lappend (page_id, ns_time) to this user_session_path nsv variable.
  • When user log out or is auto logged out I append their individ nsv session path array to the file that I created in the server session dir. Also unset the corresponding nsv var
  • At server shutdown dump any remaining var using ns_shutdown proc.

You can then parse the session path file into a graph which represents user paths. I keep a database copy of the individual graphs and a cumulative graph that holds prev/next data. You can then see all pages before / after and the time between pages can be delved into. This is not necessarily a links to/from because a user can have multiple browser windows open and may switch between.

Thats the lite version the heavy version is much more detailed and contains about 40 different db tables. If you want I can post short descript of it or email you. Best Regards,
Collapse
Posted by Tom Jackson on

Carl,

I'm interested in your use of nsv. It is great for logging, but I have never spent the time to flush to disk. Can you send me, or post your code for the 'lite' version. Thanks.

Collapse
Posted by carl garland on
At server startup make file with
if [catch {open $sessionpathfile w} fId] {
   ns_log Notice "Could not create session path file"
} else {
    puts $fId [ns_time]
}

At user session login:

nsv_set web_session_path $session_id ""

At pageload filter:

nsv_lappend web_session_path $session_id "$page_id [ns_time]"

At session logout:

If [catch {open $sessionpathfile a} fId] {
   ns_log Notice "Unable to append to session path file"
} else {
   puts $fId "$session_id [nsv_get web_session_path $session_id]"
} 
close $fId

At server shutdown


If [catch {open $sessionpathfile a} fId] {
   ns_log Notice "Unable to append to session path file"
} else {

     foreach s_id [nsv_array names web_session] {
     puts $fId "$s_id [nsv_get web_session_path $s_id]"
}	    
 
close $fId
Collapse
Posted by Nick Carroll on
Hi Guys,

I know this thread is a little out dated, but I was wondering if anyone has been working on Richard Li's clickstream module?  If so would you be able to tell me about your experiences.  I am interested in any responses as I am a newbie to openacs, and am working on a clickstream data warehouse module for openacs.  The following is a brief development plan of what I intend to do.

1. Port the aD clickstream module to OpenACS.  Currently this would involve porting functions, procedures and queries to PostgreSQL.  I am not sure if I'll need to hook the data model into the OACS object system.

The data warehouse creation process is hidden from the end user, and is probably best if I kept it like that.  Should this process be fired off using the create-packagename.sql file?

The data warehouse data model is a standard star join.  Theres the central fact table, and there are the many dimensions that the fact table references.  I can't see why I would need OACS objects for this data model, as the data warehouse creation process is more of a service than an application.  Is this distinction correct?  That OACS objects are generally used for applications rather than services?

2. Create an interface to the clickstream data warehouse.  For this step I believe I will have to look into using OACS objects.  So that my application will inherit permissions, etc.  This part of the project will be more transparent to a user, such as an administrator.

I too believe that there is no reason to include the clickstream data warehouse as an ACS module.  In fact I think it should be an independent PostgreSQL application, as I am concerned that a large data warehouse could slow down the web server dramatically.  However I would like to propose the idea of using OACS to develop a web interface to manage, as well as to extract information from the clickstream data warehouse.  Does this idea sound feasible?

Any guidance or feedback will be much appreciated.

Thanks,
Nick.

Collapse
Posted by Janine Ohmer on
I haven't used the clickstream module at all, so I can't help you there.  But I remember reading somewhere that someone, can't recall who, had ported it to Postgres for OpenForce.  It has never been released as far as I know, but you might want to ask them about it.  It could be that it wasn't released because it turned out not to be useful, as Talli says at the top of this thread.
Collapse
Posted by Nick Carroll on
Thanks Janine, I believe you were referring to Roberto Mello's PostgreSQL Porting guide at http://www.brasileiro.net/postgres/porting/index.html.  The guide is based on his ported solution of the aD clickstream module.

Nick.

Collapse
Posted by Nick Carroll on
Hi,

Is there anyone out there that is familiar with Richard Li's clickstream data warehouse module?

There seems to be a TCL file that is missing from the arsdigita repository at http://eveander.com/arsdigita/acs-repository/clickstream-0.2.  The file I'm after is cs-avail.tcl (http://$host/SYSTEM/cs-avail.tcl).  I think it works out if clickstream data is available to download from a host site.  It appears to be an important file in order for the clickstream module to populate the data warehouse.

Thanks,
Nick.

Collapse
Posted by Jeff Davis on
here is the copy I have:
# $Id: cs-avail.tcl,v 3.1 2000/04/10 15:17:21 jsalz Exp $
# File:        cs-avail.tcl
# Date:        6 Apr 2000
# Author:      jsalz@mit.edu
# Description: Returns a list of times in the past week for which clickstreaming data are
#              available.

ReturnHeaders "text/plain"

set time [cs_get_current_chunk]

set log_file [cs_log_file]
set files [glob "$log_file.*"]

set avail [list]

for { set i 0 } { $i < 24*7 } { incr i } {
    if { [lsearch $files "$log_file.[cs_format_log_file_date $time]"] >= 0 } {
        lappend avail $time
    }
    incr time -3600
}
ns_write $avail
Collapse
Posted by Nick Carroll on
Hi Jeff,

Thanks for that posting.  Would you by any chance have a copy of the following file as well?

cs-data.tcl

I just noticed that was missing along with cs-one-day.tcl.  However I was able to get cs-one-day.tcl from Roberto.

Thanks,
Nick.

Collapse
Posted by Jeff Davis on
For completeness sake here are the other two:
cs-data.tcl
# $Id: cs-data.tcl,v 3.1 2000/04/10 15:17:24 jsalz Exp $
# File:        cs-data.tcl
# Date:        6 Apr 2000
# Author:      jsalz@mit.edu
# Description: Returns clickstreaming data for a particular chunk.
# Inputs:      chunk_start

set_the_usual_form_variables

proc_doc cs_data_write_chunk { time chunk_start } { Write all lines in the chunk beginning with $chunk_start from the log file for $time. } {
    set filename [cs_log_file $time]
    if { ![file exists $filename] } {
	return
    }
    set file [open $filename "r"]

    set out ""
    while { [gets $file line] >= 0 } {
	set time [string range $line 0 [expr { [string first "	" $line] - 1 }]]
	if { $time >= $chunk_start && $time < $chunk_start + 3600 } {
	    append out $line
	    append out "
"
	    if { [string length $out] >= 8192 } {
		ns_write $out
		set out ""
	    }
	}
    }
    ns_write $out

    close $file
}

ReturnHeaders "text/plain"

# Note that we need to look in two log files - the one we'd expect, plus the next one (in case
# any requests weren't logged until the next chunk, i.e., the request was issued at 10:59:59 but
# the request wasn't written to the log until 11:00:01).
foreach time [list $chunk_start [expr { $chunk_start + 3600 }]] {
    cs_data_write_chunk $time $chunk_start
}
cs-one-day.tcl

# /SYSTEM/cs-one-day.tcl
#
# returns compressed log file
#
# richardl@arsdigita.com, jsalz@mit.edu 5 June 2000
#
# $Id: cs-one-day.tcl,v 3.4 2000/06/26 17:21:02 iwashima Exp $

set host_ip [ns_conn peeraddr]
set hosts_approved [ad_parameter_all_values_as_list Approved_ip cs/logging]
if {[lsearch -exact $hosts_approved $host_ip] == -1} {
    ns_log "Notice" "$host_ip not in $hosts_approved ... not approved IP to access clickstreaming log"
    # The person isn't allowed in here
    ad_return_complaint 1 "Your IP $host_ip is not on the allowed list of IPs to see this page."
    return
}

ad_page_variables {
    start_time
}

ns_log Notice "starting downloads...[cs_archive_file $start_time]"

ns_returnfile 200 application/x-gzip-compressed [cs_archive_file $start_time]


Collapse
Posted by Nick Carroll on
Is there a file in the clickstream data warehouse that parses a web server log file, and creates another specially formatted file that can be used to load data into the database with the following sql load statement...
OPTIONS (DIRECT=true)
LOAD DATA
INFILE *
TRUNCATE
INTO TABLE cs_event_log
FIELDS TERMINATED BY X'09'
(event_id, event_time, end_time, date_id, url, instance_id, user_ip, user_id, query,
    content_type, bytes, session_id, browser_id, user_agent_string, accept_language, referring_url,
    method, status, secure_p)
BEGINDATA