Forum OpenACS Q&A: Data migration

Collapse
Posted by Andrew Grumet on
As part of the Sloan effort we've begun writing scripts to move data
from an ACES/ACS3x-backed site to dotLRN/OpenACS4x.

I remember a thread from months ago where we talked about this, but
I don't remember whether plan was established to collect and publish
migration resources.

In any case, I'd love to see all this effort get collected and
shared if it isn't already.

Your thoughts?

Collapse
2: Response to Data migration (response to 1)
Posted by Jade Rubick on
Our company will eventually need to do this, so I might be able to
help out a little. Unfortunately, we're not planning on doing it for a
while, and I still am not really up to speed enough on OpenACS
4x to be of that much help.

The portions we're most interested in are the Intranet, ticket
tracker, and calendar.

I think our company will eventually need to move to OpenACS,
but that means I'll need to do a lot of porting for our custom code.

It's unfortunate that this upgrade is so difficult -- I really wish the
Intranet code I'm developing could be used by more people, and
improved upon collaboratively. We're starting to do some major
changes in the Intranet code soon, which should result in some
major improvements in usability, I think.

What modules are you concentrating on? I could probably help
out with testing or documentation.

We didn't agree upon a place to collect and publish resources. I
think the thread a while back collected some of it -- but it might
be that it was just a reference to the work you were doing! 😊

Collapse
3: Response to Data migration (response to 1)
Posted by Andrew Grumet on
Here is the older thread I was referring to: https://openacs.org/bboard/q-and-a-fetch-msg.tcl?msg_id=000357&topic_id=OpenACS&topic=11.

Given the amount of customization that takes place in a typical Acs3x site I think there are limits to the amount of generality one really wants to strive for. Perhaps the best model is to be more informal, collecting code that is targetted at certain modules with hints from the author about generality and extension. Though we'll try as best as we can to prevent wasted effort and duplication, we'll mostly be interested in collecting useful code. Sound reasonable (or did I just set myself up for major flamage?)?

Also, I forgot to mention. On the Sloan side, Tracy Adams and myself are working on this. Either of us can dedicate some time to the sharing effort.

Collapse
4: Response to Data migration (response to 1)
Posted by Tracy Adams on
To answer Jade's question, this is our evolving plan:
  • Overall, we are converting the current installation at Sloan(ACS 3.x based ACES) to the new OpenACS 4.5 based .LRN.
  • We tried the straight PL/SQL route for a day as this had worked well for the WorldBank conversion. We found that there were enough datamodel and application-levl changes that porting the data directly from one schema to another would be challenging. We concluded that a better path was to write TCL scripts and take advantage of the TCL procedures to do the more complex steps things like creating dotlrn communities and classes and now are staying with that approach in general.
  • We're currently doing the basics of ACS and dotlrn. Users, dotlrn users, dotlrn communities, dotlrn classes.....
  • Also on our list - file storage, survey (once written), faq, bboard, calendar, news, spam
  • On our list depending on difficultly - portals
Collapse
5: Response to Data migration (response to 1)
Posted by Don Baccus on
There are scripts available for migrating OpenACS 3 bboard data to OpenACS 4 (thanks to Musea) but ... ACES bboards are unrelated to the [Open]ACS 3x standard bboard.

So those scripts won't help.

Just thought I'd point this out before someone not realizing this misinforms Tracy that such scripts are available ...

I do think the Tcl script approach is best.  The bboard migration scripts work very well.

Collapse
6: Response to Data migration (response to 1)
Posted by Michael Bryzek on
Although I'm not currently using Open ACS, I am in the middle of a data migration from a fully-custom, legacy app to an ACS data model (in this case ACS Java...).

We approached data migration by first looking at existing tools. If you're lucky enough to do a straight column-to-column migration, there are plenty of tools out there to help you. One example for Oracle is the Oracle Migration Workbench. Unfortunately, I doubt these kinds of tools will be useful when migrating to the object-oriented data models we use today.

We divided our migration work into two steps:

  1. Export data from the legacy application to an intermediate XML format.
  2. Import data from the XML document using the API's available in our version of the ACS.
The use of XML as an intermediate representation of the data had some very real advantages for us. Namely,
  • We expect to do future data migrations. Next time, some of the work will already be done for us. I do not yet have practical experience showing what percentage of our migration code will be reusable, but definitely parts of the XML to ACS process will be.
  • Two of us could work independently on the migration process. One person needs to understand the legacy data in depth, creating an XML document to represent it. The other person never has to think about the legacy app - only about parsing the XML and calling the correct API's.
  • The import process takes full advantage of our existing API.
The only major drawbacks we've seen with using XML so far is speed and maintenance. It is much slower to migrate data in this way then in direct migrations (e.g. using oracle's sqlldr). In our particular case, we do not have a lot of data so we do not have to worry about the performance of the migration. The maintenance issue will come up next time we do a data migration - as API's evolve, the importer must also be updated. This can be mitigated to some extent if your importer is written generically, but it is yet another package that needs to be maintained.

What do people think about using XML to help with data migration? It would be very useful if we had an XML Schema for OACS and a corresponding utility which knew how to instantiate objects based on the data in the XML.

Collapse
7: Response to Data migration (response to 1)
Posted by John Sequeira on
Michael,

In your case, what advantages does the intermediary steps of XML serialization/deserialization give you?

The way I've done that before (and the way I suspect Musea has done it) is:

1) grab data from 3.x (TCL)
2) emit plpgsql or pl/sql loader script (TCL)
3) load loader script into 4.x data model (pgsql/sql*plus)

If I want to repeat the migration, or the 4.x API evolves, I change the emitter in TCL.

What you're proposing, I think, is:

1) grab data from 3.x (TCL)
1x) emit XML (TCL)
2x) parse XML (TCL)
2) emit plpgsql or pl/sql loader script (TCL)
3) load loader script into 4.x data model (pgsql/sql*plus)

Yes you've improved the prospect of working on the project in parallel by presenting a data interface, but I would argue in this simple case of 3.x->4.x data migration it won't be worth the overhead of XML. I've done this kind of thing alot, both using XML (when dictated by an XML-stored data source), and more typically the way I've described above. My hunch is that the XML pays off only when complexity grows ( a lot ), when a different team of people or people with a different toolset will do the loading, or when it's already a project requirement and the overhead is marginal.

Collapse
8: Response to Data migration (response to 1)
Posted by Michael Bryzek on
John - Your perspective makes a lot of sense to me. Let me add a few more details about our particular migration. Most importantly, we were migrating from a flat, non-relational schema that we had never seen before the migration started. There were several distinct problems we had to solve:
  1. Transfer data from the original database to somewhere we could import it into our new database. In our case we were migrating from a SQL Server database, but the source doesn't really matter.
  2. Completely understand the old schema so that we could actually figure out what objects we would need in the ACS schema. This probably took 50% of the total time to do the migration.
  3. Export the old data and generate some kind of import script.
  4. Import the data.
For us, using XML as an intermediate representation allowed us to entirely focus on transforming the original data into a hierarchy that matches that in the ACS. We could do this without worrying about the actual import process (e.g. I could create a user XML tag without worrying about additional methods to call for password encryption).

Our actual process looked like:

  1. Grab data from SQL server
  2. Massage data and map to appropriate ACS Objects
  3. Emit XML
  4. Parse XML
  5. Load data
In our case, the last 2 steps are now done for us and will likely require only minor modifications the next time we migrate data, even though our API's and data model are constantly changing.

You make a good point that by the time we've organized our data into a logical XML structure, we could have emitted an appropriate script to load our data. In our case, the code that parses the XML and instantiates the actual objects is written fairly generically. We can add attributes to our object types, create new types of relations between our objects, and in some cases add entirely new object types without having to modify the actual code that imports the data. This is a big win for us as we expect to do many more migrations from disparate sources.

I would argue in this simple case of 3.x->4.x data migration it won't be worth the overhead of XML
I would agree with you. I have never done a 3.x -> 4.x migration, but I have done migrations with other versions of ACS. These data models share enough similarities that an additional data interface would have little benefit for a one-time migration.

One other angle to consider is the frequency with which you intend to migrate data. A one-time migration has different requirements from a weekly/monthly/yearly migration. My limited experience so far has shown that it is easier to think about generating an XML document than the correct sequence of actual API calls. If you are going to be migrating data more than once, from different sources, an intermediate data structure may make more sense.

And this is what led me to post initially. With OACS, we should expect lots of people to do lots of data migrations as they adopt OACS. It would be easier for users to migrate data to OACS if the process was well-defined and part of it was even implemented (the XML Reader and loader). This would allow users familiar with an old schema to migrate correctly to OACS without understanding every API in detail.

Collapse
9: Response to Data migration (response to 1)
Posted by John Sequeira on
Michael,

I didn't catch the opportunities for reuse from your first post, but I think I got it. It does sound like a no-brainer that giving people a DTD to publish to, and having a single import routine for that DTD, would go a long way towards easing data-model learning curve/migration issues for a OpenACS newcomers. That's something I would definitely find helpful in approaching a new system.

FWIW, here's a project I discovered a while back but haven't looked into that appears to reflect a lot of thinking on this topic:

  • XML-DBMS

    It's a perl port of a Java project to perform this mapping between persistence formats. I'm not suggesting it as a platform, just that it seems well documented and potentially of interest.

  • Collapse
    Posted by Andrew Grumet on
    In addition to everything above, we've also contemplated an intermediate XML representation for purposes of content syndication and reuse (cf http://www.imsproject.org/). After talking to Michael B. about this and also given our time constraints, this approach is going to have to wait.

    On the other hand, it seems like there could be significant value in separating out the process of getting data out of the old system from putting data into the new system. I've been pondering this a bit and think the following design pattern could be useful later without adding much overhead to our task:

    1. Create a separate database pool for the ACS3x system and call the pool "acs3x"

    2. Keep this pool out of the db API's hands by adding the following to our config file
      ns_section ns/server/${server}/acs/database
      ns_param AvailablePool pool1
      ns_param AvailablePool pool2
      ns_param AvailablePool pool3
      
      
    3. Push any content-fetching code out into a proc that returns a database cursor...
      
      set db [ns_db gethandle acs3x]
      
      # should return a cursor containing the following columns:
      #    email, first_names, last_name, screen_name, password,
      #    member_state
      
      set selection [migration_user_data $db]
      
      while { [ns_db getrow $db $selection] } {
          set_variables_after_query
          # do stuff with the data
      }
      
      ns_db flush $db
      
      
    4. Content-fetching procs look like this
      proc migration_user_data db {
      
          set sql "select ... from ... where ..."
      
          set selection [ns_db select $db $sql]
          return $selection
      
      }
      
    5. Migration teams customize the content-fetching procs only.
    This is perhaps not quite as pretty as the .NET "DataSource" abstraction because it ties you to the ns_db API when your underlying data might actually be XML. On the other hand, it uses native AOLserver api call (== no overhead to us) and saves you from having to ram everything into memory first before doing the INSERTs.

    We've actually come quite a way since my last post, and have draft scripts to import: users, dotllrn-users, communities, departments, terms, subjects, classes, subgroups, faqs, news. We're in the process of refining these scripts, adding new ones, and starting to tune various queries now that we have about 70,000 acs_objects in the system.

    We're happy to work with anyone who wants to help, and to hand out copies of our buggy unfinished scripts ;) This being somewhat different than designing a new toolkit, I suspect folks won't feel too locked out if we wait until we're finished (late June at the latest) to release. But speak up if you'd like this effort to be more open.

    Collapse
    Posted by Don Baccus on
    What you're describing is very similar to what the Musea scripts to move OpenACS 3 content to OpenACS 4 do.  You may want to take a look at them.  They may be in new file storage?  If not e-mail Talli or someone else at musea directly to get a copy.

    It's actually quite easy ... it is only hard in the sense that deleting content without "on delete cascade" is hard - you have to figure out the dependencies and migrate everything in the right order.

    Collapse
    Posted by Andrew Grumet on
    Great thanks.  I found the oacs-bboard script ... yes, looks good.
    We'll probably add a few twists (i.e. the pushing-out-into-procs
    part) but I think we have a winner in terms of approach.

    To keep things simple I think we'll post the results to new-file-
    storage as well.

    Collapse
    Posted by Hamilton Chua on
    Hello,

    I am in the process of migrating a project which we started with OACS Alpha to the Beta code.

    I am uncertain as how to proceed with regards to migrating users. In particualr, I am having problems with the password and salt.

    If i were migrating from OACS 3.x, it would be a simple matter of getting the password from the OACS3 db and use the add user API on the OACS4 db.

    However, I am migrating from OACS4 alpha to OACS4 beta. Wherein the password is already hashed. I don't think inserting the hashed password and the salt into the new db will do it. Will it ?

    Thanks

    Collapse
    Posted by Andrew Grumet on
    I would give it a whirl if you haven't already. I think it will work fine. Rationale: the key piece of relevant code is in ad_check_password (packages/acs-tcl/tcl/security-procs.tcl):
    if { [string compare $password [ns_sha1 "$password_from_form$salt"]] } {
        return 0
    }
    
    The ns_sha1 function is deterministic (its output is always the same for a given input), so if you preserve the salt and the output you should be all set.
    Collapse
    15: Re: Data migration (response to 1)
    Posted by Tracy Adams on
    I went to send the scripts that Sloan used to migrate from ACES to dotlrn.  I realized that though I uploaded them to this site, I couldn't navigate to them.

    The files are deleted from the file-storage

    Collapse
    16: Re: Data migration (response to 1)
    Posted by Venkatesh Goteti on
    Tilmann worked on migration of AIESEC.net from ACES to dotLRN as well. He's uploaded some migration scripts as well on OpenACS site.

    His scripts add a framework for running tasks in parallel that depend on each other, and further they are started automatically without web interaction