Forum OpenACS Development: OpenACS development history via Git

Collapse
Posted by Tom Jackson on
I just imported OpenACS into a git repository. It is available at: http://rmadilo.com/gitweb/gitweb.perl?p=openacs.git;a=summary

The import took about 12 hours and the .git directory is about 134Mb, and is available as a tar.gz file at:
http://rmadilo.com/projects/openacs.tgz

Browsing commits is very easy, and you can grab snapshots from any point. The snapshots nicely avoid including the CVS directories.

Collapse
Posted by Malte Sussdorff on
Tom, this is interesting work. Could you give some ideas on Git and what benefits it has over subversion in the context of OpenACS and the OpenACS developers and companies?

Some references I found while digging into this:

http://en.wikipedia.org/wiki/Git_(software)
http://video.google.com/videoplay?docid=-3999952944619245780

Collapse
Posted by Tom Jackson on
From what I have read/watched subversion adds the concept of an atomic commit. This is an advance over cvs. However, git adds in the really difficult part: easy merging.

With subversion, branching/tagging are essentially the same thing. Of course tags don't change, so a branch is essentially a copy of something else which can then vary independently, a tag is a named and fixed snapshot.

So you might reason that it is easy to make a copy or snapshot (backup). What is difficult is to bring different lines of development back together. This difficulty is at the root of all other problems of shared development. Essentially any branch, even your local, private branch, is a future source of pain if you want to merge it back in to the main line of development.

The fact that Git makes the merging process a 'no brainer' is what is interesting to me. The ability to easily merge back in to a main line of development is critical to supporting the typical feature of open source development: individuals hacking away on their own copy. But without the ability to easily merge in your work, what happens is that developers either don't commit their work very often, or they commit to a central repository, which usually means that the HEAD is of unknown stability.

I'm not a guru or anything, but this concept has always seriously impacted my contributions. I don't want to commit code until it has been accepted, but uncommitted code is in serious jeopardy of being lost. Since commits should have a logical scope, any delay in making a commit for a particular logical change ends up obscuring the purpose of a commit, and making it more difficult to document the commit.

Anyway, the choices with cvs/svn are to commit often and risk instability, or to commit late and risk not documenting the purpose of the changes, and in addition, late commits have an even higher potential for conflicts.

The benefits of easy merging are first and foremost: distributed repositories. The impact of this is really difficult to explain. If you want to try out some new feature, you create a branch. If it works, you merge it back in. But the really weird thing with Git is that you do all of this in the same place. With Git, the toplevel directory has a .git directory. So for openacs-4, there is an openacs-4/.git directory which contains all the information for everything in openacs-4. If you want to change from one branch to another branch, you do:

$ git checkout otherbranch

Git changes all the files, nearly instantly, from one branch to another. In effect, everything about every version of the project is in the .git directory. The files below the parent of the .git directory are mere conveniences that aid editing.

How can this benefit OpenACS?

First I have to caution that Git is a new tool, and is not very well documented. But just because cvs is old and well documented doesn't mean that you can't totally screw up. Actually, with cvs, this seems to be the expected outcome of any operation.

OpenACS can benefit only to the extent that they embrace the core concepts of Git. These are 1. distributed development, 2. hierarchical development, and 3. independent development.

Some of this has been realized in OpenACS for a long time. The separation of
Core and 'non-core' development is a good example each of the three concepts.

However, Git extends the potential for each of these concepts.

Suddenly, if merging is easy, my core is just as good as the HEAD core. Why would I complain about another developer's innovations, since I don't have to use them unless I merge them in to my line of work? Distributed does not mean that everything ends up in one toplevel repository, it means that you can combine any line of development into your own. Instead of an 'intersection' we end up with a 'multiplication'.

One problem with the OpenACS cvs is that it is huge, really huge. However, git was able to import the whole thing and at least the last few commits sync up with the last commits I saw. I haven't yet figured out exactly how to keep up with the cvs commits, this is part of the lack of documentation. I need to try on some of my own cvs conversions and see what works best. To do the conversion, you first do a cvs checkout. Then git-cvsimport uses another independent tool cvsps to create a series of patches and commits for the entire life of the project. This means checking out each file a number of times, but it gets all the commit messages and git produces a compact repository that has many advantages over cvs. Even if OpenACS never switches to git or svn, an occasional conversion of this sort would be very useful for developers. Above I reference a single file which contains this conversion, so it only needs to be done by one person, and then they can distribute it as a tar file.

If OpenACS were to switch to git, it would probably be best to divide up the repository into the core, utility, and end user packages. It might seem that this only increases the work for the maintainer, but it turns out not to be so. You can quickly create a configuration to combine all of these things automatically. Instead of having to hunt around and remember how you do something, you just lookup your shortcut command. For instance, my external copy of openacs-4 is pushed into place using ssh. Somehow it creates a tiny set of deltas and sends those. The command to do it is:

$ git-push openacs-rmadilo

The configuration setting is also pretty simple:

[remote "openacs-rmadilo"]
url = rmadilo.com:/web/rmadilo/servers/rmadilo/pages/projects/openacs.git
push = master

You can push any number of HEADs in one command, or, even with this configuration, you can append to git-push a subset of HEADS.

Once you have a git repository you can search it! You can search the source, the logs, everything. Probably this works well at a command line, but the gitweb cgi module (single perl script) works very well. One reason I became somewhat sour on svn was that to get a web interface you have to install the Apache Runtime. This appears to be nearly impossible to install on older versions of linux, and is a huge monster. Nobody screws up stuff like Apache, they really have a lock on this. But git just uses a perl script. I installed a modern version of perl and setup nscgi.

One very helpful feature is that you can grab a snapshot of anything, HEAD, or just a module. These are not stored up files, they are generated when you ask for them. So you won't be telling your users: we haven't released in a while, just fire up cvs and download xyz... You have to do the same thing with svn. To make a release, I have to do a svn copy, download the copy, tgz it up and then upload it.

Collapse
Posted by Malte Sussdorff on
Thanks a lot for this explanation. It basically seems to fulfill all my needs:

1) I can use OpenACS Core as the main directory
2) I can use additional packages easily as they come by
3) I can create my own "cognovis" distribution which is easy to merge back to OpenACS (at the moment it is a little bit of a pain to merge between subversion and CVS, but it works)
4) I can provide certain code not officially as a package in OpenACS, but as something that I "control" on our servers. Very useful for packages that mainly I use and others might be interested to test, but should not be officially in OpenACS yet
5) I could use code developed by someone I trust who does not want to keep his code completely in OpenACS CVS
6) In the EU project I am coordinating, each partner could have their own repository for their software (which runs on different platforms) at their office and we could merge it once they punch a hole through their firewall 😊.

What I am struggling at the moment most with subversion is changes made to client installations. I usually use cognovis-core as the subversion checkout (pretty much a mix of HEAD and 5.3) on which to base the client code. Then I make a change at the client. It works in the client, I need to merge it back to cognovis-core, then I need to commit it to all clients and then I need to commit it back to OpenACS. Even with automated scripts it is a painful experience. Is this merging going to be easier in Git? I assume I could just copy the Git repository for cognovis-core and use it in the client's place and if code works, merge it back either to OpenACS directly or just to cognovis-core (e.g. if it is a change which needs a TIP).

My assumption is that my scenario is not so different from any other's OpenACS company, but correct me if I'm wrong. And note that I am not thinking about changing our Tipped decision to go with subversion, but trying to figure out what best to use for our (cognovis) scenario, hoping Tom will find a way to merge back to CVS or SVN for that matter 😊.

Collapse
Posted by Tom Jackson on
The biggest fly in the ointment is that most of the documentation and tutorials are directed at starting a new project with git. Everything is clean and easy. Pulling in a large cvs project is not documented, and I have no idea how to check that everything is done correctly. Maybe the cvsps tool has more documentation, since this is what is doing the difficult task, and is applicable to more than just git, maybe it would help the conversion to subversion?

My typical method of development with OpenACS is to download the most current stable version and then develop on that. But I seldom make any core changes. The problem for me is not pushing changes back, but merging in new updates from OpenACS, but git or svn may not make things any better because this isn't just source code, but also datamodel and data. And there is the potential for lots of one-off edits to local files that you don't normally see in code development.

Pushing and pulling into a production environment probably shouldn't extend to any shared repository, this is the point of git: everyone shares what they want and pulls from these shares what they want (into their own repository). At some point, an official version can come out, but if your specialized code isn't in the official version, it doesn't mean you are cut off. So if you develop a feature which only you find useful, maybe it stays in your local branch. Then if someone else picks it up, maybe generalizes it, it might work its way up into the main branch. If the point is to maintain a monolithic repository, there is no need to move to git, it isn't really directed at shared repositories.

So what you might do as a developer with multiple clients is to maintain a development repository with branches for each client, or a private repository for each client, but the private repositories would still reference your local repository, which would eventually reference the openacs official version/branch/tag.

As far as merging stuff from git back to cvs, I don't know if that is possible. What I was interested in doing is pulling down new cvs commits. I have heard that developers are pulling from svn, merging and then pushing back into svn, so maybe you can do the same with cvs.

Collapse
Posted by Gustaf Neumann on
Malte wrote
Thanks a lot for this explanation. It basically seems to fulfill all my needs:
i have started using git in spring and i am vey happy with it. currently, i have 7 git repositories for different projects, everything develops neatly. Git reminds me on TeX/Latex: the simple commands do most what you need, for more complex situations there there are a lot of clever gadgets available; many of those i have not needed so far. For me the biggest asset is the ease to create branches and its "sane" behavior (no troubles with renaming, moving files etc.). Git makes a lot of sense for me even without having a shared repository in mind, simple to keep track of multiple version of software i have in use or development. git is as well quite fast.

Git can be used like cvs/svn, but it offers a huge array of ways to collaborate. This freedom might be the biggest problems of git. One needs more coordination and clear policies about naming branches, tags etc. within a project, since everybody can do whatever they want in their repositories. In reality, this seems to work quite well (see e.g. the linux kernel project).

For the openacs kernel i am not sure whether it is a good idea to leave the centralized development model. At least for packages, it is a great way to share stuff which is for some reason not in the central repository. Also for OpenACS companies, seeing every customer installation as a separate branch/repository helps to factor out commonalities and make maintenance easier.

Tom wrote

I haven't yet figured out exactly how to keep up with the cvs commits, this is part of the lack of documentation.
check out for the full cycle: http://issaris.blogspot.com/2005/11/cvs-to-git-and-back.html
Collapse
Posted by Tom Jackson on
Good comments. I would tend to agree that OpenACS should not ditch the centralized development model, but what exactly that means would be important to clear up. Obviously the kernel development ends up with releases, so in the end there are the same decision-making processes going on. This is critical for any project, and OpenACS has such a structure already. The point of using git would be to allow developers at least two thing: first, to commit (and thus comment and protect) their work as they go along, and second, to more easily keep up with the central development. Right now, sharing code means committing directly to the HEAD, then pulling stuff back in, but that could easily mess up what you are doing. Overall, it looks like git makes the flow of code, up, down, across much easier, which in the end will probably have the effect of encouraging more sharing, experimentation and testing. At the moment it is impractical to check out a new feature without making a new checkout of everything in some scratch directory.

But this is still a new tool, it will take lots of interest and testing for the community to get comfortable with it.

I also looked at the suggested Bazaar. Apparently this is limited to around 10k files, which is probably too low for OpenACS. I haven't done a count, but there were over 200k objects in my conversion.

The author of the articles was also somewhat confused about Git. He thinks that Git should support renaming files, even though this is completely unnecessary in Git. Git tracks content, and even sub-content. If you move a file somewhere else, or have two copies of a gif with different names (or maybe some empty placeholder files), these are considered the same content, there is only one copy, and only the pointers change. You can also track a function which moves from on file to another file.

Personally, this is exactly how I develop. I might use a testing file to develop a function, then move it into the correct file days or months later, long after I have forgotten all the development issues and problems. With svn, I leave the test file untracked and uncommitted.

I also often split code into separate files if they get too long and contain multiple namespaces. You shouldn't be thinking of this stuff in advance of the need to do it, so it would be a big help to capture this development history automatically.

Gustaf, I'm going to read the article about cvs-to-git-and-back. Hopefully you can share your typical workflow for using git with OpenACS, or other projects.

Collapse
Posted by Tom Jackson on
Tom wrote

I haven't yet figured out exactly how to keep up with the cvs commits, this is part of the lack of documentation.

check out for the full cycle: http://issaris.blogspot.com/2005/11/cvs-to-git-and-back.html

This looks great, thanks for the reference. One question is about the huge OpenACS cvsroot. A comment in the above suggests using rsync (for sourceforge) to copy everything in cvsroot. When I did my cvs-to-git it took about 12 hours and I was mostly concerned with placing stress on the OpenACS servers. Is there any possibility of making a similar service available, or is the stress not too bad? I assume that rsync is used so that things don't change under your feet. If there is any interest I could rsync to one of my well connected hosts and folks could rsync from there. Maybe there are better ways to do this, suggestions?

Mark Shuttleworth (founder of Ubuntu), wrote four short articles about version control tools back in June (2007). Among other things, he argues that you are much better starting off with Bazaar, and later switching to Git if for some reason that turns out to be necessary, rather than vice-versa.
Collapse
Posted by Tom Jackson on
I've moved my git repository to http://junom.com/gitweb/gitweb.perl but you can still access it at the previous location for now.

I finally figured out how to get the new commits from cvs, so about once a day I run git-cvsimport and then run git-push.