Git foundations

“Foundations” is a little joke on a religious theme; our page borrows heavily from the git parable - so - why not a foundation myth?

On the first day - the repository and the working tree

I’m young and the world is fresh and I have lots of time.

I have so much time, that I decide that I want to write a book. The project I am working on is a modest history and explanation of everything, with provisional title “The Book”. It has a table of contents contents.txt and a single book chapter:

.
├── chapter1.txt
└── contents.txt

As I start to write, I begin to think I should keep track of my changes. I need some sort of version control system. How hard can it be?

I start with some names. I’m going to call this set of files that I’m working on, the working tree. Because I currently lack any shame about body issues, I will call my new versioning system ahole [1] .

I decide that I need to store the state of The Book at the end of each day. To do this, I make a new directory in my working tree, called .ahole. This directory will store The Book as it evolves into a world-wide best-seller. I will use the name repository for the contents of .ahole.

At the end of the day, I make a copy of all the files in the working tree, and save it in my new .ahole repository. In fact, what I will do is, make a new subdirectory in .ahole named for today’s date, then store a copy of the book files in there. On unix, that might look like this:

mkdir .ahole/year0-jan-01
mkdir .ahole/year0-jan-01/files
cp * .ahole/year0-jan-01/files

So I’ve still got the contents of The Book in the working tree, but now, in the repository, I have a copy of the files that is a snapshot of The Book as of today:

.
├── .ahole
│   └── year0-jan-01
│       └── files
│           ├── chapter1.txt
│           └── contents.txt
├── chapter1.txt
└── contents.txt

On the second day - staging and commits

Today I do some more work on the book. I start work on chapter 2, and, while I’m thinking about things, I find that I am also writing some notes to myself about this character “Eve” that I have seen wandering around. I save those notes in a file called something_about_eve.txt. When I get to the end of the day, I get ready to store my work. At the moment, my directory looks like this:

.
├── .ahole
│   └── year0-jan-01
│       └── files
│           ├── chapter1.txt
│           └── contents.txt
├── something_about_eve.txt
├── chapter2.txt
├── chapter1.txt
└── contents.txt

For some reason I can’t put my finger on, I don’t want to put something_about_eve.txt into the repository at the moment. In fact, in general, I want to choose which changes I back up into the repository, and which changes I leave for another day. In the end I come up with an idea. I’ll make a directory in .ahole called staging_area. When I start work at the beginning of the day, I copy the previous backed-up version of my files from the repository, into staging_area. These files are now ready for storing in the next snapshot:

cp .ahole/year0-jan-01/* .ahole/staging_area

I now have:

├── .ahole
│   ├─── staging_area
│   │    ├── chapter1.txt
│   │    └── contents.txt
│   └── year0-jan-01
│       └── files
│           ├── chapter1.txt
│           └── contents.txt
├── something_about_eve.txt
├── chapter2.txt
├── chapter1.txt
└── contents.txt

As I work, I decide what I’m going to put into tonight’s snapshot. For example, maybe I changed chapter1.txt and I think it’s ready to back up. I copy my modified version of chapter1.txt from the working tree to staging_area. I’ll call that stage-ing the file. I’ll also ‘stage’ the new chapter2.txt file (copy it to the staging area). I’m not going to stage something_about_eve.txt at the moment.

Now I’ve done that, all the stuff I want to store in the backup is ready. I just need to put it into its own backup snapshot directory. To do that, I just do something like (Unix again):

mkdir .ahole/year0-jan-02
mkdir .ahole/year0-jan-02/files
cp .ahole/staging_area/* .ahole/year0-jan-02/files

I end up with a directory that looks like this:

.
├── .ahole
│   ├── year0-jan-02
│   │   └── files
│   │       ├── chapter2.txt
│   │       ├── chapter1.txt
│   │       └── contents.txt
│   ├─── staging_area
│   │    ├── chapter2.txt
│   │    ├── chapter1.txt
│   │    └── contents.txt
│   └── year0-jan-01
│       └── files
│           ├── chapter1.txt
│           └── contents.txt
├── something_about_eve.txt
├── chapter2.txt
├── chapter1.txt
└── contents.txt

I decide that I’ll use the name commit for each of the daily snapshot directories (year0-jan-01 and year0-jan-02). The action of adding files to the staging area, I will call staging files for the commit. I will use the term committing for the action of making the snapshot directory, and copying the files from the staging area to the snapshot directory.

On the third day - history

As a result of certain events yesterday evening, I have a new friend, Eve. She wants to help out. Of course Eve has her own computer, and I send her my .ahole directory. I thank myself for my wisdom in not adding something_about_eve.txt to the repository.

Eve checks out our book (reconstructs my working tree) with something like:

cp .ahole/year0-jan-02/files/* .

Now she’s got the book files as I committed them last night. She also copies the last commit files into the staging area, as I did:

├── .ahole
│   ├─── staging_area
│   │    ├── chapter2.txt
│   │    ├── chapter1.txt
│   │    └── contents.txt

She works hard on a new file chapter1_discussion.txt. It’s good to see she’s enjoying the work. As the afternoon turns to evening, she gets ready to save her work, so she copies chapter1_discussion.txt to .ahole/staging_area. Now she is ready to do a commit:

mkdir .ahole/year0-jan-03
mkdir .ahole/year0-jan-03/files
cp .ahole/staging_area/* .ahole/year0-jan-03/files

That is what Eve was going to do, but Eve is smart, and she immediately realizes that there is a problem. After she has done her commit, both of us will likely have a commit directory .ahole/year0-jan-03 - but they will have different contents. If she later wants to share work with me, that could get confusing.

The two of us are a little tired after all our work, and we meet for a beer. We talk about it for a while. At first we think we can just add the time to the date, because that’s likely to be unique for each of us. Then we realize that that’s going to get messy too, because, if Eve does a commit on her computer, then I do a commit on mine, and she does another one on hers, the times will say that these are all in one sequence, but in fact there are two sequences, mine, and Eves. We need some other way to keep track of the sequence of commits, that will work even if two of us are working independently.

In the end we decide that we are going to give the commits some unique identifier string instead of the date. We might have a problem in making sure that the unique identifier string is actually unique, but let’s assume we can solve that somehow. We’ll store the contents of the working tree in the same way as we have done up till now, in the files subdirectory, but we’ll add a new file to each commit, called info.txt, that will tell us who did the commit, and when, and, most importantly, what the previous commit was. We’ll call the previous commit the parent.

Eve was right to predict that I had made my own commit today. I’ve been happily working on chapter 3. So, before our conversation, my directory looked like this:

.
├── .ahole
│   ├── year0-jan-03
│   │   └── files
│   │       ├── chapter3.txt
│   │       ├── chapter2.txt
│   │       ├── chapter1.txt
│   │       └── contents.txt
│   ├─── staging_area
│   │    ├── chapter3.txt
│   │    ├── chapter2.txt
│   │    ├── chapter1.txt
│   │    └── contents.txt
│   ├── year0-jan-02
│   │   └── files
│   │       ├── chapter2.txt
│   │       ├── chapter1.txt
│   │       └── contents.txt
│   └── year0-jan-01
│       └── files
│           ├── chapter1.txt
│           └── contents.txt
├── something_about_eve.txt
├── chapter3.txt
├── chapter2.txt
├── chapter1.txt
└── contents.txt

but now we’ve worked out the new way, it looks like this:

.
├── .ahole
│   ├── 5d89f8
│   │   ├── info.txt
│   │   └── files
│   │       ├── chapter3.txt
│   │       ├── chapter2.txt
│   │       ├── chapter1.txt
│   │       └── contents.txt
│   ├─── staging_area
│   │    ├── chapter3.txt
│   │    ├── chapter2.txt
│   │    ├── chapter1.txt
│   │    └── contents.txt
│   ├── 7ef41f
│   │   ├── info.txt
│   │   └── files
│   │       ├── chapter2.txt
│   │       ├── chapter1.txt
│   │       └── contents.txt
│   └── 6438a4
│       ├── info.txt
│       └── files
│           ├── chapter1.txt
│           └── contents.txt
├── something_about_eve.txt
├── chapter3.txt
├── chapter2.txt
├── chapter1.txt
└── contents.txt

and .ahole/5d89f8/info.txt looks like this:

committer = Adam
message = Third day
date = year0-jan-03
parent = 7ef41f

Meanwhile, Eve’s directory looks like this:

.
├── .ahole
│   ├── 0a01a0
│   │   ├── info.txt
│   │   └── files
│   │       ├── chapter1_discussion.txt
│   │       ├── chapter2.txt
│   │       ├── chapter1.txt
│   │       └── contents.txt
│   ├─── staging_area
│   │    ├── chapter1_discussion.txt
│   │    ├── chapter2.txt
│   │    ├── chapter1.txt
│   │    └── contents.txt
│   ├── 7ef41f
│   │   ├── info.txt
│   │   └── files
│   │       ├── chapter2.txt
│   │       ├── chapter1.txt
│   │       └── contents.txt
│   └── 6438a4
│       ├── info.txt
│       └── files
│           ├── chapter1.txt
│           └── contents.txt
├── chapter1_discussion.txt
├── chapter2.txt
├── chapter1.txt
└── contents.txt

and Eve’s .ahole/0a01a0/info.txt looks like this:

committer = Eve
message = Eve day 3
date = year0-jan-03
parent = 7ef41f

After a little thought, Eve and I realize that, when we make our new commit, we are going to have to know what the current commit is, so we can use that as the parent. When we make a new commit, we store the commit identifier in a file. We’ll call this file .ahole/HEAD, so, after my last commit above, the file .ahole/HEAD will have the contents 5d89f8. We use the contents of .ahole/HEAD to identify the last (current) commit. And of course, when we make a new commit, we can get the parent of the new commit, from the current commit in .ahole/HEAD.

So now, we have a new procedure for our commit. In outline it looks like this (now in python syntax) [2]

def ahole_commit(committer, message):
    # Make a unique identifier for this commit somehow
    new_id = make_unique_id()
    # Make a new directory in ahole with the new unique name
    commit_dir = '.ahole/' + new_id
    mkdir(commit_dir)
    mkdir(commit_dir + '/files')
    # Copy the files from the staging area to the new snapshot directory
    copy_tree('.ahole/staging_area', commit_dir + '/files')
    # Get previous (parent) commit id from .ahole/HEAD
    head_id = file('.ahole/HEAD').read()
    # Make info with parent set to HEAD
    info_str = 'committer = ' + committer + '\n'
    info_str += 'message = ' + message + '\n'
    info_str += 'date = ' + date.today() + '\n'
    info_str += 'parent = ' + head_id + '\n'
    # Write info to info.txt file
    info_file = file(commit_dir + '/info.txt', 'w')
    info_file.write(info_str)
    info_file.close()
    # Set .ahole/HEAD to contain new commit id
    file('.ahole/HEAD', 'w').write(new_id)

When we want to go back to an earlier state of the book, we can do a checkout, with something like:

def ahole_checkout(commit_id):
    commit_dir = '.ahole/' + commit_id
    # copy .ahole/:math:`commit_id/files into working tree
    delete_tree('.')
    copy_tree(commit_dir + '/files', '.')
    # make .ahole/HEAD contain commit_id
    file('.ahole/HEAD', 'w').write(commit_id)
    # copy commit snapshot into staging area
    delete_tree('.ahole/staging_area')
    copy_tree(commit_dir + '/files', '.ahole/staging_area')

So, when we run ahole_checkout('7ef41f`) we will get the copy of the working tree corresponging to 7ef41f, and .ahole/HEAD will just contain the string 7ef41f.

In our excitement, we immediately realize that it’s really easy to see the history of the book now. We can easily fetch out info.txt from the current commit, print it, then find its parent, and fetch info.txt from the parent, print it, and so on.

Now we are tired, but happy, and we rest.

On the fourth day - references

We wake with a strange excitement. The idea, of keeping a reference to the current commit in .ahole/HEAD, seems that it could be more general. I talk to Eve over breakfast (she stayed in her own place of course, but she came over for work). Together we work out the concept of references. A reference is:

Reference
Something that points to a commit

So, .ahole/HEAD is a reference - to the current commit. But what if I decide that I want to give out some preliminary version of our book. Let’s say I want to release the book stored in .ahole/7ef41f/files as ‘release-0.1’. I’m going to send this out to all my friends (to be honest, I don’t have many friends just yet, but still). I want to be able to remember what version of the book I sent out. I can make a reference to this commit. I’ll call this a tag. I make a new directory in .ahole called refs, and another directory in refs, called tags, and then, in .ahole/refs/tags/release-0.1 I just put 7ef41f - a reference to the release commit. That way, if I ever need to go back to the version of the book I released, I just have to read the release-0.1 file to find the commit, and then checkout that commit.

Wait, but, there’s a problem. If I checkout the commit in release-0.1, I will overwrite .ahole/HEAD, and I will lose track of what commit I was working on before.

Let’s store that in another reference. Let’s use the name ‘master’ for my main line of development. I store where this is, by making a new file .ahole/refs/heads/master that is a reference to the last commit. It just contains the text ‘5d89f8’. So that I know that I am working on ‘master’, I make .ahole/HEAD have the text ref: refs/heads/master. Now, when I make a new commit, I first check .ahole/HEAD; if I see ref: refs/heads/master, then first, I get the commit id in .ahole/refs/heads/master - and I use that as the parent id for the commit. When I’ve saved the new commit, I set .ahole/refs/heads/master to have the new commit id. So, I need to modify my commit procedure slightly:

def ahole_commit(committer, message):
    # *** this stuff down to the next *** line is new
    # Get previous (parent) commit id from .ahole/HEAD
    head_contents = file('.ahole/HEAD').read()
    # Check if this is a reference, de-reference if so
    # Also, get file into which to write the new commit id
    if head_contents.startswith('ref: '):
        head_ref = head_contents.replace('ref: ', '')
        head_ref_file = '.ahole/' + head_ref
        head_id = file(head_ref_file).read()
    else:
        head_ref_file = '.ahole/HEAD'
        head_id = head_contents
    # *** the stuff below you've seen before (until *** again)
    # Make a unique identifier for this commit somehow
    new_id = make_unique_id()
    # Make a new directory in ahole with the new unique name
    commit_dir = '.ahole/' + new_id
    mkdir(commit_dir)
    mkdir(commit_dir + '/files')
    # Copy the files from the staging area to the new snapshot directory
    copy_tree('.ahole/staging_area', commit_dir + '/files')
    # Make info.txt with parent set to HEAD
    info_str = 'committer = ' + committer + '\n'
    info_str += 'message = ' + message + '\n'
    info_str += 'date = ' + date.today() + '\n'
    info_str += 'parent = ' + head_id + '\n'
    # Write info to info.txt file
    info_file = file(commit_dir + '/info.txt', 'w')
    info_file.write(info_str)
    info_file.close()
    # Set the file that points to the current commit, to point to our commit
    # *** a little new, in that we might be writing to .ahole/HEAD, or
    # something like .ahole/refs/heads/master, depending on what .ahole/HEAD
    # contained at the top of this routine
    file(head_ref_file, 'w').write(new_id)

So, let’s say that I’m currently on commit ‘5d89f8’. .ahole/HEAD contains ref: refs/heads/master. .ahole/refs/heads/master contains 5d89f8. I run my commit procedure:

ahole_commit('Adam', 'Night follows day')

The commit procedure has made a new commit ‘dfbeda’; .ahole/HEAD continues to have text ref: refs/heads/master, but now .ahole/refs/heads/master contains dfbeda. In this way, we keep track of which commit we are on, by constantly updating ‘master’.

Ok - now let’s return to me checking out the released version of the book. I first get the contents of .ahole/refs/tags/release-0.1 - it’s ‘5d89f8’. Then I checkout the working tree for that version, using my nice ahole_checkout procedure:

ahole_checkout('5d89f8')

The checkout procedure will make .ahole/HEAD contain the text 5d89f8.

Now I want to go back to working on my current version of the book. That’s the set of files pointed to by .ahole/refs/heads/master. I can check the contents of .ahole/refs/heads/master - it is dfbeda. Then I get the current version with the normal checkout procedure:

ahole_checkout('dfbeda')

Finally, I’ll have to set .ahole/HEAD to be ref: refs/heads/master. All good.

Of course, I could automate this, by modifying my checkout procedure slightly:

def ahole_checkout(commit_reference):
   # If this is a reference, dereference
   if commit_reference in listdir('.ahole/refs/heads'):
       # it's a head reference, maybe 'master'
       head_reference = True
       fname = '.ahole/refs/heads/' + commit_reference
       commit_id = file(fname).read()
   elif commit_reference in listdir('.ahole/refs/tags'):
       # it's a tag reference
       head_reference = False
       fname = '.ahole/refs/tags/' + commit_reference
       commit_id = file(fname).read()
   else: # Just a standard commit id
       head_reference = False
       commit_id = commit_reference
   commit_dir = '.ahole/' + commit_id
   # copy .ahole/`commit_id/files into working tree
   delete_tree('.')
   copy_tree(commit_dir + '/files', '.')
   # make ahole/HEAD point to commit id
   if head_reference:
       # Point HEAD at head reference
       file('.ahole/HEAD').write('ref: refs/heads/' + commit_reference)
       # Write commit id into head reference file
       file('.ahole/refs/heads/' + commit_reference, 'w').write(commit_id)
   else:
       file('.ahole/HEAD', 'w').write(commit_id)
   # copy commit snapshot into staging area
   delete_tree('.ahole/staging_area')
   copy_tree(commit_dir + '/files', '.ahole/staging_area')

What then, is the difference, between a tag - like our release - and the moving target like ‘master’? The ‘tag’ is a static reference - it does not change when we do a commit and always points to the same commit. ‘master’ is a dynamic reference - in particular, it’s a head reference:

Head
A head is a reference that updates when we do a commit

My head is hurting a little, after Eve explains all this, but after a little while and a nice apple pie, I’m feeling positive about ahole.

On the fifth day - branches, merges and remotes

Yesterday was a little exhausting, so today there was some time for reflection.

As Eve and I relax with the other animals, who are all getting on very well with each other, we begin to realize that this head thing could be very useful.

For example, what if one of my very small number of friends tells me that there’s a serious conceptual error in the version of the book that I released - ‘release-0.1’. What if I want to go back and fix it - that is - do another commit on top of the released book, instead of the version of the book that I’m currently working on? I can just make a new head. I’ll do it like this:

cp .ahole/refs/tags/release-0.1 .ahole/refs/heads/working-on-0.1

Then, I look at what commit working-on-0.1 contains - of course it’s 7ef41f. I get that state of the book with my new checkout procedure:

ahole_checkout('working-on-0.1')

This changes .ahole/HEAD to be ref: refs/heads/working-on-0.1. Now, when I do a commit with ahole_commit, that will update the file .ahole/refs/heads/working-on-0.1 to have the new commit identifier. Despite the apple pie being a bit bitter last night, we’re feeling good.

As we think about this, we come to think of ‘master’ and ‘working-on-0.1’ as branches - because they can each be thought of as identifying a tree or graph of commits, which can grow. All I need, to make a new branch, is make a new head reference to a commit. For example, if I want to make new branch starting at the current position of ‘master’, all I need is:

cp .ahole/refs/tags/master .ahole/refs/heads/my-new-branch

If I want to work on this branch, I need to check it out, with:

ahole_checkout('my-new-branch')

That will get the commit identifier in .ahole/refs/heads/my-new-branch, unpack the commit tree into the working tree, and set .ahole/HEAD to contain the text ref: refs/heads/my-new-branch

I’ve got my branches, but Eve will have her own branches, and this will help us know where each of us is working.

That’s good, because Eve is now asking me if I can have a look at her changes, and whether I’ll include them in my version of the book. Unwisely I end up suggesting that women don’t contribute to books, and ask her why her hair isn’t covered with an as-yet not-invented headscarf. In the end we patch it up, and I agree to go back and try and put in her changes.

Luckily, despite the lack of basics like clothing, there is an excellent local network, so I can see the contents of her version of the book at /eves_computer/our_book/.ahole. She wants me to look at her ‘master’ branch. Just because the network might fail, I need to fetch what I need from her computer to mine. So, to keep track of things, I’ll make a new directory, called .ahole/refs/remotes/eve, and I’ll copy all her heads - in this case just master - to that directory. So now, I’ve got .ahole/refs/remotes/eve/master, and in fact, it points to the commit that she did on the third day; this was commit ‘0a01a0’. I don’t have this commit in my .ahole directory, so I’ll copy that from /eves_computer/our_book/.ahole/0a01a0. I look in the info.txt file for that commit, and check what the parent is. It is ‘7ef41f’. I check if I have that, and yes, I have, so I can stop copying stuff from Eve’s directory.

So, what I just did was:

  • Copy Eve’s head references from /eves_computer/our_book/.ahole/refs/heads to my .ahole/refs/remotes/eve.
  • For each of the references in .ahole/refs/remotes/eve, I check whether I have the referenced commit, and the parents of that commit, and, if not, I copy them to .ahole.

We decide to call that two-step sequence - a fetch.

Now I want to look at her version of the book. I have her head references and the commits they point to, so I can checkout her latest version. I first get the commit identifier from .ahole/refs/remotes/eve/master - ‘0a01a0’. Then:

ahole_checkout('0a01a0')

This will put ‘0a01a0’ into .ahole/HEAD. I can look at her version of the book, and decide if I like it. If I do, then I can do a merge.

What is a merge? It’s the join of two commits. First I work out where Eve’s tree diverged from mine, by going back in her history, following the parents of the commits. In this case it’s easy, because the parent commit (‘7ef41f’) of this commit (‘0a01a0’) is one that is also in my history (the history for my ‘master’ branch). This most recent shared commit I will call the common ancestor. Then I work out the difference between the common ancestor commit (‘7ef41f’) and this commit (‘0a01a0’) - let’s call that eves_diff.

I go back to my own ‘master’ - which turns out to be (.ahole/refs/heads/master) - ‘dfbeda’:

ahole_checkout('master')

This will change .ahole/HEAD to be ref: refs/heads/master - and I will have just got the working tree from .ahole/dfbeda/files. Then I take eves_diff and apply it to my current working tree. If there were any conflicts, I resolve them, but in my world, there are no conflicts. I have a feeling there may be some later. That apple pie is making me feel a little funny.

Finally, I make a new commit, with a new unique ID - say ‘80cc85’, with the merged working tree. But, there’s a trick: here the new commit ‘80cc85’ - has two parents, first - ‘dfbeda’ - the previous commit in my ‘master’, and second ‘0a01a0’ - the last commit in Eve’s master. Now, the next time I look at Eve’s tree, I will be able to see that I’ve got her ‘0a01a0’ commit in my own history, and won’t need to apply it again.

On the sixth day - saving time and space with objects

I am now very happy with ahole, but Eve clearly doesn’t think we’ve got it right yet.

As she’s thinking, she decides to make a couple of illustrations for The Book, so she adds some photos to her working tree:

.
├── .ahole
│   ...
├── images
│   ├── adam_with_apple.jpg
│   └── lion_with_lamb.jpg
├── chapter1_discussion.txt
├── chapter2.txt
├── chapter1.txt
└── contents.txt

As soon as she does this, she realizes what’s wrong with ahole. The photos are large files. At the moment, every time we make a commit, we’re copying all the files into the commit files directory to make the snapshot. With big files, this is going to lead to many identical copies and lots of wasted space.

Eve realizes that what we need to do, is to make the commit use references to files, rather than the files themselves. That way, when the commit has files that have not changed, it can just point to the unchanged file, rather than carrying a wasteful copy of the file.

If the commits just store references, we need a way to store the contents of the files, so they can be referenced. Maybe we could store the files for our snapshots in a directory, and use some sort of unique filename so that the commits can reference that filename? For example, maybe we could make a directory in .ahole like this:

mkdir .ahole/objects

and use this directory to store the contents of the files for our snapshots. Then we could store the commits as something like a table, where the entries would tell us how to get the matching files from the .ahole/objects directory.

We could have some structure for the commits like this:

├── .ahole
│   ├── 5d89f8
│   │   ├── info.txt
│   │   └── file_list

where .ahole/5d89f8/file_list would be a list of references to files in the .ahole/objects directory, along with the filename that the contents has when reconstructed back into the snapshot. For example, maybe file_list would have a series of (object reference, filename) pairs like this:

contents_version1            contents.txt
chapter1_version1            chapter1.txt
chapter2_version2            chapter2.txt
chapter3_version1            chapter3.txt
chapter1_discussion_version1 chapter1_discussion.txt

These references in the first column could match filenames in the .ahole/objects directory:

│   ├── objects
│   │   ├── chapter1_version1
│   │   ├── chapter2_version1
│   │   ├── chapter2_version2
│   │   ├── chapter3_version1
│   │   ├── chapter1_discussion_version1
│   │   └── contents_version1

We could think of the .ahole/objects directory as a very simple form of database, where the keys are the filenames, and the file contents are the values.

We think about this for a while and realize that it’s going to be annoying trying to find unique names to use as filenames in .ahole/objects, because there will be many versions of many files. For example chapter1_version2, chapter1_version3 and so on is clearly not going to work, because when Eve and I work independently, at some point we’re both going to have something like a chapter1_version3 in our respective .ahole/objects directories, but they will be different, and that will be confusing.

At this stage, Eve reveals that she has some training in computer science. Of course I have no idea what that is, or who did the training, but she’s in too much of a rush to explain that now. She proposes that we make the filenames (database keys) by doing hashes of the file contents. It turns out that hashing algorithms can take a stream of bytes such as the contents of a file, and create a string that is near-enough unique to that stream of bytes. That’s really good, because it means that, if Eve and I have an object with the same filename (hash) that means it almost certainly contains the exact same contents.

Eve recommends the ‘SHA1’ hashing algorithm, and I’m in no position to disagree with her. Now we’ve got a unique string to use as a key for each file. For example, we run the SHA1 algorithm over the current book files and we get these:

Filename SHA1 hash
chapter1.txt 9e398c7cf8d56e960aa7769839cc0c38b8e12f11
chapter2.txt 65735b3705284cdf4a66c2e4812ca13cbaa7cd5d
chapter1_discussion.txt 3c2e09cc43568f13444c075c84b047957f7995a5
contents.txt f31bfa1225f9e0eb6741a0ab1122f8cd2cbedc04

If we change the file at all, then the hash changes, and we have a new unique string and therefore we have a new unique filename with which to store the new contents. For example, the original version of chapter 2 was a bit shorter, and had a hash of ‘1cf01a1dfbe135b6132362fa8e17eaefcaf00a7f’.

Now we have got a nice way of making the references that will go into .ahole/5d89f8/file_list. First we store the file versions in our .ahole/objects directory, using their hash values as filenames:

│   ├── objects
│   │   ├── 9e398c7cf8d56e960aa7769839cc0c38b8e12f11 (chapter1 version 1)
│   │   ├── 1cf01a1dfbe135b6132362fa8e17eaefcaf00a7f (chapter2 version 1)
│   │   ├── 65735b3705284cdf4a66c2e4812ca13cbaa7cd5d (chapter2 version 2)
│   │   ├── 3c2e09cc43568f13444c075c84b047957f7995a5 (chapter1_discussion version 1)
│   │   └── f31bfa1225f9e0eb6741a0ab1122f8cd2cbedc04 (contents version 1)

Next we create .ahole/5d89f8/file_list with one row per file in our directory. Each row contains first - the hash value (and therefore filename in .ahole/objects) which allows me to get the file contents, then the type of thing this is - here a file - and lastly, the filename as it was in the snapshot:

9e398c7cf8d56e960aa7769839cc0c38b8e12f11 file chapter1.txt
65735b3705284cdf4a66c2e4812ca13cbaa7cd5d file chapter2.txt
3c2e09cc43568f13444c075c84b047957f7995a5 file chapter1_discussion.txt
f31bfa1225f9e0eb6741a0ab1122f8cd2cbedc04 file contents.txt

Now, what about Eve’s new working tree with the photos in it? The photos are in the images subdirectory, and we don’t have a way of storing subdirectories yet. Aha - why not store directories in the object database too? Directories can just be tree files like file_list. tree files are lists, one entry per row, where each row contains the hash reference for the file contents, the type of thing it is (tree or file), and the filename as it was in the snapshot. So, for Eve’s new commit, we’d first store the contents of the two photo files in the .ahole/objects directory:

│   ├── objects
│   │   ├── 82e6792faa893070dcd6fe3e614b6f147be1a0a9 (adam_with_apple.jpg)
│   │   ├── e8b23357995db47e70906d4c7a08114c0c0ba376 (lion_with_lamb.jpg)
│   │   ├── 9e398c7cf8d56e960aa7769839cc0c38b8e12f11 (chapter1 version 1)

etc. Then we make a new tree file called - say - ‘images_listing’ like this:

82e6792faa893070dcd6fe3e614b6f147be1a0a9 file adam_with_apple.jpg
e8b23357995db47e70906d4c7a08114c0c0ba376 file lion_with_lamb.jpg

and we make a hash for that tree file too, and put that into .ahole/objects:

│   ├── objects
│   │   ├── be242dba385bc0689be16454e959f4b64c87abce (images_listing)
│   │   ├── 82e6792faa893070dcd6fe3e614b6f147be1a0a9 (adam_with_apple.jpg)
│   │   ├── e8b23357995db47e70906d4c7a08114c0c0ba376 (lion_with_lamb.jpg)
│   │   ├── 9e398c7cf8d56e960aa7769839cc0c38b8e12f11 (chapter1 version 1)

etc. Now maybe our whole commit listing can include files and directories for the root directory of our project, something like:

9e398c7cf8d56e960aa7769839cc0c38b8e12f11 file chapter1.txt
65735b3705284cdf4a66c2e4812ca13cbaa7cd5d file chapter2.txt
3c2e09cc43568f13444c075c84b047957f7995a5 file chapter1_discussion.txt
f31bfa1225f9e0eb6741a0ab1122f8cd2cbedc04 file contents.txt
be242dba385bc0689be16454e959f4b64c87abce tree images

Oh - but wait - that’s just a tree listing too, let’s make a hash for that, and put it into the .ahole/objects directory:

│   ├── objects
│   │   ├── e52dc9dbe358c549df65307652ff2709322812b3 (root listing)
│   │   ├── be242dba385bc0689be16454e959f4b64c87abce (images_listing)
│   │   ├── 82e6792faa893070dcd6fe3e614b6f147be1a0a9 (adam_with_apple.jpg)

Right - so now our whole commit boils down to our info.txt file, and the hash for the root tree (the one starting ‘e52dc’ above). We can get rid of the old files subdirectory in the commit, and add the hash for the root tree instead - something like:

committer = Eve
message = Adding funny pictures
date = year0-jan-06
root_tree = e52dc9dbe358c549df65307652ff2709322812b3
parent = 0a01a0

Now we can solve the annoying problem of finding an unique commit id for each commit. We just make a hash for the info.txt file, and put that into the .ahole/objects directory too, as a commit file:

│   ├── objects
│   │   ├── 7e0cda8c145b300b519ed28998a31f801b6d626f (latest commit)
│   │   ├── e52dc9dbe358c549df65307652ff2709322812b3 (root listing)
│   │   ├── be242dba385bc0689be16454e959f4b64c87abce (images_listing)

The unique id for the commit is the hash for its contents. In this case the commit id is ‘7e0cda8c145b300b519ed28998a31f801b6d626f’. Don’t forget that the hash is more or less unique to the contents, so this commit will have an id that is unique to the combination of the committer, message, date, root tree hash and commit parent. The root tree hash is unique to the contents of the root tree listing, and the root tree listing contains file hashes, which are in turn unique to the file contents, so the root tree hash will be unique to the file contents of the commit. Thus, the commit id is unique to all the things that go into the commit, including the contents. It’s clever isn’t it?

We can now have three types of files in the .ahole/objects directory - files, trees, and commits.

OK - so things are now a little more complicated than our previous setup with file copies, but lots of things have just got much easier. For example, we can now get rid of the staging_area directory. The staging area can just be a single file containing the root tree listing of the snapshot. Let’s call that file .ahole/index. Now Eve has done her new commit, that file can just be the root directory listing of the previous commit (the commit we have just done):

9e398c7cf8d56e960aa7769839cc0c38b8e12f11 file chapter1.txt
65735b3705284cdf4a66c2e4812ca13cbaa7cd5d file chapter2.txt
3c2e09cc43568f13444c075c84b047957f7995a5 file chapter1_discussion.txt
f31bfa1225f9e0eb6741a0ab1122f8cd2cbedc04 file contents.txt
be242dba385bc0689be16454e959f4b64c87abce tree images

When Eve makes an edit to chapter1.txt, instead of copying the file to the staging_area directory, she makes a hash for the new chapter1.txt contents, she stores the new chapter1.txt contents in the .ahole/objects directory using the hash as a filename, and then she edits the .ahole/index file to point to her new chapter 1 contents instead of the old. She might automate this with a small command like ahole_stage [3]

def ahole_stage(fname):
    # Get the hash for the file contents
    file_contents = file(fname).read()
    file_hash = sha1_hash(file_contents)
    # (assuming that the new file is going in the root directory)
    new_root_entry = file_hash + ' file ' + fname
    root_listing = file('.ahole/index').read()
    if new_root_entry in root_listing:
        # This exact file contents and filename already present
        return
    # Make an entry for these file contents in the objects database
    database_fname = '.ahole/objects/' + file_hash
    file(database_fname, 'w').write(file_contents)
    # Write index listing with new entry
    root_listing = root_listing + new_root_entry + '\n'
    file('.ahole/index', 'w').write(root_listing)

Making a new commit involves taking the contents of .ahole/index and using it to make a new commit file in .ahole/objects. Using the structure of our previous ahole_commit routine, that might look like:

def ahole_commit(committer, message):
    # *** this stuff is the same as before ***
    # Get previous (parent) commit id from .ahole/HEAD
    head_contents = file('.ahole/HEAD').read()
    # Check if this is a reference, de-reference if so
    # Also, get file into which to write the new commit id
    if head_contents.startswith('ref: '):
        head_ref = head_contents.replace('ref: ', '')
        head_ref_file = '.ahole/' + head_ref
        head_id = file(head_ref_file).read()
    else:
        head_ref_file = '.ahole/HEAD'
        head_id = head_contents
    # *** the stuff below is different ***
    # Make root tree entry in objects database from .ahole/index
    index_contents = file('.ahole/index').read()
    index_hash = sha1_hash(index_contents)
    file('.ahole/objects/' + index_hash, 'w').write(index_contents)
    # Make commit information with parent set to HEAD
    info_str = 'committer = ' + committer + '\n'
    info_str += 'message = ' + message + '\n'
    info_str += 'date = ' + date.today() + '\n'
    info_str += 'root_tree = ' + index_hash + '\n'
    info_str += 'parent = ' + head_id + '\n'
    # Write commit file into objects database, with hash
    commit_hash = sha1_hash(info_str)
    file('.ahole/objects/' + commit_hash, 'w').write(info_str)
    # Set the current commit file to contain new id
    file(head_ref_file, 'w').write(commit_hash)

How about doing a merge? Remember that, in the bad old days, we had to compare lots of files between the branches, and the common ancestor? No more. Now we are using the hash file references, all we need to do, is look at the tree listing. If the tree listing has the same entry (filename and hash) that means that the file is identical between the two trees, and we don’t have to load the contents to check. That makes it very fast to do comparisons between trees that haven’t changed much.

Eve was right of course. Now, if we make a new commit, when one file is changed, all we store is the contents of the file that has changed and a new tree listing with the updated hash for the changed file. That makes the storage for lots and lots of similar trees very efficient.

Someone ought to write this up and give it to the world. Wait, that’s just us.

On the seventh day - there was git

The seventh day is for resting. You are all done now, and the hard stuff is over. In a state of deep inner peace, you can think about all that you’ve discovered in ahole:

  • A commit refers to a snapshot of the complete set of files for your project
  • The staging area (index) defines what will change between your upcoming commit and the previous commit
  • A branch is just a pointer to a commit, that moves when you do another commit.
  • Version control is very easy to understand

You remind yourself that life is very good, because you don’t have to use a version control system called ahole, you can use a very similar system called git.

If you use git, you’ll notice that you have lots of ahole friends. You’ll see git creates a .git subdirectory that contains the repository. You’ll recognize the .git/objects directory containing filenames with SHA1 hashes. You’ll see that commits have SHA1 hashes. You’ll recognize the .git/HEAD file and .git/refs/heads and .git/refs/tags and .git/refs/heads/master. There is a .git/index file, and it is the staging area. .git/index is a little more complicated than .ahole/index because it’s adapted to helping with difficult merges, but it’s the same idea.

You now live in the garden of Eden of version control. Remember to stay away from that apple tree.

Footnotes

[1]ahole might seem a bit rude to you, but I was born in the UK, and, where I come from, ‘ahole’ is roughly as rude as ‘git’.
[2]

In case you are interested, for the commit and checkout code to actually run, you would need some python definitions. First some standard python imports:

from datetime import date
from os import mkdir, listdir

Then we need some simple custom commands for deleting our working tree, and for copying files into the working tree:

from os import remove
from os.path import isfile, isdir
from shutil import copyfile, copytree, rmtree

def delete_tree(path):
    # Delete everything in path unless it's an '.ahole' directory
    for name in listdir(path):
        full_name = path + '/' + name
        if isfile(name):
            remove(full_name)
        elif isdir(name):
            if name != '.ahole':
                rmtree(full_name)

def copy_tree(src_path, dst_path):
    # Copy everything in src_path to dst_path
    for name in listdir(src_path):
        src_name = src_path + '/' + name
        dst_name = dst_path + '/' + name
        if isfile(src_name):
            copyfile(src_name, dst_name)
        elif isdir(src_name):
            copytree(src_name, dst_name)

We also need some definition of make_unique_id().

[3]

Now you need to add:

import hashlib

def sha1_hash(contents):
    return hashlib.sha1_hash(contents)

.