.. _git-foundation:

===============
Git foundations
===============

"Foundations" is a little joke on a religious theme; our page borrows heavily
from the `git parable`_ - so - why not a foundation myth?

On the first day - the repository and the working tree
======================================================

I'm young and the world is fresh and I have lots of time.

I have so much time, that I decide that I want to write a book.  The project I
am working on is a modest history and explanation of everything, with
provisional title "The Book".  It has a table of contents ``contents.txt`` and a
single book chapter::

    .
    ├── chapter1.txt
    └── contents.txt

As I start to write, I begin to think I should keep track of my changes.  I need
some sort of version control system.  How hard can it be?

I start with some names.  I'm going to call this set of files that I'm working
on, the *working tree*. Because I currently lack any shame about body issues, I
will call my new versioning system ``ahole`` [#ahole_git]_ .

I decide that I need to store the state of The Book at the end of each
day.  To do this, I make a new directory in my working tree, called ``.ahole``.
This directory will store The Book as it evolves into a world-wide best-seller.
I will use the name *repository* for the contents of ``.ahole``. 

At the end of the day, I make a copy of all the files in the working tree, and
save it in my new ``.ahole`` repository.  In fact, what I will do is, make a new
subdirectory in ``.ahole`` named for today's date, then store a copy of the book
files in there.  On unix, that might look like this::

    mkdir .ahole/year0-jan-01
    mkdir .ahole/year0-jan-01/files
    cp * .ahole/year0-jan-01/files 

So I've still got the contents of The Book in the working tree, but now, in the
repository, I have a copy of the files that is a snapshot of The Book as of
today::

    .
    ├── .ahole
    │   └── year0-jan-01
    │       └── files
    │           ├── chapter1.txt
    │           └── contents.txt
    ├── chapter1.txt
    └── contents.txt

On the second day - staging and commits
=======================================

Today I do some more work on the book.   I start work on chapter 2, and, while
I'm thinking about things, I find that I am also writing some notes to myself
about this character "Eve" that I have seen wandering around.  I save those
notes in a file called ``something_about_eve.txt``.  When I get to the end of
the day, I get ready to store my work.  At the moment, my directory looks like
this::

    .
    ├── .ahole
    │   └── year0-jan-01
    │       └── files
    │           ├── chapter1.txt
    │           └── contents.txt
    ├── something_about_eve.txt
    ├── chapter2.txt
    ├── chapter1.txt
    └── contents.txt

For some reason I can't put my finger on, I don't want to put
``something_about_eve.txt`` into the repository at the moment.  In fact, in
general, I want to choose which changes I back up into the repository, and which
changes I leave for another day.  In the end I come up with an idea.  I'll make
a directory in ``.ahole`` called ``staging_area``.  When I start work at the
beginning of the day, I copy the previous backed-up version of my files from the
repository, into ``staging_area``. These files are now ready for storing in the
next snapshot::

    cp .ahole/year0-jan-01/* .ahole/staging_area

I now have::

    ├── .ahole
    │   ├─── staging_area
    │   │    ├── chapter1.txt
    │   │    └── contents.txt
    │   └── year0-jan-01
    │       └── files
    │           ├── chapter1.txt
    │           └── contents.txt
    ├── something_about_eve.txt
    ├── chapter2.txt
    ├── chapter1.txt
    └── contents.txt


As I work, I decide what I'm going to put into tonight's snapshot.  For example,
maybe I changed ``chapter1.txt`` and I think it's ready to back up. I copy my
modified version of ``chapter1.txt`` from the working tree to ``staging_area``.
I'll call that *stage*-ing the file.  I'll also 'stage' the new ``chapter2.txt``
file (copy it to the staging area).  I'm not going to stage
``something_about_eve.txt`` at the moment.

Now I've done that, all the stuff I want to store in the backup is ready.  I
just need to put it into its own backup snapshot directory.   To do that, I just
do something like (Unix again)::

    mkdir .ahole/year0-jan-02
    mkdir .ahole/year0-jan-02/files
    cp .ahole/staging_area/* .ahole/year0-jan-02/files 

I end up with a directory that looks like this::

    .
    ├── .ahole
    │   ├── year0-jan-02
    │   │   └── files
    │   │       ├── chapter2.txt
    │   │       ├── chapter1.txt
    │   │       └── contents.txt
    │   ├─── staging_area
    │   │    ├── chapter2.txt
    │   │    ├── chapter1.txt 
    │   │    └── contents.txt
    │   └── year0-jan-01
    │       └── files
    │           ├── chapter1.txt
    │           └── contents.txt
    ├── something_about_eve.txt
    ├── chapter2.txt
    ├── chapter1.txt
    └── contents.txt

I decide that I'll use the name *commit* for each of the daily snapshot
directories (``year0-jan-01`` and ``year0-jan-02``).  The action of adding files
to the staging area, I will call *staging* files for the commit.  I will use
the term *committing* for the action of making the snapshot directory, and
copying the files from the staging area to the snapshot directory.

On the third day - history
==========================

As a result of certain events yesterday evening, I have a new friend, Eve.  She
wants to help out.  Of course Eve has her own computer, and I send her my
``.ahole`` directory.  I thank myself for my wisdom in not adding
``something_about_eve.txt`` to the repository.

Eve checks out our book (reconstructs my working tree) with something like::

    cp .ahole/year0-jan-02/files/* .

Now she's got the book files as I committed them last night.  She also copies
the last commit files into the staging area, as I did::

    ├── .ahole
    │   ├─── staging_area
    │   │    ├── chapter2.txt 
    │   │    ├── chapter1.txt
    │   │    └── contents.txt

She works hard on a new file ``chapter1_discussion.txt``. It's good to see she's
enjoying the work.  As the afternoon turns to evening, she gets ready to save
her work, so she copies ``chapter1_discussion.txt`` to ``.ahole/staging_area``.
Now she is ready to do a commit::

    mkdir .ahole/year0-jan-03
    mkdir .ahole/year0-jan-03/files
    cp .ahole/staging_area/* .ahole/year0-jan-03/files

That is what Eve was going to do, but Eve is smart, and she immediately realizes
that there is a problem.  After she has done her commit, both of us will likely
have a commit directory ``.ahole/year0-jan-03`` - but they will have different
contents.   If she later wants to share work with me, that could get confusing.

The two of us are a little tired after all our work, and we meet for a beer.  We
talk about it for a while.  At first we think we can just add the time to the
date, because that's likely to be unique for each of us.  Then we realize that
that's going to get messy too, because, if Eve does a commit on her computer,
then I do a commit on mine, and she does another one on hers, the times will say
that these are all in one sequence, but in fact there are two sequences, mine,
and Eves.  We need some other way to keep track of the sequence of commits, that
will work even if two of us are working independently.

In the end we decide that we are going to give the commits some unique
identifier string instead of the date.  We might have a problem in making sure
that the unique identifier string is actually unique, but let's assume we can
solve that somehow.  We'll store the contents of the working tree in the same
way as we have done up till now, in the ``files`` subdirectory, but we'll add a
new file to each commit, called ``info.txt``, that will tell us who did the
commit, and when, and, most importantly, what the previous commit was.  We'll
call the previous commit the *parent*.

Eve was right to predict that I had made my own commit today.  I've been happily
working on chapter 3.  So, before our conversation, my directory looked like
this::

    .
    ├── .ahole
    │   ├── year0-jan-03
    │   │   └── files
    │   │       ├── chapter3.txt
    │   │       ├── chapter2.txt
    │   │       ├── chapter1.txt
    │   │       └── contents.txt
    │   ├─── staging_area
    │   │    ├── chapter3.txt
    │   │    ├── chapter2.txt
    │   │    ├── chapter1.txt
    │   │    └── contents.txt
    │   ├── year0-jan-02
    │   │   └── files
    │   │       ├── chapter2.txt
    │   │       ├── chapter1.txt
    │   │       └── contents.txt
    │   └── year0-jan-01
    │       └── files
    │           ├── chapter1.txt
    │           └── contents.txt
    ├── something_about_eve.txt
    ├── chapter3.txt
    ├── chapter2.txt
    ├── chapter1.txt
    └── contents.txt

but now we've worked out the new way, it looks like this::

    .
    ├── .ahole
    │   ├── 5d89f8
    │   │   ├── info.txt
    │   │   └── files
    │   │       ├── chapter3.txt
    │   │       ├── chapter2.txt
    │   │       ├── chapter1.txt
    │   │       └── contents.txt
    │   ├─── staging_area
    │   │    ├── chapter3.txt
    │   │    ├── chapter2.txt
    │   │    ├── chapter1.txt
    │   │    └── contents.txt
    │   ├── 7ef41f
    │   │   ├── info.txt
    │   │   └── files
    │   │       ├── chapter2.txt
    │   │       ├── chapter1.txt
    │   │       └── contents.txt
    │   └── 6438a4
    │       ├── info.txt
    │       └── files
    │           ├── chapter1.txt
    │           └── contents.txt
    ├── something_about_eve.txt
    ├── chapter3.txt
    ├── chapter2.txt
    ├── chapter1.txt
    └── contents.txt

and ``.ahole/5d89f8/info.txt`` looks like this::

    committer = Adam
    message = Third day
    date = year0-jan-03
    parent = 7ef41f

Meanwhile, Eve's directory looks like this::

    .
    ├── .ahole
    │   ├── 0a01a0
    │   │   ├── info.txt
    │   │   └── files
    │   │       ├── chapter1_discussion.txt
    │   │       ├── chapter2.txt
    │   │       ├── chapter1.txt
    │   │       └── contents.txt
    │   ├─── staging_area
    │   │    ├── chapter1_discussion.txt
    │   │    ├── chapter2.txt
    │   │    ├── chapter1.txt
    │   │    └── contents.txt
    │   ├── 7ef41f
    │   │   ├── info.txt
    │   │   └── files
    │   │       ├── chapter2.txt
    │   │       ├── chapter1.txt
    │   │       └── contents.txt
    │   └── 6438a4
    │       ├── info.txt
    │       └── files
    │           ├── chapter1.txt
    │           └── contents.txt
    ├── chapter1_discussion.txt
    ├── chapter2.txt
    ├── chapter1.txt
    └── contents.txt

and Eve's ``.ahole/0a01a0/info.txt`` looks like this::

    committer = Eve
    message = Eve day 3
    date = year0-jan-03
    parent = 7ef41f

After a little thought, Eve and I realize that, when we make our new commit, we
are going to have to know what the current commit is, so we can use that as the
parent.  When we make a new commit, we store the commit identifier in a file.
We'll call this file ``.ahole/HEAD``, so, after my last commit above, the file
``.ahole/HEAD`` will have the contents ``5d89f8``. We use the contents of
``.ahole/HEAD`` to identify the last (current) commit.  And of course, when we
make a new commit, we can get the parent of the new commit, from the current
commit in ``.ahole/HEAD``. 

So now, we have a new procedure for our commit.  In outline it looks like this
(now in python_ syntax) [#commit_imports]_ ::

   def ahole_commit(committer, message):
       # Make a unique identifier for this commit somehow
       new_id = make_unique_id()
       # Make a new directory in ahole with the new unique name 
       commit_dir = '.ahole/' + new_id
       mkdir(commit_dir)
       mkdir(commit_dir + '/files')
       # Copy the files from the staging area to the new snapshot directory
       copy_tree('.ahole/staging_area', commit_dir + '/files')
       # Get previous (parent) commit id from .ahole/HEAD 
       head_id = file('.ahole/HEAD').read()
       # Make info with parent set to HEAD
       info_str = 'committer = ' + committer + '\n'
       info_str += 'message = ' + message + '\n'
       info_str += 'date = ' + date.today() + '\n'
       info_str += 'parent = ' + head_id + '\n'
       # Write info to info.txt file
       info_file = file(commit_dir + '/info.txt', 'w')
       info_file.write(info_str)
       info_file.close()
       # Set .ahole/HEAD to contain new commit id
       file('.ahole/HEAD', 'w').write(new_id)

When we want to go back to an earlier state of the book, we can do a
*checkout*, with something like::

   def ahole_checkout(commit_id):
       commit_dir = '.ahole/' + commit_id
       # copy .ahole/$commit_id/files into working tree
       delete_tree('.')
       copy_tree(commit_dir + '/files', '.')
       # make .ahole/HEAD contain commit_id
       file('.ahole/HEAD', 'w').write(commit_id)
       # copy commit snapshot into staging area
       delete_tree('.ahole/staging_area')
       copy_tree(commit_dir + '/files', '.ahole/staging_area')

So, when we run ``ahole_checkout('7ef41f`)`` we will get the copy of the working
tree corresponging to ``7ef41f``, and ``.ahole/HEAD`` will just contain the
string ``7ef41f``. 

In our excitement, we immediately realize that it's really easy to see the
history of the book now.  We can easily fetch out ``info.txt`` from the current
commit, print it, then find its parent, and fetch ``info.txt`` from the parent,
print it, and so on.

Now we are tired, but happy, and we rest.

On the fourth day - references
==============================

We wake with a strange excitement.  The idea, of keeping a reference to the
current commit in ``.ahole/HEAD``, seems that it could be more general.  I
talk to Eve over breakfast (she stayed in her own place of course, but she came
over for work).  Together we work out the concept of *references*. A reference
is:

Reference
    Something that points to a commit

So, ``.ahole/HEAD`` is a reference - to the current commit.  But what if I decide
that I want to give out some preliminary version of our book.  Let's say I want
to release the book stored in ``.ahole/7ef41f/files`` as 'release-0.1'.   I'm
going to send this out to all my friends (to be honest, I don't have many
friends just yet, but still).  I want to be able to remember what version of the
book I sent out.  I can make a *reference* to this commit.  I'll call this a
*tag*.   I make a new directory in ``.ahole`` called ``refs``, and another
directory in ``refs``, called ``tags``, and then, in
``.ahole/refs/tags/release-0.1`` I just put ``7ef41f`` - a reference to the
release commit.   That way, if I ever need to go back to the version of the book
I released, I just have to read the ``release-0.1`` file to find the commit, and
then checkout that commit. 

Wait, but, there's a problem.  If I checkout the commit in ``release-0.1``, I
will overwrite ``.ahole/HEAD``, and I will lose track of what commit I was
working on before.

Let's store that in another reference.  Let's use the name 'master' for my main
line of development.  I store where this is, by making a new file
``.ahole/refs/heads/master`` that is a reference to the last commit.  It just
contains the text '5d89f8'.  So that I know that I am working on 'master', I
make ``.ahole/HEAD`` have the text ``ref: refs/heads/master``.  Now, when I make
a new commit, I first check ``.ahole/HEAD``; if I see ``ref:
refs/heads/master``, then first, I get the commit id in
``.ahole/refs/heads/master`` - and I use that as the parent id for the commit.
When I've saved the new commit, I set ``.ahole/refs/heads/master`` to have the
new commit id.  So, I need to modify my commit procedure slightly::

   def ahole_commit(committer, message):
       # *** this stuff down to the next *** line is new
       # Get previous (parent) commit id from .ahole/HEAD
       head_contents = file('.ahole/HEAD').read()
       # Check if this is a reference, de-reference if so
       # Also, get file into which to write the new commit id
       if head_contents.startswith('ref: '):
           head_ref = head_contents.replace('ref: ', '')
           head_ref_file = '.ahole/' + head_ref
           head_id = file(head_ref_file).read()
       else:
           head_ref_file = '.ahole/HEAD'
           head_id = head_contents
       # *** the stuff below you've seen before (until *** again)
       # Make a unique identifier for this commit somehow
       new_id = make_unique_id()
       # Make a new directory in ahole with the new unique name 
       commit_dir = '.ahole/' + new_id
       mkdir(commit_dir)
       mkdir(commit_dir + '/files')
       # Copy the files from the staging area to the new snapshot directory
       copy_tree('.ahole/staging_area', commit_dir + '/files')
       # Make info.txt with parent set to HEAD
       info_str = 'committer = ' + committer + '\n'
       info_str += 'message = ' + message + '\n'
       info_str += 'date = ' + date.today() + '\n'
       info_str += 'parent = ' + head_id + '\n'
       # Write info to info.txt file
       info_file = file(commit_dir + '/info.txt', 'w')
       info_file.write(info_str)
       info_file.close()
       # Set the file that points to the current commit, to point to our commit
       # *** a little new, in that we might be writing to .ahole/HEAD, or
       # something like .ahole/refs/heads/master, depending on what .ahole/HEAD
       # contained at the top of this routine
       file(head_ref_file, 'w').write(new_id)

So, let's say that I'm currently on commit '5d89f8'.  ``.ahole/HEAD`` contains
``ref: refs/heads/master``.  ``.ahole/refs/heads/master`` contains ``5d89f8``.
I run my commit procedure::

   ahole_commit('Adam', 'Night follows day')

The commit procedure has made a new commit 'dfbeda'; ``.ahole/HEAD`` continues
to have text ``ref: refs/heads/master``, but now ``.ahole/refs/heads/master``
contains ``dfbeda``.  In this way, we keep track of which commit we are on, by
constantly updating 'master'.

Ok - now let's return to me checking out the released version of the book.  I
first get the contents of ``.ahole/refs/tags/release-0.1`` - it's '5d89f8'.
Then I checkout the working tree for that version, using my nice
``ahole_checkout`` procedure::

    ahole_checkout('5d89f8')

The checkout procedure will make ``.ahole/HEAD`` contain the text ``5d89f8``.  

Now I want to go back to working on my current version of the book.  That's the
set of files pointed to by ``.ahole/refs/heads/master``.  I can
check the contents of ``.ahole/refs/heads/master`` - it is ``dfbeda``.  Then I
get the current version with the normal checkout procedure::

    ahole_checkout('dfbeda')

Finally, I'll have to set ``.ahole/HEAD`` to be ``ref: refs/heads/master``.  All
good.

Of course, I could automate this, by modifying my checkout procedure slightly::

   def ahole_checkout(commit_reference):
      # If this is a reference, dereference
      if commit_reference in listdir('.ahole/refs/heads'): 
          # it's a head reference, maybe 'master'
          head_reference = True
          fname = '.ahole/refs/heads/' + commit_reference
          commit_id = file(fname).read()
      elif commit_reference in listdir('.ahole/refs/tags'):
          # it's a tag reference
          head_reference = False
          fname = '.ahole/refs/tags/' + commit_reference
          commit_id = file(fname).read()
      else: # Just a standard commit id
          head_reference = False
          commit_id = commit_reference
      commit_dir = '.ahole/' + commit_id
      # copy .ahole/$commit_id/files into working tree
      delete_tree('.')
      copy_tree(commit_dir + '/files', '.')
      # make ahole/HEAD point to commit id
      if head_reference:
          # Point HEAD at head reference
          file('.ahole/HEAD').write('ref: refs/heads/' + commit_reference)
          # Write commit id into head reference file
          file('.ahole/refs/heads/' + commit_reference, 'w').write(commit_id)
      else:
          file('.ahole/HEAD', 'w').write(commit_id)
      # copy commit snapshot into staging area
      delete_tree('.ahole/staging_area')
      copy_tree(commit_dir + '/files', '.ahole/staging_area')

What then, is the difference, between a *tag* - like our release - and the
moving target like 'master'?  The 'tag' is a *static* reference - it does not
change when we do a commit and always points to the same commit.  'master' is a
dynamic reference - in particular, it's a *head* reference:

Head
    A head is a reference that updates when we do a commit

My head is hurting a little, after Eve explains all this, but after a little
while and a nice apple pie, I'm feeling positive about ``ahole``.

On the fifth day - branches, merges and remotes
===============================================

Yesterday was a little exhausting, so today there was some time for reflection.

As Eve and I relax with the other animals, who are all getting on very well with
each other, we begin to realize that this *head* thing could be very useful.

For example, what if one of my very small number of friends tells me that
there's a serious conceptual error in the version of the book that I released -
'release-0.1'.  What if I want to go back and fix it - that is - do another
commit on top of the *released* book, instead of the version of the book that
I'm currently working on?  I can just make a new *head*.  I'll do it like this::

   cp .ahole/refs/tags/release-0.1 .ahole/refs/heads/working-on-0.1

Then, I look at what commit ``working-on-0.1`` contains - of course it's
``7ef41f``.  I get that state of the book with my new checkout procedure::

    ahole_checkout('working-on-0.1')

This changes ``.ahole/HEAD`` to be ``ref: refs/heads/working-on-0.1``.  Now,
when I do a commit with ``ahole_commit``,  that will update the file
``.ahole/refs/heads/working-on-0.1`` to have the new commit identifier.  Despite the
apple pie being a bit bitter last night, we're feeling good.

As we think about this, we come to think of 'master' and 'working-on-0.1' as
*branches* - because they can each be thought of as identifying a tree or graph
of commits, which can grow.  All I need, to make a new branch, is make a
new head reference to a commit.  For example, if I want to make new branch
starting at the current position of 'master', all I need is::

   cp .ahole/refs/tags/master .ahole/refs/heads/my-new-branch

If I want to work on this branch, I need to check it out, with::

    ahole_checkout('my-new-branch')

That will get the commit identifier in ``.ahole/refs/heads/my-new-branch``, unpack
the commit tree into the working tree, and set ``.ahole/HEAD`` to contain the
text ``ref: refs/heads/my-new-branch``

I've got my branches, but Eve will have her own branches, and this will help us
know where each of us is working.

That's good, because Eve is now asking me if I can have a look at her changes,
and whether I'll include them in my version of the book.  Unwisely I end up
suggesting that women don't contribute to books, and ask her why her hair isn't
covered with an as-yet not-invented headscarf.  In the end we patch it up, and I
agree to go back and try and put in her changes. 

Luckily, despite the lack of basics like clothing, there is an excellent local
network, so I can see the contents of her version of the book at
``/eves_computer/our_book/.ahole``.  She wants me to look at her 'master'
branch.  Just because the network might fail, I need to fetch what I need from
her computer to mine.  So, to keep track of things, I'll make a new directory,
called ``.ahole/refs/remotes/eve``, and I'll copy all her *heads* - in this case
just ``master`` - to that directory.   So now, I've got
``.ahole/refs/remotes/eve/master``, and in fact, it points to the commit that
she did on the third day; this was commit '0a01a0'.  I don't have this
commit in my ``.ahole`` directory, so I'll copy that from
``/eves_computer/our_book/.ahole/0a01a0``.  I look in the ``info.txt`` file
for that commit, and check what the parent is.  It is '7ef41f'.  I check if I
have that, and yes, I have, so I can stop copying stuff from Eve's directory.

So, what I just did was:

* Copy Eve's *head* references from
  ``/eves_computer/our_book/.ahole/refs/heads`` to my
  ``.ahole/refs/remotes/eve``. 
* For each of the references in ``.ahole/refs/remotes/eve``, I check whether
  I have the referenced commit, and the parents of that commit, and, if not, I
  copy them to ``.ahole``.

We decide to call that two-step sequence - a *fetch*. 

Now I want to look at her version of the book.  I have her head references and
the commits they point to, so I can checkout her latest version. I first get the
commit identifier from ``.ahole/refs/remotes/eve/master`` - '0a01a0'.  Then::

    ahole_checkout('0a01a0')

This will put '0a01a0' into ``.ahole/HEAD``.  I can look at her version of the
book, and decide if I like it.  If I do, then I can do a *merge*.  

What is a merge?  It's the join of two commits.  First I work out where Eve's
tree diverged from mine, by going back in her history, following the parents of
the commits.  In this case it's easy, because the parent commit ('7ef41f') of
this commit ('0a01a0') is one that is also in my history (the history for my
'master' branch).  This most recent shared commit I will call the *common
ancestor*.  Then I work out the difference between the common ancestor commit
('7ef41f') and this commit ('0a01a0') - let's call that ``eves_diff``.  

I go back to my own 'master' - which turns out to be
(``.ahole/refs/heads/master``) - 'dfbeda'::

    ahole_checkout('master')

This will change ``.ahole/HEAD`` to be ``ref: refs/heads/master`` - and I will
have just got the working tree from ``.ahole/dfbeda/files``.  Then I take
``eves_diff`` and apply it to my current working tree.  If there were any
conflicts, I resolve them, but in my world, there are no conflicts.  I have a
feeling there may be some later.   That apple pie is making me feel a little
funny.

Finally, I make a new commit, with a new unique ID - say '80cc85', with the
merged working tree.  But, there's a trick: here the new commit '80cc85' - has
*two* parents, first - 'dfbeda' - the previous commit in my 'master', and second
'0a01a0' - the last commit in Eve's master.  Now, the next time I look at Eve's
tree, I will be able to see that I've got her '0a01a0' commit in my own history,
and won't need to apply it again.

On the sixth day - saving time and space with objects
=====================================================

I am now very happy with ``ahole``, but Eve clearly doesn't think we've got it
right yet.

As she's thinking, she decides to make a couple of illustrations for The Book,
so she adds some photos to her working tree::

    .
    ├── .ahole
    │   ...
    ├── images
    │   ├── adam_with_apple.jpg
    │   └── lion_with_lamb.jpg
    ├── chapter1_discussion.txt
    ├── chapter2.txt
    ├── chapter1.txt
    └── contents.txt

As soon as she does this, she realizes what's wrong with ``ahole``.  The photos
are large files.  At the moment, every time we make a commit, we're copying all
the files into the commit ``files`` directory to make the snapshot.  With big
files, this is going to lead to many identical copies and lots of wasted space. 

Eve realizes that what we need to do, is to make the commit use *references* to
files, rather than the files themselves.  That way, when the commit has files
that have not changed, it can just point to the unchanged file, rather than
carrying a wasteful copy of the file.

If the commits just store references, we need a way to store the contents of the
files, so they can be referenced.  Maybe we could store the files for our
snapshots in a directory, and use some sort of unique filename so that the
commits can reference that filename?  For example, maybe we could make a
directory in ``.ahole`` like this::

   mkdir .ahole/objects

and use this directory to store the contents of the files for our snapshots.
Then we could store the commits as something like a table, where the entries
would tell us how to get the matching files from the ``.ahole/objects``
directory. 

We could have some structure for the commits like this::

    ├── .ahole
    │   ├── 5d89f8
    │   │   ├── info.txt
    │   │   └── file_list


where ``.ahole/5d89f8/file_list`` would be a list of references to files in the
``.ahole/objects`` directory, along with the filename that the contents has when
reconstructed back into the snapshot.  For example, maybe ``file_list`` would
have a series of (object reference, filename) pairs like this::

    contents_version1            contents.txt
    chapter1_version1            chapter1.txt
    chapter2_version2            chapter2.txt
    chapter3_version1            chapter3.txt
    chapter1_discussion_version1 chapter1_discussion.txt

These references in the first column could match filenames in the
``.ahole/objects`` directory::

    │   ├── objects
    │   │   ├── chapter1_version1
    │   │   ├── chapter2_version1
    │   │   ├── chapter2_version2
    │   │   ├── chapter3_version1
    │   │   ├── chapter1_discussion_version1
    │   │   └── contents_version1

We could think of the ``.ahole/objects`` directory as a very simple form of
database, where the keys are the filenames, and the file contents are the
values.

We think about this for a while and realize that it's going to be annoying
trying to find unique names to use as filenames in ``.ahole/objects``, because
there will be many versions of many files.  For example ``chapter1_version2``,
``chapter1_version3`` and so on is clearly not going to work, because when Eve
and I work independently, at some point we're both going to have something like
a ``chapter1_version3`` in our respective ``.ahole/objects`` directories, but
they will be different, and that will be confusing.

At this stage, Eve reveals that she has some training in computer science.  Of
course I have no idea what that is, or who did the training, but she's in too
much of a rush to explain that now.   She proposes that we make the filenames
(database keys) by doing *hashes* of the file contents.  It turns out that
hashing algorithms can take a stream of bytes such as the contents of a file,
and create a string that is near-enough unique to that stream of bytes. That's
really good, because it means that, if Eve and I have an object with the same
filename (hash) that means it almost certainly contains the exact same contents.

Eve recommends the 'SHA1' hashing algorithm, and I'm in no position to disagree
with her.  Now we've got a unique string to use as a key for each file.  For
example, we run the SHA1 algorithm over the current book files and we get these:

========================  ========================================
Filename                  SHA1 hash
========================  ========================================
chapter1.txt              9e398c7cf8d56e960aa7769839cc0c38b8e12f11
chapter2.txt              65735b3705284cdf4a66c2e4812ca13cbaa7cd5d
chapter1_discussion.txt   3c2e09cc43568f13444c075c84b047957f7995a5
contents.txt              f31bfa1225f9e0eb6741a0ab1122f8cd2cbedc04
========================  ========================================

If we change the file at all, then the hash changes, and we have a new unique
string and therefore we have a new unique filename with which to store the new
contents. For example, the original version of chapter 2 was a bit shorter, and
had a hash of '1cf01a1dfbe135b6132362fa8e17eaefcaf00a7f'. 

Now we have got a nice way of making the references that will go into
``.ahole/5d89f8/file_list``.  First we store the file versions in our
``.ahole/objects`` directory, using their hash values as filenames::

    │   ├── objects
    │   │   ├── 9e398c7cf8d56e960aa7769839cc0c38b8e12f11 (chapter1 version 1)
    │   │   ├── 1cf01a1dfbe135b6132362fa8e17eaefcaf00a7f (chapter2 version 1)
    │   │   ├── 65735b3705284cdf4a66c2e4812ca13cbaa7cd5d (chapter2 version 2)
    │   │   ├── 3c2e09cc43568f13444c075c84b047957f7995a5 (chapter1_discussion version 1)
    │   │   └── f31bfa1225f9e0eb6741a0ab1122f8cd2cbedc04 (contents version 1)

Next we create ``.ahole/5d89f8/file_list`` with one row per file in our
directory.  Each row contains first - the hash value (and therefore filename in
``.ahole/objects``) which allows me to get the file contents, then the type of
thing this is - here a file - and lastly, the filename as it was in the
snapshot::

    9e398c7cf8d56e960aa7769839cc0c38b8e12f11 file chapter1.txt 
    65735b3705284cdf4a66c2e4812ca13cbaa7cd5d file chapter2.txt 
    3c2e09cc43568f13444c075c84b047957f7995a5 file chapter1_discussion.txt 
    f31bfa1225f9e0eb6741a0ab1122f8cd2cbedc04 file contents.txt 

Now, what about Eve's new working tree with the photos in it?  The photos are in
the ``images`` subdirectory, and we don't have a way of storing subdirectories
yet.  Aha - why not store directories in the object database too?  Directories
can just be *tree* files like ``file_list``.  *tree* files are lists, one entry
per row, where each row contains the hash reference for the file contents, the
type of thing it is (tree or file), and the filename as it was in the snapshot.
So, for Eve's new commit, we'd first store the contents of the two photo files
in the ``.ahole/objects`` directory::

    │   ├── objects
    │   │   ├── 82e6792faa893070dcd6fe3e614b6f147be1a0a9 (adam_with_apple.jpg)
    │   │   ├── e8b23357995db47e70906d4c7a08114c0c0ba376 (lion_with_lamb.jpg)
    │   │   ├── 9e398c7cf8d56e960aa7769839cc0c38b8e12f11 (chapter1 version 1)

etc.  Then we make a new *tree* file called - say - 'images_listing' like this::

    82e6792faa893070dcd6fe3e614b6f147be1a0a9 file adam_with_apple.jpg 
    e8b23357995db47e70906d4c7a08114c0c0ba376 file lion_with_lamb.jpg  

and we make a hash for that tree file too, and put that into
``.ahole/objects``::

    │   ├── objects
    │   │   ├── be242dba385bc0689be16454e959f4b64c87abce (images_listing)
    │   │   ├── 82e6792faa893070dcd6fe3e614b6f147be1a0a9 (adam_with_apple.jpg)
    │   │   ├── e8b23357995db47e70906d4c7a08114c0c0ba376 (lion_with_lamb.jpg)
    │   │   ├── 9e398c7cf8d56e960aa7769839cc0c38b8e12f11 (chapter1 version 1)

etc.  Now maybe our whole commit listing can include files and directories for
the root directory of our project, something like::

    9e398c7cf8d56e960aa7769839cc0c38b8e12f11 file chapter1.txt
    65735b3705284cdf4a66c2e4812ca13cbaa7cd5d file chapter2.txt
    3c2e09cc43568f13444c075c84b047957f7995a5 file chapter1_discussion.txt
    f31bfa1225f9e0eb6741a0ab1122f8cd2cbedc04 file contents.txt
    be242dba385bc0689be16454e959f4b64c87abce tree images      

Oh - but wait - that's just a tree listing too, let's make a hash for that, and
put it into the ``.ahole/objects`` directory::

    │   ├── objects
    │   │   ├── e52dc9dbe358c549df65307652ff2709322812b3 (root listing)  
    │   │   ├── be242dba385bc0689be16454e959f4b64c87abce (images_listing)
    │   │   ├── 82e6792faa893070dcd6fe3e614b6f147be1a0a9 (adam_with_apple.jpg)

Right - so now our whole commit boils down to our ``info.txt`` file, and the
hash for the root tree (the one starting 'e52dc' above). We can get rid of the
old ``files`` subdirectory in the commit, and add the hash for the root tree
instead - something like::

    committer = Eve
    message = Adding funny pictures
    date = year0-jan-06
    root_tree = e52dc9dbe358c549df65307652ff2709322812b3 
    parent = 0a01a0

Now we can solve the annoying problem of finding an unique commit id for each
commit.   We just make a hash for the ``info.txt`` file, and put that into the
``.ahole/objects`` directory too, as a *commit* file::

    │   ├── objects
    │   │   ├── 7e0cda8c145b300b519ed28998a31f801b6d626f (latest commit)
    │   │   ├── e52dc9dbe358c549df65307652ff2709322812b3 (root listing)  
    │   │   ├── be242dba385bc0689be16454e959f4b64c87abce (images_listing)

The unique id for the commit is the hash for its contents. In this case the
commit id is '7e0cda8c145b300b519ed28998a31f801b6d626f'.  Don't forget that the
hash is more or less unique to the contents, so this commit will have an id that
is unique to the combination of the committer, message, date, root tree hash and
commit parent.  The root tree hash is unique to the contents of the root tree
listing, and the root tree listing contains file hashes, which are in turn
unique to the file contents, so the root tree hash will be unique to the file
contents of the commit.  Thus, the commit id is unique to all the things that go
into the commit, including the contents.  It's clever isn't it?

We can now have three types of files in the ``.ahole/objects`` directory -
files, trees, and commits.  

OK - so things are now a little more complicated than our previous setup with
file copies, but lots of things have just got much easier.   For example, we can
now get rid of the ``staging_area`` directory.  The staging area can just be a
single file containing the root tree listing of the snapshot.  Let's call that
file ``.ahole/index``.  Now Eve has done her new commit, that file can just be
the root directory listing of the previous commit (the commit we have just
done)::

    9e398c7cf8d56e960aa7769839cc0c38b8e12f11 file chapter1.txt
    65735b3705284cdf4a66c2e4812ca13cbaa7cd5d file chapter2.txt
    3c2e09cc43568f13444c075c84b047957f7995a5 file chapter1_discussion.txt
    f31bfa1225f9e0eb6741a0ab1122f8cd2cbedc04 file contents.txt
    be242dba385bc0689be16454e959f4b64c87abce tree images      

When Eve makes an edit to ``chapter1.txt``, instead of copying the file to the
``staging_area`` directory, she makes a hash for the new ``chapter1.txt``
contents, she stores the new ``chapter1.txt`` contents in the ``.ahole/objects``
directory using the hash as a filename, and then she edits the ``.ahole/index``
file to point to her new chapter 1 contents instead of the old.  She might
automate this with a small command like ``ahole_stage`` [#need_hashlib]_ ::

    def ahole_stage(fname):
        # Get the hash for the file contents
        file_contents = file(fname).read()
        file_hash = sha1_hash(file_contents)
        # (assuming that the new file is going in the root directory)
        new_root_entry = file_hash + ' file ' + fname
        root_listing = file('.ahole/index').read()
        if new_root_entry in root_listing:
            # This exact file contents and filename already present
            return
        # Make an entry for these file contents in the objects database
        database_fname = '.ahole/objects/' + file_hash
        file(database_fname, 'w').write(file_contents)
        # Write index listing with new entry
        root_listing = root_listing + new_root_entry + '\n'
        file('.ahole/index', 'w').write(root_listing)

Making a new commit involves taking the contents of ``.ahole/index`` and using
it to make a new commit file in ``.ahole/objects``.  Using the structure of our
previous ``ahole_commit`` routine, that might look like::

   def ahole_commit(committer, message):
       # *** this stuff is the same as before ***
       # Get previous (parent) commit id from .ahole/HEAD
       head_contents = file('.ahole/HEAD').read()
       # Check if this is a reference, de-reference if so
       # Also, get file into which to write the new commit id
       if head_contents.startswith('ref: '):
           head_ref = head_contents.replace('ref: ', '')
           head_ref_file = '.ahole/' + head_ref
           head_id = file(head_ref_file).read()
       else:
           head_ref_file = '.ahole/HEAD'
           head_id = head_contents
       # *** the stuff below is different ***
       # Make root tree entry in objects database from .ahole/index
       index_contents = file('.ahole/index').read()
       index_hash = sha1_hash(index_contents)
       file('.ahole/objects/' + index_hash, 'w').write(index_contents)
       # Make commit information with parent set to HEAD
       info_str = 'committer = ' + committer + '\n'
       info_str += 'message = ' + message + '\n'
       info_str += 'date = ' + date.today() + '\n'
       info_str += 'root_tree = ' + index_hash + '\n' 
       info_str += 'parent = ' + head_id + '\n'
       # Write commit file into objects database, with hash
       commit_hash = sha1_hash(info_str)
       file('.ahole/objects/' + commit_hash, 'w').write(info_str)
       # Set the current commit file to contain new id
       file(head_ref_file, 'w').write(commit_hash)

How about doing a merge?  Remember that, in the bad old days, we had to compare
lots of files between the branches, and the common ancestor?  No more.  Now we
are using the hash file references, all we need to do, is look at the tree
listing.  If the tree listing has the same entry (filename and hash) that means
that the file is identical between the two trees, and we don't have to load the
contents to check. That makes it very fast to do comparisons between trees that
haven't changed much.

Eve was right of course.  Now, if we make a new commit, when one file is
changed, all we store is the contents of the file that has changed and a new
tree listing with the updated hash for the changed file.  That makes the storage
for lots and lots of similar trees very efficient.

Someone ought to write this up and give it to the world.  Wait, that's just us.

On the seventh day - there was git
==================================

The seventh day is for resting.   You are all done now, and the hard stuff is
over.  In a state of deep inner peace, you can think about all that you've
discovered in *ahole*:

* A commit refers to a snapshot of the complete set of files for your project
* The staging area (index) defines what will change between your upcoming commit
  and the previous commit
* A branch is just a pointer to a commit, that moves when you do another commit.
* Version control is very easy to understand

You remind yourself that life is very good, because you don't have to use a
version control system called *ahole*, you can use a very similar system called
git_.

If you use git_, you'll notice that you have lots of *ahole* friends.  You'll
see git creates a ``.git`` subdirectory that contains the repository.  You'll
recognize the ``.git/objects`` directory containing filenames with SHA1 hashes.
You'll see that commits have SHA1 hashes.  You'll recognize the ``.git/HEAD``
file and ``.git/refs/heads`` and ``.git/refs/tags`` and
``.git/refs/heads/master``. There is a ``.git/index`` file, and it is the
staging area. ``.git/index`` is a little more complicated than ``.ahole/index``
because it's adapted to helping with difficult merges, but it's the same idea. 

You now live in the garden of Eden of version control.  Remember to stay away
from that apple tree.

.. rubric:: Footnotes

.. [#ahole_git] ``ahole`` might seem a bit rude to you, but I was born in the UK, and,
     where I come from, 'ahole' is roughly as rude as 'git'. 

.. [#commit_imports] In case you are interested, for the commit and checkout
   code to actually run, you would need some python definitions.  First some
   standard python imports::

      from datetime import date
      from os import mkdir, listdir

   Then we need some simple custom commands for deleting our working tree, and
   for copying files into the working tree::

      from os import remove
      from os.path import isfile, isdir
      from shutil import copyfile, copytree, rmtree

      def delete_tree(path):
          # Delete everything in path unless it's an '.ahole' directory
          for name in listdir(path):
              full_name = path + '/' + name
              if isfile(name):
                  remove(full_name)
              elif isdir(name):
                  if name != '.ahole':
                      rmtree(full_name)

      def copy_tree(src_path, dst_path):
          # Copy everything in src_path to dst_path
          for name in listdir(src_path):
              src_name = src_path + '/' + name
              dst_name = dst_path + '/' + name
              if isfile(src_name):
                  copyfile(src_name, dst_name)
              elif isdir(src_name):
                  copytree(src_name, dst_name)

   We also need some definition of ``make_unique_id()``.

.. [#need_hashlib] Now you need to add::

       import hashlib

       def sha1_hash(contents):
           return hashlib.sha1_hash(contents)

    .

.. include:: links_names.inc