Git foundations¶
“Foundations” is a little joke on a religious theme; our page borrows heavily from the git parable - so - why not a foundation myth?
On the first day - the repository and the working tree¶
I’m young and the world is fresh and I have lots of time.
I have so much time, that I decide that I want to write a book. The project I
am working on is a modest history and explanation of everything, with
provisional title “The Book”. It has a table of contents contents.txt
and a
single book chapter:
.
├── chapter1.txt
└── contents.txt
As I start to write, I begin to think I should keep track of my changes. I need some sort of version control system. How hard can it be?
I start with some names. I’m going to call this set of files that I’m working
on, the working tree. Because I currently lack any shame about body issues, I
will call my new versioning system ahole
1 .
I decide that I need to store the state of The Book at the end of each
day. To do this, I make a new directory in my working tree, called .ahole
.
This directory will store The Book as it evolves into a world-wide best-seller.
I will use the name repository for the contents of .ahole
.
At the end of the day, I make a copy of all the files in the working tree, and
save it in my new .ahole
repository. In fact, what I will do is, make a new
subdirectory in .ahole
named for today’s date, then store a copy of the book
files in there. On unix, that might look like this:
mkdir .ahole/year0-jan-01
mkdir .ahole/year0-jan-01/files
cp * .ahole/year0-jan-01/files
So I’ve still got the contents of The Book in the working tree, but now, in the repository, I have a copy of the files that is a snapshot of The Book as of today:
.
├── .ahole
│ └── year0-jan-01
│ └── files
│ ├── chapter1.txt
│ └── contents.txt
├── chapter1.txt
└── contents.txt
On the second day - staging and commits¶
Today I do some more work on the book. I start work on chapter 2, and, while
I’m thinking about things, I find that I am also writing some notes to myself
about this character “Eve” that I have seen wandering around. I save those
notes in a file called something_about_eve.txt
. When I get to the end of
the day, I get ready to store my work. At the moment, my directory looks like
this:
.
├── .ahole
│ └── year0-jan-01
│ └── files
│ ├── chapter1.txt
│ └── contents.txt
├── something_about_eve.txt
├── chapter2.txt
├── chapter1.txt
└── contents.txt
For some reason I can’t put my finger on, I don’t want to put
something_about_eve.txt
into the repository at the moment. In fact, in
general, I want to choose which changes I back up into the repository, and which
changes I leave for another day. In the end I come up with an idea. I’ll make
a directory in .ahole
called staging_area
. When I start work at the
beginning of the day, I copy the previous backed-up version of my files from the
repository, into staging_area
. These files are now ready for storing in the
next snapshot:
cp .ahole/year0-jan-01/* .ahole/staging_area
I now have:
├── .ahole
│ ├─── staging_area
│ │ ├── chapter1.txt
│ │ └── contents.txt
│ └── year0-jan-01
│ └── files
│ ├── chapter1.txt
│ └── contents.txt
├── something_about_eve.txt
├── chapter2.txt
├── chapter1.txt
└── contents.txt
As I work, I decide what I’m going to put into tonight’s snapshot. For example,
maybe I changed chapter1.txt
and I think it’s ready to back up. I copy my
modified version of chapter1.txt
from the working tree to staging_area
.
I’ll call that stage-ing the file. I’ll also ‘stage’ the new chapter2.txt
file (copy it to the staging area). I’m not going to stage
something_about_eve.txt
at the moment.
Now I’ve done that, all the stuff I want to store in the backup is ready. I just need to put it into its own backup snapshot directory. To do that, I just do something like (Unix again):
mkdir .ahole/year0-jan-02
mkdir .ahole/year0-jan-02/files
cp .ahole/staging_area/* .ahole/year0-jan-02/files
I end up with a directory that looks like this:
.
├── .ahole
│ ├── year0-jan-02
│ │ └── files
│ │ ├── chapter2.txt
│ │ ├── chapter1.txt
│ │ └── contents.txt
│ ├─── staging_area
│ │ ├── chapter2.txt
│ │ ├── chapter1.txt
│ │ └── contents.txt
│ └── year0-jan-01
│ └── files
│ ├── chapter1.txt
│ └── contents.txt
├── something_about_eve.txt
├── chapter2.txt
├── chapter1.txt
└── contents.txt
I decide that I’ll use the name commit for each of the daily snapshot
directories (year0-jan-01
and year0-jan-02
). The action of adding files
to the staging area, I will call staging files for the commit. I will use
the term committing for the action of making the snapshot directory, and
copying the files from the staging area to the snapshot directory.
On the third day - history¶
As a result of certain events yesterday evening, I have a new friend, Eve. She
wants to help out. Of course Eve has her own computer, and I send her my
.ahole
directory. I thank myself for my wisdom in not adding
something_about_eve.txt
to the repository.
Eve checks out our book (reconstructs my working tree) with something like:
cp .ahole/year0-jan-02/files/* .
Now she’s got the book files as I committed them last night. She also copies the last commit files into the staging area, as I did:
├── .ahole
│ ├─── staging_area
│ │ ├── chapter2.txt
│ │ ├── chapter1.txt
│ │ └── contents.txt
She works hard on a new file chapter1_discussion.txt
. It’s good to see she’s
enjoying the work. As the afternoon turns to evening, she gets ready to save
her work, so she copies chapter1_discussion.txt
to .ahole/staging_area
.
Now she is ready to do a commit:
mkdir .ahole/year0-jan-03
mkdir .ahole/year0-jan-03/files
cp .ahole/staging_area/* .ahole/year0-jan-03/files
That is what Eve was going to do, but Eve is smart, and she immediately realizes
that there is a problem. After she has done her commit, both of us will likely
have a commit directory .ahole/year0-jan-03
- but they will have different
contents. If she later wants to share work with me, that could get confusing.
The two of us are a little tired after all our work, and we meet for a beer. We talk about it for a while. At first we think we can just add the time to the date, because that’s likely to be unique for each of us. Then we realize that that’s going to get messy too, because, if Eve does a commit on her computer, then I do a commit on mine, and she does another one on hers, the times will say that these are all in one sequence, but in fact there are two sequences, mine, and Eves. We need some other way to keep track of the sequence of commits, that will work even if two of us are working independently.
In the end we decide that we are going to give the commits some unique
identifier string instead of the date. We might have a problem in making sure
that the unique identifier string is actually unique, but let’s assume we can
solve that somehow. We’ll store the contents of the working tree in the same
way as we have done up till now, in the files
subdirectory, but we’ll add a
new file to each commit, called info.txt
, that will tell us who did the
commit, and when, and, most importantly, what the previous commit was. We’ll
call the previous commit the parent.
Eve was right to predict that I had made my own commit today. I’ve been happily working on chapter 3. So, before our conversation, my directory looked like this:
.
├── .ahole
│ ├── year0-jan-03
│ │ └── files
│ │ ├── chapter3.txt
│ │ ├── chapter2.txt
│ │ ├── chapter1.txt
│ │ └── contents.txt
│ ├─── staging_area
│ │ ├── chapter3.txt
│ │ ├── chapter2.txt
│ │ ├── chapter1.txt
│ │ └── contents.txt
│ ├── year0-jan-02
│ │ └── files
│ │ ├── chapter2.txt
│ │ ├── chapter1.txt
│ │ └── contents.txt
│ └── year0-jan-01
│ └── files
│ ├── chapter1.txt
│ └── contents.txt
├── something_about_eve.txt
├── chapter3.txt
├── chapter2.txt
├── chapter1.txt
└── contents.txt
but now we’ve worked out the new way, it looks like this:
.
├── .ahole
│ ├── 5d89f8
│ │ ├── info.txt
│ │ └── files
│ │ ├── chapter3.txt
│ │ ├── chapter2.txt
│ │ ├── chapter1.txt
│ │ └── contents.txt
│ ├─── staging_area
│ │ ├── chapter3.txt
│ │ ├── chapter2.txt
│ │ ├── chapter1.txt
│ │ └── contents.txt
│ ├── 7ef41f
│ │ ├── info.txt
│ │ └── files
│ │ ├── chapter2.txt
│ │ ├── chapter1.txt
│ │ └── contents.txt
│ └── 6438a4
│ ├── info.txt
│ └── files
│ ├── chapter1.txt
│ └── contents.txt
├── something_about_eve.txt
├── chapter3.txt
├── chapter2.txt
├── chapter1.txt
└── contents.txt
and .ahole/5d89f8/info.txt
looks like this:
committer = Adam
message = Third day
date = year0-jan-03
parent = 7ef41f
Meanwhile, Eve’s directory looks like this:
.
├── .ahole
│ ├── 0a01a0
│ │ ├── info.txt
│ │ └── files
│ │ ├── chapter1_discussion.txt
│ │ ├── chapter2.txt
│ │ ├── chapter1.txt
│ │ └── contents.txt
│ ├─── staging_area
│ │ ├── chapter1_discussion.txt
│ │ ├── chapter2.txt
│ │ ├── chapter1.txt
│ │ └── contents.txt
│ ├── 7ef41f
│ │ ├── info.txt
│ │ └── files
│ │ ├── chapter2.txt
│ │ ├── chapter1.txt
│ │ └── contents.txt
│ └── 6438a4
│ ├── info.txt
│ └── files
│ ├── chapter1.txt
│ └── contents.txt
├── chapter1_discussion.txt
├── chapter2.txt
├── chapter1.txt
└── contents.txt
and Eve’s .ahole/0a01a0/info.txt
looks like this:
committer = Eve
message = Eve day 3
date = year0-jan-03
parent = 7ef41f
After a little thought, Eve and I realize that, when we make our new commit, we
are going to have to know what the current commit is, so we can use that as the
parent. When we make a new commit, we store the commit identifier in a file.
We’ll call this file .ahole/HEAD
, so, after my last commit above, the file
.ahole/HEAD
will have the contents 5d89f8
. We use the contents of
.ahole/HEAD
to identify the last (current) commit. And of course, when we
make a new commit, we can get the parent of the new commit, from the current
commit in .ahole/HEAD
.
So now, we have a new procedure for our commit. In outline it looks like this (now in python syntax) 2
def ahole_commit(committer, message):
# Make a unique identifier for this commit somehow
new_id = make_unique_id()
# Make a new directory in ahole with the new unique name
commit_dir = '.ahole/' + new_id
mkdir(commit_dir)
mkdir(commit_dir + '/files')
# Copy the files from the staging area to the new snapshot directory
copy_tree('.ahole/staging_area', commit_dir + '/files')
# Get previous (parent) commit id from .ahole/HEAD
head_id = file('.ahole/HEAD').read()
# Make info with parent set to HEAD
info_str = 'committer = ' + committer + '\n'
info_str += 'message = ' + message + '\n'
info_str += 'date = ' + date.today() + '\n'
info_str += 'parent = ' + head_id + '\n'
# Write info to info.txt file
info_file = file(commit_dir + '/info.txt', 'w')
info_file.write(info_str)
info_file.close()
# Set .ahole/HEAD to contain new commit id
file('.ahole/HEAD', 'w').write(new_id)
When we want to go back to an earlier state of the book, we can do a checkout, with something like:
def ahole_checkout(commit_id):
commit_dir = '.ahole/' + commit_id
# copy .ahole/$commit_id/files into working tree
delete_tree('.')
copy_tree(commit_dir + '/files', '.')
# make .ahole/HEAD contain commit_id
file('.ahole/HEAD', 'w').write(commit_id)
# copy commit snapshot into staging area
delete_tree('.ahole/staging_area')
copy_tree(commit_dir + '/files', '.ahole/staging_area')
So, when we run ahole_checkout('7ef41f`)
we will get the copy of the working
tree corresponging to 7ef41f
, and .ahole/HEAD
will just contain the
string 7ef41f
.
In our excitement, we immediately realize that it’s really easy to see the
history of the book now. We can easily fetch out info.txt
from the current
commit, print it, then find its parent, and fetch info.txt
from the parent,
print it, and so on.
Now we are tired, but happy, and we rest.
On the fourth day - references¶
We wake with a strange excitement. The idea, of keeping a reference to the
current commit in .ahole/HEAD
, seems that it could be more general. I
talk to Eve over breakfast (she stayed in her own place of course, but she came
over for work). Together we work out the concept of references. A reference
is:
- Reference
Something that points to a commit
So, .ahole/HEAD
is a reference - to the current commit. But what if I decide
that I want to give out some preliminary version of our book. Let’s say I want
to release the book stored in .ahole/7ef41f/files
as ‘release-0.1’. I’m
going to send this out to all my friends (to be honest, I don’t have many
friends just yet, but still). I want to be able to remember what version of the
book I sent out. I can make a reference to this commit. I’ll call this a
tag. I make a new directory in .ahole
called refs
, and another
directory in refs
, called tags
, and then, in
.ahole/refs/tags/release-0.1
I just put 7ef41f
- a reference to the
release commit. That way, if I ever need to go back to the version of the book
I released, I just have to read the release-0.1
file to find the commit, and
then checkout that commit.
Wait, but, there’s a problem. If I checkout the commit in release-0.1
, I
will overwrite .ahole/HEAD
, and I will lose track of what commit I was
working on before.
Let’s store that in another reference. Let’s use the name ‘master’ for my main
line of development. I store where this is, by making a new file
.ahole/refs/heads/master
that is a reference to the last commit. It just
contains the text ‘5d89f8’. So that I know that I am working on ‘master’, I
make .ahole/HEAD
have the text ref: refs/heads/master
. Now, when I make
a new commit, I first check .ahole/HEAD
; if I see ref:
refs/heads/master
, then first, I get the commit id in
.ahole/refs/heads/master
- and I use that as the parent id for the commit.
When I’ve saved the new commit, I set .ahole/refs/heads/master
to have the
new commit id. So, I need to modify my commit procedure slightly:
def ahole_commit(committer, message):
# *** this stuff down to the next *** line is new
# Get previous (parent) commit id from .ahole/HEAD
head_contents = file('.ahole/HEAD').read()
# Check if this is a reference, de-reference if so
# Also, get file into which to write the new commit id
if head_contents.startswith('ref: '):
head_ref = head_contents.replace('ref: ', '')
head_ref_file = '.ahole/' + head_ref
head_id = file(head_ref_file).read()
else:
head_ref_file = '.ahole/HEAD'
head_id = head_contents
# *** the stuff below you've seen before (until *** again)
# Make a unique identifier for this commit somehow
new_id = make_unique_id()
# Make a new directory in ahole with the new unique name
commit_dir = '.ahole/' + new_id
mkdir(commit_dir)
mkdir(commit_dir + '/files')
# Copy the files from the staging area to the new snapshot directory
copy_tree('.ahole/staging_area', commit_dir + '/files')
# Make info.txt with parent set to HEAD
info_str = 'committer = ' + committer + '\n'
info_str += 'message = ' + message + '\n'
info_str += 'date = ' + date.today() + '\n'
info_str += 'parent = ' + head_id + '\n'
# Write info to info.txt file
info_file = file(commit_dir + '/info.txt', 'w')
info_file.write(info_str)
info_file.close()
# Set the file that points to the current commit, to point to our commit
# *** a little new, in that we might be writing to .ahole/HEAD, or
# something like .ahole/refs/heads/master, depending on what .ahole/HEAD
# contained at the top of this routine
file(head_ref_file, 'w').write(new_id)
So, let’s say that I’m currently on commit ‘5d89f8’. .ahole/HEAD
contains
ref: refs/heads/master
. .ahole/refs/heads/master
contains 5d89f8
.
I run my commit procedure:
ahole_commit('Adam', 'Night follows day')
The commit procedure has made a new commit ‘dfbeda’; .ahole/HEAD
continues
to have text ref: refs/heads/master
, but now .ahole/refs/heads/master
contains dfbeda
. In this way, we keep track of which commit we are on, by
constantly updating ‘master’.
Ok - now let’s return to me checking out the released version of the book. I
first get the contents of .ahole/refs/tags/release-0.1
- it’s ‘5d89f8’.
Then I checkout the working tree for that version, using my nice
ahole_checkout
procedure:
ahole_checkout('5d89f8')
The checkout procedure will make .ahole/HEAD
contain the text 5d89f8
.
Now I want to go back to working on my current version of the book. That’s the
set of files pointed to by .ahole/refs/heads/master
. I can
check the contents of .ahole/refs/heads/master
- it is dfbeda
. Then I
get the current version with the normal checkout procedure:
ahole_checkout('dfbeda')
Finally, I’ll have to set .ahole/HEAD
to be ref: refs/heads/master
. All
good.
Of course, I could automate this, by modifying my checkout procedure slightly:
def ahole_checkout(commit_reference):
# If this is a reference, dereference
if commit_reference in listdir('.ahole/refs/heads'):
# it's a head reference, maybe 'master'
head_reference = True
fname = '.ahole/refs/heads/' + commit_reference
commit_id = file(fname).read()
elif commit_reference in listdir('.ahole/refs/tags'):
# it's a tag reference
head_reference = False
fname = '.ahole/refs/tags/' + commit_reference
commit_id = file(fname).read()
else: # Just a standard commit id
head_reference = False
commit_id = commit_reference
commit_dir = '.ahole/' + commit_id
# copy .ahole/$commit_id/files into working tree
delete_tree('.')
copy_tree(commit_dir + '/files', '.')
# make ahole/HEAD point to commit id
if head_reference:
# Point HEAD at head reference
file('.ahole/HEAD').write('ref: refs/heads/' + commit_reference)
# Write commit id into head reference file
file('.ahole/refs/heads/' + commit_reference, 'w').write(commit_id)
else:
file('.ahole/HEAD', 'w').write(commit_id)
# copy commit snapshot into staging area
delete_tree('.ahole/staging_area')
copy_tree(commit_dir + '/files', '.ahole/staging_area')
What then, is the difference, between a tag - like our release - and the moving target like ‘master’? The ‘tag’ is a static reference - it does not change when we do a commit and always points to the same commit. ‘master’ is a dynamic reference - in particular, it’s a head reference:
- Head
A head is a reference that updates when we do a commit
My head is hurting a little, after Eve explains all this, but after a little
while and a nice apple pie, I’m feeling positive about ahole
.
On the fifth day - branches, merges and remotes¶
Yesterday was a little exhausting, so today there was some time for reflection.
As Eve and I relax with the other animals, who are all getting on very well with each other, we begin to realize that this head thing could be very useful.
For example, what if one of my very small number of friends tells me that there’s a serious conceptual error in the version of the book that I released - ‘release-0.1’. What if I want to go back and fix it - that is - do another commit on top of the released book, instead of the version of the book that I’m currently working on? I can just make a new head. I’ll do it like this:
cp .ahole/refs/tags/release-0.1 .ahole/refs/heads/working-on-0.1
Then, I look at what commit working-on-0.1
contains - of course it’s
7ef41f
. I get that state of the book with my new checkout procedure:
ahole_checkout('working-on-0.1')
This changes .ahole/HEAD
to be ref: refs/heads/working-on-0.1
. Now,
when I do a commit with ahole_commit
, that will update the file
.ahole/refs/heads/working-on-0.1
to have the new commit identifier. Despite the
apple pie being a bit bitter last night, we’re feeling good.
As we think about this, we come to think of ‘master’ and ‘working-on-0.1’ as branches - because they can each be thought of as identifying a tree or graph of commits, which can grow. All I need, to make a new branch, is make a new head reference to a commit. For example, if I want to make new branch starting at the current position of ‘master’, all I need is:
cp .ahole/refs/tags/master .ahole/refs/heads/my-new-branch
If I want to work on this branch, I need to check it out, with:
ahole_checkout('my-new-branch')
That will get the commit identifier in .ahole/refs/heads/my-new-branch
, unpack
the commit tree into the working tree, and set .ahole/HEAD
to contain the
text ref: refs/heads/my-new-branch
I’ve got my branches, but Eve will have her own branches, and this will help us know where each of us is working.
That’s good, because Eve is now asking me if I can have a look at her changes, and whether I’ll include them in my version of the book. Unwisely I end up suggesting that women don’t contribute to books, and ask her why her hair isn’t covered with an as-yet not-invented headscarf. In the end we patch it up, and I agree to go back and try and put in her changes.
Luckily, despite the lack of basics like clothing, there is an excellent local
network, so I can see the contents of her version of the book at
/eves_computer/our_book/.ahole
. She wants me to look at her ‘master’
branch. Just because the network might fail, I need to fetch what I need from
her computer to mine. So, to keep track of things, I’ll make a new directory,
called .ahole/refs/remotes/eve
, and I’ll copy all her heads - in this case
just master
- to that directory. So now, I’ve got
.ahole/refs/remotes/eve/master
, and in fact, it points to the commit that
she did on the third day; this was commit ‘0a01a0’. I don’t have this
commit in my .ahole
directory, so I’ll copy that from
/eves_computer/our_book/.ahole/0a01a0
. I look in the info.txt
file
for that commit, and check what the parent is. It is ‘7ef41f’. I check if I
have that, and yes, I have, so I can stop copying stuff from Eve’s directory.
So, what I just did was:
Copy Eve’s head references from
/eves_computer/our_book/.ahole/refs/heads
to my.ahole/refs/remotes/eve
.For each of the references in
.ahole/refs/remotes/eve
, I check whether I have the referenced commit, and the parents of that commit, and, if not, I copy them to.ahole
.
We decide to call that two-step sequence - a fetch.
Now I want to look at her version of the book. I have her head references and
the commits they point to, so I can checkout her latest version. I first get the
commit identifier from .ahole/refs/remotes/eve/master
- ‘0a01a0’. Then:
ahole_checkout('0a01a0')
This will put ‘0a01a0’ into .ahole/HEAD
. I can look at her version of the
book, and decide if I like it. If I do, then I can do a merge.
What is a merge? It’s the join of two commits. First I work out where Eve’s
tree diverged from mine, by going back in her history, following the parents of
the commits. In this case it’s easy, because the parent commit (‘7ef41f’) of
this commit (‘0a01a0’) is one that is also in my history (the history for my
‘master’ branch). This most recent shared commit I will call the common
ancestor. Then I work out the difference between the common ancestor commit
(‘7ef41f’) and this commit (‘0a01a0’) - let’s call that eves_diff
.
I go back to my own ‘master’ - which turns out to be
(.ahole/refs/heads/master
) - ‘dfbeda’:
ahole_checkout('master')
This will change .ahole/HEAD
to be ref: refs/heads/master
- and I will
have just got the working tree from .ahole/dfbeda/files
. Then I take
eves_diff
and apply it to my current working tree. If there were any
conflicts, I resolve them, but in my world, there are no conflicts. I have a
feeling there may be some later. That apple pie is making me feel a little
funny.
Finally, I make a new commit, with a new unique ID - say ‘80cc85’, with the merged working tree. But, there’s a trick: here the new commit ‘80cc85’ - has two parents, first - ‘dfbeda’ - the previous commit in my ‘master’, and second ‘0a01a0’ - the last commit in Eve’s master. Now, the next time I look at Eve’s tree, I will be able to see that I’ve got her ‘0a01a0’ commit in my own history, and won’t need to apply it again.
On the sixth day - saving time and space with objects¶
I am now very happy with ahole
, but Eve clearly doesn’t think we’ve got it
right yet.
As she’s thinking, she decides to make a couple of illustrations for The Book, so she adds some photos to her working tree:
.
├── .ahole
│ ...
├── images
│ ├── adam_with_apple.jpg
│ └── lion_with_lamb.jpg
├── chapter1_discussion.txt
├── chapter2.txt
├── chapter1.txt
└── contents.txt
As soon as she does this, she realizes what’s wrong with ahole
. The photos
are large files. At the moment, every time we make a commit, we’re copying all
the files into the commit files
directory to make the snapshot. With big
files, this is going to lead to many identical copies and lots of wasted space.
Eve realizes that what we need to do, is to make the commit use references to files, rather than the files themselves. That way, when the commit has files that have not changed, it can just point to the unchanged file, rather than carrying a wasteful copy of the file.
If the commits just store references, we need a way to store the contents of the
files, so they can be referenced. Maybe we could store the files for our
snapshots in a directory, and use some sort of unique filename so that the
commits can reference that filename? For example, maybe we could make a
directory in .ahole
like this:
mkdir .ahole/objects
and use this directory to store the contents of the files for our snapshots.
Then we could store the commits as something like a table, where the entries
would tell us how to get the matching files from the .ahole/objects
directory.
We could have some structure for the commits like this:
├── .ahole
│ ├── 5d89f8
│ │ ├── info.txt
│ │ └── file_list
where .ahole/5d89f8/file_list
would be a list of references to files in the
.ahole/objects
directory, along with the filename that the contents has when
reconstructed back into the snapshot. For example, maybe file_list
would
have a series of (object reference, filename) pairs like this:
contents_version1 contents.txt
chapter1_version1 chapter1.txt
chapter2_version2 chapter2.txt
chapter3_version1 chapter3.txt
chapter1_discussion_version1 chapter1_discussion.txt
These references in the first column could match filenames in the
.ahole/objects
directory:
│ ├── objects
│ │ ├── chapter1_version1
│ │ ├── chapter2_version1
│ │ ├── chapter2_version2
│ │ ├── chapter3_version1
│ │ ├── chapter1_discussion_version1
│ │ └── contents_version1
We could think of the .ahole/objects
directory as a very simple form of
database, where the keys are the filenames, and the file contents are the
values.
We think about this for a while and realize that it’s going to be annoying
trying to find unique names to use as filenames in .ahole/objects
, because
there will be many versions of many files. For example chapter1_version2
,
chapter1_version3
and so on is clearly not going to work, because when Eve
and I work independently, at some point we’re both going to have something like
a chapter1_version3
in our respective .ahole/objects
directories, but
they will be different, and that will be confusing.
At this stage, Eve reveals that she has some training in computer science. Of course I have no idea what that is, or who did the training, but she’s in too much of a rush to explain that now. She proposes that we make the filenames (database keys) by doing hashes of the file contents. It turns out that hashing algorithms can take a stream of bytes such as the contents of a file, and create a string that is near-enough unique to that stream of bytes. That’s really good, because it means that, if Eve and I have an object with the same filename (hash) that means it almost certainly contains the exact same contents.
Eve recommends the ‘SHA1’ hashing algorithm, and I’m in no position to disagree with her. Now we’ve got a unique string to use as a key for each file. For example, we run the SHA1 algorithm over the current book files and we get these:
Filename |
SHA1 hash |
---|---|
chapter1.txt |
9e398c7cf8d56e960aa7769839cc0c38b8e12f11 |
chapter2.txt |
65735b3705284cdf4a66c2e4812ca13cbaa7cd5d |
chapter1_discussion.txt |
3c2e09cc43568f13444c075c84b047957f7995a5 |
contents.txt |
f31bfa1225f9e0eb6741a0ab1122f8cd2cbedc04 |
If we change the file at all, then the hash changes, and we have a new unique string and therefore we have a new unique filename with which to store the new contents. For example, the original version of chapter 2 was a bit shorter, and had a hash of ‘1cf01a1dfbe135b6132362fa8e17eaefcaf00a7f’.
Now we have got a nice way of making the references that will go into
.ahole/5d89f8/file_list
. First we store the file versions in our
.ahole/objects
directory, using their hash values as filenames:
│ ├── objects
│ │ ├── 9e398c7cf8d56e960aa7769839cc0c38b8e12f11 (chapter1 version 1)
│ │ ├── 1cf01a1dfbe135b6132362fa8e17eaefcaf00a7f (chapter2 version 1)
│ │ ├── 65735b3705284cdf4a66c2e4812ca13cbaa7cd5d (chapter2 version 2)
│ │ ├── 3c2e09cc43568f13444c075c84b047957f7995a5 (chapter1_discussion version 1)
│ │ └── f31bfa1225f9e0eb6741a0ab1122f8cd2cbedc04 (contents version 1)
Next we create .ahole/5d89f8/file_list
with one row per file in our
directory. Each row contains first - the hash value (and therefore filename in
.ahole/objects
) which allows me to get the file contents, then the type of
thing this is - here a file - and lastly, the filename as it was in the
snapshot:
9e398c7cf8d56e960aa7769839cc0c38b8e12f11 file chapter1.txt
65735b3705284cdf4a66c2e4812ca13cbaa7cd5d file chapter2.txt
3c2e09cc43568f13444c075c84b047957f7995a5 file chapter1_discussion.txt
f31bfa1225f9e0eb6741a0ab1122f8cd2cbedc04 file contents.txt
Now, what about Eve’s new working tree with the photos in it? The photos are in
the images
subdirectory, and we don’t have a way of storing subdirectories
yet. Aha - why not store directories in the object database too? Directories
can just be tree files like file_list
. tree files are lists, one entry
per row, where each row contains the hash reference for the file contents, the
type of thing it is (tree or file), and the filename as it was in the snapshot.
So, for Eve’s new commit, we’d first store the contents of the two photo files
in the .ahole/objects
directory:
│ ├── objects
│ │ ├── 82e6792faa893070dcd6fe3e614b6f147be1a0a9 (adam_with_apple.jpg)
│ │ ├── e8b23357995db47e70906d4c7a08114c0c0ba376 (lion_with_lamb.jpg)
│ │ ├── 9e398c7cf8d56e960aa7769839cc0c38b8e12f11 (chapter1 version 1)
etc. Then we make a new tree file called - say - ‘images_listing’ like this:
82e6792faa893070dcd6fe3e614b6f147be1a0a9 file adam_with_apple.jpg
e8b23357995db47e70906d4c7a08114c0c0ba376 file lion_with_lamb.jpg
and we make a hash for that tree file too, and put that into
.ahole/objects
:
│ ├── objects
│ │ ├── be242dba385bc0689be16454e959f4b64c87abce (images_listing)
│ │ ├── 82e6792faa893070dcd6fe3e614b6f147be1a0a9 (adam_with_apple.jpg)
│ │ ├── e8b23357995db47e70906d4c7a08114c0c0ba376 (lion_with_lamb.jpg)
│ │ ├── 9e398c7cf8d56e960aa7769839cc0c38b8e12f11 (chapter1 version 1)
etc. Now maybe our whole commit listing can include files and directories for the root directory of our project, something like:
9e398c7cf8d56e960aa7769839cc0c38b8e12f11 file chapter1.txt
65735b3705284cdf4a66c2e4812ca13cbaa7cd5d file chapter2.txt
3c2e09cc43568f13444c075c84b047957f7995a5 file chapter1_discussion.txt
f31bfa1225f9e0eb6741a0ab1122f8cd2cbedc04 file contents.txt
be242dba385bc0689be16454e959f4b64c87abce tree images
Oh - but wait - that’s just a tree listing too, let’s make a hash for that, and
put it into the .ahole/objects
directory:
│ ├── objects
│ │ ├── e52dc9dbe358c549df65307652ff2709322812b3 (root listing)
│ │ ├── be242dba385bc0689be16454e959f4b64c87abce (images_listing)
│ │ ├── 82e6792faa893070dcd6fe3e614b6f147be1a0a9 (adam_with_apple.jpg)
Right - so now our whole commit boils down to our info.txt
file, and the
hash for the root tree (the one starting ‘e52dc’ above). We can get rid of the
old files
subdirectory in the commit, and add the hash for the root tree
instead - something like:
committer = Eve
message = Adding funny pictures
date = year0-jan-06
root_tree = e52dc9dbe358c549df65307652ff2709322812b3
parent = 0a01a0
Now we can solve the annoying problem of finding an unique commit id for each
commit. We just make a hash for the info.txt
file, and put that into the
.ahole/objects
directory too, as a commit file:
│ ├── objects
│ │ ├── 7e0cda8c145b300b519ed28998a31f801b6d626f (latest commit)
│ │ ├── e52dc9dbe358c549df65307652ff2709322812b3 (root listing)
│ │ ├── be242dba385bc0689be16454e959f4b64c87abce (images_listing)
The unique id for the commit is the hash for its contents. In this case the commit id is ‘7e0cda8c145b300b519ed28998a31f801b6d626f’. Don’t forget that the hash is more or less unique to the contents, so this commit will have an id that is unique to the combination of the committer, message, date, root tree hash and commit parent. The root tree hash is unique to the contents of the root tree listing, and the root tree listing contains file hashes, which are in turn unique to the file contents, so the root tree hash will be unique to the file contents of the commit. Thus, the commit id is unique to all the things that go into the commit, including the contents. It’s clever isn’t it?
We can now have three types of files in the .ahole/objects
directory -
files, trees, and commits.
OK - so things are now a little more complicated than our previous setup with
file copies, but lots of things have just got much easier. For example, we can
now get rid of the staging_area
directory. The staging area can just be a
single file containing the root tree listing of the snapshot. Let’s call that
file .ahole/index
. Now Eve has done her new commit, that file can just be
the root directory listing of the previous commit (the commit we have just
done):
9e398c7cf8d56e960aa7769839cc0c38b8e12f11 file chapter1.txt
65735b3705284cdf4a66c2e4812ca13cbaa7cd5d file chapter2.txt
3c2e09cc43568f13444c075c84b047957f7995a5 file chapter1_discussion.txt
f31bfa1225f9e0eb6741a0ab1122f8cd2cbedc04 file contents.txt
be242dba385bc0689be16454e959f4b64c87abce tree images
When Eve makes an edit to chapter1.txt
, instead of copying the file to the
staging_area
directory, she makes a hash for the new chapter1.txt
contents, she stores the new chapter1.txt
contents in the .ahole/objects
directory using the hash as a filename, and then she edits the .ahole/index
file to point to her new chapter 1 contents instead of the old. She might
automate this with a small command like ahole_stage
3
def ahole_stage(fname):
# Get the hash for the file contents
file_contents = file(fname).read()
file_hash = sha1_hash(file_contents)
# (assuming that the new file is going in the root directory)
new_root_entry = file_hash + ' file ' + fname
root_listing = file('.ahole/index').read()
if new_root_entry in root_listing:
# This exact file contents and filename already present
return
# Make an entry for these file contents in the objects database
database_fname = '.ahole/objects/' + file_hash
file(database_fname, 'w').write(file_contents)
# Write index listing with new entry
root_listing = root_listing + new_root_entry + '\n'
file('.ahole/index', 'w').write(root_listing)
Making a new commit involves taking the contents of .ahole/index
and using
it to make a new commit file in .ahole/objects
. Using the structure of our
previous ahole_commit
routine, that might look like:
def ahole_commit(committer, message):
# *** this stuff is the same as before ***
# Get previous (parent) commit id from .ahole/HEAD
head_contents = file('.ahole/HEAD').read()
# Check if this is a reference, de-reference if so
# Also, get file into which to write the new commit id
if head_contents.startswith('ref: '):
head_ref = head_contents.replace('ref: ', '')
head_ref_file = '.ahole/' + head_ref
head_id = file(head_ref_file).read()
else:
head_ref_file = '.ahole/HEAD'
head_id = head_contents
# *** the stuff below is different ***
# Make root tree entry in objects database from .ahole/index
index_contents = file('.ahole/index').read()
index_hash = sha1_hash(index_contents)
file('.ahole/objects/' + index_hash, 'w').write(index_contents)
# Make commit information with parent set to HEAD
info_str = 'committer = ' + committer + '\n'
info_str += 'message = ' + message + '\n'
info_str += 'date = ' + date.today() + '\n'
info_str += 'root_tree = ' + index_hash + '\n'
info_str += 'parent = ' + head_id + '\n'
# Write commit file into objects database, with hash
commit_hash = sha1_hash(info_str)
file('.ahole/objects/' + commit_hash, 'w').write(info_str)
# Set the current commit file to contain new id
file(head_ref_file, 'w').write(commit_hash)
How about doing a merge? Remember that, in the bad old days, we had to compare lots of files between the branches, and the common ancestor? No more. Now we are using the hash file references, all we need to do, is look at the tree listing. If the tree listing has the same entry (filename and hash) that means that the file is identical between the two trees, and we don’t have to load the contents to check. That makes it very fast to do comparisons between trees that haven’t changed much.
Eve was right of course. Now, if we make a new commit, when one file is changed, all we store is the contents of the file that has changed and a new tree listing with the updated hash for the changed file. That makes the storage for lots and lots of similar trees very efficient.
Someone ought to write this up and give it to the world. Wait, that’s just us.
On the seventh day - there was git¶
The seventh day is for resting. You are all done now, and the hard stuff is over. In a state of deep inner peace, you can think about all that you’ve discovered in ahole:
A commit refers to a snapshot of the complete set of files for your project
The staging area (index) defines what will change between your upcoming commit and the previous commit
A branch is just a pointer to a commit, that moves when you do another commit.
Version control is very easy to understand
You remind yourself that life is very good, because you don’t have to use a version control system called ahole, you can use a very similar system called git.
If you use git, you’ll notice that you have lots of ahole friends. You’ll
see git creates a .git
subdirectory that contains the repository. You’ll
recognize the .git/objects
directory containing filenames with SHA1 hashes.
You’ll see that commits have SHA1 hashes. You’ll recognize the .git/HEAD
file and .git/refs/heads
and .git/refs/tags
and
.git/refs/heads/master
. There is a .git/index
file, and it is the
staging area. .git/index
is a little more complicated than .ahole/index
because it’s adapted to helping with difficult merges, but it’s the same idea.
You now live in the garden of Eden of version control. Remember to stay away from that apple tree.
Footnotes
- 1
ahole
might seem a bit rude to you, but I was born in the UK, and, where I come from, ‘ahole’ is roughly as rude as ‘git’.- 2
In case you are interested, for the commit and checkout code to actually run, you would need some python definitions. First some standard python imports:
from datetime import date from os import mkdir, listdir
Then we need some simple custom commands for deleting our working tree, and for copying files into the working tree:
from os import remove from os.path import isfile, isdir from shutil import copyfile, copytree, rmtree def delete_tree(path): # Delete everything in path unless it's an '.ahole' directory for name in listdir(path): full_name = path + '/' + name if isfile(name): remove(full_name) elif isdir(name): if name != '.ahole': rmtree(full_name) def copy_tree(src_path, dst_path): # Copy everything in src_path to dst_path for name in listdir(src_path): src_name = src_path + '/' + name dst_name = dst_path + '/' + name if isfile(src_name): copyfile(src_name, dst_name) elif isdir(src_name): copytree(src_name, dst_name)
We also need some definition of
make_unique_id()
.- 3
Now you need to add:
import hashlib def sha1_hash(contents): return hashlib.sha1_hash(contents)
.