##############
A curious tale
##############

Here is a story where we develop a very simple system for storing file
snapshots.  We soon find it starts to look just like git.

********************
The end of the story
********************

To understand why git does what it does, we first need to think about what a
content manager should do, and why we would want one.

If you've read the `git parable`_ (please do), then you'll recognize many of
the ideas.  Why?  Because they are good ideas, worthy of re-use.

As in the `git parable`_, we will try and design our own content manager, and
then see what git has to say.

(If you don't mind reading some Python code, and more jokes, then also try my
`git foundation`_ page).

While we are designing our own content management system, we will do a lot of
stuff longhand, to show how things work.  When we get to git, we will find it
does these tasks for us.

****************
The story begins
****************

You are writing a breakthrough paper showing that you can explain how the
brain works by careful processing of some interesting data.  You've got the
analysis script, the data file and a figure for the paper.  These are all in a
directory modestly named ``nobel_prize``.

You can get this, the first draft, by downloading and unzipping
:download:`nobel_prize </np-versions/nobel_prize.zip>`.

.. include:: reset_env.inc

Here's the current contents of our ``nobel_prize`` directory:

.. prizeout::

    # Show directory contents as tree
    {{ np_tree }}

**********************
The dog ate my results
**********************

You've been working on this study for a while.

At first, you were very excited with the results. You ran the script, made the
figure, and the figure looked good.  That's the figure you currently have in
``nobel_prize`` directory. You took this figure to your advisor, Josephine.
She was excited too. You get ready to publish in Science.

You've done a few changes to the script and figure since then.  Today you
finished cleaning up for the Science paper, and reran the analysis, and it
doesn't look quite the same. You go to see Josephine. She says "It used to
look better than that". That's what you think too. But:

* **Did it really look different before?**
* If it did, **what caused the change in the figure?**

**********************
Deja vu all over again
**********************

Given you are so clever and you have discovered how the brain works, it is
really easy for you to leap in your time machine, and go back two weeks to
start again.

What are you going to do differently this time?

**********************************
Gitwards 1: make regular snapshots
**********************************

You decide to make your own content management system.  It's the simplest
thing that could possibly work, so you call it the "Simple As Possible"
system, or SAP for short.

Every time you finish doing some work on your paper, you make a snapshot
of all the files for the paper.

The snapshot is a copy of all the files in the working directory.

First you make a directory called ``working``, and move your files to that
directory:

.. nprun::
    :hide:
    :allow-fail:

    mkdir working
    mv * working

.. prizeout::

    {{ np_tree }}

When you've finished work for the day, you make a snapshot of the directory
containing the files you are working on.  The snapshot is just a copy of your
working directory:

.. nprun::
    :hide:

    cp -r working snapshot_1

.. prizeout::

    {{ np_tree }}

You are going to do this every day you work on the project.

On the second day, you add your first draft of the paper, ``nobel_prize.md``.
You can download this ground-breaking work at :download:`nobel_prize.md
</np-versions/work2/nobel_prize.md>`.

.. nprun::
    :hide:

    cp {{ np_versions }}/work2/nobel_prize.md working

.. prizeout::

    {{ np_tree }}

At the end of the day you make your second snapshot:

.. nprun::
    :hide:

    cp -r working snapshot_2

.. prizeout::

    {{ np_tree }}

On the third day, you did some edits to the analysis script, and refreshed the
figure by running the script.  You did a third snapshot.

.. nprun::
    :hide:

    cp {{ np_versions }}/work3/* working
    cp -r working snapshot_3

.. prizeout::

    {{ np_tree }}

To make the directory listing more compact, I'll sometimes show only the
number of files / directories in a subdirectory.  For example, here's a
listing of the three snapshots, but only showing the contents of the third
snapshot:

.. prizeout::

    {{ np_tree }} --elide snapshot --unelide snapshot_3

Finally, on the fourth day, you make some more edits to the script, and you
add some references for the paper.

.. nprun::
    :hide:

    cp {{ np_versions }}/work4/* working
    cp -r working snapshot_4

.. prizeout::

    {{ np_tree }} --elide snapshot --unelide snapshot_4

You are ready for your fateful meeting with Josephine.  Again she notices that
the figure is different from the first time you showed her.  This time you can
go and look in ``nobel_prize/snapshot_1`` to see if the figure really is
different.  Then you can go through the snapshots to see where the figure
changed.

You've already got a useful content management system, but you are going to
make it better.

.. note::

    We are already at the stage where we can define some `terms
    <https://www.kernel.org/pub/software/scm/git/docs/gitglossary.html>`_ that
    apply to our system and that will later apply to git:

    Commit
        A completed snapshot. For example, ``snapshot_1`` contains one commit.

    Working tree
        The files you are working on in ``nobel_prize/working``.

**********************************************
Gitwards 2: reminding yourself of what you did
**********************************************

.. Add message.txt

Your experience tracking down the change in the figure makes you think that it
would be good to save a message with each snapshot (commit) to record the
commit date and some text giving a summary of the changes you made.  Next time
you need to track down when and why something changed, you can look at the
message to give yourself an idea of the changes in the commit.  That might
save you time when you want to narrow down where to look for problems.

So, for each commit, you write write a file called ``message.txt``. The
message for the first commit looks like this:

.. prizewrite:: snapshot_1/message.txt

    Date: April 1 2012, 14.30
    Author: I. M. Awesome
    Notes: First backup of my amazing idea

.. prizewrite:: snapshot_2/message.txt
    :hide:

    Date: April 2 2012, 18.03
    Author: I. M. Awesome
    Notes: Add first draft of paper

There is a similar ``messsage.txt`` file for each commit. For example,
here's the message for the third commit:

.. prizewrite:: snapshot_3/message.txt

    Date: April 3 2012, 11.20
    Author: I. M. Awesome
    Notes: Add another fudge factor

This third message is useful because it gives you a hint that this was where
you made the important change to the script and figure.

.. note::

    Commit message
        Information about a commit, including the author, date, time, and some
        information about the changes in the commit, compared to the previous
        commits.

****************************************
Gitwards 3: breaking up work into chunks
****************************************

.. the staging area

Now you are used to your new system, you find that you like to break your
changes up into self-contained chunks of work, each with their own commit, and
a matching commit message.  When you look back at your fourth commit, it looks
like you included two separate chunks of work into the same commit.  You've
even confessed to this in your commit message:

.. prizewrite:: snapshot_4/message.txt

    Date: April 4 2012, 01.40
    Author: I. M. Awesome
    Notes: Change analysis and add references

You realize that you will often be in the situation where you have made
several changes in the working tree, and you want to split those changes up
into different commits, with their own commit messages.  How can you adapt SAP
to deal with that situation?

To help yourself think about this problem, you decide to scrap your last
commit, and go back to the situation where your working tree has the changes,
but the snapshots (commits) do not.  All you have to do to get there, is
delete the ``snapshot_4`` directory:

.. nprun::
    :hide:

    rm -rf snapshot_4

.. prizeout::

    {{ np_tree }} --hasta snapshot_2

    rm -rf staging
    cp -r snapshot_3 staging
    rm staging/message.txt

You still have your changes in the working tree.  You have changed the
analysis script and figure, and you have the new ``references.bib`` file.

You want to break these changes up into two separate commits:

* a commit with the changes to the analysis script and figure, but without the
  references;
* another commit to add the references.

To do this kind of thing, you are going to use a new directory called
``staging``.  The ``staging`` directory starts off with the files from the
last commit.  When you want to add some changes that will go into your next
commit, you copy the changes from the working tree to the ``staging``
directory.  You make the commit by copying the contents of ``staging`` to a
new snapshot directory, and adding a commit message.

To get started, you make the new ``staging`` directory by copying the contents
of the last commit (except the commit message):

.. nprun::
    :hide:

    rm -rf snapshot_4
    rm -rf staging
    cp -r snapshot_3 staging
    rm staging/message.txt

.. prizeout::

    {{ np_tree }} --hasta snapshot_2

Call the ``staging`` directory |--| the **staging area**.  Your new sequence
for making a commit is:

* copy any changes for the next commit from the working tree to the staging
  area;
* make the commit by copying the contents of the staging area to a snapshot
  directory, and adding a commit message.

You are doing this by hand, but later git will make this much more automatic.

Now you are ready to make the first of your two new commits.  You copy the
changed analysis script and figure from the working tree to the staging area:

.. nprun::

    cp working/clever_analysis.py staging
    cp working/fancy_figure.png staging

The staging directory (staging area) now contains the right files for the
first of your two commits.

Next you make a commit by copying the staging area to ``snapshot_4`` and
adding a message:

.. nprun::
    :hide:

    cp -r staging snapshot_4

.. prizewrite:: snapshot_4/message.txt

    Date: April 4 2012, 01.40
    Author: I. M. Awesome
    Notes: Change parameters of analysis

This gives:

.. prizeout::

    {{ np_tree }} --hasta snapshot_3

To finish, you make the second of the two commits.  Remember the sequence:

* copy any changes for the next commit from the working tree to the staging
  area;
* make the commit by taking a snapshot of the staging area.

You copy the rest of the changes to the staging area:

.. nprun::

    cp working/references.bib staging

Finally, you do the commit by copying the contents of ``staging`` to a new
directory ``snapshot_5``, and adding a commit message:

.. nprun::
    :hide:

    cp -r staging snapshot_5

.. prizewrite:: snapshot_5/message.txt

    Date: April 4 2012, 02.10
    Author: I. M. Awesome
    Notes: Add references

Now you have:

.. prizeout::

    {{ np_tree }} --hasta snapshot_4

We can add a couple of new terms to our vocabulary:

.. note::

    Staging area
        Temporary area that contains the contents of the next commit.  We copy
        changes from the working tree to the staging area to **stage** those
        changes.  We make the new **commit** from the contents of the
        **staging area**.

***********************************************
Gitwards 4: getting files from previous commits
***********************************************

Remember that you found the figure had changed?

You also found that the problem was in the third commit.

Now you look back over the commits, you realize that your first draft of the
analysis script was correct, and you decide to restore that.

To do that, you will **checkout** the script from the first commit
(``snapshot_1``).  You also want to checkout the generated figure.

Following our new standard staging workflow, that means:

* get the files you want from the old commit into the working directory, and
  the staging area;
* make a new commit from the staging area.

For our simple SAP system, that looks like this:

.. nprun::

    # Copy files from old commit to working tree
    cp snapshot_1/clever_analysis.py working
    cp snapshot_1/fancy_figure.png working

.. nprun::

    # Copy files from working tree to staging area
    cp working/clever_analysis.py staging
    cp working/fancy_figure.png staging

.. nprun::
    :hide:

    cp -r staging snapshot_6

Then do the commit by copying ``staging``, and add a message:

.. prizewrite:: snapshot_6/message.txt

    Date: April 5 2012, 18.40
    Author: I. M. Awesome
    Notes: Revert to original script & figure

This gives:

.. prizeout::

    {{ np_tree }} --elide snapshot_ --unelide "snapshot_(1|6)"

.. note::

    Checkout (a file)
        To **checkout** a file is to restore the copy of a file as stored in a
        particular commit.

***********************************************
Gitwards 5: two people working at the same time
***********************************************

.. How to have unique ids for the commits / snapshots

One reason that git is so powerful is that it works very well when more than
one person is working on the files in parallel.

Josephine is impressed with your SAP content management system, and wants to
use it to make some edits to the paper.  She takes a copy of your
``nobel_prize`` directory to put on her laptop.

She goes away for a conference.

While she is away, you do some work on the analysis script, and regenerate the
figure, to make ``shapshot_7``:

.. nprun::
    :hide:

    cp {{ np_versions }}/work7m/* working
    cp working/* staging
    cp -r staging snapshot_7

.. prizewrite:: snapshot_7/message.txt
    :hide:

    Date: April 6 2012, 11.03
    Author: I. M. Awesome
    Notes: More fun with fudge

.. prizeout::

    {{ np_tree }} --elide staging --hasta snapshot_6

Meanwhile, Josephine decides to work on the paper.  Following your procedure,
she makes a commit herself.

What should Josephine's commit directory be called?

She could call it ``snapshot_7``, but then, when she gets back to the lab, and
gives you her ``nobel_prize`` directory, her copy of ``nobel_prize`` and yours
will both have a ``snapshot_7`` directory, but they will be different.  It
would be easy to copy Josephine's directory over yours or yours over
Josephine's, and lose the work.

For the moment, you decide that Josephine will attach her name to the commit
directory, to make it clear this is her snapshot.  So, she makes her commit
into the directory ``snapshot_7_josephine``.  When she comes back from the
conference, you copy her ``snapshot_7_josephine`` into your ``nobel_prize``
directory:

.. nprun::
    :hide:

    cp -r snapshot_6 snapshot_7_josephine
    cp -r {{ np_versions }}/work7j/* snapshot_7_josephine

.. prizewrite:: snapshot_7_josephine/message.txt
    :hide:

    Date: April 6 2012, 14.30
    Author: J. S. Rightway
    Notes: Expand the introduction

.. prizeout::

    {{ np_tree }} --elide staging --hasta snapshot_6

After the copy, you still have your own copy of the working tree, without
Josephine's changes to the paper. You want to combine your changes with her
changes.  To do this you do a **merge** by copying her changes to the paper
into the working directory.

.. nprun::

    # Get Josephine's changes to the paper
    cp snapshot_7_josephine/nobel_prize.md working

Now you do a commit with these merged changes, by copying them into the
staging area, and thence to ``snapshot_8``, with a suitable message:

.. nprun::
    :hide:

    cp working/* staging
    cp -r staging snapshot_8

.. prizewrite:: snapshot_8/message.txt
    :hide:

    Date: April 7 2012, 15.03
    Author: I. M. Awesome
    Notes: Merged Josephine's changes

.. prizeout::

    {{ np_tree }} --hasta "snapshot_7$"

This new commit is the result of a merge, and therefore it is a **merge
commit**.

.. note::

    Merge
        To make a new **merge commit** by combining changes from two (or
        more) commits.

********************************************************
Gitwards 6: how should you name your commit directories?
********************************************************

You like your new system, and so does Josephine, but you don't much like your
solution of adding Josephine's name to the commit directory |--| as in
``snapshot_7_josephine``.  There might be lots of people working on this
paper.  With your naming system, you have to give out a unique name to each
person working on ``nobel_prize``.  As you think about this problem, you begin
to realize that what you want is a system for giving each commit directory a
unique name, that comes from the contents of the commit.  This is where you
starting thinking about **hashes**.

***********************************
A diversion on cryptographic hashes
***********************************

This section describes "Cryptographic hashes". These will give us an excellent
way to name our snapshots.  Later we will see that they are central to the way
that git works.

See : `Wikipedia on hash
functions <http://en.wikipedia.org/wiki/Cryptographic_hash_function>`__.

A *hash* is the result of running a *hash function* over a block of
data. The hash is a fixed length string that is the characteristic
*fingerprint* of that exact block of data.  One common hash function is called
SHA1.  Let's run this via the command line:

.. workrun::

    # Make a file with a single line of text
    echo "git is a rude word in UK English" > git_is_rude
    # Show the SHA1 hash
    shasum git_is_rude

Not too exciting so far. However, the rather magical nature of this string is
not yet apparent. This SHA1 hash is a *cryptographic* hash because:

* the hash value is (almost) unique to this exact file contents, and
* it is (almost) impossible to find some other file contents with the same
  hash.

By "almost impossible" I mean that finding a file with the same hash is
roughly the same level of difficulty as trying something like $16^{40}$
different files (where 16 is the number of different hexadecimal digits, and
40 is the length of the SHA1 string).

In other words, there is no practical way for you to find another file with
different contents that will give the same hash.

For example, a tiny change in the string makes the hash completely different.
Here I've just added a full stop at the end:

.. workrun::

    echo "git is a rude word in UK English." > git_is_rude_stop
    shasum git_is_rude_stop

So, if you give me some data, and I calculate the SHA1 hash value, and it
comes out as ``30ad6c360a692c1fe66335bb00d00e0528346be5``, then I can be very
sure that the data you gave me was exactly the ASCII string "git is a rude
word in UK English".

.. _naming-from-hashes:

**************************************
Gitwards 7: naming commits from hashes
**************************************

Now you have hashing under your belt, maybe it would be a good way of making a
unique name for the commits.  You could take the SHA1 hash for the
``message.txt`` for each commit, and use that SHA1 hash as the name for the
commit directory.  Because each message has the date and time and author and
notes, it's very unlikely that any two ``message.txt`` files will be the same.
Here are the hashes for the current ``message.txt`` files:

.. nprun::

    # Show the SHA1 hash values for each message.txt
    shasum snapshot*/message.txt

.. prizevar:: snapshot_1_sha

    shasum snapshot_1/message.txt | awk '{print $1}'

.. prizevar:: snapshot_2_orig_sha

    shasum snapshot_2/message.txt | awk '{print $1}'

For example you could rename the ``snapshot_1`` directory to |snapshot_1_sha|,
then rename ``snapshot_2`` to |snapshot_2_orig_sha| and so on.

.. prizeout::

    {{ np_tools }}/mv_shas.sh
    snapshot_1=$({{ np_tools }}/name2sha.sh snapshot_1)
    {{ np_tree }} --elide "\S+"

The problem you have now is that the directory names no longer tell you the
sequence of the commits, so you can't tell that ``snapshot_2`` (now
|snapshot_2_orig_sha|) follows ``snapshot_1`` (now |snapshot_1_sha|).

OK |--| you scratch the renaming for now while you have a rethink.

.. prizeout::

    {{ np_tools }}/unmv_shas.sh
    {{ np_tree }} --elide "\S+"

You still want to rename the commit directories, from the ``message.txt``
hashes, but you need a way to store the sequence of commits, after you have
done this.

After some thought, you come up with a quite brilliant idea.  Each
``message.txt`` will point back to the previous commit in the sequence.  You
add a new field to ``messsage.txt`` called ``Parents``.
``snapshot_1/message.txt`` stays the same, because it has no parents:

.. nprun::

    cat snapshot_1/message.txt

``snapshot_2/message.txt`` does change, because it now points back to
``snapshot_1``.  But, you're going to rename the snapshot directories, so you
want ``snapshot_2/message.txt`` to point back to the hash for
``snapshot_1/message.txt``, which you know is |snapshot_1_sha|:

.. nprun::
    :hide:

    {{ np_tools }}/link_commits.py

.. nprun::

    cat snapshot_2/message.txt

Now we've changed the contents and therefore the hash for
``snapshot_2/message.txt``.  The hash was |snapshot_2_orig_sha|, but now it
is:

.. nprun::

    shasum snapshot_2/message.txt

You keep doing this procedure, for all the commits, modifying ``message.txt``
and recalculating the hash, until you come to ``snapshot_8``, the merge
commit.  This commit is the result of merging two commits: ``snapshot_7`` and
``snapshot_7_josephine``.  You can record this information by putting *two*
parents into the ``Parents`` field of ``snapshot_8/message.txt``, being the
new hashes for ``snapshot_7/message.txt`` and
``snapshot_7_josephine/message.txt``:

.. nprun::

    shasum snapshot_7/message.txt

.. nprun::

    shasum snapshot_7_josephine/message.txt

.. nprun::

    cat snapshot_8/message.txt

With the new ``Parents`` field, you have new hashes for all the
``message.txt`` files, except ``snapshot_1`` (that has no parent):

.. nprun::

    shasum snapshot_*/message.txt

You can now rename your snapshot directories with the hash values, safe in the
knowledge that the ``message.txt`` files have the information about the commit
sequence.


.. nprun::
    :hide:

    {{ np_tools }}/mv_shas.sh

.. prizeout::

    {{ np_tree }} --elide "\S+"

Now the commit directories are hash names, it is harder to see which commit is
which, so here's the directory listing where the commit directories have a
label from the ``Notes:`` field in ``message.txt``:

.. prizeout::

    {{ np_tree }} --elide "\S+" --label

.. note::

    Commit hash
        The hash value for the file containing the **commit message**.

**********************************************
Gitwards 8: the development history is a graph
**********************************************

The commits are linked by the "Parents" field in the ``message.txt`` file.  We
can think of the commits in a graph, where the commits are the nodes, and the
links between the nodes come from the hashes in the "Parents" field.

.. workrun::
    :hide:

    cd ../generated
    ../np-tools/make_dot.py > snapshot_graph1.dot
    dot -Tpng -o snapshot_graph1.png snapshot_graph1.dot
    dot -Tpdf -o snapshot_graph1.pdf snapshot_graph1.dot

.. figure:: /generated/snapshot_graph1.*

    Graph of development history for your SAP content management system.  The
    most recent commit is at the top, the first commit is at the bottom.  Your
    commits are in blue, Josephine's are in pink.  Each commit label has the
    hash for the commit message, and the note in the ``message.txt`` file.

*****************************************
Gitwards 9: saving space with file hashes
*****************************************

While you've been working on your system, you've noticed that your snapshots
are not efficient on disk space.  For example, every commit / snapshot has an
identical copy of the data ``expensive_data.csv``.  If you had bigger files or
a longer development history, this could be a problem.

.. prizevar:: snapshot_2_sha

    echo $({{ np_tools }}/name2sha.sh snapshot_2)

.. prizevar:: snapshot_3_sha

    echo $({{ np_tools }}/name2sha.sh snapshot_3)

.. prizevar:: snapshot_6_sha

    echo $({{ np_tools }}/name2sha.sh snapshot_6)

.. prizevar:: snapshot_8_sha

    echo $({{ np_tools }}/name2sha.sh snapshot_8)

Likewise, ``fancy_figure.png`` and ``clever_analysis.py`` are the same for the
first two commits, and then again when you reverted to that copy in
``snapshot_6`` (that is now commit |snapshot_6_sha|).

You can show these files are the same by checking their hash strings.  If
their hash strings are different, the files must be different.  All copies of
``expensive_data.csv`` have the same hash, and are therefore identical:

.. prizevar:: asterisk
    :omit_link:

    # Because * as in file system glob messes up syntax highlighting in vim
    echo "*"

.. nprun::

    shasum {{ asterisk }}/expensive_data.csv

``fancy_figure.png`` is the same for the first two commits, changes for the
third commit, and reverts back to the same contents at the 6th commit:

.. nprun::

    # First commit
    shasum {{ snapshot_1_sha }}/fancy_figure.png

.. nprun::

    # Second commit
    shasum {{ snapshot_2_sha }}/fancy_figure.png

.. nprun::

    # Third commit
    shasum {{ snapshot_3_sha }}/fancy_figure.png

.. nprun::

    # Sixth commit
    shasum {{ snapshot_6_sha }}/fancy_figure.png

You wonder if there is a way to store each unique version of the file just
once, and make the commits point to the matching version.

First you make a new directory to store files generated from your commits:

.. nprun::

    mkdir repo

Next you make a sub-directory to store the unique copies of the files in
commits:

.. nprun::

    mkdir repo/objects

You play with the idea of calling these unique versions something like
``repo/objects/fancy_figure.png.v1``, ``repo/objects/fancy_figure.png.v2`` and
so on.  You would then need something like a text file called
``directory_listing.txt`` in the first commit directory to say that the file
``fancy_figure.png`` for this commit is available at
``repo/objects/fancy_figure.png.v1``.  This could be something like::

    # directory_listing.txt in first commit
    fancy_figure.png -> repo/objects/fancy_figure.png.v1

``directory_listing.txt`` for the second commit would point to the same file,
but the third commit would have something like::

    # directory_listing.txt in third commit
    fancy_figure.png -> repo/objects/fancy_figure.png.v2

You quickly realize this is going to get messy when you are working with other
people, because you may store ``repo/objects/fancy_figure.png.v3`` while
Josephine is also working on the figure, and is storing her own
``repo/objects/fancy_figure.png.v3``.  You need a unique file name for each
version of the file.

Now you have your second quite brilliant hashing idea.  Why not use the
**hash** of the file to make a unique file name?

For example, here are the hash values for the files in the first commit:

.. nprun::

    shasum {{ snapshot_1_sha }}/*

.. prizevar:: fancy_figure_v1_sha

    shasum {{ snapshot_1_sha }}/fancy_figure.png | awk '{print $1}'

.. prizevar:: clever_analysis_v1_sha

    shasum {{ snapshot_1_sha }}/clever_analysis.py | awk '{print $1}'

.. prizevar:: expensive_data_sha

    shasum {{ snapshot_1_sha }}/expensive_data.csv | awk '{print $1}'

To store the unique copies, you copy each file in the first commit to
``repo/objects`` with a unique file name.  **The file name is the hash of the
file contents**.  For example, the hash for ``fancy_figure.png`` is
|fancy_figure_v1_sha|.  So, you do:

.. nprun::

    cp {{ snapshot_1_sha }}/fancy_figure.png repo/objects/{{ fancy_figure_v1_sha }}

The hash values for ``clever_analysis.py`` and ``expensive_data.csv`` are
|clever_analysis_v1_sha| and |expensive_data_sha| respectively, so:

.. nprun::

    cp {{ snapshot_1_sha }}/clever_analysis.py repo/objects/{{ clever_analysis_v1_sha }}
    cp {{ snapshot_1_sha }}/expensive_data.csv repo/objects/{{ expensive_data_sha }}

These hash values become the ``directory_listing.txt`` for the first commit:

.. nprun::
    :hide:

    cd {{ snapshot_1_sha }}
    shasum * | grep -v 'message.txt' > directory_listing.txt

.. nprun::

    cat {{ snapshot_1_sha }}/directory_listing.txt

Finally, you can delete ``fancy_figure.png``, ``clever_analysis.py`` and
``expensive_data.csv`` in the first commit directory, because you have them
backed up in ``repo/objects``.

So far you haven't gained anything much except some odd-looking filenames.
The payoff comes when you apply the same procedure to the second commit.  Here
are the hashes for the files in the second commit:

.. nprun::

    shasum {{ snapshot_2_sha }}/*

.. prizevar:: nobel_prize_v1_sha

    shasum {{ snapshot_2_sha }}/nobel_prize.md | awk '{print $1}'

Remember that, in the second commit, all you did was add the first draft of
the paper as ``nobel_prize.md``.  So, all the other files in the second commit
(apart from ``message.txt`` that you are not storing) are the same as for the
first commit, and therefore have the same hash.  You already have these files
backed up in ``repo/objects`` so all you need to do is point
``directory_listing.txt`` at the original copies in ``repo/objects``.

For example, the hash for ``fancy_figure.png`` in the second commit is
|fancy_figure_v1_sha|.  When you are storing the files for the second commit
in ``repo/objects``, you notice that you already have a file
named |fancy_figure_v1_sha| in ``repo/objects``, so you do not copy it a
second time.  By checking the hashes for each file in the commit, you find
that the only file you are missing is the new file ``nobel_prize.md``.  This
has hash |nobel_prize_v1_sha|, so you do a single copy to ``repo/objects``:

.. nprun::

    # Only one copy needed to store files in second commit
    cp {{ snapshot_2_sha }}/nobel_prize.md repo/objects/{{ nobel_prize_v1_sha }}

As before, you can make ``directory_listing.txt`` for the second commit by
recording the hashes of the files:

.. nprun::
    :hide:

    cd {{ snapshot_2_sha }}
    shasum * | grep -v 'message.txt' > directory_listing.txt

.. nprun::

    cat {{ snapshot_2_sha }}/directory_listing.txt

Before you start this procedure of moving the unique copies into
``repo/objects``, your whole ``nobel_prize`` directory is size:

.. nprun::
    :hide:

    rm -rf repo/objects

.. nprun::

    # Size of the contents of nobel_prize before moving to repo/objects
    du -hs .

When you run the procedure above on every commit, moving files to
``repo/objects``, you have this:

.. nprun::
    :hide:

    {{ np_tools }}/to_repo_objects.py

.. prizeout::

    {{ np_tree }} --elide ize/working --elide staging --label

The whole ``nobel_prize`` directory is now smaller because you have no
duplicated files:

.. nprun::

    # Size of the contents of nobel_prize after moving to repo/objects
    du -hs .

The advantage in size gets larger as your system grows, and you have more
duplicated files.

**************************************
Gitwards 10: making the commits unique
**************************************

.. hashing the directory listing; including hashes in the commit

Up in :ref:`naming-from-hashes` you used the hash of ``message.txt`` as a
nearly unique directory name for the commit.  Your thinking was that it was
very unlikely that any two commits would have the same author, date, time, and
note.  You have since added the ``Parents`` field to ``message.txt`` to make
it even more unlikely.  But |--| it could still happen.  You might be careless
and make another commit very quickly after the previous, and without a note.
You could even point back to the same parent.

You would like to be even more confident that the commit message is unique to
the commit, including the contents of the files in the commit.

You now have a way of doing this.   The ``directory_listing.txt`` files
contain a list of hashes and corresponding file names for this commit
(snapshot).  For example, here is ``directory_listing.txt`` for the first
commit:

.. nprun::

    cat {{ snapshot_1_sha }}/directory_listing.txt

The contents of this file are (very nearly) unique to the contents of the
files in the snapshot.  If any of the files changed, then the hash of the file
would change and the corresponding line in ``directory_listing.txt`` would
change.  If you renamed the file, the name of the file would change and the
corresponding line in ``directory_listing.txt`` would change.

Now you know what to do.  You take a hash of the ``directory_listing.txt``
file:

.. nprun::

    shasum {{ snapshot_1_sha }}/directory_listing.txt

.. nprun::
    :hide:

    {{ np_tools }}/add_tree.py

You put this has into a new field in ``message.txt`` called ``Directory
hash:``:

.. prizeout::

    cat {{ snapshot_1_sha }}/message.txt

Now, if any file in the commit changes, ``directory_listing.txt`` will change,
and so its hash will change, and so ``message.txt`` will change.

Now you've added the ``Directory hash`` field to ``messsage.txt`` you have
also changed the hash values of the ``message.txt`` files.  Because you've
changed the hashes of the ``message.txt`` files, you go back through your
commits updating the parent hashes to the new ones, and renaming the commit
directories with the new hashes.  You end up with this:

.. nprun::
    :hide:

    {{ np_tools }}/mv_shas.sh

.. prizeout::

    {{ np_tree }} --elide ".*" --label

With your new system, if any two commits have the same ``message.txt`` then
they also have the same date, author, note, parents and file contents.  They
are therefore exactly the same commit.

.. note::

    The commit message is unique to the contents of the files in the snapshot
    (because of the directory hash) and unique to its previous history
    (because of the parent hash(es)).

***********************************************
Gitwards 11: away with the snapshot directories
***********************************************

.. hashing the commits

You are reflecting on your idea about hashing the directory listing, and your
eye falls idly on the current directory tree of ``nobel_prize``:

.. prizevar:: snapshot_1_with_tree_sha

    echo $({{ np_tools }}/name2sha.sh {{ snapshot_1_sha }})

.. prizeout::

    {{ np_tree }} --elide ize/working --elide staging --elide repo/objects --label

It occurs to you that you can move the ``directory_listing.txt`` and
``message.txt`` files into your ``repo/objects`` directory.  When you have
done that, you can get rid of the commit directories entirely.

First you take the hash of each ``directory_listing.txt`` and move it into the
``repo/objects`` directory as you did for the other files:

.. prizevar:: snapshot_1_tree_hash

    shasum {{ snapshot_1_with_tree_sha }}/directory_listing.txt | awk '{print $1}'

.. nprun::

    shasum {{ snapshot_1_with_tree_sha }}/directory_listing.txt

.. nprun::

    cp {{ snapshot_1_with_tree_sha }}/directory_listing.txt repo/objects/{{ snapshot_1_tree_hash }}

Then you do the same for the ``message.txt`` file:

.. nprun::

    cp {{ snapshot_1_with_tree_sha }}/message.txt repo/objects/{{ snapshot_1_with_tree_sha }}

.. prizevar:: n_commits
    :not-literal:

    wc .names2sha | awk '{print $1}'

There are |n_commits| commits, so there are |n_commits| x 2 new files with
hash filenames in ``repo/objects`` (a hashed copy of ``directory_listing.txt``
and ``message.txt`` for each commit).

Now you don't need the snapshot directories at all, because the hashed files
in ``repo/objects`` have all the information about the snapshots.

.. nprun::
    :hide:

    ../../np-tools/move_snapshots.py

.. prizeout::

    {{ np_tree }} --elide ize/working --elide staging --elide repo/objects

.. note::

    In git as in your SAP content management system, a **repository
    directory** stores all the data from the snapshots.  In your case that
    directory is ``repo``.  For git, it will be a directory called ``.git``.

************************
Gitwards 12: where am I?
************************

You have one last problem to face |--| where is your latest commit?

When your snapshot directory names had numbers, like ``snapshot_8``, you could
use the numbers to find the most recent commit.  Now all you have is a
directory called ``repo/objects`` with unhelpful file names made from hashes.
Which of these files has your latest commit?

You could write down the latest commit hash on a piece of paper, after you
make the commit, but this sounds like a job better done by a computer.

.. prizevar:: snapshot_8_with_tree_sha

    echo $({{ np_tools }}/name2sha.sh {{ snapshot_8_sha }})

.. nprun::
    :hide:

    echo {{ snapshot_8_with_tree_sha }} > repo/my_bookmark

So, when you make a new commit, you store the hash for that commit in a file
called ``repo/my_bookmark``.  It is a text file with the hash string as
contents.  Your last commit was |snapshot_8_with_tree_sha|, so
``repo/my_bookmark`` has contents:

.. nprun::

    cat repo/my_bookmark

You can imagine that, when Josephine is working on the same set of files, she
might want her own bookmark, maybe in a file called ``josephines-bookmark``.

.. note::

    You keep track of the latest commit in a particular sequence by storing
    the latest **commit hash** in a bookmark file.  In git this bookmark is
    called a **branch**.

**********************
You are on the on-ramp
**********************

You now know all the main ideas in git.  Follow me then to :doc:`curious_git`
to see these ideas come to life in your actual git.

.. include:: links_names.inc
.. include:: working/object_names.inc