###########################
How do git submodules work?
###########################

Git submodules can be a little confusing.

This page explains how git *stores* submodules.  My hope is that this will make
it easier to understand how to *use* submodules.

If you've read :doc:`curious_git` you will recognize this way of thinking.

***************
Why submodules?
***************

Submodules are useful when you have a project that is under git version
control, and you want to include a copy of another project that is also under
git version control.

**************
Worked example
**************

We will call the project that we need to use ``myproject``, and the project
that is using ``myproject`` we will call ``super``.

We are expecting that ``myproject`` will continue to develop.

``super`` is going to start using some version of ``myproject``.  In the
spirit of version control, we want to keep track of exactly which
``myproject`` version ``super`` is using.

.. prizevar:: np_tools
    :omit_link:

    echo "../../np-tools"

.. prizevar:: my_tree
    :omit_link:

    echo "../../tools/mytree.py"

.. nprun::
    :hide:

    git config --global user.name "Matthew Brett"
    git config --global user.email "matthew.brett@gmail.com"

``myproject``
=============

We make a little ``myproject`` to start:

.. workrun::
    :hide:

    rm -rf myproject
    rm -rf super
    rm -rf super-cloned

.. workrun::

    mkdir myproject
    cd myproject
    git init

.. projectcommit:: proj-init 2012-05-01 11:13:13

    echo "Important code and data" > some_data.txt
    git add some_data.txt
    git commit -m "Initial commit on myproject"

Back to the working directory containing the repositories:

.. workrun::

    cd ..

``super``
=========

Now a ``super`` project:

.. workrun::

    mkdir super
    cd super
    git init

Remember (from :doc:`curious_git`) that doing ``git add`` on a file adds a new
copy of that file to the ``.git/objects`` directory.  So, ``.git/objects``
starts off empty:

.. superout::

   {{ my_tree }} .git/objects

When we ``git add`` a file, there is one new file in ``.git/objects``:

.. superrun::

    echo "This project will use ``myproject``" > README.txt
    git add README.txt

.. superout::

   {{ my_tree }} .git/objects

Now do the first commit for ``super``:

.. supercommit:: super-init 2012-05-01 12:12:12

    git commit -m "Initial commit on super"

The commit made two new objects in the ``.git/objects`` directory:

* a *tree* object giving the directory listing of the root directory;
* a *commit* object giving information about the commit itself.

So, we now have three files in ``.git/objects``:

.. superout::

   {{ my_tree }} .git/objects

Adding ``myproject`` as a submodule of ``super``
================================================

We use a git submodule to put ``myproject`` inside ``super``.  We will use the
name ``subproject`` for the submodule copy of ``myproject``, to make clear
that it is the submodule copy:

.. superrun::

    git submodule add ../myproject subproject

What just happened?:

.. superrun::

    git status

Notice that ``git submodule`` has already staged its changes, so we need the
``--staged`` flag to ``git diff`` to see what has changed:

.. superrun::

    git diff --staged

As you saw, the output from ``git submodule`` says ``Cloning into
subproject``, and sure enough, if we look in the new ``subproject`` directory,
there is a clone of ``myproject`` there:

.. superout::

   {{ my_tree }} subproject

So, ``git submodule`` has:

#. cloned ``myproject`` to ``super`` subdirectory ``subproject``;
#. created and staged a small text file called ``.gitmodules`` that records
   the relationship of the ``subproject`` subdirectory to the original
   ``myproject`` repository;
#. claimed to have made a new *file* in the ``super`` repository that records
   the ``myproject`` commit that the submodule contains.

It's the last of these three that is a little strange, so we will explore.

Storing the current commit of ``myproject``
===========================================

Why do I say that git "claimed" to have made a new file to record the
``myproject`` commit?

Remember that we had three files in the ``.git/objects`` directory of
``super`` after the first commit.  After ``git submodule add`` we have four:

.. superout::

    {{ my_tree }} .git/objects

.. workvar:: gitmodules-object

    cd super
    git rev-parse :0:.gitmodules

The new object has hash |gitmodules-object|, and it contains the contents of
the new ``.gitmodules`` file:

.. superrun::

    git cat-file -p {{ gitmodules-object }}

There is only one new object in ``.git/objects``, and that is for
``.gitmodules``.  Therefore there is no new git object corresponding to the
``myproject`` repository.  In fact what has happened, is that git records the
commit for ``myproject`` in the *directory listing*, instead of recording the
``subproject`` directory as a subdirectory (tree object) or a file (blob
object).  That is a bit difficult to see at the moment, because the directory
listing is in the git staging area and not yet written into a tree object.  To
write the tree object, we do a commit:

.. supercommit:: add-module 2012-05-01 13:22:10

    git commit -m "Adding the submodule"

The exotic ``git ls-tree`` command shows us the contents of the new root tree
object (directory listing) for this commit:

.. superrun::

    git ls-tree main

As you can see, the two real files |--| ``.gitmodules`` and ``README.txt``
|--| are listed as type ``blob``, with the hashes of their file contents. This
is the usual way git refers to a file in a directory listing (see
:doc:`curious_git`, and :ref:`git-object-types`). The new entry for
``subproject`` is of type ``commit``.  The hash is the hash for current commit
of the ``myproject`` repository, in the ``subproject`` copy:

.. superrun::

    cd subproject
    git log

Updating submodules from their source repositories
==================================================

How do we keep the ``subproject`` copy of ``myproject`` up to date with the
original ``myproject`` repository?

To show this in action, we start by going back to the original ``myproject``
repository to make another commit:

.. superrun::

    cd ../myproject

.. projectcommit:: myproject-more-data 2012-05-01 13:33:21

    # Now in the original "myproject" directory
    echo "More data" > some_more_data.txt
    git add some_more_data.txt
    git commit -m "Add some more data"

.. projectrun::

    git branch -v

Of course ``super`` has not changed, because we haven't updated the submodule
clone:

.. projectrun::

    cd ../super

.. superrun::

    git status

The ``subproject`` directory is a full git repository clone of the original
``myproject``.  Remember that ``git submodule add`` created the directory by
cloning. The ``myproject`` clone has a remote from the URL we gave to ``git
submodule add``.

.. superrun::

    # We're in the "super" directory
    cd subproject
    # Now we're in the submodule clone of "myproject"
    git remote -v

We can do a ``fetch`` / ``merge`` to get the new commit:

.. subprojectrun::

    # This is the same as "git pull"
    git fetch origin
    git merge origin/main

Now what do we see in ``super``?

.. subprojectrun::

    cd ..

.. superrun::

    # Now we are in the "super" directory
    git status

.. superrun::

    git diff

Git is not tracking the *contents* of the ``subproject`` directory, but the
*git state* of the directory.  In this case, all ``super`` sees is that the
commit has changed.

.. superrun::

    git add subproject

As when we first added the submodule, a ``git add`` of the ``subproject``
directory has the effect of updating the commit that the ``super`` tree is
pointing to in the staging area, but adds no new files to ``.git/objects``.

If we do the commit, we can see the root tree listing now points
``subproject`` to the new commit of ``myproject``:

.. supercommit:: super-more-data 2012-05-01 13:44:32

    git commit -m "Update myproject with more data"

.. superrun::

    git ls-tree main

Cloning a repository with submodules
====================================

What happens if we clone the ``super`` project?

.. superrun::

    cd ..

.. workrun::

    # In directory below "super"
    git clone super super-cloned

.. workrun::

    cd super-cloned
    ls

What is in the new ``subproject`` directory?

.. superclonedout::

    {{ my_tree }} subproject

Nothing.  When you ``git clone`` a project with submodules, git does not clone
the submodules.

Getting the submodule repository clone takes two steps.  These are:

* initialize with ``git submodule init``;
* clone with ``git submodule update``.

Initializing the submodule copies the repository submodule information in
``.gitmodules`` to the repository ``.git/config`` file.  Having this as a
separate step is useful when you want to use a different clone URL from the
one recorded in ``.gitmodules``. This might happen if you want to use a local
repository to clone from instead of a slower internet repository.  In this
case, you can do ``git submodule init``, edit ``.git/config``, and then do the
cloning with ``git submodule update``.

Here's ``.git/config`` before the ``init`` step:

.. superclonedrun::

    # .git/config before submodule init
    cat .git/config

.. superclonedrun::

    git submodule init

``.git/config`` after the ``init``:

.. superclonedrun::

    # .git/config after submodule init
    cat .git/config

We have done ``init``, but not ``update``. The submodule directory is still
empty:

.. superclonedout::

    {{ my_tree }} subproject

To do the submodule clone, use ``git submodule update`` after ``git submodule
init``:

.. superclonedrun::

    git submodule update

If you are happy to clone from the clone URL recorded in ``.gitmodules``, then
you can do both ``init`` and ``update`` in one step with:

.. superclonedrun::

    git submodule update --init

.. include:: links_names.inc
.. include:: working/object_names.inc