In the first post of this series, I explained the data structures Git uses
to store files and working directory history in its database. A string of commit
objects is used to keep track of your progress. But I did not explain how you
can actually access a commit from outside without knowing its internal SHA1
identifier. This is one of the mysteries that will be revealed in this post,
in which I talk about the structure of the .git directory and about what
exactly a branch is and how branches work.
As before, I want you to go into a repository of your choice and poke around a bit yourself while you're reading my explanations.
The .git directory
In your working directory, run cd .git to visit the .git directory.
Here, Git stores everything it needs to run: configuration, the database,
hooks and refs. I want to explain every subdirectory briefly, before going
into the details of the more interesting parts.
jan@mops $ ls .git
COMMIT_EDITMSG
FETCH_HEAD
HEAD
ORIG_HEAD
config
description
hooks
index
info
logs
objects
packed-refs
refs
The configuration file config
This file stores options you have configured directly via git config or
automatically through other commands. For example, git clone populates this
file with the default "origin" remote:
[remote "origin"]
url = user@server:path
fetch = +refs/heads/*:refs/remotes/origin/*
You can edit this file in any text editor. This is sometimes easier than using
git config.
The uppercase files
You will notice some files in the .git directory that are named in uppercase letters. Let's keep this brief, I will get into more detail later.
COMMIT_EDITMSG - Used to pass the commit message to your text editor
FETCH_HEAD - Git stores the last fetched branches in here
HEAD - Points to the branch you're currently working on
ORIG_HEAD - Used to backup the value of HEAD before a potentially dangerous operation
MERGE_HEAD, CHERRY_PICK_HEAD - Used temporarily during merging or cherry-picking
description, hooks and info
The description file contains a description of your repository. You'll likely never use this
unless you plan to publish your repository through gitweb.
The hooks directory contains callback scripts
that are executed by git everytime certain event occurs (like a commit or a rebase).
These can be used to send out emails everytime someone pushes a commit to a server for example.
Inside the info directory, the only file you'll probably ever touch is the
excludes file, which contains your private excludes. You can use it to prevent
temporary files from showing up in git st without adding them to .gitignore.
index and logs
The index is a central mechanism of Git. Basically it contains the content of
your next commit. I like to call the index an unborn commit. By adding
and removing files through git add and git rm you shape it's content to
your liking and then store it in the database as a proper commit through git commit.
The logs directory contains specials files known as reflogs. Each of the files
here corresponds to a branch. Whenever you are working in that branch, an entry
is created in the reflog. This makes it possible to see what commit your branch
was pointing to, at any given moment in time using git reflog <branchname>.
I will talk a bit more about the reflog in the next part of the series.
The objects directory
Now it gets interesting. The objects directory contains the actual database of
all the objects in the repository. The objects are stored in files and directories
that are based on the objects SHA1 ids. The first two characters of the SHA1 form
a directory, the rest is the filename. If you cat any of the files in there,
you'll see the binary contents of the object, compressed with zlib. To see the
uncompressed content, use the following command (you'll obviously need Ruby for this):
ruby -rzlib -e'puts Zlib::Inflate.inflate(File.read(ARGV[0]))'
Remember what I told you at the end of part one? That git stores all of its
objects as actual files? Here you see them. Also, remember that I told you that
it didn't actually do that all of the time?
Well, run git gc and list the contents of the objects directory again. Most
of the directories should be gone now. They went into one of the files in objects/pack.
These are compressed archives that allow for much more efficient storage of the objects.
But it helps to still think of them as the actual files we've seen before.
The refs directory
The refs directory sits at the interface between the user and the object database.
Here, branches and tags are stored, enabling you to access commits by an easy
to remember name instead of the SHA1. Inside refs you'll see several subirectories:
heads and remotes store branches for the local and remote repositories respectively, tags contains tags.
If you've used git bisect or git stash before, you'll also find corresponding
files for them here.
You can take a look at what your refs are pointing to by just looking at their content. They simply store the hash of the object they're referencing in plain text.
You might be wondering, that git branch -av is showing you quite a lot more
branches than you see files in the refs directory. That's because only branches
you're actually working with are listed here as files. The rest can be found in
the file packed-refs in your .git directory.
Working with branches
Now that you know how branches are stored, you can probably imagine how some of Gits common operations are implemented. Lets take a simple commit for example.

-
Let's assume you're working in the master branch. Your HEAD will point to that
branch. Execute a
cat .git/HEADand you'll see a reference to the master branch:ref: refs/heads/master
Master itself might point to a commit:jan@mops$ cat .git/refs/heads/master 05c80116a36bbbdd7a453255aee5a1d2c7b01fd7 jan@mops$ git rev-parse master 05c80116a36bbbdd7a453255aee5a1d2c7b01fd7
HEADcan either point to a branch, like shown, or directly to a commit (That's called a detached HEAD, a term you might have encountered already). Git has no problems resolvingHEADto a commit in any case:jan@mops$ git rev-parse HEAD 05c80116a36bbbdd7a453255aee5a1d2c7b01fd7
This situation is displayed in the illustration. -
Before you start editing, your working tree, your index and the tree object that belongs to the current commit that
HEADpoints to have identical content. This is situation 1 in the illustration. -
You will now edit a file. The
git statuscommand will report that there's a difference between your working directory and the index and list the file under "Changed but not updated". This is situation 2 in the illustration. -
After adding our changes to the index with
git add,git stwill now report difference between the index an theHEADunder "Changes to be committed". We're now at situation 3. -
If you're done with your work, you finally call
git commit. Git then takes your index and creates a tree object from it. A commit object is created, containing the commit message, your name and the current time. The commits parent will be set to the commit that is referenced by the current HEAD and its tree reference will point to the tree that was just created. This is the transition from situation 4 to situation 5. -
Finally, to treat that newly created commit as the new tip of your development
history, git updates HEAD to point to it. In case HEAD references a branch,
the branch is updated. At every step you can see the pointers changing by
looking into your
HEADandrefs/*files. You're now at situation 6 and your repository is in a clean state again.
By now, you can probably already imagine how branches are created. Git simply
places a file with the name of the branch in refs/heads and lets it point to
the commit you provided to git branch.
Checkouts are a little more interesting. If you instruct git to checkout a branch, three things happen:
- The index is set to the same contents as the commit you're checking out
- The working directory is also adjusted to the same contents
- If you're checking out an actual branch (as opposed to, say a tag or a
SHA1-identified commit), git updates
HEADto point to that branch.
Now you know what the HEAD file I introduced in the "uppercase files" section is used for. Just as HEAD stores the pointer to your current branch, the other uppercase files point to other branches, or other commits that are interesting in some situations like a merge or fetch operation.
Summary
The last part of the series described the data structures behind Gits object database.
By discussing the contents of the .git directory, you understand the operations that git performs to organize the content in the object database, and to create branches.
Given the knowledge about these files, you should have a clear idea now how Git implements its commands.
In the next part of the series, I want to take a closer look at some of them, especially the dreaded rebase.
Really enjoying this series so far. I’ve been using Git for about two years now, and very comfortable with rebasing, merging, and editing history, but it’s very interesting to get a better understanding of what’s going on behind the scenes. The concept of HEAD has always been particularly fuzzy, but you’ve made it clear!
Really looking forward to part 3 – any idea when it’ll be up?
Thanks.
Part 3 should be ready around end of June.
Hi, do you have released part 3 ?
No. There was a noticable lack of feedback to the series (A few comments to the first entry, a single one to the second) and writing these posts really requires effort.
Also, there are already very good articles on this topic. They cover Gits internals much more comprehensive than I could ever hope to do. If you want to know more, I recommend taking a look at
Sorry for my delay, many thanks for sharing your knowledge with us :) and thank you for the links