At the Barcamp Ruhr 4 this year I held an intermediate level talk about one of my favorite tools of all time: Git. After a very successful introductory presentation two years ago, I wanted to help people to get a deeper understanding of Git so they can use it better.

If you used Git before and kinda like it but feel unsure about using some of its advanced commands because you think you don’t completely understand whats going on under Gits hood, if you like what rebase can do for you but are afraid to use it because you’ve read somewhere that the sky will fall on your head if you make a mistake, then this article is for you. Git only reveals its true, awesome power if you use it to its fullest potential. And to do that it is essential to understand how Git works internally.

The talk I gave at the Barcamp was roughly about four topics:

  1. Data structures in Git
  2. The layout of the .git directory
  3. The benefits of using rebase and why rebase isn’t nearly as harmful as everyone thinks
  4. Several Tips and Tricks for making day-to-day tasks easier

After the talk I decided to write it down here for the benefit of everyone who ever struggled to understand what Git exactly does when you tell it to pull, merge or commit. I will write a series of four posts, based on the topics of the talk.

Data Structures in Git

Git’s core database is a directed object graph with four different types of objects. Each object has an identifier that is calculated as a SHA1 hash of its contents. That hash is formed by a cryptographic function returning a 160bit key. The fundamental property of a hash function in this context is that it’s a true function in the mathematical sense. Same inputs yield the same outputs. This ensures that identical objects are always assigned to the same identifier. There is no duplication, ever. I will describe this priniciple in more detail in the following paragraphs.

It will help your understanding to refer back to the following graphic when reading these paragraphs: Illustration of Git's graph database This is a representation of Git’s object graph. For brevity I focused on the structural properties of each object, omitting the content (the binary content in case of a blob, the commit message in commits etc.). I also shortened the SHA1s to three characters.

The four object types in Git are:

Blobs

Blobs are simply chunks of binary data with no other properties, no metadata no nothing. Just the pure data. They are used to store the content of files in the repository. They do not correspond 1:1 to files however. They correspond to file content. Two files in your repository, with different names or at different locations, with identical file content will use the same blob object to represent that content in the database. The blob is identified by the SHA1 hash of its content.

Trees

Trees represent directory structures. This is a tree object:

100644 blob 159202af1c0374e33374f2a0e20b5e0ecbc0c19e    .gitignore
100644 blob 37e1207dab85993425ee5f4ceb2a59055dccfc77    .gitmodules
100644 blob e04728e8d391f57a6fa0c3325118750c602ef5ef    Capfile
100644 blob 2af0fb1133d03dcedf1f2bbca9a9b04444ef84f0    README
100644 blob 3bb0e8592a41ae3185ee32266c860714980dbed7    Rakefile
100644 blob 70d0345e4619e790993e852fe0ed1946d8d53afc    TODO.txt
040000 tree 875c4668c815306dcb1de23407973e2f1fb9d3a8    app
040000 tree 942fb533688aa713f5302b525cf0b8cfeb245d8b    config
040000 tree da543b1ab388687f5612e6fb7c06fc778b8026bc    db
040000 tree 0269300738b048a5cc34769d1436d9f228499018    doc
040000 tree 5a86b1e544e01c8951edafc39a3b0ca7bf09c2e9    lib
040000 tree 0289883d028de7e3c8c54a7fa09c2851fda8346f    public
040000 tree 5ecf890b2a8c6d1e6b76b7d2ac25a4e40cf2cc67    script
040000 tree c900b82e1d3f53af6392341f4ecf2a271961c26a    spec
040000 tree 3d5fc32106bf1848bcb79ef8a9f0fbf06e858fed    vendor

Where does this representation come from? Well, Git offers a command for that, git cat-file. Its most common usage is git cat-file -p <object>. I got the above printout by passing the SHA1 of a tree as <object>.

You can see that a tree is simply a list of your directory that consists of links to other objects, blobs (for files) and trees (for subdirectories), together with metadata (file permissions and filenames). What this says, for example, is that there’s a file named “Capfile”, whose content is stored in the blob with the SHA1 e04728e8d391f57a6fa0c3325118750c602ef5ef:

$ git cat-file -p e04728e8d391f57a6fa0c3325118750c602ef5ef
load 'deploy' if respond_to?(:namespace) # cap2 differentiator
Dir['vendor/plugins/*/recipes/*.rb'].each { |plugin| load(plugin) }

load 'config/deploy' # remove this line to skip loading any of the default tasks

Or that “app” is a subdirectory whose content is found in the tree 875c4668c815306dcb1de23407973e2f1fb9d3a8:

$ git cat-file -p 875c4668c815306dcb1de23407973e2f1fb9d3a8
040000 tree 3e6fae3a140890d75eb9d51ce0974f7969194661    controllers
040000 tree 77b99dd8afcf55ad613d51e9e18a3df1aafa3f62    helpers
040000 tree 41c9c92d8f11afe27f2f25fa6bad867d6427cfbe    models
040000 tree c958c76cd2ccda49b3d514a1a12b2236974300c2    sweepers
040000 tree 850a76c0d5d056c43d56bc5d987836ea296584f4    views
040000 tree 4c713208aee9f3bfb172424e3a68f2a1f10d715a    workers

Commits

Until now, our trees and blobs have been floating around in the database with no way of getting at any object without knowing its SHA1. Also, we’ve seen the information that blobs and trees can store but there was no discernible way of storing the history of anything. Pretty useless for a version control system, you say?

This is were commits come into play. Their job is to record history. Let’s start by looking at the commit on top of our master branch with git cat-file -p master:

tree dcad9007245d68ff56d90fcf96af38f686eb61c1
parent 4d9cb9b0d6248bb5c0868261039ef7f56ce47494
author Jan Varwig <jan@varwig.org> 1300882710 +0100
committer Jan Varwig <jan@varwig.org> 1300882710 +0100

Wrote Helper methods in User to aid with taking down accounts

Here you can see what kind of information is stored in a commit:

  • An author and a time of authoring as well as a committer and the date the commit was created. This distinction is made because Git supports patches that are authored by one person but committed by someone else, something that’s not uncommon in big open source projects.

    This can also occur when all developers have commit rights: Whenever you cherry-pick a commit, you become the committer, but the original author remains the same. However, for the sake of discussing Git’s data structure this distinction has no relevance.

  • A reference to a tree object that represents the state of the index at the time the commit was created.
  • One or more references to parent commits. This is what actually builds the history of your repository. Each regular commit has one parent, one previous state of the working directory. When you perform a merge, a commit can even have two or more parents, pointing to the different branches of development that have been merged.

Taking the example from above, if we inspect the parent we see such a merge commit:

$ git cat-file -p 4d9cb9b0d6248bb5c0868261039ef7f56ce47494
tree c7e8830f84488241ee185842b5e226c70e629653
parent 49951c31b899631e974deb15766389978603b47a
parent bfefebbfcdf1dec22eca969a91ac9bbf1d3d499e
author Jan Varwig <jan@varwig.org> 1300445073 +0100
committer Jan Varwig <jan@varwig.org> 1300445073 +0100

Merge branch 'stage' of dev.9elements.de:imgly into stage

This explains how Git strings a series of commits together to form a history, but I didn’t tell how to actually get at an object without knowing its SHA1. This is were branches come into play. They are essentially readable aliases for SHA1s that get updated every time you perform certain actions (like committing, merging, etc.). I will explain this in more detail in the next post of this series.

Tags

The last type of objects are tags. To be more specific, Annotated Tags. Simple tags are not objects (more on that later), but annotated tags are. You get an annotated tag if you use the -a option when creating a tag

$ git tag -a test_tag
$ git cat-file -p test_tag
object 4a06c46ee6d58ce4be09954ee054921b18269cd6
type commit
tag test_tag
tagger Jan Varwig <jan@varwig.org> Fri Apr 15 19:42:02 2011 +0200

This is the message for the test tag

A tag consists of an object reference, but what’s special about it, is that it can refer to any kind of object. What exactly the tag is referring to, can be seen in the type field. The tag we’re seeing here points to a commit with the SHA1 4a06c46ee6d58ce4be09954ee054921b18269cd6. The tag also has a name, given in the tag field, a tagger and a message.

To be honest I never worked with annotated tags and most of you probably never will either. Their main use case over regular tags is that they can be cryptographically signed (as can commits).

Summary

That was it. Four very simple types of objects.

  • Blobs - Storing the content of files
  • Trees - Storing the structure of your working directory
  • Commits - Putting trees into a sequence to preserve history
  • Tags - Reliable mechnism to point to objects in the database

Maybe you should take a look at one of your own repositories now, starting with git cat-file -p master and poking around a bit.

These objects and their references to each other are the absolute core of Git and understanding their structure and relationships is essential to working well with Git. As soon as you start thinking of your repository as this objectgraph, you’ll realize that the Git toolchain is nothing but a set of manipulations on that graph database, creating new objects all the time, pointing to other objects.

There are two additional implementation details that you should be aware of:

  1. Even though my description and the results of git cat-file make it seem like you’re dealing with full fledged objects, the reality is that Git uses very efficient compression algorithms to reduce the amount of actual data stored in its database. Trees or blobs aren’t usually stored in full but described as differences to other similar objects.
But for reasoning about objects, you can and should think of them as being self-contained and independent.   2. On the other hand, objects often actually _are_ uncompressed in the database. Also Git never deletes objects! If you remove a file from a tree, the blob for that file will still exist. If you **lose a commit** through rebasing or merge problems or by accidentally deleting a branch, **as long as you know its SHA1, you can get back to it**. The only time Git actually deletes and compresses objects is during its garbage collection run. If you clone a new project, you will retrieve compressed objects from the remote server, but objects you create in your local repository will be uncompressed at first. Git repacks and garbage collects on its own from time to time, but you can also trigger this processs manually by calling `git gc`. Do this when you notice that working with your repository becomes slow or that the repository becomes too large.

Next post

In the next post of this series, I will explain the structure of the .git directory. This is where you will find your branches and regular tags, as well as the actual object database files. You will learn what a branch actually is and how to effectively manipulate them.