Blog Archive

Advanced Git Part 2

In the first post of this series, I explained the data structures Git uses to store files and working directory history in its database. A string of commit objects is used to keep track of your progress. But I did not explain how you can actually access a commit from outside without knowing its internal SHA1 identifier. This is one of the mysteries that will be revealed in this post, in which I talk about the structure of the .git directory and about what exactly a branch is and how branches work.

As before, I want you to go into a repository of your choice and poke around a bit yourself while you’re reading my explanations.

The .git directory

In your working directory, run cd .git to visit the .git directory. Here, Git stores everything it needs to run: configuration, the database, hooks and refs. I want to explain every subdirectory briefly, before going into the details of the more interesting parts.


jan@mops $ ls .git
COMMIT_EDITMSG
FETCH_HEAD
HEAD
ORIG_HEAD
config
description
hooks
index
info
logs
objects
packed-refs
refs
 

The configuration file config

This file stores options you have configured directly via git config or automatically through other commands. For example, git clone populates this file with the default “origin” remote:


[remote "origin"]
  url = user@server:path
  fetch = +refs/heads/*:refs/remotes/origin/*  
 

You can edit this file in any text editor. This is sometimes easier than using git config.

The uppercase files

You will notice some files in the .git directory that are named in uppercase letters. Let’s keep this brief, I will get into more detail later.


COMMIT_EDITMSG – Used to pass the commit message to your text editor
FETCH_HEAD     – Git stores the last fetched branches in here
HEAD           – Points to the branch you’re currently working on
ORIG_HEAD      - Used to backup the value of HEAD before a potentially dangerous operation
MERGE_HEAD, CHERRY_PICK_HEAD – Used temporarily during merging or cherry-picking
 

description, hooks and info

The description file contains a description of your repository. You’ll likely never use this unless you plan to publish your repository through gitweb.

The hooks directory contains callback scripts that are executed by git everytime certain event occurs (like a commit or a rebase). These can be used to send out emails everytime someone pushes a commit to a server for example.

Inside the info directory, the only file you’ll probably ever touch is the excludes file, which contains your private excludes. You can use it to prevent temporary files from showing up in git st without adding them to .gitignore.

index and logs

The index is a central mechanism of Git. Basically it contains the content of your next commit. I like to call the index an unborn commit. By adding and removing files through git add and git rm you shape it’s content to your liking and then store it in the database as a proper commit through git commit.

The logs directory contains specials files known as reflogs. Each of the files here corresponds to a branch. Whenever you are working in that branch, an entry is created in the reflog. This makes it possible to see what commit your branch was pointing to, at any given moment in time using git reflog <branchname>. I will talk a bit more about the reflog in the next part of the series.

The objects directory

Now it gets interesting. The objects directory contains the actual database of all the objects in the repository. The objects are stored in files and directories that are based on the objects SHA1 ids. The first two characters of the SHA1 form a directory, the rest is the filename. If you cat any of the files in there, you’ll see the binary contents of the object, compressed with zlib. To see the uncompressed content, use the following command (you’ll obviously need Ruby for this):


ruby -rzlib -e’puts Zlib::Inflate.inflate(File.read(ARGV[0]))’ <PATH_TO_OBJECT_FILE>
 

Remember what I told you at the end of part one? That git stores all of its objects as actual files? Here you see them. Also, remember that I told you that it didn’t actually do that all of the time? Well, run git gc and list the contents of the objects directory again. Most of the directories should be gone now. They went into one of the files in objects/pack. These are compressed archives that allow for much more efficient storage of the objects. But it helps to still think of them as the actual files we’ve seen before.

The refs directory

The refs directory sits at the interface between the user and the object database. Here, branches and tags are stored, enabling you to access commits by an easy to remember name instead of the SHA1. Inside refs you’ll see several subirectories: heads and remotes store branches for the local and remote repositories respectively, tags contains tags. If you’ve used git bisect or git stash before, you’ll also find corresponding files for them here.

You can take a look at what your refs are pointing to by just looking at their content. They simply store the hash of the object they’re referencing in plain text.

You might be wondering, that git branch -av is showing you quite a lot more branches than you see files in the refs directory. That’s because only branches you’re actually working with are listed here as files. The rest can be found in the file packed-refs in your .git directory.

Working with branches

Now that you know how branches are stored, you can probably imagine how some of Gits common operations are implemented. Lets take a simple commit for example.

  • Let’s assume you’re working in the master branch. Your HEAD will point to that branch. Execute a cat .git/HEAD and you’ll see a reference to the master branch:

    ref: refs/heads/master
     
    Master itself might point to a commit:

    jan@mops$ cat .git/refs/heads/master
    05c80116a36bbbdd7a453255aee5a1d2c7b01fd7
    jan@mops$ git rev-parse master
    05c80116a36bbbdd7a453255aee5a1d2c7b01fd7
     
    HEAD can either point to a branch, like shown, or directly to a commit (That’s called a detached HEAD, a term you might have encountered already). Git has no problems resolving HEAD to a commit in any case:

    jan@mops$ git rev-parse HEAD
    05c80116a36bbbdd7a453255aee5a1d2c7b01fd7
     
    This situation is displayed in the illustration.
  • Before you start editing, your working tree, your index and the tree object that belongs to the current commit that HEAD points to have identical content. This is situation 1 in the illustration.
  • You will now edit a file. The git status command will report that there’s a difference between your working directory and the index and list the file under “Changed but not updated”. This is situation 2 in the illustration.
  • After adding our changes to the index with git add, git st will now report difference between the index an the HEAD under “Changes to be committed”. We’re now at situation 3.
  • If you’re done with your work, you finally call git commit. Git then takes your index and creates a tree object from it. A commit object is created, containing the commit message, your name and the current time. The commits parent will be set to the commit that is referenced by the current HEAD and its tree reference will point to the tree that was just created. This is the transition from situation 4 to situation 5.
  • Finally, to treat that newly created commit as the new tip of your development history, git updates HEAD to point to it. In case HEAD references a branch, the branch is updated. At every step you can see the pointers changing by looking into your HEAD and refs/* files. You’re now at situation 6 and your repository is in a clean state again.

By now, you can probably already imagine how branches are created. Git simply places a file with the name of the branch in refs/heads and lets it point to the commit you provided to git branch.

Checkouts are a little more interesting. If you instruct git to checkout a branch, three things happen:

  • The index is set to the same contents as the commit you’re checking out
  • The working directory is also adjusted to the same contents
  • If you’re checking out an actual branch (as opposed to, say a tag or a SHA1-identified commit), git updates HEAD to point to that branch.

Now you know what the HEAD file I introduced in the “uppercase files” section is used for. Just as HEAD stores the pointer to your current branch, the other uppercase files point to other branches, or other commits that are interesting in some situations like a merge or fetch operation.

Summary

The last part of the series described the data structures behind Gits object database. By discussing the contents of the .git directory, you understand the operations that git performs to organize the content in the object database, and to create branches. Given the knowledge about these files, you should have a clear idea now how Git implements its commands.

In the next part of the series, I want to take a closer look at some of them, especially the dreaded rebase.

Advanced Git

At the Barcamp Ruhr 4 this year I held an intermediate level talk about one of my favorite tools of all time: Git. After a very successful introductory presentation two years ago, I wanted to help people to get a deeper understanding of Git so they can use it better.

If you used Git before and kinda like it but feel unsure about using some of its advanced commands because you think you don’t completely understand whats going on under Gits hood, if you like what rebase can do for you but are afraid to use it because you’ve read somewhere that the sky will fall on your head if you make a mistake, then this article is for you. Git only reveals its true, awesome power if you use it to its fullest potential. And to do that it is essential to understand how Git works internally.

more…

Advanced Git slides from Barcamp Ruhr 4

At the Barcamp Ruhr 4 this year I held a session about Git for advanced users. I’m currently preparing the content of that session as a series of blog posts, but in the meantime, here are the slides:

Download/View Slides

The talk has also been recorded by Oliver Überholz, who promised to send me a copy but so far hasn’t replied to my messages. Please give him a nudge :)

UPDATE: First post of of the Advanced Git series is online.

REST in Place now on Github

After using Mercurial for 7 months we at 9elements have finally given in to the internet peer pressure und switched to git (Well, to be honest, several shortcomings in Mercurial played an important role too). Since then I’ve become accustomed to git and today ported over REST in Place from Subversion to Github.
The Github project page ist located at http://github.com/janv/rest_in_place/, the repository can be found at git://github.com/janv/rest_in_place.git.

I’ve updated the README and the project page with the new information.

I’ve also published my dbserialize plugin at Github.

One of those nights

Last night Zed Shaws Rails is a Ghetto Shitstorm was brought to my attention. Zeds rant provides enough meat for a post of its own but it’s not what I want to write about today. Following some comments on Zed article on Technorati I stumbled (again) into one of those evenings full of great discoveries.

Git

Shortly before Git became really popular some months ago, I had become interested in darcs and distributed revision control systems in general. The topic is kinda difficult though and none of the texts I was reading at the time could really communicate the benefits of DRCS to me. I always had some gripes about svn but it wasn’t clear to me how DRCS were able to solve them.

I lost interest, following posts about git only loosely until last night a colleague pointed me to Randal Schwartz’ Git presentation at Google Tech Talks. Holy crap, I need to check this out. What appeals most to me:

  • The ability to have the entire repository available locally
    I was extremely sceptical when I first heard about this, but when Schwartz claimed that the entire repository of the linux kernel is half the size of a checkout I was sold.
  • Subversion interoperability
    Didn’t know about this before. Makes the transition much easier.
  • Having local-only repositories inside your working dir
    I have many smaller projects that I’d love to keep locally contained. In Subversion I always had to create a repository on my server for everything.
  • Other small things
    High compression, the simple database system behind git, the optimization for speed, staged commits, the ability to completely erase files from a repository (e.g. stuff not intended for publication, something that was very hard to do on subversion), the placement of all metadata in a single directory (opposed to littering every dir in the working copy with .svn directories)

Despite some shortcomings that was enough to make me install git on my mac (sudo port install git-core). I’m eager to check it out later today.

Smalltalk and Seaside

Ever since watching Evan Phoenix Rubinius Presentation at RubyConf 2007 and listening to Avi Bryants Smalltalk’s Lessons for Ruby Keynote from RailsConf 2007 I’ve been curious about Smalltalk. I mean, I was curious about it before, after all it’s probably the language that has the most influence on what I’m doing today (through its promotion of object-orientation and through providing key principles behind ruby), but since listening to Avi and Evan I’ve become really interested in VM implementations (see Smalltalk-80: The Language and Its Implementation for an excellent in-depth description of the orignal Smalltalk-80 interpreter) and real world usage of smalltalk.

To be honest, as much as I love Ruby as a language, its implementations all suck. And Evan explained why: Implementing most of the base language on another Platform (C for MRI, Java for JRuby) turns out to be a leaky abstraction when you want to extend the language. Additionally, as pure and beautiful the Ruby language is in concept, as ugly is its implementation. On the one hand, what I like so much about Ruby is its conceptual purity, its very limited set of axioms, syntax and exceptions from its own rules, on the other hand, this purity is not present in the interpreter when high-level data structures (like arrays and hashes) are implemented in C for performance reasons. Smalltalk has always had a strong philosophy of implementing as much as possible in Smalltalk itself and only resorting to C for a minimal subset (“Turtles all the way down”).

Rubinius aims to implement a Ruby interpreter on the design principles of smalltalk. I love the project and there seem to be only the most brilliant people working on it (Evan Phoenix, Eric Hodel, Ryan Davis and others, full time). As many others have said already, Rubinius is likely to become the main Ruby implementation if they manage to take off (and they will undoubtedly).

Yet, something was bugging me: If Ruby and Smalltalk are so similar, if Smalltalk has been around, specified and stable for 25 years in many different, compatible implementations, commercial and open source, why take the long route and bend Ruby to look like Smalltalk? Why not use Smalltalk directly? These questions became even more nagging after reading Randal Schwartz’ (yeah, the guy who sold me on Git earlier) Transcript show: ‘Hello, world!’, cr and Fabio Akitas excellent Interview with Avi Bryant (part 1, part 2). Listening to yet another chat with Avi (linked in the second part of Fabio’s interview) at Floss Weekly (with Randal Schwartz again, highly recommended) before falling asleep I finally decided to check out Squeak, the (most popular?) opensource Smalltalk implementation. I was amazed at the simplicity of the installation: Download the VM, Download the Squeak image, load the image into the VM, done.

I have read about Seaside, Avi’s web development framework, before, mainly in reagard to it’s clever use of continuations, but some of the stuff he described at Floss sound almost too good to be true. Live debugging of your app in the browser ? With hot code swapping over the wire? And I thought Rails’ rdebug integration was great.

Well, at 2am I was finally falling asleep but the stuff I’ve been discovering will probably keep me occupied for quite some time. As I explore and discover more about the topics mentioned, I’ll report my findings here on my blog.