From Zero to Git

Daniel Tashjian
Tapad Engineering
Published in
14 min readOct 18, 2019

--

This guide will teach you git by starting with fundamental concepts. From there, it will build upon that foundation introducing each layer of abstraction in terms of the previous ones. This is in contrast to many other git guides I have seen out there that start with the basic commands you would use in your day-to-day. Instead, this guide will introduce some commands only after teaching the underlying concepts.

For those of you who feel you have some level of understanding about git, I ask that you pretend to forget everything about git. Forget what it does, who uses it, what they use it for, commits, branches — everything. To start with, I am going to assume you only know about directories and files.

A seemingly natural starting point would be git init. However, every command in git is defined in terms of lower-level concepts which I haven’t introduced yet. Right now we only have directories and files. So instead, imagine you have a single root directory, and there is an empty directory within it called .git. This new directory will store everything to make git work, it is your git repository. You never have to directly access anything in .git for all but the most advanced use cases or for educational purposes. Everything else under root but outside of the .git directory is called the working tree, and you control all of it directly. You are the originator of everything in the working tree.

An empty .git directory is neither interesting nor useful. Before we put anything in there, we need to have a motivation to guide us. To provide that motivation, I will describe git’s primary purpose and function: git stores snapshots of your working tree so that the snapshots can be restored back to your working tree at a later time. A snapshot of your working tree will include the names of your files with a relative path from root, and the contents of those files. Interestingly, snapshots do not include directories. We now have our first thing to put in our .git directory: snapshots.

We may have snapshots, but they are completely unorganized. Just having them floating around will not be very useful to anyone. To organize them better, we will next talk about the content-addressable filesystem. If you don’t know what that is, you’re in luck because you’re about to learn something new.

A content-addressable filesystem is a content store with the following properties:

  1. Ability to access content from a key key -> content.
  2. A function H to determine the key for a given content H(content) == key.
  3. H(content) != H(different-content).
  4. H(content) == H(same-content).

In practice, git uses a hash algorithm for H. The particular algorithm is not important, what’s important is that the resulting key is unique to the content used to generate it. The motivation behind being able to generate such a key is because content can be arbitrarily large and become unwieldy. The key is a small fixed size no matter how large the content is. Therefore, you can pass this key around and use it as a pointer to the content to avoid unnecessary duplication and wasteful processing.

The ability to go from key to content and back again has an extremely important implication: immutability is required. Imagine what would happen if we modified content in our content-addressable file system. The new content would generate a different key from H and thus would be pointed at by the wrong key. Our filesystem is no longer content-addressable. As such, everything that is stored in the content-addressable filesystem is necessarily immutable. A lot gets stored in the content-addressable filesystem as you will see, including our snapshots. Whenever git creates a snapshot, the snapshot is run through the H function to generate the key, and the snapshot is stored at the location of the key.

Things are starting to shape up better, we have snapshots of our working tree, and we have pointers we can use to reference our snapshots. Unfortunately, our keys in practice tend to look like arbitrary hexadecimal strings and so are not memorable. This does not make for a good user experience. Git provides ways to reference snapshots by an alias. However, before we get there, we first have to introduce the most important layer of abstraction in git: the commit.

Commits are the core abstraction in git and deserve the most attention.

They are the fundamental abstraction upon which almost everything else you do is built upon directly or indirectly. In this guide, I talk a lot about snapshots, but in practice, you don’t need to. In practice, commits are the lowest level concept you will ever need to talk about. So it may surprise you to learn that there isn’t a whole lot to commits. A commit is little more than some bookkeeping about a particular snapshot. Like snapshots, they are also stored in the content-addressable filesystem. As such, they get their own unique key from H(commit).

Among other things, commits contain:

  • A reference to exactly one snapshot.
  • The name of who authored the snapshot.
  • A timestamp of when the snapshot was authored.
  • The name of who created the commit (can differ from author).
  • A timestamp of when the commit was created.

However, the most important thing that commits bring to the picture is they can contain any number of keys that point to other commits. Think: H(other-commit)*. Before, our snapshots were completely isolated and had no relation to each other. Now that we have introduced commits, we created the ability to encode ordering. We can express that this commit “came after” this other commit by having a new commit point to an existing one.

There you go — you have all the tools you need to track the history of your working tree. It’s still a poor user experience, but it won’t be hard to add things to make it much nicer to work with.

It’s worth taking a moment to step back and talk about what git is not. Most of my initial frustration with learning git was due to a disconnect between what I thought git was and what git actually is. Now is a good time to address some common misconceptions.

Misconception 1: Commits store diffs from the previous commits

As we have talked about, git stores a whole snapshot under each commit. These are not diffs, these are entire representations of the working tree all referenced from one commit. This is why shallow checkouts from remote repositories are possible. Git does not require all historical commits in order to “play a sequence of diffs” to generate your current working tree. How it achieves this storage structure in a small amount of space is interesting but beyond the scope of this guide.

Misconception 2: You can undo a commit

Git’s choice of using the word ‘commit’ is very apt. When you create a commit, you are truly committing it to the repository. Remember, the content-addressable filesystem requires immutability. You cannot go back and change your mind. What you can do is decide to create a new commit with different content. Git provides tools to make that easy, but remember you are neither editing nor unmaking the undesired commit. It will still exist.

Misconception 3: You can change history

This is a bit of a philosophical dispute whether it is a misconception or not. I consider it one because it implies this notion that things are mutable when the vast majority of what git stores is not. I have seen this create fear in people, making them worry they are going to lose data if they run a dangerous command the wrong way. The only data that is realistically at risk of being lost forever is uncommitted content in your working tree. Once your data is committed, it is extremely difficult to lose it by accident. What is really happening when people talk about “changing history” is simply switching to a different history, not modifying the current one.

Misconception 4: Git has a DAG

DAG is an acronym for Directed Acyclic Graph. People colloquially will talk about “The DAG” as if it is a necessary component to git. It creates an inaccurate perception that git contains an internal DAG it treats special. The ubiquity of the term is understandable. In practice, nearly all git repositories will have exactly 1 DAG in them encoded by the commits. The truth is that a DAG is merely the artifact of highly common git usage patterns. Note that so far in the git built up in this guide, I have made no mention of any DAG. That is precisely because git does not care about DAGs. However, you do have all the tools necessary to create “The DAG” that people talk about. It is good to realize that not only are you creating the DAG, but it is totally optional. A mildly popular git usage pattern results in a linear history, no graphs necessary.

Aside: I know a line is a special case of graph. However, if you count special cases, then just about everything has a graph in which case talking about graphs is kind of meaningless.

Although I have yet to witness it in the wild, you could even use git in a way that there is no history, not even a linear one. It would just be completely disjoint individual commits that you arbitrarily check out. You could even use git in a way that you have two DAGs or more. They would not share any history, but they can happily cohabit the same git repository. I am not bringing these things up as useful options, but rather to help you exercise your understanding so far.

Let us get back to building up our model of git. We have commits now so you may be antsy to start looking at commands. However, we need one more concept before we can have a command that creates commits. What do commits need? Snapshots. Therefore, in order to create commits, we need to create snapshots. That is what the index (also called the staging area) is for, building up the new snapshot that you want to commit.

The process goes:

  1. Add everything you want in your snapshot to the index.
  2. Commit everything in the index.

This is our first stateful component. Fortunately, it doesn’t have a very complicated workflow. It only supports adding and removing things. That brings us to our first two commands:

  • git add file-or-dir adds files to the index.
  • git reset file-or-dir removes files from the index.

In both cases, specifying a directory adds or removes all files in that directory recursively. We also now have everything we need to create commits:

  • git commit -m message creates a commit with a snapshot from the index and the given message

The message is another piece of bookkeeping in a commit. It can be useful for documenting why you are creating the commit.

Next, let us start giving our commits names we can remember them by. The simplest way to do this is with tags. A git tag is an alias to a specific commit. This means that whenever you run a git command where you specify a commit, you can instead reference the tag by name to usually the same effect. Tags are not stored in the content-addressable filesystem. They are referenced by name, not a content-addressable key, so they are simply stored in a file of the same name in a special directory for tags.

Tags also have a variant called annotated tags. Essentially, they add a layer of bookkeeping between the tag and the commit. The tag now refers to the piece of bookkeeping which is stored in the content-addressable filesystem, and that contains the reference to the commit. Creating annotated tags allows you to share the tag with other git repositories and add additional information.

Tags are great for keeping a record of a specific commit in the past. However, it is very common for git users to want to keep track of the most recent update in a line of work. Tags are not allowed to be updated, this is where branches come in. Branches are almost exactly the same as the basic tag. They have their own special directory where they are stored, they are stored by name, and they only contain a single reference to a commit and nothing else. The only thing that differentiates them from tags is the semantics in how they are handled. Most notably, branches can be updated to change the commit they point to. There is also some metadata about branches for working with remote repositories, but the branch itself stores nothing more than a single key to a commit.

Misconception 5: Git branches are like the branches of a tree

When trying to visualize a git repository, it is inevitable that you will encounter examples that show diverging paths of commits that look a lot like a tree. The ends of these paths will often be pointed at by a branch. Given the name ‘branch’ and this visualization, it is very easy to mistake the alternative path as the branch itself. It is very important to understand that while you may refer to that path as a branch colloquially, git does not think of branches in that way. To git, a branch is just a named pointer to a commit. To further the confusion, if you try to merge two branches, git does look at the history as part of performing the merge. However, git would look at the history anyway even if branches were not the direct targets of the merge. Git behaves that way because those are the semantics of merging commits.

Now that we have branches, we are ready for a new command. Branches are yet another stateful component that is very simple, so there is not a whole lot you can do with it. The only thing you can really do to a branch is set the commit it points to:

  • git reset commit-or-tag-or-branch sets the pointer of the checked out branch to the given commit or commit referenced by the given name

This is deemed one of the most dangerous commands in git, and what is it really doing?

It is updating a single value: the pointer of the checked-out branch. How can such an innocuous-seeming operation be thought to be so dangerous? There are a couple of reasons:

  1. As you may have noticed, we’ve seen the reset command before, but with a different form. This means that users often mistakenly use a form they did not intend to use.
  2. A variant of this command which adds the --hard flag can delete things from the working tree. As I said, the only way to lose data in git is if you delete something from the working tree that has not been committed.
  3. Reset can cause you to create “dangling commits” which can only be referenced by their key. Keys are not memorable so users often have trouble finding those commits again.

You may be wondering, “What does it mean for a branch to be checked out?” That’s good, because I have not yet given you that piece to the puzzle. For that you need: HEAD.

HEAD is probably the most complex concept in git. By convention, it is written in all caps. I suppose that is to make it clear we are referring to something unique. HEAD is kind of like a branch, but it differs in these ways:

  • HEAD can point to a commit, tag, or branch.
  • There is only one HEAD, which ensures no ambiguity making it a suitable default value.
  • HEAD sort of represents the current state of the git repository due to how git commands make use of it.

Whatever HEAD points to is described as “checked out”. So when I referred to the checked out branch previously, that meant the branch pointed at by HEAD. In most cases HEAD will be pointing to a branch, but there are good reasons to have it point to a commit or tag at times.

Perhaps the best way to understand HEAD is to look at how commands make use of it:

  • git init dir-name will create your root directory, your largely empty git repository, a branch named master, and HEAD will be pointing at master.
  • git commit -m message will create a new commit from merging the content in the index with the snapshot pointed at by HEAD. If HEAD is pointing at a branch, it will update that branch to point at the new commit, otherwise it will update HEAD to point at the new commit. In both cases, HEAD ends up pointing at the new commit either directly or indirectly through a branch.
  • git checkout branch-or-tag-or-commit sets HEAD to the given value and loads the associated snapshot to the working tree.
  • git diff prints the differences between the snapshot pointed at by HEAD and the unstaged content of the working tree.
  • git branch branch-name creates a branch with the given name pointing at the commit HEAD is pointing at directly or indirectly.
  • git checkout -b branch-name also creates a branch with the given name pointing at the commit HEAD is pointing at directly or indirectly. However, it takes the additional step of updating HEAD to point at the newly created branch.
  • etc.

As you can see, there are a lot of commands that interact with HEAD. Just about any command that creates commits (commit, merge, rebase, cherry-pick, etc.) will end up reading and/or changing HEAD. This is to facilitate an easy workflow of always having the most recent change be what’s checked out unless you specify otherwise.

There are a whole lot of commands that git provides, but you should now have the fundamentals which make understanding them easier.

I did gloss over some details that are not relevant for understanding how git works but are relevant for using git. For example, I made it seem as if git add creates a whole snapshot at a time. Usually, when working with git add, you will incrementally build up your desired snapshot to commit file by file. As such, I do not recommend taking this guide as a reference or as a prescription for how to use git. Instead, I recommend using this guide as a basis to provide context to your typical workflow.

I am sure my own understanding of git is not perfect, it is only enough for me to use git effectively.

If you notice any errors in this guide, or if you have any thoughts to share, please let me know in the comments.

Daniel Tashjian is a software engineer at Tapad. Follow Tapad Engineering on Twitter.

--

--