Part 10: Git Internals — The Object Model & DAG

Introduction — Porcelain vs Plumbing

Git divides its commands into two categories: porcelain (user-facing commands like git commit, git branch, git merge) and plumbing (low-level commands like git hash-object, git cat-file, git update-ref). Porcelain commands are designed for humans. Plumbing commands reveal what's actually happening underneath.

Why Understanding Internals Matters

When you understand Git's internals, several things happen:

You stop fearing Git — Merge conflicts, detached HEAD, and "rebase gone wrong" become solvable puzzles, not mysterious catastrophes
You debug faster — When something goes wrong, you know exactly where to look (.git/refs/, .git/objects/, git reflog)
You make better decisions — Knowing that branches are 41-byte files makes you comfortable creating 50 branches. Knowing that rebase rewrites SHA-1s explains why it's dangerous on shared branches
You can recover from anything — Almost nothing in Git is truly lost. Understanding the object store and reflog means you can recover from almost any mistake

                            
                            Key Insight: Git is fundamentally a content-addressable filesystem with a version control user interface built on top. Once you understand the filesystem, the user interface (porcelain commands) becomes completely predictable.
                        

Content-Addressable Storage

At its core, Git is a key-value store. You give it content, it returns a unique key (a SHA-1 hash). Later, you give it the key, it returns the content. Every piece of data in Git — every file version, every directory listing, every commit — is stored as an object identified by its SHA-1 hash.

SHA-1 Hashing

SHA-1 produces a 40-character hexadecimal string (160 bits) from any input. The same input always produces the same hash. Different inputs produce different hashes (with astronomically low collision probability). This gives Git three powerful properties:

Integrity — If a single byte changes in any object, its hash changes. Corruption is immediately detectable
Deduplication — If two files have identical content, they produce the same hash and Git stores them only once
Immutability — You cannot modify an object without changing its hash. Objects are permanent once created

# Demonstrate content-addressable storage
# Hash a string without storing it
echo "Hello, Git internals!" | git hash-object --stdin
# Output: a specific SHA-1 hash (e.g., 7e5d5e5...)

# Hash and store it in the Git object database
echo "Hello, Git internals!" | git hash-object -w --stdin
# Output: same hash, but now the object is stored in .git/objects/

# Verify the object exists
# The hash is split: first 2 chars = directory, rest = filename
find .git/objects -type f | head -5

echo "Content-addressable storage demonstrated"

Immutability & Deduplication

# Deduplication in action
# Create two files with identical content
echo "shared content" > file1.txt
echo "shared content" > file2.txt

# Stage both files
git add file1.txt file2.txt

# Check: both files point to the SAME blob object
git ls-files --stage
# Output shows both have the same SHA-1 hash
# Git stores the content only ONCE, regardless of filename

echo "Deduplication demonstrated — same content = same hash"

The Four Object Types

Git's entire data model consists of just four object types. Everything Git does — every commit, every branch, every tag, every file version — is built from these four primitives.

Git Object Model — How Objects Reference Each Other

flowchart TD
    C[Commit Object
author, message, timestamp] --> T[Tree Object
directory listing]
    C --> CP[Parent Commit]
    T --> B1[Blob
file1.txt content]
    T --> B2[Blob
file2.txt content]
    T --> ST[Subtree
src/ directory]
    ST --> B3[Blob
main.js content]
    TAG[Tag Object
v1.0.0] --> C

Blob — File Content

A blob (binary large object) stores the raw content of a file. Crucially, it does not store the filename, permissions, or any metadata — just the bytes. The filename is stored in the tree object that references the blob.

# Create a file and commit it
echo "print('hello world')" > hello.py
git add hello.py
git commit -m "add hello.py"

# Find the blob for hello.py
# First, get the tree of the latest commit
git cat-file -p HEAD
# Shows: tree , parent , author, committer, message

# Look at the tree
git cat-file -p HEAD^{tree}
# Shows: 100644 blob     hello.py

# Look at the blob itself — just raw file content
git cat-file -p 
# Output: print('hello world')

# Check the type and size
git cat-file -t    # Output: blob
git cat-file -s    # Output: 22 (bytes)

echo "Blob exploration complete"

Tree — Directory Listing

A tree object represents a directory. It contains a list of entries, each mapping a filename (plus permissions) to either a blob (file) or another tree (subdirectory). Trees are how Git reconstructs the full directory structure from content-addressed blobs.

# Create a directory structure
mkdir -p src/utils
echo "export const add = (a, b) => a + b;" > src/utils/math.js
echo "import { add } from './utils/math';" > src/index.js
git add .
git commit -m "add src directory structure"

# Inspect the top-level tree
git ls-tree HEAD
# Output:
# 040000 tree     src

# Inspect the src/ subtree
git ls-tree HEAD:src
# Output:
# 100644 blob     index.js
# 040000 tree     utils

# Inspect the src/utils/ subtree
git ls-tree HEAD:src/utils
# Output:
# 100644 blob     math.js

echo "Tree hierarchy explored"

Commit — The Snapshot Record

A commit object ties everything together. It contains: (1) a pointer to a tree object (the project snapshot), (2) pointers to parent commit(s), (3) author and committer with timestamps, and (4) the commit message. The commit is what makes Git a version control system rather than just a filesystem.

# Inspect a commit object in full detail
git cat-file -p HEAD

# Output:
# tree 4b825dc642cb6eb9a060e54bf899d69f4c3e4b30
# parent a1b2c3d4e5f6a1b2c3d4e5f6a1b2c3d4e5f6a1b2
# author Wasil Zafar  1715600000 +0100
# committer Wasil Zafar  1715600000 +0100
#
# add src directory structure

# Key observations:
# - "tree" points to the root tree (entire project state at this commit)
# - "parent" points to the previous commit (absent for initial commit)
# - Merge commits have TWO parent lines
# - author = who wrote the code; committer = who applied it

echo "Commit object structure examined"

Tag — Named Reference with Metadata

Git supports two types of tags: lightweight (just a pointer, like a branch that doesn't move) and annotated (a full object with tagger, date, message, and optional GPG signature).

# Create a lightweight tag (just a ref, no object)
git tag v0.1.0

# Create an annotated tag (creates a tag object)
git tag -a v1.0.0 -m "First stable release"

# Inspect the annotated tag object
git cat-file -p v1.0.0
# Output:
# object     (the commit being tagged)
# type commit
# tag v1.0.0
# tagger Wasil Zafar  1715600000 +0100
#
# First stable release

# See where the tag refs are stored
cat .git/refs/tags/v0.1.0     # Contains: commit hash directly
cat .git/refs/tags/v1.0.0     # Contains: tag object hash

echo "Tag types demonstrated"

The DAG (Directed Acyclic Graph)

Commits in Git form a Directed Acyclic Graph (DAG). Each commit points to its parent(s). "Directed" means edges have direction (child → parent). "Acyclic" means you can never follow parent pointers in a circle back to where you started. This structure enables Git's powerful branching and merging.

A Git DAG with Branches and a Merge

gitGraph
    commit id: "A" tag: "initial"
    commit id: "B"
    commit id: "C"
    branch feature
    commit id: "D"
    commit id: "E"
    checkout main
    commit id: "F"
    commit id: "G"
    merge feature id: "H" tag: "v1.0"
    commit id: "I"

Branching & Merging in the DAG

In the DAG model:

A branch is simply a named pointer to one node (commit) in the graph
Creating a branch adds a new pointer — no new nodes are created
Committing adds a new node with an edge to the current node, then moves the branch pointer forward
Merging creates a new node with two parent edges (two incoming edges in the DAG)
Reachability — A commit is "on" a branch if you can reach it by following parent pointers from the branch tip

# Visualise the DAG in your terminal
git log --oneline --graph --all --decorate

# Example output:
# *   H (HEAD -> main, tag: v1.0) Merge feature into main
# |\
# | * E (feature) Add search results page
# | * D Add search input component
# * | G Fix header padding
# * | F Update navigation links
# |/
# * C Add user authentication
# * B Set up project structure
# * A Initial commit

echo "DAG visualisation demonstrated"

                            
                            Key Insight: The "acyclic" property is what makes Git's history meaningful. Because you can never create a cycle (a commit cannot be its own ancestor), you can always trace any commit back to the initial commit by following parent pointers. This gives Git a clear, unambiguous notion of "what came before what."
                        

References — Human-Friendly Names for Hashes

Nobody wants to remember a1b2c3d4e5f6a1b2c3d4e5f6a1b2c3d4e5f6a1b2. References (refs) are human-readable names that point to commit hashes. Branches, tags, and HEAD are all references.

Branches as Pointers

# A branch is literally a file containing a commit hash
cat .git/refs/heads/main
# Output: a1b2c3d4e5f6a1b2c3d4e5f6a1b2c3d4e5f6a1b2

# That's it. A branch is a 41-byte file (40 hex chars + newline)
# When you commit, Git just overwrites this file with the new commit's hash

# List all branch references
ls .git/refs/heads/
# Output: main, feature/auth, feature/search, ...

echo "Branches are just files containing commit hashes"

HEAD — Where You Are

# HEAD tells Git which branch you're on
cat .git/HEAD
# Normal output: ref: refs/heads/main
# This means HEAD points to the "main" branch

# When you checkout a branch, HEAD changes to point to that branch
git checkout feature/auth
cat .git/HEAD
# Output: ref: refs/heads/feature/auth

# Detached HEAD: when you checkout a specific commit (not a branch)
git checkout a1b2c3d
cat .git/HEAD
# Output: a1b2c3d4e5f6a1b2c3d4e5f6a1b2c3d4e5f6a1b2
# HEAD points directly to a commit, not a branch reference

echo "HEAD pointer states demonstrated"

The refs/ Directory Structure

# Explore the full refs directory structure
find .git/refs -type f

# Typical structure:
# .git/refs/
# ├── heads/           ← Local branches
# │   ├── main
# │   └── feature/auth
# ├── remotes/         ← Remote-tracking branches
# │   └── origin/
# │       ├── main
# │       └── feature/auth
# ├── tags/            ← Tags
# │   ├── v1.0.0
# │   └── v0.1.0
# └── stash            ← Stash reference

# The packed-refs file stores refs efficiently when there are many
cat .git/packed-refs

echo "Refs directory structure explored"

The .git Directory — A Complete Tour

The .git directory is the repository. Everything else in your project folder is just the working directory (a checked-out version of one commit). If you delete everything except .git/, you can reconstruct the entire project from any point in history.

# List the .git directory contents
ls -la .git/

# Key files and directories:
# HEAD          — Points to the current branch
# config        — Repository-specific configuration
# description   — Used by GitWeb (rarely relevant)
# hooks/        — Client-side and server-side hook scripts
# index         — The staging area (binary file)
# info/         — Global excludes (alternative to .gitignore)
# objects/      — ALL content (blobs, trees, commits, tags)
# refs/         — Branch and tag pointers
# logs/         — Reflog history (for recovery)
# packed-refs   — Compressed refs for efficiency

echo ".git directory overview shown"

# The objects directory — where ALL data lives
ls .git/objects/
# You'll see directories named with 2-character hex prefixes:
# 0a/ 0b/ 1f/ 2c/ ... a1/ b2/ ff/ info/ pack/

# Each object is stored as: .git/objects/XX/YYYYYY...
# Where XX = first 2 chars of SHA-1, YYYYYY = remaining 38 chars

# Count total objects in the repository
git count-objects
# Output: X objects, Y kilobytes

# More detailed stats
git count-objects -v

echo "Objects directory explored"

The Index (Staging Area) File

# The index is a binary file — use git commands to inspect it
# List everything in the staging area
git ls-files --stage

# Output format:
#    
# 100644 a1b2c3... 0 README.md
# 100644 d4e5f6... 0 src/index.js
# 100755 g7h8i9... 0 scripts/deploy.sh

# The stage number (0) means no conflict
# During a merge conflict, you'll see stages 1, 2, 3:
# Stage 1 = common ancestor, Stage 2 = ours, Stage 3 = theirs

echo "Index file inspection demonstrated"

Deep Dive

The .git/hooks/ Directory

The hooks directory contains sample scripts for Git lifecycle events: pre-commit, pre-push, commit-msg, post-merge, and more. These are local to each clone (not shared via push/pull). Teams use tools like Husky (Node.js) or pre-commit (Python) to share hook configurations via the repository itself. Hooks enable powerful automation: running linters before commit, validating commit message format, running tests before push, and sending notifications after merge.

Automation Pre-commit CI/CD

Packfiles & Garbage Collection

Storing every version of every file as a separate compressed object works for small repositories, but would be enormously wasteful for large ones. Git solves this with packfiles — a compressed format that stores objects using delta compression (storing only the differences between similar objects).

Delta Compression

When Git packs objects, it finds objects that are similar (e.g., different versions of the same file) and stores one full copy plus deltas (diffs) for the others. This dramatically reduces storage. A 10MB file that changes by 1 line per commit stores ~10MB for the first version and a few bytes for each subsequent version.

# Check repository size before packing
git count-objects -v
# Look for: size-pack (kilobytes in packfiles)
# And: count (number of loose objects)

# Manually trigger garbage collection (packing)
git gc

# Check size after packing — typically much smaller
git count-objects -v

# List packfile contents
git verify-pack -v .git/objects/pack/pack-*.idx | head -20
# Shows: hash type size size-in-pack offset [depth base-hash]
# Objects with a "depth" > 0 are stored as deltas

echo "Packfiles and gc demonstrated"

When Does Git GC Run?

Git runs garbage collection automatically when:

The number of loose objects exceeds ~6700 (configurable via gc.auto)
The number of packfiles exceeds 50 (configurable via gc.autoPackLimit)
You explicitly run git gc

GC also removes unreachable objects — objects that no branch, tag, or reflog entry points to. By default, these are kept for 2 weeks (gc.pruneExpire) before deletion, giving you time to recover from mistakes.

How Operations Work Internally

Now that you understand the object model, let's see what actually happens under the hood when you run common Git commands.

git add — What Really Happens

# When you run: git add myfile.txt
# Git does the following internally:

# 1. Compute SHA-1 of the file content
git hash-object myfile.txt
# Output: e69de29bb2d1d6434b8b29ae775ad8c2e48c5391

# 2. Compress and store the content as a blob object
git hash-object -w myfile.txt
# Now exists: .git/objects/e6/9de29bb2d1d6434b8b29ae775ad8c2e48c5391

# 3. Update the index (staging area) with the new blob hash
git update-index --add --cacheinfo 100644 e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 myfile.txt

# That's it! 'git add' = hash content + store blob + update index

echo "git add internals demonstrated"

git commit — What Really Happens

# When you run: git commit -m "my message"
# Git does the following internally:

# 1. Create a tree object from the current index
git write-tree
# Output: 4b825dc642cb6eb9a060e54bf899d69f4c3e4b30

# 2. Create a commit object pointing to that tree
# (with parent = current HEAD commit, author, committer, message)
echo "my message" | git commit-tree 4b825dc... -p HEAD
# Output: 

# 3. Update the current branch ref to point to the new commit
git update-ref refs/heads/main 

# Summary: git commit = write-tree + commit-tree + update-ref

echo "git commit internals demonstrated"

git branch — What Really Happens

# When you run: git branch feature/new-thing
# Git literally creates a file:

# Equivalent to:
echo "$(git rev-parse HEAD)" > .git/refs/heads/feature/new-thing

# That's ALL a branch is — a file containing a commit hash
# It takes microseconds and zero disk space (41 bytes)

# When you switch branches:
# git checkout feature/new-thing
# Git does:
# 1. Update HEAD to reference the new branch
echo "ref: refs/heads/feature/new-thing" > .git/HEAD
# 2. Update the working directory to match that commit's tree

echo "git branch internals demonstrated"

git merge — What Really Happens

# When you run: git merge feature/done
# Git does the following:

# 1. Find the merge base (common ancestor)
git merge-base main feature/done
# Output: 

# 2. Create a three-way diff:
#    - ancestor → main (what changed on main)
#    - ancestor → feature/done (what changed on feature)

# 3. Apply both sets of changes to create a merged tree
#    - If changes don't overlap: automatic merge
#    - If changes conflict: pause and ask for manual resolution

# 4. Create a merge commit with TWO parents
# git commit-tree  -p  -p 

# 5. Update the branch pointer
# The result: a new commit with two parent edges in the DAG

echo "git merge internals demonstrated"

Experiment

Build a Commit Manually with Plumbing Commands

You can create a commit without ever using porcelain commands. Try this sequence: (1) echo "content" | git hash-object -w --stdin to create a blob, (2) git update-index --add --cacheinfo 100644 <hash> file.txt to add it to the index, (3) git write-tree to create a tree from the index, (4) echo "message" | git commit-tree <tree-hash> to create a commit, (5) git update-ref refs/heads/main <commit-hash> to update the branch. Congratulations — you just did what git add + git commit does, one plumbing command at a time.

Plumbing Manual Commit Hands-On

Exercises

Exercise 1

Manual Object Creation

Without using git add or git commit, create a commit using only plumbing commands: git hash-object -w, git update-index, git write-tree, git commit-tree, and git update-ref. Verify the result with git log and git cat-file -p.

Plumbing Object Store

Exercise 2

Explore .git

Create a new repository with at least 5 commits across 2 branches. Then explore the .git directory: (1) Read HEAD and verify it matches your current branch. (2) Read the branch ref file and verify it matches git log -1. (3) Use git cat-file -p to walk from the latest commit → tree → blobs. (4) Count objects before and after git gc.

.git Directory Exploration

Exercise 3

Trace a Commit's Full Object Chain

Pick any commit in a repository. Using only git cat-file -p, trace the full chain: commit → tree → subtrees → blobs. Draw a diagram showing all objects and their relationships. How many total objects does a single commit reference?

Object Graph cat-file

Exercise 4

Draw the DAG

Create a repository with this history: (1) 3 commits on main, (2) branch "feature-a" from commit 2 with 2 commits, (3) branch "feature-b" from commit 3 with 1 commit, (4) merge feature-a into main, (5) merge feature-b into main. Draw the DAG by hand showing all commits with arrows to their parents. Verify your drawing matches git log --oneline --graph --all.

DAG Graph Theory

Conclusion & Next Steps

You've now seen behind the curtain. Git is not magic — it's an elegant content-addressable filesystem with a version control interface. The four object types (blob, tree, commit, tag), the DAG structure, and the reference system explain everything Git does. Every porcelain command is just a convenient wrapper around creating objects and updating references.

Key takeaways:

Everything is an object — Identified by SHA-1, stored in .git/objects/, immutable once created
Four types explain all — Blob (content), Tree (directory), Commit (snapshot + metadata), Tag (named annotation)
The DAG is the history — Commits form a directed acyclic graph via parent pointers
Branches are just files — 41 bytes pointing to a commit hash. Cheap and disposable
HEAD tells Git where you are — Points to a branch (normal) or a commit (detached)
Packfiles optimise storage — Delta compression for similar objects; gc automates packing
git add = hash + store + index; git commit = write-tree + commit-tree + update-ref
Almost nothing is truly lost — The reflog and GC grace period mean recovery is almost always possible

Next in the Series

In Part 11: Git Workflows — Trunk-Based, GitFlow & Beyond, we'll apply this knowledge to team collaboration. You'll learn how teams organise their branching strategies, the trade-offs between GitFlow and trunk-based development, pull request workflows, and how to choose the right model for your team size and release cadence.

Previous Part 9: Git & Version Control Next Part 11: Git Workflows

Cookie Consent

Part 10: Git Internals — The Object Model & DAG

Table of Contents

Introduction — Porcelain vs Plumbing

Why Understanding Internals Matters

Content-Addressable Storage

SHA-1 Hashing

Immutability & Deduplication

The Four Object Types

Blob — File Content

Tree — Directory Listing

Commit — The Snapshot Record

Tag — Named Reference with Metadata

The DAG (Directed Acyclic Graph)

Branching & Merging in the DAG

References — Human-Friendly Names for Hashes

Branches as Pointers

HEAD — Where You Are

The refs/ Directory Structure

The .git Directory — A Complete Tour

The Index (Staging Area) File

The .git/hooks/ Directory

Packfiles & Garbage Collection

Delta Compression

When Does Git GC Run?

How Operations Work Internally

git add — What Really Happens

git commit — What Really Happens

git branch — What Really Happens

git merge — What Really Happens

Build a Commit Manually with Plumbing Commands

Exercises

Manual Object Creation

Explore .git

Trace a Commit's Full Object Chain

Draw the DAG

Conclusion & Next Steps

Next in the Series

Cookie Consent

Part 10: Git Internals — The Object Model & DAG

Table of Contents

Introduction — Porcelain vs Plumbing

Why Understanding Internals Matters

Content-Addressable Storage

SHA-1 Hashing

Immutability & Deduplication

The Four Object Types

Blob — File Content

Tree — Directory Listing

Commit — The Snapshot Record

Tag — Named Reference with Metadata

The DAG (Directed Acyclic Graph)

Branching & Merging in the DAG

References — Human-Friendly Names for Hashes

Branches as Pointers

HEAD — Where You Are

The refs/ Directory Structure

The .git Directory — A Complete Tour

The Index (Staging Area) File

The .git/hooks/ Directory

Packfiles & Garbage Collection

Delta Compression

When Does Git GC Run?

How Operations Work Internally

git add — What Really Happens

git commit — What Really Happens

git branch — What Really Happens

git merge — What Really Happens

Build a Commit Manually with Plumbing Commands

Exercises

Manual Object Creation

Explore .git

Trace a Commit's Full Object Chain

Draw the DAG

Conclusion & Next Steps

Next in the Series

Related Articles in This Series

Part 9: Git & Version Control Foundations

Part 11: Git Workflows — Trunk-Based, GitFlow & Beyond

Part 1: Software Delivery Mental Models & the SDLC