Git Series (1) — The Architecture & Internals!

Since the dawn of software engineering, managing the source code had been a great challenge for engineers. A plethora of Source code management (SCM) systems & Version Control Systems (VCS) have been developed in the search for the best one.

In my short career, I have used the useless Microsoft VSS, the ubiquitous CVS, the slightly better SVN, and the proprietary IBM-ClearCase. But none provided as comprehensive a solution as the one and only GIT.

The Evolution

The quest for the best SCM ended in 2005 when Linus Torvalds gave another opensource gift to the world. It was a source code management system (SCM) that he created to manage the distributed development of the Linux kernel'shuge codebase. He named the SCM GIT.

The industry embraced GIT with open arms. SAAS providers like GitHub, BigBucket & GitLab added more bells and whistles on top, and within a few years, GIT became the ubiquitous source control solution.

Git Architecture

Git is different from all other SCMs in many aspects. To understand better, let's look at a 30,000 feet view of the architecture of SCMs.

The Terminology

Before we start, here is an oversimplified explanation of a few SCM's related terminologies.

Tree
A codebase is usually a collection of files & folders. It can be represented as a Tree data structure, having the Project folder as the root node and files (blobs) as leaves with subfolders as intermediate nodes. For our further discussion, we will refer to the codebase as a Tree.
Working Tree
The source code that we work on and make changes on. All the code that we change or add, is saved in the working tree.
Repo Tree
This is the source code that is kept for later reference. The data in the repo tree is uneditable. To make any change, you have to checkout, edit, and then commit the file.
Commit
An action to put the code in the repo tree.
Checkout
An action to bring the code from the repo tree to the working tree.

Two Tree Architecture

This is is how most of the SCM worked. We make changes in a working tree and commit them to the repo. Repo is usually somewhere on a remote server. Right after the code commit, new changes become available for everyone.

Three Tree Architecture

Linus introduced many distinctive elements in GIT design, making it different from all other systems, The 2 prominent elements were the local Repository & The Staging area.

Local Repository
Git maintains a full repository on our machines, called a local repository. Developers commit to and checkout from the local repository. If needed, these commits can be pushed to remote servers. But each machine works as a full repository, containing all information, such as branches, commits, logs, tags, etc.
Staging Area /Cache
Linus also created a cache that holds the code, that a developer intends to commit to the repo tree.
Any change for local development can be done onto a working tree without worrying about mixing it with commit code. Once ready, changes that are needed can be explicitly added to this cache, only the changes that are added to the cache will be committed to the local repo.

So a typical workflow in a GIT will look like this.

Adding Remote Tier

Usually, the goal of working with SCM is, to store the code on a reliable server machine, not on a developer machine. Therefore, we finally PUSH the local repo code to a remote server. Thus, the complete workflow looks like below

Additional Key components

If you require just a high-level overview, you can very well stop reading here. however, for developers who use git for day-to-day work, below are a few more useful concepts

SHA-1 Hash
This is the key concept that all things in Git revolve around. Think of it as a 40-character hexadecimal SHA string generated by passing “some value” to a cryptographic function.

— A file hash is calculated upon a commit, by applying the checksum on the uncompressed file content.
— A Commit hash, however, contains more information than just the files, like the commit author, commit time, etc. Commit hash is calculated by combining all the changes in a commit and other commit information. TheCommit hash is unique within the repository.
Blob
The content of a file along with file metadata is called a blob. On commit, a file hash is calculated on an uncompressed file. The file is then compressed using zlib, and data is stored in the .git/objects/ directory, having the file’s hash as the filename.
HEAD Pointer
It is a pointer that directly or indirectly points to a commit hash in the repo tree.
Indirect pointer — When we checkout a branch, the HEAD pointer holds the reference to the branch. But a branch ref points to the last commit hash on the tip of the branch. Making HEAD indirectly pointing to that commit hash
Direct Pointer-. Git allows us to checkout by a commit hash instead of a branch name, When we do this, our tree goes into Detached HEAD State, In this scenario, the HEAD pointer points directly to a commit hash.
Repo Metadata
As said above, Git brings the whole repository to every developer machine, How Git does this, is by creating a .git folder right under the project’s root folder. All git data is kept in this directory.

.git folder content (Advanced)

Below is an example of the content inside a .git directory.
This was a repo with 1 file and 1 commit (no tags).

.git
├── COMMIT_EDITMSG
├── HEAD
├── config
├── description
│ 
├── hooks
│   ├── hook scripts 
│ 
├── index
│ 
├── info
│   └── exclude
│ 
├── logs
│   ├── HEAD
│   └── refs
│       └── heads
│           └── master
│ 
├── objects 
│   ├── 05
│   │   └── f683xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx2774
│   ├── 1b
│   │   └── 0f93xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx4e21
│   ├── 37
│   │   └── 40b8xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx5510
│   ├── info
│   └── pack
│ 
└── refs
    ├── heads
    │   └── master
    └── tag

Here is a high-level overview of the content.

description- This file contains the basic description of the repo.
config- This file contains the git configuration specific for this repository,
HEAD- This file contains either the commit hash or branch ref
COMMIT_EDITMSG- This file contains the message from the last commit.
hook- This folder contains pre and post-scripts that you want to execute on various events, like commit/push/merge/rebase, etc.
ref- This folder contains the files storing the commit hash
- for local branches under the heads folder,
- for remote branches under the origin folder
- for releases under the tags folder.
logs- This folder contains one file for HEAD and files for the local and remote branch contains a hash, time, user, and other important information for all the git events.
object-This folder contains blobs, trees, & commit files for each commit

This is all for the first post in the Git series. Let me know your response in the comments or clap. In the next post, we will discuss all the commands needed to perform day-to-day operations on git.

kaleidoscope

Search This Blog

Git Series (1) — The Architecture & Internals!

The Evolution

Git Architecture

The Terminology

Two Tree Architecture

Three Tree Architecture

Adding Remote Tier

Additional Key components

.git folder content (Advanced)

Labels

Comments

Popular posts from this blog

Unable to Redo in VS-Code & Intellij

My Custom Built Desktop. The Questions & The Answers!

An Introduction to Quartz Scheduler

Time Zones, Meridian, Longitude, IDL… It's more politics than science.

Maven (0) - Preface

BDD (1) — Behavior Driven Development