Skip to main content

Git Series (1) — The Architecture & Internals!

Since the dawn of software engineering, managing the source code had been a great challenge for engineers. A plethora of Source code management (SCM) systems & Version Control Systems (VCS) have been developed in the search for the best one.

In my short career, I have used the useless Microsoft VSS, the ubiquitous CVS, the slightly better SVN, and the proprietary IBM-ClearCase. But none provided as comprehensive a solution as the one and only GIT.

The Evolution

The quest for the best SCM ended in 2005 when Linus Torvalds gave another opensource gift to the world. It was a source code management system (SCM) that he created to manage the distributed development of the Linux kernel'shuge codebase. He named the SCM GIT.

The industry embraced GIT with open arms. SAAS providers like GitHubBigBucket & GitLab added more bells and whistles on top, and within a few years, GIT became the ubiquitous source control solution.

Git Architecture

Git is different from all other SCMs in many aspects. To understand better, let's look at a 30,000 feet view of the architecture of SCMs.

The Terminology

Before we start, here is an oversimplified explanation of a few SCM's related terminologies.

  • Tree
    A codebase is usually a collection of files & folders. It can be represented as a Tree data structure, having the Project folder as the root node and files (blobs) as leaves with subfolders as intermediate nodes. For our further discussion, we will refer to the codebase as a Tree.
  • Working Tree
    The source code that we work on and make changes on. All the code that we change or add, is saved in the working tree.
  • Repo Tree 
    This is the source code that is kept for later reference. The data in the repo tree is uneditable. To make any change, you have to checkout, edit, and then commit the file.
  • Commit 
    An action to put the code in the repo tree.
  • Checkout 
    An action to bring the code from the repo tree to the working tree.

Two Tree Architecture

This is is how most of the SCM worked. We make changes in a working tree and commit them to the repo. Repo is usually somewhere on a remote server. Right after the code commit, new changes become available for everyone.

Three Tree Architecture

Linus introduced many distinctive elements in GIT design, making it different from all other systems, The 2 prominent elements were the local Repository & The Staging area.

  • Local Repository 
    Git maintains a full repository on our machines, called a local repository. Developers commit to and checkout from the local repository. If needed, these commits can be pushed to remote servers. But each machine works as a full repository, containing all information, such as branches, commits, logs, tags, etc.
  • Staging Area /Cache 
    Linus also created a cache that holds the code, that a developer intends to commit to the repo tree. 
    Any change for local development can be done onto a working tree without worrying about mixing it with commit code. Once ready, changes that are needed can be explicitly added to this cache, only the changes that are added to the cache will be committed to the local repo.

So a typical workflow in a GIT will look like this.

Git 3 Tree design

Adding Remote Tier

Usually, the goal of working with SCM is, to store the code on a reliable server machine, not on a developer machine. Therefore, we finally PUSH the local repo code to a remote server. Thus, the complete workflow looks like below

Git 4-tier design

Additional Key components

If you require just a high-level overview, you can very well stop reading here. however, for developers who use git for day-to-day work, below are a few more useful concepts

  • SHA-1 Hash
    This is the key concept that all things in Git revolve around. Think of it as a 40-character hexadecimal SHA string generated by passing “some value” to a cryptographic function.

    — A file hash is calculated upon a commit, by applying the checksum on the uncompressed file content
     A Commit hash, however, contains more information than just the files, like the commit author, commit time, etc. Commit hash is calculated by combining all the changes in a commit and other commit information. TheCommit hash is unique within the repository.
  • Blob 
    The content of a file along with file metadata is called a blob. On commit, a file hash is calculated on an uncompressed file. The file is then compressed using zlib, and data is stored in the .git/objects/ directory, having the file’s hash as the filename.
  • HEAD Pointer
    It is a pointer that directly or indirectly points to a commit hash in the repo tree. 
    Indirect pointer — When we checkout a branch, the HEAD pointer holds the reference to the branch. But a branch ref points to the last commit hash on the tip of the branch. Making HEAD indirectly pointing to that commit hash
    Direct Pointer-. Git allows us to checkout by a commit hash instead of a branch name, When we do this, our tree goes into Detached HEAD State, In this scenario, the HEAD pointer points directly to a commit hash.
  • Repo Metadata
    As said above, Git brings the whole repository to every developer machine, How Git does this, is by creating a .git folder right under the project’s root folder. All git data is kept in this directory.

.git folder content (Advanced)

Below is an example of the content inside a .git directory. 
This was a repo with 1 file and 1 commit (no tags).

.git
├── COMMIT_EDITMSG
├── HEAD
├── config
├── description

├── hooks
│ ├── hook scripts

├── index

├── info
│ └── exclude

├── logs
│ ├── HEAD
│ └── refs
│ └── heads
│ └── master

├── objects
│ ├── 05
│ │ └── f683xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx2774
│ ├── 1b
│ │ └── 0f93xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx4e21
│ ├── 37
│ │ └── 40b8xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx5510
│ ├── info
│ └── pack

└── refs
├── heads
│ └── master
└── tag

Here is a high-level overview of the content.

  • description- This file contains the basic description of the repo.
  • config- This file contains the git configuration specific for this repository,
  • HEAD- This file contains either the commit hash or branch ref
  • COMMIT_EDITMSG- This file contains the message from the last commit.
  • hook- This folder contains pre and post-scripts that you want to execute on various events, like commit/push/merge/rebase, etc.
  • ref- This folder contains the files storing the commit hash 
    - for local branches under the heads folder, 
    - for remote branches under the origin folder
    - for releases under the tags folder.
  • logs- This folder contains one file for HEAD and files for the local and remote branch contains a hash, time, user, and other important information for all the git events.
  • object-This folder contains blobs, trees, & commit files for each commit

This is all for the first post in the Git series. Let me know your response in the comments or clap. In the next post, we will discuss all the commands needed to perform day-to-day operations on git.

Comments

Popular posts from this blog

Unable to Redo in VS-Code & Intellij

Since the beginning of personal computers, few keyboard shortcuts are common among all operating systems and software. The ubiquitous cmd+c (copy), cmd+v(paste) , cmd+z (undo) and cmd+y (redo) I am not sure why, both of my favorite IDEs,  Visual Studio Code  &  Intellij  decided to not use  cmd+Y for redo.Below are the quick steps to configure  cmd+Y for a redo in VS-Code & Intellij Visual Studio Code Open VS Code & Go to keyboard shortcuts There will be a search bar at the top Type “  redo  “ in the search bar. You can see on my system its still mapped to  shift+cmd+z Double click on  ⇧ ⌘ z  and the below box will appear. Do not click anywhere or type anything on the keyboard except the key you want to assign, in our case it was  cmd+y,  so type  cmd+y Press Enter and you are done. Now you can use  cmd+z  for undo and  cmd+y  to redo like always Intellij It is also as simple as VS-Code...

An Introduction to Quartz Scheduler

It's a common use case to have an enterprise application, perform specific work, at a specific time or in response to a specific action. In other words, “There is an ask to execute a  Job  upon a predefined  Trigger ”. This brings us to the need for a  Scheduling System.  A system, where  Jobs  &  Trigger  can be registered and the system will manage the remaining complexity. Thankfully for the Java systems,  Quartz  is for rescue. It‘s an open-source library that has been extensively used in enterprise applications for more than a decade. Components in Quartz Sub System: Following are the all major component in the Quartz subsystem: Scheduler : It’s the control room of Quartz. It maintains everything required for scheduling,  such as managing listeners ,  scheduling jobs , clustering, transactions & job persistence. It maintains a registry of  JobDetails ,  Listeners  &  Triggers , and exec...

My Custom Built Desktop. The Questions & The Answers!

If  you want to avoid overpriced pre-builts like the M1 Mac Mini, Mac Pro, or Dell XPS Desktop without compromising on performance, a self-built desktop is a preferred option. It's also a great choice if you enjoy building things. custom built with ASUS-PRIME-P If you choose to build a custom PC, be prepared to invest time in researching and assembling compatible components.  In this post, I'll share my experience building this colorful powerhouse. I'll cover: Why did I do it.  Key questions to ask when selecting components Thought process behind component choices Components used in my build Benchmark comparisons . ** My second custom-build **.  ***  Disclaimer: Not an Apple product. Just a free apple sticker is used *** Why did I do it I decided to get a desktop during the pre-MacM1 era (yes, that’s a thing). After browsing many websites, I found that well-configured prebuilt PCs were overpriced, while cheaper ones had subpar components. Unable to choose betwee...

Time Zones, Meridian, Longitude, IDL… It's more politics than science.

Once, I was working on a few geospatial APIs handling many time zones. While writing tests, I realized I did not know much about timezones. A lame excuse might be, my subpar schooling as a village kid. Nevertheless, I decided to turn the pages on timezones, what I found was more politics than science. Photo by  Arpit Rastogi  on  Unsplash Before diving into anomalies, let’s talk about history then we will go to science followed by politics. History The world without time zones By 300 BCE, the western world agreed that the earth is round. Each developed civilization devised its unique distinct system to measure distances, times & absolute locations, but relative to prime locations within their civilizations. It all worked in ancient times because long-distance travel was not prevalent among common people. Only merchants or armies traveled long distances. And they already developed systems that worked on their predetermined routes, irrespective of the time differences b...

Maven (0) - Preface

During our java based microservice development, we extensively use build tools like  Maven or Gradle.  Usually, IDEs do a lot on our behalf or we just run some predefined commands without checking what's happening inside. Here in this series of 6 posts, I tried to explain Maven. Before I start talking about what Maven is, and its different components, let’s discuss the “why”. Why do we even need Maven?  For this, I’ve to first explain the nature of a Java-based project and also need to take you back in history. The “Build” Step. Java is a compilable language, Unlike Python or Javascript, which are interpreted. ie, the code we write in java, can not as-is run on a Java virtual machine (JVM). JVM understands only the bytecode. Therefore, in the Java world, there is always a need for an  intermediary step.  A step that compiles the java code files into bytecode. That's why after writing the java code, we “somehow” create some deployable (jar, war, ear) to run on ma...

BDD (1) — Behavior Driven Development

A wise man ( narcissist me ) once said, “Life is all about the question and answers. The trick to a meaningful life is,  To ask the right questions to yourself, so you can get on the right path to search for the answer .” The very first question one should always ask oneself is WHY.  Let's discuss our WHY in the current case. Why BDD Let's take a step back and start with the well-known software development practice TDD ( Test-Driven Development).  In TDD, the very first thing developers do is, set up the technical expectations from the code by writing failing test cases. After the expectation is set, the code is written/modified to finally pass all of the failing tests. It's an  Acceptance driven development strategy . TDD works fine to create a robust technically working product. But the whole TDD approach revolves only around technical teams. It barely involves the business analysis or product owners to validate the business aspect of a feature, they get involved o...