Git Commit

How commits are handled.

Git objects

At the core of Git is a simple key-value data store. You can insert any kind of content into a Git repository, for which Git will hand you back a unique key (OID for short) you can use later to retrieve that content[1]. OID is a 40-character SHA-1 checksum hash — a checksum of the content you're storing plus a header.

Git objects are blobs, trees and commits. Git references each of its objects by its OID.

Commits, trees, and blobs are immutable, meaning you can't change their contents. If you change the contents, then you get a different hash and thus a new OID referring to the new object

Blobs

At the bottom of the object model, blobs contain file contents. blobs contain file contents, but not the file names.

Trees

The names come from Git's representation of directories: trees. All the content is stored as tree and blob objects, with trees corresponding to a file directory entries and blobs corresponding to file contents. A tree is an ordered list of path entries, paired with object types, file modes, and the OID for the object at that path. Subdirectories are also represented as trees, so trees can point to other trees and blobs. Trees provide names for each sub-item, object type (blob or tree), and OIDs for each entry.

                            For example: 
                            
$ git cat-file -p master^{tree}
100644 blob a906cb2a4a904a152e80877d4088654daad0c859      README
100644 blob 8f94139338f9404f26296befa88755fc2598c289      Rakefile
040000 tree 99f1a6d12cb4b6f19c8655fca46c3ecf317074e0      lib
                            
                        

Empty tree SHA

The tree referenced by /ref^{tree} is a special tree: the root tree.

If you don't remember the empty tree sha1, you can always derive it with:

git hash-object -t tree /dev/null

or

git hash-object -t tree --stdin < /dev/null

or

git mktree </dev/null

Branches

Branches are significantly different from our previous Git objects. Branches provide pointers to commits.

The special reference HEAD points to the current branch. When we add a commit to HEAD, it automatically updates that branch to the new commit.

Branches point to commits, commits point to other commits and their root trees, trees point to blobs and other trees, and blobs don't point to anything.

Commits

A commit is a snapshot in time. Commits are snapshots, not diffs. Commits also contain metadata describing the snapshot such as author and committer (including name, email address, and date) and a commit message.

Each commit has a pointer to its root tree, representing the state of the working directory at that time. The commit has a list of parent commits corresponding to the previous snapshots. A commit with no parents is a root commit. A commit with multiple parents is a merge commit.

How to find the initial commit in a repo

There can exist more than one root commit (parentless commit) in a repository. This is usually the result of joining separate projects in one, or using subtree merge of separately developed subprojects.

For example git repository has 6 root commits: git-gui, gitk (subtree-merged), gitweb (merged in, no longer developed separately), git mail tools (merged very early in project history), and p4-fast-export (perhaps accidental). That is not counting roots of 'html' and 'man' branches, "convenience" branches which contain pre-generated documentation, and 'todo' branches with TODO list and scripts.

In order to get the initial commit in a repo we can use the --max-parents option: $ git rev-list --max-parents=0 HEAD.

Author and commiter

The author is the person who originally wrote the code. The committer is assumed to be the person who committed the code on behalf of the original author. This is important in Git because Git allows you to rewrite history, or apply patches on behalf of another person. The FREE online Pro Git book explains it like this:

You may be wondering what the difference is between author and committer. The author is the person who originally wrote the patch, whereas the committer is the person who last applied the patch. So, if you send in a patch to a project and one of the core members applies the patch, both of you get credit — you as the author and the core member as the committer.

Only author data shows by default on git log. To see both author and commiter data use git log --format=fuller.

Is there a case when two git commit ids in two different projects could be identical?

SHA code is not used to generate the contents of any files, the contents are stored by Git separately. The SHA code is just used as a key to a commit. The reason commits can't just have keys just numbered from 1 and increasing is because with Git different people can work on different branches of the same project making commits without knowing about each other. When these get merged together we still need commits to have unique keys. The best way of making it so the keys will definitely be unique is using something like SHA which creates a unique code and as others have explained the probability of getting the same key is almost zero.

Since Git commit IDs are SHA-1 hashes, we’re essentially looking for SHA-1 hash collisions.

SHA-1 hashes consist of 20 bytes or 160 bits, allowing for 2^160 = 1.4615e+48 combinations.

The most likely reason for a collision is not SHA1, but an exact match in the input data. And that seems highly unlikely, given that author details and timestamps are in there as well.

All in all, using commit hashes to identify commits seems sufficiently identifying to use across different projects without any real risk of trouble.

There are three cases to consider.

  • Two different non-malicious commits that happen to have the same commit ID.
  • Two different commits deliberately constructed to have the same commit ID
  • Two projects that contain the same commit.

Two different non-malicious commits that happen to have the same commit ID

This is known as the birthday problem. It is about the probability of having 2 (or more) people from a group of N people to have a birthday on the same day in a year. Which is analogical to the probability of 2 (or more) git commits from a repository having N commits in total having the same hash prefix of length X. http://www.solipsys.co.uk/new/TheBirthdayParadox.html?HN2

SHA-1 hashes consist of 160 bits, allowing for 2^160 = 1.4615e+48 combinations. or exactly 1,461,501,637,330,902,918,203,684,832,716,283,019,655,932,542,976 possible hashes.

The birthday problem makes it so that it'll only take roughly the root of this number (roughly 2^80) to get a 50% chance of collisions, but that's still enormous. Note, however, that the input to the hash is not at all uniformly random, as it is simply the hash over the commit data. https://en.wikipedia.org/wiki/Birthday_problem#Probability_table

The birthday paradox only gives you the probability that of a random set of hashes, two are the same.

Of course the fun part is that after precisely calculating that probability and proving that we don't have anything to worry about, the next commit we make could collide... and it will be the only git collision within the history of the human race.

It is very important to note that this is the probability that any two hashes collide in a set of hashes. It is not the probability to find a second plaintext that hashes to the same hash as a given one. It's far harder to find a colliding value to a given one than to just grab a group of values and have two of them hash to the same hash. https://news.ycombinator.com/item?id=4753014

Applying the formula for 160bit SHA-1 you need 1.7e23 objects to get a 1% chance of collision. Here’s an example to give you an idea of what it would take to get a SHA-1 collision.

If all 6.5 billion humans on Earth were programming, and every second, each one was producing code that was the equivalent of the entire Linux kernel history (6.5 million Git objects) and pushing it into one enormous Git repository, it would take roughly 2 years until that repository contained enough objects to have a 50% probability of a single SHA-1 object collision. Thus, an organic SHA-1 collision is less likely than every member of your programming team being attacked and killed by wolves in unrelated incidents on the same night. https://git-scm.com/book/en/v2/Git-Tools-Revision-Selection#A-SHORT-NOTE-ABOUT-SHA-1

For some wacky perspective that's 10 million kernel sized contributions for every man, woman and child on earth together in a single repository. Even storing 1 trillion objects gives a collision probability of 3e-19. It would seem git will reach plenty of other bottlenecks before SHA-1 becomes a problem...

Not as likely as you'd imagine. But it's still non-zero.

Suppose the most likely reason for a collision is not SHA1, but an exact match in the input data. And that seems highly unlikely, given that author details and timestamps are in there as well.

All in all, using commit hashes to identify commits seems sufficiently identifying to use across different projects without any real risk of trouble.

Two different commits deliberately constructed to have the same commit ID

That is possible now that sha1 collision techniques have been constructed : https://security.googleblog.com/2017/02/announcing-first-sha1-collision.html

Git v2.13.0 and later subsequently moved to a hardened SHA-1 implementation by default, which isn’t vulnerable to the SHAttered attack.https://git-scm.com/docs/hash-function-transition

Example: https://gist.github.com/masak/2415865#file-explanation-md

How about also guessing the tree hash?

https://gist.github.com/masak/2415865#file-explanation-md

Two projects that contain the same commit.

That is actually the most likely. Many projects are forks of other projects.However it is not really a thing, because if they are a fork, it's the same commit.

Why is Git marking my file as binary?

The answer is because it's seeing a NUL (0) byte somewhere within the first 8000 characters of the file. Typically, that happens because the file is being saved as something other than UTF-8. So, it's likely being saved as UCS-2, UCS-4, UTF-16, or UTF-32. All of those have embedded NUL characters when using ASCII characters.

Push the same commit to many remotes

A local repository can be linked to multiple remote repositories. In many cases you will have links to multiple remote repositories in your local repository and each of those will have a different short name. One of the names that we have for our repository is the name it has on GitHub or a remote server somewhere. This can be like a project name. And in our case that is 'amazing-project'. The other name that we have for our repository is the short name that it has in our local repository that is related to the URL of the repository. It is the short name we are going to use whenever we want to push or fetch code from that remote repository. And this short name kind of acts like an alias for the url, it's a way for us to avoid having to use that entire long url in order to push or fetch code.

Origin is the default short name that Git uses for a remote repository when you clone that remote repository. So it's just the default.

We can have as many remotes as we want, but only one of those links can be called origin. The rest of the links need to have different names.

It is possible to have origin push to more than one git repository server at a time.

Here's an example: a project with multiple remotes, GitHub & GitLab:

  1. Add remote repo for GitHub
    $ git remote add github https://github.com/Company_Name/repository_name.git
  2. Add remote repo for GitLab
    $ git remote add gitlab https://gitlab.com/Company_Name/repository_name.git
  3. Now we have multiple remotes in the project. We can see that by using with git remote -v
    
    $ git remote -v
    github https://github.com/Company_Name/repository_name.git (fetch)
    github https://github.com/Company_Name/repository_name.git (push)
    gitlab https://gitlab.com/Company_Name/repository_name.git (fetch)
    gitlab https://gitlab.com/Company_Name/repository_name.git (push)
                                
  4. How do we push to multiple repositories?
    $ git push github && git push gitlab
  5. We can check what are the configured remotes for a local repository by executing
    git config -e
  6. to create a merged‐remote for "github" and "gitlab", we can add the following after all of those:
    
    [remote "origin"]
    url = https://github.com/Company_Name/repository_name.git
    url = https://gitlab.com/Company_Name/repository_name.git
    [branch "master"]
    remote = origin
    merge = refs/heads/master
                                
  7. Once we've done this, when we execute:
    git push origin master
  8. it will push to both github/Master and gitlab/Master sequentially, making life a little easier. Since under [branch "Master"] we have set remote = Origin then
    git pull 

    will pull both remotes.

References

1. Git Internals - Git Objects

Getting started