Implementing Git in JavaScript

January, 2019

This article is showing how Gitfred was build. A library that provides git-like experience for storing content in JavaScript. Poet.codes, the place where you are now, is using that library in order to optimize the data storing and transfer. And this material is presenting some of the ideas behind Gitfred.

Photo by Flickr

Introduction

I'm using Git all the time and I love it. There was time when I was using SVN and I could definitely say that I wasn't so happy and excited. Git is a wonderful piece of software that makes writing code easier. I think most people take it for granted and don't realize how easy our life is because of this tool.

I spent couple of months building Poet and one of the problems that I tackle was the data management and transfer with the back-end. Poet allows us to have that interactive experience where we write code in the browser and we see it working instantly. This is all fine and could be seen on many places. My problem with them though is the lack of history of my changes. Especially when we are talking about technical writing. I want to develop an example and show to the reader how I progress through it. And Git perfectly fits into this idea. Imagine how I generate commits while writing the code and then those commits become part of the whole Story. This is what Poet is all about. Mocking quickly an example, explain it and then share it with others.

That is all cool but to make it persistent we have to save every single change to a database. This means constant amount of requests to an API. Such type of apps may become really expensive in terms of data transfer and storage.

So, I decided to solve the problem in a similar fashion and designed Poet with a Git like experience.

The raw interface

Let's start with the basics. Git has three states and we need to represent them in our implementation.

const data = {
  working: {},
  staging: {},
  commits: {}
}

working will represent the modified state, staging the staged state and commits will play the role of our local database.

In Git we have the concept of HEAD. That is simply said a pointer to the tip of our current branch. In our case the HEAD will point to a specific commit in the commits field. Every commit also must have an unique identifier which in Git we define as a hash. We will simplify that and will use a counter variable i. And with those two things we end up with the following data object:

const data = {
  i: 0,
  head: null,
  working: {},
  staging: {},
  commits: {}
}

We will continue by wrapping it in a function. A function that returns a git object with couple of methods.

const createGit = function () {
 	const data = {
    i: 0,
    head: null,
    working: {},
    staging: {},
    commits: {}
  }
  
  return {
    save(filepath, content) {},
    get() {},
    add() {},
    commit(message) {},
    checkout(hash) {}
  }
}

const git = createGit();

const createGit = function () {
 	const data = {
    i: 0,
    head: null,
    working: {},
    staging: {},
    commits: {}
  }
  
  return {
    save(filepath, content) {},
    get() {},
    add() {},
    commit(message) {},
    checkout(hash) {}
  }
}

const git = createGit();

Copy

save will add something to our working directory. get will return the content of the same field. add will stage our changes. In Git is possible to stage only some of the changes but here we will assume that the developer wants to stage everything. commit will get whatever is in the staging field and will form a commit which will be stored the commits map. Finally checkout will allow us to jump to specific record by getting the content of the commit and setting it back to the _working_ field so we can use get and read it.

Saving and retrieving files from the working directory

Because we took the decision to use an object as a working directory field we will use the filepath as a key and the content as a value.

save(filepath, content) {
  data.working[filepath] = content;
}

The reading is just returing data.working:

get() {
  return data.working;
}

And we can test these changes with the following example:

git.save('app.js', 'const answer = 42;');
console.log(JSON.stringify(git.get(), null, 2));
/* 
  results in:
  {
    "app.js": "const answer = 42;"
  }
*/
Photo by Pixabay

Staging our changes

For convenience we will define one more method called export. It will return the whole data object so we can monitor what is going on from the outside.

export() {
  return data;
}

As we said above our staging process will be taking whatever is in the working directory and copying it to the staging area.

add() {
  data.staging = JSON.parse(JSON.stringify(data.working));
}

We will use the quickest way to clone an object in JavaScript - JSON.stringify and then JSON.parse. Now if we extend our example a little bit we will see the effect.

git.save('app.js', 'const answer = 42;');
git.add();
console.log(JSON.stringify(git.export(), null, 2));

The result is as follows:

{
  "i": 0,
  "head": null,
  "working": {
    "app.js": "const answer = 42;"
  },
  "staging": {
    "app.js": "const answer = 42;"
  },
  "commits": {}
}

Same file with the same content now exists on both places.

Commiting to our local database

There are couple of things that should happen here. The first one is to generate an unique hash for our commit. Second we should get the content of the staging area and together with the commit message store it into the commits field. We should also empty the staging area so we are in a good position for the further changes. Also to stick to what Git is doing. At the end the head should point to that new commit.

commit(message) {
  const hash = '_' + (++data.i);

  data.commits[hash] = {
    content: data.staging,
    message
  };
  data.staging = {};
}

Let's use the commit method in our example and see how our data object looks like afterwards:

git.save('app.js', 'const answer = 42;');
git.add();
git.commit('first commit');

And the result is:

{
  "i": 1,
  "head": "_1",
  "working": {
    "app.js": "const answer = 42;"
  },
  "staging": {},
  "commits": {
    "_1": {
      "content": {
        "app.js": "const answer = 42;"
      },
      "message": "first commit"
    }
  }
}

Notice how our counter i is now increased to 1 which means that the second commit will have a hash of _2. The staging is again empty and there is one commit registered. The head points to the right place as well. Let's move on with the wonderful checkout method.

Checking out

To illustrate what the checkout method does we need to have at least two commits. So, let's add another file foo.js to the database and see what is the final state of the data object.

git.save('app.js', 'const answer = 42;');
git.add();
git.commit('first commit');
git.save('foo.js', 'const bar = "zar";');
git.add();
git.commit('second commit');
console.log(JSON.stringify(git.export(), null, 2));

We should have now two commits with hashes _1 and _2 the second of which contains both app.js and foo.js. And indeed, that's what we see if we print out data:

{
  "i": 2,
  "head": "_2",
  "working": {
    "app.js": "const answer = 42;",
    "foo.js": "const bar = \"zar\";"
  },
  "staging": {},
  "commits": {
    "_1": {
      "content": {
        "app.js": "const answer = 42;"
      },
      "message": "first commit"
    },
    "_2": {
      "content": {
        "app.js": "const answer = 42;",
        "foo.js": "const bar = \"zar\";"
      },
      "message": "second commit"
    }
  }
}

At this point the head points to the latest commit that we have made _2. Checking out the first one means updating the value of head but also updating our working directory.

checkout(hash) {
  data.head = hash;
  data.working = JSON.parse(JSON.stringify(data.commits[hash].content));
}

We have to again clone here because otherwise every saving to the working directory will amend the commit in commits field. With that done we are ready with our implementation. Now we are able to store information, retrieve it, create a history of the changes and travel through them. If we call git.checkout('_1') the export method shows the following:

{
  "i": 2,
  "head": "_1",
  "working": {
    "app.js": "const answer = 42;"
  },
  "staging": {},
  "commits": {
    "_1": {
      "content": {
        "app.js": "const answer = 42;"
      },
      "message": "first commit"
    },
    "_2": {
      "content": {
        "app.js": "const answer = 42;",
        "foo.js": "const bar = \"zar\";"
      },
      "message": "second commit"
    }
  }
}
Photo by Flickr

Going further

If you open the source code of Gitfred you'll see that there is a lot more then 40 lines of code. To make the library actually usable I had to make bunch of features on top of what we have here. Most of the stuff are to mimic what Git actually does. One thing however is I think interesting and worth mentioning - the scalability of the solution. Imagine how we have dozen of files and we start pushing commit after commit for every change. This means having our collection of files copied many times and this is definitely not scalable. We can't afford to keep all the files in every commit because the payload will become too big. What I ended up using is diff-match-patch library by Google. It is a small compact JavaScript implementation of the Myer's diff algorithm algorithm. This allowed me to store only the changes between the commits and decrease significantly the data stored in the database of Poet.

Here is a simple example of two strings compared by diff-match-patch and how the diff looks like:

const str1 = 'Hello world';
const str2 = 'Goodbye world';

var dmp = new diff_match_patch();
var diff = dmp.diff_main(str1, str2);
dmp.diff_cleanupSemantic(diff);
console.log(diff);
// outputs: -1,Hello,1,Goodbye,0, world

Conclusion

I had lots of fun creating Gitfred. I would love seeing it battle tested which you can help with by doing two things - (a) use the library in some of your projects or (b) start using Poet. Feel free to post your feedback here.

“Talk is cheap. Show me the code.”

Full screen

More by
the same passionate developer