The road to fully native Mercurial in Heptapod
Posted on Mon 08 March 2021 in news by Georges Racinet
What are native Mercurial projects? What does this "fully native" qualifier you've seen here or there in our issues and merge requests mean? How will Heptapod get there?
This is about getting rid of the side conversions to Git that Heptapod performs under the hood, and the Heptapod 1.0 landmark, actually.
Read below for perspective and to learn how we will ship these important long term features.
Historical perspective and ultimate goals
The original Heptapod idea was to build on the fact that Git and Mercurial aren't conceptually that different when accessing data at the commit level. One-to-one conversion between Git commits and Mercurial changesets is a long solved problem.
in Mercurial terminology, what Git users usually call a commit is called a changeset, whereas commit is a command that creates a changeset.
From a user's perspective, the main differences lie in how to address commits symbolically (branches, tags, topics…) and what happens in pulls and pushes. The case of a forge, even as rich as GitLab, should not be much different.
All it took for the early Heptapod prototype to start displaying Mercurial content was to expose it through an internal Git repository, with a set of conventions to map Mercurial concepts to Git branches. The application needs some entry points to address the changesets, and of course GitLab expect Git branches and tags to play that part. In truth, any web application would need something alike anyway. Git refs in general provide exactly that.
An important point is that the auxiliary Git repository does not lie in the data path between the client and server-side Mercurial repositories. The best way to think about it is to consider it is as a side view meant for the web application, not that different in spirit with data exchange performed in any serialization format.
This was good enough for a prototype, and we've been refining a lot since then, but it has several drawbacks
- Mercurial to Git conversion is slow and does not scale with the repository size. This can especially be felt in imports of large repositories, but we also pay a small price at every push.
- If the web application does nothing more than displaying a Git repository, we can expect all the hashes in the Web UI and the API to be Git commit hashes. We could in theory replace them by Mercurial changeset hashes in the outer layers of the application, and we actually did it in a few cases, but trying to be comprehensive would be an unbearable maintainance burden. See heptapod#6 for more details.
- Bad perception: some people believe that if Heptapod converts to Git, they might as well use Git. Even the message that there is no back-and-forth conversion to Git, that server-side Mercurial is the source of truth is a bit complicated to convey efficiently.
- In some cases, Mercurial is more flexible than Git, requiring, e.g., less locking but we get the worst of both worlds.
It has been a major long term goal of the Heptapod project since then to switch to a "native" way of handling Mercurial repositories, in which no conversion to Git would be necessary.
To finish on this historical perspective, our early experiments happened right before GitLab launched their Gitaly project with the goal to provide all access to Git content in a separate server component. Because Gitaly can be remote from the web application and sharded, it solves in particular the problem of server-side horizontal scaling.
Gitaly is by no means meant for VCS independency, but it makes for a neat separation of concerns, and it would have been easier for us to start right away with the Gitaly project already completed, as it was in GitLab 11.
Our plan is simple: implement a Gitaly server for Mercurial, a project we called HGitaly. By doing this, the other GitLab components should happily expose Mercurial content without much modifications. Of course, we need to identify Mercurial projects and dispatch requests to HGitaly instead of Gitaly for them, but that is well encapsulated in a finite set of mostly stable inner application layers.
If we go as far as letting all the server-side writes be performed by HGitaly, then Heptapod will also gain the nice horizontal scaling properties that were the original goal of Gitaly, as well as so-called container native deployments, which can be of interest for small scale deployments as well. This is represented by the HGitaly3 milestone, and is beyond the scope of this article.
HGitaly1: native Mercurial projects
Several GitLab sub systems actually store Git commit hashes in the database. For instance, a Merge Request keeps track of the commits for its diff view. Moreover, user and system comments often reference commits by hash directly. This is what happens, e.g., when a commit cross-references an issue: the issue gets a system comment containing the system hash. Moreover, any external system can keep a link to a given commit, with the full hash in the URL, of course.
All of this means that we cannot simply make existing Mercurial projects go through HGitaly and remove the auxiliary Git repository. Instead, we needed to make a difference between "native" Mercurial projects, internally exposed through HGitaly, and the traditional ones, internally exposed through the Git conversion. For simplicity, we decided to make that distinction a different VCS type, which was introduced in Heptapod 0.17.
The piece of good news, though, was that all this adherence to persistent data is limited to the commit hashes. On the other hand, the main user experience problem in Heptapod is precisely that Git hashes appear in various places of the interface, a condition worsened by the fact that users can't predict them. Finally, it is good software engineering practice to break down long implementation endeavors in smaller chunks of work and ship them early. That was a nice convergence.
So we defined an intermediate step, the HGitaly1 milestone in which we introduced the "Mercurial (native)" VCS type with these properties:
- all content is still converted to Git
- all commit hashes seen by the web application are actually Mercurial changeset hashes
- internal exposition is provided by HGitaly in some cases, e.g, branch resolution, and Gitaly in other cases, e.g., file contents and directory listing.
In other words, Mercurial native projects aren't really native yet (as of Heptapod 0.20) from an internal point of view, but as far as persistent data and user experience are concerned, they actually are. Also, newly created native projects shouldn't need to undergo heavy data migration in the future.
We will naturally need heavy migrations and fallbacks for existing non-native Mercurial projects. These are also under development. The tracking issues are heptapod#420 and heptapod#421.
Today, all newly created Mercurial projects of foss.heptapod.net are native.
HGitaly2: fully native mode
Since we started to ship HGitaly1 and let users create native projects, we've been working on the next step: having all Mercurial content been served through HGitaly, which will lead us eventually to drop the conversion to Git entirely.
This is a bit trickier than what we did for HGitaly1, which was mostly reimplementing in HGitaly the conventional branch mappings we already were doing in the conversion to Git.
Indeed, the Gitaly protocol is, perhaps unsurprisingly, Git centric in that it expects to address directory and file contents in terms of Git object ids. So we needed to find our way around that. It was one of the motivations for the split in the HGitaly1 and HGitaly2 milestones, the other one being the coincidence with user goodness and data migration properties provided by HGitaly1.
The good news is that we now have prototype implementations for all the needed Gitaly calls: a few days ago, a development instance passed for the first time all the functional tests in fully native mode. This was a look ahead attempt that combined all the current experimental code, to gather insights and help us decide on a release plan.
In Heptapod 0.19 and 0.20, native Mercurial projects are operating at the HGitaly1 level. This means that they won't need any data migration to work with the future fully native mode.
Even better, as long as we keep converting to Git behind the scenes, we can switch back and forth between the fully native mode of HGitaly2, in which the converted Git content is not actually used, and the current partially native mode of HGitaly1. We will even be able to do that without instance restarts.
Now we know that no matter the amount of automated testing, a piece of software cannot achieve production readiness without extensive testing by real users. So we will ship in several steps and provide our early adopters with means to use the new modes without incurring much risk.
Step 1: feature gating and transparent fallbacks
In a next release, we will introduce the fully native mode as a development feature flag, perhaps even at the Project or Group level.
On instances running with the feature activated, all read operations on Mercurial native projects will then involve Mercurial only, while writes will still be converted to Git for easy fallback to the more mature HGitaly1. Non-native Mercurial projects will be entirely unaffected.
We will of course activate the fully native mode on the Heptapod instances that we manage ourselves, notably foss.heptapod.net. After a while, we will encourage our closest partners to do the same, and provide them support in return.
This could happen as soon as in Heptapod 0.21, but we won't hesitate to postpone to 0.22, because of the also appealing goal of releasing 0.21 at about the same time as GitLab 13.10, which is scheduled for March 22nd.
Step 2: unplug the conversion to Git
At this point, instances running with the fully native feature flag will stop converting new content to Git for native Mercurial projects.
Fallbacking to the partially native mode will then involve bringing back the auxiliary Git repository on par with the Mercurial repository. We will provide a way to do that, probably as a Rake task.
Some of our users are actually taking advantage of the Git conversion to mirror their repositories to Git-centric external systems. We consider that an interesting feature, even if it happened without planning on our side. Hence this will be the time to properly support it.
Since the conversion to Git is a major performance bottleneck for large repositories, this step will allow us to conduct experiments to assess what Heptapod can handle. We are currently hosting a couple repositories with about 100 000 changesets (PyPy and Heptapod itself), and have been considering this order of magnitude to be the biggest reasonable one. At this step, we will be able to test repositories in the ballpark of a million changesets.
Step 3: fully native by default
This is just changing the default value of the feature flag. At this point, an explicit configuration will be needed for native Mercurial projects to run in partially native mode.
If you've been following so far, you know that the type of new Mercurial projects (native of not) is completely orthogonal to the fully native mode of operation. Nevertheless, if all new Mercurial projects are already native by default by the time we reach this step, we can (and should!) label it Heptapod 1.0.
depending on how testing at Step 1 goes, we may actually decide to perform Step 3 before Step 2. But we'll still need both to claim the 1.0 milestone.
Thank you for reading this long article, we hope you enjoyed it.
We have pretty exciting times ahead, looking forward to Heptapod 1.0, with fully native Mercurial projects. It seems we can be there by the summer.