DVCS blessed repo replication among geographically distributed teams

My company is exploring the move from Perforce to a DVCS and we currently use lots of Perforce proxies because the software development teams are spread over Germany, China, USA and Mexico and sometimes bandwidth from one place to another is not that great.

Speaking with IT, we started looking for a way to keep things running smooth from the geographically distributed perspective so that everyone gets the latest and greatest without determining what repo server is the source of truth (i.e. replicating transparently).

I thought that maybe we could emulate the DNS mechanism through pre-push and pre-pull hooks. For example, consider countries A, B, and C. Upon pulling from blessed A, A itself will pull for changes from B, which in turn will pull for changes in C. If B and C have new changes they will fall towards A. Conversely, when there is a push, it could be propagated to all blessed repositories.

I’m aware that generally you only have one blessed repo, however this may not scale globally and each blessed repository would just be applicable to the teams from a single country.

My question is: is the conception of DVCS repo replication something used in practice?, has anyone done it succesfully?, if so, what is the correct way to do it?

This question asks about transparent replication, and I suspect there are no answers yet because people might be getting hung up on transparency. I’ll take the liberty of setting aside transparency for the moment to focus on replication. I’ll deal with (or finesse) transparency later, and in fact I don’t actually think it’s all that important in a DVCS.

First, let me run down a few key points about the way repositories work in a DVCS. (I’m most familiar with Mercurial, so that’s what I’ll use for examples, but I believe everything I say is also true of git.)

A. In a DVCS, any clone contains the same file contents and history as the original.

Providing you keep the repos properly in synch, this means you can use ordinary DVCS change propagation operations (clone, push, pull) and ordinary repos to build a replication system.

B. New changes don’t have to be propagated to where they came from.

In particular, if I were to get changes from a particular repo, and add some changes of my own, my changes don’t have to go back to that particular repo. They can go elsewhere. The utility of this should become clear from examples I’ll show below.

C. Changes can be propagated via push or pull.

In a centralized system, new changes can pretty much (I think) only be pushed into the repo. In a DVCS, it’s possible to set up a variety of change propagation topologies, some of which involve only pulling. This affords more flexibility in the setup.

Examples

For the sake of discussion let’s say your distributed teams use systems in the domains duke.de, duke.us, duke.cn, and duke.mx, and further that duke.de is where we want to have the “blessed” repo. Given these assumptions, let me lay out a number of examples of different topologies you could set up, bearing in mind the three key DVCS points above.

0. Centralized Push Model

Have a single repo at hg.duke.de and have the developers in all locations clone and pull from here and push changes here. This might work for the folks in Germany, but it would probably be a problem for the people in the rest of the world. All clone, pull, and push operations would go across slow long-haul network links. This is using a DVCS just like a centralized system. This is the problem you’re trying to solve.

1. Centralized Push with Replication

Have the blessed repo at hg.duke.de and have replicas at hg.duke.cn, hg.duke.mx, and hg.duke.us. Developers clone from their local replica and push changes to hg.duke.de. Whenever new changes appear in hg.duke.de, a hook runs that propagates them to the replicas. The replicas are otherwise read-only, thus there will never be any merges or conflicts.

If I’m a developer in Mexico, for instance, I’ll clone from hg.duke.mx but push changes to hg.duke.de. If other changes are pushed into hg.duke.de before I can push my changes, my push will be blocked. The other changes will be replicated to hg.duke.mx, so I’ll pull these changes locally, merge, and then attempt to push to hg.duke.de again.

This should provide some advantages, since the big clone operations are all done locally. Pushing to the central repo in another location might not be too bad, since changes are pushed relatively infrequently, incremental changes are generally fairly small. (Mercurial in particular essentially sends compressed diffs, not entire files and their histories.)

In Mercurial, you can set up a local repo to pull from one location and push to another by putting something like the following in the .hg/hgrc file:

[paths]
default = ssh://hg.duke.mx
default-push = ssh://hg.duke.de

2. Simple Pull Model

Continuing with hg.duke.de as the blessed repo and the others as replicas, we can propagate changes via pull instead of push. Developers clone and pull from their local replica as usual. When a change is ready, a developer submits a pull request to some central service, which pulls from the developer’s repo into hg.duke.de. A policy will need to be established for merges. For example, if there are merge conflicts, the request might be rejected, requiring the developer to pull (from the local replica), merge, and resubmit the pull request.

This approach has the advantage of not making the developer wait while changes are being propagated. Of course, the dev still has to wait for the pull request to be acted upon, but at least he or she can work on additional changes during that time.

Variations

There are a bunch of variations that can be applied.

The submission of a pull request is a perfect time for code review. The changes are published, in the sense that they’re available to everyone, but they haven’t yet been integrated into the blessed repo.

Pull requests can be acted upon manually or by some automated system. Processing a pull request might not merge changes directly into the blessed repo, but instead into a temporary staging area where a build and test cycle is done. Only after passing all the tests would the changeset be integrated into the blessed repo.

Those more comfortable with a push model might want to set up a local staging repo in each location, alongside the replica, e.g. hg-stage.duke.mx, hg-stage.duke.cn, etc. This requires a bit more work, though, as developers not only have to merge against other local changes, but somebody has to be responsible for merging changes from the staging repos into the blessed repo. This can work under the right circumstances, though, and can be aided by automation.

“Transparency”

Now to the issue of transparent replication.

Given the above scenarios I don’t really see the need for transparent replication. All the repos are visible to everybody, and there are conventions for pulling/cloning from the local replica and pushing to a blessed repo or a local staging area.

If you want transparency, you could have people set up DNS search domains according to their location. The local replica and staging repos would simply be referred to as “hg” and “hg-stage” and the DNS setup would resolve these to hg.duke.cn and hg-stage.duke.cn for developers in China, and correspondingly for developers in other locations. But this is a bit of magic and can be confusing, and I really don’t think it adds much.

I hope this answers your question. I took a number of liberties with the response, but it seems to me that your situation could be remedied through the use of techniques that I’ve described above.

Filed under: softwareengineering - @ 14:47

Thẻ: data-replication, distributed-development, dvcs

Thiết kế website giá rẻ

Danh mục

DVCS blessed repo replication among geographically distributed teams