Forge Federation Design Discussion

Since I recently returned to working on forge federation, I’ve slightly forgotten some of the rationales and reasons for the design choices made earlier. This post is mostly written for myself, but if you have any feedback or comments, just let me know.

From a technical point of view (so ignoring aspects like moderation), forge federation is synchronization problem. You have two people who want to collaborate on a project, but without a centralized forge, there will be two copies of the data that must be kept in sync. Sure, you could have a forge with some sort of single sign-on to make it easy to create accounts, but that would basically be a centralized forge.

One solution would be to put projects on a distributed or peer-to-peer network to handle the synchronization for us, which is what git-ssb already does. However, this introduces a lot of inefficiency and complexity that we would like to avoid.

That leaves us with the only option of having a federated network. For the problem of how to do the synchronization, the main options are primary-secondary replication or a consensus algorithm like Paxos or Raft. We decided to go with the first option because consesus algorithms are much more complicated to implement and debug. Of course, the main flaw with primary-secondary replication is that it can’t automatically recover from the primary server failing. However, this isn’t a huge issue for forge federation, because if this happens, we can allow any secondary server to declare themselves as the new primary. Basically, we can solve this problem using manual intervention. This does result in the split-brain problem if multiple secondary servers declare themselves as primary servers, but this is fine for forges.

The other big problem is what to synchronize. I’ve had multiple people tell me to just store the issues, pull requests, and so on in Git and use Git for pushing and pulling data around. I also like this idea, but there’s another idea that I like even better: using JSON (specifically Activity Streams).

Now at first you might think that’s a terrible idea, but here me out. Forgejo currently stores issues and stuff in its database, so we can continue doing that and expose them as JSON at some API endpoints. In addition, any programming language can easily process JSON, which is ideal because the purpose of ForgeFed is interoperability. Finally, JSON also makes it easy to selectively sync stuff. For instance, say Bob wants to open an issue on Alice’s project on a different instance. Then we really only need to synchronize that single issue thread between the two instances instead of the entire project, which could be gigantic. Sure, you could probably do the same with git subtrees or something, but this idea is way easier. Another small benefit of using Activity Streams is that ForgeFed can be interoperable with other ActivityPub software, so you can do things like star a repository using your Mastodon account. So how do you actually sync JSON? What about conflict resolution? Well, the nice thing is that with primary-secondary resolution, we don’t need conflict resolution. The primary server is always correct, so we can overwrite data on the secondary server with stuff from the primary server.

There are actually two different JSON formats right now for representing forge data, F3 and the representation in ForgeFed, but I’m going to work on merging those two into a single specification this month.

One last thing: what about object capabilities? I haven’t really thought too much about this, but I think they aren’t necessary for Forgejo at least. Any activity by a user is signed with HTTP signatures, which provides some security. For more complicated scenarios, we probably need OCAPs, but I’m going to work on other stuff first.