In this talk out of last month’s Gophercon India, Matthew Campbell, co-founder of Errplane (now open source metrics and analytics start-up InfluxDB), speaks about his experiences in switching over to Go, monitoring large-scale Go servers and how one can build a fully distributed project on a large platform using remote teams.
Based out of New York and Bangkok, Campbell, a former Rails developer who helped build projects for Bloomberg, recounts how he decided to start using Go in a project with Thomson Reuters, rebuilding their instant messaging platform for stock traders, which was originally written in Java and C++.
Thomson Reuters’ Eikon Messenger is one of the older Go applications outside of Google, and has been in production for over two years, with over 300,000 users worldwide. It’s the largest financial instant messenger in Southeast Asia and the EU, and the second largest in the Americas, and runs a very large Go server.
Campbell describes how the project originally began in New York, but because they could not find Go developers locally, they ended up hiring all around the world, eventually building it as a fully distributed project.
Eikon uses a web-based platform that includes a full stock-trading platform for commodities and financial consulting. Java is used for the front-end messaging system, but the back-end is in Go. Campbell lists the few protocols that the platform uses, including Microsoft SIP, a necessary component since every bank uses Microsoft instant messaging.
Campbell also notes the unique needs of the project’s context in the realities of the financial world: “We have a very sensitive set-up, more than most web apps. A lot of web apps can go down, they can have problems, but in instant messaging if you crash, users lose their messages or they get disconnected from your servers, then they get very upset.”
“Our customers are particularly sensitive since they are doing thousands or millions of dollars of deals on this platform. So when we have downtime, people are literally losing money every minute. So we have to be very good.”
Campbell then goes on to describe what he believes are the missing features of Go, and what workarounds the team used. First, it lacks a “decent” XML DOM Parser; the team adapted by using a C library equipped with XML2, but encountered integration issues with C and Go. Secondly, it also lacks a regex (regular expression) with good performance, and because using the Go-Redis library wasn’t enough; therefore, the team uses PCRE (Perl Compatible Regular Expressions). Serialization is another big issue in Go: because of Go’s lack of generics, one ends up doing a lot of reflections, resulting in lowered performance to the garbage collector.
Expanding on the problem of garbage collection, Campbell explains that because this is an instant messaging system, users get frustrated with any latency, “which ruins the experience. So garbage collection performance is one of our number one key metrics.” He also notes that garbage heap sizes impact speed of garbage collection, so in the face of slow response times, the team had to cut down on their heap sizes significantly, and performance is tested using stringent methods.
But Campbell points out that every application — no matter which language it runs — will always have IO performance issues. In this case, the system has to handle tens of thousands of users logging in at the same time. To cope, a strategy that Campbell calls a “Russian doll approach” is employed, starting with a core server called Nitro, which has a super-fast in-memory cache. IO is broken down into permanent information (roster lists, friends list and history) and ephemeral information (current location, logon server), and is handled using Redis, with about four Redis instances running on every blade server, thus ensuring there is no roadblock when cached data needs to be retrieved. In the rare condition that Redis does run out, the system falls back on MySQL. When combined with consistent hashing, data is easily distributed across all Redis instances, and will scale horizontally with ease, with no coordination between individual servers and operational overhead needed.
Campbell praises Go’s consistent complier, saying,
“Every time a new release comes out, we compile the same day, switch our code, and within a week, it’s in production. We’ve never once had an issue with the Go complier messing up our code. It’s one of the best things we’ve ever had.”
On the other hand, Campbell indicates that so far, there are no “good integration test suites” with Go. To overcome this, a cluster of instant messaging servers like Redis, MySQL and Elasticsearch are spawned and and pre-populated with data, and Bot is run on them using public interfaces to complete the integration test.
Though Go doesn’t support dynamic linking, the project required the use of plugins. As a workaround, the team used TCP to link plugins back to the main process. Sometimes processes run on the same box and sometimes, when more performance is needed, they are moved onto their own servers. “But the code doesn’t change,” says Campbell, “because it just uses TCP, so it does not matter where the box is, and so it gives a nice scaling strategy, and you can deploy things independently.”
Feature image via Flickr Creative Commons.