Technology

The New Stack Makers: DAT’s Max Ogden, Building a GitHub for Data

22 Jul 2014 3:39pm, by

It’s OSCON and so it seems to make sense to post an interview I did with Max Ogden, one of the original Code for Amercia fellows. He is now building DAT, which is described as a “GitHub for data” project, supported by the Alfred P. Sloan Foundation.

Alex: So, essentially just trying to get a picture of Max Ogden, where you’re from, where did you grow up?

Max: Cool. I’m from Oregon, but I’m from the Oregon Coast, and the town I grew up in had 2,000 people, the town I was born in had 200, and I thought 2,000 was big. I used to go to the town of 8,000 which had a Taco Bell and I was like, “Big city.”

What town?

Max: Newport was the big town with a Taco Bell, and I was in Waldport, in Yachats. And Yachats is this weird little hippie village, basically. And so my parents moved there after they went to San Francisco in the 70’s. They weren’t into it at all, they found it to be too crazy, so they moved up to Oregon and lived in an A-frame in the woods until they started having kids and then they settled in Yachats. And so I grew up close to the ocean, close to the mountain range, but then, as soon as I got a car I started going to Portland all the time and I moved here. And went to Portland State for two months, doing computer science, but I kind of felt like it was a breeding ground for IBM…

Why did you feel that?

Max: All the classes were enterprisey-feeling, and they weren’t doing cutting edge. You can’t really find cutting edge stuff in the computer science programs generally, but they just didn’t rub me the right way, and looking back I missed out on some of the computer science fundamentals early on, but what I did instead was basically teach myself Ruby on Rails and just got a job at the software company here in town as an intern.

Which company?

Max: They’re called Revelation. They’re a Rails marketing company, basically.

What year was that?

Max: 2008? 2007? It would have been late 2007, I think. And then I basically was working at the Apple store in the mall, but then as soon as I quit and got this internship, I was like a junior developer basically and I was able to quit the Apple store. I was doing computer training there, which is actually a pretty sweet job for someone with no college credentials. It was, like, the “Teach people how to use Photoshop” instructor. And then I was teaching people how to do web design and stuff, but I didn’t know programming at that point.

But then basically spent like three years every day in this Rails company, just learning the ins and outs of everything. Like software development, agile testing, team-based software, like how startups work, how a company works internally, how office politics work, how to hire people, how to train people, it was a really good immersive experience, but kind of work that drains your brain but in a good way, when you feel totally exhausted because you just spent all day doing new stuff.

Why did you want to understand so much about way all those things work?

Max: Well, I was just kind of – just aside of the fact that I’m working there – but it happened really quick. I went from, “Oh, I should be a programmer,” so in a few months I had a job in a software company somehow and I was doing all the things. And I learned really quick that way. They also did Pair Programming at that company, which is kind of extreme in my opinion looking back on it. They don’t let you write code unless you’re in a pair, so you’re constantly working with another person, and it was a really good way to get up to speed on…

Did that help you learn faster?

Max: Oh yeah. I felt like I got more out of it than the other person did, honestly, because it’s like you soak it up like a sponge if you are the lesser-skilled in a pair…

That’s kind of a universal thing.

Max: It was cool, and they see that as an investment in the team, like it’s better to have – instead of having one person that, if they get hit by a bus then you’re screwed, pairing helps them to spread the knowledge around. So I eventually got to a point where I was more or less up to speed. But then, in 2008 Sam Adams got elected and Obama as well. And Obama did this thing where he declared all data open, just like passed a mandate, or Presidential order, or whatever. He said, “We should open all the federal data,” and all then all these mayors basically said “Hey, let’s copy Obama’s directives, and just open up a bunch of city data.” But I don’t think anybody really knew what that meant at the time. It was all kind of like theoretical, like, “Oh yeah, that’s a good idea, let’s open up data sets,” but nobody really figured out what data sets were interesting. This was right around the time that Google started to do transit, bus directions and stuff, and that came out of TriMet, so Portland already had a little bit of geek credit around open data, because they were the first city to have real-time bus arrivals on Google Maps. And so I just basically, in my free time, because I was kind of getting bored of marketing stuff, I started messing around with all the data that the city was publishing, and was super into it. I think I had a naïve fascination with public service.

What was that naïve fascination?

Max: Oh, as the reality of bureaucracy and how demoralizing it can be and how frustrating it is not being able to get things done efficiently. But, I was on the outside, so it was really fun because I got to take advantage of all the produced data, I didn’t have to see how the sausage was made, so to speak. The first major open source project I did was taking the City of Portland’s data sets and then – they were in this crazy formats that were either government vendors’, like Esri formats, Esri map data or Microsoft simple server databases – stuff that you have to buy licenses for to read, and I just converted them all to open web formats, so that you put them up on any PI and people can access them from a web browser.

How many years in programming have you had by this point?

Max: One or two.

Wow. How did you learn it so quick? Was it the Pair Programming, what was it? Did you just feel like you had a natural tendency for it?

Max: I’m not like one of those prodigies, I think. I probably am, when you talk about programming education, I kind of feel like there are some people that are naturally going to be good at really low-level programming, but I kind of think you can teach a little bit of programming to anybody, but there’s always a spectrum, and I think I’m probably like on the top 50% of the spectrum of types of people who can be really good programmers. The combination of being able to think in abstract logic or whatever.
But, for instance, I was never really good at math. I’ve taken Precalculus, but I’ve never, it might be just because it was never applied in an interesting way, but I was never, I wasn’t like a math genius or anything like that. And there’s some programmers who are just data structures and algorithms and they just like eat and sleep it, and I think I’m not quite that extreme, but I was lucky because I did have really good mentors and I was at the same age as I would have been taking college courses, except I was doing 100% software. So I was focused on it for a couple of years.

Where was that?

Max: This was at Revelation Software. The way that I got hired at that company was basically, I read the Rails book – it teaches you how to make a shopping cart app – and then I wanted to try out everything else, so I checked out a Django book, because Django is like the rails of Python. I was like, “Okay, I’ll see what’s up over there,” and I made a site called PortlandSmells.com. Because I basically read the Django book and I was like, “Okay, I’m also really interested in maps,” and the Google Maps API was the coolest thing you could do at that time.

So I just made a database driven Django app that was just a map of Portland that’s empty at the beginning, and you can add points and say if they smell good or bad, and then describe the smells and give them a title. It was a pretty simple straightforward terminology. But I did the design; I self-taught myself Python; it wasn’t anything super-technical; like there were no algorithms or anything. But I think the reason that I got the job was that they thought that I could apply myself by seeing how I learn. I was able to self-teach to get to that point. Because it took a little bit of everything. It took a little bit of style sheets to make the site look good, and I made a logo in Photoshop and I designed the map markers, and it wasn’t beautiful but it was at least functional.

At that point, I was like, “Oh, this is interesting, you can… programming is like…” – I don’t want to use the word “meritocracy” because that has negative connotations when you talk about equality, but for lack of a better term, programming is something that, if you can get past the privilege that you do or do not have, you don’t have to get accredited, you don’t have to pass the bar before you can get work. You don’t have to go through six years of education training and all these certifications. I liked that it was very “all you need is a laptop and the time,” and I thought that was really cool. It kept me going because I saw, if I invested in it, I saw a lot of immediate returns, as opposed to, “Oh, I should get a master’s degree” – who can plan seven years in the future effectively? Everybody switches majors a lot and all that, so… it just seemed logical.

I wonder about how that pairs with the interest in open data and that how that propelled your own skill sets and exposure to lots of other people who could help you grow?

Max: Yeah, yeah. So, I totally lucked out timing-wise. Had I started this in 2004, I might not have found something like open data that was getting popular, and I was highly interested in at the same time. So I was basically playing around with a bunch of data sets. I made a map of all the bike racks in Portland so you could see where the most bicyclists are – crime – also data visualization that nobody had done using public open data… a couple of city-wide hack-a-thons where they basically say, “Here’s all our data, who can make the coolest thing?” I won two of those in a row just making what I thought were interesting little visualizations of stuff. One was: you could draw your commute on a map with lines, and then you get text messages when any of those roads would get closed for construction. That was the plan; I actually never could get the City Department of Public Works to release the schedules, because they didn’t want to be held accountable to it. They’d give quarters; they’d say, “In Spring we will close this road.” They wouldn’t actually give dates. But the concept was promising. I was like, “Maybe if I build that, I could get them to release it.”

There are a lot of roads that are closed that you come across in your biking.

Max: With all the streetcars stuff going on…

And the big sewer projects… we see it every month in Portland.

Max: So, that basically turned into the Mayor giving me an award in front of a crowd, and one of those crowds, it was during OSCON and Tim O’Reilly was in the crowd. And I didn’t know. And then he comes up to me afterwards and he hands me the card with the lemur thing on it. He’s like, “Hey, I’m starting a new non-profit called Code for America. Right now I’m recruiting the first year fellows and do you want to be a fellow?” He just asked me straight up. “It’s a year-long program, you could spend a year doing what you’re doing, but like nationally.” And I was like, “Oh, yeah,” I was sold at that second. And he’s like, “Come by OSCON – I just comped you a $1600 ticket, you can go to all the sessions,” and then I took vacation days from work to go to OSCON and meet everybody else. Then I was – basically I put in my notice at work.

What year was that?

Max: That was June of 2010. And then I started Code for America January 2011 and did that all year. So I was like the first fellow of the first class, the first year of Code for America. That was their first year of operation. They like spun-up that summer of 2010 and actually opened the doors in January.

What did you do the first day?

Max: The first day? We basically… It was funny. They kind of didn’t know if people would show up, and were just like… Because basically they got twenty people to move to San Francisco for… I actually took a pay cut to move to San Francisco. I was making more money here in Portland than I was in San Francisco, which is crazy, because they pay you a stipend. I was living on a stipend n San Francisco, and San Francisco rent is probably double this year than it was four years ago, but I didn’t realize it at the time, but only when I started getting my paychecks was I like, wow, this is less than I have… I guess I didn’t realize that. But I actually saved more money, ironically, in San Francisco because I was so busy that I literally never went out. I don’t drink, I don’t have a car… Basically all I was doing was paying the rent, that’s it pretty much, eating cheap burritos in the Mission. So the first day we all show up and then they’re all like, “Ah, everybody showed up, good.” You know, twenty new fresh faces that just moved to San Francisco the week before. I loaded up my brother’s Prius – he lives on the coast of Oregon still – and we drove down the 101, unloaded me in San Francisco, and he said, “See you later,” and I showed up at Code for America.

It was awesome. They basically spent the first month training you on… They get like a bunch of speakers to come in and talk about challenges of government and bureaucracy and how to optimize government and how technology works, and also how politics work and how different city halls are structured. Some of them very mayor-driven, some city-council-driven, basically an all-around primer on technology and politics and community organizing. I think they did a good job for the first year, and I’ve kind of watched it evolve over the years. It started with three staff members and twenty fellows, and now it’s like forty staff members, and thirty fellows a year, and almost 150 alumni. They also have an incubator program, because Google gives them a bunch of money every year now, so they have a startup accellerator/incubator, as well as a volunteer network, meetups in a couple of hundred cities. So, it’s a lot bigger than… they exploded, they got really popular, non-profit wise.

And that first year, basically, it was bitter-sweet moving away from Portland, because I love the developer community here because it’s so hobby-focused. And it was like a shock going to San Francisco specifically, seeing how all the meetups were recruiting-focused; they were all kind of cheap-pizza-and-cheep-beer, formulaic, kind of like at some startup office, and they’re all just talks from developer evangelists, and they’re all like, “Here’s the API to use our paid service.” And it was less like, in Portland you get talks that are like, “I wrote a functional programming language in Lisp that’s based on Haskell…” You know, some crazy researching… Or like, “I turned a garden trowel into an instrument,” and you know, really cool, weird, not business-driven at all.

And that was what I identified with, because people have so much free time here to do their passion projects, but in California it was very work-hard-play-hard, and, “If I do go give a talk it’s because I’m promoting my startup.” And there wasn’t really a community, in my opinion, so I haven’t really found the San Francisco tech community. I think the Portland one might be bigger. There’s this high signal-to-noise ratio in San Francisco. So, as a community-driven open source person, I was a little bit like, “Oh, this is weird.” It’s a different breed of open source, where you might get to work on open source from your company. Whereas people in Portland can do open source all the time, because they don’t give that much money to pay the bills, so you can kind of do what you think is right all of the time.

Yeah.

Max: Less of like “The Gold Rush” effects. So that first year was interesting, you know, like I said, bitter-sweet because I missed Portland, but also I was getting to work with city… I was working with Boston mostly, and the Boston Public Schools system, so I was splitting my time between Boston and San Francisco and basically worked on a variety of stuff for either teachers or students. I started working with data at that point – so the stuff I’m working on now started during my fellowship year, but more started up here when I was working on PDX API, the Portland API for government data I was talking about.

Let’s follow that progression from PDX API to what you did in Code for America to this new project.

Max: Yeah. So the thing I did when I was at Code for America, I kind of viewed it as, obviously the Portland API is for Portland, but I thought, why not genericize it so that anybody that has open data can publish it? The goal of the PDX API is, if you have some data set – say the boundary of Forest Park – the Forest Service, or maybe the Oregon Parks and Rec or whatever, they’ll go out and they’ll GPS-track the boundary, so they’ll know what the park boundary is. But say that they like buy new land and the park expands, they would be the people who would have to update that, because it’s all an internal process. And all you need to do as a citizen is to get the updated version. But that’s not how OpenStreetMap works. With OpenStreetMap it’s whoever first reports the change and edits it and sends a patch, that’s more an open source model. So, I kind of viewed that as a fundamentally broken system. There’s no way to contribute to the government’s database.

That former example about OpenStreetMap is an example of participatory democracy.

Max: Sure, definitely. It’s like giving people the tools to do the work for you. And I think there’s a variety of reasons why government doesn’t work that way, and mainly I think it’s cultural, like risk aversion. And having worked with multiple Code for America projects, risk aversion is almost the biggest issue. It’s easier to say “no” than it is to try something new, put your neck out… There’s also reasons like they know that they can ensure data quality standards if they have employees collecting data, because those employees are more accountable than random people from the Internet, and they don’t want people putting fake data in or putting low-quality data in.

But I think those issues are more theoretical than anything. If you gave people a way to effectively help you manage your data sets, you’d get really meaningful contributions. It’s just like code on GitHub: you obviously get some low-quality contributions, but you don’t have to merge those if you don’t want. If you get some really high contributions, that’s the payoff for having an open source project, when you have somebody else with a different use case that makes your thing a lot better and you can both work on it, it becomes a greater-than-the-sum-of-its-parts kind of thing.

So, I was kind of frustrated with government data because it was so read-only, it was so thrown-over-the-wall at you from government and you just get to use it. But if you improve it, the best thing you could do is just host it yourself. And that’s what a lot of people do. Take the census for example. Reading the census data directly from the federal government is horrible because it’s in weird formats. So, USA Today – or there’s some magazine or something – they provide basically better census data. It’s the same data, it’s just they did all the work and cleaned it up.

enigma.io

Max: Yeah, yeah, yeah, exactly. There’s a series of these, there’s one for the Federal Reserve, there’s all sorts of ones for public transit feeds. There’s all these people, because the government data is so bad, they just wrap it in better versions, or they host better versions. And so that’s kind of like the problem that I saw. Ideally, the government would adopt better data practices, but at the very least it would be cool if the government… when I say “the government,” let’s just take a city for example: The City of Portland had a GitHub account. They’d put all their data sets on GitHub. Then at least you get their raw version and some sort of version control system, and then I can at least fork it and have my version and then there are at least links.

So at least if somebody goes to the City of Portland’s GitHub account, they can at least see there’s different versions available, as opposed to, how do they Google it? And if you’re on the census website, how do you get to enigma.io from there? It’s not like the census is going to be able to take me to that .io. So there’s a lot of discoverability problems. But there’s also cases when you want two versions of the data set. So, for instance, the census is only going to be their guidelines for what a census is, but if you wanted to do your own census and mix it in with… If you want to do a local census on your block and compare it to federal census data, people who care about it… It’s like Wikipedia. Wikipedia doesn’t care about Portland-specific neighborhood stuff, but that’s why we have PortlandWiki. We have a local version that’s all about Portland, then there is Wikipedia for stuff that’s on a global scale. It makes sense to have two different levels.

So generally, I just kind of saw that there are all these issues around data management that we have for source code that we didn’t have for data sets.

And the version control system on GitHub is ideal for obviously tracking updates to code, but you’re saying it’s like the difference in tracking updates to the data?

Max: Yeah. If you put text data on GitHub it works okay, like small CSVs and things that you could imagine opening up in a text editor and editing every once in a while, maybe like a JSON or something, but it’s not great at either files that update really a lot, like GPS trails – I just wrote a thing the other day to get all the bus movement in the East Bay. The TriMet down there is called AC Transit, and they publish where all the buses are, and I just grab it every few seconds and get a snapshot of where all the buses in the system are, and I ran it for about six hours and got a 125,000 points. It’s like 125,000 records of where buses were over a six-hour window, and that’s like, even just the buses in Oakland was too much. That would be too much data to put in a single Git repository, because the time spent adding every bus when it updates would just stack up and then the thing would crash, because Git isn’t built to be a database.

So you hit limits when you try to work with either data sets that update a lot or data sets that are really big, like a lot of data. Because if you try to add a 125,000 rows .csv, Git will sit there and crunch for a while, and it’s really not built for speed. I think GitHub is the best place to publish open data for the majority of use cases, but it’s not great once you get into large data sets. I would say probably, of the data sets that Portland’s releases, maybe like 60% of them are small enough to put on GitHub and then there’s like 40%, the building footprints data is like a gigabyte of data, every building footprint for the whole city. There are some data sets that are really high-accuracy GPS kind of data. Basically stuff that’s like hand-curated, hand-edited, is usually pretty small, but then stuff that comes from devices, like sensors and stuff tends to get really big.

Log data…

Max: Yeah, exactly. So, once you start getting into gigabytes, especially terabytes, then Git isn’t very scale-able. But basically what I wanted was something to build a GitHub for data on TopUp, and that was the project I worked on at Code for America. It was called DataCouch. And it was built on CouchDB at the time. It was like a very early prototype of what I’m working on now. This was like two-and-a-half years ago or three years ago. The whole point of DataCouch was I was literally ripping off GitHub, but making it for basically spreadsheets instead of for code. So it knew, it was aware of row base changes… I was just trying to build what I thought would be an awesome set of tools for managing spreadsheets. Almost like a GitHub for Google Docs or GitHub for Google Spreadsheets.

Because, if you imagine – Google Spreadsheets, it’s not public by default – you can go to my GitHub account and you can see all my activity and all my repositories and a lot of recruiters go to GitHub accounts now because it’s a great way to see how active a developer is and what languages they use and what their expertise is, but I can’t go to an account at a company and look at their Google Docs account. It’s like Google Docs is meant to be private and you share one at a time, it’s not something you publish to the world.

And so the cool thing about GitHub is they kind of turn… you know, the social part, the social coding slogan that GitHub uses is in how everything is open, and they incentivize openness by making you pay to get private. Whereas Google Docs you are going to have private for free, but with GitHub you have to pay to make it private so it’s more of an incentive to be open. So that way you get a lot more collaboration and you make it so that people can take your code and add features to it and send you pull requests. So it’s a really cool network effect if you have a system where people can be proud of what they put into it and show it off and have a profile. And I kind of wanted that for data, and I wanted it to hook into developer tools, too.

So, in data, now most – like your example of open government data – most of it is closed, and people have been trying to open it up.

Max: Yeah.

Your philosophy is to keep it open…

Max: Yeah.

…and what if you want to make it private?

Max: So there’s use cases for private data that has sensitive information in it, but a majority of data that the city collects is just like for tax reasons. They just need to know where all the stop signs they bought are, they need to know where all their assets are located. So from a government perspective, I think that most of it is pretty boring. And GPS tracking – they don’t know necessarily you to know where all the cop cars are, but they’re okay with you knowing where all the fire trucks are. And they might not want to know where the 911 calls come in, but I know Portland puts like a half an hour of buffer on it, so you can’t know where the 911 calls are immediately, but you can find out where they were as soon as it’s died down a little bit.

It’s hard to find patterns.

Max: Yeah. You don’t want people showing up. So there’s definitely cases where you can’t just say, “Open everything,” but I haven’t seen a city that has a majority of what should be private data. Most of it is either boring or there’s no concerns. And there’s definitely a metadata analysis and stuff and there’s lots of questions about anonymizing data. If you do have data that’s sensitive information, and you scrub it, people look at network analysis and figure out, you know, how people detect that there’s terrorists in whatever, think tanks, all this kind of crazy network graph analysis things you can do, but usually that’s just like theoretical stuff that comes up, I’ve never really seen that to be a major issue.

So GitHub is built on Git and it’s got Visualize Git.

Max: Yeah.

Explain what your service is and what it’s based upon and we’ll go from there.

Max: So, the project is called “DAT,” d-a-t, and I named it three letters ending with T, makes it seem like Git – very conscious decision – and it also copies a lot of Git’s design, but is optimized, instead of being optimized for source code, it’s optimized for basically tables of data. So more like a database like MySQL, PostgreSQL or something, but one that is really fast to do synchronization stuff. Because Git is really good at doing Git Push and Git Pull. You could push to GitHub and you could pull from GitHub and you can make a full copy of the entire history of the code archive, but there aren’t many databases that can make full copies of themselves efficiently.

I have a conspiracy theory about this. It’s that database companies never get paid by their clients to get their data away. Clients are never like, “Hey, we want to give our users all our data.” So there’s this bias that they usually are really good at doing things like business analytics theories or statistics or stuff that’s powering applications, but they’re never really first-class feature of…

The querying capability?

Max: Well, the querying is always really good, but replicating it is never a first-class feature. Unless you’re talking about like scaling the database to data centers, but that’s a different kind of replication thatt doesn’t really apply. And the kind of replication I’m talking about is, like, “Give me a full copy of a data set, and give me everything. Not just a little window.” There are a lot of companies that treat data as an asset that they want to keep their claws on. So they don’t want to go out of their way to make it really easy to give away all their data, but that’s exactly what I want.

You want to clone it?

Max: Yeah, I want you to have all of it, and I want you to have full history and everything I’ve ever edited in it. Or I want you to have options to say things like, “Okay, I might have the census, but you only want the 2000 Senses, so you only get that. And instead of getting the 1988, 1990, 2000 and 2010… Or I could just give you 2010 in Oregon.” You should have the ability to get what you want out of it in as easy, little number of steps as possible. There’s not really a standard way of synchronizing data like that, other than just emailing attachments around. But that’s not very specific, because attachments could be any format.

I started realizing that I need a package manager so that people can publish a data set and give it a name and give it description and give it a version number. So I could publish, on behalf of the federal government, I could publish the census 2010 data in .csv and then it would have a name, and then anybody else could just say, “Install the census data,” and then it would just install it and get the full copy. And then it’s automated. So the whole goal is automated, end-to-end workflows, so that people can use the same tools for installing the census as they used to install a genome, as they install the Oregon Park Boundaries or whatever. So, because data is data, and if you’re just trying to install a data set then we should have a standard way of installing a data set, just like how there’s standard ways to install packages if you’re a JavaScript programmer or a Ruby programmer, you use RubyGems or you use npm js, or if you’re on Python, you know, there’s standard ways to install third party code.

These are issues that Data Journalists are facing right now.

Max: Oh yeah, definitely. Yeah, and scientists: reproducibility. Like, what version of the reference genome did you use to produce your analysis?

And it gets very complicated to keep the data updated.

Max: Yeah, yeah, exactly. Well, and to consume it too. Because if you have to write custom code to consume somebody else’s data set, then you’re probably going to have a disincentive to switch to another data set if it means redoing all that work. It would be nice if you could just take that code and publish it as a part of the data package, so that the first person that writes the code could just share it. This is exactly how package managers in programming work. It’s like, the whole point of it is that you figure out some generic functionality, that you think of and find useful, you give it a name and you publish it and other people can just use your solution. But people haven’t figured that out for data sets yet. So that’s basically the problem domain I’m working in.

So what is the service infrastructure that you’re using for this? You’re using CouchDB?

Max: That was at the beginning, and I couldn’t make Couch do what I wanted. In order to make it do what I wanted, I would have had to become an Erlang programmer, and I just kinda hit technical barriers that – it was too much like a magic box that I couldn’t dive into and figure out how it was working. But what I’m doing now is, it’s all based on Node.js for the server, for the thing that actually talks to the network and talks to the database, does the different formats and synchronizes, handles the, synchronizing many clients in pushing and pulling.

Node is great because it’s relatively fast, it’s a nice tradeoff of fast and easy to write, but also works on Windows, Mac and Linux the same. So people on Windows can use it and there’s a lot of people in the academic community that are on Windows computers… the class platform part is nice. But it also gives me built-in fundamentals for streaming data in and out, so processing larger data sets is a lot easier because the community builds a lot of streaming modules that handle large data sets – data sets that don’t fit in memory. So it supports really big data sets.

Is it a distributed infrastructure?

Max: Kind of. It depends. So, I’m using this database, though, as well, by Google for BigTable.

BigQuery?

Max: Well, that’s their SQL kind of thing on top of it, but their database was a distributed database called BigTable. But each node was a thing called a “tablet,” and they wrote it for the Chrome browser, actually. It has this in it now; LevelDB is the name of their database. So LevelDB was written by the same person that wrote BigTable, but for Chrome later. It was based on a BigTable design, but it’s kind of like a little building block database. So it’s like the super-primitive component, written in C++, that just saves data to disc for you, handles the actual sorting of data and reading it in and out. It’s nice because there’s a lot of people switching to it, like Minecraft just switched to it this week. So, they were saying, that now anybody can read the Minecraft level format by using LevelDB to open it. They use it in Chrome obviously. The SQLite database – they’re switching to LevelDB as their back-end…

When you think database, maybe – you’re wearing a Riak T-Shirt or a sweatshirt – Riak actually uses LevelDB but they have their own fork that they optimize for their use case, but LevelDB is like the thing at the very bottom of the stack, and then Riak is both the API and the server and the distributed layer. And you know, when you think of a database it’s like multiple things that are all stacked up into one product. And CouchDB is similar. It’s both a REST server and an API, and they even have javascripts for running queries on the database in there, and then they have their server written in Erlang, they have their admin dashboard…

So a database is usually like six or seven components that people call a database. But what I like about LevelDB is it’s like very UNIX-y, it’s like a component that doesn’t have a server built in… it’s not very easy to use, but it’s really low-level and it’s easy to embed in the program and build the rest of the stuff on top. So, I like LevelDB because it’s really fast basically. And it also runs on all operating systems. Because Google needed it to work in Chrome, so anywhere Chrome runs it will run. And it basically gave me the flexibility that Couch didn’t, because it was so simple, it was a lot easier to reconfigure to do what I wanted. So what I do with it is track basically changes to data. So I track all the data in LevelDB. Say you add a .csv into DAT, or you import a .csv, and then you edit a couple of columns in Excel, and then you re-import the .csv into DAT, DAT would know the difference between the first import and the second import, and then, because it knows the difference, it can more efficiently synchronize it.

That goes back to the issues with open data in government, but also data in general, the issues you have when you add information, and replication…

Max: Yeah. It makes it a lot faster. You don’t have to download the whole data set every time, you can just get the stuff that changes. So I’ve been focused on tables; not all data is in tables, but the majority of government data is. Anything that’s in a database is usually a table. I just got this funding from The (Alfred P.) Sloan Foundation, and they’re a big science funder, philanthropic group, funds a lot of science, academic science, but also open science lately, and they did like a 40-million dollar investment in a couple of universities, like Berkeley and maybe Harvard to basically just hire data scientists in colleges.

Because there’s too many scientists that code, but then they don’t get paid – they get paid to write papers, but they don’t get paid to write code. So they never invest, there’s no incentive for them to invest in better software. So they always write the worst software. They always write the worst one-off analysis scripts, but they never publish, so it’s hard to reproduce their results, and science is kind of getting held back by the lack of open source, basically. And it’s not that they’re not willing to make it open, it’s just that they won’t get their grant if they don’t move on to writing the next paper, because they’re in this vicious grant cycle. So Sloan is trying to put some money into the system, so that a couple colleges can afford to have scientists that focus on the open source infrastructure and reproducibility, and not just scientists that are trying to get tenure and don’t have time to work on non-essential things.

It’s interesting. We’re working on a story on archaeologists who are becoming developers.

Max: Woah.

The archaeologists were resistant to being developers for a long time, but then a few of them start making discoveries that had never made before, like a group of developers, archaeologists that used some software tools to discover seven new pyramids that have never been known about before. And so now it’s just taking off.

Max: Oh, yeah, cool.

So that kind of seems like a similar kind of a spread of issues here which kind of converges with DAT.

Max: Yeah, yeah, yeah. So, I don’t have an academic background, obviously, I dropped out of college after two months, but I’ve always been, you know, I was in astronomy club in school, and I’ve always been a geeky kid, and I always appreciated the scientific method and computer science, but I’ve never written a paper, haven’t ever done a PhD-level peer review or whatever. But now I get to work with these people because Sloan basically came to me and they said, “Hey, we like what you’re doing, we want to make it so that you’re biased towards science,” as opposed to, you know, I was approached by a couple of start-ups about DAT because they wanted it for their API products. They basically were like, “Hey, if people could sync their data to us, that would keep us from having to go and scrape their website all the time, and it would be cool if we had more real-time integration with third-party services.” I can imagine getting either hired or corporate sponsorship to kind of take DAT in different directions. Even within open data there is open science, open government, other weird niches, some businesses…

Our goal is to be an open media company.

Max: Right – yeah, yeah. There’s lots of journalism stuff that has data implications. So, I was kind of sitting there, I’ve always, coming from Portland I’ve always kind of been like a mission-driven person, and so I was like, “Well, I like that you want to give me money, but I extra-like that you want me to support science, which I have a soft spot in my heart for.”

So now I basically shifted focus of the project to, instead of solving governments’ data problems, which is what I had been been doing for Code for America, I am now focusing on academic research and collaborative science, open science and like reproducibility and all these issues.

And what is cool is that they gave me funding to hire a team, so I hired a guy who’s a bioinformatician, doing his PhD in London, and another guy that’s graduated from Stanford last year and is really into bringing more peer-to-peer distribution to science. So that, you know, some of the scientific data sets are really big, and the torrent is really great for big files, but it’s not good. Big files will change a lot. BitTorrent as it exists, like, if you install BitTorrent client on your computer, and you want to download like Beyonce’s new album or whatever, well, that album is shipped and it’s never going to change. But in science data sets have versions and they get published and new builds of data sets come out all the time.

So what are some of the challenges that this team will help to overcome?

Max: Oh, yeah. So there’s technical and there’s cultural. And the cultural ones are that we need to find the early adopters in academia. And I was just talking to somebody about this last May and they’re saying, “Well, it’s funny, because in government you might think that the young ones are the ones with passion, or the people that have been there thirty years, they’re less likely to try new things, and so the longer somebody has been in a position, the more job security they have, and the more likely they are to be risk-averse. But if you look at science, you’re in your PhD, and you’re like, you’ll try anything.

PhD students are really experimental. But then, if they’re in Postdoc, or they’re trying to get tenure, or are on a tenure track, then they’re so conservative. And that’s like in their late 20’s, early 30’s, on average. And you’d think that they would be, they’re at the peak of their career, they’re writing the most papers, you’d think that they’d have the coolest ideas for moving science forward, but they’re actually, they don’t have any time for any bullshit, they just want to get tenure. So, for like ten years they’re just like, you know, nose on the grindstone or whatever. But then it’s the professors who are never going to get kicked out because they can’t get fired – Nobel laureates are especially crazy; once they’re there and they basically can’t be told what to do – so it’s the really early-on career and really late career who are the most innovative in science.

So I’m trying to learn culturally about what, both high-level like that, but also like bioinformatics is really interesting because there’s a lot of genome research where people share data. Then there’s also astrophysics – there’s a lot of data that somebody will collect and somebody wants to download and analyze somewhere else in the world, so I’m trying to find pilot communities right now basically.

We’re doing a story on the Square Kilometre Arrays in Australia and Africa…

Max: Oh!

…and looking at the kinds of stacks that they’re building, and it’s really, they’re having, they’re using thousands of radio telescopes. And these radio telescopes are going to collect all this data. There’ll be more data than the Internet has, very quickly. And they’re talking about clusters of ten million nodes, how they define nodes is a question mark, but that raises questions about how you’re going to develop your own infrastructure, what kind of people do you need to do that?

Max: So, my philosophy, especially now that it’s very science-focused, and my grant advisor at Sloan, and everybody I’ve talked to has said, “Don’t host data.” I obviously can’t monetize hosting most of the world’s scientific research. There’s probably a sub-section. The Large Hadron Collider collects 200 petabytes a day, and that’s an entire Google data center. They’re going to have their own in-house solution. They might be able to use my software for sharing subsets of it.

I’ve been trying to get a ball-park number of what is the average-sized data set that I’m talking about, and I think it’s terabytes. The low terabytes number, that’s like the sweet spot for me, because for $10 a month now on Google Drive you can host a terabyte, and that’s pretty cheap, that’s like ten times cheaper than S3, and Google just dropped prices a few months ago to $10 a terabyte, and it’s pretty low, because still buying a terabyte of hard drive is like $80 from Western Digital – personal ones or whatever. So, pricing is getting lower. Then it’s just the matter of: how do you write the synchronization software so that it can effectively transmit that stuff around? Since we’re not going to host data, we want to support different data back-ends, basically.

So one would be, if you’re a researcher at a university (and) you are going to put your data set up on the university’s FTP server, why not publish your data set to our package manager, which just means having a link to the FTP server at the very least? What would be even cooler is if we could give you options like, okay, once you published your package to the repository and then linked it to your one official data source, we could do a deal with Amazon and get Amazon to donate us some number of petabytes a year to make backups of scientific data. So we could automatically download a copy of your data, put it on Amazon’s Glacier cold storage service – it won’t cost Amazon very much, makes them look good, good PR for them – and then we have a backup of this dataset in case the university goes down. Or even better, we’re investing in peer-to-peer distribution strategies like BitTorrent-style stuff, so that if your university publishes it, you can run software that basically seeds the data set, and then we kind of turn it into a sort of BitTorrent tracker, so that we can see that, oh, there’s five seeds, and maybe four other universities or people – I really want to make like a study-at-home (?) screen-saver where you’re hosting the world’s research data…

There’s issues, because obviously, you don’t want the situation where there’s zero seeds, so we need to invest in basically like data center companies hosting data, and maybe it’s our users paying for it. I actually think that if we charge what Amazon charges, and we make it easy to do it, that we would probably get a fair number of scientists publishing their data and paying for it themselves. Because if you factor in the cost plus the time, it’s not worth it, but if you remove all the time that they would take…

I’ll tell you our challenge. So, we’re going to do data journalism projects. And our data journalism projects, we’ll do one a month, like we’re starting small, like exploring the top Docker projects in GitHub. So we look at the top ten Docker projects, we look at all the users, we try to understand who are the biggest committers, what are the relationships they have, and we’re already running into issues. We’re running into issues with GitHub because of the data that we’re using. We’re not sure how that plays out. We’re running into issues with use of the graph database itself. All these kind of things we’re going to be able to figure out. The place for us to have more access to data is what we really need. And we need places where they almost become channels for that. To me, you’d be like a great channel: I know what’s in this, I know it’s here, and I can sync to it at as well, and then I may be able to then sync that data to a graph database and then show it in different kinds of ways.

Max: Do you want to see cool demo that I just got working? One of my collaborators, the guy I just hired, is – have you heard of Popcorn Time? It got pretty big the last month or two, it’s basically, it’s an app that looks like Netflix, but it pirates the movies on demand. So it’s the guy that wrote Popcorn Time, but the UI was written by a different person, but he wrote the BitTorrent part that streams the videos to the torrent. And he’s in Copenhagen right now, and he just put this server up. This is a DAT server, he put all the npm js into DAT, so – the node package manager, which has almost 70,000 modules – he pulled the copy of it down at some point and made a backup in DAT, basically. And so, there’s like 66,000 things in here.

He just deployed it to his server, so he’s basically hosting the data set, and it’s a live data set. He’s working on a thing that will hook up to actual npm. So as soon as anybody publishes a new module, then the newest version will go into here too, so this number will be increasing. And what’s cool though is… it goes and fetches his remote database and then it streams all the data from his down to my copy. But this is using our slow stream, we’re working on a faster one. This is almost 100,000 documents, about a thousand a second or something. But what’s cool about this, is then I can subscribe mine to his, so my copy on my laptop will just live-update. So as soon as he gets the whole thing working, whenever somebody publishes an npm, it will go to his copy, and then mine will update from his immediately. So it’s like a chain, like the data is flowing from npm to his to mine.

Why is yours important then in that?

Max: Well, because mine is on my laptop. And then I can build my stuff locally. If I go offline, I can have the full dataset. And then I can run computations, do it locally. Kind of like the next step in that chain is if it goes from npm to his on the Internet to mine on my laptop and then what if I wanted to mine to go into Postgres. So I can have a Postgres table, or a graph database, or something

And this data is on Popcorn?

Max: No. Well, he just wrote Popcorn, but this is data he put in DAT. So now I have a full copy of it in here. And so I could run my own local one on my laptop that has the same database and now they match. I made like a replica to his onto my laptop. So I just cloned the data set down. And I could do things like export a .csv-

What is he using DAT for exactly?

Max: In this case he’s using it to make a back-up of a package manager, all the packages on npm. Most of the packages are things like some module with a read-me and some code on it, so he just put about 65,000-70,000 modules into DAT, and then the cool thing about it is that it makes it really easy to clone down a copy. So with GitHub what you have to do is hit their API like a million times to get all the data out of it, but with this, you basically are saying, “Give me the full database.”

You just do it once.

Max: Yeah. You just get the full thing and then you have it all. And then you can say tomorrow, “Give me everything since yesterday,” and you’ll just get the new stuff. This is how I think it should work: companies should give their data away and let people, then it empowers people…

Well, our whole goal is to be an open data set provider. So we could use this.

Max: Oh, totally, yeah. Yeah, any data set you wanted to publish, yeah. So what we’re working on now is basically the tools…

We want to keep our data sets updated; this allows for that synchronization.

Max: Yeah. Yeah, what he did was basically write some code that makes an empty DAT database, serves it online and then starts filling it with data. So it can subscribe to some upstream, in this case he’s subscribing to npm, so every time npm changes, then he puts it in DAT, but once it’s in DAT, it’s tracked and it’s sync-able, so then I can get a clone of it.

Just in conclusion, where do you want to take it, where do you want to go from here with it?

Max: So my goal is by the end of the year we have a package manager website. So we have an initial version up, it’s called Data Decks, and DataDecks.io is where it’s at now, and essentially, this was built by one of our team members. So this is a very early version, but for example, this is a data set on Data Decks, and this is a username. So you can click on somebody and see their profile and then you can see their data sets and you can go and view all the data sets that they’ve published, and then you can see all the versions of the data set that you can download or clone the data set.

So this is like the web front-end, right? And we want to have it so that you as an organization or an individual can create an account, upload all the data sets with DAT, push them to the website, and then, like I said, we’re not going to host data, but we’re going to make it so that you could either host it on your own server or you could click a button, put it on your Google Drive, if you exceed the limits of your Google Drive, the free one, you could buy their terabyte one for $10 a month, or you can integrate with Digital Ocean which gives you SSDs so it’s really fast to clone, and there it’s $5 for 20 gigs.

So we want to integrate with the most modern cloud services. We don’t host any data; our costs are really low. And then if it’s a data set that is for research, like if you can prove that you’re a scientist, we want to be able to do things like make free back-ups of your data on Amazon, so we’re going to work on the relationships with cloud host providers, try to donate. Google actually just donated a petabyte to climate research like two weeks ago. So there’s precedent, so we’re trying to get, we want a petabyte for science so that we can give it out to researchers on behalf of Google. Put the data on Google, but they wouldn’t have to use the clunky Google interfaces to use it, because we can make really nice, easy-to-use things.

So what I’m hoping is that we can build in-roads into the research community, get some scientists using it, probably bioinformatics – I was just learning about genomes today. I want to get different popular genomes on here so that people can download them more easily for their bioinformatics research. So by the end of the year I hope to have kind of like an active community of scientific data sharing, that’s my goal.

Great. Well, thank you, Max, for taking the time. It’s been a lot of fun to hear about this.

Max: Yeah, it’s cool to see what you’re planning too.

Great, thanks.

Special thanks to Luke Lefler who edited the interview with Max.

Feature image courtesy of The Knight Foundation.

A newsletter digest of the week’s most important stories & analyses.

View / Add Comments

Please stay on topic and be respectful of others. Review our Terms of Use.