Rejoining forces: a new (old) partnership with the Harvard Center for Geographic Analysis

Today, with great pride, MapD is announcing a partnership with the Harvard Center for Geographic Analysis (CGA). We will be working with CGA to build new geospatial functionality into the MapD Core database, to improve hydrological modeling and visualization and allow for more accurate flood prediction and water-supply estimation. By combining CGA’s geospatial expertise with our team’s knowledge of high-performance analytics, we hope to build MapD into an even better platform for analyzing data at scale. More details of the collaboration can be found in the press release, however I want to focus in this blog post on the story behind this partnership.

In many ways, this announcement means coming full circle for the MapD project. More than five years ago, in the spring of 2012, while in my final semester at Harvard, I started work on the first iteration of MapD (then an acronym for Massively Parallel Database). Needing to analyze and visualize hundreds of millions of tweets for my thesis on the Arab Spring, and encouraged by MIT professor (and now board member) Sam Madden (whose database course I was cross-enrolled in), I envisioned a system that would enable analysts and data scientists to query and visualize massive datasets instantly.

Much of the data I was working with had a spatiotemporal component, which led me to seek out the advice and guidance of Harvard CGA. There I quickly struck up a friendship with Ben Lewis, who was spearheading WorldMap, a project that enables researchers worldwide to share and collaborate on open geospatial datasets.

Ben and I believed that GPUs, by virtue of their massive computational throughput and their native visualization capabilities, could give geospatial analysts the ability to explore massive datasets interactively and in real-time. Then, as now, analysts had access to a myriad of GIS tools, that while functionally powerful, could not scale to the data volumes required by Internet of Things, telematics, wireless handsets, and social media use cases.

Harvard CGA was the first to buy a dedicated server to run the technology, a Supermicro system with four NVIDIA GTX 680 cards. They had a relatively paltry 4GB of VRAM each (compared to cards with up to 32GB VRAM today), but it was enough for us to power the first MapD Tweetmap, enabling interactive analysis of hundreds of millions of tweets, and to serve up all digital media for the Japan Disaster Archive, an online museum memorializing the Fukushima tragedy.

Even after I started as a research fellow at MIT CSAIL, I continued my collaboration with Harvard CGA. Ben and I would dream together a future era of GPU ubiquity, one where analysts and data scientists to explore unfathomably large datasets across tens or hundreds of GPU servers. Five years later, that dream is fast becoming reality.

I want to end by noting that since the beginning of our work together, Ben persistently advocated for open sourcing the MapD database. It is thus fitting to announce this partnership five years later at the FOSS4G conference in Boston alongside him and the entire CGA team, with MapD finally open source. Open source, as Ben would note, enables collaborations that are simply not possible with proprietary software, and he was completely right. I look forward to a fruitful future of such collaborations with the academic community and beyond, with Harvard CGA once again paving the way.