MapD is a next-generation database and visualization platform that harnesses the parallel power of GPUs to explore multi-billion row datasets in milliseconds.
By combining a purpose-built GPU database with a GPU-powered visualization layer, MapD is able to deliver immersive, instantaneous analytics on data sets previously considered too large to explore interactively.
As both the volume and velocity of data increases, organizations need increasingly performant database and analytic solutions that can keep up. Roughly ten years ago, aided by a rapid drop in the price of CPU RAM, databases started to be built that could cache significant portions the working dataset in memory. The upside was one to two orders of magnitude speed increases, roughly mirroring the relative difference in bandwidth of CPU RAM compared to disk.
Today, there is a concerted movement of technology pioneers toward GPU-based analytics, motivated by the numerous advantages GPUs have over CPUs for database and analytic workloads.
To start with, processing is often bottlenecked by available memory bandwidth, and a server full of GPUs can possess aggregate bandwidth of nearly 6 TB/sec, or over 40X faster than similarly configured CPU servers. For complex queries (or machine learning algorithms) that require greater computational throughput, GPUs have similar performance advantages. For example, a server with 8 of Nvidia’s new P100 cards has 84.8 Teraflops of Single Precision performance (42.4 Teraflops Double Precision), a 40X improvement over a dual-socket Xeon CPU server.
Between the increase in memory and compute bandwidth, GPU analytics solutions like MapD promise up to two orders of magnitude better performace over even the fastest CPU solutions. MapD can scale to billions of rows while maintaining query response times in milliseconds, enabling multiple analysts to simultaneously query and visualize today’s massive datasets at interactive speeds.
And that’s not all.
In addition to executing SQL queries, MapD can leverage the GPUs for what they were originally designed for, i.e. rendering large amounts of data nearly instantaneously. When the results of a query get large (imagine billions of geolocated tweets or points on a scatterplot), MapD can visualize the query results by leveraging the native graphics pipeline of the GPUs.
Not only does this give the database a large amount of graphics horsepower, it also means that the results of queries executed on the GPUs do not even have to be copied back to CPU, much less the client, to visualize the results (copying large result sets to a remote client for visualization is a common bottleneck of conventional BI systems).
Given suitable tuning, GPUs can be significantly faster than CPUs at a wide-range of computational tasks for both floating point and integer workloads. That said, GPUs perform best on algorithms that can be parallelized with minimal branching - that is algorithms that require each core do the same operation in lock-step. For example, running a word processor would likely not be suitable for a GPU due to the highly serialized nature of the problem. Other problems, such as text grep over many documents, can be tricker on GPUs due to the varying length of the documents and conditional statements needed to handle matches. However, given the right approach, even problems such as these can be parallelized effectively on GPUs to achieve large speedups over CPUs.
A key advantage of using MapD is that we let you interface with the database with the SQL you know (and our Immerse visualization frontend) while letting us focus on the behind-the-scenes optimizations necessary to ensure that a wide variety of workloads run lightning fast.
MapD can utilize as many GPUs as can fit in a server. Most of the major server vendors have chassis 3U or 4U chassis that can fit up to 8 Nvidia K80s, which since each K80 comprises two GPUs, actually means 16 GPUs per server.
For even further scale MapD can use OneStop System’s High Density Compute Accelerator, which accommodates up to 16 K80s or 32 GPUs in a single chassis.
Some visualization types, such as bar or pie charts, typically represent a small set of grouped values, and hence can be easily rendered on CPU by even the least performant of visualization clients. However, other charts like scatter plots, point maps, network graphs, detailed chorlopleths or heat maps often depict millions of non-aggregated records. Here a GPU will be able to render such data much faster than any CPU-based system. Furthermore, rendering millions of records on the frontend requires transferring such data over a network, which can be a huge bottleneck with such large result sets. MapD, by both executing the SQL and rendering result sets on the GPU, not only avoids copying the data to the client but even avoids transferring it to the CPU, instead only sending a small (hundreds of kilobytes) PNG back to the client. This means that MapD can produce even the most complex visualizations at nearly 30fps.
The MapD Immerse visualization system will instruct the MapD backend to render result sets for certain chart types when the result set is big, while rendering less demanding charts in-browser using D3.
MapD treats the combined memory (VRAM) of all the GPUs on a server as its lowest level cache, i.e. it tries to keep compressed versions of hot partitions of hot columns in GPU memory whenever possible. This approach can easily scale to working sets in the terabytes on a single node. For those cases when the working set cannot entirely fit in GPU memory, MapD caches a greater subset of the data in CPU memory, which although slower is typically significantly larger than available GPU memory. It will then stream data from CPU to GPU. Although this method is slower than if the needed data fits entirely in GPU RAM, we have still found MapD to be faster than other in-memory solutions in such situations due to the highly optimized nature of the database.
By virtue of being able to execute queries incredibly fast (typically tens of milliseconds over multi-billion record datasets), MapD can scale to many simultaneous users. If the users are just executing hand-written SQL queries, this could be hundreds of users per server. If the users are analysts working with MapD Immerse, which can issue several queries for every filter applied (one per chart), then you can expect tens of simultaneous users without experiencing any performance degradation.
MapD can support very fast inserts
because it does not have to index the data as it comes in (typically a costly operation in other databases) as it relies on the brute force parallelism of multiple GPUs to ensure queries run in milliseconds
since it primarily uses GPUs for query execution there is ample leftover capacity on a server’s CPUs to perform the parsing and other tasks associated with data import
When the system receives a read query (i.e. SELECT statement), it first executes on the data already in GPU RAM. While executing, it can asynchronously transfer over any new data to the GPU inserted since the last SELECT query, which it executes on last.
The result is incredibly fast performance while keeping queries up to date with even the fastest insert streams.
Once a database is faster than an analyst can type queries in a SQL console, further speed increases have diminishing usefulness (except perhaps when attempting to support many simultaneous users) . However, for systems that generate queries visually, i.e visual analytics/BI platforms, analysts can easily generate multiple queries per second. Having a fast backend in this case is imperative to maintaining speed-of-thought performance, particularly when confronted with multiple users.
Sadly, most existing BI platforms are built with the expectation of having a slow backend and thus throttle themselves in various ways. However, since we know our Immerse frontend could tap into the immense power of the MapD database, we were free to fully embrace the crossfilter paradigm where any filter applied with a click or drag on any of the charts would fire new queries to update the other charts, often many times a second. MapD can of course power other BI frontends like Tableau - but we find that our Immerse frontend is more interactive than other tools simply due to being single-mindedly built for such a purpose.
Perhaps more important still is the ability of the MapD backend to work hand-in-glove with our Immerse frontend to render visualizations directly on the GPUs when needed. Traditional “client-only” BI solutions tend to fall over when needed to render complex visualizations of lots of data - even if they can manage to render it on the frontend they still have to transfer all of the necessary data over the network, a huge bottleneck.
Of course there are other uses for a super-fast backend, like performing on-line machine learning, clustering and anomaly detection. Stay tuned…