Open Source Analytical Database for the GPU

MapD Core is the foundation of the MapD Extreme Analytics™ Platform. MapD Core is SQL-based, relational, columnar and specifically developed to harness the parallel processing power of graphics processing units (GPUs). MapD Core can query up to billions of rows in milliseconds, and is capable of unprecedented ingestion speeds, making it the ideal SQL engine for the era of big, high-velocity data.

Features

Advanced Memory Management

MapD Core optimizes the memory and compute layers to deliver unprecedented performance. MapD Core was designed to keep hot data in GPU memory for the fastest access possible. Other GPU database systems have taken the approach of storing the data in CPU memory, only moving it to GPU at query time, trading the gains they receive from GPU parallelism with transfer overheads over the PCIe bus.

MapD Core avoids this transfer inefficiency by caching the most recently touched data in High Bandwidth Memory on the GPU, which offers up to 10x the bandwidth of CPU DRAM and far lower latency. MapD Core is also designed to exploit efficient inter-GPU communication infrastructure such as NVIDIA NVLink when available.

Read the Technology White Paper

Native SQL

MapD Core natively supports standard SQL and returns query results hundreds of times faster than CPU-only analytical database platforms. Analysts and data scientists can still rely on their existing SQL knowledge, querying data using industry-standard SQL.

MapD can operate as a standalone SQL engine using the command line tool mapdql, or the SQL editor that is part of the MapD Immerse visual analytics interface. MapD query results can output to MapD Immerse or to third-party software such as BIRT, Power BI, Qlik or Tableau, via a variety of connectors.

Read the Documentation

Rapid Query Compilation

A key component of MapD Core’s innovation advantage is the JIT (Just-In-Time) compilation framework built on LLVM (Low-level Virtual Machine). By pre-generating compiled code for the query, MapD avoids many of the memory bandwidth and cache-space inefficiencies of traditional virtual machine or transpiler approaches.

Using LLVM, compilation times are much quicker— generally under 30 milliseconds for entirely new SQL queries. Furthermore, the system can cache templated versions of compiled query plans for reuse. This is important in situations where users are leveraging MapD Immerse to cross-filter billions of rows over multiple correlated visualizations.

Read the Technology White Paper

Native Support of Standard Geo Data Types

MapD Core can store and query data using native Open Geospatial Consortium (OGC) types, including POINT, LINESTRING, POLYGON, and MULTIPOLYGON. With native geo type support, analysts can query geo data at scale using a growing number of special geospatial functions. This opens up a wide range of new use cases for geospatial analysts, who can use the full power of GPU parallel processing to quickly and interactively calculate distances between two points and intersections between objects. Now analysts can find all points that fall within a building footprint or search for intersections between them.

Hybrid Execution

A key component of MapD’s performance advantage is the hybrid, or parallelized, execution of queries. Parallelized code allows a processor to compute multiple data items simultaneously. This is necessary to achieve optimal performance on GPUs, which contain thousands of execution units.

Optimizing hybrid execution also translates well to CPUs, which increasingly have “wide” execution units capable of processing multiple data items at once. MapD Core parallelizes computation across multiple GPUs and CPUs, and even improves query performance on CPU-only systems.

Read the MapD-IBM White Paper

Distributed Architecture

The MapD scale-out configuration allows single queries to span more than one physical host when data is too large to fit on a single machine. Across nodes, MapD uses a shared-nothing architecture between GPUs. When a query is launched, each GPU processes a slice of data independently from other GPUs. Even though multiple GPUs reside within a single machine, the data is fanned out from CPU to multiple GPUs and then gathered back together onto the CPU.

A distributed architecture also provides faster data load times. Import times speed up linearly with the number of nodes because loading can be done concurrently across multiple nodes. Reads from disk also benefit from similar acceleration in a scale-out configuration.

Read About It

High Availability (HA)

The goal of MapD's High Availability (HA) is to meet an organization’s service level agreements for performance and uptime. If a MapD server becomes unavailable, the load balancer redirects traffic to a different available MapD server, preserving availability. High Availability configurations allow a set of MapD servers that are running together in a High Availability Group to be synchronized in a guaranteed way.

As HA group members receive updates, backend synchronization orchestrates and manages replication, then updates the MapD servers in the HA group using Kafka topics as a distributed resilient logging system. While multiple servers are active in an HA group, average response times also tend to improve, due to the efficient distribution of query load across the members. A load balancer distributes users across the available MapD servers, improving concurrency and throughput as more servers are added.

Learn more

Open Source Code

MapD open sourced MapD Core to put it on the fastest path to innovation by building a global community of users and developers. MapD Core is available on GitHub under the Apache 2.0 license, alongside other components like a Python interface (pymapd) and JavaScript infrastructure (mapd-connector, mapd-charting), making MapD the leader in open source analytics.

Learn more

Apache Arrow for Ecosystem Integration

Apache Arrow is a project in the Apache big data ecosystem that will play a pivotal role in the evolution of in-memory computing. Arrow addresses an emerging problem in building data pipelines: the cost of data exchange between different tools, including analytics platforms such as MapD, and machine learning tools such as TensorFlow, Pytorch and H2O.ai. MapD is working to integrate Arrow deeply within our own product, both as an open source SQL engine and as an integrated part in end-to-end Machine Learning workflows.

Learn More

Get the MapD Whitepaper

Learn more about the fastest open-source SQL engine and how you can use it to accelerate big data analytics.