Back to Blog

Scaling Up and Out in a GPU World

Scaling Up and Out in a GPU World

A common question faced in the petabyte economy is when, and how, to embrace a distributed, scale out architecture. I will argue here that it makes sense to push for the simplest and cheapest solution that will solve the problem.

This seems like an obvious statement, but I’ve encountered a surprising number of companies that do otherwise, shifting to large clusters long before they are necessary.

Here’s why that’s not always a correct strategy.

First, some basic facts. The speed of light in a vacuum is 3 x 10^8 meters per second and I don’t see that changing. This (and thermodynamics) governs the basic architectural guidelines for the computing hierarchy. Signal velocity through a semi­conductor or wires are a bit slower, from 30-­70% of C, but you get the idea.

The best performance for a given problem comes from putting the necessary data together with deployable computational capacity in the smallest possible space, subject to constraints on power and heat dissipation. If you are adding a hundred numbers, it’s faster (and cheaper) to complete this computation on one machine than to combine the results of ten machines running in parallel. This concept applies at all scales, from how processing is done on a single chip, all the way up to clusters of thousands of machines.

Let’s examine first the two principal flavors of computation building blocks today in common use - CPUs and GPUs.

CPUs are designed for general purpose computing, with a relatively small number of powerful computing cores coupled with a moderate amount of quickly accessible memory. Intel’s Xeon v4 e7 Haswell processor is the flagship of this line, with up to 24 dual-threaded CPU cores, clocked at 2.2 Ghz and 60 megabytes of memory accessible within a few clock cycles. Much larger quantities of memory are available, up to 3 terabytes, but latencies of around 300 clock cycles.

GPUs, designed originally for graphics processing, employ a much higher number of less powerful computing cores, with a large quantity of quickly accessible memory. Nvidia’s p100 graphics processor package has 3584 CUDA cores, clocked at 1.3Ghz. Cache memory on this processor chip is a bit smaller than the above CPU -- 18 megabytes, but with another 16 Gigabytes of high-bandwidth memory less than a centimeter away. Much larger quantities of memory are available, but today that path goes through CPU and is accordingly slower and more complicated. Expect vendors of GPU servers to migrate eventually to direct memory access from GPU’s to RAM and SSD storage.

Both CPU and GPU architectures accommodate scale out to multiple processors within a single node, and to multiple nodes. While there are today differences in configurations and technologies available today from different vendors (QPI vs NVLink, NVMe vs SATA, HBM vs DDR), these factors will equalize over time.

Per my rule from above - if your problem fits in <100MB and isn’t amenable to parallelism, then a CPU is a perfect solution.

For example, if your application is an OLTP debit-credit system for a small number of users, running thousands of parallel threads gives no benefit. If, however, the problem is larger and possible to parallelize -- for example searching or aggregating a large data set, the massive parallelism available with GPU’s can run >100x faster. The same line of reasoning extends to multiple sockets, and to multiple nodes. If you can parallelize your code across the 28,000 cores in a single Nvidia DGX1, that will be significantly faster than a few dozen quad-socket CPU-­based servers for a fraction of the price.

Now let’s work through a practical example.

I’ll focus on analytics, because that’s where I see this mistake made most often. Perhaps the worst common mistake is to take a transactional data engine not built for analytics and deploy lots of nodes to scale it up for parallel analytic processing.

How many of us have seen MySQL deployed for analytic applications? Don’t get me wrong, I think very highly of MySQL. ­­ It’s a simple, sensible database to backstop a small web site, ­­but it was never designed for analytics. The volcano­-style iteration processing model ensures that while a processor might be kept busy, very little of that time is actually spent performing the requested calculations. The only way to scale this sort of product to larger data sets is to shard the data over multiple nodes, limiting the size of each node to commodity equipment to keep costs contained.

Each commodity box might seem individually inexpensive, but a few hundred nodes adds up, and any solution with more than a few dozen nodes will in practice need 2-3x redundancy, plus overhead for interconnect, plus a highly-paid team to operate all of this.

In practical operation, all of this equipment and staff are kept busy, but precious little of that investment goes to accomplishing the task at hand. If this were the best available technology, we’d put up with it, but happily the next generation came onto the market with a better solution.

The early part of the century saw the introduction of purpose-built analytic databases, designed for big data, parallelism and made available on commodity hardware. Impala, Redshift, Exasol, and Hana are excellent examples of current products that do an effective job of finding parallelism inherent in analytic queries, both coarse-grained parallelism by sharding data, and for some of these products with fine-grained pipelined parallelism within each thread. It’s not unreasonable to expect these products to outperform their OLTP-based counterparts by 10-100x. These products have enabled those several hundred MySQL nodes to be replaced by one or a small handful of analytic db nodes -- a huge improvement. But at the same time, the data volume has grown seventeen-fold, and so we see now clusters of dozens or hundreds of instances of an analytic database. Again, if these was the best available solution, we’d live with it or limit our expectations.

But what’s the best solution?

The first generation of purpose-built analytic databases was built for the CPU-based equipment available a decade ago when they were designed. That equipment has seen incremental performance gains, but even when combined with the first generation of analytic databases performance has failed to keep pace with the growth in data.

Luckily, there is one technology -- Graphics Processors -- with a power growth curve similar to data growth.

With GPU-based servers available now -- at every scale point from small server to supercomputer -- the best available analytic databases deliver 10-100x improvement over CPU-generation analytic databases. These products (MapD included) are maturing quickly, with significant opportunity for adding both functionality and performance, but from industry commentators this is quickly becoming seen as the dominant technology for analytics for the next decade.

The simplest, most-performant, most economical solution are these next generation GPU servers.

They are able to reduce a thousand nodes of a poor product, or a hundred nodes of a good product, to one or a handful of nodes of an outstanding GPU-oriented analytic database. And from there, move forward to large clusters of these GPU servers to solve problems for which there is today no solution at all.

These are exciting times from a compute and database perspective and we are delighted to be on the cutting edge of these critical developments.

If these are the types of problems you would like to work on, please don’t hesitate to look at our engineering openings.

Newer Post

The Deep Learning Enabled Cloud: Google's Play

Older Post

Follow the Money: Political Donations in America

Please fill out the following to receive our whitepaper via email.