SQL is the lingua-franca of data big and small, but SQL is a language, not a platform—it serves as the conceptual framework for data tasks on many platforms, ranging from blog content management with MySQL, to high-frequency online transaction processing (OLTP) systems, to heavy-duty batch processing on Hadoop and other big-data platforms.
If you’re looking for help with Hive programming, this might be the wrong page for you, but you may find some useful pointers to documents at the end of this article. This article is more about orientation. The first section is about where relational databases and Hive come from historically. The second section is a catalog of Hive best-practices and optimizations. If you’re impatient, feel free to skip section one and jump right into the catalog.The last section is a selection of links to useful material from various sources.
SQL interfaces exist for numerous databases and models of computation, but online transaction processing (OLTP) systems probably account for the largest number of users. MySQL, SQL Server, Postgres, and Oracle are prime examples of OLTP systems, but there are plenty of others, including both in-memory systems and hybrids. These systems excel at allowing many users to execute transactional queries concurrently even as data is added, modified, and deleted from the database. Their strength is quickly extracting narrow, but often multi-dimensional slices of complex datasets, as opposed to bulk analytics across a wide range of a data set. A billion rows is a very big table in the OLTP world, and they tend to get difficult to manage and expensive when they exceed that range.
Online Analytic Processing (OLAP) systems, which includes data warehouses, and data appliances, e.g., Teradata, occupy the next echelon in capacity. These systems are not designed for agile, high-frequency transactional loads, but for big analytic computations on highly structured data. They typically support fewer, but larger queries, and handle aggregations across a wide range of large datasets much more efficiently than an OLTP system. While they do complex analytics efficiently, they tend to be very expensive. Several billion rows is a large table for an OLAP or data warehouse appliance.
MPP (which stands for massively parallel processing) systems are an important subcategory OLAP systems. MPP’s have as many as a couple of hundred processors and may either appliances or software that installs on general purpose machines. Neteeza and Greenplum are examples, with the former being an appliance and the latter pure software. These systems tend to be quite mature and polished and offer a lot of support for the enterprise in terms of security, built-in connection software, etc. MPP’s tend to be closer to a “share nothing” architecture than non-MPP OLAP systems, but their designs retain deep roots in the relational database world.
Hadoop-based systems, of which Hive is currently the most important, are far less nimble with transactional updates than OLTP or even OLAP systems, but in the Hive world, a few billion rows is a test data set—Hive can handle tables with trillions, or even tens of trillions of rows, and scales more or less linearly almost indefinitely in terms of cost per GB of storage. Scaling in terms of computation time has a more complex story, but suffice it to say that computation scales approximately linearly for most query types over a range that dwarfs the capacity of OLAP systems, and for many query types, it scales linearly almost indefinitely. A big MPP system fills a rack, or sometimes more than one rack, but the largest Hadoop systems fill hundreds of racks.
Hive’s roots are not in the relational world, but in the MapReduce batch processing world, and for this reason it tends to excel also at tasks involving the transformation of data at large scale, as opposed to aggregation of information. Hive’s underlying computational mechanisms are shared with a number of non-relational processing idioms, for instance, Pig, that are geared toward the second kind of processing.
The familiar RDBMS was a radical idea when the first commercial implementations came out in the late 1970’s. Databases were not a new idea at the time, but until the relational model came along, querying a database usually required extensive custom programming, both for the query itself, and for the bespoke structuring of the data necessary to support the query. The RDBMS was revolutionary primarily because it stored data in such a way that queries could be constructed on the fly, with an unlimited range of queries possible against the same underlying data set.
Hadoop MapReduce came along a generation later, in a data-processing world turned upside down by the unprecedented explosion of data coming from the Internet. By the early 2000’s, Internet applications were routinely generating datasets that dwarfed the capacities of relational systems, and MapReduce was developed as a simplified, general purpose framework for doing batch processing on these huge data sets. In other words, Hadoop MapReduce did for distributed batch processing what the RDBMS had done for querying: it replaced complex, low-level coding and customized data structures for storage with a relatively simple and abstract programming model that could work for a wide range of processing demands.
The MapReduce paradigm is powerful, but coding to it demands considerable specialized Hadoop knowledge as well as an understanding of Java programming—skills that are not common among data analysts. Hive SQL was born of the observation that much of the arcane MapReduce code that Java programmers were writing was boiler-plate that could easily be machine-generated from higher-level, declarative SQL code. Thus, Hive was originally developed not so much to compete with data warehouses as to put a pretty face on the MapReduce batch programming model, and in the early days it retained that sensibility, operating with batch-processing latencies, and lacking transactions and many other RDBMS features.
Things have gotten a little confusing since the early days of Hadoop, because OLAP systems have continued to grow increasingly capable of handling larger data sets, even as Hive’s version of SQL has subsumed more and more of the full ANSI SQL, becoming increasingly nimble and relational database-like in terms of latency. (Hive query latencies are now often in the single digit second and even sub-second range.)
Yet, while there is significant overlap in the capabilities of the three classes of systems, at heavier performance demands, both Hive and OLTP systems remain unchallenged in in their core domains, and for reasons we’ll touch on below, this situation is unlikely to change in the near future. The future of OLAP systems is less clear. Hive is squeezing them hard from the top, not only because of its nearly unlimited capacity, but because it is increasingly encroaching on the performance envelope of specialized appliances at a much lower price.
The Classic RDBMS Model
Let’s do a quick review of what Hadoop is not. RDBMS’s are designed primarily to answer questions about limited, but often many-dimensional slices of a database quickly, while allowing numerous users to read and update the database simultaneously without tripping over each other.
The essence of the classic relational database model is that data is stored on disk pre-structured into tables consisting of rows and columns. Subsets, ideally small subsets, of the rows from one or more tables are pulled into a central processor where the desired result is computed, typically by “joining” multiple tables. Centralized processing* is critical, because it makes it possible to implement low-latency transactional behavior, allowing many users to interact with a complex dataset in a disciplined, logically consistent way.
Because tables can be very large, and all data passes through the central processor bottleneck, table scans are anathema in the RDBMS world. To avoid them, RDBMS’s put tremendous emphasis on indexing so that the number of data blocks that have to be located and pulled from disk is held to a minimum. In most cases, this approach reduces the quantity of data that must be read to a minute fraction of the total in the tables that it accesses. Queries that scan entire tables usually indicate a design pathology.
Data normalization in an RDBMS has several functions. It both reduces the quantity of data that needs to flow through the centralized processor, and makes it easier to maintain consistency of the data. Within a normalized data set, each table only contains one narrow vertical slice of information and ideally, all of the data in a given row is unique in the database. If a column in table X refers to values that are not inherently part of table X’s vertical slice of the data, it does so through a foreign key pointing to table Y that holds data of that other kind (or to a linking table that connects X to Y indirectly.) This tends to confine the impact of an update to a small number of rows, and makes it much easier to maintain consistency, as well as reducing the amount of data that needs to be locked for transactions.
The Hive and Hadoop Model
Because Hive was originally conceived as way to simplify writing MapReduce batch jobs, let’s do a quick review of how Hadoop MapReduce works (although in practice, Hive now usually uses MapReduce’s more modern cousin, Tez.)
The Hadoop Distributed File System (HDFS), which supports Hadoop (including Hive,) isn’t actually a true file system in the sense that Windows NTFS or Linux EXT3 or EXT4 are. Instead HDFS is a layer of abstraction across any number of physical file systems running on any number of machines. It provides a simple API to access datasets of almost any size distributed over as many as thousands of servers. HDFS deals in large blocks, typically 64MB to 256MB, represented as separate files on the host file systems, rather than the 4K blocks that typically underlie files in lower level file systems. It supplies built-in redundancy at the block level to support both high-availability of the data as well as flexibility to Hadoop applications in choosing which node will process a given data block.
Keeping distributed data in sync is notoriously difficult if arbitrary updates are allowed (a key RDBMS strength.) Therefore, HDFS finesses the problem of consistent updates by forbidding updates entirely, allowing only appends. In other words, data is never updated after being written. (We’ll see how Hadoop gets around this later.)
MapReduce is very much geared toward streaming reads. Each of the many mappers in a MapReduce job consumes whole HDFS blocks, and there is very little indexing or other random access involved—it’s all about streaming in blocks of data. It is designed this way because streaming from disk is extremely efficient. “Seeks,” i.e., moving the reader head across the disk, are the slow part of disk access, taking several milliseconds each. This means that a disk drive can only execute at most hundreds of seeks per second, and for commodity SATA disks, usually substantially less than one hundred. However, once the disk head is positioned, data can stream from the spinning disk to the CPU at a sustained rate of from tens of MB/second to more than a hundred MB/second.
Think of MapReduce as being almost the opposite of RDBMS processing. Whereas SQL is built around precisely targeted lookups of carefully structured and indexed data, the classic use-case for MapReduce is a batch job that slams end-to-end though a giant dataset that is too big to be processed, or even hosted, by a single machine. To accomplish this, MapReduce uses anywhere from a few to several thousand machines, each of which has many disks that can be read in parallel via HDFS. Consider a typical modern server with dual 4-core processors, and 24 local disks. Even a small Hadoop cluster with 100 such machines dwarfs the raw processing power of the largest RDBMS, having a total of 3200 hardware cores, and reading 2400 disk drives running in parallel. Such a cluster might typically execute SQL on several gigabytes of data per second with a commensurate amount of output. A big cluster can be up to 50x that size, and chew through many hundreds of GB/second. Moreover, MapReduce is not limited to row/column oriented data; it can process JSON, XML, binary files, images, machine data, unstructured text, and many other formats.
Hive translates a SQL query into a series of MapReduce jobs, which lets a query run on almost arbitrarily large datasets. While modern Hive has a number of clever ways to minimize the quantity of data it reads, in spirit it continues to favor the table scan, and indexes play such a minor part that many Hive users are not even aware of where or how they are employed. (I’ll tell you later!)
With its MapReduce roots, Hive greatly reduces the need for normalization and working out elaborate data schemas in advance. In fact, the physical representation of the data on disk doesn’t even have to be tabular, because, at MapReduce processing scale, there are so many processors at work that it is possible to simply impose an appropriate logical table structure on a raw dataset on the fly as it is read in from the HDFS file system. (The technique is known as “schema on read.”)
In the old days, this batch/table-scan orientation meant that for queries against small data sets, or for queries that only dealt with a narrow slice of the data, the original Hive-over-MapReduce could be frustratingly slow compared to conventional RDBMS systems. This was in part because smaller queries tended to be dominated by the setup time, and in part because queries had to read the entire dataset even if they only cared about a very restricted subset of the data. However, the larger the dataset and the more table-scan like the query, the more efficient even the early Hive was (and remains) compared to conventional RDBMS’s.
Today (2015) much of this has changed. Hive has found ways minimize startup time, avoid reading irrelevant data, and optimize processing, with the result that the latency of many queries has been reduced to values formerly associated with OLTP systems, often beating out even highly optimized OLAP systems. And of course, Hive can still be applied to datasets far beyond the size range of any OLAP system.
Then, as now, Hive really came into its own when storing and processing on a large scale. A billion rows may seem like a lot in the OLTP world, but Hive is undaunted by datasets with multiple trillions of rows—petabytes of data—a scale a thousand or more times bigger than OLAP systems are meant for.
Storage Formats: One of the most critical things to remember about how Hive differs from an RDBMS is that while RDBMS tables correspond directly to the row and column structure of the data on disk, Hive “tables” are purely metadata constructs. In fact, any number of different Hive tables can overlay a single HDFS data file at the same time, and none of them need reflect the physical structure of columns in the file. Indeed, the underlying data on disk may not even be stored as rows and columns. The physical storage may be almost anything: delimited columns, JSON, XML, specialized Hadoop formats such as ORC or Sequence Files, or custom semi-structured format. Hive can make any of these formats look like plain-vanilla SQL tables. We’ll see later that this is a key part of Hive’s usefulness because it lets you store data in bulk without committing to an elaborate ELT process in advance of understanding how you’ll use it.
Money: Henry Ford once said “An engineer is someone who can do for a nickel what any damn fool can do for two bits.” Engineering is always about money, and the economics of Hadoop open up a world of possibilities that are out of range with conventional technologies because of cost. Most obviously, Hadoop storage can cost as little as 1% of the cost of storage in a high-end appliance, which means that datasets that would be absurdly expensive to store on an appliance can be insignificant on a Hadoop cluster. Less obviously, schema-on-read means those large data sets don’t need a heavy-weight ETL process in order to be useful. Organizations using Hadoop often have a policy of saving virtually all data by default in raw format, doing ETL (or in Hadoop jargon, ELT) only on special cases.
The Stinger Initiative, Tez, YARN and ORC
Hadoop 2.0, released in late 2013, changed Hadoop and Hive profoundly by inserting a new component—YARN—between HDFS and MapReduce. Prior to 2.0, MapReduce was a central, privileged component of Hadoop, which was specifically built to support it. In the pre-YARN Hadoop, the algorithm for allocating resources was primitive, and might be summarized as, “Give the first job all it wants, and offer whatever is left over to the next job until all resources are spoken for.” In other words, MapReduce, and with it, Hive, was almost a batch processing environment, where jobs waited for resources on a single queue.
YARN can be thought of as a multi-tasking operating system for Hadoop. Under YARN, the special purpose machinery for allocating containers to MapReduce was replaced by a general mechanism for allocating containers that could be used by any application, and MapReduce (now MapReduce2) became just one of any number of applications that run in YARN containers. The new YARN containers are capable of running arbitrary code, whether it be Java MapReduce tasks, Tez task, other Java, or even applications written in other languages. YARN also provided a much improved algorithm for resource allocation that allows any number of processing queues to be defined, each with a guaranteed proportion of cluster capacity. These capabilities alone make the system far more responsive, but YARN also facilitated numerous performance optimizations.
Concurrently with the development of YARN, a number of cooperative initiatives, including efforts known as “The Stinger Initiative” and later, “Stinger.next,” set their sights on improving Hive performance by 100X. These efforts, which continue today, have been wildly successful, and include:
- Enhancements related to the processing graph:
- Tez running over YARN has largely replaced MapReduce with a more flexible processing model.
- Pre-warmed containers, and container re-use.
- Better resource allocation and job scheduling
- YARN’s CapacityScheduler allows flexible resource allocation strategies that can ensure that interactive users aren’t stuck behind massive batch jobs that hog the cluster for extended periods.
- Internal improvements to query planning and execution:
- Cost-Based Optimization (CBO) of the execution plan.
- Vectorized Processing
- Predicate Push-down
- Advances in how queries are broken down into jobs—e.g., broadcast joins.
- Storage improvements, often working together with specific processing enhancements:
- Optimized Row Column (ORC) file format, which provides:
- Separately accessible columns with custom compression by type
- Internal indexing for fast location of data
- Pre-computed metadata about the data in the files
- Uses of partitioning and bucketing
- Optimized Row Column (ORC) file format, which provides: