Hadoop, Hadoop Hive, Ingestion, not-hadoop, Unicode

No Fluff Unicode Sumary for Hadoop

clovers6

Developers might not want to read all the background on Unicode included in this earlier blog entry. Here is a quick distillation of how Unicode and the UTF encodings are relevant to a Hadoop user—just the facts and the warnings.

Continue reading →

Hadoop, Hadoop Hive

Lipwig for Hive Is The Greatest!

Making_Money_Lipwig Ok, this is the coolest thing this Hive user has seen all day.

As you probably know, if you prepend the word EXPLAIN to your SQL query and then run it, Hive prints out a text description of the query plan. This lets you explore the effects such variations as code changes, the use of analyze, turning on/off the cost-based optimizer (CBO), and so on. It’s an essential tool for optimizing Hive.

The output of EXPLAIN is far from pretty, but fortunately, a simple pipeline of Linux commands can give you a slick graphical rendition like the one below.

Continue reading →

Uncategorized

Choosing a YARN Scheduler

The Apache documentation on the YARN schedulers is good, but it covers how to configure them, not how to choose one or the other. Here’s the background on why the schedulers are designed the way they are and how to choose the right one.

Continue reading →

Hadoop

What’s So Important About Compression?

HDFS storage is cheap—about 1% of the cost of storage on a data appliance such as Teradata. It takes some doing to use up all the disk space on even a small cluster of say, 30 nodes. Such a cluster may have anywhere from 12 to 24 TB per node, so a cluster of that size has from 720 to 1440 TB of storage space. If there’s no shortage of space, why bother wasting cycles on compression and decompression?

With Hadoop, that’s the wrong way to look at it because saving space is not the main reason to use compression in Hadoop clusters—minimizing disk and network I/O is usually more important. In a fully-used cluster, MapReduce and Tez, which do most of the work, tend to saturate disk-I/O capacity, while jobs that transform bulk data, such as ETL or sorting, can easily consume all available network I/O.

Continue reading →

Hadoop, YARN

The YARN Revolution

YARN—the data operating system for Hadoop. Bored yet? They should call it YAWN, right?

152bb5c293e1e1b091141c2c1ad9ebda2

Not really—YARN is turning out be the biggest thing to hit big-data since Hadoop itself, despite the fact that it runs down in the plumbing of somewhere, and even some Hadoop users aren’t 100% clear on exactly what it does. In some ways, the technical improvements it enables aren’t even the most important part. YARN is changing the very economics of Hadoop.

Continue reading →

not-hadoop

Not Hadoop: All about Unicode

Unicode is a subject that trips up even experienced programmers. It’s one of those places where computer science and engineering bump hard into human diversity.

Unicode-in-a-Spiral-Lucida-Sans1
Continue reading →

Hadoop hardware

Understanding Hadoop Hardware Requirements

I want my big-data applications to run as fast as possible. So why do the engineers who designed Hadoop specify “commodity hardware” for Hadoop clusters? Why go out of your way to tell people to run on mediocre machines?

Showroom+deco+Hardware

Continue reading →

Hadoop Hive

Shifting to Hive Part II: Best Practices and Optimizations

This is part two of an extended article. See part one here.

beehive

A full listing of Hive best practices and optimization would fill a book. All we’ll do here is skim over the topics that best indicate the spirit of Hive, and how it is used most successfully. There’s plenty of detail available in the documentation and on the Web at large. Hopefully, these quick run-downs will provide enough background and keywords for a rewarding Google search.

Continue reading →

Hadoop Hive

Shifting to Hive Part I: Origins

SQL is the lingua-franca of data big and small, but SQL is a language, not a platform—it serves as the conceptual framework for data tasks on many platforms, ranging from blog content management with MySQL, to high-frequency online transaction processing (OLTP) systems, to heavy-duty batch processing on Hadoop and other big-data platforms.

BeehiveWoodcut

I hope this page will help people who are experienced with conventional RDBMS’s and OLTP systems make the jump to working with big data using Apache Hive, the most important of the SQL big-data platforms.

Continue reading →

	Water on A Pilgrim’s Progress #1:…
	Stewyn on Shifting to Hive Part II: Best…
	Glen on Go Go Go
	hadoop 3 Erasure cod… on Erasure Code in Hadoop
	Rajesh KSV on Shifting to Hive Part II: Best…

	Water on A Pilgrim’s Progress #1:…
	Stewyn on Shifting to Hive Part II: Best…
	Glen on Go Go Go
	hadoop 3 Erasure cod… on Erasure Code in Hadoop
	Rajesh KSV on Shifting to Hive Part II: Best…

hadoopoopadoop

Big Data with Hortonworks Hadoop

No Fluff Unicode Sumary for Hadoop

Lipwig for Hive Is The Greatest!

Choosing a YARN Scheduler

What’s So Important About Compression?

The YARN Revolution

Not Hadoop: All about Unicode

Understanding Hadoop Hardware Requirements

Shifting to Hive Part II: Best Practices and Optimizations

Shifting to Hive Part I: Origins