Hadoop and Ambari usually run over Linux, but please don’t fall into thinking of your cluster as a collection of Linux boxes; for stability and efficiency, you need to treat it like an appliance dedicated to Hadoop. Here’s why.
YARN—the data operating system for Hadoop. Bored yet? They should call it YAWN, right?
Not really—YARN is turning out be the biggest thing to hit big-data since Hadoop itself, despite the fact that it runs down in the plumbing of somewhere, and even some Hadoop users aren’t 100% clear on exactly what it does. In some ways, the technical improvements it enables aren’t even the most important part. YARN is changing the very economics of Hadoop.
I want my big-data applications to run as fast as possible. So why do the engineers who designed Hadoop specify “commodity hardware” for Hadoop clusters? Why go out of your way to tell people to run on mediocre machines?
This is part two of an extended article. See part one here.
A full listing of Hive best practices and optimization would fill a book. All we’ll do here is skim over the topics that best indicate the spirit of Hive, and how it is used most successfully. There’s plenty of detail available in the documentation and on the Web at large. Hopefully, these quick run-downs will provide enough background and keywords for a rewarding Google search.