Hadoop and Ambari usually run over Linux, but please don’t fall into thinking of your cluster as a collection of Linux boxes; for stability and efficiency, you need to treat it like an appliance dedicated to Hadoop. Here’s why.
When you add nodes to a cluster, they start out empty. They work, but the data for them to work on isn’t co-located, so it’s not very efficient. Therefore, you want to tell HDFS to rebalance.
After adding new racks to our 70 node cluster, we noticed that it was taking several hours per terabyte to rebalance the nodes. You can copy a terabyte of data across a 10GbE network in under half an hour with SCP, so why should HDFS take several hours?
I want my big-data applications to run as fast as possible. So why do the engineers who designed Hadoop specify “commodity hardware” for Hadoop clusters? Why go out of your way to tell people to run on mediocre machines?