When you add nodes to a cluster, they start out empty. They work, but the data for them to work on isn’t co-located, so it’s not very efficient. Therefore, you want to tell HDFS to rebalance.
After adding new racks to our 70 node cluster, we noticed that it was taking several hours per terabyte to rebalance the nodes. You can copy a terabyte of data across a 10GbE network in under half an hour with SCP, so why should HDFS take several hours?
It didn’t take long to discover the cause—the configuration parameter dfs.datanode.balance.bandwidthPerSecond controls how much bandwidth each node is allowed to use for rebalancing, and it defaults to a conservative value of 10Mb/sec/node, which is 1.25MB/sec. If you have 70 nodes (the number we started with before adding new ones), that’s 87.5MB/second. One terabyte, i.e., a million MB, divided 87.5MB/sec, equals 11,428 sec, or 3.17 hours per TB. The more nodes in the original cluster, the faster it will write.
With dfs.datanode.balance.bandwidthPerSecond set to the default, it should take at least than 3.17 hours/TB on an idle 70 node cluster, so the cluster was actually behaving as configured.
If the racks and nodes are of uniform type, the proportion of existing data that must be moved in this kind of rebalancing situation is equal to the number of new racks divided by the total number of racks both new and old. In our case, this works out to 330TB to move. At 3.17 hours per TB, it would take 43 days to fully rebalance. Yikes!
With this in mind, one of the guys upped the value from 10Mb/sec to 500Mb/sec increased the transfer volume by well over 10X. That’s great, but why wasn’t the data moving 50X faster? What happened?
To understand this, consider that the 1.25 GB/sec divided by 70 nodes is 17.9MB, i.e., it would only take 143Mb/sec each for 70 nodes to saturate the network between the racks. Therefore, the biggest meaningful increase possible for our cluster is 14.3X bigger than the default, not 50X. The “more than 10X” was also what we should have expected.
Would we really want to set the value to 143Mb/sec for daily use? Usually not—rebalancing does not defer to processing in the matter of network use, so the cost in performance would be crippling for other network-intensive jobs. A value of 14.3X would saturate the network for more than four days!
You can easily work out how much bandwidth will go to rebalancing using these principles. After that it’s a matter of taste how much you want to let rebalancing cut in on processing power. The flip side is that as rebalancing progresses, the applicable processing power goes up because more data is available colocated with nodes.
The default value dfs.datanode.balance.bandwidthPerSecond is set at a modest rate so that if you do nothing to change it, rebalancing will chip away at the problem without hogging the network. If you want to go faster, increase the value, but for best results, choose a value that won’t use too much of the available network bandwidth. How much bandwidth you can afford depends on how busy your cluster is, and what the workload is like.
Some kinds of jobs use more network than others (jobs with a lot of output tend to be hard on the network.) If you workload has a lot of sorting or ETL transformations, be conservative. Bump it up to the maximum over nights and weekends when the cluster will be idle.
Beware that different rebalancing situations can have very different network implications. In the case above, entire racks of new nodes were added to an existing cluster. But if you are adding nodes to existing racks across the cluster, or adding disks to existing nodes across the cluster, the situation is different, because most of the block movement will occur within the racks, using far less bandwidth on the main network. Therefore, you may be able to afford a higher rate because the load on each of the ToR switches will be 1/Rth of the load as computed above if R is the number of racks.