Hadoop and Ambari usually run over Linux, but please don’t fall into thinking of your cluster as a collection of Linux boxes; for stability and efficiency, you need to treat it like an appliance dedicated to Hadoop. Here’s why.
Levenshtein Distance is an elegant measure of the dissimilarity of two strings. Given a pair of strings, say, “hat” and “cat”, the LD is the number of single-character edits that are required to turn one into the other. The LD of cat and hat is one. The LD of hats and cat is two.
LD is precise and has an intuitive meaning but in practice it is used mostly for short strings because the run-time of the algorithm to compute LD is quadratic, i.e, proportional to the product of the lengths of the two strings. On a reasonable computer you can compare strings as long as a line in this article in a few microseconds, but comparing a full page of this article to another full page would take a good chunk of a second. Comparing two books with LD is serious computing–many of minutes if you have a computer with sufficient memory, which you almost certainly do not.
That’s why, as elegant as LD is for describing how different two chunks of text are, people rarely consider using it as a way to compare documents such as Web pages, articles, or books.
This heuristic described here is a way around that problem. It turns out that you can compute a decent estimate of the LD of two large strings many thousands of times faster than you could compute the true LD. The other way to say this is that whatever amount of time you deem tolerable for a comparison computation, using estimates increases the size of strings you can compare within that time limit by a factor of as much as a few hundred. The practical size-range for estimated LD is in the megabyte range (and much larger for binary data.)
Equally importantly, the LD estimates are made from pre-computed signatures, not from the the original documents. This means that it is not necessary to have the documents to be compared on hand at the time the estimate is computed, which is a tremendous advantage when you need to compare documents across a network.
The signatures can also provide insight into approximately where and how two sequences differ. This allows finer distinctions to be made about near-duplication, for instance, is one document embedded in the other, or are two documents different versions with many small difference sprinkled throughout?
When you add nodes to a cluster, they start out empty. They work, but the data for them to work on isn’t co-located, so it’s not very efficient. Therefore, you want to tell HDFS to rebalance.
After adding new racks to our 70 node cluster, we noticed that it was taking several hours per terabyte to rebalance the nodes. You can copy a terabyte of data across a 10GbE network in under half an hour with SCP, so why should HDFS take several hours?
Z-Filters is a technique for listening to what millions of people are talking about. Twitter is the data source, but the idea isn’t Twitter specific. The same problems are inherent no matter how you eavesdrop on a population.
If a hundred million people were to talk about a hundred million different things, making sense of it would be a hopeless task, but that’s not the way society operates. The number of things people are talking about at any given moment is much, much smaller: thousands, not millions.
What is Erasure Code?
Hadoop 2.7 isn’t out yet, but it’s scheduled to include something called “erasure code.” What the heck is that, you ask? Here’s a quick preview.
The short answer is that erasure code is another name for Reed-Solomon error-correcting codes, which will be used in Hadoop 3.0 as an alternative to brute-force triple replication. This new feature is intended to provide high data availability while using much less disk space.
The longer answer follows.
Ok, this is the coolest thing this Hive user has seen all day.
As you probably know, if you prepend the word EXPLAIN to your SQL query and then run it, Hive prints out a text description of the query plan. This lets you explore the effects such variations as code changes, the use of analyze, turning on/off the cost-based optimizer (CBO), and so on. It’s an essential tool for optimizing Hive.
The output of EXPLAIN is far from pretty, but fortunately, a simple pipeline of Linux commands can give you a slick graphical rendition like the one below.
The Apache documentation on the YARN schedulers is good, but it covers how to configure them, not how to choose one or the other. Here’s the background on why the schedulers are designed the way they are and how to choose the right one.
HDFS storage is cheap—about 1% of the cost of storage on a data appliance such as Teradata. It takes some doing to use up all the disk space on even a small cluster of say, 30 nodes. Such a cluster may have anywhere from 12 to 24 TB per node, so a cluster of that size has from 720 to 1440 TB of storage space. If there’s no shortage of space, why bother wasting cycles on compression and decompression?
With Hadoop, that’s the wrong way to look at it because saving space is not the main reason to use compression in Hadoop clusters—minimizing disk and network I/O is usually more important. In a fully-used cluster, MapReduce and Tez, which do most of the work, tend to saturate disk-I/O capacity, while jobs that transform bulk data, such as ETL or sorting, can easily consume all available network I/O.
YARN—the data operating system for Hadoop. Bored yet? They should call it YAWN, right?
Not really—YARN is turning out be the biggest thing to hit big-data since Hadoop itself, despite the fact that it runs down in the plumbing of somewhere, and even some Hadoop users aren’t 100% clear on exactly what it does. In some ways, the technical improvements it enables aren’t even the most important part. YARN is changing the very economics of Hadoop.