Uncategorized

Hadoop Cloud Clusters

If experience with Hadoop in the cloud has taught me anything, it’s that it is very hard to get straight answers about Hadoop in the cloud. The cloud is a complex environment that differs in many ways from the data center and full of surprises for Hadoop. Hopefully, these notes will lay out all the major issues.

elephant-in-cloud

 

No Argument Here

Before getting into Hadoop, let’s be clear that there is no real question anymore that the cloud kicks the data center’s ass on cost for most business applications. Yet, we need to look closely at why, because Hadoop usage patterns are very different from those of typical business applications. Continue reading

Standard
Uncategorized

Big Jobs Little Jobs

You’ve probably heard the well-known Hadoop paradox that even on the biggest clusters, most jobs are small, and the monster jobs that Hadoop is designed for are actually the exception.

Sumatran-elephant

This is true, but it’s not the whole story. It isn’t easy to find detailed numbers on how clusters are used in the wild, but I recently came across some decent data on a 2011 production analytics cluster at Microsoft. Technology years are like dog years but the processing load it describes remains representative of the general state of things today, and back-of-the-envelope analysis of the data presented in the article yields some interesting insights.

Continue reading

Standard
Hadoop, Hadoop Hive, YARN

Live Long and Process

One great thing about working for Hortonworks is that you get to try out new features before they are released, with real feature engineers as tour guides—features like LLAP, which stands for Live Long and Process. LLAP is coming to Hortonworks with Hive 2.0 and (spoiler alert) it looks like it will be worth the wait.Live-Long-and-Prosper-Shirt

Originally a pure batch processing platform, Hive has speeded up enormously over the last couple of years with Tez, the cost-based optimizer (CBA), ORC files, and a host of other improvements. Queries that once took minutes now (1st quarter 2016) run at interactive speeds, and LLAP aims to push latencies into the sub-second range.

Continue reading

Standard
algorithms, not-hadoop

Super Fast Estimates of Levenshtein Distance

This simple heuristic estimates the Levenshtein Distance (LD) of large sequences as much as tens of thousands of times faster than computing the true LD. In exchange for a certain loss of precision, this extends the size range over which LD is practical by a factor of a few hundred X, making it useful for comparing large text documents such as articles, Web pages, and books.

I612H

Equally importantly, the estimates are computed from relatively small signatures, which means that it is not necessary to have the documents to be compared on hand at the time the estimate is computed.

The signatures can also be used to estimate approximately where and how the sequences differ. This allows finer distinctions to be made about near duplication, for instance, is one document embedded in the other, or are many small difference sprinkled throughout?

Continue reading

Standard
Hadoop, Hadoop hardware, Uncategorized

A Question of Balance

When you add nodes to a cluster, they start out empty.  They work, but the data for them to work on isn’t co-located, so it’s not very efficient. Therefore, you want to tell HDFS to rebalance.

teeter_totter.png

After adding new racks to our 70 node cluster, we noticed that it was taking several hours per terabyte to rebalance the nodes. You can copy a terabyte of data across a 10GbE network in under half an hour with SCP, so why should HDFS take several hours?

Continue reading

Standard
algorithms, not-hadoop, twitter, Uncategorized

Z-Filters: How to Listen to a Hundred Million Voices

Z-Filters is a technique for listening to what millions of people are talking about.

If a hundred million people were to talk about a hundred million different things, making sense of it would be a hopeless task, but that’s not the way society operates. The number of things people are talking about at any given moment is much, much smaller: thousands, not millions.

bubble-view.png Continue reading

Standard