Uncategorized

Haxe

Haxe is a most unusual language. So far, nobody I’ve enthused about it to has heard of it, which is a shame. I’m loving it. But before jumping into it, I want to give you some setup.

We had a problem. The company I’m with wants to flush data from hundreds of different kinds of IoT devices to the AWS Cloud. There are also Linux-powered gateways, a ton of code on the Cloud side plus Web browser applications. Among them, they use Python, C/C++, Java, JS, and PHP, and run on Linux, Mongoose, Microsoft, OSX, Android and even bare metal (the embedded controller-based devices, e.g. Arduino and ESP32, etc.)

Despite all these exotica, our problem is humble.  The messages the components send, at some point, are almost all represented in JSON, so we need some way to define that JSON centrally to ensure that all participants conform to the same schema and to make it testable. The best way to do this is to provide developers with standard objects—beans, in the Java world—that emit and accept the JSON.   But we don’t want to write and maintain the bean code in five languages as things evolve. How do we get around that?

Continue reading

Standard
Uncategorized

Hadoop Cloud Clusters

If experience with Hadoop in the cloud has taught me anything, it’s that it is very hard to get straight answers about Hadoop in the cloud. The cloud is a complex environment that differs in many ways from the data center and full of surprises for Hadoop. Hopefully, these notes will lay out all the major issues.

elephant-in-cloud

 

No Argument Here

Before getting into Hadoop, let’s be clear that there is no real question anymore that the cloud kicks the data center’s ass on cost for most business applications. Yet, we need to look closely at why, because Hadoop usage patterns are very different from those of typical business applications. Continue reading

Standard
Uncategorized

Big Jobs Little Jobs

You’ve probably heard the well-known Hadoop paradox that even on the biggest clusters, most jobs are small, and the monster jobs that Hadoop is designed for are actually the exception.

Sumatran-elephant

This is true, but it’s not the whole story. It isn’t easy to find detailed numbers on how clusters are used in the wild, but I recently came across some decent data on a 2011 production analytics cluster at Microsoft. Technology years are like dog years but the processing load it describes remains representative of the general state of things today, and back-of-the-envelope analysis of the data presented in the article yields some interesting insights.

Continue reading

Standard
Hadoop, Hadoop hardware, Uncategorized

A Question of Balance

When you add nodes to a cluster, they start out empty.  They work, but the data for them to work on isn’t co-located, so it’s not very efficient. Therefore, you want to tell HDFS to rebalance.

teeter_totter.png

After adding new racks to our 70 node cluster, we noticed that it was taking several hours per terabyte to rebalance the nodes. You can copy a terabyte of data across a 10GbE network in under half an hour with SCP, so why should HDFS take several hours?

Continue reading

Standard
algorithms, not-hadoop, twitter, Uncategorized

Z-Filters: How to Listen to a Hundred Million Voices

Z-Filters is a technique for listening to what millions of people are talking about.

If a hundred million people were to talk about a hundred million different things, making sense of it would be a hopeless task, but that’s not the way society operates. The number of things people are talking about at any given moment is much, much smaller: thousands, not millions.

bubble-view.png Continue reading

Standard
Uncategorized

Erasure Code in Hadoop

What is Erasure Code?

Hadoop 2.7 isn’t out yet, but it’s scheduled to include something called “erasure code.”  What the heck is that, you ask? Here’s a quick preview.

300px-Office-pink-erasers

The short answer is that erasure code is another name for Reed-Solomon error-correcting codes, which will be used in Hadoop 3.0 as an alternative to brute-force triple replication. This new feature is intended to provide high data availability while using much less disk space.

The longer answer follows.

Continue reading

Standard