Go Go Go

I’m new to Go—just a few months in. I’ve spent a lot more time with Java, C++, even Python, but the Gopher is an interesting critter so far.  It’s not just a better version of <your favorite language here>.

Every language is a commitment to a particular way way of looking at programming but rarely more so than with Go, which is often politely described as “opinionated.”

Some Informal History

Go’s ancestor, the C language, which was invented in 1972, spelled the end of the age of assembler for systems programming. In the words of its author, Dennis Ritchie, C is portable assembler. Until C came along, you had to hand-port operating systems and other systems code to each platform you wanted to run on. There weren’t any IDE’s—it was just you and your editor back then.


A vast superstructure of software development tooling has evolved since C was young but less of it than one might imagine is about telling machines what to do.

There used to be a thing called “the software crisis” back in the 80’s and 90’s. Younger programmers may never have heard of it, but back then the majority of large projects were said to fail as the size of the problems we faced began to outstrip our ability to write commensurately large software.

Yet today, only the occasional overblown or ill-conceived project fails and success is the expectation. What happened?

It’s not the languages that have changed—I was a CS student in the late eighties and I have yet to encounter a significant language feature that did not already exist when I was an undergraduate.

It is tools and management techniques to facilitate people working together that beat down the software crisis, not high-powered syntax. Widespread adoption of Object Orientation allowed data models of unprecedented size to be developed and managed by large teams as they evolved over time. Open-source created a universe of high-quality computational Lego that let individuals or small teams produce incredibly powerful systems with little more than glue code. Agile and other management practices, powerful source control, documentation, tools like Maven, Jenkins, continuous integration systems, Jira, and of course, Unix for everyone.  These things are all about people, not machines. If you left it up to the machines, they’d write everything in C for portability and they’d skip the tabs or newlines.

Continue reading



Haxe is a most unusual language. So far, nobody I’ve enthused about it to has heard of it, which is a shame. I’m loving it. But before jumping into it, I want to give you some setup.

We had a problem. The company I’m with wants to flush data from hundreds of different kinds of IoT devices to the AWS Cloud. There are also Linux-powered gateways, a ton of code on the Cloud side plus Web browser applications. Among them, they use Python, C/C++, Java, JS, and PHP, and run on Linux, Mongoose, Microsoft, OSX, Android and even bare metal (the embedded controller-based devices, e.g. Arduino and ESP32, etc.)

Despite all these exotica, our problem is humble.  The messages the components send, at some point, are almost all represented in JSON, so we need some way to define that JSON centrally to ensure that all participants conform to the same schema and to make it testable. The best way to do this is to provide developers with standard objects—beans, in the Java world—that emit and accept the JSON.   But we don’t want to write and maintain the bean code in five languages as things evolve. How do we get around that?

Continue reading


Hadoop Cloud Clusters

If experience with Hadoop in the cloud has taught me anything, it’s that it is very hard to get straight answers about Hadoop in the cloud. The cloud is a complex environment that differs in many ways from the data center and full of surprises for Hadoop. Hopefully, these notes will lay out all the major issues.



No Argument Here

Before getting into Hadoop, let’s be clear that there is no real question anymore that the cloud kicks the data center’s ass on cost for most business applications. Yet, we need to look closely at why, because Hadoop usage patterns are very different from those of typical business applications. Continue reading


Big Jobs Little Jobs

You’ve probably heard the well-known Hadoop paradox that even on the biggest clusters, most jobs are small, and the monster jobs that Hadoop is designed for are actually the exception.


This is true, but it’s not the whole story. It isn’t easy to find detailed numbers on how clusters are used in the wild, but I recently came across some decent data on a 2011 production analytics cluster at Microsoft. Technology years are like dog years but the processing load it describes remains representative of the general state of things today, and back-of-the-envelope analysis of the data presented in the article yields some interesting insights.

Continue reading

Hadoop, Hadoop Hive, YARN

Live Long and Process

One great thing about working for Hortonworks is that you get to try out new features before they are released, with real feature engineers as tour guides—features like LLAP, which stands for Live Long and Process. LLAP is coming to Hortonworks with Hive 2.0 and (spoiler alert) it looks like it will be worth the wait.Live-Long-and-Prosper-Shirt

Originally a pure batch processing platform, Hive has speeded up enormously over the last couple of years with Tez, the cost-based optimizer (CBA), ORC files, and a host of other improvements. Queries that once took minutes now (1st quarter 2016) run at interactive speeds, and LLAP aims to push latencies into the sub-second range.

Continue reading

algorithms, not-hadoop

Super Fast Estimates of Levenshtein Distance

Levenshtein Distance is an elegant measure of the dissimilarity of two strings. Given a pair of strings, say, “hat” and “cat”, the LD is the number of single-character edits that are required to turn one into the other.  The LD of cat and hat is one. The LD of hats and cat is two.

LD is precise and has an intuitive meaning but in practice it is used mostly for short strings because the run-time of the algorithm to compute LD is quadratic, i.e, proportional to the product of  the lengths of the two strings. On a reasonable computer you can compare strings as long as a line in this article in a few microseconds, but comparing a full page of this article to another full page would take a good chunk of a second.  Comparing two books with LD is serious computing–many of minutes if you have a computer with sufficient memory, which you almost certainly do not.

That’s why, as elegant as LD is for describing how different two chunks of text are, people rarely consider using it as a way to compare documents such as Web pages, articles, or books.

This heuristic described here is a way around that problem.  It turns out that you can compute a decent estimate of the LD of two large strings many thousands of times faster than you could compute the true LD.  The other way to say this is that whatever amount of time you deem tolerable for a comparison computation, using estimates increases the size of strings you can compare within that time limit by a factor of as much as a few hundred. The practical size-range for estimated LD is in the megabyte range (and much larger for binary data.)


Equally importantly, the LD estimates are made from pre-computed signatures, not from the the original documents. This means that it is not necessary to have the documents to be compared on hand at the time the estimate is computed, which is a tremendous advantage when you need to compare documents across a network.

The signatures can also provide insight into approximately where and how two sequences differ. This allows finer distinctions to be made about near-duplication, for instance, is one document embedded in the other, or are many small difference sprinkled throughout?

Continue reading

Hadoop, Hadoop hardware, Uncategorized

A Question of Balance

When you add nodes to a cluster, they start out empty.  They work, but the data for them to work on isn’t co-located, so it’s not very efficient. Therefore, you want to tell HDFS to rebalance.


After adding new racks to our 70 node cluster, we noticed that it was taking several hours per terabyte to rebalance the nodes. You can copy a terabyte of data across a 10GbE network in under half an hour with SCP, so why should HDFS take several hours?

Continue reading

algorithms, not-hadoop, twitter, Uncategorized

Z-Filters: How to Listen to a Hundred Million Voices

Z-Filters is a technique for listening to what millions of people are talking about. Twitter is the data source, but the idea isn’t Twitter specific. The same problems are inherent no matter how you eavesdrop on a population.

If a hundred million people were to talk about a hundred million different things, making sense of it would be a hopeless task, but that’s not the way society operates. The number of things people are talking about at any given moment is much, much smaller: thousands, not millions.

bubble-view.png Continue reading


Erasure Code in Hadoop

What is Erasure Code?

Hadoop 2.7 isn’t out yet, but it’s scheduled to include something called “erasure code.”  What the heck is that, you ask? Here’s a quick preview.


The short answer is that erasure code is another name for Reed-Solomon error-correcting codes, which will be used in Hadoop 3.0 as an alternative to brute-force triple replication. This new feature is intended to provide high data availability while using much less disk space.

The longer answer follows.

Continue reading