We had a problem. The company I’m with wants to flush data from hundreds of different kinds of IoT devices to the AWS Cloud. There are also Linux-powered gateways, a ton of code on the Cloud side plus Web browser applications. Among them, they use Python, C/C++, Java, JS, and PHP, and run on Linux, Mongoose, Microsoft, OSX, Android and even bare metal (the embedded controller-based devices, e.g. Arduino and ESP32, etc.)

Despite all these exotica, our problem is humble.  The messages the components send, at some point, are almost all represented in JSON, so we need some way to define that JSON centrally to ensure that all participants conform to the same schema and to make it testable. The best way to do this is to provide developers with standard objects—beans, in the Java world—that emit and accept the JSON.   But we don’t want to write and maintain the bean code in five languages as things evolve. How do we get around that?

IDL In The Middle?

There are many frameworks that use some kind of Interface Definition Language (IDL) to define data objects generically so that they to and from a generic format wire-format in a language-neutral way. These frameworks use the IDL document to generate equivalent beans in multiple languages. The beans know how to emit and reconstitute the data at either end and in between the data is usually serialized to some efficient binary wire-format to conserve bandwidth.  CORBA, Protocol Buffers, Avro, and Thrift all do something like this.

IDL seems like the right general idea, but those frameworks don’t quite fit our needs because they aren’t JSON oriented and wire formats and communication aren’t really the problems. For us, it’s just a question of keeping the JSON consistent. Writing such beans isn’t a big deal—we just don’t want to write and maintain everything in five languages for the rest of eternity.


Which brings me to Haxe—the coolest language you never heard of.

Forget IDL. Haxe is a high-level, feature-rich, high-level programming language. It’s very generic, somewhat Java-like but it also feels somewhat Pythonish at times. It’s got all the basic stuff with plenty of modern whistles and bells like closures and generics. Nothing too exciting there, but the point is, it’s not a specialized framework. It’s a real programming language suitable for complex projects.

The unique thing about Haxe is it is that there is no Haxe compiler that turns out an executable, and no virtual machine, either. Huh?!  Instead, you run your code through the Haxe cross-compiler (called haxe) with a flag naming a target language, and it rewrites your Haxe program in your language of choice and even compiles it for you. I’m not sure if compiles every compiled language, but it compiles it for Java. The Python just comes out Python.  If you name a “main” on the command line, the result is executable.

This solves our problem perfectly.  I’ve written the JSON-oriented code in Haxe. There’s bean with getters and setters for all the fields plus methods to write the JSON and constructors to go in the reverse direction. There’s also a convenience class to run some standard known data through each object and convert it to JSON so we can verify reference output with automated tests.

Now, the developers can be confident that they are writing to the same message model and they don’t have to code the JSON at all—that’s done just the one time in Haxe and updates will go to all the libraries automatically when it’s compiled.

It took a couple of hours to get Haxe set up, figure out how it worked, and establish the mechanics, but after that it was easy, and it looks like it will reduce the time spent on maintaining this aspect by 80%, forever. It’s so successful and easy that we’re looking fold other boilerplate functionality into the Haxe build. It’s pretty amazing.

Some More Details

Once more, I’m no expert (yet) but there are some other points worth keeping in mind so I’ll just dump some things I’ve come across and you can investigate yourself.

Limitations of space and my own inexpertise will only allow me to touch on the high points but HaXe’s excellent Website is very complete and there is a book,  HaXe 2 Beginner’s Guide by Benjamin Dasnoise available online.

Type Checking and Binding

Haxe has an odd model for type checking.

Languages vary in two major ways w.r.t. type checking: how strict they are, and when a given variable is bound to a type. For instance, in a C++ program, you always have to state the type of a variable or function when you declare it and thereafter the type cannot change. This is called compile-time binding or early-binding. On the other hand, you do not declare types in Python and similar languages because the runtime figures out what type a variable is on the fly. This is known as late-binding. In fact, in Python, a given variable can hold different types at various points during execution, which strikes many C++ and Java people as flat-out depraved. (Not all late-bound languages allow this.)

Superficially, Haxe type rules sometimes feel like Python in that you can simply ignore types much of the time, but there are times when Haxe insists that you give a variable or function return value a type. A language expert may correct me on this, but despite the fact that you don’t always have to declare the type, the underlying model of Haxe seems to be strictly compile-time bound in that by the time the code hits the compiler, every type must be either explicitly stated or inferable.  It would make sense because if the language you compile to doesn’t need the information, you can throw it away, but if it weren’t there, you could not compile to early-bound languages.

How Hard Is It?

If you are a user of any major programming language other than Lisp, much of Haxe will feel familiar. You can just start using it and figure out the fine points as you muddle through your first project. I’m a complete nube myself, having done exactly one non-trivial project in Haxe, but I got several hundred lines of working code without too much trouble.

How Complete is It?

Most of the features of modern languages are included in Haxe. It’s kind of weirdly generic that way.  If you’re used to Java and Python, you’ll barely notice that it’s not whatever language you’re used to.  The big exception is that Haxe does not expose memory directly with pointers the way C and C++ do, so that style of programming won’t be directly available to you.  Just as cross-compiling from an early-bound model to a late-bound model is logically straightforward but going from a late-bound model to an early bound model would not be, translating from a language without pointers to a language with pointers is relatively easy, but going from a language with pointers to one without would be a very good trick.


It feels like magic—you write your code one time in this generic programming language and in seconds you can have your code in Java, Python, C/C++/CPPIA, Lua, Neko, PHP, JavaScript, C#,  or some specialized languages such as Flash.   

But it can’t really be that simple. Languages have libraries that aren’t strictly part of the language proper but are very much a part of the language culture. Also, a few obvious things are mysteriously left out.  For instance, when the target is C/C++, C#, Java, Neko, or PHP, you have the Haxa Sys library, which deals with command-line arguments and several other important things, but it’s not available for Python. I had to work around a few things, but very few so far.

It goes the other way, too. A lot of the general purpose stuff is built into the Haxe libraries but it can’t be expected to include the union of all languages’ library functionality, if only because the models are often different. So for each language, there is an additional set of idiosyncratic libraries. The PHP library has things like cookies and HTML, and the Flash libraries obviously have Flash, etc.

To cover the inevitable complexities there are some compiling features such as conditional compilation that let you continue to maintain one code base.  The ins and outs of compilation are not trivial but they don’t seem like a big deal compared to C++.

The Black Arts

If you are truly hardcore (and my core is pretty much that of a Hostess Twinkie) you can start messing with macros. This feature lets you jump into the middle of the translation/code generation path and insert your own custom functionality.  So in theory, you can make it do pretty much anything.

What Is The Correct Pronunciation?

It doesn’t matter how you pronounce it because nobody you will talk to has ever heard of it before. As the unquestioned authority, your pronunciation will be established locally as correct.

If you find yourself talking to someone who knows the correct pronunciation, rely on chutzpah. Eccentrics of independent spirit are still mispronouncing vi to rhyme with bye after more than 40 years and others snobbishly stick to saying Line-ux instead of Lin-ux because supposedly Linus is pronounced Line-us. Be your own person, goddammit.

That said, so far I’ve seen arguments for Hax-eh, Hax, and Hex. To judge from the Internet, Hex seems to be winning the popular vote but I’ve never heard anyone other than me actually say any of these. I started out saying Hax-eh, but now I’m going with hax. Like axe.

Bottom Line

I could not be more ticked so far. It’s not perfect, because languages have differences that are more than cosmetic. Non-removable as they say in calculus.  But for the kind of situation we’re in, it couldn’t be better.  Word on the street (I can’t verify it) is that it’s popular with people who write multi-platform games, Web-applications, and desktop-applications. I can see why.



Hadoop Cloud Clusters

If experience with Hadoop in the cloud has taught me anything, it’s that it is very hard to get straight answers about Hadoop in the cloud. The cloud is a complex environment that differs in many ways from the data center and full of surprises for Hadoop. Hopefully, these notes will lay out all the major issues.



No Argument Here

Before getting into Hadoop, let’s be clear that there is no real question anymore that the cloud kicks the data center’s ass on cost for most business applications. Yet, we need to look closely at why, because Hadoop usage patterns are very different from those of typical business applications. Continue reading


Big Jobs Little Jobs

You’ve probably heard the well-known Hadoop paradox that even on the biggest clusters, most jobs are small, and the monster jobs that Hadoop is designed for are actually the exception.


This is true, but it’s not the whole story. It isn’t easy to find detailed numbers on how clusters are used in the wild, but I recently came across some decent data on a 2011 production analytics cluster at Microsoft. Technology years are like dog years but the processing load it describes remains representative of the general state of things today, and back-of-the-envelope analysis of the data presented in the article yields some interesting insights.

Continue reading

Hadoop, Hadoop Hive, YARN

Live Long and Process

One great thing about working for Hortonworks is that you get to try out new features before they are released, with real feature engineers as tour guides—features like LLAP, which stands for Live Long and Process. LLAP is coming to Hortonworks with Hive 2.0 and (spoiler alert) it looks like it will be worth the wait.Live-Long-and-Prosper-Shirt

Originally a pure batch processing platform, Hive has speeded up enormously over the last couple of years with Tez, the cost-based optimizer (CBA), ORC files, and a host of other improvements. Queries that once took minutes now (1st quarter 2016) run at interactive speeds, and LLAP aims to push latencies into the sub-second range.

Continue reading

algorithms, not-hadoop

Super Fast Estimates of Levenshtein Distance

This simple heuristic estimates the Levenshtein Distance (LD) of large sequences as much as tens of thousands of times faster than computing the true LD. In exchange for a certain loss of precision, this extends the size range over which LD is practical by a factor of a few hundred X, making it useful for comparing large text documents such as articles, Web pages, and books.


Equally importantly, the estimates are computed from relatively small signatures, which means that it is not necessary to have the documents to be compared on hand at the time the estimate is computed.

The signatures can also be used to estimate approximately where and how the sequences differ. This allows finer distinctions to be made about near duplication, for instance, is one document embedded in the other, or are many small difference sprinkled throughout?

Continue reading

Hadoop, Hadoop hardware, Uncategorized

A Question of Balance

When you add nodes to a cluster, they start out empty.  They work, but the data for them to work on isn’t co-located, so it’s not very efficient. Therefore, you want to tell HDFS to rebalance.


After adding new racks to our 70 node cluster, we noticed that it was taking several hours per terabyte to rebalance the nodes. You can copy a terabyte of data across a 10GbE network in under half an hour with SCP, so why should HDFS take several hours?

Continue reading