Uncategorized

Haxe

We had a problem. The company I’m with wants to flush data from hundreds of different kinds of IoT devices to the AWS Cloud. There are also Linux-powered gateways, a ton of code on the Cloud side plus Web browser applications. Among them, they use Python, C/C++, Java, JS, and PHP, and run on Linux, Mongoose, Microsoft, OSX, Android and even bare metal (the embedded controller-based devices, e.g. Arduino and ESP32, etc.)

Despite all these exotica, our problem is humble.  The messages the components send, at some point, are almost all represented in JSON, so we need some way to define that JSON centrally to ensure that all participants conform to the same schema and to make it testable. The best way to do this is to provide developers with standard objects—beans, in the Java world—that emit and accept the JSON.   But we don’t want to write and maintain the bean code in five languages as things evolve. How do we get around that?

IDL In The Middle?

There are many frameworks that use some kind of Interface Definition Language (IDL) to define data objects generically so that they to and from a generic format wire-format in a language-neutral way. These frameworks use the IDL document to generate equivalent beans in multiple languages. The beans know how to emit and reconstitute the data at either end and in between the data is usually serialized to some efficient binary wire-format to conserve bandwidth.  CORBA, Protocol Buffers, Avro, and Thrift all do something like this.

IDL seems like the right general idea, but those frameworks don’t quite fit our needs because they aren’t JSON oriented and wire formats and communication aren’t really the problems. For us, it’s just a question of keeping the JSON consistent. Writing such beans isn’t a big deal—we just don’t want to write and maintain everything in five languages for the rest of eternity.

Haxe

Which brings me to Haxe—the coolest language you never heard of.

Forget IDL. Haxe is a high-level, feature-rich, high-level programming language. It’s very generic, somewhat Java-like but it also feels somewhat Pythonish at times. It’s got all the basic stuff with plenty of modern whistles and bells like closures and generics. Nothing too exciting there, but the point is, it’s not a specialized framework. It’s a real programming language suitable for complex projects.

The unique thing about Haxe is it is that there is no Haxe compiler that turns out an executable, and no virtual machine, either. Huh?!  Instead, you run your code through the Haxe cross-compiler (called haxe) with a flag naming a target language, and it rewrites your Haxe program in your language of choice and even compiles it for you. I’m not sure if compiles every compiled language, but it compiles it for Java. The Python just comes out Python.  If you name a “main” on the command line, the result is executable.

This solves our problem perfectly.  I’ve written the JSON-oriented code in Haxe. There’s bean with getters and setters for all the fields plus methods to write the JSON and constructors to go in the reverse direction. There’s also a convenience class to run some standard known data through each object and convert it to JSON so we can verify reference output with automated tests.

Now, the developers can be confident that they are writing to the same message model and they don’t have to code the JSON at all—that’s done just the one time in Haxe and updates will go to all the libraries automatically when it’s compiled.

It took a couple of hours to get Haxe set up, figure out how it worked, and establish the mechanics, but after that it was easy, and it looks like it will reduce the time spent on maintaining this aspect by 80%, forever. It’s so successful and easy that we’re looking fold other boilerplate functionality into the Haxe build. It’s pretty amazing.

Some More Details

Once more, I’m no expert (yet) but there are some other points worth keeping in mind so I’ll just dump some things I’ve come across and you can investigate yourself.

Limitations of space and my own inexpertise will only allow me to touch on the high points but HaXe’s excellent Website is very complete and there is a book,  HaXe 2 Beginner’s Guide by Benjamin Dasnoise available online.

Type Checking and Binding

Haxe has an odd model for type checking.

Languages vary in two major ways w.r.t. type checking: how strict they are, and when a given variable is bound to a type. For instance, in a C++ program, you always have to state the type of a variable or function when you declare it and thereafter the type cannot change. This is called compile-time binding or early-binding. On the other hand, you do not declare types in Python and similar languages because the runtime figures out what type a variable is on the fly. This is known as late-binding. In fact, in Python, a given variable can hold different types at various points during execution, which strikes many C++ and Java people as flat-out depraved. (Not all late-bound languages allow this.)

Superficially, Haxe type rules sometimes feel like Python in that you can simply ignore types much of the time, but there are times when Haxe insists that you give a variable or function return value a type. A language expert may correct me on this, but despite the fact that you don’t always have to declare the type, the underlying model of Haxe seems to be strictly compile-time bound in that by the time the code hits the compiler, every type must be either explicitly stated or inferable.  It would make sense because if the language you compile to doesn’t need the information, you can throw it away, but if it weren’t there, you could not compile to early-bound languages.

How Hard Is It?

If you are a user of any major programming language other than Lisp, much of Haxe will feel familiar. You can just start using it and figure out the fine points as you muddle through your first project. I’m a complete nube myself, having done exactly one non-trivial project in Haxe, but I got several hundred lines of working code without too much trouble.

How Complete is It?

Most of the features of modern languages are included in Haxe. It’s kind of weirdly generic that way.  If you’re used to Java and Python, you’ll barely notice that it’s not whatever language you’re used to.  The big exception is that Haxe does not expose memory directly with pointers the way C and C++ do, so that style of programming won’t be directly available to you.  Just as cross-compiling from an early-bound model to a late-bound model is logically straightforward but going from a late-bound model to an early bound model would not be, translating from a language without pointers to a language with pointers is relatively easy, but going from a language with pointers to one without would be a very good trick.

Compiling

It feels like magic—you write your code one time in this generic programming language and in seconds you can have your code in Java, Python, C/C++/CPPIA, Lua, Neko, PHP, JavaScript, C#,  or some specialized languages such as Flash.   

But it can’t really be that simple. Languages have libraries that aren’t strictly part of the language proper but are very much a part of the language culture. Also, a few obvious things are mysteriously left out.  For instance, when the target is C/C++, C#, Java, Neko, or PHP, you have the Haxa Sys library, which deals with command-line arguments and several other important things, but it’s not available for Python. I had to work around a few things, but very few so far.

It goes the other way, too. A lot of the general purpose stuff is built into the Haxe libraries but it can’t be expected to include the union of all languages’ library functionality, if only because the models are often different. So for each language, there is an additional set of idiosyncratic libraries. The PHP library has things like cookies and HTML, and the Flash libraries obviously have Flash, etc.

To cover the inevitable complexities there are some compiling features such as conditional compilation that let you continue to maintain one code base.  The ins and outs of compilation are not trivial but they don’t seem like a big deal compared to C++.

The Black Arts

If you are truly hardcore (and my core is pretty much that of a Hostess Twinkie) you can start messing with macros. This feature lets you jump into the middle of the translation/code generation path and insert your own custom functionality.  So in theory, you can make it do pretty much anything.

What Is The Correct Pronunciation?

It doesn’t matter how you pronounce it because nobody you will talk to has ever heard of it before. As the unquestioned authority, your pronunciation will be established locally as correct.

If you find yourself talking to someone who knows the correct pronunciation, rely on chutzpah. Eccentrics of independent spirit are still mispronouncing vi to rhyme with bye after more than 40 years and others snobbishly stick to saying Line-ux instead of Lin-ux because supposedly Linus is pronounced Line-us. Be your own person, goddammit.

That said, so far I’ve seen arguments for Hax-eh, Hax, and Hex. To judge from the Internet, Hex seems to be winning the popular vote but I’ve never heard anyone other than me actually say any of these. I started out saying Hax-eh, but now I’m going with hax. Like axe.

Bottom Line

I could not be more ticked so far. It’s not perfect, because languages have differences that are more than cosmetic. Non-removable as they say in calculus.  But for the kind of situation we’re in, it couldn’t be better.  Word on the street (I can’t verify it) is that it’s popular with people who write multi-platform games, Web-applications, and desktop-applications. I can see why.

 

Standard
Uncategorized

Hadoop Cloud Clusters

If experience with Hadoop in the cloud has taught me anything, it’s that it is very hard to get straight answers about Hadoop in the cloud. The cloud is a complex environment that differs in many ways from the data center and full of surprises for Hadoop. Hopefully, these notes will lay out all the major issues.

elephant-in-cloud

 

No Argument Here

Before getting into Hadoop, let’s be clear that there is no real question anymore that the cloud kicks the data center’s ass on cost for most business applications. Yet, we need to look closely at why, because Hadoop usage patterns are very different from those of typical business applications. Continue reading

Standard
Uncategorized

Big Jobs Little Jobs

You’ve probably heard the well-known Hadoop paradox that even on the biggest clusters, most jobs are small, and the monster jobs that Hadoop is designed for are actually the exception.

Sumatran-elephant

This is true, but it’s not the whole story. It isn’t easy to find detailed numbers on how clusters are used in the wild, but I recently came across some decent data on a 2011 production analytics cluster at Microsoft. Technology years are like dog years but the processing load it describes remains representative of the general state of things today, and back-of-the-envelope analysis of the data presented in the article yields some interesting insights.

Continue reading

Standard
Hadoop, Hadoop hardware, Uncategorized

A Question of Balance

When you add nodes to a cluster, they start out empty.  They work, but the data for them to work on isn’t co-located, so it’s not very efficient. Therefore, you want to tell HDFS to rebalance.

teeter_totter.png

After adding new racks to our 70 node cluster, we noticed that it was taking several hours per terabyte to rebalance the nodes. You can copy a terabyte of data across a 10GbE network in under half an hour with SCP, so why should HDFS take several hours?

Continue reading

Standard
algorithms, not-hadoop, twitter, Uncategorized

Z-Filters: How to Listen to a Hundred Million Voices

Z-Filters is a technique for listening to what millions of people are talking about.

If a hundred million people were to talk about a hundred million different things, making sense of it would be a hopeless task, but that’s not the way society operates. The number of things people are talking about at any given moment is much, much smaller: thousands, not millions.

bubble-view.png Continue reading

Standard
Uncategorized

Erasure Code in Hadoop

What is Erasure Code?

Hadoop 2.7 isn’t out yet, but it’s scheduled to include something called “erasure code.”  What the heck is that, you ask? Here’s a quick preview.

300px-Office-pink-erasers

The short answer is that erasure code is another name for Reed-Solomon error-correcting codes, which will be used in Hadoop 3.0 as an alternative to brute-force triple replication. This new feature is intended to provide high data availability while using much less disk space.

The longer answer follows.

Continue reading

Standard