Hadoop Cloud Clusters

If experience with Hadoop in the cloud has taught me anything, it’s that it is very hard to get straight answers about Hadoop in the cloud. The cloud is a complex environment that differs in many ways from the data center and full of surprises for Hadoop. Hopefully, these notes will lay out all the major issues.



No Argument Here

Before getting into Hadoop, let’s be clear that there is no real question anymore that the cloud kicks the data center’s ass on cost for most business applications. Yet, we need to look closely at why, because Hadoop usage patterns are very different from those of typical business applications. Interestingly, purely in terms of raw cycles per dollar’s worth of hardware, it is not at all clear that the cloud is cheap—you have to be precise about what you mean. As we’ll see below, in the extreme case, were you to run large numbers of machines indefinitely at high utilization you’d find that cloud cycles are actually quite expensive. Likewise with certain data usage patterns.

Nevertheless, cloud economics crush the data center for most business applications for several reasons:

  • Only relatively rarely does cloud v data center boil down to cycles/dollar. Cloud providers are primarily selling VM’s, not raw cycles. The goal of beating data center cost is a sitting duck for providers because data center utilization rates notoriously vary from abysmal to not very good. Ganging up many VM’s per server allows the cloud provider hold the cost per VM down while still renting VM’s relatively cheaply. This is possible precisely because cycles are not the main product.
  • Even if the raw cycles work out to be expensive, the cost of the hardware is only part of the total cost of deployment. Where many enterprises spend the big money is the myriad extras such as rack space, networking, security, hardware maintenance and replacement, administration, backups, managing cool and cold storage, durable backups, IT department overhead, management overhead etc., to say nothing of electricity, AC, and connectivity. Cloud providers throw most of this in for free, and the services are better quality than even some of the best IT departments can achieve.
  • Even if the total cost of VM’s weren’t lower than running your own (and for general applications it usually is) that’s still not all there is to it.  AWS and the other clouds are incredibly nimble–you can spin up a new server in minutes, add load-balanced instances, or set up an entire virtual private cloud before lunch. You can shut them down just as fast. Getting that done in a data center through your IT department can take months and costs a fortune.

Yet for all that, there do remain classes of computing applications for which the cloud cannot be assumed to be competitive. For instance:

  • Applications where communication latency is critical. For example, round-trip times from earth to the cloud are orders of magnitude too long to be competitive for securities trading. A round-trip to the cloud takes several 10’s of milliseconds, which is an eternity for traders, who can win or lose trades by a microsecond.
  • Applications that need close control of hardware resources, e.g., CPU, network, or storage-I/O. Sharing doesn’t simply reduce the resources a given user gets proportionately. It also reduces the total amount available because of inefficiencies it induces such as paging delays, cache flushing, extra disk seeks, breaking up smooth I/O streams, etc.
  • Extremely resource-intensive applications.  If your need for raw cycles and data is great enough, for long enough, it pays to cut out the middleman. This is especially true if the platform is highly consistent, which minimizes administrative complexity per node.

All of these to some extent, but especially second and third, apply to Hadoop.

It Definitely Works

Hadoop clearly works at scale in the cloud. Netflix, for instance, reports that they operate thousands of cloud nodes that process staggering loads. At this writing, they process something like 350 billion events and petabytes of reads daily, numbers they expect to increase rapidly in the next few years.

The caveat is that it is far from clear that such a deployment is particularly efficient in terms of bytes-processed-per-node or events-per-dollar when compared to smaller earth-bound clusters. It might be or it might not be; it’s extremely hard to say.  You can’t judge a technology from the success of a behemoth like Netflix because some jobs are so big that they can only be done inefficiently. That may sound like a perverse statement, but it’s actually common at many scales. Think of MapReduce itself: my 2010 laptop smokes any Hadoop cluster at sorting a megabyte of data—it’s not a contest. MR and Tez don’t come into their own until you’re into many gigabytes, but after that, their advantage increases relentlessly forever. Also, operating on that scale often involves major geographic issues that may not come into play at smaller scales.

Monster deployments also tend to be very complex, with many more moving parts than mere Hadoop. The companies operating at that scale usually work closely with the provider’s engineers as well as the staffs of their software vendors. Deciding whether the cloud will be cost-effective for your 30-node starter cluster involves an entirely different set of variables.

Apples and Oranges

For a typical business application, the burden of proof is on anyone who says don’t deploy in the cloud, but the decision is maddeningly complex for Hadoop.

There are two main reasons why it is difficult to predict in advance the cost-effectiveness of the cloud for Hadoop. Pretty much the same goes for other clouds, but I’ll talk about this in terms of AWS.

Firstly, Amazon instance capacity is defined in terms of a unit called vCPU, which has no clear definition and varies with the CPU type. Basically, you can think of it as equivalent to about 1/2 of the compute capacity of a hardware core on a processor of the type that underlies the instance. That’s a lame description, but you can talk all day and not come up with a better one—there are too many variables. Using vCPU to describe processing power is like using horsepower to describe vehicles: a Ferrari and an 18-wheel semi have approximately the same horsepower.

This is by far the best explanation of what a vCPU is that I have seen anywhere (courtesy of Marc Fielding on the Pythian Web site.) It’s three years old, but I think it is still very sound.

For practical purposes, you can guestimate that to duplicate the power (whatever that means) of an 8-core data center server, you need a 16-vCPU’s running on the same family of hardware.

The second reason is that even given a rough measure of processing power equivalence, the properties of a data center cluster and a supposedly equivalent cloud cluster will differ in many dimensions.

  • It’s not just raw cycles. As mentioned above
    • Workload type is a significant factor in hyperthread performance.
    • Hardware resources (CPU, memory, disk-swapping, etc.) are shared with an unpredictable number and variety of users.
  • The mounted block-storage models are quite different from disk.
  • The S3 model, for better and worse,  is completely different from anything in classic Hadoop.
  • Cloud environments aren’t comparable to the data center counterparts.
    • The relationship of the VM to the hardware is abstracted in the cloud.
    • The inter-instance LAN is shared with users outside the cluster.
    • S3 is an HTTP service and is not at all like disks.
    • The devices backing EBS volumes are shared with other users and EBS itself is a complex service that does more than serving blocks.
    • Important physical components that Hadoop uses in optimization are entirely abstracted away (e.g. racks.)
    • The cloud supplies many things with an instance that you must take care of yourself in a data center deployment.

Even if the computational power you need in your cluster, over the duration of time you need it, is not cost effective in the cloud, it  may still be the right choice because:

  • No matter how good your IT department is, deploying in the cloud is faster. You can deploy a major cloud cluster in hours. Hardware takes months just to buy and install.
  • Managerial bias: there is a massive shift to the cloud underway, and expanding in the data center moves against the tide. Expect resistance to any project targeted to the data center. It’s a fact of business life.
  • Hadoop doesn’t live in a vacuum. The economics of the cloud for the many other applications that use it or feed it may compensate for marginal economics in the cluster proper.
  • AWS provides a very capable model for related items such as backing up data, resource availability, fail-over, etc.

But for all these good things, beware that it can require more ongoing engineering and user support to get good performance in the cloud.

Characteristics and Quirks

EBS is a NAS, not a DAS—reads and writes traverse a network. This affects performance in multiple ways:

  • Reading across a network causes increased latency, which means the amount of time it takes to get the first byte is greater, regardless of the streaming bandwidth. Compared to streaming, latency is a secondary concern for Hadoop, but be aware that magnetic-backed EBS has a 10-40 MS latency, while SSD EBS is single-digit-MS.  On a data center machine, SATA disk latency might be 10-15 MS and SSD latency would a tiny fraction of a millisecond.
  • The underlying storage device that supports an EBS volume is shared among multiple volumes serving others within and possibly without your cluster. The number of users and what they are doing definitely affects both average latency and throughput, as well as the variance seen in those measures, especially if you or others sharing the resource are using transparent encryption on the back end. Latency and variance in latency subtly disrupt smooth processing in countless ways.

Bandwidth and latency issues are hard to get a handle on.

Note that we’re only talking about AWS EBS-optimized nodes in this section. Optimized nodes provide a guaranteed amount of bandwidth to EBS. Without optimization, all the EBS traffic goes over the same network that your instance use to talk to each other, which limits you to a toy cluster for almost all realistic workloads.

Here are the main points you need to consider.

  • The guarantee we’re talking about is for bandwidth to/from EBS, not for the amount of data you can actually fetch in a given time or with a given latency. These can vary for the worse by a significant amount depending on usage pattern, whether you use encryption,  how many volumes you are accessing, etc.
  • Maximum bandwidth to/from a given volume is limited by the volume type. (These types have very different prices.)
    • Old-school magnetic-backed EBS is limited to about 40-90MB/sec
    • SSD is higher–up to about 160MB/sec (this is usually the default choice)
    • Provisioned IOPS SSD can go higher–up to 320 MB/sec per volume.
  • Aggregate bandwidth from a given instance to all of the EBS volumes that it has mounted is limited by the combination of instance type and vCPU count. Here is a chart that gives the numbers. For example, a typical server choice for Hadoop might be an m4.4xlarge with 16 vCPU’s. This would be equivalent to a pretty decent 8-core server and has a guarantee of 2000Mbps=250MB/sec aggregate bandwidth to EBS.

Note that 250MB/sec is approximately equal to the I/O bandwidth that a data center server will see from five commodity SATA drives. Such a machine might host twenty or more data disks so you can get about four times the raw I/O bandwidth on the data center machine at somewhat lower latency.

The latency differences between EBS and DAS disks are large and the source of much confusion. The key thing that surprises people is that SSD-backed EBS is nothing like having SSD on the machine itself. (You can get this too but it is very expensive.) The latency of SSD-backed EBS is much better than magnetic-backed EBS, but it’s closer to that of a local SAS disk drive than it is to that of a local SSD device, which for most purposes is negligible. The latency of magnetic-backed EBS is similar to that of an HTTP GET call—figure three to five times the latency of a commodity SATA disk.

In fairness, EBS is a service and not simply a mounted file system. It does more than blindly serve blocks. It optimizes by merging sequential reads, does caching and pre-loading, implements transparent encryption, and provides higher availability than a disk.

On the other hand, EBS requires that a backlog of operations be enqueued in order to reach maximum throughput. In other words, getting the highest throughput depends upon a factor that tends to increase the latency. Some of the other features mentioned above also apply more to typical applications than to Hadoop. For instance, Hadoop reads and writes big blocks, so merging reads is less of an advantage. Higher availability is also moot, as Hadoop replication already provides much higher data availability.

S3 is a strange beast and not to be confused with a file system.

  • S3 is a REST service accessed over HTTP, not a mounted block file system. The latency is large compared to either DAS or EBS as the data makes many network hops to get to you.
  • S3 has unbounded storage capacity.
  • Performance can depend upon naming conventions in several ways.
  • Compared to EBS, the cost ranges from cheap to extremely cheap. At its most expensive S3 is an order of magnitude cheaper than EBS.
  • S3 is optimized for IOPS, which is best for coarse-grained lookups for substantial data such as, for instance, fetching Web pages from a huge set. But it has poor streaming capabilities and it is not a good way to access a random small bit of data (both of which are important to Hadoop.)

Noisy neighbors are a factor for many components including EBS, S3, the LAN, the CPU, etc.

Economic Pros and Cons of the Cloud

The economics of the cloud relative to the data center are complicated and highly dependent upon what you’re trying to accomplish even within the restricted concerns of Hadoop. Some of the differences are good, some are bad, and some are both.

Raw Instance Cost (Con)

Whatever the advantages of the cloud, contrary to its reputation, it is definitely not cheap in terms of raw CPU cycles/dollar. We’ll see below that $/CPU/hour is too crude a measure, but consider the following simplified example to get started.

As alluded to before, it’s hard to compare hardware with VM’s in the cloud, but 16  vCPU m4.4xlarge costs $0.862/hour, or $20.68/day, which is $7,551/year. EBS volumes cost $0.10/GB/month (for the recommended default SSD-backed storage) so if your big server has 8TB of EBS, that’s $9,600/year for disk, for a total of $17,151/year. Note, that’s only 2.6TB of replicated storage per node, which is quite small for Hadoop, small enough to greatly affect how you will choose to use the cluster, but we’ll assume you can adjust, perhaps by using a more complex cold-storage policy than you might have on a data center cluster.

Less powerful machines are cheaper, for instance, an m4.2xLarge with 8 vCPU’s costs $0.431/hour, but you need twice as many to get equal power and the same level of guaranteed bandwidth. The vCPU’s for a given type generally works out to about the same price.

That same $17,151 will buy a very nice commodity server outright, including the server’s share of a rack, cabling, switching, etc, so the raw cost of the hardware on AWS is something like 5X as much as the purchase price of an approximately equivalent machine with many times as much storage. Note that this is before working in the depreciation, which can offset a large part of the nominal purchase price, or the residual value of the machines after their nominal lifetime.

HDFS Storage Cost (Con)

This one was a surprise to me.

It’s typical to spend more on EBS volumes than on instances—much more. In the above example, we’re paying $9,600/server/year for the disk space alone.

Commodity SATA drives cost less than $100/TB. Not $100/year, but $100, period. Over five years, that would be 0.00833 of the cost of EBS—less than 1%.  Even if you go for higher performance, more reliable SAS drives, it would only reduce that advantage by half, and SAS is roughly as fast as SSD-EBS on the cloud.

For a concrete example, take TrueCar’s often published claims about storage cost on their terrestrial clusters. They reported three years ago that they pay $0.23/GB for storage in their data center Hadoop cluster. That’s replicated, so it’s  $0.015/GB/year for raw data, compared to AWS’s current price of $1.20/GB/year, or about .0125 the cost of cloud storage—not too different from the estimate above even if you don’t allow for three years of technology improvements.

Data center nodes typically have a lot more storage, too—as of 2017, a Hadoop server might have 24 two-TB drives each. You could go considerably higher—three or four TB drives and/or dedicated disk-servers for cold storage, but let’s call it two-TB. The equivalent of 24 such drives would cost $57,600/year on AWS.

Regular S3 costs $0.023/GB/month, but TrueCar’s reported cost for SATA works out to $0.0038, which is about 1/6 the cost. Not until you get to their Glacial S3 storage, at $0.004 does AWS approach being competitive with SATA disks in the data center. S3 Glacier is only semi-online so the name may refer to the speed of access as much as to the data temperature.

As mentioned above, the data center box has about four times the throughput, so that data can be served up by less capable servers as well because there is less need for compression and other CPU intensive steps to conserve I/O  bandwidth.

Power  and Rack Space (Pro)

AWS is supplying the power for cloud deployments, as well as the cooling, rack space, hardware installation and physical security, probably with higher reliability than any lesser organization can hope to achieve.

Power and Rack Space (Con)

Say your server uses 250W—at $0.20/KWH, that’s $438/year. Double that to cover cooling, it’s only $876/server/year, which is not a large number compared to the hardware cost.

Even if rack space cost in your own data center is equal to the entire hardware cost, which would be a lot, the total cost of hardware, power and rack space is still under $13,000/year.

Management, Administrative Skills, User Skills (Pro)

Managers often forget the overhead of management itself when assessing relative cluster costs. Getting a data center project approved and then shepherding through the decisions on machine types, space, work schedules, etc. can consume an enormous amount of time for many months. Management hassles from up the hierarchy are also often much larger for a data center project than the analogous process for a cloud project.

A new hardware/software platform also takes hardware skills, networking expertise, cluster security skills, Linux admin skills, conceptual understanding of the platform and its goals, etc. You need to be pretty big for the necessary expertise to be a commodity item that is amortized across many machines.

Say the 12-node cloud cluster for your pilot is costing you $200K/year in AWS bills. That sounds like a lot, but that’s less than the net cost of one person, be it a manager or an engineer.

All in all, the human cost of setting up small cluster can easily exceed the concrete costs.

Time to Production and Opportunity Cost (Pro)

This is the cloud’s silver lining. Setting up your own hardware cluster takes a lot of money and time up front. Just getting buy-in from the IT folks about what machines to purchase can be an ordeal, particularly because the ideal machine for Hadoop is quite different from what you’d choose to support typical corporate applications. It’s not just conservatism; there are significant up-front and ongoing costs to IT for each platform type it has to support.

In the cloud, a much narrower set of skills can spin up a Hadoop cluster almost on demand. If you’ve done it once or twice before, you can set up a 30 node cluster in an afternoon. Not just the machines—I’m talking about going from a standing start to executing Hive queries.

For pilots and smaller clusters, the ease of deployment and the savings in opportunity cost can outweigh all other considerations, especially in the startup world where days may count.

Finding Experts (Pro and Con)

You can get handy with the cloud pretty fast, so there are lots of experts. On the other hand, the mix of Hadoop and AWS is so complex, with so many variables, that relatively few of the experts actually know much.

Technology Pros and Cons

The VM Model (Con)

Hadoop was developed for deployment over Linux running on bare metal. Cloud deployment implies virtual machines, and for Hadoop, it’s a huge difference.

As detailed elsewhere in this blog (for instance  Your Cluster Is An Appliance or Understanding Hadoop Hardware Requirements) bare-metal deployments have inherent advantages over virtual machine deployments. The biggest of these is that they can use direct attached storage, i.e., local disks.

Not every Hadoop workload is storage I/O bound, but most are, and even when Hadoop seems to be CPU bound, much of the CPU activity is often either directly in service of I/O, e.g., marshaling, unmarshaling, compression, etc., or in service of avoiding I/O, e.g., building in-memory tables for map-side joins.

In the old days, NAS was almost unusable for Hadoop, but modern virtualization platforms do better. AWS, in particular, does an amazing job of mitigating one of the biggest shortcomings of NAS by providing dedicated bandwidth from instance to EBS on alternate networks. The pipes, however, are still limited in size, and fall far short of the bandwidth achievable across the server bus.

The second big disadvantage of VM’s (for Hadoop) is that they abstract away many of the details that would be used for optimizations in a terrestrial cluster. Hadoop is designed to take advantage of the predictability of a block-oriented workload to avoid paging and GC delays, keep pipelines and caches full, TLB buffers from flushing, etc. Shared virtual environments obviate much of this under the normally reasonable assumption that the resources are grossly under-used anyway, which unfortunately does not apply to Hadoop.

Sister Applications (Pro)

Even if Hadoop is marginal in the cloud in your particular case, the cloud is often a superb environment for many of the other applications that probably go with it. For many of these applications, it has easy, almost unlimited scalability and cloud storage models that have excellent properties for those applications (e.g. S3.)  Having some services in the cloud and some on the ground can have large costs for data transfer as well as being a performance hit.

Ancillary Services (Pro)

Site-security, network security, transparent encryption, backup to other regions, cool and cold storage modes, communication among geographically remote VPC’s, networking among VPC’s, seamless extension to a physical data center, and innumerable similar capabilities would be hard to match at any price in a local data center.

Memory, Machine, and Rack Awareness (Con)

There are a number of other ways than I/O in which Hadoop is tuned for conventional deployments on bare metal. Rack awareness is one. Hadoop considers the racks that nodes reside on both when reading and when writing/replicating data in order to take advantage the top-of-rack (TOR) switch in preference to the globally shared LAN. Rack-awareness also matters when idle machine capacity and the data-mounts are not co-located because data must move over the shared LAN.

Likewise, in a conventional deployment, the number of mappers and the size of the containers are chosen on the assumption that very little other than Hadoop will be happening on the node. This keeps the data pipelines flowing and minimizes cache and TLB misses, page faults, etc. None of this works smoothly with noisy neighbors on shared hardware.

S3 (Pro and Con)

One thing about S3 that often comes as a surprise to Hadoop users is that S3 does not underlie HDFS, in the way that SATA, SAS, SSD, or even NAS storage does. S3 is a drop-in replacement for HDFS. (You can use S3 and HDFS side by side, of course.)

S3 isn’t really a file system at all. It is a sharded key-value data store that is accessed via a REST API. It is optimized for fast lookup of random data items of medium size. What looks like path names in S3 is actually just a naming convention applied to keys that are in reality just strings. The apparently hierarchical naming does not correspond to the underlying storage in the sense that it does in a file system. There are no chains of I-nodes.

S3 can do far more I/O operations per second than a disk can, which makes it an excellent backing store for say, a large Web site serving pages,  but it is not built for streaming data. Even if it were, as an HTTP service, an application interacts with it over the LAN that is shared with all the instances, traffic to the Internet, etc.  That alone would cap it at a modest overall rate cluster-wide.

One consequence of its REST-service nature is that it doesn’t mount on any particular machine. Your entire cluster sees the same buckets. This is good and bad, because, while it makes S3 data’s lifespan independent of any particular node, the cluster, or event the VPC, inserted data is not guaranteed to be instantly consistent to all users.  A node can write data that another user won’t see for an indeterminate amount of time.

The maximum bandwidth an individual reader can get from S3 is tiny by Hadoop standards—single digit or low two-digit megabytes per second. You can have lots of readers, but it’s still very limited for streaming from the point of view of any particular reader.

We touched on costs above. The good news is that the raw cost per GB is about 1/3 that of EBS. The better news is that there is no HDFS replication for S3 data, so the crude cost is more like 1/9 that of EBS.

One other peculiarity is that S3 never loses data. You can do something stupid and lose it yourself, but the service is reported to have never lost data due to an internal failure in the entire history of the product.

Unfortunately, while S3 is great for storing massive amounts of data that will rarely be accessed, the severe bandwidth limitations make it a poor fit for analytics at scale or for low-latency queries. It tends to get used for landing data for ETL, and for long-term storage.

The other thing about cost is that while it looks great compared to EBS, EBS is about 100x more expensive than general purpose SATA data center disks. So contrary to what one hears, S3, even at 1/9 the cost of EBS, is actually rather expensive compared to disk. And while it’s free to write into it from the world, you pay to read if you aren’t in the cloud. The price to read is currently around $50/TB.

Data Formats (Con)

Some of the most powerful optimizations in the Hadoop universe are built around the optimized row-column format, aka ORC files, and ORC plays relatively poorly with S3. It works better with HDFS over EBS, but not as well as over a native file system and non-virtual machines because of the many low-level optimizations that go with it.  Parquet is said to work better in the cloud. Apache and Hortonworks are hard at work on improving the performance of ORC in the cloud, particularly over S3.

Limits on Node EBS-I/O Capacity (Con)

As mentioned above, price limits the amount of EBS one can use. Beyond that, nodes cannot mount an infinite number of EBS volumes. The number varies with the OS, but figure eight volumes for RHEL. How much it matters is debatable, because you can reach the guaranteed bandwidth of the biggest instances with fewer than this.

Not Paying for Idle Time (Pro)

This is one of the big selling points for AWS. The cost arguments above implicitly assume you’re using the VM’s around the clock 365 days per year as if they were machines in the data center that you are paying for whether you use them or not. In practice, not many computers are used that way in the typical enterprise. In theory, you can reduce the number of compute nodes you use seasonally, on weekends,  or even daily for the after-work hours.

Not Paying for Idle Time (Con)

This strategy is probably only practical if your data is backed by S3, where the data is not local to any particular node. Such nodes have minimal state and can be shut down and spun up again easily.  S3 isn’t going to work well for general purpose Hive and Pig for general purposes, so this probably is moot.

For HDFS, assuming that in most use-cases you want to keep all of the data, there will usually be far too much data to move around if you are removing a significant number of nodes. (See A Question of Balance.) In a nutshell, keeping enough free space on remaining nodes so that you can simply move the data over from the deleted nodes defeats the purpose of the exercise because the volumes cost more than the instances. You’d be spending more than a dollar for each dollar you save.

Schemes to shuffle the data from the decommissioned nodes to the remaining nodes with minimal extra space work logically, but incur a great deal of data movement, which ferociously consumes the two most precious and limiting resources in the cluster: I/O bandwidth to/from EBS and I/O bandwidth on the LAN that connects the instances. You’ll be devoting a substantial amount of your cluster to saving money on your cluster for a considerable period of time.

For some use cases, it is possible to keep most of the data on S3 after the ETL step and move just the working set down to HDFS for a limited time. This can work, but the transfer times to/from S3 are large and may cut in on cluster performance and/or availability.

There are other possible strategies, but it only in exceptional cases will it make economic sense to do this.

End Run

One great way to get started is, rather than fiddling around with tuning your own cluster, try your ideas out on Amazon Elastic Map Reduce (EMR). This will be engineered for you about as well as it could be so you can get a good feel for what the real resource requirements are, how S3 and EBS compare, etc.

Personal  Experience

What I have seen in practice follows. None of this is authoritative and it’s all anecdotal but I think my experience is consistent with what a lot of people report for modest size clusters (< 100 nodes.)

Even as a non-admin, I find it unbelievably easy to deploy clusters in the cloud. After you’ve done it once, you can deploy a basic Hadoop cluster in a couple of hours. Even with more advanced enterprise security such as Kerberos integration, deploying is still not difficult for a system administrator.

No doubt about it, S3 query performance is very poor and HDFS over EBS isn’t great. There is no way that an N-node EBS-backed cluster will compete with N-nodes on the ground. Not even close. I have never seen anyone get decent performance from S3, and I cannot reconcile this with what I read.

Throughput can be adequate for batch, ETL, etc., but I haven’t seen good responsiveness for interactive users. (Down the road, LLAP may mitigate much of this problem for many users.)

Whatever the number of nodes it takes to reach your desired performance level, they will be expensive per node if you plan to run the cluster indefinitely. Costs almost always come as a surprise, and it will probably be EBS, rather than the nodes that dominate the bill. Storage really adds up, and even S3 isn’t as cheap as it’s cracked up to be.


For startup companies and new projects, you can’t beat the low barrier to entry. For POC’s and experimenting, it’s a dream. Per hour, a good sized cluster can be had for a lot less than you’d pay an architect to noodle around, so a lot of arguments can be settled by setting up a cluster and trying it.

The multi-dimensional nature of the price/performance comparison makes it hard define equivalent clusters exactly, but by any reasonable measure, for a given amount of computational power, a large Hadoop deployment the cloud is very expensive.

Bottom line on hardware: AWS is very expensive for raw hardware. Any savings on a long-term deployment are going to be in things like time-to-deploy, administrative costs, whether the money comes out of capital or operating expenses. You will probably not save on the cost of the processing power and certainly not in the cost of data storage.

Bottom line on performance: Hadoop in the cloud tends to perform better for traditional Hadoop workloads than for interactive responsiveness, where the latency and slower transfer rates slow down job turnaround. This is bad for internal politics because the most visible processing is probably going to be area where it is least impressive.  Compared to the data center, it takes a lot more instances to get a given level of performance.

Bottom line on engineering: It takes much less time and expertise to set up a cloud cluster than a data center cluster. But it takes more skill to use the cloud effectively. Not just back-end expertise, either. Your ordinary user programs may need significantly more tuning and support to get the performance that the cluster is capable of.

Overall:  For new lines of business, temporary clusters, etc., the cloud can’t be beat, but anyone deploying larger clusters (fifty data center node equivalent and up) that will be around a while should give the cost in the cloud a long hard look before committing. Be particularly careful about anticipated data storage needs and what your reluctance to pay for more storage will cost you in versatility.


One thought on “Hadoop Cloud Clusters

  1. RobbieTheK says:

    Do you know if the performance is similar with MS Azure? Also do you have a playbook you follow that you’d be willing to share on how you quickly configure HDFS in the cloud? Have you tried adding Apache Spark and/or TensorFlow?


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s