A new series of execution frameworks have recently been added to the Hadoop ecosystem, including Spark, Tez, Flink, Storm and Samza, each with its own strengths and weaknesses. As such, it’s increasingly challenging to choose the best execution framework for the problems you need to solve. Yet when it comes to picking the best execution framework, one size doesn’t fit all.
Newly developed execution frameworks are initially only strong for the use cases they were built for, whether it’s small data or big data, speed or accuracy. Traditionally, Hadoop was developed as a special-purpose infrastructure for big data with MapReduce handling massive scalability across hundreds or thousands of servers in a cluster. However, we just saw Hadoop 2.0, most notably YARN, join the fray to enable Hadoop to support more varied data processing approaches and a broader array of applications like Tez and Spark.
For example, Tez provides a powerful and extensible application programming interface (API) for independent software vendors to build YARN applications. This can be used to easily build both high-performance batch-processing engines that handle terabyte- or petabyte-scale datasets, as well as low latency YARN applications dealing well with smaller data volumes and leveraging in-memory processing. Tez also has the advantage of backward compatibility, yielding strong performance gains even with unmodified MapReduce-based applications. Tez has also become the new execution engine behind Hive, making it a painless swap-out replacement to accelerate SQL-on-Hadoop applications.
Spark has a more end-user-facing API that can be used to build new data processing pipelines very easily. Spark excels at more targeted tasks like machine learning and stream processing (a la Storm), though it will likely be adapted to a broader set of use cases in the future (similar to how MapReduce evolved into MapReduce 2.0, YARN and Tez). Spark’s integration into YARN, however, is still very young and will have to prove its production readiness at large scale. Spark also currently has a lot of development resources being dedicated to hardening the technology. As Spark continues to develop and mature, it may suite other, or additional, use cases.
In the meantime, customers should remember that compatibility and peaceful coexistence are big deals in the world of multitenant, multipurpose data infrastructure that often represents a big, shared investment by different divisions of a large organization.
You may choose Spark or Tez because it works best with the data set you are currently using or wins the benchmark you performed today, but keep in mind that those variables (and use cases) are constantly evolving and usually before the technology itself can keep up. Organizations need to be prepared to leverage multiple execution frameworks, and switch between them as necessary.
Accommodate changing data volumes and future use cases
Data volumes are on an upward trajectory. With increased data volume, velocity, and variety as the new reality, businesses must be flexible in how they manage data. Furthermore, businesses increasingly need to develop a near-real-time analytics capability to support intelligent, proactive and predictive businesses processes. To do so, they need the option to employ different big data execution frameworks and incorporate new ones as they emerge in the big data landscape.
Combine execution frameworks for most efficient performance
When it comes to data analytics, a hybrid solution is often best. Rather than choosing one framework for the complete process, the best solution may be using different frameworks for different parts of the process to optimize performance.
When you initially press on the gas in a hybrid car, you rely on the gasoline engine to accelerate, but once you start cruising, the car can switch to the electric engine to be more efficient. With the ability to use two engines, you get to the desired destination in the most effective and efficient way possible.
Analytics workflows are similar. You start by cleaning, transforming, and rolling up massive amounts of data, whether it’s raw logs, text or transactions. Those are then divided into summaries by features such as time or product. Once you’ve aggregated and reduced it to the most relevant data, you can switch engines and begin deeper analysis, whether that’s path analysis on clickstreams or using data-mining algorithms to understand churn.
In these last steps involving slicing and dicing smaller amounts of data, organizations are best served by computation frameworks that leverage either in-memory processing or single-node processing, because they make the best use of computing resources, and also get the job done fastest. Like the hybrid car example, you reach the desired destination while maximizing both efficiency and performance.
The smartest workflow without manual intervention
To best support these hybrid models, the workflows need an automatic switch, triggered in the middle of the workflow. Doing so eliminates the need to test multiple execution frameworks, port work from one framework to another, or manually fine-tune a specific framework. This automatic switching logic helps ensure that when your needs change, so does your analytics capability. And it does so seamlessly and without compromising performance or integrity.
True innovation comes with enabling not just the “now” but what could and will be in the future.
Peter Voss is chief technology officer of big data analytics startup Datameer.