The secrets of designing and building big data apps

Aaron Kimball co-founded WibiData in 2010. He has worked with Hadoop since 2007 and is a committer on the Apache Hadoop project.

Software applications have traditionally been perceived as a unit of computation designed and used to solve a problem. Whether an application is a CRM tool that helps manage customer information or a complex supply-chain management system, the problem it solves is often rather specific. Applications are also frequently designed with a relatively static set of input and output interfaces, and communication to and from the application uses specially designed (or chosen) protocols.

Applications are also designed around data. The data that an application uses to solve a problem is stored using a data platform. This underlying data platform has historically been designed to enable optimal data storage and retrieval. Somewhere in the process of storage and retrieval of data, an application applies computation is to produce results in the application.

One unfortunate side effect of this optimized data storage and retrieval design is that it requires data to be structured in a predefined way (both on disk and during information design and retrieval.) In the world of big data, applications must draw on data from rigidly structured elements, such as names, addresses, quantities, and birthdays, as well as to loose and unstructured data such as images and free-form text.

Defining and building a big data application can be perplexing given the lack of rigidity in the underlying data. This lack of structure makes it more difficult to precisely define what a big data application will do. This applies to communication interfaces, computation on unstructured or semi-structured data and even communication with other applications.

While the traditional application may have solved a specific problem, the big data application doesn't limit itself to a highly specific or targeted problem. Its objective is to provide a framework to solve many problems. A big data application manages life-cycles of data in a pragmatic and predictable way. Big data applications may include a batch or high-latency component, a low-latency (or real-time component), or even an in-stream component. Big data applications do not replace traditional single-problem applications, but complement them.

Let's use a CRM tool as an example. The traditional CRM tool might store information about customers, their purchase history and customer loyalty level. Given a finite resource such as a customer call center, during peak loads the CRM must determine which customers should receive equal versus priority service. Typically, higher loyalty customers will receive the priority service, with the levels of loyalty usually being pre-determined. Those levels might be driven by spend, spend ranges, or other rules, but the determination is dependent on the the data, which is typically rigidly structured data.

However, if the CRM tool has the ability to predict whether a given customer, even if she is not within the pre-determined loyalty range, is exhibiting behaviors known to lead to a high loyalty customer, it would be able to make a smarter decision on how to prioritize resources and suggest prioritizing her call.

Traditional applications can only operate on pre-determined formulas and data. The CRM tool needs to know in advance whether a customer is a preferred customer and should receive resource priority or not.

By operating on datasets from all facets of the business, a big data application introduces new abilities to join datasets that were not previously possible. It provides the ability to create a feedback loop to existing applications to help make them smarter. In the example above, a CRM vendor could use a big data application to compute and analyze trends that lead to preferred customers and identify those customers earlier than previously possible. The Big Data Application will continuously re-evaluate the predictive model score based on changes in information about each customer, as they interact with the traditional applications.

So how should you approach designing and building a big data application? What are some of the considerations, decisions, or cautions to be taken when building a big data application? Here are a few recommended steps for building a sophisticated big data application:

Define objectives with an open mind

Define the tangible results of the application. Will the application merely provide the ability to join datasets across technologies? Find high-value customers earlier than existing systems? Regardless of the defined goal, keep in mind that joining data may yield insights beyond the scope of the defined goal. Don’t throw away data because you may find nuggets of gold in unexpected places.

Understand volumes, sources and integration points

Define the existing applications you want to stream data to and from. Document the volumes of data and how frequently the data changes. Involve the teams that manage those datasets early on to understand the best method(s) of exchanging data from various sources. Missing key customer information can make a world of difference.

Determine the platform

Understanding the types and volume of data your Big Data Application will operate on is going to dictate the platform required. Traditionally, Hadoop is the de-facto platform since it allows for all forms of ingest and analysis, ranging from real-time, to batch, to low-latency analytics. Choose the platform that is the most versatile and fits your needs. Partner with a vendor who can support you throughout your applications' lifespan.

Start with batch; graduate slowly

Big data platforms today offer a variety of ways to analyze your data whether it be batch, real-time, or in-stream processing. Begin initially with offline, or batch processing. This allows you to process and analyze your data in a manner unobtrusive to existing applications. As phases of the application mature, transition to more real-time integration points between your big data application and existing applications.

Create a process for data collection

Batch loading and processing is the right way to start a Big Data Application project. Once you perform your initial analysis, work on creating a process that allows for incremental refresh of data. You should be able to copy just the updated changes from your source systems into your big data platform. A continuous refresh of business data, that can provide up-to-date analysis.

Create a 360-degree feedback loop

Once a process for data collection and updates has been established, begin creating datasets that can feed the existing applications and systems to make them smarter. Establish a 360-degree process that will schedule data and feed it into your big data application for analysis. Create incremental and smaller datasets to be consumed back by the originating application. Using this method, all the existing applications benefit from having the intelligence, analysis and up to date information from systems that they need not know exist, or were unable to benefit from previously.

Use data mining and predictive analytics

You’re well on your way to a big data application once you create a large feedback loop that now bridges all the various data sources within the business. Go further by doing deep data mining and behavioral analysis on your data to help predict where best to optimize resources. As the Big Data technologies mature, newer higher level open source frameworks, such as Kiji, can lower the barrier to entry for advanced machine learning and deep data mining analytics.

Evaluate real-time, predictive models

Given previous data mining, or other analytics insights, augment your applications further by providing a real-time interface that can dynamically rescore your predictive models based on up to the second changes to data. Use knowledge learned from batch/iterative research and analysis to build better predictive models that can be executed in real-time, on a per-user level. Open source frameworks such as Kiji resT and Kiji Scoring allow this to be done nearly at the speed of thought. Being able to iterate quickly and deploy new models as trends shift allow you to capitalize on trends as they occur instead of lagging customer trends by weeks or months.

Continually refine the process

A successful big data application requires continuous evaluation and refinement. Is the right data being collected? Are there new data sources that need to be integrated? Are there any data sources that should removed or are stale? Are there better predictive models we can use to provide a more accurate real-time experience? These are questions that should be continually evaluated so that the quality of data streaming into your big data application remains constantly relevant to the business.

Measure results and reap the rewards

This should really be included in every step above: measure, measure, measure. The only way to know the impact and performance of your big data application is by measuring the results. It might be as simple as knowing how much data is being collected daily verse the number of records being sent back to existing traditional applications or implementing advanced A/B testing on behavioral models. Either way, the point is the same: the best way to know whether your big data application is achieving its objectives is to measure the results against expectations.