The method of operating on data-in-motion in which data is operated on immediately as it arrives without the requirement that it first be stored in some kind of database is an alternative to batch processing. It is typically implemented as an in-memory technology with extremely low end-to-end latency. Depending on the operations being performed and the available processing hardware it is not unusual to have overall latency from the time a new data element arrives to when the results have been updated to be in the sub-millisecond range.
Note that the performance described above is not just the ability to ingest large volumes of data at high rates, but it also includes the completion of all processing necessary to produce the “answer” or to update a predicted value. This capability allows the organization to react immediately to current situations or changing conditions. Since the results are updated continuously, an understanding of the exact current state is always available. Where predictive models exist, the system can also update both the models and the results instant by instant according to the latest available data. This moves the insight provided from what has happened previously to what is happening now while providing more accurate predictions of future situations by taking into account up-to-the-moment information. It also enables the agility necessary to accommodate evolving problem spaces.
These characteristics are especially appropriate for applications that emphasize the immediate, such as feeding a real-time dashboard or triggering automated alerts or behaviors. One of many things to consider when evaluating whether data-in-motion processing would be useful for your application is how quickly your business needs to, or can respond to, the results. If a daily or monthly report is sufficient, you may not benefit from a low latency solution.
In addition to ultra low latency, a different kind of capability that data-in-motion systems can provide is as a front-end processor to data-at-rest systems. Used in this way a stream processing system can perform operations such as cleansing, normalizing, filtering, and aggregation. These kinds of operations can result in higher value and lower quantities of data being presented to the downstream systems.
As capable as it is, the data-in-motion concept is often best deployed as a complement to data-at-rest systems. Understanding the immediate state in context with historical behavior is a very powerful combination and is made even more so when this interaction can be handled automatically. This combination leverages the low latency of the in-memory data-in-motion system with the ability of the data-at-rest system to hold long periods of historical information. While there is certainly some crossover, it is appropriate for many use cases to think of using the data-at-rest system for data-mining and model creation (i.e. determining what to be on the lookout for) and the data-in-motion for real-time scoring (i.e. doing the looking).
As the understanding of the benefits of a data-in-motion, stream processing is increasingly becoming more widespread, the availability of platforms that support the paradigm is also growing. The initial entry into the space was a technology developed by IBM in cooperation with the US Government that eventually was productized as InfoSphere Streams. There is some similarity with this particular product with Complex Event Processing (CEP) systems although there are also some substantial differences. Also falling into this space are relatively new open source stream processing systems such as S4, Storm, and Spark. A somewhat different offering that also has substantial overlap for operating on certain forms of machine data such as system logs is Splunk.
Many considerations need to be taken into account in order to choose wisely when selecting a stream processing platform. There are a growing number of platforms that claim to accommodate the three V’s of big data (volume, variety, and velocity) and that may be sufficient for various applications that are not necessarily mission critical. But there are many use cases for which many other things should also be considered. Examples include:
In our experience the evaluation of the points listed above along with many others has led the selection of the InfoSphere Streams platform as the obvious choice for the majority of our clients’ mission critical solutions.
Resources to learn more
In addition to contacting us for a personalized conversation, you can find more information on stream processing in general and the InfoSphere Streams platform at the resources below.
Product website – including two forms of a quick start edition and links to whitepapers and other resources.
Streamsdev – Developer community website for Streams
Streams Developer Ed (7 part series of short topics)