Sharpe Engineering has been working on various aspects of a hybrid hierarchical analytical framework for situational awareness and decision support since the early 90’s. The primary limiting factor has always been a suitable platform on which to implement it. The recent availability of InfoSphere Streams from IBM has finally removed that roadblock. One important aspect of this project is to leverage this emerging commercial technology to provide the execution framework upon which, guided by the UPP, we will build a suite of modular and reusable components out of which we will then create specific applications. As delivered by IBM, the Streams platform is analogous to a database or programming language. It is critical enabling technology and innovative in its own right, but the value lies in its creative application. By combining concepts of the UPP with this execution platform we can produce solutions that meet all the goals identified earlier. This approach is general purpose and the resulting implementation would be sufficiently modular and flexible to support its use in many different problem spaces. We will now describe in a bit more detail the characteristics of Streams which we feel are necessary requirements for creating a system capable of producing better knowledge faster.
Note that we use the convention of having ”streams” with a lower case leading ‘s’ to refer to the generic concept of streaming data and “Streams” with an upper case leading ‘s’ as shorthand for the IBM product InfoSphere Streams.
Even with all the other characteristics listed later, the platform would not be sufficient for our needs if it didn’t support the computational expressiveness required to implement sophisticated processing elements. InfoSphere Streams applications are implemented using SPL (Stream Processing Language) which is a language and framework designed from the ground up for writing stream-processing applications. Among its goals is to ease the burden associated with designing and implementing complex, large-scale, and scalable applications running on the distributed computing platform supplied by InfoSphere Streams.
SPL’s syntax and structure is centered on exposing the controls to the main tasks associated with designing applications while, at the same time, hiding the complexities associated with:
In addition to numerous built-in functional operators, the SPL language supports several mechanisms for extensions and the Streams platform provides two powerful analytical toolkits and enables the creation of others.
While the computational expressiveness illustrated above is an absolute requirement for the analytical platform, it is not sufficient. One of the differentiators between InfoSphere Streams and other commercial complex event processing systems is the unprecedented speed at which it performs. Recent benchmark experiments have demonstrated that throughput rates of 5-10 million messages per second can be handled on commodity hardware with latencies measured in the microseconds. Even a small Streams system can easily handle hundreds of thousands of messages per second. These values correspond favorably with highly tuned custom coded solutions running on specialized hardware. But keep in mind that this is an easily programmed, agile, general purpose platform. The two tables below illustrate performance on a Quad core Intel Xeon system running Linux RHEL 5 update 3.
Round Trip Latency (microseconds) |
||
Network |
Message size (bytes) |
Latency (Microseconds) |
Ethernet (1GbE) |
512 |
820 |
Native InfiniBand |
512 |
70 |
Table 1 Streams Latency
Throughput |
||
Message size (bytes) |
Ethernet Rate (MB/s) |
InfiniBand Rate (MB/s) |
32 |
32.5 |
49.2 |
128 |
105 |
238 |
1,024 |
117 |
380 |
16,384 |
118 |
355 |
Table 2 Streams Throughput
Obviously the actual throughput of each application will be dependent on the details of environment in which it is run, the size of the stream tuples, nature of the calculations occurring in each processing element, the topology of the analytical framework, degree of parallelism, and a myriad of other criteria. However, the benchmark numbers and our conversations with other Streams users lead us to believe that total response times for even challenging problems present in the use cases we are evaluating will likely have end to end latency times of a few seconds. This level of computational performance provides several benefits.
A very short end-to-end computation time results in a current worldview rather than one representing some point in the past. Furthermore, since the system is operating in real-time on data in motion rather than relying on having to first place it in structured storage (e.g. database or data warehouse) the world view is constantly kept current, and requirements for storage for analytical purposes is reduced or eliminated. However, we expect to do even better than that. Using some of the capabilities present in the data mining toolkit described later, certain kinds of predictions can also be made in real-time, producing results seconds, hours, or days before the actual results will be known thereby allowing additional time to formulate and optimize a response. Furthermore even the predictions can be constantly updated as new information comes into the system. For certain situations a very low value for absolute latency is the direct primary benefit, but for human-in-the-loop situations the real-time capability of the solution will facilitate interactive tuning and experimentation as well as simply giving them more time to contemplate a result or generate a response.
Low latency by itself doesn’t guarantee high throughput. Another important factor is for the system to accommodate large volumes of data. The highly efficient execution of the Streams runtime contributes to its ability to deal with very large data volumes (hundreds of MB per second) but Streams also easily supports horizontal scaling wherein additional processing elements can be deployed in parallel either to optimally balance the compute resources or to expand to additional nodes.
The previous two performance characterizations relate to being able to tackle increasingly larger and more complex problems. However, another aspect of the high execution efficiency of the Streams platform is that it allows scaling down as well. By this we mean it will enable us to address problems that would have been too compute intensive to exist on a given platform using a less efficient technology. This is particularly valuable when we consider distributed decentralized deployment to field nodes that won’t necessarily have the resources of even a modestly sized server. While a less capable execution platform certainly can’t accomplish the same level of computation of a larger system, having a more efficient runtime does allow for proportionally greater analytical capacity for a given size machine. Streams can comfortably run on a laptop (although it currently requires one of several versions of Red Hat Linux).
Many problems we need to be able to address are constantly evolving. New behaviors must be accounted for, additional inputs incorporated, alternate algorithms evaluated, etc. In many cases the development, test, deployment cycle of custom software is simply either too time consuming, expensive, or both. The Streams platform has a number of characteristics that reduce the time and effort required to produce, modify or enhance a solution. Some of these characteristics work to enhance the productivity of the development/test/deploy cycle, and others allow the system to dynamically self-evolve. Both are valuable but it is the latter category that we are particularly excited about.
Agility also facilitates human-in-the-loop interaction, where appropriate, for things like validation, tuning, experimentation, etc. A system that expresses this kind of agility means we don’t have to depend on a silver bullet algorithm or technique, rather it supports efficient incremental evolution which is important for incorporating ongoing adjustments required by dynamic problem spaces.
One of several characteristics relating to agility is the ability to rapidly produce or enhance an application. Things that can contribute to this are a development language that supports modularization, a service oriented architecture, and components that are sufficiently configurable either in code or at runtime to allow them to be reused with minimal effort. The Streams platform and SPL programming language provide features support these concepts.
In addition to having a rich set of built in functionality for operating on streaming data, SPL supports mechanisms for creating user defined functions and operators as well as user defined built in operators. Toolkits comprised of computational modules and adapters built using these extensions can foster reuse and provide an efficient mechanism to add or replace functionality in an application.
SPL operators have controlled access to the InfoSphere Streams runtime for introspecting and obtaining information about the runtime environment as well as selected information on other applications. This access is important to create solutions that are reactive to workload changes as well as changes in the runtime environment. As we’ll see later, SPL also has support for dynamically modifying the flowgraph which allows for runtime application composition. Support for introspection is a key requirement to leverage that capability to produce applications that can collaboratively inter-operate at runtime.
In addition to introspection, there are additional services for retrieving performance metrics from other applications as well as from the runtime components themselves and there is a powerful capability to expand on those metrics programmatically thru the use of API's. While these extensions could probably have been built manually, it is convenient that they are natively supported. In addition to supporting the creation of various kinds of operational dashboards, these performance metrics can be used in some instances to automatically address performance issues. For example if the system detects that a particular layer of the processing has become slower than allowed by a service level limit (set either manually or computed dynamically) then some corrective action could be taken such as automatically adding more processing elements to temporarily increase the parallelism of that portion of the process or modifying a parameter to reduce the computational depth of the algorithm.
When considering a distributed and decentralized system we believe it is important to have a mechanism for conveying metadata between collaborating nodes. Streams and the stream paradigm allow for the embedding of this metadata directly into the data stream. This capability allows the information being communicated to be transactionally complete, in that no other out of band communication is required to effectively and accurately work with the data. Examples of metadata that would be useful to embed include:
This also helps address the problem of making the output of modules visible and understandable since the exposed content includes both the information and metadata. This in turn supports things like multiple dissimilar algorithms and distribution/decentralization. By embedding this information in the streams it allows us to effectively expose all parts of the unified model because even processes that might otherwise have been considered internal, produce and consume self contained input and output streams. This concept of embedded metadata is one of the key enablers of the UPP.
When we were first exploring the capabilities of the Streams runtime platform, one characteristic that we found particularly exciting was the high level of support for dynamically modifying an operationally live system. For example, if a new algorithm for obtaining a particular value becomes available, it is a fairly simple matter to split the input stream(s) and send a duplicate copy to the new operator. Then downstream processing elements could be deployed to compare the results of the original implementation and the new one. This ability to quickly and safely alter a live system in such a fundamental way is an incredible asset for providing runtime application agility.
for more information on InfoSphere Sterams and Stream processing in general here are some relevant links.
http://www-01.ibm.com/software/data/infosphere/streams/
http://www-07.ibm.com/innovation/au/ideas/streamcomputing/