Sharpe Engineering Inc.
Streams Processing | InfoSphere Streams |
Challenges | Process Stack | 3-Layer Model | Elements | Soft Technologies | SOA | Orchestration | Ontologies
Challenges | Types of Uncertainty | Representations | Holy Grail | Framework
subglobal6 link | subglobal6 link | subglobal6 link | subglobal6 link | subglobal6 link | subglobal6 link | subglobal6 link
subglobal7 link | subglobal7 link | subglobal7 link | subglobal7 link | subglobal7 link | subglobal7 link | subglobal7 link
small logo

Stream Processing

Somewhat related to Complex Event processing (CEP) Stream Processing is a relatively new data handling paradigm for dealing with data in motion. In other words extracting the relevant knowledge from the data without first having to store it in a database.

Much of the data being produced today is being underutilized. The knowledge it contains is either not being recognized or not producing actionable insights.  Effectively processing the ever increasing volumes of data requires not just one technique but rather a cohesive capability that combines fusion (bottom up analysis), resource management (top town analysis), and data mining (both a priori and predictive methods), with human interaction.  To that list we also add a common semantic model, metadata enhanced communication, and an execution platform that is capable, performant, distributed, scalable, and agile.  
The purpose of this project is to design such a system and demonstrate the feasibility of the approach to address complex problems that exist in today’s decentralized and distributed net-centric environment.  Our term for this comprehensive collection of concepts is a "unified processing paradigm" (UPP).  This report outlines key aspects of the approach, why we believe each is important and how the design applies this combination of concepts to produce a solution that is not so much a silver bullet, as an agile technology for incrementally addressing complex problems in an evolving world.

Problems

In order to more effectively frame the descriptions of the technical approach and work completed on this project we will first briefly describe several characteristics of the problem we are addressing and why they are important.

Drowning in Data

There are many ways to drown in data but proliferation of sensors and the rate at which they output data are common causes in the fusion and data mining domains as well as presenting challenges and opportunities for resource management.  We have found that accommodating the increasingly large amounts of data being produced involves two fundamental ideas.
First, there is an unmet need for greater automated coverage over the complete fusion and resource management model (i.e. comprehensive analytical capability).  A more complete unified analytical framework can reduce the load on knowledge workers while taking advantage of a greater amount of contextual information. 

Second, higher performance solutions are needed which are capable of operating in real-time on data in motion without the requirement to first store it in some repository.  Reducing the overall latency of the analytical solution to milliseconds or seconds allows the generated world view to represent the current situation, rather than what it was minutes, hours, or even days in the past.  Improved execution speed also allows us to take on more challenging problems while still meeting performance requirements. Furthermore, in some situations the same execution efficiency that produces extremely high performance in larger computing environments can allow the migration of greater amounts of analytical processing closer to the source.  For example in the case of a group of unmanned surface vehicles, a more computationally efficient solution will allow more of the fusion and collaboration capability to be deployed to the individual vehicles even under circumstances of limited onboard computing power.

Evolving Problem Spaces

While many problems can be solved today with custom software solutions, the time it takes to define, implement, test, and deploy the solution often exceeds the window of opportunity and often there is insufficient support for experimentation and tuning.  Some aspects of agility relate to the execution platform (e.g. language, architecture, runtime environment) while others relate to the analytical model (e.g. algorithm design, semantic model, metadata).  There is an unmet need for agility through the ability to dynamically alter at runtime the overall topology as well as the configuration of the individual processing elements of an analytical system, either as a result of operator direction or automatically in response to a detected condition. 

Geographically Dispursed Data Sources/Sinks

Many of today’s challenging data processing problems do not have the luxury of existing in a single location where all the required data is conveniently aggregated.  Often the sources of the data as well as the computing resources to operate on it are geographically and organizationally separated.  There is an unmet need to enable decision support and resource management functionality for time sensitive, collaborative, net-centric operations among distributed computational environments.

In addition to operating in a distributed capacity, we also need to accommodate a decentralized deployment strategy.  The distinction here being that distributed simply refers to geographically separated processing centers, whereas decentralized implies a horizontal or peer based control capability.  While there are many features either required or helpful to accomplish this, a few of the major ones include a service oriented architecture, publish and subscribe notification capability, common shared semantic models, efficient interface capability for data sources and sinks, and modular analytical elements .  Each of these are characteristics of the architecture or platform.  Additionally the analytical model and application design must also accommodate other considerations such as a modular approach for collaboration and not being dependent on perfectly reliable communications.

Requirements

By the conclusion of Phase I we need to have demonstrated feasibility in several critical technology areas.  Specifically we need to show that the proposed approach is capable, performant, agile, distributable, modular and generalizable, and able to interface with a wide range of external net-centric systems. Below we elaborate on these characteristics and later discuss techniques that our research has identified for addressing each of them.

Capable

First and foremost the solution must be functionally capable of producing the required result, in the context of the sample use cases this typically means turning data into knowledge and actions.  It is our belief that due to their interrelated nature, such a system needs to provide unified and simultaneous domain coverage for fusion, data mining, resource management, and human interaction.
In order to be actionable, the knowledge produced from the system for use both internally and externally must have consistent and well understood semantic meaning.  The overall solution must be reliable, meaning that it continues to operate under all expected conditions. It must be trustworthy and in situations where it fails, must communicate the degree of reliability.  Things like confidence metrics, collaboration algorithms, and redundancy need to be pervasively exploited.  And finally it needs to be able to handle a mix of classified and open information in a secure manner.

Performant

Speed is a currency with which we can purchase a number of beneficial system characteristics.  The most obvious is the simple fact that being able to respond faster than an adversary is always a desirable trait, but beyond that, speed allows us to execute more operations within a given amount of time to perform more complex analysis, or the same number of operations on lesser hardware, both of which also can provide significant value. In the scope of this project speed comes in two forms, first and foremost the execution platform must be inherently fast, and second the analytical techniques and deployment approaches employed for each part of the overall problem must be chosen to appropriately balance the functionality and performance.

Agile

A custom coded, single purpose, monolithic solution could probably be created for most problems.  Unfortunately by the time the solution was finished it might no longer be optimal, or even relevant.  By creating modular and configurable processing elements and an environment guided by a SOA architecture on a platform that supports dynamic runtime modification we will be able to rapidly evolve a solution to keep pace with changing requirements.

Distributable and Decentralized

As we mentioned previously, distributed and decentralized are related but distinct descriptors and present different challenges. While parts of this requirement relate primarily to the platform architecture, the way the application is mapped to the processing model, the analytical techniques that are chosen, and the way that knowledge is communicated, are also important to enable operation in this kind of deployment scenario.

Modular, Generalizable, and Tuneable

Modularity allows individual portions of a streams application to be exposed thereby supporting net-centricity, experimentation, incremental evolution and operational adaptability. It also supports the potential to separate how information is gathered from its structure and content.
Decomposition of solutions into modular elements presents the opportunity that some of those components could be reusable in other situations.  This concept can be combined with support for external configuration to further the potential to generalize their applicability.  If some of the configuration can be adjusted at runtime (live parameters) it provides for tuning either under the control of another component or by a human-in-the-loop.  The latter case is one form of experimentation.
Modular components support the notion of toolkits, wherein groups of reusable functionality common to a particular domain can be aggregated. As a sufficient collection of modules and toolkits become available it enables net-centric composition for bringing different algorithms into play as required by changing information needs.
It is a design goal to define the majority of the processing capabilities in terms of modular elements that can be deployed in a service oriented architecture, thereby exposing all categories of the unified processing paradigm components as services. Furthermore, by constructing each in either a model driven, or at least configurable, way we can maximize the opportunity for reuse either in other processing nodes, other parts of the problem space, or even in entirely different applications.

Net-Centric

The magnitude and complexity of current and future problems often require a cooperative effort that spans organizations and sometimes even governments.  A net-centric analytical system must be able to easily accommodate receiving information from, and distributing it to, many different sources.  This information will exist in many different formats, quality levels, reliability, and can have varying security restrictions.  In order to participate effectively in this environment it is imperative that the platform and application design be able to efficiently and effectively integrate with this broad range of systems that are outside of our direct control.

 


 

Contact Us | Search | ©2010 Sharpe Engineering Inc