A little over two years ago I wrote a series of blogs introducing Insight-as-a-Service. My idea on how companies can provide insight as a service started by observing my SaaS portfolio companies. In addition to each customer's operational data used by their SaaS applications, like all SaaS companies, these companies collect and store application usage data. As a result, they have the capacity to benchmark the performance of their customers and help them improve their corporate and application performance. I had then determined that insight delivered as a service can be applied not only for benchmarking but to other analytic- and data-driven systems. Over the intervening time I came across several companies that started developing products and services that were building upon the idea of insight generation and providing insight as a service. However, the more I thought about insight-as-a-service, the more I came to understand that we didn't really have a good enough understanding of what constitutes insight. In today's environment where corporate marketing overhypes everything associated with big data and analytics, the word "insight" is being used very loosely, most of the times in order to indicate any type of data analysis or prediction. For this reason, I felt it was important to attempt defining the concept of insight. Once we define it we can then determine if we can deliver it as a service. During the past several months I have been interacting with colleagues such as Nikos Anerousis of IBM, Bill Mark of SRI, Ashok Srivastava of Verizon and Ben Lorica of O'Reilly in an effort to try to define "insight."
An insight is the identification of cause and effect relations among elements of a data set that leads to the formation of an action plan which results in an improvement as measured by a set of KPIs. Insights are discovered by reasoning over the output of analytic models and techniques. This output can take the form of predictions, correlations, benchmarks, outlier identifications and optimizations.
The evaluation of a set of established relations to identify an insight, and the creation of an action plan associated with a particular insight needs to be done within a particular context and necessitates the use of domain knowledge.
Most analytic model outputs do not provide insights. There are two reasons for this. First, the models don't suggest a meaning for each of their findings. Second, they don't put each finding in an actionable context (even if the meaning were known). Finding a pattern doesn't imply that you automatically find meaning and that you understand it. It just implies that you are finding a correlation among a data set. Moreover, finding causality alone is not necessary and sufficient for generating an insight. One needs to be able to derive an action plan that can successfully and effectively, i.e., with impact, be applied in a particular context. This requirement implies that even knowing the meaning of the finding doesn't tell me how to generalize it and use it for something in the context I am trying to impact. That step requires knowledge of my environment (business, social, education, etc.), my strengths and weaknesses, other forces that may enhance or diminish my efforts, etc.
An insight must be:
- Stable. This means that an insight must not vary depending on the relation-identification algorithm/model being used. For example, if I use two different samples from the same data set to create a predictive model employing the same model-creation method, then the resulting models have to provide the identical result under the same new data input.
- Reproducible. This means regardless of how many times a feed a particular data set through an insight-generation system, the same insight will be produced.
- Robust. This means that a certain amount of noise in the input data will not diminish the quality of the insight. This is particularly important requirement in big data environments. Insight-generation systems must be able to organize noisy data and focus on the data that makes "sense," based on a particular context.
- Enduring. This means that the insight is valid for an amount of time that is related to the underlying data's "half life."
Because of the above requirements, insight-generation necessitates the deeper analysis, including the causal analysis, of the underlying relation-identification models, rather than just the testing of each model's accuracy, as it is typically done in predictive analytics tasks. Such causal analysis implies that when trying to generate insights it is preferable to utilize machine learning techniques that describe patterns declaratively, e.g., decision trees, rather than black box approaches, e.g., neural nets and genetic algorithms. As a result of this requirement, one may need to sacrifice prediction accuracy and speed for expressiveness. Therefore, one needs to identify the domains where insight-generation may be more important than predictive accuracy. Moreover, because the models themselves need to analyzed, simpler models may be preferred to more complex ones.
Insight-generation is not a single shot process. Once an insight is generated and the associated action plan is created, it is important to apply the plan in the particular context and measure its impact. The collected data must then be compared to the set of established KPIs in order to determine whether the particular insight/action-plan pair led to an improvement. Depending on this analysis, the system must then decide whether to attempt improving the action plan, create a completely new plan (assuming that alternatives can be found), or try to create a brand new insight. This means that from a set of initial input data the insight-generation system must seek to derive all possible predictions, based on the set of available models.