Our research is focused on constructing condensed representations from data streams. These representations are updated with every new instance that arrives and are explicitely designed to support reasoning over the entire data stream. One of their main building blocks are online density estimators that can be manipulated by inference algorithms and, therefore, provide a means to extract relevant information from it. By combining these inference algorithms, one can also perform more complex tasks such as pattern mining.

Density Estimation

In order to model discrete joint probability distributions, we employed so-called classifier chains and a set of probabilistic online classifiers. The overall chain aims to model the dependencies among the features, and the classifiers in a chain aim to model the probabilities of a particular feature, where the individual estimates are combined using the product rule. To increase the robustness of the estimate, we also proposed variants that use ensembles of classifier chains and ensembles of weighted classifier chains, where the estimate is computed from a set of classifier chains (usually 10).

The density estimates, however, are not only designed to estimate the probability distribution of the data stream, but enable inference tasks to manipulate the estimate and, thereby, extracting relevant information from it. This is useful to answer questions about the data distribution of the stream and can also be used to perform complex data mining tasks such as pattern mining (see below).

If the data stream consists of many variables, this approach is easily pushed to its limits. Therefore, we also designed an online density estimator for data of higher dimensionality. The main idea is to project the data stream into a lower dimensional space, estimating the density there, and then performing a back translation.

Read more: Online Estimation of Discrete Densities, Online Density Estimation of Heterogeneous Data Streams in Higher Dimensions

Pattern Mining

Performing complex data mining tasks on the density estimate without having access to the raw data is one of the main motivations behind our current research. To show that this can be achieved with online density estimates, we designed itemset and association rule mining algorithms that only operate on the density estimate of data stream. The presented approach randomly modifies the estimate and then samples instances with a high probability.

Read more: A Probabilistic Condensed Representation of Data for Stream Mining

Recurrent Data Distributions

As with many data stream applications, the distribution of the stream is usually not fixed but subject to changes. Although the online density estimates that we proposed are able to adapt to these changes, they can still only represent the current state. All information originating from historical data is simply lost, which unfortunately prevents a detailed analysis of the data stream. To overcome this limitation, we extended the current framework to model all the distributions of the data stream. Since data distribution can also be recurrent, we additionally provided mechanisms to identify recurrent density estimates and recurrent parts of densities.

Read more: Modeling Recurrent Distributions in Streams using Possible Worlds