Our research is focused on constructing condensed representations from data streams. These representations are updated with every new instance that arrives and are explicitely designed to support reasoning over the entire data stream. One of their main building blocks are online density estimators that can be manipulated by inference algorithms and, therefore, provide a means to extract relevant information from it. By combining these inference algorithms, one can also perform more complex tasks such as pattern mining.

## Density Estimation

In order to model discrete joint probability distributions, we employed so-called classifier chains and a set of probabilistic online classifiers. The overall chain aims to model the dependencies among the features, and the classifiers in a chain aim to model the probabilities of a particular feature, where the individual estimates are combined using the product rule. To increase the robustness of the estimate, we also proposed variants that use ensembles of classifier chains and ensembles of weighted classifier chains, where the estimate is computed from a set of classifier chains (usually 10).

The density estimates, however, are not only designed to estimate the probability distribution of the data stream, but enable inference tasks to manipulate the estimate and, thereby, extracting relevant information from it. This is useful to answer questions about the data distribution of the stream and can also be used to perform complex data mining tasks such as pattern mining (see below).

If the data stream consists of many variables, this approach is easily pushed to its limits. Therefore, we also designed an online density estimator for data of higher dimensionality. The main idea is to project the data stream into a lower dimensional space, estimating the density there, and then performing a back translation.

Read more: Online Estimation of Discrete Densities, Online Density Estimation of Heterogeneous Data Streams in Higher Dimensions

## Pattern Mining

Performing complex data mining tasks on the density estimate without having access to the raw data is one of the main motivations behind our current research. To show that this can be achieved with online density estimates, we designed itemset and association rule mining algorithms that only operate on the density estimate of data stream. The presented approach randomly modifies the estimate and then samples instances with a high probability.

Read more: A Probabilistic Condensed Representation of Data for Stream Mining

## Recurrent Data Distributions

As with many data stream applications, the distribution of the stream is usually not fixed but subject to changes. Although the online density estimates that we proposed are able to adapt to these changes, they can still only represent the current state. All information originating from historical data is simply lost, which unfortunately prevents a detailed analysis of the data stream. To overcome this limitation, we extended the current framework to model all the distributions of the data stream. Since data distribution can also be recurrent, we additionally provided mechanisms to identify recurrent density estimates and recurrent parts of densities.

Read more: Modeling Recurrent Distributions in Streams using Possible Worlds

## Publications

- Michael Geilke, Andreas Karwath, and Stefan Kramer

*Online Density Estimation of Heterogeneous Data Streams in Higher Dimensions*

In: Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases (ECML-PKDD 2016), pages 65-80, Springer 2016. doi:10.1007/978-3-319-46128-1_5, Bibtex, presentation. - Michael Geilke, Andreas Karwath, and Stefan Kramer

*Modeling Recurrent Distributions in Streams using Possible Worlds*

In: Proceedings of the International Conference on Data Science and Advanced Analytics 2015 (DSAA 2015), pages 1-9, IEEE 2015. doi:10.1109/DSAA.2015.7344814, Bibtex, preprint, presentation. - Michael Geilke, Andreas Karwath, and Stefan Kramer

*A Probabilistic Condensed Representation of Data for Stream Mining*

In: Proceedings of the International Conference on Data Science and Advanced Analytics 2014 (DSAA 2014), pages 297-303, IEEE 2014. doi:10.1109/DSAA.2014.7058088, Bibtex, preprint, presentation. - Michael Geilke, Andreas Karwath, Eibe Frank, and Stefan Kramer

*Online Estimation of Discrete Densities*

In: Proceedings of the 13th IEEE International Conference on Data Mining (ICDM 2013), pp. 191-200, IEEE 2013. doi:10.1109/ICDM.2013.91, Bibtex, preprint, presentation. - Michael Geilke and Sandra Zilles

*Polynomial-Time Algorithms for Learning Typed Pattern Languages*

In: Proceedings of the 6th International Conference on Language and Automata Theory and Applications (LATA 2012), Lecture Notes in Computer Science 7183, pages 277-288, Springer 2012. doi:10.1007/978-3-642-28332-1_24, Bibtex, presentation. - Michael Geilke and Sandra Zilles

*Learning Relational Patterns*

In: Proceedings of the 22nd International Conference on Algorithmic Learning Theory (ALT 2011), Lecture Notes in Artificial Intelligence 6925, pages 84-98, Springer 2011. doi:10.1007/978-3-642-24412-4, Bibtex, presentation.

## (Co)-Reviewing

- Journal of Computer and System Sciences
- ECML/PKDD: European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (2012 - 2014)
- KDD: ACM SIGKDD Conference on Knowledge Discovery and Data Mining (2013 - 2016)
- ICDM: IEEE International Conference on Data Mining (2013 - 2016)
- DSAA: International Conference on Data Sciences and Advanced Analytics (2014)
- AAAI Conference on Artificial Intelligence (2013)
- EDBT: International Conference on Extending Database Technology (2012)