iSOUP-Tree: trees and ensemble methods for multi-target prediction for data streams

Incremental Structured Output Prediction Tree (iSOUP-Tree) is an online multi-target regressor, which implements the global approach to multi-target regression. All of the targets are predicted using a single model, which lightens the resource load and in some cases increases predictive performance over models utilizing the local approach, i.e., a separate model is learned for each of the targets. iSOUP-Tree can also be combined with ensemble methods, such as online bagging and online random forest, to improve its predictive performance. Online random forest of iSOUP-Trees generally provides the best trade-off between resource load and predictive performance.

Clus+: An enhanced version of CLUS included within ClowdFlows

Clus is a decision tree and rule induction software that implements the predictive clustering framework. This framework combines clustering with predictive modelling and allows for a natural extension to more complex prediction tasks. Clus is able to solve tasks such as (hierarchical) multi-label classification, (hierarchical) multi-target regression, network regression, and tasks with even more complex output types such as tuples and sets. In addition to this, ensemble learning, feature ranking and semi-supervised learning methods are also included. The software is available as a standalone tool as well as an integrated building block within cloud-based interactive machine learning and data mining platform ClowdFlows.

ProTraits – Database of predictions for phenotypic traits

The ProTraits atlas of prokaryotic traits describes environmental preferences of microbes, interactions with other organisms (including pathogenicity), biochemical phenotypes, resistance to chemicals and other stressors, and utility in industrial applications. ProTraits recognizes 424 phenotypic traits and covers 3,046 bacterial or archeal species. Overall, it provides 545,081 annotations (less than 10% FDR, tallying both the positive and the negative labels), of which 503,308 are novel.

InterSet – Exploration and visualization of redescription sets

The main purpose of the tool InterSet is to allow interactive, comprehensive, redescription set exploration. On this page you can test the features of the tool by exploring redescriptions created on two different datasets. The first dataset contains attributes describing world countries by using general country information and country trading patterns for the year 2012 ([2,7,8]). The second dataset contains attributes describing co-authorship graph and the author-conference bipartite graph ([1,10]). The tool is described in more detail in the paper “InterSet: Interactive redescription set exploration” published in the proceedings of the Discovery Science Conference (DS’16).

Multi-Plant Photovoltaic Energy Forecasting Challenge at ECML PKDD 2017

The urgent need to reduce pollution emission has made renewable energy a strategic European Union (EU) and international sector. This has resulted in an increasing presence of renewable energy sources and thus, significant distributed power generation. The main challenges faced by this new energy market are grid integration, load balancing and energy trading. In order to face these challenges, it is of paramount importance to monitor the production and consumption of energy, both at the local and global level, to store historical data and to design new, reliable prediction tools. In this challenge, we focus our attention on photovoltaic (PV) power plants, due to their wide distribution in Europe. During the last years, the forecast of PV energy production has received significant attention since photovoltaics are becoming a major source of renewable energy for the world. Forecast may apply to a single renewable power generation system, or refer to an aggregation of large numbers of systems spread over an extended geographic area.

ComiRNet – The Database of Predicted miRNAs Regulatory Networks

ComiRNet is a database of miRNA target predictions and predicted miRNA regulatory networks. ComiRNet stores approximately 5 million predicted interactions between 934 human miRNAs and 30,875 gene transcripts (mRNAs) which are exploited in the construction of the hierarchies of overlapping biclusters representing potential miRNA regulatory networks.

Clus: A Predictive Clustering System

Clus is a decision tree and rule induction system that implements the predictive clustering framework. This framework unifies unsupervised clustering and predictive modelling and allows for a natural extension to more complex prediction settings such as multi-task learning and multi-label classification. Clus is co-developed by the Declarative Languages and Artificial Intelligence group of the Katholieke Universiteit Leuven, Belgium, and the Department of Knowledge Technologies of the Jo┼żef Stefan Institute, Ljubljana, Slovenia.


A theoretical framework that unifies different data mining tasks, on different types of data can help to formalize the knowledge about the domain of data mining and provide a base for future research, unification and standardization. It can directly support the development of a general framework for data mining, support the representation of the process of mining structured data, and allow the representation of the complete process of knowledge discovery.

Web-based system for retrieval and modality classification of medical images

The system uses multimodal features, both textual and visual. For the visual features we used the state-of-the-art opponent SIFT features, whereas, for the textual features we referred to the standard bag-of-words representation. We applied query expansion to further improve the text-based retrieval. At the end, we included the medical modality of the images as input to the retrieval.

DiatomSearch: Diatom identification system

DiatomSearch is a hierarchical multi-label classification (HMC) system for diatom image classification. Our approach to HMC exploits the classification hierarchy by building a single predictive clustering tree (PCT) that can simultaneously predict all different levels in the hierarchy of taxonomic ranks: genus, species, variety, and form. The system can be used by taxonomists to annotate new diatom images.

Interpolative Clustering Tree Learner

Interpolative Clustering Tree (ICT) is a data mining algorithm that allows us to summarize data sampled over space for a number of goephysical variables by leveraging the power of a spatial-aware clustering algortihm. ICT determines a descriptive and interpolative model of georeferenced data sampled for the set of elds under examination.

TweetViz: Twitter Data Visualization

TweetViz ia a web tool for visualizing Twitter data. TweetViz offers several different kinds of visualizations that can pertain to a Twitter user or any keyword or hashtag entered through the interface. TweetViz also includes a so called Streamgraph visualization that represents topic distribution in a set of tweets. The topic distributions are created using LDA (Latent Dirichlet Allocation).

NewsTweetSentiment – system for sentiment analysis of news-related social media responses

NewsTweetSentiment is a system for sentiment analysis of social media responses on Twitter and matching them with news articles from several sources. In this way users can get a better understanding of the reactions a news article is receiving. Currently, we support news feeds from BBC and The Guardian from several topics including worldwide news, business, sport and technology news. The articles presented in the news feed are not older than 24 hours.

HOCCLUS: Hierarchical and Overlapping Co-Clustering of mRNA:miRNA Interactions

Method for the extraction of co-clusters of miRNAs and messenger RNAs (mRNAs). Different from several already available co-clustering algorithms, our approach efficiently extracts a set of possibly overlapping, exhaustive and hierarchically organized co-clusters.

AMRules: Rules for regression data streams

The volume and velocity of data is increasing at astonishing rates. In order to extract knowledge from this huge amount of information there is a need for efficient on-line learning algorithms. Three rule-based algorithms offer state-of-the-art results for mining regression streams. The algorithms are implemented and available from MOA (MASSIVE ONLINE ANALYSIS).

DAMRules: Distributed Adaptive Model rules for Regression

This the first distributed streaming algorithm to learn decision rules for regression tasks. The algorithm is available in SAMOA (SCALABLE ADVANCED MASSIVE ONLINE ANALYSIS), an open-source platform for mining big data streams. It uses a hybrid of vertical and horizontal parallelism to distribute Adaptive Model Rules (AMRules) on a cluster.