Topological data analysis with Mapper
Topological data analysis (TDA) is a recent branch of data analysis that uses topology and in particular persistent homology. TDA is used to extract meaning from the shape of data.
In this article we will explore a particular technique of TDA called the Mapper algorithm introduced by Singh, Memoli and Carlsson in their seminal paper Topological Methods for the Analysis of High Dimensional Data Sets and 3D Object Recognition (2007).
Mapper is a combination of dimensionality reduction, clustering and graph networks techniques used to get higher level understanding of the structure of data. It is mainly used for:
- visualising the shape of data through a particular lens
- detecting clusters and interesting topological structures which traditional methods fail to find
- selecting features that best discriminate data and for model interpretability
A flourishing literature shows its high potential in real data applications and especially in biology and medicine. Created by some founders of the celebrated silicon valley start-up Ayasdi, Mapper is applied in business to customer segmentation, hot spot analysis, fraud detection, churn prediction, drug discovery, socio-economic, sensor or text data as well.
In this article we will briefly describe what Mapper technically is, reveal the advantages that it carries and show some applications as well as an exploratory case study. In a forthcoming article we will explore how TDA and persistent homology can be used effectively with Machine Learning.
Given a dataset of points, the basic steps behind Mapper are as follows:
- Map to a lower-dimensional space using a filter function , or lens. Common choices for the filter function include projection onto one or more axes via PCA or density-based methods.
- Construct a cover of the projected space typically in the form of a set of overlapping intervals which have constant length.
- For each interval cluster the points in the preimage into sets .
- Construct the graph whose vertices are the cluster sets and an edge exists between two vertices if two clusters share some points in common.
The result is a compressed graph representation of the dataset representing well its structure. High flexibility is left in this procedure to the data scientist for fine tuning: the choice of the lens, the cover and the clustering algorithm.
Lenses, coverings and clustering
The choice of the lens is of great importance since the output graph will sensibly depend on this choice. In the one-dimensional case, one could choose some height function (like in the figure above), eccentricity or centrality.
With more dimensions typical choices are standard dimensionality reduction algorithms like PCA, Isomap, MDS, t-SNE or UMAP. Density-based methods such as distance to the first k-nearest neighbors are often used in biological applications (see this book for several examples). More subtle results can be achieved by projecting to some latent space coming from a variational autoencoder.
Depending on what type of dataset you’re analyzing, the most meaningful lenses can be applied. For example, for outlier detection it will make sense to apply a lens projecting to an isolation forest score coupled with L2 norm or first PCA component (see here for one of such examples).
Concerning the cover, a standard choice is a collection of d-dimensional intervals (i.e. a Cartesian product of d one-dimensional intervals) of the same length pairwise overlapping with a specified percentage. This is another crucial parameter since the overlaps will determine the edge creation in the Mapper’s graph.
Lastly you are left with the choice of your favourite clustering technique. DBSCAN or hierarchical clustering are good picks in this context. It could be tricky to choose the number of clusters as hyperparameter globally: clustering is applied to every set of points resulting from the pull-back of an element of the cover, thus there may be sets of points with different numbers of clusters to select. One solution is given by stopping the agg