Topological Data Analysis (TDA) is an application of topology to data science and analysis, offering a unique toolkit to analyze data with. There are a number of software libraries to assist in computing the constructions from TDA-including the Rips filtration, persistent homology, and circular parametrization. There are a variety of TDA libraries available-and in several languages:

In this post, we provide a short explanation of how one can install the Dionysus library on a Databricks cluster. As a simple use-case, consider needing to compute persistant homology on a large corpus of datasets. Databricks is a convenient, and optimized, notebook-like environment for large datasets. Databricks allows easy use of on-demand clusters, and pyspark kernels running on those clusters. For our purposes, these provide convenient test-beds. As one expects, the first step is augmenting the kernels with the packages necessary. First, start a new python notebook and attach it to a cluster. In a cell, add the following

dbutils.fs.put("dbfs:/databricks/init//.sh", """
#!/bin/bash 

sudo apt-get install -y cmake
sudo apt-get install -y libboost-all-dev
""", True)

and run it, replacing with the name of your notebook's cluster, and with whatever you wish to name the initialization script. This will place a bash script in the Databricks file system that will run whenever the cluster is initialized. Next restart the cluster; this will re-initialize the cluster, and runs the script we just created on the file system. This script will install cmake and boost, which are C++ libraries necessary for Dionysus library code. After the cell with the above code is run once, it is no longer necessary to run it again; this script will run every time the cluster initializes.

Once the cluster is initialized with the above initialization script, then we can simply use the GUI to install the Dionysus package. Navigate to the Create Library page. Under the Language drop down menu, select Upload Python Egg or PyPi, and in the PyPi Name text field, enter Dionysus, and click Install Library. This will then install the Dionysus library on the cluster, and will be available in the attached notebook. Finally in the notebook we can run import dionysus as d.

In a subsequent post, we will detail Dionysus' usage with Spark to compute persistent homology.