Topological Data Analysis (TDA) is an application of topology to data science and analysis, offering a unique toolkit to analyze data with. There are a number of software libraries to assist in computing the constructions from TDA-including the Rips filtration, persistent homology, and circular parametrization. There are a variety of TDA libraries available-and in several languages:
- Dionysus2 (C++ with Python bindings)
- Ripser (C++ with Python bindings)
- TDA (R)
- Javaplex (Java)
- GUDHI (Python)
- PHAT (Python)
- DIPHA (C++)
- Perseus (C++)
and run it, replacing
/ .sh", """ #!/bin/bash sudo apt-get install -y cmake sudo apt-get install -y libboost-all-dev """, True)
with the name of your notebook's cluster, and
with whatever you wish to name the initialization script. This will place a bash script in the Databricks file system that will run whenever the cluster is initialized. Next restart the cluster; this will re-initialize the cluster, and runs the script we just created on the file system. This script will install
boost, which are C++ libraries necessary for Dionysus library code. After the cell with the above code is run once, it is no longer necessary to run it again; this script will run every time the cluster initializes.
Once the cluster is initialized with the above initialization script, then we can simply use the GUI to install the Dionysus package. Navigate to the Create Library page. Under the Language drop down menu, select Upload Python Egg or PyPi, and in the PyPi Name text field, enter Dionysus, and click Install Library. This will then install the Dionysus library on the cluster, and will be available in the attached notebook. Finally in the notebook we can run
import dionysus as d.
In a subsequent post, we will detail Dionysus' usage with Spark to compute persistent homology.