I tend to focus on clustering algorithms that use distance and density.
OPTICS¶
OPTICS, or ‘Ordering Points To Identify Clustering Structure’, is a density based clustering algorithm that I developed for segmentation of LiDAR points. It was also my first pull request.
My implementation of the OPTICS algorithm was merged into sklearn for release 0.21, and is available in all current versions. You can find the module documentation here, with more descriptive and narrative documentation here; there’s also demo code.
Posts in this section¶
OPTICS Benchmark
For these benchmarks I’m going to use real data-- the data is heterogeneous, large, and publicly available. More specifically, the benchmark data is a set of ~8.5 million filtered LiDAR data points, collected by the National Ecological Observation Network (NEON) in 2013 over the D17 study domain site. More information about the field campaign and data can be found here. The data is spread over a rectangle that is slightly larger than a kilometer and half North/South by 0.75 kilometers East/West. A (very) coarse view of the of the points is shown below, with arbitrary units of relative density:
Comparing OPTICS Implementations
The OPTICS algorithm is a density based segmentation algorithm that orders objects into a hierarchical ordering. It has a number of useful properties, such working with convex/concave clusters, requiring few input parameters, and not requiring a priori knowledge of the number of clusters that are to be segmented. The output from OPTICS is strictly speaking just an ordering of points, but this ordering can be ‘scanned’ to produce output that is essentially identical to DBSCAN.
The OPTICS algorithm uses pairwise distances to order objects (usually points) by a variety of distance and density metrics. The handling of these pairwise distances are the primary bottleneck for both CPU and memory utilization.