Cluster Estimation
Overview
The Cluster Estimation worker receives input landing pad data from the geolocation worker and outputs a list of most probable landing locations. This step is important because there is inaccuracy in landing pad detections in the form of outliers and a distribution of detections instead of a singular point.
The worker uses a clustering algorithm to predict a ‘cluster center’ (or centroid) from groupings of data points. Each data point comes from a singular detection with a (x, y) position, which is gradually passed into the worker as time passes and more landing pads are detected.
Visualization
These data points can be plotted on a 2D scatter plot, and should form a cluster of decreasing density surrounding the true landing pad location.
Model Choice
Why the Variational Gaussian Mixture Model instead of another clustering algorithm like the K-means algorithm?
Both VGMM and K-means require an input for number of expected centroids (components), but the VGMM can automatically limit itself to a lower number of components required for the data.
This is important because there will be a practical max number of landing pads in the competition, but the number of already detected landing pads will be varying & lowering than this maximum amount throughout the worker lifetime.
A k-means ensemble was considered but VGMM model was chosen for ease of implementation.
How it works
The VGMM model comes from the Scikit-learn library BayesianGaussianMixture. After fitting data onto the model, it gives back a list of centroids and their corresponding weights (confidence of the model that this is a centroid), and co-variance (the ‘radius' in which points belong to this centroid).
Centroid post-processing
Invalid centroids are then eliminated from this list through these steps
Filtering by point ownership
Since the VGMM can choose to not ‘use’ the entire amount of specified clusters, the ‘unused’ cluster centroids have no point belonging to them but are still returned in the list. This step removes these centroids without points.
Filtering by confidence
Outlier detections should occur less often than true detection, and thus if any cluster centroids are placed on these outlier detections, they should be filtered out.
Less point belonging to a center generally result in lower weight (confidence) for that center.
Filtering is done by sorting the list of centroids in descending confidence, then checking the drop in confidence between subsequent elements in the list. Once a large drop occurs, that center and any cluster at lower confidence is removed.
Filtering by covariance
Sometimes outlier detections are grouped together, and assigned a cluster center, depending on how close they are to the real detections.
Since outlier detections are unlikely to be tightly grouped together, its center ‘radius’ should be larger than real centroids.
This step filters out any center with covariance (radius) much larger than the minimum covariance amongst all the centroids.
The remaining valid clusters are outputted with their corresponding position, weight, and covariance.
Useful Links
2.1. Gaussian mixture models Overview + Examples
BayesianGaussianMixture Documentation for the Scikit-learn model used