What they are
DBSCAN and HDBSCAN are density based clustering algorithms. They group points that live in high density regions and mark noise that does not belong to any group.
intuition —> define a neighborhood radius —> dense neighborhoods grow into clusters —> sparse points become noise
- DBSCAN needs two knobs
epsneighborhood size andmin_samplesminimum neighbors to be dense - HDBSCAN removes the
epsguesswork by building a hierarchy of densities and extracting stable clusters automatically; it also labels noise
Why security teams use them
- Campaign and infrastructure discovery clusters appear in natural shapes without forcing k
- Noise handling obvious one offs are flagged as noise instead of forced into a cluster
- Variable density HDBSCAN handles tight and loose groups better than k-means
How DBSCAN works step by step
- For each point count neighbors within eps
- If neighbors >= min_samples mark as core point
- Grow a cluster by visiting all points density reachable from any core
- Points not assigned become noise
How HDBSCAN works step by step
- Transform distances into mutual reachability distances
- Build a minimum spanning tree over points
- Condense the hierarchy using min_cluster_size
- Extract stable clusters and label low stability as noise
Security examples that click
-
Phishing campaign clustering
features —> URL character n grams TF IDF, domain age, brand tokens
output —> organically shaped clusters by campaign, plus noise for one offs -
Infrastructure grouping
features —> TLS JA3 or JA4 fingerprints, cert issuer, ASN, hosting region
output —> clusters per operator infra even if density varies -
Binary or process family discovery
features —> section entropy, import counts, tokenized commands after PCA
output —> families without pre setting k, scattered experiments go to noise
Feature engineering and scaling
- Scale features standardize or use cosine distance on L2 normalized TF IDF
- High dimensional reduce with PCA or Truncated SVD to 20 —> 100
- Distance choice Euclidean for dense numeric, cosine for text like URLs
Choosing parameters
-
DBSCAN
epsset via k distance plot choose the elbow of the k nearest neighbor distance curvemin_samples5 —> 15 typical start higher for noisy data
-
HDBSCAN
min_cluster_sizesmallest group you care about operationallymin_sampleshigher gives more noise and stricter cores default tomin_cluster_size
Evaluation that matches operations
- Internal cluster quality silhouette with your chosen metric, Davies–Bouldin
- Stability overlap of clusters across resamples or windows
- Analyst utility percent of alerts deduplicated, time saved, precision of sampled clusters
Practical workflow
- Define goal campaign map, infra map, or dedup map
- Encode and scale features lock scalers to the training window
- Sweep parameters
- DBSCAN sweep
epsusing k distance elbow - HDBSCAN sweep
min_cluster_sizeacross a small grid
- DBSCAN sweep
- Check quality and stability across random seeds and weeks
- Name clusters top terms, exemplar URLs, common certs
- Integrate cluster id —> ticket, rule routing, dashboard
- Monitor drift cluster counts, size distribution, stability
Pitfalls and fixes
- Curse of dimensionality distances flatten
fix —> PCA or SVD before clustering - Bad scaling dominates one feature overpowers metric
fix —> standardize or normalize first - Parameter sensitivity tiny
epssplits clusters, largeepsmerges everything
fix —> use k distance elbow or prefer HDBSCAN - Mixed densities DBSCAN struggles
fix —> HDBSCAN handles this case
Common hyperparameters
- DBSCAN
eps,min_samples,metriceuclidean or cosine - HDBSCAN
min_cluster_size,min_samples,metric,cluster_selection_methodeom or leaf
Security focused testing checklist
- Verify scaling and dimensionality reduction fitted on past only
- For DBSCAN plot k distance to select
eps - For HDBSCAN sweep min_cluster_size and inspect stability
- Sample each cluster for analyst sanity checks and naming
- Track noise rate, cluster sizes, and stability over time
- Guard against leakage remove post verdict or future only fields
Threats and mitigations
- Feature gaming attacker nudges features toward benign cluster core
- mitigate —> include hard to fake features domain age, ASN, cert lineage and backstop with supervised checks
- Poisoning many injected samples pull density
- mitigate —> rate limit training contributions, outlier screens, rolling windows
- Concept drift shapes move as campaigns evolve
- mitigate —> scheduled refits, compare cluster stability and rename or merge
Takeaways
Use DBSCAN when you can set a good neighborhood scale and want noise labeling. Use HDBSCAN when densities vary or you do not want to guess eps. Always scale, usually reduce, and validate cluster stability and utility.
[Original Source](No response)