At a glance

  • K-means → fast clustering when groups are compact and similar size
  • DBSCAN and HDBSCAN → clusters of any shape with built in noise labeling
  • Gaussian Mixture Models → soft clusters plus a density score
  • Agglomerative clustering → hierarchical view you can cut at any level
  • Spectral clustering → non convex structure when a good similarity graph exists
  • PCA and Truncated SVD → compress features for speed and clarity
  • UMAP and t SNE → 2D maps for analyst triage not for policy
  • Isolation Forest → strong default for point anomalies
  • Local Outlier Factor → find locally sparse points
  • One Class SVM → frontier around normal with kernels
  • Autoencoder anomaly → reconstruction error for complex signals
  • Graph communities and embeddings → find operator cohorts in indicator graphs
  • BIRCH and k medoids → very large data or outlier robust clustering

Comparison table

AlgorithmBest forStrengthsWatch outsTypical security uses
K-meansCompact similar size clustersVery fast, scalable, simpleNeeds k, Euclidean scale sensitive, outliers pull centroidsPhishing campaign grouping from TF IDF after PCA, alert dedup buckets
DBSCANArbitrary shapes with single densityFinds noise, no kChoose eps well, mixed densities hurtInfra clustering from TLS or URL features with noise points
HDBSCANVariable density clustersAuto selects clusters, labels noiseFew knobs but still metric sensitiveCampaign discovery when densities vary across families
GMMElliptical soft clusters and densityProbabilities per cluster, log likelihood for anomalyChoose K, scale features, covariance issuesAuth cohorts and low likelihood session alerts
AgglomerativeHierarchical explorationDendrogram insights, flexible linkageNeeds cut level, O(n²) on large setsGroup similar alerts or binaries at different granularities
SpectralNon convex structure on graphsSeparates intertwined shapesBuild and tune similarity graphDomain similarity graph clustering from n grams and WHOIS
PCANumeric compressionFast, stable, improves distanceMax variance ≠ max separationReduce URL email features 100 → 20 for faster models
Truncated SVDSparse text compressionWorks on TF IDF directlyDense output still needs scalingCompress URL or email n grams before clustering or SVM
UMAP t SNEVisual mapsGreat analyst triage viewsNot for policies thresholds unstableMap alerts to see campaign islands and outliers
Isolation ForestGeneral point anomaliesFew assumptions, robust, fastThresholding and contamination choiceLogin DNS process anomalies as ranked leads
Local Outlier FactorLocal density anomaliesCaptures neighborhood raritySensitive to k and scaleRare device geo combos in auth streams
One Class SVMBoundary around normalKernel flexibilityNeeds scaling and tuning nu gammaBaseline normal per tenant then score sessions
AutoencoderComplex reconstruction anomaliesLearns nonlinear structureMore tuning compute less explainableUnusual process trees command lines network bursts
Graph communities and embeddingsRelationships matterOperator cohort mapping, inductive embeddingsHubs and stale edges can misleadDomain IP cert communities node2vec then HDBSCAN
BIRCH k medoidsMassive data or outlier robustnessStream friendly or medoid exemplarsCoarse splits for BIRCH, slower for medoidsLarge scale alert dedup keep a real exemplar per cluster

How to choose in practice

  • Campaign and infrastructure maps

    • Start K-means after PCA SVD → if shapes are weird or densities vary go HDBSCAN
    • On graphs use Louvain Leiden communities or node2vec → cluster embeddings
  • Anomaly alert feeds

    • Start Isolation Forest → compare LOF and PCA residual → consider One Class SVM if boundary is curved
    • For sequences or rich signals try an Autoencoder if you can afford tuning
  • Text and URL tokens high dimensional and sparse

    • Truncated SVD for compression → K-means or HDBSCAN for campaigns → Isolation Forest for anomalies
  • Mixed tabular metadata

    • PCA to 10 → 30 components → K-means or GMM for cohorts → Isolation Forest for outliers
  • Visualization for triage

    • Build UMAP t SNE maps from PCA SVD inputs → use for human sensemaking only
  • Operational constraints

    • Tight latency and simplicity → K-means or Isolation Forest
    • Heavy scale → MiniBatch K-means or BIRCH
    • Need soft membership and density → GMM

Metrics that match operations

  • Clustering quality silhouette and Davies–Bouldin higher silhouette lower DBI → better cohesion and separation
  • Anomaly usefulness Precision at top k weekly labeled sample time to detect analyst effort saved
  • Compression adequacy cumulative explained variance and downstream model performance
  • Stability over time cluster overlap across weeks centroid drift community modularity

Simple starting playbook

  • Email and URL campaign discovery
    TF IDF → Truncated SVDHDBSCAN → name clusters with top terms → route tickets by cluster id

  • Auth anomaly surfacing
    Engineer user normalized features → Isolation Forest → set threshold for daily alert budget → review Precision at k weekly

  • Infra grouping from TLS or DNS
    Handcraft fingerprints and ages → PCADBSCAN pick eps via k distance elbow → noise are probes one offs

  • Analyst triage map
    PCA → UMAP → plot clusters and scatter of anomalies → link points to raw evidence


Guardrails

  • Always scale features and fit scalers on past only
  • For distance methods reduce dimensionality first PCA SVD
  • Tune parameters with time aware validation
  • Treat anomalies as leads not verdicts keep a human loop
  • Monitor drift and refresh models on a cadence


[Original Source](No response)