What is K-means

K-means is an unsupervised algorithm that groups data into k clusters so that items inside a cluster are close to one another and far from other clusters.
intuition —> place k points as tentative centers —> assign each sample to its nearest center —> move centers to the mean of their assigned samples —> repeat until nothing changes much

Why security teams use it

  • Campaign views cluster similar emails or URLs to see active phishing waves
  • Infra mapping cluster domains or hosts that look alike to reveal operator footprints
  • Alert dedup cluster near-duplicate alerts so analysts work cases not copies
  • User cohorts cluster similar behavior to set sensible anomaly thresholds per group

How K-means works step by step

  1. Choose k number of clusters you want.
  2. Initialize centroids usually with k-means++ which spreads seeds sensibly.
  3. Assign each point to its nearest centroid by Euclidean distance.
  4. Update each centroid to the mean of points assigned to it.
  5. Repeat assign —> update until assignments stop changing or improvement is tiny.

Objective minimized

Inertia = Σ over clusters Σ over points in cluster || x - centroid ||^2

Lower inertia —> tighter clusters.


Security examples that click

Phishing campaign clustering

  • Features URL character n-grams TF-IDF, domain age, sender org indicator
  • Output clusters of similar emails —> act at campaign level

DGA family rough grouping

  • Features domain length, digit ratio, character n-grams reduced by PCA
  • Output clusters of DGAs that behave alike —> block rules per group

Login behavior cohorts

  • Features hour-of-day histogram, device mix, geo dispersion, session length
  • Output user groups —> per-cohort anomaly thresholds

Alert deduplication

  • Features normalized rule id, asset role, tokenized process command
  • Output clusters of near-duplicates —> one ticket per cluster

Feature engineering and scaling

  • Scale features standardize or min–max so one field does not dominate distance.
  • Text and URLs use TF-IDF, then L2 normalize; with cosine similarity needs, apply spherical k-means or normalize first.
  • High dimensional reduce with PCA to 20–100 components before K-means to improve cluster quality and speed.
  • Categoricals one-hot encode or use embeddings first.
  • Outliers can pull centroids —> trim extremes or use robust variants like k-medoids for very noisy sets.

Choosing k

  • Elbow plot k vs inertia and pick the bend.
  • Silhouette average per-point score in [-1, 1]; higher is better.
  • Stability run with different seeds and check if clusters persist.
  • Operational fit pick k that yields a reviewable number of clusters for analysts.

Evaluation that matches operations

  • Internal metrics silhouette, Davies–Bouldin, within-cluster spread.
  • External sanity analyst spot-check a few samples per cluster.
  • Cluster utility dedup rate, time saved per incident, coverage of major campaigns.
  • Drift checks cluster counts by week, centroid movement, silhouette over time.

Practical workflow

  1. Define goal campaign view, dedup view, or cohorting.
  2. Assemble features human-readable first, then enrich.
  3. Scale and reduce standardize and apply PCA if high dimensional.
  4. Run K-means with k-means++ initialization.
  5. Sweep k and compute silhouette and stability.
  6. Name clusters with simple rules tokens, top terms, centroid summaries.
  7. Integrate cluster id —> ticket title, rule routing, dashboard.
  8. Monitor weekly metrics and refresh centroids on a rolling window.

Pitfalls and fixes

  • Assumes spherical clusters with similar size
    fix —> try PCA, engineer features, or switch to HDBSCAN DBSCAN if shapes are irregular.
  • Sensitive to scale
    fix —> standardize features before distance.
  • Sensitive to seeds
    fix —> k-means++, multiple initializations, pick best inertia silhouette.
  • Outliers drag centroids
    fix —> trim, winsorize, or use k-medoids mini-batch with robust preprocessing.
  • Empty clusters when data is sparse
    fix —> reinitialize empty centroid to a far point or lower k.
  • Curse of dimensionality distances lose meaning
    fix —> PCA or feature selection, keep dimensions lean.

Variants to know

  • MiniBatch K-means faster on streams or very large corpora.
  • Spherical K-means cosine-friendly for TF-IDF text URL tokens.
  • K-medoids PAM uses actual points as centers more robust to outliers.

Common hyperparameters

  • n_clusters how many groups
  • init k-means++ or random
  • n_init number of different seeds to try
  • max_iter cap iterations per run
  • tol improvement threshold to stop early
  • batch_size for mini-batch variant

Security focused testing checklist

  • Confirm scaling and PCA steps are fit on past only and reused for future scoring
  • Sweep k and seeds measure silhouette and stability
  • Inspect top terms or feature means per cluster are they coherent and nameable
  • Verify dedup compression ratio tickets per day drop without missing real cases
  • Track centroid drift and cluster count by week retrain when drift exceeds a bound
  • Guard against leakage remove post-verdict fields or future-only signals

Threats and mitigations

  • Feature gaming attacker crafts inputs to land in a benign cluster
    • Mitigate add features hard to fake e.g., domain age, hosting ASN, cert fingerprints and use a supervised backstop.
  • Data poisoning injected samples pull centroids
    • Mitigate robust preprocessing, outlier screening, capped influence per source, rolling retrains.
  • Concept drift campaigns evolve and clusters go stale
    • Mitigate periodic re-fit, compare silhouettes, rename or merge split clusters.

Takeaways

  • Use K-means when you want fast, scalable grouping with clear, nameable clusters.
  • Make scaling and PCA standard, pick k with silhouette and operational needs, and keep a light drift watch.
  • Treat clusters as maps for campaigns, infrastructure, cohorts, and dedup — they guide analysts and reduce noise.
---

[Original Source](_No response_)