What is K-means
K-means is an unsupervised algorithm that groups data into k clusters so that items inside a cluster are close to one another and far from other clusters.
intuition —> place k points as tentative centers —> assign each sample to its nearest center —> move centers to the mean of their assigned samples —> repeat until nothing changes much
Why security teams use it
- Campaign views cluster similar emails or URLs to see active phishing waves
- Infra mapping cluster domains or hosts that look alike to reveal operator footprints
- Alert dedup cluster near-duplicate alerts so analysts work cases not copies
- User cohorts cluster similar behavior to set sensible anomaly thresholds per group
How K-means works step by step
- Choose k number of clusters you want.
- Initialize centroids usually with k-means++ which spreads seeds sensibly.
- Assign each point to its nearest centroid by Euclidean distance.
- Update each centroid to the mean of points assigned to it.
- Repeat assign —> update until assignments stop changing or improvement is tiny.
Objective minimized
Inertia = Σ over clusters Σ over points in cluster || x - centroid ||^2
Lower inertia —> tighter clusters.
Security examples that click
Phishing campaign clustering
- Features URL character n-grams TF-IDF, domain age, sender org indicator
- Output clusters of similar emails —> act at campaign level
DGA family rough grouping
- Features domain length, digit ratio, character n-grams reduced by PCA
- Output clusters of DGAs that behave alike —> block rules per group
Login behavior cohorts
- Features hour-of-day histogram, device mix, geo dispersion, session length
- Output user groups —> per-cohort anomaly thresholds
Alert deduplication
- Features normalized rule id, asset role, tokenized process command
- Output clusters of near-duplicates —> one ticket per cluster
Feature engineering and scaling
- Scale features standardize or min–max so one field does not dominate distance.
- Text and URLs use TF-IDF, then L2 normalize; with cosine similarity needs, apply spherical k-means or normalize first.
- High dimensional reduce with PCA to 20–100 components before K-means to improve cluster quality and speed.
- Categoricals one-hot encode or use embeddings first.
- Outliers can pull centroids —> trim extremes or use robust variants like k-medoids for very noisy sets.
Choosing k
- Elbow plot k vs inertia and pick the bend.
- Silhouette average per-point score in [-1, 1]; higher is better.
- Stability run with different seeds and check if clusters persist.
- Operational fit pick k that yields a reviewable number of clusters for analysts.
Evaluation that matches operations
- Internal metrics silhouette, Davies–Bouldin, within-cluster spread.
- External sanity analyst spot-check a few samples per cluster.
- Cluster utility dedup rate, time saved per incident, coverage of major campaigns.
- Drift checks cluster counts by week, centroid movement, silhouette over time.
Practical workflow
- Define goal campaign view, dedup view, or cohorting.
- Assemble features human-readable first, then enrich.
- Scale and reduce standardize and apply PCA if high dimensional.
- Run K-means with k-means++ initialization.
- Sweep k and compute silhouette and stability.
- Name clusters with simple rules tokens, top terms, centroid summaries.
- Integrate cluster id —> ticket title, rule routing, dashboard.
- Monitor weekly metrics and refresh centroids on a rolling window.
Pitfalls and fixes
- Assumes spherical clusters with similar size
fix —> try PCA, engineer features, or switch to HDBSCAN DBSCAN if shapes are irregular. - Sensitive to scale
fix —> standardize features before distance. - Sensitive to seeds
fix —> k-means++, multiple initializations, pick best inertia silhouette. - Outliers drag centroids
fix —> trim, winsorize, or use k-medoids mini-batch with robust preprocessing. - Empty clusters when data is sparse
fix —> reinitialize empty centroid to a far point or lower k. - Curse of dimensionality distances lose meaning
fix —> PCA or feature selection, keep dimensions lean.
Variants to know
- MiniBatch K-means faster on streams or very large corpora.
- Spherical K-means cosine-friendly for TF-IDF text URL tokens.
- K-medoids PAM uses actual points as centers more robust to outliers.
Common hyperparameters
n_clustershow many groupsinitk-means++orrandomn_initnumber of different seeds to trymax_itercap iterations per runtolimprovement threshold to stop earlybatch_sizefor mini-batch variant
Security focused testing checklist
- Confirm scaling and PCA steps are fit on past only and reused for future scoring
- Sweep k and seeds measure silhouette and stability
- Inspect top terms or feature means per cluster are they coherent and nameable
- Verify dedup compression ratio tickets per day drop without missing real cases
- Track centroid drift and cluster count by week retrain when drift exceeds a bound
- Guard against leakage remove post-verdict fields or future-only signals
Threats and mitigations
- Feature gaming attacker crafts inputs to land in a benign cluster
- Mitigate add features hard to fake e.g., domain age, hosting ASN, cert fingerprints and use a supervised backstop.
- Data poisoning injected samples pull centroids
- Mitigate robust preprocessing, outlier screening, capped influence per source, rolling retrains.
- Concept drift campaigns evolve and clusters go stale
- Mitigate periodic re-fit, compare silhouettes, rename or merge split clusters.
Takeaways
- Use K-means when you want fast, scalable grouping with clear, nameable clusters.
- Make scaling and PCA standard, pick k with silhouette and operational needs, and keep a light drift watch.
- Treat clusters as maps for campaigns, infrastructure, cohorts, and dedup — they guide analysts and reduce noise.
---
[Original Source](_No response_)