Linear and Logistic regression

What is Linear Regression

Linear regression is a fundamental supervised learning algorithm that predicts continuous values by finding a linear relationship between inputs and an output. In cybersecurity, it is a simple and explainable baseline that helps you reason about more complex models you will test, attack, or defend.

Think of it as drawing the best fitting straight line through data points to make predictions. The model learns from historical data and assumes changes in input variables lead to proportional changes in the output.

Core Concepts

Simple Linear Regression

y = m x + c

Where

y = predicted value
x = input feature
m = slope
c = intercept

Multiple Linear Regression

y = b0 + b1 x1 + b2 x2 + … + bn xn

Practical Examples in a Security Context

Keep targets continuous for linear regression. If the target is a class like phish vs not phish, use logistic regression.

Network baseline
- Inputs: time of day, weekday flag, active sessions
- Output: expected Mbps
- Use: deviations from the baseline —> anomaly investigation
Password strength time to crack on log scale
- Inputs: length, character sets, common patterns
- Output: log seconds to crack
- Use: policy guidance and risk communication
Alert priority scoring
- Inputs: MITRE technique, asset criticality, failed login streak
- Output: risk score 0 —> 100
- Use: rank alerts and set triage threshold
Remediation time forecasting
- Inputs: severity, team load, system type
- Output: days to fix
- Use: staffing and SLA planning

Why This Matters for AI Security Testing

Data poisoning: crafted outliers or flipped labels shift coefficients
Model extraction: coefficients can often be inferred from queries
Operational evasion: small feature tweaks push a continuous score across your action threshold

Example: a score based spam filter can be nudged by adjusting URL count and wording so the score falls just under the quarantine threshold.

Testing Linear Regression Models

Range and Robustness

Validate input ranges and units
Inject missing values and unexpected categories
Probe outliers and confirm robust behavior

Feature Manipulation

Identify controllable features
Measure how incremental changes shift the score
Map which changes cross the action threshold

Model Extraction

Send crafted probes, solve for weights
Once extracted, predict and evade consistently

Key Assumptions and How They Break in Security

Linearity additivity and proportionality
- Many security signals are nonlinear —> engineer features or switch models
Independence of errors no autocorrelation
- Logs are time dependent —> use time aware splits and diagnostics
Homoscedasticity constant residual variance
- Variance grows with traffic —> use robust regression or transform targets
Normality of residuals for exact intervals
- Heavy tails are common —> use robust metrics like MAE
Multicollinearity correlated features inflate variance
- Remove or merge features, or use Ridge or Elastic Net

Practical Security Testing Checklist

List model features and identify which an attacker can control
Check input validation, ranges, missing values, outliers
Test monotonicity changes in features move the score as expected
Attempt poisoning on update paths batch or online
Probe for coefficient extraction
Verify assumptions with residual plots and diagnostics
Use time ordered train —> validate —> test splits
Track drift and recalibrate thresholds

Conclusion

Linear regression is a transparent baseline for scoring, forecasting, and capacity planning. Know where it works, where it breaks, and how attackers might steer it. That foundation makes you better at both defending and testing more complex AI systems.

What is Logistic Regression

Logistic regression is the standard baseline for binary classification. It learns a weighted sum of features and converts it to a probability between 0 and 1. You then apply a threshold to turn that probability into a decision.

Pipeline in one line
features —> weighted sum —> sigmoid —> probability —> threshold —> class

Output is P class equals 1
Threshold 0.5 is common, but you should tune it for your costs and SOC capacity

Core Concepts

Weights and log odds: positive weight pushes probability up, negative pushes it down
Decision boundary: where the model is indifferent probability near threshold
Calibration: align predicted probabilities with reality so 0.7 really means about 70 percent positives

Practical Examples in a Security Context

Spam or not spam
- Inputs: has URL, count of links, words like urgent or password, unknown sender
- Output: probability of spam
- Use: quarantine if probability —> threshold, otherwise deliver or tag
Phishing email detection
- Inputs: sender reputation, URL count, brand lookalike score, domain age
- Output: probability of phishing
- Use: choose threshold to balance analyst time vs missed attacks
Malware vs benign file
- Inputs: imported API counts, section entropy, packer flags
- Output: probability malicious
- Use: block when probability is high, sandbox when borderline
DGA domain detection
- Inputs: length, digit ratio, n gram stats, NXDOMAIN rate
- Output: probability DGA
- Use: sinkhole or apply further checks above threshold
Login anomaly
- Inputs: geo velocity, device fingerprint, hour of day, failed attempt streak
- Output: probability suspicious
- Use: step up auth if above threshold

Evaluation for Security Reality

Precision and Recall: pick the balance that fits operations
F1 and PR AUC: better for imbalanced classes common in attacks
ROC AUC: useful for overall ranking, less informative under heavy imbalance
Confusion matrix by week: spot drift and threshold problems over time

Attacks and Weaknesses

Data poisoning: attacker flips labels or injects crafted samples that shift weights
Feature gaming: attacker modifies controllable features to slip under the threshold
Model extraction: with enough probes, recover weights and intercept
Label leakage: features that encode the verdict after the fact make the model fragile

Mitigations

Use regularization to prevent extreme weights
Apply time aware splits and grouped cross validation by user or host
Calibrate probabilities and tune thresholds to analyst capacity
Monitor feature distributions and retrain on drift

Testing Checklist

Verify class definitions and labeling quality
Inspect coefficients largest positive and negative drivers
Test threshold sweeps Precision Recall trade offs
Probe controllable features to map evasion paths
Check calibration reliability diagrams and Brier score
Validate on future time windows not just random splits
Recheck on new campaigns families and tenants

Regularization and Calibration

L2 Ridge stabilizes weights and reduces variance
L1 Lasso drives some weights to zero for sparse, readable models
Elastic Net blends both for stability and sparsity
Calibration options Platt scaling or isotonic on a validation set
Class imbalance tools class weights, focal loss like weighting, stratified sampling

Assumptions and Practical Notes

Relationship in the log odds is approximately linear
Features are reasonably independent given the target heavy collinearity hurts stability
Use careful preprocessing and avoid leakage fit encoders and scalers only on training folds
Prefer interpretable features so analysts can understand and trust decisions

Step by Step using spam

Collect: labeled emails from past weeks
Prepare: extract simple clues has URL, urgent in subject, link count
Split: train on earlier weeks —> validate on the next week —> test on a later sealed week
Train: logistic regression with L2 regularization
Tune: threshold for desired Precision or Recall and calibrate probabilities
Test: confirm metrics on the sealed test window
Deploy: set operational threshold and monitor weekly

Conclusion

Logistic regression turns clear human rules into consistent machine decisions. It is fast, explainable, and easy to audit. When paired with good features, proper calibration, and time aware validation, it becomes a reliable baseline for spam filtering, phishing detection, DGA spotting, file triage, and login anomaly checks.

Final takeaway

Use linear regression when your target is a number continuous
Use logistic regression when your target is a class like phish vs not phish
Keep features simple and human readable
Validate on future data and tune thresholds to real operational costs
Watch for drift and retrain on a cadence

[Original Source](No response)