Continuous Changes and Testing Cadence
Point-in-time security assessments work for static systems. AI systems aren’t static. Models get updated, prompts change, new integrations get added, training data evolves. Testing in January says nothing about security in July if the system changed six times in between.
The AI System Change Rate
Traditional applications change through code deployments. AI systems change through multiple mechanisms.
Model Updates
Using third-party APIs:
# January
response = openai.chat.completions.create(model="gpt-4-0613", ...)
# March - provider updates model
response = openai.chat.completions.create(model="gpt-4-0613", ...)
# Same model name, different behavior
Provider updates the model behind the API. Your code didn’t change. Model behavior changed. Your January testing is obsolete.
Custom models:
# January deployment
model = load("customer-support-v1.pkl")
# March - retrained with new data
model = load("customer-support-v2.pkl")
# June - fine-tuned for new product
model = load("customer-support-v3.pkl")
Three different models in six months. Each has different attack surface. Testing v1 doesn’t validate v2 or v3.
Prompt Engineering Changes
# January
system_prompt = "You are a helpful customer service agent."
# March - added constraints
system_prompt = "You are a helpful customer service agent. Never discuss pricing."
# June - added capabilities
system_prompt = "You are a helpful customer service agent. You can access customer records and update account status."
Same model. Different system prompts. Different capabilities. Different attack surface.
Each prompt change potentially introduces new vulnerabilities. March’s constraint might be bypassable. June’s capabilities might lack authorization.
Integration Changes
# January - chat only
def chat(message):
return model.generate(message)
# March - added database access
def chat(message):
response = model.generate(message, tools=[query_database])
return response
# June - added email sending
def chat(message):
response = model.generate(message, tools=[query_database, send_email])
return response
Integration complexity increases. New tools add attack surface. Testing the January chat-only version doesn’t cover database queries or email sending added later.
Training Data Updates
Organizations continuously collect new training data:
# Weekly retraining
def retrain_model():
last_week_data = get_data(last_7_days)
current_model = load("production_model.pkl")
updated_model = fine_tune(current_model, last_week_data)
deploy(updated_model)
Every retraining cycle uses new data. If adversaries poison data in week 10, week 10’s model is compromised. Testing week 1’s model didn’t detect week 10’s vulnerability.
Why Point-in-Time Testing Fails
Organizations schedule annual security assessments:
January 2024: Security assessment completed
Finding: No critical issues
Status: System approved for production
December 2024: Security breach
Root cause: Vulnerability in feature added in August
The August feature was never tested. Annual testing cadence missed eight months of changes.
Configuration Drift
# January config
{
"rate_limit": 100,
"enable_tools": false,
"log_level": "info"
}
# June config - gradual changes
{
"rate_limit": 1000, # Increased for performance
"enable_tools": true, # Added function calling
"log_level": "error" # Reduced logging
}
Configuration changed incrementally. Each change seemed minor. Cumulatively they significantly altered security posture. Point-in-time testing in January captured the January config, not the June reality.
Dependency Updates
# requirements.txt January
openai==1.0.0
langchain==0.1.0
# requirements.txt June
openai==1.5.0 # Security patches, new features
langchain==0.2.0 # Breaking changes, new vulnerabilities
Dependency updates change application behavior. New library versions might have new vulnerabilities. June’s dependencies weren’t tested in January.
What Triggers Should Require Testing
Scope documents need to define what changes trigger security revalidation.
Model Changes
Trigger testing when:
- Base model updated (gpt-4-0613 -> gpt-4-1106)
- Custom model retrained
- Fine-tuning applied
- Model serving infrastructure changed
Example scope:
Model change testing requirements:
- Test within 48 hours of model deployment
- Regression test previous vulnerabilities
- Test new capabilities added in update
- Verify safety mechanisms remain effective
System Prompt Modifications
Trigger testing when:
- System prompt content changes
- Instructions added or removed
- Capability descriptions modified
- Constraints added or removed
Even minor prompt changes affect behavior:
# Before
"You are a helpful assistant."
# After
"You are a helpful assistant. Be concise."
“Be concise” changes response patterns. Does it affect safety? Does it change how the model handles edge cases? Unknown without testing.
Integration Changes
Trigger testing when:
- New tools/functions added
- API integrations added
- Database access modified
- External service connections added
Each new integration expands attack surface:
# Added in June
tools = [
existing_tool1,
existing_tool2,
new_tool_file_access # New attack surface
]
new_tool_file_access needs testing. But if June’s change doesn’t trigger testing, it goes to production untested.
Training Data Updates
Trigger testing when:
- Training data sources change
- Data collection processes modified
- New data categories included
- Data volume increases significantly
Significant data changes warrant testing:
# January: 10k customer service conversations
train_data_v1 = load_conversations(count=10000)
# June: 100k conversations including social media
train_data_v2 = load_conversations(count=100000, sources=["tickets", "social"])
10x data increase plus new sources. Different attack surface. Needs testing.
Testing Cadence Options
Different approaches for different risk levels.
Event-Driven Testing
Test when changes occur:
Pipeline:
1. Code commit
2. CI runs security tests
3. Deploy if tests pass
Triggers:
- Model update: Run model-specific tests
- Prompt change: Run prompt injection tests
- New function: Run function calling tests
This catches changes immediately but requires automation.
Scheduled Testing
Regular intervals:
Quarterly assessment:
- Full security review
- New vulnerability testing
- Regression testing
- Compliance verification
Monthly sanity checks:
- Smoke tests for critical functions
- Automated vulnerability scans
- Configuration validation
More predictable than event-driven but might miss issues between cycles.
Risk-Based Testing
Frequency based on risk level:
High-risk systems (financial, healthcare):
- Full assessment: Quarterly
- Automated testing: Daily
- Manual review: Monthly
Medium-risk systems (customer service):
- Full assessment: Biannually
- Automated testing: Weekly
- Manual review: Quarterly
Low-risk systems (internal tools):
- Full assessment: Annually
- Automated testing: Monthly
- Manual review: As needed
Hybrid Approach
Combine methods:
Continuous:
- Automated tests in CI/CD
- Daily smoke tests
- Real-time monitoring
Scheduled:
- Monthly: Automated comprehensive scan
- Quarterly: Manual assessment
- Annually: Full red team engagement
Event-driven:
- Major model updates: Full regression
- Minor updates: Automated testing
- Config changes: Targeted testing
Baseline and Regression Testing
Testing new versions requires baseline:
# Establish baseline (v1)
baseline_results = {
"prompt_injection": 0_detected,
"data_leakage": 0_detected,
"function_abuse": 0_detected
}
# Test new version (v2)
new_results = run_security_tests(model_v2)
# Compare
regressions = []
for test, result in new_results.items():
if result > baseline_results[test]:
regressions.append(test)
Regression testing verifies new versions don’t reintroduce old vulnerabilities or create new ones.
Version Control for AI Systems
Track what changed between versions:
Model v1 -> v2:
- Base model: gpt-4-0613 (same)
- System prompt: Added "Never discuss pricing" (changed)
- Functions: Added send_email (new)
- Training data: +5000 conversations (changed)
Test focus for v2:
- Pricing discussion bypass attempts
- send_email authorization
- Training data poisoning in new conversations
Knowing what changed guides testing priorities.
Scope Language for Continuous Testing
Bad scope:
Perform annual security assessment of AI system.
This implies one-time testing regardless of changes.
Better scope:
Testing cadence and triggers:
1. Continuous automated testing:
- Run on every code commit
- Test suite: Prompt injection, output validation, function authorization
- Block deployment if critical tests fail
2. Scheduled assessments:
- Monthly: Automated comprehensive scan
- Quarterly: Manual penetration test
- Annually: Full red team engagement with report
3. Event-driven testing:
- Model update: Full regression within 48 hours
- System prompt change: Prompt security testing
- New function added: Function calling security testing
- Training data source change: Data validation testing
4. Baseline requirements:
- Maintain baseline test results for each version
- Compare new versions against baseline
- Document any regressions
- Require approval for degraded security metrics
5. Version tracking:
- Document what changed between versions
- Link test results to specific versions
- Maintain test history for compliance
Out of scope:
- Testing of changes made after assessment period ends
- Continuous monitoring implementation (separate contract)
Automation Requirements
Frequent testing requires automation:
# CI/CD integration
def security_gate():
results = {
"prompt_injection": test_prompt_injection(),
"output_validation": test_output_handling(),
"function_auth": test_function_authorization(),
"rate_limits": test_rate_limiting()
}
critical_failures = [k for k, v in results.items() if v["severity"] == "critical"]
if critical_failures:
block_deployment()
return results
Scope needs to specify: “Automated testing must complete in under 15 minutes to fit CI/CD pipeline. Manual testing scheduled separately.”
Change Documentation
Effective continuous testing requires change tracking:
Change log format:
Date: 2024-06-15
Change type: System prompt modification
Old value: "You are a helpful assistant."
New value: "You are a helpful assistant with access to customer records."
Risk assessment: High - new data access capability
Testing required: Yes
Testing completed: 2024-06-16
Results: 2 findings (see report #456)
Without change documentation, continuous testing becomes reactive instead of proactive.
Cost Considerations
Frequent testing has cost implications:
Annual testing: $50k once per year
Quarterly testing: $15k four times = $60k per year
Monthly testing: $5k twelve times = $60k per year
Continuous automated: $20k setup + $2k/month = $44k per year
Scope needs to balance thoroughness with budget:
Proposed approach:
- Continuous automated testing: Daily ($2k/month)
- Monthly manual review: Targeted ($5k/month)
- Quarterly deep assessment: Comprehensive ($15k/quarter)
Total annual cost: $104k
Alternative lower-cost approach:
- Continuous automated testing: Daily ($2k/month)
- Quarterly assessment: Comprehensive ($15k/quarter)
Total annual cost: $84k
Compliance Requirements
Regulations might mandate testing frequency:
EU AI Act (high-risk systems):
- Continuous monitoring required
- Regular testing and validation
- Documentation of all changes
PCI DSS (payment systems):
- Quarterly vulnerability scans
- Annual penetration tests
- Testing after significant changes
HIPAA (healthcare):
- Regular risk assessments
- Testing when environment changes
- Continuous monitoring
Scope must align with regulatory requirements: “Testing frequency meets EU AI Act requirements for high-risk AI systems.”
Conclusion
AI systems change constantly. Point-in-time security testing creates false confidence that expires with the first change after testing.
Effective AI security requires:
- Clear triggers for when testing is required
- Baseline establishment and regression testing
- Mix of continuous, scheduled, and event-driven testing
- Version control and change documentation
- Automation for frequent testing
- Budget allocation for ongoing testing
Organizations that test AI security once and declare victory are accumulating risk with every untested change. Scope documents need to explicitly address testing cadence, change triggers, and continuous validation requirements.
[Original Source](No response)