Prompt Management & Engineering Best Practices

Responsible AI in autobotAI requires careful management and engineering of prompts used across all node types: AI evaluator nodes, GenAI workflow nodes, Agent nodes, notification nodes, approval nodes, AI assistant for workflow creation and AI assistant for Action node code generation.

What is Prompt Management?

Prompt management refers to the design, validation, testing, versioning, and maintenance of prompts that guide AI models to produce appropriate, safe, and accurate outputs.

Scope: ALL Node Types

Prompt management applies to:

AI evaluator Nodes: Prompts that guide threat classification, risk assessment, compliance violations
GenAI Workflow nodes: Prompts that generate text, summaries, recommendations
Notification Nodes: Prompts that create alert summaries for users
Approval Workflows: Prompts that generate context summaries for approvers
AI Assistant for Workflow creation: Prompts that describes your automation workflow use case and generates workflow to kickstart work on flowgraming.
AI Assistant for Full-code Action node: Prompts that describes your custom action node that helps you build custom tool for agent node or independent deterministic node.
Agent node: System prompt and output format prompt that helps agent to define its behavior when executing tasks

Why Prompt Management Matters

Poor prompt design can lead to:

Biased or unfair recommendations across different inputs
Hallucinations or inaccurate information presented as fact
Compliance violations in generated content
Confusing or misleading summaries for users/approvers
Inconsistent automation behavior
Undetected security issues
Failed regulatory compliance

Prompt Engineering Best Practices

1. Clear and Specific Instructions

Write prompts that explicitly define what you want the AI to do:

✓ GOOD:

markdown
Generate a threat assessment for the following security finding:
- Only include threats with confidence > 80%
- Format output as: [Threat Type], [Risk Level: CRITICAL|HIGH|MEDIUM], [Recommendation]
- Do not speculate about threats below 80% confidence
- Do not recommend disabling security controls
- If evidence is insufficient, state "Unable to determine risk level"
- Provide asumption score with list of assumptions.

✗ BAD:

markdown
Tell me about these threats

2. Provide Rich Context and Examples

Include relevant context to help AI understand your domain:

markdown
You are a security analyst for a financial services company with strict compliance.
Context: We follow PCI DSS, RBI, GDPR, and SOC 2 standards.

Example 1 - Well-formatted violation:
{
  "resource": "database_prod_01",
  "violation": "encryption_not_enabled",
  "severity": "CRITICAL",
  "recommendation": "Enable encryption at rest using AWS KMS with customer-managed key"
}

Example 2 - Compliant resource:
{
  "resource": "backup_encrypted_01",
  "status": "compliant",
  "evidence": "Verified AES-256 encryption, key rotation enabled"
}

Now analyze these violations and generate similar structured output:
[$$_violation data]

3. Define Exact Output Format

Specify exactly how you want output structured:

markdown
Output Format Requirements:
- Response must be valid JSON
- Required fields: violation_id, resource_id, rule_violated, evidence, recommendation, severity
- Severity must be one of: CRITICAL, HIGH, MEDIUM, LOW (no other values)
- Recommendation must be 50-200 characters, specific and actionable
- Maximum output size: 500 characters per violation
- Return empty array [] if no violations found
- Do not include warnings, disclaimers, or explanatory text

4. Set Explicit Boundaries and Constraints

Tell the AI what it absolutely should NOT do:

markdown
IMPORTANT CONSTRAINTS:
- Do NOT make assumptions about missing data - state what's unknown
- Do NOT recommend actions outside the customer's environment/capabilities
- Do NOT include passwords, API keys, or sensitive credentials in output
- Do NOT recommend disabling security controls
- Do NOT violate compliance frameworks: PCI DSS, SOC 2, GDPR, HIPAA (where applicable)
- Do NOT suggest actions that affect > 100 resources without explicit approval requirement
- Do NOT generate code for production execution
- Do NOT recommend using deprecated or end-of-life services

5. Use System-Level Prompts for Governance

Create foundational system prompts that apply guardrails:

markdown
System Prompt for autobotAI Automation:

You are an automated security assistant for enterprise infrastructure.
Your primary responsibility is making recommendations that:
1. Follow company policy (see [policy link])
2. Prioritize human review when uncertain
3. Flag decisions affecting > 100 resources for mandatory approval
4. Never recommend disabling security controls
5. Respect compliance requirements:
   - Compliance frameworks: CIS AWS, SOC 2, GDPR, HIPAA
   - Data privacy: No PII in outputs
   - Regulatory: Meet audit trail requirements

When uncertain: Ask for human guidance rather than guessing.
When risky: Escalate to security team immediately.

6. Test Prompts with Diverse Inputs Before Deployment

Before deploying prompts, test with:

Test Category	Examples	Expected Behavior
Normal Cases	Standard threats, typical violations	Correct classification
Edge Cases	Missing data, unusual combinations	Graceful handling with explanations
Boundary Cases	Very large/small values, extremes	Appropriate responses
Diversity	Different threat types, resource types	Consistent quality
Stress Cases	Complex scenarios, contradictions	Reasonable recommendations
Security Cases	Attempts to manipulate prompt	Resistant, doesn't comply

Example test suite:

markdown
Test Case 1: Normal threat
Input: Database admin requested 2 hour for production database access from remote geo location and database has PII and PCI data 
Expected: HIGH severity, recommend not to automatically give access but forward for manual approval to his/her reporting manager to approve and sign risk.

Test Case 2: Edge case - ambiguous
Input: IP seen in 2 threat databases, not in others
Expected: MEDIUM severity or request more data

Test Case 3: Boundary - massive scale
Input: Vulnerability affects 50,000 resources
Expected: Escalation flag, Pull code repos and update requirement.txt, package.json, pom.xml or dockerfile file with updated fixed library version, push updated code with pull request review, and update IPS and WAF rule to protect against exploit.

Test Case 4: Diversity - different attack type
Input: Social engineering attempt vs. malware
Expected: Different agent workflow for each threat vector

7. Version and Track Prompt Changes

How Prompts are Structured in autobotAI

Each AI node in autobotAI includes three components:

1. System Prompt

Foundational instructions that define the AI's behavior
Example: "You are a security analyst. Follow company policies..."
Cannot be versioned separately in autobotAI
Backup: Use Export Bot feature to backup entire workflow including system prompt

2. Output Format Prompt

Specifies exact output structure (JSON, fields, constraints)
Example: "Return JSON with fields: violation_id, severity, recommendation"
Cannot be versioned separately in autobotAI
Backup: Use Export Bot feature to backup entire workflow including output format

3. User Prompt (Input)

Dynamic instructions based on workflow execution
Example: "Analyze this security finding..."
User Prompt History is maintained for chat-based workflows (if customer publishes workflow with human interaction/chatbot)
User can view complete conversation history with all prompts and responses

Backing Up and Managing Prompts

Since system prompts and output format prompts are not separately versioned:

Use Export Bot Feature for Backup:

Export entire bot workflow with all prompts included
Creates complete backup of:
- System prompt
- Output format prompt
- All node configurations
- Workflow logic
Can be re-imported to restore previous versions
Provides audit trail of workflow changes

Workflow Version Management:

markdown
Bot: "Generate Threat Assessment"

Workflow Versions (via Export):

Version 2.3 (current, deployed 2025-11-16)
├─ Export date: 2025-11-16
├─ System Prompt: "You are a security analyst..."
├─ Output Format: "Return JSON with {id, severity, recommendation}"
├─ Changes from v2.2: Added constraint about "no disabling controls"
└─ File: threat_assessment_v2.3.bot (exported)

Version 2.2 (2025-11-01)
├─ Export date: 2025-11-01
├─ System Prompt: "You are a security analyst..."
├─ Output Format: "Return JSON..."
├─ Changes: Increased confidence threshold to 75%
└─ File: threat_assessment_v2.2.bot (exported)

Version 2.1 (2025-10-15)
├─ Export date: 2025-10-15
├─ Initial production version
└─ File: threat_assessment_v2.1.bot (exported)

Restoration Process:
- To rollback to v2.2: Import threat_assessment_v2.2.bot
- Entire workflow restored with correct prompts
- All functionality returns to that version

User Prompt History in Chat Workflows

For workflows published with human interaction (chatbot mode):

User sees complete conversation history:

markdown
Chat Session ID: CHT-2025-11-16-001

Message 1 (User Input):
"Analyze this suspicious login: IP 192.0.2.5, time 2:45 AM, location: Unknown"

Message 1 (Bot Response - with User Prompt):
User Prompt Used: "Analyze security threat: {threat_details}"
System Response: [AI-generated threat assessment]

Message 2 (User Input):
"Is this a real threat or false positive?"

Message 2 (Bot Response):
User Prompt Used: "Evaluate threat confidence based on user feedback"
System Response: [Updated assessment]

Chat History Features: ✓ All user messages archived ✓ All bot responses archived ✓ User prompts that generated each response shown ✓ Multi-turn conversation available for review ✓ Can export chat history for audit ✓ Timestamps for each exchange

9. Handle Foundation Model Variations

Different foundation models respond differently to prompts. Adjust your system prompts based on which model you select in autobotAI.

Claude Sonnet 4.5 (Latest - Best Default)

Best for: Most autobotAI workflows - best coding and agent tasks

System Prompt Approach:

markdown
You are a security analyst.

Analyze in this order:
1. What is the rule/policy?
2. What is the current state?
3. Why is it non-compliant?
4. What's the fix?

Output format: JSON with fields {rule, evidence, recommendation, severity}

Why: Sonnet 4.5 is fastest (2x speed vs. Opus), best for coding, strong agentic capabilities, prompt injection resistant.

Claude Opus 4.1 (Complex Analysis)

Best for: Complex threat analysis, policy interpretation, open-ended reasoning

System Prompt Approach:

markdown
You are an expert security analyst.

Provide nuanced analysis considering:
- Immediate threat/violation
- Business context
- Regulatory implications

Be thorough but concise.

Why: Opus 4.1 optimized for agentic tasks and complex reasoning. Better for "unknown unknowns" scenarios.

Claude 3.5 Sonnet (Previous Generation)

Best for: Budget-conscious workflows, still very capable

System Prompt Approach: Same as Sonnet 4.5 - structured, step-by-step

Why: Still strong performer, lower cost than 4.5 if budget matters.

Claude Haiku 4.5 (Speed/Cost Critical)

Best for: High-volume real-time analysis, cost-sensitive deployments

System Prompt Approach:

markdown
You are a security analyst.
Respond with: [THREAT TYPE], [RISK LEVEL], [ACTION]

Example:
Malware, CRITICAL, Block immediately

Why: Fastest and cheapest. Requires simpler, direct prompts. Complexity reduces accuracy.

GPT-4o (Default OpenAI)

Best for: Multimodal (text + image), general purpose workflows

System Prompt Approach:

markdown
You are a security analyst.
Threat assessment: threat_type, risk_level, action.
Be concise.

Why: Fast (60% cheaper than prior generation), can handle text and images, good accuracy.

GPT-4o Mini (Speed/Cost - OpenAI)

Best for: High-volume, cost-sensitive, fast response needed

System Prompt Approach:

markdown
Security assessment: [threat], [level], [action]

Why: 60% cheaper than GPT-4o, scores 82% on MMLU (vs 70% for older models), fastest response time.

GPT-5 / o3 (Latest Reasoning Models)

Best for: Complex multi-step reasoning, novel threat scenarios

System Prompt Approach:

markdown
You are a strategic security analyst.

Think step-by-step about:
- What's the threat/policy issue?
- What's the context and patterns?
- What are second-order consequences?
- What's the strategic response?

Be thorough and nuanced.

Why: State-of-the-art reasoning, but slower and more expensive. For critical, complex decisions only.

Quick Decision Guide

Model	Best For	Speed	Cost	Prompt Style
Sonnet 4.5	Default choice	⭐⭐⭐ Fast	⭐⭐ Moderate	Structured
Opus 4.1	Complex reasoning	⭐⭐ Medium	⭐ Higher	Open-ended
Haiku 4.5	Real-time, bulk	⭐⭐⭐ Fastest	⭐⭐⭐ Cheapest	Simple, direct
GPT-4o	Multimodal	⭐⭐⭐ Fast	⭐⭐ Moderate	Direct, clear
GPT-4o Mini	High-volume	⭐⭐⭐ Fastest	⭐⭐⭐ Cheapest	Simple, direct
GPT-5/o3	Complex reasoning	⭐⭐ Slower	⭐ Highest	Detailed, nuanced

Testing Prompts Across Models

Before deploying:

Write system prompt for your use case
Test on Sonnet 4.5 first (good baseline)
Test on Haiku 4.5 - simplify if accuracy drops
Test on Opus 4.1 if Sonnet fails on complex cases
Compare: accuracy, speed, cost
Choose model that meets all targets

Example:

Sonnet 4.5: 96% accuracy, 0.5s, $0.0005/call ✓
Haiku 4.5: 92% accuracy, 0.2s, $0.0002/call ✓
→ If 92% meets target, use Haiku (faster, cheaper)
→ If you need 96%+, use Sonnet 4.5

Summary

Always test your system prompts on multiple models before production deployment. Latest models (Sonnet 4.5, Opus 4.1, GPT-5/o3) provide better accuracy - use them for new implementations.

10. Document Prompt Decisions and Assumptions

Keep documentation for why prompts are written a certain way:

Prompt Documentation:

markdown
Name: Generate Remediation Steps for Compliance Violations
Purpose: Create step-by-step remediation instructions for security findings

Design Rationale:
- Uses structured format to reduce hallucinations
- Includes confidence thresholds to catch uncertainty
- Limits output length to ensure readability
- References specific frameworks (CIS, MITRE, GDPR, NIST etc) for accuracy

Tested Against:
- 1000+ compliance violations from real customer data
- Achieves 96% accuracy on CIS violations
- 94% accuracy on GDPR violations
- Edge case: Fails on custom compliance rules not in training data

Limitations:
- Does not generate novel remediation approaches
- Assumes standard AWS environment
- May struggle with hybrid/on-premises infrastructure
- Cannot handle customer-specific compliance rules

When This Works Well:
- AWS-native environments
- Standard compliance frameworks
- Common violation types

When This Might Fail:
- Custom compliance rules
- Hybrid/multi-cloud environments
- Novel attack patterns
- Non-English system messages

Review Schedule: Quarterly or after major incidents
Last Reviewed: 2025-11-16
Next Review: 2026-02-16
Owners: Security Team, Product Team

Responsible AI in Prompt Design

Explainability

Ask prompts to explain their reasoning
Example: "Explain in simple terms why you classified this as HIGH risk"
Results should include evidence supporting conclusions

Fairness

Test prompts with diverse datasets
Ensure similar inputs get similar outputs
Monitor for resource-type or department bias

Accountability

Document who created and reviewed each prompt
Track prompt versions and changes
Link prompts to responsible individuals

Compliance

Prompts must enforce company policies
Explicitly include compliance requirements
Test against compliance rules before deployment

Human Control

Prompts should suggest human review for uncertain cases
Never make critical decisions without oversight
Include confidence/uncertainty scores in output

Foundation Model Selection for Prompt Effectiveness

When choosing a foundation model, consider:

Factor	Description	Impact
Capability Match	Does model excel at your use case?	High accuracy vs. poor performance
Cost	Per-token or per-request pricing	Budget constraints
Latency	Response time needed	Real-time vs. batch
Governance	What guardrails available?	Control over outputs
Updates	How often updated?	Access to latest capabilities

Foundation Model Selection for autobotAI Workflows

When choosing a foundation model for your autobotAI node, consider:

Factor	Description	Impact on Selection
Use Case Match	Does model excel at your task?	Threat analysis needs reasoning; Haiku 4.5 ok; Sonnet 4.5 better
Speed	Response time required?	Real-time alerts → Sonnet/Haiku; Analysis → Opus ok
Cost	Budget per execution?	Haiku 3.5 cheapest; Sonnet 4.5 mid-range; Opus higher cost then others
Accuracy Needed	What's acceptable accuracy?	<92% → Haiku; 92-98% → Sonnet 4.5; >98% → Opus
Complexity	What's task complexity?	Simple classification → Haiku; Complex reasoning → Opus
Privacy	Compliance required?	Customised Guardrail → Amazon Bedrock with Claude; Flexibility → GPT-5x
Customization	Need fine-tuning?	Bedrock supports fine-tuning; OpenAI also supports

When and How to Update Prompts

Update When:

Performance metrics drop below target
Users report confusing or incorrect outputs
New compliance requirements emerge
Foundation model is updated
Quarterly refresh as best practice
Incident analysis shows prompt gap

Update Process

Identify Problem
- Review user feedback
- Review metrics
- Review incident reports
Hypothesize Solution
- Determine if issue is prompt-related
Test Changes
- Create new prompt version
- Test on same test suite
- Compare performance with current version
Validate Quality
- Accuracy improved?
- False positive rate acceptable?
- Consistency maintained?
- No new regressions?
Stage Deployment
- Deploy to staging environment
- Monitor for 1–2 weeks
Production Rollout
- Deploy to production
- Monitor metrics closely
- Keep rollback plan ready
Monitor & Learn
- Track new version performance
- Capture user feedback
- Plan next improvements

Common Prompt Anti-Patterns to Avoid

Anti-Pattern 1: Vague Instructions

✗ BAD:

markdown
Classify this threat

✓ GOOD:

markdown
Classify threat severity as CRITICAL, HIGH, MEDIUM, or LOW based on: 
(1) Vendor detections, (2) Attack pattern, (3) Target sensitivity

Anti-Pattern 2: No Output Format Specification

✗ BAD:

markdown
Give me the remediation steps

✓ GOOD:

markdown
Return JSON with fields: step_number, action, expected_time, risk_level

Anti-Pattern 3: Ignoring Model-Specific Behavior

✗ BAD:

markdown
Use same prompt for all LLM providers

✓ GOOD:

markdown
Optimize prompts for each model's strengths (Claude for reasoning, 
GPT-4 for creativity, Llama for speed)

Anti-Pattern 4: No Testing on Edge Cases

✗ BAD:

markdown
Deploy prompt without testing unusual scenarios

✓ GOOD:

markdown
Test on diverse data before production deployment

Anti-Pattern 5: Over-Trusting AI Output

✗ BAD:

markdown
Auto-execute all AI-generated recommendations

✓ GOOD:

markdown
Require human approval for critical actions

Summary: Prompt Management Checklist

Before deploying any prompt, verify:

Prompt management is NOT a one-time task - continuously monitor, test, and improve.