How It Works
CloudSigma’s pipeline transforms unstructured threat intelligence into structured, validated Sigma detection rules. The entire process is automated and typically completes in 10–30 seconds.
Pipeline Overview
Click a step to see details
Step-by-Step
1. Ingestion
The pipeline accepts three input types:
- URL — Fetches the page, extracts main content, strips navigation and ads
- CVE — Looks up the CVE in NVD and MITRE, fetches up to 2 linked references
- Text — Accepts raw text directly (max 50,000 characters)
All inputs undergo SSRF protection (private IP blocking, DNS validation) and size limits (5 MB for URLs).
2. Classification
An AI model classifies the content to determine whether it contains actionable threat intelligence. Content that is purely marketing, news without technical indicators, or unrelated to cybersecurity is flagged and may produce fewer or no rules.
3. TTP Extraction
Using an AI model, the pipeline identifies MITRE ATT&CK techniques mentioned or implied in the text. Each TTP is extracted with:
- Technique ID — e.g.,
T1098.001 - Technique name — e.g., “Account Manipulation: Additional Cloud Credentials”
- Tactic — e.g., Persistence
- Confidence —
high,medium, orlow
The extraction is grounded against the ATT&CK framework to minimize hallucination.
4. Filtering
Several filters remove TTPs that cannot produce useful detection rules:
- Host-level filter — Removes techniques that require endpoint visibility when targeting cloud platforms
- Unknown filter — Removes unrecognized technique IDs
- Low confidence filter — Removes TTPs below the confidence threshold
- Non-cloud-detectable filter — Removes techniques that have no observable cloud log artifacts
5. Rule Generation
For each surviving TTP and target platform, the pipeline generates Sigma rules using AI with static grounding. The gold corpus — 475+ curated, validated rules — provides examples that anchor the AI output to known-good patterns.
Behavioral rules detect adversary techniques (e.g., “unusual IAM role assumption”). IOC rules detect specific indicators extracted from the text (e.g., known malicious IP addresses).
6. Deduplication
Functionally identical rules (same detection logic, different metadata) are merged to avoid noise.
7. Validation
Every generated rule is validated by pySigma :
- YAML syntax correctness
- Required fields present (
title,logsource,detection,level) - Field names valid for the target platform
- Detection logic well-formed
Rules that fail validation are excluded from the output with a notice in pipelineNotices.
8. Conversion
Validated Sigma rules are converted to SIEM-native query languages:
| Backend | Backend ID | Output Format |
|---|---|---|
| Splunk | splunk | SPL queries |
| Microsoft Sentinel | sentinel | KQL queries |
| Elasticsearch | elasticsearch | Lucene queries |
| Google Chronicle | chronicle | UDM Search queries |
| OpenSearch | opensearch | Lucene/DQL queries |
| Google SecOps | google_secops | YARA-L queries |
See SIEM Backends for full details and example output.
Gold Corpus
The gold corpus is a curated collection of 475+ Sigma rules covering all 13 platforms. These rules are:
- Written and validated by detection engineers
- Tested against pySigma
- Organized by ATT&CK technique and platform
- Used as grounding examples during rule generation
The corpus ensures that AI-generated rules follow correct Sigma conventions, use proper field names, and produce valid detection logic for each platform.
Performance
| Metric | Typical Value |
|---|---|
| Pipeline duration | 10–30 seconds |
| Rules per execution | 3–15 (depends on input) |
| TTPs extracted | 5–20 |