How It Works

CloudSigma’s pipeline transforms unstructured threat intelligence into structured, validated Sigma detection rules. The entire process is automated and typically completes in 10–30 seconds.

Pipeline Overview

Detection Pipeline

Threat Intelligence Input

Sigma YAML + SIEM Queries

Click a step to see details

Step-by-Step

1. Ingestion

The pipeline accepts three input types:

URL — Fetches the page, extracts main content, strips navigation and ads
CVE — Looks up the CVE in NVD and MITRE, fetches up to 2 linked references
Text — Accepts raw text directly (max 50,000 characters)

All inputs undergo SSRF protection (private IP blocking, DNS validation) and size limits (5 MB for URLs).

2. Classification

An AI model classifies the content to determine whether it contains actionable threat intelligence. Content that is purely marketing, news without technical indicators, or unrelated to cybersecurity is flagged and may produce fewer or no rules.

3. TTP Extraction

Using an AI model, the pipeline identifies MITRE ATT&CK techniques mentioned or implied in the text. Each TTP is extracted with:

Technique ID — e.g., T1098.001
Technique name — e.g., “Account Manipulation: Additional Cloud Credentials”
Tactic — e.g., Persistence
Confidence — high, medium, or low

The extraction is grounded against the ATT&CK framework to minimize hallucination.

4. Filtering

Several filters remove TTPs that cannot produce useful detection rules:

Host-level filter — Removes techniques that require endpoint visibility when targeting cloud platforms
Unknown filter — Removes unrecognized technique IDs
Low confidence filter — Removes TTPs below the confidence threshold
Non-cloud-detectable filter — Removes techniques that have no observable cloud log artifacts

5. Rule Generation

For each surviving TTP and target platform, the pipeline generates Sigma rules using AI with static grounding. The gold corpus — 475+ curated, validated rules — provides examples that anchor the AI output to known-good patterns.

Behavioral rules detect adversary techniques (e.g., “unusual IAM role assumption”). IOC rules detect specific indicators extracted from the text (e.g., known malicious IP addresses).

6. Deduplication

Functionally identical rules (same detection logic, different metadata) are merged to avoid noise.

7. Validation

Every generated rule is validated by pySigma :

YAML syntax correctness
Required fields present (title, logsource, detection, level)
Field names valid for the target platform
Detection logic well-formed

Rules that fail validation are excluded from the output with a notice in pipelineNotices.

8. Conversion

Validated Sigma rules are converted to SIEM-native query languages:

Backend	Backend ID	Output Format
Splunk	`splunk`	SPL queries
Microsoft Sentinel	`sentinel`	KQL queries
Elasticsearch	`elasticsearch`	Lucene queries
Google Chronicle	`chronicle`	UDM Search queries
OpenSearch	`opensearch`	Lucene/DQL queries
Google SecOps	`google_secops`	YARA-L queries

See SIEM Backends for full details and example output.

Gold Corpus

The gold corpus is a curated collection of 475+ Sigma rules covering all 13 platforms. These rules are:

Written and validated by detection engineers
Tested against pySigma
Organized by ATT&CK technique and platform
Used as grounding examples during rule generation

The corpus ensures that AI-generated rules follow correct Sigma conventions, use proper field names, and produce valid detection logic for each platform.

Performance

Metric	Typical Value
Pipeline duration	10–30 seconds
Rules per execution	3–15 (depends on input)
TTPs extracted	5–20