SynthScan

May 17, 2026 · View on GitHub

This file defines the detection patterns used by SynthScan.

Focused exclusively on AI slop — phrases, vocabulary, structural tells, and hallucination markers that indicate AI-generated code. General code-quality issues (linting, security, style) are intentionally excluded to avoid false positives.

Each pattern is defined in a fenced block under its category. To add new patterns, append them to the appropriate section or create a new ## Category.

Severity tags — prepend a pattern line with [CRITICAL], [HIGH], [MEDIUM], or [LOW] to override the default severity for that category. If omitted, the category default applies.

Severity → score mapping:

TagPoints
CRITICAL10
HIGH5
MEDIUM2
LOW1

Slop Phrases

Default severity: MEDIUM

Classic filler phrases and clichés that AI code assistants inject into comments, docstrings, and string literals. Humans rarely write these.

# Direct AI self-references
As an AI language model
As a language model
I cannot provide
I'm unable to
# Filler / hedging phrases AI over-produces
It's worth noting that
Note that this is a simplified
This is a basic implementation
For demonstration purposes
Let me know if you need
Feel free to modify
Feel free to adjust
Feel free to customize
Here's a simple example
Here is a simple example
As mentioned earlier
As discussed above
Here's how you can
Here is how you can
This should work for most cases
You can modify this to
You may want to adjust
[LOW] Make sure to replace
[LOW] Don't forget to
# Instructional tone (AI talks to the user, not the reader)
regex:#.*\byou can\s+(also\s+)?(use|try|add|change|modify|adjust|replace)\b
[LOW] regex:#.*\bmake sure (to|you)\b
[LOW] regex:#.*\bdon'?t forget to\b

AI Slop Vocabulary

Default severity: MEDIUM

Distinctive words and phrases LLMs disproportionately overuse in comments, docstrings, and string literals. Individually low signal, but clusters are a strong AI tell.

# High-frequency AI slop words in comments
regex:#.*\b(delve|tapestry|multifaceted|nuanced|streamlined)\b
regex:#.*\b(leverage|utilize|facilitate|comprehensive)\b
regex:#.*\b(robust|seamless|cutting-edge|state-of-the-art|paradigm)\b
regex:#.*\b(aforementioned|henceforth|pertaining to|in conjunction with)\b
regex:#.*\b(endeavor|pivotal|intricate|meticulous|holistic)\b
regex:#.*\b(unleash|empower|elevate|harness|supercharge)\b
regex:#.*\b(game-?changer|best practices|synergy|scalable solution)\b
# Same words in docstrings / multi-line strings
regex:""".*\b(delve into|it's important to note|in order to)\b
regex:""".*\b(at the end of the day|a testament to|serves as a)\b
regex:""".*\b(leverage|utilize|robust|seamless|comprehensive|facilitate)\b
# Overly enthusiastic adverbs in comments
regex:#.*\b(Certainly|Absolutely|Definitely|Essentially|Fundamentally)\b
# "Simply" / "just" — oversimplification markers
[LOW] regex:#.*\b(simply|just)\s+(call|use|add|set|pass|create|return)\b
# Phrases in // comments (JS/Go/Java/C++)
regex://.*\b(delve|tapestry|multifaceted|nuanced|leverage|utilize|robust|seamless)\b
regex://.*\b(Certainly|Absolutely|Definitely|Essentially|Fundamentally)\b

Synthetic Comment Markers

Default severity: HIGH

Comments that explicitly reveal AI authorship or templated generation.

# Direct AI attribution
Generated by AI
Generated by GPT
Generated by ChatGPT
Generated by Copilot
Generated by Claude
Generated by Gemini
Generated by Llama
Generated by Bard
Generated by OpenAI
Auto-generated code
This code was generated
regex:#.*\bAI[- ]generated\b
regex://.*\bAI[- ]generated\b
regex:#.*\bwritten by (an )?AI\b
regex:#.*\bcreated by (an )?AI\b
regex:#.*\bproduced by AI\b
regex://.*\bwritten by (an )?AI\b
# Prompt leakage (AI echoing the user's prompt)
regex:#.*\b(as requested|as you asked|as per your request|per your instructions)\b
regex://.*\b(as requested|as you asked|as per your request)\b

Self-Referential Comments

Default severity: MEDIUM

Comments that narrate what the code is rather than why — a strong AI tell. Humans comment intent; AI describes structure.

# "This X does Y" tautologies
regex:#\s*This\s+(class|function|method|module|file)\s+(is|provides|represents|implements|handles|contains|defines)
regex:#\s*The\s+(following|above|below)\s+(class|function|method|code|block|section)
regex:"""This\s+(class|function|method|module)\s+(is|provides|represents|implements)
# Narrating the obvious
regex:#\s*(Import|Importing)\s+(the\s+)?(necessary|required|needed)\s+(modules|libraries|packages|dependencies)
regex:#\s*(Define|Defining|Create|Creating)\s+(the\s+)?(main|a|an|the)\s+\w+
regex:#\s*(Initialize|Initializing)\s+(the\s+)?\w+\s+(variable|object|instance|class)

Redundant / Tautological Comments

Default severity: LOW

Comments that restate the code verbatim — a hallmark of LLM generation.

# Increment / assignment restatements
regex:#\s*(Set|Assign)\s+\w+\s+to\s+
regex:#\s*(Increment|Decrement)\s+\w+(\s+by\s+\d+)?\s*$
regex:#\s*Return\s+(the\s+)?(result|value|output|data)\s*$
regex:#\s*(Loop|Iterate)\s+(through|over)\s+(the\s+)?(list|array|items|elements|data)
regex:#\s*(Check|Verify)\s+if\s+
regex:#\s*(Print|Display|Output)\s+(the\s+)?(result|value|output|message)
regex:#\s*(Open|Close|Read|Write)\s+(the\s+)?file
regex:#\s*(Add|Append|Push|Insert)\s+(the\s+)?\w+\s+(to|into)\s+(the\s+)?(list|array|queue|stack)

Verbosity Indicators

Default severity: LOW

Overly explanatory phrases that signal machine-generated text.

# Over-explanation in comments
This line initializes
This variable stores
We need to check if
The purpose of this function is
The following code block
This section handles
# Numbered step narration
regex:#\s*Step\s+\d+\s*:
regex://\s*Step\s+\d+\s*:

Example Usage Blocks

Default severity: LOW

AI assistants almost always append "Example usage:" blocks at the bottom of generated code.

# Example-usage header comments
regex:#\s*(Example\s+usage|Usage\s+example|Sample\s+usage|How\s+to\s+use)\s*:?\s*$
regex://\s*(Example\s+usage|Usage\s+example)\s*:?\s*$
regex:#\s*Usage:\s*$

Fake / Example Data

Default severity: LOW

Hardcoded placeholder data that AI models insert as "examples" and developers forget to replace. Severity reduced to LOW because placeholder emails and names like "John Doe" appear frequently in legitimate pre-ChatGPT sample code, tutorials, and documentation.

# Canonical placeholder names / emails — LOW to reduce false positives on old tutorial code
[LOW] regex:['"]John\s+Doe['"]
[LOW] regex:['"]Jane\s+Doe['"]
[LOW] regex:['"]user@example\.com['"]
[LOW] regex:['"]admin@example\.com['"]
[LOW] regex:['"]test@test\.com['"]
[LOW] regex:['"]foo@bar\.com['"]
[LOW] regex:['"]123\s+Main\s+St(reet)?['"]
[LOW] regex:['"]Acme\s+(Corp|Inc|Ltd)['"]
# Lorem ipsum / placeholder text — strong AI signal when in code (not docs)
Lorem ipsum
dolor sit amet
# Phone number placeholders
[LOW] regex:['"]555-\d{4}['"]

Cross-Language Confusion

Default severity: HIGH

Applies to: .py

AI models trained on many languages frequently emit idioms from the wrong language. These are strong AI tells because experienced human developers don't make these mistakes.

# Wrong-language method calls in Python files
regex:\w+\.push\(
regex:\w+\.length\(\)
regex:\w+\.equals\(
regex:\w+\.toString\(\)
# null / undefined in Python (should be None)
regex:\b(null|undefined)\s*[;)}\],]
regex:\bif\s+\w+\s*(==|!=|is)\s*null\b
# true/false lowercase in Python (should be True/False)
regex:\breturn\s+(true|false)\s*$
# Logical operators from C/JS used in Python files (should be and/or/not)
# Only match when preceded by a Python-like variable/expression context
regex:^[^#]*\b\w+\s+&&\s+\w+
regex:^[^#]*\b\w+\s+\|\|\s+\w+

Hallucination Indicators

Default severity: CRITICAL

Patterns that suggest hallucinated APIs, phantom imports, or invented function signatures — among the strongest signals of AI-generated code.

# Suspicious deeply-nested import paths (common AI hallucinations)
regex:from\s+\w+\.utils\.helpers\s+import\s+\w+
regex:from\s+\w+\.core\.exceptions\s+import\s+\w+Error
# Hallucinated long chained attribute access
regex:\w+\.\w+\.\w+\.\w+\.\w+\.\w+\(

Overly Generic Function Names

Default severity: LOW

Applies to: .py, .js, .ts, .jsx, .tsx

Function names so generic they indicate AI-generated scaffolding rather than domain-specific design. Severity is LOW because experienced humans also write these as stubs — cluster scoring carries the signal.

[LOW] regex:def\s+(process_data|handle_request|do_something|do_stuff)\s*\(
[LOW] regex:def\s+(run_task|execute_task|perform_action|main_function)\s*\(
[LOW] regex:def\s+(helper|my_function|my_method|test_function)\s*\(
[LOW] regex:function\s+(processData|handleRequest|doSomething|getData)\s*\(
[LOW] regex:func\s+(processData|handleRequest|doSomething)\s*\(

Excessive Try-Catch Wrapping

Default severity: MEDIUM

AI models tend to wrap every operation in try/except with generic AI-typical error messages.

# Bare "Error:" prefix (AI-typical phrasing)
regex:print\s*\(\s*f?['"]Error:?\s
regex:print\s*\(\s*f?['"]An error occurred
regex:print\s*\(\s*f?['"]Something went wrong
# Bare except Exception catch-alls (AI uses these excessively)
[LOW] regex:^\s*except\s+Exception(\s+as\s+\w+)?:

Decorative Section Separators

Default severity: MEDIUM

AI assistants love inserting visually decorated section headers with Unicode box-drawing characters or long dash/equals lines. Humans occasionally do this, but AI does it systematically throughout a file.

# Unicode box-drawing section headers (── Title ──────)
regex:#.*[─━═╌╍┄┅]{5,}
regex://.*[─━═╌╍┄┅]{5,}
# Long dash/equals separator lines (10+ chars)
regex:#\s*-{10,}\s*$
regex:#\s*={10,}\s*$

Magic Placeholder Names

Default severity: HIGH

Hardcoded API key and token placeholders that AI models insert as stand-ins. Near-certain AI artifacts when found in source code.

regex:\byour[_-]?api[_-]?key\b
regex:\bYOUR[_-]?API[_-]?KEY\b
regex:\bYOUR[_-]?TOKEN[_-]?HERE\b
regex:\bYOUR[_-]?SECRET[_-]?HERE\b
regex:\bINSERT[_-]?YOUR[_-]?(KEY|TOKEN|SECRET|PASSWORD)\b
regex:['"]?<YOUR[_-]?(API[_-]?KEY|TOKEN|SECRET)>['"]?
regex:\byour[_-]?database[_-]?url\b
regex:\bYOUR[_-]?DATABASE[_-]?URL\b

Hyper-Verbose Identifiers

Default severity: LOW

Function names so long they describe implementation rather than domain intent. AI models consistently produce identifiers like calculateTotalAmountOfAllItems where a human writes total_price.

regex:def\s+[a-z_]{25,}\s*\(
regex:function\s+[a-zA-Z]{25,}\s*\(
regex:def\s+[a-z_]*(process|calculate|compute|validate|handle|get|set)(And|Or|Then)[A-Z]
regex:function\s+[a-zA-Z]*(process|calculate|compute|validate|handle|get|set)(And|Or|Then)[A-Z]
regex:class\s+\w*(DataManager|DataProcessor|DataHandler|RequestHandler|ResponseHandler)\b

Cross-Language Confusion (JS/TS)

Default severity: HIGH

Applies to: .js, .ts, .jsx, .tsx

Python idioms incorrectly used in JavaScript/TypeScript files. Experienced JS/TS developers never write these; AI models do frequently.

regex:\breturn\s+None\b
regex:\bif\s+\w+\s*(==|===|!=|!==)\s*None\b
regex:\breturn\s+(True|False)\b
regex:^\s*elif\s+
regex:^\s*print\s*\(

How to Add New Patterns

  1. Create a new ## Category heading and optionally state a default severity.
  2. Optionally add an Applies to: .py, .js line to restrict the category to specific file extensions.
  3. Add a fenced code block tagged as ```patterns.
  4. Put one pattern per line.
    • Plain text lines are matched as case-insensitive substrings.
    • Lines starting with regex: are compiled as Python regular expressions.
    • Prepend [CRITICAL], [HIGH], [MEDIUM], or [LOW] to override the category default.
  5. Comment lines starting with # inside the block are ignored.
  6. Commit and push — the action will pick up new patterns automatically.