manchittlab/TheCrawler

high

Open-source web scraper + LLM-powered structured extraction. PDF/DOCX, markdown, JSON-LD, microdata, commerce data, forms, 16 analytics-tracker detection. Structured errors with retryable flags. Adaptive Cheerio->Playwright. CLI, npm, REST API, and MCP server. AGPL-3.0.

TheCrawler is an AI-ready web scraper that provides validated extraction contracts, LLM-powered structured extraction, and diagnostic readiness scorin...

purpose: TheCrawler is an AI-ready web scraper that providethreat: network exposed

TypeScript★ 0◷ May 20, 2026⚙ May 20, 2026GITHUB

agplapifycheeriocrawlerllmmarkdownmcpmcp-servermodel-context-protocolnodejsplaywrightragscrapertypescriptweb-scraping

◆Vulnerability Analysis[ 3 findings in 3 blocks ]

◷ 5/20/2026

high1 finding

src/main.ts

42const input = (await Actor.getInput<ActorInput>()) ?? ({} as ActorInput);
43
44if (!input.urls?.length && !input.searchQuery && !input.sitemapUrl) {
45    throw new Error('Input must contain "urls" (non-empty array), "searchQuery", or "sitemapUrl".');
46}

src/main.ts:1-14

// Exploitable if MCP is exposed to untrusted prompts (network_exposed).

The input accepts arbitrary URLs, search queries, and sitemap URLs without any validation or sanitization. These are passed directly to crawlStream and extract functions, which will fetch those URLs. An attacker could provide internal network addresses (e.g., 169.254.169.254 for cloud metadata, localhost, or internal services) to perform SSRF attacks.

ImpactAn attacker could use the MCP server to scan internal networks, access cloud metadata endpoints, or interact with internal services that are not intended to be exposed. This could lead to information disclosure or further compromise.

FixImplement URL validation to block private IP ranges (e.g., 10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16, 127.0.0.0/8, 169.254.0.0/16) and restrict to allowed domains if possible. Also validate searchQuery and sitemapUrl to prevent injection of malicious URLs.

high1 finding

src/main.ts

134const baseUrl = input.llmBaseUrl || process.env.THECRAWLER_LLM_BASEURL || '';
135const model = input.llmModel || process.env.THECRAWLER_LLM_MODEL || '';
136if (!baseUrl || !model) {
137    throw new Error('extractMode requires llmBaseUrl + llmModel input fields, or THECRAWLER_LLM_BASEURL + THECRAWLER_LLM_MODEL Actor environment variables.');
138}
139const apiKey = process.env.THECRAWLER_LLM_API_KEY || undefined;

src/main.ts:1-14

// Exploitable if MCP is exposed to untrusted prompts (network_exposed).

The LLM base URL and model are taken from user input or environment variables without any validation. An attacker could provide a malicious LLM endpoint (e.g., an attacker-controlled server) to intercept API keys or manipulate extraction results. Additionally, the API key is read from environment variables and passed to the LLM provider, which could be a malicious endpoint.

ImpactAn attacker could exfiltrate the API key by pointing the LLM base URL to their own server, or inject malicious responses that could lead to further compromise (e.g., prompt injection).

FixValidate the LLM base URL against an allowlist of known providers, or at minimum ensure it uses HTTPS and is not a private IP. Consider not allowing user-supplied LLM endpoints in production.

medium1 finding

src/main.ts

130const contract = input.extractContract ? getExtractionContract(input.extractContract) : null;
131if (!contract && !input.extractJsonSchema && !input.extractPrompt) {
132    throw new Error('extractMode requires extractContract, extractJsonSchema, or extractPrompt.');
133}

src/main.ts:1-14

// Exploitable if MCP is exposed to untrusted prompts (network_exposed).

The extract mode accepts arbitrary JSON schemas and prompts from user input. While this is part of the intended functionality (LLM-powered extraction), it allows an attacker to craft malicious schemas or prompts that could cause the LLM to perform unintended actions (e.g., prompt injection, data exfiltration). This is an excessive capability if the MCP is exposed to untrusted users.

ImpactAn attacker could use prompt injection to manipulate the LLM into revealing sensitive information, executing arbitrary commands (if the LLM has tool access), or generating harmful content. The JSON schema could also be used to extract data in unexpected ways.

FixRestrict the ability to supply arbitrary schemas and prompts to trusted users only. Consider using predefined contracts with limited customization. Implement input sanitization and validation for prompts.

◆Heuristic Signals

shell.execbrowser.automationenv.exposurefilesystem.readfilesystem.writenetwork.http

◆Risk Score

LLM-based

high findings+50

medium findings+15