Skip to content
v11.0.0AI-native scraping for .NET is here

Scrape any site.
Feed your AI.

WebReaper is an AI-native web scraper for .NET. One ~12 MB binary turns any site into clean Markdown or structured data, with an LLM layer when you need it. No Docker, no signup, MIT licensed.

$brew install pavlovtech/webreaper/webreaper
zsh
# Hacker News
- Show HN: An AI-native web scraper in .NET
- Ask HN: Best way to feed a docs site into an LLM?
- Launch HN: WebReaper, a single-binary crawler
- Scraping behind Cloudflare without the headache

Works with the tools you already use

.NETOpenAIAnthropicOllamaAzure OpenAIPlaywrightRedisMongoDB

Everything a modern scraper needs

Batteries included, nothing locked in. Compose exactly the pipeline you want.

Drop on PATH, run

A single ~12 MB native binary. No Docker, no Postgres, no signup. Install and you're scraping in seconds.

AI-native by composition

Markdown by default. Add schema extraction, an LLM fallback, self-healing selectors, or an autonomous agent with one .With… call.

Bring any LLM

OpenAI, Anthropic, Ollama, Azure OpenAI, llamafile: any IChatClient via Microsoft.Extensions.AI. Never locked in.

Bot-checks handled automatically

Detects Cloudflare, DataDome, and PerimeterX and climbs from HTTP to a browser to stealth, per page and host-sticky. Blocked pages are dropped, never returned as data.

Distributed when needed

Swap the scheduler, tracker, and sink to Redis, MongoDB, SQLite, Azure Service Bus, or Cosmos. Same code, many workers.

MIT, not AGPL

Embed it in commercial software, fork it, redistribute it. No copyleft, no service-source obligations, no license tax.

AI-native

Deterministic where you can, AI where you must

Start with fast, free selectors. Reach for an LLM only when the page fights back.

Markdown by default

Any page to clean, LLM-ready Markdown

No schema, no selectors. Point WebReaper at a URL and get back tidy Markdown you can pipe straight into a prompt or a vector store.

Program.cs
using WebReaper.Builders;

var engine = await ScraperEngineBuilder
    .Crawl("https://news.ycombinator.com")
    .AsMarkdown()
    .WriteToConsole()
    .BuildAsync();

await engine.RunAsync();
Typed extraction

Structured data with compile-time schemas

Declare fields once on a POCO. A Roslyn source generator emits a static schema and a reflection-free materializer that is AOT-clean, with no runtime guessing.

Program.cs
[ScrapeSchema]
public partial class Article
{
    [ScrapeField("h1")] public string? Title { get; set; }
    [ScrapeField(".score", Type = SchemaFieldType.Integer)]
    public int Points { get; set; }
    [ScrapeField(".tag", IsList = true)]
    public List<string> Tags { get; set; } = new();
}

await ScraperEngineBuilder
    .Crawl("https://example.com/post")
    .Extract(Article.Schema)
    .BuildAsync();
Deterministic first, LLM as rescue

Self-healing extraction that costs nothing when it works

Cheap CSS selectors run first. If a field comes back empty, the LLM fills it and caches the fix. Stable pages cost zero LLM calls.

Program.cs
using WebReaper.AI;

var engine = await ScraperEngineBuilder
    .Crawl("https://example.com")
    .Extract(Article.Schema)
    .WithLlmFallback(chatClient)   // OpenAI, Anthropic, Ollama…
    .WriteToJsonFile("articles.jsonl")
    .BuildAsync();
Command line

The whole toolkit, one command away

Scrape a page, map a site, or crawl everything to JSON Lines. The CLI is Native-AOT, bot-check aware, and ships a Claude Code skill.

  • scrape: one page to Markdown or JSON
  • map: discover the URLs on a site
  • crawl: every on-domain page to JSON Lines
  • init: wire the Claude Code skill
Terminal
# One page as Markdown
webreaper scrape https://example.com

# Discover URLs on a site
webreaper map https://example.com --search /blog/ --max-urls 50

# Crawl a whole site to JSON Lines
webreaper crawl https://example.com > pages.jsonl

# Bot-protected? A plain scrape auto-climbs to a browser; --stealth starts at the top tier
webreaper scrape https://example.com --stealth

How WebReaper compares

Local-first and MIT licensed, with the AI features people reach for the cloud to get.

WebReaperFirecrawlCrawl4AICrawlee
Single self-contained binarySupportedNot supportedNot supportedNot supported
MIT licensedSupportedNot supportedSupportedSupported
LLM extraction + autonomous agentSupportedSupportedPartialNot supported
Auto bot-check stealthSupportedPartialPartialPartial
Pluggable distributed backendsSupportedSupportedNot supportedSupported
Runs natively in .NET / C#SupportedNot supportedNot supportedNot supported

Built for real work

From LLM data pipelines to price monitoring and autonomous agents.

All use cases

Free to run. Pay only to scale.

The open-source core does everything locally. Hosted tiers add scheduling, managed infrastructure, and a team UI.

Open Source

Free

The library, CLI, and Claude Code skill. MIT, self-hosted, forever.

Install now
Early access

Cloud

Early access

Hosted scheduled crawls, managed proxies and stealth, a team dashboard.

Join the waitlist

Enterprise

Custom

SSO, SLAs, on-prem, private satellites, and dedicated support.

Contact sales

Frequently asked questions

Is WebReaper really free?

Yes. The library, the CLI, and the Claude Code skill are MIT licensed and free forever. You only pay if you later choose the optional hosted Cloud or Enterprise tiers.

Do I have to use an LLM?

No. WebReaper is deterministic by default: CSS/XPath selectors and clean Markdown need no model. The AI features are opt-in and bring-your-own LLM, so you only pay for tokens when you ask for them.

How is it different from Firecrawl?

Firecrawl is a hosted, AGPL-licensed cloud service. WebReaper is a local-first, MIT-licensed binary and .NET library. You run it yourself, embed it in commercial code, and bring any LLM.

Can it handle JavaScript and bot protection?

Yes. Swap the HTTP transport for Playwright or raw CDP for JS rendering, and pass --auto-stealth to escalate to a stealth Chromium backend on Cloudflare, DataDome, or PerimeterX challenges.

Does it scale to large crawls?

The crawl loop is parallel by design. Swap the scheduler, visited-link tracker, and result sink to Redis, MongoDB, SQLite, Azure Service Bus, or Cosmos and run many workers against shared state.

Start scraping in 30 seconds

Install the binary, run one command, and pipe clean data into whatever comes next.

$brew install pavlovtech/webreaper/webreaper