22,700 stars on GitHub. A parser that adapts to structural changes. And a native MCP server that connects directly with Claude.
Every developer who’s ever built a web scraper knows that feeling. You spend an afternoon writing a clean script, selectors perfectly tuned — and two weeks later the site rotates its CSS classes, reorganizes the layout, or migrates to a new framework. Your scraper breaks. You start from scratch.
Scrapling, a Python framework created by Karim Shoair (D4Vinci), tackles this problem head-on. Its parser doesn’t just find elements by selector — it learns where those elements live in the page structure, and when the layout changes, it automatically relocates them using similarity algorithms. The scraper keeps running.
This month Scrapling entered GitHub Trending with over 22,700 stars. For developers building data pipelines and RAG architectures for AI systems, the timing is relevant.
The core problem: scrapers are fragile
Traditional scraping tools — BeautifulSoup, Scrapy, plain requests — are excellent. But they all share the same fundamental assumption: the page structure is stable. The moment a site rotates its CSS classes or restructures its DOM, the selectors stop working.
The usual solution is manual maintenance: detect the breakage, reverse engineer the new structure, update the selectors, redeploy. For one-off scripts it’s annoying. For production pipelines feeding a data lake or RAG system — where stale or missing data has real downstream consequences — it’s a serious operational burden.
Scrapling’s approach: mark elements with auto_save=True and the framework stores a structural fingerprint. On subsequent runs, even if the exact selectors no longer work, the adaptive engine finds the element by structural similarity. The scraper degrades gracefully instead of failing silently.
from scrapling.fetchers import StealthyFetcher
StealthyFetcher.adaptive = True
page = StealthyFetcher.fetch('https://example.com', headless=True, network_idle=True)
# auto_save stores a structural fingerprint of each element
products = page.css('.product', auto_save=True)
Three-tier fetcher system
Scrapling organizes its fetching capabilities into three tiers. You start at the lowest cost and scale only when necessary:
Fetcher / AsyncFetcher — Plain HTTP requests without browser overhead. Fastest, lowest resource usage. Works for any site that doesn’t require JavaScript execution.
DynamicFetcher — Browser-based fetching via Playwright for sites that depend on JavaScript. Handles dynamic content, lazy-loaded sections, and single-page applications.
StealthyFetcher — Browser fetching with active anti-bot evasion. Bypasses Cloudflare Turnstile and similar protections out of the box, using a completely spoofed Chromium fingerprint.
All three return the same Response object, which inherits from Selector. This means any extraction code you write for a static request with Fetcher works identically if you later migrate to DynamicFetcher — no changes to extraction logic.
Spider API for large-scale crawls
For larger workloads, Scrapling includes a Scrapy-style spider framework. You define a spider with start_urls and an async parse callback, and the engine handles concurrent crawling with configurable concurrency limits, domain throttling, download delays, pause/resume, and automatic proxy rotation.
from scrapling import Spider
class ProductSpider(Spider):
start_urls = ['https://example.com/catalogo']
async def parse(self, response):
for product in response.css('.product', auto_save=True):
yield {
'name': product.css('.name').text,
'price': product.css('.price').text,
}
next_page = response.css('a.next-page')
if next_page:
yield next_page.follow()
The concurrent crawling engine offers real-time stats and streaming — useful for monitoring long-running jobs without needing to poll.
MCP Server: connect your scraper with Claude
One detail that stands out: Scrapling comes with a native MCP server. This means you can configure it as a tool in Claude Code, Cursor, or any MCP-compatible environment.
The integration is designed specifically for AI workflows — instead of dumping raw HTML into your LLM’s context, Scrapling preprocesses and extracts the target content before passing it to the model. The result: fewer tokens consumed, lower API costs, faster responses.
For developers building RAG pipelines or data acquisition layers for AI systems, this is a concrete efficiency gain. Fetching a raw page and cleaning it in the LLM’s context is wasteful; having a dedicated extraction layer that solves it before tokens reach the model is the cleaner architecture.
Note for LatAm teams: For teams operating with tight API budgets — which describes most dev shops in the region — token reduction is a concrete cost benefit, not just a performance improvement.
Installation
# Parser only (without fetchers)
pip install scrapling
# Full installation with all fetchers
pip install "scrapling[fetchers]"
scrapling install # Downloads browsers and fingerprint dependencies
Requires Python 3.10+. Docker image available via docker pull pyd4vinci/scrapling.
For projects where you don’t want to manage browser dependencies, the Fetcher / AsyncFetcher tier works with just pip install scrapling — no browser installation.
Caveats worth knowing
auto_save is not magic. The adaptive engine relocates elements by structural similarity — if a site’s redesign is radical enough (not just CSS class changes but complete DOM restructuring), fingerprints may not survive. It reduces maintenance, doesn’t eliminate it.
Browser dependencies are heavy. scrapling install downloads Playwright browsers and fingerprint manipulation libraries. In resource-constrained environments (CI containers with little disk space, lean VMs), consider the additional footprint.
Anti-bot bypass has limits. StealthyFetcher handles Cloudflare Turnstile well. For enterprise protections (Akamai, DataDome, Kasada), the README itself points to external API services — Scrapling alone isn’t sufficient.
Legal and ethical use. Web scraping is subject to each site’s terms of service and applicable regulation. Scrapling is a technical tool — responsibility for appropriate use rests with the developer.
Who is this for
Scrapling fits naturally in some specific scenarios:
- Data pipelines for AI — Feeding vector stores, knowledge bases, or RAG retrieval layers with structured data
- Price and inventory monitoring — Long-running scrapers that break every time retailers update their frontend
- Research and competitive intelligence — Periodic extraction from multiple sources with varying stability
- Developers migrating from BeautifulSoup or Scrapy — The API is deliberately familiar; migration curve is low
For one-off extractions where you control the target URL and don’t need resilience, simpler tools are sufficient. Scrapling justifies its weight in production pipelines where uptime matters.
Resources
- GitHub: github.com/D4Vinci/Scrapling — 22,700+ stars
- Documentation: scrapling.readthedocs.io
- License: BSD-3-Clause
