2025Self-hosted, AI-assisted web scraping

ScrapeGPT

Paste a URL and an LLM proposes the extraction fields and selectors; the app re-validates and self-heals them against the real HTML, then crawls and exports clean CSV/JSON/XLSX.

Gallery

The problem

Scrapers built on hand-written CSS selectors are brittle — one layout change and they silently return nothing. ScrapeGPT lets an LLM discover the selectors from the page, then makes extraction resilient: it re-checks every selector against the real HTML, self-heals the ones that miss, and falls back through multiple strategies so even a wrong guess still yields records.

My role

Designed and built the full stack — a FastAPI + PostgreSQL backend, a provider-agnostic LLM layer (LiteLLM) with encrypted bring-your-own keys, and a React/TypeScript UI.

How it works

01URL

02Safe fetch

03DOM summary

04LLM selectors

05Validate + heal

06Crawl + export

Engineering highlights

AI proposes, code verifies

An LLM reads a distilled DOM summary (not raw HTML) and proposes the repeated-item and per-field selectors — and the system never trusts them blindly.

Self-healing selectors

Every selector is re-run against the real page; ones that match nothing are relaxed or demoted, with a table-structure fallback, so an imperfect guess still extracts data.

Dynamic, interaction-aware

Detects on-page controls (toggles, dropdowns like Metric/Imperial), drives a real browser to flip them, and captures each variant — not just static HTML.

Secure by default

SSRF hardening (per-redirect-hop checks and connected-IP pinning against DNS rebinding) plus bring-your-own provider keys stored Fernet-encrypted.

Outcomes

Turns any page into clean, typed CSV / JSON / XLSX through a guided analyze → review → extract flow.
Degrades gracefully instead of returning nothing when the AI's selectors aren't perfect.