ScrapeGraphAI Proxy Setup with Mobile IPs (2026)
A complete, hands-on guide to pairing ScrapeGraphAI with Coronium mobile proxies for production-grade natural-language web scraping. Covers SmartScraper, Search Scraper, Markdownify, self-hosting the Python library, BYOP integration with the hosted Cloud API, and a direct head-to-head with Firecrawl and Crawl4AI.
Quick Facts: ScrapeGraphAI in 2026
pip install scrapegraphai) + hosted Cloud API at scrapegraphai.comWhat is ScrapeGraphAI?
ScrapeGraphAI is a Python-first, LLM-native web scraping framework that replaces brittle CSS selectors and XPath with natural-language prompts. You tell it what you want in English, not where it lives in the DOM, and an LLM extracts structured JSON for you. Under the hood, ScrapeGraphAI models each scrape as a directed graph of nodes (fetch, parse, prompt, validate) so the pipeline is observable, composable, and resilient to layout drift.
Open-Source Library
pip install scrapegraphai ships the full graph engine, node library, and LLM adapters. Apache 2.0 licensed, 18K+ GitHub stars, active monthly releases.
- Full control over fetcher (Playwright, HTTPX, ChromeDriver)
- Pay only for your own LLM tokens + proxy egress
- Can run fully offline with Ollama
Hosted Cloud API
scrapegraphai.com offers a REST API with 50 free credits, Starter plan from $17-19/month, and 15% off annual. Dashboard includes live test/preview, run history, and webhook delivery.
- No server infrastructure to maintain
- LLM tokens bundled into credit price
- BYOP (bring your own proxy) parameter supported
How a ScrapeGraphAI pipeline works (high level)
The genius of the graph design is that each node is swappable: you can drop in a stealth browser fetcher, a custom parser, or a different LLM without rewriting the pipeline. For Coronium users, this means configuring the proxy once at the fetch node and letting SmartScraper, Search Scraper, or Markdownify route every request through your mobile endpoint.
SmartScraper Explained
SmartScraper is ScrapeGraphAI's flagship single-page extraction endpoint. Give it one URL and a natural-language prompt describing what you want, and it returns structured JSON. At 10 credits per call (about $0.04/page on the Starter plan), it's the workhorse for product catalogs, article extraction, profile scraping, and anything where you know the URL and want structured data back.
SmartScraper: minimal Python example (OSS)
from scrapegraphai.graphs import SmartScraperGraph
graph_config = {
"llm": {
"api_key": "sk-...", # your OpenAI key
"model": "openai/gpt-4o-mini",
},
"verbose": True,
"headless": True,
"loader_kwargs": {
"proxy": {
"server": "http://us.coronium.io:10001",
"username": "your_user",
"password": "your_pass",
}
},
}
smart_scraper_graph = SmartScraperGraph(
prompt="Extract all product names, prices in USD, and star ratings.",
source="https://example-shop.com/category/laptops",
config=graph_config,
)
result = smart_scraper_graph.run()
print(result)
# {"products": [{"name": "...", "price": 1299.00, "rating": 4.5}, ...]}SmartScraper: Cloud API call (BYOP with Coronium)
import requests
resp = requests.post(
"https://api.scrapegraphai.com/v1/smartscraper",
headers={
"SGAI-APIKEY": "sgai-...", # your ScrapeGraphAI key
"Content-Type": "application/json",
},
json={
"website_url": "https://example-shop.com/category/laptops",
"user_prompt": "Extract all product names, prices in USD, and star ratings.",
"proxy": "http://your_user:your_pass@us.coronium.io:10001",
# optional Pydantic-style schema
"output_schema": {
"type": "object",
"properties": {
"products": {
"type": "array",
"items": {
"type": "object",
"properties": {
"name": {"type": "string"},
"price": {"type": "number"},
"rating": {"type": "number"},
},
},
}
},
},
},
timeout=120,
)
print(resp.json())Search Scraper: Multi-Source Aggregation
Search Scraper is where ScrapeGraphAI gets genuinely interesting. Instead of one URL, you give it a question. It issues a web search, fetches the top results in parallel, extracts structured data from each, and returns an aggregated JSON object with source attribution per field. At 30 credits per query it's three times the cost of SmartScraper but does the work of multiple pipelines.
Search Scraper: competitive-intelligence example
from scrapegraphai.graphs import SearchGraph
graph_config = {
"llm": {
"api_key": "sk-ant-...",
"model": "anthropic/claude-sonnet-4-5",
},
"max_results": 5,
"loader_kwargs": {
"proxy": {
"server": "http://us.coronium.io:10002", # sticky 5-min
"username": "your_user",
"password": "your_pass",
}
},
}
search_graph = SearchGraph(
prompt=(
"Compare pricing and key features of the top 5 AI web scraping "
"frameworks as of 2026. Return name, pricing_usd_per_month, "
"free_tier, primary_language, and one_line_summary."
),
config=graph_config,
)
result = search_graph.run()
# {"frameworks": [{"name": "...", "pricing_usd_per_month": ..., ...}, ...],
# "sources": ["https://...", "https://..."]}When to reach for Search Scraper
- Market research where you don't know the URLs up front
- Competitive pricing sweeps across 5-10 competitors
- News aggregation on a breaking topic
- RAG seed data: collect citations for a grounded answer
- Brand monitoring across forums and review sites
- Scheduled scrapes of known URLs (catalogs, listings)
- Scrapes where you already have the target link
- Cost-sensitive jobs (3x cheaper at 10 credits)
- Authenticated scrapes that need specific sessions
- Single-source ground-truth extractions
Proxy note: Search Scraper fans out to multiple domains in parallel, so rotating mobile IPs work well because each sub-request gets a fresh trust-heavy IP. If a target site issues multiple requests to load full content (dynamic pagination, async content loading), switch that particular run to a sticky session to keep cookies coherent.
Markdownify: URLs to Clean Markdown
Markdownify is the RAG-friendly endpoint: feed it a URL, get back clean Markdown with headings, lists, and links preserved, ready to embed into a vector store. It strips nav, ads, cookie banners, and script noise and returns just the content. Cheaper per call than SmartScraper because there's no LLM extraction step on the ScrapeGraphAI side - just fetch, clean, convert.
Markdownify: building a RAG corpus
import requests
import time
urls = [
"https://docs.example.com/getting-started",
"https://docs.example.com/api-reference",
"https://docs.example.com/guides/deployment",
]
corpus = []
for url in urls:
r = requests.post(
"https://api.scrapegraphai.com/v1/markdownify",
headers={"SGAI-APIKEY": "sgai-...", "Content-Type": "application/json"},
json={
"website_url": url,
"proxy": "http://user:pass@us.coronium.io:10001",
},
timeout=60,
)
corpus.append({"url": url, "markdown": r.json()["markdown"]})
time.sleep(1) # gentle rate limit
# Now feed corpus to your embedding model
# e.g. OpenAI text-embedding-3-large, Voyage voyage-3, etc.What Markdownify keeps
- Hierarchical headings (H1-H6)
- Ordered and unordered lists
- Links with anchor text preserved
- Tables converted to GFM Markdown tables
- Code blocks with language hints when available
What Markdownify strips
- Navigation, header, footer boilerplate
- Script, style, and iframe tags
- Cookie banners and consent overlays
- Ad slots and tracking pixels
- Social-share widgets and related-content rails
Self-Hosting with Coronium Proxies
Running the OSS Python library gives you maximum control: your own Playwright browser, your own LLM keys, your own proxies, and no per-credit cloud pricing. Below is the full setup for a production self-hosted pipeline routing every fetch through Coronium mobile IPs.
Step 1: Install and set up
# Python 3.10+ recommended
python -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install scrapegraphai playwright python-dotenv pydantic
# Playwright browser binaries (headless Chromium)
playwright install chromiumNote: On Debian/Ubuntu hosts you may also need playwright install-deps to pull native libraries.
Step 2: Configure secrets
# .env
OPENAI_API_KEY=sk-...
CORONIUM_HOST=us.coronium.io
CORONIUM_PORT=10001 # rotating
CORONIUM_STICKY_PORT=10002 # sticky 5 min
CORONIUM_USER=your_user
CORONIUM_PASS=your_passStep 3: Production-grade SmartScraper with schema
import os
from typing import List, Optional
from pydantic import BaseModel, Field
from dotenv import load_dotenv
from scrapegraphai.graphs import SmartScraperGraph
load_dotenv()
class Product(BaseModel):
name: str
price_usd: float
rating: Optional[float] = Field(None, ge=0, le=5)
in_stock: bool = True
class ProductList(BaseModel):
products: List[Product]
def build_graph(sticky: bool = False) -> SmartScraperGraph:
port = os.getenv("CORONIUM_STICKY_PORT") if sticky else os.getenv("CORONIUM_PORT")
return SmartScraperGraph(
prompt="Extract every product with its USD price, rating (0-5), and in-stock flag.",
source="https://example-shop.com/category/laptops",
schema=ProductList,
config={
"llm": {
"api_key": os.getenv("OPENAI_API_KEY"),
"model": "openai/gpt-4o-mini",
"temperature": 0.0,
},
"verbose": False,
"headless": True,
"loader_kwargs": {
"proxy": {
"server": f"http://{os.getenv('CORONIUM_HOST')}:{port}",
"username": os.getenv("CORONIUM_USER"),
"password": os.getenv("CORONIUM_PASS"),
}
},
},
)
if __name__ == "__main__":
graph = build_graph(sticky=False)
result = graph.run()
print(result)Operational tips for self-hosted ScrapeGraphAI
- 1Always pass a schema. Pydantic validation catches hallucinated fields before they poison downstream pipelines. Add a retry-on-validation-error loop at the call site.
- 2Cache parsed Markdown. LLM tokens are the expensive part. Hash (url, date) and skip the LLM call if you've already extracted today.
- 3Use sticky ports for flows. Any scrape with >1 HTTP request to the same domain should use a sticky session (5-10 minutes) to keep the IP coherent.
- 4Run headless with a realistic UA. Playwright's default UA flags as automation. Override it to a current Chrome/Safari mobile UA that matches the 4G IP's carrier region.
- 5Log graph output JSON at every node. When extraction fails, the bug is almost always at the parse node (bad chunking) or the prompt node (ambiguous prompt), not the LLM.
Cloud API Integration with BYOP
If you'd rather not manage Playwright, LLM keys, and retry logic yourself, the ScrapeGraphAI Cloud API wraps everything behind a single REST endpoint. Credits cover the LLM tokens and the fetch. The BYOP (Bring Your Own Proxy) parameter lets you swap the default proxy for your Coronium endpoint so you keep control of IP trust and sticky sessions while offloading orchestration.
REST request structure
POST https://api.scrapegraphai.com/v1/smartscraper
Headers:
SGAI-APIKEY: sgai-xxxxxxxxxxxxxxxxxxxx
Content-Type: application/json
Body:
{
"website_url": "https://example.com/product/123",
"user_prompt": "Extract title, price, SKU, stock status, image URLs",
"proxy": "http://USER:PASS@us.coronium.io:10002",
"output_schema": { ... optional JSON Schema ... },
"render_heavy_js": true,
"total_timeout": 90
}Node.js client with BYOP and retry
// npm i axios zod
import axios from "axios";
import { z } from "zod";
const Product = z.object({
title: z.string(),
price_usd: z.number(),
sku: z.string().nullable(),
in_stock: z.boolean(),
});
async function smartScrape(url: string, retries = 3) {
const proxy = process.env.CORONIUM_PROXY_URL!; // http://u:p@us.coronium.io:10002
for (let attempt = 1; attempt <= retries; attempt++) {
try {
const { data } = await axios.post(
"https://api.scrapegraphai.com/v1/smartscraper",
{
website_url: url,
user_prompt: "Extract title, price in USD, SKU, and stock boolean.",
proxy,
render_heavy_js: true,
total_timeout: 120,
},
{
headers: {
"SGAI-APIKEY": process.env.SGAI_API_KEY!,
"Content-Type": "application/json",
},
timeout: 125_000,
}
);
return Product.parse(data.result);
} catch (err: any) {
if (attempt === retries) throw err;
await new Promise(r => setTimeout(r, 2000 * attempt));
}
}
}Cloud API pricing at scale (2026)
| Plan | Monthly credits | SmartScraper pages | Price | Notes |
|---|---|---|---|---|
| Free | 50 | ~5 | $0 | Evaluate + prototype |
| Starter | ~5,000 | ~500 | $17-19/mo | Small production workloads |
| Growth | Higher | Several thousand | See dashboard | Team + scheduled jobs |
| Annual | Any plan | Same | -15% | Best $/credit for committed workloads |
SmartScraper = 10 credits, Search Scraper = 30 credits. Annual billing cuts 15%. Exact monthly credit allotments and higher-tier prices are set on the ScrapeGraphAI pricing page; always confirm live numbers before building a unit-economics model.
LLM Model Selection: GPT-4o vs Claude vs Gemini vs Ollama
ScrapeGraphAI is model-agnostic by design. The right choice depends on three axes: accuracy on your target pages, cost per 1M tokens, and data residency (sometimes you legally cannot send content to a third-party cloud).
| Model | Best for | Accuracy | Cost (relative) | Context | On-prem? |
|---|---|---|---|---|---|
| GPT-4o | Balanced default, most site types | Excellent | $$$ | 128K | No |
| GPT-4o-mini | High-volume, cost-sensitive SmartScraper | Good | $ | 128K | No |
| Claude Sonnet 4.5+ | Long dense pages, legal/medical copy | Excellent | $$$ | 200K+ | No |
| Claude Haiku | Fast+cheap SmartScraper runs | Good | $ | 200K | No |
| Gemini 2.5 Pro/Flash | Search Scraper aggregation, multimodal | Very good | $-$$ | 1M | No |
| Llama 3.3 70B (Ollama) | On-prem, privacy-sensitive extraction | OK-Good | HW only | 128K | Yes |
| Qwen 2.5 / Mistral (Ollama) | Lighter local runs on consumer GPUs | OK | HW only | 32K-128K | Yes |
Swapping models in ScrapeGraphAI
# OpenAI
{"model": "openai/gpt-4o-mini", "api_key": os.getenv("OPENAI_API_KEY")}
# Anthropic
{"model": "anthropic/claude-sonnet-4-5", "api_key": os.getenv("ANTHROPIC_API_KEY")}
# Google Gemini
{"model": "google_genai/gemini-2.5-flash", "api_key": os.getenv("GEMINI_API_KEY")}
# Local Ollama (Llama 3.3 70B)
{
"model": "ollama/llama3.3",
"model_tokens": 128000,
"base_url": "http://localhost:11434",
}A pragmatic model-selection heuristic
- Start with GPT-4o-mini or Gemini Flash for 80% of jobs. Cheap, fast, accurate enough.
- Escalate to GPT-4o or Claude Sonnet when validation failures exceed 3-5% of runs.
- Pick Claude Sonnet for long pages (news longforms, legal docs, SEC filings) where the 200K+ context matters.
- Pick Gemini Flash for Search Scraper when you're aggregating 5-10 pages per query and cost dominates.
- Pick Ollama only when legal/privacy requirements forbid sending content to a third party.
ScrapeGraphAI vs Firecrawl vs Crawl4AI
All three tools sit in the modern "AI-native scraping" category and all three play well with Coronium mobile proxies. They differ in surface area, pricing model, and default workflow.
| Dimension | ScrapeGraphAI | Firecrawl | Crawl4AI |
|---|---|---|---|
| Primary metaphor | Natural-language graph extraction | Crawl + convert to Markdown | Async Python crawler with LLM strategies |
| License | Apache 2.0 OSS + Cloud | AGPL OSS + Cloud | Apache 2.0 fully OSS |
| Hosted API | Yes (scrapegraphai.com) | Yes (firecrawl.dev) | No (self-host only) |
| Free tier | 50 credits | 500 credits | Unlimited (self-host) |
| Entry paid plan | $17-19/mo | ~$16-19/mo | N/A |
| Single-page extraction | SmartScraper (10 cr) | /scrape + extract | arun + LLM strategy |
| Multi-source search | Search Scraper (30 cr) | /search | Manual w/ SearxNG |
| URL to Markdown | Markdownify | Native output | Built-in |
| Full-site crawler | Basic | Excellent (core feature) | Excellent |
| LLM-flexible | OpenAI, Anthropic, Gemini, Ollama | Multiple via extract endpoint | Any LangChain LLM |
| BYOP (custom proxy) | Yes (proxy param) | Yes (proxy config) | Full control |
| Natural-language UX | Best-in-class | Prompt on /extract | Strategy-based |
Pick ScrapeGraphAI when
- You want the cleanest natural-language prompt interface
- Multi-source Search Scraper with source attribution matters
- You want the choice between OSS and Cloud without a rewrite
Pick Firecrawl when
- You need to crawl entire documentation sites into Markdown
- Building a RAG knowledge base as a primary goal
- The generous free tier (500 cr) helps you prototype
Pick Crawl4AI when
- You must self-host with no third-party cloud
- You need deep programmatic control over crawler strategies
- Cost discipline (hardware-only) beats developer velocity
Whether you pick ScrapeGraphAI, Firecrawl, or Crawl4AI, the IP layer is the same problem: your fetcher needs to look like a real user. Coronium's 4G/5G mobile pools, CGNAT shared IPs, automatic rotation and sticky sessions plug into all three via their respective proxy parameters. You can even A/B two tools against the same Coronium endpoint to see which gives better extraction quality on your specific target domains.
Frequently Asked Questions
Configure & Buy Mobile Proxies
Select from 10+ countries with real mobile carrier IPs and flexible billing options
Choose Billing Period
Select the billing cycle that works best for you
SELECT LOCATION
when you order 5+ proxy ports
Carrier & Region
Available regions:
Included Features
๐บ๐ธUSA Configuration
AT&T โข Florida โข Monthly Plan
Your price:
$129
/month
Unlimited Bandwidth
No commitment โข Cancel anytime โข Purchase guide
Perfect For
Popular Proxy Locations
Secure payment methods accepted: Credit Card, PayPal, Bitcoin, and more. 2 free modem replacements per 24h.
Power Your ScrapeGraphAI Pipelines with Mobile Proxies
CGNAT-shared 4G and 5G IPs raise trust scores at Cloudflare, DataDome and PerimeterX and keep SmartScraper, Search Scraper, and Markdownify delivering structured JSON - not 403s.