Overview

Web Scraping is built for content pipelines, research tools, and agents that need the substance of a web page without the surrounding noise. Submit a URL and receive the main article content as clean Markdown, raw or rendered HTML, extracted links, page images, or a screenshot — choose one or request multiple formats in a single call.

For multi-page needs, the async batch endpoint accepts up to 100 URLs at once and returns a job ID you poll for results. The async crawl endpoint explores an entire site from a seed URL with configurable depth and path filters. The synchronous map endpoint rapidly discovers all URLs on a domain using robots.txt, sitemaps, and HTML link extraction — useful for seeding crawls or auditing site structure.

Every request is protected by an SSRF policy that blocks private networks, link-local addresses, and cloud-metadata endpoints. The shared browser pool imposes a capacity limit; requests exceeding it receive a 429 with a retry hint.

API URL

POST /health, GET /metrics, GET /api/v1/batch/scrape/:id (poll batch job), GET /api/v1/crawl/:id (poll crawl job)

Headers

Code

Content-Type: application/json

Body Parameters

Param

Type

Description

url

required

string

Public http/https URL to scrape crawl or map (SSRF-guarded)

urls

required

array

Array of public URLs for async batch scrape (1-100 items)

formats

optional

array

Output formats: markdown html rawHtml links images screenshot. Default [""markdown""]

format

optional

string

Legacy single-format alias: json markdown html. json maps to markdown

waitUntil

optional

string

Puppeteer navigation wait event: load domcontentloaded networkidle0 networkidle2. Default domcontentloaded

waitFor

optional

integer

Extra wait in ms after navigation (0-10000). Default 0

onlyMainContent

optional

boolean

Apply Mozilla Readability to isolate main article content. Default true

includeTags

optional

array

CSS selectors to keep exclusively in extracted content

excludeTags

optional

array

CSS selectors to remove before extraction

headers

optional

object

Custom HTTP headers forwarded with browser request

mobile

optional

boolean

Emulate mobile viewport. Default false

blockAds

optional

boolean

Block ad-related network requests. Default true

maxAge

optional

integer

Redis cache TTL in ms for this result (0 = bypass). Default 0

timeout

optional

integer

Browser operation timeout in ms (1000-300000). Default 35000. Scrape only.

actions

optional

array

Ordered browser actions before extraction (scrape only max 50): wait click write press scroll screenshot execute-javascript

limit

optional

integer

Max pages for crawl (1-10000 default 100) or max URLs for map (1-100000 default 5000)

maxDepth

optional

integer

Max BFS discovery depth for crawl. Omit for unlimited.

includePaths

optional

array

Regex patterns URL paths must match (crawl and map)

excludePaths

optional

array

Regex patterns URL paths must not match (crawl and map)

allowSubdomains

optional

boolean

Follow subdomain links during crawl. Default false.

allowExternalLinks

optional

boolean

Follow external domain links during crawl. Default false.

scrapeOptions

optional

object

Per-page scrape settings for crawl: formats waitUntil onlyMainContent blockAds

optional

string

Keyword to filter and relevance-sort map results by URL title or description

includeMetadata

optional

boolean

Enrich map link entries with title and description. Default false.

skip

optional

integer

Pagination offset for job result arrays (batch and crawl poll). Default 0.

Example Request

Code

POST /api/v1/scrape {"url":"https://example.com/article","formats":["markdown"]}

Successful Response

Status

HTML

200 OK

Body

Code

{"success":true,"url":"https://example.com/article","title":"Article Title","text":"Plain text content...","excerpt":"Summary...","byline":null,"siteName":"Example","length":1200,"content":"<article>...</article>","markdown":"# Article Title

Plain text content...","metadata":{"timingMs":3500,"cached":false},"correlationId":"req_abc123"}

Field

Type

No output fields documented.

Error Response

Example

Code

400,ValidationError,Joi validation failure or SSRF-blocked or invalid URL,Request rejected before navigation, 401,AuthenticationError,Missing or invalid API key at gateway layer,Authentication required, 404,NotFound,Unknown endpoint path or async job ID not found,Resource not found, 422,ExtractionError,Page loaded but content not extractable or scrape failed after navigation,Extraction failed after navigation, 429,RateLimitError,Rate limit exceeded or browser pool saturated (POOL_SATURATED),Retry after retryAfter seconds, 500,InternalError,Unexpected failure (message in development only),Internal server error, 504,GatewayTimeout,Route-level timeout exceeded (35s scrape 60s map),Operation timed out

Last Updated: May 4, 2026

DP.

API Documentation

API URL

POST /health, GET /metrics, GET /api/v1/batch/scrape/:id (poll batch job), GET /api/v1/crawl/:id (poll crawl job)

Example Request

Successful Response

Error Response

API URL

POST /health, GET /metrics, GET /api/v1/batch/scrape/:id (poll batch job), GET /api/v1/crawl/:id (poll crawl job)

urlrequiredstring

urlsrequiredarray

formatsoptionalarray

formatoptionalstring

waitUntiloptionalstring

waitForoptionalinteger

onlyMainContentoptionalboolean

includeTagsoptionalarray

excludeTagsoptionalarray

headersoptionalobject

mobileoptionalboolean

blockAdsoptionalboolean

maxAgeoptionalinteger

timeoutoptionalinteger

actionsoptionalarray

limitoptionalinteger

maxDepthoptionalinteger

includePathsoptionalarray

excludePathsoptionalarray

allowSubdomainsoptionalboolean

allowExternalLinksoptionalboolean

scrapeOptionsoptionalobject

searchoptionalstring

includeMetadataoptionalboolean

skipoptionalinteger

Example Request

Successful Response

Error Response