Advanced Scraping Guide

This guide will walk you through the different endpoints of Firecrawl and how to use them fully with all its parameters.

Basic scraping with Firecrawl

To scrape a single page and get clean markdown content, you can use the /scrape endpoint.

# pip install firecrawl-py

from firecrawl import Firecrawl

firecrawl = Firecrawl(api_key="fc-YOUR-API-KEY")

doc = firecrawl.scrape("https://firecrawl.dev")

print(doc.markdown)

Scraping PDFs

Firecrawl supports PDFs. Use the parsers option (e.g., parsers: ["pdf"]) when you want to ensure PDF parsing.

Scrape options

When using the /scrape endpoint, you can customize scraping with the options below.

Formats (`formats`)

Type: array
Strings: ["markdown", "links", "html", "rawHtml", "summary", "images"]
Object formats:
- JSON: { type: "json", prompt, schema }
- Screenshot: { type: "screenshot", fullPage?, quality?, viewport? }
- Change tracking: { type: "changeTracking", modes?, prompt?, schema?, tag? } (requires markdown)
Default: ["markdown"]

Full page content vs main content (`onlyMainContent`)

Type: boolean
Description: By default the scraper returns only the main content. Set to false to return full page content.
Default: true

Include tags (`includeTags`)

Type: array
Description: HTML tags/classes/ids to include in the scrape.

Exclude tags (`excludeTags`)

Type: array
Description: HTML tags/classes/ids to exclude from the scrape.

Wait for page readiness (`waitFor`)

Type: integer
Description: Milliseconds of extra wait time before scraping (use sparingly). This waiting time is in addition to Firecrawl’s smart wait feature.
Default: 0

Freshness and cache (`maxAge`)

Type: integer (milliseconds)
Description: If a cached version of the page is newer than maxAge, Firecrawl returns it instantly; otherwise it scrapes fresh and updates the cache. Set 0 to always fetch fresh.
Default: 172800000 (2 days)

Request timeout (`timeout`)

Type: integer
Description: Max duration in milliseconds before aborting.
Default: 30000 (30 seconds)

PDF parsing (`parsers`)

Type: array
Description: Control parsing behavior. To parse PDFs, set parsers: ["pdf"].
Cost: PDF parsing costs 1 credit per PDF page. To skip PDF parsing and receive the file as base64 (1 credit flat), set parsers: [].
Limit pages: To limit PDF parsing to a specific number of pages, use parsers: [{"type": "pdf", "maxPages": 10}].

Actions (`actions`)

When using the /scrape endpoint, Firecrawl allows you to perform various actions on a web page before scraping its content. This is particularly useful for interacting with dynamic content, navigating through pages, or accessing content that requires user interaction.

Type: array
Description: Sequence of browser steps to run before scraping.
Supported actions:
- wait - Wait for page to load: { type: "wait", milliseconds: number } or { type: "wait", selector: string }
- click - Click an element: { type: "click", selector: string, all?: boolean }
- write - Type text into a field: { type: "write", text: string } (element must be focused first with click)
- press - Press a keyboard key: { type: "press", key: string }
- scroll - Scroll the page: { type: "scroll", direction: "up" | "down", selector?: string }
- screenshot - Capture screenshot: { type: "screenshot", fullPage?: boolean, quality?: number, viewport?: { width: number, height: number } }
- scrape - Scrape sub-element: { type: "scrape" }
- executeJavascript - Run JS code: { type: "executeJavascript", script: string }
- pdf - Generate PDF: { type: "pdf", format?: string, landscape?: boolean, scale?: number }

from firecrawl import Firecrawl

firecrawl = Firecrawl(api_key='fc-YOUR-API-KEY')

doc = firecrawl.scrape('https://example.com', {
  'actions': [
    { 'type': 'wait', 'milliseconds': 1000 },
    { 'type': 'click', 'selector': '#accept' },
    { 'type': 'scroll', 'direction': 'down' },
    { 'type': 'click', 'selector': '#q' },
    { 'type': 'write', 'text': 'firecrawl' },
    { 'type': 'press', 'key': 'Enter' },
    { 'type': 'wait', 'milliseconds': 2000 }
  ],
  'formats': ['markdown']
})

print(doc.markdown)

Action Execution Notes

Write action: You must first focus the element using a click action before using write. The text is typed character by character to simulate keyboard input.
Scroll selector: If you want to scroll a specific element instead of the whole page, provide the selector parameter to scroll.
Wait with selector: You can wait for a specific element to be visible using wait with a selector parameter, or wait for a fixed duration using milliseconds.
Actions are sequential: Actions are executed in order, and Firecrawl waits for page interactions to complete before moving to the next action.

Advanced Action Examples

Taking a screenshot:

cURL

curl -X POST https://api.firecrawl.dev/v2/scrape \
  -H 'Content-Type: application/json' \
  -H 'Authorization: Bearer fc-YOUR-API-KEY' \
  -d '{
    "url": "https://example.com",
    "actions": [
      { "type": "click", "selector": "#load-more" },
      { "type": "wait", "milliseconds": 1000 },
      { "type": "screenshot", "fullPage": true, "quality": 80 }
    ]
  }'

Clicking multiple elements:

cURL

curl -X POST https://api.firecrawl.dev/v2/scrape \
  -H 'Content-Type: application/json' \
  -H 'Authorization: Bearer fc-YOUR-API-KEY' \
  -d '{
    "url": "https://example.com",
    "actions": [
      { "type": "click", "selector": ".expand-button", "all": true },
      { "type": "wait", "milliseconds": 500 }
    ],
    "formats": ["markdown"]
  }'

Generating a PDF:

cURL

curl -X POST https://api.firecrawl.dev/v2/scrape \
  -H 'Content-Type: application/json' \
  -H 'Authorization: Bearer fc-YOUR-API-KEY' \
  -d '{
    "url": "https://example.com",
    "actions": [
      { "type": "pdf", "format": "A4", "landscape": false }
    ]
  }'

Example Usage

cURL

curl -X POST https://api.firecrawl.dev/v2/scrape \
    -H '
    Content-Type: application/json' \
    -H 'Authorization: Bearer fc-YOUR-API-KEY' \
    -d '{
      "url": "https://docs.firecrawl.dev",
      "formats": [
        "markdown",
        "links",
        "html",
        "rawHtml",
        { "type": "screenshot", "fullPage": true, "quality": 80 }
      ],
      "includeTags": ["h1", "p", "a", ".main-content"],
      "excludeTags": ["#ad", "#footer"],
      "onlyMainContent": false,
      "waitFor": 1000,
      "timeout": 15000,
      "parsers": ["pdf"]
    }'

In this example, the scraper will:

Return the full page content as markdown.
Include the markdown, raw HTML, HTML, links, and a screenshot in the response.
Include only the HTML tags <h1>, <p>, <a>, and elements with the class .main-content, while excluding any elements with the IDs #ad and #footer.
Wait for 1000 milliseconds (1 second) before scraping to allow the page to load.
Set the maximum duration of the scrape request to 15000 milliseconds (15 seconds).
Parse PDFs explicitly via parsers: ["pdf"].

Here is the API Reference: Scrape Endpoint Documentation

JSON extraction via formats

Use the JSON format object in formats to extract structured data in one pass:

curl -X POST https://api.firecrawl.dev/v2/scrape \
  -H 'Content-Type: application/json' \
  -H 'Authorization: Bearer fc-YOUR-API-KEY' \
  -d '{
    "url": "https://firecrawl.dev",
    "formats": [{
      "type": "json",
      "prompt": "Extract the features of the product",
      "schema": {"type": "object", "properties": {"features": {"type": "object"}}, "required": ["features"]}
    }]
  }'

Extract endpoint

Use the dedicated extract job API when you want asynchronous extraction with status polling.

import Firecrawl from '@mendable/firecrawl-js';

const firecrawl = new Firecrawl({ apiKey: 'fc-YOUR-API-KEY' });

// Start extract job
const started = await firecrawl.startExtract({
  urls: ['https://docs.firecrawl.dev'],
  prompt: 'Extract title',
  schema: { type: 'object', properties: { title: { type: 'string' } }, required: ['title'] }
});

// Poll status
const status = await firecrawl.getExtractStatus(started.id);
console.log(status.status, status.data);

Crawling multiple pages

To crawl multiple pages, use the /v2/crawl endpoint.

cURL

curl -X POST https://api.firecrawl.dev/v2/crawl \
    -H 'Content-Type: application/json' \
    -H 'Authorization: Bearer fc-YOUR-API-KEY' \
    -d '{
      "url": "https://docs.firecrawl.dev"
    }'

Returns an id

{ "id": "1234-5678-9101" }

Check Crawl Job

Used to check the status of a crawl job and get its result.

cURL

curl -X GET https://api.firecrawl.dev/v2/crawl/1234-5678-9101 \
  -H 'Content-Type: application/json' \
  -H 'Authorization: Bearer fc-YOUR-API-KEY'

Pagination/Next URL

If the content is larger than 10MB or if the crawl job is still running, the response may include a next parameter, a URL to the next page of results.

Crawl prompt and params preview

You can provide a natural-language prompt to let Firecrawl derive crawl settings. Preview them first:

cURL

curl -X POST https://api.firecrawl.dev/v2/crawl/params-preview \
  -H 'Content-Type: application/json' \
  -H 'Authorization: Bearer fc-YOUR-API-KEY' \
  -d '{
    "url": "https://docs.firecrawl.dev",
    "prompt": "Extract docs and blog"
  }'

Crawler options

When using the /v2/crawl endpoint, you can customize the crawling behavior with:

includePaths

Type: array
Description: Regex patterns to include.
Example: ["^/blog/.*$", "^/docs/.*$"]

excludePaths

Type: array
Description: Regex patterns to exclude.
Example: ["^/admin/.*$", "^/private/.*$"]

maxDiscoveryDepth

Type: integer
Description: Max discovery depth for finding new URLs.

limit

Type: integer
Description: Max number of pages to crawl.
Default: 10000

crawlEntireDomain

Type: boolean
Description: Explore across siblings/parents to cover the entire domain.
Default: false

allowExternalLinks

Type: boolean
Description: Follow links to external domains.
Default: false

allowSubdomains

Type: boolean
Description: Follow subdomains of the main domain.
Default: false

delay

Type: number
Description: Delay in seconds between scrapes.
Default: undefined

scrapeOptions

Type: object
Description: Options for the scraper (see Formats above).
Example: { "formats": ["markdown", "links", {"type": "screenshot", "fullPage": true}], "includeTags": ["h1", "p", "a", ".main-content"], "excludeTags": ["#ad", "#footer"], "onlyMainContent": false, "waitFor": 1000, "timeout": 15000}
Defaults: formats: ["markdown"], caching enabled by default (maxAge ~ 2 days)

Example Usage

cURL

curl -X POST https://api.firecrawl.dev/v2/crawl \
    -H 'Content-Type: application/json' \
    -H 'Authorization: Bearer fc-YOUR-API-KEY' \
    -d '{
      "url": "https://docs.firecrawl.dev",
      "includePaths": ["^/blog/.*$", "^/docs/.*$"],
      "excludePaths": ["^/admin/.*$", "^/private/.*$"],
      "maxDiscoveryDepth": 2,
      "limit": 1000
    }'

Mapping website links

The /v2/map endpoint identifies URLs related to a given website.

Usage

cURL

curl -X POST https://api.firecrawl.dev/v2/map \
    -H 'Content-Type: application/json' \
    -H 'Authorization: Bearer fc-YOUR-API-KEY' \
    -d '{
      "url": "https://docs.firecrawl.dev"
    }'

Map Options

search

Type: string
Description: Filter links containing text.

limit

Type: integer
Description: Maximum number of links to return.
Default: 100

sitemap

Type: "only" | "include" | "skip"
Description: Control sitemap usage during mapping.
Default: "include"

includeSubdomains

Type: boolean
Description: Include subdomains of the website.
Default: true

Here is the API Reference for it: Map Endpoint Documentation

Whitelisting Firecrawl

Allowing Firecrawl to scrape your website

If you want Firecrawl to scrape your own website and need to whitelist the crawler:

User Agent: Firecrawl identifies itself with the user agent FirecrawlAgent. Allow this user agent string in your firewall or security rules.
IP Addresses: Firecrawl does not use a fixed set of IP addresses for outbound scraping requests.

Allowing your application to call the Firecrawl API

If your firewall blocks outbound requests from your application to external services, whitelist 35.245.250.27 to allow calls to Firecrawl’s API.

Get Started

New Features

Standard Features

Webhooks

Developer Guides

Use Cases

Contributing

​Basic scraping with Firecrawl

​Scraping PDFs

​Scrape options

​Formats (formats)

​Full page content vs main content (onlyMainContent)

​Include tags (includeTags)

​Exclude tags (excludeTags)

​Wait for page readiness (waitFor)

​Freshness and cache (maxAge)

​Request timeout (timeout)

​PDF parsing (parsers)

​Actions (actions)

​Action Execution Notes

​Advanced Action Examples

​Example Usage

​JSON extraction via formats

​Extract endpoint

​Crawling multiple pages

​Check Crawl Job

​Pagination/Next URL

​Crawl prompt and params preview

​Crawler options

​includePaths

​excludePaths

​maxDiscoveryDepth

​limit

​crawlEntireDomain

​allowExternalLinks

​allowSubdomains

​delay

​scrapeOptions

​Example Usage

​Mapping website links

​Usage

​Map Options

​search

​limit

​sitemap

​includeSubdomains

​Whitelisting Firecrawl

​Allowing Firecrawl to scrape your website

​Allowing your application to call the Firecrawl API

Basic scraping with Firecrawl

Scraping PDFs

Scrape options

Formats (`formats`)

Full page content vs main content (`onlyMainContent`)

Include tags (`includeTags`)

Exclude tags (`excludeTags`)

Wait for page readiness (`waitFor`)

Freshness and cache (`maxAge`)

Request timeout (`timeout`)

PDF parsing (`parsers`)

Actions (`actions`)

Action Execution Notes

Advanced Action Examples

Example Usage

JSON extraction via formats

Extract endpoint

Crawling multiple pages

Check Crawl Job

Pagination/Next URL

Crawl prompt and params preview

Crawler options

includePaths

excludePaths

maxDiscoveryDepth

limit

crawlEntireDomain

allowExternalLinks

allowSubdomains

delay

scrapeOptions

Example Usage

Mapping website links

Usage

Map Options

search

limit

sitemap

includeSubdomains

Whitelisting Firecrawl

Allowing Firecrawl to scrape your website

Allowing your application to call the Firecrawl API