Web Scraping, Playwright·Sep 24, 2024

Playwright Web Scraping Tutorial for 2025 (Node.js)

Learn how to use Playwright for web scraping in 2025. This guide covers installation, basic scraping, intercepting requests, and using proxies.

Petro Popelyshko

Web Scraping with Playwright in 2025

As we head toward 2025, robust tools like Playwright are becoming increasingly important for developers and businesses alike. Whether you’re conducting market research, performing price comparisons, or simply gathering news articles, Playwright’s cross-browser capabilities and modern architecture make it an excellent option for both small-scale and enterprise-level scraping projects.

You’ll learn how to use Playwright in a Node.js environment, from installation to more advanced tasks like intercepting network requests and configuring proxies. You’ll also discover best practices for locating and extracting text, images, and other valuable data, plus tips on whether to choose Playwright over its primary competitors, Puppeteer and Selenium.

Introduction to Playwright

Playwright is an open-source library created by Microsoft, designed to automate web browsers—Chromium, Firefox, and WebKit (Safari)—using a single unified API. While it initially gained traction in the realm of automated testing (much like Puppeteer and Selenium before it), it has quickly become a top choice for web scraping and automation due to its speed, reliability, and flexibility.

Why Playwright for Scraping?

Multi-Browser Support: Supports Chromium, Firefox, and WebKit out of the box.
Consistent API: Write the same code for all supported browsers.
Powerful Network Controls: Intercept, block, or mock requests.
Robust Community: Backed by Microsoft with rapid development and community support.

Setting Up Node.js and Installing Playwright

Before diving into the code, we need a working environment that includes Node.js and Playwright. If you already have Node.js installed, feel free to skip ahead to the Playwright installation steps.

Step 1: Install or Update Node.js

Download and install Node.js from nodejs.org and verify the installation:

node -v

Step 2: Initialize Your Project

mkdir playwright-scraping-tutorial
cd playwright-scraping-tutorial
npm init -y

Step 3: Install Playwright

npm install playwright

Basic Web Scraping with Playwright

Now that your environment is ready, let’s write our first web scraping script to fetch basic data from a webpage. Playwright’s commands follow a straightforward pattern: launch a browser, open a page, navigate to a URL, interact with the page, and close the browser.

Create a file basic-scraping.js:

const { chromium } = require('playwright');

(async () => {
  const browser = await chromium.launch({ headless: true });
  const page = await browser.newPage();
  await page.goto('https://example.com');
  const pageTitle = await page.title();
  console.log('Page Title:', pageTitle);
  await browser.close();
})();

Run with:

node basic-scraping.js

You should see the page title printed in your console. This simple demonstration might not seem like much, but it establishes the foundation for more complex scraping tasks. In real-world scenarios, you may want to:

Extract text content from specific elements (e.g., h1, p, or .class-name). h1 p.class-name- Follow links or simulate button clicks to navigate multi-page flows.
Capture screenshots or PDF snapshots of specific pages for offline analysis.

Each of these use cases builds upon the same pattern we see above: open a page, wait for content, extract or interact, and then close the browser.

Locating Elements: CSS vs. XPath

Once you’ve navigated to a webpage, you need to locate the specific elements containing the data you want. Playwright offers a locator API that accepts both CSS selectors and XPath expressions, among other strategies (text-based locators, etc.).

CSS Example:

CSS selectors are typically more intuitive to use. They’re also widely utilized in front-end web development, so if you’re familiar with CSS, you can quickly target elements by class, id, or other attributes.

const paragraph = await page.locator('p').textContent();

XPath Example:

XPath is a query language originally designed for XML documents but also applicable to HTML. While it can be more powerful for intricate document structures, it’s typically less user-friendly compared to CSS selectors. Nonetheless, many developers still prefer XPath for advanced or legacy web scraping tasks.

const headline = await page.locator('//h1').textContent();

Best Practice: Whenever possible, use CSS selectors for simpler, more maintainable code. Switch to XPath for extremely specific or dynamic cases where CSS selectors do not suffice.

Scraping Text with Playwright

Scraping textual data—such as product names, blog posts, or article headlines—is likely the most common task. Playwright makes this straightforward:

Use a locator (CSS or XPath) to target the element.
Use either .textContent() or .innerText() to extract the text.

Create a file text-scraping.js:

const { chromium } = require('playwright');

(async () => {
  const browser = await chromium.launch();
  const page = await browser.newPage();
  await page.goto('https://example.com');

  const headline = await page.locator('h1').textContent();
  const firstParagraph = await page.locator('p').textContent();

  console.log('Headline:', headline);
  console.log('First Paragraph:', firstParagraph);

  await browser.close();
})();

Wait for Dynamic Content

Modern websites often load data asynchronously. If the text you need is not immediately available (e.g., it’s fetched via an AJAX call), you may need to wait for it to appear:

await page.waitForSelector('.dynamic-text');
const dynamicText = await page.locator('.dynamic-text').textContent();

Scraping Images with Playwright

Images can be particularly valuable for e-commerce research, data analysis, or competitor monitoring. Scraping images typically involves extracting the src attribute of an <img> tag, though some websites might store images as background images in CSS or even use data URIs.

const { chromium } = require('playwright');

(async () => {
  const browser = await chromium.launch();
  const page = await browser.newPage();
  await page.goto('https://example.com');

  const firstImageSrc = await page.locator('img').getAttribute('src');
  console.log('First Image URL:', firstImageSrc);

  const allImages = await page.locator('img').elementHandles();
  for (let i = 0; i < allImages.length; i++) {
    const src = await allImages[i].getAttribute('src');
    console.log(\`Image \${i} URL:\`, src);
  }

  await browser.close();
})();

Downloading Images

To actually download images (rather than just collecting their URLs), you can use Playwright’s built-in interception (covered later) or a dedicated HTTP request library like Axios or node-fetch. Simply pass the image URL to the HTTP client and save the returned data as a file.

Intercepting Requests with Playwright

Intercepting network requests is crucial in advanced scraping scenarios. It enables you to:

Monitor all requests made by the page.
Block specific requests (e.g., ads, analytics scripts) for faster scraping.
Modify or mock responses (useful in testing or specialized scenarios).
Analyze API calls to find direct endpoints that might be easier to scrape than the rendered page.

const { chromium } = require('playwright');

(async () => {
  const browser = await chromium.launch();
  const page = await browser.newPage();

  page.on('request', request => {
    console.log('Request URL:', request.url());
  });

  page.on('response', response => {
    console.log('Response URL:', response.url(), 'Status:', response.status());
  });

  await page.goto('https://example.com');
  await browser.close();
})();

Blocking Unwanted Requests

For websites heavy with images, advertisements, or large style files, you may want to block certain resource types to speed up your data collection. With Playwright, you can intercept requests before they’re made:

await page.route('**/*', (route) => {
  const request = route.request();
  const resourceType = request.resourceType();
  if (['image', 'stylesheet', 'font'].includes(resourceType)) {
    route.abort();
  } else {
    route.continue();
  }
});

Extracting JSON from APIs

Often, modern sites fetch data from JSON-based APIs. If you can identify these API calls, you might skip HTML parsing entirely and scrape data directly from the source:

page.on('response', async (response) => {
  if (response.request().url().includes('/api/products')) {
    const jsonData = await response.json();
    console.log('API Data:', jsonData);
  }
});

Using Proxies with Playwright

Proxies are essential in any serious scraping project, especially when dealing with high-volume or region-specific data. A proxy acts as an intermediary between your scraping script and the target website, masking your real IP address and potentially rotating through different IPs to prevent blocking.

const { chromium } = require('playwright');

(async () => {
  const browser = await chromium.launch({
    headless: true,
    proxy: {
      server: 'http://your-proxy-server.com:8000',
      username: 'user123',
      password: 'pass123',
    }
  });

  const page = await browser.newPage();
  await page.goto('https://whatismyipaddress.com/');
  const ipInfo = await page.locator('.your-ip').textContent();
  console.log('Detected IP Address:', ipInfo);

  await browser.close();
})();

Selecting the Right Proxy Provider

Residential Proxies: Tend to be more reputable to websites (less likely to be blacklisted) but are often slower and more expensive.
Datacenter Proxies: Faster and cheaper, but more prone to blocks.
Rotating/Sticky Sessions: Choose rotating proxies for wide-scale scraping; choose sticky sessions for tasks where session continuity is essential (like logging into a site).

If you need high-quality proxy solutions, you can explore services like ScrapingForge.com. Such services often come with specialized infrastructure designed for web scraping, including rotating IPs, geo-targeting, and built-in CAPTCHA handling.

Playwright vs. Puppeteer vs. Selenium

Feature	Playwright	Puppeteer	Selenium
Browser Support	Chromium, Firefox, WebKit	Chromium only	All major browsers
Modern APIs	✅	✅	❌
Community	Strong (Microsoft)	Good (Google)	Mature, legacy
Best For	Modern scraping, testing	Chromium automation	Enterprise legacy

Conclusion

Playwright offers a fast, reliable, and cross-browser solution for web scraping. By mastering the techniques in this guide, you can build powerful and resilient scraping tools for modern websites.

Next Steps

Optimize for Scale: Schedule jobs, store in databases.
Add Resilience: Handle errors, rotate proxies.
Data Cleaning: Use libraries like cheerio or lodash.