Web Scraping Steam Store with JavaScript and Node.js

1. Introduction — Web Scraping with JavaScript and Node.js in 2025
If you build scrapers for a living, you already know the pattern: you start with a quick script, it works for five minutes, then Steam (or any modern site) changes something and your code implodes at 2 AM.
The goal of this post isn’t another “Hello world” scraping demo.
We’re going to build a realistic, reproducible Steam Store scraper in plain JavaScript — the kind of tool you could drop into a job queue or analytics pipeline tomorrow.
Why Node.js
Node.js still dominates real‑time data collection for a reason:
- Native async I/O lets you keep thousands of requests open without threads.
- npm’s ecosystem (e.g.
cheerio,undici,playwright) covers 95 % of what you need. - Same language on backend and scraper means less context switching.
Python remains great for quick notebooks. But when you want a scraper that runs forever, Node.js gives you event‑loop control, streams, and observability hooks that scale.
Why Steam
The Steam Store is the perfect benchmark site: public pages, predictable markup, light dynamic loading, and data everyone actually cares about — game names, prices, discounts, and genres.
We’ll scrape a few categories (Action, RPG, and Free‑to‑Play), handle pagination, normalize the data into JSON, and keep it polite with rate limits.
By the end, you’ll have a clean, maintainable scraper that reflects the ScrapingForge philosophy:
build once, debug rarely, scale easily.
2. Setting Up Your Node.js Web Scraping Project
We’ll keep things minimal but organized — structure first, hacks later.
Create the project
mkdir steam-scraper && cd steam-scraper
npm init -y
npm install node-fetch cheerio dotenv
We’re sticking with native ES modules, so set "type": "module" in package.json.
{
"name": "steam",
"version": "1.0.0",
"main": "index.js",
"scripts": {
"test": "echo \"Error: no test specified\" && exit 1"
},
"author": "",
"license": "ISC",
"keywords": [],
"description": "",
"dependencies": {
"cheerio": "^1.1.2",
"dotenv": "^17.2.3",
"node-fetch": "^3.3.2"
}
}
Folder layout
steam-scraper/
├── src/
│ ├── index.js # entry point
│ ├── lib/
│ │ └── helpers.js # delay, logging, normalization
│ └── targets/
│ └── steam.js # main scraping logic
├── data/
│ └── output/
├── .env
└── package.json
Environment configuration
Add your .env file to define basic runtime variables:
CATEGORY_URL=https://store.steampowered.com/genre/Action/
REQUEST_DELAY=2000
MAX_PAGES=3
Load it in your code:
import 'dotenv/config'
const CATEGORY_URL = process.env.CATEGORY_URL
const REQUEST_DELAY = Number(process.env.REQUEST_DELAY || 2000)
Minimal utility
Let’s create a tiny delay helper (we’ll need it later for rate limiting):
// src/lib/helpers.js
export const sleep = ms => new Promise(res => setTimeout(res, ms))
Test the baseline
Before you touch Cheerio, confirm your fetch works:
// src/index.js
import fetch from "node-fetch"
import { sleep } from "./lib/helpers.js"
const url = process.env.CATEGORY_URL
async function main() {
console.log(`Fetching ${url}`)
const res = await fetch(url)
console.log(`Status: ${res.status}`)
await sleep(1000)
}
main()
Run it:
node src/index.js
If you see Status: 200, you’re ready to parse HTML in the next step.
3. Understanding the Steam Store Structure (and Why “/category/action” Doesn’t Work)
If you open https://store.steampowered.com/category/action, you’ll notice something strange: there’s no visible pagination in the URL. Scroll down, and new games just appear. That’s because this page is a dynamic “content hub” — the data loads asynchronously through internal API calls, not through static links like ?page=2.
At first glance, that’s bad news for us scrapers…
But Steam also provides a hidden gem: its search endpoint.
🔍 The Real Endpoint for Reliable Data
Under the hood, Steam powers most category listings through:
https://store.steampowered.com/search/results/
This endpoint supports parameters like:
start– offset index (0, 50, 100, …)count– number of items per slice (max ≈50)tags– category or genre ID (e.g., 19 = Action)force_infinite=1– returns a JSON payload instead of full HTML
A real request looks like this:
https://store.steampowered.com/search/results/?start=0&count=50&tags=19&force_infinite=1&l=english&cc=US
The response includes two keys:
{
"results_html": "<a class='search_result_row' ...> ... </a>",
"total_count": 4872
}
So instead of scraping the dynamically rendered /category page, we’ll call this endpoint directly — it’s cleaner, faster, and gives us built-in pagination metadata.
🧩 Why This Is the “ScrapingForge Way”
We’re not trying to hack around JavaScript rendering when there’s a clean, JSON-backed alternative.
The ScrapingForge mindset is:
“Find the layer that machines actually use, not what browsers paint.”
Steam’s search/results endpoint is that layer. It’s structured, efficient, and consistent — perfect for automation.
⚙️ What We’ll Build
We’ll create a Node 22 scraper that:
- Hits the
search/resultsendpoint with proper query params - Parses
results_htmlwith Cheerio - Automatically paginates until it hits the total count or an empty page
- Saves everything as JSON (and later CSV, if we want)
4. Building the Node 22 Steam Scraper (with Auto-Stop Pagination)
This version is lean, modern, and production-proof:
- Uses Node 22’s native
fetch(nonode-fetch) - Handles pagination via
start/count - Stops automatically when no new results appear
- Works in ESM mode (
"type": "module")
🧱 Project Structure Recap
steam-scraper/
├── src/
│ ├── index.js
│ ├── lib/helpers.js
│ └── targets/steam_search.js
├── data/output/
├── .env
└── package.json
⚙️ .env
STEAM_TAG_ID=19 # Action
COUNT_PER_PAGE=50 # items per slice
MAX_PAGES=20 # safety cap
MAX_ITEMS=0 # 0 = unlimited
MAX_EMPTY_PAGES=1 # stop after N empty slices
LOCALE=en
COUNTRY=US
OUT_DIR=./data/output
USER_AGENT=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36
⚙️ src/lib/helpers.js
export const sleep = (ms) => new Promise((res) => setTimeout(res, ms))
export function ensureDirSync(fs, dir) {
if (!fs.existsSync(dir)) fs.mkdirSync(dir, { recursive: true })
}
export function normText(s) {
return (s ?? "").replace(/\s+/g, " ").trim()
}
⚙️ src/targets/steam_search.js
import { load } from "cheerio"
import { sleep, normText } from "../lib/helpers.js"
const TAG_ID = process.env.STEAM_TAG_ID
const PER_PAGE = Number(process.env.COUNT_PER_PAGE || 50)
const MAX_PAGES = Number(process.env.MAX_PAGES || 2)
const MAX_ITEMS = Number(process.env.MAX_ITEMS || 0)
const MAX_EMPTY = Number(process.env.MAX_EMPTY_PAGES || 1)
const LOCALE = process.env.LOCALE || "en"
const CC = process.env.COUNTRY || "US"
const UA = process.env.USER_AGENT ||
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36"
function buildSearchUrl(start) {
const u = new URL("https://store.steampowered.com/search/results/")
u.searchParams.set("start", String(start))
u.searchParams.set("count", String(PER_PAGE))
u.searchParams.set("force_infinite", "1")
u.searchParams.set("infinite", "1")
u.searchParams.set("dynamic_data", "")
u.searchParams.set("sort_by", "_ASC")
u.searchParams.set("l", LOCALE)
u.searchParams.set("cc", CC)
if (TAG_ID) u.searchParams.set("tags", String(TAG_ID))
return u.toString()
}
async function fetchSlice(start) {
const url = buildSearchUrl(start)
const res = await fetch(url, {
headers: {
"User-Agent": UA,
"Accept": "application/json,text/javascript,*/*;q=0.9",
"Accept-Language": "en-US,en;q=0.9",
"Cache-Control": "no-cache",
"Pragma": "no-cache"
}
})
const contentType = res.headers.get("content-type") || ""
if (!contentType.includes("application/json")) {
const text = await res.text()
throw new Error(`Non-JSON response (${res.status}): ${text.slice(0, 100)}...`)
}
return res.json()
}
function parseResultsHtml(resultsHtml) {
const $ = load(resultsHtml)
const items = []
$(".search_result_row").each((_, el) => {
// IMPORTANT You may need to check the HTML structure of the response, as it may change over time.
const $el = $(el)
const title = normText($el.find(".title").text())
const link = $el.attr("href") || ""
const discountPct = normText($el.find(".search_discount span").text()) || "0%"
const finalPrice = normText(
$el.find(".discounted, .search_price").text()
) || "N/A"
if (title) items.push({ title, price: finalPrice, discount: discountPct, link })
})
return items
}
export async function scrapeByTag() {
if (!TAG_ID) throw new Error("STEAM_TAG_ID is not set")
let results = []
let start = 0
let empties = 0
for (let page = 1; page <= MAX_PAGES; page++) {
let payload
try {
payload = await fetchSlice(start)
} catch (err) {
console.error(`Fetch error at page ${page}: ${err.message}`)
console.log("Retrying after 3s...")
await sleep(3000)
continue
}
const html = payload?.results_html || ""
const total = Number(payload?.total_count || 0)
const batch = parseResultsHtml(html)
console.log(`Page ${page}: start=${start}, got=${batch.length}, total=${total || "?"}`)
if (batch.length === 0) {
empties++
if (empties >= MAX_EMPTY) {
console.log(`Auto-stop: ${empties} empty page(s).`)
break
}
} else {
empties = 0
results = results.concat(batch)
}
if (MAX_ITEMS > 0 && results.length >= MAX_ITEMS) {
results = results.slice(0, MAX_ITEMS)
console.log(`Auto-stop: reached MAX_ITEMS=${MAX_ITEMS}.`)
break
}
start += PER_PAGE
if (total && start >= total) {
console.log(`Auto-stop: reached total_count=${total}.`)
break
}
await sleep(Number(process.env.REQUEST_DELAY || 1000))
}
return results
}
⚙️ src/index.js
import "dotenv/config"
import fs from "fs"
import path from "path"
import { fileURLToPath } from "url"
import { ensureDirSync } from "./lib/helpers.js"
import { scrapeByTag } from "./targets/steam_search.js"
const __filename = fileURLToPath(import.meta.url)
const __dirname = path.dirname(__filename)
const OUT_DIR = process.env.OUT_DIR || "./data/output"
const OUT_FILE = path.join(__dirname, "..", OUT_DIR, `steam_tag_${process.env.STEAM_TAG_ID}.json`)
async function main() {
ensureDirSync(fs, path.join(__dirname, "..", OUT_DIR))
const items = await scrapeByTag()
fs.writeFileSync(OUT_FILE, JSON.stringify(items, null, 2))
console.log(`✅ Saved ${items.length} records → ${OUT_FILE}`)
}
main().catch((e) => {
console.error("Fatal:", e)
process.exit(1)
})
🧪 Run It
node src/index.js
Example output:
Page 1: start=0, got=50, total=4872
Page 2: start=50, got=50, total=4872
Auto-stop: reached total_count=4872.
✅ Saved 40 records → ./data/output/steam_tag_19.json
💾 Example JSON Output
[
{
"title": "Deep Rock Galactic",
"price": "$14.99",
"discount": "-50%",
"link": "https://store.steampowered.com/app/548430/"
},
{
"title": "Cyberpunk 2077",
"price": "$29.99",
"discount": "-50%",
"link": "https://store.steampowered.com/app/1091500/"
}
]
5. Handling Dynamic Pages and JavaScript Rendering
The Steam Store doesn’t use a single template for all its pages. While the search endpoint works great for most categories, certain sections such as New & Trending, Specials, or Coming Soon rely heavily on client-side JavaScript to render data.
If you try to scrape those pages with simple HTTP requests, you’ll often end up with an empty HTML response or placeholders like <script>InitPage()</script> instead of real data.
Understanding When HTML Isn’t Enough
When inspecting a page in the browser’s Developer Tools, check the Network tab:
- If you only see
.jsand.jsonrequests loading after the initial page, the content is being rendered dynamically. - If
Ctrl+Ffor a game title returns nothing in the raw HTML, it’s definitely JavaScript-driven.
In such cases, you need a way to execute the page’s JavaScript and capture the resulting HTML — something Cheerio alone can’t do. This is where headless browsers like Playwright or Puppeteer come in.
Using Playwright for Dynamic Scraping
Playwright is ideal for modern web scraping because it supports Chromium, Firefox, and WebKit, with automatic waiting for network idle states and page rendering.
import { chromium } from "playwright"
async function scrapeDynamic(url) {
const browser = await chromium.launch({ headless: true })
const page = await browser.newPage()
await page.goto(url, { waitUntil: "networkidle" })
const games = await page.$$eval(".tab_item_name", els =>
els.map(e => e.textContent.trim())
)
await browser.close()
return games
}
const data = await scrapeDynamic("https://store.steampowered.com/explore/new/")
console.log(data)
This approach works for pages where Steam renders content after load. It’s slower than the JSON scraping method, but it’s bulletproof for smaller, dynamically loaded sections.
If you’re building at scale, you can also use ScrapingForge’s built-in browser rendering API, which provides the same capability via API calls — no local browsers to manage, no Playwright setup.
6. Cleaning and Structuring the Scraped Data
After scraping multiple pages or categories, your data will likely contain inconsistencies — mixed currencies, varying whitespace, missing prices, or discounts formatted differently. A small normalization layer ensures your output is consistent and easy to use.
Common Data Issues
Typical problems you’ll find in raw Steam data:
- Extra whitespace or newline characters in titles
- Prices like “Was $59.99 Now $29.99”
- Empty strings for games with no discounts
- Currency symbols depending on region (
$,€,£)
To make analysis easier, clean and structure everything into predictable fields.
Normalizing Price and Discount Fields
Create a helper to clean each record:
export function normalizeGame(game) {
const priceValue = parseFloat(game.price.replace(/[^0-9.]/g, ""))
const discountValue = parseInt(game.discount.replace(/[^0-9-]/g, "")) || 0
return {
...game,
price_usd: isNaN(priceValue) ? null : priceValue,
discount_percent: discountValue,
}
}
You can then apply this normalization step after scraping:
import { normalizeGame } from "../lib/normalize.js"
const normalized = items.map(normalizeGame)
fs.writeFileSync("data/cleaned/steam_games_clean.json", JSON.stringify(normalized, null, 2))
This gives you structured JSON that’s easy to query or convert to other formats.
Enriching Data with Additional Fields
You can also extract and compute:
- Discount range categories (e.g., “Small”, “Medium”, “Big Deal”)
- Derived fields like
is_freeoron_sale - Timestamp to track when the data was collected
Example:
export function enrichGame(game) {
return {
...game,
on_sale: game.discount_percent < 0,
scraped_at: new Date().toISOString(),
}
}
Structured and enriched data isn’t just cleaner — it’s more valuable for downstream systems or dashboards.
7. Exporting Results and Using the Data
JSON is great for developers, but most users prefer working with data in spreadsheets or analytics tools. Exporting your results to CSV or SQLite makes it easier to filter, sort, and visualize game data.
Exporting to CSV
Install a converter like json2csv:
npm install json2csv
Then use it in your script:
import { Parser } from "json2csv"
import fs from "fs"
export function saveAsCSV(data, path) {
const parser = new Parser()
const csv = parser.parse(data)
fs.writeFileSync(path, csv)
console.log(`✅ Saved CSV: ${path}`)
}
Usage:
import { saveAsCSV } from "../lib/export.js"
saveAsCSV(normalized, "data/output/steam_games.csv")
You can then open the resulting file in Excel, Google Sheets, or import it into tools like Tableau or Metabase.
Integrating with Other Systems
Once exported, your data can easily feed:
- Dashboards for price trends and discounts
- Game recommendation bots
- Marketing automation systems tracking top-sellers
With minimal tweaks, you can even stream the JSON results to an API endpoint or store them in MongoDB or PostgreSQL.
8. Scraping Multiple Categories Automatically
Once your single-category scraper is stable, expanding to multiple categories is straightforward.
Each Steam category (Action, RPG, Indie, Simulation, etc.) has its own tag ID.
Instead of running the script manually for each, you can automate the process in a loop.
Example: Multi-Tag Scraper
You can define all target tag IDs in your .env file or directly in your script:
STEAM_TAG_IDS=19,122,492,597
Then update your scraper entry point:
import "dotenv/config"
import fs from "fs"
import path from "path"
import { fileURLToPath } from "url"
import { ensureDirSync } from "./lib/helpers.js"
import { scrapeByTag } from "./targets/steam_search.js"
const __filename = fileURLToPath(import.meta.url)
const __dirname = path.dirname(__filename)
const OUT_DIR = process.env.OUT_DIR || "./data/output"
ensureDirSync(fs, path.join(__dirname, "..", OUT_DIR))
const tags = (process.env.STEAM_TAG_IDS || "").split(",").filter(Boolean)
for (const tag of tags) {
process.env.STEAM_TAG_ID = tag.trim()
console.log(`\n--- Scraping tag: ${tag} ---`)
const items = await scrapeByTag()
const outPath = path.join(__dirname, "..", OUT_DIR, `steam_tag_${tag}.json`)
fs.writeFileSync(outPath, JSON.stringify(items, null, 2))
console.log(`✅ Saved ${items.length} records for tag ${tag}`)
}
This setup can handle multiple genres in one run and store each as a separate dataset.
Expanding the Usefulness
With multiple JSON files, you can:
- Compare genres side by side
- Aggregate price ranges or discount averages
- Build datasets for machine learning or recommendation systems
This turns your scraper from a one-off script into a data pipeline — the kind of technical depth that resonates well with developers reading your blog.
9. Error Handling and Rate Limiting
Real-world scraping doesn’t always go smoothly.
Pages change, servers rate-limit requests, and network timeouts can occur.
Building resilience into your scraper keeps it reliable and professional.
1. Detecting and Handling Non-JSON Responses
As seen earlier, Steam sometimes responds with HTML instead of JSON (for example, when it temporarily blocks automated requests).
Your scraper should handle that gracefully:
async function safeFetch(url) {
try {
const res = await fetch(url)
const type = res.headers.get("content-type") || ""
if (!type.includes("application/json")) {
const text = await res.text()
console.warn("Received non-JSON response, skipping slice")
return null
}
return await res.json()
} catch (e) {
console.error("Network error:", e.message)
return null
}
}
This prevents one failed page from crashing the entire job.
2. Rate Limiting and Retry Logic
Always include delays between requests.
A 1–2 second delay is enough to prevent throttling.
Add exponential backoff when consecutive errors occur:
let delay = 1000
for (let attempt = 1; attempt <= 5; attempt++) {
const data = await safeFetch(url)
if (data) return data
console.log(`Retrying in ${delay} ms...`)
await sleep(delay)
delay *= 2
}
This kind of structured retry system helps maintain stability even when API limits change.
3. Logging and Debugging
For long-running scrapers, keep logs:
import fs from "fs"
function logMessage(message) {
const ts = new Date().toISOString()
fs.appendFileSync("scraper.log", `[${ts}] ${message}\n`)
}
Logs help trace failures, detect pattern changes, and debug silently failed scrapes.
10. Optimizing and Maintaining Your Scraper**
A good scraper doesn’t just work once — it stays reliable as websites evolve.
Here’s how to keep your Steam scraper robust and high-performing.
Minimize Redundant Requests
Steam data doesn’t change every minute.
Use caching or If-Modified-Since headers to avoid unnecessary downloads:
const headers = {
"If-Modified-Since": new Date(Date.now() - 86400000).toUTCString(), // 1 day ago
"User-Agent": process.env.USER_AGENT
}
This reduces bandwidth and avoids flagging from frequent polling.
Handle Data Changes Gracefully
When new HTML structures appear, Cheerio selectors might break.
Build a small diagnostic step to detect when expected fields are missing:
if (!title || !price) {
console.warn("Incomplete record detected:", link)
}
This lets you adapt early instead of silently producing bad data.
Schedule Regular Runs
Once the scraper works well, automate it:
- Cron job (Linux/macOS):
0 */6 * * * node src/index.js >> scraper.log 2>&1 - PM2 for continuous jobs
- Or a lightweight CI/CD runner for reproducible datasets
Routine scheduling ensures fresh data, useful for trend tracking or price alerts.
Keep It Ethical and Maintainable
Even with a solid scraper, always respect:
- Robots.txt and Terms of Service
- Reasonable request rates
- Avoiding data misuse or personal information
Professional scrapers succeed long-term because they balance technical excellence with responsible use.
Wrapping Up
At this point, you’ve built a robust, production-grade Steam Store web scraper in Node.js that can:
- Handle static and dynamic pages
- Clean and normalize structured data
- Export to CSV or JSON for analytics
- Recover from errors and scale to multiple categories
You’ve also seen how a thoughtful scraping architecture — built on modular helpers, retries, and structured output — saves time and keeps your data pipelines maintainable.
This approach isn’t limited to Steam.
You can apply the same structure to scrape e-commerce sites, marketplaces, or product APIs safely and efficiently.
A well-engineered scraper is not about hacking websites — it’s about building resilient data pipelines that keep up with the web’s evolution.
Playwright Web Scraping Tutorial for 2025 (Node.js)
Learn how to use Playwright for web scraping in 2025. This guide covers installation, basic scraping, intercepting requests, and using proxies.
What Is a Web Scraping API (and Why You Shouldn`t Build One From Scratch)
Discover why web scraping APIs are the future of data extraction. Learn about the hidden challenges of DIY scraping and when it makes sense to build vs. buy.