Common HTTP Status Codes in Web Scraping & How to Handle Them

Table of Contents
- Why Status Codes Matter for Web Scraping
- How HTTP responses gate your crawl budget, throughput, and data quality
- Mapping failures to fixes (traffic shaping, backoff, proxies)
- Quick Primer: HTTP Status Code Families (1xx–5xx)
- What client vs. server errors signal for scrapers
- When to retry, when to slow down, when to switch IPs
- 429 Too Many Requests (Rate Limiting)
- What 429 means and how Retry-After works
- Tactics: exponential backoff, request pacing, rotating proxies, header tuning
- Python patterns to honor cool-down windows
- 📖 Detailed Guide: 429 Error: How to Handle Rate Limits When Scraping Websites
- 503 Service Unavailable (Temporary Server Issues)
- Causes: maintenance, overload, transient upstream failures
- Tactics: jittered retries, circuit breakers, fallback mirrors, smarter schedules
- When to widen timeouts vs. reduce concurrency
- 📖 Detailed Guide: 503 Error: Why Servers Block Scrapers and How to Avoid It
- 520 Unknown Error (Edge/Proxy Intermediary Issues)
- Why error code 520 appears with reverse proxies/CDNs
- Tactics: validate origin responses, trim headers, stabilize TLS, try alternate IP pools
- Monitoring origin vs. edge to isolate faults
- Other High-Signal Errors for Scrapers
- 499 Client Closed Request (non-standard; client aborted): diagnosing client-side timeouts, long-tail retries, streaming responses
- 500/502/504: upstream instability, queue pressure, and graceful degradation plans
- 📖 Detailed Guides:
- Proxy Strategy: Residential vs. Datacenter vs. Rotating Pools
- Matching best proxy sites and pool types to targets and anti-bot posture
- IP reputation, session pinning, and cost/performance tradeoffs
- Header, Session, and Fingerprint Hygiene
- User-Agent rotation, cookie management, TLS/JA3 considerations
- Avoiding patterns that trigger 403/429 throttles
- Python Implementation Playbook
- Unified retry handler for 429/503/520 with exponential backoff + jitter
- Respecting Retry-After, idempotent-safe retries, and fail-fast rules
- Integrating rotating proxies and per-target rate limits
- Observability for Crawlers
- Structured logs for status-code histograms and heatmaps
- Alert thresholds for spikes in 429/503/520
- Feedback loops to auto-tune concurrency and delay
- Sustainable, Site-Friendly Scraping
- Crawl politeness policies that reduce bans and 4xx/5xx
- Scheduling windows and cache-aware fetches
- Appendix: Status Code Reference & Test Checklist
- Links to canonical references (MDN, Cloudflare docs)
- Pre-launch tests for new targets (timeouts, headers, proxy health)
1) Why Status Codes Matter for Web Scraping
HTTP status codes are the control signals of every web scraping run. They decide whether a request brings back usable HTML/JSON, needs to be retried, or must be rerouted through rotating proxies. At scale, these codes directly influence crawl throughput, cost (proxy spend, compute), and the reliability of your datasets. A spike in 429 status code responses means the target is throttling you; a wave of 503 status code responses points to server-side unavailability; repeated error code 520 events suggest a CDN or reverse proxy is returning an unknown response from the origin.
Treat status codes as actionable telemetry rather than generic failures. Map each class of code to a mitigation: 2xx should flow into parsing; 3xx needs redirect handling; 4xx often signals client-side adjustments (headers, sessions, pacing); 5xx typically warrants cautious, jittered retries or alternate IP pools. When your pipeline reacts intelligently—respecting Retry-After, slowing request rates, or switching IPs—success rates climb and ban rates drop.
Effective observability turns status codes into an early-warning system. Dashboards that chart error rates by target domain, path, and hour of day reveal when to reschedule jobs, when to test different proxy types (residential vs. datacenter), and when to rotate user agents. If you practice web scraping python patterns (requests/HTTPX/asyncio) with structured logging, you can auto-tune concurrency based on live feedback—preventing 429 bursts before they cascade into full blocks.
2) Quick Primer: HTTP Status Code Families (1xx–5xx)
HTTP groups responses into five families:
- 1xx Informational: Provisional responses; scrapers rarely act on these.
- 2xx Success: Payload is good; move to parsing and extraction.
- 3xx Redirection: Follow location headers—mind infinite loops and domain shifts.
- 4xx Client errors: Often your request profile (headers, cookies, pace) or permissions.
- 5xx Server errors: Target instability or intermediary failures; retry with care and jitter.
This taxonomy clarifies when to retry vs. redesign. Many 4xx codes are non-retryable without a change (e.g., fixing headers for 403 or pacing for 429 status code), while many 5xx codes are transient and benefit from backoff. MDN’s canonical references provide concise definitions and edge cases, useful when you’re tuning crawler behavior per target.
Two key protocol details guide scraper logic:
- Idempotency: Retrying
GETis generally safe; retryingPOSTmay not be, unless explicitly idempotent on the server. Build your retry policy with method awareness. - Retry-After: When present (notably with 429 and 503), this header specifies how long to wait before retrying. Honor it exactly to preserve IP reputation and reduce bans.
3) 429 Too Many Requests (Rate Limiting) — What It Means and How to Fix It

What it indicates
429 status code signals the target has received too many requests from your client/IP within a window. It’s an explicit rate limit. The response may include Retry-After with either a delay in seconds or an HTTP date after which you may try again(see MDN 429).
Why it happens in scraping
- High concurrency from a single IP or narrow pool
- Aggressive request cadence (no pacing or backoff)
- Fingerprinting signals (static User-Agent, missing cookies, identical header order) that mark automation
- Crawling sensitive paths or APIs where business rules enforce strict quotas
Core tactics to reduce 429
- Honor Retry-After precisely
Parse the header and schedule the next attempt after the given time. Where missing, use conservative backoff defaults to avoid re-triggering limits. - Exponential backoff with jitter
Replace linear sleep with capped exponential backoff (e.g., base=1.5–2.0, max=60–120s) plus random jitter. This spreads retries and reduces thundering herds during partial outages. - Rotating proxies and wider IP pools
Route requests through larger, reputationally diverse pools (residential or premium datacenter) and pin sessions for sites that dislike IP churn. Maintain health checks; demote IPs that attract repeated 429s. This aligns with targeting “best proxy sites” and “rotating proxies” keywords in your acquisition content. - Request pacing and concurrency caps
Rate-limit per host and per path. Use token buckets or leaky buckets so bursts smooth into steady flow. Calibrate QPS to stay below each target’s informal threshold. - Fingerprint hygiene
Rotate realistic User-Agent strings, accept-encoding, and languages; preserve cookie jars; vary header order where appropriate. For JS-heavy sites, render via Playwright/Selenium with human-like timings. - Path-aware strategy
Space requests across distinct sections of a site; avoid hammering one endpoint or large list pages without delays or pagination awareness.
Python sketch: honoring Retry-After and backoff
Below is a language-agnostic outline you can adapt for python web scraping (requests or HTTPX). It captures 429, extracts Retry-After, and applies jittered exponential backoff. It also leaves room to plug in rotating proxies.
- Send request with a per-host rate limiter.
- If 429 and
Retry-Afterpresent → compute wait; sleep or reschedule task. - If 429 without header → backoff using exponential (cap + jitter).
- After N failures on a single IP → rotate proxy identity; cool down IP.
- Record metrics:
429_count, average wait, success-after-retry ratio.
Community practices reinforce these approaches: sleep or reschedule on 429, respect Retry-After, and avoid “dodging” the limit by brute force.
When to escalate beyond backoff
- If 429 persists across multiple reputable IPs and realistic headers, lower your per-target concurrency and increase intervals.
- If a specific path is quota-protected (e.g., search endpoints), redesign the crawl: cache results, use incremental updates, or switch to server-side feeds if available.
- If your pool is too small, upgrade capacity or change mix (more residential for tougher sites).
Metrics that prove you’ve fixed it
- 429 rate < 1–2% of total requests across a job
- Mean time-to-success after a 429 < 2× base latency
- Declining number of retries per successful page
- Stable session durations for authenticated paths
Integrate these controls early, especially if you’re migrating from curl to python scripts into full crawlers. Building the retry logic, proxy rotation, and header hygiene into a shared client library keeps behavior consistent across jobs.
import httpx, time, random
from email.utils import parsedate_to_datetime
def _retry_after_seconds(resp) -> float | None:
ra = resp.headers.get("Retry-After")
if not ra:
return None
ra = ra.strip()
if ra.isdigit():
return float(ra)
try:
dt = parsedate_to_datetime(ra)
return max(0.0, (dt - parsedate_to_datetime(time.strftime("%a, %d %b %Y %H:%M:%S GMT", time.gmtime()))).total_seconds())
except Exception:
return None
def get_with_429_handling(url: str, client: httpx.Client, max_retries: int = 5, base: float = 1.6, cap: float = 60.0, proxy=None):
tries = 0
while True:
r = client.get(url, proxies=proxy, timeout=20)
if r.status_code != 429:
r.raise_for_status()
return r
# Honor Retry-After when present
wait = _retry_after_seconds(r)
if wait is None:
wait = min(cap, (base ** tries)) + random.random() # full jitter
time.sleep(wait)
tries += 1
if tries > max_retries:
raise RuntimeError(f"Exceeded retries due to 429 for {url}")
4) 503 Service Unavailable — Causes, Diagnostics, and Robust Fixes

import time, random, httpx
from collections import deque
class CircuitBreaker:
def __init__(self, window=60, threshold=0.5, min_samples=10, cooldown=90):
self.window = window
self.threshold = threshold
self.min_samples = min_samples
self.cooldown = cooldown
self.events = deque()
self.open_until = 0
def record(self, success: bool):
now = time.time()
self.events.append((now, success))
# drop old
while self.events and now - self.events[0][0] > self.window:
self.events.popleft()
# open circuit if failure ratio too high
if len(self.events) >= self.min_samples:
fail = sum(1 for _, s in self.events if not s)
ratio = fail / len(self.events)
if ratio >= self.threshold:
self.open_until = now + self.cooldown
def allow(self):
return time.time() >= self.open_until
breaker = CircuitBreaker()
def get_with_503_handling(url: str, client: httpx.Client, max_retries=4, base=1.7, cap=120.0):
tries = 0
while True:
if not breaker.allow():
time.sleep(1) # cool down window open
continue
resp = client.get(url, timeout=30)
if resp.status_code != 503:
breaker.record(True)
resp.raise_for_status()
return resp
breaker.record(False)
sleep_s = min(cap, (base ** tries)) + random.random()
time.sleep(sleep_s)
tries += 1
if tries > max_retries:
raise RuntimeError(f"Exceeded 503 retry budget for {url}")
What it indicates
A 503 status code means the server can’t handle the request right now—often due to maintenance windows, overload, or upstream dependency issues. Unlike client errors, 503s are typically transient and good candidates for retry with backoff. When present, a Retry-After header tells you exactly when to try again.
Why it happens in scraping
- Overload: Your concurrency outpaces the target’s capacity.
- Maintenance / deploys: Routine downtime or rolling restarts.
- Rate shaping at the edge: CDNs or reverse proxies apply temporary throttles.
- Upstream failures: Database or internal API slowness that bubbles up as 503.
Triage checklist
- Confirm transience: Sample a few requests over 60–120 seconds; look for mixed 2xx/503 results.
- Check edge vs. origin: Compare responses with and without CDN (if possible) or across IP pools to isolate the failure plane.
- Inspect headers: Look for Retry-After, cache directives, or vendor headers that hint at edge throttling.
- Plot error bursts: If 503s cluster at specific hours, reschedule jobs to off-peak windows.
Mitigations
- Exponential backoff + jitter: Capped exponential (e.g., 1s, 2s, 4s, … up to 120s) with random jitter to desynchronize retries.
- Circuit breaker: If the failure ratio exceeds a threshold (e.g., 50% within 60s), pause the target for a cool-down period instead of hammering it.
- Adaptive concurrency: Reduce concurrent requests per host dynamically when 503 spikes appear; restore slowly when stability returns.
- Edge-aware strategy: If a CDN is rate shaping, widen your rotating proxies pool to diversify ingress IPs and lower local per-IP pressure.
- Timeout + retry budget: Use generous read timeouts and a per-request retry budget so you don’t starve the job queue.
Operational signals that you’re winning
- Declining 503 ratio after concurrency drops.
- Success-after-retry above 70–80%.
- Retries converge quickly (e.g., ≤2 attempts) with health restored within minutes.
5) 520 Unknown Error — Working Around Edge/Proxy Intermediaries

What it indicates
Error code 520 is a non-standard, CDN-originated response—commonly from Cloudflare—indicating the edge proxy received an invalid, unexpected, or empty response from the origin. It’s not a classic HTTP status code from the origin server; it’s the edge saying “something went wrong upstream.”(see Cloudflare 520 documentation)
Why it happens in scraping
- Protocol quirks: TLS handshake anomalies, unsupported ciphers, or partial connections.
- Header/path sensitivity: Oversized headers, unusual header order, or malformed requests rejected by origin or WAF.
- Origin instability: The origin responds inconsistently (timeouts, resets) under load.
- Edge misclassification: Edge security rules treat your pattern as suspicious.
Diagnostics
- Compare pools: Send identical requests via different rotating proxies to see if certain ASN/IP ranges correlate with 520s.
- Trim and normalize: Remove non-essential headers, shrink cookie payloads, and standardize header casing/order.
- TLS hygiene: Try a mainstream TLS client profile; ensure SNI is correct and disable outdated protocol versions.
- Origin echo test: Fetch lightweight, cacheable endpoints (robots.txt, a small static asset) to separate origin health from path-specific issues.
- Latency triage: If 520s co-occur with long TTFB, raise read timeouts and test slower pacing to reduce server stress.
Mitigations
- Header minimization: Keep headers small and conventional (User-Agent, Accept, Accept-Language). Avoid exotic or duplicated fields.
- Session stickiness: For JS-heavy sites behind anti-bot edges, keep session cookies stable across a short window; flapping identity can look suspicious.
- Mixed proxy strategy: Blend residential and premium datacenter IPs to improve reputation and reduce false positives at the edge.
- Resilient retry policy: Treat 520 like a transient 5xx: retry with jitter; after N failures on an IP, rotate identity and cool down that IP.
- Origin fallback: If applicable and permitted, test an alternative hostname or mirror to prove whether the issue is edge-local.
import httpx, random, time
MIN_HEADERS = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
"Connection": "keep-alive",
}
def rotate_proxy():
# plug in your provider here; example returns None or a proxy URL
pools = [None, "http://user:pass@res-ip:8000", "http://user:pass@dc-ip:8000"]
return random.choice(pools)
def get_with_520_resilience(url: str, max_retries: int = 5):
proxy = rotate_proxy()
with httpx.Client(http2=True, headers=MIN_HEADERS, verify=True, timeout=30) as client:
for attempt in range(max_retries + 1):
r = client.get(url, proxies=proxy)
if r.status_code != 520:
r.raise_for_status()
return r
# try lighter identity and different proxy
proxy = rotate_proxy()
time.sleep(1 + random.random())
raise RuntimeError(f"Exceeded 520 retry budget for {url}")
When to redesign
- Persistent 520s despite normalized headers and healthy concurrency suggest an edge rule conflict. Shift the crawl to off-peak hours, slow QPS, and maintain longer-lived sessions to look less bursty.
6) Other High-Signal Errors for Scrapers: 499, 500, 502, 504
499 Client Closed Request (non-standard)
Meaning: Your client terminated the connection before the server sent a response (or the server interpreted it that way). Some NGINX setups emit 499 when the client times out or disconnects early.
Common triggers
- Client-side timeouts too aggressive for the page’s generation time.
- Connection reuse issues in keep-alive pools under high concurrency.
- Premature aborts from the scraper (cancelling futures when queues reshuffle).
Fixes
- Increase read timeout modestly (e.g., 10 → 20–30s) for heavier pages.
- Retry once with a fresh connection (no keep-alive) to rule out pooled-socket quirks.
- Apply backpressure: cap in-flight requests per host and use queues that avoid yanking requests mid-flight.
500 Internal Server Error
Meaning: Unhandled exception at the server or a generic upstream failure.
Triage
- Path sampling: Do multiple paths fail, or just one? If it’s localized, treat it like target data variance rather than global downtime.
- Payload sensitivity: Try the same path with slightly different query parameters or headers; look for patterns that trigger 500.
- Method awareness: Retry GET with backoff; avoid blind retries of non-idempotent POST calls unless you’ve confirmed idempotency.
Fixes
- Jittered retries with a small cap (1–3) and rising delay.
- Slow-lane mode: Temporarily reduce concurrency for the affected path.
- Cache assist: If you can use ETags or If-Modified-Since, you may avoid heavy recomputation on the server.
502 Bad Gateway
Meaning: A proxy or gateway got an invalid response from the upstream server.
Triage
- Edge vs. origin: Compare behavior across IPs and try a simple origin resource.
- Burst correlation: If 502s appear during traffic spikes, treat them as overload symptoms.
Fixes
- Retry with jitter, widen timeouts slightly.
- Staggered concurrency: Spread requests evenly over time; avoid synchronized bursts from multiple workers.
- Alternate routes: If the site has multiple POPs or mirrors, test them to bypass a localized upstream issue.
504 Gateway Timeout
Meaning: A gateway/proxy didn’t receive a timely response from the upstream server.
Triage
- Measure TTFB: If responses are slow but successful sometimes, increase read timeout and dial back QPS.
- Payload sizing: Very heavy pages or large API responses need longer ceilings.
Fixes
- Timeout tuning: Lift read timeout (e.g., from 10s to 25–40s) for specific domains.
- Adaptive slowdown: Reduce per-domain concurrency until 504s vanish, then slowly ramp back up.
- Conditional retries: A small number of retries with jitter is safe; if 504 persists, defer the target for a scheduled re-run.
Cross-cutting Patterns for 499/500/502/504
1) Idempotent-aware retries
Only GET, HEAD, and safe OPTIONS should be retried by default; POST/PUT/PATCH/DELETE require explicit idempotency guarantees to avoid duplicate writes.
2) Bucketed rate limits
Token buckets per host + per path allow burst tolerance while capping sustained QPS. Log token starvation events; they often precede 429/5xx spikes.
3) Smart proxy rotation
Tie proxy rotation to observability: demote IPs associated with rising error rates (429/403/5xx) and promote stable ones. Track IP-reputation scores if your provider exposes them.
4) Health-first scheduling
Maintain a calendar of target stability: if certain domains degrade during business hours, schedule crawls for early morning or night in the target’s local timezone.
5) Evidence-based caps
Set per-target caps from real measurements: median successful TTFB, p95 latency, and the lowest concurrency that yields ≤2% combined error rate. Bake these caps into your crawler config so every job starts with proven defaults.
7) Proxy Strategy: Residential vs. Datacenter vs. Rotating Pools
Selecting the right proxy mix determines how often you hit 429 status code, how well you sidestep soft blocks, and how much you spend. Think in terms of reputation, session stability, and cost per successful fetch rather than headline QPS.
Datacenter proxies
- Pros: Fast, inexpensive, abundant.
- Cons: Lower reputation with anti-bot systems; more likely to trigger 403/429 when patterns are obvious.
- Use: Bulk fetching of static assets and lightweight pages where risk of blocking is low and cost-per-request matters most.
Residential proxies
- Pros: Higher reputation and diversity; better at blending in and reducing blocks like 503 status code under load.
- Cons: Costly; variable latency.
- Use: Tighter targets, authenticated flows, and JS-heavy pages where appearance matters more than raw speed.
Rotating proxies (pool-based rotation)
- Pros: Automatic identity churn reduces per-IP pressure; useful against rate-limit windows and error code 520 flurries at the edge.
- Cons: Session-dependent sites may dislike frequent IP changes; maintain cookie continuity.
- Use: Broad crawls where session affinity is not required and you need resilience against transient throttling.
Session pinning vs. rotation
- Pin a session (cookie + IP) for flows that depend on consistency (carts, dashboards, multi-step forms).
- Rotate identities for list pages and public content where state does not matter.
- Track a per-target session-success rate and time-to-first-error; promote pools that extend session longevity.
Practical selection guide
- If the site shows early 429 spikes at low QPS, switch to a residential-first pool and pin sessions for a while.
- If the site returns 520 from a CDN, blend in reputable residential IPs and trim headers.
- For massive catalogs with tolerant defenses, datacenter + smart pacing often outperforms pricey pools.
- Build failover: if block signals rise, auto-migrate to “best proxy sites” in your allowlist and back down QPS.
Cost control
Measure cost per successful 2xx page. A pool that looks expensive per request can be cheaper per success if it slashes retries and bans.
8) Header, Session, and Fingerprint Hygiene
A large portion of avoidable 4xx/5xx errors comes from machine-like request signatures. Good hygiene reduces bans without burning proxy budget.
User-Agent and hint headers
- Rotate realistic User-Agent strings across OS/browser families; avoid rare or outdated versions.
- Match Accept-Language and Accept-Encoding to your UA family.
- Keep the header set short and conventional; extra or duplicated fields look synthetic.
Cookie and session management
- Persist cookie jars per domain; authenticate or consent once and reuse the session for a short window.
- Refresh sessions predictably to avoid suspicious long-lived identities.
- For rotating proxies, either pin IP for the session duration or tolerate short sessions with higher retry budgets.
Connection behavior
- Respect keep-alive and connection reuse but be ready to open a fresh connection after failures like 499 or 520.
- Calibrate TCP connect, TLS handshake, and read timeouts to domain baselines; timeouts set too low masquerade as 499-like failures.
Request pacing and navigation realism
- Add micro-delays between sequential requests from the same session.
- Avoid perfect periodicity; introduce jitter.
- Respect
robots.txtand crawl-delay when present; even if not enforced, politeness reduces bans.
Advanced fingerprints
- For dynamic targets, emulate headless browsers with Playwright or Selenium; rotate viewport sizes, timezones, and input timings.
- If you hit CAPTCHAs, measure their density; it’s often cheaper to slow down than to solve them at scale.
Minimal viable header template
Start from a compact profile (User-Agent, Accept, Accept-Language, Accept-Encoding, Connection, Referer when applicable). Add only what improves success rates.
9) Python Implementation Playbook
This section outlines implementation patterns you can drop into web scraping python clients or when migrating from curl to python.
Unified retry policy
- Handle 429, 503, 520 as retryable with exponential backoff + full jitter and a hard cap (e.g., 5 attempts).
- Honor Retry-After precisely when provided; compute delay in seconds or parse RFC 1123 date.
- Treat 4xx like 403/404 as non-retryable unless your logic changes something (headers, cookies, proxies).
Rate limiting
- Use a token-bucket per host and, optionally, per path. Track tokens consumed and starvation events to tune QPS.
- Apply per-session caps; if a single session sees rising error rates, slow that session first before global slowdown.
Proxy orchestration
- Abstract a “proxy provider” that yields identities and tracks health. Demote IPs tied to high 4xx/5xx rates; promote stable ones.
- Support both pool rotation and session pinning.
- Keep a small quarantine list for IPs recently associated with 429 or 520 spikes.
Idempotency-aware retries
- Automatically retry safe methods (
GET,HEAD) only. - Require explicit flags for
POST/PUT/PATCH/DELETE, or add request IDs to ensure server-side idempotency.
Resilience primitives
- Circuit breakers to pause targets with high failure ratios.
- Bulkhead isolation to prevent one flaky domain from starving the queue.
- Dead-letter queues for requests that exceeded retry budgets, with metadata for later analysis.
Observability hooks
- Emit structured logs: domain, path, method, status code, attempt, proxy ID, latency buckets.
- Produce status-code histograms and error heatmaps for dashboards.
- Capture
Retry-Aftervalues and success-after-retry metrics for each domain.
Testing and dry runs
- Smoke-test a target with a tiny concurrency and generous timeouts to learn baselines.
- Validate redirects, cookies, and header echo endpoints before high-QPS runs.
10) Observability for Crawlers
Without telemetry you fly blind. Observability reduces cost and shortens fix cycles when 429 status code, 503 status code, or error code 520 spike.
Core metrics
- Status-code distribution per domain and path.
- TTFB and total latency (median, p95).
- Retry attempts per success and success-after-retry%.
- Proxy health: error rates per IP/ASN, session length, ban incidents.
- Throughput: pages per minute per target and per worker.
- Cost per successful page including proxy and compute.
Dashboards that matter
- A per-target panel with status-code stacks over time; anomalies stand out quickly.
- An IP/ASN leaderboard to demote problem pools.
- A Retry-After histogram to set backoff expectations by target.
- Concurrency vs. error rate scatterplots to find safe operating points.
Alerting strategy
- Alerts on sustained deviations, not single spikes: e.g., 15-minute windows with error rate > threshold.
- Separate alerts for 4xx vs. 5xx; response actions differ.
- Route severe alerts to a circuit breaker that cools down the target automatically.
Auto-tuning loops
- If 429 rises above N%, reduce per-host tokens by a step and evaluate after a few minutes.
- If 503/520 surge, widen jitter, extend timeouts, and rotate identities more aggressively.
- Promote proxy pools that deliver longer sessions and lower error ratios; retire poor performers.
Post-incident reviews
- For every major spike, record the before/after metrics, changes deployed, and the smallest config change that restored stability.
- Convert lessons into defaults in your shared client library so every new job starts smarter.
Data retention
- Keep raw logs long enough to investigate seasonal patterns and rate-limit windows.
- Sample high-cardinality fields (User-Agent, header order) to study fingerprint impacts without exploding storage.
11) Sustainable, Site‑Friendly Scraping
Long‑term access relies on minimizing your footprint while maintaining throughput. Sustainable tactics reduce bans, lower the rate of 429 status code responses, and cut proxy spend.
Politeness and pacing
- Respect
robots.txtand implied crawl delays even when not enforced. - Use adaptive concurrency targeting a low steady error rate rather than maximum QPS.
- Distribute requests across paths and subdomains to avoid hot‑spotting a single endpoint.
Cache and conditional requests
- Honor
ETagandLast-Modifiedheaders to sendIf-None-MatchandIf-Modified-Since. - Prefer delta updates: fetch only what changed since the last run.
- Cache intermediate results (e.g., category pages) to reduce deep‑link queries.
Workload shaping
- Schedule heavy jobs during the target’s low‑traffic hours (use the site’s local timezone).
- Batch requests and stagger worker start times to avoid synchronized waves.
- For pages that frequently trigger 503 status code, move them to a slow lane with wider jitter.
Data quality without pressure
- Sample periodically to confirm content consistency; over‑fetching the same data wastes budget and invites throttling.
- Use pre‑flight HEAD requests on large media or APIs to validate sizes and ranges.
Compliance and respect
- Keep clear contact and identification where appropriate; honor removal requests.
- Avoid sensitive endpoints and authenticated data you’re not permitted to access.
- Monitor for content‑owner signals (CAPTCHAs, WAF challenges) and back off accordingly.
12) Appendix: Status Code Reference & Pre‑Launch Test Checklist
Quick status code reference (scraper‑centric)
- 429 Too Many Requests — You hit a rate limit; respect Retry‑After, slow down, use rotating proxies.
- 503 Service Unavailable — Temporary server or edge issue; retry with exponential backoff and jitter.
- 520 Unknown Error — Edge/CDN saw an invalid upstream response; normalize headers, improve TLS hygiene, rotate identity.
- 499 Client Closed Request — Client aborted/timeout; tune timeouts and reconsider connection reuse.
- 500/502/504 — Origin or gateway instability; retry safely, tune timeouts, and lower concurrency.
Pre‑launch checklist (per target)
- Fetch
robots.txt; decide on crawl cadence and allowed paths. - Measure baseline latency (TTFB, total) and successful QPS at tiny concurrency.
- Verify redirects, cookies, and session establishment on login/consent flows.
- Establish a minimal, realistic header set; confirm server echoes expected values.
- Choose proxy mix (datacenter/residential) and decide on session pinning vs rotation.
- Implement retry policy for 429, 503, 520 with exponential backoff + jitter; honor Retry‑After.
- Set per‑domain token buckets and global caps; add circuit breakers for failure bursts.
- Instrument logs: domain, path, status, attempt, proxy ID/ASN, latency, payload size.
- Define alert thresholds (4xx and 5xx separately) and auto‑tuning responses.
- Run a 15–30 minute canary: confirm steady success rate, low error ratios, and predictable costs.
- Store config and learned safe limits with the job so future runs start from proven defaults.
- Document remediation steps for each error class, including who/what changes the config and how success is verified.
Handy templates to include in your repo
- Header profiles (browser‑like, API‑like) with notes on where each is safe.
- Retry policy constants (base, cap, jitter), per‑domain overrides, and method‑aware rules.
- Proxy provider interface with health scoring and quarantine lists.
- Grafana dashboard JSON for status‑code histograms, Retry‑After distributions, and concurrency vs error scatterplots.
Where the keywords fit
- Use the appendix to interlink content around web scraping python, curl to python, best proxy sites, and the error‑focused guides (429 status code, 503 status code, error code 520). Internal links here boost topical authority across your SEO cluster.
FAQs: Web Scraping HTTP Status Codes (429, 503, 520)
What does the 429 status code mean in web scraping?
It signals rate limiting. Honor Retry-After, use exponential backoff with jitter, and rotate IPs via rotating proxies to lower request pressure.
How do I fix a 503 status code during crawling?
Treat 503 as transient: retry with jitter, lower per-host concurrency, and add a circuit breaker. If bursts align with peak hours, reschedule jobs.
What is error code 520 and why is it common behind CDNs?
It’s an edge-side error (e.g., Cloudflare) indicating an invalid or empty response from origin. Trim headers, ensure correct TLS/SNI, widen your proxy mix, and retry.
Should I use residential or datacenter proxies to avoid bans?
Residential proxies improve reputation and reduce soft blocks, while datacenter proxies are faster and cheaper. Use a mixed, health-scored pool and pin sessions where needed. See your best proxy sites guide.
How can I implement Retry-After handling in web scraping python?
Parse the header (seconds or HTTP date), sleep for that duration, then retry with a capped exponential backoff and full jitter. Apply per-domain rate limits.
When is it better to rotate proxies vs. keep sessions sticky?
Rotate for public, stateless pages; keep sessions sticky for authenticated or multi-step flows. Switch strategies if 429/403 spikes correlate with IP churn.
Can I migrate a cURL one-liner to Python for better reliability?
Yes—convert curl to python and add retries, timeouts, and proxy support (HTTPX/requests). This reduces 429/5xx failures at scale.
What metrics show my fixes are working?
Falling 429/503/520 rates, fewer retries per success, stable session lifetime, and lower cost per successful page.