Bypassing Akamai Bot Manager with curl_cffi

How to scrape Akamai-protected pages using Chrome TLS impersonation — without a headless browser.

What Akamai Detects

Akamai Bot Manager scores requests on a 0–100 scale starting with the very first request. The score combines signals from three gates: protocol-level fingerprint, IP/session reputation, and request pattern. Documentation often presents these as co-equal "all three must pass" requirements, but in practice session trust dominates once you have a warm cookie jar — a session that has built up an ak_bmsc/bm_sv history through legitimate-looking navigation rides through subsequent requests largely independent of the IP's baseline reputation. The warm-pool architecture below is what makes a cheap-residential proxy viable on Premier targets.

Gate 1 — Protocol-level fingerprint

Gate 2 — IP / session reputation

Akamai operates Client Reputation, a global IP scoring system shared across all Akamai customers, but session-level state (ak_bmsc, bm_sv, _abck) accumulates trust on top of the IP baseline and dominates the score for any request that already has a warm cookie jar.

Gate 3 — Request pattern (velocity + clustering)

Akamai aggregates request counts across multiple keys, not per-IP alone:

Rotating IPs per request does not defeat cluster detection because the cluster key is multi-dimensional.

Real-World Success Rate Bands

For a pure-HTTP scraper (curl_cffi + residential proxies, no JS execution):

Path type Expected sustained rate
Bot Manager Standard, no _abck validation 85–95% with good config
Bot Manager Premier with _abck validation, no sensor forgery 50–80%
Content Protector enabled (Akamai's 2024 scraper-specific product) 30–60%

Any documentation claiming "<5% block rate" is either outdated, run against unprotected paths, or measured before Akamai's recent rule updates. eBay-tier targets running Premier + Content Protector are at the harder end of the range.

The dominant variable for sustained success rate is session warmth, not IP pool quality or fingerprint freshness. A perfect TLS impersonation with a fresh, unwarmed session through a clean residential pool still bottoms out at 30–60% on a Premier target. The same fingerprint through the cheapest $1/GB residential, but riding a warm ak_bmsc/bm_sv from a homepage→category warmup, sustains 95%+ on eBay-tier traffic — measured live in our prod fleet over the last 48h. Proxy quality matters for the cold mint; the pool architecture matters for everything after.

curl_cffi: Chrome TLS Impersonation

curl_cffi is a Python binding for libcurl that impersonates real browsers at the TLS level. Setting impersonate="chrome146" (or the current latest) reproduces that Chrome version's exact:

from curl_cffi.requests import Session as CurlSession
from curl_cffi.const import CurlOpt
import random

_ACCEPT_LANGUAGES = [
    "fr-FR,fr;q=0.9,en-US;q=0.8,en;q=0.7",
    "fr-FR,fr;q=0.9,en-US;q=0.5,en;q=0.3",
    "fr-FR,fr;q=0.9",
]

session = CurlSession(
    impersonate="chrome",   # alias resolves to the latest installed target
    timeout=15,
    allow_redirects=True,
    headers={"Accept-Language": random.choice(_ACCEPT_LANGUAGES)},
    proxy="http://user:pass@gate.provider.com:port",
    curl_options={
        CurlOpt.TCP_KEEPALIVE: 1,
        CurlOpt.TCP_KEEPIDLE: 60,
        CurlOpt.TCP_KEEPINTVL: 30,
        CurlOpt.DNS_CACHE_TIMEOUT: 300,
        CurlOpt.MAXCONNECTS: 10,
        CurlOpt.PIPEWAIT: 1,            # HTTP/2 multiplexing
        CurlOpt.CONNECTTIMEOUT_MS: 3000,
        CurlOpt.IPRESOLVE: 1,           # IPv4-only — skip AAAA + Happy Eyeballs
    },
)

resp = session.get("https://target.example.com/search?q=test")

Picking the impersonate target

The chrome alias auto-tracks the latest target curl_cffi ships. As of curl_cffi==0.15.1b1 that resolves to chrome148. Pinning the explicit version (impersonate="chrome148") means your scraper's wire image only changes when you upgrade the library — convenient for stability but easy to forget.

The "best" impersonate target rotates over time. Akamai's per-tenant ML auto-tunes its scoring; an impersonate that passed 100% last week may drop to 20% next week. Specific patterns observed in the field:

For sustained operation, run a small daily probe (20–30 requests across candidate impersonates against a cheap public path) and pin the day's winner. The list of candidates worth probing:

CANDIDATES = [
    "chrome",          # alias — latest stable Chrome
    "chrome131",       # one version back, sometimes survives longer
    "firefox",         # for tenants with significant Firefox user base
    "chrome_android",  # mobile path
    "safari_ios",      # mobile path
]

In CollectValue prod we've run that probe and the outcome (2026-05-12 A/B) was that single-target chrome outperforms any rotation we tested against eBay.fr. The pool is currently pinned via EBAY_IMPERSONATE_POOL=('chrome',) with weighted-sampling support left in for future re-tuning (EBAY_IMPERSONATE_PRIMARY_WEIGHTS, empty by default). The probe pattern above is still the right method to use when the success rate drifts — we just settled on a single-target outcome this round.

What NOT to set manually

curl_cffi's impersonate= already handles User-Agent, Sec-CH-UA, Sec-CH-UA-Mobile, Sec-CH-UA-Platform, Accept, and Accept-Encoding for the impersonated browser. Overriding these breaks the fingerprint:

The general rule: if curl_cffi doesn't set a header for a given impersonate, do not invent values for it. Real Firefox doesn't send Sec-CH-UA; if you add it to a Firefox-impersonated request, you create the mismatch you were trying to avoid.

The only header consistently worth setting manually is Accept-Language, because curl_cffi doesn't localize this.

IP Reputation: A Cost You Pay At Session Mint, Not Per Request

For a target with strong protocol-level scoring, IP pool quality determines how expensive each session mint is — i.e. how often the homepage→category warmup gets blocked before producing a usable ak_bmsc/bm_sv. It does not determine the steady-state success rate of warm-session requests, which is dominated by session trust (see below).

In practice, with a warm pool maintaining N pre-warmed sessions:

Provider tiers — relevant for cold-mint cost, not sustained rate

Sticky-session lifetime

When the target serves an ak_bmsc or bm_sv cookie, reusing the same IP for multiple requests lets that cookie's session state accumulate trust. 10–30 minutes per sticky IP is the typical sweet spot — long enough to amortize cookie warming, short enough to limit damage if Akamai escalates scoring mid-session.

In CollectValue prod we run longer: EBAY_POOL_PROXY_SESSTTL_MIN=120 (2h) paired with EBAY_POOL_SESSION_TTL_S=7200, probe-verified that DataImpulse honors sessttl.120 on port 823. The longer window lets a single warmed cookie jar service 15–25 requests (the per-session retire bound) without the IP rotating mid-life. The ±20% TTL jitter (cm_pool.py/akamai_pool.py) handles the synchronous-expiration risk that would otherwise come with 2h sessions.

Most rotating-residential providers offer sticky modes:

IP cleanup heuristic

When a sticky IP returns a 403, don't reuse it within the next hour. Akamai's per-IP score doesn't immediately recover, and burning more requests through a flagged IP only worsens the cluster signal for the same fingerprint+ASN combination.

Request Pattern: The Velocity + Cluster Gate

The cluster-detection gate is the most counterintuitive of the three. Even with per-request IP rotation, 300 requests in 5 minutes from one proxy ASN, with the same TLS fingerprint, against the same URL template, trips Akamai's rate policy because the rate is keyed on (ASN, fingerprint-hash, URL-template, time-window), not on the IP alone.

Pacing

For a ~1,000-requests-per-day scraper:

import random
import time

# Inter-request delay — random jitter prevents pattern detection
time.sleep(random.uniform(10.0, 60.0))

Retry pacing

When a request returns 403, immediately retrying against the same target with a new IP looks like a bot's retry loop. Sleep 30–120s (random) before the retry. This both lets the proxy pool rotate and avoids the burst-retry pattern.

Cookie discard on 403

If the session received an ak_bmsc or bm_sv cookie before the 403, Akamai has flagged that session as Strict. Continued requests on the same session — even from a new IP — will fail. Discard the cookie jar after any 403 and start fresh.

Pool Architecture: Sustained Multi-Worker Operation

The naive "fresh session per scrape" pattern pays the homepage→category warmup cost on every request and produces fingerprint+cookie trails that get flagged quickly. For sustained operation across a fleet — multiple gunicorn workers, cron jobs, batch backfills — a pre-warmed session pool is the right primitive.

The model

Maintain a Redis-backed pool of N pre-warmed sessions. Each session carries:

Workers LPOP a session, do their work, then RPUSH it back on success or move it to a sick set on failure. A background maintainer keeps the pool topped up to target size.

ebay:pool:warm           LIST     SIDs ready for use
ebay:pool:sick           SET      SIDs awaiting GC
ebay:pool:session:{sid}  HASH     cookies_json, proxy_session, impersonate,
                                  created_at, expires_at, request_count, status
ebay:pool:mint_lock      STRING   global mint serialization

Mint serialization across processes

threading.Lock is per-process. In a fleet of 4 gunicorn workers + a cron container, each worker has its own lock — they can all mint simultaneously when the pool drains, and overshoot the target by Nx.

Use a Redis-level lock with SET ebay:pool:mint_lock 1 NX EX 30. Acquire with a bounded wait (3s), release in a finally: block:

def mint_one(*, respect_ceiling: bool = False) -> str | None:
    deadline = time.time() + 3.0
    while time.time() < deadline:
        if r.set(KEY_MINT_LOCK, '1', nx=True, ex=30):
            break
        time.sleep(0.1)
    else:
        return None  # contention timeout

    try:
        # Maintainer callers pass respect_ceiling=True so they no-op if
        # another worker has already filled the pool. Hitchhiker callers
        # (a real scrape waiting on a session) leave this False.
        if respect_ceiling and r.llen(KEY_WARM) >= TARGET_SIZE:
            return None
        meta = warmup_chain()                   # homepage GET → dwell → category GET
        store_session(meta)
        r.rpush(KEY_WARM, meta['sid'])
        return meta['sid']
    finally:
        r.delete(KEY_MINT_LOCK)

The maintainer's iteration becomes a while LLEN(warm) < target: mint_one(respect_ceiling=True) loop that re-reads the count between mints.

TTL jitter: avoid the synchronous expiration stampede

When the pool's sessions are minted in a burst (after deploy, after a Redis flush, or at cold boot), a uniform TTL means they all expire within seconds of each other. The pool drains faster than serial mints can refill, and concurrent requests fall through to the no-pool path → captcha cascade.

Store a per-session jittered expires_at at mint time and verify staleness against it:

expires_at = created_at + TTL + random.uniform(0, TTL * 0.2)

def is_stale(meta):
    if meta['request_count'] >= MAX_REQUESTS_PER_SESSION:
        return True
    return time.time() >= meta.get('expires_at', 0)

±20% jitter on a 7200s TTL spreads expirations across a ~24-minute window (0 to 0.2×7200s = 1440s of added jitter). The maintainer keeps pace.

Pool-miss policy

When checkout() returns None (pool empty), there are two paths:

Frontend / SLA-bound paths should default to inline_mint: a 5–10s slow page beats a captcha error. Background batch can use either; inline_mint is also recommended there since the worker has nothing better to do.

Hitchhiker mints

The dedicated category-page warmup costs bandwidth. When a real request is already waiting (pool was empty when the caller hit checkout), skip the synthetic category GET — the real scrape will serve as the second warmup step:

def execute_leg(scrape_call):
    meta = checkout()
    if meta is None:
        mint_one(skip_category=True)            # hitchhiker: homepage only
        meta = checkout()
    return scrape_call(meta)

Cuts ~50% of warmup bandwidth on the hitchhiker path. The session arrives with the homepage cookies; the real eBay request picks up the rest.

Stream-aborted warmup

Akamai sets its cookies in the initial response headers. The full category-page body (often 1–2 MB) is wasted bandwidth on the warmup. Abort after ~64 KB:

with session.stream('GET', category_url) as r:
    total = 0
    for chunk in r.iter_content(chunk_size=4096):
        total += len(chunk)
        if total >= 65_536:
            break

Real measurement on eBay's category search: full-page warmup ~1.5 MB; stream-aborted warmup ~150 KB per mint. Proxy providers bill the wire bytes — this directly cuts proxy spend.

Cookie roll-forward

bm_sv rotates on most protected requests. On return_session(success=True, cookies=live), persist the post-scrape cookie jar back into the session's hash so the next caller starts from the current server-side session state:

def return_session(sid, success, cookies=None):
    if not success:
        mark_sick(sid)
        return
    if cookies:
        r.hset(session_key(sid), 'cookies_json', json.dumps(cookies))
    r.hincrby(session_key(sid), 'request_count', 1)
    r.rpush(KEY_WARM, sid)

Without roll-forward, sessions degrade as their stored cookies drift out of sync.

Pre-warm at process boot

A cron or batch process running outside the maintainer-running fleet starts with an empty (or stale) local view of the pool. The first scrapes serially trigger hitchhiker mints — a cold-start tax of ~5–10s × N for the first N requests.

Front-load it: call prewarm_pool() once synchronously at process start. The mint lock serializes globally with the worker fleet's maintainer, so there's no double-mint risk.

def main():
    args = parser.parse_args()
    prewarm_pool()                              # blocks ~30–60s cold, no-op when warm
    for item in items:
        scrape(item)

Per-session telemetry

Persist these fields on every scrape's metrics row:

Pool-wide gauges to graph: LLEN warm over time, SCARD sick, TTL mint_lock (>0 means a mint is in progress). Alert when pool_status_after='no_pool' rate exceeds 1% — the pool is draining faster than the maintainer can refill, indicating the target size is too low or the TTL jitter is too narrow for the current burst pattern.

Block Detection

import re

def detect_akamai_block(html):
    if 'Pardon Our Interruption' in html:
        return 'pardon'
    if 'Access Denied' in html and len(html) < 10_000:
        return 'access-denied'
    if 'Nous sommes' in html[:500]:
        return 'nous-sommes'                     # locale-specific Akamai deny
    if 'pageError' in html or 'page-error' in html:
        return 'rate_limit'                      # eBay app-level limiter
    if len(html) < 10_000 and re.search(r'Reference #\d+\.\w+', html):
        return 'akamai-ref'
    if len(html) < 30_000 and 'splashui' in html:
        return 'splashui'
    if len(html) < 5_000 and 'sensor_data' in html:
        return 'sensor-challenge'
    if len(html) < 5_000 and 'sec-cpt-if' in html:
        return 'crypto-challenge'
    return None

The len(html) guards prevent false positives — a real results page is 500KB+, block pages are typically <10KB. The rate_limit class (eBay's app-level limiter, distinct from an Akamai 403) drives an extra cooldown in the pool layer before the next checkout (EBAY_POOL_RATE_LIMIT_BACKOFF_MIN_S / _MAX_S, default 60–120s).

Bot scoring cookie classes

Akamai's _abck cookie encodes the session's bot-score state. The cookie value ends in a suffix that signals the current state:

Tracking the _abck suffix class per response is the single most useful telemetry for understanding why a scraper is degrading.

Operational Telemetry

For a scraper running at scale, log the following per request to make degradations visible:

Alert thresholds worth tuning:

When Pure HTTP Hits Its Ceiling

Three signals indicate the pure-HTTP path can't be tuned further on a given target:

  1. The target serves _abck and rejects requests without a server-validated sensor_data POST (visible as: every protected-path response returns the _abck=...~0~-1~-1~-1~-1 invalidated form, regardless of fingerprint or IP).
  2. The target ships Content Protector — symptoms include tarpitting (200 with slowed response body), deterministic 403 on a previously-mixed pattern, or first appearance of sec-cpt / sbsd cookies.
  3. Sustained success rate sits below 30% across multiple proxy providers, multiple impersonate targets, and multiple pacing strategies — meaning the gate isn't any of the variables you control.

At that point the options are sensor_data forgery (Hyper Solutions paid API, or self-hosted port of the open-source glizzykingdreko/akamai-v3-sensor-data-helper encryption primitives plus a daily-updated payload generator) or migrating the cold-path requests to a stealth browser pool (Camoufox or Patchright). Both are significantly more expensive than the pure-HTTP path.

Created 2026-04-09T11:53:49+02:00, updated 2026-05-21T13:46:28+02:00 · History · Edit