AWS Lambda is remarkable technology. The ability to run arbitrary code without provisioning or managing servers, scaling automatically from zero to thousands of concurrent executions, billed to the millisecond — it’s a genuine engineering achievement.
The question isn’t whether Lambda works. It’s whether Lambda is the right abstraction for data extraction and enrichment workers. For most teams, it isn’t.
What serverless is actually good at
AWS Lambda and Google Cloud Functions shine for:
- Event-driven compute: An image is uploaded → trigger a function to resize it
- API backend handlers: Stateless request → response without sustained traffic
- Lightweight transformations: Parse a webhook payload and route to a queue
- Burst workloads: Unpredictable traffic spikes that don’t justify reserved capacity
The Lambda model works when:
- Your function is stateless (no side effects beyond the return value)
- Cold starts are acceptable (or mitigated by provisioned concurrency)
- External dependencies are minimal
- Execution time is short (under 15 minutes for Lambda)
Why Lambda struggles with scraping workers
Cold starts with heavy dependencies
A web scraping worker typically needs Playwright, Chromium, and related libraries. The Chromium binary alone is ~200MB. Adding it to a Lambda deployment package or container image means cold start times of 3–8 seconds before any business logic runs.
For a worker called infrequently, this is fine. For a worker processing 100 jobs in quick succession, you’re paying 3–8 seconds of overhead per cold start — or you pay for provisioned concurrency ($50–$200/month per function) to keep instances warm.
Memory requirements
A headless Chromium instance requires 1–2 GB of RAM to run reliably. Lambda pricing is memory × time:
- 1024 MB × 30 seconds = $0.0000199 per execution
- 2048 MB × 30 seconds = $0.0000398 per execution
At 100,000 executions/month: $2–$4. Manageable. But the true cost includes:
- Cold start mitigation (provisioned concurrency)
- Container image storage
- VPC configuration for proxy access
- Monitoring and alerting infrastructure
- Development overhead
15-minute execution limit
Lambda’s maximum execution time is 15 minutes. For most scraping tasks this isn’t a constraint. But for workers that process multi-page documents, handle complex navigation flows, or need to wait for anti-bot challenges to clear, hitting the limit causes silent failures.
Cloud Functions (Google) has a 60-minute max, which is better. But the 15-minute Lambda limit is a gotcha that only reveals itself in production.
VPC + proxy complexity
Serious scraping requires proxy rotation. Lambda running inside a VPC needs a NAT gateway to access the internet — which costs ~$32/month flat regardless of usage. Configuring proxy rotation through Lambda requires either:
- A proxy provider accessible over HTTPS (simple but expensive per request)
- A proxy pool running on dedicated infrastructure (negates “serverless” simplicity)
Operational overhead
Building a “scraping Lambda” still requires:
- Container image with Chromium
- IAM roles and policies
- CloudWatch logging configuration
- Error handling and retry logic
- Dead letter queues for failed invocations
- VPC configuration if using proxies
- A queue (SQS) if processing batches asynchronously
This is not “zero ops.” This is a moderate amount of ops, paid in setup time and architectural complexity even if not in server management.
The Seek API worker model vs Lambda
When you use a Seek API worker instead of a Lambda function:
| Concern | Lambda | Seek API Worker |
|---|---|---|
| Cold start | 3–8s (unless provisioned) | None (platform manages warmth) |
| Memory config | You choose, you pay | Managed |
| Proxy setup | VPC + NAT or external service | Included |
| Chromium bundling | You manage | Included |
| Retry logic | You build | Platform-provided |
| Anti-bot updates | Your responsibility | Worker maintainer’s responsibility |
| IAM/security config | You configure | N/A |
| Monitoring | CloudWatch config | Dashboard included |
The trade: you lose flexibility (you can’t run arbitrary code) but you gain operational simplicity and zero infrastructure management.
When Lambda is still the right choice
Lambda makes sense for worker infrastructure when:
- Proprietary logic: The worker does something that no managed platform covers, and you need to run it on your own infrastructure
- Data residency: Compliance requires data never leaves your AWS account
- Tight integration with existing AWS services: If your data pipeline is deeply embedded in AWS (S3, RDS, SQS), keeping the worker in Lambda reduces latency and cross-service data transfer costs
- Extreme scale with cost optimization: At millions of executions/month, a highly optimized Lambda function can be cheaper than per-job pricing
The architectural principle
Lambda solves for: “I need to run arbitrary code reactively without managing servers.”
Seek API workers solve for: “I need structured data from a source without maintaining extraction infrastructure.”
These are different problems. Lambda is compute infrastructure. Workers are data infrastructure. For data extraction and enrichment — the vast majority of API worker use cases — the worker platform model is simpler, cheaper to operate, and ready to use immediately.
Use Lambda for the glue between systems. Use workers for the data acquisition. Combine them as needed.