You have a spreadsheet with 10,000 URLs. You need to:
- Visit each one
- Extract specific data
- Store the result
If you’ve done this before, you know the usual path: spin up a queue (Redis + Bull, or SQS), deploy a worker process, handle retries and failures, monitor progress, scale up compute… and then tear it all down when you’re done.
There’s a much simpler model.
The problem with DIY batch processing
Large batch jobs require:
- Queue infrastructure to distribute work
- Worker processes to consume from the queue
- Concurrency management (don’t hammer the source)
- Retry logic for transient failures
- Dead-letter handling for permanent failures
- Monitoring to know when you’re done
- Compute capacity that can handle the peak
For a one-off analysis, this is massively over-engineered. For a recurring job, it still requires ongoing ops. And none of this is your core product.
Async workers: a simpler model
Seek API runs workers in a distributed job execution platform. When you submit a job:
- It enters a queue managed by the platform
- It executes with appropriate concurrency
- It retries on transient failures
- The result is stored and available when complete
You submit thousands of jobs simultaneously and check for results. No infrastructure.
Submitting 10,000 jobs
import fetch from 'node-fetch';
import { readFileSync } from 'fs';
const urls = readFileSync('urls.txt', 'utf8').trim().split('\n');
const API_KEY = process.env.SEEKAPI_KEY;
async function submitJob(url) {
const res = await fetch('https://api.seek-api.com/v1/workers/webpage-extractor/jobs', {
method: 'POST',
headers: { 'X-Api-Key': API_KEY, 'Content-Type': 'application/json' },
body: JSON.stringify({ url })
});
const data = await res.json();
return data.job_uuid;
}
// Rate limit submission to 50 req/s (platform limit)
async function submitBatch(urls, ratePerSecond = 50) {
const jobIds = [];
for (let i = 0; i < urls.length; i++) {
const uuid = await submitJob(urls[i]);
jobIds.push({ url: urls[i], uuid });
if ((i + 1) % ratePerSecond === 0) await sleep(1000);
if ((i + 1) % 1000 === 0) console.log(`Submitted ${i + 1} / ${urls.length}`);
}
return jobIds;
}
const jobs = await submitBatch(urls);
console.log(`All ${jobs.length} jobs submitted`);
10,000 submissions at 50/s takes ~200 seconds (3.3 minutes).
Polling for results
Don’t poll each job individually in sequence — that defeats the purpose. Instead, poll in batches and collect results as they complete:
async function collectResults(jobs) {
const pending = new Map(jobs.map(j => [j.uuid, j.url]));
const results = [];
while (pending.size > 0) {
// Check all pending jobs in parallel (up to 100 at a time)
const batch = [...pending.keys()].slice(0, 100);
await Promise.all(
batch.map(async (uuid) => {
const res = await fetch(`https://api.seek-api.com/v1/jobs/${uuid}`, {
headers: { 'X-Api-Key': API_KEY }
}).then(r => r.json());
if (res.status === 'completed') {
results.push({ url: pending.get(uuid), data: res.result });
pending.delete(uuid);
} else if (res.status === 'failed') {
results.push({ url: pending.get(uuid), error: res.error });
pending.delete(uuid);
}
})
);
console.log(`${pending.size} jobs remaining...`);
if (pending.size > 0) await sleep(5000);
}
return results;
}
const results = await collectResults(jobs);
Using webhooks instead of polling
For 10K jobs, polling is manageable. For 100K+, use webhooks to avoid unnecessary API calls:
// Include webhook URL in each job submission
body: JSON.stringify({
url,
webhook: 'https://your-server.com/hooks/job-complete'
})
Your webhook endpoint receives the result as soon as each job finishes. No polling. No missed completions. Process each result as it arrives.
app.post('/hooks/job-complete', (req, res) => {
const { job_uuid, status, result, original_input } = req.body;
if (status === 'completed') processResult(original_input.url, result);
res.sendStatus(200);
});
Throughput expectations
| Job count | Avg job time | Total wall time (parallel) |
|---|---|---|
| 100 | 5s | ~10s |
| 1,000 | 5s | ~30s |
| 10,000 | 5s | ~2–3 min |
| 100,000 | 5s | ~15–20 min |
The platform handles parallelism. You just submit and wait.
Saving results
Stream results to a file or database as they come in:
import { createWriteStream } from 'fs';
const output = createWriteStream('results.jsonl');
// In collectResults:
if (res.status === 'completed') {
output.write(JSON.stringify({ url, data: res.result }) + '\n');
pending.delete(uuid);
}
JSONL (one JSON object per line) is ideal for large result sets — it’s streamable, appendable, and every major data tool reads it.
What this replaces
With this approach, you eliminate:
- A Redis instance
- A queue worker service
- An autoscaling policy
- A dead-letter queue handler
- A monitoring dashboard for queue depth
- A retry configuration
For a batch analysis that runs once a month, maintaining that infrastructure doesn’t make sense. For one that runs every day, the ops overhead compounds. Workers the simpler, cheaper path in both cases.
Complete script
A complete, ready-to-run script for bulk processing is in the Seek API GitHub examples repository. It includes chunked submission, webhook support, and progress tracking.