batch-processingscaleasyncarchitectureworkers

How to Process 10,000 URLs in Parallel Without Running a Server

Run batch jobs against massive URL lists using async workers — no queue infrastructure, no worker processes, no ops overhead. Just API calls.

S
Seek API Team
·

You have a spreadsheet with 10,000 URLs. You need to:

  • Visit each one
  • Extract specific data
  • Store the result

If you’ve done this before, you know the usual path: spin up a queue (Redis + Bull, or SQS), deploy a worker process, handle retries and failures, monitor progress, scale up compute… and then tear it all down when you’re done.

There’s a much simpler model.

The problem with DIY batch processing

Large batch jobs require:

  1. Queue infrastructure to distribute work
  2. Worker processes to consume from the queue
  3. Concurrency management (don’t hammer the source)
  4. Retry logic for transient failures
  5. Dead-letter handling for permanent failures
  6. Monitoring to know when you’re done
  7. Compute capacity that can handle the peak

For a one-off analysis, this is massively over-engineered. For a recurring job, it still requires ongoing ops. And none of this is your core product.

Async workers: a simpler model

Seek API runs workers in a distributed job execution platform. When you submit a job:

  • It enters a queue managed by the platform
  • It executes with appropriate concurrency
  • It retries on transient failures
  • The result is stored and available when complete

You submit thousands of jobs simultaneously and check for results. No infrastructure.

Submitting 10,000 jobs

import fetch from 'node-fetch';
import { readFileSync } from 'fs';

const urls = readFileSync('urls.txt', 'utf8').trim().split('\n');
const API_KEY = process.env.SEEKAPI_KEY;

async function submitJob(url) {
  const res = await fetch('https://api.seek-api.com/v1/workers/webpage-extractor/jobs', {
    method: 'POST',
    headers: { 'X-Api-Key': API_KEY, 'Content-Type': 'application/json' },
    body: JSON.stringify({ url })
  });
  const data = await res.json();
  return data.job_uuid;
}

// Rate limit submission to 50 req/s (platform limit)
async function submitBatch(urls, ratePerSecond = 50) {
  const jobIds = [];
  for (let i = 0; i < urls.length; i++) {
    const uuid = await submitJob(urls[i]);
    jobIds.push({ url: urls[i], uuid });
    if ((i + 1) % ratePerSecond === 0) await sleep(1000);
    if ((i + 1) % 1000 === 0) console.log(`Submitted ${i + 1} / ${urls.length}`);
  }
  return jobIds;
}

const jobs = await submitBatch(urls);
console.log(`All ${jobs.length} jobs submitted`);

10,000 submissions at 50/s takes ~200 seconds (3.3 minutes).

Polling for results

Don’t poll each job individually in sequence — that defeats the purpose. Instead, poll in batches and collect results as they complete:

async function collectResults(jobs) {
  const pending = new Map(jobs.map(j => [j.uuid, j.url]));
  const results = [];

  while (pending.size > 0) {
    // Check all pending jobs in parallel (up to 100 at a time)
    const batch = [...pending.keys()].slice(0, 100);
    
    await Promise.all(
      batch.map(async (uuid) => {
        const res = await fetch(`https://api.seek-api.com/v1/jobs/${uuid}`, {
          headers: { 'X-Api-Key': API_KEY }
        }).then(r => r.json());

        if (res.status === 'completed') {
          results.push({ url: pending.get(uuid), data: res.result });
          pending.delete(uuid);
        } else if (res.status === 'failed') {
          results.push({ url: pending.get(uuid), error: res.error });
          pending.delete(uuid);
        }
      })
    );

    console.log(`${pending.size} jobs remaining...`);
    if (pending.size > 0) await sleep(5000);
  }

  return results;
}

const results = await collectResults(jobs);

Using webhooks instead of polling

For 10K jobs, polling is manageable. For 100K+, use webhooks to avoid unnecessary API calls:

// Include webhook URL in each job submission
body: JSON.stringify({ 
  url,
  webhook: 'https://your-server.com/hooks/job-complete'
})

Your webhook endpoint receives the result as soon as each job finishes. No polling. No missed completions. Process each result as it arrives.

app.post('/hooks/job-complete', (req, res) => {
  const { job_uuid, status, result, original_input } = req.body;
  if (status === 'completed') processResult(original_input.url, result);
  res.sendStatus(200);
});

Throughput expectations

Job countAvg job timeTotal wall time (parallel)
1005s~10s
1,0005s~30s
10,0005s~2–3 min
100,0005s~15–20 min

The platform handles parallelism. You just submit and wait.

Saving results

Stream results to a file or database as they come in:

import { createWriteStream } from 'fs';
const output = createWriteStream('results.jsonl');

// In collectResults:
if (res.status === 'completed') {
  output.write(JSON.stringify({ url, data: res.result }) + '\n');
  pending.delete(uuid);
}

JSONL (one JSON object per line) is ideal for large result sets — it’s streamable, appendable, and every major data tool reads it.

What this replaces

With this approach, you eliminate:

  • A Redis instance
  • A queue worker service
  • An autoscaling policy
  • A dead-letter queue handler
  • A monitoring dashboard for queue depth
  • A retry configuration

For a batch analysis that runs once a month, maintaining that infrastructure doesn’t make sense. For one that runs every day, the ops overhead compounds. Workers the simpler, cheaper path in both cases.

Complete script

A complete, ready-to-run script for bulk processing is in the Seek API GitHub examples repository. It includes chunked submission, webhook support, and progress tracking.