Skip to main content

Retry Strategies

When a job fails, Zeridion Flare automatically retries it with exponential backoff and jitter. You control how many times a job is retried and what happens when all attempts are exhausted.

How retries work

  1. A worker picks up a job and calls your ExecuteAsync method
  2. If ExecuteAsync throws an unhandled exception, the worker reports status: "failed" to the API via POST /v1/workers/ack
  3. The server checks whether AttemptNumber < MaxAttempts
  4. If retries remain: the job returns to Pending with a RunAt delay (exponential backoff + jitter)
  5. If retries are exhausted: the job moves to DeadLetter

Exponential backoff with jitter

The retry delay doubles with each attempt, starting at 15 seconds. A random jitter of 0–3 seconds is added to prevent thundering herd when many jobs fail simultaneously.

Formula: delay = 15s × 2^(attempt - 1) + random(0–3000ms)

AttemptBase delayActual range
115s15–18s
230s30–33s
360s60–63s
4120s120–123s
5240s (4 min)240–243s
6480s (8 min)480–483s
7960s (16 min)960–963s
81920s (32 min)1920–1923s

With the default MaxAttempts = 3, a job gets three tries spanning approximately 1.5 minutes of total backoff before dead-lettering.

Configuring MaxAttempts

You can set the maximum retry count at three levels. More specific settings override less specific ones.

Per-class default

Apply [JobConfig] to set a default for all enqueues of this job type:

[JobConfig(MaxAttempts = 5)]
public class SendWelcomeEmail : IJob<NewUserPayload>
{
public async Task ExecuteAsync(NewUserPayload payload, JobContext ctx)
{
// Up to 5 attempts before dead letter
}
}

Per-call override

Pass JobOptions when enqueuing to override the class default for a specific enqueue:

await jobs.EnqueueAsync<SendWelcomeEmail>(payload, new JobOptions
{
MaxAttempts = 10
});

Precedence

LevelHow to setDefault
Per-callnew JobOptions { MaxAttempts = N }
Per-class[JobConfig(MaxAttempts = N)]3
Server-side clamp1–100

Resolution order: JobOptions (per-call) > [JobConfig] (per-class) > 3 (hardcoded default).

The server clamps the final value to the range 1–100. Values outside this range are reset to 3.

AttemptNumber tracking

AttemptNumber starts at 0 on the job entity and is incremented to 1 when a worker first claims the job. Inside your ExecuteAsync, ctx.AttemptNumber gives the current attempt number.

Use it for conditional logic:

public async Task ExecuteAsync(PaymentPayload payload, JobContext ctx)
{
if (ctx.AttemptNumber == ctx.MaxAttempts)
{
ctx.Logger.LogWarning("Final attempt for job {JobId}, alerting ops", ctx.JobId);
await _alertService.NotifyAsync($"Job {ctx.JobId} on final attempt");
}

await ProcessPayment(payload, ctx.CancellationToken);
}

Dead letter

When AttemptNumber >= MaxAttempts after a failure, the job moves to DeadLetter:

  • State is set to DeadLetter
  • CompletedAt is set to the current time
  • Error details (ErrorType, ErrorMessage, ErrorStackTrace) are preserved from the last failure
  • Any child continuation jobs in Scheduled state are cancelled

Dead-lettered jobs remain in the database for inspection. They are not deleted or cleaned up automatically.

Querying dead letter jobs

GET /v1/jobs?state=dead_letter&limit=50

Manual retry from dead letter

You can requeue a dead-lettered job via the API, SDK, or dashboard:

API

POST /v1/jobs/{id}/retry

This resets the job to Pending, clears error/worker/timing fields, and bumps MaxAttempts if the current AttemptNumber has already reached it.

SDK

var retried = await jobs.RetryAsync(jobId);
// returns true if requeued, false if job is not in a retryable state
tip

RetryAsync returns false (instead of throwing) when the job is in a state that cannot be retried (e.g., Processing or Succeeded). No try/catch needed.

Dashboard

Click the Retry button on the job detail page to requeue a dead-lettered job with one click.

HTTP client retries (SDK to API)

The job-level retries described above are separate from the SDK's HTTP transport retries. The SDK registers its HTTP client with AddStandardResilienceHandler() from Microsoft.Extensions.Http.Resilience, which provides:

  • Retry — automatic retry with exponential backoff for transient HTTP failures (5xx, timeouts)
  • Circuit breaker — stops sending requests when the API is consistently failing
  • Timeout — per-request and total timeout enforcement

These transport-level retries protect against network blips and temporary API outages. They happen transparently before your code sees the response.

Best practices

  1. Keep jobs idempotent — since jobs may execute more than once, design ExecuteAsync so that re-running with the same payload produces the same result. Use database upserts, check-before-write, or idempotency keys on downstream calls.

  2. Use ctx.AttemptNumber for logging — always include the attempt number in your log messages so you can trace the retry history:

    ctx.Logger.LogInformation(
    "Attempt {Attempt}/{Max} for job {JobId}",
    ctx.AttemptNumber, ctx.MaxAttempts, ctx.JobId);
  3. Set reasonable timeouts — jobs without timeouts can run indefinitely and block the worker. Use [JobConfig(TimeoutSeconds = 300)] to cap execution time. The worker sends heartbeats at TimeoutSeconds / 3 intervals; if no heartbeat arrives, the job is marked stuck and reclaimed.

  4. Don't catch and swallow all exceptions — let unexpected exceptions bubble up so the retry engine can do its job. Only catch exceptions when you need to prevent retries (e.g., invalid input data that will never succeed).

  5. Monitor dead letter counts — use GET /v1/metrics/summary to track dead letter accumulation. A rising dead letter count signals a systemic issue.

See also