Retry Strategies
When a job fails, Zeridion Flare automatically retries it with exponential backoff and jitter. You control how many times a job is retried and what happens when all attempts are exhausted.
How retries work
- A worker picks up a job and calls your
ExecuteAsyncmethod - If
ExecuteAsyncthrows an unhandled exception, the worker reportsstatus: "failed"to the API viaPOST /v1/workers/ack - The server checks whether
AttemptNumber < MaxAttempts - If retries remain: the job returns to
Pendingwith aRunAtdelay (exponential backoff + jitter) - If retries are exhausted: the job moves to
DeadLetter
Exponential backoff with jitter
The retry delay doubles with each attempt, starting at 15 seconds. A random jitter of 0–3 seconds is added to prevent thundering herd when many jobs fail simultaneously.
Formula: delay = 15s × 2^(attempt - 1) + random(0–3000ms)
| Attempt | Base delay | Actual range |
|---|---|---|
| 1 | 15s | 15–18s |
| 2 | 30s | 30–33s |
| 3 | 60s | 60–63s |
| 4 | 120s | 120–123s |
| 5 | 240s (4 min) | 240–243s |
| 6 | 480s (8 min) | 480–483s |
| 7 | 960s (16 min) | 960–963s |
| 8 | 1920s (32 min) | 1920–1923s |
With the default MaxAttempts = 3, a job gets three tries spanning approximately 1.5 minutes of total backoff before dead-lettering.
Configuring MaxAttempts
You can set the maximum retry count at three levels. More specific settings override less specific ones.
Per-class default
Apply [JobConfig] to set a default for all enqueues of this job type:
[JobConfig(MaxAttempts = 5)]
public class SendWelcomeEmail : IJob<NewUserPayload>
{
public async Task ExecuteAsync(NewUserPayload payload, JobContext ctx)
{
// Up to 5 attempts before dead letter
}
}
Per-call override
Pass JobOptions when enqueuing to override the class default for a specific enqueue:
await jobs.EnqueueAsync<SendWelcomeEmail>(payload, new JobOptions
{
MaxAttempts = 10
});
Precedence
| Level | How to set | Default |
|---|---|---|
| Per-call | new JobOptions { MaxAttempts = N } | — |
| Per-class | [JobConfig(MaxAttempts = N)] | 3 |
| Server-side clamp | — | 1–100 |
Resolution order: JobOptions (per-call) > [JobConfig] (per-class) > 3 (hardcoded default).
The server clamps the final value to the range 1–100. Values outside this range are reset to 3.
AttemptNumber tracking
AttemptNumber starts at 0 on the job entity and is incremented to 1 when a worker first claims the job. Inside your ExecuteAsync, ctx.AttemptNumber gives the current attempt number.
Use it for conditional logic:
public async Task ExecuteAsync(PaymentPayload payload, JobContext ctx)
{
if (ctx.AttemptNumber == ctx.MaxAttempts)
{
ctx.Logger.LogWarning("Final attempt for job {JobId}, alerting ops", ctx.JobId);
await _alertService.NotifyAsync($"Job {ctx.JobId} on final attempt");
}
await ProcessPayment(payload, ctx.CancellationToken);
}
Dead letter
When AttemptNumber >= MaxAttempts after a failure, the job moves to DeadLetter:
Stateis set toDeadLetterCompletedAtis set to the current time- Error details (
ErrorType,ErrorMessage,ErrorStackTrace) are preserved from the last failure - Any child continuation jobs in
Scheduledstate are cancelled
Dead-lettered jobs remain in the database for inspection. They are not deleted or cleaned up automatically.
Querying dead letter jobs
GET /v1/jobs?state=dead_letter&limit=50
Manual retry from dead letter
You can requeue a dead-lettered job via the API, SDK, or dashboard:
API
POST /v1/jobs/{id}/retry
This resets the job to Pending, clears error/worker/timing fields, and bumps MaxAttempts if the current AttemptNumber has already reached it.
SDK
var retried = await jobs.RetryAsync(jobId);
// returns true if requeued, false if job is not in a retryable state
RetryAsync returns false (instead of throwing) when the job is in a state that cannot be retried (e.g., Processing or Succeeded). No try/catch needed.
Dashboard
Click the Retry button on the job detail page to requeue a dead-lettered job with one click.
HTTP client retries (SDK to API)
The job-level retries described above are separate from the SDK's HTTP transport retries. The SDK registers its HTTP client with AddStandardResilienceHandler() from Microsoft.Extensions.Http.Resilience, which provides:
- Retry — automatic retry with exponential backoff for transient HTTP failures (5xx, timeouts)
- Circuit breaker — stops sending requests when the API is consistently failing
- Timeout — per-request and total timeout enforcement
These transport-level retries protect against network blips and temporary API outages. They happen transparently before your code sees the response.
Best practices
-
Keep jobs idempotent — since jobs may execute more than once, design
ExecuteAsyncso that re-running with the same payload produces the same result. Use database upserts, check-before-write, or idempotency keys on downstream calls. -
Use
ctx.AttemptNumberfor logging — always include the attempt number in your log messages so you can trace the retry history:ctx.Logger.LogInformation(
"Attempt {Attempt}/{Max} for job {JobId}",
ctx.AttemptNumber, ctx.MaxAttempts, ctx.JobId); -
Set reasonable timeouts — jobs without timeouts can run indefinitely and block the worker. Use
[JobConfig(TimeoutSeconds = 300)]to cap execution time. The worker sends heartbeats atTimeoutSeconds / 3intervals; if no heartbeat arrives, the job is marked stuck and reclaimed. -
Don't catch and swallow all exceptions — let unexpected exceptions bubble up so the retry engine can do its job. Only catch exceptions when you need to prevent retries (e.g., invalid input data that will never succeed).
-
Monitor dead letter counts — use
GET /v1/metrics/summaryto track dead letter accumulation. A rising dead letter count signals a systemic issue.
See also
- Error Handling — exception types and catch patterns
- Idempotency — preventing duplicate work across retries
- JobConfigAttribute — class-level MaxAttempts and TimeoutSeconds
- JobOptions — per-call MaxAttempts override