Backup flow
Backup Flow¶
The full backup flow is coordinated by BackupManager, but data ownership stays inside each module.
BackupManager does not read module storage directly. It stores backup configuration, starts runs, writes manifests, validates produced artifacts, classifies run status, and exposes run history through TCP methods. Modules register backup callbacks and are responsible for flushing their own pending state before producing a dump.
Lifecycle¶
- Modules initialize and register backup participants through the backup registry.
BackupManagerloads backup configuration and existing run manifests during server initialization.- If stale directories exist under
backups/.running, they are treated as failed runs and moved tobackups/failed. - A full run starts from cron or from
StartFullBackup. - The manager creates
backups/.running/<run_id>and writes the initialmanifest.jsonwith statusrunning. - Participants are executed sequentially in deterministic name order.
- Each participant flushes or drains its own cache/queue state and writes one or more artifacts into the run directory.
- The manager validates every artifact: file exists, file is not empty, SHA-256 is calculated, and storage integrity checks pass where applicable.
- The manifest is updated after every participant.
- The manager classifies the run as
completed,partial, orfailed. - The final run directory is published under
backups/completed/<run_id>orbackups/failed/<run_id>. - Validated artifacts and the manifest are published to enabled remote backup endpoints.
- Run history is available through
GetBackupRunsandGetBackupRun.
During the run, the manager also updates a runtime progress snapshot. Clients
can read it with GetBackupProgress or subscribe to the backup.progress
channel for stream updates.
Manifest¶
Every run writes manifest.json.
{
"run_id": "20260629-121045-000001",
"type": "full",
"backup_format_version": 1,
"trigger": "manual",
"status": "completed",
"started_at": 1782727845,
"finished_at": 1782727852,
"path": "backups/completed/20260629-121045-000001",
"server_version": "2.0.0",
"error_summary": "",
"requested_participants": [],
"participants": [
{
"name": "accounts",
"critical": true,
"status": "completed",
"started_at": 1782727845,
"finished_at": 1782727846,
"error": "",
"artifacts": [
{
"logical_name": "accounts.snapshot",
"file": "accounts.snapshot",
"path": "backups/completed/20260629-121045-000001/accounts.snapshot",
"size_bytes": 327680,
"sha256": "f2a1...",
"integrity_ok": true,
"integrity_error": "",
"validation_ok": true,
"validation_error": ""
}
]
}
]
}
The TCP run response uses the same data model, except artifact filename is represented through path and logical_name.
Critical Participants¶
Participants can be critical or optional. Current production participants are all critical:
| Participant | Artifact | Critical |
|---|---|---|
accounts |
accounts.snapshot |
Yes |
symbols |
symbols.snapshot |
Yes |
staff |
staff.snapshot |
Yes |
trades |
trades.snapshot |
Yes |
If a critical participant fails, the run status becomes failed. If only optional participants fail, the run status becomes partial.
Scoped Backups¶
StartFullBackup accepts an optional participants array. When omitted, the run is a full backup and executes all currently registered participants. When provided, the run is scoped and only the requested participants are executed.
Example scoped request:
{
"command": "StartFullBackup",
"extID": "1",
"data": {
"participants": ["trades"]
}
}
A scoped run writes type: "scoped" and requested_participants in the run response and manifest. Scoped backups are intended for module-level operational workflows, such as taking a trades snapshot before import or before a module-scoped restore. They are not full disaster restore points.
Validation¶
For every artifact, the manager stores:
| Field | Description |
|---|---|
size_bytes |
Artifact size in bytes |
sha256 |
SHA-256 checksum |
integrity_ok |
Participant-specific integrity check result |
integrity_error |
Integrity check error text |
validation_ok |
Overall artifact validation result |
validation_error |
File, checksum, or validation error text |
Large runtime stores are copied in bounded chunks. This avoids copying a whole multi-gigabyte artifact in one step and allows the process to release internal cache and allocator memory after the dump and validation phases.
Runtime Progress¶
Backup progress is designed for long-running dumps and UI progress bars.
The progress phases are:
| Phase | Description |
|---|---|
idle |
No active run has published progress yet |
backup |
A participant is creating artifacts |
validate |
The manager is validating participant artifacts |
finalize |
The local run is being classified and published |
publish_remote |
Validated artifacts are being sent to configured backup servers |
done |
Backup finished successfully or partially |
failed |
Backup failed |
Use:
GetBackupProgressfor polling.Subscribewithchanels: ["backup.progress"]for push updates. The current TCP API field name ischanels;channelsis not accepted.
Use a separate TCP/WS session for progress while another session waits for the
synchronous StartFullBackup command. The stream payload uses the same fields
as GetBackupProgress, including overall_percent, participants_done,
participants_total, remote_done, and remote_total.
Restore Flow¶
Restore is participant-based like backup. BackupManager validates the selected
backup run first, then calls restore callbacks registered by each module.
Current restore flow:
- Client calls
ValidateBackupRunto check manifest and artifacts. - Client calls
StartRestoreRunwithrun_id, optionaldry_run, and optionalparticipants. BackupManagerblocks concurrent backup/restore operations.- For actual restore,
BackupManagerrequires maintenance/update mode unlessforceis explicitly set. - The source run is validated again before restore starts. Only
completedbackup runs are restorable;partial,failed,running,cancelled, andtimeoutruns are rejected. - For actual restore, the manager creates a pre-restore safety snapshot under
backups/pre-restore/<restore_id>. - If any selected critical participant cannot produce a valid safety snapshot, restore does not start.
- Registered participants are executed in deterministic name order.
- Each participant must restore its own data and rebuild or refresh its own runtime state.
- If a participant does not provide a restore callback, the restore run is marked failed.
- Restore progress and final status are written to
backups/restores/<restore_id>/manifest.json.
BackupManager must not directly replace module storage files because modules own their writers, queues, caches, and domain invariants.
Current restore callback coverage:
| Participant | Artifact | Restore state |
|---|---|---|
staff |
staff.snapshot |
Implemented through module-controlled restore and cache refresh |
symbols |
symbols.snapshot |
Implemented through module-controlled restore and cache refresh |
accounts |
accounts.snapshot |
Implemented through runtime drain, module-controlled restore, and cache refresh |
trades |
trades.snapshot |
Implemented through runtime drain, module-controlled restore, and cache refresh |
Use participants when only part of the backup should be restored, for example
["staff"], ["symbols"], ["accounts"], or a combination of participants.
Omitting participants attempts full restore for all currently registered
participants.
dry_run is allowed in any server mode. Actual restore requires
liveupdate_mode != 0, or an explicit force: true request for controlled
maintenance operations.
Restore history is available through GetRestoreRuns and GetRestoreRun.
The restore response and manifest include pre_restore_path; participant
results include safety_snapshot_path and safety_artifacts.
Remote Restore Flow¶
Remote backup servers are used as offsite storage. Restore still runs through the local backup catalog:
- Call
GetRemoteBackupRunsfor the configured endpoint. - Select a remote
run_id. - Call
ImportRemoteBackupRunto download the manifest and artifacts intobackups/completed/<run_id>. - Call
ValidateBackupRun. - Call
StartRestoreRun.
Operational Notes¶
StartFullBackupis synchronous in the current TCP handler: the response is returned after the run finishes or fails.- Backup progress is available during the synchronous call through
GetBackupProgressandbackup.progress; use a separate TCP/WS client for progress while the caller waits forStartFullBackup. - After a successful run,
backup.progressemits the finaldonestate andGetBackupProgressreturns idle state on later polling. - Only one full backup run can execute at a time.
StartRestoreRunis also synchronous in the current TCP handler.StopBackupServeris currently a compatibility method and returns HTTP501.- Additional participant coverage, retention, runtime cancellation, and object-level fast backups are future architecture work.