Storage¶

The storage layer persists every data type to Parquet with nanosecond timestamps, provenance and per-type deduplication, and keeps an append-only run history in SQLite. ParquetStore is the unified read/write interface; you rarely touch it directly — dccd.Client and the operations use it for you.

Directory layout¶

{data_path}/{exchange}/ohlc/{pair}/{span}/YYYY.parquet
{data_path}/{exchange}/trades/{pair}/YYYY-MM-DD.parquet
{data_path}/{exchange}/orderbook/{pair}/YYYY-MM-DD.parquet

exchange — lowercase (binance, kraken, …).
pair — BTC-USDT (slash replaced by hyphen).
span — seconds label (3600s); OHLC only.
OHLC files are annual; trades and order-book files are daily.

Schema & integrity¶

All timestamps are TS — nanoseconds UTC (int64). Each write merges into the existing file and deduplicates on the natural key per data type:

Data type	Dedup key	Notes
OHLC	`TS`	One bar per span window.
trades	`tid` (else `TS, price, amount, side`)	Many trades share a `TS` — keying on `TS` alone would lose them.
order book	`TS, side, price`	A snapshot’s levels all share one `TS`.

Writes are atomic (temp file + os.replace) and serialised per file, so a reader never sees a half-written Parquet and concurrent writers can’t corrupt it.

Reading data¶

from dccd.storage.parquet import ParquetStore
from dccd.domain.dataset import DatasetId
from dccd.domain.symbol import Symbol
from dccd.domain.types import DataType

store = ParquetStore("/data/crypto")
ds = DatasetId(exchange="binance", symbol=Symbol(base="BTC", quote="USDT"),
               data_type=DataType.OHLC, span=3600)
df = store.load(ds)          # a polars.DataFrame, sorted by TS

ParquetStore¶

class ParquetStore(data_path)[source]¶

Read/write interface for a single DatasetId.

All timestamps (TS) are nanoseconds UTC (int64).

Parameters:

data_pathstr or Path: Root directory for all data files.

Examples

>>> import pathlib, tempfile
>>> from dccd.domain.dataset import DatasetId
>>> from dccd.domain.symbol import Symbol
>>> from dccd.domain.types import DataType
>>> store = ParquetStore('/tmp/data')

directory(ds)[source]¶: Return the directory for ds, creating it if needed.

inventory()[source]¶

Return list of dataset info dicts for all stored data.

Each entry includes min_ts / max_ts (nanoseconds UTC) and rows so the UI can display the actual data time range.

last_timestamp(ds)[source]¶: Return last TS in ns, or None if no data.

load(ds, start_ns=None, end_ns=None)[source]¶: Load data for ds in the given nanosecond range.

missing_intervals(ds, start_ns, end_ns)[source]¶: Return gaps as (start_ns, end_ns) pairs within [start_ns, end_ns].

static read_provenance(file_path)[source]¶: Return the Provenance stored in a Parquet file, if any.

property root¶: Root directory of the local store.

save(ds, records, provenance=None)[source]¶

Write records to Parquet, merging with existing data.

Parameters:

dsDatasetId
recordslist: OHLCBar, Trade, or OrderBookSnapshot objects.
provenanceProvenance or None

Returns:

int: Number of rows written.

Run history¶

class RunsStore(db_path)[source]¶

Append-only SQLite store for JobRun records.

Parameters:

db_pathstr or Path: Path to the SQLite database file. Created if absent.

Examples

>>> import tempfile, os
>>> with tempfile.NamedTemporaryFile(suffix='.db', delete=False) as f:
...     path = f.name
>>> store = RunsStore(path)
>>> store.create_run('r1', 'backfill:binance:BTC/USDT:ohlc', 'backfill', 'binance', 'BTC/USDT', 'ohlc')
>>> os.unlink(path)

active_runs()[source]¶: Runs currently running or reconnecting.

append_log(run_id, msg, max_lines=100)[source]¶: Append a log line to the run’s bounded log_tail.

create_run(run_id, spec_id, operation, exchange, symbol, data_type, started_at=None)[source]¶: Insert a new run row in the running state.

finish_run(run_id, state, ended_at=None, rows_written=0, error=None)[source]¶: Mark a run finished with its final state, row count and optional error.

get_run(run_id)[source]¶: Return one run as a dict, or None if unknown.

list_runs(spec_id=None, state=None, limit=50)[source]¶: List recent runs, most recent first, optionally filtered.

mark_stale_running()[source]¶

Transition all running rows to stale at daemon boot.

Runs left in state running after a daemon crash or SIGKILL pollute active_runs and the Dashboard forever. Calling this once during daemon startup corrects the DB without deleting any history rows.

Parameters:

None

Returns:

int: Number of rows updated (0 when the store is clean).

Notes

The ended_at timestamp is set to now (nanoseconds UTC) and error is set to 'orphaned by daemon restart' so the run history clearly attributes the state change to a restart rather than a normal completion or a user-visible error.

This method must only be called from the daemon boot path, before any new runs are started: cmd_start for dccd start (called before the scheduler starts stream workers); the FastAPI lifespan for standalone dccd ui (called before the standalone scheduler is created). Calling it while workers are already running would incorrectly stale-out their legitimate active runs.

prune_old_runs(retention_days)[source]¶

Delete terminal non-failed runs older than retention_days days.

Runs in states succeeded, stale, and cancelled that started more than retention_days days ago are removed. failed rows are intentionally kept as the long-term error journal. The database is VACUUM-ed after any deletion to reclaim disk space.

Parameters:

retention_daysint: Number of days to retain terminal non-failed runs. Pass 0 (or any value <= 0) to disable pruning; the method returns 0 immediately without touching the database.

Returns:

int: Number of rows deleted (0 when pruning is disabled or when no rows match the cutoff).

Notes

VACUUM cannot run inside a transaction. This method opens a separate connection (outside the _conn context manager) for the VACUUM statement, which is executed only when at least one row was deleted.

This method must be called from the daemon boot path after mark_stale_running so that freshly-staled orphans age normally rather than being immediately pruned on the next boot.

update_progress(run_id, progress)[source]¶: Persist the latest progress dict for run_id (polled by the UI).

Coverage manifest¶

class CoverageStore(db_path)[source]¶

Per-dataset coverage manifest (min/max ts + row count).

Parameters:

db_pathstr or Path: Path to the SQLite database file. Created if absent.

get_max_ts(ds)[source]¶: Return the recorded max_ts for ds, or None if unknown.

list_all()[source]¶: Return every coverage row (most recently updated first).

record(ds, *, min_ts, max_ts, rows_added=0)[source]¶

Upsert coverage for ds, widening the [min_ts, max_ts] envelope.

min_ts/max_ts are merged with any existing row (min of mins, max of maxes), so a later backfill never narrows the recorded extent; rows_added accumulates (an approximate stored-row tally for display).