Oninit® Log Ripper — Grafana & Prometheus

The ripper exposes capture health as Prometheus metrics. Drop-in provisioning files + curated dashboards ship in the package so an operator can wire the ripper into Grafana with no JSON authoring. This page covers the metric set, the install path for two datasource flavors (Prometheus or the Oninit Grafana plugin), the panels in each dashboard, and the operator playbook for the SLO signals.

The /metrics endpoint

An embedded HTTP server inside the ripper serves /metrics in Prometheus text-exposition format v0.0.4 and /healthz for orchestrator liveness probes. The endpoint is default-disabled: the ripper opens no listening sockets unless monitoring.prometheus.port is non-zero. Enable in the YAML:

monitoring:
  prometheus:
    port: 9091
    bind: "0.0.0.0"   # 127.0.0.1 to keep loopback-only

Default bind is 127.0.0.1 — a fresh enable carries no external attack surface. Set 0.0.0.0 to allow a Prometheus server on a different host to scrape; wrap with a reverse proxy (nginx / haproxy / Caddy) if the link crosses an untrusted network. The embedded server is plain HTTP, no TLS, no authentication — tunnel it.

Metric families

Five families ship in v1, all named with the oni_logripper_ prefix per CNCF Prometheus naming convention:

oni_logripper_build_info — gauge, labels version. Constant 1, advertises the running ripper version. Use as the info metric in Grafana’s build-info stat.
oni_logripper_records_total — counter, labels worker, op. Records emitted, broken down per worker per op ∈ insert / update / delete / truncate / discard. The discard bucket is the early-warning signal: any non-zero rate means CDC is falling behind log recycling.
oni_logripper_lag_seconds — gauge, labels worker. Source wall-clock minus this worker’s last seen transaction timestamp. 0 if no transaction has been captured yet. The primary SLO signal.
oni_logripper_recovery_count — counter, labels worker. How many times this worker has restarted its CDC session via cdc_activatesess after an ifx_lo_read failure. Sustained non-zero rate points at network instability between ripper and source.
oni_logripper_worker_running — gauge, labels worker. 1 if the worker thread is actively running, 0 if stopped. Sum across workers for the “expected vs actual” comparison.

Sample scrape:

$ curl -s http://ripper-host:9091/metrics
# HELP oni_logripper_build_info Constant 1 with build metadata labels.
# TYPE oni_logripper_build_info gauge
oni_logripper_build_info{version="1.0.0"} 1
# HELP oni_logripper_records_total Records emitted per worker per op.
# TYPE oni_logripper_records_total counter
oni_logripper_records_total{worker="0",op="insert"} 1234
oni_logripper_records_total{worker="0",op="update"} 567
oni_logripper_records_total{worker="0",op="delete"} 12
oni_logripper_records_total{worker="0",op="truncate"} 0
oni_logripper_records_total{worker="0",op="discard"} 0
oni_logripper_records_total{worker="1",op="insert"} 845
oni_logripper_records_total{worker="1",op="update"} 230
# HELP oni_logripper_lag_seconds Real-time capture lag in seconds.
# TYPE oni_logripper_lag_seconds gauge
oni_logripper_lag_seconds{worker="0"} 3
oni_logripper_lag_seconds{worker="1"} 2
# HELP oni_logripper_recovery_count Worker recovery attempts.
# TYPE oni_logripper_recovery_count counter
oni_logripper_recovery_count{worker="0"} 0
oni_logripper_recovery_count{worker="1"} 0
# HELP oni_logripper_worker_running 1 if worker is running, 0 otherwise.
# TYPE oni_logripper_worker_running gauge
oni_logripper_worker_running{worker="0"} 1
oni_logripper_worker_running{worker="1"} 1

/healthz liveness probe

/healthz returns HTTP 200 with body OK when every worker reports error == 0, otherwise HTTP 503 with body UNHEALTHY. Standard shape for orchestrator liveness probes.

# Kubernetes liveness probe
livenessProbe:
  httpGet:
    path: /healthz
    port: 9091
  initialDelaySeconds: 10
  periodSeconds: 30

# systemd readiness check (via curl)
ExecStartPost=/usr/bin/curl --fail --silent http://127.0.0.1:9091/healthz

# Datadog Agent http_check
init_config:
instances:
  - name: oni_ripper
    url: http://ripper-host:9091/healthz
    timeout: 5

What ships in share/grafana/

The package places provisioning + dashboards under share/grafana/ (typically /usr/share/oni_ripper/grafana/ when installed via the RPM / DEB):

share/grafana/
├── provisioning/
│   ├── datasources/
│   │   ├── oni_logripper_prometheus.yaml   # Prometheus DS
│   │   └── oni_logripper_oninit.yaml       # Oninit Grafana plugin DS
│   └── dashboards/
│       └── oni_ripper.yaml                  # dashboard provider
└── dashboards/
    ├── oni_logripper_capture_health.json       # main board
    └── oni_logripper_capture_drilldown.json    # per-worker drilldown

Install — Prometheus path

The canonical CNCF setup. A Prometheus server scrapes the ripper’s /metrics on a 15-30s cadence and rolls up across all instances. Grafana queries Prometheus.

Enable the ripper’s metrics endpoint (monitoring.prometheus.port non-zero, see above).

Point your existing Prometheus server at the ripper:

scrape_configs:
  - job_name: oni_ripper
    static_configs:
      - targets: ["ripper-host:9091"]

Drop the datasource provisioning file into Grafana:

cp share/grafana/provisioning/datasources/oni_logripper_prometheus.yaml \
   /etc/grafana/provisioning/datasources/

Drop the dashboard provisioning manifest:

cp share/grafana/provisioning/dashboards/oni_ripper.yaml \
   /etc/grafana/provisioning/dashboards/

Copy the dashboard JSON to the location the manifest points at:

mkdir -p /var/lib/grafana/dashboards/oni_ripper
cp share/grafana/dashboards/*.json \
   /var/lib/grafana/dashboards/oni_ripper/

Set PROMETHEUS_URL in Grafana’s environment (or edit the YAML’s url: directly), then restart Grafana:
```
PROMETHEUS_URL=http://prom.local:9090
systemctl restart grafana-server
```
Open Grafana → Dashboards → Oninit Log Ripper folder. Both dashboards are now provisioned.

Install — Oninit Datasource path

For shops already running the Oninit Grafana plugin across the Oninit product family (InformixAnalyser, snooper, etc.). The plugin queries the source Informix directly via its own protocol — no separate Prometheus deployment required, and lag / DML rates surface alongside your existing Informix dashboards.

Install the Oninit Grafana plugin on the Grafana host (separate deliverable from the Oninit product team; not shipped in this package).

Drop the Oninit datasource provisioning file:

cp share/grafana/provisioning/datasources/oni_logripper_oninit.yaml \
   /etc/grafana/provisioning/datasources/

Set the plugin connection environment in Grafana:

ONINIT_DS_PLUGIN_TYPE=oninit-datasource    # whatever the local plugin id is
ONINIT_DS_URL=https://informix.local:9088
ONINIT_DS_USER=monitor
ONINIT_DS_PASSWORD=<...>
INFORMIXSERVER=ol_informix1410

Steps 4–7 above are identical — same dashboard manifest, same dashboard JSON, same restart.

Both can coexist. Drop both provisioning files and the dashboard’s ${DS} variable lets users pick at view time which datasource backs the panel queries.

Dashboard: Capture Health

Top-of-funnel board. Eight panels arranged for an at-a-glance read on the entire capture pipeline.

Build info (stat) — oni_logripper_build_info. Single-line version label across all scraped instances.
Workers running (stat) — sum(oni_logripper_worker_running). Aggregate worker count. Threshold red < 1, green ≥ 1.
Recovery attempts (total) (stat) — sum(oni_logripper_recovery_count). Lifetime recovery count. Threshold green 0, yellow ≥ 1, red ≥ 5.
Capture lag (seconds, per worker) (timeseries) — oni_logripper_lag_seconds{worker=~"$worker"}. Per-worker line. Threshold green < 30s, yellow 30–120s, red > 120s drives the area fill colour.
Records emitted (rate, per op) (timeseries, stacked) — sum by (op) (rate(oni_logripper_records_total[1m])). Stacked area, one band per op. The discard band stacking on top is the alarm shape.
Records cumulative (per worker, per op) (timeseries) — oni_logripper_records_total{worker=~"$worker"}. Monotonic counter view; useful for sanity-checking the rate panel.
Recovery attempts (rate) (timeseries, bars) — rate(oni_logripper_recovery_count[5m]). Bars per worker. Sustained bars = network instability between ripper and source.
Worker state (table) — joined instant query of worker_running + lag_seconds + recovery_count. One row per worker. Value-mapping turns worker_running into colour-coded RUNNING / STOPPED.

Two template variables: ${DS} picks Prometheus vs. Oninit DS; ${worker} filters every panel to a worker subset (default All). Refresh defaults to 30s, time range to now-1h.

Dashboard: Worker Drilldown

Linked from the health board’s “Drilldown” link. Same metrics, narrowed to one $worker at a time. Five panels:

Lag (seconds) — this worker’s oni_logripper_lag_seconds with the same 30/120s thresholds.
Records / sec — stacked-by-op rate, one worker.
Records cumulative (per op) — counter lines.
Recovery attempts (rate) — bar chart.
Worker state — large stat panel, RUNNING green / STOPPED red.

Operator playbook

What each panel signal means and the first thing to check.

lag_seconds climbing across reports on every worker. Source DML faster than the ripper can drain. Either the source side has a write spike or the target write path is the bottleneck. First check: the records_total{op="discard"} counter. If discards are firing, the source log has already recycled past records the ripper hasn’t read — widen the source log file or add ripper workers. If discards are zero, look at target latency.
lag_seconds climbing on one worker only. That worker’s tables are bottlenecked — typically a hot table on a worker that’s also handling several other tables, or a target index that’s write-amplifying. First check: re-shard sessions_per_thread / max_threads so the hot table lives on its own worker, or move it to a target dialect with cheaper inserts.
recovery_count rate > 0 sustained. The source ↔ ripper TCP path is dropping; every recovery costs a round-trip and re-DESCRIBE. First check: run a sustained onstat against sysmaster from the ripper host to confirm the path isn’t flapping. Tune NETTYPE / SOC_*KEEPALIVE on the source.
worker_running drops to 0 on one worker. That worker exited — either a hard error or a clean shutdown. First check: read the ripper’s [STATUS] / [CRITICAL] log lines. The /healthz endpoint flips to 503 immediately.
records_total{op="discard"} incrementing. CDC fell behind log recycling on the source — the most serious signal. Some captures were never seen. First check: read the configured alerting.discard_command output (rate-limited per discard_alert_min_sec). Audit for missing rows on the target. Long-term: widen the source log file or increase ripper throughput.

Ad-hoc PromQL

For one-off questions outside the curated dashboards. All queries below assume the Prometheus datasource.

Total inserts/sec across all workers:

sum(rate(oni_logripper_records_total{op="insert"}[1m]))

Per-worker DML mix:

sum by (worker, op) (rate(oni_logripper_records_total[5m]))

Worst-case lag across the fleet:

max(oni_logripper_lag_seconds)

Workers currently behind > 30s:

count(oni_logripper_lag_seconds > 30)

Recovery rate over the last hour:

increase(oni_logripper_recovery_count[1h])

Discards in the last 24 hours (any non-zero is a problem):

sum(increase(oni_logripper_records_total{op="discard"}[24h]))

Out of scope (today)

TLS / authentication on the metrics endpoint — operator wraps with a reverse-proxy that terminates TLS and applies the auth check. Treated as a deployment concern; the ripper’s embedded server stays minimal so the attack surface is small.
Push-model gateways (Prometheus Pushgateway, OTLP push) — the ripper is a long-running daemon, so the pull-model is canonical. Push is for short-lived jobs.
Histogram metrics (per-tx latency distributions) — v1 ships counters + gauges; histogram families add in v2 if a customer asks.
Alerting rules — Grafana / Prometheus Alertmanager rules ship as a separate deliverable when SLO definitions are agreed with the customer.
Buffer + lag-bytes metrics — the target-offline buffering subsystem and the LSN-distance lag in bytes (distinct from the wall-clock lag-seconds shipped today) are not yet exposed as Prometheus metrics.

See config.html for the YAML knobs and reference.html for the per-key reference.