Oninit Logo
The Down System Specialists
+1-913-732-8892
+44-2081-337529
Partnerships Contact

Oninit® Log Ripper — Source Locale Verification

How each captured construct behaves under different Informix source DB_LOCALE settings. The CDC capture pipe is byte-oriented — column data arrives as raw bytes from the logical log, transcoding never happens on the wire — so character data is locale-agnostic by construction. Numeric and temporal literal formatting goes through the Informix CSDK formatters and is forced to a locale-neutral encoding at the ripper's process boundary so captured SQL is portable to any target dialect regardless of source locale.

The matrix below shows the verified behaviour for each (source-locale × captured construct) cell. Every cell was exercised live on this build — INSERT / UPDATE (with WHERE-image) / DELETE driven against an Informix source running the named locale, the captured SQL inspected against the source bytes hex-for-hex. See the SQL Mapping page for the per-target dialect rewriter and the Schema Translation page for end-to-end target-side replay examples.

Source-locale matrix

Captured construct en_US.819 (Latin-1, baseline) en_US.utf8 de_de.819 ja_jp.utf8
NCHAR / NVARCHAR multi-byte content single-byte high-bit (e.g. 0xE9=é) byte-perfect (Cyrillic, Greek, Japanese pass through unchanged) single-byte high-bit (German umlauts äöüß) byte-perfect (Hiragana / Katakana / Kanji pass through unchanged)
VARCHAR with high-bit / multi-byte chars byte-preserved byte-preserved (multi-byte UTF-8 sequences emit verbatim in INSERT / UPDATE / DELETE) byte-preserved byte-preserved
DECIMAL value emission (decimal separator) ASCII dot (1234.56) ASCII dot ASCII dot (forced via CLIENT_LOCALE=en_us.819 at process startup — the ripper overrides the operator's env so a comma-decimal locale on the source never leaks into the captured SQL) ASCII dot
DECIMAL emission, large precision (32+ digits) full digits preserved full digits preserved full digits preserved with dot separator full digits preserved
DATETIME literal 'YYYY-MM-DD HH:MM:SS[.fffff]' (ISO) ISO ISO (no comma drift in the date string under DE locale) ISO (no Japanese-era drift; '2026-04-01 09:15:30' not '令和8')
UPDATE / DELETE WHERE-image (full row reconstruction) preserved preserved (multi-byte WHERE values match source bytes verbatim) preserved (umlauts and dot-decimal both clean) preserved (multi-byte WHERE values match source bytes verbatim)
NCHAR length semantics byte-padded byte-padded (server-side octet_length matches captured byte count) byte-padded byte-padded

The capture path is locale-agnostic for column data because the CDC log records carry raw byte sequences, not character strings — no client-side decoding or re-encoding happens between the source log and the captured SQL stream. Numeric and temporal formatting goes through Informix CSDK functions (dectoasc, dttoasc) which honour the GLS locale system; the ripper forces CLIENT_LOCALE=en_us.819 at process startup so those formatters always emit ASCII-dot decimal separator and ISO DATETIME literals regardless of the operator's shell environment or the source database's locale. Other locale categories (collation, character classification) stay at the operator's setting; only the formatter output is normalised.

Cross-locale — source vs target column charset

The same byte-oriented capture path that makes same-locale matches trivial puts the cross-locale match on the operator: the captured INSERT carries whatever bytes the source's column held, and the target accepts or rejects them per its own column-charset declaration. The ripper does not transcode.

Source DB_LOCALE encoding Target column charset Result
Latin-1 (en_US.819, de_de.819, …) latin1 (MySQL/MariaDB), SQL_Latin1_General_CP1_CI_AS (MSSQL), WE8ISO8859P1 (Oracle), 819 (Db2) byte-perfect — high-bit chars (0xE9=é) land verbatim
UTF-8 (en_US.utf8, ja_jp.utf8, …) utf8mb4 (MySQL/MariaDB), UTF-8 (PG), AL32UTF8 (Oracle), 1208 (Db2) byte-perfect — multi-byte sequences pass through unchanged
Latin-1 utf8mb4 / UTF-8 requires per-target charset workaround — the Latin-1 high-bit byte (e.g. 0xE9) is not a valid UTF-8 sequence and the target rejects with Incorrect string value. The ripper's connector init issues SET NAMES binary on MySQL/MariaDB and SET client_encoding TO LATIN1 on PG so the target accepts the bytes as opaque; the column should be declared with a CHARACTER SET latin1 attribute (or the equivalent per dialect) for the bytes to render correctly to applications reading the target.
UTF-8 latin1 / single-byte multi-byte source content does not fit a single-byte target column — the target either rejects or silently truncates. Operator must widen the target column to a multi-byte charset (utf8mb4 / UTF-8) before pointing the ripper at it.

The ripper itself does not narrow or widen captured bytes; the target column's declared charset and the connection's session charset (SET NAMES on MySQL family, client_encoding on PG) determine whether the bytes land cleanly. For mixed environments, declare the target column with a charset matching the source's storage encoding and let the connection-level SET NAMES binary path (already wired into the MySQL/MariaDB connectors) carry the bytes opaquely.

Target-side column-charset probe

At startup, after each direct-DB connector has issued its connection-level charset directive, the ripper additionally probes the actual declared charset of every captured-table column on the target and compares against the source's DB_LOCALE encoding family. The probe is informational — it does not block startup — but surfaces cross-family mismatches the operator may not have noticed when the target schema was built.

Target Probe query Granularity
postgres SHOW server_encoding per database (PG has one encoding per database, not per column — same answer for every captured column)
mysql, mariadb SELECT CHARACTER_SET_NAME FROM INFORMATION_SCHEMA.COLUMNS WHERE TABLE_SCHEMA = DATABASE() AND ... per column — MySQL/MariaDB carry a charset attribute on every text column, so the probe surfaces per-column mismatch

For each captured column the probe classifies both the source DB_LOCALE family and the target column's declared family (Latin-1 / UTF-8 / Shift-JIS / multi-byte-other / unknown). On a cross-family split (e.g. Latin-1 source + UTF-8 target column) the ripper emits a one-time WARN per column naming the table.column, the source family, the target family, and a brief remediation hint (set the target column's charset to match, or rely on the connection-level binary-transit directive plus a per-application re-decode). Same-family pairs produce no log line.

The probe is opt-out via skip_charset_check: true at the top-level YAML (the same knob disables the connection-level verify-back covered on the Schema Translation page). Db2 / Oracle / MSSQL targets do not yet carry a per-column probe — their per-column charset model is either single-database-level (Db2) or a fixed CSDK-controlled configuration (Oracle NLS_CHARACTERSET) better verified by the connection-level directive check.

The SQL Mapping page covers the per-target dialect rewrite once the captured SQL is in hand; the Schema Translation page covers per-dialect target column types and the connection-level charset directive verify-back; the Configuration page covers the YAML knobs the operator sets per target.

To discuss how Oninit ® can assist please call on +1-913-732-8892 or alternatively just send an email specifying your requirements.


You get all this for free.. think about what you get if you pay us