Overview
The CH research dataset is a single coherent OMOP CDM 5.4 instance, extended with eight tables specific to the Master Equation. Every patient, every encounter, every observation lives in this schema. There is no parallel "internal" schema — researchers see the same model our own analysts use, with sensitive fields deidentified per the policy on this page.
Standards we follow
We don't invent vocabularies when something credible already exists. Where we extend, we extend transparently — the CH-prefixed namespaces below are the only places we add anything custom.
Entities
The 28 tables in the registry. OMOP standard tables are kept as-is (so existing OHDSI tooling works); CH-prefixed tables are extensions documented on this page.
| Table | Purpose | Source | Fields | Rows (live) |
|---|---|---|---|---|
| personDemographic anchor for every participant. | One row per participant | OMOP | 18 | 1.2k |
| observation_periodWhen a person was actively observed. | Spans of follow-up | OMOP | 5 | 3.4k |
| visit_occurrenceEncounter-level rollup (visit, telehealth, ER, etc.) | One row per encounter | OMOP | 17 | 8.7k |
| condition_occurrenceCoded conditions (SNOMED + ICD-10). | Diagnoses, problem-list entries | OMOP | 15 | 22.1k |
| drug_exposureMedications dispensed or administered. | Rx + admin records | OMOP | 22 | 18.4k |
| procedure_occurrenceProcedures performed (CPT/HCPCS/SNOMED). | Procedure records | OMOP | 14 | 11.2k |
| measurementQuantitative results — labs, vitals, instruments. | Numeric or coded results | OMOP | 23 | 142.8k |
| observationQualitative or non-result clinical facts. | Notes, social history, qualifiers | OMOP | 20 | 38.6k |
| deathDate and cause of death. | One row per decedent | OMOP | 8 | 14 |
| device_exposureImplants, wearables, durable medical equipment. | Device-level records | OMOP | 12 | 2.8k |
| specimenSpecimens collected (blood, tissue, etc.) | One per draw | OMOP | 11 | 9.4k |
| costEncounter and item-level cost records. | Charges, paid amounts | OMOP | 14 | 28.3k |
| payer_plan_periodInsurance coverage spans. | One per coverage span | OMOP | 9 | 2.1k |
| care_siteClinic, hospital, or virtual visit site. | Reference table | OMOP | 8 | 147 |
| providerClinicians and care-team members. | Reference table | OMOP | 11 | 312 |
| locationGeographic location (county-level only post-deid). | Reference table | OMOP | 6 | 847 |
| noteClinical notes (free text, redacted). | Encounter notes | OMOP | 10 | 8.1k |
| note_nlpNLP-derived structured facts from notes. | Auto-extracted | OMOP | 14 | 42.6k |
| fact_relationshipCross-table relationships (e.g. drug↔condition). | Relationship graph | OMOP | 5 | 18.9k |
| concept / vocabulary / etcOMOP standardized vocabulary tables. | Reference | OMOP | — | — |
| ch_axis_snapshotPer-day 8-axis vector for every participant. | Daily axis values | CH | 14 | 438k |
| ch_axis_eventDiscrete events that moved an axis. | Event log (CHAX-coded) | CH | 12 | 86.4k |
| ch_consentActive consent record (research / sharing / data classes). | Consent state | CH | 16 | 3.2k |
| ch_consent_eventConsent-state change log (chain-pinned). | Audit trail | CH | 11 | 11.8k |
| ch_data_sharePer-study data-share grants and revocations. | One per share | CH | 13 | 412 |
| ch_token_eventHCR/HCC issuance to participants (deidentified). | Earnings ledger refs | CH | 9 | 94.2k |
| ch_protocolCare-protocol catalog (versioned, signed). | Reference | CH | 10 | 187 |
| ch_chain_eventChain-pinned audit events (notes, consent, prescribing). | Tamper-evident log | CH | 7 | 218k |
person
The demographic anchor. Every participant has exactly one row. Direct identifiers (name, SSN, MRN, address, phone, email) are not in this table — they live in the operational system, never in the research enclave.
person — 18 fields
| Field | Type | Source / vocabulary | Deid | Notes |
|---|---|---|---|---|
| person_id | int64 | Synthetic surrogate | SHIFT | Stable across queries within one study; rotated between studies. |
| gender_concept_id | int | OMOP Gender | PASS | Pass-through. 5 standard concepts. |
| year_of_birth | int | YYYY | SHIFT | Date-shifted ±90 days for the patient (year stable for ages < 89). |
| month_of_birth | int | 1-12 | BIN | Quarter only after deid (1, 4, 7, 10). |
| day_of_birth | int | — | SUPPR | Suppressed (set to 15 of birth month). |
| birth_datetime | timestamp | — | SUPPR | Suppressed entirely. Use year_of_birth. |
| race_concept_id | int | OMB Race & Ethnicity | PASS | Pass-through. Self-reported. |
| ethnicity_concept_id | int | OMB Race & Ethnicity | PASS | Pass-through. Self-reported. |
| location_id | int | FK → location | BIN | County-level only. ZIP code suppressed. |
| provider_id | int | FK → provider | SHIFT | Provider IDs surrogated per study. |
| care_site_id | int | FK → care_site | SHIFT | Care-site IDs surrogated per study. |
| person_source_value | string | — | SUPPR | Original MRN suppressed. |
| gender_source_value | string | — | PASS | Self-described, no PII. |
| race_source_value | string | — | PASS | Self-reported text. |
| ethnicity_source_value | string | — | PASS | Self-reported text. |
| ch_consent_state | enum | CHCO | PASS | CH ext. active / paused / withdrawn / pending. |
| ch_axis_baseline_id | int | FK → ch_axis_snapshot | PASS | CH ext. Baseline 8-axis vector at enrollment. |
| ch_data_classes | enum[] | CHCO | PASS | CH ext. Which data classes patient consented to share. |
observation
Qualitative facts that aren't measurements, conditions, or procedures — social history, family history, qualifiers, structured questionnaire responses. CHAX-coded axis events also surface here as observations of class CH-AXIS.
observation — selected fields (20 total)
| Field | Type | Source / vocabulary | Deid | Notes |
|---|---|---|---|---|
| observation_id | int64 | Synthetic surrogate | SHIFT | Per-study surrogate. |
| person_id | int64 | FK → person | SHIFT | Stable within study. |
| observation_concept_id | int | SNOMED, LOINC, CHAX | PASS | Standardized concept. |
| observation_date | date | — | SHIFT | Patient-level date shift ±90 days. |
| observation_datetime | timestamp | — | SHIFT | Same shift as date; time-of-day binned to nearest hour. |
| value_as_string | text | — | REDACT | NER-redacted free text. Names, locations, MRNs removed. |
| value_as_number | numeric | — | PASS | Numeric values pass through. |
| value_as_concept_id | int | Vocab | PASS | Coded value. |
| qualifier_concept_id | int | SNOMED qualifier | PASS | — |
| unit_concept_id | int | UCUM | PASS | — |
| provider_id | int | FK | SHIFT | — |
| visit_occurrence_id | int64 | FK | SHIFT | — |
| ch_axis | enum | PO/NM/ER/SC/RS/ES/TA/PV | PASS | CH ext. Which axis this observation moved. |
| ch_axis_delta | numeric | — | PASS | CH ext. Signed delta on that axis. |
measurement
Quantitative test results: labs, vital signs, calculated indices, instrument scores. Largest table by row count. Values pass through unchanged; dates are patient-shifted; provider/care-site IDs are surrogated.
measurement — abbreviated
| Field | Type | Source | Deid | Notes |
|---|---|---|---|---|
| measurement_concept_id | int | LOINC | PASS | Standard LOINC concept. |
| measurement_date | date | — | SHIFT | ±90 days, patient-stable. |
| value_as_number | numeric | — | PASS | Pass-through. |
| unit_concept_id | int | UCUM | PASS | — |
| range_low / range_high | numeric | — | PASS | Lab-reported reference ranges. |
| measurement_source_value | string | — | REDACT | Source-system text; NER-redacted. |
| ch_axis_attribution | enum | PO/NM/ER/SC/RS/ES/TA/PV | PASS | CH ext. Which axis (if any) this measurement updates. |
drug_exposure
Medications dispensed, prescribed, or administered. Includes inpatient administration records, outpatient prescriptions, and pharmacy fill events. Standard RxNorm coding throughout.
drug_exposure — abbreviated
| Field | Type | Source | Deid | Notes |
|---|---|---|---|---|
| drug_concept_id | int | RxNorm RxCUI | PASS | — |
| drug_exposure_start_date | date | — | SHIFT | Patient date shift. |
| drug_exposure_end_date | date | — | SHIFT | Same shift. |
| days_supply | int | — | PASS | — |
| quantity | numeric | — | PASS | — |
| refills | int | — | PASS | — |
| drug_type_concept_id | int | OMOP type | PASS | Rx written / dispensed / administered / inferred. |
| stop_reason | string | — | REDACT | NER-redacted free text. |
| ch_42cfr2_flag | bool | CHCO | SUPPR | CH ext. SUD-treatment Rx is suppressed unless Tier-2+ IREB and 42 CFR Part 2 redisclosure consent. |
condition_occurrence
Diagnoses recorded by clinicians, problem-list entries, claims-coded conditions. SNOMED CT is the standard concept; ICD-10-CM source is preserved.
condition_occurrence — abbreviated
| Field | Type | Source | Deid | Notes |
|---|---|---|---|---|
| condition_concept_id | int | SNOMED CT | PASS | — |
| condition_start_date | date | — | SHIFT | Patient date shift. |
| condition_end_date | date | — | SHIFT | Same shift. |
| condition_status_concept_id | int | OMOP status | PASS | Active, resolved, ruled-out, etc. |
| condition_source_value | string | ICD-10-CM | PASS | Original code preserved. |
| stop_reason | string | — | REDACT | NER-redacted text. |
| ch_sensitive_class | enum | CHCO | PASS | CH ext. none / BH / SUD / HIV / genomic / repro — drives suppression rules. |
ch_axis_snapshot
The signature CH extension. One row per participant per day, holding their 8-axis vector and overall CH score. This is what makes CH research data structurally different from claims and chart data — every patient has a longitudinal time-series of their lived health, not just billing events.
Physical Origin
Anchor work, baseline movement, restorative habits.
Nourishment
Diet quality, meal cadence, hydration.
Effort & Recovery
Training load, sleep, recovery markers.
Social Connection
Quality and frequency of meaningful contact.
Rhythm & Sleep
Circadian alignment, sleep-stage architecture.
Emotional State
Mood, affect, stress regulation.
Thought & Attention
Cognitive engagement, focus, learning.
Purpose & Vision
Sense of meaning, agency, longitudinal direction.
ch_axis_snapshot — 14 fields
| Field | Type | Range | Deid | Notes |
|---|---|---|---|---|
| snapshot_id | int64 | — | SHIFT | Surrogate. |
| person_id | int64 | FK | SHIFT | — |
| snapshot_date | date | — | SHIFT | ±90 day patient shift. |
| axis_po | numeric | 0–100 | PASS | Physical Origin score. |
| axis_nm | numeric | 0–100 | PASS | Nourishment score. |
| axis_er | numeric | 0–100 | PASS | Effort & Recovery score. |
| axis_sc | numeric | 0–100 | PASS | Social Connection score. |
| axis_rs | numeric | 0–100 | PASS | Rhythm & Sleep score. |
| axis_es | numeric | 0–100 | PASS | Emotional State score. |
| axis_ta | numeric | 0–100 | PASS | Thought & Attention score. |
| axis_pv | numeric | 0–100 | PASS | Purpose & Vision score. |
| ch_score | numeric | 0–100 | PASS | Composite Master-Equation score. |
| data_completeness | numeric | 0–1 | PASS | Fraction of expected signals present that day. |
| computation_version | string | semver | PASS | ME computation version (e.g. 2.4.0). Pin to this for reproducibility. |
Deidentification methods
Five methods cover every field. The deid column on every entity table tells you which one applies. Combined methods comply with HIPAA Safe Harbor (45 CFR § 164.514(b)(2)) and meet the more stringent Expert Determination standard for the cohorts we publish.
Field is non-identifying by construction. Coded vocabularies, numeric values without obvious dates, axis scores. Passed unchanged.
Dates shifted by a patient-stable random offset of ±90 days. IDs rotated to per-study surrogates. Time intervals between events preserved.
Continuous values bucketed (e.g. ZIP → county; age > 89 → "90+"; hour-of-day → nearest 4-hour bin). Reduces identifiability while preserving research value.
NER-based PHI redaction across 18 HIPAA Safe Harbor identifiers + 12 CH-specific. Replaced with placeholder tokens ([NAME-1], [LOC-1]). Reviewed by output-review for k-anonymity.
Field is removed entirely or set to a fixed null value. Used for direct identifiers (MRN, full DOB), 42 CFR Part 2-protected fields without explicit redisclosure consent, and other high-risk classes.
Re-identification risk
Deidentified ≠ anonymized. We treat re-identification risk as ongoing. Three layers protect against attempts:
1 · Output review (every result)
Every result you export from the enclave is reviewed before release. Cells with cohort < 11 are auto-suppressed. Counts within ±2 of small cohorts are jittered. Anything that violates k-anon for any combination of demographic axes gets held for human review.
2 · Linkage attack monitoring
We monitor for query patterns consistent with linkage attempts (e.g. repeatedly slicing by ZIP + DOB + sex). The IREB is alerted and the study can be paused for review.
3 · Annual re-identification audit
Each year, an external auditor (currently Privacy Analytics) attempts re-identification on a sample of our published outputs using public datasets. Last audit (2025-08): 0 of 47 records re-identifiable at 95% confidence. Report posted on the security page.
What you cannot do, even in the enclave
The enclave software environment enforces these prohibitions. Violation attempts are logged and may result in study termination.
- Cross-link to external identifying datasets you bring in (file uploads are scanned).
- Generate outputs that uniquely identify cohorts smaller than 11.
- Use unprotected free-text fields to derive direct identifiers.
- Combine date-shifted dates with external timeline data to reverse the shift.
- Attempt to fingerprint participants via rare-condition combinations beyond what your protocol justifies.
Provenance & lineage
Every record in the registry has a provenance chain. You can trace any row back through ETL → operational system → originating clinic / device / patient self-report. The provenance fields are visible in the enclave and are required for any publication.
Provenance fields (on every fact table)
Each table includes a standard set of provenance columns: source_system (which clinic, device, or app produced the row), extraction_timestamp (when ETL ran), etl_version (semver of the ETL job), chain_event_id (FK into ch_chain_event for chain-pinned operations), and quality_flags (bitset for known data-quality issues).
Chain-pinned events
A subset of high-trust events — consent changes, prescription writes, signed clinical notes, AI override decisions, axis-token issuance — are pinned to the CH Chain. Their chain_event_id resolves to a public hash you can verify independently. We use this for the kind of facts where "we say it happened" is not enough.
Downloads & SDKs
All artifacts free, no application required.
pip install ch-research