Intelligent Health Data
Harmonization

Transform chaotic raw health data into structured, analysis-ready formats with complete reproducibility, transparency, and AI-assisted intelligence. Built for national-scale datasets and trusted by leading scientific publications.

100%
Reproducible
AI-Powered
Mapping
Production
Ready
config.yaml
mapping:
  source_column: "icd10_code"
  sources:
    - type: "ai"
      threshold: 0.85

# AI-assisted harmonization
# with human-in-the-loop

Enterprise-Grade Features

Built for researchers, ministries of health, and academic institutions worldwide

๐Ÿ”ฌ

100% Reproducible

Every transformation is logged. Run the same config with the same data, get identical results. Perfect for peer review and scientific publication.

๐Ÿ“Š

Publication-Ready Reports

Automatically generates methodological appendix reports in markdown format. Complete audit trail for every decision made during harmonization.

๐Ÿค–

AI-Assisted Mapping

Transformer models suggest GBD cause mappings with confidence scores. Human-in-the-loop ensures trust and accuracy.

โšก

Configuration-Driven

Single YAML file controls the entire pipeline. No code changes needed. Non-technical users can configure complex workflows.

๐Ÿ”

Data Quality Assurance

Comprehensive quality checks with customizable validation rules. Quality scores and issue reporting built-in.

๐Ÿงฉ

Extensible Architecture

Plugin system for custom data sources, mapping targets, and quality checks. Designed for community contributions.

Robust Architecture

Modular, tested, and designed for scale

๐Ÿ“ฅ

I/O Handlers

CSV, Excel, Parquet support with plugin architecture

โ†’
๐Ÿงน

Data Cleaning

8 built-in rules with configurable parameters

โ†’
๐Ÿ”—

Mapping Engine

Direct, fuzzy, and AI-powered mapping

โ†’
โœ…

Quality Checks

9 validation checks with scoring

โ†’
๐Ÿ“„

Reporting

Publication-ready markdown reports

Technology Stack

Python 3.10+ Modern Python with type hints
Pandas Data manipulation & analysis
Pydantic Configuration validation
RapidFuzz Fuzzy string matching
Sentence Transformers AI-powered embeddings
Pytest Comprehensive test suite

Visual Configuration Builder

Create config files without writing YAML

Streamlit Web Interface

Our visual config builder makes it easy for non-technical users to create complex configurations:

  • โœจ Visual form-based editor
  • ๐Ÿ‘€ Live YAML preview
  • โœ“ Real-time validation
  • ๐Ÿ“ฅ Download config files
  • ๐Ÿ“ค Upload and edit existing configs
# Launch the config builder
streamlit run autogbd/app.py

# Or use the CLI
autogbd config-builder
AutoGBD Config Builder
๐Ÿ“ Editor
๐Ÿ‘€ Preview
โ„น๏ธ About
Remove Duplicates
Normalize Sex Values

Get Started in Minutes

Installation and quick start guide

Terminal
# Install AutoGBD
pip install autogbd[app]

# Or install from source
git clone https://github.com/m-aljasem/autogbd.git
cd autogbd
pip install -e ".[app]"
config.yaml
io:
  input_file: "data/raw_data.csv"
  output_file: "output/harmonized.csv"

mapping:
  source_column: "icd10_code"
  sources:
    - type: "ai"
      threshold: 0.85

quality:
  enabled: true
  checks: []
Terminal
# Run the harmonization pipeline
autogbd run --config config.yaml

# Validate configuration first
autogbd validate --config config.yaml

# Launch visual config builder
autogbd config-builder

Trusted by Researchers Worldwide

Production-ready for real-world applications

๐Ÿฅ

Ministries of Health

Harmonize national mortality data for GBD studies. Handle millions of records with complete audit trails for regulatory compliance.

๐ŸŽ“

Academic Research

Reproducible data pipelines for peer-reviewed publications. Complete methodological transparency meets journal requirements.

๐ŸŒ

Multi-Country Studies

Harmonize data from diverse sources across countries. Support for multiple ICD versions and local coding systems.

Comprehensive Documentation

Everything you need to succeed