Skip to content

GJQu/predoc-coding-sample

Repository files navigation

predoc-coding-sample

Clean, reproducible research pipeline for an applied micro / health & labor project. The repo supports both synthetic data (for demonstration) and real UKHLS data.

Setup

python3 -m venv .venv
source .venv/bin/activate
python -m pip install -e .

Run — Real Data Pipeline

The real-data pipeline ingests the UKHLS frailty panel, clusters individuals into health types via K-Means, and generates tables and figures.

Pipeline steps:

  1. ingest_real — reads data/raw/frailty_long_panel.parquet, renames columns, computes lagged frailty
  2. cluster — K-Means on per-individual mean frailty (k=3 by default)
  3. report — generates summary tables and figures
./run_real.sh
# or
.venv/bin/python3 -m src.cli_real --config configs/config_real.yaml

Outputs

Tables:

  • output/tables/tab01_summary_stats.csv — summary stats by health type
  • output/tables/tab02_frailty_by_wave.csv — frailty by wave and health type

Figures:

  • output/figures/fig01_frailty_trajectories.png — frailty trajectories by health type
  • output/figures/fig02_frailty_distribution.png — frailty distributions per cluster
  • output/figures/fig03_cluster_diagnostics.png — elbow and silhouette plots

Metrics: output/metrics/metrics.json Log: output/logs/pipeline_real.log

Run — Synthetic Data Pipeline

Generates synthetic data and runs the full pipeline including OLS regressions.

./run.sh
# or
.venv/bin/python3 -m src.cli --config configs/config.yaml

Fast mode

FAST=1 ./run.sh
# or
.venv/bin/python3 -m src.cli --config configs/config.yaml --fast

Tests

Run the test suite (uses a 200-individual subsample from the real data):

.venv/bin/python3 -m pytest tests/test_real_pipeline.py -v

Data

  • data/raw/ — real UKHLS frailty panel (parquet) and death records. Not committed to git.
  • data/sample/ — synthetic data generated by the pipeline.
  • data/derived/ — intermediate pipeline outputs (parquet).

Notes

  • Notebooks are archived in notebooks/archive/ for provenance only; the pipeline does not depend on them.
  • Synthetic data do not match the distribution of the actual frailty data.

Paper Assets

  • LaTeX sources: paper/tex/
  • Compiled outputs: paper/final/

About

Reproducible data pipeline (Python) for my thesis —health types and labor market outcomes.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors