Clean, reproducible research pipeline for an applied micro / health & labor project. The repo supports both synthetic data (for demonstration) and real UKHLS data.
python3 -m venv .venv
source .venv/bin/activate
python -m pip install -e .The real-data pipeline ingests the UKHLS frailty panel, clusters individuals into health types via K-Means, and generates tables and figures.
Pipeline steps:
ingest_real— readsdata/raw/frailty_long_panel.parquet, renames columns, computes lagged frailtycluster— K-Means on per-individual mean frailty (k=3 by default)report— generates summary tables and figures
./run_real.sh
# or
.venv/bin/python3 -m src.cli_real --config configs/config_real.yamlTables:
output/tables/tab01_summary_stats.csv— summary stats by health typeoutput/tables/tab02_frailty_by_wave.csv— frailty by wave and health type
Figures:
output/figures/fig01_frailty_trajectories.png— frailty trajectories by health typeoutput/figures/fig02_frailty_distribution.png— frailty distributions per clusteroutput/figures/fig03_cluster_diagnostics.png— elbow and silhouette plots
Metrics: output/metrics/metrics.json
Log: output/logs/pipeline_real.log
Generates synthetic data and runs the full pipeline including OLS regressions.
./run.sh
# or
.venv/bin/python3 -m src.cli --config configs/config.yamlFAST=1 ./run.sh
# or
.venv/bin/python3 -m src.cli --config configs/config.yaml --fastRun the test suite (uses a 200-individual subsample from the real data):
.venv/bin/python3 -m pytest tests/test_real_pipeline.py -vdata/raw/— real UKHLS frailty panel (parquet) and death records. Not committed to git.data/sample/— synthetic data generated by the pipeline.data/derived/— intermediate pipeline outputs (parquet).
- Notebooks are archived in
notebooks/archive/for provenance only; the pipeline does not depend on them. - Synthetic data do not match the distribution of the actual frailty data.
- LaTeX sources:
paper/tex/ - Compiled outputs:
paper/final/