predoc-coding-sample

Clean, reproducible research pipeline for an applied micro / health & labor project. The repo supports both synthetic data (for demonstration) and real UKHLS data.

Setup

python3 -m venv .venv
source .venv/bin/activate
python -m pip install -e .

Run — Real Data Pipeline

The real-data pipeline ingests the UKHLS frailty panel, clusters individuals into health types via K-Means, and generates tables and figures.

Pipeline steps:

ingest_real — reads data/raw/frailty_long_panel.parquet, renames columns, computes lagged frailty
cluster — K-Means on per-individual mean frailty (k=3 by default)
report — generates summary tables and figures

./run_real.sh
# or
.venv/bin/python3 -m src.cli_real --config configs/config_real.yaml

Outputs

Tables:

output/tables/tab01_summary_stats.csv — summary stats by health type
output/tables/tab02_frailty_by_wave.csv — frailty by wave and health type

Figures:

output/figures/fig01_frailty_trajectories.png — frailty trajectories by health type
output/figures/fig02_frailty_distribution.png — frailty distributions per cluster
output/figures/fig03_cluster_diagnostics.png — elbow and silhouette plots

Metrics: output/metrics/metrics.json Log: output/logs/pipeline_real.log

Run — Synthetic Data Pipeline

Generates synthetic data and runs the full pipeline including OLS regressions.

./run.sh
# or
.venv/bin/python3 -m src.cli --config configs/config.yaml

Fast mode

FAST=1 ./run.sh
# or
.venv/bin/python3 -m src.cli --config configs/config.yaml --fast

Tests

Run the test suite (uses a 200-individual subsample from the real data):

.venv/bin/python3 -m pytest tests/test_real_pipeline.py -v

Data

data/raw/ — real UKHLS frailty panel (parquet) and death records. Not committed to git.
data/sample/ — synthetic data generated by the pipeline.
data/derived/ — intermediate pipeline outputs (parquet).

Notes

Notebooks are archived in notebooks/archive/ for provenance only; the pipeline does not depend on them.
Synthetic data do not match the distribution of the actual frailty data.

Paper Assets

LaTeX sources: paper/tex/
Compiled outputs: paper/final/

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.github/workflows		.github/workflows
.vscode		.vscode
assets/reference_figures		assets/reference_figures
configs		configs
data		data
docs		docs
notebooks/archive		notebooks/archive
output		output
paper		paper
src		src
tests		tests
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml
run.sh		run.sh
run_real.sh		run_real.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

predoc-coding-sample

Setup

Run — Real Data Pipeline

Outputs

Run — Synthetic Data Pipeline

Fast mode

Tests

Data

Notes

Paper Assets

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

predoc-coding-sample

Setup

Run — Real Data Pipeline

Outputs

Run — Synthetic Data Pipeline

Fast mode

Tests

Data

Notes

Paper Assets

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages