Head and Neck Malignant Neoplasms in Chile: In-Hospital Mortality and Length of Stay Analysis (GRD 2019–2024)

Neoplasias Malignas de Cabeza y Cuello en Chile: Análisis de Mortalidad Hospitalaria y Duración de Estadía (GRD 2019–2024)

Author

Amaru Simón Agüero Jiménez

Published

April 6, 2026

NoteResumen

FACTORES PRONÓSTICOS DE MORTALIDAD INTRAHOSPITALARIA EN CÁNCER DE CABEZA Y CUELLO EN CHILE MEDIANTE ECUACIONES DE ESTIMACIÓN GENERALIZADAS, 2019-2024

Introducción: La mortalidad intrahospitalaria por neoplasias malignas de cabeza y cuello varía sustancialmente según localización tumoral, pero los factores pronósticos específicos por sitio han sido escasamente estudiados en Chile considerando la agrupación natural de los pacientes dentro de centros hospitalarios.

Materiales y Métodos: Cohorte retrospectiva de 9.411 egresos del sistema de Grupos Relacionados de Diagnóstico (GRD) 2019–2024 con diagnóstico principal de neoplasia maligna de cabeza y cuello (CIE-10 C00–C14, C30–C32), agrupados en ocho subsitios clínicos. Se extrajeron comorbilidades de los diagnósticos secundarios y se ajustaron ecuaciones de estimación generalizadas (GEE) con correlación intercambiable y hospital como clúster: familia binomial para mortalidad y binomial negativa para estadía. Edad, sexo, severidad, riesgo de mortalidad y comorbilidades seleccionadas por significancia bivariada (p<0,05) por sitio se incorporaron como predictores. Complementariamente se entrenaron cinco algoritmos de aprendizaje automático (GradientBoosting, RandomForest, SVM, MLP y regresión logística) con seis estrategias de balanceo de clases, optimizando el umbral mediante el índice de Youden.

Resultados: La mortalidad global fue de 6,4% y la mediana de estadía de 5 días, con marcada heterogeneidad entre subsitios: faringe otra (10,7%), hipofaringe (8,2%) y laringe (7,5%) presentaron las mayores tasas, mientras sinonasal (4,6%) y glándulas salivales (5,1%) las menores. Los modelos GEE por sitio alcanzaron áreas bajo la curva entre 0,86 y 0,94 para mortalidad intrahospitalaria, con mejor desempeño en glándulas salivales (AUC 0,935), sinonasal y faringe otra (0,898), cavidad oral (0,897), orofaringe (0,888), hipofaringe (0,862) y laringe (0,859); en nasofaringe el modelo no convergió. Edad, severidad y comorbilidades cardiometabólicas y respiratorias resultaron predictores robustos. El mejor aprendizaje automático para mortalidad alcanzó AUC entre 0,650 y 0,843, sin superar a los modelos jerárquicos.

Conclusiones: Los modelos GEE con agrupación hospitalaria mostraron capacidad discriminativa elevada para mortalidad por cáncer de cabeza y cuello en Chile. La marcada variabilidad entre subsitios respalda el uso de modelos independientes con selección diferenciada de comorbilidades.

Palabras clave: neoplasias de cabeza y cuello; mortalidad hospitalaria; ecuaciones de estimación generalizadas; aprendizaje automático; Chile.

NoteAbstract

PROGNOSTIC FACTORS FOR IN-HOSPITAL MORTALITY IN HEAD AND NECK CANCER IN CHILE USING GENERALIZED ESTIMATING EQUATIONS, 2019-2024

Introduction: In-hospital mortality from head and neck malignancies varies substantially by tumor site, yet site-specific prognostic factors have been scarcely studied in Chile accounting for the natural clustering of patients within hospital centers.

Materials and Methods: Retrospective cohort of 9,411 discharges from the Diagnosis-Related Groups (DRG) system 2019–2024 with primary diagnosis of head and neck malignancy (ICD-10 C00–C14, C30–C32), grouped into eight clinical subsites. Comorbidities were extracted from secondary diagnoses, and Generalized Estimating Equations (GEE) with exchangeable correlation and hospital as cluster were fitted: binomial family for mortality and negative binomial for length of stay. Age, sex, severity, mortality risk, and comorbidities selected by bivariate significance (p<0.05) per site were included as predictors. Five machine learning algorithms (GradientBoosting, RandomForest, SVM, MLP, and logistic regression) were trained complementarily with six class balancing strategies, optimizing the cutoff via Youden’s J index.

Results: Overall in-hospital mortality was 6.4% and median length of stay 5 days, with marked heterogeneity across subsites: other pharynx (10.7%), hypopharynx (8.2%), and larynx (7.5%) showed the highest rates, while sinonasal (4.6%) and salivary glands (5.1%) the lowest. Site-specific GEE models reached areas under the curve between 0.86 and 0.94 for in-hospital mortality, with best performance in salivary glands (AUC 0.935), sinonasal and other pharynx (0.898), oral cavity (0.897), oropharynx (0.888), hypopharynx (0.862), and larynx (0.859); the nasopharynx model did not converge. Age, severity, and cardiometabolic and respiratory comorbidities emerged as robust predictors. Best machine learning AUC for mortality ranged 0.650–0.843, without outperforming hierarchical models.

Conclusions: GEE models with hospital clustering showed high discriminative capacity for head and neck cancer mortality in Chile. The marked variability across subsites supports independent models with differentiated comorbidity selection.

Keywords: head and neck neoplasms; hospital mortality; generalized estimating equations; machine learning; Chile.

1 1. Methods

1.1 1.1 Study Design and Data Source

This is a retrospective cohort study based on hospital discharge records from the GRD (Diagnosis-Related Groups) system in Chile, covering the period 2019–2024. The GRD system captures comprehensive data on public hospital discharges including:

  • 35 diagnostic fields (DIAGNOSTICO1 through DIAGNOSTICO35)
  • Severity indices coded 1–3 (IR_29301_SEVERIDAD)
  • Mortality risk indices coded 1–3 (IR_29301_MORTALIDAD)
  • Admission and discharge dates
  • Sex, age (derived from birth date), and service of admission
  • Discharge type (alive or deceased)

Cases of head and neck malignancies (ICD-10 codes C00–C14, C30–C32) are identified by screening all 35 diagnostic fields for these codes, with the cancer site determined by the first (earliest) occurrence position.

1.2 1.2 Cancer Case Identification

All patients with an ICD-10 code between C00–C14 and C30–C32 in any of the 35 diagnostic fields (DIAGNOSTICO1–DIAGNOSTICO35) are included. The primary cancer site is classified based on the first diagnostic field containing a relevant code:

Oral Cavity (C00–C06) - Lip (C00): ICD-10 code C00 - Base of tongue (C01): ICD-10 code C01 - Other tongue (C02): ICD-10 code C02 - Gum (C03): ICD-10 code C03 - Floor of mouth (C04): ICD-10 code C04 - Palate (C05): ICD-10 code C05 - Other mouth (C06): ICD-10 code C06

Salivary Glands (C07–C08) - Parotid gland (C07): ICD-10 code C07 - Other salivary glands (C08): ICD-10 code C08

Pharynx (C09–C14) - Tonsil (C09): ICD-10 code C09 - Oropharynx (C10): ICD-10 code C10 - Nasopharynx (C11): ICD-10 code C11 - Piriform sinus (C12): ICD-10 code C12 - Hypopharynx (C13): ICD-10 code C13 - Other lip, oral, pharynx (C14): ICD-10 code C14

Sinonasal and Larynx (C30–C32) - Nasal cavity and middle ear (C30): ICD-10 code C30 - Accessory sinuses (C31): ICD-10 code C31 - Larynx (C32): ICD-10 code C32

For analysis, sites are grouped into clinical subsites: Oral Cavity (C00–C06), Salivary Glands (C07–C08), Oropharynx (C09–C10), Nasopharynx (C11), Hypopharynx (C12–C13), Other Pharynx (C14), Sinonasal (C30–C31), and Larynx (C32).

1.3 1.3 Comorbidity Classification

Comorbidities are identified from secondary diagnoses (DIAGNOSTICO2 through DIAGNOSTICO35) and classified into 48 detailed clinical groups organized in 10 systemic categories, using the same scheme as the comorbidity network analysis (neoplasias_cabeza_cuello_redes.qmd) to ensure consistency across analyses.

Systemic Category Detailed Groups Representative ICD-10 Codes
Cardiovascular (CV) — 7 groups Essential Hypertension; Complicated Hypertension; Coronary Disease; Heart Failure; Arrhythmias; Cerebrovascular Disease; Peripheral Vascular Disease I10; I11–I15; I20–I25; I50; I44–I49; I60–I69; I70–I77
Metabolic (MET) — 7 groups Type 1 Diabetes; Type 2 Diabetes; Complicated Diabetes; Obesity; Dyslipidemia; Metabolic Syndrome; Thyroid Disorders E10; E11; E12–E14; E66; E78; E88; E00–E07
Respiratory (RESP) — 7 groups COPD; Asthma; Chronic Bronchitis; Emphysema; Pulmonary Fibrosis; Pulmonary Hypertension; Sleep Apnea J44; J45–J46; J41–J42; J43; J84; I27; G47
Renal (REN) — 4 groups Acute Renal Failure; Chronic Renal Failure; End Stage Renal; Diabetic Nephropathy N17; N18; N18.5–N18.6; E10.2–E14.2
Hepatic (HEP) — 4 groups Steatosis; Alcoholic Cirrhosis; NonAlcoholic Cirrhosis; Viral Hepatitis K76; K70; K74; B15–B19
Gastrointestinal (GI) — 4 groups Reflux; Peptic Ulcer; IBD; IBS K21; K25–K28; K50–K51; K58
Neurological (NEURO) — 4 groups Dementia; Parkinson; Epilepsy; Stroke Sequelae F00–F03, G30; G20–G22; G40–G41; I69
Psychiatric (PSI) — 3 groups Major Depression; Bipolar; Anxiety F32–F33; F30–F31; F41
Substance Use (SUST) — 3 groups Alcohol; Tobacco; Opioids F10; F17, Z72; F11
Infectious (INF) — 4 groups HIV; Tuberculosis; Hepatitis C; Hepatitis B B20–B24; A15–A19; B18; B16

Each of the 48 detailed groups is coded as a binary indicator (presence = 1, absence = 0). Additionally, 10 aggregated systemic-category indicators (cat_CV, cat_MET, cat_RESP, cat_REN, cat_HEP, cat_GI, cat_NEURO, cat_PSI, cat_SUST, cat_INF) are derived as the disjunction (any group present) within each category. Code matching uses both 3-character ICD-10 prefixes and 4-character extended codes (e.g., N18.5 for End Stage Renal disease).

1.4 1.4 Statistical Analyses

1.4.1 1.4.1 Descriptive Statistics

Continuous variables (age, length of stay) are reported as mean, median, standard deviation, and range. Categorical variables (sex, service category, comorbidities, discharge status) are reported as counts and proportions. Mortality rates and length of stay distributions are stratified by cancer site and temporal trends.

1.4.2 1.4.2 Bivariate Associations

Independence between each categorical covariate and the binary mortality outcome was tested with the Pearson chi-square statistic

\[ \chi^{2} \;=\; \sum_{i=1}^{R}\sum_{j=1}^{C} \frac{(O_{ij} - E_{ij})^{2}}{E_{ij}}, \]

where:

  • \(O_{ij}\) — observed count in row \(i\) and column \(j\) of the \(R \times C\) contingency table;
  • \(E_{ij} = (R_i \cdot C_j)/N\) — count expected under the null hypothesis of independence, with \(R_i\), \(C_j\) the row/column marginals and \(N\) the table total;
  • under \(H_0\), \(\chi^{2} \;\dot\sim\; \chi^{2}_{(R-1)(C-1)}\).

The Wilcoxon rank-sum (Mann–Whitney \(U\)) test compared the length-of-stay distribution between patients with and without each comorbidity, without assuming normality or homoscedasticity.

1.4.3 1.4.3 Mixed-Effects Logistic Regression for Mortality

In-hospital mortality \(Y_{ij} \in \{0,1\}\) (death = 1) is modeled with a mixed-effects logistic regression with a random intercept for the cluster (year of admission \(j\)):

\[ Y_{ij} \,\mid\, u_j \;\sim\; \text{Bernoulli}(\pi_{ij}), \qquad \mathrm{logit}(\pi_{ij}) \;=\; \log\!\left(\frac{\pi_{ij}}{1-\pi_{ij}}\right) \;=\; \beta_0 \,+\, \mathbf{x}_{ij}^{\top}\boldsymbol{\beta} \,+\, u_j, \]

with \(u_j \sim \mathcal{N}(0,\sigma^{2}_{u})\) independent across years. Symbols:

  • \(\pi_{ij}\) — probability of in-hospital death for patient \(i\) admitted in year \(j\);
  • \(\beta_0\) — fixed intercept (log-odds of death in the reference category, year-average);
  • \(\mathbf{x}_{ij}\) — covariate vector for patient \(i,j\): age, sex, service, GRD severity, GRD mortality risk, and the binary comorbidity indicators;
  • \(\boldsymbol{\beta}\) — fixed-effect log-odds ratios; the exponentiated coefficient \(e^{\beta_k}\) is the OR for a one-unit change in \(x_k\);
  • \(u_j\) — random intercept absorbing unobserved year-level heterogeneity; \(\sigma^{2}_{u}\) quantifies between-year variability of baseline mortality risk.

Models were fit by maximum likelihood with Laplace approximation, separately for each of the eight primary cancer subsites. Estimates are reported as odds ratios with 95% Wald confidence intervals.

1.4.4 1.4.4 Negative Binomial Mixed Models for Length of Stay

Length of stay is a non-negative count with overdispersion relative to Poisson. We use a negative-binomial mixed model with a log link and a random intercept for year:

\[ Y_{ij} \,\mid\, u_j \;\sim\; \mathrm{NB}\!\left(\mu_{ij}, \theta\right), \qquad \log(\mu_{ij}) \;=\; \beta_0 \,+\, \mathbf{x}_{ij}^{\top}\boldsymbol{\beta} \,+\, u_j, \]

with \(u_j \sim \mathcal{N}(0,\sigma^{2}_{u})\) and the NB\((\mu,\theta)\) variance

\[ \mathrm{Var}(Y_{ij}\mid u_j) \;=\; \mu_{ij} \,+\, \mu_{ij}^{2}/\theta. \]

Symbols:

  • \(Y_{ij}\) — length of stay in days for patient \(i,j\);
  • \(\mu_{ij}\) — conditional mean stay given covariates and the year effect;
  • \(\theta\) — dispersion parameter; smaller \(\theta\) implies greater overdispersion, and \(\theta \to \infty\) recovers the Poisson model;
  • \(\mathbf{x}_{ij},\boldsymbol{\beta},u_j\) — as in the logistic model.

Coefficients are reported as incidence rate ratios \(\mathrm{IRR}_k = e^{\beta_k}\) (multiplicative effect on expected stay) with 95% CIs.

1.4.5 1.4.5 Area Under the Receiver Operating Characteristic Curve (AUC)

Discrimination of the mortality model is summarized by the area under the ROC curve,

\[ \mathrm{AUC} \;=\; \int_{0}^{1} \mathrm{TPR}\!\left(\mathrm{FPR}^{-1}(u)\right)\,du \;=\; \Pr\!\big(\hat{p}_{i} > \hat{p}_{j} \,\big|\, y_i = 1,\, y_j = 0\big), \]

where for every classification threshold \(c \in [0,1]\) on the predicted probability \(\hat{p}\):

  • \(\mathrm{TPR}(c) = \Pr(\hat{p} > c \mid Y = 1)\) — true positive rate (sensitivity);
  • \(\mathrm{FPR}(c) = \Pr(\hat{p} > c \mid Y = 0)\) — false positive rate (1 − specificity);
  • \(\mathrm{AUC} = 0.5\) — no discrimination; \(\mathrm{AUC} = 1\) — perfect ranking of cases above controls.

1.4.6 1.4.6 Hosmer–Lemeshow Goodness-of-Fit

Calibration of the logistic model was assessed with the Hosmer–Lemeshow statistic, which compares observed and expected events across \(G\) deciles of predicted risk:

\[ \mathrm{HL} \;=\; \sum_{g=1}^{G} \frac{\big(O_g - n_g\bar{p}_g\big)^{2}}{n_g\,\bar{p}_g\,(1 - \bar{p}_g)}, \]

where:

  • \(G\) — number of risk groups (we use \(G = 10\), i.e., deciles of \(\hat{p}\));
  • \(n_g\) — number of patients in group \(g\);
  • \(O_g\) — observed deaths in group \(g\);
  • \(\bar{p}_g\) — mean predicted probability in group \(g\) (so \(E_g = n_g \bar{p}_g\)).

Under \(H_0\) (model well-calibrated), \(\mathrm{HL} \;\dot\sim\; \chi^{2}_{G-2}\); \(p > 0.05\) supports adequate calibration.

1.4.7 1.4.7 McFadden’s Pseudo-\(R^{2}\)

\[ R^{2}_{\mathrm{McF}} \;=\; 1 \,-\, \frac{\ell(\hat{\boldsymbol{\beta}})}{\ell(\mathbf{0})}, \]

where:

  • \(\ell(\hat{\boldsymbol{\beta}})\) — log-likelihood of the fitted model;
  • \(\ell(\mathbf{0})\) — log-likelihood of the intercept-only (“null”) model.

Both log-likelihoods are negative, so \(R^{2}_{\mathrm{McF}} \in [0,1]\); values of 0.2–0.4 are typically considered indicative of good fit in logistic models.

1.4.8 1.4.8 Matthews Correlation Coefficient (MCC)

For a binary classifier at a fixed threshold, the Matthews correlation coefficient is

\[ \mathrm{MCC} \;=\; \frac{TP\cdot TN \;-\; FP\cdot FN}{\sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}}, \]

where \(TP, TN, FP, FN\) are the cells of the confusion matrix (true/false positives/negatives). \(\mathrm{MCC} \in [-1, 1]\): \(+1\) perfect agreement, \(0\) no better than chance, \(-1\) perfect disagreement. Unlike accuracy, MCC stays meaningful with class imbalance.

1.4.9 1.4.9 Brier Score

The Brier score is the mean squared error between predicted probabilities and observed binary outcomes,

\[ \mathrm{BS} \;=\; \frac{1}{N}\sum_{i=1}^{N}\big(y_i - \hat{p}_i\big)^{2}, \]

where:

  • \(N\) — number of patients in the validation set;
  • \(y_i \in \{0,1\}\) — observed outcome;
  • \(\hat{p}_i \in [0,1]\) — predicted probability of the event for patient \(i\).

\(\mathrm{BS} = 0\) is perfect; \(\mathrm{BS} = 0.25\) is the score of a non-informative model that predicts \(\hat{p}_i \equiv 0.5\) for everyone. It rewards both calibration and discrimination.

1.4.10 1.4.10 RMSE and MAE for Length-of-Stay

Predictive accuracy of the continuous length-of-stay model is summarized by

\[ \mathrm{RMSE} \;=\; \sqrt{\frac{1}{N}\sum_{i=1}^{N}\big(\hat{y}_i - y_i\big)^{2}}, \qquad \mathrm{MAE} \;=\; \frac{1}{N}\sum_{i=1}^{N}\big|\hat{y}_i - y_i\big|, \]

where \(y_i\) is the observed length of stay (days), \(\hat{y}_i\) its prediction, and \(N\) the validation sample size. RMSE penalizes large errors more strongly because of the square; MAE is in the same units as the outcome and is more robust to outliers.

1.4.11 1.4.11 Youden’s J Statistic

For threshold selection on the ROC curve we use Youden’s \(J\):

\[ J(c) \;=\; \mathrm{TPR}(c) \,+\, \mathrm{TNR}(c) \,-\, 1 \;=\; \mathrm{Sensitivity}(c) \,+\, \mathrm{Specificity}(c) \,-\, 1, \]

where \(c\) is the probability cut-off. \(J \in [0,1]\); the optimal cut-off is \(c^{*} = \arg\max_c J(c)\), the point that maximizes the vertical distance between the ROC curve and the chance diagonal.


2 2. Data Loading and Processing

Code
import pandas as pd
import numpy as np
import os
import warnings
from datetime import datetime
from pathlib import Path
import pickle
from IPython.display import display

warnings.filterwarnings('ignore')

# Configuration
DATA_DIR = "../data"
OUTPUT_DIR = "../data/output_files"
CACHE_DIR = "../data"

# Create output directory if needed
os.makedirs(OUTPUT_DIR, exist_ok=True)
os.makedirs(CACHE_DIR, exist_ok=True)
Code
# Function to load GRD data
def load_grd_data(data_dir, years=[2019, 2020, 2021, 2022, 2023, 2024]):
    """Load GRD data from multiple CSV files (pipe-separated, UTF-8)"""
    dfs = []

    for year in years:
        filepath = os.path.join(data_dir, f"GRD_PUBLICO_{year}.csv")
        if not os.path.exists(filepath):
            continue

        try:
            df = pd.read_csv(filepath, sep="|", dtype=str, low_memory=False, encoding='utf-8')
            df['year'] = str(year)
            dfs.append(df)
        except Exception as e:
            pass

    if dfs:
        return pd.concat(dfs, ignore_index=True)
    else:
        return pd.DataFrame()

# Load all GRD data
df_grd = load_grd_data(DATA_DIR)
if len(df_grd) == 0:
    display(HTML("<p><strong>Warning:</strong> No GRD data found.</p>"))
Code
# Load catalogs
catalog_file = os.path.join(DATA_DIR, "códigos grd.xlsx")

df_grd_codes = None
df_hospitals = None

try:
    df_grd_codes = pd.read_excel(catalog_file, sheet_name="IR - GRD")
except Exception as e:
    pass

try:
    df_hospitals = pd.read_excel(catalog_file, sheet_name="Hospitales")
except Exception as e:
    pass
Code
# Function to parse dates flexibly
def parse_date_flexible(date_str):
    """Parse dates that may be in DD-MM-YYYY or YYYY-MM-DD format"""
    if pd.isna(date_str) or date_str == '':
        return pd.NaT

    if isinstance(date_str, (int, float)):
        return pd.NaT

    date_str = str(date_str).strip()

    # Try different formats
    for fmt in ["%d-%m-%Y", "%Y-%m-%d", "%d/%m/%Y", "%Y/%m/%d"]:
        try:
            return pd.to_datetime(date_str, format=fmt)
        except:
            continue

    return pd.NaT

# Apply date normalization
date_cols = ['FECHA_NACIMIENTO', 'FECHA_INGRESO', 'FECHAALTA']
for col in date_cols:
    if col in df_grd.columns:
        df_grd[col] = df_grd[col].apply(parse_date_flexible)
        valid_count = df_grd[col].notna().sum()
Code
# Calculate derived variables

# Age in years
df_grd['edad'] = (df_grd['FECHA_INGRESO'] - df_grd['FECHA_NACIMIENTO']).dt.days / 365.25

# Length of stay in days
df_grd['dias_estancia'] = (df_grd['FECHAALTA'] - df_grd['FECHA_INGRESO']).dt.days

# Mortality status (discharge type)
df_grd['egresar_fallecido'] = (df_grd['TIPOALTA'] == 'FALLECIDO').astype(int)
df_grd['estado_alta'] = df_grd['TIPOALTA'].apply(
    lambda x: "Deceased" if x == "FALLECIDO" else "Discharged"
)
Code
# Filter valid records

# Remove unknown sex
df_clean = df_grd[df_grd['SEXO'] != 'DESCONOCIDO'].copy()

# Standardize sex values to English
sex_map = {'FEMENINO': 'Female', 'MASCULINO': 'Male', 'MUJER': 'Female', 'HOMBRE': 'Male',
           'Female': 'Female', 'Male': 'Male', 'F': 'Female', 'M': 'Male'}
df_clean['SEXO'] = df_clean['SEXO'].map(sex_map).fillna(df_clean['SEXO'])

# Remove invalid ages
df_clean = df_clean[(df_clean['edad'].notna()) & (df_clean['edad'] > 0) & (df_clean['edad'] < 150)].copy()

# Remove invalid length of stay
df_clean = df_clean[(df_clean['dias_estancia'].notna()) & (df_clean['dias_estancia'] >= 0)].copy()
Code
# Function to identify head and neck cancer sites from diagnostic fields
def determine_cancer_site(row):
    """
    Scans all DIAGNOSTICO1-35 fields and returns the first C00-C14, C30-C32 code found.
    Returns: (cancer_site_name, cancer_code, diagnostic_field_position)
    """
    cancer_sites = {
        'C00': 'Lip',
        'C01': 'Base of tongue',
        'C02': 'Other tongue',
        'C03': 'Gum',
        'C04': 'Floor of mouth',
        'C05': 'Palate',
        'C06': 'Other mouth',
        'C07': 'Parotid gland',
        'C08': 'Other salivary glands',
        'C09': 'Tonsil',
        'C10': 'Oropharynx',
        'C11': 'Nasopharynx',
        'C12': 'Piriform sinus',
        'C13': 'Hypopharynx',
        'C14': 'Other pharynx',
        'C30': 'Nasal cavity',
        'C31': 'Accessory sinuses',
        'C32': 'Larynx'
    }

    for pos in range(1, 36):
        col_name = f'DIAGNOSTICO{pos}'
        if col_name not in row.index:
            continue

        diag = row[col_name]
        if pd.isna(diag) or diag == '':
            continue

        diag_str = str(diag).strip().upper()
        if len(diag_str) < 3:
            continue

        icd_code = diag_str[:3]
        if icd_code in cancer_sites:
            return cancer_sites[icd_code], icd_code, pos

    return None, None, None

# Filter for head and neck neoplasms

cancer_results = df_clean.apply(determine_cancer_site, axis=1)
df_clean['cancer_site'] = cancer_results.apply(lambda x: x[0])
df_clean['cancer_code'] = cancer_results.apply(lambda x: x[1])
df_clean['diagnostic_position'] = cancer_results.apply(lambda x: x[2])

# Keep only records with cancer diagnosis
df_head_neck = df_clean[df_clean['cancer_site'].notna()].copy()

vc_site = df_head_neck['cancer_site'].value_counts().sort_values(ascending=False).reset_index()
vc_site.columns = ['Cancer Site', 'N']
vc_site['%'] = (vc_site['N'] / vc_site['N'].sum() * 100).round(1)
Code
# Create clinical subsite grouping
def classify_head_neck_subsite(site):
    """Group head and neck sites into clinical subsites"""
    oral_cavity = ['Lip', 'Base of tongue', 'Other tongue', 'Gum', 'Floor of mouth', 'Palate', 'Other mouth']
    salivary = ['Parotid gland', 'Other salivary glands']
    oropharynx_group = ['Tonsil', 'Oropharynx']
    hypopharynx_group = ['Piriform sinus', 'Hypopharynx']
    sinonasal = ['Nasal cavity', 'Accessory sinuses']

    if site in oral_cavity:
        return 'Oral Cavity'
    elif site in salivary:
        return 'Salivary Glands'
    elif site in oropharynx_group:
        return 'Oropharynx'
    elif site == 'Nasopharynx':
        return 'Nasopharynx'
    elif site in hypopharynx_group:
        return 'Hypopharynx'
    elif site == 'Other pharynx':
        return 'Other Pharynx'
    elif site in sinonasal:
        return 'Sinonasal'
    elif site == 'Larynx':
        return 'Larynx'
    else:
        return site

df_head_neck['cancer_site_group'] = df_head_neck['cancer_site'].apply(classify_head_neck_subsite)

vc = df_head_neck['cancer_site_group'].value_counts().reset_index()
vc.columns = ['Cancer Site', 'N']
vc['%'] = (vc['N'] / vc['N'].sum() * 100).round(1)
Code
# Classify service category

def classify_service_category(servicio):
    """Classify admission service into categories"""
    if pd.isna(servicio) or servicio == '':
        return 'Other'

    servicio = str(servicio).upper()

    if 'CUIDADOS INTENSIVOS' in servicio or 'CUIDADOS' in servicio:
        return 'ICU'
    elif 'TRATAMIENTO INTERMEDIO' in servicio or 'UTI' in servicio:
        return 'ITU'
    elif any(x in servicio for x in ['AGUDOS', 'CIRUGÍA', 'CIRUGÍA', 'RECUPERACIÓN', 'MEDICO-QUIRURGICO']):
        return 'MS'
    else:
        return 'Other'

df_head_neck['service_category'] = df_head_neck['SERVICIOINGRESO'].apply(classify_service_category)

sc = df_head_neck['service_category'].value_counts().reset_index()
sc.columns = ['Service Category', 'N']
sc['%'] = (sc['N'] / sc['N'].sum() * 100).round(1)
Code
# Detect comorbidities from secondary diagnoses
# Uses the same comprehensive 48-group / 10-category scheme as the network analysis (redes)

comorbidity_groups = {
    # CARDIOVASCULAR (7 groups)
    "CV Essential Hypertension": ["I10"],
    "CV Complicated Hypertension": ["I11", "I12", "I13", "I15"],
    "CV Coronary Disease": ["I20", "I21", "I22", "I23", "I24", "I25"],
    "CV Heart Failure": ["I50"],
    "CV Arrhythmias": ["I44", "I45", "I46", "I47", "I48", "I49"],
    "CV Cerebrovascular Disease": ["I60", "I61", "I62", "I63", "I64", "I65", "I66", "I67", "I68", "I69"],
    "CV Peripheral Vascular Disease": ["I70", "I71", "I73", "I74", "I77"],

    # METABOLIC (7 groups)
    "MET Type1 Diabetes": ["E10"],
    "MET Type2 Diabetes": ["E11"],
    "MET Complicated Diabetes": ["E12", "E13", "E14"],
    "MET Obesity": ["E66"],
    "MET Dyslipidemia": ["E78"],
    "MET Metabolic Syndrome": ["E88"],
    "MET Thyroid Disorders": ["E00", "E01", "E02", "E03", "E04", "E05", "E06", "E07"],

    # RESPIRATORY (7 groups)
    "RESP COPD": ["J44"],
    "RESP Asthma": ["J45", "J46"],
    "RESP Chronic Bronchitis": ["J41", "J42"],
    "RESP Emphysema": ["J43"],
    "RESP Pulmonary Fibrosis": ["J84"],
    "RESP Pulmonary Hypertension": ["I27"],
    "RESP Sleep Apnea": ["G47"],

    # RENAL (4 groups)
    "REN Acute Renal Failure": ["N17"],
    "REN Chronic Renal Failure": ["N18"],
    "REN End Stage Renal": ["N185", "N186"],
    "REN Diabetic Nephropathy": ["E102", "E112", "E122", "E132", "E142"],

    # HEPATIC (4 groups)
    "HEP Steatosis": ["K76"],
    "HEP Alcoholic Cirrhosis": ["K70"],
    "HEP NonAlcoholic Cirrhosis": ["K74"],
    "HEP Viral Hepatitis": ["B15", "B16", "B17", "B18", "B19"],

    # GASTROINTESTINAL (4 groups)
    "GI Reflux": ["K21"],
    "GI Peptic Ulcer": ["K25", "K26", "K27", "K28"],
    "GI IBD": ["K50", "K51"],
    "GI IBS": ["K58"],

    # NEUROLOGICAL (4 groups)
    "NEURO Dementia": ["F00", "F01", "F02", "F03", "G30"],
    "NEURO Parkinson": ["G20", "G21", "G22"],
    "NEURO Epilepsy": ["G40", "G41"],
    "NEURO Stroke Sequelae": ["I69"],

    # PSYCHIATRIC (3 groups)
    "PSI Major Depression": ["F32", "F33"],
    "PSI Bipolar": ["F30", "F31"],
    "PSI Anxiety": ["F41"],

    # SUBSTANCE (3 groups)
    "SUST Alcohol": ["F10"],
    "SUST Tobacco": ["F17", "Z72"],
    "SUST Opioids": ["F11"],

    # INFECTIOUS (4 groups)
    "INF HIV": ["B20", "B21", "B22", "B23", "B24"],
    "INF Tuberculosis": ["A15", "A16", "A17", "A18", "A19"],
    "INF Hepatitis C": ["B18"],
    "INF Hepatitis B": ["B16"],
}

# Backwards-compatible alias used downstream
life_groups = comorbidity_groups

def get_icd_code(diag):
    """Extract normalized ICD code, returning both 3-char and 4-char (no dot) forms."""
    if pd.isna(diag) or diag == '':
        return None
    s = str(diag).strip().upper().replace('.', '')
    return s

def check_comorbidity_groups(row):
    """Detect comorbidities for a record. Matches both 3-char and 4-char codes
    so groups like REN End Stage Renal (N185, N186) are correctly captured."""
    diag_cols = [f'DIAGNOSTICO{i}' for i in range(2, 36)]
    codes_full = set()
    codes_3 = set()

    for col in diag_cols:
        if col in row.index:
            icd = get_icd_code(row[col])
            if icd:
                codes_full.add(icd)
                codes_3.add(icd[:3])

    comorbidities = {}
    for group_name, icd_codes in comorbidity_groups.items():
        match = False
        for code in icd_codes:
            if len(code) == 3:
                if code in codes_3:
                    match = True
                    break
            else:
                if any(c.startswith(code) for c in codes_full):
                    match = True
                    break
        comorbidities[group_name] = 1 if match else 0

    return comorbidities

# Reset index to ensure clean alignment after the boolean filter above
df_head_neck = df_head_neck.reset_index(drop=True)

# Apply comorbidity detection
comorbidity_data = df_head_neck.apply(check_comorbidity_groups, axis=1)
comorbidity_df = pd.DataFrame(comorbidity_data.tolist(), index=df_head_neck.index)

# Add to main dataframe (index-aligned)
for col in comorbidity_df.columns:
    df_head_neck[col] = comorbidity_df[col].astype(int)

comorbidity_cols = list(comorbidity_groups.keys())
all_comorbidity_cols = list(comorbidity_groups.keys())

# Aggregated systemic category indicators (CV, MET, RESP, REN, HEP, GI, NEURO, PSI, SUST, INF)
systemic_categories = ['CV', 'MET', 'RESP', 'REN', 'HEP', 'GI', 'NEURO', 'PSI', 'SUST', 'INF']
for cat in systemic_categories:
    cat_cols = [c for c in comorbidity_cols if c.startswith(cat + ' ')]
    df_head_neck[f'cat_{cat}'] = (df_head_neck[cat_cols].sum(axis=1) > 0).astype(int)

# Create total comorbidities and any comorbidity columns
df_head_neck['total_comorbidities'] = df_head_neck[comorbidity_cols].sum(axis=1)
df_head_neck['any_comorbidity'] = (df_head_neck['total_comorbidities'] > 0).astype(int)

# Prepare clustering/random effect variables for mixed models
# COD_HOSPITAL as cluster (random intercept), year as fixed effect
if 'COD_HOSPITAL' in df_head_neck.columns:
    df_head_neck['hospital_id'] = df_head_neck['COD_HOSPITAL'].astype(str)
else:
    # Fallback: use first available hospital-like column
    for hcol in ['ESTABLECIMIENTO', 'HOSPITAL', 'cod_hospital']:
        if hcol in df_head_neck.columns:
            df_head_neck['hospital_id'] = df_head_neck[hcol].astype(str)
            break
    else:
        df_head_neck['hospital_id'] = '1'  # single cluster fallback

df_head_neck['year_factor'] = df_head_neck['year'].astype(str)
Code
from IPython.display import display, HTML

# Helper function for clean table display
def _safe_format(x):
    """Format a value: NaN → '—', Inf → 'Inf', large abs → scientific, else 2 decimals"""
    if isinstance(x, float):
        if np.isnan(x):
            return '—'
        if np.isinf(x):
            return 'Inf' if x > 0 else '-Inf'
        if abs(x) > 1e8:
            return f'{x:.2e}'
        return f'{x:,.2f}'
    return str(x)

def sanitize_df(df):
    """Replace NaN with '—' and Inf with 'Inf' across entire DataFrame"""
    df_out = df.copy()
    for col in df_out.columns:
        if df_out[col].dtype in ['float64', 'float32']:
            df_out[col] = df_out[col].apply(_safe_format)
        else:
            df_out[col] = df_out[col].fillna('—')
    # Also handle index if it contains NaN
    if df_out.index.dtype in ['float64', 'float32']:
        df_out.index = [_safe_format(v) for v in df_out.index]
    return df_out

def show_table(df, caption="", index=False, float_format=None):
    """Display a DataFrame as a properly formatted HTML table with NaN/Inf handling"""
    df_clean = sanitize_df(df)
    html = f'<h4>{caption}</h4>' if caption else ''
    html += df_clean.to_html(index=index, classes='table table-striped table-hover table-sm',
                              escape=False)
    display(HTML(html))

# Process severity and mortality risk as factors

severity_cols = ['IR_29301_SEVERIDAD', 'IR_29301_MORTALIDAD']
for col in severity_cols:
    if col in df_head_neck.columns:
        df_head_neck[col] = pd.to_numeric(df_head_neck[col], errors='coerce')
        df_head_neck[col] = df_head_neck[col].fillna(df_head_neck[col].median())
        df_head_neck[col] = df_head_neck[col].astype('category')

# Create main sites dataset
main_sites = ['Larynx', 'Oropharynx', 'Oral Cavity', 'Nasopharynx']
df_main_sites = df_head_neck[df_head_neck['cancer_site_group'].isin(main_sites)].copy()

# Save processed dataframe to cache
output_cache_path = os.path.join(CACHE_DIR, "df_head_neck_processed.pkl")
os.makedirs(os.path.dirname(output_cache_path), exist_ok=True)
with open(output_cache_path, 'wb') as f:
    pickle.dump(df_head_neck, f)

# General summary
info_df = pd.DataFrame({
    'Metric': ['Total GRD records', 'Head and neck malignancy cases (C00-C14, C30-C32)',
                'Period', 'Mortality rate', 'Median length of stay'],
    'Value': [f'{len(df_head_neck):,}', f'{len(df_head_neck):,}',
              f'{df_head_neck["year"].min()}-{df_head_neck["year"].max()}',
              f'{df_head_neck["egresar_fallecido"].mean()*100:.1f}%',
              f'{df_head_neck["dias_estancia"].median():.0f} days']
})
show_table(info_df, "Table 5: Overall Summary Statistics", index=False)

Table 5: Overall Summary Statistics

Metric Value
Total GRD records 9,411
Head and neck malignancy cases (C00-C14, C30-C32) 9,411
Period 2019-2024
Mortality rate 6.4%
Median length of stay 5 days

3 3. Display Name Formatting

Code
# Define formatted display names for all variables
DISPLAY_NAMES = {
    'edad': 'Age',
    'SEXO_coded': 'Sex (Female)',
    'dias_estancia': 'Length of Stay',
    'egresar_fallecido': 'In-Hospital Mortality',
    'cancer_site_group': 'Cancer Site',
    'service_category': 'Service Category',
    'service_category_coded': 'Service Category',
    'year': 'Year',
    'year_factor': 'Year',
    'IR_29301_SEVERIDAD': 'Severity',
    'IR_29301_MORTALIDAD': 'Mortality Risk',
    # Detailed comorbidity groups (48 groups, 10 systemic categories)
    **{name: name for name in comorbidity_cols},
    # Aggregated systemic categories
    'cat_CV': 'Cardiovascular (any)',
    'cat_MET': 'Metabolic (any)',
    'cat_RESP': 'Respiratory (any)',
    'cat_REN': 'Renal (any)',
    'cat_HEP': 'Hepatic (any)',
    'cat_GI': 'Gastrointestinal (any)',
    'cat_NEURO': 'Neurological (any)',
    'cat_PSI': 'Psychiatric (any)',
    'cat_SUST': 'Substance Use (any)',
    'cat_INF': 'Infectious (any)',
    'total_comorbidities': 'Total Comorbidities',
    'any_comorbidity': 'Any Comorbidity',
    'service_ICU': 'Service ICU',
    'service_MS': 'Service MS',
    'service_ITU': 'Service ITU',
    'service_Other': 'Service Other',
    'severity_2': 'Severity Medium',
    'severity_3': 'Severity High',
    'mort_risk_2': 'Mortality Risk Medium',
    'mort_risk_3': 'Mortality Risk High',
    'const': 'Intercept',
    'Intercept': 'Intercept',
    'sex_female': 'Sex Female',
}

def format_name(name):
    """Convert variable names to formatted display names"""
    if name in DISPLAY_NAMES:
        return DISPLAY_NAMES[name]
    if name.startswith('year_'):
        return f'Year {name.split("_")[1]}'
    if name.startswith('severity_'):
        level = name.split('_')[1]
        labels = {'2': 'Medium', '3': 'High'}
        return f'Severity {labels.get(level, level)}'
    if name.startswith('mort_risk_'):
        level = name.split('_')[2] if len(name.split('_')) > 2 else name.split('_')[1]
        labels = {'2': 'Medium', '3': 'High'}
        return f'Mortality Risk {labels.get(level, level)}'
    if name.startswith('service_'):
        svc = name.replace('service_', '')
        return f'Service {svc}'
    if name.startswith('C('):
        return name.replace('C(', '').replace(')', '').replace('[T.', ': ').replace(']', '')
    return name.replace('_', ' ').title()

# Create mapping for comorbidity display names
comorb_names = {col: format_name(col) for col in comorbidity_cols}

4 4. Descriptive Statistics

Code
import matplotlib.pyplot as plt
import seaborn as sns

plt.rcParams['figure.facecolor'] = 'white'
sns.set_style("whitegrid")

# Create age groups
df_head_neck['age_group'] = pd.cut(df_head_neck['edad'],
                                    bins=[0, 18, 35, 50, 65, 80, 150],
                                    labels=['<18', '18-34', '35-49', '50-64', '65-79', '80+'])

# Table 1: Patient Distribution by Cancer Site
table1_data = []
for site in sorted(df_head_neck['cancer_site_group'].unique()):
    site_df = df_head_neck[df_head_neck['cancer_site_group'] == site]
    n_total = len(site_df)
    n_discharged = (site_df['egresar_fallecido'] == 0).sum()
    n_deceased = site_df['egresar_fallecido'].sum()
    mortality_pct = (n_deceased / n_total * 100) if n_total > 0 else 0
    mean_age = site_df['edad'].mean()
    age_range = f"{site_df['edad'].min():.0f}-{site_df['edad'].max():.0f}"
    female_pct = (site_df['SEXO'] == 'Female').sum() / n_total * 100
    mean_los = site_df['dias_estancia'].mean()
    median_los = site_df['dias_estancia'].median()
    sd_los = site_df['dias_estancia'].std()

    table1_data.append({
        'Site': site,
        'N': n_total,
        'Discharged': n_discharged,
        'Deceased': n_deceased,
        'Mortality %': mortality_pct,
        'Mean Age': mean_age,
        'Age Range': age_range,
        'Female %': female_pct,
        'Mean LOS': mean_los,
        'Median LOS': median_los,
        'SD LOS': sd_los
    })

table1_df = pd.DataFrame(table1_data)
show_table(table1_df, "Table 1: Patient Distribution by Head and Neck Cancer Site", index=False)

Table 1: Patient Distribution by Head and Neck Cancer Site

Site N Discharged Deceased Mortality % Mean Age Age Range Female % Mean LOS Median LOS SD LOS
Hypopharynx 400 367 33 8.25 68.60 9-97 23.50 13.71 7.00 19.00
Larynx 2625 2427 198 7.54 67.63 6-98 15.05 14.49 7.00 21.86
Nasopharynx 368 345 23 6.25 50.19 3-92 31.79 8.97 4.00 18.19
Oral Cavity 3001 2825 176 5.86 64.32 0-99 39.09 9.61 4.00 16.41
Oropharynx 951 896 55 5.78 62.16 8-96 26.60 8.39 4.00 13.66
Other Pharynx 281 251 30 10.68 57.49 0-96 32.38 10.80 5.00 15.99
Salivary Glands 898 852 46 5.12 60.54 2-97 52.00 6.95 4.00 11.65
Sinonasal 887 846 41 4.62 55.47 1-102 41.83 6.68 3.00 11.31
Code
# Table 2: Trends by Year and Site
table2_data = []
for year in sorted(df_head_neck['year'].unique()):
    for site in sorted(df_head_neck['cancer_site_group'].unique()):
        subset = df_head_neck[(df_head_neck['year'] == year) & (df_head_neck['cancer_site_group'] == site)]
        if len(subset) > 0:
            n = len(subset)
            mortality_pct = (subset['egresar_fallecido'].sum() / n * 100) if n > 0 else 0
            mean_age = subset['edad'].mean()
            mean_los = subset['dias_estancia'].mean()
            any_comorb_pct = subset['any_comorbidity'].mean() * 100 if 'any_comorbidity' in subset.columns else 0

            table2_data.append({
                'Year': year,
                'Site': site,
                'N': n,
                'Mortality %': mortality_pct,
                'Mean Age': mean_age,
                'Mean LOS': mean_los,
                'Any Comorbidity %': any_comorb_pct
            })

table2_df = pd.DataFrame(table2_data)
show_table(table2_df, "Table 2: Trends by Year and Site", index=False)

Table 2: Trends by Year and Site

Year Site N Mortality % Mean Age Mean LOS Any Comorbidity %
2019 Hypopharynx 63 3.17 69.85 11.84 65.08
2019 Larynx 450 8.22 67.61 10.87 68.00
2019 Nasopharynx 61 1.64 51.20 9.13 54.10
2019 Oral Cavity 571 4.38 62.50 7.34 58.49
2019 Oropharynx 183 6.56 62.18 6.81 47.54
2019 Other Pharynx 50 10.00 65.85 11.80 60.00
2019 Salivary Glands 173 2.89 59.66 6.52 60.69
2019 Sinonasal 165 3.03 53.91 5.14 49.70
2020 Hypopharynx 55 9.09 68.24 13.29 67.27
2020 Larynx 343 7.87 67.47 15.12 72.89
2020 Nasopharynx 48 6.25 58.31 9.65 43.75
2020 Oral Cavity 387 4.39 64.82 11.89 70.80
2020 Oropharynx 108 7.41 61.00 8.35 63.89
2020 Other Pharynx 35 5.71 56.63 14.40 60.00
2020 Salivary Glands 126 7.14 57.89 7.60 57.14
2020 Sinonasal 132 6.82 52.98 6.28 40.15
2021 Hypopharynx 60 13.33 65.86 15.50 55.00
2021 Larynx 408 9.07 66.90 15.57 76.72
2021 Nasopharynx 47 6.38 43.42 6.02 55.32
2021 Oral Cavity 418 7.89 65.41 9.92 71.05
2021 Oropharynx 126 4.76 60.39 6.49 57.94
2021 Other Pharynx 54 14.81 55.85 13.00 53.70
2021 Salivary Glands 135 5.93 56.48 8.59 56.30
2021 Sinonasal 122 3.28 55.92 6.88 57.38
2022 Hypopharynx 71 9.86 66.15 15.85 64.79
2022 Larynx 407 6.39 66.92 17.27 77.15
2022 Nasopharynx 45 6.67 59.07 9.93 62.22
2022 Oral Cavity 488 6.35 63.20 10.19 69.26
2022 Oropharynx 160 6.25 63.57 8.95 60.62
2022 Other Pharynx 49 6.12 50.27 7.00 73.47
2022 Salivary Glands 155 5.16 63.86 6.53 70.32
2022 Sinonasal 130 7.69 55.26 8.46 59.23
2023 Hypopharynx 77 6.49 70.62 14.64 74.03
2023 Larynx 504 7.34 68.16 13.00 79.96
2023 Nasopharynx 50 6.00 52.95 5.92 64.00
2023 Oral Cavity 505 5.74 65.01 9.69 71.68
2023 Oropharynx 164 7.32 63.42 10.55 68.29
2023 Other Pharynx 38 15.79 61.21 6.50 65.79
2023 Salivary Glands 125 3.20 60.55 5.90 62.40
2023 Sinonasal 160 3.75 55.75 6.66 65.00
2024 Hypopharynx 74 8.11 70.29 11.18 72.97
2024 Larynx 513 6.63 68.36 15.66 82.85
2024 Nasopharynx 117 8.55 44.44 10.72 57.26
2024 Oral Cavity 632 6.49 65.27 9.55 74.53
2024 Oropharynx 210 3.33 61.73 8.83 67.62
2024 Other Pharynx 55 10.91 55.94 11.82 65.45
2024 Salivary Glands 184 6.52 63.34 6.78 72.28
2024 Sinonasal 178 3.93 58.34 6.99 56.18
Code
# Table 3: Comorbidity Prevalence - All Patients
table3_data = []
for site in sorted(df_head_neck['cancer_site_group'].unique()):
    site_df = df_head_neck[df_head_neck['cancer_site_group'] == site]
    n = len(site_df)
    row = {'Site': site}

    for col in comorbidity_cols:
        if col in site_df.columns:
            pct = (site_df[col].sum() / n * 100) if n > 0 else 0
            row[format_name(col)] = pct

    table3_data.append(row)

table3_df = pd.DataFrame(table3_data)
show_table(table3_df, "Table 3: Comorbidity Prevalence - All Patients (%)", index=False)

Table 3: Comorbidity Prevalence - All Patients (%)

Site CV Essential Hypertension CV Complicated Hypertension CV Coronary Disease CV Heart Failure CV Arrhythmias CV Cerebrovascular Disease CV Peripheral Vascular Disease MET Type1 Diabetes MET Type2 Diabetes MET Complicated Diabetes MET Obesity MET Dyslipidemia MET Metabolic Syndrome MET Thyroid Disorders RESP COPD RESP Asthma RESP Chronic Bronchitis RESP Emphysema RESP Pulmonary Fibrosis RESP Pulmonary Hypertension RESP Sleep Apnea REN Acute Renal Failure REN Chronic Renal Failure REN End Stage Renal REN Diabetic Nephropathy HEP Steatosis HEP Alcoholic Cirrhosis HEP NonAlcoholic Cirrhosis HEP Viral Hepatitis GI Reflux GI Peptic Ulcer GI IBD GI IBS NEURO Dementia NEURO Parkinson NEURO Epilepsy NEURO Stroke Sequelae PSI Major Depression PSI Bipolar PSI Anxiety SUST Alcohol SUST Tobacco SUST Opioids INF HIV INF Tuberculosis INF Hepatitis C INF Hepatitis B
Hypopharynx 36.75 0.25 4.00 1.50 3.75 0.75 1.50 0.25 14.25 0.25 1.50 9.50 0.25 16.00 6.25 1.25 0.00 2.00 1.00 0.00 0.25 1.50 4.50 1.75 0.00 1.00 1.00 0.00 0.75 0.00 0.75 0.00 0.00 1.25 0.50 1.75 0.50 2.00 0.00 1.00 6.50 16.75 0.00 1.50 0.50 0.50 0.00
Larynx 45.79 0.50 5.79 3.50 6.93 2.93 3.09 0.34 19.81 0.23 4.99 8.42 0.46 11.16 13.79 3.62 0.04 1.75 1.03 0.46 0.99 6.10 5.18 1.10 0.30 0.91 1.30 0.08 0.15 0.57 1.14 0.00 0.04 1.33 0.72 1.56 2.10 2.32 0.27 0.88 4.53 24.27 0.04 0.65 0.27 0.11 0.04
Nasopharynx 28.26 0.00 2.17 1.63 2.99 1.63 4.08 0.00 12.50 0.27 7.07 8.97 0.54 9.51 0.82 3.53 0.00 0.54 1.09 0.00 0.27 5.43 3.26 0.54 0.27 0.27 0.27 0.00 0.00 0.27 0.27 0.27 0.00 0.00 0.27 1.09 0.82 0.54 0.00 3.80 4.08 7.88 0.00 1.36 0.00 0.00 0.00
Oral Cavity 43.85 0.50 3.80 2.17 5.63 2.43 2.10 0.37 18.66 0.43 5.20 11.43 0.60 12.26 5.13 1.77 0.13 0.73 0.73 0.27 0.30 4.53 4.57 1.03 0.23 1.37 0.63 0.30 0.17 0.43 0.53 0.13 0.07 1.73 0.67 1.20 1.90 2.97 0.23 0.70 4.77 14.83 0.00 0.73 0.07 0.00 0.17
Oropharynx 37.01 0.11 2.42 1.05 3.15 2.31 2.73 0.11 13.67 0.32 4.21 7.05 0.11 8.73 3.68 1.79 0.11 1.05 0.74 0.21 0.32 3.05 2.52 0.63 0.21 1.47 0.84 0.00 0.00 0.21 0.32 0.11 0.00 0.42 0.63 2.21 2.00 2.31 0.63 1.16 4.84 13.25 0.11 2.10 0.21 0.00 0.00
Other Pharynx 33.45 0.36 2.14 2.49 4.98 3.56 3.91 0.71 15.30 0.00 2.85 6.05 0.00 9.25 4.98 1.07 0.00 1.42 1.07 0.00 3.20 3.56 4.27 1.42 0.00 1.07 0.36 0.00 0.00 1.07 1.78 0.36 0.36 0.71 0.00 3.56 1.78 2.14 0.36 0.71 1.78 11.03 0.00 1.42 0.36 0.00 0.00
Salivary Glands 37.53 0.89 3.01 2.23 4.68 1.67 0.89 0.33 20.04 0.22 8.02 10.47 0.67 11.36 4.12 3.79 0.00 0.22 0.78 0.22 0.78 4.23 4.23 0.78 0.11 0.78 0.22 0.11 0.00 0.22 0.45 0.22 0.00 0.78 1.22 1.78 1.34 3.12 0.45 1.00 2.00 9.24 0.00 0.56 0.56 0.00 0.00
Sinonasal 33.03 0.56 1.24 1.35 1.69 2.14 0.79 0.45 16.01 0.23 6.43 7.33 0.23 8.46 1.80 2.59 0.00 0.00 0.56 0.00 0.68 1.69 2.03 0.79 0.23 0.90 0.23 0.00 0.00 0.34 0.34 0.00 0.00 0.90 0.68 1.58 1.13 2.03 0.11 1.24 3.27 9.47 0.00 0.45 0.00 0.00 0.00
Code
# Table 4: Comorbidity Prevalence - Discharged
table4_data = []
discharged = df_head_neck[df_head_neck['egresar_fallecido'] == 0]

for site in sorted(discharged['cancer_site_group'].unique()):
    site_df = discharged[discharged['cancer_site_group'] == site]
    n = len(site_df)
    row = {'Site': site}

    for col in comorbidity_cols:
        if col in site_df.columns:
            pct = (site_df[col].sum() / n * 100) if n > 0 else 0
            row[format_name(col)] = pct

    table4_data.append(row)

table4_df = pd.DataFrame(table4_data)
show_table(table4_df, "Table 4: Comorbidity Prevalence - Discharged (%)", index=False)

Table 4: Comorbidity Prevalence - Discharged (%)

Site CV Essential Hypertension CV Complicated Hypertension CV Coronary Disease CV Heart Failure CV Arrhythmias CV Cerebrovascular Disease CV Peripheral Vascular Disease MET Type1 Diabetes MET Type2 Diabetes MET Complicated Diabetes MET Obesity MET Dyslipidemia MET Metabolic Syndrome MET Thyroid Disorders RESP COPD RESP Asthma RESP Chronic Bronchitis RESP Emphysema RESP Pulmonary Fibrosis RESP Pulmonary Hypertension RESP Sleep Apnea REN Acute Renal Failure REN Chronic Renal Failure REN End Stage Renal REN Diabetic Nephropathy HEP Steatosis HEP Alcoholic Cirrhosis HEP NonAlcoholic Cirrhosis HEP Viral Hepatitis GI Reflux GI Peptic Ulcer GI IBD GI IBS NEURO Dementia NEURO Parkinson NEURO Epilepsy NEURO Stroke Sequelae PSI Major Depression PSI Bipolar PSI Anxiety SUST Alcohol SUST Tobacco SUST Opioids INF HIV INF Tuberculosis INF Hepatitis C INF Hepatitis B
Hypopharynx 36.78 0.27 3.54 1.09 2.72 0.54 1.63 0.27 14.44 0.27 1.63 9.54 0.27 15.53 5.72 1.09 0.00 1.91 1.09 0.00 0.27 1.36 4.90 1.91 0.00 1.09 0.82 0.00 0.82 0.00 0.54 0.00 0.00 1.36 0.27 1.63 0.54 1.91 0.00 0.82 7.08 17.17 0.00 1.63 0.27 0.54 0.00
Larynx 45.78 0.45 5.64 3.09 5.48 2.76 2.68 0.37 19.45 0.25 5.23 8.61 0.37 11.08 13.39 3.67 0.04 1.48 0.91 0.37 0.99 4.70 4.61 0.95 0.29 0.82 1.03 0.00 0.16 0.62 0.91 0.00 0.00 1.28 0.74 1.44 2.02 2.22 0.29 0.95 4.41 24.64 0.04 0.66 0.25 0.12 0.04
Nasopharynx 27.83 0.00 2.03 1.74 2.61 1.16 3.48 0.00 12.46 0.29 7.54 9.57 0.29 9.28 0.87 3.77 0.00 0.29 1.16 0.00 0.29 4.64 3.48 0.58 0.29 0.29 0.29 0.00 0.00 0.29 0.29 0.29 0.00 0.00 0.00 1.16 0.87 0.58 0.00 4.06 4.35 7.25 0.00 1.45 0.00 0.00 0.00
Oral Cavity 43.61 0.46 3.68 2.02 4.71 2.27 1.77 0.35 18.62 0.46 5.35 11.54 0.50 11.96 5.03 1.63 0.14 0.71 0.57 0.21 0.28 3.82 4.28 0.92 0.25 1.20 0.57 0.28 0.18 0.46 0.50 0.11 0.07 1.63 0.67 1.17 1.77 2.94 0.21 0.71 4.71 14.73 0.00 0.71 0.07 0.00 0.18
Oropharynx 36.83 0.11 2.46 0.78 2.46 2.12 2.68 0.11 13.39 0.33 4.24 7.14 0.00 9.04 3.24 1.79 0.11 1.12 0.56 0.22 0.33 2.23 2.23 0.56 0.22 1.12 0.78 0.00 0.00 0.22 0.33 0.11 0.00 0.45 0.67 2.23 1.90 2.12 0.45 1.23 4.91 13.39 0.11 2.01 0.11 0.00 0.00
Other Pharynx 31.87 0.00 2.39 1.99 3.98 3.19 3.59 0.40 15.54 0.00 2.79 6.77 0.00 9.96 4.38 1.20 0.00 1.20 1.20 0.00 3.59 1.99 3.98 1.20 0.00 1.20 0.00 0.00 0.00 1.20 1.59 0.40 0.40 0.80 0.00 3.19 1.59 1.59 0.00 0.80 1.59 10.36 0.00 1.59 0.40 0.00 0.00
Salivary Glands 37.79 0.82 2.70 1.88 3.76 1.41 0.94 0.35 20.31 0.12 8.33 10.68 0.70 11.38 4.23 3.99 0.00 0.12 0.82 0.23 0.82 3.29 3.99 0.70 0.12 0.70 0.12 0.12 0.00 0.23 0.47 0.23 0.00 0.70 1.29 1.88 1.29 3.17 0.47 1.06 1.88 9.27 0.00 0.59 0.59 0.00 0.00
Sinonasal 32.86 0.47 1.30 1.30 1.65 1.77 0.83 0.35 15.72 0.24 6.26 7.33 0.12 8.51 1.77 2.72 0.00 0.00 0.59 0.00 0.71 1.54 2.01 0.83 0.24 0.83 0.24 0.00 0.00 0.35 0.35 0.00 0.00 0.95 0.71 1.54 1.06 2.01 0.12 1.30 3.31 9.34 0.00 0.47 0.00 0.00 0.00
Code
# Table 5: Comorbidity Prevalence - Deceased
table5_data = []
deceased = df_head_neck[df_head_neck['egresar_fallecido'] == 1]

for site in sorted(deceased['cancer_site_group'].unique()):
    site_df = deceased[deceased['cancer_site_group'] == site]
    n = len(site_df)
    row = {'Site': site}

    for col in comorbidity_cols:
        if col in site_df.columns:
            pct = (site_df[col].sum() / n * 100) if n > 0 else 0
            row[format_name(col)] = pct

    table5_data.append(row)

table5_df = pd.DataFrame(table5_data)
show_table(table5_df, "Table 5: Comorbidity Prevalence - Deceased (%)", index=False)

Table 5: Comorbidity Prevalence - Deceased (%)

Site CV Essential Hypertension CV Complicated Hypertension CV Coronary Disease CV Heart Failure CV Arrhythmias CV Cerebrovascular Disease CV Peripheral Vascular Disease MET Type1 Diabetes MET Type2 Diabetes MET Complicated Diabetes MET Obesity MET Dyslipidemia MET Metabolic Syndrome MET Thyroid Disorders RESP COPD RESP Asthma RESP Chronic Bronchitis RESP Emphysema RESP Pulmonary Fibrosis RESP Pulmonary Hypertension RESP Sleep Apnea REN Acute Renal Failure REN Chronic Renal Failure REN End Stage Renal REN Diabetic Nephropathy HEP Steatosis HEP Alcoholic Cirrhosis HEP NonAlcoholic Cirrhosis HEP Viral Hepatitis GI Reflux GI Peptic Ulcer GI IBD GI IBS NEURO Dementia NEURO Parkinson NEURO Epilepsy NEURO Stroke Sequelae PSI Major Depression PSI Bipolar PSI Anxiety SUST Alcohol SUST Tobacco SUST Opioids INF HIV INF Tuberculosis INF Hepatitis C INF Hepatitis B
Hypopharynx 36.36 0.00 9.09 6.06 15.15 3.03 0.00 0.00 12.12 0.00 0.00 9.09 0.00 21.21 12.12 3.03 0.00 3.03 0.00 0.00 0.00 3.03 0.00 0.00 0.00 0.00 3.03 0.00 0.00 0.00 3.03 0.00 0.00 0.00 3.03 3.03 0.00 3.03 0.00 3.03 0.00 12.12 0.00 0.00 3.03 0.00 0.00
Larynx 45.96 1.01 7.58 8.59 24.75 5.05 8.08 0.00 24.24 0.00 2.02 6.06 1.52 12.12 18.69 3.03 0.00 5.05 2.53 1.52 1.01 23.23 12.12 3.03 0.51 2.02 4.55 1.01 0.00 0.00 4.04 0.00 0.51 2.02 0.51 3.03 3.03 3.54 0.00 0.00 6.06 19.70 0.00 0.51 0.51 0.00 0.00
Nasopharynx 34.78 0.00 4.35 0.00 8.70 8.70 13.04 0.00 13.04 0.00 0.00 0.00 4.35 13.04 0.00 0.00 0.00 4.35 0.00 0.00 0.00 17.39 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 4.35 0.00 0.00 0.00 0.00 0.00 0.00 17.39 0.00 0.00 0.00 0.00 0.00
Oral Cavity 47.73 1.14 5.68 4.55 20.45 5.11 7.39 0.57 19.32 0.00 2.84 9.66 2.27 17.05 6.82 3.98 0.00 1.14 3.41 1.14 0.57 15.91 9.09 2.84 0.00 3.98 1.70 0.57 0.00 0.00 1.14 0.57 0.00 3.41 0.57 1.70 3.98 3.41 0.57 0.57 5.68 16.48 0.00 1.14 0.00 0.00 0.00
Oropharynx 40.00 0.00 1.82 5.45 14.55 5.45 3.64 0.00 18.18 0.00 3.64 5.45 1.82 3.64 10.91 1.82 0.00 0.00 3.64 0.00 0.00 16.36 7.27 1.82 0.00 7.27 1.82 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.82 3.64 5.45 3.64 0.00 3.64 10.91 0.00 3.64 1.82 0.00 0.00
Other Pharynx 46.67 3.33 0.00 6.67 13.33 6.67 6.67 3.33 13.33 0.00 3.33 0.00 0.00 3.33 10.00 0.00 0.00 3.33 0.00 0.00 0.00 16.67 6.67 3.33 0.00 0.00 3.33 0.00 0.00 0.00 3.33 0.00 0.00 0.00 0.00 6.67 3.33 6.67 3.33 0.00 3.33 16.67 0.00 0.00 0.00 0.00 0.00
Salivary Glands 32.61 2.17 8.70 8.70 21.74 6.52 0.00 0.00 15.22 2.17 2.17 6.52 0.00 10.87 2.17 0.00 0.00 2.17 0.00 0.00 0.00 21.74 8.70 2.17 0.00 2.17 2.17 0.00 0.00 0.00 0.00 0.00 0.00 2.17 0.00 0.00 2.17 2.17 0.00 0.00 4.35 8.70 0.00 0.00 0.00 0.00 0.00
Sinonasal 36.59 2.44 0.00 2.44 2.44 9.76 0.00 2.44 21.95 0.00 9.76 7.32 2.44 7.32 2.44 0.00 0.00 0.00 0.00 0.00 0.00 4.88 2.44 0.00 0.00 2.44 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 2.44 2.44 2.44 0.00 0.00 2.44 12.20 0.00 0.00 0.00 0.00 0.00
Code
# Table 6: Summary by Discharge Status
table6_data = []
for site in sorted(df_head_neck['cancer_site_group'].unique()):
    for status in [0, 1]:
        status_label = 'Discharged' if status == 0 else 'Deceased'
        subset = df_head_neck[(df_head_neck['cancer_site_group'] == site) & (df_head_neck['egresar_fallecido'] == status)]

        if len(subset) > 0:
            table6_data.append({
                'Site': site,
                'Status': status_label,
                'N': len(subset),
                'Mean Age': subset['edad'].mean(),
                'Mean LOS': subset['dias_estancia'].mean(),
                'Median LOS': subset['dias_estancia'].median()
            })

table6_df = pd.DataFrame(table6_data)
show_table(table6_df, "Table 6: Summary by Discharge Status", index=False)

Table 6: Summary by Discharge Status

Site Status N Mean Age Mean LOS Median LOS
Hypopharynx Discharged 367 68.30 12.88 7.00
Hypopharynx Deceased 33 71.94 23.00 13.00
Larynx Discharged 2427 67.30 13.67 7.00
Larynx Deceased 198 71.64 24.63 12.50
Nasopharynx Discharged 345 49.49 7.73 4.00
Nasopharynx Deceased 23 60.67 27.48 13.00
Oral Cavity Discharged 2825 64.06 9.22 4.00
Oral Cavity Deceased 176 68.54 15.80 6.00
Oropharynx Discharged 896 61.89 7.87 4.00
Oropharynx Deceased 55 66.57 16.96 9.00
Other Pharynx Discharged 251 55.71 10.04 5.00
Other Pharynx Deceased 30 72.43 17.23 5.50
Salivary Glands Discharged 852 60.34 6.41 3.00
Salivary Glands Deceased 46 64.11 16.93 9.00
Sinonasal Discharged 846 55.32 6.27 3.00
Sinonasal Deceased 41 58.38 15.20 11.00
Code
# Table 7: LOS by Site and Discharge Status
table7_data = []
for site in sorted(df_head_neck['cancer_site_group'].unique()):
    for status in [0, 1]:
        status_label = 'Discharged' if status == 0 else 'Deceased'
        subset = df_head_neck[(df_head_neck['cancer_site_group'] == site) & (df_head_neck['egresar_fallecido'] == status)]

        if len(subset) > 0:
            los_vals = subset['dias_estancia']
            q1 = los_vals.quantile(0.25)
            q3 = los_vals.quantile(0.75)
            iqr = q3 - q1

            table7_data.append({
                'Site': site,
                'Status': status_label,
                'N': len(subset),
                'Mean LOS': los_vals.mean(),
                'Median LOS': los_vals.median(),
                'IQR': iqr
            })

table7_df = pd.DataFrame(table7_data)
show_table(table7_df, "Table 7: Length of Stay by Site and Discharge Status", index=False)

Table 7: Length of Stay by Site and Discharge Status

Site Status N Mean LOS Median LOS IQR
Hypopharynx Discharged 367 12.88 7.00 14.00
Hypopharynx Deceased 33 23.00 13.00 27.00
Larynx Discharged 2427 13.67 7.00 16.00
Larynx Deceased 198 24.63 12.50 25.50
Nasopharynx Discharged 345 7.73 4.00 6.00
Nasopharynx Deceased 23 27.48 13.00 14.50
Oral Cavity Discharged 2825 9.22 4.00 9.00
Oral Cavity Deceased 176 15.80 6.00 12.25
Oropharynx Discharged 896 7.87 4.00 8.00
Oropharynx Deceased 55 16.96 9.00 18.00
Other Pharynx Discharged 251 10.04 5.00 9.50
Other Pharynx Deceased 30 17.23 5.50 23.00
Salivary Glands Discharged 852 6.41 3.00 6.00
Salivary Glands Deceased 46 16.93 9.00 20.25
Sinonasal Discharged 846 6.27 3.00 6.00
Sinonasal Deceased 41 15.20 11.00 19.00
Code
# Table 8: Yearly Trends Aggregate
table8_data = []
for year in sorted(df_head_neck['year'].unique()):
    year_df = df_head_neck[df_head_neck['year'] == year]
    n_total = len(year_df)
    mortality_pct = (year_df['egresar_fallecido'].sum() / n_total * 100) if n_total > 0 else 0
    mean_los = year_df['dias_estancia'].mean()
    mean_age = year_df['edad'].mean()

    row = {
        'Year': year,
        'Total': n_total,
        'Mortality %': mortality_pct,
        'Mean Age': mean_age,
        'Mean LOS': mean_los
    }

    # Add per-site counts
    for site in sorted(df_head_neck['cancer_site_group'].unique()):
        site_count = len(year_df[year_df['cancer_site_group'] == site])
        row[f'{site} N'] = site_count

    table8_data.append(row)

table8_df = pd.DataFrame(table8_data)
show_table(table8_df, "Table 8: Yearly Trends (Aggregate)", index=False)

Table 8: Yearly Trends (Aggregate)

Year Total Mortality % Mean Age Mean LOS Hypopharynx N Larynx N Nasopharynx N Oral Cavity N Oropharynx N Other Pharynx N Salivary Glands N Sinonasal N
2019 1716 5.36 62.66 8.27 63 450 61 571 183 50 173 165
2020 1234 6.48 62.91 11.49 55 343 48 387 108 35 126 132
2021 1370 7.81 62.56 11.12 60 408 47 418 126 54 135 122
2022 1505 6.51 63.22 11.60 71 407 45 488 160 49 155 130
2023 1623 6.28 64.38 10.26 77 504 50 505 164 38 125 160
2024 1963 6.27 63.57 10.77 74 513 117 632 210 55 184 178
Code
# Create visualization
fig, axes = plt.subplots(2, 3, figsize=(16, 10))

# 1. Age distribution
axes[0, 0].hist(df_head_neck['edad'], bins=50, color='steelblue', edgecolor='black', alpha=0.7)
axes[0, 0].axvline(df_head_neck['edad'].mean(), color='red', linestyle='--', linewidth=2,
                    label=f'Mean: {df_head_neck["edad"].mean():.1f}')
axes[0, 0].set_xlabel('Age (years)')
axes[0, 0].set_ylabel('Frequency')
axes[0, 0].set_title('Age Distribution')
axes[0, 0].legend()

# 2. Length of stay distribution
axes[0, 1].hist(df_head_neck['dias_estancia'], bins=50, color='forestgreen', edgecolor='black', alpha=0.7)
axes[0, 1].axvline(df_head_neck['dias_estancia'].mean(), color='red', linestyle='--', linewidth=2,
                    label=f'Mean: {df_head_neck["dias_estancia"].mean():.1f}')
axes[0, 1].set_xlabel('Length of stay (days)')
axes[0, 1].set_ylabel('Frequency')
axes[0, 1].set_title('Length of Stay Distribution')
axes[0, 1].legend()

# 3. Mortality by sex
mortality_by_sex = df_head_neck.groupby('SEXO')['egresar_fallecido'].mean() * 100
axes[0, 2].bar(mortality_by_sex.index, mortality_by_sex.values, color=['#FF6B6B', '#4ECDC4'])
axes[0, 2].set_ylabel('Mortality rate (%)')
axes[0, 2].set_title('Mortality by Sex')
for i, v in enumerate(mortality_by_sex.values):
    axes[0, 2].text(i, v + 0.5, f'{v:.1f}%', ha='center', fontweight='bold')

# 4. Records by year
year_counts = df_head_neck['year'].value_counts().sort_index()
axes[1, 0].plot(year_counts.index, year_counts.values, marker='o', linewidth=2, markersize=8, color='darkviolet')
axes[1, 0].fill_between(year_counts.index, year_counts.values, alpha=0.3, color='darkviolet')
axes[1, 0].set_xlabel('Year')
axes[1, 0].set_ylabel('Number of records')
axes[1, 0].set_title('Records by Year')
axes[1, 0].grid(True, alpha=0.3)

# 5. Cancer site distribution
site_counts = df_head_neck['cancer_site_group'].value_counts()
axes[1, 1].barh(site_counts.index, site_counts.values, color='coral')
axes[1, 1].set_xlabel('Number of cases')
axes[1, 1].set_title('Cases by Cancer Site')
for i, v in enumerate(site_counts.values):
    axes[1, 1].text(v + 50, i, f'{v:,.0f}', va='center')

# 6. Service category
service_counts = df_head_neck['service_category'].value_counts()
colors_pie = ['#FF6B6B', '#4ECDC4', '#45B7D1', '#FFA07A']
axes[1, 2].pie(service_counts.values, labels=service_counts.index, autopct='%1.1f%%', startangle=90)
axes[1, 2].set_title('Service Category Distribution')

plt.tight_layout()
plt.savefig(os.path.join(OUTPUT_DIR, '01_descriptive_stats.png'), dpi=300, bbox_inches='tight', facecolor='white')
plt.savefig(_FIG / "fig-descriptive-stats.png", dpi=300, bbox_inches="tight")
plt.show()
Figure 1: Figure 1: Descriptive Distribution of Head and Neck Cancer Hospitalizations

5 4.1 Age Distribution by Cancer Site

Code
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Violin plot
ax = axes[0]
for i, site in enumerate(df_head_neck['cancer_site_group'].unique()):
    data = df_head_neck[df_head_neck['cancer_site_group'] == site]['edad']
    positions = [i]
    ax.violinplot([data], positions=positions, widths=0.7, showmeans=True, showmedians=True)
ax.set_xticks(range(len(df_head_neck['cancer_site_group'].unique())))
ax.set_xticklabels(df_head_neck['cancer_site_group'].unique(), rotation=30, ha='right', fontsize=9)
ax.set_ylabel('Age (years)')
ax.set_title('Age Distribution by Cancer Site (Violin Plot)')
ax.grid(True, alpha=0.3, axis='y')

# Box plot
ax = axes[1]
data_by_site = [df_head_neck[df_head_neck['cancer_site_group'] == site]['edad'].values
                 for site in df_head_neck['cancer_site_group'].unique()]
bp = ax.boxplot(data_by_site, labels=df_head_neck['cancer_site_group'].unique(), patch_artist=True)
for patch in bp['boxes']:
    patch.set_facecolor('lightblue')
ax.set_ylabel('Age (years)')
ax.set_title('Age Distribution by Cancer Site (Box Plot)')
plt.setp(ax.get_xticklabels(), rotation=30, ha='right', fontsize=9)
ax.grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.savefig(os.path.join(OUTPUT_DIR, '02_age_by_site.png'), dpi=300, bbox_inches='tight', facecolor='white')
plt.savefig(_FIG / "fig-age-by-site.png", dpi=300, bbox_inches="tight")
plt.show()
Figure 2: Figure 2. Age Distribution by Cancer Site

6 4.2 Comorbidity Prevalence Tables

Code
# Table: Comorbidity prevalence - ALL patients
comorb_all = []
for site in sorted(df_head_neck['cancer_site_group'].unique()):
    df_site = df_head_neck[df_head_neck['cancer_site_group'] == site]
    row = {'Cancer Site': site}
    for col in comorbidity_cols:
        prev = (df_site[col].mean() * 100)
        row[format_name(col)] = f'{prev:.1f}%'
    comorb_all.append(row)

comorb_all_df = pd.DataFrame(comorb_all)
show_table(comorb_all_df, "Table 13: Comorbidity Prevalence - All Patients (%)", index=False)

Table 13: Comorbidity Prevalence - All Patients (%)

Cancer Site CV Essential Hypertension CV Complicated Hypertension CV Coronary Disease CV Heart Failure CV Arrhythmias CV Cerebrovascular Disease CV Peripheral Vascular Disease MET Type1 Diabetes MET Type2 Diabetes MET Complicated Diabetes MET Obesity MET Dyslipidemia MET Metabolic Syndrome MET Thyroid Disorders RESP COPD RESP Asthma RESP Chronic Bronchitis RESP Emphysema RESP Pulmonary Fibrosis RESP Pulmonary Hypertension RESP Sleep Apnea REN Acute Renal Failure REN Chronic Renal Failure REN End Stage Renal REN Diabetic Nephropathy HEP Steatosis HEP Alcoholic Cirrhosis HEP NonAlcoholic Cirrhosis HEP Viral Hepatitis GI Reflux GI Peptic Ulcer GI IBD GI IBS NEURO Dementia NEURO Parkinson NEURO Epilepsy NEURO Stroke Sequelae PSI Major Depression PSI Bipolar PSI Anxiety SUST Alcohol SUST Tobacco SUST Opioids INF HIV INF Tuberculosis INF Hepatitis C INF Hepatitis B
Hypopharynx 36.8% 0.2% 4.0% 1.5% 3.8% 0.8% 1.5% 0.2% 14.2% 0.2% 1.5% 9.5% 0.2% 16.0% 6.2% 1.2% 0.0% 2.0% 1.0% 0.0% 0.2% 1.5% 4.5% 1.8% 0.0% 1.0% 1.0% 0.0% 0.8% 0.0% 0.8% 0.0% 0.0% 1.2% 0.5% 1.8% 0.5% 2.0% 0.0% 1.0% 6.5% 16.8% 0.0% 1.5% 0.5% 0.5% 0.0%
Larynx 45.8% 0.5% 5.8% 3.5% 6.9% 2.9% 3.1% 0.3% 19.8% 0.2% 5.0% 8.4% 0.5% 11.2% 13.8% 3.6% 0.0% 1.8% 1.0% 0.5% 1.0% 6.1% 5.2% 1.1% 0.3% 0.9% 1.3% 0.1% 0.2% 0.6% 1.1% 0.0% 0.0% 1.3% 0.7% 1.6% 2.1% 2.3% 0.3% 0.9% 4.5% 24.3% 0.0% 0.6% 0.3% 0.1% 0.0%
Nasopharynx 28.3% 0.0% 2.2% 1.6% 3.0% 1.6% 4.1% 0.0% 12.5% 0.3% 7.1% 9.0% 0.5% 9.5% 0.8% 3.5% 0.0% 0.5% 1.1% 0.0% 0.3% 5.4% 3.3% 0.5% 0.3% 0.3% 0.3% 0.0% 0.0% 0.3% 0.3% 0.3% 0.0% 0.0% 0.3% 1.1% 0.8% 0.5% 0.0% 3.8% 4.1% 7.9% 0.0% 1.4% 0.0% 0.0% 0.0%
Oral Cavity 43.9% 0.5% 3.8% 2.2% 5.6% 2.4% 2.1% 0.4% 18.7% 0.4% 5.2% 11.4% 0.6% 12.3% 5.1% 1.8% 0.1% 0.7% 0.7% 0.3% 0.3% 4.5% 4.6% 1.0% 0.2% 1.4% 0.6% 0.3% 0.2% 0.4% 0.5% 0.1% 0.1% 1.7% 0.7% 1.2% 1.9% 3.0% 0.2% 0.7% 4.8% 14.8% 0.0% 0.7% 0.1% 0.0% 0.2%
Oropharynx 37.0% 0.1% 2.4% 1.1% 3.2% 2.3% 2.7% 0.1% 13.7% 0.3% 4.2% 7.0% 0.1% 8.7% 3.7% 1.8% 0.1% 1.1% 0.7% 0.2% 0.3% 3.0% 2.5% 0.6% 0.2% 1.5% 0.8% 0.0% 0.0% 0.2% 0.3% 0.1% 0.0% 0.4% 0.6% 2.2% 2.0% 2.3% 0.6% 1.2% 4.8% 13.2% 0.1% 2.1% 0.2% 0.0% 0.0%
Other Pharynx 33.5% 0.4% 2.1% 2.5% 5.0% 3.6% 3.9% 0.7% 15.3% 0.0% 2.8% 6.0% 0.0% 9.3% 5.0% 1.1% 0.0% 1.4% 1.1% 0.0% 3.2% 3.6% 4.3% 1.4% 0.0% 1.1% 0.4% 0.0% 0.0% 1.1% 1.8% 0.4% 0.4% 0.7% 0.0% 3.6% 1.8% 2.1% 0.4% 0.7% 1.8% 11.0% 0.0% 1.4% 0.4% 0.0% 0.0%
Salivary Glands 37.5% 0.9% 3.0% 2.2% 4.7% 1.7% 0.9% 0.3% 20.0% 0.2% 8.0% 10.5% 0.7% 11.4% 4.1% 3.8% 0.0% 0.2% 0.8% 0.2% 0.8% 4.2% 4.2% 0.8% 0.1% 0.8% 0.2% 0.1% 0.0% 0.2% 0.4% 0.2% 0.0% 0.8% 1.2% 1.8% 1.3% 3.1% 0.4% 1.0% 2.0% 9.2% 0.0% 0.6% 0.6% 0.0% 0.0%
Sinonasal 33.0% 0.6% 1.2% 1.4% 1.7% 2.1% 0.8% 0.5% 16.0% 0.2% 6.4% 7.3% 0.2% 8.5% 1.8% 2.6% 0.0% 0.0% 0.6% 0.0% 0.7% 1.7% 2.0% 0.8% 0.2% 0.9% 0.2% 0.0% 0.0% 0.3% 0.3% 0.0% 0.0% 0.9% 0.7% 1.6% 1.1% 2.0% 0.1% 1.2% 3.3% 9.5% 0.0% 0.5% 0.0% 0.0% 0.0%
Code
# Table: Comorbidity prevalence - DISCHARGED patients
comorb_discharged = []
for site in sorted(df_head_neck['cancer_site_group'].unique()):
    df_site = df_head_neck[(df_head_neck['cancer_site_group'] == site) & (df_head_neck['egresar_fallecido'] == 0)]
    row = {'Cancer Site': site}
    for col in comorbidity_cols:
        prev = (df_site[col].mean() * 100) if len(df_site) > 0 else 0
        row[format_name(col)] = f'{prev:.1f}%'
    comorb_discharged.append(row)

comorb_discharged_df = pd.DataFrame(comorb_discharged)
show_table(comorb_discharged_df, "Table 14: Comorbidity Prevalence - Discharged Patients (%)", index=False)

Table 14: Comorbidity Prevalence - Discharged Patients (%)

Cancer Site CV Essential Hypertension CV Complicated Hypertension CV Coronary Disease CV Heart Failure CV Arrhythmias CV Cerebrovascular Disease CV Peripheral Vascular Disease MET Type1 Diabetes MET Type2 Diabetes MET Complicated Diabetes MET Obesity MET Dyslipidemia MET Metabolic Syndrome MET Thyroid Disorders RESP COPD RESP Asthma RESP Chronic Bronchitis RESP Emphysema RESP Pulmonary Fibrosis RESP Pulmonary Hypertension RESP Sleep Apnea REN Acute Renal Failure REN Chronic Renal Failure REN End Stage Renal REN Diabetic Nephropathy HEP Steatosis HEP Alcoholic Cirrhosis HEP NonAlcoholic Cirrhosis HEP Viral Hepatitis GI Reflux GI Peptic Ulcer GI IBD GI IBS NEURO Dementia NEURO Parkinson NEURO Epilepsy NEURO Stroke Sequelae PSI Major Depression PSI Bipolar PSI Anxiety SUST Alcohol SUST Tobacco SUST Opioids INF HIV INF Tuberculosis INF Hepatitis C INF Hepatitis B
Hypopharynx 36.8% 0.3% 3.5% 1.1% 2.7% 0.5% 1.6% 0.3% 14.4% 0.3% 1.6% 9.5% 0.3% 15.5% 5.7% 1.1% 0.0% 1.9% 1.1% 0.0% 0.3% 1.4% 4.9% 1.9% 0.0% 1.1% 0.8% 0.0% 0.8% 0.0% 0.5% 0.0% 0.0% 1.4% 0.3% 1.6% 0.5% 1.9% 0.0% 0.8% 7.1% 17.2% 0.0% 1.6% 0.3% 0.5% 0.0%
Larynx 45.8% 0.5% 5.6% 3.1% 5.5% 2.8% 2.7% 0.4% 19.4% 0.2% 5.2% 8.6% 0.4% 11.1% 13.4% 3.7% 0.0% 1.5% 0.9% 0.4% 1.0% 4.7% 4.6% 0.9% 0.3% 0.8% 1.0% 0.0% 0.2% 0.6% 0.9% 0.0% 0.0% 1.3% 0.7% 1.4% 2.0% 2.2% 0.3% 0.9% 4.4% 24.6% 0.0% 0.7% 0.2% 0.1% 0.0%
Nasopharynx 27.8% 0.0% 2.0% 1.7% 2.6% 1.2% 3.5% 0.0% 12.5% 0.3% 7.5% 9.6% 0.3% 9.3% 0.9% 3.8% 0.0% 0.3% 1.2% 0.0% 0.3% 4.6% 3.5% 0.6% 0.3% 0.3% 0.3% 0.0% 0.0% 0.3% 0.3% 0.3% 0.0% 0.0% 0.0% 1.2% 0.9% 0.6% 0.0% 4.1% 4.3% 7.2% 0.0% 1.4% 0.0% 0.0% 0.0%
Oral Cavity 43.6% 0.5% 3.7% 2.0% 4.7% 2.3% 1.8% 0.4% 18.6% 0.5% 5.3% 11.5% 0.5% 12.0% 5.0% 1.6% 0.1% 0.7% 0.6% 0.2% 0.3% 3.8% 4.3% 0.9% 0.2% 1.2% 0.6% 0.3% 0.2% 0.5% 0.5% 0.1% 0.1% 1.6% 0.7% 1.2% 1.8% 2.9% 0.2% 0.7% 4.7% 14.7% 0.0% 0.7% 0.1% 0.0% 0.2%
Oropharynx 36.8% 0.1% 2.5% 0.8% 2.5% 2.1% 2.7% 0.1% 13.4% 0.3% 4.2% 7.1% 0.0% 9.0% 3.2% 1.8% 0.1% 1.1% 0.6% 0.2% 0.3% 2.2% 2.2% 0.6% 0.2% 1.1% 0.8% 0.0% 0.0% 0.2% 0.3% 0.1% 0.0% 0.4% 0.7% 2.2% 1.9% 2.1% 0.4% 1.2% 4.9% 13.4% 0.1% 2.0% 0.1% 0.0% 0.0%
Other Pharynx 31.9% 0.0% 2.4% 2.0% 4.0% 3.2% 3.6% 0.4% 15.5% 0.0% 2.8% 6.8% 0.0% 10.0% 4.4% 1.2% 0.0% 1.2% 1.2% 0.0% 3.6% 2.0% 4.0% 1.2% 0.0% 1.2% 0.0% 0.0% 0.0% 1.2% 1.6% 0.4% 0.4% 0.8% 0.0% 3.2% 1.6% 1.6% 0.0% 0.8% 1.6% 10.4% 0.0% 1.6% 0.4% 0.0% 0.0%
Salivary Glands 37.8% 0.8% 2.7% 1.9% 3.8% 1.4% 0.9% 0.4% 20.3% 0.1% 8.3% 10.7% 0.7% 11.4% 4.2% 4.0% 0.0% 0.1% 0.8% 0.2% 0.8% 3.3% 4.0% 0.7% 0.1% 0.7% 0.1% 0.1% 0.0% 0.2% 0.5% 0.2% 0.0% 0.7% 1.3% 1.9% 1.3% 3.2% 0.5% 1.1% 1.9% 9.3% 0.0% 0.6% 0.6% 0.0% 0.0%
Sinonasal 32.9% 0.5% 1.3% 1.3% 1.7% 1.8% 0.8% 0.4% 15.7% 0.2% 6.3% 7.3% 0.1% 8.5% 1.8% 2.7% 0.0% 0.0% 0.6% 0.0% 0.7% 1.5% 2.0% 0.8% 0.2% 0.8% 0.2% 0.0% 0.0% 0.4% 0.4% 0.0% 0.0% 0.9% 0.7% 1.5% 1.1% 2.0% 0.1% 1.3% 3.3% 9.3% 0.0% 0.5% 0.0% 0.0% 0.0%
Code
# Table: Comorbidity prevalence - DECEASED patients
comorb_deceased = []
for site in sorted(df_head_neck['cancer_site_group'].unique()):
    df_site = df_head_neck[(df_head_neck['cancer_site_group'] == site) & (df_head_neck['egresar_fallecido'] == 1)]
    row = {'Cancer Site': site}
    for col in comorbidity_cols:
        prev = (df_site[col].mean() * 100) if len(df_site) > 0 else 0
        row[format_name(col)] = f'{prev:.1f}%'
    comorb_deceased.append(row)

comorb_deceased_df = pd.DataFrame(comorb_deceased)
show_table(comorb_deceased_df, "Table 15: Comorbidity Prevalence - Deceased Patients (%)", index=False)

Table 15: Comorbidity Prevalence - Deceased Patients (%)

Cancer Site CV Essential Hypertension CV Complicated Hypertension CV Coronary Disease CV Heart Failure CV Arrhythmias CV Cerebrovascular Disease CV Peripheral Vascular Disease MET Type1 Diabetes MET Type2 Diabetes MET Complicated Diabetes MET Obesity MET Dyslipidemia MET Metabolic Syndrome MET Thyroid Disorders RESP COPD RESP Asthma RESP Chronic Bronchitis RESP Emphysema RESP Pulmonary Fibrosis RESP Pulmonary Hypertension RESP Sleep Apnea REN Acute Renal Failure REN Chronic Renal Failure REN End Stage Renal REN Diabetic Nephropathy HEP Steatosis HEP Alcoholic Cirrhosis HEP NonAlcoholic Cirrhosis HEP Viral Hepatitis GI Reflux GI Peptic Ulcer GI IBD GI IBS NEURO Dementia NEURO Parkinson NEURO Epilepsy NEURO Stroke Sequelae PSI Major Depression PSI Bipolar PSI Anxiety SUST Alcohol SUST Tobacco SUST Opioids INF HIV INF Tuberculosis INF Hepatitis C INF Hepatitis B
Hypopharynx 36.4% 0.0% 9.1% 6.1% 15.2% 3.0% 0.0% 0.0% 12.1% 0.0% 0.0% 9.1% 0.0% 21.2% 12.1% 3.0% 0.0% 3.0% 0.0% 0.0% 0.0% 3.0% 0.0% 0.0% 0.0% 0.0% 3.0% 0.0% 0.0% 0.0% 3.0% 0.0% 0.0% 0.0% 3.0% 3.0% 0.0% 3.0% 0.0% 3.0% 0.0% 12.1% 0.0% 0.0% 3.0% 0.0% 0.0%
Larynx 46.0% 1.0% 7.6% 8.6% 24.7% 5.1% 8.1% 0.0% 24.2% 0.0% 2.0% 6.1% 1.5% 12.1% 18.7% 3.0% 0.0% 5.1% 2.5% 1.5% 1.0% 23.2% 12.1% 3.0% 0.5% 2.0% 4.5% 1.0% 0.0% 0.0% 4.0% 0.0% 0.5% 2.0% 0.5% 3.0% 3.0% 3.5% 0.0% 0.0% 6.1% 19.7% 0.0% 0.5% 0.5% 0.0% 0.0%
Nasopharynx 34.8% 0.0% 4.3% 0.0% 8.7% 8.7% 13.0% 0.0% 13.0% 0.0% 0.0% 0.0% 4.3% 13.0% 0.0% 0.0% 0.0% 4.3% 0.0% 0.0% 0.0% 17.4% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 4.3% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 17.4% 0.0% 0.0% 0.0% 0.0% 0.0%
Oral Cavity 47.7% 1.1% 5.7% 4.5% 20.5% 5.1% 7.4% 0.6% 19.3% 0.0% 2.8% 9.7% 2.3% 17.0% 6.8% 4.0% 0.0% 1.1% 3.4% 1.1% 0.6% 15.9% 9.1% 2.8% 0.0% 4.0% 1.7% 0.6% 0.0% 0.0% 1.1% 0.6% 0.0% 3.4% 0.6% 1.7% 4.0% 3.4% 0.6% 0.6% 5.7% 16.5% 0.0% 1.1% 0.0% 0.0% 0.0%
Oropharynx 40.0% 0.0% 1.8% 5.5% 14.5% 5.5% 3.6% 0.0% 18.2% 0.0% 3.6% 5.5% 1.8% 3.6% 10.9% 1.8% 0.0% 0.0% 3.6% 0.0% 0.0% 16.4% 7.3% 1.8% 0.0% 7.3% 1.8% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 1.8% 3.6% 5.5% 3.6% 0.0% 3.6% 10.9% 0.0% 3.6% 1.8% 0.0% 0.0%
Other Pharynx 46.7% 3.3% 0.0% 6.7% 13.3% 6.7% 6.7% 3.3% 13.3% 0.0% 3.3% 0.0% 0.0% 3.3% 10.0% 0.0% 0.0% 3.3% 0.0% 0.0% 0.0% 16.7% 6.7% 3.3% 0.0% 0.0% 3.3% 0.0% 0.0% 0.0% 3.3% 0.0% 0.0% 0.0% 0.0% 6.7% 3.3% 6.7% 3.3% 0.0% 3.3% 16.7% 0.0% 0.0% 0.0% 0.0% 0.0%
Salivary Glands 32.6% 2.2% 8.7% 8.7% 21.7% 6.5% 0.0% 0.0% 15.2% 2.2% 2.2% 6.5% 0.0% 10.9% 2.2% 0.0% 0.0% 2.2% 0.0% 0.0% 0.0% 21.7% 8.7% 2.2% 0.0% 2.2% 2.2% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 2.2% 0.0% 0.0% 2.2% 2.2% 0.0% 0.0% 4.3% 8.7% 0.0% 0.0% 0.0% 0.0% 0.0%
Sinonasal 36.6% 2.4% 0.0% 2.4% 2.4% 9.8% 0.0% 2.4% 22.0% 0.0% 9.8% 7.3% 2.4% 7.3% 2.4% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 4.9% 2.4% 0.0% 0.0% 2.4% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 2.4% 2.4% 2.4% 0.0% 0.0% 2.4% 12.2% 0.0% 0.0% 0.0% 0.0% 0.0%
Code
# Table: Difference (Deceased - Discharged) with chi-square
from scipy.stats import chi2_contingency

comorb_diff = []
for site in sorted(df_head_neck['cancer_site_group'].unique()):
    df_site = df_head_neck[df_head_neck['cancer_site_group'] == site]
    row = {'Cancer Site': site}

    for col in comorbidity_cols:
        df_discharged = df_site[df_site['egresar_fallecido'] == 0]
        df_deceased = df_site[df_site['egresar_fallecido'] == 1]

        prev_discharged = (df_discharged[col].mean() * 100) if len(df_discharged) > 0 else 0
        prev_deceased = (df_deceased[col].mean() * 100) if len(df_deceased) > 0 else 0

        diff = prev_deceased - prev_discharged

        # Chi-square test
        contingency_table = pd.crosstab(df_site['egresar_fallecido'], df_site[col])
        try:
            chi2, pval, dof, expected = chi2_contingency(contingency_table)
            sig_stars = ''
            if pval < 0.001:
                sig_stars = '***'
            elif pval < 0.01:
                sig_stars = '**'
            elif pval < 0.05:
                sig_stars = '*'
        except:
            sig_stars = ''

        row[format_name(col)] = f'{diff:+.1f}%{sig_stars}'

    comorb_diff.append(row)

comorb_diff_df = pd.DataFrame(comorb_diff)
show_table(comorb_diff_df, "Table 16: Difference in Comorbidity Prevalence (Deceased - Discharged, %)\n* p<0.05, ** p<0.01, *** p<0.001", index=False)

Table 16: Difference in Comorbidity Prevalence (Deceased - Discharged, %) * p<0.05, ** p<0.01, *** p<0.001

Cancer Site CV Essential Hypertension CV Complicated Hypertension CV Coronary Disease CV Heart Failure CV Arrhythmias CV Cerebrovascular Disease CV Peripheral Vascular Disease MET Type1 Diabetes MET Type2 Diabetes MET Complicated Diabetes MET Obesity MET Dyslipidemia MET Metabolic Syndrome MET Thyroid Disorders RESP COPD RESP Asthma RESP Chronic Bronchitis RESP Emphysema RESP Pulmonary Fibrosis RESP Pulmonary Hypertension RESP Sleep Apnea REN Acute Renal Failure REN Chronic Renal Failure REN End Stage Renal REN Diabetic Nephropathy HEP Steatosis HEP Alcoholic Cirrhosis HEP NonAlcoholic Cirrhosis HEP Viral Hepatitis GI Reflux GI Peptic Ulcer GI IBD GI IBS NEURO Dementia NEURO Parkinson NEURO Epilepsy NEURO Stroke Sequelae PSI Major Depression PSI Bipolar PSI Anxiety SUST Alcohol SUST Tobacco SUST Opioids INF HIV INF Tuberculosis INF Hepatitis C INF Hepatitis B
Hypopharynx -0.4% -0.3% +5.5% +5.0% +12.4%** +2.5% -1.6% -0.3% -2.3% -0.3% -1.6% -0.4% -0.3% +5.7% +6.4% +1.9% +0.0% +1.1% -1.1% +0.0% -0.3% +1.7% -4.9% -1.9% +0.0% -1.1% +2.2% +0.0% -0.8% +0.0% +2.5% +0.0% +0.0% -1.4% +2.8% +1.4% -0.5% +1.1% +0.0% +2.2% -7.1% -5.0% +0.0% -1.6% +2.8% -0.5% +0.0%
Larynx +0.2% +0.6% +1.9% +5.5%*** +19.3%*** +2.3% +5.4%*** -0.4% +4.8% -0.2% -3.2% -2.6% +1.1% +1.0% +5.3%* -0.6% -0.0% +3.6%*** +1.6% +1.1% +0.0% +18.5%*** +7.5%*** +2.1%* +0.2% +1.2% +3.5%*** +1.0%*** -0.2% -0.6% +3.1%*** +0.0% +0.5% +0.7% -0.2% +1.6% +1.0% +1.3% -0.3% -0.9% +1.7% -4.9% -0.0% -0.2% +0.3% -0.1% -0.0%
Nasopharynx +7.0% +0.0% +2.3% -1.7% +6.1% +7.5% +9.6% +0.0% +0.6% -0.3% -7.5% -9.6% +4.1% +3.8% -0.9% -3.8% +0.0% +4.1% -1.2% +0.0% -0.3% +12.8%* -3.5% -0.6% -0.3% -0.3% -0.3% +0.0% +0.0% -0.3% -0.3% -0.3% +0.0% +0.0% +4.3% -1.2% -0.9% -0.6% +0.0% -4.1% -4.3% +10.1% +0.0% -1.4% +0.0% +0.0% +0.0%
Oral Cavity +4.1% +0.7% +2.0% +2.5%* +15.7%*** +2.8%* +5.6%*** +0.2% +0.7% -0.5% -2.5% -1.9% +1.8%* +5.1% +1.8% +2.3%* -0.1% +0.4% +2.8%*** +0.9% +0.3% +12.1%*** +4.8%** +1.9%* -0.2% +2.8%** +1.1% +0.3% -0.2% -0.5% +0.6% +0.5% -0.1% +1.8% -0.1% +0.5% +2.2% +0.5% +0.4% -0.1% +1.0% +1.8% +0.0% +0.4% -0.1% +0.0% -0.2%
Oropharynx +3.2% -0.1% -0.6% +4.7%** +12.1%*** +3.3% +1.0% -0.1% +4.8% -0.3% -0.6% -1.7% +1.8% -5.4% +7.7%* +0.0% -0.1% -1.1% +3.1% -0.2% -0.3% +14.1%*** +5.0% +1.3% -0.2% +6.2%** +1.0% +0.0% +0.0% -0.2% -0.3% -0.1% +0.0% -0.4% -0.7% -0.4% +1.7% +3.3% +3.2%* -1.2% -1.3% -2.5% -0.1% +1.6% +1.7% +0.0% +0.0%
Other Pharynx +14.8% +3.3% -2.4% +4.7% +9.3% +3.5% +3.1% +2.9% -2.2% +0.0% +0.5% -6.8% +0.0% -6.6% +5.6% -1.2% +0.0% +2.1% -1.2% +0.0% -3.6% +14.7%*** +2.7% +2.1% +0.0% -1.2% +3.3% +0.0% +0.0% -1.2% +1.7% -0.4% -0.4% -0.8% +0.0% +3.5% +1.7% +5.1% +3.3% -0.8% +1.7% +6.3% +0.0% -1.6% -0.4% +0.0% +0.0%
Salivary Glands -5.2% +1.4% +6.0% +6.8%* +18.0%*** +5.1%* -0.9% -0.4% -5.1% +2.1% -6.2% -4.2% -0.7% -0.5% -2.1% -4.0% +0.0% +2.1% -0.8% -0.2% -0.8% +18.5%*** +4.7% +1.5% -0.1% +1.5% +2.1% -0.1% +0.0% -0.2% -0.5% -0.2% +0.0% +1.5% -1.3% -1.9% +0.9% -1.0% -0.5% -1.1% +2.5% -0.6% +0.0% -0.6% -0.6% +0.0% +0.0%
Sinonasal +3.7% +2.0% -1.3% +1.1% +0.8% +8.0%** -0.8% +2.1% +6.2% -0.2% +3.5% -0.0% +2.3% -1.2% +0.7% -2.7% +0.0% +0.0% -0.6% +0.0% -0.7% +3.3% +0.4% -0.8% -0.2% +1.6% -0.2% +0.0% +0.0% -0.4% -0.4% +0.0% +0.0% -0.9% -0.7% +0.9% +1.4% +0.4% -0.1% -1.3% -0.9% +2.9% +0.0% -0.5% +0.0% +0.0% +0.0%
Code
# Prepare data for heatmap
heatmap_data_all = []
for site in sorted(df_head_neck['cancer_site_group'].unique()):
    df_site = df_head_neck[df_head_neck['cancer_site_group'] == site]
    row = []
    for col in comorbidity_cols:
        prev = (df_site[col].mean() * 100)
        row.append(prev)
    heatmap_data_all.append(row)

heatmap_df_all = pd.DataFrame(heatmap_data_all,
                               index=sorted(df_head_neck['cancer_site_group'].unique()),
                               columns=[format_name(col) for col in comorbidity_cols])

fig, ax = plt.subplots(figsize=(22, 8))
sns.heatmap(heatmap_df_all, annot=True, fmt='.1f', cmap='viridis',
            cbar_kws={'label': 'Prevalence (%)'}, ax=ax,
            annot_kws={'rotation': 90, 'fontsize': 8})
ax.set_title('Comorbidity Prevalence by Cancer Site (All Patients)')
ax.set_xlabel('Comorbidity')
ax.set_ylabel('Cancer Site')
plt.setp(ax.get_xticklabels(), rotation=45, ha='right', rotation_mode='anchor', fontsize=8)
plt.setp(ax.get_yticklabels(), rotation=0)
plt.tight_layout()
plt.savefig(os.path.join(OUTPUT_DIR, '03_comorb_heatmap_all.png'), dpi=300, bbox_inches='tight', facecolor='white')
plt.savefig(_FIG / "fig-comorb-heatmap-all.png", dpi=300, bbox_inches="tight")
plt.show()
Figure 3: Figure 3. Heatmap: Comorbidity Prevalence (All Patients)
Code
# Prepare data for heatmap
heatmap_data_deceased = []
for site in sorted(df_head_neck['cancer_site_group'].unique()):
    df_site = df_head_neck[(df_head_neck['cancer_site_group'] == site) & (df_head_neck['egresar_fallecido'] == 1)]
    row = []
    for col in comorbidity_cols:
        prev = (df_site[col].mean() * 100) if len(df_site) > 0 else 0
        row.append(prev)
    heatmap_data_deceased.append(row)

heatmap_df_deceased = pd.DataFrame(heatmap_data_deceased,
                                    index=sorted(df_head_neck['cancer_site_group'].unique()),
                                    columns=[format_name(col) for col in comorbidity_cols])

fig, ax = plt.subplots(figsize=(22, 8))
sns.heatmap(heatmap_df_deceased, annot=True, fmt='.1f', cmap='viridis',
            cbar_kws={'label': 'Prevalence (%)'}, ax=ax,
            annot_kws={'rotation': 90, 'fontsize': 8})
ax.set_title('Comorbidity Prevalence by Cancer Site (Deceased Patients)')
ax.set_xlabel('Comorbidity')
ax.set_ylabel('Cancer Site')
plt.setp(ax.get_xticklabels(), rotation=45, ha='right', rotation_mode='anchor', fontsize=8)
plt.setp(ax.get_yticklabels(), rotation=0)
plt.tight_layout()
plt.savefig(os.path.join(OUTPUT_DIR, '04_comorb_heatmap_deceased.png'), dpi=300, bbox_inches='tight', facecolor='white')
plt.savefig(_FIG / "fig-comorb-heatmap-deceased.png", dpi=300, bbox_inches="tight")
plt.show()
Figure 4: Figure 4. Heatmap: Comorbidity Prevalence (Deceased Patients)

7 4.3 Length of Stay Analysis

Code
# Table: LOS by cancer site and discharge status
los_summary = []
for site in sorted(df_head_neck['cancer_site_group'].unique()):
    df_site = df_head_neck[df_head_neck['cancer_site_group'] == site]

    for status in ['Discharged', 'Deceased']:
        status_code = 0 if status == 'Discharged' else 1
        df_status = df_site[df_site['egresar_fallecido'] == status_code]

        if len(df_status) > 0:
            los_summary.append({
                'Cancer Site': site,
                'Status': status,
                'N': len(df_status),
                'Mean': df_status['dias_estancia'].mean(),
                'Median': df_status['dias_estancia'].median(),
                'SD': df_status['dias_estancia'].std(),
                'Min': df_status['dias_estancia'].min(),
                'Max': df_status['dias_estancia'].max()
            })

los_summary_df = pd.DataFrame(los_summary)
show_table(los_summary_df, "Table 17: Length of Stay by Cancer Site and Discharge Status", index=False)

Table 17: Length of Stay by Cancer Site and Discharge Status

Cancer Site Status N Mean Median SD Min Max
Hypopharynx Discharged 367 12.88 7.00 17.16 0.00 171.00
Hypopharynx Deceased 33 23.00 13.00 32.18 0.00 164.00
Larynx Discharged 2427 13.67 7.00 20.08 0.00 317.00
Larynx Deceased 198 24.63 12.50 35.82 0.00 250.00
Nasopharynx Discharged 345 7.73 4.00 14.79 0.00 155.00
Nasopharynx Deceased 23 27.48 13.00 41.43 1.00 138.00
Oral Cavity Discharged 2825 9.22 4.00 14.98 0.00 216.00
Oral Cavity Deceased 176 15.80 6.00 30.88 0.00 271.00
Oropharynx Discharged 896 7.87 4.00 12.48 0.00 105.00
Oropharynx Deceased 55 16.96 9.00 24.96 0.00 141.00
Other Pharynx Discharged 251 10.04 5.00 14.70 0.00 96.00
Other Pharynx Deceased 30 17.23 5.50 23.66 0.00 106.00
Salivary Glands Discharged 852 6.41 3.00 10.75 0.00 136.00
Salivary Glands Deceased 46 16.93 9.00 20.34 0.00 100.00
Sinonasal Discharged 846 6.27 3.00 10.99 0.00 151.00
Sinonasal Deceased 41 15.20 11.00 14.19 0.00 51.00
Code
# Table: Mean LOS by cancer site and comorbidity presence
los_by_comorb = []
for site in sorted(df_head_neck['cancer_site_group'].unique()):
    df_site = df_head_neck[df_head_neck['cancer_site_group'] == site]
    row = {'Cancer Site': site}

    for col in comorbidity_cols:
        df_with = df_site[df_site[col] == 1]
        los_with = df_with['dias_estancia'].mean() if len(df_with) > 0 else 0
        row[format_name(col)] = f'{los_with:.1f}'

    los_by_comorb.append(row)

los_by_comorb_df = pd.DataFrame(los_by_comorb)
show_table(los_by_comorb_df, "Table 18: Mean Length of Stay When Comorbidity Present (days)", index=False)

Table 18: Mean Length of Stay When Comorbidity Present (days)

Cancer Site CV Essential Hypertension CV Complicated Hypertension CV Coronary Disease CV Heart Failure CV Arrhythmias CV Cerebrovascular Disease CV Peripheral Vascular Disease MET Type1 Diabetes MET Type2 Diabetes MET Complicated Diabetes MET Obesity MET Dyslipidemia MET Metabolic Syndrome MET Thyroid Disorders RESP COPD RESP Asthma RESP Chronic Bronchitis RESP Emphysema RESP Pulmonary Fibrosis RESP Pulmonary Hypertension RESP Sleep Apnea REN Acute Renal Failure REN Chronic Renal Failure REN End Stage Renal REN Diabetic Nephropathy HEP Steatosis HEP Alcoholic Cirrhosis HEP NonAlcoholic Cirrhosis HEP Viral Hepatitis GI Reflux GI Peptic Ulcer GI IBD GI IBS NEURO Dementia NEURO Parkinson NEURO Epilepsy NEURO Stroke Sequelae PSI Major Depression PSI Bipolar PSI Anxiety SUST Alcohol SUST Tobacco SUST Opioids INF HIV INF Tuberculosis INF Hepatitis C INF Hepatitis B
Hypopharynx 15.1 8.0 19.1 18.3 15.3 8.0 45.0 4.0 14.6 1.0 10.2 14.2 30.0 19.0 19.6 17.4 0.0 41.1 32.2 0.0 108.0 20.7 21.8 32.3 0.0 12.2 18.0 0.0 14.0 0.0 22.3 0.0 0.0 15.0 18.5 25.4 5.5 33.2 0.0 13.8 24.3 19.4 0.0 12.7 48.0 5.5 0.0
Larynx 15.1 23.3 18.2 18.4 24.7 25.3 23.6 9.9 14.2 17.0 16.2 12.3 51.0 13.7 19.7 13.6 5.0 34.6 11.1 17.4 12.3 24.1 19.9 24.6 16.6 16.5 21.0 11.0 7.8 23.0 37.3 0.0 2.0 15.5 10.8 18.9 19.4 25.3 19.4 25.6 21.8 17.9 14.0 16.9 18.9 5.7 14.0
Nasopharynx 11.1 0.0 4.0 8.8 9.7 36.7 19.8 0.0 15.4 7.0 7.2 8.8 10.5 6.7 23.0 4.2 0.0 8.5 4.0 0.0 0.0 17.0 7.8 7.0 1.0 2.0 12.0 0.0 0.0 3.0 1.0 7.0 0.0 0.0 6.0 10.5 18.7 78.5 0.0 4.9 4.1 17.0 0.0 4.8 0.0 0.0 0.0
Oral Cavity 10.0 12.7 11.7 12.2 15.9 10.9 17.9 11.3 10.4 6.0 12.7 9.1 16.9 10.4 12.8 10.7 24.5 20.5 21.4 8.4 38.1 17.6 13.9 14.9 9.3 11.5 13.2 16.3 6.2 4.8 13.6 6.0 8.0 9.7 12.6 12.2 9.6 16.2 21.3 25.3 13.2 12.7 0.0 17.5 4.5 0.0 6.2
Oropharynx 9.7 12.0 14.8 18.3 14.7 13.2 19.3 7.0 9.0 7.3 8.6 6.9 27.0 8.5 20.9 17.3 12.0 15.2 21.7 5.0 11.0 18.7 8.2 10.2 6.0 25.7 28.8 0.0 0.0 12.0 26.3 1.0 0.0 37.0 13.3 12.3 12.9 21.6 21.5 15.6 12.5 11.8 55.0 5.2 44.5 0.0 0.0
Other Pharynx 11.5 26.0 16.7 18.1 28.7 17.7 24.8 19.0 13.1 0.0 8.8 7.6 0.0 12.3 14.6 2.7 0.0 24.2 6.0 0.0 5.8 13.7 18.5 10.2 0.0 6.7 4.0 0.0 0.0 5.7 10.6 2.0 7.0 26.0 0.0 17.2 13.4 15.2 6.0 36.0 22.2 16.5 0.0 3.5 12.0 0.0 0.0
Salivary Glands 7.8 12.8 13.5 12.6 17.2 13.5 11.2 7.0 7.3 13.5 6.0 6.3 3.7 6.4 11.1 7.1 0.0 18.5 14.0 20.0 6.9 15.5 10.8 6.4 6.0 13.1 19.5 8.0 0.0 5.0 19.0 2.0 0.0 10.6 12.8 4.9 12.0 10.2 2.0 2.3 7.8 9.7 0.0 10.6 4.8 0.0 0.0
Sinonasal 8.2 17.6 6.9 15.1 18.3 16.2 16.4 14.5 7.7 0.0 7.8 6.1 16.0 6.7 6.8 6.8 0.0 0.0 8.6 0.0 7.2 38.1 12.0 11.3 2.0 10.1 10.5 0.0 0.0 10.3 1.3 0.0 0.0 7.1 3.8 9.3 9.1 7.2 61.0 6.6 9.8 6.9 0.0 3.5 0.0 0.0 0.0
Code
# Table: LOS impact (difference WITH - WITHOUT comorbidity)
los_impact = []
for site in sorted(df_head_neck['cancer_site_group'].unique()):
    df_site = df_head_neck[df_head_neck['cancer_site_group'] == site]
    row = {'Cancer Site': site}

    for col in comorbidity_cols:
        df_with = df_site[df_site[col] == 1]
        df_without = df_site[df_site[col] == 0]

        los_with = df_with['dias_estancia'].mean() if len(df_with) > 0 else 0
        los_without = df_without['dias_estancia'].mean() if len(df_without) > 0 else 0

        difference = los_with - los_without
        row[format_name(col)] = f'{difference:+.1f}'

    los_impact.append(row)

los_impact_df = pd.DataFrame(los_impact)
show_table(los_impact_df, "Table 19: Length of Stay Impact by Comorbidity (Days Difference: WITH - WITHOUT)", index=False)

Table 19: Length of Stay Impact by Comorbidity (Days Difference: WITH - WITHOUT)

Cancer Site CV Essential Hypertension CV Complicated Hypertension CV Coronary Disease CV Heart Failure CV Arrhythmias CV Cerebrovascular Disease CV Peripheral Vascular Disease MET Type1 Diabetes MET Type2 Diabetes MET Complicated Diabetes MET Obesity MET Dyslipidemia MET Metabolic Syndrome MET Thyroid Disorders RESP COPD RESP Asthma RESP Chronic Bronchitis RESP Emphysema RESP Pulmonary Fibrosis RESP Pulmonary Hypertension RESP Sleep Apnea REN Acute Renal Failure REN Chronic Renal Failure REN End Stage Renal REN Diabetic Nephropathy HEP Steatosis HEP Alcoholic Cirrhosis HEP NonAlcoholic Cirrhosis HEP Viral Hepatitis GI Reflux GI Peptic Ulcer GI IBD GI IBS NEURO Dementia NEURO Parkinson NEURO Epilepsy NEURO Stroke Sequelae PSI Major Depression PSI Bipolar PSI Anxiety SUST Alcohol SUST Tobacco SUST Opioids INF HIV INF Tuberculosis INF Hepatitis C INF Hepatitis B
Hypopharynx +2.2 -5.7 +5.6 +4.7 +1.7 -5.8 +31.8 -9.7 +1.0 -12.7 -3.6 +0.6 +16.3 +6.3 +6.3 +3.7 -13.7 +28.0 +18.7 -13.7 +94.5 +7.1 +8.5 +18.9 -13.7 -1.5 +4.3 -13.7 +0.3 -13.7 +8.7 -13.7 -13.7 +1.3 +4.8 +11.9 -8.3 +19.9 -13.7 +0.0 +11.3 +6.8 -13.7 -1.1 +34.5 -8.3 -13.7
Larynx +1.2 +8.9 +4.0 +4.0 +11.0 +11.1 +9.3 -4.6 -0.4 +2.5 +1.8 -2.3 +36.7 -0.9 +6.0 -0.9 -9.5 +20.5 -3.4 +2.9 -2.2 +10.2 +5.7 +10.2 +2.1 +2.0 +6.6 -3.5 -6.8 +8.6 +23.1 -14.5 -12.5 +1.1 -3.7 +4.5 +5.0 +11.1 +4.9 +11.2 +7.7 +4.5 -0.5 +2.4 +4.4 -8.8 -0.5
Nasopharynx +3.0 -9.0 -5.1 -0.1 +0.8 +28.2 +11.3 -9.0 +7.4 -2.0 -2.0 -0.1 +1.5 -2.5 +14.1 -5.0 -9.0 -0.5 -5.0 -9.0 -9.0 +8.5 -1.2 -2.0 -8.0 -7.0 +3.0 -9.0 -9.0 -6.0 -8.0 -2.0 -9.0 -9.0 -3.0 +1.5 +9.8 +69.9 -9.0 -4.3 -5.1 +8.8 -9.0 -4.2 -9.0 -9.0 -9.0
Oral Cavity +0.6 +3.1 +2.2 +2.7 +6.6 +1.3 +8.4 +1.7 +0.9 -3.6 +3.2 -0.6 +7.3 +0.9 +3.3 +1.1 +14.9 +11.0 +11.9 -1.2 +28.6 +8.3 +4.5 +5.3 -0.3 +1.9 +3.6 +6.7 -3.4 -4.8 +4.0 -3.6 -1.6 +0.1 +3.0 +2.6 -0.0 +6.8 +11.7 +15.8 +3.7 +3.7 -9.6 +7.9 -5.1 -9.6 -3.4
Oropharynx +2.0 +3.6 +6.5 +10.0 +6.5 +4.9 +11.2 -1.4 +0.7 -1.1 +0.2 -1.6 +18.6 +0.1 +13.0 +9.1 +3.6 +6.9 +13.4 -3.4 +2.6 +10.7 -0.1 +1.8 -2.4 +17.6 +20.5 -8.4 -8.4 +3.6 +18.0 -7.4 -8.4 +28.7 +5.0 +4.0 +4.6 +13.5 +13.2 +7.3 +4.3 +3.9 +46.7 -3.3 +36.2 -8.4 -8.4
Other Pharynx +1.0 +15.2 +6.0 +7.5 +18.8 +7.2 +14.6 +8.3 +2.7 -10.8 -2.1 -3.4 -10.8 +1.6 +4.0 -8.2 -10.8 +13.6 -4.9 -10.8 -5.2 +3.0 +8.0 -0.6 -10.8 -4.2 -6.8 -10.8 -10.8 -5.2 -0.2 -8.8 -3.8 +15.3 -10.8 +6.6 +2.6 +4.5 -4.8 +25.4 +11.6 +6.4 -10.8 -7.4 +1.2 -10.8 -10.8
Salivary Glands +1.3 +5.9 +6.8 +5.8 +10.7 +6.7 +4.3 +0.1 +0.5 +6.6 -1.0 -0.7 -3.3 -0.6 +4.3 +0.1 -6.9 +11.6 +7.1 +13.1 -0.1 +8.9 +4.0 -0.5 -1.0 +6.2 +12.6 +1.1 -6.9 -2.0 +12.1 -5.0 -6.9 +3.6 +5.9 -2.1 +5.1 +3.4 -5.0 -4.7 +0.9 +3.0 -6.9 +3.7 -2.2 -6.9 -6.9
Sinonasal +2.2 +11.0 +0.2 +8.5 +11.9 +9.7 +9.8 +7.9 +1.3 -6.7 +1.1 -0.6 +9.3 +0.0 +0.1 +0.1 -6.7 -6.7 +1.9 -6.7 +0.5 +32.0 +5.4 +4.6 -4.7 +3.5 +3.8 -6.7 -6.7 +3.7 -5.4 -6.7 -6.7 +0.4 -2.9 +2.6 +2.4 +0.5 +54.4 -0.0 +3.3 +0.2 -6.7 -3.2 -6.7 -6.7 -6.7
Code
fig, axes = plt.subplots(1, 2, figsize=(12, 6))

# Box plot
ax = axes[0]
data_discharged = []
data_deceased = []
labels = []
for site in sorted(df_head_neck['cancer_site_group'].unique()):
    df_site = df_head_neck[df_head_neck['cancer_site_group'] == site]
    data_discharged.append(df_site[df_site['egresar_fallecido'] == 0]['dias_estancia'].values)
    data_deceased.append(df_site[df_site['egresar_fallecido'] == 1]['dias_estancia'].values)
    labels.append(site)

positions_discharged = np.arange(len(labels)) * 2.5
positions_deceased = positions_discharged + 1
bp1 = ax.boxplot(data_discharged, positions=positions_discharged, widths=0.8, patch_artist=True, label='Discharged')
bp2 = ax.boxplot(data_deceased, positions=positions_deceased, widths=0.8, patch_artist=True, label='Deceased')

for patch in bp1['boxes']:
    patch.set_facecolor('lightgreen')
for patch in bp2['boxes']:
    patch.set_facecolor('lightcoral')

ax.set_xticks((positions_discharged + positions_deceased) / 2)
ax.set_xticklabels(labels, rotation=30, ha='right', fontsize=9)
ax.set_ylabel('Length of Stay (days)')
ax.set_title('LOS Distribution by Site and Discharge Status')
ax.set_ylim([0, 100])
ax.legend()
ax.grid(True, alpha=0.3, axis='y')

# Bar plot of means
ax = axes[1]
means_discharged = []
means_deceased = []
for site in sorted(df_head_neck['cancer_site_group'].unique()):
    df_site = df_head_neck[df_head_neck['cancer_site_group'] == site]
    means_discharged.append(df_site[df_site['egresar_fallecido'] == 0]['dias_estancia'].mean())
    means_deceased.append(df_site[df_site['egresar_fallecido'] == 1]['dias_estancia'].mean())

x = np.arange(len(labels))
width = 0.35
ax.bar(x - width/2, means_discharged, width, label='Discharged', color='lightgreen')
ax.bar(x + width/2, means_deceased, width, label='Deceased', color='lightcoral')
ax.set_ylabel('Mean Length of Stay (days)')
ax.set_xlabel('Cancer Site')
ax.set_title('Mean LOS by Site and Discharge Status')
ax.set_xticks(x)
ax.set_xticklabels(labels, rotation=30, ha='right', fontsize=9)
ax.legend()
ax.grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.savefig(os.path.join(OUTPUT_DIR, '05_los_by_site.png'), dpi=300, bbox_inches='tight', facecolor='white')
plt.savefig(_FIG / "fig-los-by-site.png", dpi=300, bbox_inches="tight")
plt.show()
Figure 5: Figure 5. Length of Stay by Cancer Site and Discharge Status
Code
heatmap_los_impact = []
for site in sorted(df_head_neck['cancer_site_group'].unique()):
    df_site = df_head_neck[df_head_neck['cancer_site_group'] == site]
    row = []
    for col in comorbidity_cols:
        df_with = df_site[df_site[col] == 1]
        df_without = df_site[df_site[col] == 0]
        los_with = df_with['dias_estancia'].mean() if len(df_with) > 0 else 0
        los_without = df_without['dias_estancia'].mean() if len(df_without) > 0 else 0
        difference = los_with - los_without
        row.append(difference)
    heatmap_los_impact.append(row)

heatmap_los_impact_df = pd.DataFrame(heatmap_los_impact,
                                      index=sorted(df_head_neck['cancer_site_group'].unique()),
                                      columns=[format_name(col) for col in comorbidity_cols])

fig, ax = plt.subplots(figsize=(22, 8))
sns.heatmap(heatmap_los_impact_df, annot=True, fmt='.1f', cmap='RdBu_r', center=0,
            cbar_kws={'label': 'LOS Difference (days)'}, ax=ax,
            annot_kws={'rotation': 90, 'fontsize': 8})
ax.set_title('Length of Stay Impact: WITH - WITHOUT Comorbidity (days)')
ax.set_xlabel('Comorbidity')
ax.set_ylabel('Cancer Site')
plt.setp(ax.get_xticklabels(), rotation=45, ha='right', rotation_mode='anchor', fontsize=8)
plt.setp(ax.get_yticklabels(), rotation=0)
plt.tight_layout()
plt.savefig(os.path.join(OUTPUT_DIR, '06_los_impact_heatmap.png'), dpi=300, bbox_inches='tight', facecolor='white')
plt.savefig(_FIG / "fig-los-impact-heatmap.png", dpi=300, bbox_inches="tight")
plt.show()
Figure 6: Figure 6. Heatmap: LOS Impact by Comorbidity and Cancer Site

9 5. Bivariate Analysis

Code
from scipy.stats import chi2_contingency, ranksums

# Bivariate analysis: chi-square for categorical association with mortality
all_site_names = sorted(df_head_neck['cancer_site_group'].unique())
df_bivariate = []

for site in all_site_names:
    df_site = df_head_neck[df_head_neck['cancer_site_group'] == site]

    if len(df_site) < 30:
        continue

    for col in comorbidity_cols:
        # Chi-square test for mortality association
        contingency = pd.crosstab(df_site[col], df_site['egresar_fallecido'])
        try:
            chi2, pval_mort, dof, expected = chi2_contingency(contingency)
        except:
            pval_mort = np.nan

        # Wilcoxon rank-sum test (Mann-Whitney U) for LOS difference
        group_with = df_site[df_site[col] == 1]['dias_estancia'].dropna()
        group_without = df_site[df_site[col] == 0]['dias_estancia'].dropna()
        try:
            if len(group_with) >= 5 and len(group_without) >= 5:
                stat, pval_los = ranksums(group_with, group_without)
            else:
                pval_los = np.nan
        except:
            pval_los = np.nan

        df_bivariate.append({
            'Cancer Site': site,
            'Comorbidity': format_name(col),
            'Chi-square p-value': pval_mort,
            'LOS p-value (Wilcoxon)': pval_los,
            'N with': len(group_with),
            'N without': len(group_without)
        })

df_bivariate = pd.DataFrame(df_bivariate)
show_table(df_bivariate, "Table 20: Bivariate Analysis by Cancer Site", index=False)

Table 20: Bivariate Analysis by Cancer Site

Cancer Site Comorbidity Chi-square p-value LOS p-value (Wilcoxon) N with N without
Hypopharynx CV Essential Hypertension 1.00 1.00 147 253
Hypopharynx CV Complicated Hypertension 1.00 1 399
Hypopharynx CV Coronary Disease 0.27 0.55 16 384
Hypopharynx CV Heart Failure 0.13 0.44 6 394
Hypopharynx CV Arrhythmias 0.00 0.63 15 385
Hypopharynx CV Cerebrovascular Disease 0.59 3 397
Hypopharynx CV Peripheral Vascular Disease 1.00 0.04 6 394
Hypopharynx MET Type1 Diabetes 1.00 1 399
Hypopharynx MET Type2 Diabetes 0.92 0.67 57 343
Hypopharynx MET Complicated Diabetes 1.00 1 399
Hypopharynx MET Obesity 1.00 0.95 6 394
Hypopharynx MET Dyslipidemia 1.00 0.82 38 362
Hypopharynx MET Metabolic Syndrome 1.00 1 399
Hypopharynx MET Thyroid Disorders 0.55 0.41 64 336
Hypopharynx RESP COPD 0.28 0.00 25 375
Hypopharynx RESP Asthma 0.89 0.26 5 395
Hypopharynx RESP Chronic Bronchitis 1.00 0 400
Hypopharynx RESP Emphysema 1.00 0.01 8 392
Hypopharynx RESP Pulmonary Fibrosis 1.00 4 396
Hypopharynx RESP Pulmonary Hypertension 1.00 0 400
Hypopharynx RESP Sleep Apnea 1.00 1 399
Hypopharynx REN Acute Renal Failure 0.99 0.05 6 394
Hypopharynx REN Chronic Renal Failure 0.39 0.04 18 382
Hypopharynx REN End Stage Renal 0.91 0.02 7 393
Hypopharynx REN Diabetic Nephropathy 1.00 0 400
Hypopharynx HEP Steatosis 1.00 4 396
Hypopharynx HEP Alcoholic Cirrhosis 0.76 4 396
Hypopharynx HEP NonAlcoholic Cirrhosis 1.00 0 400
Hypopharynx HEP Viral Hepatitis 1.00 3 397
Hypopharynx GI Reflux 1.00 0 400
Hypopharynx GI Peptic Ulcer 0.59 3 397
Hypopharynx GI IBD 1.00 0 400
Hypopharynx GI IBS 1.00 0 400
Hypopharynx NEURO Dementia 1.00 0.44 5 395
Hypopharynx NEURO Parkinson 0.39 2 398
Hypopharynx NEURO Epilepsy 1.00 0.02 7 393
Hypopharynx NEURO Stroke Sequelae 1.00 2 398
Hypopharynx PSI Major Depression 1.00 0.24 8 392
Hypopharynx PSI Bipolar 1.00 0 400
Hypopharynx PSI Anxiety 0.76 4 396
Hypopharynx SUST Alcohol 0.23 0.00 26 374
Hypopharynx SUST Tobacco 0.62 0.02 67 333
Hypopharynx SUST Opioids 1.00 0 400
Hypopharynx INF HIV 1.00 0.84 6 394
Hypopharynx INF Tuberculosis 0.39 2 398
Hypopharynx INF Hepatitis C 1.00 2 398
Hypopharynx INF Hepatitis B 1.00 0 400
Larynx CV Essential Hypertension 1.00 0.00 1202 1423
Larynx CV Complicated Hypertension 0.58 0.01 13 2612
Larynx CV Coronary Disease 0.34 0.03 152 2473
Larynx CV Heart Failure 0.00 0.00 92 2533
Larynx CV Arrhythmias 0.00 0.00 182 2443
Larynx CV Cerebrovascular Disease 0.11 0.00 77 2548
Larynx CV Peripheral Vascular Disease 0.00 0.00 81 2544
Larynx MET Type1 Diabetes 0.82 0.95 9 2616
Larynx MET Type2 Diabetes 0.12 0.77 520 2105
Larynx MET Complicated Diabetes 1.00 0.23 6 2619
Larynx MET Obesity 0.07 0.14 131 2494
Larynx MET Dyslipidemia 0.27 0.38 221 2404
Larynx MET Metabolic Syndrome 0.08 0.00 12 2613
Larynx MET Thyroid Disorders 0.74 0.54 293 2332
Larynx RESP COPD 0.05 0.00 362 2263
Larynx RESP Asthma 0.79 0.99 95 2530
Larynx RESP Chronic Bronchitis 1.00 1 2624
Larynx RESP Emphysema 0.00 0.00 46 2579
Larynx RESP Pulmonary Fibrosis 0.07 0.84 27 2598
Larynx RESP Pulmonary Hypertension 0.08 0.12 12 2613
Larynx RESP Sleep Apnea 1.00 0.71 26 2599
Larynx REN Acute Renal Failure 0.00 0.00 160 2465
Larynx REN Chronic Renal Failure 0.00 0.01 136 2489
Larynx REN End Stage Renal 0.02 0.46 29 2596
Larynx REN Diabetic Nephropathy 1.00 0.86 8 2617
Larynx HEP Steatosis 0.19 0.72 24 2601
Larynx HEP Alcoholic Cirrhosis 0.00 0.00 34 2591
Larynx HEP NonAlcoholic Cirrhosis 0.00 2 2623
Larynx HEP Viral Hepatitis 1.00 4 2621
Larynx GI Reflux 0.54 0.17 15 2610
Larynx GI Peptic Ulcer 0.00 0.00 30 2595
Larynx GI IBD 1.00 0 2625
Larynx GI IBS 0.11 1 2624
Larynx NEURO Dementia 0.58 0.44 35 2590
Larynx NEURO Parkinson 1.00 0.53 19 2606
Larynx NEURO Epilepsy 0.15 0.01 41 2584
Larynx NEURO Stroke Sequelae 0.49 0.01 55 2570
Larynx PSI Major Depression 0.35 0.01 61 2564
Larynx PSI Bipolar 0.97 0.28 7 2618
Larynx PSI Anxiety 0.33 0.00 23 2602
Larynx SUST Alcohol 0.37 0.00 119 2506
Larynx SUST Tobacco 0.14 0.00 637 1988
Larynx SUST Opioids 1.00 1 2624
Larynx INF HIV 1.00 0.14 17 2608
Larynx INF Tuberculosis 1.00 0.38 7 2618
Larynx INF Hepatitis C 1.00 3 2622
Larynx INF Hepatitis B 1.00 1 2624
Nasopharynx CV Essential Hypertension 0.63 0.21 104 264
Nasopharynx CV Complicated Hypertension 1.00 0 368
Nasopharynx CV Coronary Disease 1.00 0.71 8 360
Nasopharynx CV Heart Failure 1.00 0.30 6 362
Nasopharynx CV Arrhythmias 0.30 0.01 11 357
Nasopharynx CV Cerebrovascular Disease 0.06 0.00 6 362
Nasopharynx CV Peripheral Vascular Disease 0.09 0.01 15 353
Nasopharynx MET Type1 Diabetes 1.00 0 368
Nasopharynx MET Type2 Diabetes 1.00 0.02 46 322
Nasopharynx MET Complicated Diabetes 1.00 1 367
Nasopharynx MET Obesity 0.34 0.38 26 342
Nasopharynx MET Dyslipidemia 0.24 0.12 33 335
Nasopharynx MET Metabolic Syndrome 0.27 2 366
Nasopharynx MET Thyroid Disorders 0.82 0.21 35 333
Nasopharynx RESP COPD 1.00 3 365
Nasopharynx RESP Asthma 0.72 0.90 13 355
Nasopharynx RESP Chronic Bronchitis 1.00 0 368
Nasopharynx RESP Emphysema 0.27 2 366
Nasopharynx RESP Pulmonary Fibrosis 1.00 4 364
Nasopharynx RESP Pulmonary Hypertension 1.00 0 368
Nasopharynx RESP Sleep Apnea 1.00 1 367
Nasopharynx REN Acute Renal Failure 0.03 0.01 20 348
Nasopharynx REN Chronic Renal Failure 0.76 0.20 12 356
Nasopharynx REN End Stage Renal 1.00 2 366
Nasopharynx REN Diabetic Nephropathy 1.00 1 367
Nasopharynx HEP Steatosis 1.00 1 367
Nasopharynx HEP Alcoholic Cirrhosis 1.00 1 367
Nasopharynx HEP NonAlcoholic Cirrhosis 1.00 0 368
Nasopharynx HEP Viral Hepatitis 1.00 0 368
Nasopharynx GI Reflux 1.00 1 367
Nasopharynx GI Peptic Ulcer 1.00 1 367
Nasopharynx GI IBD 1.00 1 367
Nasopharynx GI IBS 1.00 0 368
Nasopharynx NEURO Dementia 1.00 0 368
Nasopharynx NEURO Parkinson 0.07 1 367
Nasopharynx NEURO Epilepsy 1.00 4 364
Nasopharynx NEURO Stroke Sequelae 1.00 3 365
Nasopharynx PSI Major Depression 1.00 2 366
Nasopharynx PSI Bipolar 1.00 0 368
Nasopharynx PSI Anxiety 0.67 0.97 14 354
Nasopharynx SUST Alcohol 0.63 0.03 15 353
Nasopharynx SUST Tobacco 0.18 0.41 29 339
Nasopharynx SUST Opioids 1.00 0 368
Nasopharynx INF HIV 1.00 0.94 5 363
Nasopharynx INF Tuberculosis 1.00 0 368
Nasopharynx INF Hepatitis C 1.00 0 368
Nasopharynx INF Hepatitis B 1.00 0 368
Oral Cavity CV Essential Hypertension 0.32 0.00 1316 1685
Oral Cavity CV Complicated Hypertension 0.49 0.49 15 2986
Oral Cavity CV Coronary Disease 0.25 0.00 114 2887
Oral Cavity CV Heart Failure 0.05 0.00 65 2936
Oral Cavity CV Arrhythmias 0.00 0.00 169 2832
Oral Cavity CV Cerebrovascular Disease 0.03 0.01 73 2928
Oral Cavity CV Peripheral Vascular Disease 0.00 0.00 63 2938
Oral Cavity MET Type1 Diabetes 1.00 0.90 11 2990
Oral Cavity MET Type2 Diabetes 0.90 0.14 560 2441
Oral Cavity MET Complicated Diabetes 0.76 0.40 13 2988
Oral Cavity MET Obesity 0.20 0.00 156 2845
Oral Cavity MET Dyslipidemia 0.52 0.73 343 2658
Oral Cavity MET Metabolic Syndrome 0.01 0.04 18 2983
Oral Cavity MET Thyroid Disorders 0.06 0.01 368 2633
Oral Cavity RESP COPD 0.38 0.00 154 2847
Oral Cavity RESP Asthma 0.05 0.82 53 2948
Oral Cavity RESP Chronic Bronchitis 1.00 4 2997
Oral Cavity RESP Emphysema 0.85 0.00 22 2979
Oral Cavity RESP Pulmonary Fibrosis 0.00 0.94 22 2979
Oral Cavity RESP Pulmonary Hypertension 0.12 0.16 8 2993
Oral Cavity RESP Sleep Apnea 1.00 0.00 9 2992
Oral Cavity REN Acute Renal Failure 0.00 0.00 136 2865
Oral Cavity REN Chronic Renal Failure 0.01 0.00 137 2864
Oral Cavity REN End Stage Renal 0.04 0.00 31 2970
Oral Cavity REN Diabetic Nephropathy 1.00 0.23 7 2994
Oral Cavity HEP Steatosis 0.01 0.13 41 2960
Oral Cavity HEP Alcoholic Cirrhosis 0.17 0.11 19 2982
Oral Cavity HEP NonAlcoholic Cirrhosis 1.00 0.50 9 2992
Oral Cavity HEP Viral Hepatitis 1.00 0.46 5 2996
Oral Cavity GI Reflux 0.76 0.09 13 2988
Oral Cavity GI Peptic Ulcer 0.55 0.01 16 2985
Oral Cavity GI IBD 0.57 4 2997
Oral Cavity GI IBS 1.00 2 2999
Oral Cavity NEURO Dementia 0.14 0.95 52 2949
Oral Cavity NEURO Parkinson 1.00 0.55 20 2981
Oral Cavity NEURO Epilepsy 0.78 0.04 36 2965
Oral Cavity NEURO Stroke Sequelae 0.07 0.03 57 2944
Oral Cavity PSI Major Depression 0.90 0.01 89 2912
Oral Cavity PSI Bipolar 0.89 0.69 7 2994
Oral Cavity PSI Anxiety 1.00 0.00 21 2980
Oral Cavity SUST Alcohol 0.68 0.00 143 2858
Oral Cavity SUST Tobacco 0.60 0.00 445 2556
Oral Cavity SUST Opioids 1.00 0 3001
Oral Cavity INF HIV 0.85 0.01 22 2979
Oral Cavity INF Tuberculosis 1.00 2 2999
Oral Cavity INF Hepatitis C 1.00 0 3001
Oral Cavity INF Hepatitis B 1.00 0.46 5 2996
Oropharynx CV Essential Hypertension 0.74 0.00 352 599
Oropharynx CV Complicated Hypertension 1.00 1 950
Oropharynx CV Coronary Disease 1.00 0.02 23 928
Oropharynx CV Heart Failure 0.01 0.13 10 941
Oropharynx CV Arrhythmias 0.00 0.00 30 921
Oropharynx CV Cerebrovascular Disease 0.26 0.04 22 929
Oropharynx CV Peripheral Vascular Disease 1.00 0.00 26 925
Oropharynx MET Type1 Diabetes 1.00 1 950
Oropharynx MET Type2 Diabetes 0.42 0.13 130 821
Oropharynx MET Complicated Diabetes 1.00 3 948
Oropharynx MET Obesity 1.00 0.23 40 911
Oropharynx MET Dyslipidemia 0.84 0.47 67 884
Oropharynx MET Metabolic Syndrome 0.06 1 950
Oropharynx MET Thyroid Disorders 0.26 0.58 83 868
Oropharynx RESP COPD 0.01 0.00 35 916
Oropharynx RESP Asthma 1.00 0.92 17 934
Oropharynx RESP Chronic Bronchitis 1.00 1 950
Oropharynx RESP Emphysema 0.92 0.11 10 941
Oropharynx RESP Pulmonary Fibrosis 0.08 0.03 7 944
Oropharynx RESP Pulmonary Hypertension 1.00 2 949
Oropharynx RESP Sleep Apnea 1.00 3 948
Oropharynx REN Acute Renal Failure 0.00 0.00 29 922
Oropharynx REN Chronic Renal Failure 0.06 0.30 24 927
Oropharynx REN End Stage Renal 0.79 0.56 6 945
Oropharynx REN Diabetic Nephropathy 1.00 2 949
Oropharynx HEP Steatosis 0.00 0.00 14 937
Oropharynx HEP Alcoholic Cirrhosis 0.95 0.06 8 943
Oropharynx HEP NonAlcoholic Cirrhosis 1.00 0 951
Oropharynx HEP Viral Hepatitis 1.00 0 951
Oropharynx GI Reflux 1.00 2 949
Oropharynx GI Peptic Ulcer 1.00 3 948
Oropharynx GI IBD 1.00 1 950
Oropharynx GI IBS 1.00 0 951
Oropharynx NEURO Dementia 1.00 4 947
Oropharynx NEURO Parkinson 1.00 0.12 6 945
Oropharynx NEURO Epilepsy 1.00 0.06 21 930
Oropharynx NEURO Stroke Sequelae 0.69 0.08 19 932
Oropharynx PSI Major Depression 0.26 0.01 22 929
Oropharynx PSI Bipolar 0.04 0.03 6 945
Oropharynx PSI Anxiety 0.86 0.92 11 940
Oropharynx SUST Alcohol 0.92 0.03 46 905
Oropharynx SUST Tobacco 0.75 0.00 126 825
Oropharynx SUST Opioids 1.00 1 950
Oropharynx INF HIV 0.74 0.05 20 931
Oropharynx INF Tuberculosis 0.24 2 949
Oropharynx INF Hepatitis C 1.00 0 951
Oropharynx INF Hepatitis B 1.00 0 951
Other Pharynx CV Essential Hypertension 0.16 0.62 94 187
Other Pharynx CV Complicated Hypertension 0.20 1 280
Other Pharynx CV Coronary Disease 0.85 0.13 6 275
Other Pharynx CV Heart Failure 0.35 0.13 7 274
Other Pharynx CV Arrhythmias 0.08 0.00 14 267
Other Pharynx CV Cerebrovascular Disease 0.65 0.11 10 271
Other Pharynx CV Peripheral Vascular Disease 0.75 0.00 11 270
Other Pharynx MET Type1 Diabetes 0.51 2 279
Other Pharynx MET Type2 Diabetes 0.96 0.92 43 238
Other Pharynx MET Complicated Diabetes 1.00 0 281
Other Pharynx MET Obesity 1.00 0.57 8 273
Other Pharynx MET Dyslipidemia 0.29 0.91 17 264
Other Pharynx MET Metabolic Syndrome 1.00 0 281
Other Pharynx MET Thyroid Disorders 0.40 0.53 26 255
Other Pharynx RESP COPD 0.37 0.07 14 267
Other Pharynx RESP Asthma 1.00 3 278
Other Pharynx RESP Chronic Bronchitis 1.00 0 281
Other Pharynx RESP Emphysema 0.91 4 277
Other Pharynx RESP Pulmonary Fibrosis 1.00 3 278
Other Pharynx RESP Pulmonary Hypertension 1.00 0 281
Other Pharynx RESP Sleep Apnea 0.61 0.92 9 272
Other Pharynx REN Acute Renal Failure 0.00 0.12 10 271
Other Pharynx REN Chronic Renal Failure 0.83 0.03 12 269
Other Pharynx REN End Stage Renal 0.91 4 277
Other Pharynx REN Diabetic Nephropathy 1.00 0 281
Other Pharynx HEP Steatosis 1.00 3 278
Other Pharynx HEP Alcoholic Cirrhosis 0.20 1 280
Other Pharynx HEP NonAlcoholic Cirrhosis 1.00 0 281
Other Pharynx HEP Viral Hepatitis 1.00 0 281
Other Pharynx GI Reflux 1.00 3 278
Other Pharynx GI Peptic Ulcer 1.00 0.52 5 276
Other Pharynx GI IBD 1.00 1 280
Other Pharynx GI IBS 1.00 1 280
Other Pharynx NEURO Dementia 1.00 2 279
Other Pharynx NEURO Parkinson 1.00 0 281
Other Pharynx NEURO Epilepsy 0.65 0.13 10 271
Other Pharynx NEURO Stroke Sequelae 1.00 0.25 5 276
Other Pharynx PSI Major Depression 0.25 0.71 6 275
Other Pharynx PSI Bipolar 0.20 1 280
Other Pharynx PSI Anxiety 1.00 2 279
Other Pharynx SUST Alcohol 1.00 0.12 5 276
Other Pharynx SUST Tobacco 0.46 0.00 31 250
Other Pharynx SUST Opioids 1.00 0 281
Other Pharynx INF HIV 1.00 4 277
Other Pharynx INF Tuberculosis 1.00 1 280
Other Pharynx INF Hepatitis C 1.00 0 281
Other Pharynx INF Hepatitis B 1.00 0 281
Salivary Glands CV Essential Hypertension 0.58 0.00 337 561
Salivary Glands CV Complicated Hypertension 0.88 0.02 8 890
Salivary Glands CV Coronary Disease 0.06 0.03 27 871
Salivary Glands CV Heart Failure 0.01 0.00 20 878
Salivary Glands CV Arrhythmias 0.00 0.00 42 856
Salivary Glands CV Cerebrovascular Disease 0.04 0.00 15 883
Salivary Glands CV Peripheral Vascular Disease 1.00 0.58 8 890
Salivary Glands MET Type1 Diabetes 1.00 3 895
Salivary Glands MET Type2 Diabetes 0.52 0.16 180 718
Salivary Glands MET Complicated Diabetes 0.20 2 896
Salivary Glands MET Obesity 0.22 0.56 72 826
Salivary Glands MET Dyslipidemia 0.52 0.54 94 804
Salivary Glands MET Metabolic Syndrome 1.00 0.60 6 892
Salivary Glands MET Thyroid Disorders 1.00 0.60 102 796
Salivary Glands RESP COPD 0.76 0.01 37 861
Salivary Glands RESP Asthma 0.32 0.30 34 864
Salivary Glands RESP Chronic Bronchitis 1.00 0 898
Salivary Glands RESP Emphysema 0.20 2 896
Salivary Glands RESP Pulmonary Fibrosis 1.00 0.01 7 891
Salivary Glands RESP Pulmonary Hypertension 1.00 2 896
Salivary Glands RESP Sleep Apnea 1.00 0.85 7 891
Salivary Glands REN Acute Renal Failure 0.00 0.00 38 860
Salivary Glands REN Chronic Renal Failure 0.24 0.00 38 860
Salivary Glands REN End Stage Renal 0.81 0.50 7 891
Salivary Glands REN Diabetic Nephropathy 1.00 1 897
Salivary Glands HEP Steatosis 0.81 0.02 7 891
Salivary Glands HEP Alcoholic Cirrhosis 0.20 2 896
Salivary Glands HEP NonAlcoholic Cirrhosis 1.00 1 897
Salivary Glands HEP Viral Hepatitis 1.00 0 898
Salivary Glands GI Reflux 1.00 2 896
Salivary Glands GI Peptic Ulcer 1.00 4 894
Salivary Glands GI IBD 1.00 2 896
Salivary Glands GI IBS 1.00 0 898
Salivary Glands NEURO Dementia 0.81 0.31 7 891
Salivary Glands NEURO Parkinson 0.93 0.32 11 887
Salivary Glands NEURO Epilepsy 0.71 0.53 16 882
Salivary Glands NEURO Stroke Sequelae 1.00 0.01 12 886
Salivary Glands PSI Major Depression 1.00 0.19 28 870
Salivary Glands PSI Bipolar 1.00 4 894
Salivary Glands PSI Anxiety 1.00 0.06 9 889
Salivary Glands SUST Alcohol 0.53 0.98 18 880
Salivary Glands SUST Tobacco 1.00 0.07 83 815
Salivary Glands SUST Opioids 1.00 0 898
Salivary Glands INF HIV 1.00 0.27 5 893
Salivary Glands INF Tuberculosis 1.00 0.62 5 893
Salivary Glands INF Hepatitis C 1.00 0 898
Salivary Glands INF Hepatitis B 1.00 0 898
Sinonasal CV Essential Hypertension 0.74 0.06 293 594
Sinonasal CV Complicated Hypertension 0.57 0.11 5 882
Sinonasal CV Coronary Disease 0.99 0.81 11 876
Sinonasal CV Heart Failure 1.00 0.00 12 875
Sinonasal CV Arrhythmias 1.00 0.00 15 872
Sinonasal CV Cerebrovascular Disease 0.00 0.03 19 868
Sinonasal CV Peripheral Vascular Disease 1.00 0.01 7 880
Sinonasal MET Type1 Diabetes 0.45 4 883
Sinonasal MET Type2 Diabetes 0.40 0.02 142 745
Sinonasal MET Complicated Diabetes 1.00 2 885
Sinonasal MET Obesity 0.57 0.05 57 830
Sinonasal MET Dyslipidemia 1.00 0.93 65 822
Sinonasal MET Metabolic Syndrome 0.17 2 885
Sinonasal MET Thyroid Disorders 1.00 0.46 75 812
Sinonasal RESP COPD 1.00 0.80 16 871
Sinonasal RESP Asthma 0.57 0.73 23 864
Sinonasal RESP Chronic Bronchitis 1.00 0 887
Sinonasal RESP Emphysema 1.00 0 887
Sinonasal RESP Pulmonary Fibrosis 1.00 0.56 5 882
Sinonasal RESP Pulmonary Hypertension 1.00 0 887
Sinonasal RESP Sleep Apnea 1.00 0.42 6 881
Sinonasal REN Acute Renal Failure 0.32 0.00 15 872
Sinonasal REN Chronic Renal Failure 1.00 0.05 18 869
Sinonasal REN End Stage Renal 1.00 0.03 7 880
Sinonasal REN Diabetic Nephropathy 1.00 2 885
Sinonasal HEP Steatosis 0.83 0.09 8 879
Sinonasal HEP Alcoholic Cirrhosis 1.00 2 885
Sinonasal HEP NonAlcoholic Cirrhosis 1.00 0 887
Sinonasal HEP Viral Hepatitis 1.00 0 887
Sinonasal GI Reflux 1.00 3 884
Sinonasal GI Peptic Ulcer 1.00 3 884
Sinonasal GI IBD 1.00 0 887
Sinonasal GI IBS 1.00 0 887
Sinonasal NEURO Dementia 1.00 0.18 8 879
Sinonasal NEURO Parkinson 1.00 0.89 6 881
Sinonasal NEURO Epilepsy 1.00 0.10 14 873
Sinonasal NEURO Stroke Sequelae 0.95 0.95 10 877
Sinonasal PSI Major Depression 1.00 0.51 18 869
Sinonasal PSI Bipolar 1.00 1 886
Sinonasal PSI Anxiety 0.99 0.87 11 876
Sinonasal SUST Alcohol 1.00 0.08 29 858
Sinonasal SUST Tobacco 0.74 0.29 84 803
Sinonasal SUST Opioids 1.00 0 887
Sinonasal INF HIV 1.00 4 883
Sinonasal INF Tuberculosis 1.00 0 887
Sinonasal INF Hepatitis C 1.00 0 887
Sinonasal INF Hepatitis B 1.00 0 887
Code
# Build dictionaries of significant comorbidities per site (p < 0.05)
display_to_col = {format_name(c): c for c in comorbidity_cols}

sig_comorbidities_mortality = {}
sig_comorbidities_los = {}

for site in all_site_names:
    site_biv = df_bivariate[df_bivariate['Cancer Site'] == site]
    sig_mort = []
    sig_los = []
    for _, row in site_biv.iterrows():
        comorb_display = row['Comorbidity']
        col_name = display_to_col.get(comorb_display, None)
        if col_name is None:
            continue
        if not np.isnan(row['Chi-square p-value']) and row['Chi-square p-value'] < 0.05:
            sig_mort.append(col_name)
        if not np.isnan(row['LOS p-value (Wilcoxon)']) and row['LOS p-value (Wilcoxon)'] < 0.05:
            sig_los.append(col_name)
    sig_comorbidities_mortality[site] = sig_mort
    sig_comorbidities_los[site] = sig_los

sig_summary = []
for site in all_site_names:
    sig_summary.append({
        'Cancer Site': site,
        'Sig. Comorbidities (Mortality)': len(sig_comorbidities_mortality.get(site, [])),
        'Sig. Comorbidities (LOS)': len(sig_comorbidities_los.get(site, [])),
        'Mortality List': ', '.join([format_name(c) for c in sig_comorbidities_mortality.get(site, [])]) or '—',
        'LOS List': ', '.join([format_name(c) for c in sig_comorbidities_los.get(site, [])]) or '—'
    })

sig_summary_df = pd.DataFrame(sig_summary)
show_table(sig_summary_df, "Table 21: Significant Comorbidities by Cancer Site (p < 0.05 in Bivariate Analysis)", index=False)

Table 21: Significant Comorbidities by Cancer Site (p < 0.05 in Bivariate Analysis)

Cancer Site Sig. Comorbidities (Mortality) Sig. Comorbidities (LOS) Mortality List LOS List
Hypopharynx 1 9 CV Arrhythmias CV Peripheral Vascular Disease, RESP COPD, RESP Emphysema, REN Acute Renal Failure, REN Chronic Renal Failure, REN End Stage Renal, NEURO Epilepsy, SUST Alcohol, SUST Tobacco
Larynx 11 20 CV Heart Failure, CV Arrhythmias, CV Peripheral Vascular Disease, RESP COPD, RESP Emphysema, REN Acute Renal Failure, REN Chronic Renal Failure, REN End Stage Renal, HEP Alcoholic Cirrhosis, HEP NonAlcoholic Cirrhosis, GI Peptic Ulcer CV Essential Hypertension, CV Complicated Hypertension, CV Coronary Disease, CV Heart Failure, CV Arrhythmias, CV Cerebrovascular Disease, CV Peripheral Vascular Disease, MET Metabolic Syndrome, RESP COPD, RESP Emphysema, REN Acute Renal Failure, REN Chronic Renal Failure, HEP Alcoholic Cirrhosis, GI Peptic Ulcer, NEURO Epilepsy, NEURO Stroke Sequelae, PSI Major Depression, PSI Anxiety, SUST Alcohol, SUST Tobacco
Nasopharynx 1 6 REN Acute Renal Failure CV Arrhythmias, CV Cerebrovascular Disease, CV Peripheral Vascular Disease, MET Type2 Diabetes, REN Acute Renal Failure, SUST Alcohol
Oral Cavity 11 23 CV Heart Failure, CV Arrhythmias, CV Cerebrovascular Disease, CV Peripheral Vascular Disease, MET Metabolic Syndrome, RESP Asthma, RESP Pulmonary Fibrosis, REN Acute Renal Failure, REN Chronic Renal Failure, REN End Stage Renal, HEP Steatosis CV Essential Hypertension, CV Coronary Disease, CV Heart Failure, CV Arrhythmias, CV Cerebrovascular Disease, CV Peripheral Vascular Disease, MET Obesity, MET Metabolic Syndrome, MET Thyroid Disorders, RESP COPD, RESP Emphysema, RESP Sleep Apnea, REN Acute Renal Failure, REN Chronic Renal Failure, REN End Stage Renal, GI Peptic Ulcer, NEURO Epilepsy, NEURO Stroke Sequelae, PSI Major Depression, PSI Anxiety, SUST Alcohol, SUST Tobacco, INF HIV
Oropharynx 6 14 CV Heart Failure, CV Arrhythmias, RESP COPD, REN Acute Renal Failure, HEP Steatosis, PSI Bipolar CV Essential Hypertension, CV Coronary Disease, CV Arrhythmias, CV Cerebrovascular Disease, CV Peripheral Vascular Disease, RESP COPD, RESP Pulmonary Fibrosis, REN Acute Renal Failure, HEP Steatosis, PSI Major Depression, PSI Bipolar, SUST Alcohol, SUST Tobacco, INF HIV
Other Pharynx 1 4 REN Acute Renal Failure CV Arrhythmias, CV Peripheral Vascular Disease, REN Chronic Renal Failure, SUST Tobacco
Salivary Glands 4 12 CV Heart Failure, CV Arrhythmias, CV Cerebrovascular Disease, REN Acute Renal Failure CV Essential Hypertension, CV Complicated Hypertension, CV Coronary Disease, CV Heart Failure, CV Arrhythmias, CV Cerebrovascular Disease, RESP COPD, RESP Pulmonary Fibrosis, REN Acute Renal Failure, REN Chronic Renal Failure, HEP Steatosis, NEURO Stroke Sequelae
Sinonasal 1 7 CV Cerebrovascular Disease CV Heart Failure, CV Arrhythmias, CV Cerebrovascular Disease, CV Peripheral Vascular Disease, MET Type2 Diabetes, REN Acute Renal Failure, REN End Stage Renal

10 6. GEE Mixed-Effects Logistic Regression for Mortality (All Cancer Sites)

Code
from statsmodels.genmod.generalized_estimating_equations import GEE
from statsmodels.genmod.families import Binomial, Poisson, NegativeBinomial
from statsmodels.genmod.cov_struct import Exchangeable, Independence
import statsmodels.api as sm
from sklearn.metrics import (accuracy_score, precision_score, recall_score,
                             f1_score, matthews_corrcoef, brier_score_loss,
                             roc_auc_score, roc_curve, confusion_matrix,
                             classification_report)

# Use ALL cancer site groups
all_sites = sorted(df_head_neck['cancer_site_group'].unique())

# Store GEE model results
gee_mortality_models = {}
gee_mortality_results = {}

def find_optimal_threshold(y_true, y_proba):
    """Find optimal classification threshold using Youden's J statistic"""
    fpr, tpr, thresholds = roc_curve(y_true, y_proba)
    j_scores = tpr - fpr
    best_idx = np.argmax(j_scores)
    return thresholds[best_idx]

def safe_fmt(val, fmt='.4f'):
    """Format a value safely, returning '—' for NaN/Inf"""
    if isinstance(val, float) and (np.isnan(val) or np.isinf(val)):
        return '—'
    try:
        return f'{val:{fmt}}'
    except:
        return '—'

def build_gee_mortality_model(df_site, site_name):
    """Build GEE logistic regression (mixed-effects approximation) for mortality.
    Uses hospital_id as cluster group, year_factor as fixed effect,
    and only significant comorbidities from bivariate analysis."""
    df_model = df_site.copy()

    # Code categorical variables
    df_model['SEXO_coded'] = (df_model['SEXO'] == 'Female').astype(int)

    # Create service dummies
    service_dummies = pd.get_dummies(df_model['service_category'], prefix='service', drop_first=True)
    for col in service_dummies.columns:
        df_model[col] = service_dummies[col]
    service_dummy_cols = list(service_dummies.columns)

    # Create year dummies (year_factor as fixed effect)
    year_dummies = pd.get_dummies(df_model['year_factor'], prefix='year', drop_first=True)
    for col in year_dummies.columns:
        df_model[col] = year_dummies[col]
    year_dummy_cols = list(year_dummies.columns)

    # Ensure numeric
    df_model['edad'] = pd.to_numeric(df_model['edad'], errors='coerce')

    # Base predictors: age, sex, service, year dummies
    predictor_cols = ['edad', 'SEXO_coded'] + service_dummy_cols + year_dummy_cols

    # Add severity and mortality risk as CATEGORICAL dummies (Low=ref, Medium, High)
    for col in ['IR_29301_SEVERIDAD', 'IR_29301_MORTALIDAD']:
        if col in df_model.columns:
            df_model[col] = pd.to_numeric(df_model[col], errors='coerce')
            df_model[col] = df_model[col].fillna(1)
            df_model[col] = df_model[col].clip(lower=1).astype(int)
            prefix = 'severity' if 'SEVERIDAD' in col else 'mort_risk'
            col_dummies = pd.get_dummies(df_model[col], prefix=prefix, drop_first=True)
            for dc in col_dummies.columns:
                df_model[dc] = col_dummies[dc]
                predictor_cols.append(dc)

    # Add ONLY significant comorbidities from bivariate analysis (p < 0.05)
    sig_comorbs = sig_comorbidities_mortality.get(site_name, [])
    used_comorbs = []
    for col in sig_comorbs:
        if col in df_model.columns:
            df_model[col] = pd.to_numeric(df_model[col], errors='coerce').fillna(0).astype(int)
            if df_model[col].sum() >= 3:
                predictor_cols.append(col)
                used_comorbs.append(col)

    # Prepare groups (hospital_id for clustering)
    df_model['_group'] = df_model['hospital_id']

    # Remove missing values
    cols_for_model = ['egresar_fallecido', '_group'] + predictor_cols
    df_gee = df_model[cols_for_model].dropna()

    # Need at least 2 clusters and sufficient data
    n_clusters = df_gee['_group'].nunique()
    if len(df_gee) < 50 or n_clusters < 2:
        return None, None, [], []

    # Build design matrix
    y = df_gee['egresar_fallecido'].astype(float)
    X = df_gee[predictor_cols].astype(float)
    X = sm.add_constant(X)
    groups = df_gee['_group']

    # Fit GEE with exchangeable correlation structure
    try:
        gee_model = GEE(y, X, groups=groups,
                        family=Binomial(),
                        cov_struct=Exchangeable())
        gee_result = gee_model.fit(maxiter=200)
    except Exception:
        # Fallback to independence correlation
        try:
            gee_model = GEE(y, X, groups=groups,
                            family=Binomial(),
                            cov_struct=Independence())
            gee_result = gee_model.fit(maxiter=200)
        except Exception:
            return None, None, [], []

    # Check for extreme coefficients and remove problematic variables
    max_retries = 5
    current_predictors = list(predictor_cols)
    for _ in range(max_retries):
        params = gee_result.params
        bad_vars = []
        for i, var in enumerate(current_predictors):
            idx = i + 1  # +1 for constant
            if idx < len(params) and (np.isinf(params.iloc[idx]) or abs(params.iloc[idx]) > 50):
                bad_vars.append(var)
        if not bad_vars:
            break
        current_predictors = [p for p in current_predictors if p not in bad_vars]
        if len(current_predictors) < 2:
            return None, None, [], []
        X = df_gee[current_predictors].astype(float)
        X = sm.add_constant(X)
        try:
            gee_model = GEE(y, X, groups=groups, family=Binomial(), cov_struct=Exchangeable())
            gee_result = gee_model.fit(maxiter=200)
        except:
            try:
                gee_model = GEE(y, X, groups=groups, family=Binomial(), cov_struct=Independence())
                gee_result = gee_model.fit(maxiter=200)
            except:
                return None, None, [], []

    return gee_result, df_gee, current_predictors, used_comorbs

def get_gee_fit_metrics(gee_result, y, family_type='binomial'):
    """Extract model fit metrics from a GEE result object."""
    metrics = {}
    metrics['N_obs'] = int(gee_result.nobs)
    metrics['N_clusters'] = int(len(set(gee_result.model.groups)))
    metrics['N_params'] = int(len(gee_result.params))

    # QIC / QICu (quasi-information criteria)
    try:
        qic, qicu = gee_result.qic()
        metrics['QIC'] = qic
        metrics['QICu'] = qicu
    except Exception:
        metrics['QIC'] = np.nan
        metrics['QICu'] = np.nan

    # Quasi log-likelihood (scale parameter)
    try:
        metrics['Scale'] = float(gee_result.scale)
    except Exception:
        metrics['Scale'] = np.nan

    # Pseudo R² (McFadden-style for logistic, deviance-based for count)
    try:
        y_pred = gee_result.fittedvalues
        if family_type == 'binomial':
            y_arr = np.asarray(y, dtype=float)
            p_hat = np.clip(np.asarray(y_pred, dtype=float), 1e-10, 1 - 1e-10)
            ll_model = np.sum(y_arr * np.log(p_hat) + (1 - y_arr) * np.log(1 - p_hat))
            p_null = np.mean(y_arr)
            ll_null = np.sum(y_arr * np.log(p_null) + (1 - y_arr) * np.log(1 - p_null))
            metrics['Log-Lik (model)'] = ll_model
            metrics['Log-Lik (null)'] = ll_null
            metrics['Pseudo R² (McFadden)'] = 1 - ll_model / ll_null if ll_null != 0 else np.nan
        else:
            y_arr = np.asarray(y, dtype=float)
            y_pred_arr = np.asarray(y_pred, dtype=float)
            ss_res = np.sum((y_arr - y_pred_arr) ** 2)
            ss_tot = np.sum((y_arr - np.mean(y_arr)) ** 2)
            metrics['Pseudo R²'] = 1 - ss_res / ss_tot if ss_tot > 0 else np.nan
    except Exception:
        if family_type == 'binomial':
            metrics['Log-Lik (model)'] = np.nan
            metrics['Log-Lik (null)'] = np.nan
            metrics['Pseudo R² (McFadden)'] = np.nan
        else:
            metrics['Pseudo R²'] = np.nan

    # Correlation structure
    try:
        cov_name = gee_result.cov_struct.__class__.__name__
        metrics['Correlation'] = cov_name
    except Exception:
        metrics['Correlation'] = '—'

    return metrics

# Fit GEE mortality models for EACH cancer site
for site in all_sites:
    df_site = df_head_neck[df_head_neck['cancer_site_group'] == site].copy()

    n_events = df_site['egresar_fallecido'].sum()
    n_total = len(df_site)
    if n_events < 10 or n_total < 50:
        continue

    gee_result, df_gee, used_cols, used_comorbs = build_gee_mortality_model(df_site, site)

    if gee_result is not None:
        gee_mortality_models[site] = (gee_result, df_gee, used_cols)
        gee_mortality_results[site] = gee_result

10.1 6.1 GEE Logistic Regression Results - OR [95% CI]

Code
# Table: GEE Logistic Regression - OR [95% CI] by Cancer Site
gee_or_table = {}

for site in all_sites:
    if site not in gee_mortality_results:
        continue

    result = gee_mortality_results[site]
    or_ci_list = {}

    param_names = list(result.params.index)
    ci = result.conf_int()

    def _fmt_or(v):
        if np.isinf(v) or abs(v) > 1e6:
            return 'Inf' if v > 0 else '-Inf'
        return f"{v:.3f}"

    for var in param_names:
        coef = result.params[var]
        or_val = np.exp(coef)
        ci_lower = np.exp(ci.loc[var, 0] if var in ci.index else ci.iloc[param_names.index(var), 0])
        ci_upper = np.exp(ci.loc[var, 1] if var in ci.index else ci.iloc[param_names.index(var), 1])
        pval = result.pvalues[var] if var in result.pvalues.index else result.pvalues.iloc[param_names.index(var)]

        if np.isnan(or_val) or np.isnan(pval):
            or_ci_list[format_name(var)] = '—'
            continue

        sig_stars = ''
        if pval < 0.001:
            sig_stars = '***'
        elif pval < 0.01:
            sig_stars = '**'
        elif pval < 0.05:
            sig_stars = '*'

        or_ci_str = f"{_fmt_or(or_val)} [{_fmt_or(ci_lower)}, {_fmt_or(ci_upper)}]{sig_stars}"
        or_ci_list[format_name(var)] = or_ci_str

    gee_or_table[site] = or_ci_list

gee_or_df = pd.DataFrame(gee_or_table).fillna('—')
show_table(gee_or_df, "Table 9: GEE Mixed-Effects Logistic Regression - Odds Ratios [95% CI] for In-Hospital Mortality by Cancer Site\n(Hospital as cluster, year as fixed effect, only significant comorbidities from bivariate analysis)\n* p<0.05, ** p<0.01, *** p<0.001", index=True)

Table 9: GEE Mixed-Effects Logistic Regression - Odds Ratios [95% CI] for In-Hospital Mortality by Cancer Site (Hospital as cluster, year as fixed effect, only significant comorbidities from bivariate analysis) * p<0.05, ** p<0.01, *** p<0.001

Hypopharynx Larynx Nasopharynx Oral Cavity Oropharynx Other Pharynx Salivary Glands Sinonasal
Intercept 0.000 [0.000, 0.025]*** 0.004 [0.001, 0.014]*** 0.002 [0.000, 0.007]*** 0.012 [0.000, 0.329]** 0.001 [0.000, 0.172]** 0.015 [0.001, 0.206]** 0.045 [0.007, 0.290]**
Age 1.032 [0.989, 1.078] 1.027 [1.013, 1.041]*** 1.018 [1.002, 1.035]* 1.022 [0.982, 1.064] 1.052 [1.005, 1.101]* 0.990 [0.969, 1.011] 1.002 [0.985, 1.018]
Sex (Female) 1.076 [0.447, 2.590] 0.982 [0.580, 1.662] 1.014 [0.701, 1.468] 0.927 [0.408, 2.108] 0.863 [0.353, 2.109] 0.485 [0.234, 1.005] 0.887 [0.440, 1.784]
Service ITU 2.796 [0.357, 21.901] 0.451 [0.209, 0.973]* 0.583 [0.259, 1.311] 0.463 [0.131, 1.641] 0.489 [0.053, 4.518] 0.969 [0.150, 6.276] 0.223 [0.043, 1.165]
Service MS 0.812 [0.172, 3.835] 0.819 [0.491, 1.366] 0.884 [0.524, 1.492] 0.367 [0.150, 0.899]* 0.670 [0.167, 2.681] 0.521 [0.164, 1.660] 0.284 [0.084, 0.957]*
Service Other 1.449 [0.417, 5.041] 0.786 [0.496, 1.244] 1.183 [0.655, 2.136] 0.276 [0.118, 0.648]** 0.443 [0.140, 1.405] 0.714 [0.218, 2.337] 0.335 [0.131, 0.852]*
Year 2020 3.417 [0.564, 20.696] 0.585 [0.347, 0.987]* 0.553 [0.304, 1.004] 0.830 [0.305, 2.257] 0.622 [0.100, 3.866] 1.476 [0.463, 4.707] 1.494 [0.430, 5.189]
Year 2021 5.211 [0.774, 35.084] 0.693 [0.380, 1.263] 1.251 [0.669, 2.340] 0.454 [0.142, 1.450] 2.348 [0.575, 9.596] 1.180 [0.399, 3.488] 0.594 [0.164, 2.152]
Year 2022 4.763 [0.770, 29.484] 0.392 [0.235, 0.654]*** 0.935 [0.514, 1.699] 0.383 [0.136, 1.084] 0.596 [0.156, 2.284] 1.001 [0.306, 3.278] 1.095 [0.290, 4.133]
Year 2023 2.285 [0.314, 16.609] 0.460 [0.275, 0.769]** 0.734 [0.411, 1.312] 0.309 [0.129, 0.744]** 1.883 [0.476, 7.445] 0.470 [0.106, 2.083] 0.444 [0.136, 1.451]
Year 2024 3.120 [0.391, 24.884] 0.332 [0.206, 0.534]*** 0.934 [0.514, 1.697] 0.232 [0.076, 0.711]* 1.430 [0.305, 6.697] 1.361 [0.393, 4.709] 0.720 [0.213, 2.435]
Severity Medium 1.532 [0.170, 13.757] 1.099 [0.597, 2.024] 2.686 [1.031, 7.000]* 3.020 [1.179, 7.733]* 0.329 [0.068, 1.588] 1.370 [0.103, 18.236] 3.409 [1.298, 8.952]*
Severity High 3.470 [0.545, 22.076] 3.300 [1.795, 6.067]*** 11.927 [4.378, 32.496]*** 8.455 [2.361, 30.286]** 0.481 [0.073, 3.168] 5.837 [0.353, 96.645] 12.691 [3.231, 49.846]***
Mortality Risk Medium 1.412 [0.300, 6.649] 1.296 [0.777, 2.161] 1.412 [0.644, 3.099] 0.769 [0.301, 1.961] 6.715 [0.469, 96.243] 1.272 [0.101, 16.017] 0.416 [0.150, 1.154]
Mortality Risk High 8.937 [2.100, 38.033]** 7.286 [4.452, 11.923]*** 7.506 [3.505, 16.075]*** 4.280 [1.524, 12.019]** 90.776 [5.197, 1585.558]** 16.732 [0.945, 296.340] 2.286 [0.568, 9.196]
CV Arrhythmias 2.703 [0.461, 15.855] 2.508 [1.640, 3.835]*** 1.878 [1.105, 3.193]* 2.270 [0.440, 11.709] 1.805 [0.597, 5.461]
CV Heart Failure 1.551 [0.765, 3.148] 0.680 [0.357, 1.297] 4.117 [0.670, 25.303] 0.755 [0.147, 3.868]
CV Peripheral Vascular Disease 1.384 [0.700, 2.736] 1.882 [0.948, 3.738]
RESP COPD 0.865 [0.516, 1.449] 1.142 [0.532, 2.453]
RESP Emphysema 1.350 [0.628, 2.905]
REN Acute Renal Failure 1.157 [0.745, 1.798] 0.465 [0.256, 0.843]* 1.225 [0.384, 3.910] 0.700 [0.134, 3.664] 0.574 [0.222, 1.487]
REN Chronic Renal Failure 0.801 [0.476, 1.349] 0.877 [0.377, 2.039]
REN End Stage Renal 1.210 [0.440, 3.328] 1.107 [0.233, 5.269]
HEP Alcoholic Cirrhosis 4.357 [2.123, 8.944]***
GI Peptic Ulcer 2.035 [0.554, 7.479]
CV Cerebrovascular Disease 1.038 [0.471, 2.287] 3.283 [0.465, 23.189] 1.566 [0.492, 4.982]
MET Metabolic Syndrome 2.562 [0.724, 9.068]
RESP Asthma 3.147 [1.268, 7.806]*
RESP Pulmonary Fibrosis 2.534 [0.609, 10.534]
HEP Steatosis 1.926 [1.002, 3.701]* 6.214 [1.237, 31.206]*
PSI Bipolar 1.445 [0.193, 10.842]
Code
# Table 10: Performance Metrics - GEE Logistic Models (ALL sites)

gee_perf = []
for site in all_sites:
    if site not in gee_mortality_results:
        continue

    result = gee_mortality_results[site]
    df_gee = gee_mortality_models[site][1]
    df_site = df_head_neck[df_head_neck['cancer_site_group'] == site]
    n_total = len(df_site)
    n_events = df_site['egresar_fallecido'].sum()
    event_rate = n_events / n_total * 100 if n_total > 0 else 0

    try:
        y_pred_proba = result.fittedvalues
        y_actual = df_gee['egresar_fallecido'].values

        auc = roc_auc_score(y_actual, y_pred_proba) if len(np.unique(y_actual)) > 1 else np.nan
        brier = brier_score_loss(y_actual, y_pred_proba)

        opt_threshold = find_optimal_threshold(y_actual, y_pred_proba)
        y_pred = (y_pred_proba >= opt_threshold).astype(int)

        accuracy = accuracy_score(y_actual, y_pred)
        sensitivity = recall_score(y_actual, y_pred, zero_division=0)
        specificity = recall_score(y_actual, y_pred, pos_label=0, zero_division=0)
        ppv = precision_score(y_actual, y_pred, zero_division=0)
        tn = ((1 - y_actual) * (1 - y_pred)).sum()
        fn = (y_actual * (1 - y_pred)).sum()
        npv = tn / (tn + fn) if (tn + fn) > 0 else 0
        f1 = f1_score(y_actual, y_pred, zero_division=0)
        mcc = matthews_corrcoef(y_actual, y_pred)
        youdens_j = sensitivity + specificity - 1

    except Exception:
        accuracy = sensitivity = specificity = ppv = npv = f1 = mcc = brier = auc = np.nan
        opt_threshold = np.nan
        youdens_j = np.nan

    n_sig_comorbs = len(sig_comorbidities_mortality.get(site, []))
    fit_metrics = get_gee_fit_metrics(result, df_gee['egresar_fallecido'], family_type='binomial')

    gee_perf.append({
        'Cancer Site': site,
        'N obs': fit_metrics['N_obs'],
        'N clusters': fit_metrics['N_clusters'],
        'N params': fit_metrics['N_params'],
        'Events': f'{n_events} ({event_rate:.1f}%)',
        'Sig. Comorb.': n_sig_comorbs,
        'Correlation': fit_metrics['Correlation'],
        'QIC': safe_fmt(fit_metrics['QIC'], '.1f'),
        'QICu': safe_fmt(fit_metrics['QICu'], '.1f'),
        'Log-Lik': safe_fmt(fit_metrics['Log-Lik (model)'], '.1f'),
        'Pseudo R²': safe_fmt(fit_metrics['Pseudo R² (McFadden)']),
        'Scale': safe_fmt(fit_metrics['Scale']),
        'Threshold': safe_fmt(opt_threshold),
        'AUC': safe_fmt(auc),
        'Brier': safe_fmt(brier),
        'Sensitivity': safe_fmt(sensitivity),
        'Specificity': safe_fmt(specificity),
        'PPV': safe_fmt(ppv),
        'NPV': safe_fmt(npv),
        'Bal. Accuracy': safe_fmt((sensitivity + specificity) / 2 if not np.isnan(sensitivity) else np.nan),
        'F1': safe_fmt(f1),
        'MCC': safe_fmt(mcc),
        "Youden's J": safe_fmt(youdens_j),
    })

gee_perf_df = pd.DataFrame(gee_perf)
show_table(gee_perf_df, "Table 10: Performance & Fit Metrics - GEE Logistic Regression for Mortality\n(QIC = Quasi-Information Criterion, Pseudo R² = McFadden)", index=False)

Table 10: Performance & Fit Metrics - GEE Logistic Regression for Mortality (QIC = Quasi-Information Criterion, Pseudo R² = McFadden)

Cancer Site N obs N clusters N params Events Sig. Comorb. Correlation QIC QICu Log-Lik Pseudo R² Scale Threshold AUC Brier Sensitivity Specificity PPV NPV Bal. Accuracy F1 MCC Youden's J
Hypopharynx 400 49 16 33 (8.2%) 1 Exchangeable 174.8 202.4 -85.2 0.2523 1.0000 0.0556 0.8623 0.0624 0.9394 0.7193 0.2313 0.9925 0.8294 0.3713 0.3840 0.6587
Larynx 2625 64 25 198 (7.5%) 11 Exchangeable 1041.6 1084.5 -517.2 0.2633 1.0000 0.0915 0.8590 0.0554 0.7929 0.7973 0.2419 0.9793 0.7951 0.3707 0.3613 0.5902
Nasopharynx 368 50 16 23 (6.2%) 1 Exchangeable 1.0000
Oral Cavity 3001 68 26 176 (5.9%) 11 Exchangeable 915.3 959.3 -453.7 0.3228 1.0000 0.1031 0.8971 0.0426 0.7898 0.8683 0.2720 0.9851 0.8290 0.4047 0.4114 0.6581
Oropharynx 951 59 21 55 (5.8%) 6 Exchangeable 298.6 335.1 -146.5 0.3027 1.0000 0.1115 0.8876 0.0430 0.7636 0.8884 0.2958 0.9839 0.8260 0.4264 0.4271 0.6520
Other Pharynx 281 51 16 30 (10.7%) 1 Exchangeable 128.4 155.8 -61.9 0.3514 1.0000 0.2047 0.8981 0.0653 0.8000 0.8805 0.4444 0.9736 0.8402 0.5714 0.5333 0.6805
Salivary Glands 898 59 19 46 (5.1%) 4 Exchangeable 229.6 262.1 -112.1 0.3826 1.0000 0.0214 0.9350 0.0378 0.9783 0.8075 0.2153 0.9985 0.8929 0.3529 0.4099 0.7858
Sinonasal 887 58 16 41 (4.6%) 1 Exchangeable 235.2 263.7 -115.8 0.3025 1.0000 0.0453 0.8984 0.0360 0.8780 0.8298 0.2000 0.9929 0.8539 0.3258 0.3695 0.7078
Code
fig, ax = plt.subplots(figsize=(10, 7))
colors_roc = plt.cm.tab10(np.linspace(0, 1, len(gee_mortality_results)))

for idx, (site, result) in enumerate(gee_mortality_results.items()):
    try:
        y_pred_proba = result.fittedvalues
        df_gee = gee_mortality_models[site][1]
        y_actual = df_gee['egresar_fallecido'].values
        fpr, tpr, _ = roc_curve(y_actual, y_pred_proba)
        auc_val = roc_auc_score(y_actual, y_pred_proba)
        ax.plot(fpr, tpr, label=f'{site} (AUC={auc_val:.3f})', linewidth=2, color=colors_roc[idx])
    except:
        pass

ax.plot([0, 1], [0, 1], 'k--', label='Random', linewidth=1)
ax.set_xlabel('False Positive Rate', fontsize=11)
ax.set_ylabel('True Positive Rate', fontsize=11)
ax.set_title('ROC Curves - GEE Logistic Regression by Cancer Site', fontsize=12, fontweight='bold')
ax.legend(loc='lower right', fontsize=9)
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig(os.path.join(OUTPUT_DIR, 'roc_gee_logistic_by_site.png'), dpi=300, bbox_inches='tight', facecolor='white')
plt.savefig(_FIG / "fig-roc-gee-logistic.png", dpi=300, bbox_inches="tight")
plt.show()
Figure 8: Figure 8. ROC Curves - GEE Logistic Regression Models by Cancer Site

10.2 6.2 Forest Plot — Odds Ratios for In-Hospital Mortality

Code
def _extract_or_ci(result, drop_intercept=True, cap=(0.01, 100.0)):
    """Return DataFrame with var, OR, lo, hi, p — finite & capped for plotting."""
    params = result.params
    ci = result.conf_int()
    pvals = result.pvalues
    rows = []
    for var in params.index:
        if drop_intercept and var.lower() in {'const', 'intercept'}:
            continue
        coef = params[var]
        lo, hi = (ci.loc[var, 0], ci.loc[var, 1]) if var in ci.index else (np.nan, np.nan)
        p = pvals[var] if var in pvals.index else np.nan
        or_v, or_lo, or_hi = np.exp(coef), np.exp(lo), np.exp(hi)
        # Skip non-finite
        if not np.isfinite(or_v) or not np.isfinite(or_lo) or not np.isfinite(or_hi):
            continue
        # Cap extreme CIs for display only
        or_v_d  = float(np.clip(or_v,  cap[0], cap[1]))
        or_lo_d = float(np.clip(or_lo, cap[0], cap[1]))
        or_hi_d = float(np.clip(or_hi, cap[0], cap[1]))
        rows.append({
            'var': format_name(var),
            'OR': or_v_d, 'lo': or_lo_d, 'hi': or_hi_d,
            'p': p,
        })
    return pd.DataFrame(rows)

# Build per-site dataframes
forest_data_logit = {site: _extract_or_ci(res) for site, res in gee_mortality_results.items()}
forest_data_logit = {s: d for s, d in forest_data_logit.items() if not d.empty}

n_sites = len(forest_data_logit)
n_cols = 3
n_rows = int(np.ceil(n_sites / n_cols))

fig, axes = plt.subplots(n_rows, n_cols, figsize=(16, 4 * n_rows), squeeze=False)

for ax_idx, (site, df_f) in enumerate(forest_data_logit.items()):
    r, c = divmod(ax_idx, n_cols)
    ax = axes[r][c]
    df_f = df_f.sort_values('OR').reset_index(drop=True)
    y = np.arange(len(df_f))
    sig_mask = (df_f['p'] < 0.05)
    # Colors: red OR>1 sig, blue OR<1 sig, gray ns
    colors = np.where(~sig_mask, '#9e9e9e',
              np.where(df_f['OR'] > 1, '#c0392b', '#1f6f8b'))
    # Error bars
    ax.hlines(y, df_f['lo'], df_f['hi'], color=colors, linewidth=1.5, alpha=0.9)
    ax.scatter(df_f['OR'], y,
               s=55, marker='o',
               facecolors=np.where(sig_mask, colors, 'white'),
               edgecolors=colors, linewidths=1.5, zorder=3)
    ax.axvline(1, color='#444', linestyle='--', linewidth=0.8, alpha=0.7)
    ax.set_xscale('log')
    ax.set_xlim(0.05, 50)
    ax.set_yticks(y)
    ax.set_yticklabels(df_f['var'], fontsize=8)
    ax.set_xlabel('Odds Ratio (log scale)', fontsize=9)
    ax.set_title(f'{site}  (n predictors = {len(df_f)})', fontsize=10, fontweight='bold')
    ax.grid(axis='x', linestyle=':', alpha=0.4)
    ax.spines['top'].set_visible(False)
    ax.spines['right'].set_visible(False)

# Hide empty subplots
for ax_idx in range(n_sites, n_rows * n_cols):
    r, c = divmod(ax_idx, n_cols)
    axes[r][c].set_visible(False)

# Legend
from matplotlib.lines import Line2D
legend_elems = [
    Line2D([0], [0], marker='o', linestyle='None', markersize=7,
           markerfacecolor='#c0392b', markeredgecolor='#c0392b', label='OR > 1, p < 0.05 (harmful)'),
    Line2D([0], [0], marker='o', linestyle='None', markersize=7,
           markerfacecolor='#1f6f8b', markeredgecolor='#1f6f8b', label='OR < 1, p < 0.05 (protective)'),
    Line2D([0], [0], marker='o', linestyle='None', markersize=7,
           markerfacecolor='white', markeredgecolor='#9e9e9e', label='Not significant (p ≥ 0.05)'),
    Line2D([0], [0], color='#444', linestyle='--', linewidth=1, label='OR = 1 (null)'),
]
fig.legend(handles=legend_elems, loc='lower center', ncol=4, frameon=False,
           bbox_to_anchor=(0.5, -0.01), fontsize=10)

fig.suptitle('Forest plot — GEE Logistic Mortality Models: Odds Ratios [95% CI] by Cancer Site',
             fontsize=13, fontweight='bold', y=1.00)
plt.tight_layout(rect=[0, 0.02, 1, 0.98])
plt.savefig(os.path.join(OUTPUT_DIR, 'forest_gee_logistic_mortality.png'), dpi=300, bbox_inches='tight', facecolor='white')
plt.savefig(_FIG / "fig-forest-gee-logistic.png", dpi=300, bbox_inches="tight")
plt.show()
Figure 9: Figure 9. Forest plot — Odds Ratios (95% CI) for in-hospital mortality from the GEE mixed-effects logistic models, by cancer site. Reference line at OR=1. Filled markers = p<0.05.

11 7. GEE Mixed-Effects Regression for Length of Stay (All Cancer Sites)

Code
# Store GEE LOS model results
gee_los_models = {}
gee_los_results = {}

def build_gee_los_model(df_site, site_name):
    """Build GEE Negative Binomial regression (mixed-effects approximation) for LOS.
    Uses hospital_id as cluster group, year_factor as fixed effect,
    and only significant comorbidities from bivariate analysis."""
    df_model = df_site[df_site['dias_estancia'] > 0].copy()

    df_model['SEXO_coded'] = (df_model['SEXO'] == 'Female').astype(int)

    service_dummies = pd.get_dummies(df_model['service_category'], prefix='service', drop_first=True)
    for col in service_dummies.columns:
        df_model[col] = service_dummies[col]
    service_dummy_cols = list(service_dummies.columns)

    year_dummies = pd.get_dummies(df_model['year_factor'], prefix='year', drop_first=True)
    for col in year_dummies.columns:
        df_model[col] = year_dummies[col]
    year_dummy_cols = list(year_dummies.columns)

    df_model['edad'] = pd.to_numeric(df_model['edad'], errors='coerce')

    predictor_cols = ['edad', 'SEXO_coded'] + service_dummy_cols + year_dummy_cols

    # Add severity and mortality risk as CATEGORICAL dummies
    for col in ['IR_29301_SEVERIDAD', 'IR_29301_MORTALIDAD']:
        if col in df_model.columns:
            df_model[col] = pd.to_numeric(df_model[col], errors='coerce')
            df_model[col] = df_model[col].fillna(1)
            df_model[col] = df_model[col].clip(lower=1).astype(int)
            prefix = 'severity' if 'SEVERIDAD' in col else 'mort_risk'
            col_dummies = pd.get_dummies(df_model[col], prefix=prefix, drop_first=True)
            for dc in col_dummies.columns:
                df_model[dc] = col_dummies[dc]
                predictor_cols.append(dc)

    # Add ONLY significant comorbidities from bivariate Wilcoxon test (p < 0.05)
    sig_comorbs = sig_comorbidities_los.get(site_name, [])
    used_comorbs = []
    for col in sig_comorbs:
        if col in df_model.columns:
            df_model[col] = pd.to_numeric(df_model[col], errors='coerce').fillna(0).astype(int)
            if df_model[col].sum() >= 3:
                predictor_cols.append(col)
                used_comorbs.append(col)

    df_model['_group'] = df_model['hospital_id']

    cols_for_model = ['dias_estancia', '_group'] + predictor_cols
    df_gee = df_model[cols_for_model].dropna()

    n_clusters = df_gee['_group'].nunique()
    if len(df_gee) < 50 or n_clusters < 2:
        return None, None, [], []

    y = df_gee['dias_estancia'].astype(float)
    X = df_gee[predictor_cols].astype(float)
    X = sm.add_constant(X)
    groups = df_gee['_group']

    # Fit GEE with Negative Binomial family and exchangeable correlation
    try:
        gee_model = GEE(y, X, groups=groups,
                        family=NegativeBinomial(alpha=1.0),
                        cov_struct=Exchangeable())
        gee_result = gee_model.fit(maxiter=200)
    except Exception:
        try:
            gee_model = GEE(y, X, groups=groups,
                            family=NegativeBinomial(alpha=1.0),
                            cov_struct=Independence())
            gee_result = gee_model.fit(maxiter=200)
        except Exception:
            try:
                gee_model = GEE(y, X, groups=groups,
                                family=Poisson(),
                                cov_struct=Exchangeable())
                gee_result = gee_model.fit(maxiter=200)
            except Exception:
                return None, None, [], []

    # Remove problematic variables
    max_retries = 5
    current_predictors = list(predictor_cols)
    for _ in range(max_retries):
        params = gee_result.params
        bad_vars = []
        for i, var in enumerate(current_predictors):
            idx = i + 1
            if idx < len(params) and (np.isinf(params.iloc[idx]) or abs(params.iloc[idx]) > 50):
                bad_vars.append(var)
        if not bad_vars:
            break
        current_predictors = [p for p in current_predictors if p not in bad_vars]
        if len(current_predictors) < 2:
            return None, None, [], []
        X = df_gee[current_predictors].astype(float)
        X = sm.add_constant(X)
        try:
            gee_model = GEE(y, X, groups=groups, family=NegativeBinomial(alpha=1.0), cov_struct=Exchangeable())
            gee_result = gee_model.fit(maxiter=200)
        except:
            try:
                gee_model = GEE(y, X, groups=groups, family=NegativeBinomial(alpha=1.0), cov_struct=Independence())
                gee_result = gee_model.fit(maxiter=200)
            except:
                try:
                    gee_model = GEE(y, X, groups=groups, family=Poisson(), cov_struct=Exchangeable())
                    gee_result = gee_model.fit(maxiter=200)
                except:
                    return None, None, [], []

    return gee_result, df_gee, current_predictors, used_comorbs

# Fit GEE LOS models for EACH cancer site
for site in all_sites:
    df_site = df_head_neck[df_head_neck['cancer_site_group'] == site].copy()

    if len(df_site) < 50:
        continue

    gee_result, df_gee, used_cols, used_comorbs = build_gee_los_model(df_site, site)

    if gee_result is not None:
        gee_los_models[site] = (gee_result, df_gee, used_cols)
        gee_los_results[site] = gee_result

11.1 7.1 GEE Regression Results - IRR [95% CI]

Code
# Table 11: GEE Regression - IRR [95% CI] by Cancer Site
gee_irr_table = {}

for site in all_sites:
    if site not in gee_los_results:
        continue

    result = gee_los_results[site]
    irr_ci_list = {}

    param_names = list(result.params.index)
    ci = result.conf_int()

    def _fmt_irr(v):
        if np.isinf(v) or abs(v) > 1e6:
            return 'Inf' if v > 0 else '-Inf'
        return f"{v:.3f}"

    for var in param_names:
        coef = result.params[var]
        irr_val = np.exp(coef)
        ci_lower = np.exp(ci.loc[var, 0] if var in ci.index else ci.iloc[param_names.index(var), 0])
        ci_upper = np.exp(ci.loc[var, 1] if var in ci.index else ci.iloc[param_names.index(var), 1])
        pval = result.pvalues[var] if var in result.pvalues.index else result.pvalues.iloc[param_names.index(var)]

        if np.isnan(irr_val) or np.isnan(pval):
            irr_ci_list[format_name(var)] = '—'
            continue

        sig_stars = ''
        if pval < 0.001:
            sig_stars = '***'
        elif pval < 0.01:
            sig_stars = '**'
        elif pval < 0.05:
            sig_stars = '*'

        irr_ci_str = f"{_fmt_irr(irr_val)} [{_fmt_irr(ci_lower)}, {_fmt_irr(ci_upper)}]{sig_stars}"
        irr_ci_list[format_name(var)] = irr_ci_str

    gee_irr_table[site] = irr_ci_list

gee_irr_df = pd.DataFrame(gee_irr_table).fillna('—')
show_table(gee_irr_df, "Table 11: GEE Mixed-Effects Regression - IRR [95% CI] for Length of Stay by Cancer Site\n(Hospital as cluster, year as fixed effect, Negative Binomial family, only significant comorbidities)\n* p<0.05, ** p<0.01, *** p<0.001", index=True)

Table 11: GEE Mixed-Effects Regression - IRR [95% CI] for Length of Stay by Cancer Site (Hospital as cluster, year as fixed effect, Negative Binomial family, only significant comorbidities) * p<0.05, ** p<0.01, *** p<0.001

Hypopharynx Larynx Nasopharynx Oral Cavity Oropharynx Other Pharynx Salivary Glands Sinonasal
Intercept 2.530 [1.002, 6.390]* 9.526 [6.003, 15.115]*** 6.734 [3.032, 14.953]*** 5.628 [3.551, 8.920]*** 5.804 [2.892, 11.649]*** 7.964 [4.455, 14.234]*** 3.986 [2.629, 6.045]*** 4.623 [2.875, 7.433]***
Age 1.009 [1.000, 1.019] 0.998 [0.993, 1.003] 1.005 [0.998, 1.013] 1.001 [0.997, 1.004] 1.001 [0.992, 1.009] 1.002 [0.997, 1.007] 1.002 [0.995, 1.008] 0.999 [0.994, 1.003]
Sex (Female) 0.828 [0.644, 1.065] 0.781 [0.684, 0.893]*** 0.760 [0.546, 1.057] 0.856 [0.784, 0.934]*** 0.953 [0.744, 1.222] 1.288 [0.952, 1.743] 0.819 [0.687, 0.975]* 0.848 [0.741, 0.970]*
Service ITU 2.636 [1.467, 4.734]** 1.069 [0.799, 1.429] 0.554 [0.342, 0.897]* 1.014 [0.753, 1.366] 0.850 [0.504, 1.433] 0.924 [0.286, 2.985] 0.988 [0.727, 1.342] 1.556 [0.887, 2.728]
Service MS 1.475 [0.887, 2.451] 0.731 [0.578, 0.924]** 1.132 [0.695, 1.845] 0.884 [0.645, 1.211] 0.663 [0.489, 0.899]** 0.645 [0.442, 0.943]* 1.053 [0.808, 1.371] 1.315 [0.922, 1.876]
Service Other 1.368 [0.825, 2.269] 0.707 [0.544, 0.919]** 0.985 [0.666, 1.459] 0.773 [0.550, 1.087] 0.842 [0.558, 1.269] 0.611 [0.419, 0.890]* 1.039 [0.811, 1.333] 1.168 [0.810, 1.684]
Year 2020 0.913 [0.618, 1.349] 1.199 [1.008, 1.425]* 0.878 [0.338, 2.282] 1.218 [0.894, 1.659] 1.122 [0.878, 1.434] 1.114 [0.725, 1.714] 0.987 [0.741, 1.316] 1.023 [0.784, 1.333]
Year 2021 1.111 [0.886, 1.395] 1.129 [0.939, 1.359] 0.530 [0.324, 0.865]* 1.096 [0.854, 1.407] 1.020 [0.797, 1.305] 0.949 [0.605, 1.487] 1.171 [0.814, 1.684] 1.035 [0.783, 1.367]
Year 2022 1.179 [0.735, 1.893] 1.248 [0.971, 1.604] 0.712 [0.404, 1.253] 1.047 [0.813, 1.347] 1.090 [0.840, 1.415] 0.610 [0.414, 0.900]* 0.822 [0.603, 1.121] 1.043 [0.829, 1.313]
Year 2023 1.053 [0.711, 1.558] 0.928 [0.776, 1.110] 0.611 [0.342, 1.092] 1.010 [0.743, 1.373] 1.097 [0.855, 1.407] 0.381 [0.279, 0.521]*** 0.866 [0.592, 1.267] 0.897 [0.653, 1.234]
Year 2024 0.695 [0.480, 1.006] 1.105 [0.912, 1.339] 0.843 [0.535, 1.327] 0.977 [0.802, 1.190] 1.191 [0.896, 1.582] 0.890 [0.614, 1.290] 0.896 [0.615, 1.306] 0.890 [0.690, 1.148]
Severity Medium 1.394 [1.120, 1.734]** 1.898 [1.466, 2.457]*** 1.410 [0.995, 1.997] 1.862 [1.555, 2.228]*** 1.679 [1.314, 2.146]*** 1.743 [1.241, 2.448]** 1.446 [1.121, 1.866]** 1.682 [1.286, 2.201]***
Severity High 1.982 [1.420, 2.766]*** 2.259 [1.732, 2.947]*** 3.194 [2.005, 5.087]*** 3.003 [2.408, 3.745]*** 2.435 [1.839, 3.225]*** 2.350 [1.641, 3.364]*** 3.575 [2.434, 5.250]*** 3.174 [2.028, 4.969]***
Mortality Risk Medium 1.153 [0.853, 1.559] 0.868 [0.770, 0.978]* 0.929 [0.587, 1.470] 1.113 [0.930, 1.333] 0.891 [0.732, 1.085] 0.850 [0.645, 1.120] 1.195 [0.963, 1.485] 0.931 [0.735, 1.180]
Mortality Risk High 1.463 [1.038, 2.062]* 1.306 [1.109, 1.539]** 0.905 [0.514, 1.591] 1.152 [0.877, 1.512] 1.033 [0.775, 1.376] 1.112 [0.731, 1.691] 1.254 [0.928, 1.695] 1.030 [0.721, 1.472]
CV Peripheral Vascular Disease 2.125 [0.977, 4.623] 1.069 [0.830, 1.378] 1.107 [0.613, 1.998] 1.371 [1.064, 1.768]* 1.361 [0.877, 2.112] 1.706 [0.990, 2.940] 0.995 [0.595, 1.663]
RESP COPD 1.070 [0.627, 1.825] 1.076 [0.911, 1.270] 1.078 [0.859, 1.354] 1.713 [1.009, 2.908]* 1.016 [0.638, 1.616]
RESP Emphysema 2.468 [1.125, 5.416]* 1.396 [1.120, 1.741]** 1.097 [0.597, 2.016]
REN Acute Renal Failure 0.989 [0.486, 2.014] 0.897 [0.719, 1.120] 0.667 [0.356, 1.248] 0.884 [0.690, 1.134] 1.179 [0.726, 1.916] 0.837 [0.523, 1.341] 2.194 [1.302, 3.698]**
REN Chronic Renal Failure 1.003 [0.596, 1.688] 1.044 [0.867, 1.256] 1.138 [0.921, 1.406] 1.549 [0.858, 2.794] 0.909 [0.631, 1.310]
REN End Stage Renal 0.870 [0.526, 1.440] 0.966 [0.615, 1.518] 1.025 [0.632, 1.664]
NEURO Epilepsy 2.212 [1.453, 3.366]*** 1.064 [0.748, 1.513] 0.949 [0.662, 1.362]
SUST Alcohol 1.378 [1.010, 1.882]* 1.223 [0.938, 1.593] 0.941 [0.405, 2.187] 1.218 [0.961, 1.544] 1.209 [0.946, 1.546]
SUST Tobacco 1.158 [0.778, 1.723] 1.176 [1.055, 1.312]** 1.153 [0.994, 1.338] 1.116 [0.872, 1.427] 1.745 [1.194, 2.550]**
CV Essential Hypertension 1.012 [0.913, 1.121] 0.933 [0.861, 1.011] 1.054 [0.933, 1.190] 1.045 [0.910, 1.200]
CV Complicated Hypertension 1.421 [1.026, 1.969]* 1.231 [0.805, 1.882]
CV Coronary Disease 0.909 [0.764, 1.081] 1.110 [0.839, 1.468] 1.229 [0.816, 1.851] 0.833 [0.487, 1.426]
CV Heart Failure 0.941 [0.675, 1.310] 0.727 [0.563, 0.938]* 0.935 [0.498, 1.753] 1.395 [0.878, 2.216]
CV Arrhythmias 1.104 [0.934, 1.305] 0.986 [0.636, 1.528] 1.099 [0.922, 1.311] 1.240 [0.881, 1.745] 2.644 [1.821, 3.838]*** 1.236 [0.802, 1.906] 1.563 [1.129, 2.163]**
CV Cerebrovascular Disease 1.311 [0.812, 2.116] 1.964 [0.821, 4.699] 0.900 [0.491, 1.647] 1.093 [0.637, 1.876] 1.972 [0.630, 6.169] 0.829 [0.562, 1.223]
MET Metabolic Syndrome 2.376 [1.386, 4.076]** 1.285 [0.750, 2.201]
HEP Alcoholic Cirrhosis 1.176 [0.806, 1.716]
GI Peptic Ulcer 1.693 [1.345, 2.130]*** 0.899 [0.632, 1.279]
NEURO Stroke Sequelae 0.855 [0.484, 1.509] 1.008 [0.520, 1.954] 0.579 [0.169, 1.984]
PSI Major Depression 1.374 [1.121, 1.684]** 1.383 [1.012, 1.888]* 1.804 [1.078, 3.019]*
PSI Anxiety 1.595 [1.007, 2.528]* 2.083 [1.335, 3.252]**
MET Type2 Diabetes 1.222 [0.759, 1.966] 1.110 [0.924, 1.332]
MET Obesity 1.264 [1.061, 1.507]**
MET Thyroid Disorders 1.029 [0.897, 1.181]
RESP Sleep Apnea 2.220 [1.306, 3.772]**
INF HIV 1.481 [0.786, 2.793] 0.521 [0.221, 1.230]
RESP Pulmonary Fibrosis 1.157 [0.629, 2.127] 0.939 [0.523, 1.687]
HEP Steatosis 1.687 [0.787, 3.618] 2.245 [0.851, 5.924]
PSI Bipolar 1.135 [0.551, 2.337]
Code
# Table 12: Performance Metrics - GEE LOS Models
gee_los_perf = []
for site in all_sites:
    if site not in gee_los_results:
        continue

    result = gee_los_results[site]
    df_gee = gee_los_models[site][1]
    df_site = df_head_neck[df_head_neck['cancer_site_group'] == site]
    n_total = len(df_site)

    try:
        y_pred = result.fittedvalues
        y_actual = df_gee['dias_estancia'].values

        rmse = np.sqrt(np.mean((y_actual - y_pred) ** 2))
        mae = np.mean(np.abs(y_actual - y_pred))
        ss_res = np.sum((y_actual - y_pred) ** 2)
        ss_tot = np.sum((y_actual - np.mean(y_actual)) ** 2)
        pseudo_r2 = 1 - ss_res / ss_tot if ss_tot > 0 else np.nan

    except Exception:
        rmse = mae = pseudo_r2 = np.nan

    n_sig_comorbs = len(sig_comorbidities_los.get(site, []))
    fit_metrics = get_gee_fit_metrics(result, df_gee['dias_estancia'], family_type='count')

    gee_los_perf.append({
        'Cancer Site': site,
        'N obs': fit_metrics['N_obs'],
        'N clusters': fit_metrics['N_clusters'],
        'N params': fit_metrics['N_params'],
        'Mean LOS': f'{df_site["dias_estancia"].mean():.1f}',
        'Median LOS': f'{df_site["dias_estancia"].median():.0f}',
        'Sig. Comorb.': n_sig_comorbs,
        'Correlation': fit_metrics['Correlation'],
        'QIC': safe_fmt(fit_metrics['QIC'], '.1f'),
        'QICu': safe_fmt(fit_metrics['QICu'], '.1f'),
        'Pseudo R²': safe_fmt(fit_metrics['Pseudo R²']),
        'Scale': safe_fmt(fit_metrics['Scale']),
        'RMSE': f'{rmse:.2f}' if not np.isnan(rmse) else '—',
        'MAE': f'{mae:.2f}' if not np.isnan(mae) else '—',
    })

gee_los_perf_df = pd.DataFrame(gee_los_perf)
show_table(gee_los_perf_df, "Table 12: Performance & Fit Metrics - GEE Regression Models for LOS\n(QIC = Quasi-Information Criterion, Negative Binomial/Poisson family)", index=False)

Table 12: Performance & Fit Metrics - GEE Regression Models for LOS (QIC = Quasi-Information Criterion, Negative Binomial/Poisson family)

Cancer Site N obs N clusters N params Mean LOS Median LOS Sig. Comorb. Correlation QIC QICu Pseudo R² Scale RMSE MAE
Hypopharynx 388 49 24 13.7 7 9 Exchangeable 23454.4 384.7 0.2763 1.0000 16.26 10.26
Larynx 2398 64 35 14.5 7 20 Exchangeable 56825.0 2587.5 0.1611 1.0000 20.50 12.44
Nasopharynx 304 49 21 9.0 4 6 Exchangeable 16628.3 330.6 0.1315 1.0000 18.14 9.21
Oral Cavity 2575 67 38 9.6 4 23 Exchangeable 38797.2 2530.5 0.1196 1.0000 16.14 8.95
Oropharynx 822 59 29 8.4 4 14 Exchangeable 21087.4 767.1 0.1652 1.0000 13.02 7.56
Other Pharynx 252 48 19 10.8 5 4 Exchangeable 14079.0 235.0 0.1606 1.0000 15.03 9.22
Salivary Glands 800 58 27 6.9 4 12 Exchangeable 11993.7 615.4 0.1378 1.0000 11.20 5.84
Sinonasal 727 55 22 6.7 3 7 Exchangeable 10252.5 577.6 0.2372 1.0000 10.48 5.93

11.2 7.1 Forest Plot — Incidence Rate Ratios for Length of Stay

Code
# Reuses _extract_or_ci defined in §6.2 — coefficients exponentiate to IRRs for log-link NB models.
forest_data_los = {site: _extract_or_ci(res) for site, res in gee_los_results.items()}
forest_data_los = {s: d for s, d in forest_data_los.items() if not d.empty}

n_sites = len(forest_data_los)
n_cols = 3
n_rows = int(np.ceil(n_sites / n_cols))

fig, axes = plt.subplots(n_rows, n_cols, figsize=(16, 4 * n_rows), squeeze=False)

for ax_idx, (site, df_f) in enumerate(forest_data_los.items()):
    r, c = divmod(ax_idx, n_cols)
    ax = axes[r][c]
    df_f = df_f.sort_values('OR').reset_index(drop=True)  # OR field carries IRR here
    y = np.arange(len(df_f))
    sig_mask = (df_f['p'] < 0.05)
    # IRR > 1 means longer LOS (worse), IRR < 1 means shorter
    colors = np.where(~sig_mask, '#9e9e9e',
              np.where(df_f['OR'] > 1, '#b8860b', '#2e7d32'))
    ax.hlines(y, df_f['lo'], df_f['hi'], color=colors, linewidth=1.5, alpha=0.9)
    ax.scatter(df_f['OR'], y,
               s=55, marker='s',
               facecolors=np.where(sig_mask, colors, 'white'),
               edgecolors=colors, linewidths=1.5, zorder=3)
    ax.axvline(1, color='#444', linestyle='--', linewidth=0.8, alpha=0.7)
    ax.set_xscale('log')
    ax.set_xlim(0.3, 8)
    ax.set_yticks(y)
    ax.set_yticklabels(df_f['var'], fontsize=8)
    ax.set_xlabel('Incidence Rate Ratio (log scale)', fontsize=9)
    ax.set_title(f'{site}  (n predictors = {len(df_f)})', fontsize=10, fontweight='bold')
    ax.grid(axis='x', linestyle=':', alpha=0.4)
    ax.spines['top'].set_visible(False)
    ax.spines['right'].set_visible(False)

for ax_idx in range(n_sites, n_rows * n_cols):
    r, c = divmod(ax_idx, n_cols)
    axes[r][c].set_visible(False)

from matplotlib.lines import Line2D
legend_elems = [
    Line2D([0], [0], marker='s', linestyle='None', markersize=7,
           markerfacecolor='#b8860b', markeredgecolor='#b8860b', label='IRR > 1, p < 0.05 (longer stay)'),
    Line2D([0], [0], marker='s', linestyle='None', markersize=7,
           markerfacecolor='#2e7d32', markeredgecolor='#2e7d32', label='IRR < 1, p < 0.05 (shorter stay)'),
    Line2D([0], [0], marker='s', linestyle='None', markersize=7,
           markerfacecolor='white', markeredgecolor='#9e9e9e', label='Not significant (p ≥ 0.05)'),
    Line2D([0], [0], color='#444', linestyle='--', linewidth=1, label='IRR = 1 (null)'),
]
fig.legend(handles=legend_elems, loc='lower center', ncol=4, frameon=False,
           bbox_to_anchor=(0.5, -0.01), fontsize=10)

fig.suptitle('Forest plot — GEE Negative-Binomial LOS Models: IRR [95% CI] by Cancer Site',
             fontsize=13, fontweight='bold', y=1.00)
plt.tight_layout(rect=[0, 0.02, 1, 0.98])
plt.savefig(os.path.join(OUTPUT_DIR, 'forest_gee_nb_los.png'), dpi=300, bbox_inches='tight', facecolor='white')
plt.savefig(_FIG / "fig-forest-gee-los.png", dpi=300, bbox_inches="tight")
plt.show()
Figure 10: Figure 10. Forest plot — Incidence Rate Ratios (95% CI) for length of stay from the GEE Negative Binomial mixed-effects models, by cancer site. Reference line at IRR=1. Filled markers = p<0.05.

12 8. Machine Learning Models for Mortality Prediction (Per Cancer Site, Multiple Class Balancing)

Machine learning models for per-site mortality prediction using multiple class balancing techniques.

Code
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.linear_model import LogisticRegression as LR_sklearn
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (
    roc_auc_score, roc_curve, confusion_matrix as cm_func,
    accuracy_score, f1_score, precision_score, recall_score,
    brier_score_loss, matthews_corrcoef
)

from collections import OrderedDict
try:
    from imblearn.over_sampling import SMOTE, ADASYN, RandomOverSampler
    from imblearn.under_sampling import RandomUnderSampler
    from imblearn.combine import SMOTETomek
    IMBLEARN_AVAILABLE = True
except ImportError:
    IMBLEARN_AVAILABLE = False

# Feature columns: age, sex, service dummies, ALL comorbidities
ml_comorbidity_cols = [c for c in df_head_neck.columns if c in comorbidity_cols]

ALL_BALANCING_NAMES = ['None (Original)', 'SMOTE', 'ADASYN', 'SMOTETomek',
                       'Random Oversampling', 'Random Undersampling']

def get_balancing_methods(y_train):
    """Return ALL class balancing methods, adapting k_neighbors to minority class size."""
    methods = OrderedDict()
    methods['None (Original)'] = None

    if not IMBLEARN_AVAILABLE:
        for name in ALL_BALANCING_NAMES[1:]:
            methods[name] = 'imblearn not installed'
        return methods

    n_minority = int(y_train.sum())
    k = min(5, n_minority - 1) if n_minority > 1 else 0

    if k >= 1:
        methods['SMOTE'] = SMOTE(random_state=42, k_neighbors=k)
        methods['ADASYN'] = ADASYN(random_state=42, n_neighbors=min(k, n_minority - 1))
        methods['SMOTETomek'] = SMOTETomek(random_state=42, smote=SMOTE(random_state=42, k_neighbors=k))
    else:
        methods['SMOTE'] = f'Insufficient minority samples (n={n_minority}, need k>=1)'
        methods['ADASYN'] = f'Insufficient minority samples (n={n_minority}, need k>=1)'
        methods['SMOTETomek'] = f'Insufficient minority samples (n={n_minority}, need k>=1)'
    methods['Random Oversampling'] = RandomOverSampler(random_state=42)
    methods['Random Undersampling'] = RandomUnderSampler(random_state=42)

    return methods

def _failed_balancing_entry(reason):
    """Return a placeholder result dict for a failed balancing technique."""
    return {
        'model': None,
        'y_pred': None,
        'y_proba': None,
        'y_test': None,
        'opt_threshold': np.nan,
        'cm': None,
        'failed': True,
        'fail_reason': reason,
        'metrics': {
            'AUC': np.nan, 'Bal. Accuracy': np.nan, 'Sensitivity': np.nan,
            'Specificity': np.nan, 'PPV': np.nan, 'NPV': np.nan,
            'F1': np.nan, 'MCC': np.nan, 'Brier': np.nan,
            "Youden's J": np.nan, 'Threshold': np.nan,
            'Train Size': '—', 'Minority %': '—'
        }
    }

def run_ml_mortality_multi_balancing(df_site, site_name):
    """Train ML classifiers with multiple class balancing techniques + optimal threshold"""
    df_ml = df_site.copy()
    df_ml['sex_female'] = (df_ml['SEXO'] == 'Female').astype(int)
    df_ml['edad'] = pd.to_numeric(df_ml['edad'], errors='coerce')

    svc_dummies = pd.get_dummies(df_ml['service_category'], prefix='service', drop_first=True)
    for c in svc_dummies.columns:
        df_ml[c] = svc_dummies[c]

    feature_names = ['edad', 'sex_female'] + list(svc_dummies.columns) + ml_comorbidity_cols
    target = 'egresar_fallecido'

    df_ml = df_ml[feature_names + [target]].dropna()

    n_pos = int(df_ml[target].sum())
    n_neg = int((df_ml[target] == 0).sum())

    if len(df_ml) < 30 or n_pos < 5 or n_neg < 5:
        return None, None, None

    X = df_ml[feature_names].values
    y = df_ml[target].values

    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)

    try:
        X_train, X_test, y_train, y_test = train_test_split(
            X_scaled, y, test_size=0.2, random_state=42, stratify=y
        )
    except ValueError:
        X_train, X_test, y_train, y_test = train_test_split(
            X_scaled, y, test_size=0.2, random_state=42
        )

    balancing_methods = get_balancing_methods(y_train)

    base_model_class = GradientBoostingClassifier
    base_model_params = {'n_estimators': 150, 'random_state': 42, 'max_depth': 4}

    balancing_results = OrderedDict()

    for bal_name, balancer in balancing_methods.items():
        if isinstance(balancer, str):
            balancing_results[bal_name] = _failed_balancing_entry(balancer)
            continue

        try:
            if balancer is not None:
                X_res, y_res = balancer.fit_resample(X_train, y_train)
            else:
                X_res, y_res = X_train, y_train

            model = base_model_class(**base_model_params)
            model.fit(X_res, y_res)
            y_proba = model.predict_proba(X_test)[:, 1]

            fpr_t, tpr_t, thresholds_t = roc_curve(y_test, y_proba)
            j_scores = tpr_t - fpr_t
            best_idx = np.argmax(j_scores)
            opt_threshold = thresholds_t[best_idx]
            y_pred = (y_proba >= opt_threshold).astype(int)

            tn, fp, fn, tp = cm_func(y_test, y_pred).ravel()
            sens = tp / (tp + fn) if (tp + fn) > 0 else 0
            spec = tn / (tn + fp) if (tn + fp) > 0 else 0
            ppv_val = tp / (tp + fp) if (tp + fp) > 0 else 0
            npv_val = tn / (tn + fn) if (tn + fn) > 0 else 0

            balancing_results[bal_name] = {
                'model': model,
                'y_pred': y_pred,
                'y_proba': y_proba,
                'y_test': y_test,
                'opt_threshold': opt_threshold,
                'cm': cm_func(y_test, y_pred),
                'failed': False,
                'metrics': {
                    'AUC': roc_auc_score(y_test, y_proba) if len(np.unique(y_test)) > 1 else 0,
                    'Bal. Accuracy': (sens + spec) / 2,
                    'Sensitivity': sens,
                    'Specificity': spec,
                    'PPV': ppv_val,
                    'NPV': npv_val,
                    'F1': f1_score(y_test, y_pred, zero_division=0),
                    'MCC': matthews_corrcoef(y_test, y_pred),
                    'Brier': brier_score_loss(y_test, y_proba),
                    "Youden's J": sens + spec - 1,
                    'Threshold': opt_threshold,
                    'Train Size': len(y_res),
                    'Minority %': f'{y_res.sum()/len(y_res)*100:.1f}'
                }
            }
        except Exception as e:
            balancing_results[bal_name] = _failed_balancing_entry(str(e))

    successful_results = {k: v for k, v in balancing_results.items() if not v.get('failed', False)}
    if not successful_results:
        return balancing_results, None, None

    best_balancing = max(successful_results, key=lambda k: successful_results[k]['metrics']['AUC'])
    best_balancer = balancing_methods.get(best_balancing)

    if best_balancer is not None:
        try:
            X_best, y_best = best_balancer.fit_resample(X_train, y_train)
        except:
            X_best, y_best = X_train, y_train
    else:
        X_best, y_best = X_train, y_train

    all_models = {
        'Logistic Regression': LR_sklearn(max_iter=1000, random_state=42, class_weight='balanced'),
        'Random Forest': RandomForestClassifier(n_estimators=200, random_state=42, n_jobs=-1, class_weight='balanced'),
        'Gradient Boosting': GradientBoostingClassifier(n_estimators=150, random_state=42, max_depth=4),
        'SVM': SVC(kernel='rbf', probability=True, random_state=42, class_weight='balanced'),
        'MLP': MLPClassifier(hidden_layer_sizes=(64, 32), max_iter=1000, random_state=42)
    }

    model_results = {}
    for name, model in all_models.items():
        try:
            model.fit(X_best, y_best)
            y_proba = model.predict_proba(X_test)[:, 1]

            fpr_t, tpr_t, thresholds_t = roc_curve(y_test, y_proba)
            j_scores = tpr_t - fpr_t
            best_idx = np.argmax(j_scores)
            opt_threshold = thresholds_t[best_idx]
            y_pred = (y_proba >= opt_threshold).astype(int)

            tn, fp, fn, tp = cm_func(y_test, y_pred).ravel()
            sens = tp / (tp + fn) if (tp + fn) > 0 else 0
            spec = tn / (tn + fp) if (tn + fp) > 0 else 0
            ppv_val = tp / (tp + fp) if (tp + fp) > 0 else 0
            npv_val = tn / (tn + fn) if (tn + fn) > 0 else 0

            model_results[name] = {
                'model': model,
                'y_pred': y_pred,
                'y_proba': y_proba,
                'y_test': y_test,
                'opt_threshold': opt_threshold,
                'cm': cm_func(y_test, y_pred),
                'metrics': {
                    'AUC': roc_auc_score(y_test, y_proba) if len(np.unique(y_test)) > 1 else 0,
                    'Bal. Accuracy': (sens + spec) / 2,
                    'Sensitivity': sens,
                    'Specificity': spec,
                    'PPV': ppv_val,
                    'NPV': npv_val,
                    'F1': f1_score(y_test, y_pred, zero_division=0),
                    'MCC': matthews_corrcoef(y_test, y_pred),
                    'Brier': brier_score_loss(y_test, y_proba),
                    "Youden's J": sens + spec - 1,
                    'Threshold': opt_threshold
                }
            }
        except:
            pass

    return balancing_results, model_results, (feature_names, X_test, y_test, best_balancing)

# Run ML mortality models for EACH cancer site
ml_balancing_by_site = {}
ml_models_by_site = {}
ml_meta_by_site = {}

for site in all_sites:
    df_site = df_head_neck[df_head_neck['cancer_site_group'] == site].copy()
    bal_results, model_results, meta = run_ml_mortality_multi_balancing(df_site, site)
    if bal_results is not None and len(bal_results) > 0:
        ml_balancing_by_site[site] = bal_results
        if model_results is not None:
            ml_models_by_site[site] = model_results
        if meta is not None:
            ml_meta_by_site[site] = meta

12.1 8.1 Class Balancing Technique Comparison by Cancer Site

Code
# Table: Class balancing technique comparison for each site (Gradient Boosting as base model)
for site in all_sites:
    if site not in ml_balancing_by_site:
        continue

    results = ml_balancing_by_site[site]
    df_site = df_head_neck[df_head_neck['cancer_site_group'] == site]
    n_total = len(df_site)
    n_events = df_site['egresar_fallecido'].sum()

    metrics_df = pd.DataFrame({name: d['metrics'] for name, d in results.items()}).T
    metrics_df.index.name = 'Balancing'

    status_col = []
    for name in metrics_df.index:
        entry = results[name]
        if entry.get('failed', False):
            reason = entry.get('fail_reason', 'Error')
            status_col.append(reason[:40])
        else:
            status_col.append('OK')
    metrics_df.insert(0, 'Status', status_col)

    for col in metrics_df.columns:
        if col in ['Train Size', 'Minority %', 'Status']:
            metrics_df[col] = metrics_df[col].apply(lambda x: '—' if (isinstance(x, float) and np.isnan(x)) else x)
        else:
            metrics_df[col] = metrics_df[col].apply(lambda x: '—' if (isinstance(x, float) and (np.isnan(x) or np.isinf(x))) else f'{x:.4f}')

    show_table(metrics_df,
               f"Class Balancing Comparison (Gradient Boosting): {site} (N={n_total:,}, Events={n_events}, {n_events/n_total*100:.1f}%)",
               index=True)

Class Balancing Comparison (Gradient Boosting): Hypopharynx (N=400, Events=33, 8.2%)

Status AUC Bal. Accuracy Sensitivity Specificity PPV NPV F1 MCC Brier Youden's J Threshold Train Size Minority %
Balancing
None (Original) OK 0.5978 0.6918 1.0000 0.3836 0.1346 1.0000 0.2373 0.2272 0.1059 0.3836 0.0043 320 8.1
SMOTE OK 0.6213 0.7094 0.8571 0.5616 0.1579 0.9762 0.2667 0.2370 0.1132 0.4188 0.0303 588 50.0
ADASYN OK 0.6204 0.6996 0.7143 0.6849 0.1786 0.9615 0.2857 0.2365 0.1121 0.3992 0.0892 584 49.7
SMOTETomek OK 0.6115 0.6996 0.7143 0.6849 0.1786 0.9615 0.2857 0.2365 0.1157 0.3992 0.0848 580 50.0
Random Oversampling OK 0.7006 0.7162 0.8571 0.5753 0.1622 0.9767 0.2727 0.2451 0.1124 0.4325 0.0272 588 50.0
Random Undersampling OK 0.4403 0.6233 1.0000 0.2466 0.1129 1.0000 0.2029 0.1669 0.4881 0.2466 0.0050 52 50.0

Class Balancing Comparison (Gradient Boosting): Larynx (N=2,625, Events=198, 7.5%)

Status AUC Bal. Accuracy Sensitivity Specificity PPV NPV F1 MCC Brier Youden's J Threshold Train Size Minority %
Balancing
None (Original) OK 0.6275 0.6255 0.3500 0.9010 0.2258 0.9438 0.2745 0.2064 0.0749 0.2510 0.1495 2100 7.5
SMOTE OK 0.5740 0.5933 0.5000 0.6866 0.1163 0.9433 0.1887 0.1055 0.0915 0.1866 0.1589 3884 50.0
ADASYN OK 0.5877 0.5872 0.9250 0.2495 0.0923 0.9758 0.1678 0.1090 0.0939 0.1745 0.0457 3875 49.9
SMOTETomek OK 0.5847 0.5955 0.5250 0.6660 0.1148 0.9444 0.1883 0.1063 0.0864 0.1910 0.1306 3744 50.0
Random Oversampling OK 0.5728 0.5729 0.5500 0.5959 0.1009 0.9414 0.1705 0.0785 0.1381 0.1459 0.2486 3884 50.0
Random Undersampling OK 0.6431 0.6343 0.4500 0.8186 0.1698 0.9475 0.2466 0.1775 0.2642 0.2686 0.8045 316 50.0

Class Balancing Comparison (Gradient Boosting): Nasopharynx (N=368, Events=23, 6.2%)

Status AUC Bal. Accuracy Sensitivity Specificity PPV NPV F1 MCC Brier Youden's J Threshold Train Size Minority %
Balancing
None (Original) OK 0.6391 0.6681 0.8000 0.5362 0.1111 0.9737 0.1951 0.1689 0.0792 0.3362 0.0028 294 6.1
SMOTE OK 0.7130 0.7130 0.6000 0.8261 0.2000 0.9661 0.3000 0.2660 0.0721 0.4261 0.0615 552 50.0
ADASYN OK 0.7130 0.7203 0.6000 0.8406 0.2143 0.9667 0.3158 0.2824 0.0788 0.4406 0.0720 552 50.0
SMOTETomek OK 0.4986 0.6130 0.4000 0.8261 0.1429 0.9500 0.2105 0.1449 0.0894 0.2261 0.0526 548 50.0
Random Oversampling OK 0.5855 0.6739 1.0000 0.3478 0.1000 1.0000 0.1818 0.1865 0.1080 0.3478 0.0028 552 50.0
Random Undersampling OK 0.7188 0.7203 0.6000 0.8406 0.2143 0.9667 0.3158 0.2824 0.2460 0.4406 0.9935 36 50.0

Class Balancing Comparison (Gradient Boosting): Oral Cavity (N=3,001, Events=176, 5.9%)

Status AUC Bal. Accuracy Sensitivity Specificity PPV NPV F1 MCC Brier Youden's J Threshold Train Size Minority %
Balancing
None (Original) OK 0.7222 0.6762 0.5714 0.7809 0.1389 0.9672 0.2235 0.1933 0.0602 0.3523 0.0448 2400 5.9
SMOTE OK 0.5746 0.6083 0.6000 0.6166 0.0882 0.9614 0.1538 0.1037 0.0974 0.2166 0.2029 4518 50.0
ADASYN OK 0.5835 0.6074 0.7714 0.4435 0.0789 0.9691 0.1432 0.1016 0.0952 0.2149 0.1380 4497 49.8
SMOTETomek OK 0.5723 0.6197 0.7429 0.4965 0.0836 0.9690 0.1503 0.1122 0.0953 0.2393 0.1244 4254 50.0
Random Oversampling OK 0.6285 0.6327 0.3714 0.8940 0.1781 0.9583 0.2407 0.1903 0.1181 0.2654 0.5121 4518 50.0
Random Undersampling OK 0.7183 0.6905 0.6000 0.7809 0.1448 0.9693 0.2333 0.2085 0.2441 0.3809 0.6778 282 50.0

Class Balancing Comparison (Gradient Boosting): Oropharynx (N=951, Events=55, 5.8%)

Status AUC Bal. Accuracy Sensitivity Specificity PPV NPV F1 MCC Brier Youden's J Threshold Train Size Minority %
Balancing
None (Original) OK 0.5856 0.6553 0.7273 0.5833 0.0964 0.9722 0.1702 0.1460 0.0679 0.3106 0.0104 760 5.8
SMOTE OK 0.5144 0.6152 0.3636 0.8667 0.1429 0.9571 0.2051 0.1517 0.0881 0.2303 0.3646 1432 50.0
ADASYN OK 0.4437 0.5568 0.3636 0.7500 0.0816 0.9507 0.1333 0.0606 0.0970 0.1136 0.2403 1439 50.2
SMOTETomek OK 0.5333 0.6141 0.2727 0.9556 0.2727 0.9556 0.2727 0.2283 0.0857 0.2283 0.5336 1366 50.0
Random Oversampling OK 0.5609 0.6348 0.6364 0.6333 0.0959 0.9661 0.1667 0.1293 0.0827 0.2697 0.0871 1432 50.0
Random Undersampling OK 0.6447 0.6323 0.9091 0.3556 0.0794 0.9846 0.1460 0.1301 0.3244 0.2646 0.0567 88 50.0

Class Balancing Comparison (Gradient Boosting): Other Pharynx (N=281, Events=30, 10.7%)

Status AUC Bal. Accuracy Sensitivity Specificity PPV NPV F1 MCC Brier Youden's J Threshold Train Size Minority %
Balancing
None (Original) OK 0.5605 0.6275 0.6667 0.5882 0.1600 0.9375 0.2581 0.1576 0.1279 0.2549 0.0028 224 10.7
SMOTE OK 0.5343 0.6275 0.6667 0.5882 0.1600 0.9375 0.2581 0.1576 0.1219 0.2549 0.0165 400 50.0
ADASYN OK 0.5245 0.6471 0.6667 0.6275 0.1739 0.9412 0.2759 0.1840 0.1247 0.2941 0.0138 401 50.1
SMOTETomek OK 0.5033 0.6373 0.6667 0.6078 0.1667 0.9394 0.2667 0.1706 0.1251 0.2745 0.0132 398 50.0
Random Oversampling OK 0.5605 0.6618 0.8333 0.4902 0.1613 0.9615 0.2703 0.1993 0.1570 0.3235 0.0134 400 50.0
Random Undersampling OK 0.6961 0.7255 0.6667 0.7843 0.2667 0.9524 0.3810 0.3143 0.3035 0.4510 0.9682 48 50.0

Class Balancing Comparison (Gradient Boosting): Salivary Glands (N=898, Events=46, 5.1%)

Status AUC Bal. Accuracy Sensitivity Specificity PPV NPV F1 MCC Brier Youden's J Threshold Train Size Minority %
Balancing
None (Original) OK 0.6556 0.6871 0.8889 0.4854 0.0833 0.9881 0.1524 0.1635 0.0597 0.3743 0.0048 718 5.2
SMOTE OK 0.6699 0.6696 0.8889 0.4503 0.0784 0.9872 0.1441 0.1492 0.0709 0.3392 0.0117 1362 50.0
ADASYN OK 0.6322 0.6696 0.5556 0.7836 0.1190 0.9710 0.1961 0.1748 0.0717 0.3392 0.0833 1362 50.0
SMOTETomek OK 0.6576 0.6550 0.6667 0.6433 0.0896 0.9735 0.1579 0.1397 0.0777 0.3099 0.0224 1342 50.0
Random Oversampling OK 0.6374 0.6462 0.4444 0.8480 0.1333 0.9667 0.2051 0.1710 0.0924 0.2924 0.2315 1362 50.0
Random Undersampling OK 0.7609 0.7632 0.8889 0.6374 0.1143 0.9909 0.2025 0.2353 0.3080 0.5263 0.5186 74 50.0

Class Balancing Comparison (Gradient Boosting): Sinonasal (N=887, Events=41, 4.6%)

Status AUC Bal. Accuracy Sensitivity Specificity PPV NPV F1 MCC Brier Youden's J Threshold Train Size Minority %
Balancing
None (Original) OK 0.5537 0.6110 0.8750 0.3471 0.0593 0.9833 0.1111 0.0973 0.0572 0.2221 0.0040 709 4.7
SMOTE OK 0.6118 0.6191 0.7500 0.4882 0.0645 0.9765 0.1188 0.0988 0.0831 0.2382 0.0440 1352 50.0
ADASYN OK 0.5882 0.6199 0.8750 0.3647 0.0609 0.9841 0.1138 0.1039 0.0725 0.2397 0.0244 1346 49.8
SMOTETomek OK 0.5978 0.6110 0.8750 0.3471 0.0593 0.9833 0.1111 0.0973 0.0874 0.2221 0.0229 1306 50.0
Random Oversampling OK 0.6485 0.7441 1.0000 0.4882 0.0842 1.0000 0.1553 0.2028 0.0780 0.4882 0.0398 1352 50.0
Random Undersampling OK 0.6504 0.6838 0.7500 0.6176 0.0845 0.9813 0.1519 0.1556 0.3468 0.3676 0.7202 66 50.0

12.2 8.2 ML Model Comparison (Best Balancing per Site)

Code
# Table: ML model comparison with best balancing for each site
for site in all_sites:
    if site not in ml_models_by_site or not ml_models_by_site[site]:
        continue

    results = ml_models_by_site[site]
    meta = ml_meta_by_site.get(site)
    best_balancing = meta[3] if meta else 'Unknown'
    df_site = df_head_neck[df_head_neck['cancer_site_group'] == site]
    n_total = len(df_site)
    n_events = df_site['egresar_fallecido'].sum()

    metrics_df = pd.DataFrame({name: d['metrics'] for name, d in results.items()}).T
    metrics_df.index.name = 'Model'

    for col in metrics_df.columns:
        metrics_df[col] = metrics_df[col].apply(lambda x: '—' if (isinstance(x, float) and (np.isnan(x) or np.isinf(x))) else f'{x:.4f}')

    show_table(metrics_df,
               f"ML Models: {site} (N={n_total:,}, Events={n_events}, Best balancing: {best_balancing})",
               index=True)

ML Models: Hypopharynx (N=400, Events=33, Best balancing: Random Oversampling)

AUC Bal. Accuracy Sensitivity Specificity PPV NPV F1 MCC Brier Youden's J Threshold
Model
Logistic Regression 0.4521 0.6135 0.8571 0.3699 0.1154 0.9643 0.2034 0.1345 0.2670 0.2270 0.1141
Random Forest 0.4843 0.5900 0.7143 0.4658 0.1136 0.9444 0.1961 0.1023 0.1359 0.1800 0.0300
Gradient Boosting 0.7006 0.7162 0.8571 0.5753 0.1622 0.9767 0.2727 0.2451 0.1124 0.4325 0.0272
SVM 0.6047 0.7025 0.8571 0.5479 0.1538 0.9756 0.2609 0.2290 0.1689 0.4051 0.1543
MLP 0.5616 0.6644 1.0000 0.3288 0.1250 1.0000 0.2222 0.2027 0.2396 0.3288 0.0000

ML Models: Larynx (N=2,625, Events=198, Best balancing: Random Undersampling)

AUC Bal. Accuracy Sensitivity Specificity PPV NPV F1 MCC Brier Youden's J Threshold
Model
Logistic Regression 0.6627 0.6673 0.4500 0.8845 0.2432 0.9512 0.3158 0.2551 0.2306 0.3345 0.7886
Random Forest 0.6885 0.6921 0.6750 0.7093 0.1607 0.9636 0.2596 0.2186 0.2232 0.3843 0.5500
Gradient Boosting 0.6431 0.6343 0.4500 0.8186 0.1698 0.9475 0.2466 0.1775 0.2642 0.2686 0.8045
SVM 0.6565 0.6736 0.4750 0.8722 0.2346 0.9527 0.3140 0.2550 0.2144 0.3472 0.7175
MLP 0.6755 0.6771 0.5500 0.8041 0.1880 0.9559 0.2803 0.2258 0.3072 0.3541 0.9930

ML Models: Nasopharynx (N=368, Events=23, Best balancing: Random Undersampling)

AUC Bal. Accuracy Sensitivity Specificity PPV NPV F1 MCC Brier Youden's J Threshold
Model
Logistic Regression 0.6551 0.7420 0.6000 0.8841 0.2727 0.9683 0.3750 0.3415 0.2441 0.4841 0.8362
Random Forest 0.7594 0.7826 1.0000 0.5652 0.1429 1.0000 0.2500 0.2842 0.2258 0.5652 0.4450
Gradient Boosting 0.7188 0.7203 0.6000 0.8406 0.2143 0.9667 0.3158 0.2824 0.2460 0.4406 0.9935
SVM 0.4145 0.6638 0.4000 0.9275 0.2857 0.9552 0.3333 0.2809 0.3762 0.3275 0.7654
MLP 0.8435 0.8551 1.0000 0.7101 0.2000 1.0000 0.3333 0.3769 0.4234 0.7101 0.9973

ML Models: Oral Cavity (N=3,001, Events=176, Best balancing: None (Original))

AUC Bal. Accuracy Sensitivity Specificity PPV NPV F1 MCC Brier Youden's J Threshold
Model
Logistic Regression 0.7013 0.6648 0.4286 0.9011 0.2113 0.9623 0.2830 0.2392 0.1958 0.3296 0.6417
Random Forest 0.6390 0.6323 0.4571 0.8074 0.1280 0.9601 0.2000 0.1527 0.0611 0.2646 0.0700
Gradient Boosting 0.7222 0.6762 0.5714 0.7809 0.1389 0.9672 0.2235 0.1933 0.0602 0.3523 0.0448
SVM 0.6413 0.6750 0.8571 0.4929 0.0946 0.9824 0.1705 0.1642 0.0543 0.3501 0.0524
MLP 0.4907 0.5626 0.1429 0.9823 0.3333 0.9488 0.2000 0.1879 0.0653 0.1252 0.7080

ML Models: Oropharynx (N=951, Events=55, Best balancing: Random Undersampling)

AUC Bal. Accuracy Sensitivity Specificity PPV NPV F1 MCC Brier Youden's J Threshold
Model
Logistic Regression 0.7404 0.7098 0.6364 0.7833 0.1522 0.9724 0.2456 0.2287 0.2346 0.4197 0.6146
Random Forest 0.6662 0.6987 0.6364 0.7611 0.1400 0.9716 0.2295 0.2106 0.2210 0.3975 0.6050
Gradient Boosting 0.6447 0.6323 0.9091 0.3556 0.0794 0.9846 0.1460 0.1301 0.3244 0.2646 0.0567
SVM 0.8288 0.8341 0.8182 0.8500 0.2500 0.9871 0.3830 0.3980 0.2084 0.6682 0.6107
MLP 0.7449 0.7341 0.8182 0.6500 0.1250 0.9832 0.2169 0.2251 0.3003 0.4682 0.5443

ML Models: Other Pharynx (N=281, Events=30, Best balancing: Random Undersampling)

AUC Bal. Accuracy Sensitivity Specificity PPV NPV F1 MCC Brier Youden's J Threshold
Model
Logistic Regression 0.6340 0.6863 1.0000 0.3725 0.1579 1.0000 0.2727 0.2425 0.2163 0.3725 0.2220
Random Forest 0.6830 0.7598 0.8333 0.6863 0.2381 0.9722 0.3704 0.3306 0.2156 0.5196 0.5350
Gradient Boosting 0.6961 0.7255 0.6667 0.7843 0.2667 0.9524 0.3810 0.3143 0.3035 0.4510 0.9682
SVM 0.1961 0.5392 1.0000 0.0784 0.1132 1.0000 0.2034 0.0942 0.3557 0.0784 0.4605
MLP 0.4248 0.5392 0.3333 0.7451 0.1333 0.9048 0.1905 0.0547 0.3116 0.0784 0.6873

ML Models: Salivary Glands (N=898, Events=46, Best balancing: Random Undersampling)

AUC Bal. Accuracy Sensitivity Specificity PPV NPV F1 MCC Brier Youden's J Threshold
Model
Logistic Regression 0.6004 0.6637 0.5556 0.7719 0.1136 0.9706 0.1887 0.1661 0.3319 0.3275 0.8033
Random Forest 0.7840 0.7953 0.8889 0.7018 0.1356 0.9917 0.2353 0.2742 0.2239 0.5906 0.5550
Gradient Boosting 0.7609 0.7632 0.8889 0.6374 0.1143 0.9909 0.2025 0.2353 0.3080 0.5263 0.5186
SVM 0.6283 0.6345 1.0000 0.2690 0.0672 1.0000 0.1259 0.1344 0.2393 0.2690 0.3582
MLP 0.6244 0.6696 0.5556 0.7836 0.1190 0.9710 0.1961 0.1748 0.3154 0.3392 0.9770

ML Models: Sinonasal (N=887, Events=41, Best balancing: Random Undersampling)

AUC Bal. Accuracy Sensitivity Specificity PPV NPV F1 MCC Brier Youden's J Threshold
Model
Logistic Regression 0.5618 0.6360 0.6250 0.6471 0.0769 0.9735 0.1370 0.1171 0.2597 0.2721 0.4821
Random Forest 0.5688 0.7037 0.6250 0.7824 0.1190 0.9779 0.2000 0.1988 0.2729 0.4074 0.6750
Gradient Boosting 0.6504 0.6838 0.7500 0.6176 0.0845 0.9813 0.1519 0.1556 0.3468 0.3676 0.7202
SVM 0.5687 0.6132 0.2500 0.9765 0.3333 0.9651 0.2857 0.2600 0.2503 0.2265 0.6229
MLP 0.6037 0.6404 0.3750 0.9059 0.1579 0.9686 0.2222 0.1885 0.3521 0.2809 1.0000
Code
n_sites_ml = len(ml_models_by_site)
if n_sites_ml > 0:
    ncols = min(3, n_sites_ml)
    nrows = (n_sites_ml + ncols - 1) // ncols
    fig, axes = plt.subplots(nrows, ncols, figsize=(6 * ncols, 5 * nrows))
    if n_sites_ml == 1:
        axes = np.array([axes])
    axes = axes.flatten() if hasattr(axes, 'flatten') else [axes]

    for idx, (site, results) in enumerate(ml_models_by_site.items()):
        if not results:
            continue
        ax = axes[idx]
        best_balancing = ml_meta_by_site[site][3] if ml_meta_by_site.get(site) else ''
        for name, data in results.items():
            y_test = data['y_test']
            y_proba = data['y_proba']
            fpr, tpr, _ = roc_curve(y_test, y_proba)
            auc_val = data['metrics']['AUC']
            ax.plot(fpr, tpr, label=f'{name} ({auc_val:.3f})', linewidth=2)

        ax.plot([0, 1], [0, 1], 'k--', linewidth=1)
        ax.set_xlabel('FPR')
        ax.set_ylabel('TPR')
        ax.set_title(f'{site}\n(Balancing: {best_balancing})', fontweight='bold', fontsize=10)
        ax.legend(fontsize=7, loc='lower right')
        ax.grid(True, alpha=0.3)

    for idx in range(len(ml_models_by_site), len(axes)):
        axes[idx].set_visible(False)

    plt.tight_layout()
    plt.savefig(os.path.join(OUTPUT_DIR, 'roc_ml_mortality_by_site.png'), dpi=300, bbox_inches='tight', facecolor='white')
    plt.savefig(_FIG / "fig-roc-ml-mortality.png", dpi=300, bbox_inches="tight")
    plt.show()
Figure 11: Figure 9. ROC Curves - ML Mortality Models by Cancer Site (Best Class Balancing)
Code
if len(ml_balancing_by_site) > 0:
    comparison_data = []
    for site, results in ml_balancing_by_site.items():
        for bal_name, data in results.items():
            failed = data.get('failed', False)
            comparison_data.append({
                'Cancer Site': site,
                'Balancing': bal_name,
                'AUC': 0.0 if failed else data['metrics']['AUC'],
                'Bal. Accuracy': 0.0 if failed else data['metrics']['Bal. Accuracy'],
                'F1': 0.0 if failed else data['metrics']['F1'],
                'Failed': failed
            })

    comp_df = pd.DataFrame(comparison_data)

    bal_order = [b for b in ALL_BALANCING_NAMES if b in comp_df['Balancing'].unique()]
    comp_df['Balancing'] = pd.Categorical(comp_df['Balancing'], categories=bal_order, ordered=True)

    fig, axes = plt.subplots(1, 3, figsize=(14, 8))
    metrics_to_plot = ['AUC', 'Bal. Accuracy', 'F1']
    colors_resamp = plt.cm.Set2(np.linspace(0, 1, len(bal_order)))

    for ax_idx, metric in enumerate(metrics_to_plot):
        ax = axes[ax_idx]
        pivot = comp_df.pivot(index='Cancer Site', columns='Balancing', values=metric)
        pivot = pivot[[c for c in bal_order if c in pivot.columns]]
        pivot.fillna(0).plot(kind='bar', ax=ax, color=colors_resamp[:len(pivot.columns)])
        ax.set_title(metric, fontweight='bold')
        ax.set_ylabel(metric)
        ax.legend(fontsize=6, loc='lower right')
        ax.grid(True, alpha=0.3, axis='y')
        plt.setp(ax.get_xticklabels(), rotation=30, ha='right', fontsize=8)

    plt.tight_layout()
    plt.savefig(os.path.join(OUTPUT_DIR, 'balancing_comparison_by_site.png'), dpi=300, bbox_inches='tight', facecolor='white')
    plt.savefig(_FIG / "fig-balancing-comparison.png", dpi=300, bbox_inches="tight")
    plt.show()
Figure 12: Figure 10. Class Balancing Technique AUC Comparison by Cancer Site
Code
n_sites_ml = len(ml_models_by_site)
if n_sites_ml > 0:
    ncols = min(4, n_sites_ml)
    nrows = (n_sites_ml + ncols - 1) // ncols
    fig, axes = plt.subplots(nrows, ncols, figsize=(4 * ncols, 4 * nrows))
    if n_sites_ml == 1:
        axes = np.array([axes])
    axes = axes.flatten() if hasattr(axes, 'flatten') else [axes]

    for idx, (site, results) in enumerate(ml_models_by_site.items()):
        if not results:
            continue
        best_name = max(results, key=lambda k: results[k]['metrics']['AUC'])
        cm = results[best_name]['cm']
        ax = axes[idx]
        sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=ax, cbar=False,
                    xticklabels=['Survived', 'Deceased'], yticklabels=['Survived', 'Deceased'])
        ax.set_title(f'{site}\n({best_name})', fontsize=10, fontweight='bold')
        ax.set_ylabel('True')
        ax.set_xlabel('Predicted')

    for idx in range(len(ml_models_by_site), len(axes)):
        axes[idx].set_visible(False)

    plt.tight_layout()
    plt.savefig(os.path.join(OUTPUT_DIR, 'cm_ml_mortality_by_site.png'), dpi=300, bbox_inches='tight', facecolor='white')
    plt.savefig(_FIG / "fig-cm-ml-mortality.png", dpi=300, bbox_inches="tight")
    plt.show()
Figure 13: Figure 11. Confusion Matrices - Best ML Mortality Model per Site
Code
sites_with_gb = [s for s in ml_models_by_site if ml_models_by_site[s] and 'Gradient Boosting' in ml_models_by_site[s]]
n_gb = len(sites_with_gb)

if n_gb > 0:
    ncols = min(3, n_gb)
    nrows = (n_gb + ncols - 1) // ncols
    fig, axes = plt.subplots(nrows, ncols, figsize=(6 * ncols, 5 * nrows))
    if n_gb == 1:
        axes = np.array([axes])
    axes = axes.flatten() if hasattr(axes, 'flatten') else [axes]

    for idx, site in enumerate(sites_with_gb):
        model = ml_models_by_site[site]['Gradient Boosting']['model']
        feat_names = ml_meta_by_site[site][0]
        importances = model.feature_importances_
        formatted = [format_name(f) for f in feat_names]

        imp_df = pd.DataFrame({'Feature': formatted, 'Importance': importances})
        imp_df = imp_df.sort_values('Importance', ascending=True).tail(12)

        ax = axes[idx]
        ax.barh(imp_df['Feature'], imp_df['Importance'], color='steelblue')
        ax.set_xlabel('Importance')
        ax.set_title(f'{site}', fontweight='bold')
        ax.grid(True, alpha=0.3, axis='x')

    for idx in range(n_gb, len(axes)):
        axes[idx].set_visible(False)

    plt.tight_layout()
    plt.savefig(os.path.join(OUTPUT_DIR, 'feat_importance_mortality_by_site.png'), dpi=300, bbox_inches='tight', facecolor='white')
    plt.savefig(_FIG / "fig-feat-importance-mortality.png", dpi=300, bbox_inches="tight")
    plt.show()
Figure 14: Figure 12. Feature Importance - Gradient Boosting Mortality by Cancer Site (Best Class Balancing)

13 9. Machine Learning Models for Length of Stay Prediction (Per Cancer Site)

Code
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

def run_ml_los_for_site(df_site, site_name):
    """Train 5 ML regressors for LOS on a single cancer site"""
    df_ml = df_site[df_site['dias_estancia'] > 0].copy()
    df_ml['sex_female'] = (df_ml['SEXO'] == 'Female').astype(int)
    df_ml['edad'] = pd.to_numeric(df_ml['edad'], errors='coerce')

    svc_dummies = pd.get_dummies(df_ml['service_category'], prefix='service', drop_first=True)
    for c in svc_dummies.columns:
        df_ml[c] = svc_dummies[c]

    feature_names = ['edad', 'sex_female'] + list(svc_dummies.columns) + ml_comorbidity_cols
    target = 'dias_estancia'

    df_ml = df_ml[feature_names + [target]].dropna()

    if len(df_ml) < 80:
        return None, None

    X = df_ml[feature_names].values
    y = df_ml[target].values

    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)

    X_train, X_test, y_train, y_test = train_test_split(
        X_scaled, y, test_size=0.2, random_state=42
    )

    models = {
        'Linear Regression': LinearRegression(),
        'Ridge': Ridge(alpha=1.0),
        'Lasso': Lasso(alpha=0.1, max_iter=5000),
        'Random Forest': RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1),
        'Gradient Boosting': GradientBoostingRegressor(n_estimators=100, random_state=42)
    }

    results = {}
    for name, model in models.items():
        try:
            model.fit(X_train, y_train)
            y_pred = model.predict(X_test)

            results[name] = {
                'model': model,
                'y_pred': y_pred,
                'y_test': y_test,
                'metrics': {
                    'RMSE': np.sqrt(mean_squared_error(y_test, y_pred)),
                    'MAE': mean_absolute_error(y_test, y_pred),
                    'R2': r2_score(y_test, y_pred),
                    'MSE': mean_squared_error(y_test, y_pred)
                }
            }
        except:
            pass

    return results, feature_names

# Run ML LOS models for EACH cancer site
ml_los_by_site = {}
ml_los_features_by_site = {}

for site in all_sites:
    df_site = df_head_neck[df_head_neck['cancer_site_group'] == site].copy()
    results, feat_names = run_ml_los_for_site(df_site, site)
    if results and len(results) > 0:
        ml_los_by_site[site] = results
        ml_los_features_by_site[site] = feat_names

13.1 9.1 ML LOS Results by Cancer Site

Code
# Table: ML LOS Performance by Site
for site in all_sites:
    if site not in ml_los_by_site:
        continue

    results = ml_los_by_site[site]
    df_site = df_head_neck[df_head_neck['cancer_site_group'] == site]
    n_total = len(df_site)
    mean_los = df_site['dias_estancia'].mean()

    metrics_df = pd.DataFrame({name: d['metrics'] for name, d in results.items()}).T
    metrics_df.index.name = 'Model'

    for col in metrics_df.columns:
        metrics_df[col] = metrics_df[col].apply(lambda x: '—' if (isinstance(x, float) and (np.isnan(x) or np.isinf(x))) else f'{x:.4f}')

    show_table(metrics_df,
               f"ML LOS Performance: {site} (N={n_total:,}, Mean LOS={mean_los:.1f} days)",
               index=True)

ML LOS Performance: Hypopharynx (N=400, Mean LOS=13.7 days)

RMSE MAE R2 MSE
Model
Linear Regression 22.2782 13.9487 -0.4538 496.3178
Ridge 22.2354 13.9280 -0.4483 494.4112
Lasso 21.9121 13.6609 -0.4065 480.1414
Random Forest 19.4278 12.5932 -0.1056 377.4393
Gradient Boosting 20.6044 13.4355 -0.2436 424.5431

ML LOS Performance: Larynx (N=2,625, Mean LOS=14.5 days)

RMSE MAE R2 MSE
Model
Linear Regression 19.5764 13.3325 -0.0275 383.2339
Ridge 19.5748 13.3316 -0.0274 383.1739
Lasso 19.4355 13.2411 -0.0128 377.7388
Random Forest 21.3782 14.2926 -0.2254 457.0262
Gradient Boosting 20.3111 13.8083 -0.1061 412.5399

ML LOS Performance: Nasopharynx (N=368, Mean LOS=9.0 days)

RMSE MAE R2 MSE
Model
Linear Regression 10.8355 7.9235 -0.1720 117.4088
Ridge 10.7736 7.8920 -0.1587 116.0715
Lasso 10.2645 7.5827 -0.0517 105.3589
Random Forest 11.1671 7.6256 -0.2449 124.7033
Gradient Boosting 13.0054 8.0567 -0.6884 169.1396

ML LOS Performance: Oral Cavity (N=3,001, Mean LOS=9.6 days)

RMSE MAE R2 MSE
Model
Linear Regression 18.8474 10.6064 -0.0307 355.2235
Ridge 18.8466 10.6061 -0.0307 355.1952
Lasso 18.7448 10.5509 -0.0195 351.3677
Random Forest 20.5552 11.3571 -0.2260 422.5151
Gradient Boosting 19.0218 10.6592 -0.0499 361.8286

ML LOS Performance: Oropharynx (N=951, Mean LOS=8.4 days)

RMSE MAE R2 MSE
Model
Linear Regression 13.9983 8.5612 -0.1791 195.9520
Ridge 13.9887 8.5563 -0.1774 195.6842
Lasso 13.8397 8.4215 -0.1525 191.5368
Random Forest 13.5869 8.4933 -0.1108 184.6032
Gradient Boosting 13.9424 8.5364 -0.1697 194.3902

ML LOS Performance: Other Pharynx (N=281, Mean LOS=10.8 days)

RMSE MAE R2 MSE
Model
Linear Regression 12.2293 8.8816 -0.2828 149.5557
Ridge 12.1605 8.8576 -0.2684 147.8770
Lasso 11.7386 8.6341 -0.1819 137.7949
Random Forest 11.3697 7.8188 -0.1088 129.2695
Gradient Boosting 14.7723 9.3805 -0.8717 218.2223

ML LOS Performance: Salivary Glands (N=898, Mean LOS=6.9 days)

RMSE MAE R2 MSE
Model
Linear Regression 9.9316 6.2590 -0.2303 98.6370
Ridge 9.9229 6.2540 -0.2282 98.4633
Lasso 9.5267 6.0424 -0.1320 90.7571
Random Forest 9.3527 6.2639 -0.0911 87.4735
Gradient Boosting 9.3368 6.1751 -0.0874 87.1758

ML LOS Performance: Sinonasal (N=887, Mean LOS=6.7 days)

RMSE MAE R2 MSE
Model
Linear Regression 13.5113 6.8173 0.0791 182.5560
Ridge 13.5123 6.8185 0.0790 182.5820
Lasso 13.5369 6.8838 0.0756 183.2484
Random Forest 13.5982 6.7620 0.0672 184.9117
Gradient Boosting 13.7705 6.8266 0.0434 189.6273
Code
n_sites_los = len(ml_los_by_site)
if n_sites_los > 0:
    ncols = min(3, n_sites_los)
    nrows = (n_sites_los + ncols - 1) // ncols
    fig, axes = plt.subplots(nrows, ncols, figsize=(6 * ncols, 5 * nrows))
    if n_sites_los == 1:
        axes = np.array([axes])
    axes = axes.flatten() if hasattr(axes, 'flatten') else [axes]

    for idx, (site, results) in enumerate(ml_los_by_site.items()):
        best_name = max(results, key=lambda k: results[k]['metrics']['R2'])
        best = results[best_name]
        y_test = best['y_test']
        y_pred = best['y_pred']
        r2 = best['metrics']['R2']

        ax = axes[idx]
        ax.scatter(y_test, y_pred, alpha=0.4, s=15, color='steelblue')
        lims = [min(y_test.min(), y_pred.min()), max(np.percentile(y_test, 95), np.percentile(y_pred, 95))]
        ax.plot(lims, lims, 'r--', lw=2)
        ax.set_xlabel('Observed LOS (days)')
        ax.set_ylabel('Predicted LOS (days)')
        ax.set_title(f'{site}\n{best_name} (R²={r2:.3f})', fontweight='bold', fontsize=10)
        ax.grid(True, alpha=0.3)

    for idx in range(n_sites_los, len(axes)):
        axes[idx].set_visible(False)

    plt.tight_layout()
    plt.savefig(os.path.join(OUTPUT_DIR, 'pred_vs_obs_los_by_site.png'), dpi=300, bbox_inches='tight', facecolor='white')
    plt.savefig(_FIG / "fig-pred-vs-obs-los.png", dpi=300, bbox_inches="tight")
    plt.show()
Figure 15: Figure 12. Predicted vs Observed LOS - Best ML Model per Site
Code
sites_with_rf_los = [s for s in ml_los_by_site if 'Random Forest' in ml_los_by_site[s]]
n_rf_los = len(sites_with_rf_los)

if n_rf_los > 0:
    ncols = min(3, n_rf_los)
    nrows = (n_rf_los + ncols - 1) // ncols
    fig, axes = plt.subplots(nrows, ncols, figsize=(6 * ncols, 5 * nrows))
    if n_rf_los == 1:
        axes = np.array([axes])
    axes = axes.flatten() if hasattr(axes, 'flatten') else [axes]

    for idx, site in enumerate(sites_with_rf_los):
        model = ml_los_by_site[site]['Random Forest']['model']
        feat_names = ml_los_features_by_site[site]
        importances = model.feature_importances_
        formatted = [format_name(f) for f in feat_names]

        imp_df = pd.DataFrame({'Feature': formatted, 'Importance': importances})
        imp_df = imp_df.sort_values('Importance', ascending=True).tail(12)

        ax = axes[idx]
        ax.barh(imp_df['Feature'], imp_df['Importance'], color='forestgreen')
        ax.set_xlabel('Importance')
        ax.set_title(f'{site}', fontweight='bold')
        ax.grid(True, alpha=0.3, axis='x')

    for idx in range(n_rf_los, len(axes)):
        axes[idx].set_visible(False)

    plt.tight_layout()
    plt.savefig(os.path.join(OUTPUT_DIR, 'feat_importance_los_by_site.png'), dpi=300, bbox_inches='tight', facecolor='white')
    plt.savefig(_FIG / "fig-feat-importance-los.png", dpi=300, bbox_inches="tight")
    plt.show()
Figure 16: Figure 13. Feature Importance - Random Forest LOS by Cancer Site

14 10. Summary Comparison Table

Code
# Combined summary: best model per site for mortality and LOS
summary_rows = []
for site in all_sites:
    row = {'Cancer Site': site}
    df_site = df_head_neck[df_head_neck['cancer_site_group'] == site]
    row['N'] = f'{len(df_site):,}'
    row['Mortality %'] = f'{df_site["egresar_fallecido"].mean()*100:.1f}'
    row['Mean LOS'] = f'{df_site["dias_estancia"].mean():.1f}'

    # GEE Logistic regression AUC
    if site in gee_mortality_results:
        try:
            result = gee_mortality_results[site]
            df_gee = gee_mortality_models[site][1]
            y_proba = result.fittedvalues
            y_actual = df_gee['egresar_fallecido'].values
            row['GEE Mortality AUC'] = f'{roc_auc_score(y_actual, y_proba):.3f}'
        except:
            row['GEE Mortality AUC'] = '—'
    else:
        row['GEE Mortality AUC'] = '—'

    # Best ML mortality AUC (with class balancing)
    if site in ml_models_by_site and ml_models_by_site[site]:
        best_name = max(ml_models_by_site[site], key=lambda k: ml_models_by_site[site][k]['metrics']['AUC'])
        best_auc = ml_models_by_site[site][best_name]['metrics']['AUC']
        best_balancing = ml_meta_by_site[site][3] if ml_meta_by_site.get(site) else 'Unknown'
        row['Best ML Mortality AUC'] = f'{best_auc:.3f} ({best_name}, {best_balancing})'
    else:
        row['Best ML Mortality AUC'] = '—'

    # GEE Poisson/NB model RMSE
    if site in gee_los_results:
        try:
            result = gee_los_results[site]
            df_gee = gee_los_models[site][1]
            y_pred = result.fittedvalues
            y_actual = df_gee['dias_estancia'].values
            rmse = np.sqrt(np.mean((y_actual - y_pred) ** 2))
            row['GEE LOS RMSE'] = f'{rmse:.2f}'
        except:
            row['GEE LOS RMSE'] = '—'
    else:
        row['GEE LOS RMSE'] = '—'

    # Best ML LOS R²
    if site in ml_los_by_site:
        best_name = max(ml_los_by_site[site], key=lambda k: ml_los_by_site[site][k]['metrics']['R2'])
        best_r2 = ml_los_by_site[site][best_name]['metrics']['R2']
        row['Best ML R² (LOS)'] = f'{best_r2:.3f} ({best_name})'
    else:
        row['Best ML R² (LOS)'] = '—'

    summary_rows.append(row)

summary_df = pd.DataFrame(summary_rows)
show_table(summary_df, "Table 13: Summary - GEE & ML Model Performance Across All Cancer Sites\n(GEE: Mixed-Effects with hospital clustering; ML: Multiple Class Balancing Techniques)", index=False)

Table 13: Summary - GEE & ML Model Performance Across All Cancer Sites (GEE: Mixed-Effects with hospital clustering; ML: Multiple Class Balancing Techniques)

Cancer Site N Mortality % Mean LOS GEE Mortality AUC Best ML Mortality AUC GEE LOS RMSE Best ML R² (LOS)
Hypopharynx 400 8.2 13.7 0.862 0.701 (Gradient Boosting, Random Oversampling) 16.26 -0.106 (Random Forest)
Larynx 2,625 7.5 14.5 0.859 0.688 (Random Forest, Random Undersampling) 20.50 -0.013 (Lasso)
Nasopharynx 368 6.2 9.0 0.843 (MLP, Random Undersampling) 18.14 -0.052 (Lasso)
Oral Cavity 3,001 5.9 9.6 0.897 0.722 (Gradient Boosting, None (Original)) 16.14 -0.020 (Lasso)
Oropharynx 951 5.8 8.4 0.888 0.829 (SVM, Random Undersampling) 13.02 -0.111 (Random Forest)
Other Pharynx 281 10.7 10.8 0.898 0.696 (Gradient Boosting, Random Undersampling) 15.03 -0.109 (Random Forest)
Salivary Glands 898 5.1 6.9 0.935 0.784 (Random Forest, Random Undersampling) 11.20 -0.087 (Gradient Boosting)
Sinonasal 887 4.6 6.7 0.898 0.650 (Gradient Boosting, Random Undersampling) 10.48 0.079 (Linear Regression)

15 11. Discussion and Conclusion

15.1 11.1 Key Findings

This analysis of head and neck malignancies in Chile (2019-2024) provides independent regression and machine learning models for each cancer site, enabling site-specific risk assessment for both in-hospital mortality and length of stay. Head and neck cancer comprises diverse anatomical subsites (oral cavity, pharynx, larynx, sinonasal) with markedly different epidemiology, treatment complexity, and clinical outcomes.

15.2 11.2 Regression Model Approach

Mixed-effects models were fit using Generalized Estimating Equations (GEE) to account for the hierarchical structure of observations within hospitals and years. For mortality, GEE logistic regression was used with Binomial family and exchangeable correlation structure. For length of stay, GEE negative binomial regression was used to account for overdispersion in count data, with Poisson as fallback. All models include age, sex, service category, severity/mortality indices, and only significant comorbidities from bivariate analysis (p < 0.05), improving parsimony and interpretability. Hospital was specified as the grouping variable to capture clustering effects.

15.3 11.3 Machine Learning Approach

To address class imbalance and improve generalization, multiple class balancing techniques were evaluated: SMOTE, ADASYN, SMOTETomek, random oversampling, random undersampling, and no balancing (original). For each cancer site, gradient boosting was used as the base model to identify the best-performing balancing method by AUC. Subsequently, five classifiers (Logistic Regression, Random Forest, Gradient Boosting, SVM, MLP) were trained using the optimal balancing strategy. All models used optimal decision thresholds derived from Youden’s J statistic rather than default 0.5 cutoff, improving balanced accuracy and clinical utility.

15.4 11.4 Clinical Implications

Head and neck cancer mortality and hospital resource utilization vary significantly by anatomical subsite and comorbidity burden. Treatment decisions (surgery vs. radiation vs. chemoradiation) and patient characteristics (age, performance status, comorbidities) directly influence hospital outcomes. These models provide quantitative tools for risk stratification and outcome prediction, supporting clinical decision-making and resource planning.


16 12. Conclusion

This comprehensive analysis of head and neck malignancies in Chilean public hospitals provides an epidemiological profile of in-hospital mortality and length of stay across multiple anatomical subsites (2019-2024). The results highlight the importance of site-specific modeling, the differential impact of comorbidities by location, and temporal patterns that may inform resource allocation and quality improvement initiatives in head and neck cancer care.

17 References

Breiman, Leo. 2001. “Random Forests.” Machine Learning 45 (1): 5–32. https://doi.org/10.1023/A:1010933404324.
Charlson, Mary E., Peter Pompei, Kathy L. Ales, and C. Ronald MacKenzie. 1987. “A New Method of Classifying Prognostic Comorbidity in Longitudinal Studies: Development and Validation.” Journal of Chronic Diseases 40 (5): 373–83. https://doi.org/10.1016/0021-9681(87)90171-8.
Chawla, Nitesh V., Kevin W. Bowyer, Lawrence O. Hall, and W. Philip Kegelmeyer. 2002. SMOTE: Synthetic Minority Over-Sampling Technique.” Journal of Artificial Intelligence Research 16: 321–57. https://doi.org/10.1613/jair.953.
Chow, Laura Q. M. 2020. “Head and Neck Cancer.” New England Journal of Medicine 382 (1): 60–72. https://doi.org/10.1056/NEJMra1715715.
Departamento de Estadı́sticas e Información de Salud (DEIS). 2024. Bases de Datos de Grupos Relacionados de Diagnóstico (GRD), 2019–2024. Https://deis.minsal.cl/.
Friedman, Jerome H. 2001. “Greedy Function Approximation: A Gradient Boosting Machine.” Annals of Statistics 29 (5): 1189–232.
Hashibe, Mia, Paul Brennan, Shu-Chun Chuang, et al. 2009. “Interaction Between Tobacco and Alcohol Use and the Risk of Head and Neck Cancer: Pooled Analysis in the INHANCE Consortium.” Cancer Epidemiology, Biomarkers & Prevention 18 (2): 541–50. https://doi.org/10.1158/1055-9965.EPI-08-0347.
Hosmer, David W., Stanley Lemeshow, and Rodney X. Sturdivant. 2013. “Applied Logistic Regression.” Wiley Series in Probability and Statistics, ahead of print, 3rd ed. https://doi.org/10.1002/9781118548387.
Hubbard, Alan E., Jennifer Ahern, Nancy L. Fleischer, et al. 2010. “To GEE or Not to GEE: Comparing Population Average and Mixed Models for Estimating the Associations Between Neighborhood Risk Factors and Health.” Epidemiology 21 (4): 467–74. https://doi.org/10.1097/EDE.0b013e3181caeb90.
Hunter, John D. 2007. “Matplotlib: A 2D Graphics Environment.” Computing in Science & Engineering 9 (3): 90–95. https://doi.org/10.1109/MCSE.2007.55.
Johnson, Daniel E., Barbara Burtness, C. René Leemans, et al. 2020. “Head and Neck Squamous Cell Carcinoma.” Nature Reviews Disease Primers 6 (1): 92. https://doi.org/10.1038/s41572-020-00224-3.
Liang, Kung-Yee, and Scott L. Zeger. 1986. “Longitudinal Data Analysis Using Generalized Linear Models.” Biometrika 73 (1): 13–22. https://doi.org/10.1093/biomet/73.1.13.
Marur, Shanthi, and Arlene A. Forastiere. 2016. “Head and Neck Squamous Cell Carcinoma: Update on Epidemiology, Diagnosis, and Treatment.” Mayo Clinic Proceedings 91 (3): 386–96. https://doi.org/10.1016/j.mayocp.2015.12.017.
McKinney, Wes. 2010. “Data Structures for Statistical Computing in Python.” Proceedings of the 9th Python in Science Conference, 56–61.
Organización Mundial de la Salud. 2019. Clasificación Estadı́stica Internacional de Enfermedades y Problemas Relacionados Con La Salud, décima Revisión (CIE-10). OMS.
Pedregosa, Fabian, Gaël Varoquaux, Alexandre Gramfort, et al. 2011. “Scikit-Learn: Machine Learning in Python.” Journal of Machine Learning Research 12: 2825–30.
Rettig, Eleni M., and Gypsyamber D’Souza. 2015. “Epidemiology of Head and Neck Cancer.” Surgical Oncology Clinics of North America 24 (3): 379–96. https://doi.org/10.1016/j.soc.2015.03.001.
Seabold, Skipper, and Josef Perktold. 2010. “Statsmodels: Econometric and Statistical Modeling with Python.” Proceedings of the 9th Python in Science Conference, 92–96.
Sung, Hyuna, Jacques Ferlay, Rebecca L. Siegel, et al. 2021. “Global Cancer Statistics 2020: GLOBOCAN Estimates of Incidence and Mortality Worldwide for 36 Cancers in 185 Countries.” CA: A Cancer Journal for Clinicians 71 (3): 209–49. https://doi.org/10.3322/caac.21660.
Valderas, José M., Barbara Starfield, Bonnie Sibbald, Chris Salisbury, and Martin Roland. 2009. “Defining Comorbidity: Implications for Understanding Health and Health Services.” Annals of Family Medicine 7 (4): 357–63. https://doi.org/10.1370/afm.983.
Youden, William J. 1950. “Index for Rating Diagnostic Tests.” Cancer 3 (1): 32–35. https://doi.org/10.1002/1097-0142(1950)3:1<32::AID-CNCR2820030106>3.0.CO;2-3.
Zeger, Scott L., and Kung-Yee Liang. 1986. “Longitudinal Data Analysis for Discrete and Continuous Outcomes.” Biometrics 42 (1): 121–30. https://doi.org/10.2307/2531248.