Head and Neck Malignant Neoplasms in Chile: In-Hospital Mortality and Length of Stay Analysis (GRD 2019–2024)

Neoplasias Malignas de Cabeza y Cuello en Chile: Análisis de Mortalidad Hospitalaria y Duración de Estadía (GRD 2019–2024)

Author

Amaru Simón Agüero Jiménez

Published

April 6, 2026

Resumen

FACTORES PRONÓSTICOS DE MORTALIDAD INTRAHOSPITALARIA EN CÁNCER DE CABEZA Y CUELLO EN CHILE MEDIANTE ECUACIONES DE ESTIMACIÓN GENERALIZADAS, 2019-2024

Introducción: La mortalidad intrahospitalaria por neoplasias malignas de cabeza y cuello varía sustancialmente según localización tumoral, pero los factores pronósticos específicos por sitio han sido escasamente estudiados en Chile considerando la agrupación natural de los pacientes dentro de centros hospitalarios.

Materiales y Métodos: Cohorte retrospectiva de 9.411 egresos del sistema de Grupos Relacionados de Diagnóstico (GRD) 2019–2024 con diagnóstico principal de neoplasia maligna de cabeza y cuello (CIE-10 C00–C14, C30–C32), agrupados en ocho subsitios clínicos. Se extrajeron comorbilidades de los diagnósticos secundarios y se ajustaron ecuaciones de estimación generalizadas (GEE) con correlación intercambiable y hospital como clúster: familia binomial para mortalidad y binomial negativa para estadía. Edad, sexo, severidad, riesgo de mortalidad y comorbilidades seleccionadas por significancia bivariada (p<0,05) por sitio se incorporaron como predictores. Complementariamente se entrenaron cinco algoritmos de aprendizaje automático (GradientBoosting, RandomForest, SVM, MLP y regresión logística) con seis estrategias de balanceo de clases, optimizando el umbral mediante el índice de Youden.

Resultados: La mortalidad global fue de 6,4% y la mediana de estadía de 5 días, con marcada heterogeneidad entre subsitios: faringe otra (10,7%), hipofaringe (8,2%) y laringe (7,5%) presentaron las mayores tasas, mientras sinonasal (4,6%) y glándulas salivales (5,1%) las menores. Los modelos GEE por sitio alcanzaron áreas bajo la curva entre 0,86 y 0,94 para mortalidad intrahospitalaria, con mejor desempeño en glándulas salivales (AUC 0,935), sinonasal y faringe otra (0,898), cavidad oral (0,897), orofaringe (0,888), hipofaringe (0,862) y laringe (0,859); en nasofaringe el modelo no convergió. Edad, severidad y comorbilidades cardiometabólicas y respiratorias resultaron predictores robustos. El mejor aprendizaje automático para mortalidad alcanzó AUC entre 0,650 y 0,843, sin superar a los modelos jerárquicos.

Conclusiones: Los modelos GEE con agrupación hospitalaria mostraron capacidad discriminativa elevada para mortalidad por cáncer de cabeza y cuello en Chile. La marcada variabilidad entre subsitios respalda el uso de modelos independientes con selección diferenciada de comorbilidades.

Palabras clave: neoplasias de cabeza y cuello; mortalidad hospitalaria; ecuaciones de estimación generalizadas; aprendizaje automático; Chile.

Abstract

PROGNOSTIC FACTORS FOR IN-HOSPITAL MORTALITY IN HEAD AND NECK CANCER IN CHILE USING GENERALIZED ESTIMATING EQUATIONS, 2019-2024

Introduction: In-hospital mortality from head and neck malignancies varies substantially by tumor site, yet site-specific prognostic factors have been scarcely studied in Chile accounting for the natural clustering of patients within hospital centers.

Materials and Methods: Retrospective cohort of 9,411 discharges from the Diagnosis-Related Groups (DRG) system 2019–2024 with primary diagnosis of head and neck malignancy (ICD-10 C00–C14, C30–C32), grouped into eight clinical subsites. Comorbidities were extracted from secondary diagnoses, and Generalized Estimating Equations (GEE) with exchangeable correlation and hospital as cluster were fitted: binomial family for mortality and negative binomial for length of stay. Age, sex, severity, mortality risk, and comorbidities selected by bivariate significance (p<0.05) per site were included as predictors. Five machine learning algorithms (GradientBoosting, RandomForest, SVM, MLP, and logistic regression) were trained complementarily with six class balancing strategies, optimizing the cutoff via Youden’s J index.

Results: Overall in-hospital mortality was 6.4% and median length of stay 5 days, with marked heterogeneity across subsites: other pharynx (10.7%), hypopharynx (8.2%), and larynx (7.5%) showed the highest rates, while sinonasal (4.6%) and salivary glands (5.1%) the lowest. Site-specific GEE models reached areas under the curve between 0.86 and 0.94 for in-hospital mortality, with best performance in salivary glands (AUC 0.935), sinonasal and other pharynx (0.898), oral cavity (0.897), oropharynx (0.888), hypopharynx (0.862), and larynx (0.859); the nasopharynx model did not converge. Age, severity, and cardiometabolic and respiratory comorbidities emerged as robust predictors. Best machine learning AUC for mortality ranged 0.650–0.843, without outperforming hierarchical models.

Conclusions: GEE models with hospital clustering showed high discriminative capacity for head and neck cancer mortality in Chile. The marked variability across subsites supports independent models with differentiated comorbidity selection.

Keywords: head and neck neoplasms; hospital mortality; generalized estimating equations; machine learning; Chile.

1 1. Methods

1.1 1.1 Study Design and Data Source

This is a retrospective cohort study based on hospital discharge records from the GRD (Diagnosis-Related Groups) system in Chile, covering the period 2019–2024. The GRD system captures comprehensive data on public hospital discharges including:

35 diagnostic fields (DIAGNOSTICO1 through DIAGNOSTICO35)
Severity indices coded 1–3 (IR_29301_SEVERIDAD)
Mortality risk indices coded 1–3 (IR_29301_MORTALIDAD)
Admission and discharge dates
Sex, age (derived from birth date), and service of admission
Discharge type (alive or deceased)

Cases of head and neck malignancies (ICD-10 codes C00–C14, C30–C32) are identified by screening all 35 diagnostic fields for these codes, with the cancer site determined by the first (earliest) occurrence position.

1.2 1.2 Cancer Case Identification

All patients with an ICD-10 code between C00–C14 and C30–C32 in any of the 35 diagnostic fields (DIAGNOSTICO1–DIAGNOSTICO35) are included. The primary cancer site is classified based on the first diagnostic field containing a relevant code:

Oral Cavity (C00–C06) - Lip (C00): ICD-10 code C00 - Base of tongue (C01): ICD-10 code C01 - Other tongue (C02): ICD-10 code C02 - Gum (C03): ICD-10 code C03 - Floor of mouth (C04): ICD-10 code C04 - Palate (C05): ICD-10 code C05 - Other mouth (C06): ICD-10 code C06

Salivary Glands (C07–C08) - Parotid gland (C07): ICD-10 code C07 - Other salivary glands (C08): ICD-10 code C08

Pharynx (C09–C14) - Tonsil (C09): ICD-10 code C09 - Oropharynx (C10): ICD-10 code C10 - Nasopharynx (C11): ICD-10 code C11 - Piriform sinus (C12): ICD-10 code C12 - Hypopharynx (C13): ICD-10 code C13 - Other lip, oral, pharynx (C14): ICD-10 code C14

Sinonasal and Larynx (C30–C32) - Nasal cavity and middle ear (C30): ICD-10 code C30 - Accessory sinuses (C31): ICD-10 code C31 - Larynx (C32): ICD-10 code C32

For analysis, sites are grouped into clinical subsites: Oral Cavity (C00–C06), Salivary Glands (C07–C08), Oropharynx (C09–C10), Nasopharynx (C11), Hypopharynx (C12–C13), Other Pharynx (C14), Sinonasal (C30–C31), and Larynx (C32).

1.3 1.3 Comorbidity Classification

Comorbidities are identified from secondary diagnoses (DIAGNOSTICO2 through DIAGNOSTICO35) and classified into 48 detailed clinical groups organized in 10 systemic categories, using the same scheme as the comorbidity network analysis (neoplasias_cabeza_cuello_redes.qmd) to ensure consistency across analyses.

Systemic Category	Detailed Groups	Representative ICD-10 Codes
Cardiovascular (CV) — 7 groups	Essential Hypertension; Complicated Hypertension; Coronary Disease; Heart Failure; Arrhythmias; Cerebrovascular Disease; Peripheral Vascular Disease	I10; I11–I15; I20–I25; I50; I44–I49; I60–I69; I70–I77
Metabolic (MET) — 7 groups	Type 1 Diabetes; Type 2 Diabetes; Complicated Diabetes; Obesity; Dyslipidemia; Metabolic Syndrome; Thyroid Disorders	E10; E11; E12–E14; E66; E78; E88; E00–E07
Respiratory (RESP) — 7 groups	COPD; Asthma; Chronic Bronchitis; Emphysema; Pulmonary Fibrosis; Pulmonary Hypertension; Sleep Apnea	J44; J45–J46; J41–J42; J43; J84; I27; G47
Renal (REN) — 4 groups	Acute Renal Failure; Chronic Renal Failure; End Stage Renal; Diabetic Nephropathy	N17; N18; N18.5–N18.6; E10.2–E14.2
Hepatic (HEP) — 4 groups	Steatosis; Alcoholic Cirrhosis; NonAlcoholic Cirrhosis; Viral Hepatitis	K76; K70; K74; B15–B19
Gastrointestinal (GI) — 4 groups	Reflux; Peptic Ulcer; IBD; IBS	K21; K25–K28; K50–K51; K58
Neurological (NEURO) — 4 groups	Dementia; Parkinson; Epilepsy; Stroke Sequelae	F00–F03, G30; G20–G22; G40–G41; I69
Psychiatric (PSI) — 3 groups	Major Depression; Bipolar; Anxiety	F32–F33; F30–F31; F41
Substance Use (SUST) — 3 groups	Alcohol; Tobacco; Opioids	F10; F17, Z72; F11
Infectious (INF) — 4 groups	HIV; Tuberculosis; Hepatitis C; Hepatitis B	B20–B24; A15–A19; B18; B16

Each of the 48 detailed groups is coded as a binary indicator (presence = 1, absence = 0). Additionally, 10 aggregated systemic-category indicators (cat_CV, cat_MET, cat_RESP, cat_REN, cat_HEP, cat_GI, cat_NEURO, cat_PSI, cat_SUST, cat_INF) are derived as the disjunction (any group present) within each category. Code matching uses both 3-character ICD-10 prefixes and 4-character extended codes (e.g., N18.5 for End Stage Renal disease).

1.4 1.4 Statistical Analyses

1.4.1 1.4.1 Descriptive Statistics

Continuous variables (age, length of stay) are reported as mean, median, standard deviation, and range. Categorical variables (sex, service category, comorbidities, discharge status) are reported as counts and proportions. Mortality rates and length of stay distributions are stratified by cancer site and temporal trends.

1.4.2 1.4.2 Bivariate Associations

Independence between each categorical covariate and the binary mortality outcome was tested with the Pearson chi-square statistic

\[ \chi^{2} \;=\; \sum_{i=1}^{R}\sum_{j=1}^{C} \frac{(O_{ij} - E_{ij})^{2}}{E_{ij}}, \]

where:

\(O_{ij}\) — observed count in row \(i\) and column \(j\) of the \(R \times C\) contingency table;
\(E_{ij} = (R_i \cdot C_j)/N\) — count expected under the null hypothesis of independence, with \(R_i\), \(C_j\) the row/column marginals and \(N\) the table total;
under \(H_0\), \(\chi^{2} \;\dot\sim\; \chi^{2}_{(R-1)(C-1)}\).

The Wilcoxon rank-sum (Mann–Whitney \(U\)) test compared the length-of-stay distribution between patients with and without each comorbidity, without assuming normality or homoscedasticity.

1.4.3 1.4.3 Mixed-Effects Logistic Regression for Mortality

In-hospital mortality \(Y_{ij} \in \{0,1\}\) (death = 1) is modeled with a mixed-effects logistic regression with a random intercept for the cluster (year of admission \(j\)):

\[ Y_{ij} \,\mid\, u_j \;\sim\; \text{Bernoulli}(\pi_{ij}), \qquad \mathrm{logit}(\pi_{ij}) \;=\; \log\!\left(\frac{\pi_{ij}}{1-\pi_{ij}}\right) \;=\; \beta_0 \,+\, \mathbf{x}_{ij}^{\top}\boldsymbol{\beta} \,+\, u_j, \]

with \(u_j \sim \mathcal{N}(0,\sigma^{2}_{u})\) independent across years. Symbols:

\(\pi_{ij}\) — probability of in-hospital death for patient \(i\) admitted in year \(j\);
\(\beta_0\) — fixed intercept (log-odds of death in the reference category, year-average);
\(\mathbf{x}_{ij}\) — covariate vector for patient \(i,j\): age, sex, service, GRD severity, GRD mortality risk, and the binary comorbidity indicators;
\(\boldsymbol{\beta}\) — fixed-effect log-odds ratios; the exponentiated coefficient \(e^{\beta_k}\) is the OR for a one-unit change in \(x_k\);
\(u_j\) — random intercept absorbing unobserved year-level heterogeneity; \(\sigma^{2}_{u}\) quantifies between-year variability of baseline mortality risk.

Models were fit by maximum likelihood with Laplace approximation, separately for each of the eight primary cancer subsites. Estimates are reported as odds ratios with 95% Wald confidence intervals.

1.4.4 1.4.4 Negative Binomial Mixed Models for Length of Stay

Length of stay is a non-negative count with overdispersion relative to Poisson. We use a negative-binomial mixed model with a log link and a random intercept for year:

\[ Y_{ij} \,\mid\, u_j \;\sim\; \mathrm{NB}\!\left(\mu_{ij}, \theta\right), \qquad \log(\mu_{ij}) \;=\; \beta_0 \,+\, \mathbf{x}_{ij}^{\top}\boldsymbol{\beta} \,+\, u_j, \]

with \(u_j \sim \mathcal{N}(0,\sigma^{2}_{u})\) and the NB\((\mu,\theta)\) variance

\[ \mathrm{Var}(Y_{ij}\mid u_j) \;=\; \mu_{ij} \,+\, \mu_{ij}^{2}/\theta. \]

Symbols:

\(Y_{ij}\) — length of stay in days for patient \(i,j\);
\(\mu_{ij}\) — conditional mean stay given covariates and the year effect;
\(\theta\) — dispersion parameter; smaller \(\theta\) implies greater overdispersion, and \(\theta \to \infty\) recovers the Poisson model;
\(\mathbf{x}_{ij},\boldsymbol{\beta},u_j\) — as in the logistic model.

Coefficients are reported as incidence rate ratios \(\mathrm{IRR}_k = e^{\beta_k}\) (multiplicative effect on expected stay) with 95% CIs.

1.4.5 1.4.5 Area Under the Receiver Operating Characteristic Curve (AUC)

Discrimination of the mortality model is summarized by the area under the ROC curve,

\[ \mathrm{AUC} \;=\; \int_{0}^{1} \mathrm{TPR}\!\left(\mathrm{FPR}^{-1}(u)\right)\,du \;=\; \Pr\!\big(\hat{p}_{i} > \hat{p}_{j} \,\big|\, y_i = 1,\, y_j = 0\big), \]

where for every classification threshold \(c \in [0,1]\) on the predicted probability \(\hat{p}\):

\(\mathrm{TPR}(c) = \Pr(\hat{p} > c \mid Y = 1)\) — true positive rate (sensitivity);
\(\mathrm{FPR}(c) = \Pr(\hat{p} > c \mid Y = 0)\) — false positive rate (1 − specificity);
\(\mathrm{AUC} = 0.5\) — no discrimination; \(\mathrm{AUC} = 1\) — perfect ranking of cases above controls.

1.4.6 1.4.6 Hosmer–Lemeshow Goodness-of-Fit

Calibration of the logistic model was assessed with the Hosmer–Lemeshow statistic, which compares observed and expected events across \(G\) deciles of predicted risk:

\[ \mathrm{HL} \;=\; \sum_{g=1}^{G} \frac{\big(O_g - n_g\bar{p}_g\big)^{2}}{n_g\,\bar{p}_g\,(1 - \bar{p}_g)}, \]

where:

\(G\) — number of risk groups (we use \(G = 10\), i.e., deciles of \(\hat{p}\));
\(n_g\) — number of patients in group \(g\);
\(O_g\) — observed deaths in group \(g\);
\(\bar{p}_g\) — mean predicted probability in group \(g\) (so \(E_g = n_g \bar{p}_g\)).

Under \(H_0\) (model well-calibrated), \(\mathrm{HL} \;\dot\sim\; \chi^{2}_{G-2}\); \(p > 0.05\) supports adequate calibration.

1.4.7 1.4.7 McFadden’s Pseudo-\(R^{2}\)

\[ R^{2}_{\mathrm{McF}} \;=\; 1 \,-\, \frac{\ell(\hat{\boldsymbol{\beta}})}{\ell(\mathbf{0})}, \]

where:

\(\ell(\hat{\boldsymbol{\beta}})\) — log-likelihood of the fitted model;
\(\ell(\mathbf{0})\) — log-likelihood of the intercept-only (“null”) model.

Both log-likelihoods are negative, so \(R^{2}_{\mathrm{McF}} \in [0,1]\); values of 0.2–0.4 are typically considered indicative of good fit in logistic models.

1.4.8 1.4.8 Matthews Correlation Coefficient (MCC)

For a binary classifier at a fixed threshold, the Matthews correlation coefficient is

\[ \mathrm{MCC} \;=\; \frac{TP\cdot TN \;-\; FP\cdot FN}{\sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}}, \]

where \(TP, TN, FP, FN\) are the cells of the confusion matrix (true/false positives/negatives). \(\mathrm{MCC} \in [-1, 1]\): \(+1\) perfect agreement, \(0\) no better than chance, \(-1\) perfect disagreement. Unlike accuracy, MCC stays meaningful with class imbalance.

1.4.9 1.4.9 Brier Score

The Brier score is the mean squared error between predicted probabilities and observed binary outcomes,

\[ \mathrm{BS} \;=\; \frac{1}{N}\sum_{i=1}^{N}\big(y_i - \hat{p}_i\big)^{2}, \]

where:

\(N\) — number of patients in the validation set;
\(y_i \in \{0,1\}\) — observed outcome;
\(\hat{p}_i \in [0,1]\) — predicted probability of the event for patient \(i\).

\(\mathrm{BS} = 0\) is perfect; \(\mathrm{BS} = 0.25\) is the score of a non-informative model that predicts \(\hat{p}_i \equiv 0.5\) for everyone. It rewards both calibration and discrimination.

1.4.10 1.4.10 RMSE and MAE for Length-of-Stay

Predictive accuracy of the continuous length-of-stay model is summarized by

\[ \mathrm{RMSE} \;=\; \sqrt{\frac{1}{N}\sum_{i=1}^{N}\big(\hat{y}_i - y_i\big)^{2}}, \qquad \mathrm{MAE} \;=\; \frac{1}{N}\sum_{i=1}^{N}\big|\hat{y}_i - y_i\big|, \]

where \(y_i\) is the observed length of stay (days), \(\hat{y}_i\) its prediction, and \(N\) the validation sample size. RMSE penalizes large errors more strongly because of the square; MAE is in the same units as the outcome and is more robust to outliers.

1.4.11 1.4.11 Youden’s J Statistic

For threshold selection on the ROC curve we use Youden’s \(J\):

\[ J(c) \;=\; \mathrm{TPR}(c) \,+\, \mathrm{TNR}(c) \,-\, 1 \;=\; \mathrm{Sensitivity}(c) \,+\, \mathrm{Specificity}(c) \,-\, 1, \]

where \(c\) is the probability cut-off. \(J \in [0,1]\); the optimal cut-off is \(c^{*} = \arg\max_c J(c)\), the point that maximizes the vertical distance between the ROC curve and the chance diagonal.

2 2. Data Loading and Processing

Code

import pandas as pd
import numpy as np
import os
import warnings
from datetime import datetime
from pathlib import Path
import pickle
from IPython.display import display

warnings.filterwarnings('ignore')

# Configuration
DATA_DIR = "../data"
OUTPUT_DIR = "../data/output_files"
CACHE_DIR = "../data"

# Create output directory if needed
os.makedirs(OUTPUT_DIR, exist_ok=True)
os.makedirs(CACHE_DIR, exist_ok=True)

Code

# Function to load GRD data
def load_grd_data(data_dir, years=[2019, 2020, 2021, 2022, 2023, 2024]):
    """Load GRD data from multiple CSV files (pipe-separated, UTF-8)"""
    dfs = []

    for year in years:
        filepath = os.path.join(data_dir, f"GRD_PUBLICO_{year}.csv")
        if not os.path.exists(filepath):
            continue

        try:
            df = pd.read_csv(filepath, sep="|", dtype=str, low_memory=False, encoding='utf-8')
            df['year'] = str(year)
            dfs.append(df)
        except Exception as e:
            pass

    if dfs:
        return pd.concat(dfs, ignore_index=True)
    else:
        return pd.DataFrame()

# Load all GRD data
df_grd = load_grd_data(DATA_DIR)
if len(df_grd) == 0:
    display(HTML("<p><strong>Warning:</strong> No GRD data found.</p>"))

Code

# Load catalogs
catalog_file = os.path.join(DATA_DIR, "códigos grd.xlsx")

df_grd_codes = None
df_hospitals = None

try:
    df_grd_codes = pd.read_excel(catalog_file, sheet_name="IR - GRD")
except Exception as e:
    pass

try:
    df_hospitals = pd.read_excel(catalog_file, sheet_name="Hospitales")
except Exception as e:
    pass

Code

# Function to parse dates flexibly
def parse_date_flexible(date_str):
    """Parse dates that may be in DD-MM-YYYY or YYYY-MM-DD format"""
    if pd.isna(date_str) or date_str == '':
        return pd.NaT

    if isinstance(date_str, (int, float)):
        return pd.NaT

    date_str = str(date_str).strip()

    # Try different formats
    for fmt in ["%d-%m-%Y", "%Y-%m-%d", "%d/%m/%Y", "%Y/%m/%d"]:
        try:
            return pd.to_datetime(date_str, format=fmt)
        except:
            continue

    return pd.NaT

# Apply date normalization
date_cols = ['FECHA_NACIMIENTO', 'FECHA_INGRESO', 'FECHAALTA']
for col in date_cols:
    if col in df_grd.columns:
        df_grd[col] = df_grd[col].apply(parse_date_flexible)
        valid_count = df_grd[col].notna().sum()

Code

# Calculate derived variables

# Age in years
df_grd['edad'] = (df_grd['FECHA_INGRESO'] - df_grd['FECHA_NACIMIENTO']).dt.days / 365.25

# Length of stay in days
df_grd['dias_estancia'] = (df_grd['FECHAALTA'] - df_grd['FECHA_INGRESO']).dt.days

# Mortality status (discharge type)
df_grd['egresar_fallecido'] = (df_grd['TIPOALTA'] == 'FALLECIDO').astype(int)
df_grd['estado_alta'] = df_grd['TIPOALTA'].apply(
    lambda x: "Deceased" if x == "FALLECIDO" else "Discharged"
)

Code

# Filter valid records

# Remove unknown sex
df_clean = df_grd[df_grd['SEXO'] != 'DESCONOCIDO'].copy()

# Standardize sex values to English
sex_map = {'FEMENINO': 'Female', 'MASCULINO': 'Male', 'MUJER': 'Female', 'HOMBRE': 'Male',
           'Female': 'Female', 'Male': 'Male', 'F': 'Female', 'M': 'Male'}
df_clean['SEXO'] = df_clean['SEXO'].map(sex_map).fillna(df_clean['SEXO'])

# Remove invalid ages
df_clean = df_clean[(df_clean['edad'].notna()) & (df_clean['edad'] > 0) & (df_clean['edad'] < 150)].copy()

# Remove invalid length of stay
df_clean = df_clean[(df_clean['dias_estancia'].notna()) & (df_clean['dias_estancia'] >= 0)].copy()

Code

# Function to identify head and neck cancer sites from diagnostic fields
def determine_cancer_site(row):
    """
    Scans all DIAGNOSTICO1-35 fields and returns the first C00-C14, C30-C32 code found.
    Returns: (cancer_site_name, cancer_code, diagnostic_field_position)
    """
    cancer_sites = {
        'C00': 'Lip',
        'C01': 'Base of tongue',
        'C02': 'Other tongue',
        'C03': 'Gum',
        'C04': 'Floor of mouth',
        'C05': 'Palate',
        'C06': 'Other mouth',
        'C07': 'Parotid gland',
        'C08': 'Other salivary glands',
        'C09': 'Tonsil',
        'C10': 'Oropharynx',
        'C11': 'Nasopharynx',
        'C12': 'Piriform sinus',
        'C13': 'Hypopharynx',
        'C14': 'Other pharynx',
        'C30': 'Nasal cavity',
        'C31': 'Accessory sinuses',
        'C32': 'Larynx'
    }

    for pos in range(1, 36):
        col_name = f'DIAGNOSTICO{pos}'
        if col_name not in row.index:
            continue

        diag = row[col_name]
        if pd.isna(diag) or diag == '':
            continue

        diag_str = str(diag).strip().upper()
        if len(diag_str) < 3:
            continue

        icd_code = diag_str[:3]
        if icd_code in cancer_sites:
            return cancer_sites[icd_code], icd_code, pos

    return None, None, None

# Filter for head and neck neoplasms

cancer_results = df_clean.apply(determine_cancer_site, axis=1)
df_clean['cancer_site'] = cancer_results.apply(lambda x: x[0])
df_clean['cancer_code'] = cancer_results.apply(lambda x: x[1])
df_clean['diagnostic_position'] = cancer_results.apply(lambda x: x[2])

# Keep only records with cancer diagnosis
df_head_neck = df_clean[df_clean['cancer_site'].notna()].copy()

vc_site = df_head_neck['cancer_site'].value_counts().sort_values(ascending=False).reset_index()
vc_site.columns = ['Cancer Site', 'N']
vc_site['%'] = (vc_site['N'] / vc_site['N'].sum() * 100).round(1)

Code

# Create clinical subsite grouping
def classify_head_neck_subsite(site):
    """Group head and neck sites into clinical subsites"""
    oral_cavity = ['Lip', 'Base of tongue', 'Other tongue', 'Gum', 'Floor of mouth', 'Palate', 'Other mouth']
    salivary = ['Parotid gland', 'Other salivary glands']
    oropharynx_group = ['Tonsil', 'Oropharynx']
    hypopharynx_group = ['Piriform sinus', 'Hypopharynx']
    sinonasal = ['Nasal cavity', 'Accessory sinuses']

    if site in oral_cavity:
        return 'Oral Cavity'
    elif site in salivary:
        return 'Salivary Glands'
    elif site in oropharynx_group:
        return 'Oropharynx'
    elif site == 'Nasopharynx':
        return 'Nasopharynx'
    elif site in hypopharynx_group:
        return 'Hypopharynx'
    elif site == 'Other pharynx':
        return 'Other Pharynx'
    elif site in sinonasal:
        return 'Sinonasal'
    elif site == 'Larynx':
        return 'Larynx'
    else:
        return site

df_head_neck['cancer_site_group'] = df_head_neck['cancer_site'].apply(classify_head_neck_subsite)

vc = df_head_neck['cancer_site_group'].value_counts().reset_index()
vc.columns = ['Cancer Site', 'N']
vc['%'] = (vc['N'] / vc['N'].sum() * 100).round(1)

Code

# Classify service category

def classify_service_category(servicio):
    """Classify admission service into categories"""
    if pd.isna(servicio) or servicio == '':
        return 'Other'

    servicio = str(servicio).upper()

    if 'CUIDADOS INTENSIVOS' in servicio or 'CUIDADOS' in servicio:
        return 'ICU'
    elif 'TRATAMIENTO INTERMEDIO' in servicio or 'UTI' in servicio:
        return 'ITU'
    elif any(x in servicio for x in ['AGUDOS', 'CIRUGÍA', 'CIRUGÍA', 'RECUPERACIÓN', 'MEDICO-QUIRURGICO']):
        return 'MS'
    else:
        return 'Other'

df_head_neck['service_category'] = df_head_neck['SERVICIOINGRESO'].apply(classify_service_category)

sc = df_head_neck['service_category'].value_counts().reset_index()
sc.columns = ['Service Category', 'N']
sc['%'] = (sc['N'] / sc['N'].sum() * 100).round(1)

Code

# Detect comorbidities from secondary diagnoses
# Uses the same comprehensive 48-group / 10-category scheme as the network analysis (redes)

comorbidity_groups = {
    # CARDIOVASCULAR (7 groups)
    "CV Essential Hypertension": ["I10"],
    "CV Complicated Hypertension": ["I11", "I12", "I13", "I15"],
    "CV Coronary Disease": ["I20", "I21", "I22", "I23", "I24", "I25"],
    "CV Heart Failure": ["I50"],
    "CV Arrhythmias": ["I44", "I45", "I46", "I47", "I48", "I49"],
    "CV Cerebrovascular Disease": ["I60", "I61", "I62", "I63", "I64", "I65", "I66", "I67", "I68", "I69"],
    "CV Peripheral Vascular Disease": ["I70", "I71", "I73", "I74", "I77"],

    # METABOLIC (7 groups)
    "MET Type1 Diabetes": ["E10"],
    "MET Type2 Diabetes": ["E11"],
    "MET Complicated Diabetes": ["E12", "E13", "E14"],
    "MET Obesity": ["E66"],
    "MET Dyslipidemia": ["E78"],
    "MET Metabolic Syndrome": ["E88"],
    "MET Thyroid Disorders": ["E00", "E01", "E02", "E03", "E04", "E05", "E06", "E07"],

    # RESPIRATORY (7 groups)
    "RESP COPD": ["J44"],
    "RESP Asthma": ["J45", "J46"],
    "RESP Chronic Bronchitis": ["J41", "J42"],
    "RESP Emphysema": ["J43"],
    "RESP Pulmonary Fibrosis": ["J84"],
    "RESP Pulmonary Hypertension": ["I27"],
    "RESP Sleep Apnea": ["G47"],

    # RENAL (4 groups)
    "REN Acute Renal Failure": ["N17"],
    "REN Chronic Renal Failure": ["N18"],
    "REN End Stage Renal": ["N185", "N186"],
    "REN Diabetic Nephropathy": ["E102", "E112", "E122", "E132", "E142"],

    # HEPATIC (4 groups)
    "HEP Steatosis": ["K76"],
    "HEP Alcoholic Cirrhosis": ["K70"],
    "HEP NonAlcoholic Cirrhosis": ["K74"],
    "HEP Viral Hepatitis": ["B15", "B16", "B17", "B18", "B19"],

    # GASTROINTESTINAL (4 groups)
    "GI Reflux": ["K21"],
    "GI Peptic Ulcer": ["K25", "K26", "K27", "K28"],
    "GI IBD": ["K50", "K51"],
    "GI IBS": ["K58"],

    # NEUROLOGICAL (4 groups)
    "NEURO Dementia": ["F00", "F01", "F02", "F03", "G30"],
    "NEURO Parkinson": ["G20", "G21", "G22"],
    "NEURO Epilepsy": ["G40", "G41"],
    "NEURO Stroke Sequelae": ["I69"],

    # PSYCHIATRIC (3 groups)
    "PSI Major Depression": ["F32", "F33"],
    "PSI Bipolar": ["F30", "F31"],
    "PSI Anxiety": ["F41"],

    # SUBSTANCE (3 groups)
    "SUST Alcohol": ["F10"],
    "SUST Tobacco": ["F17", "Z72"],
    "SUST Opioids": ["F11"],

    # INFECTIOUS (4 groups)
    "INF HIV": ["B20", "B21", "B22", "B23", "B24"],
    "INF Tuberculosis": ["A15", "A16", "A17", "A18", "A19"],
    "INF Hepatitis C": ["B18"],
    "INF Hepatitis B": ["B16"],
}

# Backwards-compatible alias used downstream
life_groups = comorbidity_groups

def get_icd_code(diag):
    """Extract normalized ICD code, returning both 3-char and 4-char (no dot) forms."""
    if pd.isna(diag) or diag == '':
        return None
    s = str(diag).strip().upper().replace('.', '')
    return s

def check_comorbidity_groups(row):
    """Detect comorbidities for a record. Matches both 3-char and 4-char codes
    so groups like REN End Stage Renal (N185, N186) are correctly captured."""
    diag_cols = [f'DIAGNOSTICO{i}' for i in range(2, 36)]
    codes_full = set()
    codes_3 = set()

    for col in diag_cols:
        if col in row.index:
            icd = get_icd_code(row[col])
            if icd:
                codes_full.add(icd)
                codes_3.add(icd[:3])

    comorbidities = {}
    for group_name, icd_codes in comorbidity_groups.items():
        match = False
        for code in icd_codes:
            if len(code) == 3:
                if code in codes_3:
                    match = True
                    break
            else:
                if any(c.startswith(code) for c in codes_full):
                    match = True
                    break
        comorbidities[group_name] = 1 if match else 0

    return comorbidities

# Reset index to ensure clean alignment after the boolean filter above
df_head_neck = df_head_neck.reset_index(drop=True)

# Apply comorbidity detection
comorbidity_data = df_head_neck.apply(check_comorbidity_groups, axis=1)
comorbidity_df = pd.DataFrame(comorbidity_data.tolist(), index=df_head_neck.index)

# Add to main dataframe (index-aligned)
for col in comorbidity_df.columns:
    df_head_neck[col] = comorbidity_df[col].astype(int)

comorbidity_cols = list(comorbidity_groups.keys())
all_comorbidity_cols = list(comorbidity_groups.keys())

# Aggregated systemic category indicators (CV, MET, RESP, REN, HEP, GI, NEURO, PSI, SUST, INF)
systemic_categories = ['CV', 'MET', 'RESP', 'REN', 'HEP', 'GI', 'NEURO', 'PSI', 'SUST', 'INF']
for cat in systemic_categories:
    cat_cols = [c for c in comorbidity_cols if c.startswith(cat + ' ')]
    df_head_neck[f'cat_{cat}'] = (df_head_neck[cat_cols].sum(axis=1) > 0).astype(int)

# Create total comorbidities and any comorbidity columns
df_head_neck['total_comorbidities'] = df_head_neck[comorbidity_cols].sum(axis=1)
df_head_neck['any_comorbidity'] = (df_head_neck['total_comorbidities'] > 0).astype(int)

# Prepare clustering/random effect variables for mixed models
# COD_HOSPITAL as cluster (random intercept), year as fixed effect
if 'COD_HOSPITAL' in df_head_neck.columns:
    df_head_neck['hospital_id'] = df_head_neck['COD_HOSPITAL'].astype(str)
else:
    # Fallback: use first available hospital-like column
    for hcol in ['ESTABLECIMIENTO', 'HOSPITAL', 'cod_hospital']:
        if hcol in df_head_neck.columns:
            df_head_neck['hospital_id'] = df_head_neck[hcol].astype(str)
            break
    else:
        df_head_neck['hospital_id'] = '1'  # single cluster fallback

df_head_neck['year_factor'] = df_head_neck['year'].astype(str)

Code

from IPython.display import display, HTML

# Helper function for clean table display
def _safe_format(x):
    """Format a value: NaN → '—', Inf → 'Inf', large abs → scientific, else 2 decimals"""
    if isinstance(x, float):
        if np.isnan(x):
            return '—'
        if np.isinf(x):
            return 'Inf' if x > 0 else '-Inf'
        if abs(x) > 1e8:
            return f'{x:.2e}'
        return f'{x:,.2f}'
    return str(x)

def sanitize_df(df):
    """Replace NaN with '—' and Inf with 'Inf' across entire DataFrame"""
    df_out = df.copy()
    for col in df_out.columns:
        if df_out[col].dtype in ['float64', 'float32']:
            df_out[col] = df_out[col].apply(_safe_format)
        else:
            df_out[col] = df_out[col].fillna('—')
    # Also handle index if it contains NaN
    if df_out.index.dtype in ['float64', 'float32']:
        df_out.index = [_safe_format(v) for v in df_out.index]
    return df_out

def show_table(df, caption="", index=False, float_format=None):
    """Display a DataFrame as a properly formatted HTML table with NaN/Inf handling"""
    df_clean = sanitize_df(df)
    html = f'<h4>{caption}</h4>' if caption else ''
    html += df_clean.to_html(index=index, classes='table table-striped table-hover table-sm',
                              escape=False)
    display(HTML(html))

# Process severity and mortality risk as factors

severity_cols = ['IR_29301_SEVERIDAD', 'IR_29301_MORTALIDAD']
for col in severity_cols:
    if col in df_head_neck.columns:
        df_head_neck[col] = pd.to_numeric(df_head_neck[col], errors='coerce')
        df_head_neck[col] = df_head_neck[col].fillna(df_head_neck[col].median())
        df_head_neck[col] = df_head_neck[col].astype('category')

# Create main sites dataset
main_sites = ['Larynx', 'Oropharynx', 'Oral Cavity', 'Nasopharynx']
df_main_sites = df_head_neck[df_head_neck['cancer_site_group'].isin(main_sites)].copy()

# Save processed dataframe to cache
output_cache_path = os.path.join(CACHE_DIR, "df_head_neck_processed.pkl")
os.makedirs(os.path.dirname(output_cache_path), exist_ok=True)
with open(output_cache_path, 'wb') as f:
    pickle.dump(df_head_neck, f)

# General summary
info_df = pd.DataFrame({
    'Metric': ['Total GRD records', 'Head and neck malignancy cases (C00-C14, C30-C32)',
                'Period', 'Mortality rate', 'Median length of stay'],
    'Value': [f'{len(df_head_neck):,}', f'{len(df_head_neck):,}',
              f'{df_head_neck["year"].min()}-{df_head_neck["year"].max()}',
              f'{df_head_neck["egresar_fallecido"].mean()*100:.1f}%',
              f'{df_head_neck["dias_estancia"].median():.0f} days']
})
show_table(info_df, "Table 5: Overall Summary Statistics", index=False)

Table 5: Overall Summary Statistics

Metric	Value
Total GRD records	9,411
Head and neck malignancy cases (C00-C14, C30-C32)	9,411
Period	2019-2024
Mortality rate	6.4%
Median length of stay	5 days

3 3. Display Name Formatting

Code

# Define formatted display names for all variables
DISPLAY_NAMES = {
    'edad': 'Age',
    'SEXO_coded': 'Sex (Female)',
    'dias_estancia': 'Length of Stay',
    'egresar_fallecido': 'In-Hospital Mortality',
    'cancer_site_group': 'Cancer Site',
    'service_category': 'Service Category',
    'service_category_coded': 'Service Category',
    'year': 'Year',
    'year_factor': 'Year',
    'IR_29301_SEVERIDAD': 'Severity',
    'IR_29301_MORTALIDAD': 'Mortality Risk',
    # Detailed comorbidity groups (48 groups, 10 systemic categories)
    **{name: name for name in comorbidity_cols},
    # Aggregated systemic categories
    'cat_CV': 'Cardiovascular (any)',
    'cat_MET': 'Metabolic (any)',
    'cat_RESP': 'Respiratory (any)',
    'cat_REN': 'Renal (any)',
    'cat_HEP': 'Hepatic (any)',
    'cat_GI': 'Gastrointestinal (any)',
    'cat_NEURO': 'Neurological (any)',
    'cat_PSI': 'Psychiatric (any)',
    'cat_SUST': 'Substance Use (any)',
    'cat_INF': 'Infectious (any)',
    'total_comorbidities': 'Total Comorbidities',
    'any_comorbidity': 'Any Comorbidity',
    'service_ICU': 'Service ICU',
    'service_MS': 'Service MS',
    'service_ITU': 'Service ITU',
    'service_Other': 'Service Other',
    'severity_2': 'Severity Medium',
    'severity_3': 'Severity High',
    'mort_risk_2': 'Mortality Risk Medium',
    'mort_risk_3': 'Mortality Risk High',
    'const': 'Intercept',
    'Intercept': 'Intercept',
    'sex_female': 'Sex Female',
}

def format_name(name):
    """Convert variable names to formatted display names"""
    if name in DISPLAY_NAMES:
        return DISPLAY_NAMES[name]
    if name.startswith('year_'):
        return f'Year {name.split("_")[1]}'
    if name.startswith('severity_'):
        level = name.split('_')[1]
        labels = {'2': 'Medium', '3': 'High'}
        return f'Severity {labels.get(level, level)}'
    if name.startswith('mort_risk_'):
        level = name.split('_')[2] if len(name.split('_')) > 2 else name.split('_')[1]
        labels = {'2': 'Medium', '3': 'High'}
        return f'Mortality Risk {labels.get(level, level)}'
    if name.startswith('service_'):
        svc = name.replace('service_', '')
        return f'Service {svc}'
    if name.startswith('C('):
        return name.replace('C(', '').replace(')', '').replace('[T.', ': ').replace(']', '')
    return name.replace('_', ' ').title()

# Create mapping for comorbidity display names
comorb_names = {col: format_name(col) for col in comorbidity_cols}

4 4. Descriptive Statistics

Code

import matplotlib.pyplot as plt
import seaborn as sns

plt.rcParams['figure.facecolor'] = 'white'
sns.set_style("whitegrid")

# Create age groups
df_head_neck['age_group'] = pd.cut(df_head_neck['edad'],
                                    bins=[0, 18, 35, 50, 65, 80, 150],
                                    labels=['<18', '18-34', '35-49', '50-64', '65-79', '80+'])

# Table 1: Patient Distribution by Cancer Site
table1_data = []
for site in sorted(df_head_neck['cancer_site_group'].unique()):
    site_df = df_head_neck[df_head_neck['cancer_site_group'] == site]
    n_total = len(site_df)
    n_discharged = (site_df['egresar_fallecido'] == 0).sum()
    n_deceased = site_df['egresar_fallecido'].sum()
    mortality_pct = (n_deceased / n_total * 100) if n_total > 0 else 0
    mean_age = site_df['edad'].mean()
    age_range = f"{site_df['edad'].min():.0f}-{site_df['edad'].max():.0f}"
    female_pct = (site_df['SEXO'] == 'Female').sum() / n_total * 100
    mean_los = site_df['dias_estancia'].mean()
    median_los = site_df['dias_estancia'].median()
    sd_los = site_df['dias_estancia'].std()

    table1_data.append({
        'Site': site,
        'N': n_total,
        'Discharged': n_discharged,
        'Deceased': n_deceased,
        'Mortality %': mortality_pct,
        'Mean Age': mean_age,
        'Age Range': age_range,
        'Female %': female_pct,
        'Mean LOS': mean_los,
        'Median LOS': median_los,
        'SD LOS': sd_los
    })

table1_df = pd.DataFrame(table1_data)
show_table(table1_df, "Table 1: Patient Distribution by Head and Neck Cancer Site", index=False)

Table 1: Patient Distribution by Head and Neck Cancer Site

Site	N	Discharged	Deceased	Mortality %	Mean Age	Age Range	Female %	Mean LOS	Median LOS	SD LOS
Hypopharynx	400	367	33	8.25	68.60	9-97	23.50	13.71	7.00	19.00
Larynx	2625	2427	198	7.54	67.63	6-98	15.05	14.49	7.00	21.86
Nasopharynx	368	345	23	6.25	50.19	3-92	31.79	8.97	4.00	18.19
Oral Cavity	3001	2825	176	5.86	64.32	0-99	39.09	9.61	4.00	16.41
Oropharynx	951	896	55	5.78	62.16	8-96	26.60	8.39	4.00	13.66
Other Pharynx	281	251	30	10.68	57.49	0-96	32.38	10.80	5.00	15.99
Salivary Glands	898	852	46	5.12	60.54	2-97	52.00	6.95	4.00	11.65
Sinonasal	887	846	41	4.62	55.47	1-102	41.83	6.68	3.00	11.31

Code

# Table 2: Trends by Year and Site
table2_data = []
for year in sorted(df_head_neck['year'].unique()):
    for site in sorted(df_head_neck['cancer_site_group'].unique()):
        subset = df_head_neck[(df_head_neck['year'] == year) & (df_head_neck['cancer_site_group'] == site)]
        if len(subset) > 0:
            n = len(subset)
            mortality_pct = (subset['egresar_fallecido'].sum() / n * 100) if n > 0 else 0
            mean_age = subset['edad'].mean()
            mean_los = subset['dias_estancia'].mean()
            any_comorb_pct = subset['any_comorbidity'].mean() * 100 if 'any_comorbidity' in subset.columns else 0

            table2_data.append({
                'Year': year,
                'Site': site,
                'N': n,
                'Mortality %': mortality_pct,
                'Mean Age': mean_age,
                'Mean LOS': mean_los,
                'Any Comorbidity %': any_comorb_pct
            })

table2_df = pd.DataFrame(table2_data)
show_table(table2_df, "Table 2: Trends by Year and Site", index=False)

Table 2: Trends by Year and Site

Year	Site	N	Mortality %	Mean Age	Mean LOS	Any Comorbidity %
2019	Hypopharynx	63	3.17	69.85	11.84	65.08
2019	Larynx	450	8.22	67.61	10.87	68.00
2019	Nasopharynx	61	1.64	51.20	9.13	54.10
2019	Oral Cavity	571	4.38	62.50	7.34	58.49
2019	Oropharynx	183	6.56	62.18	6.81	47.54
2019	Other Pharynx	50	10.00	65.85	11.80	60.00
2019	Salivary Glands	173	2.89	59.66	6.52	60.69
2019	Sinonasal	165	3.03	53.91	5.14	49.70
2020	Hypopharynx	55	9.09	68.24	13.29	67.27
2020	Larynx	343	7.87	67.47	15.12	72.89
2020	Nasopharynx	48	6.25	58.31	9.65	43.75
2020	Oral Cavity	387	4.39	64.82	11.89	70.80
2020	Oropharynx	108	7.41	61.00	8.35	63.89
2020	Other Pharynx	35	5.71	56.63	14.40	60.00
2020	Salivary Glands	126	7.14	57.89	7.60	57.14
2020	Sinonasal	132	6.82	52.98	6.28	40.15
2021	Hypopharynx	60	13.33	65.86	15.50	55.00
2021	Larynx	408	9.07	66.90	15.57	76.72
2021	Nasopharynx	47	6.38	43.42	6.02	55.32
2021	Oral Cavity	418	7.89	65.41	9.92	71.05
2021	Oropharynx	126	4.76	60.39	6.49	57.94
2021	Other Pharynx	54	14.81	55.85	13.00	53.70
2021	Salivary Glands	135	5.93	56.48	8.59	56.30
2021	Sinonasal	122	3.28	55.92	6.88	57.38
2022	Hypopharynx	71	9.86	66.15	15.85	64.79
2022	Larynx	407	6.39	66.92	17.27	77.15
2022	Nasopharynx	45	6.67	59.07	9.93	62.22
2022	Oral Cavity	488	6.35	63.20	10.19	69.26
2022	Oropharynx	160	6.25	63.57	8.95	60.62
2022	Other Pharynx	49	6.12	50.27	7.00	73.47
2022	Salivary Glands	155	5.16	63.86	6.53	70.32
2022	Sinonasal	130	7.69	55.26	8.46	59.23
2023	Hypopharynx	77	6.49	70.62	14.64	74.03
2023	Larynx	504	7.34	68.16	13.00	79.96
2023	Nasopharynx	50	6.00	52.95	5.92	64.00
2023	Oral Cavity	505	5.74	65.01	9.69	71.68
2023	Oropharynx	164	7.32	63.42	10.55	68.29
2023	Other Pharynx	38	15.79	61.21	6.50	65.79
2023	Salivary Glands	125	3.20	60.55	5.90	62.40
2023	Sinonasal	160	3.75	55.75	6.66	65.00
2024	Hypopharynx	74	8.11	70.29	11.18	72.97
2024	Larynx	513	6.63	68.36	15.66	82.85
2024	Nasopharynx	117	8.55	44.44	10.72	57.26
2024	Oral Cavity	632	6.49	65.27	9.55	74.53
2024	Oropharynx	210	3.33	61.73	8.83	67.62
2024	Other Pharynx	55	10.91	55.94	11.82	65.45
2024	Salivary Glands	184	6.52	63.34	6.78	72.28
2024	Sinonasal	178	3.93	58.34	6.99	56.18

Code

# Table 3: Comorbidity Prevalence - All Patients
table3_data = []
for site in sorted(df_head_neck['cancer_site_group'].unique()):
    site_df = df_head_neck[df_head_neck['cancer_site_group'] == site]
    n = len(site_df)
    row = {'Site': site}

    for col in comorbidity_cols:
        if col in site_df.columns:
            pct = (site_df[col].sum() / n * 100) if n > 0 else 0
            row[format_name(col)] = pct

    table3_data.append(row)

table3_df = pd.DataFrame(table3_data)
show_table(table3_df, "Table 3: Comorbidity Prevalence - All Patients (%)", index=False)

Table 3: Comorbidity Prevalence - All Patients (%)

Site	CV Essential Hypertension	CV Complicated Hypertension	CV Coronary Disease	CV Heart Failure	CV Arrhythmias	CV Cerebrovascular Disease	CV Peripheral Vascular Disease	MET Type1 Diabetes	MET Type2 Diabetes	MET Complicated Diabetes	MET Obesity	MET Dyslipidemia	MET Metabolic Syndrome	MET Thyroid Disorders	RESP COPD	RESP Asthma	RESP Chronic Bronchitis	RESP Emphysema	RESP Pulmonary Fibrosis	RESP Pulmonary Hypertension	RESP Sleep Apnea	REN Acute Renal Failure	REN Chronic Renal Failure	REN End Stage Renal	REN Diabetic Nephropathy	HEP Steatosis	HEP Alcoholic Cirrhosis	HEP NonAlcoholic Cirrhosis	HEP Viral Hepatitis	GI Reflux	GI Peptic Ulcer	GI IBD	GI IBS	NEURO Dementia	NEURO Parkinson	NEURO Epilepsy	NEURO Stroke Sequelae	PSI Major Depression	PSI Bipolar	PSI Anxiety	SUST Alcohol	SUST Tobacco	SUST Opioids	INF HIV	INF Tuberculosis	INF Hepatitis C	INF Hepatitis B
Hypopharynx	36.75	0.25	4.00	1.50	3.75	0.75	1.50	0.25	14.25	0.25	1.50	9.50	0.25	16.00	6.25	1.25	0.00	2.00	1.00	0.00	0.25	1.50	4.50	1.75	0.00	1.00	1.00	0.00	0.75	0.00	0.75	0.00	0.00	1.25	0.50	1.75	0.50	2.00	0.00	1.00	6.50	16.75	0.00	1.50	0.50	0.50	0.00
Larynx	45.79	0.50	5.79	3.50	6.93	2.93	3.09	0.34	19.81	0.23	4.99	8.42	0.46	11.16	13.79	3.62	0.04	1.75	1.03	0.46	0.99	6.10	5.18	1.10	0.30	0.91	1.30	0.08	0.15	0.57	1.14	0.00	0.04	1.33	0.72	1.56	2.10	2.32	0.27	0.88	4.53	24.27	0.04	0.65	0.27	0.11	0.04
Nasopharynx	28.26	0.00	2.17	1.63	2.99	1.63	4.08	0.00	12.50	0.27	7.07	8.97	0.54	9.51	0.82	3.53	0.00	0.54	1.09	0.00	0.27	5.43	3.26	0.54	0.27	0.27	0.27	0.00	0.00	0.27	0.27	0.27	0.00	0.00	0.27	1.09	0.82	0.54	0.00	3.80	4.08	7.88	0.00	1.36	0.00	0.00	0.00
Oral Cavity	43.85	0.50	3.80	2.17	5.63	2.43	2.10	0.37	18.66	0.43	5.20	11.43	0.60	12.26	5.13	1.77	0.13	0.73	0.73	0.27	0.30	4.53	4.57	1.03	0.23	1.37	0.63	0.30	0.17	0.43	0.53	0.13	0.07	1.73	0.67	1.20	1.90	2.97	0.23	0.70	4.77	14.83	0.00	0.73	0.07	0.00	0.17
Oropharynx	37.01	0.11	2.42	1.05	3.15	2.31	2.73	0.11	13.67	0.32	4.21	7.05	0.11	8.73	3.68	1.79	0.11	1.05	0.74	0.21	0.32	3.05	2.52	0.63	0.21	1.47	0.84	0.00	0.00	0.21	0.32	0.11	0.00	0.42	0.63	2.21	2.00	2.31	0.63	1.16	4.84	13.25	0.11	2.10	0.21	0.00	0.00
Other Pharynx	33.45	0.36	2.14	2.49	4.98	3.56	3.91	0.71	15.30	0.00	2.85	6.05	0.00	9.25	4.98	1.07	0.00	1.42	1.07	0.00	3.20	3.56	4.27	1.42	0.00	1.07	0.36	0.00	0.00	1.07	1.78	0.36	0.36	0.71	0.00	3.56	1.78	2.14	0.36	0.71	1.78	11.03	0.00	1.42	0.36	0.00	0.00
Salivary Glands	37.53	0.89	3.01	2.23	4.68	1.67	0.89	0.33	20.04	0.22	8.02	10.47	0.67	11.36	4.12	3.79	0.00	0.22	0.78	0.22	0.78	4.23	4.23	0.78	0.11	0.78	0.22	0.11	0.00	0.22	0.45	0.22	0.00	0.78	1.22	1.78	1.34	3.12	0.45	1.00	2.00	9.24	0.00	0.56	0.56	0.00	0.00
Sinonasal	33.03	0.56	1.24	1.35	1.69	2.14	0.79	0.45	16.01	0.23	6.43	7.33	0.23	8.46	1.80	2.59	0.00	0.00	0.56	0.00	0.68	1.69	2.03	0.79	0.23	0.90	0.23	0.00	0.00	0.34	0.34	0.00	0.00	0.90	0.68	1.58	1.13	2.03	0.11	1.24	3.27	9.47	0.00	0.45	0.00	0.00	0.00

Code

# Table 4: Comorbidity Prevalence - Discharged
table4_data = []
discharged = df_head_neck[df_head_neck['egresar_fallecido'] == 0]

for site in sorted(discharged['cancer_site_group'].unique()):
    site_df = discharged[discharged['cancer_site_group'] == site]
    n = len(site_df)
    row = {'Site': site}

    for col in comorbidity_cols:
        if col in site_df.columns:
            pct = (site_df[col].sum() / n * 100) if n > 0 else 0
            row[format_name(col)] = pct

    table4_data.append(row)

table4_df = pd.DataFrame(table4_data)
show_table(table4_df, "Table 4: Comorbidity Prevalence - Discharged (%)", index=False)

Table 4: Comorbidity Prevalence - Discharged (%)

Site	CV Essential Hypertension	CV Complicated Hypertension	CV Coronary Disease	CV Heart Failure	CV Arrhythmias	CV Cerebrovascular Disease	CV Peripheral Vascular Disease	MET Type1 Diabetes	MET Type2 Diabetes	MET Complicated Diabetes	MET Obesity	MET Dyslipidemia	MET Metabolic Syndrome	MET Thyroid Disorders	RESP COPD	RESP Asthma	RESP Chronic Bronchitis	RESP Emphysema	RESP Pulmonary Fibrosis	RESP Pulmonary Hypertension	RESP Sleep Apnea	REN Acute Renal Failure	REN Chronic Renal Failure	REN End Stage Renal	REN Diabetic Nephropathy	HEP Steatosis	HEP Alcoholic Cirrhosis	HEP NonAlcoholic Cirrhosis	HEP Viral Hepatitis	GI Reflux	GI Peptic Ulcer	GI IBD	GI IBS	NEURO Dementia	NEURO Parkinson	NEURO Epilepsy	NEURO Stroke Sequelae	PSI Major Depression	PSI Bipolar	PSI Anxiety	SUST Alcohol	SUST Tobacco	SUST Opioids	INF HIV	INF Tuberculosis	INF Hepatitis C	INF Hepatitis B
Hypopharynx	36.78	0.27	3.54	1.09	2.72	0.54	1.63	0.27	14.44	0.27	1.63	9.54	0.27	15.53	5.72	1.09	0.00	1.91	1.09	0.00	0.27	1.36	4.90	1.91	0.00	1.09	0.82	0.00	0.82	0.00	0.54	0.00	0.00	1.36	0.27	1.63	0.54	1.91	0.00	0.82	7.08	17.17	0.00	1.63	0.27	0.54	0.00
Larynx	45.78	0.45	5.64	3.09	5.48	2.76	2.68	0.37	19.45	0.25	5.23	8.61	0.37	11.08	13.39	3.67	0.04	1.48	0.91	0.37	0.99	4.70	4.61	0.95	0.29	0.82	1.03	0.00	0.16	0.62	0.91	0.00	0.00	1.28	0.74	1.44	2.02	2.22	0.29	0.95	4.41	24.64	0.04	0.66	0.25	0.12	0.04
Nasopharynx	27.83	0.00	2.03	1.74	2.61	1.16	3.48	0.00	12.46	0.29	7.54	9.57	0.29	9.28	0.87	3.77	0.00	0.29	1.16	0.00	0.29	4.64	3.48	0.58	0.29	0.29	0.29	0.00	0.00	0.29	0.29	0.29	0.00	0.00	0.00	1.16	0.87	0.58	0.00	4.06	4.35	7.25	0.00	1.45	0.00	0.00	0.00
Oral Cavity	43.61	0.46	3.68	2.02	4.71	2.27	1.77	0.35	18.62	0.46	5.35	11.54	0.50	11.96	5.03	1.63	0.14	0.71	0.57	0.21	0.28	3.82	4.28	0.92	0.25	1.20	0.57	0.28	0.18	0.46	0.50	0.11	0.07	1.63	0.67	1.17	1.77	2.94	0.21	0.71	4.71	14.73	0.00	0.71	0.07	0.00	0.18
Oropharynx	36.83	0.11	2.46	0.78	2.46	2.12	2.68	0.11	13.39	0.33	4.24	7.14	0.00	9.04	3.24	1.79	0.11	1.12	0.56	0.22	0.33	2.23	2.23	0.56	0.22	1.12	0.78	0.00	0.00	0.22	0.33	0.11	0.00	0.45	0.67	2.23	1.90	2.12	0.45	1.23	4.91	13.39	0.11	2.01	0.11	0.00	0.00
Other Pharynx	31.87	0.00	2.39	1.99	3.98	3.19	3.59	0.40	15.54	0.00	2.79	6.77	0.00	9.96	4.38	1.20	0.00	1.20	1.20	0.00	3.59	1.99	3.98	1.20	0.00	1.20	0.00	0.00	0.00	1.20	1.59	0.40	0.40	0.80	0.00	3.19	1.59	1.59	0.00	0.80	1.59	10.36	0.00	1.59	0.40	0.00	0.00
Salivary Glands	37.79	0.82	2.70	1.88	3.76	1.41	0.94	0.35	20.31	0.12	8.33	10.68	0.70	11.38	4.23	3.99	0.00	0.12	0.82	0.23	0.82	3.29	3.99	0.70	0.12	0.70	0.12	0.12	0.00	0.23	0.47	0.23	0.00	0.70	1.29	1.88	1.29	3.17	0.47	1.06	1.88	9.27	0.00	0.59	0.59	0.00	0.00
Sinonasal	32.86	0.47	1.30	1.30	1.65	1.77	0.83	0.35	15.72	0.24	6.26	7.33	0.12	8.51	1.77	2.72	0.00	0.00	0.59	0.00	0.71	1.54	2.01	0.83	0.24	0.83	0.24	0.00	0.00	0.35	0.35	0.00	0.00	0.95	0.71	1.54	1.06	2.01	0.12	1.30	3.31	9.34	0.00	0.47	0.00	0.00	0.00

Code

# Table 5: Comorbidity Prevalence - Deceased
table5_data = []
deceased = df_head_neck[df_head_neck['egresar_fallecido'] == 1]

for site in sorted(deceased['cancer_site_group'].unique()):
    site_df = deceased[deceased['cancer_site_group'] == site]
    n = len(site_df)
    row = {'Site': site}

    for col in comorbidity_cols:
        if col in site_df.columns:
            pct = (site_df[col].sum() / n * 100) if n > 0 else 0
            row[format_name(col)] = pct

    table5_data.append(row)

table5_df = pd.DataFrame(table5_data)
show_table(table5_df, "Table 5: Comorbidity Prevalence - Deceased (%)", index=False)

Table 5: Comorbidity Prevalence - Deceased (%)

Site	CV Essential Hypertension	CV Complicated Hypertension	CV Coronary Disease	CV Heart Failure	CV Arrhythmias	CV Cerebrovascular Disease	CV Peripheral Vascular Disease	MET Type1 Diabetes	MET Type2 Diabetes	MET Complicated Diabetes	MET Obesity	MET Dyslipidemia	MET Metabolic Syndrome	MET Thyroid Disorders	RESP COPD	RESP Asthma	RESP Emphysema	RESP Pulmonary Fibrosis	RESP Pulmonary Hypertension	RESP Sleep Apnea	REN Acute Renal Failure	REN Chronic Renal Failure	REN End Stage Renal	REN Diabetic Nephropathy	HEP Steatosis	HEP Alcoholic Cirrhosis	HEP NonAlcoholic Cirrhosis	GI Peptic Ulcer	GI IBD	GI IBS	NEURO Dementia	NEURO Parkinson	NEURO Epilepsy	NEURO Stroke Sequelae	PSI Major Depression	PSI Bipolar	PSI Anxiety	SUST Alcohol	SUST Tobacco	INF HIV	INF Tuberculosis
Hypopharynx	36.36	0.00	9.09	6.06	15.15	3.03	0.00	0.00	12.12	0.00	0.00	9.09	0.00	21.21	12.12	3.03	3.03	0.00	0.00	0.00	3.03	0.00	0.00	0.00	0.00	3.03	0.00	3.03	0.00	0.00	0.00	3.03	3.03	0.00	3.03	0.00	3.03	0.00	12.12	0.00	3.03
Larynx	45.96	1.01	7.58	8.59	24.75	5.05	8.08	0.00	24.24	0.00	2.02	6.06	1.52	12.12	18.69	3.03	5.05	2.53	1.52	1.01	23.23	12.12	3.03	0.51	2.02	4.55	1.01	4.04	0.00	0.51	2.02	0.51	3.03	3.03	3.54	0.00	0.00	6.06	19.70	0.51	0.51
Nasopharynx	34.78	0.00	4.35	0.00	8.70	8.70	13.04	0.00	13.04	0.00	0.00	0.00	4.35	13.04	0.00	0.00	4.35	0.00	0.00	0.00	17.39	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	4.35	0.00	0.00	0.00	0.00	0.00	0.00	17.39	0.00	0.00
Oral Cavity	47.73	1.14	5.68	4.55	20.45	5.11	7.39	0.57	19.32	0.00	2.84	9.66	2.27	17.05	6.82	3.98	1.14	3.41	1.14	0.57	15.91	9.09	2.84	0.00	3.98	1.70	0.57	1.14	0.57	0.00	3.41	0.57	1.70	3.98	3.41	0.57	0.57	5.68	16.48	1.14	0.00
Oropharynx	40.00	0.00	1.82	5.45	14.55	5.45	3.64	0.00	18.18	0.00	3.64	5.45	1.82	3.64	10.91	1.82	0.00	3.64	0.00	0.00	16.36	7.27	1.82	0.00	7.27	1.82	0.00	0.00	0.00	0.00	0.00	0.00	1.82	3.64	5.45	3.64	0.00	3.64	10.91	3.64	1.82
Other Pharynx	46.67	3.33	0.00	6.67	13.33	6.67	6.67	3.33	13.33	0.00	3.33	0.00	0.00	3.33	10.00	0.00	3.33	0.00	0.00	0.00	16.67	6.67	3.33	0.00	0.00	3.33	0.00	3.33	0.00	0.00	0.00	0.00	6.67	3.33	6.67	3.33	0.00	3.33	16.67	0.00	0.00
Salivary Glands	32.61	2.17	8.70	8.70	21.74	6.52	0.00	0.00	15.22	2.17	2.17	6.52	0.00	10.87	2.17	0.00	2.17	0.00	0.00	0.00	21.74	8.70	2.17	0.00	2.17	2.17	0.00	0.00	0.00	0.00	2.17	0.00	0.00	2.17	2.17	0.00	0.00	4.35	8.70	0.00	0.00
Sinonasal	36.59	2.44	0.00	2.44	2.44	9.76	0.00	2.44	21.95	0.00	9.76	7.32	2.44	7.32	2.44	0.00	0.00	0.00	0.00	0.00	4.88	2.44	0.00	0.00	2.44	0.00	0.00	0.00	0.00	0.00	0.00	0.00	2.44	2.44	2.44	0.00	0.00	2.44	12.20	0.00	0.00

Code

# Table 6: Summary by Discharge Status
table6_data = []
for site in sorted(df_head_neck['cancer_site_group'].unique()):
    for status in [0, 1]:
        status_label = 'Discharged' if status == 0 else 'Deceased'
        subset = df_head_neck[(df_head_neck['cancer_site_group'] == site) & (df_head_neck['egresar_fallecido'] == status)]

        if len(subset) > 0:
            table6_data.append({
                'Site': site,
                'Status': status_label,
                'N': len(subset),
                'Mean Age': subset['edad'].mean(),
                'Mean LOS': subset['dias_estancia'].mean(),
                'Median LOS': subset['dias_estancia'].median()
            })

table6_df = pd.DataFrame(table6_data)
show_table(table6_df, "Table 6: Summary by Discharge Status", index=False)

Table 6: Summary by Discharge Status

Site	Status	N	Mean Age	Mean LOS	Median LOS
Hypopharynx	Discharged	367	68.30	12.88	7.00
Hypopharynx	Deceased	33	71.94	23.00	13.00
Larynx	Discharged	2427	67.30	13.67	7.00
Larynx	Deceased	198	71.64	24.63	12.50
Nasopharynx	Discharged	345	49.49	7.73	4.00
Nasopharynx	Deceased	23	60.67	27.48	13.00
Oral Cavity	Discharged	2825	64.06	9.22	4.00
Oral Cavity	Deceased	176	68.54	15.80	6.00
Oropharynx	Discharged	896	61.89	7.87	4.00
Oropharynx	Deceased	55	66.57	16.96	9.00
Other Pharynx	Discharged	251	55.71	10.04	5.00
Other Pharynx	Deceased	30	72.43	17.23	5.50
Salivary Glands	Discharged	852	60.34	6.41	3.00
Salivary Glands	Deceased	46	64.11	16.93	9.00
Sinonasal	Discharged	846	55.32	6.27	3.00
Sinonasal	Deceased	41	58.38	15.20	11.00

Code

# Table 7: LOS by Site and Discharge Status
table7_data = []
for site in sorted(df_head_neck['cancer_site_group'].unique()):
    for status in [0, 1]:
        status_label = 'Discharged' if status == 0 else 'Deceased'
        subset = df_head_neck[(df_head_neck['cancer_site_group'] == site) & (df_head_neck['egresar_fallecido'] == status)]

        if len(subset) > 0:
            los_vals = subset['dias_estancia']
            q1 = los_vals.quantile(0.25)
            q3 = los_vals.quantile(0.75)
            iqr = q3 - q1

            table7_data.append({
                'Site': site,
                'Status': status_label,
                'N': len(subset),
                'Mean LOS': los_vals.mean(),
                'Median LOS': los_vals.median(),
                'IQR': iqr
            })

table7_df = pd.DataFrame(table7_data)
show_table(table7_df, "Table 7: Length of Stay by Site and Discharge Status", index=False)

Table 7: Length of Stay by Site and Discharge Status

Site	Status	N	Mean LOS	Median LOS	IQR
Hypopharynx	Discharged	367	12.88	7.00	14.00
Hypopharynx	Deceased	33	23.00	13.00	27.00
Larynx	Discharged	2427	13.67	7.00	16.00
Larynx	Deceased	198	24.63	12.50	25.50
Nasopharynx	Discharged	345	7.73	4.00	6.00
Nasopharynx	Deceased	23	27.48	13.00	14.50
Oral Cavity	Discharged	2825	9.22	4.00	9.00
Oral Cavity	Deceased	176	15.80	6.00	12.25
Oropharynx	Discharged	896	7.87	4.00	8.00
Oropharynx	Deceased	55	16.96	9.00	18.00
Other Pharynx	Discharged	251	10.04	5.00	9.50
Other Pharynx	Deceased	30	17.23	5.50	23.00
Salivary Glands	Discharged	852	6.41	3.00	6.00
Salivary Glands	Deceased	46	16.93	9.00	20.25
Sinonasal	Discharged	846	6.27	3.00	6.00
Sinonasal	Deceased	41	15.20	11.00	19.00

Code

# Table 8: Yearly Trends Aggregate
table8_data = []
for year in sorted(df_head_neck['year'].unique()):
    year_df = df_head_neck[df_head_neck['year'] == year]
    n_total = len(year_df)
    mortality_pct = (year_df['egresar_fallecido'].sum() / n_total * 100) if n_total > 0 else 0
    mean_los = year_df['dias_estancia'].mean()
    mean_age = year_df['edad'].mean()

    row = {
        'Year': year,
        'Total': n_total,
        'Mortality %': mortality_pct,
        'Mean Age': mean_age,
        'Mean LOS': mean_los
    }

    # Add per-site counts
    for site in sorted(df_head_neck['cancer_site_group'].unique()):
        site_count = len(year_df[year_df['cancer_site_group'] == site])
        row[f'{site} N'] = site_count

    table8_data.append(row)

table8_df = pd.DataFrame(table8_data)
show_table(table8_df, "Table 8: Yearly Trends (Aggregate)", index=False)

Table 8: Yearly Trends (Aggregate)

Year	Total	Mortality %	Mean Age	Mean LOS	Hypopharynx N	Larynx N	Nasopharynx N	Oral Cavity N	Oropharynx N	Other Pharynx N	Salivary Glands N	Sinonasal N
2019	1716	5.36	62.66	8.27	63	450	61	571	183	50	173	165
2020	1234	6.48	62.91	11.49	55	343	48	387	108	35	126	132
2021	1370	7.81	62.56	11.12	60	408	47	418	126	54	135	122
2022	1505	6.51	63.22	11.60	71	407	45	488	160	49	155	130
2023	1623	6.28	64.38	10.26	77	504	50	505	164	38	125	160
2024	1963	6.27	63.57	10.77	74	513	117	632	210	55	184	178

Code

# Create visualization
fig, axes = plt.subplots(2, 3, figsize=(16, 10))

# 1. Age distribution
axes[0, 0].hist(df_head_neck['edad'], bins=50, color='steelblue', edgecolor='black', alpha=0.7)
axes[0, 0].axvline(df_head_neck['edad'].mean(), color='red', linestyle='--', linewidth=2,
                    label=f'Mean: {df_head_neck["edad"].mean():.1f}')
axes[0, 0].set_xlabel('Age (years)')
axes[0, 0].set_ylabel('Frequency')
axes[0, 0].set_title('Age Distribution')
axes[0, 0].legend()

# 2. Length of stay distribution
axes[0, 1].hist(df_head_neck['dias_estancia'], bins=50, color='forestgreen', edgecolor='black', alpha=0.7)
axes[0, 1].axvline(df_head_neck['dias_estancia'].mean(), color='red', linestyle='--', linewidth=2,
                    label=f'Mean: {df_head_neck["dias_estancia"].mean():.1f}')
axes[0, 1].set_xlabel('Length of stay (days)')
axes[0, 1].set_ylabel('Frequency')
axes[0, 1].set_title('Length of Stay Distribution')
axes[0, 1].legend()

# 3. Mortality by sex
mortality_by_sex = df_head_neck.groupby('SEXO')['egresar_fallecido'].mean() * 100
axes[0, 2].bar(mortality_by_sex.index, mortality_by_sex.values, color=['#FF6B6B', '#4ECDC4'])
axes[0, 2].set_ylabel('Mortality rate (%)')
axes[0, 2].set_title('Mortality by Sex')
for i, v in enumerate(mortality_by_sex.values):
    axes[0, 2].text(i, v + 0.5, f'{v:.1f}%', ha='center', fontweight='bold')

# 4. Records by year
year_counts = df_head_neck['year'].value_counts().sort_index()
axes[1, 0].plot(year_counts.index, year_counts.values, marker='o', linewidth=2, markersize=8, color='darkviolet')
axes[1, 0].fill_between(year_counts.index, year_counts.values, alpha=0.3, color='darkviolet')
axes[1, 0].set_xlabel('Year')
axes[1, 0].set_ylabel('Number of records')
axes[1, 0].set_title('Records by Year')
axes[1, 0].grid(True, alpha=0.3)

# 5. Cancer site distribution
site_counts = df_head_neck['cancer_site_group'].value_counts()
axes[1, 1].barh(site_counts.index, site_counts.values, color='coral')
axes[1, 1].set_xlabel('Number of cases')
axes[1, 1].set_title('Cases by Cancer Site')
for i, v in enumerate(site_counts.values):
    axes[1, 1].text(v + 50, i, f'{v:,.0f}', va='center')

# 6. Service category
service_counts = df_head_neck['service_category'].value_counts()
colors_pie = ['#FF6B6B', '#4ECDC4', '#45B7D1', '#FFA07A']
axes[1, 2].pie(service_counts.values, labels=service_counts.index, autopct='%1.1f%%', startangle=90)
axes[1, 2].set_title('Service Category Distribution')

plt.tight_layout()
plt.savefig(os.path.join(OUTPUT_DIR, '01_descriptive_stats.png'), dpi=300, bbox_inches='tight', facecolor='white')
plt.savefig(_FIG / "fig-descriptive-stats.png", dpi=300, bbox_inches="tight")
plt.show()

Figure 1: Figure 1: Descriptive Distribution of Head and Neck Cancer Hospitalizations

5 4.1 Age Distribution by Cancer Site

Code

fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Violin plot
ax = axes[0]
for i, site in enumerate(df_head_neck['cancer_site_group'].unique()):
    data = df_head_neck[df_head_neck['cancer_site_group'] == site]['edad']
    positions = [i]
    ax.violinplot([data], positions=positions, widths=0.7, showmeans=True, showmedians=True)
ax.set_xticks(range(len(df_head_neck['cancer_site_group'].unique())))
ax.set_xticklabels(df_head_neck['cancer_site_group'].unique(), rotation=30, ha='right', fontsize=9)
ax.set_ylabel('Age (years)')
ax.set_title('Age Distribution by Cancer Site (Violin Plot)')
ax.grid(True, alpha=0.3, axis='y')

# Box plot
ax = axes[1]
data_by_site = [df_head_neck[df_head_neck['cancer_site_group'] == site]['edad'].values
                 for site in df_head_neck['cancer_site_group'].unique()]
bp = ax.boxplot(data_by_site, labels=df_head_neck['cancer_site_group'].unique(), patch_artist=True)
for patch in bp['boxes']:
    patch.set_facecolor('lightblue')
ax.set_ylabel('Age (years)')
ax.set_title('Age Distribution by Cancer Site (Box Plot)')
plt.setp(ax.get_xticklabels(), rotation=30, ha='right', fontsize=9)
ax.grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.savefig(os.path.join(OUTPUT_DIR, '02_age_by_site.png'), dpi=300, bbox_inches='tight', facecolor='white')
plt.savefig(_FIG / "fig-age-by-site.png", dpi=300, bbox_inches="tight")
plt.show()

Figure 2: Figure 2. Age Distribution by Cancer Site

6 4.2 Comorbidity Prevalence Tables

Code

# Table: Comorbidity prevalence - ALL patients
comorb_all = []
for site in sorted(df_head_neck['cancer_site_group'].unique()):
    df_site = df_head_neck[df_head_neck['cancer_site_group'] == site]
    row = {'Cancer Site': site}
    for col in comorbidity_cols:
        prev = (df_site[col].mean() * 100)
        row[format_name(col)] = f'{prev:.1f}%'
    comorb_all.append(row)

comorb_all_df = pd.DataFrame(comorb_all)
show_table(comorb_all_df, "Table 13: Comorbidity Prevalence - All Patients (%)", index=False)

Table 13: Comorbidity Prevalence - All Patients (%)

Cancer Site	CV Essential Hypertension	CV Complicated Hypertension	CV Coronary Disease	CV Heart Failure	CV Arrhythmias	CV Cerebrovascular Disease	CV Peripheral Vascular Disease	MET Type1 Diabetes	MET Type2 Diabetes	MET Complicated Diabetes	MET Obesity	MET Dyslipidemia	MET Metabolic Syndrome	MET Thyroid Disorders	RESP COPD	RESP Asthma	RESP Chronic Bronchitis	RESP Emphysema	RESP Pulmonary Fibrosis	RESP Pulmonary Hypertension	RESP Sleep Apnea	REN Acute Renal Failure	REN Chronic Renal Failure	REN End Stage Renal	REN Diabetic Nephropathy	HEP Steatosis	HEP Alcoholic Cirrhosis	HEP NonAlcoholic Cirrhosis	HEP Viral Hepatitis	GI Reflux	GI Peptic Ulcer	GI IBD	GI IBS	NEURO Dementia	NEURO Parkinson	NEURO Epilepsy	NEURO Stroke Sequelae	PSI Major Depression	PSI Bipolar	PSI Anxiety	SUST Alcohol	SUST Tobacco	SUST Opioids	INF HIV	INF Tuberculosis	INF Hepatitis C	INF Hepatitis B
Hypopharynx	36.8%	0.2%	4.0%	1.5%	3.8%	0.8%	1.5%	0.2%	14.2%	0.2%	1.5%	9.5%	0.2%	16.0%	6.2%	1.2%	0.0%	2.0%	1.0%	0.0%	0.2%	1.5%	4.5%	1.8%	0.0%	1.0%	1.0%	0.0%	0.8%	0.0%	0.8%	0.0%	0.0%	1.2%	0.5%	1.8%	0.5%	2.0%	0.0%	1.0%	6.5%	16.8%	0.0%	1.5%	0.5%	0.5%	0.0%
Larynx	45.8%	0.5%	5.8%	3.5%	6.9%	2.9%	3.1%	0.3%	19.8%	0.2%	5.0%	8.4%	0.5%	11.2%	13.8%	3.6%	0.0%	1.8%	1.0%	0.5%	1.0%	6.1%	5.2%	1.1%	0.3%	0.9%	1.3%	0.1%	0.2%	0.6%	1.1%	0.0%	0.0%	1.3%	0.7%	1.6%	2.1%	2.3%	0.3%	0.9%	4.5%	24.3%	0.0%	0.6%	0.3%	0.1%	0.0%
Nasopharynx	28.3%	0.0%	2.2%	1.6%	3.0%	1.6%	4.1%	0.0%	12.5%	0.3%	7.1%	9.0%	0.5%	9.5%	0.8%	3.5%	0.0%	0.5%	1.1%	0.0%	0.3%	5.4%	3.3%	0.5%	0.3%	0.3%	0.3%	0.0%	0.0%	0.3%	0.3%	0.3%	0.0%	0.0%	0.3%	1.1%	0.8%	0.5%	0.0%	3.8%	4.1%	7.9%	0.0%	1.4%	0.0%	0.0%	0.0%
Oral Cavity	43.9%	0.5%	3.8%	2.2%	5.6%	2.4%	2.1%	0.4%	18.7%	0.4%	5.2%	11.4%	0.6%	12.3%	5.1%	1.8%	0.1%	0.7%	0.7%	0.3%	0.3%	4.5%	4.6%	1.0%	0.2%	1.4%	0.6%	0.3%	0.2%	0.4%	0.5%	0.1%	0.1%	1.7%	0.7%	1.2%	1.9%	3.0%	0.2%	0.7%	4.8%	14.8%	0.0%	0.7%	0.1%	0.0%	0.2%
Oropharynx	37.0%	0.1%	2.4%	1.1%	3.2%	2.3%	2.7%	0.1%	13.7%	0.3%	4.2%	7.0%	0.1%	8.7%	3.7%	1.8%	0.1%	1.1%	0.7%	0.2%	0.3%	3.0%	2.5%	0.6%	0.2%	1.5%	0.8%	0.0%	0.0%	0.2%	0.3%	0.1%	0.0%	0.4%	0.6%	2.2%	2.0%	2.3%	0.6%	1.2%	4.8%	13.2%	0.1%	2.1%	0.2%	0.0%	0.0%
Other Pharynx	33.5%	0.4%	2.1%	2.5%	5.0%	3.6%	3.9%	0.7%	15.3%	0.0%	2.8%	6.0%	0.0%	9.3%	5.0%	1.1%	0.0%	1.4%	1.1%	0.0%	3.2%	3.6%	4.3%	1.4%	0.0%	1.1%	0.4%	0.0%	0.0%	1.1%	1.8%	0.4%	0.4%	0.7%	0.0%	3.6%	1.8%	2.1%	0.4%	0.7%	1.8%	11.0%	0.0%	1.4%	0.4%	0.0%	0.0%
Salivary Glands	37.5%	0.9%	3.0%	2.2%	4.7%	1.7%	0.9%	0.3%	20.0%	0.2%	8.0%	10.5%	0.7%	11.4%	4.1%	3.8%	0.0%	0.2%	0.8%	0.2%	0.8%	4.2%	4.2%	0.8%	0.1%	0.8%	0.2%	0.1%	0.0%	0.2%	0.4%	0.2%	0.0%	0.8%	1.2%	1.8%	1.3%	3.1%	0.4%	1.0%	2.0%	9.2%	0.0%	0.6%	0.6%	0.0%	0.0%
Sinonasal	33.0%	0.6%	1.2%	1.4%	1.7%	2.1%	0.8%	0.5%	16.0%	0.2%	6.4%	7.3%	0.2%	8.5%	1.8%	2.6%	0.0%	0.0%	0.6%	0.0%	0.7%	1.7%	2.0%	0.8%	0.2%	0.9%	0.2%	0.0%	0.0%	0.3%	0.3%	0.0%	0.0%	0.9%	0.7%	1.6%	1.1%	2.0%	0.1%	1.2%	3.3%	9.5%	0.0%	0.5%	0.0%	0.0%	0.0%

Code

# Table: Comorbidity prevalence - DISCHARGED patients
comorb_discharged = []
for site in sorted(df_head_neck['cancer_site_group'].unique()):
    df_site = df_head_neck[(df_head_neck['cancer_site_group'] == site) & (df_head_neck['egresar_fallecido'] == 0)]
    row = {'Cancer Site': site}
    for col in comorbidity_cols:
        prev = (df_site[col].mean() * 100) if len(df_site) > 0 else 0
        row[format_name(col)] = f'{prev:.1f}%'
    comorb_discharged.append(row)

comorb_discharged_df = pd.DataFrame(comorb_discharged)
show_table(comorb_discharged_df, "Table 14: Comorbidity Prevalence - Discharged Patients (%)", index=False)

Table 14: Comorbidity Prevalence - Discharged Patients (%)

Cancer Site	CV Essential Hypertension	CV Complicated Hypertension	CV Coronary Disease	CV Heart Failure	CV Arrhythmias	CV Cerebrovascular Disease	CV Peripheral Vascular Disease	MET Type1 Diabetes	MET Type2 Diabetes	MET Complicated Diabetes	MET Obesity	MET Dyslipidemia	MET Metabolic Syndrome	MET Thyroid Disorders	RESP COPD	RESP Asthma	RESP Chronic Bronchitis	RESP Emphysema	RESP Pulmonary Fibrosis	RESP Pulmonary Hypertension	RESP Sleep Apnea	REN Acute Renal Failure	REN Chronic Renal Failure	REN End Stage Renal	REN Diabetic Nephropathy	HEP Steatosis	HEP Alcoholic Cirrhosis	HEP NonAlcoholic Cirrhosis	HEP Viral Hepatitis	GI Reflux	GI Peptic Ulcer	GI IBD	GI IBS	NEURO Dementia	NEURO Parkinson	NEURO Epilepsy	NEURO Stroke Sequelae	PSI Major Depression	PSI Bipolar	PSI Anxiety	SUST Alcohol	SUST Tobacco	SUST Opioids	INF HIV	INF Tuberculosis	INF Hepatitis C	INF Hepatitis B
Hypopharynx	36.8%	0.3%	3.5%	1.1%	2.7%	0.5%	1.6%	0.3%	14.4%	0.3%	1.6%	9.5%	0.3%	15.5%	5.7%	1.1%	0.0%	1.9%	1.1%	0.0%	0.3%	1.4%	4.9%	1.9%	0.0%	1.1%	0.8%	0.0%	0.8%	0.0%	0.5%	0.0%	0.0%	1.4%	0.3%	1.6%	0.5%	1.9%	0.0%	0.8%	7.1%	17.2%	0.0%	1.6%	0.3%	0.5%	0.0%
Larynx	45.8%	0.5%	5.6%	3.1%	5.5%	2.8%	2.7%	0.4%	19.4%	0.2%	5.2%	8.6%	0.4%	11.1%	13.4%	3.7%	0.0%	1.5%	0.9%	0.4%	1.0%	4.7%	4.6%	0.9%	0.3%	0.8%	1.0%	0.0%	0.2%	0.6%	0.9%	0.0%	0.0%	1.3%	0.7%	1.4%	2.0%	2.2%	0.3%	0.9%	4.4%	24.6%	0.0%	0.7%	0.2%	0.1%	0.0%
Nasopharynx	27.8%	0.0%	2.0%	1.7%	2.6%	1.2%	3.5%	0.0%	12.5%	0.3%	7.5%	9.6%	0.3%	9.3%	0.9%	3.8%	0.0%	0.3%	1.2%	0.0%	0.3%	4.6%	3.5%	0.6%	0.3%	0.3%	0.3%	0.0%	0.0%	0.3%	0.3%	0.3%	0.0%	0.0%	0.0%	1.2%	0.9%	0.6%	0.0%	4.1%	4.3%	7.2%	0.0%	1.4%	0.0%	0.0%	0.0%
Oral Cavity	43.6%	0.5%	3.7%	2.0%	4.7%	2.3%	1.8%	0.4%	18.6%	0.5%	5.3%	11.5%	0.5%	12.0%	5.0%	1.6%	0.1%	0.7%	0.6%	0.2%	0.3%	3.8%	4.3%	0.9%	0.2%	1.2%	0.6%	0.3%	0.2%	0.5%	0.5%	0.1%	0.1%	1.6%	0.7%	1.2%	1.8%	2.9%	0.2%	0.7%	4.7%	14.7%	0.0%	0.7%	0.1%	0.0%	0.2%
Oropharynx	36.8%	0.1%	2.5%	0.8%	2.5%	2.1%	2.7%	0.1%	13.4%	0.3%	4.2%	7.1%	0.0%	9.0%	3.2%	1.8%	0.1%	1.1%	0.6%	0.2%	0.3%	2.2%	2.2%	0.6%	0.2%	1.1%	0.8%	0.0%	0.0%	0.2%	0.3%	0.1%	0.0%	0.4%	0.7%	2.2%	1.9%	2.1%	0.4%	1.2%	4.9%	13.4%	0.1%	2.0%	0.1%	0.0%	0.0%
Other Pharynx	31.9%	0.0%	2.4%	2.0%	4.0%	3.2%	3.6%	0.4%	15.5%	0.0%	2.8%	6.8%	0.0%	10.0%	4.4%	1.2%	0.0%	1.2%	1.2%	0.0%	3.6%	2.0%	4.0%	1.2%	0.0%	1.2%	0.0%	0.0%	0.0%	1.2%	1.6%	0.4%	0.4%	0.8%	0.0%	3.2%	1.6%	1.6%	0.0%	0.8%	1.6%	10.4%	0.0%	1.6%	0.4%	0.0%	0.0%
Salivary Glands	37.8%	0.8%	2.7%	1.9%	3.8%	1.4%	0.9%	0.4%	20.3%	0.1%	8.3%	10.7%	0.7%	11.4%	4.2%	4.0%	0.0%	0.1%	0.8%	0.2%	0.8%	3.3%	4.0%	0.7%	0.1%	0.7%	0.1%	0.1%	0.0%	0.2%	0.5%	0.2%	0.0%	0.7%	1.3%	1.9%	1.3%	3.2%	0.5%	1.1%	1.9%	9.3%	0.0%	0.6%	0.6%	0.0%	0.0%
Sinonasal	32.9%	0.5%	1.3%	1.3%	1.7%	1.8%	0.8%	0.4%	15.7%	0.2%	6.3%	7.3%	0.1%	8.5%	1.8%	2.7%	0.0%	0.0%	0.6%	0.0%	0.7%	1.5%	2.0%	0.8%	0.2%	0.8%	0.2%	0.0%	0.0%	0.4%	0.4%	0.0%	0.0%	0.9%	0.7%	1.5%	1.1%	2.0%	0.1%	1.3%	3.3%	9.3%	0.0%	0.5%	0.0%	0.0%	0.0%

Code

# Table: Comorbidity prevalence - DECEASED patients
comorb_deceased = []
for site in sorted(df_head_neck['cancer_site_group'].unique()):
    df_site = df_head_neck[(df_head_neck['cancer_site_group'] == site) & (df_head_neck['egresar_fallecido'] == 1)]
    row = {'Cancer Site': site}
    for col in comorbidity_cols:
        prev = (df_site[col].mean() * 100) if len(df_site) > 0 else 0
        row[format_name(col)] = f'{prev:.1f}%'
    comorb_deceased.append(row)

comorb_deceased_df = pd.DataFrame(comorb_deceased)
show_table(comorb_deceased_df, "Table 15: Comorbidity Prevalence - Deceased Patients (%)", index=False)

Table 15: Comorbidity Prevalence - Deceased Patients (%)

Cancer Site	CV Essential Hypertension	CV Complicated Hypertension	CV Coronary Disease	CV Heart Failure	CV Arrhythmias	CV Cerebrovascular Disease	CV Peripheral Vascular Disease	MET Type1 Diabetes	MET Type2 Diabetes	MET Complicated Diabetes	MET Obesity	MET Dyslipidemia	MET Metabolic Syndrome	MET Thyroid Disorders	RESP COPD	RESP Asthma	RESP Chronic Bronchitis	RESP Emphysema	RESP Pulmonary Fibrosis	RESP Pulmonary Hypertension	RESP Sleep Apnea	REN Acute Renal Failure	REN Chronic Renal Failure	REN End Stage Renal	REN Diabetic Nephropathy	HEP Steatosis	HEP Alcoholic Cirrhosis	HEP NonAlcoholic Cirrhosis	HEP Viral Hepatitis	GI Reflux	GI Peptic Ulcer	GI IBD	GI IBS	NEURO Dementia	NEURO Parkinson	NEURO Epilepsy	NEURO Stroke Sequelae	PSI Major Depression	PSI Bipolar	PSI Anxiety	SUST Alcohol	SUST Tobacco	SUST Opioids	INF HIV	INF Tuberculosis	INF Hepatitis C	INF Hepatitis B
Hypopharynx	36.4%	0.0%	9.1%	6.1%	15.2%	3.0%	0.0%	0.0%	12.1%	0.0%	0.0%	9.1%	0.0%	21.2%	12.1%	3.0%	0.0%	3.0%	0.0%	0.0%	0.0%	3.0%	0.0%	0.0%	0.0%	0.0%	3.0%	0.0%	0.0%	0.0%	3.0%	0.0%	0.0%	0.0%	3.0%	3.0%	0.0%	3.0%	0.0%	3.0%	0.0%	12.1%	0.0%	0.0%	3.0%	0.0%	0.0%
Larynx	46.0%	1.0%	7.6%	8.6%	24.7%	5.1%	8.1%	0.0%	24.2%	0.0%	2.0%	6.1%	1.5%	12.1%	18.7%	3.0%	0.0%	5.1%	2.5%	1.5%	1.0%	23.2%	12.1%	3.0%	0.5%	2.0%	4.5%	1.0%	0.0%	0.0%	4.0%	0.0%	0.5%	2.0%	0.5%	3.0%	3.0%	3.5%	0.0%	0.0%	6.1%	19.7%	0.0%	0.5%	0.5%	0.0%	0.0%
Nasopharynx	34.8%	0.0%	4.3%	0.0%	8.7%	8.7%	13.0%	0.0%	13.0%	0.0%	0.0%	0.0%	4.3%	13.0%	0.0%	0.0%	0.0%	4.3%	0.0%	0.0%	0.0%	17.4%	0.0%	0.0%	0.0%	0.0%	0.0%	0.0%	0.0%	0.0%	0.0%	0.0%	0.0%	0.0%	4.3%	0.0%	0.0%	0.0%	0.0%	0.0%	0.0%	17.4%	0.0%	0.0%	0.0%	0.0%	0.0%
Oral Cavity	47.7%	1.1%	5.7%	4.5%	20.5%	5.1%	7.4%	0.6%	19.3%	0.0%	2.8%	9.7%	2.3%	17.0%	6.8%	4.0%	0.0%	1.1%	3.4%	1.1%	0.6%	15.9%	9.1%	2.8%	0.0%	4.0%	1.7%	0.6%	0.0%	0.0%	1.1%	0.6%	0.0%	3.4%	0.6%	1.7%	4.0%	3.4%	0.6%	0.6%	5.7%	16.5%	0.0%	1.1%	0.0%	0.0%	0.0%
Oropharynx	40.0%	0.0%	1.8%	5.5%	14.5%	5.5%	3.6%	0.0%	18.2%	0.0%	3.6%	5.5%	1.8%	3.6%	10.9%	1.8%	0.0%	0.0%	3.6%	0.0%	0.0%	16.4%	7.3%	1.8%	0.0%	7.3%	1.8%	0.0%	0.0%	0.0%	0.0%	0.0%	0.0%	0.0%	0.0%	1.8%	3.6%	5.5%	3.6%	0.0%	3.6%	10.9%	0.0%	3.6%	1.8%	0.0%	0.0%
Other Pharynx	46.7%	3.3%	0.0%	6.7%	13.3%	6.7%	6.7%	3.3%	13.3%	0.0%	3.3%	0.0%	0.0%	3.3%	10.0%	0.0%	0.0%	3.3%	0.0%	0.0%	0.0%	16.7%	6.7%	3.3%	0.0%	0.0%	3.3%	0.0%	0.0%	0.0%	3.3%	0.0%	0.0%	0.0%	0.0%	6.7%	3.3%	6.7%	3.3%	0.0%	3.3%	16.7%	0.0%	0.0%	0.0%	0.0%	0.0%
Salivary Glands	32.6%	2.2%	8.7%	8.7%	21.7%	6.5%	0.0%	0.0%	15.2%	2.2%	2.2%	6.5%	0.0%	10.9%	2.2%	0.0%	0.0%	2.2%	0.0%	0.0%	0.0%	21.7%	8.7%	2.2%	0.0%	2.2%	2.2%	0.0%	0.0%	0.0%	0.0%	0.0%	0.0%	2.2%	0.0%	0.0%	2.2%	2.2%	0.0%	0.0%	4.3%	8.7%	0.0%	0.0%	0.0%	0.0%	0.0%
Sinonasal	36.6%	2.4%	0.0%	2.4%	2.4%	9.8%	0.0%	2.4%	22.0%	0.0%	9.8%	7.3%	2.4%	7.3%	2.4%	0.0%	0.0%	0.0%	0.0%	0.0%	0.0%	4.9%	2.4%	0.0%	0.0%	2.4%	0.0%	0.0%	0.0%	0.0%	0.0%	0.0%	0.0%	0.0%	0.0%	2.4%	2.4%	2.4%	0.0%	0.0%	2.4%	12.2%	0.0%	0.0%	0.0%	0.0%	0.0%

Code

# Table: Difference (Deceased - Discharged) with chi-square
from scipy.stats import chi2_contingency

comorb_diff = []
for site in sorted(df_head_neck['cancer_site_group'].unique()):
    df_site = df_head_neck[df_head_neck['cancer_site_group'] == site]
    row = {'Cancer Site': site}

    for col in comorbidity_cols:
        df_discharged = df_site[df_site['egresar_fallecido'] == 0]
        df_deceased = df_site[df_site['egresar_fallecido'] == 1]

        prev_discharged = (df_discharged[col].mean() * 100) if len(df_discharged) > 0 else 0
        prev_deceased = (df_deceased[col].mean() * 100) if len(df_deceased) > 0 else 0

        diff = prev_deceased - prev_discharged

        # Chi-square test
        contingency_table = pd.crosstab(df_site['egresar_fallecido'], df_site[col])
        try:
            chi2, pval, dof, expected = chi2_contingency(contingency_table)
            sig_stars = ''
            if pval < 0.001:
                sig_stars = '***'
            elif pval < 0.01:
                sig_stars = '**'
            elif pval < 0.05:
                sig_stars = '*'
        except:
            sig_stars = ''

        row[format_name(col)] = f'{diff:+.1f}%{sig_stars}'

    comorb_diff.append(row)

comorb_diff_df = pd.DataFrame(comorb_diff)
show_table(comorb_diff_df, "Table 16: Difference in Comorbidity Prevalence (Deceased - Discharged, %)\n* p<0.05, ** p<0.01, *** p<0.001", index=False)

Table 16: Difference in Comorbidity Prevalence (Deceased - Discharged, %) * p<0.05, p<0.01, * p<0.001

Cancer Site	CV Essential Hypertension	CV Complicated Hypertension	CV Coronary Disease	CV Heart Failure	CV Arrhythmias	CV Cerebrovascular Disease	CV Peripheral Vascular Disease	MET Type1 Diabetes	MET Type2 Diabetes	MET Complicated Diabetes	MET Obesity	MET Dyslipidemia	MET Metabolic Syndrome	MET Thyroid Disorders	RESP COPD	RESP Asthma	RESP Chronic Bronchitis	RESP Emphysema	RESP Pulmonary Fibrosis	RESP Pulmonary Hypertension	RESP Sleep Apnea	REN Acute Renal Failure	REN Chronic Renal Failure	REN End Stage Renal	REN Diabetic Nephropathy	HEP Steatosis	HEP Alcoholic Cirrhosis	HEP NonAlcoholic Cirrhosis	HEP Viral Hepatitis	GI Reflux	GI Peptic Ulcer	GI IBD	GI IBS	NEURO Dementia	NEURO Parkinson	NEURO Epilepsy	NEURO Stroke Sequelae	PSI Major Depression	PSI Bipolar	PSI Anxiety	SUST Alcohol	SUST Tobacco	SUST Opioids	INF HIV	INF Tuberculosis	INF Hepatitis C	INF Hepatitis B
Hypopharynx	-0.4%	-0.3%	+5.5%	+5.0%	+12.4%**	+2.5%	-1.6%	-0.3%	-2.3%	-0.3%	-1.6%	-0.4%	-0.3%	+5.7%	+6.4%	+1.9%	+0.0%	+1.1%	-1.1%	+0.0%	-0.3%	+1.7%	-4.9%	-1.9%	+0.0%	-1.1%	+2.2%	+0.0%	-0.8%	+0.0%	+2.5%	+0.0%	+0.0%	-1.4%	+2.8%	+1.4%	-0.5%	+1.1%	+0.0%	+2.2%	-7.1%	-5.0%	+0.0%	-1.6%	+2.8%	-0.5%	+0.0%
Larynx	+0.2%	+0.6%	+1.9%	+5.5%***	+19.3%***	+2.3%	+5.4%***	-0.4%	+4.8%	-0.2%	-3.2%	-2.6%	+1.1%	+1.0%	+5.3%*	-0.6%	-0.0%	+3.6%***	+1.6%	+1.1%	+0.0%	+18.5%***	+7.5%***	+2.1%*	+0.2%	+1.2%	+3.5%***	+1.0%***	-0.2%	-0.6%	+3.1%***	+0.0%	+0.5%	+0.7%	-0.2%	+1.6%	+1.0%	+1.3%	-0.3%	-0.9%	+1.7%	-4.9%	-0.0%	-0.2%	+0.3%	-0.1%	-0.0%
Nasopharynx	+7.0%	+0.0%	+2.3%	-1.7%	+6.1%	+7.5%	+9.6%	+0.0%	+0.6%	-0.3%	-7.5%	-9.6%	+4.1%	+3.8%	-0.9%	-3.8%	+0.0%	+4.1%	-1.2%	+0.0%	-0.3%	+12.8%*	-3.5%	-0.6%	-0.3%	-0.3%	-0.3%	+0.0%	+0.0%	-0.3%	-0.3%	-0.3%	+0.0%	+0.0%	+4.3%	-1.2%	-0.9%	-0.6%	+0.0%	-4.1%	-4.3%	+10.1%	+0.0%	-1.4%	+0.0%	+0.0%	+0.0%
Oral Cavity	+4.1%	+0.7%	+2.0%	+2.5%*	+15.7%***	+2.8%*	+5.6%***	+0.2%	+0.7%	-0.5%	-2.5%	-1.9%	+1.8%*	+5.1%	+1.8%	+2.3%*	-0.1%	+0.4%	+2.8%***	+0.9%	+0.3%	+12.1%***	+4.8%**	+1.9%*	-0.2%	+2.8%**	+1.1%	+0.3%	-0.2%	-0.5%	+0.6%	+0.5%	-0.1%	+1.8%	-0.1%	+0.5%	+2.2%	+0.5%	+0.4%	-0.1%	+1.0%	+1.8%	+0.0%	+0.4%	-0.1%	+0.0%	-0.2%
Oropharynx	+3.2%	-0.1%	-0.6%	+4.7%**	+12.1%***	+3.3%	+1.0%	-0.1%	+4.8%	-0.3%	-0.6%	-1.7%	+1.8%	-5.4%	+7.7%*	+0.0%	-0.1%	-1.1%	+3.1%	-0.2%	-0.3%	+14.1%***	+5.0%	+1.3%	-0.2%	+6.2%**	+1.0%	+0.0%	+0.0%	-0.2%	-0.3%	-0.1%	+0.0%	-0.4%	-0.7%	-0.4%	+1.7%	+3.3%	+3.2%*	-1.2%	-1.3%	-2.5%	-0.1%	+1.6%	+1.7%	+0.0%	+0.0%
Other Pharynx	+14.8%	+3.3%	-2.4%	+4.7%	+9.3%	+3.5%	+3.1%	+2.9%	-2.2%	+0.0%	+0.5%	-6.8%	+0.0%	-6.6%	+5.6%	-1.2%	+0.0%	+2.1%	-1.2%	+0.0%	-3.6%	+14.7%***	+2.7%	+2.1%	+0.0%	-1.2%	+3.3%	+0.0%	+0.0%	-1.2%	+1.7%	-0.4%	-0.4%	-0.8%	+0.0%	+3.5%	+1.7%	+5.1%	+3.3%	-0.8%	+1.7%	+6.3%	+0.0%	-1.6%	-0.4%	+0.0%	+0.0%
Salivary Glands	-5.2%	+1.4%	+6.0%	+6.8%*	+18.0%***	+5.1%*	-0.9%	-0.4%	-5.1%	+2.1%	-6.2%	-4.2%	-0.7%	-0.5%	-2.1%	-4.0%	+0.0%	+2.1%	-0.8%	-0.2%	-0.8%	+18.5%***	+4.7%	+1.5%	-0.1%	+1.5%	+2.1%	-0.1%	+0.0%	-0.2%	-0.5%	-0.2%	+0.0%	+1.5%	-1.3%	-1.9%	+0.9%	-1.0%	-0.5%	-1.1%	+2.5%	-0.6%	+0.0%	-0.6%	-0.6%	+0.0%	+0.0%
Sinonasal	+3.7%	+2.0%	-1.3%	+1.1%	+0.8%	+8.0%**	-0.8%	+2.1%	+6.2%	-0.2%	+3.5%	-0.0%	+2.3%	-1.2%	+0.7%	-2.7%	+0.0%	+0.0%	-0.6%	+0.0%	-0.7%	+3.3%	+0.4%	-0.8%	-0.2%	+1.6%	-0.2%	+0.0%	+0.0%	-0.4%	-0.4%	+0.0%	+0.0%	-0.9%	-0.7%	+0.9%	+1.4%	+0.4%	-0.1%	-1.3%	-0.9%	+2.9%	+0.0%	-0.5%	+0.0%	+0.0%	+0.0%

Code

# Prepare data for heatmap
heatmap_data_all = []
for site in sorted(df_head_neck['cancer_site_group'].unique()):
    df_site = df_head_neck[df_head_neck['cancer_site_group'] == site]
    row = []
    for col in comorbidity_cols:
        prev = (df_site[col].mean() * 100)
        row.append(prev)
    heatmap_data_all.append(row)

heatmap_df_all = pd.DataFrame(heatmap_data_all,
                               index=sorted(df_head_neck['cancer_site_group'].unique()),
                               columns=[format_name(col) for col in comorbidity_cols])

fig, ax = plt.subplots(figsize=(22, 8))
sns.heatmap(heatmap_df_all, annot=True, fmt='.1f', cmap='viridis',
            cbar_kws={'label': 'Prevalence (%)'}, ax=ax,
            annot_kws={'rotation': 90, 'fontsize': 8})
ax.set_title('Comorbidity Prevalence by Cancer Site (All Patients)')
ax.set_xlabel('Comorbidity')
ax.set_ylabel('Cancer Site')
plt.setp(ax.get_xticklabels(), rotation=45, ha='right', rotation_mode='anchor', fontsize=8)
plt.setp(ax.get_yticklabels(), rotation=0)
plt.tight_layout()
plt.savefig(os.path.join(OUTPUT_DIR, '03_comorb_heatmap_all.png'), dpi=300, bbox_inches='tight', facecolor='white')
plt.savefig(_FIG / "fig-comorb-heatmap-all.png", dpi=300, bbox_inches="tight")
plt.show()

Figure 3: Figure 3. Heatmap: Comorbidity Prevalence (All Patients)

Code

# Prepare data for heatmap
heatmap_data_deceased = []
for site in sorted(df_head_neck['cancer_site_group'].unique()):
    df_site = df_head_neck[(df_head_neck['cancer_site_group'] == site) & (df_head_neck['egresar_fallecido'] == 1)]
    row = []
    for col in comorbidity_cols:
        prev = (df_site[col].mean() * 100) if len(df_site) > 0 else 0
        row.append(prev)
    heatmap_data_deceased.append(row)

heatmap_df_deceased = pd.DataFrame(heatmap_data_deceased,
                                    index=sorted(df_head_neck['cancer_site_group'].unique()),
                                    columns=[format_name(col) for col in comorbidity_cols])

fig, ax = plt.subplots(figsize=(22, 8))
sns.heatmap(heatmap_df_deceased, annot=True, fmt='.1f', cmap='viridis',
            cbar_kws={'label': 'Prevalence (%)'}, ax=ax,
            annot_kws={'rotation': 90, 'fontsize': 8})
ax.set_title('Comorbidity Prevalence by Cancer Site (Deceased Patients)')
ax.set_xlabel('Comorbidity')
ax.set_ylabel('Cancer Site')
plt.setp(ax.get_xticklabels(), rotation=45, ha='right', rotation_mode='anchor', fontsize=8)
plt.setp(ax.get_yticklabels(), rotation=0)
plt.tight_layout()
plt.savefig(os.path.join(OUTPUT_DIR, '04_comorb_heatmap_deceased.png'), dpi=300, bbox_inches='tight', facecolor='white')
plt.savefig(_FIG / "fig-comorb-heatmap-deceased.png", dpi=300, bbox_inches="tight")
plt.show()

Figure 4: Figure 4. Heatmap: Comorbidity Prevalence (Deceased Patients)

7 4.3 Length of Stay Analysis

Code

# Table: LOS by cancer site and discharge status
los_summary = []
for site in sorted(df_head_neck['cancer_site_group'].unique()):
    df_site = df_head_neck[df_head_neck['cancer_site_group'] == site]

    for status in ['Discharged', 'Deceased']:
        status_code = 0 if status == 'Discharged' else 1
        df_status = df_site[df_site['egresar_fallecido'] == status_code]

        if len(df_status) > 0:
            los_summary.append({
                'Cancer Site': site,
                'Status': status,
                'N': len(df_status),
                'Mean': df_status['dias_estancia'].mean(),
                'Median': df_status['dias_estancia'].median(),
                'SD': df_status['dias_estancia'].std(),
                'Min': df_status['dias_estancia'].min(),
                'Max': df_status['dias_estancia'].max()
            })

los_summary_df = pd.DataFrame(los_summary)
show_table(los_summary_df, "Table 17: Length of Stay by Cancer Site and Discharge Status", index=False)

Table 17: Length of Stay by Cancer Site and Discharge Status

Cancer Site	Status	N	Mean	Median	SD	Min	Max
Hypopharynx	Discharged	367	12.88	7.00	17.16	0.00	171.00
Hypopharynx	Deceased	33	23.00	13.00	32.18	0.00	164.00
Larynx	Discharged	2427	13.67	7.00	20.08	0.00	317.00
Larynx	Deceased	198	24.63	12.50	35.82	0.00	250.00
Nasopharynx	Discharged	345	7.73	4.00	14.79	0.00	155.00
Nasopharynx	Deceased	23	27.48	13.00	41.43	1.00	138.00
Oral Cavity	Discharged	2825	9.22	4.00	14.98	0.00	216.00
Oral Cavity	Deceased	176	15.80	6.00	30.88	0.00	271.00
Oropharynx	Discharged	896	7.87	4.00	12.48	0.00	105.00
Oropharynx	Deceased	55	16.96	9.00	24.96	0.00	141.00
Other Pharynx	Discharged	251	10.04	5.00	14.70	0.00	96.00
Other Pharynx	Deceased	30	17.23	5.50	23.66	0.00	106.00
Salivary Glands	Discharged	852	6.41	3.00	10.75	0.00	136.00
Salivary Glands	Deceased	46	16.93	9.00	20.34	0.00	100.00
Sinonasal	Discharged	846	6.27	3.00	10.99	0.00	151.00
Sinonasal	Deceased	41	15.20	11.00	14.19	0.00	51.00

Code

# Table: Mean LOS by cancer site and comorbidity presence
los_by_comorb = []
for site in sorted(df_head_neck['cancer_site_group'].unique()):
    df_site = df_head_neck[df_head_neck['cancer_site_group'] == site]
    row = {'Cancer Site': site}

    for col in comorbidity_cols:
        df_with = df_site[df_site[col] == 1]
        los_with = df_with['dias_estancia'].mean() if len(df_with) > 0 else 0
        row[format_name(col)] = f'{los_with:.1f}'

    los_by_comorb.append(row)

los_by_comorb_df = pd.DataFrame(los_by_comorb)
show_table(los_by_comorb_df, "Table 18: Mean Length of Stay When Comorbidity Present (days)", index=False)

Table 18: Mean Length of Stay When Comorbidity Present (days)

Cancer Site	CV Essential Hypertension	CV Complicated Hypertension	CV Coronary Disease	CV Heart Failure	CV Arrhythmias	CV Cerebrovascular Disease	CV Peripheral Vascular Disease	MET Type1 Diabetes	MET Type2 Diabetes	MET Complicated Diabetes	MET Obesity	MET Dyslipidemia	MET Metabolic Syndrome	MET Thyroid Disorders	RESP COPD	RESP Asthma	RESP Chronic Bronchitis	RESP Emphysema	RESP Pulmonary Fibrosis	RESP Pulmonary Hypertension	RESP Sleep Apnea	REN Acute Renal Failure	REN Chronic Renal Failure	REN End Stage Renal	REN Diabetic Nephropathy	HEP Steatosis	HEP Alcoholic Cirrhosis	HEP NonAlcoholic Cirrhosis	HEP Viral Hepatitis	GI Reflux	GI Peptic Ulcer	GI IBD	GI IBS	NEURO Dementia	NEURO Parkinson	NEURO Epilepsy	NEURO Stroke Sequelae	PSI Major Depression	PSI Bipolar	PSI Anxiety	SUST Alcohol	SUST Tobacco	SUST Opioids	INF HIV	INF Tuberculosis	INF Hepatitis C	INF Hepatitis B
Hypopharynx	15.1	8.0	19.1	18.3	15.3	8.0	45.0	4.0	14.6	1.0	10.2	14.2	30.0	19.0	19.6	17.4	0.0	41.1	32.2	0.0	108.0	20.7	21.8	32.3	0.0	12.2	18.0	0.0	14.0	0.0	22.3	0.0	0.0	15.0	18.5	25.4	5.5	33.2	0.0	13.8	24.3	19.4	0.0	12.7	48.0	5.5	0.0
Larynx	15.1	23.3	18.2	18.4	24.7	25.3	23.6	9.9	14.2	17.0	16.2	12.3	51.0	13.7	19.7	13.6	5.0	34.6	11.1	17.4	12.3	24.1	19.9	24.6	16.6	16.5	21.0	11.0	7.8	23.0	37.3	0.0	2.0	15.5	10.8	18.9	19.4	25.3	19.4	25.6	21.8	17.9	14.0	16.9	18.9	5.7	14.0
Nasopharynx	11.1	0.0	4.0	8.8	9.7	36.7	19.8	0.0	15.4	7.0	7.2	8.8	10.5	6.7	23.0	4.2	0.0	8.5	4.0	0.0	0.0	17.0	7.8	7.0	1.0	2.0	12.0	0.0	0.0	3.0	1.0	7.0	0.0	0.0	6.0	10.5	18.7	78.5	0.0	4.9	4.1	17.0	0.0	4.8	0.0	0.0	0.0
Oral Cavity	10.0	12.7	11.7	12.2	15.9	10.9	17.9	11.3	10.4	6.0	12.7	9.1	16.9	10.4	12.8	10.7	24.5	20.5	21.4	8.4	38.1	17.6	13.9	14.9	9.3	11.5	13.2	16.3	6.2	4.8	13.6	6.0	8.0	9.7	12.6	12.2	9.6	16.2	21.3	25.3	13.2	12.7	0.0	17.5	4.5	0.0	6.2
Oropharynx	9.7	12.0	14.8	18.3	14.7	13.2	19.3	7.0	9.0	7.3	8.6	6.9	27.0	8.5	20.9	17.3	12.0	15.2	21.7	5.0	11.0	18.7	8.2	10.2	6.0	25.7	28.8	0.0	0.0	12.0	26.3	1.0	0.0	37.0	13.3	12.3	12.9	21.6	21.5	15.6	12.5	11.8	55.0	5.2	44.5	0.0	0.0
Other Pharynx	11.5	26.0	16.7	18.1	28.7	17.7	24.8	19.0	13.1	0.0	8.8	7.6	0.0	12.3	14.6	2.7	0.0	24.2	6.0	0.0	5.8	13.7	18.5	10.2	0.0	6.7	4.0	0.0	0.0	5.7	10.6	2.0	7.0	26.0	0.0	17.2	13.4	15.2	6.0	36.0	22.2	16.5	0.0	3.5	12.0	0.0	0.0
Salivary Glands	7.8	12.8	13.5	12.6	17.2	13.5	11.2	7.0	7.3	13.5	6.0	6.3	3.7	6.4	11.1	7.1	0.0	18.5	14.0	20.0	6.9	15.5	10.8	6.4	6.0	13.1	19.5	8.0	0.0	5.0	19.0	2.0	0.0	10.6	12.8	4.9	12.0	10.2	2.0	2.3	7.8	9.7	0.0	10.6	4.8	0.0	0.0
Sinonasal	8.2	17.6	6.9	15.1	18.3	16.2	16.4	14.5	7.7	0.0	7.8	6.1	16.0	6.7	6.8	6.8	0.0	0.0	8.6	0.0	7.2	38.1	12.0	11.3	2.0	10.1	10.5	0.0	0.0	10.3	1.3	0.0	0.0	7.1	3.8	9.3	9.1	7.2	61.0	6.6	9.8	6.9	0.0	3.5	0.0	0.0	0.0

Code

# Table: LOS impact (difference WITH - WITHOUT comorbidity)
los_impact = []
for site in sorted(df_head_neck['cancer_site_group'].unique()):
    df_site = df_head_neck[df_head_neck['cancer_site_group'] == site]
    row = {'Cancer Site': site}

    for col in comorbidity_cols:
        df_with = df_site[df_site[col] == 1]
        df_without = df_site[df_site[col] == 0]

        los_with = df_with['dias_estancia'].mean() if len(df_with) > 0 else 0
        los_without = df_without['dias_estancia'].mean() if len(df_without) > 0 else 0

        difference = los_with - los_without
        row[format_name(col)] = f'{difference:+.1f}'

    los_impact.append(row)

los_impact_df = pd.DataFrame(los_impact)
show_table(los_impact_df, "Table 19: Length of Stay Impact by Comorbidity (Days Difference: WITH - WITHOUT)", index=False)

Table 19: Length of Stay Impact by Comorbidity (Days Difference: WITH - WITHOUT)

Cancer Site	CV Essential Hypertension	CV Complicated Hypertension	CV Coronary Disease	CV Heart Failure	CV Arrhythmias	CV Cerebrovascular Disease	CV Peripheral Vascular Disease	MET Type1 Diabetes	MET Type2 Diabetes	MET Complicated Diabetes	MET Obesity	MET Dyslipidemia	MET Metabolic Syndrome	MET Thyroid Disorders	RESP COPD	RESP Asthma	RESP Chronic Bronchitis	RESP Emphysema	RESP Pulmonary Fibrosis	RESP Pulmonary Hypertension	RESP Sleep Apnea	REN Acute Renal Failure	REN Chronic Renal Failure	REN End Stage Renal	REN Diabetic Nephropathy	HEP Steatosis	HEP Alcoholic Cirrhosis	HEP NonAlcoholic Cirrhosis	HEP Viral Hepatitis	GI Reflux	GI Peptic Ulcer	GI IBD	GI IBS	NEURO Dementia	NEURO Parkinson	NEURO Epilepsy	NEURO Stroke Sequelae	PSI Major Depression	PSI Bipolar	PSI Anxiety	SUST Alcohol	SUST Tobacco	SUST Opioids	INF HIV	INF Tuberculosis	INF Hepatitis C	INF Hepatitis B
Hypopharynx	+2.2	-5.7	+5.6	+4.7	+1.7	-5.8	+31.8	-9.7	+1.0	-12.7	-3.6	+0.6	+16.3	+6.3	+6.3	+3.7	-13.7	+28.0	+18.7	-13.7	+94.5	+7.1	+8.5	+18.9	-13.7	-1.5	+4.3	-13.7	+0.3	-13.7	+8.7	-13.7	-13.7	+1.3	+4.8	+11.9	-8.3	+19.9	-13.7	+0.0	+11.3	+6.8	-13.7	-1.1	+34.5	-8.3	-13.7
Larynx	+1.2	+8.9	+4.0	+4.0	+11.0	+11.1	+9.3	-4.6	-0.4	+2.5	+1.8	-2.3	+36.7	-0.9	+6.0	-0.9	-9.5	+20.5	-3.4	+2.9	-2.2	+10.2	+5.7	+10.2	+2.1	+2.0	+6.6	-3.5	-6.8	+8.6	+23.1	-14.5	-12.5	+1.1	-3.7	+4.5	+5.0	+11.1	+4.9	+11.2	+7.7	+4.5	-0.5	+2.4	+4.4	-8.8	-0.5
Nasopharynx	+3.0	-9.0	-5.1	-0.1	+0.8	+28.2	+11.3	-9.0	+7.4	-2.0	-2.0	-0.1	+1.5	-2.5	+14.1	-5.0	-9.0	-0.5	-5.0	-9.0	-9.0	+8.5	-1.2	-2.0	-8.0	-7.0	+3.0	-9.0	-9.0	-6.0	-8.0	-2.0	-9.0	-9.0	-3.0	+1.5	+9.8	+69.9	-9.0	-4.3	-5.1	+8.8	-9.0	-4.2	-9.0	-9.0	-9.0
Oral Cavity	+0.6	+3.1	+2.2	+2.7	+6.6	+1.3	+8.4	+1.7	+0.9	-3.6	+3.2	-0.6	+7.3	+0.9	+3.3	+1.1	+14.9	+11.0	+11.9	-1.2	+28.6	+8.3	+4.5	+5.3	-0.3	+1.9	+3.6	+6.7	-3.4	-4.8	+4.0	-3.6	-1.6	+0.1	+3.0	+2.6	-0.0	+6.8	+11.7	+15.8	+3.7	+3.7	-9.6	+7.9	-5.1	-9.6	-3.4
Oropharynx	+2.0	+3.6	+6.5	+10.0	+6.5	+4.9	+11.2	-1.4	+0.7	-1.1	+0.2	-1.6	+18.6	+0.1	+13.0	+9.1	+3.6	+6.9	+13.4	-3.4	+2.6	+10.7	-0.1	+1.8	-2.4	+17.6	+20.5	-8.4	-8.4	+3.6	+18.0	-7.4	-8.4	+28.7	+5.0	+4.0	+4.6	+13.5	+13.2	+7.3	+4.3	+3.9	+46.7	-3.3	+36.2	-8.4	-8.4
Other Pharynx	+1.0	+15.2	+6.0	+7.5	+18.8	+7.2	+14.6	+8.3	+2.7	-10.8	-2.1	-3.4	-10.8	+1.6	+4.0	-8.2	-10.8	+13.6	-4.9	-10.8	-5.2	+3.0	+8.0	-0.6	-10.8	-4.2	-6.8	-10.8	-10.8	-5.2	-0.2	-8.8	-3.8	+15.3	-10.8	+6.6	+2.6	+4.5	-4.8	+25.4	+11.6	+6.4	-10.8	-7.4	+1.2	-10.8	-10.8
Salivary Glands	+1.3	+5.9	+6.8	+5.8	+10.7	+6.7	+4.3	+0.1	+0.5	+6.6	-1.0	-0.7	-3.3	-0.6	+4.3	+0.1	-6.9	+11.6	+7.1	+13.1	-0.1	+8.9	+4.0	-0.5	-1.0	+6.2	+12.6	+1.1	-6.9	-2.0	+12.1	-5.0	-6.9	+3.6	+5.9	-2.1	+5.1	+3.4	-5.0	-4.7	+0.9	+3.0	-6.9	+3.7	-2.2	-6.9	-6.9
Sinonasal	+2.2	+11.0	+0.2	+8.5	+11.9	+9.7	+9.8	+7.9	+1.3	-6.7	+1.1	-0.6	+9.3	+0.0	+0.1	+0.1	-6.7	-6.7	+1.9	-6.7	+0.5	+32.0	+5.4	+4.6	-4.7	+3.5	+3.8	-6.7	-6.7	+3.7	-5.4	-6.7	-6.7	+0.4	-2.9	+2.6	+2.4	+0.5	+54.4	-0.0	+3.3	+0.2	-6.7	-3.2	-6.7	-6.7	-6.7

Code

fig, axes = plt.subplots(1, 2, figsize=(12, 6))

# Box plot
ax = axes[0]
data_discharged = []
data_deceased = []
labels = []
for site in sorted(df_head_neck['cancer_site_group'].unique()):
    df_site = df_head_neck[df_head_neck['cancer_site_group'] == site]
    data_discharged.append(df_site[df_site['egresar_fallecido'] == 0]['dias_estancia'].values)
    data_deceased.append(df_site[df_site['egresar_fallecido'] == 1]['dias_estancia'].values)
    labels.append(site)

positions_discharged = np.arange(len(labels)) * 2.5
positions_deceased = positions_discharged + 1
bp1 = ax.boxplot(data_discharged, positions=positions_discharged, widths=0.8, patch_artist=True, label='Discharged')
bp2 = ax.boxplot(data_deceased, positions=positions_deceased, widths=0.8, patch_artist=True, label='Deceased')

for patch in bp1['boxes']:
    patch.set_facecolor('lightgreen')
for patch in bp2['boxes']:
    patch.set_facecolor('lightcoral')

ax.set_xticks((positions_discharged + positions_deceased) / 2)
ax.set_xticklabels(labels, rotation=30, ha='right', fontsize=9)
ax.set_ylabel('Length of Stay (days)')
ax.set_title('LOS Distribution by Site and Discharge Status')
ax.set_ylim([0, 100])
ax.legend()
ax.grid(True, alpha=0.3, axis='y')

# Bar plot of means
ax = axes[1]
means_discharged = []
means_deceased = []
for site in sorted(df_head_neck['cancer_site_group'].unique()):
    df_site = df_head_neck[df_head_neck['cancer_site_group'] == site]
    means_discharged.append(df_site[df_site['egresar_fallecido'] == 0]['dias_estancia'].mean())
    means_deceased.append(df_site[df_site['egresar_fallecido'] == 1]['dias_estancia'].mean())

x = np.arange(len(labels))
width = 0.35
ax.bar(x - width/2, means_discharged, width, label='Discharged', color='lightgreen')
ax.bar(x + width/2, means_deceased, width, label='Deceased', color='lightcoral')
ax.set_ylabel('Mean Length of Stay (days)')
ax.set_xlabel('Cancer Site')
ax.set_title('Mean LOS by Site and Discharge Status')
ax.set_xticks(x)
ax.set_xticklabels(labels, rotation=30, ha='right', fontsize=9)
ax.legend()
ax.grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.savefig(os.path.join(OUTPUT_DIR, '05_los_by_site.png'), dpi=300, bbox_inches='tight', facecolor='white')
plt.savefig(_FIG / "fig-los-by-site.png", dpi=300, bbox_inches="tight")
plt.show()

Figure 5: Figure 5. Length of Stay by Cancer Site and Discharge Status

Code

heatmap_los_impact = []
for site in sorted(df_head_neck['cancer_site_group'].unique()):
    df_site = df_head_neck[df_head_neck['cancer_site_group'] == site]
    row = []
    for col in comorbidity_cols:
        df_with = df_site[df_site[col] == 1]
        df_without = df_site[df_site[col] == 0]
        los_with = df_with['dias_estancia'].mean() if len(df_with) > 0 else 0
        los_without = df_without['dias_estancia'].mean() if len(df_without) > 0 else 0
        difference = los_with - los_without
        row.append(difference)
    heatmap_los_impact.append(row)

heatmap_los_impact_df = pd.DataFrame(heatmap_los_impact,
                                      index=sorted(df_head_neck['cancer_site_group'].unique()),
                                      columns=[format_name(col) for col in comorbidity_cols])

fig, ax = plt.subplots(figsize=(22, 8))
sns.heatmap(heatmap_los_impact_df, annot=True, fmt='.1f', cmap='RdBu_r', center=0,
            cbar_kws={'label': 'LOS Difference (days)'}, ax=ax,
            annot_kws={'rotation': 90, 'fontsize': 8})
ax.set_title('Length of Stay Impact: WITH - WITHOUT Comorbidity (days)')
ax.set_xlabel('Comorbidity')
ax.set_ylabel('Cancer Site')
plt.setp(ax.get_xticklabels(), rotation=45, ha='right', rotation_mode='anchor', fontsize=8)
plt.setp(ax.get_yticklabels(), rotation=0)
plt.tight_layout()
plt.savefig(os.path.join(OUTPUT_DIR, '06_los_impact_heatmap.png'), dpi=300, bbox_inches='tight', facecolor='white')
plt.savefig(_FIG / "fig-los-impact-heatmap.png", dpi=300, bbox_inches="tight")
plt.show()

Figure 6: Figure 6. Heatmap: LOS Impact by Comorbidity and Cancer Site

8 4.4 Temporal Trends

Code

fig, axes = plt.subplots(2, 2, figsize=(14, 8))

sites = sorted(df_head_neck['cancer_site_group'].unique())
colors = ['steelblue', 'forestgreen', 'coral', 'mediumpurple']

# Row 1: Cases by year per site
ax = axes[0, 0]
for site, color in zip(sites, colors):
    df_site = df_head_neck[df_head_neck['cancer_site_group'] == site]
    cases_by_year = df_site['year'].value_counts().sort_index()
    ax.plot(cases_by_year.index, cases_by_year.values, marker='o', linewidth=2, label=site, color=color)
ax.set_xlabel('Year')
ax.set_ylabel('Number of Cases')
ax.set_title('Trend: Cases by Year')
ax.legend()
ax.grid(True, alpha=0.3)

# Row 1: Mortality rate by year per site
ax = axes[0, 1]
for site, color in zip(sites, colors):
    df_site = df_head_neck[df_head_neck['cancer_site_group'] == site]
    mortality_by_year = df_site.groupby('year')['egresar_fallecido'].mean() * 100
    ax.plot(mortality_by_year.index, mortality_by_year.values, marker='o', linewidth=2, label=site, color=color)
ax.set_xlabel('Year')
ax.set_ylabel('Mortality Rate (%)')
ax.set_title('Trend: Mortality Rate by Year')
ax.legend()
ax.grid(True, alpha=0.3)

# Row 2: Mean LOS by year per site
ax = axes[1, 0]
for site, color in zip(sites, colors):
    df_site = df_head_neck[df_head_neck['cancer_site_group'] == site]
    los_by_year = df_site.groupby('year')['dias_estancia'].mean()
    ax.plot(los_by_year.index, los_by_year.values, marker='o', linewidth=2, label=site, color=color)
ax.set_xlabel('Year')
ax.set_ylabel('Mean Length of Stay (days)')
ax.set_title('Trend: Mean LOS by Year')
ax.legend()
ax.grid(True, alpha=0.3)

# Row 2: Overall mortality trend
ax = axes[1, 1]
overall_mortality = df_head_neck.groupby('year')['egresar_fallecido'].mean() * 100
ax.plot(overall_mortality.index, overall_mortality.values, marker='o', linewidth=3, markersize=10, color='darkred')
ax.fill_between(overall_mortality.index, overall_mortality.values, alpha=0.3, color='darkred')
ax.set_xlabel('Year')
ax.set_ylabel('Mortality Rate (%)')
ax.set_title('Trend: Overall Mortality Rate')
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig(os.path.join(OUTPUT_DIR, '07_temporal_trends.png'), dpi=300, bbox_inches='tight', facecolor='white')
plt.savefig(_FIG / "fig-temporal-trends.png", dpi=300, bbox_inches="tight")
plt.show()

Figure 7: Figure 7. Temporal Trends: Cases and Mortality by Year and Cancer Site

9 5. Bivariate Analysis

Code

from scipy.stats import chi2_contingency, ranksums

# Bivariate analysis: chi-square for categorical association with mortality
all_site_names = sorted(df_head_neck['cancer_site_group'].unique())
df_bivariate = []

for site in all_site_names:
    df_site = df_head_neck[df_head_neck['cancer_site_group'] == site]

    if len(df_site) < 30:
        continue

    for col in comorbidity_cols:
        # Chi-square test for mortality association
        contingency = pd.crosstab(df_site[col], df_site['egresar_fallecido'])
        try:
            chi2, pval_mort, dof, expected = chi2_contingency(contingency)
        except:
            pval_mort = np.nan

        # Wilcoxon rank-sum test (Mann-Whitney U) for LOS difference
        group_with = df_site[df_site[col] == 1]['dias_estancia'].dropna()
        group_without = df_site[df_site[col] == 0]['dias_estancia'].dropna()
        try:
            if len(group_with) >= 5 and len(group_without) >= 5:
                stat, pval_los = ranksums(group_with, group_without)
            else:
                pval_los = np.nan
        except:
            pval_los = np.nan

        df_bivariate.append({
            'Cancer Site': site,
            'Comorbidity': format_name(col),
            'Chi-square p-value': pval_mort,
            'LOS p-value (Wilcoxon)': pval_los,
            'N with': len(group_with),
            'N without': len(group_without)
        })

df_bivariate = pd.DataFrame(df_bivariate)
show_table(df_bivariate, "Table 20: Bivariate Analysis by Cancer Site", index=False)

Table 20: Bivariate Analysis by Cancer Site

Cancer Site	Comorbidity	Chi-square p-value	LOS p-value (Wilcoxon)	N with	N without
Hypopharynx	CV Essential Hypertension	1.00	1.00	147	253
Hypopharynx	CV Complicated Hypertension	1.00	—	1	399
Hypopharynx	CV Coronary Disease	0.27	0.55	16	384
Hypopharynx	CV Heart Failure	0.13	0.44	6	394
Hypopharynx	CV Arrhythmias	0.00	0.63	15	385
Hypopharynx	CV Cerebrovascular Disease	0.59	—	3	397
Hypopharynx	CV Peripheral Vascular Disease	1.00	0.04	6	394
Hypopharynx	MET Type1 Diabetes	1.00	—	1	399
Hypopharynx	MET Type2 Diabetes	0.92	0.67	57	343
Hypopharynx	MET Complicated Diabetes	1.00	—	1	399
Hypopharynx	MET Obesity	1.00	0.95	6	394
Hypopharynx	MET Dyslipidemia	1.00	0.82	38	362
Hypopharynx	MET Metabolic Syndrome	1.00	—	1	399
Hypopharynx	MET Thyroid Disorders	0.55	0.41	64	336
Hypopharynx	RESP COPD	0.28	0.00	25	375
Hypopharynx	RESP Asthma	0.89	0.26	5	395
Hypopharynx	RESP Chronic Bronchitis	1.00	—	0	400
Hypopharynx	RESP Emphysema	1.00	0.01	8	392
Hypopharynx	RESP Pulmonary Fibrosis	1.00	—	4	396
Hypopharynx	RESP Pulmonary Hypertension	1.00	—	0	400
Hypopharynx	RESP Sleep Apnea	1.00	—	1	399
Hypopharynx	REN Acute Renal Failure	0.99	0.05	6	394
Hypopharynx	REN Chronic Renal Failure	0.39	0.04	18	382
Hypopharynx	REN End Stage Renal	0.91	0.02	7	393
Hypopharynx	REN Diabetic Nephropathy	1.00	—	0	400
Hypopharynx	HEP Steatosis	1.00	—	4	396
Hypopharynx	HEP Alcoholic Cirrhosis	0.76	—	4	396
Hypopharynx	HEP NonAlcoholic Cirrhosis	1.00	—	0	400
Hypopharynx	HEP Viral Hepatitis	1.00	—	3	397
Hypopharynx	GI Reflux	1.00	—	0	400
Hypopharynx	GI Peptic Ulcer	0.59	—	3	397
Hypopharynx	GI IBD	1.00	—	0	400
Hypopharynx	GI IBS	1.00	—	0	400
Hypopharynx	NEURO Dementia	1.00	0.44	5	395
Hypopharynx	NEURO Parkinson	0.39	—	2	398
Hypopharynx	NEURO Epilepsy	1.00	0.02	7	393
Hypopharynx	NEURO Stroke Sequelae	1.00	—	2	398
Hypopharynx	PSI Major Depression	1.00	0.24	8	392
Hypopharynx	PSI Bipolar	1.00	—	0	400
Hypopharynx	PSI Anxiety	0.76	—	4	396
Hypopharynx	SUST Alcohol	0.23	0.00	26	374
Hypopharynx	SUST Tobacco	0.62	0.02	67	333
Hypopharynx	SUST Opioids	1.00	—	0	400
Hypopharynx	INF HIV	1.00	0.84	6	394
Hypopharynx	INF Tuberculosis	0.39	—	2	398
Hypopharynx	INF Hepatitis C	1.00	—	2	398
Hypopharynx	INF Hepatitis B	1.00	—	0	400
Larynx	CV Essential Hypertension	1.00	0.00	1202	1423
Larynx	CV Complicated Hypertension	0.58	0.01	13	2612
Larynx	CV Coronary Disease	0.34	0.03	152	2473
Larynx	CV Heart Failure	0.00	0.00	92	2533
Larynx	CV Arrhythmias	0.00	0.00	182	2443
Larynx	CV Cerebrovascular Disease	0.11	0.00	77	2548
Larynx	CV Peripheral Vascular Disease	0.00	0.00	81	2544
Larynx	MET Type1 Diabetes	0.82	0.95	9	2616
Larynx	MET Type2 Diabetes	0.12	0.77	520	2105
Larynx	MET Complicated Diabetes	1.00	0.23	6	2619
Larynx	MET Obesity	0.07	0.14	131	2494
Larynx	MET Dyslipidemia	0.27	0.38	221	2404
Larynx	MET Metabolic Syndrome	0.08	0.00	12	2613
Larynx	MET Thyroid Disorders	0.74	0.54	293	2332
Larynx	RESP COPD	0.05	0.00	362	2263
Larynx	RESP Asthma	0.79	0.99	95	2530
Larynx	RESP Chronic Bronchitis	1.00	—	1	2624
Larynx	RESP Emphysema	0.00	0.00	46	2579
Larynx	RESP Pulmonary Fibrosis	0.07	0.84	27	2598
Larynx	RESP Pulmonary Hypertension	0.08	0.12	12	2613
Larynx	RESP Sleep Apnea	1.00	0.71	26	2599
Larynx	REN Acute Renal Failure	0.00	0.00	160	2465
Larynx	REN Chronic Renal Failure	0.00	0.01	136	2489
Larynx	REN End Stage Renal	0.02	0.46	29	2596
Larynx	REN Diabetic Nephropathy	1.00	0.86	8	2617
Larynx	HEP Steatosis	0.19	0.72	24	2601
Larynx	HEP Alcoholic Cirrhosis	0.00	0.00	34	2591
Larynx	HEP NonAlcoholic Cirrhosis	0.00	—	2	2623
Larynx	HEP Viral Hepatitis	1.00	—	4	2621
Larynx	GI Reflux	0.54	0.17	15	2610
Larynx	GI Peptic Ulcer	0.00	0.00	30	2595
Larynx	GI IBD	1.00	—	0	2625
Larynx	GI IBS	0.11	—	1	2624
Larynx	NEURO Dementia	0.58	0.44	35	2590
Larynx	NEURO Parkinson	1.00	0.53	19	2606
Larynx	NEURO Epilepsy	0.15	0.01	41	2584
Larynx	NEURO Stroke Sequelae	0.49	0.01	55	2570
Larynx	PSI Major Depression	0.35	0.01	61	2564
Larynx	PSI Bipolar	0.97	0.28	7	2618
Larynx	PSI Anxiety	0.33	0.00	23	2602
Larynx	SUST Alcohol	0.37	0.00	119	2506
Larynx	SUST Tobacco	0.14	0.00	637	1988
Larynx	SUST Opioids	1.00	—	1	2624
Larynx	INF HIV	1.00	0.14	17	2608
Larynx	INF Tuberculosis	1.00	0.38	7	2618
Larynx	INF Hepatitis C	1.00	—	3	2622
Larynx	INF Hepatitis B	1.00	—	1	2624
Nasopharynx	CV Essential Hypertension	0.63	0.21	104	264
Nasopharynx	CV Complicated Hypertension	1.00	—	0	368
Nasopharynx	CV Coronary Disease	1.00	0.71	8	360
Nasopharynx	CV Heart Failure	1.00	0.30	6	362
Nasopharynx	CV Arrhythmias	0.30	0.01	11	357
Nasopharynx	CV Cerebrovascular Disease	0.06	0.00	6	362
Nasopharynx	CV Peripheral Vascular Disease	0.09	0.01	15	353
Nasopharynx	MET Type1 Diabetes	1.00	—	0	368
Nasopharynx	MET Type2 Diabetes	1.00	0.02	46	322
Nasopharynx	MET Complicated Diabetes	1.00	—	1	367
Nasopharynx	MET Obesity	0.34	0.38	26	342
Nasopharynx	MET Dyslipidemia	0.24	0.12	33	335
Nasopharynx	MET Metabolic Syndrome	0.27	—	2	366
Nasopharynx	MET Thyroid Disorders	0.82	0.21	35	333
Nasopharynx	RESP COPD	1.00	—	3	365
Nasopharynx	RESP Asthma	0.72	0.90	13	355
Nasopharynx	RESP Chronic Bronchitis	1.00	—	0	368
Nasopharynx	RESP Emphysema	0.27	—	2	366
Nasopharynx	RESP Pulmonary Fibrosis	1.00	—	4	364
Nasopharynx	RESP Pulmonary Hypertension	1.00	—	0	368
Nasopharynx	RESP Sleep Apnea	1.00	—	1	367
Nasopharynx	REN Acute Renal Failure	0.03	0.01	20	348
Nasopharynx	REN Chronic Renal Failure	0.76	0.20	12	356
Nasopharynx	REN End Stage Renal	1.00	—	2	366
Nasopharynx	REN Diabetic Nephropathy	1.00	—	1	367
Nasopharynx	HEP Steatosis	1.00	—	1	367
Nasopharynx	HEP Alcoholic Cirrhosis	1.00	—	1	367
Nasopharynx	HEP NonAlcoholic Cirrhosis	1.00	—	0	368
Nasopharynx	HEP Viral Hepatitis	1.00	—	0	368
Nasopharynx	GI Reflux	1.00	—	1	367
Nasopharynx	GI Peptic Ulcer	1.00	—	1	367
Nasopharynx	GI IBD	1.00	—	1	367
Nasopharynx	GI IBS	1.00	—	0	368
Nasopharynx	NEURO Dementia	1.00	—	0	368
Nasopharynx	NEURO Parkinson	0.07	—	1	367
Nasopharynx	NEURO Epilepsy	1.00	—	4	364
Nasopharynx	NEURO Stroke Sequelae	1.00	—	3	365
Nasopharynx	PSI Major Depression	1.00	—	2	366
Nasopharynx	PSI Bipolar	1.00	—	0	368
Nasopharynx	PSI Anxiety	0.67	0.97	14	354
Nasopharynx	SUST Alcohol	0.63	0.03	15	353
Nasopharynx	SUST Tobacco	0.18	0.41	29	339
Nasopharynx	SUST Opioids	1.00	—	0	368
Nasopharynx	INF HIV	1.00	0.94	5	363
Nasopharynx	INF Tuberculosis	1.00	—	0	368
Nasopharynx	INF Hepatitis C	1.00	—	0	368
Nasopharynx	INF Hepatitis B	1.00	—	0	368
Oral Cavity	CV Essential Hypertension	0.32	0.00	1316	1685
Oral Cavity	CV Complicated Hypertension	0.49	0.49	15	2986
Oral Cavity	CV Coronary Disease	0.25	0.00	114	2887
Oral Cavity	CV Heart Failure	0.05	0.00	65	2936
Oral Cavity	CV Arrhythmias	0.00	0.00	169	2832
Oral Cavity	CV Cerebrovascular Disease	0.03	0.01	73	2928
Oral Cavity	CV Peripheral Vascular Disease	0.00	0.00	63	2938
Oral Cavity	MET Type1 Diabetes	1.00	0.90	11	2990
Oral Cavity	MET Type2 Diabetes	0.90	0.14	560	2441
Oral Cavity	MET Complicated Diabetes	0.76	0.40	13	2988
Oral Cavity	MET Obesity	0.20	0.00	156	2845
Oral Cavity	MET Dyslipidemia	0.52	0.73	343	2658
Oral Cavity	MET Metabolic Syndrome	0.01	0.04	18	2983
Oral Cavity	MET Thyroid Disorders	0.06	0.01	368	2633
Oral Cavity	RESP COPD	0.38	0.00	154	2847
Oral Cavity	RESP Asthma	0.05	0.82	53	2948
Oral Cavity	RESP Chronic Bronchitis	1.00	—	4	2997
Oral Cavity	RESP Emphysema	0.85	0.00	22	2979
Oral Cavity	RESP Pulmonary Fibrosis	0.00	0.94	22	2979
Oral Cavity	RESP Pulmonary Hypertension	0.12	0.16	8	2993
Oral Cavity	RESP Sleep Apnea	1.00	0.00	9	2992
Oral Cavity	REN Acute Renal Failure	0.00	0.00	136	2865
Oral Cavity	REN Chronic Renal Failure	0.01	0.00	137	2864
Oral Cavity	REN End Stage Renal	0.04	0.00	31	2970
Oral Cavity	REN Diabetic Nephropathy	1.00	0.23	7	2994
Oral Cavity	HEP Steatosis	0.01	0.13	41	2960
Oral Cavity	HEP Alcoholic Cirrhosis	0.17	0.11	19	2982
Oral Cavity	HEP NonAlcoholic Cirrhosis	1.00	0.50	9	2992
Oral Cavity	HEP Viral Hepatitis	1.00	0.46	5	2996
Oral Cavity	GI Reflux	0.76	0.09	13	2988
Oral Cavity	GI Peptic Ulcer	0.55	0.01	16	2985
Oral Cavity	GI IBD	0.57	—	4	2997
Oral Cavity	GI IBS	1.00	—	2	2999
Oral Cavity	NEURO Dementia	0.14	0.95	52	2949
Oral Cavity	NEURO Parkinson	1.00	0.55	20	2981
Oral Cavity	NEURO Epilepsy	0.78	0.04	36	2965
Oral Cavity	NEURO Stroke Sequelae	0.07	0.03	57	2944
Oral Cavity	PSI Major Depression	0.90	0.01	89	2912
Oral Cavity	PSI Bipolar	0.89	0.69	7	2994
Oral Cavity	PSI Anxiety	1.00	0.00	21	2980
Oral Cavity	SUST Alcohol	0.68	0.00	143	2858
Oral Cavity	SUST Tobacco	0.60	0.00	445	2556
Oral Cavity	SUST Opioids	1.00	—	0	3001
Oral Cavity	INF HIV	0.85	0.01	22	2979
Oral Cavity	INF Tuberculosis	1.00	—	2	2999
Oral Cavity	INF Hepatitis C	1.00	—	0	3001
Oral Cavity	INF Hepatitis B	1.00	0.46	5	2996
Oropharynx	CV Essential Hypertension	0.74	0.00	352	599
Oropharynx	CV Complicated Hypertension	1.00	—	1	950
Oropharynx	CV Coronary Disease	1.00	0.02	23	928
Oropharynx	CV Heart Failure	0.01	0.13	10	941
Oropharynx	CV Arrhythmias	0.00	0.00	30	921
Oropharynx	CV Cerebrovascular Disease	0.26	0.04	22	929
Oropharynx	CV Peripheral Vascular Disease	1.00	0.00	26	925
Oropharynx	MET Type1 Diabetes	1.00	—	1	950
Oropharynx	MET Type2 Diabetes	0.42	0.13	130	821
Oropharynx	MET Complicated Diabetes	1.00	—	3	948
Oropharynx	MET Obesity	1.00	0.23	40	911
Oropharynx	MET Dyslipidemia	0.84	0.47	67	884
Oropharynx	MET Metabolic Syndrome	0.06	—	1	950
Oropharynx	MET Thyroid Disorders	0.26	0.58	83	868
Oropharynx	RESP COPD	0.01	0.00	35	916
Oropharynx	RESP Asthma	1.00	0.92	17	934
Oropharynx	RESP Chronic Bronchitis	1.00	—	1	950
Oropharynx	RESP Emphysema	0.92	0.11	10	941
Oropharynx	RESP Pulmonary Fibrosis	0.08	0.03	7	944
Oropharynx	RESP Pulmonary Hypertension	1.00	—	2	949
Oropharynx	RESP Sleep Apnea	1.00	—	3	948
Oropharynx	REN Acute Renal Failure	0.00	0.00	29	922
Oropharynx	REN Chronic Renal Failure	0.06	0.30	24	927
Oropharynx	REN End Stage Renal	0.79	0.56	6	945
Oropharynx	REN Diabetic Nephropathy	1.00	—	2	949
Oropharynx	HEP Steatosis	0.00	0.00	14	937
Oropharynx	HEP Alcoholic Cirrhosis	0.95	0.06	8	943
Oropharynx	HEP NonAlcoholic Cirrhosis	1.00	—	0	951
Oropharynx	HEP Viral Hepatitis	1.00	—	0	951
Oropharynx	GI Reflux	1.00	—	2	949
Oropharynx	GI Peptic Ulcer	1.00	—	3	948
Oropharynx	GI IBD	1.00	—	1	950
Oropharynx	GI IBS	1.00	—	0	951
Oropharynx	NEURO Dementia	1.00	—	4	947
Oropharynx	NEURO Parkinson	1.00	0.12	6	945
Oropharynx	NEURO Epilepsy	1.00	0.06	21	930
Oropharynx	NEURO Stroke Sequelae	0.69	0.08	19	932
Oropharynx	PSI Major Depression	0.26	0.01	22	929
Oropharynx	PSI Bipolar	0.04	0.03	6	945
Oropharynx	PSI Anxiety	0.86	0.92	11	940
Oropharynx	SUST Alcohol	0.92	0.03	46	905
Oropharynx	SUST Tobacco	0.75	0.00	126	825
Oropharynx	SUST Opioids	1.00	—	1	950
Oropharynx	INF HIV	0.74	0.05	20	931
Oropharynx	INF Tuberculosis	0.24	—	2	949
Oropharynx	INF Hepatitis C	1.00	—	0	951
Oropharynx	INF Hepatitis B	1.00	—	0	951
Other Pharynx	CV Essential Hypertension	0.16	0.62	94	187
Other Pharynx	CV Complicated Hypertension	0.20	—	1	280
Other Pharynx	CV Coronary Disease	0.85	0.13	6	275
Other Pharynx	CV Heart Failure	0.35	0.13	7	274
Other Pharynx	CV Arrhythmias	0.08	0.00	14	267
Other Pharynx	CV Cerebrovascular Disease	0.65	0.11	10	271
Other Pharynx	CV Peripheral Vascular Disease	0.75	0.00	11	270
Other Pharynx	MET Type1 Diabetes	0.51	—	2	279
Other Pharynx	MET Type2 Diabetes	0.96	0.92	43	238
Other Pharynx	MET Complicated Diabetes	1.00	—	0	281
Other Pharynx	MET Obesity	1.00	0.57	8	273
Other Pharynx	MET Dyslipidemia	0.29	0.91	17	264
Other Pharynx	MET Metabolic Syndrome	1.00	—	0	281
Other Pharynx	MET Thyroid Disorders	0.40	0.53	26	255
Other Pharynx	RESP COPD	0.37	0.07	14	267
Other Pharynx	RESP Asthma	1.00	—	3	278
Other Pharynx	RESP Chronic Bronchitis	1.00	—	0	281
Other Pharynx	RESP Emphysema	0.91	—	4	277
Other Pharynx	RESP Pulmonary Fibrosis	1.00	—	3	278
Other Pharynx	RESP Pulmonary Hypertension	1.00	—	0	281
Other Pharynx	RESP Sleep Apnea	0.61	0.92	9	272
Other Pharynx	REN Acute Renal Failure	0.00	0.12	10	271
Other Pharynx	REN Chronic Renal Failure	0.83	0.03	12	269
Other Pharynx	REN End Stage Renal	0.91	—	4	277
Other Pharynx	REN Diabetic Nephropathy	1.00	—	0	281
Other Pharynx	HEP Steatosis	1.00	—	3	278
Other Pharynx	HEP Alcoholic Cirrhosis	0.20	—	1	280
Other Pharynx	HEP NonAlcoholic Cirrhosis	1.00	—	0	281
Other Pharynx	HEP Viral Hepatitis	1.00	—	0	281
Other Pharynx	GI Reflux	1.00	—	3	278
Other Pharynx	GI Peptic Ulcer	1.00	0.52	5	276
Other Pharynx	GI IBD	1.00	—	1	280
Other Pharynx	GI IBS	1.00	—	1	280
Other Pharynx	NEURO Dementia	1.00	—	2	279
Other Pharynx	NEURO Parkinson	1.00	—	0	281
Other Pharynx	NEURO Epilepsy	0.65	0.13	10	271
Other Pharynx	NEURO Stroke Sequelae	1.00	0.25	5	276
Other Pharynx	PSI Major Depression	0.25	0.71	6	275
Other Pharynx	PSI Bipolar	0.20	—	1	280
Other Pharynx	PSI Anxiety	1.00	—	2	279
Other Pharynx	SUST Alcohol	1.00	0.12	5	276
Other Pharynx	SUST Tobacco	0.46	0.00	31	250
Other Pharynx	SUST Opioids	1.00	—	0	281
Other Pharynx	INF HIV	1.00	—	4	277
Other Pharynx	INF Tuberculosis	1.00	—	1	280
Other Pharynx	INF Hepatitis C	1.00	—	0	281
Other Pharynx	INF Hepatitis B	1.00	—	0	281
Salivary Glands	CV Essential Hypertension	0.58	0.00	337	561
Salivary Glands	CV Complicated Hypertension	0.88	0.02	8	890
Salivary Glands	CV Coronary Disease	0.06	0.03	27	871
Salivary Glands	CV Heart Failure	0.01	0.00	20	878
Salivary Glands	CV Arrhythmias	0.00	0.00	42	856
Salivary Glands	CV Cerebrovascular Disease	0.04	0.00	15	883
Salivary Glands	CV Peripheral Vascular Disease	1.00	0.58	8	890
Salivary Glands	MET Type1 Diabetes	1.00	—	3	895
Salivary Glands	MET Type2 Diabetes	0.52	0.16	180	718
Salivary Glands	MET Complicated Diabetes	0.20	—	2	896
Salivary Glands	MET Obesity	0.22	0.56	72	826
Salivary Glands	MET Dyslipidemia	0.52	0.54	94	804
Salivary Glands	MET Metabolic Syndrome	1.00	0.60	6	892
Salivary Glands	MET Thyroid Disorders	1.00	0.60	102	796
Salivary Glands	RESP COPD	0.76	0.01	37	861
Salivary Glands	RESP Asthma	0.32	0.30	34	864
Salivary Glands	RESP Chronic Bronchitis	1.00	—	0	898
Salivary Glands	RESP Emphysema	0.20	—	2	896
Salivary Glands	RESP Pulmonary Fibrosis	1.00	0.01	7	891
Salivary Glands	RESP Pulmonary Hypertension	1.00	—	2	896
Salivary Glands	RESP Sleep Apnea	1.00	0.85	7	891
Salivary Glands	REN Acute Renal Failure	0.00	0.00	38	860
Salivary Glands	REN Chronic Renal Failure	0.24	0.00	38	860
Salivary Glands	REN End Stage Renal	0.81	0.50	7	891
Salivary Glands	REN Diabetic Nephropathy	1.00	—	1	897
Salivary Glands	HEP Steatosis	0.81	0.02	7	891
Salivary Glands	HEP Alcoholic Cirrhosis	0.20	—	2	896
Salivary Glands	HEP NonAlcoholic Cirrhosis	1.00	—	1	897
Salivary Glands	HEP Viral Hepatitis	1.00	—	0	898
Salivary Glands	GI Reflux	1.00	—	2	896
Salivary Glands	GI Peptic Ulcer	1.00	—	4	894
Salivary Glands	GI IBD	1.00	—	2	896
Salivary Glands	GI IBS	1.00	—	0	898
Salivary Glands	NEURO Dementia	0.81	0.31	7	891
Salivary Glands	NEURO Parkinson	0.93	0.32	11	887
Salivary Glands	NEURO Epilepsy	0.71	0.53	16	882
Salivary Glands	NEURO Stroke Sequelae	1.00	0.01	12	886
Salivary Glands	PSI Major Depression	1.00	0.19	28	870
Salivary Glands	PSI Bipolar	1.00	—	4	894
Salivary Glands	PSI Anxiety	1.00	0.06	9	889
Salivary Glands	SUST Alcohol	0.53	0.98	18	880
Salivary Glands	SUST Tobacco	1.00	0.07	83	815
Salivary Glands	SUST Opioids	1.00	—	0	898
Salivary Glands	INF HIV	1.00	0.27	5	893
Salivary Glands	INF Tuberculosis	1.00	0.62	5	893
Salivary Glands	INF Hepatitis C	1.00	—	0	898
Salivary Glands	INF Hepatitis B	1.00	—	0	898
Sinonasal	CV Essential Hypertension	0.74	0.06	293	594
Sinonasal	CV Complicated Hypertension	0.57	0.11	5	882
Sinonasal	CV Coronary Disease	0.99	0.81	11	876
Sinonasal	CV Heart Failure	1.00	0.00	12	875
Sinonasal	CV Arrhythmias	1.00	0.00	15	872
Sinonasal	CV Cerebrovascular Disease	0.00	0.03	19	868
Sinonasal	CV Peripheral Vascular Disease	1.00	0.01	7	880
Sinonasal	MET Type1 Diabetes	0.45	—	4	883
Sinonasal	MET Type2 Diabetes	0.40	0.02	142	745
Sinonasal	MET Complicated Diabetes	1.00	—	2	885
Sinonasal	MET Obesity	0.57	0.05	57	830
Sinonasal	MET Dyslipidemia	1.00	0.93	65	822
Sinonasal	MET Metabolic Syndrome	0.17	—	2	885
Sinonasal	MET Thyroid Disorders	1.00	0.46	75	812
Sinonasal	RESP COPD	1.00	0.80	16	871
Sinonasal	RESP Asthma	0.57	0.73	23	864
Sinonasal	RESP Chronic Bronchitis	1.00	—	0	887
Sinonasal	RESP Emphysema	1.00	—	0	887
Sinonasal	RESP Pulmonary Fibrosis	1.00	0.56	5	882
Sinonasal	RESP Pulmonary Hypertension	1.00	—	0	887
Sinonasal	RESP Sleep Apnea	1.00	0.42	6	881
Sinonasal	REN Acute Renal Failure	0.32	0.00	15	872
Sinonasal	REN Chronic Renal Failure	1.00	0.05	18	869
Sinonasal	REN End Stage Renal	1.00	0.03	7	880
Sinonasal	REN Diabetic Nephropathy	1.00	—	2	885
Sinonasal	HEP Steatosis	0.83	0.09	8	879
Sinonasal	HEP Alcoholic Cirrhosis	1.00	—	2	885
Sinonasal	HEP NonAlcoholic Cirrhosis	1.00	—	0	887
Sinonasal	HEP Viral Hepatitis	1.00	—	0	887
Sinonasal	GI Reflux	1.00	—	3	884
Sinonasal	GI Peptic Ulcer	1.00	—	3	884
Sinonasal	GI IBD	1.00	—	0	887
Sinonasal	GI IBS	1.00	—	0	887
Sinonasal	NEURO Dementia	1.00	0.18	8	879
Sinonasal	NEURO Parkinson	1.00	0.89	6	881
Sinonasal	NEURO Epilepsy	1.00	0.10	14	873
Sinonasal	NEURO Stroke Sequelae	0.95	0.95	10	877
Sinonasal	PSI Major Depression	1.00	0.51	18	869
Sinonasal	PSI Bipolar	1.00	—	1	886
Sinonasal	PSI Anxiety	0.99	0.87	11	876
Sinonasal	SUST Alcohol	1.00	0.08	29	858
Sinonasal	SUST Tobacco	0.74	0.29	84	803
Sinonasal	SUST Opioids	1.00	—	0	887
Sinonasal	INF HIV	1.00	—	4	883
Sinonasal	INF Tuberculosis	1.00	—	0	887
Sinonasal	INF Hepatitis C	1.00	—	0	887
Sinonasal	INF Hepatitis B	1.00	—	0	887

Code

# Build dictionaries of significant comorbidities per site (p < 0.05)
display_to_col = {format_name(c): c for c in comorbidity_cols}

sig_comorbidities_mortality = {}
sig_comorbidities_los = {}

for site in all_site_names:
    site_biv = df_bivariate[df_bivariate['Cancer Site'] == site]
    sig_mort = []
    sig_los = []
    for _, row in site_biv.iterrows():
        comorb_display = row['Comorbidity']
        col_name = display_to_col.get(comorb_display, None)
        if col_name is None:
            continue
        if not np.isnan(row['Chi-square p-value']) and row['Chi-square p-value'] < 0.05:
            sig_mort.append(col_name)
        if not np.isnan(row['LOS p-value (Wilcoxon)']) and row['LOS p-value (Wilcoxon)'] < 0.05:
            sig_los.append(col_name)
    sig_comorbidities_mortality[site] = sig_mort
    sig_comorbidities_los[site] = sig_los

sig_summary = []
for site in all_site_names:
    sig_summary.append({
        'Cancer Site': site,
        'Sig. Comorbidities (Mortality)': len(sig_comorbidities_mortality.get(site, [])),
        'Sig. Comorbidities (LOS)': len(sig_comorbidities_los.get(site, [])),
        'Mortality List': ', '.join([format_name(c) for c in sig_comorbidities_mortality.get(site, [])]) or '—',
        'LOS List': ', '.join([format_name(c) for c in sig_comorbidities_los.get(site, [])]) or '—'
    })

sig_summary_df = pd.DataFrame(sig_summary)
show_table(sig_summary_df, "Table 21: Significant Comorbidities by Cancer Site (p < 0.05 in Bivariate Analysis)", index=False)

Table 21: Significant Comorbidities by Cancer Site (p < 0.05 in Bivariate Analysis)

Cancer Site	Sig. Comorbidities (Mortality)	Sig. Comorbidities (LOS)	Mortality List	LOS List
Hypopharynx	1	9	CV Arrhythmias	CV Peripheral Vascular Disease, RESP COPD, RESP Emphysema, REN Acute Renal Failure, REN Chronic Renal Failure, REN End Stage Renal, NEURO Epilepsy, SUST Alcohol, SUST Tobacco
Larynx	11	20	CV Heart Failure, CV Arrhythmias, CV Peripheral Vascular Disease, RESP COPD, RESP Emphysema, REN Acute Renal Failure, REN Chronic Renal Failure, REN End Stage Renal, HEP Alcoholic Cirrhosis, HEP NonAlcoholic Cirrhosis, GI Peptic Ulcer	CV Essential Hypertension, CV Complicated Hypertension, CV Coronary Disease, CV Heart Failure, CV Arrhythmias, CV Cerebrovascular Disease, CV Peripheral Vascular Disease, MET Metabolic Syndrome, RESP COPD, RESP Emphysema, REN Acute Renal Failure, REN Chronic Renal Failure, HEP Alcoholic Cirrhosis, GI Peptic Ulcer, NEURO Epilepsy, NEURO Stroke Sequelae, PSI Major Depression, PSI Anxiety, SUST Alcohol, SUST Tobacco
Nasopharynx	1	6	REN Acute Renal Failure	CV Arrhythmias, CV Cerebrovascular Disease, CV Peripheral Vascular Disease, MET Type2 Diabetes, REN Acute Renal Failure, SUST Alcohol
Oral Cavity	11	23	CV Heart Failure, CV Arrhythmias, CV Cerebrovascular Disease, CV Peripheral Vascular Disease, MET Metabolic Syndrome, RESP Asthma, RESP Pulmonary Fibrosis, REN Acute Renal Failure, REN Chronic Renal Failure, REN End Stage Renal, HEP Steatosis	CV Essential Hypertension, CV Coronary Disease, CV Heart Failure, CV Arrhythmias, CV Cerebrovascular Disease, CV Peripheral Vascular Disease, MET Obesity, MET Metabolic Syndrome, MET Thyroid Disorders, RESP COPD, RESP Emphysema, RESP Sleep Apnea, REN Acute Renal Failure, REN Chronic Renal Failure, REN End Stage Renal, GI Peptic Ulcer, NEURO Epilepsy, NEURO Stroke Sequelae, PSI Major Depression, PSI Anxiety, SUST Alcohol, SUST Tobacco, INF HIV
Oropharynx	6	14	CV Heart Failure, CV Arrhythmias, RESP COPD, REN Acute Renal Failure, HEP Steatosis, PSI Bipolar	CV Essential Hypertension, CV Coronary Disease, CV Arrhythmias, CV Cerebrovascular Disease, CV Peripheral Vascular Disease, RESP COPD, RESP Pulmonary Fibrosis, REN Acute Renal Failure, HEP Steatosis, PSI Major Depression, PSI Bipolar, SUST Alcohol, SUST Tobacco, INF HIV
Other Pharynx	1	4	REN Acute Renal Failure	CV Arrhythmias, CV Peripheral Vascular Disease, REN Chronic Renal Failure, SUST Tobacco
Salivary Glands	4	12	CV Heart Failure, CV Arrhythmias, CV Cerebrovascular Disease, REN Acute Renal Failure	CV Essential Hypertension, CV Complicated Hypertension, CV Coronary Disease, CV Heart Failure, CV Arrhythmias, CV Cerebrovascular Disease, RESP COPD, RESP Pulmonary Fibrosis, REN Acute Renal Failure, REN Chronic Renal Failure, HEP Steatosis, NEURO Stroke Sequelae
Sinonasal	1	7	CV Cerebrovascular Disease	CV Heart Failure, CV Arrhythmias, CV Cerebrovascular Disease, CV Peripheral Vascular Disease, MET Type2 Diabetes, REN Acute Renal Failure, REN End Stage Renal

10 6. GEE Mixed-Effects Logistic Regression for Mortality (All Cancer Sites)

Code

from statsmodels.genmod.generalized_estimating_equations import GEE
from statsmodels.genmod.families import Binomial, Poisson, NegativeBinomial
from statsmodels.genmod.cov_struct import Exchangeable, Independence
import statsmodels.api as sm
from sklearn.metrics import (accuracy_score, precision_score, recall_score,
                             f1_score, matthews_corrcoef, brier_score_loss,
                             roc_auc_score, roc_curve, confusion_matrix,
                             classification_report)

# Use ALL cancer site groups
all_sites = sorted(df_head_neck['cancer_site_group'].unique())

# Store GEE model results
gee_mortality_models = {}
gee_mortality_results = {}

def find_optimal_threshold(y_true, y_proba):
    """Find optimal classification threshold using Youden's J statistic"""
    fpr, tpr, thresholds = roc_curve(y_true, y_proba)
    j_scores = tpr - fpr
    best_idx = np.argmax(j_scores)
    return thresholds[best_idx]

def safe_fmt(val, fmt='.4f'):
    """Format a value safely, returning '—' for NaN/Inf"""
    if isinstance(val, float) and (np.isnan(val) or np.isinf(val)):
        return '—'
    try:
        return f'{val:{fmt}}'
    except:
        return '—'

def build_gee_mortality_model(df_site, site_name):
    """Build GEE logistic regression (mixed-effects approximation) for mortality.
    Uses hospital_id as cluster group, year_factor as fixed effect,
    and only significant comorbidities from bivariate analysis."""
    df_model = df_site.copy()

    # Code categorical variables
    df_model['SEXO_coded'] = (df_model['SEXO'] == 'Female').astype(int)

    # Create service dummies
    service_dummies = pd.get_dummies(df_model['service_category'], prefix='service', drop_first=True)
    for col in service_dummies.columns:
        df_model[col] = service_dummies[col]
    service_dummy_cols = list(service_dummies.columns)

    # Create year dummies (year_factor as fixed effect)
    year_dummies = pd.get_dummies(df_model['year_factor'], prefix='year', drop_first=True)
    for col in year_dummies.columns:
        df_model[col] = year_dummies[col]
    year_dummy_cols = list(year_dummies.columns)

    # Ensure numeric
    df_model['edad'] = pd.to_numeric(df_model['edad'], errors='coerce')

    # Base predictors: age, sex, service, year dummies
    predictor_cols = ['edad', 'SEXO_coded'] + service_dummy_cols + year_dummy_cols

    # Add severity and mortality risk as CATEGORICAL dummies (Low=ref, Medium, High)
    for col in ['IR_29301_SEVERIDAD', 'IR_29301_MORTALIDAD']:
        if col in df_model.columns:
            df_model[col] = pd.to_numeric(df_model[col], errors='coerce')
            df_model[col] = df_model[col].fillna(1)
            df_model[col] = df_model[col].clip(lower=1).astype(int)
            prefix = 'severity' if 'SEVERIDAD' in col else 'mort_risk'
            col_dummies = pd.get_dummies(df_model[col], prefix=prefix, drop_first=True)
            for dc in col_dummies.columns:
                df_model[dc] = col_dummies[dc]
                predictor_cols.append(dc)

    # Add ONLY significant comorbidities from bivariate analysis (p < 0.05)
    sig_comorbs = sig_comorbidities_mortality.get(site_name, [])
    used_comorbs = []
    for col in sig_comorbs:
        if col in df_model.columns:
            df_model[col] = pd.to_numeric(df_model[col], errors='coerce').fillna(0).astype(int)
            if df_model[col].sum() >= 3:
                predictor_cols.append(col)
                used_comorbs.append(col)

    # Prepare groups (hospital_id for clustering)
    df_model['_group'] = df_model['hospital_id']

    # Remove missing values
    cols_for_model = ['egresar_fallecido', '_group'] + predictor_cols
    df_gee = df_model[cols_for_model].dropna()

    # Need at least 2 clusters and sufficient data
    n_clusters = df_gee['_group'].nunique()
    if len(df_gee) < 50 or n_clusters < 2:
        return None, None, [], []

    # Build design matrix
    y = df_gee['egresar_fallecido'].astype(float)
    X = df_gee[predictor_cols].astype(float)
    X = sm.add_constant(X)
    groups = df_gee['_group']

    # Fit GEE with exchangeable correlation structure
    try:
        gee_model = GEE(y, X, groups=groups,
                        family=Binomial(),
                        cov_struct=Exchangeable())
        gee_result = gee_model.fit(maxiter=200)
    except Exception:
        # Fallback to independence correlation
        try:
            gee_model = GEE(y, X, groups=groups,
                            family=Binomial(),
                            cov_struct=Independence())
            gee_result = gee_model.fit(maxiter=200)
        except Exception:
            return None, None, [], []

    # Check for extreme coefficients and remove problematic variables
    max_retries = 5
    current_predictors = list(predictor_cols)
    for _ in range(max_retries):
        params = gee_result.params
        bad_vars = []
        for i, var in enumerate(current_predictors):
            idx = i + 1  # +1 for constant
            if idx < len(params) and (np.isinf(params.iloc[idx]) or abs(params.iloc[idx]) > 50):
                bad_vars.append(var)
        if not bad_vars:
            break
        current_predictors = [p for p in current_predictors if p not in bad_vars]
        if len(current_predictors) < 2:
            return None, None, [], []
        X = df_gee[current_predictors].astype(float)
        X = sm.add_constant(X)
        try:
            gee_model = GEE(y, X, groups=groups, family=Binomial(), cov_struct=Exchangeable())
            gee_result = gee_model.fit(maxiter=200)
        except:
            try:
                gee_model = GEE(y, X, groups=groups, family=Binomial(), cov_struct=Independence())
                gee_result = gee_model.fit(maxiter=200)
            except:
                return None, None, [], []

    return gee_result, df_gee, current_predictors, used_comorbs

def get_gee_fit_metrics(gee_result, y, family_type='binomial'):
    """Extract model fit metrics from a GEE result object."""
    metrics = {}
    metrics['N_obs'] = int(gee_result.nobs)
    metrics['N_clusters'] = int(len(set(gee_result.model.groups)))
    metrics['N_params'] = int(len(gee_result.params))

    # QIC / QICu (quasi-information criteria)
    try:
        qic, qicu = gee_result.qic()
        metrics['QIC'] = qic
        metrics['QICu'] = qicu
    except Exception:
        metrics['QIC'] = np.nan
        metrics['QICu'] = np.nan

    # Quasi log-likelihood (scale parameter)
    try:
        metrics['Scale'] = float(gee_result.scale)
    except Exception:
        metrics['Scale'] = np.nan

    # Pseudo R² (McFadden-style for logistic, deviance-based for count)
    try:
        y_pred = gee_result.fittedvalues
        if family_type == 'binomial':
            y_arr = np.asarray(y, dtype=float)
            p_hat = np.clip(np.asarray(y_pred, dtype=float), 1e-10, 1 - 1e-10)
            ll_model = np.sum(y_arr * np.log(p_hat) + (1 - y_arr) * np.log(1 - p_hat))
            p_null = np.mean(y_arr)
            ll_null = np.sum(y_arr * np.log(p_null) + (1 - y_arr) * np.log(1 - p_null))
            metrics['Log-Lik (model)'] = ll_model
            metrics['Log-Lik (null)'] = ll_null
            metrics['Pseudo R² (McFadden)'] = 1 - ll_model / ll_null if ll_null != 0 else np.nan
        else:
            y_arr = np.asarray(y, dtype=float)
            y_pred_arr = np.asarray(y_pred, dtype=float)
            ss_res = np.sum((y_arr - y_pred_arr) ** 2)
            ss_tot = np.sum((y_arr - np.mean(y_arr)) ** 2)
            metrics['Pseudo R²'] = 1 - ss_res / ss_tot if ss_tot > 0 else np.nan
    except Exception:
        if family_type == 'binomial':
            metrics['Log-Lik (model)'] = np.nan
            metrics['Log-Lik (null)'] = np.nan
            metrics['Pseudo R² (McFadden)'] = np.nan
        else:
            metrics['Pseudo R²'] = np.nan

    # Correlation structure
    try:
        cov_name = gee_result.cov_struct.__class__.__name__
        metrics['Correlation'] = cov_name
    except Exception:
        metrics['Correlation'] = '—'

    return metrics

# Fit GEE mortality models for EACH cancer site
for site in all_sites:
    df_site = df_head_neck[df_head_neck['cancer_site_group'] == site].copy()

    n_events = df_site['egresar_fallecido'].sum()
    n_total = len(df_site)
    if n_events < 10 or n_total < 50:
        continue

    gee_result, df_gee, used_cols, used_comorbs = build_gee_mortality_model(df_site, site)

    if gee_result is not None:
        gee_mortality_models[site] = (gee_result, df_gee, used_cols)
        gee_mortality_results[site] = gee_result

10.1 6.1 GEE Logistic Regression Results - OR [95% CI]

Code

# Table: GEE Logistic Regression - OR [95% CI] by Cancer Site
gee_or_table = {}

for site in all_sites:
    if site not in gee_mortality_results:
        continue

    result = gee_mortality_results[site]
    or_ci_list = {}

    param_names = list(result.params.index)
    ci = result.conf_int()

    def _fmt_or(v):
        if np.isinf(v) or abs(v) > 1e6:
            return 'Inf' if v > 0 else '-Inf'
        return f"{v:.3f}"

    for var in param_names:
        coef = result.params[var]
        or_val = np.exp(coef)
        ci_lower = np.exp(ci.loc[var, 0] if var in ci.index else ci.iloc[param_names.index(var), 0])
        ci_upper = np.exp(ci.loc[var, 1] if var in ci.index else ci.iloc[param_names.index(var), 1])
        pval = result.pvalues[var] if var in result.pvalues.index else result.pvalues.iloc[param_names.index(var)]

        if np.isnan(or_val) or np.isnan(pval):
            or_ci_list[format_name(var)] = '—'
            continue

        sig_stars = ''
        if pval < 0.001:
            sig_stars = '***'
        elif pval < 0.01:
            sig_stars = '**'
        elif pval < 0.05:
            sig_stars = '*'

        or_ci_str = f"{_fmt_or(or_val)} [{_fmt_or(ci_lower)}, {_fmt_or(ci_upper)}]{sig_stars}"
        or_ci_list[format_name(var)] = or_ci_str

    gee_or_table[site] = or_ci_list

gee_or_df = pd.DataFrame(gee_or_table).fillna('—')
show_table(gee_or_df, "Table 9: GEE Mixed-Effects Logistic Regression - Odds Ratios [95% CI] for In-Hospital Mortality by Cancer Site\n(Hospital as cluster, year as fixed effect, only significant comorbidities from bivariate analysis)\n* p<0.05, ** p<0.01, *** p<0.001", index=True)

Table 9: GEE Mixed-Effects Logistic Regression - Odds Ratios [95% CI] for In-Hospital Mortality by Cancer Site (Hospital as cluster, year as fixed effect, only significant comorbidities from bivariate analysis) * p<0.05, p<0.01, * p<0.001

	Hypopharynx	Larynx	Nasopharynx	Oral Cavity	Oropharynx	Other Pharynx	Salivary Glands	Sinonasal
Intercept	0.000 [0.000, 0.025]***	0.004 [0.001, 0.014]***	—	0.002 [0.000, 0.007]***	0.012 [0.000, 0.329]**	0.001 [0.000, 0.172]**	0.015 [0.001, 0.206]**	0.045 [0.007, 0.290]**
Age	1.032 [0.989, 1.078]	1.027 [1.013, 1.041]***	—	1.018 [1.002, 1.035]*	1.022 [0.982, 1.064]	1.052 [1.005, 1.101]*	0.990 [0.969, 1.011]	1.002 [0.985, 1.018]
Sex (Female)	1.076 [0.447, 2.590]	0.982 [0.580, 1.662]	—	1.014 [0.701, 1.468]	0.927 [0.408, 2.108]	0.863 [0.353, 2.109]	0.485 [0.234, 1.005]	0.887 [0.440, 1.784]
Service ITU	2.796 [0.357, 21.901]	0.451 [0.209, 0.973]*	—	0.583 [0.259, 1.311]	0.463 [0.131, 1.641]	0.489 [0.053, 4.518]	0.969 [0.150, 6.276]	0.223 [0.043, 1.165]
Service MS	0.812 [0.172, 3.835]	0.819 [0.491, 1.366]	—	0.884 [0.524, 1.492]	0.367 [0.150, 0.899]*	0.670 [0.167, 2.681]	0.521 [0.164, 1.660]	0.284 [0.084, 0.957]*
Service Other	1.449 [0.417, 5.041]	0.786 [0.496, 1.244]	—	1.183 [0.655, 2.136]	0.276 [0.118, 0.648]**	0.443 [0.140, 1.405]	0.714 [0.218, 2.337]	0.335 [0.131, 0.852]*
Year 2020	3.417 [0.564, 20.696]	0.585 [0.347, 0.987]*	—	0.553 [0.304, 1.004]	0.830 [0.305, 2.257]	0.622 [0.100, 3.866]	1.476 [0.463, 4.707]	1.494 [0.430, 5.189]
Year 2021	5.211 [0.774, 35.084]	0.693 [0.380, 1.263]	—	1.251 [0.669, 2.340]	0.454 [0.142, 1.450]	2.348 [0.575, 9.596]	1.180 [0.399, 3.488]	0.594 [0.164, 2.152]
Year 2022	4.763 [0.770, 29.484]	0.392 [0.235, 0.654]***	—	0.935 [0.514, 1.699]	0.383 [0.136, 1.084]	0.596 [0.156, 2.284]	1.001 [0.306, 3.278]	1.095 [0.290, 4.133]
Year 2023	2.285 [0.314, 16.609]	0.460 [0.275, 0.769]**	—	0.734 [0.411, 1.312]	0.309 [0.129, 0.744]**	1.883 [0.476, 7.445]	0.470 [0.106, 2.083]	0.444 [0.136, 1.451]
Year 2024	3.120 [0.391, 24.884]	0.332 [0.206, 0.534]***	—	0.934 [0.514, 1.697]	0.232 [0.076, 0.711]*	1.430 [0.305, 6.697]	1.361 [0.393, 4.709]	0.720 [0.213, 2.435]
Severity Medium	1.532 [0.170, 13.757]	1.099 [0.597, 2.024]	—	2.686 [1.031, 7.000]*	3.020 [1.179, 7.733]*	0.329 [0.068, 1.588]	1.370 [0.103, 18.236]	3.409 [1.298, 8.952]*
Severity High	3.470 [0.545, 22.076]	3.300 [1.795, 6.067]***	—	11.927 [4.378, 32.496]***	8.455 [2.361, 30.286]**	0.481 [0.073, 3.168]	5.837 [0.353, 96.645]	12.691 [3.231, 49.846]***
Mortality Risk Medium	1.412 [0.300, 6.649]	1.296 [0.777, 2.161]	—	1.412 [0.644, 3.099]	0.769 [0.301, 1.961]	6.715 [0.469, 96.243]	1.272 [0.101, 16.017]	0.416 [0.150, 1.154]
Mortality Risk High	8.937 [2.100, 38.033]**	7.286 [4.452, 11.923]***	—	7.506 [3.505, 16.075]***	4.280 [1.524, 12.019]**	90.776 [5.197, 1585.558]**	16.732 [0.945, 296.340]	2.286 [0.568, 9.196]
CV Arrhythmias	2.703 [0.461, 15.855]	2.508 [1.640, 3.835]***	—	1.878 [1.105, 3.193]*	2.270 [0.440, 11.709]	—	1.805 [0.597, 5.461]	—
CV Heart Failure	—	1.551 [0.765, 3.148]	—	0.680 [0.357, 1.297]	4.117 [0.670, 25.303]	—	0.755 [0.147, 3.868]	—
CV Peripheral Vascular Disease	—	1.384 [0.700, 2.736]	—	1.882 [0.948, 3.738]	—	—	—	—
RESP COPD	—	0.865 [0.516, 1.449]	—	—	1.142 [0.532, 2.453]	—	—	—
RESP Emphysema	—	1.350 [0.628, 2.905]	—	—	—	—	—	—
REN Acute Renal Failure	—	1.157 [0.745, 1.798]	—	0.465 [0.256, 0.843]*	1.225 [0.384, 3.910]	0.700 [0.134, 3.664]	0.574 [0.222, 1.487]	—
REN Chronic Renal Failure	—	0.801 [0.476, 1.349]	—	0.877 [0.377, 2.039]	—	—	—	—
REN End Stage Renal	—	1.210 [0.440, 3.328]	—	1.107 [0.233, 5.269]	—	—	—	—
HEP Alcoholic Cirrhosis	—	4.357 [2.123, 8.944]***	—	—	—	—	—	—
GI Peptic Ulcer	—	2.035 [0.554, 7.479]	—	—	—	—	—	—
CV Cerebrovascular Disease	—	—	—	1.038 [0.471, 2.287]	—	—	3.283 [0.465, 23.189]	1.566 [0.492, 4.982]
MET Metabolic Syndrome	—	—	—	2.562 [0.724, 9.068]	—	—	—	—
RESP Asthma	—	—	—	3.147 [1.268, 7.806]*	—	—	—	—
RESP Pulmonary Fibrosis	—	—	—	2.534 [0.609, 10.534]	—	—	—	—
HEP Steatosis	—	—	—	1.926 [1.002, 3.701]*	6.214 [1.237, 31.206]*	—	—	—
PSI Bipolar	—	—	—	—	1.445 [0.193, 10.842]	—	—	—

Code

# Table 10: Performance Metrics - GEE Logistic Models (ALL sites)

gee_perf = []
for site in all_sites:
    if site not in gee_mortality_results:
        continue

    result = gee_mortality_results[site]
    df_gee = gee_mortality_models[site][1]
    df_site = df_head_neck[df_head_neck['cancer_site_group'] == site]
    n_total = len(df_site)
    n_events = df_site['egresar_fallecido'].sum()
    event_rate = n_events / n_total * 100 if n_total > 0 else 0

    try:
        y_pred_proba = result.fittedvalues
        y_actual = df_gee['egresar_fallecido'].values

        auc = roc_auc_score(y_actual, y_pred_proba) if len(np.unique(y_actual)) > 1 else np.nan
        brier = brier_score_loss(y_actual, y_pred_proba)

        opt_threshold = find_optimal_threshold(y_actual, y_pred_proba)
        y_pred = (y_pred_proba >= opt_threshold).astype(int)

        accuracy = accuracy_score(y_actual, y_pred)
        sensitivity = recall_score(y_actual, y_pred, zero_division=0)
        specificity = recall_score(y_actual, y_pred, pos_label=0, zero_division=0)
        ppv = precision_score(y_actual, y_pred, zero_division=0)
        tn = ((1 - y_actual) * (1 - y_pred)).sum()
        fn = (y_actual * (1 - y_pred)).sum()
        npv = tn / (tn + fn) if (tn + fn) > 0 else 0
        f1 = f1_score(y_actual, y_pred, zero_division=0)
        mcc = matthews_corrcoef(y_actual, y_pred)
        youdens_j = sensitivity + specificity - 1

    except Exception:
        accuracy = sensitivity = specificity = ppv = npv = f1 = mcc = brier = auc = np.nan
        opt_threshold = np.nan
        youdens_j = np.nan

    n_sig_comorbs = len(sig_comorbidities_mortality.get(site, []))
    fit_metrics = get_gee_fit_metrics(result, df_gee['egresar_fallecido'], family_type='binomial')

    gee_perf.append({
        'Cancer Site': site,
        'N obs': fit_metrics['N_obs'],
        'N clusters': fit_metrics['N_clusters'],
        'N params': fit_metrics['N_params'],
        'Events': f'{n_events} ({event_rate:.1f}%)',
        'Sig. Comorb.': n_sig_comorbs,
        'Correlation': fit_metrics['Correlation'],
        'QIC': safe_fmt(fit_metrics['QIC'], '.1f'),
        'QICu': safe_fmt(fit_metrics['QICu'], '.1f'),
        'Log-Lik': safe_fmt(fit_metrics['Log-Lik (model)'], '.1f'),
        'Pseudo R²': safe_fmt(fit_metrics['Pseudo R² (McFadden)']),
        'Scale': safe_fmt(fit_metrics['Scale']),
        'Threshold': safe_fmt(opt_threshold),
        'AUC': safe_fmt(auc),
        'Brier': safe_fmt(brier),
        'Sensitivity': safe_fmt(sensitivity),
        'Specificity': safe_fmt(specificity),
        'PPV': safe_fmt(ppv),
        'NPV': safe_fmt(npv),
        'Bal. Accuracy': safe_fmt((sensitivity + specificity) / 2 if not np.isnan(sensitivity) else np.nan),
        'F1': safe_fmt(f1),
        'MCC': safe_fmt(mcc),
        "Youden's J": safe_fmt(youdens_j),
    })

gee_perf_df = pd.DataFrame(gee_perf)
show_table(gee_perf_df, "Table 10: Performance & Fit Metrics - GEE Logistic Regression for Mortality\n(QIC = Quasi-Information Criterion, Pseudo R² = McFadden)", index=False)

Table 10: Performance & Fit Metrics - GEE Logistic Regression for Mortality (QIC = Quasi-Information Criterion, Pseudo R² = McFadden)

Cancer Site	N obs	N clusters	N params	Events	Sig. Comorb.	Correlation	QIC	QICu	Log-Lik	Pseudo R²	Scale	Threshold	AUC	Brier	Sensitivity	Specificity	PPV	NPV	Bal. Accuracy	F1	MCC	Youden's J
Hypopharynx	400	49	16	33 (8.2%)	1	Exchangeable	174.8	202.4	-85.2	0.2523	1.0000	0.0556	0.8623	0.0624	0.9394	0.7193	0.2313	0.9925	0.8294	0.3713	0.3840	0.6587
Larynx	2625	64	25	198 (7.5%)	11	Exchangeable	1041.6	1084.5	-517.2	0.2633	1.0000	0.0915	0.8590	0.0554	0.7929	0.7973	0.2419	0.9793	0.7951	0.3707	0.3613	0.5902
Nasopharynx	368	50	16	23 (6.2%)	1	Exchangeable	—	—	—	—	1.0000	—	—	—	—	—	—	—	—	—	—	—
Oral Cavity	3001	68	26	176 (5.9%)	11	Exchangeable	915.3	959.3	-453.7	0.3228	1.0000	0.1031	0.8971	0.0426	0.7898	0.8683	0.2720	0.9851	0.8290	0.4047	0.4114	0.6581
Oropharynx	951	59	21	55 (5.8%)	6	Exchangeable	298.6	335.1	-146.5	0.3027	1.0000	0.1115	0.8876	0.0430	0.7636	0.8884	0.2958	0.9839	0.8260	0.4264	0.4271	0.6520
Other Pharynx	281	51	16	30 (10.7%)	1	Exchangeable	128.4	155.8	-61.9	0.3514	1.0000	0.2047	0.8981	0.0653	0.8000	0.8805	0.4444	0.9736	0.8402	0.5714	0.5333	0.6805
Salivary Glands	898	59	19	46 (5.1%)	4	Exchangeable	229.6	262.1	-112.1	0.3826	1.0000	0.0214	0.9350	0.0378	0.9783	0.8075	0.2153	0.9985	0.8929	0.3529	0.4099	0.7858
Sinonasal	887	58	16	41 (4.6%)	1	Exchangeable	235.2	263.7	-115.8	0.3025	1.0000	0.0453	0.8984	0.0360	0.8780	0.8298	0.2000	0.9929	0.8539	0.3258	0.3695	0.7078

Code

fig, ax = plt.subplots(figsize=(10, 7))
colors_roc = plt.cm.tab10(np.linspace(0, 1, len(gee_mortality_results)))

for idx, (site, result) in enumerate(gee_mortality_results.items()):
    try:
        y_pred_proba = result.fittedvalues
        df_gee = gee_mortality_models[site][1]
        y_actual = df_gee['egresar_fallecido'].values
        fpr, tpr, _ = roc_curve(y_actual, y_pred_proba)
        auc_val = roc_auc_score(y_actual, y_pred_proba)
        ax.plot(fpr, tpr, label=f'{site} (AUC={auc_val:.3f})', linewidth=2, color=colors_roc[idx])
    except:
        pass

ax.plot([0, 1], [0, 1], 'k--', label='Random', linewidth=1)
ax.set_xlabel('False Positive Rate', fontsize=11)
ax.set_ylabel('True Positive Rate', fontsize=11)
ax.set_title('ROC Curves - GEE Logistic Regression by Cancer Site', fontsize=12, fontweight='bold')
ax.legend(loc='lower right', fontsize=9)
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig(os.path.join(OUTPUT_DIR, 'roc_gee_logistic_by_site.png'), dpi=300, bbox_inches='tight', facecolor='white')
plt.savefig(_FIG / "fig-roc-gee-logistic.png", dpi=300, bbox_inches="tight")
plt.show()

Figure 8: Figure 8. ROC Curves - GEE Logistic Regression Models by Cancer Site

10.2 6.2 Forest Plot — Odds Ratios for In-Hospital Mortality

Code

def _extract_or_ci(result, drop_intercept=True, cap=(0.01, 100.0)):
    """Return DataFrame with var, OR, lo, hi, p — finite & capped for plotting."""
    params = result.params
    ci = result.conf_int()
    pvals = result.pvalues
    rows = []
    for var in params.index:
        if drop_intercept and var.lower() in {'const', 'intercept'}:
            continue
        coef = params[var]
        lo, hi = (ci.loc[var, 0], ci.loc[var, 1]) if var in ci.index else (np.nan, np.nan)
        p = pvals[var] if var in pvals.index else np.nan
        or_v, or_lo, or_hi = np.exp(coef), np.exp(lo), np.exp(hi)
        # Skip non-finite
        if not np.isfinite(or_v) or not np.isfinite(or_lo) or not np.isfinite(or_hi):
            continue
        # Cap extreme CIs for display only
        or_v_d  = float(np.clip(or_v,  cap[0], cap[1]))
        or_lo_d = float(np.clip(or_lo, cap[0], cap[1]))
        or_hi_d = float(np.clip(or_hi, cap[0], cap[1]))
        rows.append({
            'var': format_name(var),
            'OR': or_v_d, 'lo': or_lo_d, 'hi': or_hi_d,
            'p': p,
        })
    return pd.DataFrame(rows)

# Build per-site dataframes
forest_data_logit = {site: _extract_or_ci(res) for site, res in gee_mortality_results.items()}
forest_data_logit = {s: d for s, d in forest_data_logit.items() if not d.empty}

n_sites = len(forest_data_logit)
n_cols = 3
n_rows = int(np.ceil(n_sites / n_cols))

fig, axes = plt.subplots(n_rows, n_cols, figsize=(16, 4 * n_rows), squeeze=False)

for ax_idx, (site, df_f) in enumerate(forest_data_logit.items()):
    r, c = divmod(ax_idx, n_cols)
    ax = axes[r][c]
    df_f = df_f.sort_values('OR').reset_index(drop=True)
    y = np.arange(len(df_f))
    sig_mask = (df_f['p'] < 0.05)
    # Colors: red OR>1 sig, blue OR<1 sig, gray ns
    colors = np.where(~sig_mask, '#9e9e9e',
              np.where(df_f['OR'] > 1, '#c0392b', '#1f6f8b'))
    # Error bars
    ax.hlines(y, df_f['lo'], df_f['hi'], color=colors, linewidth=1.5, alpha=0.9)
    ax.scatter(df_f['OR'], y,
               s=55, marker='o',
               facecolors=np.where(sig_mask, colors, 'white'),
               edgecolors=colors, linewidths=1.5, zorder=3)
    ax.axvline(1, color='#444', linestyle='--', linewidth=0.8, alpha=0.7)
    ax.set_xscale('log')
    ax.set_xlim(0.05, 50)
    ax.set_yticks(y)
    ax.set_yticklabels(df_f['var'], fontsize=8)
    ax.set_xlabel('Odds Ratio (log scale)', fontsize=9)
    ax.set_title(f'{site}  (n predictors = {len(df_f)})', fontsize=10, fontweight='bold')
    ax.grid(axis='x', linestyle=':', alpha=0.4)
    ax.spines['top'].set_visible(False)
    ax.spines['right'].set_visible(False)

# Hide empty subplots
for ax_idx in range(n_sites, n_rows * n_cols):
    r, c = divmod(ax_idx, n_cols)
    axes[r][c].set_visible(False)

# Legend
from matplotlib.lines import Line2D
legend_elems = [
    Line2D([0], [0], marker='o', linestyle='None', markersize=7,
           markerfacecolor='#c0392b', markeredgecolor='#c0392b', label='OR > 1, p < 0.05 (harmful)'),
    Line2D([0], [0], marker='o', linestyle='None', markersize=7,
           markerfacecolor='#1f6f8b', markeredgecolor='#1f6f8b', label='OR < 1, p < 0.05 (protective)'),
    Line2D([0], [0], marker='o', linestyle='None', markersize=7,
           markerfacecolor='white', markeredgecolor='#9e9e9e', label='Not significant (p ≥ 0.05)'),
    Line2D([0], [0], color='#444', linestyle='--', linewidth=1, label='OR = 1 (null)'),
]
fig.legend(handles=legend_elems, loc='lower center', ncol=4, frameon=False,
           bbox_to_anchor=(0.5, -0.01), fontsize=10)

fig.suptitle('Forest plot — GEE Logistic Mortality Models: Odds Ratios [95% CI] by Cancer Site',
             fontsize=13, fontweight='bold', y=1.00)
plt.tight_layout(rect=[0, 0.02, 1, 0.98])
plt.savefig(os.path.join(OUTPUT_DIR, 'forest_gee_logistic_mortality.png'), dpi=300, bbox_inches='tight', facecolor='white')
plt.savefig(_FIG / "fig-forest-gee-logistic.png", dpi=300, bbox_inches="tight")
plt.show()

Figure 9: Figure 9. Forest plot — Odds Ratios (95% CI) for in-hospital mortality from the GEE mixed-effects logistic models, by cancer site. Reference line at OR=1. Filled markers = p<0.05.

11 7. GEE Mixed-Effects Regression for Length of Stay (All Cancer Sites)

Code

# Store GEE LOS model results
gee_los_models = {}
gee_los_results = {}

def build_gee_los_model(df_site, site_name):
    """Build GEE Negative Binomial regression (mixed-effects approximation) for LOS.
    Uses hospital_id as cluster group, year_factor as fixed effect,
    and only significant comorbidities from bivariate analysis."""
    df_model = df_site[df_site['dias_estancia'] > 0].copy()

    df_model['SEXO_coded'] = (df_model['SEXO'] == 'Female').astype(int)

    service_dummies = pd.get_dummies(df_model['service_category'], prefix='service', drop_first=True)
    for col in service_dummies.columns:
        df_model[col] = service_dummies[col]
    service_dummy_cols = list(service_dummies.columns)

    year_dummies = pd.get_dummies(df_model['year_factor'], prefix='year', drop_first=True)
    for col in year_dummies.columns:
        df_model[col] = year_dummies[col]
    year_dummy_cols = list(year_dummies.columns)

    df_model['edad'] = pd.to_numeric(df_model['edad'], errors='coerce')

    predictor_cols = ['edad', 'SEXO_coded'] + service_dummy_cols + year_dummy_cols

    # Add severity and mortality risk as CATEGORICAL dummies
    for col in ['IR_29301_SEVERIDAD', 'IR_29301_MORTALIDAD']:
        if col in df_model.columns:
            df_model[col] = pd.to_numeric(df_model[col], errors='coerce')
            df_model[col] = df_model[col].fillna(1)
            df_model[col] = df_model[col].clip(lower=1).astype(int)
            prefix = 'severity' if 'SEVERIDAD' in col else 'mort_risk'
            col_dummies = pd.get_dummies(df_model[col], prefix=prefix, drop_first=True)
            for dc in col_dummies.columns:
                df_model[dc] = col_dummies[dc]
                predictor_cols.append(dc)

    # Add ONLY significant comorbidities from bivariate Wilcoxon test (p < 0.05)
    sig_comorbs = sig_comorbidities_los.get(site_name, [])
    used_comorbs = []
    for col in sig_comorbs:
        if col in df_model.columns:
            df_model[col] = pd.to_numeric(df_model[col], errors='coerce').fillna(0).astype(int)
            if df_model[col].sum() >= 3:
                predictor_cols.append(col)
                used_comorbs.append(col)

    df_model['_group'] = df_model['hospital_id']

    cols_for_model = ['dias_estancia', '_group'] + predictor_cols
    df_gee = df_model[cols_for_model].dropna()

    n_clusters = df_gee['_group'].nunique()
    if len(df_gee) < 50 or n_clusters < 2:
        return None, None, [], []

    y = df_gee['dias_estancia'].astype(float)
    X = df_gee[predictor_cols].astype(float)
    X = sm.add_constant(X)
    groups = df_gee['_group']

    # Fit GEE with Negative Binomial family and exchangeable correlation
    try:
        gee_model = GEE(y, X, groups=groups,
                        family=NegativeBinomial(alpha=1.0),
                        cov_struct=Exchangeable())
        gee_result = gee_model.fit(maxiter=200)
    except Exception:
        try:
            gee_model = GEE(y, X, groups=groups,
                            family=NegativeBinomial(alpha=1.0),
                            cov_struct=Independence())
            gee_result = gee_model.fit(maxiter=200)
        except Exception:
            try:
                gee_model = GEE(y, X, groups=groups,
                                family=Poisson(),
                                cov_struct=Exchangeable())
                gee_result = gee_model.fit(maxiter=200)
            except Exception:
                return None, None, [], []

    # Remove problematic variables
    max_retries = 5
    current_predictors = list(predictor_cols)
    for _ in range(max_retries):
        params = gee_result.params
        bad_vars = []
        for i, var in enumerate(current_predictors):
            idx = i + 1
            if idx < len(params) and (np.isinf(params.iloc[idx]) or abs(params.iloc[idx]) > 50):
                bad_vars.append(var)
        if not bad_vars:
            break
        current_predictors = [p for p in current_predictors if p not in bad_vars]
        if len(current_predictors) < 2:
            return None, None, [], []
        X = df_gee[current_predictors].astype(float)
        X = sm.add_constant(X)
        try:
            gee_model = GEE(y, X, groups=groups, family=NegativeBinomial(alpha=1.0), cov_struct=Exchangeable())
            gee_result = gee_model.fit(maxiter=200)
        except:
            try:
                gee_model = GEE(y, X, groups=groups, family=NegativeBinomial(alpha=1.0), cov_struct=Independence())
                gee_result = gee_model.fit(maxiter=200)
            except:
                try:
                    gee_model = GEE(y, X, groups=groups, family=Poisson(), cov_struct=Exchangeable())
                    gee_result = gee_model.fit(maxiter=200)
                except:
                    return None, None, [], []

    return gee_result, df_gee, current_predictors, used_comorbs

# Fit GEE LOS models for EACH cancer site
for site in all_sites:
    df_site = df_head_neck[df_head_neck['cancer_site_group'] == site].copy()

    if len(df_site) < 50:
        continue

    gee_result, df_gee, used_cols, used_comorbs = build_gee_los_model(df_site, site)

    if gee_result is not None:
        gee_los_models[site] = (gee_result, df_gee, used_cols)
        gee_los_results[site] = gee_result

11.1 7.1 GEE Regression Results - IRR [95% CI]

Code

# Table 11: GEE Regression - IRR [95% CI] by Cancer Site
gee_irr_table = {}

for site in all_sites:
    if site not in gee_los_results:
        continue

    result = gee_los_results[site]
    irr_ci_list = {}

    param_names = list(result.params.index)
    ci = result.conf_int()

    def _fmt_irr(v):
        if np.isinf(v) or abs(v) > 1e6:
            return 'Inf' if v > 0 else '-Inf'
        return f"{v:.3f}"

    for var in param_names:
        coef = result.params[var]
        irr_val = np.exp(coef)
        ci_lower = np.exp(ci.loc[var, 0] if var in ci.index else ci.iloc[param_names.index(var), 0])
        ci_upper = np.exp(ci.loc[var, 1] if var in ci.index else ci.iloc[param_names.index(var), 1])
        pval = result.pvalues[var] if var in result.pvalues.index else result.pvalues.iloc[param_names.index(var)]

        if np.isnan(irr_val) or np.isnan(pval):
            irr_ci_list[format_name(var)] = '—'
            continue

        sig_stars = ''
        if pval < 0.001:
            sig_stars = '***'
        elif pval < 0.01:
            sig_stars = '**'
        elif pval < 0.05:
            sig_stars = '*'

        irr_ci_str = f"{_fmt_irr(irr_val)} [{_fmt_irr(ci_lower)}, {_fmt_irr(ci_upper)}]{sig_stars}"
        irr_ci_list[format_name(var)] = irr_ci_str

    gee_irr_table[site] = irr_ci_list

gee_irr_df = pd.DataFrame(gee_irr_table).fillna('—')
show_table(gee_irr_df, "Table 11: GEE Mixed-Effects Regression - IRR [95% CI] for Length of Stay by Cancer Site\n(Hospital as cluster, year as fixed effect, Negative Binomial family, only significant comorbidities)\n* p<0.05, ** p<0.01, *** p<0.001", index=True)

Table 11: GEE Mixed-Effects Regression - IRR [95% CI] for Length of Stay by Cancer Site (Hospital as cluster, year as fixed effect, Negative Binomial family, only significant comorbidities) * p<0.05, p<0.01, * p<0.001

	Hypopharynx	Larynx	Nasopharynx	Oral Cavity	Oropharynx	Other Pharynx	Salivary Glands	Sinonasal
Intercept	2.530 [1.002, 6.390]*	9.526 [6.003, 15.115]***	6.734 [3.032, 14.953]***	5.628 [3.551, 8.920]***	5.804 [2.892, 11.649]***	7.964 [4.455, 14.234]***	3.986 [2.629, 6.045]***	4.623 [2.875, 7.433]***
Age	1.009 [1.000, 1.019]	0.998 [0.993, 1.003]	1.005 [0.998, 1.013]	1.001 [0.997, 1.004]	1.001 [0.992, 1.009]	1.002 [0.997, 1.007]	1.002 [0.995, 1.008]	0.999 [0.994, 1.003]
Sex (Female)	0.828 [0.644, 1.065]	0.781 [0.684, 0.893]***	0.760 [0.546, 1.057]	0.856 [0.784, 0.934]***	0.953 [0.744, 1.222]	1.288 [0.952, 1.743]	0.819 [0.687, 0.975]*	0.848 [0.741, 0.970]*
Service ITU	2.636 [1.467, 4.734]**	1.069 [0.799, 1.429]	0.554 [0.342, 0.897]*	1.014 [0.753, 1.366]	0.850 [0.504, 1.433]	0.924 [0.286, 2.985]	0.988 [0.727, 1.342]	1.556 [0.887, 2.728]
Service MS	1.475 [0.887, 2.451]	0.731 [0.578, 0.924]**	1.132 [0.695, 1.845]	0.884 [0.645, 1.211]	0.663 [0.489, 0.899]**	0.645 [0.442, 0.943]*	1.053 [0.808, 1.371]	1.315 [0.922, 1.876]
Service Other	1.368 [0.825, 2.269]	0.707 [0.544, 0.919]**	0.985 [0.666, 1.459]	0.773 [0.550, 1.087]	0.842 [0.558, 1.269]	0.611 [0.419, 0.890]*	1.039 [0.811, 1.333]	1.168 [0.810, 1.684]
Year 2020	0.913 [0.618, 1.349]	1.199 [1.008, 1.425]*	0.878 [0.338, 2.282]	1.218 [0.894, 1.659]	1.122 [0.878, 1.434]	1.114 [0.725, 1.714]	0.987 [0.741, 1.316]	1.023 [0.784, 1.333]
Year 2021	1.111 [0.886, 1.395]	1.129 [0.939, 1.359]	0.530 [0.324, 0.865]*	1.096 [0.854, 1.407]	1.020 [0.797, 1.305]	0.949 [0.605, 1.487]	1.171 [0.814, 1.684]	1.035 [0.783, 1.367]
Year 2022	1.179 [0.735, 1.893]	1.248 [0.971, 1.604]	0.712 [0.404, 1.253]	1.047 [0.813, 1.347]	1.090 [0.840, 1.415]	0.610 [0.414, 0.900]*	0.822 [0.603, 1.121]	1.043 [0.829, 1.313]
Year 2023	1.053 [0.711, 1.558]	0.928 [0.776, 1.110]	0.611 [0.342, 1.092]	1.010 [0.743, 1.373]	1.097 [0.855, 1.407]	0.381 [0.279, 0.521]***	0.866 [0.592, 1.267]	0.897 [0.653, 1.234]
Year 2024	0.695 [0.480, 1.006]	1.105 [0.912, 1.339]	0.843 [0.535, 1.327]	0.977 [0.802, 1.190]	1.191 [0.896, 1.582]	0.890 [0.614, 1.290]	0.896 [0.615, 1.306]	0.890 [0.690, 1.148]
Severity Medium	1.394 [1.120, 1.734]**	1.898 [1.466, 2.457]***	1.410 [0.995, 1.997]	1.862 [1.555, 2.228]***	1.679 [1.314, 2.146]***	1.743 [1.241, 2.448]**	1.446 [1.121, 1.866]**	1.682 [1.286, 2.201]***
Severity High	1.982 [1.420, 2.766]***	2.259 [1.732, 2.947]***	3.194 [2.005, 5.087]***	3.003 [2.408, 3.745]***	2.435 [1.839, 3.225]***	2.350 [1.641, 3.364]***	3.575 [2.434, 5.250]***	3.174 [2.028, 4.969]***
Mortality Risk Medium	1.153 [0.853, 1.559]	0.868 [0.770, 0.978]*	0.929 [0.587, 1.470]	1.113 [0.930, 1.333]	0.891 [0.732, 1.085]	0.850 [0.645, 1.120]	1.195 [0.963, 1.485]	0.931 [0.735, 1.180]
Mortality Risk High	1.463 [1.038, 2.062]*	1.306 [1.109, 1.539]**	0.905 [0.514, 1.591]	1.152 [0.877, 1.512]	1.033 [0.775, 1.376]	1.112 [0.731, 1.691]	1.254 [0.928, 1.695]	1.030 [0.721, 1.472]
CV Peripheral Vascular Disease	2.125 [0.977, 4.623]	1.069 [0.830, 1.378]	1.107 [0.613, 1.998]	1.371 [1.064, 1.768]*	1.361 [0.877, 2.112]	1.706 [0.990, 2.940]	—	0.995 [0.595, 1.663]
RESP COPD	1.070 [0.627, 1.825]	1.076 [0.911, 1.270]	—	1.078 [0.859, 1.354]	1.713 [1.009, 2.908]*	—	1.016 [0.638, 1.616]	—
RESP Emphysema	2.468 [1.125, 5.416]*	1.396 [1.120, 1.741]**	—	1.097 [0.597, 2.016]	—	—	—	—
REN Acute Renal Failure	0.989 [0.486, 2.014]	0.897 [0.719, 1.120]	0.667 [0.356, 1.248]	0.884 [0.690, 1.134]	1.179 [0.726, 1.916]	—	0.837 [0.523, 1.341]	2.194 [1.302, 3.698]**
REN Chronic Renal Failure	1.003 [0.596, 1.688]	1.044 [0.867, 1.256]	—	1.138 [0.921, 1.406]	—	1.549 [0.858, 2.794]	0.909 [0.631, 1.310]	—
REN End Stage Renal	0.870 [0.526, 1.440]	—	—	0.966 [0.615, 1.518]	—	—	—	1.025 [0.632, 1.664]
NEURO Epilepsy	2.212 [1.453, 3.366]***	1.064 [0.748, 1.513]	—	0.949 [0.662, 1.362]	—	—	—	—
SUST Alcohol	1.378 [1.010, 1.882]*	1.223 [0.938, 1.593]	0.941 [0.405, 2.187]	1.218 [0.961, 1.544]	1.209 [0.946, 1.546]	—	—	—
SUST Tobacco	1.158 [0.778, 1.723]	1.176 [1.055, 1.312]**	—	1.153 [0.994, 1.338]	1.116 [0.872, 1.427]	1.745 [1.194, 2.550]**	—	—
CV Essential Hypertension	—	1.012 [0.913, 1.121]	—	0.933 [0.861, 1.011]	1.054 [0.933, 1.190]	—	1.045 [0.910, 1.200]	—
CV Complicated Hypertension	—	1.421 [1.026, 1.969]*	—	—	—	—	1.231 [0.805, 1.882]	—
CV Coronary Disease	—	0.909 [0.764, 1.081]	—	1.110 [0.839, 1.468]	1.229 [0.816, 1.851]	—	0.833 [0.487, 1.426]	—
CV Heart Failure	—	0.941 [0.675, 1.310]	—	0.727 [0.563, 0.938]*	—	—	0.935 [0.498, 1.753]	1.395 [0.878, 2.216]
CV Arrhythmias	—	1.104 [0.934, 1.305]	0.986 [0.636, 1.528]	1.099 [0.922, 1.311]	1.240 [0.881, 1.745]	2.644 [1.821, 3.838]***	1.236 [0.802, 1.906]	1.563 [1.129, 2.163]**
CV Cerebrovascular Disease	—	1.311 [0.812, 2.116]	1.964 [0.821, 4.699]	0.900 [0.491, 1.647]	1.093 [0.637, 1.876]	—	1.972 [0.630, 6.169]	0.829 [0.562, 1.223]
MET Metabolic Syndrome	—	2.376 [1.386, 4.076]**	—	1.285 [0.750, 2.201]	—	—	—	—
HEP Alcoholic Cirrhosis	—	1.176 [0.806, 1.716]	—	—	—	—	—	—
GI Peptic Ulcer	—	1.693 [1.345, 2.130]***	—	0.899 [0.632, 1.279]	—	—	—	—
NEURO Stroke Sequelae	—	0.855 [0.484, 1.509]	—	1.008 [0.520, 1.954]	—	—	0.579 [0.169, 1.984]	—
PSI Major Depression	—	1.374 [1.121, 1.684]**	—	1.383 [1.012, 1.888]*	1.804 [1.078, 3.019]*	—	—	—
PSI Anxiety	—	1.595 [1.007, 2.528]*	—	2.083 [1.335, 3.252]**	—	—	—	—
MET Type2 Diabetes	—	—	1.222 [0.759, 1.966]	—	—	—	—	1.110 [0.924, 1.332]
MET Obesity	—	—	—	1.264 [1.061, 1.507]**	—	—	—	—
MET Thyroid Disorders	—	—	—	1.029 [0.897, 1.181]	—	—	—	—
RESP Sleep Apnea	—	—	—	2.220 [1.306, 3.772]**	—	—	—	—
INF HIV	—	—	—	1.481 [0.786, 2.793]	0.521 [0.221, 1.230]	—	—	—
RESP Pulmonary Fibrosis	—	—	—	—	1.157 [0.629, 2.127]	—	0.939 [0.523, 1.687]	—
HEP Steatosis	—	—	—	—	1.687 [0.787, 3.618]	—	2.245 [0.851, 5.924]	—
PSI Bipolar	—	—	—	—	1.135 [0.551, 2.337]	—	—	—

Code

# Table 12: Performance Metrics - GEE LOS Models
gee_los_perf = []
for site in all_sites:
    if site not in gee_los_results:
        continue

    result = gee_los_results[site]
    df_gee = gee_los_models[site][1]
    df_site = df_head_neck[df_head_neck['cancer_site_group'] == site]
    n_total = len(df_site)

    try:
        y_pred = result.fittedvalues
        y_actual = df_gee['dias_estancia'].values

        rmse = np.sqrt(np.mean((y_actual - y_pred) ** 2))
        mae = np.mean(np.abs(y_actual - y_pred))
        ss_res = np.sum((y_actual - y_pred) ** 2)
        ss_tot = np.sum((y_actual - np.mean(y_actual)) ** 2)
        pseudo_r2 = 1 - ss_res / ss_tot if ss_tot > 0 else np.nan

    except Exception:
        rmse = mae = pseudo_r2 = np.nan

    n_sig_comorbs = len(sig_comorbidities_los.get(site, []))
    fit_metrics = get_gee_fit_metrics(result, df_gee['dias_estancia'], family_type='count')

    gee_los_perf.append({
        'Cancer Site': site,
        'N obs': fit_metrics['N_obs'],
        'N clusters': fit_metrics['N_clusters'],
        'N params': fit_metrics['N_params'],
        'Mean LOS': f'{df_site["dias_estancia"].mean():.1f}',
        'Median LOS': f'{df_site["dias_estancia"].median():.0f}',
        'Sig. Comorb.': n_sig_comorbs,
        'Correlation': fit_metrics['Correlation'],
        'QIC': safe_fmt(fit_metrics['QIC'], '.1f'),
        'QICu': safe_fmt(fit_metrics['QICu'], '.1f'),
        'Pseudo R²': safe_fmt(fit_metrics['Pseudo R²']),
        'Scale': safe_fmt(fit_metrics['Scale']),
        'RMSE': f'{rmse:.2f}' if not np.isnan(rmse) else '—',
        'MAE': f'{mae:.2f}' if not np.isnan(mae) else '—',
    })

gee_los_perf_df = pd.DataFrame(gee_los_perf)
show_table(gee_los_perf_df, "Table 12: Performance & Fit Metrics - GEE Regression Models for LOS\n(QIC = Quasi-Information Criterion, Negative Binomial/Poisson family)", index=False)

Table 12: Performance & Fit Metrics - GEE Regression Models for LOS (QIC = Quasi-Information Criterion, Negative Binomial/Poisson family)

Cancer Site	N obs	N clusters	N params	Mean LOS	Median LOS	Sig. Comorb.	Correlation	QIC	QICu	Pseudo R²	Scale	RMSE	MAE
Hypopharynx	388	49	24	13.7	7	9	Exchangeable	23454.4	384.7	0.2763	1.0000	16.26	10.26
Larynx	2398	64	35	14.5	7	20	Exchangeable	56825.0	2587.5	0.1611	1.0000	20.50	12.44
Nasopharynx	304	49	21	9.0	4	6	Exchangeable	16628.3	330.6	0.1315	1.0000	18.14	9.21
Oral Cavity	2575	67	38	9.6	4	23	Exchangeable	38797.2	2530.5	0.1196	1.0000	16.14	8.95
Oropharynx	822	59	29	8.4	4	14	Exchangeable	21087.4	767.1	0.1652	1.0000	13.02	7.56
Other Pharynx	252	48	19	10.8	5	4	Exchangeable	14079.0	235.0	0.1606	1.0000	15.03	9.22
Salivary Glands	800	58	27	6.9	4	12	Exchangeable	11993.7	615.4	0.1378	1.0000	11.20	5.84
Sinonasal	727	55	22	6.7	3	7	Exchangeable	10252.5	577.6	0.2372	1.0000	10.48	5.93

11.2 7.1 Forest Plot — Incidence Rate Ratios for Length of Stay

Code

# Reuses _extract_or_ci defined in §6.2 — coefficients exponentiate to IRRs for log-link NB models.
forest_data_los = {site: _extract_or_ci(res) for site, res in gee_los_results.items()}
forest_data_los = {s: d for s, d in forest_data_los.items() if not d.empty}

n_sites = len(forest_data_los)
n_cols = 3
n_rows = int(np.ceil(n_sites / n_cols))

fig, axes = plt.subplots(n_rows, n_cols, figsize=(16, 4 * n_rows), squeeze=False)

for ax_idx, (site, df_f) in enumerate(forest_data_los.items()):
    r, c = divmod(ax_idx, n_cols)
    ax = axes[r][c]
    df_f = df_f.sort_values('OR').reset_index(drop=True)  # OR field carries IRR here
    y = np.arange(len(df_f))
    sig_mask = (df_f['p'] < 0.05)
    # IRR > 1 means longer LOS (worse), IRR < 1 means shorter
    colors = np.where(~sig_mask, '#9e9e9e',
              np.where(df_f['OR'] > 1, '#b8860b', '#2e7d32'))
    ax.hlines(y, df_f['lo'], df_f['hi'], color=colors, linewidth=1.5, alpha=0.9)
    ax.scatter(df_f['OR'], y,
               s=55, marker='s',
               facecolors=np.where(sig_mask, colors, 'white'),
               edgecolors=colors, linewidths=1.5, zorder=3)
    ax.axvline(1, color='#444', linestyle='--', linewidth=0.8, alpha=0.7)
    ax.set_xscale('log')
    ax.set_xlim(0.3, 8)
    ax.set_yticks(y)
    ax.set_yticklabels(df_f['var'], fontsize=8)
    ax.set_xlabel('Incidence Rate Ratio (log scale)', fontsize=9)
    ax.set_title(f'{site}  (n predictors = {len(df_f)})', fontsize=10, fontweight='bold')
    ax.grid(axis='x', linestyle=':', alpha=0.4)
    ax.spines['top'].set_visible(False)
    ax.spines['right'].set_visible(False)

for ax_idx in range(n_sites, n_rows * n_cols):
    r, c = divmod(ax_idx, n_cols)
    axes[r][c].set_visible(False)

from matplotlib.lines import Line2D
legend_elems = [
    Line2D([0], [0], marker='s', linestyle='None', markersize=7,
           markerfacecolor='#b8860b', markeredgecolor='#b8860b', label='IRR > 1, p < 0.05 (longer stay)'),
    Line2D([0], [0], marker='s', linestyle='None', markersize=7,
           markerfacecolor='#2e7d32', markeredgecolor='#2e7d32', label='IRR < 1, p < 0.05 (shorter stay)'),
    Line2D([0], [0], marker='s', linestyle='None', markersize=7,
           markerfacecolor='white', markeredgecolor='#9e9e9e', label='Not significant (p ≥ 0.05)'),
    Line2D([0], [0], color='#444', linestyle='--', linewidth=1, label='IRR = 1 (null)'),
]
fig.legend(handles=legend_elems, loc='lower center', ncol=4, frameon=False,
           bbox_to_anchor=(0.5, -0.01), fontsize=10)

fig.suptitle('Forest plot — GEE Negative-Binomial LOS Models: IRR [95% CI] by Cancer Site',
             fontsize=13, fontweight='bold', y=1.00)
plt.tight_layout(rect=[0, 0.02, 1, 0.98])
plt.savefig(os.path.join(OUTPUT_DIR, 'forest_gee_nb_los.png'), dpi=300, bbox_inches='tight', facecolor='white')
plt.savefig(_FIG / "fig-forest-gee-los.png", dpi=300, bbox_inches="tight")
plt.show()

Figure 10: Figure 10. Forest plot — Incidence Rate Ratios (95% CI) for length of stay from the GEE Negative Binomial mixed-effects models, by cancer site. Reference line at IRR=1. Filled markers = p<0.05.

12 8. Machine Learning Models for Mortality Prediction (Per Cancer Site, Multiple Class Balancing)

Machine learning models for per-site mortality prediction using multiple class balancing techniques.

Code

from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.linear_model import LogisticRegression as LR_sklearn
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (
    roc_auc_score, roc_curve, confusion_matrix as cm_func,
    accuracy_score, f1_score, precision_score, recall_score,
    brier_score_loss, matthews_corrcoef
)

from collections import OrderedDict
try:
    from imblearn.over_sampling import SMOTE, ADASYN, RandomOverSampler
    from imblearn.under_sampling import RandomUnderSampler
    from imblearn.combine import SMOTETomek
    IMBLEARN_AVAILABLE = True
except ImportError:
    IMBLEARN_AVAILABLE = False

# Feature columns: age, sex, service dummies, ALL comorbidities
ml_comorbidity_cols = [c for c in df_head_neck.columns if c in comorbidity_cols]

ALL_BALANCING_NAMES = ['None (Original)', 'SMOTE', 'ADASYN', 'SMOTETomek',
                       'Random Oversampling', 'Random Undersampling']

def get_balancing_methods(y_train):
    """Return ALL class balancing methods, adapting k_neighbors to minority class size."""
    methods = OrderedDict()
    methods['None (Original)'] = None

    if not IMBLEARN_AVAILABLE:
        for name in ALL_BALANCING_NAMES[1:]:
            methods[name] = 'imblearn not installed'
        return methods

    n_minority = int(y_train.sum())
    k = min(5, n_minority - 1) if n_minority > 1 else 0

    if k >= 1:
        methods['SMOTE'] = SMOTE(random_state=42, k_neighbors=k)
        methods['ADASYN'] = ADASYN(random_state=42, n_neighbors=min(k, n_minority - 1))
        methods['SMOTETomek'] = SMOTETomek(random_state=42, smote=SMOTE(random_state=42, k_neighbors=k))
    else:
        methods['SMOTE'] = f'Insufficient minority samples (n={n_minority}, need k>=1)'
        methods['ADASYN'] = f'Insufficient minority samples (n={n_minority}, need k>=1)'
        methods['SMOTETomek'] = f'Insufficient minority samples (n={n_minority}, need k>=1)'
    methods['Random Oversampling'] = RandomOverSampler(random_state=42)
    methods['Random Undersampling'] = RandomUnderSampler(random_state=42)

    return methods

def _failed_balancing_entry(reason):
    """Return a placeholder result dict for a failed balancing technique."""
    return {
        'model': None,
        'y_pred': None,
        'y_proba': None,
        'y_test': None,
        'opt_threshold': np.nan,
        'cm': None,
        'failed': True,
        'fail_reason': reason,
        'metrics': {
            'AUC': np.nan, 'Bal. Accuracy': np.nan, 'Sensitivity': np.nan,
            'Specificity': np.nan, 'PPV': np.nan, 'NPV': np.nan,
            'F1': np.nan, 'MCC': np.nan, 'Brier': np.nan,
            "Youden's J": np.nan, 'Threshold': np.nan,
            'Train Size': '—', 'Minority %': '—'
        }
    }

def run_ml_mortality_multi_balancing(df_site, site_name):
    """Train ML classifiers with multiple class balancing techniques + optimal threshold"""
    df_ml = df_site.copy()
    df_ml['sex_female'] = (df_ml['SEXO'] == 'Female').astype(int)
    df_ml['edad'] = pd.to_numeric(df_ml['edad'], errors='coerce')

    svc_dummies = pd.get_dummies(df_ml['service_category'], prefix='service', drop_first=True)
    for c in svc_dummies.columns:
        df_ml[c] = svc_dummies[c]

    feature_names = ['edad', 'sex_female'] + list(svc_dummies.columns) + ml_comorbidity_cols
    target = 'egresar_fallecido'

    df_ml = df_ml[feature_names + [target]].dropna()

    n_pos = int(df_ml[target].sum())
    n_neg = int((df_ml[target] == 0).sum())

    if len(df_ml) < 30 or n_pos < 5 or n_neg < 5:
        return None, None, None

    X = df_ml[feature_names].values
    y = df_ml[target].values

    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)

    try:
        X_train, X_test, y_train, y_test = train_test_split(
            X_scaled, y, test_size=0.2, random_state=42, stratify=y
        )
    except ValueError:
        X_train, X_test, y_train, y_test = train_test_split(
            X_scaled, y, test_size=0.2, random_state=42
        )

    balancing_methods = get_balancing_methods(y_train)

    base_model_class = GradientBoostingClassifier
    base_model_params = {'n_estimators': 150, 'random_state': 42, 'max_depth': 4}

    balancing_results = OrderedDict()

    for bal_name, balancer in balancing_methods.items():
        if isinstance(balancer, str):
            balancing_results[bal_name] = _failed_balancing_entry(balancer)
            continue

        try:
            if balancer is not None:
                X_res, y_res = balancer.fit_resample(X_train, y_train)
            else:
                X_res, y_res = X_train, y_train

            model = base_model_class(**base_model_params)
            model.fit(X_res, y_res)
            y_proba = model.predict_proba(X_test)[:, 1]

            fpr_t, tpr_t, thresholds_t = roc_curve(y_test, y_proba)
            j_scores = tpr_t - fpr_t
            best_idx = np.argmax(j_scores)
            opt_threshold = thresholds_t[best_idx]
            y_pred = (y_proba >= opt_threshold).astype(int)

            tn, fp, fn, tp = cm_func(y_test, y_pred).ravel()
            sens = tp / (tp + fn) if (tp + fn) > 0 else 0
            spec = tn / (tn + fp) if (tn + fp) > 0 else 0
            ppv_val = tp / (tp + fp) if (tp + fp) > 0 else 0
            npv_val = tn / (tn + fn) if (tn + fn) > 0 else 0

            balancing_results[bal_name] = {
                'model': model,
                'y_pred': y_pred,
                'y_proba': y_proba,
                'y_test': y_test,
                'opt_threshold': opt_threshold,
                'cm': cm_func(y_test, y_pred),
                'failed': False,
                'metrics': {
                    'AUC': roc_auc_score(y_test, y_proba) if len(np.unique(y_test)) > 1 else 0,
                    'Bal. Accuracy': (sens + spec) / 2,
                    'Sensitivity': sens,
                    'Specificity': spec,
                    'PPV': ppv_val,
                    'NPV': npv_val,
                    'F1': f1_score(y_test, y_pred, zero_division=0),
                    'MCC': matthews_corrcoef(y_test, y_pred),
                    'Brier': brier_score_loss(y_test, y_proba),
                    "Youden's J": sens + spec - 1,
                    'Threshold': opt_threshold,
                    'Train Size': len(y_res),
                    'Minority %': f'{y_res.sum()/len(y_res)*100:.1f}'
                }
            }
        except Exception as e:
            balancing_results[bal_name] = _failed_balancing_entry(str(e))

    successful_results = {k: v for k, v in balancing_results.items() if not v.get('failed', False)}
    if not successful_results:
        return balancing_results, None, None

    best_balancing = max(successful_results, key=lambda k: successful_results[k]['metrics']['AUC'])
    best_balancer = balancing_methods.get(best_balancing)

    if best_balancer is not None:
        try:
            X_best, y_best = best_balancer.fit_resample(X_train, y_train)
        except:
            X_best, y_best = X_train, y_train
    else:
        X_best, y_best = X_train, y_train

    all_models = {
        'Logistic Regression': LR_sklearn(max_iter=1000, random_state=42, class_weight='balanced'),
        'Random Forest': RandomForestClassifier(n_estimators=200, random_state=42, n_jobs=-1, class_weight='balanced'),
        'Gradient Boosting': GradientBoostingClassifier(n_estimators=150, random_state=42, max_depth=4),
        'SVM': SVC(kernel='rbf', probability=True, random_state=42, class_weight='balanced'),
        'MLP': MLPClassifier(hidden_layer_sizes=(64, 32), max_iter=1000, random_state=42)
    }

    model_results = {}
    for name, model in all_models.items():
        try:
            model.fit(X_best, y_best)
            y_proba = model.predict_proba(X_test)[:, 1]

            fpr_t, tpr_t, thresholds_t = roc_curve(y_test, y_proba)
            j_scores = tpr_t - fpr_t
            best_idx = np.argmax(j_scores)
            opt_threshold = thresholds_t[best_idx]
            y_pred = (y_proba >= opt_threshold).astype(int)

            tn, fp, fn, tp = cm_func(y_test, y_pred).ravel()
            sens = tp / (tp + fn) if (tp + fn) > 0 else 0
            spec = tn / (tn + fp) if (tn + fp) > 0 else 0
            ppv_val = tp / (tp + fp) if (tp + fp) > 0 else 0
            npv_val = tn / (tn + fn) if (tn + fn) > 0 else 0

            model_results[name] = {
                'model': model,
                'y_pred': y_pred,
                'y_proba': y_proba,
                'y_test': y_test,
                'opt_threshold': opt_threshold,
                'cm': cm_func(y_test, y_pred),
                'metrics': {
                    'AUC': roc_auc_score(y_test, y_proba) if len(np.unique(y_test)) > 1 else 0,
                    'Bal. Accuracy': (sens + spec) / 2,
                    'Sensitivity': sens,
                    'Specificity': spec,
                    'PPV': ppv_val,
                    'NPV': npv_val,
                    'F1': f1_score(y_test, y_pred, zero_division=0),
                    'MCC': matthews_corrcoef(y_test, y_pred),
                    'Brier': brier_score_loss(y_test, y_proba),
                    "Youden's J": sens + spec - 1,
                    'Threshold': opt_threshold
                }
            }
        except:
            pass

    return balancing_results, model_results, (feature_names, X_test, y_test, best_balancing)

# Run ML mortality models for EACH cancer site
ml_balancing_by_site = {}
ml_models_by_site = {}
ml_meta_by_site = {}

for site in all_sites:
    df_site = df_head_neck[df_head_neck['cancer_site_group'] == site].copy()
    bal_results, model_results, meta = run_ml_mortality_multi_balancing(df_site, site)
    if bal_results is not None and len(bal_results) > 0:
        ml_balancing_by_site[site] = bal_results
        if model_results is not None:
            ml_models_by_site[site] = model_results
        if meta is not None:
            ml_meta_by_site[site] = meta

12.1 8.1 Class Balancing Technique Comparison by Cancer Site

Code

# Table: Class balancing technique comparison for each site (Gradient Boosting as base model)
for site in all_sites:
    if site not in ml_balancing_by_site:
        continue

    results = ml_balancing_by_site[site]
    df_site = df_head_neck[df_head_neck['cancer_site_group'] == site]
    n_total = len(df_site)
    n_events = df_site['egresar_fallecido'].sum()

    metrics_df = pd.DataFrame({name: d['metrics'] for name, d in results.items()}).T
    metrics_df.index.name = 'Balancing'

    status_col = []
    for name in metrics_df.index:
        entry = results[name]
        if entry.get('failed', False):
            reason = entry.get('fail_reason', 'Error')
            status_col.append(reason[:40])
        else:
            status_col.append('OK')
    metrics_df.insert(0, 'Status', status_col)

    for col in metrics_df.columns:
        if col in ['Train Size', 'Minority %', 'Status']:
            metrics_df[col] = metrics_df[col].apply(lambda x: '—' if (isinstance(x, float) and np.isnan(x)) else x)
        else:
            metrics_df[col] = metrics_df[col].apply(lambda x: '—' if (isinstance(x, float) and (np.isnan(x) or np.isinf(x))) else f'{x:.4f}')

    show_table(metrics_df,
               f"Class Balancing Comparison (Gradient Boosting): {site} (N={n_total:,}, Events={n_events}, {n_events/n_total*100:.1f}%)",
               index=True)

Class Balancing Comparison (Gradient Boosting): Hypopharynx (N=400, Events=33, 8.2%)

	Status	AUC	Bal. Accuracy	Sensitivity	Specificity	PPV	NPV	F1	MCC	Brier	Youden's J	Threshold	Train Size	Minority %
Balancing
None (Original)	OK	0.5978	0.6918	1.0000	0.3836	0.1346	1.0000	0.2373	0.2272	0.1059	0.3836	0.0043	320	8.1
SMOTE	OK	0.6213	0.7094	0.8571	0.5616	0.1579	0.9762	0.2667	0.2370	0.1132	0.4188	0.0303	588	50.0
ADASYN	OK	0.6204	0.6996	0.7143	0.6849	0.1786	0.9615	0.2857	0.2365	0.1121	0.3992	0.0892	584	49.7
SMOTETomek	OK	0.6115	0.6996	0.7143	0.6849	0.1786	0.9615	0.2857	0.2365	0.1157	0.3992	0.0848	580	50.0
Random Oversampling	OK	0.7006	0.7162	0.8571	0.5753	0.1622	0.9767	0.2727	0.2451	0.1124	0.4325	0.0272	588	50.0
Random Undersampling	OK	0.4403	0.6233	1.0000	0.2466	0.1129	1.0000	0.2029	0.1669	0.4881	0.2466	0.0050	52	50.0

Class Balancing Comparison (Gradient Boosting): Larynx (N=2,625, Events=198, 7.5%)

	Status	AUC	Bal. Accuracy	Sensitivity	Specificity	PPV	NPV	F1	MCC	Brier	Youden's J	Threshold	Train Size	Minority %
Balancing
None (Original)	OK	0.6275	0.6255	0.3500	0.9010	0.2258	0.9438	0.2745	0.2064	0.0749	0.2510	0.1495	2100	7.5
SMOTE	OK	0.5740	0.5933	0.5000	0.6866	0.1163	0.9433	0.1887	0.1055	0.0915	0.1866	0.1589	3884	50.0
ADASYN	OK	0.5877	0.5872	0.9250	0.2495	0.0923	0.9758	0.1678	0.1090	0.0939	0.1745	0.0457	3875	49.9
SMOTETomek	OK	0.5847	0.5955	0.5250	0.6660	0.1148	0.9444	0.1883	0.1063	0.0864	0.1910	0.1306	3744	50.0
Random Oversampling	OK	0.5728	0.5729	0.5500	0.5959	0.1009	0.9414	0.1705	0.0785	0.1381	0.1459	0.2486	3884	50.0
Random Undersampling	OK	0.6431	0.6343	0.4500	0.8186	0.1698	0.9475	0.2466	0.1775	0.2642	0.2686	0.8045	316	50.0

Class Balancing Comparison (Gradient Boosting): Nasopharynx (N=368, Events=23, 6.2%)

	Status	AUC	Bal. Accuracy	Sensitivity	Specificity	PPV	NPV	F1	MCC	Brier	Youden's J	Threshold	Train Size	Minority %
Balancing
None (Original)	OK	0.6391	0.6681	0.8000	0.5362	0.1111	0.9737	0.1951	0.1689	0.0792	0.3362	0.0028	294	6.1
SMOTE	OK	0.7130	0.7130	0.6000	0.8261	0.2000	0.9661	0.3000	0.2660	0.0721	0.4261	0.0615	552	50.0
ADASYN	OK	0.7130	0.7203	0.6000	0.8406	0.2143	0.9667	0.3158	0.2824	0.0788	0.4406	0.0720	552	50.0
SMOTETomek	OK	0.4986	0.6130	0.4000	0.8261	0.1429	0.9500	0.2105	0.1449	0.0894	0.2261	0.0526	548	50.0
Random Oversampling	OK	0.5855	0.6739	1.0000	0.3478	0.1000	1.0000	0.1818	0.1865	0.1080	0.3478	0.0028	552	50.0
Random Undersampling	OK	0.7188	0.7203	0.6000	0.8406	0.2143	0.9667	0.3158	0.2824	0.2460	0.4406	0.9935	36	50.0

Class Balancing Comparison (Gradient Boosting): Oral Cavity (N=3,001, Events=176, 5.9%)

	Status	AUC	Bal. Accuracy	Sensitivity	Specificity	PPV	NPV	F1	MCC	Brier	Youden's J	Threshold	Train Size	Minority %
Balancing
None (Original)	OK	0.7222	0.6762	0.5714	0.7809	0.1389	0.9672	0.2235	0.1933	0.0602	0.3523	0.0448	2400	5.9
SMOTE	OK	0.5746	0.6083	0.6000	0.6166	0.0882	0.9614	0.1538	0.1037	0.0974	0.2166	0.2029	4518	50.0
ADASYN	OK	0.5835	0.6074	0.7714	0.4435	0.0789	0.9691	0.1432	0.1016	0.0952	0.2149	0.1380	4497	49.8
SMOTETomek	OK	0.5723	0.6197	0.7429	0.4965	0.0836	0.9690	0.1503	0.1122	0.0953	0.2393	0.1244	4254	50.0
Random Oversampling	OK	0.6285	0.6327	0.3714	0.8940	0.1781	0.9583	0.2407	0.1903	0.1181	0.2654	0.5121	4518	50.0
Random Undersampling	OK	0.7183	0.6905	0.6000	0.7809	0.1448	0.9693	0.2333	0.2085	0.2441	0.3809	0.6778	282	50.0

Class Balancing Comparison (Gradient Boosting): Oropharynx (N=951, Events=55, 5.8%)

	Status	AUC	Bal. Accuracy	Sensitivity	Specificity	PPV	NPV	F1	MCC	Brier	Youden's J	Threshold	Train Size	Minority %
Balancing
None (Original)	OK	0.5856	0.6553	0.7273	0.5833	0.0964	0.9722	0.1702	0.1460	0.0679	0.3106	0.0104	760	5.8
SMOTE	OK	0.5144	0.6152	0.3636	0.8667	0.1429	0.9571	0.2051	0.1517	0.0881	0.2303	0.3646	1432	50.0
ADASYN	OK	0.4437	0.5568	0.3636	0.7500	0.0816	0.9507	0.1333	0.0606	0.0970	0.1136	0.2403	1439	50.2
SMOTETomek	OK	0.5333	0.6141	0.2727	0.9556	0.2727	0.9556	0.2727	0.2283	0.0857	0.2283	0.5336	1366	50.0
Random Oversampling	OK	0.5609	0.6348	0.6364	0.6333	0.0959	0.9661	0.1667	0.1293	0.0827	0.2697	0.0871	1432	50.0
Random Undersampling	OK	0.6447	0.6323	0.9091	0.3556	0.0794	0.9846	0.1460	0.1301	0.3244	0.2646	0.0567	88	50.0

Class Balancing Comparison (Gradient Boosting): Other Pharynx (N=281, Events=30, 10.7%)

	Status	AUC	Bal. Accuracy	Sensitivity	Specificity	PPV	NPV	F1	MCC	Brier	Youden's J	Threshold	Train Size	Minority %
Balancing
None (Original)	OK	0.5605	0.6275	0.6667	0.5882	0.1600	0.9375	0.2581	0.1576	0.1279	0.2549	0.0028	224	10.7
SMOTE	OK	0.5343	0.6275	0.6667	0.5882	0.1600	0.9375	0.2581	0.1576	0.1219	0.2549	0.0165	400	50.0
ADASYN	OK	0.5245	0.6471	0.6667	0.6275	0.1739	0.9412	0.2759	0.1840	0.1247	0.2941	0.0138	401	50.1
SMOTETomek	OK	0.5033	0.6373	0.6667	0.6078	0.1667	0.9394	0.2667	0.1706	0.1251	0.2745	0.0132	398	50.0
Random Oversampling	OK	0.5605	0.6618	0.8333	0.4902	0.1613	0.9615	0.2703	0.1993	0.1570	0.3235	0.0134	400	50.0
Random Undersampling	OK	0.6961	0.7255	0.6667	0.7843	0.2667	0.9524	0.3810	0.3143	0.3035	0.4510	0.9682	48	50.0

Class Balancing Comparison (Gradient Boosting): Salivary Glands (N=898, Events=46, 5.1%)

	Status	AUC	Bal. Accuracy	Sensitivity	Specificity	PPV	NPV	F1	MCC	Brier	Youden's J	Threshold	Train Size	Minority %
Balancing
None (Original)	OK	0.6556	0.6871	0.8889	0.4854	0.0833	0.9881	0.1524	0.1635	0.0597	0.3743	0.0048	718	5.2
SMOTE	OK	0.6699	0.6696	0.8889	0.4503	0.0784	0.9872	0.1441	0.1492	0.0709	0.3392	0.0117	1362	50.0
ADASYN	OK	0.6322	0.6696	0.5556	0.7836	0.1190	0.9710	0.1961	0.1748	0.0717	0.3392	0.0833	1362	50.0
SMOTETomek	OK	0.6576	0.6550	0.6667	0.6433	0.0896	0.9735	0.1579	0.1397	0.0777	0.3099	0.0224	1342	50.0
Random Oversampling	OK	0.6374	0.6462	0.4444	0.8480	0.1333	0.9667	0.2051	0.1710	0.0924	0.2924	0.2315	1362	50.0
Random Undersampling	OK	0.7609	0.7632	0.8889	0.6374	0.1143	0.9909	0.2025	0.2353	0.3080	0.5263	0.5186	74	50.0

Class Balancing Comparison (Gradient Boosting): Sinonasal (N=887, Events=41, 4.6%)

	Status	AUC	Bal. Accuracy	Sensitivity	Specificity	PPV	NPV	F1	MCC	Brier	Youden's J	Threshold	Train Size	Minority %
Balancing
None (Original)	OK	0.5537	0.6110	0.8750	0.3471	0.0593	0.9833	0.1111	0.0973	0.0572	0.2221	0.0040	709	4.7
SMOTE	OK	0.6118	0.6191	0.7500	0.4882	0.0645	0.9765	0.1188	0.0988	0.0831	0.2382	0.0440	1352	50.0
ADASYN	OK	0.5882	0.6199	0.8750	0.3647	0.0609	0.9841	0.1138	0.1039	0.0725	0.2397	0.0244	1346	49.8
SMOTETomek	OK	0.5978	0.6110	0.8750	0.3471	0.0593	0.9833	0.1111	0.0973	0.0874	0.2221	0.0229	1306	50.0
Random Oversampling	OK	0.6485	0.7441	1.0000	0.4882	0.0842	1.0000	0.1553	0.2028	0.0780	0.4882	0.0398	1352	50.0
Random Undersampling	OK	0.6504	0.6838	0.7500	0.6176	0.0845	0.9813	0.1519	0.1556	0.3468	0.3676	0.7202	66	50.0

12.2 8.2 ML Model Comparison (Best Balancing per Site)

Code

# Table: ML model comparison with best balancing for each site
for site in all_sites:
    if site not in ml_models_by_site or not ml_models_by_site[site]:
        continue

    results = ml_models_by_site[site]
    meta = ml_meta_by_site.get(site)
    best_balancing = meta[3] if meta else 'Unknown'
    df_site = df_head_neck[df_head_neck['cancer_site_group'] == site]
    n_total = len(df_site)
    n_events = df_site['egresar_fallecido'].sum()

    metrics_df = pd.DataFrame({name: d['metrics'] for name, d in results.items()}).T
    metrics_df.index.name = 'Model'

    for col in metrics_df.columns:
        metrics_df[col] = metrics_df[col].apply(lambda x: '—' if (isinstance(x, float) and (np.isnan(x) or np.isinf(x))) else f'{x:.4f}')

    show_table(metrics_df,
               f"ML Models: {site} (N={n_total:,}, Events={n_events}, Best balancing: {best_balancing})",
               index=True)

ML Models: Hypopharynx (N=400, Events=33, Best balancing: Random Oversampling)

	AUC	Bal. Accuracy	Sensitivity	Specificity	PPV	NPV	F1	MCC	Brier	Youden's J	Threshold
Model
Logistic Regression	0.4521	0.6135	0.8571	0.3699	0.1154	0.9643	0.2034	0.1345	0.2670	0.2270	0.1141
Random Forest	0.4843	0.5900	0.7143	0.4658	0.1136	0.9444	0.1961	0.1023	0.1359	0.1800	0.0300
Gradient Boosting	0.7006	0.7162	0.8571	0.5753	0.1622	0.9767	0.2727	0.2451	0.1124	0.4325	0.0272
SVM	0.6047	0.7025	0.8571	0.5479	0.1538	0.9756	0.2609	0.2290	0.1689	0.4051	0.1543
MLP	0.5616	0.6644	1.0000	0.3288	0.1250	1.0000	0.2222	0.2027	0.2396	0.3288	0.0000

ML Models: Larynx (N=2,625, Events=198, Best balancing: Random Undersampling)

	AUC	Bal. Accuracy	Sensitivity	Specificity	PPV	NPV	F1	MCC	Brier	Youden's J	Threshold
Model
Logistic Regression	0.6627	0.6673	0.4500	0.8845	0.2432	0.9512	0.3158	0.2551	0.2306	0.3345	0.7886
Random Forest	0.6885	0.6921	0.6750	0.7093	0.1607	0.9636	0.2596	0.2186	0.2232	0.3843	0.5500
Gradient Boosting	0.6431	0.6343	0.4500	0.8186	0.1698	0.9475	0.2466	0.1775	0.2642	0.2686	0.8045
SVM	0.6565	0.6736	0.4750	0.8722	0.2346	0.9527	0.3140	0.2550	0.2144	0.3472	0.7175
MLP	0.6755	0.6771	0.5500	0.8041	0.1880	0.9559	0.2803	0.2258	0.3072	0.3541	0.9930

ML Models: Nasopharynx (N=368, Events=23, Best balancing: Random Undersampling)

	AUC	Bal. Accuracy	Sensitivity	Specificity	PPV	NPV	F1	MCC	Brier	Youden's J	Threshold
Model
Logistic Regression	0.6551	0.7420	0.6000	0.8841	0.2727	0.9683	0.3750	0.3415	0.2441	0.4841	0.8362
Random Forest	0.7594	0.7826	1.0000	0.5652	0.1429	1.0000	0.2500	0.2842	0.2258	0.5652	0.4450
Gradient Boosting	0.7188	0.7203	0.6000	0.8406	0.2143	0.9667	0.3158	0.2824	0.2460	0.4406	0.9935
SVM	0.4145	0.6638	0.4000	0.9275	0.2857	0.9552	0.3333	0.2809	0.3762	0.3275	0.7654
MLP	0.8435	0.8551	1.0000	0.7101	0.2000	1.0000	0.3333	0.3769	0.4234	0.7101	0.9973

ML Models: Oral Cavity (N=3,001, Events=176, Best balancing: None (Original))

	AUC	Bal. Accuracy	Sensitivity	Specificity	PPV	NPV	F1	MCC	Brier	Youden's J	Threshold
Model
Logistic Regression	0.7013	0.6648	0.4286	0.9011	0.2113	0.9623	0.2830	0.2392	0.1958	0.3296	0.6417
Random Forest	0.6390	0.6323	0.4571	0.8074	0.1280	0.9601	0.2000	0.1527	0.0611	0.2646	0.0700
Gradient Boosting	0.7222	0.6762	0.5714	0.7809	0.1389	0.9672	0.2235	0.1933	0.0602	0.3523	0.0448
SVM	0.6413	0.6750	0.8571	0.4929	0.0946	0.9824	0.1705	0.1642	0.0543	0.3501	0.0524
MLP	0.4907	0.5626	0.1429	0.9823	0.3333	0.9488	0.2000	0.1879	0.0653	0.1252	0.7080

ML Models: Oropharynx (N=951, Events=55, Best balancing: Random Undersampling)

	AUC	Bal. Accuracy	Sensitivity	Specificity	PPV	NPV	F1	MCC	Brier	Youden's J	Threshold
Model
Logistic Regression	0.7404	0.7098	0.6364	0.7833	0.1522	0.9724	0.2456	0.2287	0.2346	0.4197	0.6146
Random Forest	0.6662	0.6987	0.6364	0.7611	0.1400	0.9716	0.2295	0.2106	0.2210	0.3975	0.6050
Gradient Boosting	0.6447	0.6323	0.9091	0.3556	0.0794	0.9846	0.1460	0.1301	0.3244	0.2646	0.0567
SVM	0.8288	0.8341	0.8182	0.8500	0.2500	0.9871	0.3830	0.3980	0.2084	0.6682	0.6107
MLP	0.7449	0.7341	0.8182	0.6500	0.1250	0.9832	0.2169	0.2251	0.3003	0.4682	0.5443

ML Models: Other Pharynx (N=281, Events=30, Best balancing: Random Undersampling)

	AUC	Bal. Accuracy	Sensitivity	Specificity	PPV	NPV	F1	MCC	Brier	Youden's J	Threshold
Model
Logistic Regression	0.6340	0.6863	1.0000	0.3725	0.1579	1.0000	0.2727	0.2425	0.2163	0.3725	0.2220
Random Forest	0.6830	0.7598	0.8333	0.6863	0.2381	0.9722	0.3704	0.3306	0.2156	0.5196	0.5350
Gradient Boosting	0.6961	0.7255	0.6667	0.7843	0.2667	0.9524	0.3810	0.3143	0.3035	0.4510	0.9682
SVM	0.1961	0.5392	1.0000	0.0784	0.1132	1.0000	0.2034	0.0942	0.3557	0.0784	0.4605
MLP	0.4248	0.5392	0.3333	0.7451	0.1333	0.9048	0.1905	0.0547	0.3116	0.0784	0.6873

ML Models: Salivary Glands (N=898, Events=46, Best balancing: Random Undersampling)

	AUC	Bal. Accuracy	Sensitivity	Specificity	PPV	NPV	F1	MCC	Brier	Youden's J	Threshold
Model
Logistic Regression	0.6004	0.6637	0.5556	0.7719	0.1136	0.9706	0.1887	0.1661	0.3319	0.3275	0.8033
Random Forest	0.7840	0.7953	0.8889	0.7018	0.1356	0.9917	0.2353	0.2742	0.2239	0.5906	0.5550
Gradient Boosting	0.7609	0.7632	0.8889	0.6374	0.1143	0.9909	0.2025	0.2353	0.3080	0.5263	0.5186
SVM	0.6283	0.6345	1.0000	0.2690	0.0672	1.0000	0.1259	0.1344	0.2393	0.2690	0.3582
MLP	0.6244	0.6696	0.5556	0.7836	0.1190	0.9710	0.1961	0.1748	0.3154	0.3392	0.9770

ML Models: Sinonasal (N=887, Events=41, Best balancing: Random Undersampling)

	AUC	Bal. Accuracy	Sensitivity	Specificity	PPV	NPV	F1	MCC	Brier	Youden's J	Threshold
Model
Logistic Regression	0.5618	0.6360	0.6250	0.6471	0.0769	0.9735	0.1370	0.1171	0.2597	0.2721	0.4821
Random Forest	0.5688	0.7037	0.6250	0.7824	0.1190	0.9779	0.2000	0.1988	0.2729	0.4074	0.6750
Gradient Boosting	0.6504	0.6838	0.7500	0.6176	0.0845	0.9813	0.1519	0.1556	0.3468	0.3676	0.7202
SVM	0.5687	0.6132	0.2500	0.9765	0.3333	0.9651	0.2857	0.2600	0.2503	0.2265	0.6229
MLP	0.6037	0.6404	0.3750	0.9059	0.1579	0.9686	0.2222	0.1885	0.3521	0.2809	1.0000

Code

n_sites_ml = len(ml_models_by_site)
if n_sites_ml > 0:
    ncols = min(3, n_sites_ml)
    nrows = (n_sites_ml + ncols - 1) // ncols
    fig, axes = plt.subplots(nrows, ncols, figsize=(6 * ncols, 5 * nrows))
    if n_sites_ml == 1:
        axes = np.array([axes])
    axes = axes.flatten() if hasattr(axes, 'flatten') else [axes]

    for idx, (site, results) in enumerate(ml_models_by_site.items()):
        if not results:
            continue
        ax = axes[idx]
        best_balancing = ml_meta_by_site[site][3] if ml_meta_by_site.get(site) else ''
        for name, data in results.items():
            y_test = data['y_test']
            y_proba = data['y_proba']
            fpr, tpr, _ = roc_curve(y_test, y_proba)
            auc_val = data['metrics']['AUC']
            ax.plot(fpr, tpr, label=f'{name} ({auc_val:.3f})', linewidth=2)

        ax.plot([0, 1], [0, 1], 'k--', linewidth=1)
        ax.set_xlabel('FPR')
        ax.set_ylabel('TPR')
        ax.set_title(f'{site}\n(Balancing: {best_balancing})', fontweight='bold', fontsize=10)
        ax.legend(fontsize=7, loc='lower right')
        ax.grid(True, alpha=0.3)

    for idx in range(len(ml_models_by_site), len(axes)):
        axes[idx].set_visible(False)

    plt.tight_layout()
    plt.savefig(os.path.join(OUTPUT_DIR, 'roc_ml_mortality_by_site.png'), dpi=300, bbox_inches='tight', facecolor='white')
    plt.savefig(_FIG / "fig-roc-ml-mortality.png", dpi=300, bbox_inches="tight")
    plt.show()

Figure 11: Figure 9. ROC Curves - ML Mortality Models by Cancer Site (Best Class Balancing)

Code

if len(ml_balancing_by_site) > 0:
    comparison_data = []
    for site, results in ml_balancing_by_site.items():
        for bal_name, data in results.items():
            failed = data.get('failed', False)
            comparison_data.append({
                'Cancer Site': site,
                'Balancing': bal_name,
                'AUC': 0.0 if failed else data['metrics']['AUC'],
                'Bal. Accuracy': 0.0 if failed else data['metrics']['Bal. Accuracy'],
                'F1': 0.0 if failed else data['metrics']['F1'],
                'Failed': failed
            })

    comp_df = pd.DataFrame(comparison_data)

    bal_order = [b for b in ALL_BALANCING_NAMES if b in comp_df['Balancing'].unique()]
    comp_df['Balancing'] = pd.Categorical(comp_df['Balancing'], categories=bal_order, ordered=True)

    fig, axes = plt.subplots(1, 3, figsize=(14, 8))
    metrics_to_plot = ['AUC', 'Bal. Accuracy', 'F1']
    colors_resamp = plt.cm.Set2(np.linspace(0, 1, len(bal_order)))

    for ax_idx, metric in enumerate(metrics_to_plot):
        ax = axes[ax_idx]
        pivot = comp_df.pivot(index='Cancer Site', columns='Balancing', values=metric)
        pivot = pivot[[c for c in bal_order if c in pivot.columns]]
        pivot.fillna(0).plot(kind='bar', ax=ax, color=colors_resamp[:len(pivot.columns)])
        ax.set_title(metric, fontweight='bold')
        ax.set_ylabel(metric)
        ax.legend(fontsize=6, loc='lower right')
        ax.grid(True, alpha=0.3, axis='y')
        plt.setp(ax.get_xticklabels(), rotation=30, ha='right', fontsize=8)

    plt.tight_layout()
    plt.savefig(os.path.join(OUTPUT_DIR, 'balancing_comparison_by_site.png'), dpi=300, bbox_inches='tight', facecolor='white')
    plt.savefig(_FIG / "fig-balancing-comparison.png", dpi=300, bbox_inches="tight")
    plt.show()

Figure 12: Figure 10. Class Balancing Technique AUC Comparison by Cancer Site

Code

n_sites_ml = len(ml_models_by_site)
if n_sites_ml > 0:
    ncols = min(4, n_sites_ml)
    nrows = (n_sites_ml + ncols - 1) // ncols
    fig, axes = plt.subplots(nrows, ncols, figsize=(4 * ncols, 4 * nrows))
    if n_sites_ml == 1:
        axes = np.array([axes])
    axes = axes.flatten() if hasattr(axes, 'flatten') else [axes]

    for idx, (site, results) in enumerate(ml_models_by_site.items()):
        if not results:
            continue
        best_name = max(results, key=lambda k: results[k]['metrics']['AUC'])
        cm = results[best_name]['cm']
        ax = axes[idx]
        sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=ax, cbar=False,
                    xticklabels=['Survived', 'Deceased'], yticklabels=['Survived', 'Deceased'])
        ax.set_title(f'{site}\n({best_name})', fontsize=10, fontweight='bold')
        ax.set_ylabel('True')
        ax.set_xlabel('Predicted')

    for idx in range(len(ml_models_by_site), len(axes)):
        axes[idx].set_visible(False)

    plt.tight_layout()
    plt.savefig(os.path.join(OUTPUT_DIR, 'cm_ml_mortality_by_site.png'), dpi=300, bbox_inches='tight', facecolor='white')
    plt.savefig(_FIG / "fig-cm-ml-mortality.png", dpi=300, bbox_inches="tight")
    plt.show()

Figure 13: Figure 11. Confusion Matrices - Best ML Mortality Model per Site

Code

sites_with_gb = [s for s in ml_models_by_site if ml_models_by_site[s] and 'Gradient Boosting' in ml_models_by_site[s]]
n_gb = len(sites_with_gb)

if n_gb > 0:
    ncols = min(3, n_gb)
    nrows = (n_gb + ncols - 1) // ncols
    fig, axes = plt.subplots(nrows, ncols, figsize=(6 * ncols, 5 * nrows))
    if n_gb == 1:
        axes = np.array([axes])
    axes = axes.flatten() if hasattr(axes, 'flatten') else [axes]

    for idx, site in enumerate(sites_with_gb):
        model = ml_models_by_site[site]['Gradient Boosting']['model']
        feat_names = ml_meta_by_site[site][0]
        importances = model.feature_importances_
        formatted = [format_name(f) for f in feat_names]

        imp_df = pd.DataFrame({'Feature': formatted, 'Importance': importances})
        imp_df = imp_df.sort_values('Importance', ascending=True).tail(12)

        ax = axes[idx]
        ax.barh(imp_df['Feature'], imp_df['Importance'], color='steelblue')
        ax.set_xlabel('Importance')
        ax.set_title(f'{site}', fontweight='bold')
        ax.grid(True, alpha=0.3, axis='x')

    for idx in range(n_gb, len(axes)):
        axes[idx].set_visible(False)

    plt.tight_layout()
    plt.savefig(os.path.join(OUTPUT_DIR, 'feat_importance_mortality_by_site.png'), dpi=300, bbox_inches='tight', facecolor='white')
    plt.savefig(_FIG / "fig-feat-importance-mortality.png", dpi=300, bbox_inches="tight")
    plt.show()

Figure 14: Figure 12. Feature Importance - Gradient Boosting Mortality by Cancer Site (Best Class Balancing)

13 9. Machine Learning Models for Length of Stay Prediction (Per Cancer Site)

Code

from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

def run_ml_los_for_site(df_site, site_name):
    """Train 5 ML regressors for LOS on a single cancer site"""
    df_ml = df_site[df_site['dias_estancia'] > 0].copy()
    df_ml['sex_female'] = (df_ml['SEXO'] == 'Female').astype(int)
    df_ml['edad'] = pd.to_numeric(df_ml['edad'], errors='coerce')

    svc_dummies = pd.get_dummies(df_ml['service_category'], prefix='service', drop_first=True)
    for c in svc_dummies.columns:
        df_ml[c] = svc_dummies[c]

    feature_names = ['edad', 'sex_female'] + list(svc_dummies.columns) + ml_comorbidity_cols
    target = 'dias_estancia'

    df_ml = df_ml[feature_names + [target]].dropna()

    if len(df_ml) < 80:
        return None, None

    X = df_ml[feature_names].values
    y = df_ml[target].values

    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)

    X_train, X_test, y_train, y_test = train_test_split(
        X_scaled, y, test_size=0.2, random_state=42
    )

    models = {
        'Linear Regression': LinearRegression(),
        'Ridge': Ridge(alpha=1.0),
        'Lasso': Lasso(alpha=0.1, max_iter=5000),
        'Random Forest': RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1),
        'Gradient Boosting': GradientBoostingRegressor(n_estimators=100, random_state=42)
    }

    results = {}
    for name, model in models.items():
        try:
            model.fit(X_train, y_train)
            y_pred = model.predict(X_test)

            results[name] = {
                'model': model,
                'y_pred': y_pred,
                'y_test': y_test,
                'metrics': {
                    'RMSE': np.sqrt(mean_squared_error(y_test, y_pred)),
                    'MAE': mean_absolute_error(y_test, y_pred),
                    'R2': r2_score(y_test, y_pred),
                    'MSE': mean_squared_error(y_test, y_pred)
                }
            }
        except:
            pass

    return results, feature_names

# Run ML LOS models for EACH cancer site
ml_los_by_site = {}
ml_los_features_by_site = {}

for site in all_sites:
    df_site = df_head_neck[df_head_neck['cancer_site_group'] == site].copy()
    results, feat_names = run_ml_los_for_site(df_site, site)
    if results and len(results) > 0:
        ml_los_by_site[site] = results
        ml_los_features_by_site[site] = feat_names

13.1 9.1 ML LOS Results by Cancer Site

Code

# Table: ML LOS Performance by Site
for site in all_sites:
    if site not in ml_los_by_site:
        continue

    results = ml_los_by_site[site]
    df_site = df_head_neck[df_head_neck['cancer_site_group'] == site]
    n_total = len(df_site)
    mean_los = df_site['dias_estancia'].mean()

    metrics_df = pd.DataFrame({name: d['metrics'] for name, d in results.items()}).T
    metrics_df.index.name = 'Model'

    for col in metrics_df.columns:
        metrics_df[col] = metrics_df[col].apply(lambda x: '—' if (isinstance(x, float) and (np.isnan(x) or np.isinf(x))) else f'{x:.4f}')

    show_table(metrics_df,
               f"ML LOS Performance: {site} (N={n_total:,}, Mean LOS={mean_los:.1f} days)",
               index=True)

ML LOS Performance: Hypopharynx (N=400, Mean LOS=13.7 days)

	RMSE	MAE	R2	MSE
Model
Linear Regression	22.2782	13.9487	-0.4538	496.3178
Ridge	22.2354	13.9280	-0.4483	494.4112
Lasso	21.9121	13.6609	-0.4065	480.1414
Random Forest	19.4278	12.5932	-0.1056	377.4393
Gradient Boosting	20.6044	13.4355	-0.2436	424.5431

ML LOS Performance: Larynx (N=2,625, Mean LOS=14.5 days)

	RMSE	MAE	R2	MSE
Model
Linear Regression	19.5764	13.3325	-0.0275	383.2339
Ridge	19.5748	13.3316	-0.0274	383.1739
Lasso	19.4355	13.2411	-0.0128	377.7388
Random Forest	21.3782	14.2926	-0.2254	457.0262
Gradient Boosting	20.3111	13.8083	-0.1061	412.5399

ML LOS Performance: Nasopharynx (N=368, Mean LOS=9.0 days)

	RMSE	MAE	R2	MSE
Model
Linear Regression	10.8355	7.9235	-0.1720	117.4088
Ridge	10.7736	7.8920	-0.1587	116.0715
Lasso	10.2645	7.5827	-0.0517	105.3589
Random Forest	11.1671	7.6256	-0.2449	124.7033
Gradient Boosting	13.0054	8.0567	-0.6884	169.1396

ML LOS Performance: Oral Cavity (N=3,001, Mean LOS=9.6 days)

	RMSE	MAE	R2	MSE
Model
Linear Regression	18.8474	10.6064	-0.0307	355.2235
Ridge	18.8466	10.6061	-0.0307	355.1952
Lasso	18.7448	10.5509	-0.0195	351.3677
Random Forest	20.5552	11.3571	-0.2260	422.5151
Gradient Boosting	19.0218	10.6592	-0.0499	361.8286

ML LOS Performance: Oropharynx (N=951, Mean LOS=8.4 days)

	RMSE	MAE	R2	MSE
Model
Linear Regression	13.9983	8.5612	-0.1791	195.9520
Ridge	13.9887	8.5563	-0.1774	195.6842
Lasso	13.8397	8.4215	-0.1525	191.5368
Random Forest	13.5869	8.4933	-0.1108	184.6032
Gradient Boosting	13.9424	8.5364	-0.1697	194.3902

ML LOS Performance: Other Pharynx (N=281, Mean LOS=10.8 days)

	RMSE	MAE	R2	MSE
Model
Linear Regression	12.2293	8.8816	-0.2828	149.5557
Ridge	12.1605	8.8576	-0.2684	147.8770
Lasso	11.7386	8.6341	-0.1819	137.7949
Random Forest	11.3697	7.8188	-0.1088	129.2695
Gradient Boosting	14.7723	9.3805	-0.8717	218.2223

ML LOS Performance: Salivary Glands (N=898, Mean LOS=6.9 days)

	RMSE	MAE	R2	MSE
Model
Linear Regression	9.9316	6.2590	-0.2303	98.6370
Ridge	9.9229	6.2540	-0.2282	98.4633
Lasso	9.5267	6.0424	-0.1320	90.7571
Random Forest	9.3527	6.2639	-0.0911	87.4735
Gradient Boosting	9.3368	6.1751	-0.0874	87.1758

ML LOS Performance: Sinonasal (N=887, Mean LOS=6.7 days)

	RMSE	MAE	R2	MSE
Model
Linear Regression	13.5113	6.8173	0.0791	182.5560
Ridge	13.5123	6.8185	0.0790	182.5820
Lasso	13.5369	6.8838	0.0756	183.2484
Random Forest	13.5982	6.7620	0.0672	184.9117
Gradient Boosting	13.7705	6.8266	0.0434	189.6273

Code

n_sites_los = len(ml_los_by_site)
if n_sites_los > 0:
    ncols = min(3, n_sites_los)
    nrows = (n_sites_los + ncols - 1) // ncols
    fig, axes = plt.subplots(nrows, ncols, figsize=(6 * ncols, 5 * nrows))
    if n_sites_los == 1:
        axes = np.array([axes])
    axes = axes.flatten() if hasattr(axes, 'flatten') else [axes]

    for idx, (site, results) in enumerate(ml_los_by_site.items()):
        best_name = max(results, key=lambda k: results[k]['metrics']['R2'])
        best = results[best_name]
        y_test = best['y_test']
        y_pred = best['y_pred']
        r2 = best['metrics']['R2']

        ax = axes[idx]
        ax.scatter(y_test, y_pred, alpha=0.4, s=15, color='steelblue')
        lims = [min(y_test.min(), y_pred.min()), max(np.percentile(y_test, 95), np.percentile(y_pred, 95))]
        ax.plot(lims, lims, 'r--', lw=2)
        ax.set_xlabel('Observed LOS (days)')
        ax.set_ylabel('Predicted LOS (days)')
        ax.set_title(f'{site}\n{best_name} (R²={r2:.3f})', fontweight='bold', fontsize=10)
        ax.grid(True, alpha=0.3)

    for idx in range(n_sites_los, len(axes)):
        axes[idx].set_visible(False)

    plt.tight_layout()
    plt.savefig(os.path.join(OUTPUT_DIR, 'pred_vs_obs_los_by_site.png'), dpi=300, bbox_inches='tight', facecolor='white')
    plt.savefig(_FIG / "fig-pred-vs-obs-los.png", dpi=300, bbox_inches="tight")
    plt.show()

Figure 15: Figure 12. Predicted vs Observed LOS - Best ML Model per Site

Code

sites_with_rf_los = [s for s in ml_los_by_site if 'Random Forest' in ml_los_by_site[s]]
n_rf_los = len(sites_with_rf_los)

if n_rf_los > 0:
    ncols = min(3, n_rf_los)
    nrows = (n_rf_los + ncols - 1) // ncols
    fig, axes = plt.subplots(nrows, ncols, figsize=(6 * ncols, 5 * nrows))
    if n_rf_los == 1:
        axes = np.array([axes])
    axes = axes.flatten() if hasattr(axes, 'flatten') else [axes]

    for idx, site in enumerate(sites_with_rf_los):
        model = ml_los_by_site[site]['Random Forest']['model']
        feat_names = ml_los_features_by_site[site]
        importances = model.feature_importances_
        formatted = [format_name(f) for f in feat_names]

        imp_df = pd.DataFrame({'Feature': formatted, 'Importance': importances})
        imp_df = imp_df.sort_values('Importance', ascending=True).tail(12)

        ax = axes[idx]
        ax.barh(imp_df['Feature'], imp_df['Importance'], color='forestgreen')
        ax.set_xlabel('Importance')
        ax.set_title(f'{site}', fontweight='bold')
        ax.grid(True, alpha=0.3, axis='x')

    for idx in range(n_rf_los, len(axes)):
        axes[idx].set_visible(False)

    plt.tight_layout()
    plt.savefig(os.path.join(OUTPUT_DIR, 'feat_importance_los_by_site.png'), dpi=300, bbox_inches='tight', facecolor='white')
    plt.savefig(_FIG / "fig-feat-importance-los.png", dpi=300, bbox_inches="tight")
    plt.show()

Figure 16: Figure 13. Feature Importance - Random Forest LOS by Cancer Site

14 10. Summary Comparison Table

Code

# Combined summary: best model per site for mortality and LOS
summary_rows = []
for site in all_sites:
    row = {'Cancer Site': site}
    df_site = df_head_neck[df_head_neck['cancer_site_group'] == site]
    row['N'] = f'{len(df_site):,}'
    row['Mortality %'] = f'{df_site["egresar_fallecido"].mean()*100:.1f}'
    row['Mean LOS'] = f'{df_site["dias_estancia"].mean():.1f}'

    # GEE Logistic regression AUC
    if site in gee_mortality_results:
        try:
            result = gee_mortality_results[site]
            df_gee = gee_mortality_models[site][1]
            y_proba = result.fittedvalues
            y_actual = df_gee['egresar_fallecido'].values
            row['GEE Mortality AUC'] = f'{roc_auc_score(y_actual, y_proba):.3f}'
        except:
            row['GEE Mortality AUC'] = '—'
    else:
        row['GEE Mortality AUC'] = '—'

    # Best ML mortality AUC (with class balancing)
    if site in ml_models_by_site and ml_models_by_site[site]:
        best_name = max(ml_models_by_site[site], key=lambda k: ml_models_by_site[site][k]['metrics']['AUC'])
        best_auc = ml_models_by_site[site][best_name]['metrics']['AUC']
        best_balancing = ml_meta_by_site[site][3] if ml_meta_by_site.get(site) else 'Unknown'
        row['Best ML Mortality AUC'] = f'{best_auc:.3f} ({best_name}, {best_balancing})'
    else:
        row['Best ML Mortality AUC'] = '—'

    # GEE Poisson/NB model RMSE
    if site in gee_los_results:
        try:
            result = gee_los_results[site]
            df_gee = gee_los_models[site][1]
            y_pred = result.fittedvalues
            y_actual = df_gee['dias_estancia'].values
            rmse = np.sqrt(np.mean((y_actual - y_pred) ** 2))
            row['GEE LOS RMSE'] = f'{rmse:.2f}'
        except:
            row['GEE LOS RMSE'] = '—'
    else:
        row['GEE LOS RMSE'] = '—'

    # Best ML LOS R²
    if site in ml_los_by_site:
        best_name = max(ml_los_by_site[site], key=lambda k: ml_los_by_site[site][k]['metrics']['R2'])
        best_r2 = ml_los_by_site[site][best_name]['metrics']['R2']
        row['Best ML R² (LOS)'] = f'{best_r2:.3f} ({best_name})'
    else:
        row['Best ML R² (LOS)'] = '—'

    summary_rows.append(row)

summary_df = pd.DataFrame(summary_rows)
show_table(summary_df, "Table 13: Summary - GEE & ML Model Performance Across All Cancer Sites\n(GEE: Mixed-Effects with hospital clustering; ML: Multiple Class Balancing Techniques)", index=False)

Table 13: Summary - GEE & ML Model Performance Across All Cancer Sites (GEE: Mixed-Effects with hospital clustering; ML: Multiple Class Balancing Techniques)

Cancer Site	N	Mortality %	Mean LOS	GEE Mortality AUC	Best ML Mortality AUC	GEE LOS RMSE	Best ML R² (LOS)
Hypopharynx	400	8.2	13.7	0.862	0.701 (Gradient Boosting, Random Oversampling)	16.26	-0.106 (Random Forest)
Larynx	2,625	7.5	14.5	0.859	0.688 (Random Forest, Random Undersampling)	20.50	-0.013 (Lasso)
Nasopharynx	368	6.2	9.0	—	0.843 (MLP, Random Undersampling)	18.14	-0.052 (Lasso)
Oral Cavity	3,001	5.9	9.6	0.897	0.722 (Gradient Boosting, None (Original))	16.14	-0.020 (Lasso)
Oropharynx	951	5.8	8.4	0.888	0.829 (SVM, Random Undersampling)	13.02	-0.111 (Random Forest)
Other Pharynx	281	10.7	10.8	0.898	0.696 (Gradient Boosting, Random Undersampling)	15.03	-0.109 (Random Forest)
Salivary Glands	898	5.1	6.9	0.935	0.784 (Random Forest, Random Undersampling)	11.20	-0.087 (Gradient Boosting)
Sinonasal	887	4.6	6.7	0.898	0.650 (Gradient Boosting, Random Undersampling)	10.48	0.079 (Linear Regression)

15 11. Discussion and Conclusion

15.1 11.1 Key Findings

This analysis of head and neck malignancies in Chile (2019-2024) provides independent regression and machine learning models for each cancer site, enabling site-specific risk assessment for both in-hospital mortality and length of stay. Head and neck cancer comprises diverse anatomical subsites (oral cavity, pharynx, larynx, sinonasal) with markedly different epidemiology, treatment complexity, and clinical outcomes.

15.2 11.2 Regression Model Approach

Mixed-effects models were fit using Generalized Estimating Equations (GEE) to account for the hierarchical structure of observations within hospitals and years. For mortality, GEE logistic regression was used with Binomial family and exchangeable correlation structure. For length of stay, GEE negative binomial regression was used to account for overdispersion in count data, with Poisson as fallback. All models include age, sex, service category, severity/mortality indices, and only significant comorbidities from bivariate analysis (p < 0.05), improving parsimony and interpretability. Hospital was specified as the grouping variable to capture clustering effects.

15.3 11.3 Machine Learning Approach

To address class imbalance and improve generalization, multiple class balancing techniques were evaluated: SMOTE, ADASYN, SMOTETomek, random oversampling, random undersampling, and no balancing (original). For each cancer site, gradient boosting was used as the base model to identify the best-performing balancing method by AUC. Subsequently, five classifiers (Logistic Regression, Random Forest, Gradient Boosting, SVM, MLP) were trained using the optimal balancing strategy. All models used optimal decision thresholds derived from Youden’s J statistic rather than default 0.5 cutoff, improving balanced accuracy and clinical utility.

15.4 11.4 Clinical Implications

Head and neck cancer mortality and hospital resource utilization vary significantly by anatomical subsite and comorbidity burden. Treatment decisions (surgery vs. radiation vs. chemoradiation) and patient characteristics (age, performance status, comorbidities) directly influence hospital outcomes. These models provide quantitative tools for risk stratification and outcome prediction, supporting clinical decision-making and resource planning.

16 12. Conclusion

This comprehensive analysis of head and neck malignancies in Chilean public hospitals provides an epidemiological profile of in-hospital mortality and length of stay across multiple anatomical subsites (2019-2024). The results highlight the importance of site-specific modeling, the differential impact of comorbidities by location, and temporal patterns that may inform resource allocation and quality improvement initiatives in head and neck cancer care.

17 References

Breiman, Leo. 2001. “Random Forests.” Machine Learning 45 (1): 5–32. https://doi.org/10.1023/A:1010933404324.

Charlson, Mary E., Peter Pompei, Kathy L. Ales, and C. Ronald MacKenzie. 1987. “A New Method of Classifying Prognostic Comorbidity in Longitudinal Studies: Development and Validation.” Journal of Chronic Diseases 40 (5): 373–83. https://doi.org/10.1016/0021-9681(87)90171-8.

Chawla, Nitesh V., Kevin W. Bowyer, Lawrence O. Hall, and W. Philip Kegelmeyer. 2002. “SMOTE: Synthetic Minority Over-Sampling Technique.” Journal of Artificial Intelligence Research 16: 321–57. https://doi.org/10.1613/jair.953.

Chow, Laura Q. M. 2020. “Head and Neck Cancer.” New England Journal of Medicine 382 (1): 60–72. https://doi.org/10.1056/NEJMra1715715.

Departamento de Estadı́sticas e Información de Salud (DEIS). 2024. Bases de Datos de Grupos Relacionados de Diagnóstico (GRD), 2019–2024. Https://deis.minsal.cl/.

Friedman, Jerome H. 2001. “Greedy Function Approximation: A Gradient Boosting Machine.” Annals of Statistics 29 (5): 1189–232.

Hashibe, Mia, Paul Brennan, Shu-Chun Chuang, et al. 2009. “Interaction Between Tobacco and Alcohol Use and the Risk of Head and Neck Cancer: Pooled Analysis in the INHANCE Consortium.” Cancer Epidemiology, Biomarkers & Prevention 18 (2): 541–50. https://doi.org/10.1158/1055-9965.EPI-08-0347.

Hosmer, David W., Stanley Lemeshow, and Rodney X. Sturdivant. 2013. “Applied Logistic Regression.” Wiley Series in Probability and Statistics, ahead of print, 3rd ed. https://doi.org/10.1002/9781118548387.

Hubbard, Alan E., Jennifer Ahern, Nancy L. Fleischer, et al. 2010. “To GEE or Not to GEE: Comparing Population Average and Mixed Models for Estimating the Associations Between Neighborhood Risk Factors and Health.” Epidemiology 21 (4): 467–74. https://doi.org/10.1097/EDE.0b013e3181caeb90.

Hunter, John D. 2007. “Matplotlib: A 2D Graphics Environment.” Computing in Science & Engineering 9 (3): 90–95. https://doi.org/10.1109/MCSE.2007.55.

Johnson, Daniel E., Barbara Burtness, C. René Leemans, et al. 2020. “Head and Neck Squamous Cell Carcinoma.” Nature Reviews Disease Primers 6 (1): 92. https://doi.org/10.1038/s41572-020-00224-3.

Liang, Kung-Yee, and Scott L. Zeger. 1986. “Longitudinal Data Analysis Using Generalized Linear Models.” Biometrika 73 (1): 13–22. https://doi.org/10.1093/biomet/73.1.13.

Marur, Shanthi, and Arlene A. Forastiere. 2016. “Head and Neck Squamous Cell Carcinoma: Update on Epidemiology, Diagnosis, and Treatment.” Mayo Clinic Proceedings 91 (3): 386–96. https://doi.org/10.1016/j.mayocp.2015.12.017.

McKinney, Wes. 2010. “Data Structures for Statistical Computing in Python.” Proceedings of the 9th Python in Science Conference, 56–61.

Organización Mundial de la Salud. 2019. Clasificación Estadı́stica Internacional de Enfermedades y Problemas Relacionados Con La Salud, décima Revisión (CIE-10). OMS.

Pedregosa, Fabian, Gaël Varoquaux, Alexandre Gramfort, et al. 2011. “Scikit-Learn: Machine Learning in Python.” Journal of Machine Learning Research 12: 2825–30.

Rettig, Eleni M., and Gypsyamber D’Souza. 2015. “Epidemiology of Head and Neck Cancer.” Surgical Oncology Clinics of North America 24 (3): 379–96. https://doi.org/10.1016/j.soc.2015.03.001.

Seabold, Skipper, and Josef Perktold. 2010. “Statsmodels: Econometric and Statistical Modeling with Python.” Proceedings of the 9th Python in Science Conference, 92–96.

Sung, Hyuna, Jacques Ferlay, Rebecca L. Siegel, et al. 2021. “Global Cancer Statistics 2020: GLOBOCAN Estimates of Incidence and Mortality Worldwide for 36 Cancers in 185 Countries.” CA: A Cancer Journal for Clinicians 71 (3): 209–49. https://doi.org/10.3322/caac.21660.

Valderas, José M., Barbara Starfield, Bonnie Sibbald, Chris Salisbury, and Martin Roland. 2009. “Defining Comorbidity: Implications for Understanding Health and Health Services.” Annals of Family Medicine 7 (4): 357–63. https://doi.org/10.1370/afm.983.

Youden, William J. 1950. “Index for Rating Diagnostic Tests.” Cancer 3 (1): 32–35. https://doi.org/10.1002/1097-0142(1950)3:1<32::AID-CNCR2820030106>3.0.CO;2-3.

Zeger, Scott L., and Kung-Yee Liang. 1986. “Longitudinal Data Analysis for Discrete and Continuous Outcomes.” Biometrics 42 (1): 121–30. https://doi.org/10.2307/2531248.

1 1. Methods

1.1 1.1 Study Design and Data Source

1.2 1.2 Cancer Case Identification

1.3 1.3 Comorbidity Classification

1.4 1.4 Statistical Analyses

1.4.1 1.4.1 Descriptive Statistics

1.4.2 1.4.2 Bivariate Associations

1.4.3 1.4.3 Mixed-Effects Logistic Regression for Mortality

1.4.4 1.4.4 Negative Binomial Mixed Models for Length of Stay

1.4.5 1.4.5 Area Under the Receiver Operating Characteristic Curve (AUC)

1.4.6 1.4.6 Hosmer–Lemeshow Goodness-of-Fit

1.4.7 1.4.7 McFadden’s Pseudo-\(R^{2}\)

1.4.8 1.4.8 Matthews Correlation Coefficient (MCC)

1.4.9 1.4.9 Brier Score

1.4.10 1.4.10 RMSE and MAE for Length-of-Stay

1.4.11 1.4.11 Youden’s J Statistic

2 2. Data Loading and Processing

Table 5: Overall Summary Statistics

3 3. Display Name Formatting

4 4. Descriptive Statistics

Table 1: Patient Distribution by Head and Neck Cancer Site

Table 2: Trends by Year and Site

Table 3: Comorbidity Prevalence - All Patients (%)

Table 4: Comorbidity Prevalence - Discharged (%)

Table 5: Comorbidity Prevalence - Deceased (%)

Table 6: Summary by Discharge Status

Table 7: Length of Stay by Site and Discharge Status

Table 8: Yearly Trends (Aggregate)

5 4.1 Age Distribution by Cancer Site

6 4.2 Comorbidity Prevalence Tables

Table 13: Comorbidity Prevalence - All Patients (%)

Table 14: Comorbidity Prevalence - Discharged Patients (%)

Table 15: Comorbidity Prevalence - Deceased Patients (%)

Table 16: Difference in Comorbidity Prevalence (Deceased - Discharged, %) * p<0.05, ** p<0.01, *** p<0.001

7 4.3 Length of Stay Analysis

Table 17: Length of Stay by Cancer Site and Discharge Status

Table 18: Mean Length of Stay When Comorbidity Present (days)

Table 19: Length of Stay Impact by Comorbidity (Days Difference: WITH - WITHOUT)

8 4.4 Temporal Trends

9 5. Bivariate Analysis

Table 20: Bivariate Analysis by Cancer Site

Table 21: Significant Comorbidities by Cancer Site (p < 0.05 in Bivariate Analysis)

10 6. GEE Mixed-Effects Logistic Regression for Mortality (All Cancer Sites)

10.1 6.1 GEE Logistic Regression Results - OR [95% CI]

Table 9: GEE Mixed-Effects Logistic Regression - Odds Ratios [95% CI] for In-Hospital Mortality by Cancer Site (Hospital as cluster, year as fixed effect, only significant comorbidities from bivariate analysis) * p<0.05, ** p<0.01, *** p<0.001

Table 10: Performance & Fit Metrics - GEE Logistic Regression for Mortality (QIC = Quasi-Information Criterion, Pseudo R² = McFadden)

10.2 6.2 Forest Plot — Odds Ratios for In-Hospital Mortality

11 7. GEE Mixed-Effects Regression for Length of Stay (All Cancer Sites)

11.1 7.1 GEE Regression Results - IRR [95% CI]

Table 11: GEE Mixed-Effects Regression - IRR [95% CI] for Length of Stay by Cancer Site (Hospital as cluster, year as fixed effect, Negative Binomial family, only significant comorbidities) * p<0.05, ** p<0.01, *** p<0.001

Table 12: Performance & Fit Metrics - GEE Regression Models for LOS (QIC = Quasi-Information Criterion, Negative Binomial/Poisson family)

11.2 7.1 Forest Plot — Incidence Rate Ratios for Length of Stay

12 8. Machine Learning Models for Mortality Prediction (Per Cancer Site, Multiple Class Balancing)

12.1 8.1 Class Balancing Technique Comparison by Cancer Site

Class Balancing Comparison (Gradient Boosting): Hypopharynx (N=400, Events=33, 8.2%)

Class Balancing Comparison (Gradient Boosting): Larynx (N=2,625, Events=198, 7.5%)

Class Balancing Comparison (Gradient Boosting): Nasopharynx (N=368, Events=23, 6.2%)

Class Balancing Comparison (Gradient Boosting): Oral Cavity (N=3,001, Events=176, 5.9%)

Class Balancing Comparison (Gradient Boosting): Oropharynx (N=951, Events=55, 5.8%)

Class Balancing Comparison (Gradient Boosting): Other Pharynx (N=281, Events=30, 10.7%)

Class Balancing Comparison (Gradient Boosting): Salivary Glands (N=898, Events=46, 5.1%)

Class Balancing Comparison (Gradient Boosting): Sinonasal (N=887, Events=41, 4.6%)

12.2 8.2 ML Model Comparison (Best Balancing per Site)

ML Models: Hypopharynx (N=400, Events=33, Best balancing: Random Oversampling)

ML Models: Larynx (N=2,625, Events=198, Best balancing: Random Undersampling)

ML Models: Nasopharynx (N=368, Events=23, Best balancing: Random Undersampling)

ML Models: Oral Cavity (N=3,001, Events=176, Best balancing: None (Original))

ML Models: Oropharynx (N=951, Events=55, Best balancing: Random Undersampling)

ML Models: Other Pharynx (N=281, Events=30, Best balancing: Random Undersampling)

ML Models: Salivary Glands (N=898, Events=46, Best balancing: Random Undersampling)

ML Models: Sinonasal (N=887, Events=41, Best balancing: Random Undersampling)

13 9. Machine Learning Models for Length of Stay Prediction (Per Cancer Site)

13.1 9.1 ML LOS Results by Cancer Site

ML LOS Performance: Hypopharynx (N=400, Mean LOS=13.7 days)

ML LOS Performance: Larynx (N=2,625, Mean LOS=14.5 days)

ML LOS Performance: Nasopharynx (N=368, Mean LOS=9.0 days)

ML LOS Performance: Oral Cavity (N=3,001, Mean LOS=9.6 days)

ML LOS Performance: Oropharynx (N=951, Mean LOS=8.4 days)

ML LOS Performance: Other Pharynx (N=281, Mean LOS=10.8 days)

ML LOS Performance: Salivary Glands (N=898, Mean LOS=6.9 days)

Table 16: Difference in Comorbidity Prevalence (Deceased - Discharged, %) * p<0.05, p<0.01, * p<0.001

Table 9: GEE Mixed-Effects Logistic Regression - Odds Ratios [95% CI] for In-Hospital Mortality by Cancer Site (Hospital as cluster, year as fixed effect, only significant comorbidities from bivariate analysis) * p<0.05, p<0.01, * p<0.001

Table 11: GEE Mixed-Effects Regression - IRR [95% CI] for Length of Stay by Cancer Site (Hospital as cluster, year as fixed effect, Negative Binomial family, only significant comorbidities) * p<0.05, p<0.01, * p<0.001