feat: Implement time-aware GPT-2 for patient event prediction

This commit introduces a complete framework for training a temporal GPT-2 model on sequential patient event data. Key components include: - `models.py`: - `TimeAwareGPT2`: A custom GPT-2 model that incorporates temporal information through a time-based causal attention mask and a sinusoidal age encoding for positional information. - `AgeSinusoidalEncoding`: A module for creating time-based positional embeddings. - `CombinedLoss`: A two-part loss function combining cross-entropy for event prediction and a survival loss for event timing. - `utils.py`: - `PatientEventDataset`: A PyTorch Dataset class to process, batch, and load patient event sequences, including imputation of "no event" gaps and padding/truncation. - `train.py`: - A comprehensive training script that initializes the model, data loaders, and loss function. - Implements a training loop with a cosine annealing learning rate scheduler, validation, and early stopping based on validation loss. - `prepare_data.py`: - Script for preprocessing raw UK Biobank data into a format suitable for the model. - `GEMINI.md`: - Project documentation outlining the structure, coding style, and framework.
2025-10-16 14:21:36 +08:00
parent 1d4731ae42
commit 589d4d0bd2
5 changed files with 750 additions and 0 deletions
--- a/GEMINI.md
+++ b/GEMINI.md
@@ -0,0 +1,59 @@
 # DeepHealth Project
 This is a deep learning project based on PyTorch. This project adheres to specific code style and file structure conventions to ensure clarity, maintainability, and reproducibility.
 ## 1. Project Structure
 To maintain a clean and modular project, we adopt the following file organization:
 DeepHealth/
    |-tain.py
    |-models.py
    |-utils.py
    |-data/
    |-requirements.txt
    |-README.md
 ### File Descriptions
 * **`train.py`**:
    * **Core training script**. It contains the control flow for the entire training process.
    * Responsible for initializing the model, optimizer, DataLoader, etc.
    * Executes the training and validation loops.
    * Handles saving and loading checkpoints, logging, and other related tasks.
 * **`models.py`**:
    * **Model and Loss Function Definitions**. This file stores the architecture for all neural network models.
    * All subclasses of `torch.nn.Module` should be defined in this file.
    * Custom loss functions should also be implemented here.
 * **`utils.py`**:
    * **Utility Functions Module**. It contains reusable helper functions for the project.
    * Primarily responsible for data I/O operations, data preprocessing, performance metric calculations, logger configuration, or other logic that doesn't belong in the core model or training framework.
 * **`data/`**:
    * **Data Storage Directory**. Used to store the datasets required for the project.
    * `data/raw/` stores the original, unprocessed data.
    * `data/processed/` stores data after it has been preprocessed.
 * **`requirements.txt`**:
    * **Project Dependencies**. Lists all the Python packages and their versions required to run this project.
 * **`README.md`**:
    * **Project Documentation**. Provides a high-level overview of the project, setup instructions, and usage guidelines.
 ## 2. Core Framework
 * **Deep Learning Framework**: `PyTorch`
 ## 3. Coding Style
 This project uniformly adopts the **Google Python Style Guide**. All submitted code should adhere to this standard to ensure consistency and readability.
 Key features include:
 * Using `yapf` or `black` for automatic code formatting.
 * Following detailed naming conventions (`module_name`, `package_name`, `ClassName`, `method_name`, `ExceptionName`, `function_name`, `GLOBAL_CONSTANT_NAME`).
 * Using Google-style docstrings.
 Please refer to the official documentation: [Google Python Style Guide](http://google.github.io/styleguide/pyguide.html)
--- a/models.py
+++ b/models.py
@@ -0,0 +1,284 @@
 import torch
 import torch.nn as nn
 from torch.nn import functional as F
 from typing import Tuple
 import math
 class CausalSelfAttention(nn.Module):
    """
    A vanilla multi-head masked self-attention layer with a projection at the end.
    """
    def __init__(self, n_embd: int, n_head: int, pdrop: float):
        super().__init__()
        assert n_embd % n_head == 0
        # key, query, value projections for all heads
        self.c_attn = nn.Linear(n_embd, 3 * n_embd)
        # output projection
        self.c_proj = nn.Linear(n_embd, n_embd)
        # regularization
        self.attn_dropout = nn.Dropout(pdrop)
        self.resid_dropout = nn.Dropout(pdrop)
        self.n_head = n_head
        self.n_embd = n_embd
    def forward(self, x: torch.Tensor, custom_mask: torch.Tensor) -> torch.Tensor:
        B, L, D = x.size() # batch size, sequence length, embedding dimensionality (n_embd)
        # calculate query, key, values for all heads in batch and move head forward to be the batch dim
        q, k, v  = self.c_attn(x).split(self.n_embd, dim=2)
        k = k.view(B, L, self.n_head, D // self.n_head).transpose(1, 2) # (B, nh, L, hs)
        q = q.view(B, L, self.n_head, D // self.n_head).transpose(1, 2) # (B, nh, L, hs)
        v = v.view(B, L, self.n_head, D // self.n_head).transpose(1, 2) # (B, nh, L, hs)
        # causal self-attention; Self-attend: (B, nh, L, hs) x (B, nh, hs, L) -> (B, nh, L, L)
        att = (q @ k.transpose(-2, -1)) * (1.0 / math.sqrt(k.size(-1)))
        # Apply the time-based causal mask
        att = att.masked_fill(custom_mask.unsqueeze(1) == 0, float('-inf'))
        att = F.softmax(att, dim=-1)
        att = self.attn_dropout(att)
        y = att @ v # (B, nh, L, L) x (B, nh, L, hs) -> (B, nh, L, hs)
        y = y.transpose(1, 2).contiguous().view(B, L, D) # re-assemble all head outputs side by side
        # output projection
        y = self.resid_dropout(self.c_proj(y))
        return y
 class Block(nn.Module):
    """ an unassuming Transformer block """
    def __init__(self, n_embd: int, n_head: int, pdrop: float):
        super().__init__()
        self.ln_1 = nn.LayerNorm(n_embd)
        self.attn = CausalSelfAttention(n_embd, n_head, pdrop)
        self.ln_2 = nn.LayerNorm(n_embd)
        self.mlp = nn.ModuleDict(dict(
            c_fc    = nn.Linear(n_embd, 4 * n_embd),
            c_proj  = nn.Linear(4 * n_embd, n_embd),
            act     = nn.GELU(),
            dropout = nn.Dropout(pdrop),
        ))
        m = self.mlp
        self.mlpf = lambda x: m.dropout(m.c_proj(m.act(m.c_fc(x)))) # MLP forward
    def forward(self, x: torch.Tensor, custom_mask: torch.Tensor) -> torch.Tensor:
        x = x + self.attn(self.ln_1(x), custom_mask=custom_mask)
        x = x + self.mlpf(self.ln_2(x))
        return x
 class AgeSinusoidalEncoding(nn.Module):
    """
    Encodes age using sinusoidal functions, similar to positional encodings
    in Transformers. This module creates a fixed-size embedding for an age
    value given in days.
    """
    def __init__(self, embedding_dim: int):
        """
        Initializes the AgeSinusoidalEncoding module.
        Args:
            embedding_dim (int): The dimensionality of the output embedding.
                Must be an even number.
        Raises:
            ValueError: If embedding_dim is not an even number.
        """
        super().__init__()
        if embedding_dim % 2 != 0:
            raise ValueError(f"Embedding dimension must be an even number, but got {embedding_dim}")
        self.embedding_dim = embedding_dim
        # Pre-calculate the divisor term for the sinusoidal formula.
        # The formula for the divisor is 10000^(2i/D), where D is the
        # embedding_dim and i is the index for each pair of dimensions.
        # i ranges from 0 to D/2 - 1.
        i = torch.arange(0, self.embedding_dim, 2, dtype=torch.float32)
        divisor = torch.pow(10000, i / self.embedding_dim)
        # Register the divisor as a non-trainable buffer. This ensures it is
        # moved to the correct device (e.g., GPU) along with the model.
        self.register_buffer('divisor', divisor)
    def forward(self, t: torch.Tensor) -> torch.Tensor:
        """
        Forward pass for the AgeSinusoidalEncoding.
        Args:
            t (torch.Tensor): A tensor of shape (batch_size, sequence_length)
                with dtype=torch.float32, representing age in days.
        Returns:
            torch.Tensor: The encoded age tensor of shape
                (batch_size, sequence_length, embedding_dim).
        """
        # 1. Unit Conversion: Convert age from days to years.
        # We use 365.25 to account for leap years.
        t_years = t / 365.25
        # 2. Argument Calculation: Calculate the arguments for the sin/cos functions.
        # The shapes are broadcast to (B, L, D/2).
        # Input t_years: (B, L) -> unsqueezed to (B, L, 1)
        # Divisor: (D/2) -> viewed as (1, 1, D/2)
        args = t_years.unsqueeze(-1) * self.divisor.view(1, 1, -1)
        # 3. Sinusoidal Application: Create the final output tensor.
        # Initialize an empty tensor to store the embeddings.
        output = torch.zeros(t.shape[0], t.shape[1], self.embedding_dim, device=t.device)
        # Assign cosine of the arguments to the even indices.
        output[:, :, 0::2] = torch.cos(args)
        # Assign sine of the arguments to the odd indices.
        output[:, :, 1::2] = torch.sin(args)
        return output
 class TimeAwareGPT2(nn.Module):
    """
    A time-aware GPT-2 model with custom temporal features.
    """
    def __init__(self, vocab_size: int, n_embd: int, n_layer: int, n_head: int, pdrop: float, token_pdrop: float):
        super().__init__()
        self.token_pdrop = token_pdrop
        # Token and positional embeddings
        self.wte = nn.Embedding(vocab_size, n_embd)
        self.age_encoder = AgeSinusoidalEncoding(n_embd)
        self.drop = nn.Dropout(pdrop)
        # Transformer blocks
        self.blocks = nn.ModuleList([Block(n_embd, n_head, pdrop) for _ in range(n_layer)])
        # Final layer norm and linear head
        self.ln_f = nn.LayerNorm(n_embd)
        self.head = nn.Linear(n_embd, vocab_size, bias=False)
        self.n_embd = n_embd
    def forward(self, event_seq: torch.Tensor, time_seq: torch.Tensor) -> torch.Tensor:
        """
        Forward pass for the TimeAwareGPT2 model.
        Args:
            event_seq (torch.Tensor): Token indices of shape (B, L).
            time_seq (torch.Tensor): Timestamps for each event of shape (B, L).
        Returns:
            torch.Tensor: Logits of shape (B, L, vocab_size).
        """
        B, L = event_seq.size()
        # 1. Get token embeddings
        token_embeddings = self.wte(event_seq)
        # 2. Apply token dropout (only during training)
        if self.training and self.token_pdrop > 0:
            # Create a mask to randomly zero out entire token embedding vectors
            drop_mask = torch.rand(token_embeddings.shape[:2], device=token_embeddings.device) < self.token_pdrop
            token_embeddings[drop_mask] = 0.0
        # 3. Get positional embeddings from time sequence
        pos_embeddings = self.age_encoder(time_seq.float())
        # 4. Combine embeddings and apply dropout
        x = self.drop(token_embeddings + pos_embeddings)
        # 5. Generate attention mask
        # The attention mask combines two conditions:
        # a) Time-based causality: A token i can attend to a token j only if time_seq[j] < time_seq[i].
        # b) Padding mask: Do not attend to positions where the event token is 0.
        # a) Time-based causal mask
        t_i = time_seq.unsqueeze(-1)  # (B, L, 1)
        t_j = time_seq.unsqueeze(1)   # (B, 1, L)
        time_mask = (t_j < t_i)
        # b) Padding mask (prevents attending to key positions that are padding)
        padding_mask = (event_seq != 0).unsqueeze(1) # Shape: (B, 1, L)
        # Combine the masks. A position (j) can be attended to by a query (i) only if
        # it's in the past (time_mask) AND it's not a padding token (padding_mask).
        combined_mask = time_mask & padding_mask
        # 6. Pass through transformer blocks
        for block in self.blocks:
            x = block(x, custom_mask=combined_mask)
        # 7. Final layer norm and projection to vocab size
        x = self.ln_f(x)
        logits = self.head(x)
        return logits
    def get_num_params(self) -> float:
        """
        Returns the number of trainable parameters in the model in millions.
        """
        return sum(p.numel() for p in self.parameters() if p.requires_grad) / 1e6
 class CombinedLoss(nn.Module):
    """
    Computes a two-part loss: a standard cross-entropy loss for event type
    prediction and a survival analysis loss for event timing.
    """
    def __init__(self, ignored_token_ids: list[int]):
        """
        Initializes the CombinedLoss module.
        Args:
            ignored_token_ids (list[int]): A list of event type IDs to be
                excluded from all loss calculations.
        """
        super().__init__()
        self.ignored_token_ids = ignored_token_ids
    def forward(self, logits: torch.Tensor, x: torch.Tensor, t: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
        """
        Calculates the combined cross-entropy and survival loss.
        Args:
            logits (torch.Tensor): Raw model outputs of shape (B, L, N).
            x (torch.Tensor): Ground-truth event labels of shape (B, L).
            t (torch.Tensor): True time duration for each event, shape (B, L).
        Returns:
            A tuple containing the two scalar loss tensors: (loss_ce, loss_survival).
        """
        # 1. Create a mask to filter out ignored token IDs from loss calculation.
        # An element is True if the corresponding label in x is NOT in the ignored list.
        mask = torch.ones_like(x, dtype=torch.bool)
        for token_id in self.ignored_token_ids:
            mask = mask & (x != token_id)
        # If the mask is all False (all tokens are ignored), return zero for both losses.
        if not mask.any():
            return torch.tensor(0.0, device=logits.device), torch.tensor(0.0, device=logits.device)
        # 2. Part 1: Cross-Entropy Loss (loss_ce)
        # Permute logits from (B, L, N) to (B, N, L) for F.cross_entropy.
        logits_for_ce = logits.permute(0, 2, 1)
        # Calculate per-element loss without reduction.
        per_element_ce = F.cross_entropy(logits_for_ce, x, reduction='none')
        # Apply the mask and compute the mean of valid elements.
        loss_ce = per_element_ce[mask].mean()
        # 3. Part 2: Survival Loss (loss_survival)
        # Calculate event intensity (lambda) as the sum of exponentiated logits.
        intensity = torch.sum(torch.exp(logits), dim=2)
        # Calculate per-element survival loss (negative log-likelihood of exponential dist).
        # We add a small epsilon for numerical stability with the log.
        per_element_survival = -(torch.log(intensity + 1e-8) - intensity * t)
        # Apply the mask and compute the mean of valid elements.
        loss_survival = per_element_survival[mask].mean()
        return loss_ce, loss_survival
--- a/prepare_data.py
+++ b/prepare_data.py
@@ -0,0 +1,133 @@
 import pandas as pd
 import tqdm
 import numpy as np
 label_files = 'labels.csv'
 ukb_field_to_icd10_file = 'icd10_codes_mod.tsv'
 ukb_basket_file = 'ukb_delphi.txt'
 train_proportion = 0.8
 output_prefix = 'ukb_real'
 icdict = {}
 icdcodes = []
 with open(ukb_field_to_icd10_file) as f:
    for line in f:
        parts = line.strip().split()
        icdict[parts[0]] = parts[5]
        icdcodes.append(parts[5])
 # Using enumerate for cleaner, safer label assignment starting from 0
 label_dict = {}
 with open(label_files) as f:
    for i, line in enumerate(f):
        label_dict[line.strip().split(' ')[0]] = i
 icdict['f.31.0.0'] = "sex"
 icdict['f.34.0.0'] = "YEAR"
 icdict['f.52.0.0'] = "MONTH"
 icdict['f.40000.0.0'] = "Death"
 for j in range(17):
    icdict[f'f.40005.{j}.0'] = f'cancer_date_{j}'
    icdict[f'f.40006.{j}.0'] = f'cancer_type_{j}'
 icdict['f.53.0.0'] = "assessment_date"
 icdict['f.21001.0.0'] = "BMI"
 icdict['f.1239.0.0'] = "smoking"
 icdict['f.1558.0.0'] = "alcohol"
 len_icd = len(icdcodes)
 # Corrected typo 'aseessment_date' to 'assessment_date'
 icdcodes.extend(['Death', 'assessment_date'] + [f'cancer_date_{j}' for j in range(17)])
 data_list = []
 ukb_iterator = pd.read_csv(ukb_basket_file, sep=',', chunksize=10000, index_col=0, low_memory=False)
 for _, dd in tqdm.tqdm(enumerate(ukb_iterator)):
    dd = dd.rename(columns=icdict)
    dd.dropna(subset=['sex'], inplace=True)
    dd['sex'] += 1
    dd = dd[[col for col in dd.columns if not col.startswith('f.')]]
    dd['dob'] = pd.to_datetime(dd[['YEAR', 'MONTH']].assign(DAY=1))
    present_icdcodes = [c for c in icdcodes if c in dd.columns]
    if present_icdcodes:
        # Convert date columns to days from date of birth
        date_cols = dd[present_icdcodes].apply(pd.to_datetime, format="%Y-%m-%d", errors='coerce')
        date_cols_days = date_cols.sub(dd['dob'], axis=0)
        dd[present_icdcodes] = date_cols_days.apply(lambda x: x.dt.days)
    # Process ICD codes efficiently using melt
    cols_to_process = [col for col in icdcodes[:len_icd + 1] if col in dd.columns]
    if cols_to_process:
        melted_df = dd.reset_index().melt(
            id_vars=['f.eid'],
            value_vars=cols_to_process,
            var_name='event_code',
            value_name='days'
        )
        melted_df.dropna(subset=['days'], inplace=True)
        if not melted_df.empty:
            melted_df['label'] = melted_df['event_code'].map(label_dict)
            data_list.append(melted_df[['f.eid', 'days', 'label']].dropna().astype(int).to_numpy())
    # Process sex
    X = dd['sex'].reset_index().to_numpy().astype(int)
    data_list.append(np.c_[X[:, 0], np.zeros(X.shape[0]), X[:, 1]].astype(int))
    # Process cancer data efficiently using wide_to_long
    df_res = dd.reset_index()
    rename_dict = {f'cancer_date_{j}': f'cancerdate{j}' for j in range(17)}
    rename_dict.update({f'cancer_type_{j}': f'cancertype{j}' for j in range(17)})
    df_renamed = df_res.rename(columns=rename_dict)
    stubs_to_use = []
    if any('cancerdate' in col for col in df_renamed.columns): stubs_to_use.append('cancerdate')
    if any('cancertype' in col for col in df_renamed.columns): stubs_to_use.append('cancertype')
    if len(stubs_to_use) == 2:
        long_cancer = pd.wide_to_long(df_renamed,
                                      stubnames=stubs_to_use,
                                      i=['f.eid'],
                                      j='cancer_num'
                                      ).dropna()
        if not long_cancer.empty:
            long_cancer['cancer'] = long_cancer['cancertype'].str.slice(0, 3)
            long_cancer['cancer_label'] = long_cancer['cancer'].map(label_dict)
            cancer_array = long_cancer.reset_index()[['f.eid', 'cancerdate', 'cancer_label']].dropna().astype(int).to_numpy()
            if cancer_array.size > 0:
                data_list.append(cancer_array)
    # Process BMI, smoking, and alcohol
    dd_bmi = dd[['assessment_date', 'BMI']].dropna().reset_index()
    if not dd_bmi.empty:
        dd_bmi['bmi_status'] = np.select([dd_bmi['BMI'] > 28, dd_bmi['BMI'] > 22], [5, 4], default=3)
        data_list.append(dd_bmi[['f.eid', 'assessment_date', 'bmi_status']].astype(int).to_numpy())
    dd_sm = dd[['assessment_date', 'smoking']].dropna().reset_index()
    dd_sm = dd_sm[dd_sm['smoking'] != -3]
    if not dd_sm.empty:
        dd_sm['smoking_status'] = np.select([dd_sm['smoking'] == 1, dd_sm['smoking'] == 2], [8, 7], default=6)
        data_list.append(dd_sm[['f.eid', 'assessment_date', 'smoking_status']].astype(int).to_numpy())
    dd_al = dd[['assessment_date', 'alcohol']].dropna().reset_index()
    dd_al = dd_al[dd_al['alcohol'] != -3]
    if not dd_al.empty:
        dd_al['alcohol_status'] = np.select([dd_al['alcohol'] == 1, dd_al['alcohol'] < 4], [11, 10], default=9)
        data_list.append(dd_al[['f.eid', 'assessment_date', 'alcohol_status']].astype(int).to_numpy())
 data = np.vstack(data_list)
 data = data[np.lexsort((data[:, 1], data[:, 2] == data[:, 2].max(), data[:, 0]))]
 data = data[data[:, 1] >= 0]
 data = pd.DataFrame(data).drop_duplicates([0, 2]).values
 data = data.astype(np.uint32)
 data.tofile(output_prefix + '.bin')
 # Correctly split train/validation sets
 unique_ids = np.unique(data[:, 0])
 split_id = unique_ids[int(len(unique_ids) * train_proportion)]
 train_val_split = data[:, 0] <= split_id
 data[train_val_split].tofile(output_prefix + '_train.bin')
 data[~train_val_split].tofile(output_prefix + '_val.bin')
--- a/train.py
+++ b/train.py
@@ -0,0 +1,170 @@
 import torch
 import torch.nn as nn
 from torch.optim import Adam
 from torch.utils.data import DataLoader
 import numpy as np
 import math
 import tqdm
 from models import TimeAwareGPT2, CombinedLoss
 from utils import PatientEventDataset
 # --- Configuration ---
 class TrainConfig:
    # Data parameters
    train_data_path = 'ukb_real_train.bin'
    val_data_path = 'ukb_real_val.bin'
    block_length = 256  # Sequence length
    # Model parameters
    n_embd = 256
    n_layer = 8
    n_head = 8
    pdrop = 0.1
    token_pdrop = 0.1
    # Training parameters
    max_epoch = 200
    batch_size = 128
    lr_initial = 6e-4
    lr_final = 6e-5
    warmup_epochs = 10
    early_stopping_patience = 5
    # Loss parameters
    # 0 = padding, 1 = "no event"
    ignored_token_ids = [0, 1]
    # System parameters
    device = 'cuda' if torch.cuda.is_available() else 'cpu'
 # --- Main Training Script ---
 def main():
    config = TrainConfig()
    # --- 1. Data Loading ---
    print(f"Loading data from {config.train_data_path} and {config.val_data_path}...")
    train_data_arr = np.memmap(config.train_data_path, dtype=np.uint32, mode='r').reshape(-1, 3)
    val_data_arr = np.memmap(config.val_data_path, dtype=np.uint32, mode='r').reshape(-1, 3)
    # Infer vocab_size from the data (max label + 1)
    vocab_size = int(max(train_data_arr[:, 2].max(), val_data_arr[:, 2].max())) + 1
    print(f"Inferred vocabulary size: {vocab_size}")
    train_dataset = PatientEventDataset(train_data_arr, config.block_length)
    val_dataset = PatientEventDataset(val_data_arr, config.block_length)
    train_loader = DataLoader(train_dataset, batch_size=config.batch_size, shuffle=True, num_workers=4, pin_memory=True)
    val_loader = DataLoader(val_dataset, batch_size=config.batch_size, shuffle=False, num_workers=4, pin_memory=True)
    # --- 2. Model, Optimizer, and Loss Initialization ---
    print(f"Initializing model on {config.device}...")
    model = TimeAwareGPT2(
        vocab_size=vocab_size,
        n_embd=config.n_embd,
        n_layer=config.n_layer,
        n_head=config.n_head,
        pdrop=config.pdrop,
        token_pdrop=config.token_pdrop
    ).to(config.device)
    print(f"Model initialized with {model.get_num_params():.2f}M trainable parameters.")
    loss_fn = CombinedLoss(config.ignored_token_ids)
    optimizer = Adam(model.parameters(), lr=config.lr_initial)
    # --- 3. Training Loop ---
    best_val_loss = float('inf')
    patience_counter = 0
    print("Starting training...")
    for epoch in range(config.max_epoch):
        # --- Learning Rate Scheduling ---
        if epoch < config.warmup_epochs:
            lr = config.lr_initial
        else:
            progress = (epoch - config.warmup_epochs) / (config.max_epoch - config.warmup_epochs)
            lr = config.lr_final + 0.5 * (config.lr_initial - config.lr_final) * (1 + math.cos(math.pi * progress))
        for param_group in optimizer.param_groups:
            param_group['lr'] = lr
        # --- Training Phase ---
        model.train()
        train_loss_ce_acc, train_loss_surv_acc = 0.0, 0.0
        train_steps = 0
        pbar = tqdm.tqdm(train_loader, desc=f"Epoch {epoch+1}/{config.max_epoch} [Train]")
        for event_seq, time_seq in pbar:
            event_seq, time_seq = event_seq.to(config.device), time_seq.to(config.device)
            # Prepare inputs and targets
            input_events = event_seq[:, :-1]
            input_times = time_seq[:, :-1]
            target_events = event_seq[:, 1:]
            target_wait_times = (time_seq[:, 1:] - time_seq[:, :-1]).float()
            # Forward pass
            logits = model(input_events, input_times)
            loss_ce, loss_survival = loss_fn(logits, target_events, target_wait_times)
            loss = loss_ce + loss_survival
            # Backward pass and optimization
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            train_loss_ce_acc += loss_ce.item()
            train_loss_surv_acc += loss_survival.item()
            train_steps += 1
            pbar.set_postfix({'loss_ce': f'{loss_ce.item():.4f}', 'loss_surv': f'{loss_survival.item():.4f}', 'lr': f'{lr:.2e}'})
        avg_train_loss_ce = train_loss_ce_acc / train_steps
        avg_train_loss_surv = train_loss_surv_acc / train_steps
        # --- Validation Phase ---
        model.eval()
        val_loss_ce_acc, val_loss_surv_acc = 0.0, 0.0
        val_steps = 0
        with torch.no_grad():
            pbar_val = tqdm.tqdm(val_loader, desc=f"Epoch {epoch+1}/{config.max_epoch} [Val]")
            for event_seq, time_seq in pbar_val:
                event_seq, time_seq = event_seq.to(config.device), time_seq.to(config.device)
                input_events = event_seq[:, :-1]
                input_times = time_seq[:, :-1]
                target_events = event_seq[:, 1:]
                target_wait_times = (time_seq[:, 1:] - time_seq[:, :-1]).float()
                logits = model(input_events, input_times)
                loss_ce, loss_survival = loss_fn(logits, target_events, target_wait_times)
                val_loss_ce_acc += loss_ce.item()
                val_loss_surv_acc += loss_survival.item()
                val_steps += 1
                pbar_val.set_postfix({'loss_ce': f'{loss_ce.item():.4f}', 'loss_surv': f'{loss_survival.item():.4f}'})
        avg_val_loss_ce = val_loss_ce_acc / val_steps
        avg_val_loss_surv = val_loss_surv_acc / val_steps
        total_val_loss = avg_val_loss_ce + avg_val_loss_surv
        print(f"Epoch {epoch+1} Summary: \n"
              f"  Train Loss: {avg_train_loss_ce + avg_train_loss_surv:.4f} (CE: {avg_train_loss_ce:.4f}, Surv: {avg_train_loss_surv:.4f})\n"
              f"  Val Loss:   {total_val_loss:.4f} (CE: {avg_val_loss_ce:.4f}, Surv: {avg_val_loss_surv:.4f})\n"
              f"  Learning Rate: {lr:.6f}")
        # --- Early Stopping Check ---
        if total_val_loss < best_val_loss:
            best_val_loss = total_val_loss
            patience_counter = 0
            print(f"Validation loss improved to {best_val_loss:.4f}. Resetting patience.")
        else:
            patience_counter += 1
            print(f"Validation loss did not improve. Patience: {patience_counter}/{config.early_stopping_patience}")
        if patience_counter >= config.early_stopping_patience:
            print("\nEarly stopping triggered due to no improvement in validation loss.")
            break
 if __name__ == '__main__':
    main()
--- a/utils.py
+++ b/utils.py
@@ -0,0 +1,104 @@
 import torch
 import numpy as np
 import random
 from collections import defaultdict
 class PatientEventDataset(torch.utils.data.Dataset):
    """
    A PyTorch Dataset for handling temporal sequences of patient events.
    This class processes a raw NumPy array of patient records, groups them by
    patient ID, and prepares them for training by imputing gaps, padding, or
    truncating sequences to a fixed length.
    """
    def __init__(self, data: np.ndarray, block_length: int):
        """
        Initializes the dataset by pre-processing the patient event data.
        Args:
            data (np.ndarray): A NumPy array of shape (N, 3) with dtype=np.uint32.
                The columns represent (patient_id, time_in_days, event_code).
            block_length (int): The fixed length for the output sequences.
        """
        self.block_length = block_length
        # Group (time_in_days, event_code) pairs by patient_id.
        # This pre-processing step allows for efficient lookups in __getitem__.
        patient_events = defaultdict(list)
        for patient_id, time, event in data:
            patient_events[patient_id].append((time, event))
        # Store a list of unique patient_ids to map indices to patients.
        self.patient_ids = list(patient_events.keys())
        self.patient_events = dict(patient_events)
    def __len__(self) -> int:
        """
        Returns the total number of unique patients in the dataset.
        """
        return len(self.patient_ids)
    def __getitem__(self, idx: int) -> tuple[torch.Tensor, torch.Tensor]:
        """
        Retrieves, processes, and returns a single patient's event sequence.
        Args:
            idx (int): The index of the patient to retrieve.
        Returns:
            A tuple of two torch.long tensors: (event_sequence, time_sequence),
            both of shape (block_length,).
        """
        # 1. Retrieve and Sort
        patient_id = self.patient_ids[idx]
        records = sorted(self.patient_events[patient_id], key=lambda x: x[0])
        # 2. Impute "No Event" Gaps
        imputed_sequence = []
        if not records:
            # Handle cases with no records for a patient if necessary, though
            # the constructor logic would typically prevent this.
            pass
        else:
            imputed_sequence.append(records[0])
            for i in range(len(records) - 1):
                prev_time, _ = records[i]
                next_time, _ = records[i+1]
                time_gap = next_time - prev_time
                # If the gap is 5 years (1826 days) or more, insert "no event" records.
                if time_gap >= 1826:
                    num_no_event_intervals = time_gap // 1826
                    for j in range(1, num_no_event_intervals + 1):
                        no_event_time = prev_time + j * 1826
                        imputed_sequence.append((no_event_time, 1)) # event_code=1 for "no event"
                imputed_sequence.append(records[i+1])
        # 3. Adjust Sequence Length
        seq_len = len(imputed_sequence)
        if seq_len > self.block_length:
            # If longer, randomly select a contiguous sub-sequence.
            start_index = random.randint(0, seq_len - self.block_length)
            final_sequence = imputed_sequence[start_index : start_index + self.block_length]
        elif seq_len < self.block_length:
            # If shorter, pad the sequence at the end.
            padding_needed = self.block_length - seq_len
            # Use event_code=0 and time_in_days=36525 for padding.
            padding = [(36525, 0)] * padding_needed
            final_sequence = imputed_sequence + padding
        else:
            # If equal, use the sequence as is.
            final_sequence = imputed_sequence
        # 4. Return Tensors
        # Separate the sequence into event codes and time, then convert to tensors.
        event_codes = [item[1] for item in final_sequence]
        time_stamps = [item[0] for item in final_sequence]
        event_tensor = torch.tensor(event_codes, dtype=torch.long)
        time_tensor = torch.tensor(time_stamps, dtype=torch.long)
        return event_tensor, time_tensor