XGBoost AFT survival model with external memory iterator

  Kiến thức lập trình

How to make XGBoost external memory and XGBoost survival AFT model work together?

Background: I’ve written XGBoost iterator for batched training as in the linked example.
Now I want to train an AFT model from the xgboost library.
The problem is the XGB DMatrix, for which we need to run set_float_info to set survival censoring intervals. For example:

dtrain.set_float_info('label_lower_bound', y_lower_bound[train_index])
dtrain.set_float_info('label_upper_bound', y_upper_bound[train_index])

Attached please find my redacted code (can’t attach everything, but that’s the problematic gist).
I got the censoring time data in df, but I don’t know how to “attach” it to Xy_train.

class BatchedParquetIterator(xgboost.DataIter):
  def __init__(
      self
      ):
    # ...    
    super().__init__(cache_prefix=os.path.join(".", "cache"))

  def next(self, input_data: Callable):
    """Advance the iterator by 1 step and pass the data to XGBoost.  This function is
    called by XGBoost during the construction of ``DMatrix``
    """
    
    if self._it == len(self._file_paths):
      return 0  # return 0 to let XGBoost know this is the end of iteration
    
    df = pd.read_parquet(self._file_paths[self._it])
    X, y = self._preprocess(df)
    
    input_data(data=X, label=y)
    self._it += 1

    return 1  # Return 1 to let XGBoost know we haven't seen all the files yet.

  def reset(self):
    """Reset the iterator to its beginning"""
    self._it = 0

  def _preprocess(self, df: pd.DataFrame) -> Tuple[pd.DataFrame, pd.DataFrame]:
    # ...
    return X, y


parquet_iterator_train = BatchedParquetIterator(batches)
Xy_train = xgboost.DMatrix(parquet_iterator_train)

LEAVE A COMMENT