Skip to content

preprocessing

boxcox(method='mle')

Applies the Box-Cox transformation to numeric columns in a panel DataFrame.

Parameters:

Name Type Description Default
method str

The method used to determine the lambda parameter of the Box-Cox transformation.

Supported methods:

  • mle: maximum likelihood estimation
  • pearsonr: Pearson correlation coefficient
'mle'

coerce_dtypes(schema)

Coerces the column datatypes of a DataFrame using the provided schema.

Parameters:

Name Type Description Default
schema Mapping[str, DataType]

A dictionary-like object mapping column names to the desired data types.

required

detrend(method='linear')

Removes mean or linear trend from numeric columns in a panel DataFrame.

Parameters:

Name Type Description Default
method str

If mean, subtracts mean from each time-series. If linear, subtracts line of best-fit (via OLS) from each time-series. Defaults to linear.

'linear'

diff(order, sp=1)

Difference time-series in panel data given order and seasonal period.

Parameters:

Name Type Description Default
order int

The order to difference.

required
sp int

Seasonal periodicity.

1

impute(method)

Performs missing value imputation on numeric columns of a DataFrame grouped by entity.

Parameters:

Name Type Description Default
method Union[str, int, float]

The imputation method to use.

Supported methods are:

  • 'mean': Replace missing values with the mean of the corresponding column.
  • 'median': Replace missing values with the median of the corresponding column.
  • 'fill': Replace missing values with the mean for float columns and the median for integer columns.
  • 'ffill': Forward fill missing values.
  • 'bfill': Backward fill missing values.
  • 'interpolate': Interpolate missing values using linear interpolation.
  • int or float: Replace missing values with the specified constant.
required

lag(lags)

Applies lag transformation to a LazyFrame.

Parameters:

Name Type Description Default
lags List[int]

A list of lag values to apply.

required

one_hot_encode(drop_first=False)

Encode categorical features as a one-hot numeric array.

Parameters:

Name Type Description Default
drop_first bool

Drop the first one hot feature.

False

Raises:

Type Description
ValueError

if X passed into transform_new contains unknown categories.

reindex(drop_duplicates=False)

Reindexes the entity and time columns to have every possible combination of (entity, time).

Parameters:

Name Type Description Default
drop_duplicates bool

Defaults to False. If True, duplicates are dropped before reindexing.

False

resample(freq, agg_method, impute_method)

Resamples and transforms a DataFrame using the specified frequency, aggregation method, and imputation method.

Parameters:

Name Type Description Default
freq str

Offset alias supported by Polars.

required
agg_method str

The aggregation method to use for resampling. Supported values are 'sum', 'mean', and 'median'.

required
impute_method Union[str, int, float]

The method used for imputing missing values. If a string, supported values are 'ffill' (forward fill) and 'bfill' (backward fill). If an int or float, missing values will be filled with the provided value.

required

roll(window_sizes, stats, freq)

Performs rolling window calculations on specified columns of a DataFrame.

Parameters:

Name Type Description Default
window_sizes List[int]

A list of integers representing the window sizes for the rolling calculations.

required
stats List[Literal['mean', 'min', 'max', 'mlm', 'sum', 'std', 'cv']]

A list of statistical measures to calculate for each rolling window.

Supported values are:

  • 'mean' for mean
  • 'min' for minimum
  • 'max' for maximum
  • 'mlm' for maximum minus minimum
  • 'sum' for sum
  • 'std' for standard deviation
  • 'cv' for coefficient of variation
required
freq str

Offset alias supported by Polars.

required

scale(use_mean=True, use_std=True, rescale_bool=False)

Performs scaling and rescaling operations on the numeric columns of a DataFrame.

Parameters:

Name Type Description Default
use_mean bool

Whether to subtract the mean from the numeric columns. Defaults to True.

True
use_std bool

Whether to divide the numeric columns by the standard deviation. Defaults to True.

True
rescale_bool bool

Whether to rescale boolean columns to the range [-1, 1]. Defaults to False.

False

time_to_arange(eager=False)

Coerces time column into arange per entity.

Assumes even-spaced time-series and homogenous start dates.

trim(direction='both')

Trims time-series in panel to have the same start or end dates as the shortest time-series.

Parameters:

Name Type Description Default
direction Literal['both', 'left', 'right']

Defaults to "both". If "left" trims from start date of the shortest time series); if "right" trims up to the end date of the shortest time-series; or otherwise "both" trims between start and end dates of the shortest time-series

'both'

yeojohnson(brack=(-2, 2))

Applies the Yeo-Johnson transformation to numeric columns in a panel DataFrame.

Parameters:

Name Type Description Default
brack 2 - tuple

The starting interval for a downhill bracket search with optimize.brent. Note that this is in most cases not critical; the final result is allowed to be outside this bracket.

(-2, 2)