Documentation

The full documentation is available here. Use the side bar to select the sub-module that is most applicable.

All of these functions can be loaded from the dcarte_transform package directly, for example:

>>> import dcarte_transform as dct
>>> dct.Labeller()

Label

Here is the documentation for the labelling functionality.

For labelling the data.

class dcarte_transform.label.Labeller

Bases: object

__init__()

This function allows the user to label data.

The labelling types available can be accessed through the attribute .label_types.

Example

>>> l = Labeller()
>>> l.label_types
['uti', 'agitation']
>>> all_labels = l.get_labels(
    days_either_side=2,
    return_event=True,
    )
>>> data_labelled = l.label_df(data, subset='uti') # this is the same as below
>>> data_labelled = l.uti_label_df(data) # this is the same as above

agitation_label_df(df: DataFrame, id_col: str = 'patient_id', datetime_col: str = 'start_date', days_either_side: int = 0, return_event: bool = False) → DataFrame

This method will label the input dataframe based on the agitation data in behaviour.

Parameters:

df (-) – Unlabelled dataframe, must contain columns [id_col, datetime_col], where id_col is the ids of participants and datetime_col is the time of the sensors.
id_col (-) – The column name that contains the ID information. Defaults to 'patient_id'.
datetime_col (-) – The column name that contains the date time information. Defaults to 'start_date'.
days_either_side (-) – The number of days either side of a label that will be given the same label. Defaults to 0.
return_event (-) – This dictates whether another column should be added, with a unique id given to each of the separate agitation events. This allows the user to group the outputted data based on events. Defaults to False.

Returns:

- df_labelled – This is a dataframe containing the original data along with a new column, 'agitation_labels', which contains the labels. If return_event=True, a column titled 'agitation_event' will be added which contains unique IDs for each of the agitation episodes.

Return type:

pandas.DataFrame:

get_agitation_labels(days_either_side: int = 0, return_event: bool = False) → DataFrame

This method will return the Agitation labels. If a single day for a paticular ID contains two different labels (usually caused by using days_either_side), then both labels are removed.

Parameters:

days_either_side (-) – The number of days either side of a label that will be given the same label. If these days overlap, if the label is the same then the first will be kept. If they are different, then neither will be kept. Defaults to 0.
return_event (-) – This dictates whether another column should be added, with a unique id given to each of the separate UTI events. This allows the user to group the outputted data based on events. Defaults to False.

Returns:

- out – A dataframe containing the Agitation labels, with the corresponding patient_id and date.

Return type:

pd.DataFrame:

get_labels(subset: None | str | List[str] = None, days_either_side: int = 0, return_event: bool = False) → DataFrame

This method will return the labels from the subset given. If a single day for a paticular ID contains two different labels (usually caused by using days_either_side), then both labels are removed.

Parameters:

subset (-) – The subset of label types to be used in the labelling. If None, then all label types will be used. These can be accessed using the attribute .label_types. Defaults to None.
days_either_side (-) – The number of days either side of a label that will be given the same label. If these days overlap, if the label is the same then the first will be kept. If they are different, then neither will be kept. Defaults to 0.
return_event (-) – This dictates whether another column should be added, with a unique id given to each of the separate label events. This allows the user to group the outputted data based on events. Defaults to False.

Returns:

- out – A dataframe containing the labels, with the corresponding patient_id and date.

Return type:

pd.DataFrame:

get_uti_labels(days_either_side: int = 0, return_event: bool = False) → DataFrame

This method will return the UTI labels. If a single day for a paticular ID contains two different labels (usually caused by using days_either_side), then both labels are removed.

Parameters:

days_either_side (-) – The number of days either side of a label that will be given the same label. If these days overlap, if the label is the same then the first will be kept. If they are different, then neither will be kept. Defaults to 0.
return_event (-) – This dictates whether another column should be added, with a unique id given to each of the separate UTI events. This allows the user to group the outputted data based on events. Defaults to False.

Returns:

- out – A dataframe containing the uti labels, with the corresponding patient_id and date.

Return type:

pd.DataFrame:

label_df(df: DataFrame, subset: None | str | List[str] = None, id_col: str = 'patient_id', datetime_col: str = 'start_date', days_either_side: int = 0, return_event: bool = False) → DataFrame

This method will label the input dataframe based on the subset of labels given.

Parameters:

df (-) – Unlabelled dataframe, must contain columns [id_col, datetime_col], where id_col is the ids of participants and datetime_col is the time of the sensors.
subset (-) – The subset of label types to be used in the labelling. If None, then all label types will be used. These can be accessed using the attribute .label_types. Defaults to None.
id_col (-) – The column name that contains the ID information. Defaults to 'patient_id'.
datetime_col (-) – The column name that contains the date time information. Defaults to 'start_date'.
days_either_side (-) – The number of days either side of a label that will be given the same label. Defaults to 0.
return_event (-) – This dictates whether another column should be added, with a unique id given to each of the separate label events. This allows the user to group the outputted data based on events. Defaults to False.

Returns:

- df_labelled – This is a dataframe containing the original data along with new columns which contains the labels. If return_event=True, a columns will be added which contains unique IDs for each of the label episodes.

Return type:

pandas.DataFrame:

previous_agitation_label_df(df: DataFrame, id_col: str = 'patient_id', datetime_col: str = 'start_date', day_delay: int = 1)

This method allows you to label the number of agitation positives to date for the corresponding ID and date.

Parameters:

df (-) – The dataframe to append the number of previous agitation positives to.
id_col (-) – The column name that contains the ID information. Defaults to 'patient_id'.
datetime_col (-) – The column name that contains the date time information. Defaults to 'start_date'.
day_delay (-) – The number of days after an agitation is detected when the data reflects that the ID has had another previous agitation. This is used to ensure that the predictive model does not simply learn that to look for when this feature increases. Defaults to 1.

Returns:

- df_out – This is a dataframe containing the original data along with a new column, 'agitation_previous', which contains the number of previous agitations to date for that ID.

Return type:

pandas.DataFrame:

previous_label_df(df: DataFrame, subset: None | str | List[str] = None, id_col: str = 'patient_id', datetime_col: str = 'start_date', day_delay: int = 1) → DataFrame

This function allows you to label the number of positives to date for the corresponding ID and date from the subset given.

Parameters:

df (-) – The dataframe to append the number of previous positives to.
subset (-) – The subset of label types to be used in the labelling. If None, then all label types will be used. These can be accessed using the attribute .label_types. Defaults to None.
id_col (-) – The column name that contains the ID information. Defaults to 'patient_id'.
datetime_col (-) – The column name that contains the date time information. Defaults to 'start_date'.
day_delay (-) – The number of days after a label is detected when the data reflects that the ID has had another previous label. This is used to ensure that the predictive model does not simply learn that to look for when this feature increases. Defaults to 1.

Returns:

- df_out – This is a dataframe containing the original data along with new columns which contains the number of previous UTIs to date for that ID.

Return type:

pandas.DataFrame:

previous_uti_label_df(df: DataFrame, id_col: str = 'patient_id', datetime_col: str = 'start_date', day_delay: int = 1)

This function allows you to label the number of uti positives to date for the corresponding ID and date.

Parameters:

df (-) – The dataframe to append the number of previous uti positives to.
id_col (-) – The column name that contains the ID information. Defaults to 'patient_id'.
datetime_col (-) – The column name that contains the date time information. Defaults to 'start_date'.
day_delay (-) – The number of days after a UTI is detected when the data reflects that the ID has had another previous UTI. This is used to ensure that the predictive model does not simply learn that to look for when this feature increases. Defaults to 1.

Returns:

- df_out – This is a dataframe containing the original data along with a new column, 'uti_previous', which contains the number of previous UTIs to date for that ID.

Return type:

pandas.DataFrame:

uti_label_df(df: DataFrame, id_col: str = 'patient_id', datetime_col: str = 'start_date', days_either_side: int = 0, return_event: bool = False) → DataFrame

This method will label the input dataframe based on the uti data in procedure.

Parameters:

df (-) – Unlabelled dataframe, must contain columns [id_col, datetime_col], where id_col is the ids of participants and datetime_col is the time of the sensors.
id_col (-) – The column name that contains the ID information. Defaults to 'patient_id'.
datetime_col (-) – The column name that contains the date time information. Defaults to 'start_date'.
days_either_side (-) – The number of days either side of a label that will be given the same label. Defaults to 0.
return_event (-) – This dictates whether another column should be added, with a unique id given to each of the separate UTI events. This allows the user to group the outputted data based on events. Defaults to False.

Returns:

- df_labelled – This is a dataframe containing the original data along with a new column, 'uti_labels', which contains the labels. If return_event=True, a column titled 'uti_event' will be added which contains unique IDs for each of the UTI episodes.

Return type:

pandas.DataFrame:

Model Selection

Here is the documentation for the model selection functionality.

For evaluating models.

class dcarte_transform.model_selection.StratifiedEventKFold(n_splits: int = 5, shuffle: bool = False, random_state: None | int = None)

Bases: StratifiedGroupKFold

Parameters:

n_splits (int) –
shuffle (bool) –
random_state (None | int) –

__init__(n_splits: int = 5, shuffle: bool = False, random_state: None | int = None)

This function allows you to split the dataset, such that the proportion of labels across the training and testing sets are as equal as possible, whilst maintaining that no single event appears in both the training and testing set.

Example

>>> splitter = StratifiedEventKFold()
>>> splits = splitter.split(X, y.astype(int), events)
>>> for train_idx, test_idx in splits:
        X_train, y_train, ids_train = X[train_idx], y[train_idx], ids[train_idx]
        X_test, y_test, ids_test = X[test_idx], y[test_idx], ids[test_idx]

Parameters:

n_splits (-) – This is the number of splits to produce.
shuffle (-) – dictates whether the data should be shuffled before the splits are made.
random_state (-) – This dictates the random seed that is used in the random operations for this class.

split(X, y, event)

This function builds the splits and returns a generator that can be iterated over to produce the training and testing indices.

Parameters:

X (-) – Training data with shape (n_samples,n_features), where n_samples is the number of samples and n_features is the number of features.
y (-) – Label data with shape (n_samples), where n_samples is the number of samples. These are the labels that are used to stratify the data. This must be an array of integers.
event (-) – Event data with shape (n_samples), where n_samples is the number of samples. These are the event ids that are used to group the data into either the training or testing set.

Returns:

- splits – This is the generator containing the indices of the splits.

Return type:

generator:

class dcarte_transform.model_selection.StratifiedPIDKFold(n_splits: int = 5, shuffle: bool = False, random_state: None | int = None)

Bases: StratifiedGroupKFold

Parameters:

n_splits (int) –
shuffle (bool) –
random_state (None | int) –

__init__(n_splits: int = 5, shuffle: bool = False, random_state: None | int = None)

This function allows you to split the dataset, such that the proportion of labels across the training and testing sets are as equal as possible, whilst maintaining that no single PID appears in both the training and testing set.

Example

>>> splitter = StratifiedPIDKFold()
>>> splits = splitter.split(X, y.astype(int), ids)
>>> for train_idx, test_idx in splits:
        X_train, y_train, ids_train = X[train_idx], y[train_idx], ids[train_idx]
        X_test, y_test, ids_test = X[test_idx], y[test_idx], ids[test_idx]

Parameters:

n_splits (-) – This is the number of splits to produce.
shuffle (-) – dictates whether the data should be shuffled before the splits are made.
random_state (-) – This dictates the random seed that is used in the random operations for this class.

split(X, y, pid)

This function builds the splits and returns a generator that can be iterated over to produce the training and testing indices.

Parameters:

X (-) – Training data with shape (n_samples,n_features), where n_samples is the number of samples and n_features is the number of features.
y (-) – Label data with shape (n_samples), where n_samples is the number of samples. These are the labels that are used to stratify the data. This must be an array of integers.
pid (-) – PID data with shape (n_samples), where n_samples is the number of samples. These are the ids that are used to group the data into either the training or testing set.

Returns:

- splits – This is the generator containing the indices of the splits.

Return type:

generator:

dcarte_transform.model_selection.train_test_event_split(*arrays, y, event, test_size: float | None = None, train_size: float | None = None, random_state: None | int = None, shuffle: bool = True)

This function returns the train and test data given the split and the data. A single event will not be in both the training and testing set. You should use either test_size or train_size but not both.

Example

>>> (X_train, X_test,
    y_train, y_test,
    ids_train, ids_test) = train_test_event_split(X, y=y, event=event, test_size=0.33)

Parameters:

arrays (-) – The data to split into training and testing sets. The labels and the events should be passed to y and event respectively.
y (-) – Label data with shape (n_samples), where n_samples is the number of samples. These are the labels that are used to group the data into either the training or testing set.
event (-) – Event data with shape (n_samples), where n_samples is the number of samples. These are the event ids that are used to group the data into either the training or testing set.
test_size (-) – This dictates the size of the outputted test set. This should be used if train_size=None. If no test_size or train_size are given, then test_size will default to 0.25 Defaults to None.
train_size (-) – This dictates the size of the outputted train set. This should be used if test_size=None. Defaults to None.
shuffle (-) – dictates whether the data should be shuffled before the split is made.
random_state (-) – This dictates the random seed that is used in the random operations for this function.

Returns:

- split arrays – This is a list of the input data, split into the training and testing sets. See the Example for an understanding of the order of the outputted arrays.

Return type:

list:

dcarte_transform.model_selection.train_test_pid_split(*arrays, y, pid, test_size: float | None = None, train_size: float | None = None, random_state: None | int = None, shuffle: bool = True)

This function returns the train and test data given the split and the data. A single pid will not be in both the training and testing set. You should use either test_size or train_size but not both.

Example

>>> (X_train, X_test,
    y_train, y_test,
    ids_train, ids_test) = train_test_pid_split(X, y=y, pid=pid, test_size=0.33)

Parameters:

arrays (-) – The data to split into training and testing sets. The labels and the PIDs should be passed to y and pid respectively.
y (-) – Label data with shape (n_samples), where n_samples is the number of samples. These are the labels that are used to group the data into either the training or testing set.
pid (-) – PID data with shape (n_samples), where n_samples is the number of samples. These are the ids that are used to group the data into either the training or testing set.
test_size (-) – This dictates the size of the outputted test set. This should be used if train_size=None. If no test_size or train_size are given, then test_size will default to 0.25 Defaults to None.
train_size (-) – This dictates the size of the outputted train set. This should be used if test_size=None. Defaults to None.
shuffle (-) – dictates whether the data should be shuffled before the split is made.
random_state (-) – This dictates the random seed that is used in the random operations for this function.

Returns:

- split arrays – This is a list of the input data, split into the training and testing sets. See the Example for an understanding of the order of the outputted arrays.

Return type:

list:

Recipe

Here is the documentation for the recipe functionality.

Custom recipes for dcarte.

dcarte_transform.recipe.create_feature_engineering_datasets()

dcarte_transform.recipe.create_tihm_and_minder_datasets()

Transform

Here is the documentation for the data transform functionality.

Data transform and calculation functions.

dcarte_transform.transform.between_time(df, time_range, datetime_col)

dcarte_transform.transform.collapse_levels(df)

dcarte_transform.transform.compute_delta(array: array, pad: bool = False, true_divide=nan)

This function allows the user to calculate the proportional change between each element in x and its previous element. This is done using the formula:

>>> (x_{i} - x_{i-1})/x_{i-1}

Parameters:

x (-) – The array to calculate the delta values on.
pad (-) – Dictates whether NAN values should be added to the beginning of the array, so that the output is of the same shape as array.
true_divide (-) – Whether the division by 0 should be replaces with nan values.
array (array) –

Returns:

- delta_values – An array containing the delta values.

Return type:

pandas.Series:

dcarte_transform.transform.datetime_compare_rolling(df: DataFrame, funcs, s: str = '1d', w_distribution: str = '7d', w_sample: str = '1d', datetime_col: str = 'start_date', value_col: str = 'value', label: str = 'left', sorted=False)

This function will roll over a dataframe, with step size equal to s. This function compares the data in w_sample and w_distribution by passing the data in them to the functions given in funcs, which should have structure:

>>> result = func(array_sample, array_distribution)

Example

The following would calculate the relative change in the median between each day’s data and the previous week’s data, grouped by ID and transition. The calculations would also be computed in parallel.

>>> from functools import partial
>>> import dcarte
>>> from pandarallel import pandarallel as pandarallel_
>>> from dcarte_transform.utils.progress import tqdm_style, pandarallel_progress

# loading data
>>> transitions = dcarte.load('transitions', 'base')
# filtering out the transitions longer than 5 minutes
>>> transition_upper_bound = 5*60
>>> transitions=transitions[transitions['dur']<transition_upper_bound]

# setting up parallel compute
>>> pandarallel_progress(desc="Computing transition median deltas", smoothing=0, **tqdm_style)
>>> pandarallel_.initialize(progress_bar=True)

# relative median function
>>> def relative_median_delta(array_sample, array_distribution):
        import numpy as np # required for parallel compute on Windows
        median_sample = np.median(array_sample)
        median_distribution = np.median(array_distribution)
        if median_distribution == 0:
            return np.nan
        return (median_sample-median_distribution)/median_distribution

# setting up arguments for the rolling calculations
>>> datetime_compare_rolling_partial = partial(
        datetime_compare_rolling,
        funcs=[relative_median_delta],
        s='1d',
        w_distribution='7d',
        w_sample='1d',
        datetime_col='start_date',
        value_col='dur',
        label='left',
        )

>>> daily_rel_transitions = (transitions
        [['patient_id', 'transition', 'start_date', 'dur']]
        .sort_values('start_date')
        .dropna()
        .groupby(by=['patient_id', 'transition',])
        .parallel_apply(datetime_compare_rolling_partial)
        )
>>> daily_rel_transitions['date'] = pd.to_datetime(daily_rel_transitions['start_date']).dt.date
>>> daily_rel_transitions = (daily_rel_transitions
        .reset_index(drop=False)
        .drop(['level_2', 'start_date'], axis=1))

Parameters:

df (-) – The dataframe to apply the rolling function to.
funcs (-) – The functions that will be applied to the values. This should allow for two arguments to be passed. It will be called in the following way: func(array_sample, array_distribution).
s (-) – The step size when rolling over the dataframe. Defaults to '1d'.
w_distribution (-) – The window size for the distribution. Defaults to '7d'.
w_sample (-) – The window size for the sample. Defaults to '7d'.
datetime_col (-) – The name of the column containing the datetimes. This will be passed into pandas.to_datetime() before operations are applied to it. Defaults to 'start_date'.
value_col (-) – The column name for the values that will be passed to the functions given in funcs. Defaults to 'value'.
label (-) – The direction used when labelling date time values based on each of the steps. This can be in ['left', 'right']. 'left' will use the date from the beginning of the data in w_sample, whereas 'right' will use the end datetime of the data in w_sample. Defaults to 'left'.
sorted (-) – If False, this function will sort the dataframe on the datetime_col before performing any calculations. If the dataframe is already sorted then please give sorted=True. Defaults to False.

Returns:

A dataframe, with the calculations under the column names equal to the functiosn that produce them and the date time of the beginning or end of each window, depending on the label argument.

Return type:

out

dcarte_transform.transform.datetime_rolling(df: DataFrame, funcs, s: str = '1d', w: str = '7d', datetime_col: str = 'start_date', value_col: str = 'value', label: str = 'left', dataframe_apply: bool = False, pad: bool = False)

This function will roll over a dataframe, with step size equal to s and with a window equal to w, applying the functions given to each window.

This is required as pandas does not allow for a custom step size in its .rolling() function.

Parameters:

df (-) – The dataframe to apply the rolling function to.
funcs (-) – The functions that will be applied to the values.
s (-) – The step size when rolling over the dataframe. Defaults to '1d'.
w (-) – The window size used in the rolling calculation. Defaults to '7d'.
datetime_col (-) – The name of the column containing the datetimes. This will be passed into pandas.to_datetime() before operations are applied to it. Defaults to 'start_date'.
value_col (-) – The column name for the values that will be passed to the functions given in funcs. Defaults to 'value'.
label (-) – The direction used when labelling date time values based on each of the steps. This can be in ['left', 'right']. Defaults to 'left'.
dataframe_apply (-) – Whether to pass the functions the section of the dataframe for each window, or the values from value_col. Defaults to False.
pad (-) – Whether to pad the dates either side of the rolling operation. If True, the first window will be data from before the earliest date, only containing the earliest date and the last window will contain dates from after the latest date, and contain only the data from the latest date. Defaults to False.

Returns:

A dataframe, with the calculations under the column names equal to the functiosn that produce them and the date time of the beginning or end of each window, depending on the label argument.

Return type:

out

dcarte_transform.transform.groupby_freq(df, groupby_cols, count_col)

dcarte_transform.transform.lowercase_colnames(df)

dcarte_transform.transform.moving_average(array: array, w: int = 3, pad: bool = False)

Calculate the moving average of a 1D array.

Parameters:

array (-) – This is the array to calculate the moving average of.
w (-) – This is the window size to use when calculating the moving average of the array.
pad (-) – Dictates whether NAN values should be added to the beginning of the array, so that the output is of the same shape as array.

Returns:

- moving_average – An array containing the moving average.

Return type:

numpy.array:

dcarte_transform.transform.relative_func_delta(array_sample, array_distribution, func)

Utils

Here is the documentation for the utils functionality.

Util functions used by the package.

class dcarte_transform.utils.TQDMProgressBarPandarallelGenerator(**tqdm_kwargs)

Bases: object

__init__(**tqdm_kwargs)

get_bar()

dcarte_transform.utils.pandarallel_progress(**tqdm_kwargs)