cdpg_anonkit.generalisation
Classes
Module Contents
- class cdpg_anonkit.generalisation.GeneraliseData
- class SpatialGeneraliser
- static format_coordinates(series: pandas.Series) Tuple[pandas.Series, pandas.Series]
Clean coordinates attribute formatting.
Takes a pandas Series of coordinates and returns a tuple of two Series: the first with the latitude, and the second with the longitude.
The coordinates are expected to be in the format “[lat, lon]”. The function will strip any leading or trailing whitespace and brackets from the coordinates, split them into two parts, and convert each part to a float.
If the coordinate string is not in the expected format, a ValueError is raised.
- Parameters:
series (pd.Series) – The series of coordinates to be cleaned.
- Returns:
A tuple of two Series, one with the latitude and one with the longitude.
- Return type:
Tuple[pd.Series, pd.Series]
- static generalise_spatial(latitude: pandas.Series, longitude: pandas.Series, spatial_resolution: int) pandas.Series
Generalise a set of coordinates to an H3 index at a given resolution.
- Parameters:
latitude (pd.Series) – The series of latitude values to be generalised.
longitude (pd.Series) – The series of longitude values to be generalised.
spatial_resolution (int) – The spatial resolution of the H3 index. Must be between 0 and 15.
- Returns:
A series of H3 indices at the specified resolution.
- Return type:
pd.Series
- Raises:
ValueError – If the spatial resolution is not between 0 and 15, or if the latitude or longitude values are not between -90 and 90 or -180 and 180 respectively.
Warning – If the length of the latitude and longitude series are not equal.
- class TemporalGeneraliser
- static format_timestamp(series: pandas.Series) pandas.Series
Convert a pandas Series of timestamps into datetime objects.
This function takes a Series containing timestamp data and converts it into pandas datetime objects. It handles mixed format timestamps and coerces any non-parseable values into NaT (Not a Time).
- Parameters:
series (pd.Series) – The input Series containing timestamp data to be converted.
- Returns:
A Series where all timestamp values have been converted to datetime objects, with non-parseable values set to NaT.
- Return type:
pd.Series
- static generalise_temporal(data: pandas.Series | pandas.DataFrame, timestamp_col: str = None, temporal_resolution: int = 60) pandas.Series
Generalise timestamp data into specified temporal resolutions.
This function processes timestamp data, either in the form of a Series or a DataFrame, and generalises it into timeslots based on the specified temporal resolution. The resolution must be one of the following values: 15, 30, or 60 minutes.
- Parameters:
data (Union[pd.Series, pd.DataFrame]) – The input timestamp data. Can be a pandas Series of datetime objects or a DataFrame containing a column with datetime data.
timestamp_col (str, optional) – The name of the column containing timestamp data in the DataFrame. Must be specified if the input data is a DataFrame. Defaults to None.
temporal_resolution (int, optional) – The temporal resolution in minutes for which the timestamps should be generalised. Allowed values are 15, 30, or 60. Defaults to 60.
- Returns:
A pandas Series representing the generalised timeslots, with each entry formatted as ‘hour_minute’, indicating the start of the timeslot.
- Return type:
pd.Series
- Raises:
AssertionError – If the temporal resolution is not one of the allowed values (15, 30, 60).
ValueError – If timestamp_col is not specified when input data is a DataFrame, or if the specified column is not found in the DataFrame. If the timestamps cannot be converted to datetime objects.
TypeError – If the input data is neither a pandas Series nor a DataFrame.
Example
### Using with a Series generalise_temporal(ts_series)
### Using with a DataFrame generalise_temporal(df, timestamp_col=’timestamp’)
- class CategoricalGeneraliser
- static generalise_categorical(data: pandas.Series, bins: int | List[float], labels: List[str] | None = None) pandas.Series
Generalise a categorical column by binning the values into categories.
- Parameters:
data (pd.Series) – The input Series to be generalised.
bins (Union[int, List[float]]) – The number of bins to use, or a list of bin edges.
labels (Optional[List[str]], optional) – The labels to use for each bin. If not specified, the bin edges will be used as labels.
- Returns:
The generalised Series.
- Return type:
pd.Series