CDPG Anonymisation Toolkit

Welcome to the documentation for the CDPG Anonymisation Toolkit!

Project Logo

Overview

In this toolkit, we provide a set of tools that a user can use to anonymise data. The provided functions can be used to preprocess and prepare the data for anonymisation, anonymise the data and then apply certain post-processing methods and obtain validation of the selected anonymisation method.

Quick Start

pip install cdpg-anonkit --extra-index-url=https://test.pypi.org/simple/
 import cdpg_anonkit

 # Quick example
 from cdpg_anonkit import SanitiseData as sanitisation

example_data = pd.DataFrame({
      'age': [25, 40, 15, 60, 18, 90, 22, 45, 50, 55],
      'income': [50000, 80000, 65000, 120000, 20000, 90000, 55000, 75000, 85000, 95000],
      'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank', 'Grace', 'Henry', 'Ivy', 'Jack'],
      'city': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix',
               'New York', 'Chicago', 'Los Angeles', 'Dallas', 'Dallas']})

sanitisation_rules = {
  'age' : {'method': 'clip', 'params': {'min_value': 25, 'max_value': 70}},
  'name' : {'method': 'hash', 'params': {'salt': 'md5'}},
}

sanitised_data = sanitisation.sanitise_data(df=data_test,
                                            columns_to_sanitise=['age', 'name'],
                                            sanitisation_rules=sanitisation_rules)

Possible Operations

  • Sanitisation * Clipping * Hashing * Suppression

  • Generalisation * Spatial Generalisation * Temporal Generalisation * Categorical Generalisation

  • Aggregation * Query Building

  • Differential Privacy * Sensitivity Computation * Noise Addition

  • Post Processing * Rounding and Clipping * Epsilon vs MAE

Contents

Indices and tables