Title: Multimethod synthetic data generation for confidentiality and measurement of disclosure risk

Authors: Michael D. Larsen; Jennifer C. Huckett

Addresses: Biostatistics Center; Department of Statistics, George Washington University, 6110 Executive Blvd., Ste 750, Rockville, MD 20852, USA. ' Battelle Memorial Institute, 505 King Ave., Columbus, OH 43201, USA

Abstract: Government agencies must simultaneously maintain confidentiality of individual records and disseminate useful microdata. We propose a method to create synthetic data that combines quantile regression, hot deck imputation, and rank swapping. The result from implementation of the proposed procedure is a releasable dataset containing original values for a few key variables, synthetic quantile regression predictions for several variables, and imputed and perturbed values for remaining variables. To measure the disclosure risk in the resulting synthetic dataset, we extend existing probabilistic risk measures that aim to imitate an intruder attempting to match a record in the released data with information previously available on a target respondent.

Keywords: disclosure control; hot deck imputation; quantile regression; rank swapping; simulation; statistical disclosure limitation; SDL; synthetic data; disclosure avoidance; disclosure risk; confidentiality; privacy; security.

DOI: 10.1504/IJIPSI.2012.046132

International Journal of Information Privacy, Security and Integrity, 2012 Vol.1 No.2/3, pp.184 - 204

Published online: 23 Aug 2014 *

Full-text access for editors Full-text access for subscribers Purchase this article Comment on this article