Applications to Assist in De-identification of Human Subjects Research Data

Researchers are increasingly asked to share data they generate in the course of their research. However, some of this data contains information about study participants and sharing this data would breach the confidentiality of these participants. The removal of direct (and indirect) personal identifiers from research data can substantially reduce the risk of sharing this sensitive data.

Johns Hopkins Data Management Services has compiled a list of de-identification software tools and applications that researchers can use in de-identifying their research data for more public sharing. This list contains brief information about each tool that can help a researcher determine whether it might prove useful for their research project. Johns Hopkins Data Management Services has not tested these applications and is not endorsing any of them. This information is provided merely to assist researchers in searching for possible de-identification solutions.

See also our overview document on protecting and removing personal identifiers of research subjects for data sharing. De-Identifying Human Subjects Data (JHU version).  (Version for non-JHU visitors)

Johns Hopkins researchers are encouraged to talk with Johns Hopkins Data Management Services for advice and guidance on de-identification of human subjects research data. Please contact us at datamanagement@jhu.edu.

 

right-arrow  Tools for De-identifying Unstructured Text
right-arrow  Tools for De-identifying Data in Digital Images
right-arrow  Tools for De-identifying Tabular or Otherwise Structured Data
Tools for De-identifying Unstructured Text
  • ATLAS.ti
    • Software description: “a suite of tools that supports analysis of written texts, audio clips, video files, and visual/graphic data” (from Wikipedia)
    • Intended purpose: “help researchers uncover and systematically analyze complex phenomena hidden in unstructured data (text, multimedia, geospatial)” (from Wikipedia)
  • deid software package
    • Software description: “includes code and dictionaries for automated location and removal of protected health information (PHI) in free text from medical records”
    • Intended purpose: For free text in medical records
  • Nvivo
    • Software description: “a qualitative data analysis computer software package designed for qualitative researchers working with very rich text-based and/or multimedia information” (from Wikipedia)
    • Intended purpose: Transcripts and free-form Text, qualitative data analysis
  • PARAT text (Privacy Analytics Lexicon)
    • Software description: “Using PARAT Text for anonymization enables organizations to…Extend the practice of anonymization to unstructured formats residing in electronic health and other data formats”
    • Intended purpose: Unstructured medical records
NameFreewareSpecific Data Input FormatSkill NeededLatest Date on WebsiteSupportMore Information
ATLAS.tiNoNo*2016Tech support, online knowledgebase, discussion forums
deid software package YesNo*2016No explicit supportResearch article – “Automated de-identification of free-text medical records
Nvivo NoNo*2016Tech support, online discussion forms
PARAT text
(Privacy Analytics Lexicon)
NoNo*2016Tech support, online knowledgebase, discussion forums

Skill needed: * For those technically proficient enough not to be frightened off by spending a couple of hours learning a new application; ** For users with coding experience

Tools for De-identifying Data in Digital Images
  • DICOMCleaner
    • Software description: “DicomCleaner™ is a free open source tool with a user interface for importing, “”cleaning”” and saving sets of DICOM instances (files)”
    • Intended purpose: Medical Images in DICOM (Digital Imaging and Communications in Medicine) format
NameFreewareSpecific Data Input FormatSkill NeededLatest Date on WebsiteSupportMore Information
DICOMCleanerYesDICOM format*2016 No explicit supportBlog Post about DICOMCleaner
Skill needed: * For those technically proficient enough not to be frightened off by spending a couple of hours learning a new application
Tools for De-identifying Microdata, Tabular or Otherwise Structured Data
  • Cornell Anonymization Toolkit (CAT)
    • Software description: “designed for interactively anonymizing published dataset to limit identification disclosure of records under various attacker models”
    • Intended purpose: Medical records – tabular data
  • Open Refine
    • Software description: “a powerful tool for working with messy data: cleaning it; transforming it from one format into another; extending it with web services; and linking it to databases like Freebase”
    • Intended purpose: Working with messy data
  • PARAT Core (Privacy Analytics Eclipse)
    • Software description: “PARAT software masks and de-identifies personal information using a risk-based approach that optimizes the analytic utility of de-identified data sets”
    • Intended purpose: Working with structured medical records
  • mu-Argus 5.1
    • Software description: “μ-ARGUS is a software program designed to create safe micro-data files. Initially developed as a closed-source project but was converted to open source”
    • Intended purpose: Statistical Disclosure Control for microdata
  • tau-Argus 4.1
    • Software description: “τ-ARGUS is a software program designed to protect statistical tables.  Initiatally developed as a closed-source project but was converted to open source”
    • Intended purpose: Statistical Disclosure Control for tabular data
  • The sdcMicro package in R
    • Software description: “This package can be used for the generation of anonymized (micro)data, i.e. for the creation of public- and scientific-use files. In addition, various risk estimation methods are included”
    • Intended purpose: Unstructured medical records
  • The sdcTable package in R
    • Software description: “Methods for statistical disclosure control in tabular data such as primary and secondary cell suppression are covered in this package”
    • Intended purpose: Statistical Disclosure Control for tabular data
  • The University of Texas at Dallas Anonymization Toolbox
    • Software description: a researcher-compiled implementation (from UT Dallas Data Security and Privacy Lab) of various anonymization methods into a toolbox for public use by researchers
    • Intended purpose: Unstructured text files
  • ARX Data Anonymization Tool
    • Software description: “A comprehensive software for risk- and utility-based privacy-preserving microdata publishing” developed at Technical University of Munich, Germany.
    • Intended purpose: “an open source tool for transforming structured (i.e. tabular) sensitive personal data using selected methods from the broad area of statistical disclosure control.”
NameFreewareSpecific Data Input FormatSkill NeededLatest Date on WebsiteSupportMore Information
Cornell Anonymization Toolkit (CAT) YesNo*2013 No explicit supportShort Paper “Interactive Anonymization of Sensitive Data
Open Refine YesNo*2016Users mailing list/forums
PARAT Core (Privacy Analytics Eclipse) NoData imported from CSV Files, Access, SQL Server, Oracle*2016Tech support, online knowledgebase
mu-Argus 5.1 YesNo*2015No explicit support (e-mail contacts provided)User’s Manual
tau-Argus 4.1 YesNo*2015No explicit support (e-mail contacts provided)User’s Manual
The sdcMicro package in R YesNo*2016Tech support, online knowledgebase, discussion forums
The sdcTable package in R YesNo*2016Tech support, online knowledgebase, discussion forums
The University of Texas at Dallas Anonymization Toolbox YesNo**2012No explicit support (e-mail contacts provided)
ARX Data Anonymization Tool YesNo*2016No explicit support (e-mail contacts provided)ARX - A Comprehensive Tool for Anonymizing Biomedical Data
Skill needed: * For those technically proficient enough not to be frightened off by spending a couple of hours learning a new application; ** For users with coding experience

 

 

(This page was last updated in September 2016.)