Providing Useful Anonymized Data for Health Research: Subsalt Finds a Blended Solution
Ever since data went online, health care organizations and others have been struggling to provide useful data for advanced analytics while guarding Protected Health Information (PHI). Data masking, data aggregation, synthetic data, and differential privacy are among the solutions, but each presents difficulties and limitations.
Each solution, for instance, requires expert intervention to determine how much data transformation is required to achieve de-identification. If you’re trying to protect data for a large population, such as people with high blood pressure, you can be fairly loose about anonymization, whereas if you’re tracking a rare condition, you have to suppress much more data to prevent individuals from being re-identified. Experts apply statistical tests to determine the right amount of protection.
Each solution also limits the questions researchers can ask. You might prepare a data set for researchers to track high blood pressure, or even to correlate high blood pressure with congestive heart failure—but if they think up some other correlation you haven’t anticipated, they may need a separate dataset with new privacy transformations to precisely study it.
I recently talked to David Singletary, co-founder of Subsalt, about their innovative approach to providing anonymized synthetic data on demand in response to a data consumer’s specific needs. Subsalt uses advanced generative AI models in an automated manner to offer multiple views into the same data, depending on the task at hand.
Why synthetic data is most secure
Synthetic data is produced by generative models that create data meant to preserve the important underlying patterns that are useful to data consumers without disclosing anything about real, underlying people or events. For instance, if 30% of the patients at your hospital are older than 65 and 10% are of Latin background, the synthetic data can reproduce those ratios.
Currently, synthetic data is popular for software testing, so that test staff don’t need access to sensitive real data. But Subsalt believes that synthetic data can be made more useful for advanced analytics such as AI/ML, research, and business intelligence as well, particularly when access to the source data is constrained by privacy and security concerns.
Of course, synthetic can’t map real data perfectly while still being private. If the synthetic data is too close to real patient data, it could reveal sensitive information about an individual.
But synthetic data can preserve a lot of complex correlations, and does better at balancing the goals of useful data and patient privacy than the alternatives.
Some data sets provide aggregated data: For instance, it can show that 20% of the residents in a location have tested positive for COVID-19. If you also know the ages of residents (within a 10-year span, for instance), you can draw useful conclusions about the relative risk of different age groups using aggregated data. But you still don’t have the detail you can get with synthetic data, which preserves full-featured, row-level granularity.
Another classic way to provide data is with truncation. Instead of a five-digit ZIP code, patient information might be provided with just the initial three digits. Similarly, you could provide a street name without a house number. But research has proven that truncation is often not sufficient to protect privacy, and must be assessed on a case-by-case basis.
Differential privacy is probably the form of anonymization most comparable to synthetic data in its usefulness. Differential privacy protects data by answering queries with strategically garbled results. Differential privacy-based solutions require a high degree of expertise to make these systems compliant with regulatory frameworks such as HIPAA, and data quality can vary dramatically based on how the system is tuned.
Subsalt’s generative database
Subsalt offers a unique data service that combines common AI techniques with synthetic data and relational database administration. Their goal is to offer not just one data set, but a wide smorgasbord of data sets for different use cases. Some human intervention is required to configure the system, but automation can handle a lot of the work afterward.
Subsalt’s offerings start by pre-computing what data consumers are likely to need. When Subsalt understands what fields in the data interest a data consumer, and the type of analysis they intend to perform, Subsalt runs some AI models that generate synthetic data appropriate for those fields and that analysis, while preserving privacy to the requisite legal or policy bar. Availability of novel result sets can take a few hours, while data that has been previously requested is accessible in seconds.
Subsalt contracts with deidentification experts (i.e., expert determiners) to programmatically determine that privacy is adequately preserved–specifically, that the risk of re-identification is “very low” per HIPAA guidelines–and to attach a report to that effect to the synthetic data set. The report explains the methods used to generate the data and confirms that it can now be considered non-Protected Health Information (non-PHI) and is considered de-identified under HIPAA.
As a result of achieving the de-identified designation, many time-intensive and complex data governance processes can be avoided entirely, such as signing a business associate agreement (BAA) with vendors or reviewing cybersecurity audits of data recipients. Time to access data can be reduced by 95% or more, lowering costs and speeding time to value.
Note that Subsalt does not have to store any data. Once they have the trained generative models, the system can generate synthetic data on the fly when a data consumer enters a SQL query. The original data is stored securely at the health care entity that collected it and is never duplicated into Subsalt’s system.
Subsalt returns synthetic results to the data consumer, who is granted access by the data owner. This ensures that data consumers receive representative data for their use case and data owners can provide this low-risk data without expensive storage costs or complex data management.
In classic query interfaces, without adequate access control, someone could generate a lot of slightly varied synthetic data from the same source table and combine results to re-identify an individual and recreate sensitive data about the individual. Standalone synthetic generators share this weakness.
Subsalt ensures privacy by imposing limits on the number of queries any data consumer can enter. Subsalt also returns the same synthetic records for the same query, so attackers cannot gain new information through repeated queries.
Users can connect to Subsalt from the tools they’re already using: Any tool that can connect to a relational database can connect to Subsalt without any integration work. A video demo shows how Subsalt looks to a data consumer.
The landscape for privacy continues to change as new methods of protecting data are invented, along with more powerful attacks on privacy. We need clever approaches like Subsalt to derive the benefits that health data has the potential to give us without compromising on keeping the data private and secure.