We already talked about best practices around identifying and managing data at your organization, but now it’s time to dive a little deeper and focus on keeping that data (and the people it represents) safe. We’re all familiar with what happens when data is breached – it can have damning consequences for the organization that allowed it to get into the hands of those who were never intended to see it.
There’s a better way to organize and manage data so that those it represents can have faith that it won’t be seen by people who shouldn’t see it. Here’s what you need to know.
Pseudonymization/Anonymization
This covers a collection of techniques for anonymizing elements within the data set. In the case of higher education, this usually involves replacing key elements (SSN, name, address, other identifiers) with either a one-way hash (when the shape/format of the data is important enough to not simply remove the element entirely) or with some form of master lookup or pseudo-random function to allow us to provide a consistent alias that we can use to refer back to the source dataset when required.
This is the most common of techniques that we use and is in place across all of the data warehouse and the data hub solutions, providing a clean set of identifiers that does not leak the internal or institutional identifiers unless so desired. In addition, our “characteristics model” is also based on pseudonymization and allows institutions to share common data and reports (including demographic reporting) without leaking individual student information.
As with all techniques, there are some issues:
•The use of naive hashes can lead to re-identification. This can be reduced via ‘salting’ the algorithms with a known additional input to increase the complexity of any re-identification process.
•Some hash and pseudo-random techniques are vulnerable to dictionary attacks.
•Because these techniques are only focused on key elements it may be possible to infer identities by analysis of the other elements in the data set, or by linking data sets together. One example we have seen of this has been mapping targetted event information (such as for women engineers, or first-generation students) into the data set, resulting in the inference of associated demographic information.
•Even if the dataset has no exposed identifiers, it may be sensitive in its own right.
Rich talked about his use of pseudonymization at BPU in the context of working with other EdTech vendors. He also went into detail about the effectiveness of using multiple levels of making data obscure, including blocking data elements (such as SSN) entirely for some audiences while notifying others that the data does or does not exist without providing access to any values. He reiterated one of the key elements of our assumptions: Identifying who needs to know what, and when. He was able to show how an institution can internally go further and talked about requiring an explicit statement of the business need for any request.
Now that we know how to transfer data to those who need it to do their jobs, as well as how to keep key identifiers safe, we’re going to next tackle information reduction techniques that help key stakeholders manage data sets.
Want to learn more about how DXtera is working with institutions around the world to manage data? Join us for our upcoming Tuesday Session to collaborate with community members in real-time.