Blog

Blog: Anonymous data sets and what that means for your org 

We already talked about best practices around identifying and managing data at your organization, but now it’s time to dive a little deeper and focus on keeping that data (and the people it represents) safe. We’re all familiar with what happens when data is breached – it can have damning consequences for the organization that allowed it to get into the hands of those who were never intended to see it. 

There’s a better way to organize and manage data so that those it represents can have faith that it won’t be seen by people who shouldn’t see it. Here’s what you need to know. 

Pseudonymization/Anonymization

This covers a collection of techniques for anonymizing elements within the data set. In the case of higher education, this usually involves replacing key elements (SSN, name, address, other identifiers) with either a one-way hash (when the shape/format of the data is important enough to not simply remove the element entirely) or with some form of master lookup or pseudo-random function to allow us to provide a consistent alias that we can use to refer back to the source dataset when required.

This is the most common of techniques that we use and is in place across all of the data warehouse and the data hub solutions, providing a clean set of identifiers that does not leak the internal or institutional identifiers unless so desired. In addition, our “characteristics model” is also based on pseudonymization and allows institutions to share common data and reports (including demographic reporting) without leaking individual student information.

As with all techniques, there are some issues:

•The use of naive hashes can lead to re-identification. This can be reduced via ‘salting’ the algorithms with a known additional input to increase the complexity of any re-identification process.

•Some hash and pseudo-random techniques are vulnerable to dictionary attacks.

•Because these techniques are only focused on key elements it may be possible to infer identities by analysis of the other elements in the data set, or by linking data sets together. One example we have seen of this has been mapping targetted event information (such as for women engineers, or first-generation students) into the data set, resulting in the inference of associated demographic information.

•Even if the dataset has no exposed identifiers, it may be sensitive in its own right. 

Rich talked about his use of pseudonymization at BPU in the context of working with other EdTech vendors. He also went into detail about the effectiveness of using multiple levels of making data obscure, including blocking data elements (such as SSN) entirely for some audiences while notifying others that the data does or does not exist without providing access to any values. He reiterated one of the key elements of our assumptions: Identifying who needs to know what, and when. He was able to show how an institution can internally go further and talked about requiring an explicit statement of the business need for any request.   

Now that we know how to transfer data to those who need it to do their jobs, as well as how to keep key identifiers safe, we’re going to next tackle information reduction techniques that help key stakeholders manage data sets.

Want to learn more about how DXtera is working with institutions around the world to manage data? Join us for our upcoming Tuesday Session to collaborate with community members in real-time. 

Latest Posts

Stay updated with the latest from DXtera and its members.

Blog: Data & Information Reduction Techniques

Blog: Anonymous data sets and what that means for your org 

Blog: Ready to consolidate data? Follow these rules first

News: DXtera selected by UNESCO to develop the Learning Ecosystem Analysis Tool (LEAT)

News: DXtera Co-Founder Joins Broadband Commission Data for Learning Working Group

News: DXtera Launches Collaborative Initiative to Enhance Degree Progress Accuracy and Tracking

DXtera’s 2021 Year in Review

Press: DXtera Announces Leadership Additions to the EdSAFE AI Alliance

Decision-Driven Analytics

Decidability and EdSAFE AI

DXtera Featured in GovTech Magazine

Riiid and DXtera announce EdSAFE AI Alliance

June NGBD Series Summary and Fall Events

Collaborating to Support the Market for AI in Education

Announcement of Collaboration Between DXtera and Riiid Labs

Here's How to Thrive During the 2021-2022 School Year

Student Success Through Data: How CCSNH Put Students & Faculty First During a Pandemic

DXtera and Smart Republic Create Joint Venture, SmartCampus

DXtera Collaborates with Eduworks to Connect Companies to Colleges for Reskilling American Workers

DXtera Partners with the Open Skills Network

NSF POSE: Invitation to Join the Global LER Planning

IES releases RFI on Existence and Use of Large Datasets To Address Education Research Questions