Blog

Blog: Data & Information Reduction Techniques

In our first two blogs, focused on data privacy tools, we covered best practices surrounding who should have access to that data, as well as how to transform data in order to keep the people it represents safe. 

Now that we have them covered, it’s time to tackle how to combine data in order to report on it accurately. We’ll walk through how to use information reduction techniques as well as the use of synthetic data to use in analysis and presentations, for example.  

Information reduction techniques

These are techniques normally used when reporting out analysis to key stakeholders org other organizations. Examples include “bucketing” a set of values into a smaller set of need-specific values (ex. Mapping ages into buckets such as  (<18, 18 -24, >24)), or grouping students with a particular characteristic into a common group. 

This is a technique that is used sparingly inside of our tools (almost specifically to conform to specific reporting needs such as IPEDS), and usually represents a situation where we accept that de-anonymization can be made difficult, or more inexact, but cannot be eliminated without injecting false data into the data set. This is due to the variety of issues around bucket data.

These issues include:

  1. Differencing attacks involves identifying individuals by comparing data sets
  2. Minimum threshold issues – where there is an extremely limited number of individuals in the set. Often referred to as the “small cell size” problem. The usual solution is either to incorporate those individuals into a larger set or to simply not report on that subgroup (which can lead to limited exposure as well.

Rich’s notes about information reduction techniques may be necessary for data sets that some may not consider sensitive, such as zip codes. His example about answering the business need without unnecessary data exposure involved generating a field called “miles to campus” to provide the needed business information without exposing the zip codes of individual students.

He also pointed out critical information reduction techniques across the various areas of analytics and walked the folks on the call through a set of standard techniques for performing bucketing operations while minimizing the potential for data leakage.

Synthetic data

This refers to techniques for providing a data set that is realistic enough for particular types of work (process/system validation, etc.) without actually using data from the associated data source or institution. In the case of DXtera, we generate statistics from a given data source and then build synthetic data sets around those statistics. 

This is in place for our data simulation projects that are being used to provide simulated course and program enrollment data that map to various institutional personae (4-year public research west coast, etc.) for analysis and presentations. This is labor-intensive technique has a few flaws of its own, specifically:

  1. The data may not be accurate enough for computing needs and 
  2. When it is accurate enough to use, it is often real enough to have some level of sensitivity to it as well.

Much of the follow-on conversation was focused on the various techniques in play at particular institutions and brainstorming potential solutions for particular data privacy and access issues.

Finally, we talked about DXtera’s future/in-process work. This is focused on refining our ability to share data to select parties without exposing sensitive information and is focusing on two techniques:

  1. Homomorphic encryption is a set of techniques for performing analysis/computations while keeping the data encrypted
  2. Functional encryption involves encrypting data sets with different decryption keys to allow users to access or perform specific tasks on the encrypted data. This technique in particular aligns well with the authorization, policy, and other data management tools that we are currently working on. 

Are you ready to learn more about how to apply these data rules to your organization? Join DXtera and become a member of the most advanced ed-tech community that serves institutions globally. 

Latest Posts

Stay updated with the latest from DXtera and its members.

Blog: Data & Information Reduction Techniques

Blog: Anonymous data sets and what that means for your org 

Blog: Ready to consolidate data? Follow these rules first

News: DXtera selected by UNESCO to develop the Learning Ecosystem Analysis Tool (LEAT)

News: DXtera Co-Founder Joins Broadband Commission Data for Learning Working Group

News: DXtera Launches Collaborative Initiative to Enhance Degree Progress Accuracy and Tracking

DXtera’s 2021 Year in Review

Press: DXtera Announces Leadership Additions to the EdSAFE AI Alliance

Decision-Driven Analytics

Decidability and EdSAFE AI

DXtera Featured in GovTech Magazine

Riiid and DXtera announce EdSAFE AI Alliance

June NGBD Series Summary and Fall Events

Collaborating to Support the Market for AI in Education

Announcement of Collaboration Between DXtera and Riiid Labs

Here's How to Thrive During the 2021-2022 School Year

Student Success Through Data: How CCSNH Put Students & Faculty First During a Pandemic

DXtera and Smart Republic Create Joint Venture, SmartCampus

DXtera Collaborates with Eduworks to Connect Companies to Colleges for Reskilling American Workers

DXtera Partners with the Open Skills Network

NSF POSE: Invitation to Join the Global LER Planning

IES releases RFI on Existence and Use of Large Datasets To Address Education Research Questions