The heart of what DXtera does is share information between community members. We share best practices, learnings, and projects from members across the globe in order to benefit from the learnings (and mistakes) of our peers. In our January “Tuesday Session” on Data Management, we focused our conversation on data sharing and data privacy tools and how to better understand who needs to be involved during the data management process.
Part of DXtera’s data management focus is on developing and supporting tools to ensure that data can be provided to the audiences that need it (analysts, EdTech partners, administrators, etc.) in a way that minimizes the potential for illicit access and misuse. This is the tension that lies at the heart of data management and data governance: providing enough access for effective work while limiting the impact of data exposure (we all know how the wrong data in the wrong hands can wreak havoc).
This session included particular input by Rich Silva, Senior Director of Analytics & Technology Infrastructure at Bay Path University (BPU), who was invaluable in providing data about managing privacy from an institutional perspective.
Below are some data management best practices that we encourage anyone working with data to adopt at their institution or organization.
All data needs a steward. A Data Steward is the role name given to the person that performs certain tasks and takes on responsibilities around a functional area or data domain. In order to have good data management practices as an institution, one needs to have individuals in a data area take on this responsibility. In some respects, this is a facilitator and coordinator in a data area, so that users of this data have an authoritative place to go when they have questions, suggestions, or issues with interpreting the data. Without having this formal role established, users often turn to differing resources often with inconsistent results.
Here are some tips on where to start your data journey:
•Ensure that the data steward at your organization has defined the meaning of the data and the rules around access. Part of responsible data management requires the identification of a specific data steward who will be responsible for signing off on access requests, ensuring the data is accurate, providing the data definitions to be used, and evaluating which parties should have access to specific slices of the data.
•Develop a data collection strategy before you start collecting data. Don’t capture data unless there is a specific need ( is a good place to start). There is an approach that emphasizes capturing all available data, even if there doesn’t appear to be a need for it. While this has some utility in some use cases it is a liability nightmare. A quick way to think about this in a US context is: if you capture it someone else can subpoena it. Always have a plan for the data.
•Develop a data maintenance plan. Don’t store data unless there is a plan to maintain it Often analysts will develop one-time solutions that are nearly impossible to recreate or maintain, but for most business questions the expectation should be that the question will recur on a regular basis.
•Develop a data retention plan. Have a data retention plan before you begin collecting data. This goes hand-in-hand with the data collection strategy (item 1). Identify upfront how long data will be kept prior to purging and document this as a stated policy. In particular, be aware of the subjective and hazardous nature of comments (about projects and people). Capturing this sort of information may be necessary for your business process, but probably should not be retained longer than is necessary to complete the business process it is associated with.
•Develop a data release plan. Don’t release data without having a clear understanding of how the data will be used. Ensure that the release has been approved by all appropriate parties. This has been a problem at multiple institutions and growing awareness about data use and abuse has been largely managed. Nevertheless, it is important to remind folks that all data requests should be documented and should flow the associated administrative and regulatory processes at your institution.
•Document the data definition and sensitivity before collecting data. One of the most common scenarios we run into involves multiple people interpreting some ancient data artifact differently. The purpose of data management and governance practices is to build a common vocabulary and understanding of the business domain and the problem space. Without explicit and agreed-to definitions defined upfront data is liable to be the cause of misunderstanding, as opposed to a solution. This is one of the roles of the data steward.
•Only update data in the system of record. We have seen multiple situations where due to institutional problems people recommend changing the data in the warehouse or a data mart. This results in a discrepancy between the warehouse and the system of record. To ensure that one version of the truth is maintained it is critical that changes be updated in the system of record, and that those corrections flow through standard processes to update the warehouse.
•Use public identifiers to identify data for data integration purposes. When the intent of a dataset is either to share it externally or link it across business domains or systems it is important for the data to leverage formal, low sensitivity, public identifiers. This is one of the many reasons why institutions moved from SSN to institution-specific identifiers.
•Encrypt all data at rest, and across the wire. This is part of standard security. Any defense-in-depth security plan includes contingencies in case of illicit data access. Encrypting data at rest, and in transport is critical to avoid unintended data exposure.
•Provide explicit tools for the management and auditing of access. We feel that this applies as much to systems (data warehouse/lake) as it does for datasets. Developing and deploying appropriate access and privacy tools can limit the impact of data breaches and help to identify the sources.
Our current work (and the discussion) focused on three primary techniques: Pseudonymization/Anonymization, Information reduction techniques, and Synthetic Data.
Stay tuned in the coming weeks for follow-up blogs that will touch on each of these topics at length. Better yet, register for our upcoming Tuesday Session and join in on the conversation alongside members of our global community.