Data Governance

In today’s world, data is often described as the new oil – a valuable resource capable of driving innovation and fueling business decisions. As a data professional, understanding how to collect, analyze, and report on this data is at the center of your work. However, equally important, and perhaps less discussed, is the critical role of data governance.

Data governance is a broad topic, but at its core, it’s about the policies designed to protect people and the integrity of the data. These policies also help ensure data is clean, accessible, and easy to use. Data governance encompasses standards and techniques that can originate from various sources, including international standards, national and local laws, industry regulations, company contracts, and even personal rules.

For data analysts, data governance is not just an abstract concept handled by others; it’s a fundamental part of their work. While analysts might not implement data governance protocols (that often falls to cybersecurity specialists), they are personally responsible for understanding and following the rules that apply to the data they use. Ignoring data governance can have serious consequences, potentially hurting the individuals whose data is being used and leading to legal repercussions for both the company and the analyst.

Data governance, quality, and control addresses concepts that span the entire lifecycle of data analytics, including protected data handling, ensuring data quality, and master data management. Let’s delve into the essential components of data governance as outlined in the sources, which are vital for any data professional to grasp.

Protecting Your Most Valuable Asset: Data Security

A cornerstone of data governance is data security. Without data, a data analyst would be out of a job. Data security is essential not only for having data but also for ensuring its integrity. Data integrity refers to how valid, or accurate, the data is. While calculating a new variable doesn’t affect original data integrity, changing or manipulating data to falsely show trends certainly does. Data security protocols keep unauthorized individuals from accessing and potentially damaging data integrity through malice or mistake.

Data security serves multiple vital roles, some of which are legally mandated. It helps maintain data integrity, prevents competitors from using proprietary information, and is legally required when working with protected data classifications. Data analysts must understand and adhere to these protocols.

Key types of data security protocols you should be familiar with include:

Access requirements—These dictate who can access data and how. Common models include role-based access, where permissions are granted based on job title or role (e.g., only data analysts can access certain data), and user group-based access, where permissions are tied to a department or group (e.g., everyone in the sales department has access).
Security requirements—These cover general concepts like data encryption, data transmission security, and de-identification.
- Data encryption—This uses algorithms to transform data into unreadable cyphertext, rendering it useless to unauthorized parties without the correct key or algorithm. Laws regarding encryption vary widely by location.
- Data transmission—This concerns how data is moved between storage locations. Data is vulnerable while in transit. It’s crucial to use approved, secure connections provided by your company and never take data home without explicit permission.
- De-identification/masking—This involves hiding or removing protected information from a dataset, often done before sharing data with non-technical users or for certain reporting purposes.

The Legal Framework: Data Use Requirements and Agreements

Data use requirements are often formalized in a Data Use Agreement (DUA). This is a legal document that sets clear boundaries on how data can be handled. As a data analyst, you are unlikely to write a DUA, but you must know the rules set out in your company’s DUA. Deviating from the DUA is illegal and can have severe consequences. It is your responsibility to verify the rules, not just trust what others say.

Key sections typically found in a DUA include:

Acceptable use policy—This section explicitly details how the data can and, crucially, how it should not be used.
Data processing—This covers how data is treated during interim steps, like transfers, focusing on protecting the security and rights of the individuals the data is about. It can also include requirements for training, assigning a data protection officer, or performing a data protection impact assessment. The European Union’s General Data Protection Regulation (GDPR) is mentioned as a gold standard for how data should be treated. This section also details how to handle the critical issue of data breaches.
Data breaches—A data breach occurs when data security is compromised and unauthorized individuals gain access to data19. If you suspect a data breach, the steps to follow are:

Report the breach
Secure operations
Fix vulnerabilities
Notify the impacted parties

The most important actions are reporting the breach and notifying those affected. Remaining silent is the worst possible response.

Data deletion—This section specifies when and how data will be destroyed. Best practice dictates that data should be deleted once it has served its intended purpose and is no longer required.
Data retention—This covers how long data will be kept, how it will be stored, and associated security protocols. It may even include a specific date after which the data must be deleted.

Identifying and Handling Sensitive Information: Data Classifications

Understanding data classifications is paramount because different types of data require different levels of security and handling. Data analysts must know these classifications, keep the data secure, and never report it. Federal and local laws often dictate how this data must be treated.

The primary data classifications mentioned are:

Personally identifiable information (PII)—This is any data that can, even theoretically, be used to track down or identify a specific person. Examples include: name, physical address, email address, IP address, Social Security number, phone number, license number, passport number, login ID, social media ID, social media posts, date of birth, digital images, geolocation, biometric data, behavioral data. It’s acceptable to have PII in your dataset if needed, but it must be kept secure and never reported.
Personal health information (PHI)—This relates to health data.
Payment card industry (PCI)—This refers to data associated with payment cards.

Structure and Relationships: Entity Relationship Requirements

Entity relationship requirements often take the form of rules about how different data objects, or entities (like tables), can relate to one another. These relationships are visualized in entity relationship diagrams.

Key concepts related to entity relationships include:

Data constraints—These are rules designed to protect data integrity by specifying what types, conditions, formats, or entry timing are allowed for data within a database. They ensure that only the highest quality data is entered from the start. An example is a filter that only allows a specific kind of data into a dataset. Data constraints are also referred to as data attribute limitations.
Cardinality—This defines the number of relationships each row in a table can have with rows in another table. The majority of relationship restrictions revolve around cardinality. A many-to-many relationship (e.g., employees to sales, where an employee makes multiple sales and a sale involves multiple employees) is an example of cardinality.
Record link restrictions—These are rules restricting how records can be linked.

Beyond Policies: Data Quality and Management

Quality control is the process of testing data to ensure its integrity. It is essential because inaccurate data inevitably leads to inaccurate, and potentially misleading, results. Data integrity itself encompasses consistency, accuracy, and completeness, and in some fields, even knowing who entered data, when, and how.

Quality control checks should ideally be performed at various stages: during data acquisition, data manipulation, before analysis, and on the final product (like a report). If in doubt, it’s always better to double-check your data.

Methods for validating data quality include:

Cross-validation—A statistical technique to check if analysis results can be generalized.
Sample/spot check—Examining a small portion of the data.
Reasonable expectations—Checking if the data makes sense given what is known about the subject.
Data profiling—A formal process for an entire database, checking structure, content, and relationships.
Data audits—Reviews of data processes. Automated checks can also be used, such as checking the number of data points in a variable to identify missing values.

An important concept related to data quality is Master Data Management (MDM). MDM involves creating and managing a centralized data system, often referred to as a golden record or a single source of truth. This golden record contains data that is clean, standardized, consolidated, and up-to-date.

Benefits of MDM include higher data quality and integrity, cleaner data, faster and easier data access (potentially from a single table, avoiding complex joins), and the ability to automate compliance checks. While setting up and maintaining MDM can be work-intensive and expensive, it can be particularly useful for companies dealing heavily with protected data or data subject to numerous regulations, or when consolidating data from multiple sources (like during a merger).

The processes involved in MDM typically include:

Consolidation—Combining data from multiple source systems into the golden record. Updates to the golden record automatically update the original sources.
Standardization—Making data uniform in terms of field names, units, formats, and entry regulations to ensure consistency across the dataset.
Data dictionary—Creating a document that defines and describes every variable, including attributes, structure, relationships, and organization. Data dictionaries are crucial for understanding and correctly using data, whether or not MDM is implemented.

Conclusion

Data governance is far more than just a set of bureaucratic rules; it is the framework that enables reliable, ethical, and legal data analysis. By understanding concepts like data security, use agreements, data classifications, and quality management practices like MDM, data professionals can ensure the integrity of their work, protect sensitive information, and build trust in their insights.

As the field of data science continues to grow, professionals who demonstrate a strong understanding of data governance will be well-positioned for success. By making data governance a priority, you’re not just protecting data; you’re empowering better, more trustworthy decision-making.