The Hub Designs Blog welcomes Part 3 of this series by Rob DuMoulin, an information architect with more than 26 years of IT experience, specializing in master data management, database administration and design, and business intelligence.
Part 3: Attribute-Level Analyses
This is part three of a five-part series on data profiling.
In Part One, we took a light-hearted view of where profiling benefits an organization and in Part Two, we discussed the fundamentals of a profiling strategy. The remaining three parts discuss attributes, relationships, patterns, and how to use the combined data profiling information you collect. In this section, we introduce attributes, the lowest-level components of a profiling effort.
An attribute is simply a individual data element. Alone, an attribute has no context. Given the simple descriptor of “Cost” for an attribute tells us very little about the attribute’s true purpose and immediately drives a need for additional information, such as units (hours, Dollars, Euros…), type (weighted, unit, gross…), and use (invoice, sum, average…). Attributes therefore must be analyzed within the context of their business purpose to have meaning.
Some characteristics require business knowledge to define and others can be determined through interrogation of existing values and underlying rules of the storage medium. It takes both analyses to get a complete picture of information within a system. While assembling this puzzle, though, keep in mind that until you validate the enforcement of business rules, only assumptions can result from physical profiling or business context.
Analyses of values, domains, and constraints allows insight into use (or abuse) of an attribute. The larger the sample size, the better confidence you gain in the results. Without explicit proof of business rule enforcement, though, you must assume that just because a value does not presently exist does not mean it cannot exist. Business rules are defined by business experts and enforced through database constraints, data type/precision, and application code. Knowing the methods of enforcement allow you to narrow a domain but not totally understand it. Profiling of actual values provides additional refinement in terms of percentage of NULL values, percentage of distinct values, minimum, maximum, and average values, top x and bottom x recurring values along with their counts, and minimum, maximum, and average data lengths.
Some attributes within a data set serve valuable purposes that are important to identify. Attributes that individually or in conjunction with others define uniqueness of the data set also may support relationships between entities. Uniqueness can be further classified as being either members of a system-enforced primary key or of a business key (outside of the defined primary key). System-enforced primary keys are relatively easy to define within a database system through interrogation of the system catalog. Business keys that exist in tables in addition to a primary key may be more difficult to identify, especially if more than one attribute is needed to define uniqueness.
Attribute-level information of interest includes: data type (size and precision), the number and percent of NULL values, column descriptions, number and percent of distinct values, and the minimum-maximum-average values and lengths. Uses of the system catalog provides some of this information, but others must be collected through sampling the data.
Other types of attributes that may help in identifying relevancy are those that provide system-level auditing or change control. Knowing which attributes fill these roles may either allow you to (a) ignore them for profiling purposes or (b) use them to help explain versions or data anomalies.
Part 4 expands on attribute profiling with the introduction of relationships and patterns.