Data Profiling For All The Right Reasons, Part 1

The Hub Designs Blog welcomes another guest post by Rob DuMoulin, an information architect with more than 26 years of IT experience, specializing in master data management, database administration and design, and business intelligence.

Part 1: The Psychology of Data Profiling

Swiss psychologist Carl Gustav Jung founded the Analytical School of Psychology. His word association theories form the basis of the Myers-Briggs Type Indicator Assessment test to identify career aptitude in today’s high school students. Dr. Jung’s approach assigned personality profiles based on how an individual’s thoughts associated to various phrases. By analyzing responses, he could understand how an individual viewed the world around them and perceived themselves. Typically, subjects are asked to speak the first thought entering their minds after hearing a trigger phrase. For the following example, remember, there are no wrong answers. If I say the words “Data Profiling”, what is the first thing you think of?

If you thought of food, cats, country music, CSI NY, or residential plumbing, you are either not in IT or are an IT Manager.

If your first thought was “Quality Assurance”, you align yourself with data quality professionals having anti-social thoughts of failing test cases and sadistically reporting lazy developers for buggy code. You gleefully scour test cases looking for any evidence of truncation, missing values, non-matching codes, numeric precision errors, and inconsistent abbreviation, text, and date formatting.

If “Integration” comes first in your mind, past legacy integration projects have scarred you with a disdain for source system data quality levels. You view production apps with contempt and loathe the time it takes to track down data issues caused by system integrations. You investigate upstream sources to create detailed mappings and transformation rules. Typical debugging sessions consist of validating relationships to identify orphaned data, identifying attributes that contain overloaded columns (attributes containing more than one distinct data element), or fixing format errors from implied decimals.

Some of you responded with “Value Domains” or “Data Types”, indicating you are obsessive compulsive data architects compelled to organize the world into strict and orderly fashion with some degree of normalization, though you are not considered “normal” by your peers. Your concerns lie in understanding and regulating naming conventions, relationships, existence of NULL or default values, and understanding the meaning of each data element to accurately identify business rules and when two or more objects are related or redundant.

Lastly, if “Debugging” is the first item in your thought queue, you are a coder justifying why presumably good code is not working. Extreme paranoia has taught you to assume nothing about data quality, so you add tests to identify duplicates, validate relationships, enforce business rules, track change data capture, provide substitute values. Your phobia of early morning phone calls cause you to add auditing to your code to inform a DBA of data issues rather than waking you up in the middle of the night.

It is truly amazing how much we can conclude from the response to one simple phrase.

As stated before, there are no wrong answers. Aside from the innocent jab at Managers and non-IT resources, we all realize the benefits of information quality and absolutely need business involvement to understand context and domains of business information. The meaning and actions of Data Profiling change both by role and by project phase. Through profiling, we are able to identify best sources of information, learn proper ways to categorize and store it, reactively identify quality issues, and proactively define business rules to prevent future issues.

Identifying what is important to profile, when and how profiling is done, and how to share our findings across business and project resources is key. Done properly, profile results integrate to a master metadata repository and are periodically refreshed through an automated process.

This five-part series provides a tool-agnostic approach to comprehensive data profiling, focusing on information meaning and use. The next part of the series discusses system and table-level profiling. In particular, what information is important to collect at the system and table level and how can that information be leveraged by the Enterprise to help assure quality. The third part dives into attribute-level profiling and the fourth discusses attribute patterns and relationships. The final part discusses the benefits and utility of gathering profiled information into a single repository.

Continue with Part 2.

Tags: , , , , ,

9 Comments on “Data Profiling For All The Right Reasons, Part 1”

  1. Jim Harris 01/11/2010 at 12:50 pm #

    Excellent post Rob,

    I really enjoyed the Jungian word association analysis for “data profiling.”

    Jung was fond of the concept of collective unconsciousness, and data profiling can be viewed as a gateway to understanding the collective business mindset surrounding the enterprise’s most critical data assets.

    I am looking forward to reading the rest of the series.

    Best Regards,


  2. Dylan Jones 01/11/2010 at 3:25 pm #

    Nice post Rob.

    One line jumps out at me: “…how to share our findings across business and project resources…” – that is critical and for me goes way beyond the standard profiling stats we see.

    The ability to transform raw stats into a business context is critical. For me where this gets exciting is when we merge this with other metadata, external metrics and other reference material, using a repository of relationships to weave a compelling story.

    Really looking forward to the series, great intro.

    – Dylan

  3. dataintegrity 02/26/2010 at 2:13 pm #

    Good post


  1. Data Profiling For All The Right Reasons, Part 2 « Hub Designs Blog - 01/18/2010

    […] discussion is the second of a five-part series on data profiling. In Part 1, we discussed the project roles that benefit from data profiling and how better understanding […]

  2. Data Profiling For All The Right Reasons, Part 3 « Hub Designs Blog - 01/25/2010

    […] Part One, we took a light-hearted view of where profiling benefits an organization and in Part 2, we […]

  3. Data Profiling For All The Right Reasons, Part 4 « Hub Designs Blog - 01/28/2010

    […] Part One introduced different perspectives on data profiling. Part Two identified valuable system and entity metrics to track. Part Three discussed attributes. In this segment, we dive deeper into attribute relationships and pattern recognition. Also, we expand on primary key identification discussion and discuss hidden relationships. […]

  4. Data Profiling For All The Right Reasons, Part 5 « Hub Designs Blog - 02/10/2010

    […] series, describing how data profiling benefits both IT projects and business operations.  In Part One, we discussed profiling perspectives.  In Parts Two, Three and Four, we introduced the value of […]

  5. Data Profiling For All The Right Reasons, Part 2 « Hub Designs Blog - 07/27/2010

    […] discussion is the second of a five-part series on data profiling. In Part 1, we discussed the project roles that benefit from data profiling and how better understanding […]

  6. Data Profiling For All The Right Reasons, Part 3 « Hub Designs Blog - 07/28/2010

    […] Part One, we took a light-hearted view of where profiling benefits an organization and in Part Two, we […]

%d bloggers like this: