The Hub Designs Blog welcomes the final installment of this great series by Rob DuMoulin, an information architect with more than 26 years of IT experience, specializing in master data management, database administration and design, and business intelligence.
Part 5: The Profiling Payoff
This is the final part of a five-part series, describing how data profiling benefits both IT projects and business operations. In Part One, we discussed profiling perspectives. In Parts Two, Three and Four, we introduced the value of system, entity, and attribute-level metrics. This part discusses the archival and beneficial uses of profile results.
If you have defined your corporate data profiling strategy similar to the methods discussed in the preceding parts of this series, you’ll have amassed a robust collection of metadata spanning relevant systems across your business. Although systems may be of different types and locations, the structured approach and common metrics you collected create a centralized repository of information that can be examined holistically. Ideally, this information will exist in an open-source database repository with reports made available across the enterprise. System and Entity information help planners and developers organize information strategies. Attribute-level domains, constraints, and business rules help data architects understand existing systems. Relationships and value patterns are readily available to support validation of information-related hypotheses as needed.
If you plan to design your own repository, consider adding timestamps and indicators to help you manage and present the information. To keep your repository relevant to business needs, design collection rules to be configurable. This allows you to easily ignore superfluous information or enable tests only at certain critical times. Allow initial system profiling efforts to gather a large set of metrics and store them as your baseline. As you learn about the information, you will see which tests or which data objects add no value. Us geeky DBA-types who understand system-level catalogs have our own scripts to do much of what was described inParts Two,Three and Four. Those less-inclined may prefer to use a third-party tool for profiling. Either way works as long as the business needs are satisfied and the entire enterprise standardizes on one approach (and thus one integrated repository).
You will find that collecting and maintaining this level of detail has a definite cost. Even if the collection is automated, interrogations of large data sets places an overhead on production systems that may not be practical. Record and monitor profile execution metrics to identify bottlenecks or tuning opportunities. Realize that the extent of data profiling is contingent on the project phase, specific data elements, and most of all, business value. Review profiling goals on a regular basis and eliminate unnecessary and redundant checks.
How much profile history to maintain is another consideration. Even though disk is “relatively” cheap, maintaining all historical entries in a live repository may not be necessary. Consider business needs and value for historical profile information. Even consider archiving at a summarized (or less frequent) level and keep only a limited time window of statistics online.
This discussion on data profiling was intended to broaden perceptions of what it means to a business and the value it can bring if done in a sustainable way. The blog format is not conducive to in-depth discussions, but hopefully the topics covered here spur some thoughts into how you can add value to your business by implementing some of these concepts. Use your imagination, but remember that no matter how cool it might be to collect and store some profile output, if it does not add business value to somebody, it might not be worth the overhead to continue recording it.
Go back to Part 4.