Skip to content

Posts from the ‘Strategy’ Category

22
Sep

Org. Change and Data Governance

I read a great article recently by Steve Sarsfield on his blog “Data Governance and Data Quality Insider” about Change Management and Data Governance, and it got me thinking about the critical role that organizational change management plays in any well founded data governance program.

For almost ten years, with a few years off during the “Dot Com” era, I implemented Oracle’s CRM and ERP products. One of the things I came to appreciate during that time was the huge difference that including organizational change management makes between a successful implementation and a “less than successful” one.

That’s why I include emphasizing the organizational change management aspects as one of the “Ten Best Practices in Master Data Management and Data Governance” when I speak at conference like the Oracle Applications Users Group COLLABORATE 10 or Oracle OpenWorld.

That’s because big transformational programs like MDM and data governance are not that different from CRM and ERP. Any time you want the organization to embrace new processes and new technology, and more importantly to modify its DNA (that is, its culture), you’ve got to embrace “org. change”.

I’ve got a friend who is a professor in this stuff at Southern New Hampshire University, with a distinctive name – Dr. Burt Reynolds. I first met him on a 12 month ERP project at a $1 billion software company, where he helped define the org. change strategy. I studied what he did very carefully, and I’ve tried to weave it into every project I’ve done since then.

One of the biggest elements is the communications strategy. First, learn about your audience. How do they like to learn about things? Do they like e-mail newsletters, internal web sites, one-on-one meetings with their supervisors, town hall meetings with company leaders, lunch and learn sessions with project team leadership, small training sessions, etc.

Second, think about your message. Some things lend themselves to certain media better than others. Short, snappy messages are probably better suited for town hall meetings. Technical material is better handled in hands-on training sessions. Anything involving changes to individual positions is best suited for individual meetings with supervisors.

What you’ll wind up with is a grid of messages on the left and media across the top. Then you add in the time element (when to deliver these messages), and you’ll have your internal communications campaign.

Steve mentions in his article the ADKAR model for organizational change developed by Prosci: Awareness, Desire, Knowledge, Ability, and Reinforcement.

What this will produce is a well-coordinated internal communications strategy, that when you deliver it, will result in every stakeholder and business constituent being aware of your data governance program, why it’s necessary, and how it links to the overall business strategy of the company.

As for desire to participate in the change, you want to reach as many people as possible, recruit some to be champions of the program, educate others so they’re at least neutral towards it, and keep the number of active opponents as small as possible.

Your communications plan must include a healthy amount of knowledge transfer, because data governance, although not solely a technology driven activity, includes enough technology that the people actively involved in it need to be completely comfortable with it.

You’ll also be raising the bar for the ability and skill of many of the individuals in the company, as well as redesigning some of the processes for entering, updating and consuming master data. Be prepared for the amount of time this is going to take, as well as the force of the political pushback you’ll encounter. People and organizations have a lot of inertia and tend to resist change at first. That’s why reinforcement is so important, by repeating important messages several times and weaving them into different media.

Steve’s article was great, and brought back to me the importance of introducing organizational change management into MDM and data governance programs. It can literally make the difference between success and failure. Please let us know – here in the comments or on in the forums on the MDM Community – what you think of applying org. change to MDM and data governance.

30
Aug
photo by Wonderlane

Our MDM Strategy Offerings

Recently, I put together an overview of Hub Designs’ MDM strategy offerings for a potential client. Here’s a recap.

Education

  • Based on our popular “Best Practices in MDM and Data Governance” speaking engagements, presented at Oracle OpenWorld and the Oracle Applications Users Group COLLABORATE conference.
  • Our workshops get business & IT professionals up to speed quickly
  • You get access to the best MDM experts, and can bring your business people into the process early

Roadmap

  • Based on Hub Designs’ MDM framework
  • Defines where you are now, where you want to be, and over what time period
  • Looks at master data management, data integration, data quality, and data governance over time

Readiness Assessment

  • Looks at issues relating to politics & culture
  • Performs skills assessment on people who may need training
  • Examines process issues, outlining where business processes need improvement or redesign
  • Investigates technology issues, detailing where essential components are not present or not able to support your upcoming MDM initiative
  • Performs data profiling to discover data quality issues

Business Case

  • Captures business requirements
  • Identifies stakeholders and select metrics
  • Baselines current performance
  • Negotiates expected benefits
  • Converts to financial results
  • Develops total cost of ownership
  • Calculates hard-dollar ROI

Software Selection

  • Develops selection criteria
  • Creates a weighted vendor scoring model
  • Includes functionality, technology, viability, costs, services and vision
  • Develops demo scripts for vendors to follow and sample data sets to give them
  • Manages proof of concept (POC) process
  • Assists in evaluating POC performance and scoring vendors

These engagements range in length from one to twelve months, with teams varying from two to ten people, depending on the size of the company, the number of domains of master data  involved, and the complexity of the politics and legacy systems in the enterprise.

If you’re interested in discussing an MDM strategy engagement like this, please contact Hub Designs at http://www.hubdesigns.com/contact_us.html. Or if you have comments on the above approaches, please let us know by commenting here.

30
Jul

Data Profiling For All The Right Reasons, Part 5

The Hub Designs Blog welcomes the final installment of this great series by Rob DuMoulin, an information architect with more than 26 years of IT experience, specializing in master data management, database administration and design, and business intelligence.

Part 5: The Profiling Payoff

This is the final part of a five-part series, describing how data profiling benefits both IT projects and business operations.  In Part One, we discussed profiling perspectives.  In Parts Two, Three and Four, we introduced the value of system, entity, and attribute-level metrics.  This part discusses the archival and beneficial uses of profile results.

If you have defined your corporate data profiling strategy similar to the methods discussed in the preceding parts of this series, you’ll have amassed a robust collection of metadata spanning relevant systems across your business.  Although systems may be of different types and locations, the structured approach and common metrics you collected create a centralized repository of information that can be examined holistically. Ideally, this information will exist in an open-source database repository with reports made available across the enterprise. System and Entity information help planners and developers organize information strategies. Attribute-level domains, constraints, and business rules help data architects understand existing systems. Relationships and value patterns are readily available to support validation of information-related hypotheses as needed.

If you plan to design your own repository, consider adding timestamps and indicators to help you manage and present the information.  To keep your repository relevant to business needs, design collection rules to be configurable. This allows you to easily ignore superfluous information or enable tests only at certain critical times. Allow initial system profiling efforts to gather a large set of metrics and store them as your baseline.  As you learn about the information, you will see which tests or which data objects add no value.  Us geeky DBA-types who understand system-level catalogs have our own scripts to do much of what was described inParts Two,Three and Four. Those less-inclined may prefer to use a third-party tool for profiling. Either way works as long as the business needs are satisfied and the entire enterprise standardizes on one approach (and thus one integrated repository).

You will find that collecting and maintaining this level of detail has a definite cost.  Even if the collection is automated, interrogations of large data sets places an overhead on production systems that may not be practical. Record and monitor profile execution metrics to identify bottlenecks or tuning opportunities. Realize that the extent of data profiling is contingent on the project phase, specific data elements, and most of all, business value. Review profiling goals on a regular basis and eliminate unnecessary and redundant checks.

How much profile history to maintain is another consideration.  Even though disk is “relatively” cheap, maintaining all historical entries in a live repository may not be necessary. Consider business needs and value for historical profile information. Even consider archiving at a summarized (or less frequent) level and keep only a limited time window of statistics online.

This discussion on data profiling was intended to broaden perceptions of what it means to a business and the value it can bring if done in a sustainable way. The blog format is not conducive to in-depth discussions, but hopefully the topics covered here spur some thoughts into how you can add value to your business by implementing some of these concepts.  Use your imagination, but remember that no matter how cool it might be to collect and store some profile output, if it does not add business value to somebody, it might not be worth the overhead to continue recording it.

29
Jul

Data Profiling For All The Right Reasons, Part 4

The Hub Designs Blog welcomes Part 4 of this series by Rob DuMoulin, an information architect with more than 26 years of IT experience, specializing in master data management, database administration and design, and business intelligence.

Part 4: Profiling Relationships and Patterns

This is part four of a five-part series describing how data profiling assists in all aspects of system development, from design through deployment.

Part One introduced different perspectives on data profiling. Part Two identified valuable system and entity metrics to track. Part Three discussed attributes. In this segment, we dive deeper into attribute relationships and pattern recognition. Also, we expand on primary key identification discussion and discuss hidden relationships.

Pattern grouping provides a mask of distinct format patterns within an attribute data set and a count of the number of occurrences. Patterns give insight into the type of values found in an attribute. For example, a numeric pattern analysis may show values such as 999.99999, 99, or -.9999.

Observing distinct patterns gives insight into the maximum digits and precision, and also domains such as integer or real. Pattern of a database date or date-time type provides unremarkably similar patterns for all dates. Because the database management system typically enforces the domain, date analysis provides no value and can be ignored. If dates are stored in character format, however, patterns quickly show variations in date formatting. Character patterns only have significance to a limited number of positions. It makes no sense to pattern a description field of 200 or 2000 characters. Smaller code attributes of less than 10 characters though do provide value. Ignore pattern profiling for character strings over 20 characters at first, then refine to shorter character strings if the results do not add value.

In pure database theory, referential integrity (RI) is your friend. In practice, designers and software vendors often forgo RI to improve system performance on data inserts. These designers place the data quality burden on the application and do not endorse external data manipulation outside the application interfaces. In the real world, though, data corruption occurs and without RI or routine data quality checks, corruptions may not be found for a long time or not at all. Personally, I have identified over $50,000 of recent orphaned sales in a retail client resulting from deliberately disabled RI. These unreported sales were not added to the ledger and were allowed to occur for performance reasons until I found them through simple profiling. Enforcement of RI is a topic for another discussion but is mentioned here because it does identify a valid reason for data profiling.

In even presumably good relational designs, some parent-child relationships are not enforced for different reasons. First, interrogate the RI listed in the system catalogs to identify all enforced relationships. Reverse-engineering a system with a good modeling tool is probably the best way to do this. A harder and more valuable analysis is to identify unenforced relationships and determining the probability of the relationship if not all values are an exact match. Do this by counting all the candidate child attribute values that exist within a known parent attribute table. If all match and there are a non-trivial number of matches, there is a good probability of a non-identified relationship. A small number of mismatches could identify data quality issues.

In Part 5, we tie all the techniques discussed in the first four parts together to show the value of a repeatable data profiling process.

28
Jul

Data Profiling For All The Right Reasons, Part 3

The Hub Designs Blog welcomes Part 3 of this series by Rob DuMoulin, an information architect with more than 26 years of IT experience, specializing in master data management, database administration and design, and business intelligence.

Part 3: Attribute-Level Analyses

This is part three of a five-part series on data profiling.

In Part One, we took a light-hearted view of where profiling benefits an organization and in Part Two, we discussed the fundamentals of a profiling strategy.  The remaining three parts discuss attributes, relationships, patterns, and how to use the combined data profiling information you collect.  In this section, we introduce attributes, the lowest-level components of a profiling effort.

An attribute is simply a individual data element.  Alone, an attribute has no context.  Given the simple descriptor of “Cost” for an attribute tells us very little about the attribute’s true purpose and immediately drives a need for additional information, such as units (hours, Dollars, Euros…), type (weighted, unit, gross…), and use (invoice, sum, average…).  Attributes therefore must be analyzed within the context of their business purpose to have meaning.

Some characteristics require business knowledge to define and others can be determined through interrogation of existing values and underlying rules of the storage medium. It takes both analyses to get a complete picture of information within a system. While assembling this puzzle, though, keep in mind that until you validate the enforcement of business rules, only assumptions can result from physical profiling or business context.

Analyses of values, domains, and constraints allows insight into use (or abuse) of an attribute. The larger the sample size, the better confidence you gain in the results. Without explicit proof of business rule enforcement, though, you must assume that just because a value does not presently exist does not mean it cannot exist. Business rules are defined by business experts and enforced through database constraints, data type/precision, and application code. Knowing the methods of enforcement allow you to narrow a domain but not totally understand it. Profiling of actual values provides additional refinement in terms of percentage of NULL values, percentage of distinct values, minimum, maximum, and average values, top x and bottom x recurring values along with their counts, and minimum, maximum, and average data lengths.

Some attributes within a data set serve valuable purposes that are important to identify. Attributes that individually or in conjunction with others define uniqueness of the data set also may support relationships between entities.  Uniqueness can be further classified as being either members of a system-enforced primary key or of a business key (outside of the defined primary key).  System-enforced primary keys are relatively easy to define within a database system through interrogation of the system catalog.  Business keys that exist in tables in addition to a primary key may be more difficult to identify, especially if more than one attribute is needed to define uniqueness.

Attribute-level information of interest includes: data type (size and precision), the number and percent of NULL values, column descriptions, number and percent of distinct values, and the minimum-maximum-average values and lengths.  Uses of the system catalog provides some of this information, but others must be collected through sampling the data.

Other types of attributes that may help in identifying relevancy are those that provide system-level auditing or change control. Knowing which attributes fill these roles may either allow you to (a) ignore them for profiling purposes or (b) use them to help explain versions or data anomalies.

Part 4 expands on attribute profiling with the introduction of relationships and patterns.

27
Jul

Data Profiling For All The Right Reasons, Part 2

The Hub Designs Blog welcomes Part 2 of this series by Rob DuMoulin, an information architect with more than 26 years of IT experience, specializing in master data management, database administration and design, and business intelligence.

Part 2: Profiling the Basics

This discussion is the second of a five-part series on data profiling. In Part 1, we discussed the project roles that benefit from data profiling and how better understanding information results in more reliable information systems. Important goals of any profiling strategy include automation of metric collection and socializing results to support the differing objectives of a data-centric project.

Early in a system development life cycle, profiling helps define sources, data storage requirements, and data transformations. As a system goes into production (or if profiling is added to an existing system for quality control purposes), routine profiling is useful to audit system quality and business rule enforcement. The frequency of collection and amount of effort you expend to automate your profiling methods should be based on the ability of the organization to benefit from the profile results.

This section discusses the beginnings of a profiling effort. Information assembled here forms the foundation of other profiling activities. For this discussion, consider a Profile Group as a set of information sharing a common purpose and data management methods. Examples of profile groups include tables within a single database schema or a group of spreadsheets with the same format but each spreadsheet representing a different time slice of data.

The underlying System managing a set of information within the profile group may be a named relational database, a file system directory, or even a web site being accessed through web services. The reason we abstract information into Systems is to group the information into distinct governance methods common to the underlying information. Relevant metadata and governance methods we track at the system-level include: technical contacts, backup schedules, system descriptors, connection strings, business unit owners, and host operating systems. System-level metadata common to a profile group helps us understand and troubleshoot future analyses. This level of information also provides developers with an understanding of inherent restrictions (or freedoms) they may encounter when trying to use or integrate the information.

Entities within a profile group belong to the same system, may have a common unique identifier, and, for database entities, have the same schema owner. Typically, entities are database tables, but may also be similar files or spreadsheet tabs containing like attribute lists. For entities, we track characteristics common to all the attributes they contain. These include: row counts, entity-level descriptors, growth characteristics (size and frequency), last analyzed date, and various customized indicators such as active/inactive, existence of change data management attributes such as insert/update timestamps, and existence of audit traceability indicators such as insert/update username.

The combination of system and entity level profiling supply the foundation for the attribute-level profiling, which is where physical information in a system resides. It also provides valuable metadata to classify information and allows for future correlation of like information across systems. Assembly and publication of entity and system level information benefits the various consumers of the information by providing a centralized “master” source of contact and context information.

In Part 3, we will dive into the attribute level analyses around data profiling.

26
Jul

Data Profiling For All The Right Reasons, Part 1

The Hub Designs Blog welcomes a guest post by Rob DuMoulin, an information architect with more than 26 years of IT experience, specializing in master data management, database administration and design, and business intelligence.

Part 1: The Psychology of Data Profiling

Swiss psychologist Carl Gustav Jung founded the Analytical School of Psychology. His word association theories form the basis of the Myers-Briggs Type Indicator Assessment test to identify career aptitude in today’s high school students. Dr. Jung’s approach assigned personality profiles based on how an individual’s thoughts associated to various phrases. By analyzing responses, he could understand how an individual viewed the world around them and perceived themselves. Typically, subjects are asked to speak the first thought entering their minds after hearing a trigger phrase. For the following example, remember, there are no wrong answers. If I say the words “Data Profiling”, what is the first thing you think of?

If you thought of food, cats, country music, CSI NY, or residential plumbing, you are either not in IT or are an IT Manager.

If your first thought was “Quality Assurance”, you align yourself with data quality professionals having anti-social thoughts of failing test cases and sadistically reporting lazy developers for buggy code. You gleefully scour test cases looking for any evidence of truncation, missing values, non-matching codes, numeric precision errors, and inconsistent abbreviation, text, and date formatting.

If “Integration” comes first in your mind, past legacy integration projects have scarred you with a disdain for source system data quality levels. You view production apps with contempt and loathe the time it takes to track down data issues caused by system integrations. You investigate upstream sources to create detailed mappings and transformation rules. Typical debugging sessions consist of validating relationships to identify orphaned data, identifying attributes that contain overloaded columns (attributes containing more than one distinct data element), or fixing format errors from implied decimals.

Some of you responded with “Value Domains” or “Data Types”, indicating you are obsessive compulsive data architects compelled to organize the world into strict and orderly fashion with some degree of normalization, though you are not considered “normal” by your peers. Your concerns lie in understanding and regulating naming conventions, relationships, existence of NULL or default values, and understanding the meaning of each data element to accurately identify business rules and when two or more objects are related or redundant.

Lastly, if “Debugging” is the first item in your thought queue, you are a coder justifying why presumably good code is not working. Extreme paranoia has taught you to assume nothing about data quality, so you add tests to identify duplicates, validate relationships, enforce business rules, track change data capture, provide substitute values. Your phobia of early morning phone calls cause you to add auditing to your code to inform a DBA of data issues rather than waking you up in the middle of the night.

It is truly amazing how much we can conclude from the response to one simple phrase.

As stated before, there are no wrong answers. Aside from the innocent jab at Managers and non-IT resources, we all realize the benefits of information quality and absolutely need business involvement to understand context and domains of business information. The meaning and actions of Data Profiling change both by role and by project phase. Through profiling, we are able to identify best sources of information, learn proper ways to categorize and store it, reactively identify quality issues, and proactively define business rules to prevent future issues.

Identifying what is important to profile, when and how profiling is done, and how to share our findings across business and project resources is key. Done properly, profile results integrate to a master metadata repository and are periodically refreshed through an automated process.

This five-part series provides a tool-agnostic approach to comprehensive data profiling, focusing on information meaning and use. The next part of the series discusses system and table-level profiling. In particular, what information is important to collect at the system and table level and how can that information be leveraged by the Enterprise to help assure quality. The third part dives into attribute-level profiling and the fourth discusses attribute patterns and relationships. The final part discusses the benefits and utility of gathering profiled information into a single repository.

23
Jul

Modeling the MDM Blueprint – Part 6

facilittiesmgmtIn this series, we’ve discussed developing the MDM blueprint by developing the Common Information (Part 2), Canonical (Part 3) , and Operating (Part 4) models in our work. Part 5 introduced the Reference Architecture model into the mix to apply the technical infrastructure or patterns we plan on using.

The blueprint has now moved from being computation and platform independent to one that expresses intent through the use of more concrete platform-specific models. The solution specification is now documented (independent of the functional Business Requirements) to provide shared insight into the overall design.

Now, it’s time to bring the modeling products together and incorporate them into a MDM solution specification we can use in many ways to communicate the intent of the project.

First, the MDM blueprint specification becomes the vehicle for communicating the system’s design to interested stakeholders at each stage of its evolution. The blueprint can be used by:

  • Downstream designers and implementers to provide overall policy and design guidance. This establishes inviolable constraints (and a certain amount of freedom) on downstream development activities.
  • Testers and integrators to dictate the correct black-box behavior of the pieces that must fit together.
  • Technical managers as the basis for forming development teams corresponding to the work assignments identified.
  • Project managers as the basis for a work breakdown structure, planning, allocation of project resources, and tracking of progress by the various teams.
  • Designers of other systems with which this one must interoperate to define the set of operations provided and required, and the protocols for their operation, that allows the inter-operation to take place.

Second, the MDM blueprint specification provides a basis for performing up-front analysis to validate (or uncover deficiencies in) design decisions and refine or alter those decisions where necessary. The blueprint could be used by:

  • Architects and requirements engineers who represent the customer. The MDM blueprint specification becomes the forum for negotiating and making trade-offs among competing requirements.
  • Architects and component designers as a vehicle for arbitrating resource contention and establishing performance and other kinds of run-time resource consumption budgets.
  • Development using vendor-provided products from the commercial marketplace to establish the possibilities for commercial off-the-shelf (COTS) component integration by setting system and component boundaries and establishing requirements for the required behavior and quality properties of those components.
  • Architects to evaluate the ability of the design to meet the system’s quality objectives. The MDM blueprint specification serves as the input for architectural evaluation methods such as the Software Architecture Analysis Method [and the Architecture Tradeoff Analysis Method (ATAM-SM) and Software Performance Engineering (SPE) as well as less ambitious (and less effective) activities such as unfocused design walkthroughs.
  • Performance engineers as the formal model that drives analytical tools such as rate schedulers, simulations, and simulation generators.
  • Development product line managers to determine whether a potential new member of a product family is in or out of scope, and if out, by how much.

Third, the MDM blueprint becomes the first artifact used to achieve system understanding for:

  • Technical managers, as the basis for conformance checking, for assurance that implementations have in fact been faithful to the architectural prescriptions.
  • Maintainers, as a starting point for maintenance activities, revealing the areas a prospective change will affect.
  • New project members, as the first artifact for familiarization with a system’s design.
  • New architects, as the artifacts that (if properly documented) preserve and capture the previous incumbent’s knowledge and rationale.
  • Re-engineers, as the first artifact recovered from a program understanding activity or (in the event that the architecture is known or has already been recovered) the artifact that drives program understanding activities at the appropriate level of component granularity.

Blueprint for MDM - Where this fits within a larger program

Developing and refining the MDM blueprint is typically associated with larger programs or strategic initiatives. In this last part of the series, I'll discuss where all this typically fits within a larger program and how to organize and plan this work within context.

The following diagram (click to enlarge and use your browser to magnify the png file) puts our modeling efforts within the context of a larger program taken from a mix of actual engagements with large, global customers. The key MDM blueprint components are highlighted with numbers representing:

  1. Common Information Model
  2. The Canonical Model
  3. The Operating Model
  4. The Reference Architecture
ProgramManagementDesign_Ammeded_v6

Click to enlarge

I have also assumed a business case exists (you have this right?) and the functional requirements are known. Taken together with the MDM blueprint, we now have a powerful arsenal of robust information products we can use to prepare a high quality solution specification that is relevant and can be used to meet a wide variety of needs.

Typically, use of the MDM blueprint may include:

  • Identifying all necessary components and services
  • Reviewing existing progress to validate (or uncover deficiencies in) design decisions; refine or alter those decisions where necessary
  • Preparation of detailed planning products (Product, Organization, and Work Breakdown structures)
  • Program planning and coordination of resources
  • Facilitating prioritization of key requirements – technical and business
  • Development of Request for Quotation, Request for Information products (make vs. buy)
  • Preparing funding estimates (Capital and Operating Expense) and program budget preparation
  • Understanding a vendor’s contribution to the solution and pricing accordingly (for example, repurpose as needed in contract and licensing activities and decouple supplier proprietary lock-in from the solution where appropriate)

We are also helping to ensure the business needs drive the solution by mitigating the impact of the dreaded Vendor Driven Architecture (VDA) in the MDM solution specification.

Summary

I hope you have enjoyed this brief journey through “Modeling the MDM Blueprint” and have gained something from my experience. I’m always interested in learning from others, so please let me know what you’ve encountered yourself, and maybe we can help others avoid the pitfalls and pain in this difficult demanding work.

The difference between success and failure on an MDM journey is taking the time to model the blueprint and share this early and often with the business. This is after all a business project, not an elegant technical exercise. In an early reference, I mentioned Ward Cunningham’s Technical Debt concept. Recall this metaphor means doing things the quick and dirty way sets us up with a technical debt, which is similar to a financial debt. Like a financial debt, the technical debt incurs interest payments, which come in the form of the extra effort we have to do in future development because of the quick and dirty design choices we have made. The technical debt and resulting interest due in MDM initiative with this kind of far-reaching impact across the enterprise is, well, unthinkable.

Take the time to develop your MDM blueprint and use this product to ensure success by clearly communicating business and technical intent with your stakeholders.

22
Jul

Modeling the MDM Blueprint – Part 5

er_modelIn this series, we’ve discussed developing the MDM blueprint by creating the Common Information (Part 2), Canonical (Part 3), and Operating (Part 4) models in our work streams. We’ve introduced the Operating Model into the mix to communicate with the business how the solution will be adopted and used to realize the expected benefits. And hopefully we’ve set reasonable expectations with our business partners as to what this solution will look like when deployed.

Now, it’s time to model and apply the technical infrastructure or patterns we plan on using. The blueprint now moves from being computation and platform independent to one of expressing intent through the use of more concrete platform-specific models.

Reference Architecture

After the initial (CIM, Canonical, and Operating models) work is completed, then, and only then, are we ready to move on to the computation and platform specific models. We know how to do this – for example see Information ServicePatterns, Part 4: Master Data Management architecture patterns.

At this point, we now have enough information to create the reference architecture. One way (there are several) to organize this content is to use the Rozanski and Woods extensions to the classic 4+1 view model introduced by Philippe Kruchten. The views are used to describe the system in the viewpoint of different stakeholders (end-users, developers and project managers). The four views of the model are logical, development, process and physical view. In addition, selected use cases or scenarios are used to demonstrate or show the architecture’s intent. Which is why the model contains 4+1 views (the +1 being the selected scenarios).

41views1

Rozanski and Woods extended this idea by introducing a catalog of six core viewpoints for information systems architecture: the Functional, Information, Concurrency, Development, Deployment, and Operational viewpoints and related perspectives. This is elaborated in detail in their book titled “Software Systems Architecture: Working with Stakeholders Using Viewpoints and Perspectives”.  There is much to learn from their work, I encourage you to visit the book’s web site for more information.

What we are describing here is how MDM leadership within very large-scale organizations can eventually realize the five key “markers” or characteristics in the reference architecture to include:

  • Shared services architecture evolving to process hubs;
  • Sophisticated hierarchy management;
  • High-performance identity management;
  • Data governance-ready framework; and
  • Registry, persisted or hybrid design options in the selected architecture.

This is an exceptional way to tie the technical models back to the stakeholders needs, as reflected in the viewpoints, perspectives, guidelines, principles, and template models used in the reference architecture. Grady Booch said “… the 4+1 view model has proven to be both necessary and sufficient for most interesting systems”, and there is no doubt that MDM is interesting. Once this work has been accomplished and agreed to as part of a common vision, we have several different options to proceed with. One interesting approach is leveraging this effort into a Service Orientated Modeling Framework introduced by Michael Bell at Methodologies Corporation.

Service-Oriented Modeling

The service-oriented modeling framework (SOMF) is a development life cycle methodology. It somf_v_2_0offers a number of modeling practices and disciplines that contribute to a successful service-oriented life cycle management and modeling. It illustrates the major elements that identify the “what to do” aspects of a service development scheme.

These are the modeling pillars that will enable practitioners to craft an effective project plan and to identify the milestones of a service-oriented initiative—in this case crafting an effective MDM solution.  SOMF provides four major SOA modeling styles that are useful throughout a service life cycle (conceptualization, discovery and analysis, business integration, logical design, conceptual and logical architecture).

These modeling styles: Circular, Hierarchical, Network, and Star, can assist us with the following modeling aspects:

  • Identify service relationships: contextual and technological affiliations
  • Establish message routes between consumers and services
  • Provide efficient service orchestration and choreography methods
  • Create powerful service transaction and behavioral patterns
  • Offer valuable service packaging solutions

SOMF Modeling Styles

SOMF offers four major service-oriented modeling styles. Each pattern identifies the various approaches and strategies that one should consider employing when modeling MDM services in a SOA environment.

Circular Modeling Style: enables message exchange in a circular fashion, rather than employing a controller to carry out the distribution of messages. The Circular Style also offers a way to affiliate services.

Hierarchical Modeling Style: offers a relationship pattern between services for the purpose of establishing transactions and message exchange routes between consumers and services. The Hierarchical pattern enforces parent/child associations between services and lends itself to a well known taxonomy.

somf_stylesNetwork Modeling Style: this pattern establishes “many to many” relationship between services, their peer services, and consumers similar to RDF. The Network pattern accentuates on distributed environments and interoperable computing networks.

Star Modeling Style: the Star pattern advocates arranging services in a star formation, in which the central service passes messages to its extending arms. The Star modeling style is often used in “multi casting” or “publish and subscribe” instances, where “solicitation” or “fire and forget” message styles are involved.

There is much more to this method, so I encourage you to visit the Methodologies Corporation site and download the tools, power point presentations, and articles they’ve shared.

Summary

Based on my experience, we have to get this modeling effort completed to improve the probability we’ll be successful. MDM is really just another set of tools and processes for modeling and managing business knowledge of data in a sustainable way. Take the time to develop a robust blueprint to include the Common Information (semantic, pragmatic and logical modeling), Canonical (business rules and format specifications), and Operating Models to ensure completeness. Use these models to drive a suitable Reference Architecture to guide design choices in the technical implementation.

This is hard, difficult work. Anything worthwhile usually is. Why put the business at risk to solve this important and urgent need without our stakeholders understanding and real enthusiasm for shared success? A key differentiator and the difference between success and failure on an MDM journey is taking the time to model the blueprint and share this early and often with the business. This is after all a business project, not an elegant technical exercise. Creating and sharing a common vision through our modeling efforts helps ensure success from inception through adoption by communicating clearly the business and technical intent of each element of the MDM program.

In the last part of the series, I’ll discuss where all this fits into the larger MDM program and how to plan, organize, and complete this work.

21
Jul

Modeling the MDM Blueprint – Part 4

optionIn Part 2 and Part 3 of this series, we discussed the Common Information and Canonical Models. Because MDM is a business project, we need to establish of a common set of models that can be referenced independently of the technical infrastructure or patterns we plan on using. Now it is time to introduce the Operating Model to communicate how the solution will actually be deployed and used to realize the expected benefits.

This is the most important set of models you will undertake. And sadly, not widely accounted for “in the wild”, meaning rarely seen, much less achieved. This effort describes how the organization will govern, create, maintain, use, and analyze consistent, complete, contextual, and accurate data values for all stakeholders.

There are a couple of ways to do this. One interesting approach I’ve seen is to use the Galbraith Star Model as an organizational design framework. The model is developed within this framework to understand what design policies and guidelines will be needed to align organizational decision making and behavior within the MDM initiative.

The Star model includes the following five categories:

Strategy: Determine direction through goals, objectives, values and mission. It defines the criteria for selecting an organizational structure (for example functional or balanced matrix). The strategy defines the ways of making the best trade-off between alternatives.

Structure: Determines the location of decision making power. Structure policies can be subdivided into:
- specialization: type and number of job specialties;
- shape: the span of control at each level in the hierarchy;
- distribution of power: the level of centralization versus decentralization;
- departmentalization: the basis to form departments (function, product, process, market or geography).

In our case, this will really help when it comes time to designing the entitlement and data steward functions.

graph_galbraith_star-model1Processes: The flow of information and decision processes across the proposed organization’s structure. Processes can be either vertical through planning and budgeting, or horizontal through lateral relationships (matrix).

Reward Systems: Influence the motivation of organization members to align employee goals with the organization’s objectives.

People and Policies: Influence and define employee’s mindsets and skills through recruitment, promotion, rotation, training and development.

Now before your eyes glaze over, I’m only suggesting this be used as a starting point. We’re not originating much of this thought capital, only examining the impact the adoption of MDM will have on the operating model within this framework. And more importantly, identifying how any gaps uncovered will be addressed to ensure this model remains internally consistent. After all, we do want to enable the kind of behavior we expect in order to be effective, right?

A typical design sequence starts with an understanding of the strategy as defined. This in turns drives the organizational structure. Processes are based on the organization’s structure. Structure and Processes define the implementation of reward systems and people policies.

The preferred sequence in this design process is composed in the following order: (a) strategy; (b) structure;  (c) key processes; (d) key people; (e) roles and responsibilities; (f) information systems (supporting and ancillary); (g) performance measures and rewards; (h) training and development; (i) career paths.

The design process can be accomplished using a variety of tools and techniques. I have used IDEF, BPMN or other process management methods and tools (including RASIC charts describing roles and responsibilities, for example). What ever tools you elect to use, they should effectively communicate intent and be used to validate changes with the stakeholders, who must be engaged in this process.

Armed with a clear understanding of how the Star model works we can turn our attention to specific MDM model elements to include:

Master Data Life Cycle Management processes
- Process used to standardize the way the asset (data) is used across an enterprise
- Process to coordinate and manage the lifecycle of master data
- How to understand and model the lifecycle of each business object using state machines (UML)
- Process to externalize business rules locked in proprietary applications (ERP) for use with Business Rules Management Systems (BRMS) (if you’re lucky enough to have one )
- Operating Unit interaction
- Stewardship (Governance Model)
- Version and variant management, permission management, approval processes
- Context (languages, countries, channels, organizations, etc.) and inheritance of reference data values between contexts
- Hierarchy management
- Lineage (historical), auditability, traceability

I know this seems like a lot of work. Ensuring success and widespread adoption of Master Data Management mandates this kind of clear understanding and shared vision among all stakeholders. We do this to communicate how the solution will actually be deployed and used to realize the benefits we expect.

In many respects, this is the business equivalent to the Technical Debt concept Ward Cunningham developed (we’ll address this in the next part on Reference Architecture) to help us think about this problem. Recall this metaphor means doing things the quick and dirty way sets us up with a technical debt, which is similar to a financial debt. Like a financial debt, the technical debt incurs interest payments, which come in the form of the extra effort we have to do in future development because of the quick and dirty design choices we have made. The same concept applies to this effort. The most elegant technical design may be the worst possible fit for the business. The interest due in a case like this is, well, unthinkable.

Take the time to get this right. You will be rewarded with enthusiastic and supportive sponsors who will welcome your efforts to achieve success within an operating model they understand.

20
Jul

Modeling the MDM Blueprint – Part 3

In Part 2 of this series we discussed the Common Information Model. Because MDM is a business project, we need to establish of a common set of models that can be referenced independently of the technical infrastructure or patterns we plan on using. The essential elements should include:

- Common Information Model
- Canonical Model
- Operating Model, and
- Reference Architecture (e.g. 4+1 views, viewpoints and perspectives).

We will now turn our attention to the second element, the Canonical Model.

The Canonical Model (business rules and format specification) describes how the extraction of business rules from the software portfolio are managed and shared oagis_modelamong other applications.  In addition to externalizing business rules locked in proprietary applications (for example, ERP or CRM), we also use design patterns defined here to communicate between different data formats. Instead of writing translators between each and every format (with potential for a combinatorial explosion), use this in combination with the CIM to write a translator between each format and the canonical format using rules to guide the effort. See the Open Applications Group Integration Specification (OAGIS) as example of an integration architecture that is based on a canonical data model. Implicit (and emerging now as generally accepted practice) is the use of rules (rules engines like iLOG for example) to handle reference data that must be shared across systems beyond software packages in our portfolio.  OAGIS uses XML as the common protocol for defining business messages and processes (scenarios) to enable business applications to communicate among one another in a standard manner. Not only the most complete set of XML business messages currently available (there are others several others, see the eXtensible Business Reporting Language (XBRL) for example), it also accommodates specific industries by collaborating with vertical industry groups to add and extend additional requirements as needed. For another real working example in the Product Information Management (PIM) space see GS1 Global Data Synchronization Network and the standards that make this possible.

Nick Malik over at Inside Architecture has written an exceptional post about this. We may not agree on all aspects (mostly semantics), but I think he has summed up well what this set of models should address in the blueprint. His post addresses the essential elements a complete modeling effort would produce. These products would typically include:

Canonical Message Schema - describes how when passing messages from one application to another we pass a set of data between applications where both the sender and the receiver have a shared understanding of what the values are: (a) data type, (b) range of values, and (c) semantic meaning.

Event Driven Perspective (Views) - a style of architecture characterized by a set of relatively independent actors who communicate events amongst themselves in order to achieve a coordinated goal.  This can be done at the application level, the distributed system level, the enterprise level, and the inter-enterprise level (B2B and EDI).  Although we disagree on where this effort belongs (see Part IV of this series on reference architecture development), the logical view will have its origins here.

Business Event Ontology – This ontology includes a list of business events, usually in a hierarchy, that represents the points in the overall business process where two or more objects (entities) need to communicate or share the same data values and intent (semantics).  And this, as Nick states is “is not the same as a process step. An event may trigger a process step, but the event itself is strictly speaking simply a “notification of something that has occurred,” not the name of the process.  Ontology development is a pretty exciting technology I have watched mature from simple lab exercises (toys really), to something far more useful. For more on this see Part II (The Common Information Model) or my post at Essential Analytics about the Protege ontology editor.

Business Rules – The last modeling effort is the collection (identification and grouping) of the rules used to define the behavior of the elements we have already referred to. Typically buried in application code, (if you are not lucky enough to have a Business Rules engine <g>), this model describes the business rules, protocol, and default behavior expected when the model elements interact with each other (especially useful when exceptions occur or logical constraints are violated).  Not a common artifact I find; I wish more of us would take the time and effort to accomplish this task.  For another real world reference, see the  GDSN Package Measurement Rules (issue 1.9.2) for the global definition of nominal measurement attributes of product packaging or the GDSN Validation Rules.

As I stated in Part 2, this is hard challenging work. The key differentiator and difference between success and failure on your MDM journey will be taking the time to model the blueprint and sharing this work early and often with the business. We will be discussing the third (and most important element) of the MDM blueprint, the Operating model in part 4. I encourage you to participate and share your experience, as we can all learn from each other.

19
Jul

Modeling the MDM Blueprint – Part 2

whiteboardIn Part 1 of this series, we discussed what essential elements should be included in an MDM blueprint. The important thing to remember is that MDM is a business project that requires establishing a common set of models that can be referenced independently of the technical infrastructure or patterns you plan on using. The blueprint should remain computation and platform independent until the models are completed (and accepted by the business) to support and ensure the business intent. The essential elements should include:

- Common Information Model
- Canonical Model
- Operating Model, and
- Reference Architecture (e.g. 4+1 views, viewpoints and perspectives).

We will now turn our attention to the first element, the Common Information Model.

A Common Information Model (CIM) is defined using relational, object, hierarchical, and semantic modeling methods. What we are really developing here is rich semantic data architecture in selected business domains using:

  • Object Oriented modeling: reusable data types, inheritance, operations for validating data
  • Relational: manage referential integrity constraints (primary keys, foreign keys)
  • Hierarchical: nested data types and facets for declaring behaviors on data (e.g. think XML schemas)
  • Semantic models: ontologies defined through RDF, RDFS and OWL

I believe (others may not) that MDM truly represents the intersection of Relational, Object, Hierarchical, and Semantic modeling methods to achieve a rich expression of the realitycim_diagram in which the organization operates. Expressed in business terms, this model represents a “foundation principal” or theme we can pivot around to understand each facet in the proper context. This is not easy to pull off, but will provide a fighting chance to resolve semantic differences in a way that helps focus the business on the real matters at hand. This is especially important when developing the Canonical model introduced in the next step.

If you want to see what one of these looks like visit the MDM Alliance Group (MAG). MAG is a community that Pierre Bonnet founded to share MDM Modeling procedures and pre-built data models. The MDM Alliance Group publishes a set of pre-built data models that include the usual suspects (Location, Asset, Party, Party Relationship, Party Role, Event, Period [Date, Time, Condition]) downloadable from the website. And some more interesting models like Classification (Taxonomy) and Thesaurus organized across three domains. Although we may disagree about the “semantics”, I do agree with him that adopting this approach can help us avoid setting up siloed reference databases “…unfortunately often noted when using specific functional approaches such as PIM (Product Information Management) and CDI (Customer Data Integration) modeling”. How true. And an issue I encounter often.

Another good example is the CIM developed over the years at the Distributed Management Task Force (DMTF). You can get the CIM V2.20 Schema MOF, PDF and UML at their web site and take a look for yourself. While this is not what most of us think of as MDM, they are solving for some of the same problems and challenges we face.

Even more interesting is what is happening in semantic technology. Building semantic models (ontologies) includes many of the same concepts found in the other modeling methods we’ve already discussed but further extend the expressive quality we often need to fully communicate intent. For example:

- Ontologies can be used at run time (queried and reasoned over).
- Relationships are first-class constructs.
- Classes and attributes (properties) are set-based and dynamic.
- Business rules are encoded and organized using axioms.
- XML schemas are graphs not trees, and used for reasoning.

If you haven’t been exposed to ontology development, I encourage you to grab the open source Protege Ontology Editor and discover for yourself what this all about. And while you are there see the Protégé Wiki and grab the Federal Enterprise Architecture Reference Model Ontology (FEA-RMO) for an example of its use in the EA world. Or see the set of tools found at the Essential project. The project uses this tool to enter model content, based on a model pre-built for Protégé. While you are at the Protégé Wiki, grab some of the ontologies developed for use with this tool for other examples, such as the SWEET Ontologies (A Semantic Web for Earth and Environmental Terminology. Source: Jet Propulsion Laboratory). For more on this, see my post on this tool at Essential Analytics. This is an interesting and especially useful modeling method to be aware of and an important tool to have at your disposal.

This is hard challenging work. Doing anything worthwhile usually is. A key differentiator and the difference between success and failure on your MDM journey will be taking the time to model the blueprint and sharing this work early and often with the business. We will be discussing the second element of the MDM blueprint, the Canonical model in Part 3. I encourage you to participate and share your professional experience via the comments here.

16
Jul

Modeling the MDM Blueprint – Part 1

Several practitioners have contributed to this complex subject (see Dan Power’s Five Essential Elements of MDM and CDI, for example) and have done a good job at describing the critical elements.  There is one more element that’s often overlooked however, and it remains a key differentiator and all too often, it’s the difference between success and failure among the major initiatives I’ve had the opportunity to witness – modeling the blueprint for MDM.

pen1This is an important first step to take, assuming the business case is completed and approved. It forces us to address the very real challenges up front, before embarking on a journey that our stakeholders must understand and support. Obtaining buy-in and executive support means we all share a common vision.

MDM is more than maintaining a central repository of master data. The shared reference model should provide a resilient, adaptive blueprint to sustain high performance and value over time.

An MDM solution should include the tools for modeling and managing business knowledge of data in a sustainable way.  This may seem like a tall order, but consider the implications if we focus on the tactical and exclude the reality of how the business will actually adopt and embrace all of your hard work.

Or worse, asking the business to start from a blank sheet of paper and expect them to tell you how to rationalize and manage the integrity rules connecting data across several systems, eliminate duplication and waste, and ensure an authoritative source of clean, reliable information can be audited for completeness and accuracy. Still waiting?

So What’s in This Blueprint?

The critical thing to remember is the MDM project is a business project that requires establishing a common information model that applies whatever the technical infrastructure or patterns you plan on using may be. The blueprint should remain computation and platform independent until the Operating Model is defined (and accepted by the business), and a suitable Common Information Model (CIM) and Canonical Model are completed to support and ensure the business intent.

Then, and only then, are you ready to tackle the Reference Architecture.

The essential elements should include:

  • Common Information Model
  • Canonical Model
  • Operating Model, and
  • Reference Architecture (e.g. 4+1 views).

I’ll be discussing each of these important and necessary components within the MDM blueprint in future articles in this series, and I encourage you to participate and share your professional experience. Adopting and succeeding at Master Data Management is not easy, and jumping into the “deep end” without truly understanding what you are solving for is never a good idea.

Whether you are a hands-on practitioner, program manager, or an executive planner, I can’t emphasize enough how critical modeling the MDM blueprint and sharing this with the stakeholders is to success. You simply have to get this right before proceeding further.

29
Jun

Data Governance: The People Make It Real

I was talking with a potential client the other day about their master data management program and how they’ve structured their data governance initiative.  This company really seems to have gotten it right.  They’ve put together a data governance framework that makes sense, and they’ve made some good technology choices along the way as well.

But the analogy that came to mind was that of the United States in its early days, after the Constitution was framed up but before any of the representatives had been elected.

The data governance framework (the Constitution) has been put together, but the people haven’t been hired yet. And in data governance, as in most other areas of business, it’s the people who make it really happen. A great business process is just a piece of paper or a drawing on a white board without the right person to make it happen.

What do you look for in an employee filling a data governance role?  Depending on how senior the position, I’d look for a certain amount of political savvy, a drive to get things done, an ability to focus and be detail oriented, a combination of business and technical experience, and of course, the usual “strong written and verbal communication skills”.

Seriously, the more senior data governance people are taking on a pretty tough task. While I don’t believe that “data governance is career suicide“, I do believe that you’re asking a lot of someone:

  • come into a new company (or change roles within their current company)
  • start up a new organization with the company that has a somewhat vague mission (“data governance? what’s that?”)
  • take on the task of forming the data governance group while overcoming the skeptics of its mission
  • hiring people to round out the data governance group, lest you risk becoming a “team of one”
  • sometimes, the new data governance leader is coming into a crisis situation and has to hit the ground running

So one way to support a new data governance program leader is to bring in a SWAT team around them temporarily, to support them while they build out their permanent team, and to have that SWAT team help with the definition of the job roles and responsibilities, explaining the mission to the rest of the organization, creating the communication strategy, and all of the other tasks associated with getting data governance up and running.

That’s a lot kinder than just handing a newly hired data governance leader a blank sheet of paper and saying “good luck!”

And until you’ve got the bodies in the seats, data governance probably isn’t fully real at your organization yet.

28
Jun

Philosophy of MDM

My philosophy of MDM is simple: all things being equal, enter and manage master data in its own repository or hub, and pay the same attention to the organization and business processes for creating, distributing, updating and retiring master data that you do for other types of data within the enterprise.

You’d be amazed how often that simple statement confounds people though. They want to enter master data in their ERP or CRM system, and then synchronize it over to the MDM hub. Or they’d like to somehow do without an organization to manage their master data for the enterprise. Or they’re willing to concede the need for a data governance group, but don’t think that group will need any formal processes or technology to help orchestrate their work or facilitate it and improve their productivity.

Even though the link between data quality tools and master data management is well established, I sometimes still see people try to do MDM projects without using data quality technology. And even though synchronizing the high quality master data available in the hub should be a high priority, people (typically for cost reasons) still try to skimp on integration technology and try to get by with only the most basic ETL tools.

One of the most popular articles we’ve had here on the Hub Designs Blog was the Five Essential Elements of MDM, in which I laid out what I thought were the most important related areas of technology. In it, I included the MDM hub itself, of course, and also data quality, data integration, middleware, third party content and data governance (which of course, is not really technology, but needs to be included because it too is so often forgotten).

So getting back to the focus of this article, my philosophy of MDM is to have all of the essential elements, to have a sound vision and strategy for MDM, a strong business case based on metrics, to create a governance framework and organization to carry it out, to design governance processes, and then (last but not least) to implement technology to facilitate the governance needed to support the enterprise’s master data requirements.

So often today, we see organizations taking a technology-driven approach, or leaving out important parts of the above approach.  Have you thought your MDM initiative all the way through?

10
Jun

Intersection of MDM, CRM and ERP

My earlier article on Why Product Information Management in Information Management magazine prompted Andrew White of Gartner to write a short blog article.

Andrew picked up on my comment “If CRM and ERP platforms were better able to manage master data, perhaps we wouldn’t need MDM solutions.” He goes on to say that “these applications were designed in an era when there was no need to take account of information requirements ACROSS the enterprise.”

The operating assumption for most CRM and ERP platforms, unfortunately, was that you were going to run your ENTIRE business on them.  This rarely, if ever, turns out to be the case, particularly if the business does a lot of acquisitions. One business unit or geography certainly. And the count may grow over time. But there are always going to be areas of the business “outside the pale” – not included in that particular CRM or ERP solution’s purview. This leads to silos of data, which create many problems in the management and analysis of information in the enterprise.

That’s why an MDM hub makes so much sense. It provides a neutral place for customer, product and other master data from all over the enterprise to be created, read, updated and managed. Increasingly, today’s CRM and ERP applications are being used in concert with a robust MDM hub. Even now, CRM and ERP products just aren’t designed to manage master data effectively. They don’t have the built-in data quality and data governance processes that are needed to ensure a single view of accurate, complete, timely and consistent master data across the enterprise.

You can read the article by Andrew White of Gartner Research at http://blogs.gartner.com/andrew_white/2010/06/07/good-summary-of-mdm-of-product-data-and-its-value-to-the-business/.

25
May

MIKE2.0

MIKE2.0 (Method for an Integrated Knowledge Environment) is an Open Source methodology for Enterprise Information Management.

I first became familiar with it in 2009. MIKE2.0 provides a lot of “thought capital” for practitioners in the areas of enterprise architecture, master data management, and data governance. While it was (at that time, at least) too incomplete to use “as is”, it was very helpful in being able to show to a client as an example of what a data governance program would look like, or what an outline of an enterprise master data management program would look like.

MIKE 2.0 evolved from work done by BearingPoint, which has emerged from its 2009 bankruptcy operating in 14 countries throughout Europe with about 3,250 employees. The MIKE2.0 intellectual property is now open source and is controlled by the MIKE2.0 Governance Association, which includes representation from BearingPoint and Deloitte.

Now, MIKE2.0 is firmly in the hands of a non-profit, independent governing body, which makes the entire body of work available as a tool for MDM and data governance practitioners.

There are a lot of great assets embedded in MIKE2.0 – in particular, there’s a customer data integration solution offering, a data integration solution offering, a data investigation and re-engineering solution offering, and an information governance solution offering.

These map fairly well to my “Five Essential Elements of MDM” article, where I said that, to succeed with MDM, you really needed:

  • a Hub of some type
  • some kind of data integration or middleware
  • data quality capabilities
  • external content
  • data governance (which of course, is the most important)

So while I wouldn’t recommend using MIKE2.0 “out of the box” (i.e. without the need for some fairly heavy adaptation), it may very well save you a lot of time in your MDM and data governance initiative. If you’re not already familiar with it, I highly recommend you check it out.

21
May

Recent eLearning Curve Webinar

Hub Designs recently hosted a 30 minute webinar on “Best Practices in MDM and Data Governance with Dan Power”, in concert with our friends at eLearning Curve and Information Management magazine.

To download the replay of the webinar (with audio), please go to http://bit.ly/hub-designs-webinar.  To download just the slides, please go to http://bit.ly/mdm-best-practices and click “Download”.

For the “When Data Governance Turns Bureaucratic” white paper mentioned in the presentation, go to http://bit.ly/data-governance.  Scroll to and click the link at the end of that article.

Thanks for attending the webinar (or the replay). We hope you found it valuable!

2
May

When Data Governance Turns Bureaucratic

How Data Governance Police Can Constrain the Value of Your Multidomain Master Data Management Initiative

(this appeared as a guest post on Informatica’s blog on Friday, April 30 2010)

I published a white paper last year, entitled “When Data Governance Turns Bureaucratic,” that looked at how reactive data governance was preventing organizations from realizing the full value of master data management (MDM). By “reactive”, I mean organizations using a “coexistence” architecture where front office applications (CRM) and back office applications (ERP) are still used to author master data (customer and product data, suppliers, employees, etc.). Because these applications remain the “Systems of Entry” while the MDM hub’s role is limited to being the “System of Record,” some of the biggest promises of MDM remain unfulfilled.

So, what exactly would proactive data governance look like? Essentially, the proactive model places more emphasis on business users being the owners of the master data. Rather than letting data stewards carry the burden of the central issues of accuracy and completeness, the accountability for these goals shifts towards the business users. Since end users are empowered to enter new master data directly into the hub, their trust in the accuracy and completeness of master data goes up, plus there’s less need for data stewards to act as the “data quality police.” Once users are no longer dependent on the CRM and ERP systems to perform initial entry and updating of master data, the data stewards can focus on managing exceptions and measuring data for quality, availability, security and usefulness. In this less-intrusive role, data stewards don’t present a bottleneck to critical business processes such as order management or invoicing.

By getting the master data right at the source, your initial level of quality for new records is much higher. The proactive style of data governance also effectively eliminates any time lags between the initial entry of a new master record, and its certification and publishing via middleware to the rest of the enterprise. As such, marketing campaigns can be done more quickly, with no upfront data remediation needed prior to launching a campaign. Finance benefits as well, since all of the data elements needed for a new customer are captured at once, and the hub-based process for adding a new customer can include pulling third-party content and calculating a credit limit, then passing that information back to the ERP system. Customer service benefits too, because all information is stored in one hub and made accessible through an efficient, user-friendly front end. Customer service reps are able to access all of the data needed for each customer interaction, as well as being able to author new data when necessary.

When is the right time to transition from reactive to proactive data governance? Some situations call for starting out immediately with the proactive approach, such as when you’ve got multiple CRM systems and ERP systems that would require integration with the hub in order to allow them to continue to operate as Systems of Entry, or when your current source systems are very brittle or hard to maintain or modify. In those cases, bite the bullet and plan from the beginning for proactive data governance.

Want to learn more about the reactive vs. proactive governance? You can download the complete whitepaper “When Data Governance Turns Bureaucratic” here.

16
Apr

Kalido MDM and AB InBev

The Gartner MDM Summit in Las Vegas wraps up today, and this morning I caught a session by Kalido’s President and CEO Bill Hewitt and Jonathan Starkey, the Director of Business Intelligence at AB InBev North America.

AB InBev purchased Anheuser Busch in 2008 to become the largest brewer in the world, with over 116,000 employees worldwide and $39 billion in annual revenue.

AB InBev  sees master data as a foundation element supporting supply chain management (SCM), enterprise resource planning (ERP) and customer relationship management (CRM). All of that data winds up in a data warehouse and is used for reporting and planning. This shared focus on both reporting and analysis, and planning and forecasting makes up their philosophy on business intelligence.

This integration approach is being to bring together the Canadian and US operations gradually over time, but to integrate the SCM, ERP and CRM pillars of the US and Canadian operations of such a large enterprise realistically is going to take three to five years.

Turning more to the master data side of things, the first way AB InBev is using Kalido is to synchronize and cross-reference product and customer information across SCM and ERP systems. Secondly, they’re using Kalido to look for active exceptions across all of the various domains – between plants and products, between employees in HR and in ERP, between any two systems where master data is not in agreement. Thirdly, they’re using Kalido to kick off requests for new master data – new employees, new products, etc. that then get passed to various systems around the company.

The “real world” benefits from Kalido at AB InBev include procurement savings, strategic inventory optimization, overhead and budget tracking, people and resource movement tracking.

AB InBev went through a rigorous selection process, and selected Kalido in large part because of its ability to change rapidly as their business needs changed. Jonathan Starkey said ”Kalido does a very good job at managing change over time”.

I really enjoyed this session. Both Bill Hewitt and Jonathan Starkey did a great job, and it was enlightening to hear how a large global enterprise has addressed their MDM and business intelligence needs. Hub Designs recently became a Kalido partner, and one of our goals for this Gartner MDM Summit was to learn more about the company and their products, and this session definitely helped us do that.

For more information on Kalido, please visit www.kalido.com.

15
Apr

Evolving from Product MDM to Multidomain MDM

I’m attending the Gartner MDM Summit in Las Vegas, and this morning I caught a great session by Andrew White on the evolution from master data management (MDM) of product data to “multidomain MDM”.

Andrew started by talking by talking about the strong intersection of product MDM with enterprise resource planning (ERP), workflow, product configuration, and business rules. The market for product MDM is fairly healthy and is actually a little larger than the market for customer MDM.

The initial need to master product data usually arises from having too many copies of product data in different places around the enterprise. Then typically, product data quality issues need to be addressed, but that needs to be addressed as a continuing process, not as a one-time process.

Multi-channel commerce is known as the “sell side” of product MDM, and procurement is known as the “buy side”. There’s involvement with fulfillment and supply chain management, and with ERP and operations. There are many different silos that need to be connected and synchronized (one client I worked with last year had 175 different applications, systems and databases, most of which used or created product data in some way).

At some point, governance has to be addressed. Companies have to go from departmental or business unit governance to enterprise-wide data governance, and expand from single domain (typically customer) to multidomain (customer and product) master data governance.

Andrew mentioned the level of Product MDM adoption – there was software license spending of $432 million in 2008. Certain industries such as discrete and process manufacturing, communications, retailing, and healthcare providers are classified as “hot” according to Gartner (as of Q1, 2010). Retail in particular is almost post-recession. Healthcare providers has more awareness on the buy side.

A common scenario for some is to have a product MDM hub as a system of record, connected to CRM systems for sales & marketing and customer service, to PLM (product lifecycle management) as a system of reference, and to ERP systems (which need the data for their Item Masters). So the CRM, PLM and ERP systems are process owners, but the MDM platform provides the product and material master data, attributes, hierarchies and so on, for consumption by the other systems.

Andrew talked about how the inquiries he gets break down: ERP and MDM: 50%, product data quality: 33%, information exchange: 15%, metadata management: 10% and content management: 20%, and “can I use my CDI hub to master product data?”: 10%.

Andrew talked briefly about the current vendors in the product MDM space: the specialists (handling just product data) such as Hybris Software, Heiler, QAD, Pindar, Tribold, Requisite Technology, EnterWorks.  He categorized Stibo Systems, Riversand and Tribold as being somewhere in the middle between specialists and generalists (handling other domains).

Oracle, IBM and SAP are strong on product MDM and customer MDM. Tibco and Informatica (formerly Siperian) are customer MDM providers that are moving towards handling the product MDM domain. Microsoft is entering the MDM space but their solution (when it is released later this year) is really suited more for analytical use.

And other vendors such as Data Foundations and Orchestra Networks can model any domain of data, including product data.

Through the end of 2013, you might need two MDM platforms. IBM has three MDM products (IBM InfoSphere MDM Server, MDM Server for PIM which handles complex workflow, and their recent acquisition of Initiate). Other strong vendors include SAP, Oracle and Stibo Systems.

The five-year market growth rate is projected at 18%. The Top Five products have 51% of the market. Vendors to watch include Teradata, INformatica, Tibco and Hybris.

Over the next 12 months, product configuration remains an unsolved problem. Companies typically define business rules all over the place. Over the long term, in MDM, that doesn’t work – those business rules themselves need to be governed centrally. The master data and the business rules both need to be governed. Successful product MDM requires business rules governance.

Reference data is another area – price is NOT master data but it behaves like master data in a lot of ways. It needs to be governed and managed. Business process management and its intersection with MDM is another area of development.

Data quality for product data has its foibles. You need to know where you’re starting from. Most importantly, data quality is not a once and done thing, it’s an ongoing process.

The product master data life cycle looks like: Author > Store > Publish / Synchronize > Enrich > Consume > Analyze.

The picture for the future – there are three main “provinces” for MDM: the “thing” province, the “party” province and the “place” province. But vendors typically have a history in a single domain.

Andrew gave a couple of great example of companies that went through the evolutionary process of going from a single domain of MDM to multiple domains over time.

Andrew closed with recommendations for people beginning their MDM process: create a vision of what could be achieved with a “single view of product data”, to start small but think big and deliver value early, and to define data and process metrics early and then to revise then as needed as you go along.

I’ve been a big fan of Andrew White for several years now, and I thought he did a great job today (as usual). He brings a great deal of analysis to bear on the questions involved in product MDM, and provides clarity and insight into where the MDM market is headed over the next several years. If you’re attending the Gartner MDM Summit in Las Vegas, or have a chance to catch his sessions at a future event, I think you’d find those sessions very rewarding.

29
Mar

Answering Questions from LinkedIn

I got a good question via LinkedIn the other day, so I thought I’d answer it here:

Dear Dan,

I am a database architect but I am new to MDM and data governance, and I’m very interested in this area.

Can you please suggest where to start? I’ve found some information on the web (sometimes a bit disconnected), but I seem to be lost with so much information. Also, the tools that are currently available in the market – do they address all the challenges in this space?

One question I have is: if data quality is given the importance it deserves from the beginning of any project (operational or data warehouse), are MDM initiatives necessary? Are MDM projects needed because of the proliferation of applications that are developed in silos and that don’t consider what information is already available to the enterprise? In essence, should MDM be part of any project?

Thanks for your time.

Not to sound too self-serving, but I’d start with this blog and the MDM Community.

As for your question about whether MDM initiatives are necessary if data quality is given sufficient importance, please realize that MDM is a relatively new discipline which includes embedded data quality; it does not replace data quality.

What MDM does is sit between the source systems (typically CRM and ERP) and the data warehouse and business intelligence. So instead of trying to flow master data and transactional data directly into the warehouse for analysis, we bring it into the MDM system first, where it can be “mastered” – which includes fixing data quality issues. We then flow those corrections back into the source systems and downstream into the analytical systems. Which of course you can’t do without data quality tools. But data quality tools by themselves are not sufficient, because they typically don’t persist or store the data.

Your next question, are MDM projects necessary because of the proliferation of apps developed as silos – yes, that’s a big part of it. Essentially, if you developed a new architecture from scratch, you’d put a multi-domain MDM hub, able to handle many types of master data at the core, and you’d build data quality into it, then you’d surround it with integration so you can flow data from there to where ever it’s needed. So clean, accurate, consistent and timely master data would be available to any other IT project that was going on, but it would only have to be built once. “Build once, use many”, as they say.

Please keep reading and I hope you stay interested in MDM and data governance!

I got an answer from this person today that I thought I’d share with you here:

Dan,

Thank you very much for your insights. Whatever documents I had read about MDM had more to do with the people, process or technology, but didn’t cover the essence of MDM. I’ve gone through some of your blogs and I’m beginning to understand MDM.

14
Mar

Is It Taxonomy Season Already?

Like death and taxes, every Master Data Management (MDM) project goes through a taxonomy definition exercise.  During this time, Data Architects realize whether their payment of time thus far will yield a refund (of time) or require them to spend nights and weekends in jail (at the office).  Let this article serve as your free consultation with your personal Taxonomy Preparation professional.

An MDM taxonomy is simply a structured hierarchy applied to the topic of the MDM project (for example: products, people, or customers) that defines that topic’s attribution.  At each level, this hierarchy enforces the inheritance of characteristics to all of its children and their children. For example, the taxonomy of biology that has remained in my memory since 6th grade contains the levels Kingdom, Phylum, Class, Order, Family, Genus, and Species. Any animal or plant can belong to only one member of the lowest level, species, and each level of the taxonomy defines the inherited characteristics of its children. The same concept is core to an MDM design and each widget in an MDM topic can only reside in only one of the lowest taxonomy levels.

The number of levels in the MDM taxonomy varies based on the business need, topic, and a count of the widgets in the topic. There are standards available to guide you in level counts and names if you want to follow them, but the assignment of attributes, definitions, and placement of your widgets in the structure is business-specific.  Plan for a significant investment  of effort to get the taxonomy and item assignments correct.  This effort should result in the business agreeing on a taxonomy containing the fewest levels necessary to accurately represent the MDM topic’s widgets, along with a few other guidelines.

The topic of an MDM project may have many business purposes and be categorized by business users in a variety of different ways. This is expected and encouraged. We are not trying to restrict how the business analyzes the topic’s widgets.  The taxonomy we are concerned with is a single hierarchy defining widgets through attribution characteristics as described in the prior biology example. We do this to create a single unambiguous definition that can be applied to every existing and new widget so that each widget falls under one and only one of the taxonomy’s lowest levels. The business must validate the one widget per lowest taxonomy level rule, what attributes are common to each level, and that the attributes of any level apply to all levels below it. The taxonomy not only results in a standardized method of defining widgets, but also allows for automatic inheritance of widget properties during definition which reduces the workload and chances of errors during the widget information entry.

Expect to encounter puzzled looks when introducing the concept of attribution-driven taxonomy. Business subject matter experts do not think of their widgets in those terms.  Instead, they will be thinking in terms of how the business reports on the widgets.  The distinction is clear only when you remember the purpose the taxonomy serves. Conducting workshops with business users across the board promotes the required consensus. After a few episodes of realigning discussions from a reporting mindset into an attribution mindset, the users will start to change their thinking and the results will be a valid taxonomy that the MDM initiative can grow on.  Without this foundation, your success will be limited.

25
Feb

And Then There Were Five

The landscape of the MDM hub vendors has shifted quite a bit in the last month. Siperian has been acquired by Informatica, and Initiate Systems has been acquired by IBM.

What does this mean for the average Fortune 1000 company buying MDM technology? Not as much as you might think.

On the mega-vendor side, they’ve still got Oracle, IBM and SAP to choose from.  IBM, obviously, now has three MDM platforms to offer (InfoSphere MDM Server, InfoSphere MDM Server for PIM, and Initiate Systems) where they used to have two. But Oracle has three as well, and will soon have four: Customer Data Hub and Universal Customer Master for customer MDM, PIM Data Hub for product MDM, and Fusion MDM Hub, Release 1 of which is supposed to ship later in 2010.  And SAP continues to forge ahead with improved versions of their NetWeaver MDM product. So the recent consolidation doesn’t seem to have affected the mega-vendors that much – “the big get bigger”, you might say.

Outside of the “Big Three”, I continue to think Siperian being acquired by Informatica is a good thing, for Siperian’s customers, for the product roadmap, and for the market as a whole. Informatica brings a lot of expertise in integration and data quality to the table, and its Identity Systems matching engine and Address Doctor data cleansing tools are very good at what they do. It will be interesting to see how Informatica integrates Siperian as a company and as a product into itself, but I have a lot of confidence that they’ll do it well.

All this does pose an interesting issue for Oracle, however. Oracle made a big commitment to Informatica in its Fusion MDM Hub by including Informatica components for matching and cleansing on an OEM basis. But by buying Siperian, Informatica has declared itself a direct competitor in the MDM market. So there’s a lot of speculation as to what Oracle will do about this. In the short term, it may be too late to pull the Informatica components out of Fusion MDM Release 1.0, but longer term, I wouldn’t be surprised to see the Informatica components either replaced or deemphasized, perhaps with an open architecture approach allowing other third party identity resolution / matching and address cleansing products to be plugged in, in place of Informatica’s. Although there’s also been a lot of speculation about Oracle buying Informatica.

D&B/Purisma remains an interesting player. Disclosure: prior to starting Hub Designs, I worked for D&B. I saw D&B’s launch of a hosted version of Purisma last fall and was impressed by it. For a lot of situations, Purisma’s product can be a good solution. So even though I wouldn’t call Purisma a full-fledged master data management solution, it’s worth keeping an eye on because it does a great job of integrating internal customer data with D&B’s external reference data. And having it available on a hosted basis can be very helpful.

So the bottom line is, where there used to be six players, now there are five.  Of course, the MDM market being as hot as it is, everyone and their brother claims to be an MDM solution, but these are the five products that I pay the most attention to, and that we see the most in the marketplace. What do you think?  Please let us know by commenting here.

17
Feb

Long Live MDM

Editor’s Note: Today’s post was written by Jeff Schaffzin. Jeff is an independent consultant with over 15 years of experience in high tech. He’s worked with a number of leading software vendors in roles such as product marketing, professional services and information technology. Specializing in data management, Jeff has spent the last three years focusing on Customer Data Integration and Master Data Management and has worked with a number of high profile companies in the United States and abroad.

DISCLAIMER: While the facts that I’ve included here are true, I’m speculating on the reasons why they’re taking place. I have no affiliation with any company mentioned here, nor should my opinions be construed as knowledge of their actions.

If you, like me, have followed MDM for the past year or two, you knew that what has been happening recently was going to happen sooner or later. Whether it was due to choice or necessity, MDM has been in the IT press a lot lately. Oracle acquired Silver Creek to enrich its product information management offering. Talend has announced and started to promote its open source MDM application. Data integration provider Informatica acquired Siperian recently in order to enter the MDM space and IBM recently acquired Initiate Systems as well.

Each of these events leads to one key question – how will this impact MDM in the short term and in the future? Given my understanding of the space, I think three scenarios are likely:

Scenario 1

It is hard to ignore the movements that IBM and Oracle have been making in the past year or so. In a market economy, the goal is to have as much market share as possible. In order to do this, you either build new products or acquire existing companies that have the technologies that you want.

While each company has done a combination of both building and buying solutions, their strategic plans are hardly a secret. IBM has proposed a vision of an end-to-end data management platform, which includes their MDM offering as well as reporting tools like Cognos and analytics/statistics from SPSS. Now that IBM has acquired Initiate Systems to complement their MDM stack, the question is why. Do they want to be known as a serious player in the health care industry? There could be other reasons too – they may consider MDM just a small piece of their data management toolkit and this could solidify that position, moving MDM from one of the hottest ‘technologies’ out there to just a “means to an end” to increase market share for their software business unit. Regardless of the reason, it means one less major MDM player in the market.

Then we have Oracle. For as long as I can remember, Oracle has been promoting its Fusion strategy. For those of you who are not familiar with it, Fusion is Oracle’s attempt to provide one code base that would pull together the applications it has built and purchased. This momentous undertaking was finally demonstrated at last year’s Oracle Open World (while Oracle continued to acquire other companies such as Silver Creek Systems).

However, like IBM, one can speculate on where MDM fits in this Fusion strategy. Oracle has always promoted its database first and sold its applications second. Even with the numerous special purpose hubs they’ve been developing lately, could this finally be the technology that enables Oracle to transcend from being a database vendor to a true platform player. Only time will tell with this one.

Scenario 2

There’s always the possibility that MDM has been considered the “secret sauce” – the so-called missing link – that rounds out the product lines for data integration/migration vendors.

Talend’s acquisition of French software company Amalto provided them a way to enter the MDM space. The open source vendor has been a darling of the analysts for a number of years and even won an award by Gartner, one of the first (if not the first) they offered such a company. However, since I don’t have contacts within Talend, it’s not clear what their next step will be, since they seem to be focusing their energies mostly in MDM after hiring two people to drive that effort within the past 6 months or so.

As the de facto leader in data integration, Informatica needed to extend its reach beyond that space. If you look at their job listings, they are looking for someone to market their CEP (Complex Event Processing) efforts. Relatively recently, they were looking to hire someone who had experience with ERP or MDM, but it is unclear which path they decided to take with that. Regardless, there were always looming rumors of them wanting to add MDM to their portfolio with the press suggesting that they would acquire Initiate Systems. However, instead of buying them, they purchased Siperian – a company half its size in terms of customer base and revenue.

In either of these cases, MDM may not be their flagship product, but at least it shows that it is a viable technology and shows that it is something that won’t be going away any time soon.

Scenario 3

People like me who have been in the data management space are always interested in improving something. I believe in the statement, “even if something isn’t broken, there’s always a reason to make it better.” This was clear when Customer Data Integration (CDI) first came about and many companies hopped on that bandwagon, knowing that they wanted a way to track their customers more efficiently.

At the same time, other companies explored Product Information Management (PIM), a way to have a single source of product information which was sourced from PLM, inventory and supply chain systems. Following that was the concept of MDM – a beautiful vision – having a single source of truth that can be used by an entire company.

Now we have a new concept that has been promoted – Multi-domain MDM. Siperian and other companies have began to promote this to show the world that they are truly the most advanced players out there. While this has been going on, there have been rumblings about Enterprise Information Management (EIM). What I’m still not clear on is – what’s the difference between multi-domain MDM and EIM? Are they the same? If not, what are the differences between the two concepts?

In any case, there’s a lot to think about. I don’t know where you stand, but one thing is certain – MDM is not going away, at least for the foreseeable future.

10
Feb

Data Profiling For All The Right Reasons, Part 5

The Hub Designs Blog welcomes the final installment of this great series by Rob DuMoulin, an information architect with more than 26 years of IT experience, specializing in master data management, database administration and design, and business intelligence.

Part 5: The Profiling Payoff

This is the final part of a five-part series, describing how data profiling benefits both IT projects and business operations.  In Part One, we discussed profiling perspectives.  In Parts Two, Three and Four, we introduced the value of system, entity, and attribute-level metrics.  This part discusses the archival and beneficial uses of profile results.

If you have defined your corporate data profiling strategy similar to the methods discussed in the preceding parts of this series, you’ll have amassed a robust collection of metadata spanning relevant systems across your business.  Although systems may be of different types and locations, the structured approach and common metrics you collected create a centralized repository of information that can be examined holistically. Ideally, this information will exist in an open-source database repository with reports made available across the enterprise. System and Entity information help planners and developers organize information strategies. Attribute-level domains, constraints, and business rules help data architects understand existing systems. Relationships and value patterns are readily available to support validation of information-related hypotheses as needed.

If you plan to design your own repository, consider adding timestamps and indicators to help you manage and present the information.  To keep your repository relevant to business needs, design collection rules to be configurable. This allows you to easily ignore superfluous information or enable tests only at certain critical times. Allow initial system profiling efforts to gather a large set of metrics and store them as your baseline.  As you learn about the information, you will see which tests or which data objects add no value.  Us geeky DBA-types who understand system-level catalogs have our own scripts to do much of what was described inParts Two,Three and Four. Those less-inclined may prefer to use a third-party tool for profiling. Either way works as long as the business needs are satisfied and the entire enterprise standardizes on one approach (and thus one integrated repository).

You will find that collecting and maintaining this level of detail has a definite cost.  Even if the collection is automated, interrogations of large data sets places an overhead on production systems that may not be practical. Record and monitor profile execution metrics to identify bottlenecks or tuning opportunities. Realize that the extent of data profiling is contingent on the project phase, specific data elements, and most of all, business value. Review profiling goals on a regular basis and eliminate unnecessary and redundant checks.

How much profile history to maintain is another consideration.  Even though disk is “relatively” cheap, maintaining all historical entries in a live repository may not be necessary. Consider business needs and value for historical profile information. Even consider archiving at a summarized (or less frequent) level and keep only a limited time window of statistics online.

This discussion on data profiling was intended to broaden perceptions of what it means to a business and the value it can bring if done in a sustainable way. The blog format is not conducive to in-depth discussions, but hopefully the topics covered here spur some thoughts into how you can add value to your business by implementing some of these concepts.  Use your imagination, but remember that no matter how cool it might be to collect and store some profile output, if it does not add business value to somebody, it might not be worth the overhead to continue recording it.

Go back to Part 4.

28
Jan

Data Profiling For All The Right Reasons, Part 4

The Hub Designs Blog welcomes Part 4 of this series by Rob DuMoulin, an information architect with more than 26 years of IT experience, specializing in master data management, database administration and design, and business intelligence.

Part 4: Profiling Relationships and Patterns

This is part four of a five-part series describing how data profiling assists in all aspects of system development, from design through deployment.

Part One introduced different perspectives on data profiling. Part Two identified valuable system and entity metrics to track. Part Three discussed attributes. In this segment, we dive deeper into attribute relationships and pattern recognition. Also, we expand on primary key identification discussion and discuss hidden relationships.

Pattern grouping provides a mask of distinct format patterns within an attribute data set and a count of the number of occurrences. Patterns give insight into the type of values found in an attribute. For example, a numeric pattern analysis may show values such as 999.99999, 99, or -.9999.

Observing distinct patterns gives insight into the maximum digits and precision, and also domains such as integer or real. Pattern of a database date or date-time type provides unremarkably similar patterns for all dates. Because the database management system typically enforces the domain, date analysis provides no value and can be ignored. If dates are stored in character format, however, patterns quickly show variations in date formatting. Character patterns only have significance to a limited number of positions. It makes no sense to pattern a description field of 200 or 2000 characters. Smaller code attributes of less than 10 characters though do provide value. Ignore pattern profiling for character strings over 20 characters at first, then refine to shorter character strings if the results do not add value.

In pure database theory, referential integrity (RI) is your friend. In practice, designers and software vendors often forgo RI to improve system performance on data inserts. These designers place the data quality burden on the application and do not endorse external data manipulation outside the application interfaces. In the real world, though, data corruption occurs and without RI or routine data quality checks, corruptions may not be found for a long time or not at all. Personally, I have identified over $50,000 of recent orphaned sales in a retail client resulting from deliberately disabled RI. These unreported sales were not added to the ledger and were allowed to occur for performance reasons until I found them through simple profiling. Enforcement of RI is a topic for another discussion but is mentioned here because it does identify a valid reason for data profiling.

In even presumably good relational designs, some parent-child relationships are not enforced for different reasons. First, interrogate the RI listed in the system catalogs to identify all enforced relationships. Reverse-engineering a system with a good modeling tool is probably the best way to do this. A harder and more valuable analysis is to identify unenforced relationships and determining the probability of the relationship if not all values are an exact match. Do this by counting all the candidate child attribute values that exist within a known parent attribute table. If all match and there are a non-trivial number of matches, there is a good probability of a non-identified relationship. A small number of mismatches could identify data quality issues.

In Part 5, we tie all the techniques discussed in the first four parts together to show the value of a repeatable data profiling process.

Continue with Part 5 or go back to Part 3.

25
Jan

Data Profiling For All The Right Reasons, Part 3

The Hub Designs Blog welcomes Part 3 of this series by Rob DuMoulin, an information architect with more than 26 years of IT experience, specializing in master data management, database administration and design, and business intelligence.

Part 3: Attribute-Level Analyses

This is part three of a five-part series on data profiling.

In Part One, we took a light-hearted view of where profiling benefits an organization and in Part Two, we discussed the fundamentals of a profiling strategy.  The remaining three parts discuss attributes, relationships, patterns, and how to use the combined data profiling information you collect.  In this section, we introduce attributes, the lowest-level components of a profiling effort.

An attribute is simply a individual data element.  Alone, an attribute has no context.  Given the simple descriptor of “Cost” for an attribute tells us very little about the attribute’s true purpose and immediately drives a need for additional information, such as units (hours, Dollars, Euros…), type (weighted, unit, gross…), and use (invoice, sum, average…).  Attributes therefore must be analyzed within the context of their business purpose to have meaning.

Some characteristics require business knowledge to define and others can be determined through interrogation of existing values and underlying rules of the storage medium. It takes both analyses to get a complete picture of information within a system. While assembling this puzzle, though, keep in mind that until you validate the enforcement of business rules, only assumptions can result from physical profiling or business context.

Analyses of values, domains, and constraints allows insight into use (or abuse) of an attribute. The larger the sample size, the better confidence you gain in the results. Without explicit proof of business rule enforcement, though, you must assume that just because a value does not presently exist does not mean it cannot exist. Business rules are defined by business experts and enforced through database constraints, data type/precision, and application code. Knowing the methods of enforcement allow you to narrow a domain but not totally understand it. Profiling of actual values provides additional refinement in terms of percentage of NULL values, percentage of distinct values, minimum, maximum, and average values, top x and bottom x recurring values along with their counts, and minimum, maximum, and average data lengths.

Some attributes within a data set serve valuable purposes that are important to identify. Attributes that individually or in conjunction with others define uniqueness of the data set also may support relationships between entities.  Uniqueness can be further classified as being either members of a system-enforced primary key or of a business key (outside of the defined primary key).  System-enforced primary keys are relatively easy to define within a database system through interrogation of the system catalog.  Business keys that exist in tables in addition to a primary key may be more difficult to identify, especially if more than one attribute is needed to define uniqueness.

Attribute-level information of interest includes: data type (size and precision), the number and percent of NULL values, column descriptions, number and percent of distinct values, and the minimum-maximum-average values and lengths.  Uses of the system catalog provides some of this information, but others must be collected through sampling the data.

Other types of attributes that may help in identifying relevancy are those that provide system-level auditing or change control. Knowing which attributes fill these roles may either allow you to (a) ignore them for profiling purposes or (b) use them to help explain versions or data anomalies.

Part 4 expands on attribute profiling with the introduction of relationships and patterns.

Continue with Part 4 or go back to Part 2.

18
Jan

Data Profiling For All The Right Reasons, Part 2

The Hub Designs Blog welcomes Part 2 of this series by Rob DuMoulin, an information architect with more than 26 years of IT experience, specializing in master data management, database administration and design, and business intelligence.

Part 2: Profiling the Basics

This discussion is the second of a five-part series on data profiling. In Part 1, we discussed the project roles that benefit from data profiling and how better understanding information results in more reliable information systems. Important goals of any profiling strategy include automation of metric collection and socializing results to support the differing objectives of a data-centric project.

Early in a system development life cycle, profiling helps define sources, data storage requirements, and data transformations. As a system goes into production (or if profiling is added to an existing system for quality control purposes), routine profiling is useful to audit system quality and business rule enforcement. The frequency of collection and amount of effort you expend to automate your profiling methods should be based on the ability of the organization to benefit from the profile results.

This section discusses the beginnings of a profiling effort. Information assembled here forms the foundation of other profiling activities. For this discussion, consider a Profile Group as a set of information sharing a common purpose and data management methods. Examples of profile groups include tables within a single database schema or a group of spreadsheets with the same format but each spreadsheet representing a different time slice of data.

The underlying System managing a set of information within the profile group may be a named relational database, a file system directory, or even a web site being accessed through web services. The reason we abstract information into Systems is to group the information into distinct governance methods common to the underlying information. Relevant metadata and governance methods we track at the system-level include: technical contacts, backup schedules, system descriptors, connection strings, business unit owners, and host operating systems. System-level metadata common to a profile group helps us understand and troubleshoot future analyses. This level of information also provides developers with an understanding of inherent restrictions (or freedoms) they may encounter when trying to use or integrate the information.

Entities within a profile group belong to the same system, may have a common unique identifier, and, for database entities, have the same schema owner. Typically, entities are database tables, but may also be similar files or spreadsheet tabs containing like attribute lists. For entities, we track characteristics common to all the attributes they contain. These include: row counts, entity-level descriptors, growth characteristics (size and frequency), last analyzed date, and various customized indicators such as active/inactive, existence of change data management attributes such as insert/update timestamps, and existence of audit traceability indicators such as insert/update username.

The combination of system and entity level profiling supply the foundation for the attribute-level profiling, which is where physical information in a system resides. It also provides valuable metadata to classify information and allows for future correlation of like information across systems. Assembly and publication of entity and system level information benefits the various consumers of the information by providing a centralized “master” source of contact and context information.

In Part 3, we will dive into the attribute level analyses around data profiling.

Continue with Part 3 or go back to Part 1.

10
Jan

Data Profiling For All The Right Reasons, Part 1

The Hub Designs Blog welcomes another guest post by Rob DuMoulin, an information architect with more than 26 years of IT experience, specializing in master data management, database administration and design, and business intelligence.

Part 1: The Psychology of Data Profiling

Swiss psychologist Carl Gustav Jung founded the Analytical School of Psychology. His word association theories form the basis of the Myers-Briggs Type Indicator Assessment test to identify career aptitude in today’s high school students. Dr. Jung’s approach assigned personality profiles based on how an individual’s thoughts associated to various phrases. By analyzing responses, he could understand how an individual viewed the world around them and perceived themselves. Typically, subjects are asked to speak the first thought entering their minds after hearing a trigger phrase. For the following example, remember, there are no wrong answers. If I say the words “Data Profiling”, what is the first thing you think of?

If you thought of food, cats, country music, CSI NY, or residential plumbing, you are either not in IT or are an IT Manager.

If your first thought was “Quality Assurance”, you align yourself with data quality professionals having anti-social thoughts of failing test cases and sadistically reporting lazy developers for buggy code. You gleefully scour test cases looking for any evidence of truncation, missing values, non-matching codes, numeric precision errors, and inconsistent abbreviation, text, and date formatting.

If “Integration” comes first in your mind, past legacy integration projects have scarred you with a disdain for source system data quality levels. You view production apps with contempt and loathe the time it takes to track down data issues caused by system integrations. You investigate upstream sources to create detailed mappings and transformation rules. Typical debugging sessions consist of validating relationships to identify orphaned data, identifying attributes that contain overloaded columns (attributes containing more than one distinct data element), or fixing format errors from implied decimals.

Some of you responded with “Value Domains” or “Data Types”, indicating you are obsessive compulsive data architects compelled to organize the world into strict and orderly fashion with some degree of normalization, though you are not considered “normal” by your peers. Your concerns lie in understanding and regulating naming conventions, relationships, existence of NULL or default values, and understanding the meaning of each data element to accurately identify business rules and when two or more objects are related or redundant.

Lastly, if “Debugging” is the first item in your thought queue, you are a coder justifying why presumably good code is not working. Extreme paranoia has taught you to assume nothing about data quality, so you add tests to identify duplicates, validate relationships, enforce business rules, track change data capture, provide substitute values. Your phobia of early morning phone calls cause you to add auditing to your code to inform a DBA of data issues rather than waking you up in the middle of the night.

It is truly amazing how much we can conclude from the response to one simple phrase.

As stated before, there are no wrong answers. Aside from the innocent jab at Managers and non-IT resources, we all realize the benefits of information quality and absolutely need business involvement to understand context and domains of business information. The meaning and actions of Data Profiling change both by role and by project phase. Through profiling, we are able to identify best sources of information, learn proper ways to categorize and store it, reactively identify quality issues, and proactively define business rules to prevent future issues.

Identifying what is important to profile, when and how profiling is done, and how to share our findings across business and project resources is key. Done properly, profile results integrate to a master metadata repository and are periodically refreshed through an automated process.

This five-part series provides a tool-agnostic approach to comprehensive data profiling, focusing on information meaning and use. The next part of the series discusses system and table-level profiling. In particular, what information is important to collect at the system and table level and how can that information be leveraged by the Enterprise to help assure quality. The third part dives into attribute-level profiling and the fourth discusses attribute patterns and relationships. The final part discusses the benefits and utility of gathering profiled information into a single repository.

Continue with Part 2.

19
Nov

Calendar and MDM

The Hub Designs Blog welcomes a guest post by Rob DuMoulin, an information architect with more than 26 years of IT experience, specializing in master data management, database administration and design, and business intelligence.

Most business intelligence architects are well versed in the value of the time dimension.

With query performance and the need to support complex analyses being the two most important considerations in BI, a flattened set of time dimensions provides a multitude of options to represent and standardize time with limited overhead.

It’s easy to see the value of having a flexible, consistent, and integrated representation of time when thinking of business activities. Aspects such as when a transaction or activity occurs in relationship to other transactions, activities, or even pre-defined thresholds form the basis of Business Process Management activities. And accounting departments group transactions into time periods every financial reporting period.

So, how valuable can this same time dimensions be to a Master Data Management solution? If you are well versed in MDM at this point, you’re probably saying “What you’ve talked about so far is useful for relating transactions but it doesn’t tie back to mastering business objects like customers, products, or locations”.

But remember that mastering those objects does require standardization during information acquisition and publishing and that the various inputs and outputs to an MDM system are often diverse. Also, don’t underestimate the value of mastering “Time Tables” themselves as a component in your MDM universe.

First, let’s define just what we mean by a set of time tables before we apply them to MDM. A typical implementation would have two distinct groups of tables to represent time: day, and time-of-day. At the lowest level of the day group is a day-level table with every imaginable way the business can identify a day, such as: by its day of year, week, month, quarter, advertising week (for retail), same day last year (in some special context), or special tags like holiday, weekend, season, positional sunrise/sunset times, or even astrological sign and full moon cycles. And that just covers the calendar view of the business. There is an equally important and extensive set of calendar hierarchies and attributes associated with the business fiscal reporting needs. Add to that every way you want to represent attributes like day of the week or month of the year (number, 3-letter abbreviation, full name) and ending up with over 100 attributes in the day-level table is not uncommon.

Related to the day-level table are hierarchy tables at levels such as: month, quarter, year (and their fiscal counterparts). Each of the hierarchy tables contains all the attributes that define that level and higher levels. For example, the calendar month table would contain attributes defining month of year, month of quarter, and month overall, in addition to quarter and year and all the ways to call the month. Primary keys for the higher level hierarchy tables, like month, would have child entries in the lower level tables, like day, for every entry that rolls into the higher level.

The same holds true for time of day, with hierarchies like hour, minute of hour, shift, peak time, off-peak time, and others.

Because all the higher-level attributes are repeated in the lower-levels, there is typically not a compelling need to join the two tables. The relationships are there for flexibility. Having the various hierarchy tables as stand-alone entities allows you to attach them to business tables at all of the levels you collect or report time values. These tables and hierarchy relationships allow you to easily merge data of different time grains.

The best thing about time is that time is constant. There are always sixty seconds to the minute, sixty minutes to the hour, twenty-four hours to the day (excluding Daylight Savings Time adjustments), seven days to the week, the number of days to the month is fixed, the number of days in a year is predictable. Except for adjustments to fiscal calendars and special events, most of the information related to time hierarchies is static.

BI uses these techniques to conform information allowing it to readily apply to many views of the business… which sounds a lot like the same business issues we try to solve when integrating data within an MDM solution.

Introducing a robust set of Master Time dimensions into an MDM architecture opens up flexibility in how you consolidate information and also how you can apply it to many business purposes. It’s a natural expansion of MDM to include a master version of the corporate calendar (particularly the fiscal calendar) using a common set of time-related identifiers complete with any time references relevant to business operations.

Please let us know what you think of mastering the Time dimension or other types of corporate reference data in the MDM hub by leaving a comment here.

13
Oct

First Day at Oracle OpenWorld

Having a dedicated MDM track at Oracle OpenWorld this year makes a big difference, in terms of being able to find the sessions more easily and in the focus and energy in the sessions.

First up today was a panel discussion on Hyperion Data Relationship Management (DRM).  It was moderated by my friend Rahul Kamath from Oracle, and included Dongyan Wang from NetApp, Anand Raaj from Halliburton, and Nimish Mehta from Lumendata. It was very well done, and gave some good insights into the role that DRM can play as a hierarchy management tool in an MDM environment.

Next was Pascal Laik, VP of MDM Product Strategy at Oracle, who co-presented with Cisco’s Kin-Ching Wu.  Pascal talked about the reality of complex, heterogeneous environments, and the difference between “push mode” and “pull mode”. He discussed the business drivers of growth, efficiency, IT agility and compliance, and the hard work Oracle has been doing over the past couple of years to help its customers to create their business cases and document the ROI that MDM has been realizing for them. Pascal laid out Oracle’s end-to-end data quality, pre-built integration and data governance strategies, and announced the new Data Governance Manager as a way to Define, Operate, Monitor and Fix data in the hub. Interestingly, 95% of the applications that Oracle customers integrate with are non-Oracle applications.

KC Wu from Cisco discussed their Customer Registry program, which draws data from 40 source systems and publishes it to about 80 downstream systems. She described a fascinating 10-year journey up the MDM maturity model.

The highlight of the next session for me was Bill Miller, a senior IT person at Oracle whom I’ve known for several years, who recently successfully implemented Oracle Customer Hub 8.0 at Oracle. It was very interesting to hear him describe how Oracle has put in place a lot of customer MDM and data governance best practices.

The last session of the day was Vanessa Hsu from Oracle, along with Kelle O’Neal from First San Francisco Partners and Angie Couron from Symantec. They did a great session on enterprise data governance, and gave a “first look” at Data Governance Manager.

16
Sep

Webinar with Initiate Systems

“Master Data Management: The Sliding Scale Between Build and Buy”

Replay of the webinar with Dan Power and Marty Moseley

Please join industry experts Dan Power, Founder and President, Hub Solutions, and Marty Moseley, CTO, Initiate Systems, for this webinar where we’ll outline the best practices that have evolved to support organizations in making the critical “build vs. buy” decision.

Master data management (MDM) transforms data integration and business processes. Many organizations are exploring an MDM solution and will eventually have to answer the build vs. buy question. The combination of build and buy for MDM depends on the individual organization’s circumstances, goals and objectives. As MDM has evolved, so have the best practices for considering how much should be built and how much should be bought.

Some key considerations include:

  • What are your current data volumes? How will they change in the near and distant future?
  • Are customer relationships one-dimensional? Are you concerned with multiple domains of data and managing the corresponding hierarchies?
  • Will you implement Web services? How will they be used?
  • Do you augment your internal data with information from external vendors?
  • What are the time, budget and resource limitations?
  • Is MDM intended to eventually provide an enterprise data platform?

Please click here for the on-demand replay.

Follow

Get every new post delivered to your Inbox.

Join 2,897 other followers