Data Profiling For All The Right Reasons, Part 5
The Hub Designs Blog welcomes the final installment of this great series by Rob DuMoulin, an information architect with more than 26 years of IT experience, specializing in master data management, database administration and design, and business intelligence.
Part 5: The Profiling Payoff
This is the final part of a five-part series, describing how data profiling benefits both IT projects and business operations. In Part One, we discussed profiling perspectives. In Parts Two, Three and Four, we introduced the value of system, entity, and attribute-level metrics. This part discusses the archival and beneficial uses of profile results.
If you have defined your corporate data profiling strategy similar to the methods discussed in the preceding parts of this series, you’ll have amassed a robust collection of metadata spanning relevant systems across your business. Although systems may be of different types and locations, the structured approach and common metrics you collected create a centralized repository of information that can be examined holistically. Ideally, this information will exist in an open-source database repository with reports made available across the enterprise. System and Entity information help planners and developers organize information strategies. Attribute-level domains, constraints, and business rules help data architects understand existing systems. Relationships and value patterns are readily available to support validation of information-related hypotheses as needed.
If you plan to design your own repository, consider adding timestamps and indicators to help you manage and present the information. To keep your repository relevant to business needs, design collection rules to be configurable. This allows you to easily ignore superfluous information or enable tests only at certain critical times. Allow initial system profiling efforts to gather a large set of metrics and store them as your baseline. As you learn about the information, you will see which tests or which data objects add no value. Us geeky DBA-types who understand system-level catalogs have our own scripts to do much of what was described inParts Two,Three and Four. Those less-inclined may prefer to use a third-party tool for profiling. Either way works as long as the business needs are satisfied and the entire enterprise standardizes on one approach (and thus one integrated repository).
You will find that collecting and maintaining this level of detail has a definite cost. Even if the collection is automated, interrogations of large data sets places an overhead on production systems that may not be practical. Record and monitor profile execution metrics to identify bottlenecks or tuning opportunities. Realize that the extent of data profiling is contingent on the project phase, specific data elements, and most of all, business value. Review profiling goals on a regular basis and eliminate unnecessary and redundant checks.
How much profile history to maintain is another consideration. Even though disk is “relatively” cheap, maintaining all historical entries in a live repository may not be necessary. Consider business needs and value for historical profile information. Even consider archiving at a summarized (or less frequent) level and keep only a limited time window of statistics online.
This discussion on data profiling was intended to broaden perceptions of what it means to a business and the value it can bring if done in a sustainable way. The blog format is not conducive to in-depth discussions, but hopefully the topics covered here spur some thoughts into how you can add value to your business by implementing some of these concepts. Use your imagination, but remember that no matter how cool it might be to collect and store some profile output, if it does not add business value to somebody, it might not be worth the overhead to continue recording it.
Data Profiling For All The Right Reasons, Part 4
The Hub Designs Blog welcomes Part 4 of this series by Rob DuMoulin, an information architect with more than 26 years of IT experience, specializing in master data management, database administration and design, and business intelligence.
Part 4: Profiling Relationships and Patterns
This is part four of a five-part series describing how data profiling assists in all aspects of system development, from design through deployment.
Part One introduced different perspectives on data profiling. Part Two identified valuable system and entity metrics to track. Part Three discussed attributes. In this segment, we dive deeper into attribute relationships and pattern recognition. Also, we expand on primary key identification discussion and discuss hidden relationships.
Pattern grouping provides a mask of distinct format patterns within an attribute data set and a count of the number of occurrences. Patterns give insight into the type of values found in an attribute. For example, a numeric pattern analysis may show values such as 999.99999, 99, or -.9999.
Observing distinct patterns gives insight into the maximum digits and precision, and also domains such as integer or real. Pattern of a database date or date-time type provides unremarkably similar patterns for all dates. Because the database management system typically enforces the domain, date analysis provides no value and can be ignored. If dates are stored in character format, however, patterns quickly show variations in date formatting. Character patterns only have significance to a limited number of positions. It makes no sense to pattern a description field of 200 or 2000 characters. Smaller code attributes of less than 10 characters though do provide value. Ignore pattern profiling for character strings over 20 characters at first, then refine to shorter character strings if the results do not add value.
In pure database theory, referential integrity (RI) is your friend. In practice, designers and software vendors often forgo RI to improve system performance on data inserts. These designers place the data quality burden on the application and do not endorse external data manipulation outside the application interfaces. In the real world, though, data corruption occurs and without RI or routine data quality checks, corruptions may not be found for a long time or not at all. Personally, I have identified over $50,000 of recent orphaned sales in a retail client resulting from deliberately disabled RI. These unreported sales were not added to the ledger and were allowed to occur for performance reasons until I found them through simple profiling. Enforcement of RI is a topic for another discussion but is mentioned here because it does identify a valid reason for data profiling.
In even presumably good relational designs, some parent-child relationships are not enforced for different reasons. First, interrogate the RI listed in the system catalogs to identify all enforced relationships. Reverse-engineering a system with a good modeling tool is probably the best way to do this. A harder and more valuable analysis is to identify unenforced relationships and determining the probability of the relationship if not all values are an exact match. Do this by counting all the candidate child attribute values that exist within a known parent attribute table. If all match and there are a non-trivial number of matches, there is a good probability of a non-identified relationship. A small number of mismatches could identify data quality issues.
In Part 5, we tie all the techniques discussed in the first four parts together to show the value of a repeatable data profiling process.
Data Profiling For All The Right Reasons, Part 3
The Hub Designs Blog welcomes Part 3 of this series by Rob DuMoulin, an information architect with more than 26 years of IT experience, specializing in master data management, database administration and design, and business intelligence.
Part 3: Attribute-Level Analyses
This is part three of a five-part series on data profiling.
In Part One, we took a light-hearted view of where profiling benefits an organization and in Part Two, we discussed the fundamentals of a profiling strategy. The remaining three parts discuss attributes, relationships, patterns, and how to use the combined data profiling information you collect. In this section, we introduce attributes, the lowest-level components of a profiling effort.
An attribute is simply a individual data element. Alone, an attribute has no context. Given the simple descriptor of “Cost” for an attribute tells us very little about the attribute’s true purpose and immediately drives a need for additional information, such as units (hours, Dollars, Euros…), type (weighted, unit, gross…), and use (invoice, sum, average…). Attributes therefore must be analyzed within the context of their business purpose to have meaning.
Some characteristics require business knowledge to define and others can be determined through interrogation of existing values and underlying rules of the storage medium. It takes both analyses to get a complete picture of information within a system. While assembling this puzzle, though, keep in mind that until you validate the enforcement of business rules, only assumptions can result from physical profiling or business context.
Analyses of values, domains, and constraints allows insight into use (or abuse) of an attribute. The larger the sample size, the better confidence you gain in the results. Without explicit proof of business rule enforcement, though, you must assume that just because a value does not presently exist does not mean it cannot exist. Business rules are defined by business experts and enforced through database constraints, data type/precision, and application code. Knowing the methods of enforcement allow you to narrow a domain but not totally understand it. Profiling of actual values provides additional refinement in terms of percentage of NULL values, percentage of distinct values, minimum, maximum, and average values, top x and bottom x recurring values along with their counts, and minimum, maximum, and average data lengths.
Some attributes within a data set serve valuable purposes that are important to identify. Attributes that individually or in conjunction with others define uniqueness of the data set also may support relationships between entities. Uniqueness can be further classified as being either members of a system-enforced primary key or of a business key (outside of the defined primary key). System-enforced primary keys are relatively easy to define within a database system through interrogation of the system catalog. Business keys that exist in tables in addition to a primary key may be more difficult to identify, especially if more than one attribute is needed to define uniqueness.
Attribute-level information of interest includes: data type (size and precision), the number and percent of NULL values, column descriptions, number and percent of distinct values, and the minimum-maximum-average values and lengths. Uses of the system catalog provides some of this information, but others must be collected through sampling the data.
Other types of attributes that may help in identifying relevancy are those that provide system-level auditing or change control. Knowing which attributes fill these roles may either allow you to (a) ignore them for profiling purposes or (b) use them to help explain versions or data anomalies.
Part 4 expands on attribute profiling with the introduction of relationships and patterns.
Data Profiling For All The Right Reasons, Part 2
The Hub Designs Blog welcomes Part 2 of this series by Rob DuMoulin, an information architect with more than 26 years of IT experience, specializing in master data management, database administration and design, and business intelligence.
Part 2: Profiling the Basics
This discussion is the second of a five-part series on data profiling. In Part 1, we discussed the project roles that benefit from data profiling and how better understanding information results in more reliable information systems. Important goals of any profiling strategy include automation of metric collection and socializing results to support the differing objectives of a data-centric project.
Early in a system development life cycle, profiling helps define sources, data storage requirements, and data transformations. As a system goes into production (or if profiling is added to an existing system for quality control purposes), routine profiling is useful to audit system quality and business rule enforcement. The frequency of collection and amount of effort you expend to automate your profiling methods should be based on the ability of the organization to benefit from the profile results.
This section discusses the beginnings of a profiling effort. Information assembled here forms the foundation of other profiling activities. For this discussion, consider a Profile Group as a set of information sharing a common purpose and data management methods. Examples of profile groups include tables within a single database schema or a group of spreadsheets with the same format but each spreadsheet representing a different time slice of data.
The underlying System managing a set of information within the profile group may be a named relational database, a file system directory, or even a web site being accessed through web services. The reason we abstract information into Systems is to group the information into distinct governance methods common to the underlying information. Relevant metadata and governance methods we track at the system-level include: technical contacts, backup schedules, system descriptors, connection strings, business unit owners, and host operating systems. System-level metadata common to a profile group helps us understand and troubleshoot future analyses. This level of information also provides developers with an understanding of inherent restrictions (or freedoms) they may encounter when trying to use or integrate the information.
Entities within a profile group belong to the same system, may have a common unique identifier, and, for database entities, have the same schema owner. Typically, entities are database tables, but may also be similar files or spreadsheet tabs containing like attribute lists. For entities, we track characteristics common to all the attributes they contain. These include: row counts, entity-level descriptors, growth characteristics (size and frequency), last analyzed date, and various customized indicators such as active/inactive, existence of change data management attributes such as insert/update timestamps, and existence of audit traceability indicators such as insert/update username.
The combination of system and entity level profiling supply the foundation for the attribute-level profiling, which is where physical information in a system resides. It also provides valuable metadata to classify information and allows for future correlation of like information across systems. Assembly and publication of entity and system level information benefits the various consumers of the information by providing a centralized “master” source of contact and context information.
In Part 3, we will dive into the attribute level analyses around data profiling.
Data Profiling For All The Right Reasons, Part 1
The Hub Designs Blog welcomes a guest post by Rob DuMoulin, an information architect with more than 26 years of IT experience, specializing in master data management, database administration and design, and business intelligence.
Part 1: The Psychology of Data Profiling
Swiss psychologist Carl Gustav Jung founded the Analytical School of Psychology. His word association theories form the basis of the Myers-Briggs Type Indicator Assessment test to identify career aptitude in today’s high school students. Dr. Jung’s approach assigned personality profiles based on how an individual’s thoughts associated to various phrases. By analyzing responses, he could understand how an individual viewed the world around them and perceived themselves. Typically, subjects are asked to speak the first thought entering their minds after hearing a trigger phrase. For the following example, remember, there are no wrong answers. If I say the words “Data Profiling”, what is the first thing you think of?
If you thought of food, cats, country music, CSI NY, or residential plumbing, you are either not in IT or are an IT Manager.
If your first thought was “Quality Assurance”, you align yourself with data quality professionals having anti-social thoughts of failing test cases and sadistically reporting lazy developers for buggy code. You gleefully scour test cases looking for any evidence of truncation, missing values, non-matching codes, numeric precision errors, and inconsistent abbreviation, text, and date formatting.
If “Integration” comes first in your mind, past legacy integration projects have scarred you with a disdain for source system data quality levels. You view production apps with contempt and loathe the time it takes to track down data issues caused by system integrations. You investigate upstream sources to create detailed mappings and transformation rules. Typical debugging sessions consist of validating relationships to identify orphaned data, identifying attributes that contain overloaded columns (attributes containing more than one distinct data element), or fixing format errors from implied decimals.
Some of you responded with “Value Domains” or “Data Types”, indicating you are obsessive compulsive data architects compelled to organize the world into strict and orderly fashion with some degree of normalization, though you are not considered “normal” by your peers. Your concerns lie in understanding and regulating naming conventions, relationships, existence of NULL or default values, and understanding the meaning of each data element to accurately identify business rules and when two or more objects are related or redundant.
Lastly, if “Debugging” is the first item in your thought queue, you are a coder justifying why presumably good code is not working. Extreme paranoia has taught you to assume nothing about data quality, so you add tests to identify duplicates, validate relationships, enforce business rules, track change data capture, provide substitute values. Your phobia of early morning phone calls cause you to add auditing to your code to inform a DBA of data issues rather than waking you up in the middle of the night.
It is truly amazing how much we can conclude from the response to one simple phrase.
As stated before, there are no wrong answers. Aside from the innocent jab at Managers and non-IT resources, we all realize the benefits of information quality and absolutely need business involvement to understand context and domains of business information. The meaning and actions of Data Profiling change both by role and by project phase. Through profiling, we are able to identify best sources of information, learn proper ways to categorize and store it, reactively identify quality issues, and proactively define business rules to prevent future issues.
Identifying what is important to profile, when and how profiling is done, and how to share our findings across business and project resources is key. Done properly, profile results integrate to a master metadata repository and are periodically refreshed through an automated process.
This five-part series provides a tool-agnostic approach to comprehensive data profiling, focusing on information meaning and use. The next part of the series discusses system and table-level profiling. In particular, what information is important to collect at the system and table level and how can that information be leveraged by the Enterprise to help assure quality. The third part dives into attribute-level profiling and the fourth discusses attribute patterns and relationships. The final part discusses the benefits and utility of gathering profiled information into a single repository.
Building MDM-Powered Solutions with Initiate Composer
Earlier this week, I saw a demo of Initiate’s new Composer product, and was impressed. Composer, announced in March and scheduled for release in June, will be available to all existing Initiate customers.
Initiate Composer is a framework for building MDM-powered solutions on top of the company’s MDM hub, which is called Initiate Master Data Service. Typically, an MDM hub is populated with data from monolithic enterprise systems like front office suites such as customer relationship management (CRM) applications and back office suites such as enterprise resource planning (ERP) applications.
Essentially, these data sources offer the best of both worlds. By pulling the data into a robust MDM hub, you create a “single view of the customer” rather than having multiple views within different silos across the enterprise. Then by building a new, easy-to-use application on top of that trustworthy data, you’ve found a way to quickly deliver value from the MDM initiative back into the business.
Of course, in the real world it’s never quite that easy. But one of the most common things we see clients wanting to do with their newly-built MDM hub is to make the information in it widely available to the enterprise – for search, for reference, for additional data entry, for automation of manual processes, and for viewing corporate hierarchies and other relationships.
Based on the demo I saw, Initiate fulfills this need with Composer. Customer teams can now quickly create production-ready user interfaces that are pre-integrated with the Initiate Master Data Service.
Composer creates Adobe Flex applications, which are cross-platform rich Internet applications. This is helpful because they will run on a variety of clients inside only a browser.
It was impressive to see the degree to which business analysts could quickly be productive writing simple MDM applications, even if they were prototypes that would need to be finished up by a developer. A lot of times, there’s a big gap between design documents and working code. It’s a lot easier for a power user or a business analyst to work with a tool like Composer to “show you what I want” than to just describe it verbally, in writing or on a white board.
With Composer, teams can more easily and more productively build a variety of different user interfaces on top of Initiate’s MDM hub. IBM thought highly enough of Initiate Systems back in February to acquire the company. While I’m sure that Composer was only a small part of why that happened, I’m sure it didn’t hurt.
Initiate has always been a company I’ve followed closely and with whom Hub Designs has partnered, and we look forward to continuing that as they become part of the IBM universe.
Why I Enjoy MDM So Much
I was reading a very good article on a blog called Presentation Zen called The Importance of Starting from Why. The article describes a TED talk by a leadership expert and author named Simon Sinek. In his talk, which I encourage you to watch yourself, he talks about the importance of understanding the “Why” of something vs. the “How” and the “What”. Since I read that article and watched that video, I’ve been thinking about why I enjoy MDM and data governance so much, and about the central premise of Simon’s talk, which is that there’s a simple pattern, that all great and inspiring leaders and organizations think, act and communicate in the same way, and it’s the exact opposite from everyone else. He calls it “the golden circle”:
Why -> How -> What, and goes on to say that this little idea explains why some organizations and some leaders are able to inspire, where others aren’t. Every person and organization knows what they do, most know how they do it, but a lot don’t know why. The successful ones start with the why and work “inside out” (in the opposite direction from most people and companies). By nailing down the why first, everything else falls into place.
I don’t want to reproduce his whole talk here in this article, but it got me thinking about my interest in master data management and about Hub Designs and our approach to working with our clients.
I got interested in master data in one of my first consulting projects after graduating from college. I had a client that was a distributor of VHS videotapes. People would call up and order a show on tape, and the customer service people would enter them in as new customers rather than search to see if they might be an existing customer. Their order entry system was written in FoxPro on a PC network, and I had my own consulting business doing FoxPro programming. So I was engaged to help them deduplicate their customer master, based on similarity of customer name and address. I remember at the time thinking it was a great intellectual exercise.
That was my first exposure, but hardly my last. In 1995, I got recruited into a position as a project manager for an Oracle ERP implementation, and I did many Oracle projects over the years that followed. In ERP implementations, converting master data well is a big contributor to the success of the project, and I found that handling data quality issues properly became second nature to me.
In 2001-2002, I was a program manager on a large Oracle ERP project for a $1 billion software company, and one of the areas I oversaw was Customer Registration. My client and my team were one of the first to integrate Oracle’s Trading Community Architecture (TCA) with Dun & Bradstreet’s real-time database (D&B Data Integration Toolkit). That lead to my going to work for D&B in 2004, and being part of the Global Alliances team there until 2007. While at D&B, I managed their strategic alliance with Oracle, and worked closely with Oracle on Customer Data Hub and its integration with D&B content.
I mention all this not to bore you with my professional history for the past twenty-three years, but to illustrate how a passion for master data can get into your bones, and shape your career. It’s woven itself into my life, and become part of the “Why” for Hub Designs and how we work with our clients. Anyone who knows me or has worked with Hub Designs professionally knows that we care deeply about our clients and their success. They become part of our family. We hug them when we see them. We put so much of ourselves into our clients’ projects that we form relationships that last for years.
In the video we produced for the recent Gartner MDM Summit, we used words like ‘passion’, ‘performance’, ‘teamwork’, and ‘integrity’ to describe our “why”. That’s what gets us out of bed in the morning – making a difference for our clients, helping them solve their business problems, moving the needle, making things better in their organizations, and improving things one company at a time.
In the end, why I started my own consulting firm again was so I could work with clients in my own unique way, so I could develop something of lasting value, and so I could turn my passion for MDM and data governance into a business that would make a difference to our clients.
What’s your why?
Recent eLearning Curve Webinar
Hub Designs recently hosted a 30 minute webinar on “Best Practices in MDM and Data Governance with Dan Power”, in concert with our friends at eLearning Curve and Information Management magazine.
To download the replay of the webinar (with audio), please go to http://bit.ly/hub-designs-webinar. To download just the slides, please go to http://bit.ly/mdm-best-practices and click “Download”.
For the “When Data Governance Turns Bureaucratic” white paper mentioned in the presentation, go to http://bit.ly/data-governance. Scroll to and click the link at the end of that article.
Thanks for attending the webinar (or the replay). We hope you found it valuable!
When Data Governance Turns Bureaucratic
How Data Governance Police Can Constrain the Value of Your Multidomain Master Data Management Initiative
(this appeared as a guest post on Informatica’s blog on Friday, April 30 2010)
I published a white paper last year, entitled “When Data Governance Turns Bureaucratic,” that looked at how reactive data governance was preventing organizations from realizing the full value of master data management (MDM). By “reactive”, I mean organizations using a “coexistence” architecture where front office applications (CRM) and back office applications (ERP) are still used to author master data (customer and product data, suppliers, employees, etc.). Because these applications remain the “Systems of Entry” while the MDM hub’s role is limited to being the “System of Record,” some of the biggest promises of MDM remain unfulfilled.
So, what exactly would proactive data governance look like? Essentially, the proactive model places more emphasis on business users being the owners of the master data. Rather than letting data stewards carry the burden of the central issues of accuracy and completeness, the accountability for these goals shifts towards the business users. Since end users are empowered to enter new master data directly into the hub, their trust in the accuracy and completeness of master data goes up, plus there’s less need for data stewards to act as the “data quality police.” Once users are no longer dependent on the CRM and ERP systems to perform initial entry and updating of master data, the data stewards can focus on managing exceptions and measuring data for quality, availability, security and usefulness. In this less-intrusive role, data stewards don’t present a bottleneck to critical business processes such as order management or invoicing.
By getting the master data right at the source, your initial level of quality for new records is much higher. The proactive style of data governance also effectively eliminates any time lags between the initial entry of a new master record, and its certification and publishing via middleware to the rest of the enterprise. As such, marketing campaigns can be done more quickly, with no upfront data remediation needed prior to launching a campaign. Finance benefits as well, since all of the data elements needed for a new customer are captured at once, and the hub-based process for adding a new customer can include pulling third-party content and calculating a credit limit, then passing that information back to the ERP system. Customer service benefits too, because all information is stored in one hub and made accessible through an efficient, user-friendly front end. Customer service reps are able to access all of the data needed for each customer interaction, as well as being able to author new data when necessary.
When is the right time to transition from reactive to proactive data governance? Some situations call for starting out immediately with the proactive approach, such as when you’ve got multiple CRM systems and ERP systems that would require integration with the hub in order to allow them to continue to operate as Systems of Entry, or when your current source systems are very brittle or hard to maintain or modify. In those cases, bite the bullet and plan from the beginning for proactive data governance.
Want to learn more about the reactive vs. proactive governance? You can download the complete whitepaper “When Data Governance Turns Bureaucratic” here.
Oracle’s MDM Strategy and Roadmap
At the Oracle Applications Users Group (OAUG) COLLABORATE 2010 conference this week, I attended a session by Pascal Laik, Oracle’s VP of Master Data Management Strategy.
He started out by talking about several Oracle MDM customers, their success stories and their return on investment, across drivers like growth, efficiency, improved IT agility, and compliance.
Pascal moved on to talk about MDM implementation challenges. Oracle surveys its MDM customers every two years. Measuring actual ROI achieved is the most difficult challenge reported. Next is breaking down organizational silos, and then demonstrating incremental business value.
Five out of the top ten challenges were related to data governance and project/organization. These were big themes two years ago as well. So Oracle worked with an outside partner on the areas of strategy, policies & processes, organization, measurement & monitoring, technology, and communication. They got a group of 10-15 customers together 2-3 times per year, and that group put together a set of requirements for a product that Oracle has now created called Data Governance Manager. This product helps data governance professionals to operate and monitor the hub and to define and enforce policies.
Pascal showed a short video from an Oracle customer, Areva. Their program was called STOCK – Strategic and Operational Customer Knowledge, to ensure the high quality of customer data. They used a five step approach: Collect, Harmonize, Merge, Enrich, and Publish. The benefits included saving employees time, ensuring that internal people can rely on customer and prospect data, and providing the entire enterprise with a clear vision of the customer database.
The second set of challenges related to ROI and business case – measuring actual ROI achieved. Oracle now has a web-based ROI model available through its sales team. Oracle also has a group of people that do a 3-5 week management consulting exercise called “Insight” that delivers a full business case.
The third set of challenges is the first one involving technical issues: #10 and #11 (integration and data quality).
Two years ago, the #1 issue was procuring skilled resources. So Oracle has been working closely with systems integrators, so now this issue is down to #7. Integration with operational applications has gone from #2 to #11.
Lastly, Pascal discussed Oracle solutions, investments and its strategy going forward. Oracle now has Customer Hub, Supplier Hub, Product Hub, and Site Hub. Data Relationship Management, which is a financial hub to manage financial entities such as the chart of accounts and other hierarchies, is also an analytical hub.
Oracle Customer Hub (formerly known as Universal Customer Master) is now on release 8.2, which shipped in January 2010, and includes the new Data Governance Manager module. This is the largest customer release in four years.
Oracle’s MDM strategy has two legs – embedded “best in class”. Oracle has OEM’d the Informatica solution, using the Identity Systems solution (now owned by Informatica) and the Address Doctor solution (also from Informatica) for postal cleansing for 200+ countries. The other leg is “open” – Oracle is providing a “Universal DQ Connector” for selected vendors like Trillium, Acxiom, D&B and Datanomic. (Note: the embedded “best in class” approach is somewhat controversial, since Informatica is now competing directly with Oracle, since it has acquired the Siperian MDM hub).
The end-to-end data quality framework (the Data Quality “Machine”) has a Rules Manager for design, development and validation (IDQ). There is a process (Analyze/Profile, Standardize/Cleanse, Match & De-Duplicate, Enrich) with a Scorecard & Reporting, and an Exception Management Process. The output is to load the MDM system with zero rejects.
Oracle has also acquired Silver Creek Systems, which is focused on product data quality. It is a self-learning semantic engine to handle the complexities of product information.
Pascal talked about some of the newer MDM hubs, Supplier Hub and Site Hub. Site Hub in particular has experienced strong interest from retailers, fast food companies and large enterprises, which are using it to manage stores and locations.
Oracle’s MDM investments are critical for Oracle in terms of its differentiation strategy, and data governance is the number one item from its customer advisory board. Oracle has reached 1,000 MDM customers across all of its various MDM products.
Pascal wrapped up by talking about how competitive the MDM space is and the recent acquisitions in the market. Oracle’s history is in applications. Oracle brings a pre-built, flexible schema with enterprise-grade, verticalized hub applications. Oracle MDM hubs are pre-integrated with both Oracle and non-Oracle applications. And Oracle provides best-in-class data quality and data governance solutions.
Our Booth at the Gartner MDM Summit
Hub Designs was a Silver sponsor at the Gartner MDM Summit 2010. Here’s the new, 3-minute video we produced to describe what Hub Designs does as an consulting firm focused specifically on MDM:
Great New White Paper and Other Collateral Available at Our Booth
At the event, we announced with Equifax a new product that integrates Equifax commercial information with Oracle E-Business Suite and Oracle Customer Data Hub. This product simplifies the process of integrating Equifax credit and marketing information with prospect and customer data in Oracle. Both the joint press release and a one-page product overview were distributed at the booth.
Also available was a new whitepaper written in collaboration with Informatica titled, When Data Governance Turns Bureaucratic: How Data Governance Police Can Constrain the Value of Your Multidomain Master Data Management Initiative. This updated version of an earlier white paper written with Siperian in 2009 added both new content and industry insights. It was very well received at the Gartner conference this week.
Finally, we handed out one of the most popular recent articles from this blog, Hidden Costs of Duplicate Customer Data.
The conference drew attendees from many different market sectors, so discussions and meetings were both informative and valuable from an MDM perspective. Several Hub Designs clients were able to join us there, from the insurance, software and transportation industries, and we had four of our team members there as well. I’m going to write a separate article with my thoughts on the sessions and the mood of the conference, but I wanted to provide a look at our booth as well, for our readers who weren’t able to make it to Las Vegas this week.
Kalido MDM and AB InBev
The Gartner MDM Summit in Las Vegas wraps up today, and this morning I caught a session by Kalido’s President and CEO Bill Hewitt and Jonathan Starkey, the Director of Business Intelligence at AB InBev North America.
AB InBev purchased Anheuser Busch in 2008 to become the largest brewer in the world, with over 116,000 employees worldwide and $39 billion in annual revenue.
AB InBev sees master data as a foundation element supporting supply chain management (SCM), enterprise resource planning (ERP) and customer relationship management (CRM). All of that data winds up in a data warehouse and is used for reporting and planning. This shared focus on both reporting and analysis, and planning and forecasting makes up their philosophy on business intelligence.
This integration approach is being to bring together the Canadian and US operations gradually over time, but to integrate the SCM, ERP and CRM pillars of the US and Canadian operations of such a large enterprise realistically is going to take three to five years.
Turning more to the master data side of things, the first way AB InBev is using Kalido is to synchronize and cross-reference product and customer information across SCM and ERP systems. Secondly, they’re using Kalido to look for active exceptions across all of the various domains – between plants and products, between employees in HR and in ERP, between any two systems where master data is not in agreement. Thirdly, they’re using Kalido to kick off requests for new master data – new employees, new products, etc. that then get passed to various systems around the company.
The “real world” benefits from Kalido at AB InBev include procurement savings, strategic inventory optimization, overhead and budget tracking, people and resource movement tracking.
AB InBev went through a rigorous selection process, and selected Kalido in large part because of its ability to change rapidly as their business needs changed. Jonathan Starkey said ”Kalido does a very good job at managing change over time”.
I really enjoyed this session. Both Bill Hewitt and Jonathan Starkey did a great job, and it was enlightening to hear how a large global enterprise has addressed their MDM and business intelligence needs. Hub Designs recently became a Kalido partner, and one of our goals for this Gartner MDM Summit was to learn more about the company and their products, and this session definitely helped us do that.
For more information on Kalido, please visit www.kalido.com.
Evolving from Product MDM to Multidomain MDM
I’m attending the Gartner MDM Summit in Las Vegas, and this morning I caught a great session by Andrew White on the evolution from master data management (MDM) of product data to “multidomain MDM”.
Andrew started by talking by talking about the strong intersection of product MDM with enterprise resource planning (ERP), workflow, product configuration, and business rules. The market for product MDM is fairly healthy and is actually a little larger than the market for customer MDM.
The initial need to master product data usually arises from having too many copies of product data in different places around the enterprise. Then typically, product data quality issues need to be addressed, but that needs to be addressed as a continuing process, not as a one-time process.
Multi-channel commerce is known as the “sell side” of product MDM, and procurement is known as the “buy side”. There’s involvement with fulfillment and supply chain management, and with ERP and operations. There are many different silos that need to be connected and synchronized (one client I worked with last year had 175 different applications, systems and databases, most of which used or created product data in some way).
At some point, governance has to be addressed. Companies have to go from departmental or business unit governance to enterprise-wide data governance, and expand from single domain (typically customer) to multidomain (customer and product) master data governance.
Andrew mentioned the level of Product MDM adoption – there was software license spending of $432 million in 2008. Certain industries such as discrete and process manufacturing, communications, retailing, and healthcare providers are classified as “hot” according to Gartner (as of Q1, 2010). Retail in particular is almost post-recession. Healthcare providers has more awareness on the buy side.
A common scenario for some is to have a product MDM hub as a system of record, connected to CRM systems for sales & marketing and customer service, to PLM (product lifecycle management) as a system of reference, and to ERP systems (which need the data for their Item Masters). So the CRM, PLM and ERP systems are process owners, but the MDM platform provides the product and material master data, attributes, hierarchies and so on, for consumption by the other systems.
Andrew talked about how the inquiries he gets break down: ERP and MDM: 50%, product data quality: 33%, information exchange: 15%, metadata management: 10% and content management: 20%, and “can I use my CDI hub to master product data?”: 10%.
Andrew talked briefly about the current vendors in the product MDM space: the specialists (handling just product data) such as Hybris Software, Heiler, QAD, Pindar, Tribold, Requisite Technology, EnterWorks. He categorized Stibo Systems, Riversand and Tribold as being somewhere in the middle between specialists and generalists (handling other domains).
Oracle, IBM and SAP are strong on product MDM and customer MDM. Tibco and Informatica (formerly Siperian) are customer MDM providers that are moving towards handling the product MDM domain. Microsoft is entering the MDM space but their solution (when it is released later this year) is really suited more for analytical use.
And other vendors such as Data Foundations and Orchestra Networks can model any domain of data, including product data.
Through the end of 2013, you might need two MDM platforms. IBM has three MDM products (IBM InfoSphere MDM Server, MDM Server for PIM which handles complex workflow, and their recent acquisition of Initiate). Other strong vendors include SAP, Oracle and Stibo Systems.
The five-year market growth rate is projected at 18%. The Top Five products have 51% of the market. Vendors to watch include Teradata, INformatica, Tibco and Hybris.
Over the next 12 months, product configuration remains an unsolved problem. Companies typically define business rules all over the place. Over the long term, in MDM, that doesn’t work – those business rules themselves need to be governed centrally. The master data and the business rules both need to be governed. Successful product MDM requires business rules governance.
Reference data is another area – price is NOT master data but it behaves like master data in a lot of ways. It needs to be governed and managed. Business process management and its intersection with MDM is another area of development.
Data quality for product data has its foibles. You need to know where you’re starting from. Most importantly, data quality is not a once and done thing, it’s an ongoing process.
The product master data life cycle looks like: Author > Store > Publish / Synchronize > Enrich > Consume > Analyze.
The picture for the future – there are three main “provinces” for MDM: the “thing” province, the “party” province and the “place” province. But vendors typically have a history in a single domain.
Andrew gave a couple of great example of companies that went through the evolutionary process of going from a single domain of MDM to multiple domains over time.
Andrew closed with recommendations for people beginning their MDM process: create a vision of what could be achieved with a “single view of product data”, to start small but think big and deliver value early, and to define data and process metrics early and then to revise then as needed as you go along.
I’ve been a big fan of Andrew White for several years now, and I thought he did a great job today (as usual). He brings a great deal of analysis to bear on the questions involved in product MDM, and provides clarity and insight into where the MDM market is headed over the next several years. If you’re attending the Gartner MDM Summit in Las Vegas, or have a chance to catch his sessions at a future event, I think you’d find those sessions very rewarding.
Informatica Analyst Briefing
Arvind Parthasarathi, Ken Hoang and Ravi Shankar from Informatica were kind enough recently to give me a detailed briefing on Informatica’s master data management (MDM) strategy after its acquisition of Siperian.
First, there’s no doubt this was a game-changing move, for both Siperian and for Informatica. With over 4,000 Informatica installed base customers to leverage, and 200 Informatica sales reps going through training and certification, Siperian’s sales momentum should increase dramatically. And in fact, several new deals have closed just since the acquisition was announced in late January.
And being acquired by Informatica eliminates the “company viability” question that some Fortune 500 IT shops would have about any software company under a certain size (not just Siperian). Informatica itself might be acquired by one of the mega-vendors at some point, but with annual revenue of $500 million, it’s big enough not to be subject to the financial viability question.
Informatica also provides a large partner ecosystem and a significant marketing budget, so living on under the Informatica banner, Siperian can compete more readily for mind share both with partners and with potential customers.
But what impressed me the most was the strategic nature of the other purchases that Informatica has made over the past couple of years, such as Identity Systems for entity resolution (i.e. matching) and Address Doctor for address cleansing. With the addition of Siperian as a strong player in the multidomain MDM hub space, Informatica has declared itself a real competitor against the likes of Oracle, IBM, Initiate Systems (an IBM company) and SAP.
And in some ways, Informatica is better positioned than most of these, for two reasons. First, it has a complete suite of leading products for data integration, data quality and all of the associated things that make up the “MDM ecosystem”. And second, many of its competitors are dependent on it for those components (Ramon Chen wrote a great article on Informatica’s OEM agreements with various competitors).
Informatica’s product lineup supports all of these MDM requirements:
- Multiple MDM architectural styles including the ability to support Registry style (competes most directly with Initiate Systems)
- Multiple data domains, i.e. multidomain MDM (competes most directly with Oracle, IBM and SAP)
- Data Integration and Data Quality as a foundation for MDM (competes with a wide variety of products)
So in some ways, Informatica wins even if customers buy a competitor’s MDM hub product, because there’s a good chance they’ll still buy Informatica’s data integration and/or data quality solutions, to help them with data integration, data profiling and data quality, or to help build the inevitable data services, once the master data is gathered in a centralized hub and able to deliver timely, trusted and relevant to the rest of the enterprise.
Informatica sees its MDM products used in both Operational MDM (where the master data is actively managed by data stewards, governed and improved and then synchronized back to the operational systems), and in Analytical MDM (where for various reasons, the improved master data does not flow back to the operational systems, but flow instead to data warehousing, analytical and business intelligence applications).
Informatica has such a strong, integrated story, with its PowerCenter data integration, Informatica Data Quality, and Informatica MDM products, that it’s able to accommodate customers’ maturity needs starting with data integration and progressing to data quality and MDM.
And Informatica, by giving customers the ability to solve any MDM-related business problem with a unified architecture, spanning data integration, data profiling, data quality, identity resolution, address validation, and all major styles of master data management, has pulled together a great set of solutions for MDM.
I’m looking forward to seeing the Informatica folks at this week’s Gartner MDM Summit conference in Las Vegas. If you’re going to be there, stop by and see the Hub Designs team at Booth #7 during the exhibit hall hours. We’ll be announcing a new product with Equifax, and we’ll be releasing a data governance white paper with Informatica.
Answering Questions from LinkedIn
I got a good question via LinkedIn the other day, so I thought I’d answer it here:
Dear Dan,
I am a database architect but I am new to MDM and data governance, and I’m very interested in this area.
Can you please suggest where to start? I’ve found some information on the web (sometimes a bit disconnected), but I seem to be lost with so much information. Also, the tools that are currently available in the market – do they address all the challenges in this space?
One question I have is: if data quality is given the importance it deserves from the beginning of any project (operational or data warehouse), are MDM initiatives necessary? Are MDM projects needed because of the proliferation of applications that are developed in silos and that don’t consider what information is already available to the enterprise? In essence, should MDM be part of any project?
Thanks for your time.
Not to sound too self-serving, but I’d start with this blog and the MDM Community.
As for your question about whether MDM initiatives are necessary if data quality is given sufficient importance, please realize that MDM is a relatively new discipline which includes embedded data quality; it does not replace data quality.
What MDM does is sit between the source systems (typically CRM and ERP) and the data warehouse and business intelligence. So instead of trying to flow master data and transactional data directly into the warehouse for analysis, we bring it into the MDM system first, where it can be “mastered” – which includes fixing data quality issues. We then flow those corrections back into the source systems and downstream into the analytical systems. Which of course you can’t do without data quality tools. But data quality tools by themselves are not sufficient, because they typically don’t persist or store the data.
Your next question, are MDM projects necessary because of the proliferation of apps developed as silos – yes, that’s a big part of it. Essentially, if you developed a new architecture from scratch, you’d put a multi-domain MDM hub, able to handle many types of master data at the core, and you’d build data quality into it, then you’d surround it with integration so you can flow data from there to where ever it’s needed. So clean, accurate, consistent and timely master data would be available to any other IT project that was going on, but it would only have to be built once. “Build once, use many”, as they say.
Please keep reading and I hope you stay interested in MDM and data governance!
I got an answer from this person today that I thought I’d share with you here:
Dan,
Thank you very much for your insights. Whatever documents I had read about MDM had more to do with the people, process or technology, but didn’t cover the essence of MDM. I’ve gone through some of your blogs and I’m beginning to understand MDM.
AMB Releases Community Edition
Hub Designs has been a partner of AMB, a provider of information governance, quality and discovery software, since November 2008.
Now AMB is launching an open source version of its Information Governance Suite, called the Community Edition. AMB delivers tools to facilitate real time governance, and is now extending its reach as one of the first major vendors with an open source version of its core product.
This might be an ideal way for companies looking to familiarize themselves with a data profiling and data quality product to learn the tool, get a data governance proof of concept up and running in a cost effective way, and then demonstrate value to the business.
The Community Edition allows you to:
- become familiar with the concept of data profiling as a way of identifying and fixing information anomalies
- enable enterprises embarking on a data stewardship program to use the Community Edition to spotlight, identify and determine the priority of their internal information issues
- enable organizations to define and automate a repeatable process, using software to administer the information governance program that aligns with the repeatable process, not the other way around
The Community Edition should provide a core set of data profiling and governance, and training and support is available, as are upgrades to the Professional and Enterprise Editions.
For more information, contact AMB at 1-847-899-5154 or community@ambpdm.com, or visit http://www.ambpdm.com.
Is It Taxonomy Season Already?
Like death and taxes, every Master Data Management (MDM) project goes through a taxonomy definition exercise. During this time, Data Architects realize whether their payment of time thus far will yield a refund (of time) or require them to spend nights and weekends in jail (at the office). Let this article serve as your free consultation with your personal Taxonomy Preparation professional.
An MDM taxonomy is simply a structured hierarchy applied to the topic of the MDM project (for example: products, people, or customers) that defines that topic’s attribution. At each level, this hierarchy enforces the inheritance of characteristics to all of its children and their children. For example, the taxonomy of biology that has remained in my memory since 6th grade contains the levels Kingdom, Phylum, Class, Order, Family, Genus, and Species. Any animal or plant can belong to only one member of the lowest level, species, and each level of the taxonomy defines the inherited characteristics of its children. The same concept is core to an MDM design and each widget in an MDM topic can only reside in only one of the lowest taxonomy levels.
The number of levels in the MDM taxonomy varies based on the business need, topic, and a count of the widgets in the topic. There are standards available to guide you in level counts and names if you want to follow them, but the assignment of attributes, definitions, and placement of your widgets in the structure is business-specific. Plan for a significant investment of effort to get the taxonomy and item assignments correct. This effort should result in the business agreeing on a taxonomy containing the fewest levels necessary to accurately represent the MDM topic’s widgets, along with a few other guidelines.
The topic of an MDM project may have many business purposes and be categorized by business users in a variety of different ways. This is expected and encouraged. We are not trying to restrict how the business analyzes the topic’s widgets. The taxonomy we are concerned with is a single hierarchy defining widgets through attribution characteristics as described in the prior biology example. We do this to create a single unambiguous definition that can be applied to every existing and new widget so that each widget falls under one and only one of the taxonomy’s lowest levels. The business must validate the one widget per lowest taxonomy level rule, what attributes are common to each level, and that the attributes of any level apply to all levels below it. The taxonomy not only results in a standardized method of defining widgets, but also allows for automatic inheritance of widget properties during definition which reduces the workload and chances of errors during the widget information entry.
Expect to encounter puzzled looks when introducing the concept of attribution-driven taxonomy. Business subject matter experts do not think of their widgets in those terms. Instead, they will be thinking in terms of how the business reports on the widgets. The distinction is clear only when you remember the purpose the taxonomy serves. Conducting workshops with business users across the board promotes the required consensus. After a few episodes of realigning discussions from a reporting mindset into an attribution mindset, the users will start to change their thinking and the results will be a valid taxonomy that the MDM initiative can grow on. Without this foundation, your success will be limited.
Data Profiling For All The Right Reasons, Part 5
The Hub Designs Blog welcomes the final installment of this great series by Rob DuMoulin, an information architect with more than 26 years of IT experience, specializing in master data management, database administration and design, and business intelligence.
Part 5: The Profiling Payoff
This is the final part of a five-part series, describing how data profiling benefits both IT projects and business operations. In Part One, we discussed profiling perspectives. In Parts Two, Three and Four, we introduced the value of system, entity, and attribute-level metrics. This part discusses the archival and beneficial uses of profile results.
If you have defined your corporate data profiling strategy similar to the methods discussed in the preceding parts of this series, you’ll have amassed a robust collection of metadata spanning relevant systems across your business. Although systems may be of different types and locations, the structured approach and common metrics you collected create a centralized repository of information that can be examined holistically. Ideally, this information will exist in an open-source database repository with reports made available across the enterprise. System and Entity information help planners and developers organize information strategies. Attribute-level domains, constraints, and business rules help data architects understand existing systems. Relationships and value patterns are readily available to support validation of information-related hypotheses as needed.
If you plan to design your own repository, consider adding timestamps and indicators to help you manage and present the information. To keep your repository relevant to business needs, design collection rules to be configurable. This allows you to easily ignore superfluous information or enable tests only at certain critical times. Allow initial system profiling efforts to gather a large set of metrics and store them as your baseline. As you learn about the information, you will see which tests or which data objects add no value. Us geeky DBA-types who understand system-level catalogs have our own scripts to do much of what was described inParts Two,Three and Four. Those less-inclined may prefer to use a third-party tool for profiling. Either way works as long as the business needs are satisfied and the entire enterprise standardizes on one approach (and thus one integrated repository).
You will find that collecting and maintaining this level of detail has a definite cost. Even if the collection is automated, interrogations of large data sets places an overhead on production systems that may not be practical. Record and monitor profile execution metrics to identify bottlenecks or tuning opportunities. Realize that the extent of data profiling is contingent on the project phase, specific data elements, and most of all, business value. Review profiling goals on a regular basis and eliminate unnecessary and redundant checks.
How much profile history to maintain is another consideration. Even though disk is “relatively” cheap, maintaining all historical entries in a live repository may not be necessary. Consider business needs and value for historical profile information. Even consider archiving at a summarized (or less frequent) level and keep only a limited time window of statistics online.
This discussion on data profiling was intended to broaden perceptions of what it means to a business and the value it can bring if done in a sustainable way. The blog format is not conducive to in-depth discussions, but hopefully the topics covered here spur some thoughts into how you can add value to your business by implementing some of these concepts. Use your imagination, but remember that no matter how cool it might be to collect and store some profile output, if it does not add business value to somebody, it might not be worth the overhead to continue recording it.
Go back to Part 4.
Initiate Systems Acquired By IBM
Today, IBM announced that it is acquiring Initiate Systems.
This was widely rumored last week, but the announcement of Informatica’s acquisition of Siperian took my mind off this temporarily.
On the face of it, it makes all the sense in the world. IBM knows a good product when it sees it, and Initiate has been doing well in the MDM world lately, particularly in the healthcare vertical, where it grew up, and in the public sector vertical. IBM’s press release explicitly mentions Initiate as a leader in “data integrity software for information sharing” among healthcare and government organizations. I thought it was interesting that the IBM release didn’t mention the terms “master data management” or “MDM” even once.
I was a little surprised that IBM’s release made no mention of the financial terms, since IBM is a public company, but I’m sure it will only be a matter of time before those details become available to those who know where to look or whom to ask.
It wasn’t a surprise to see the IBM release mention the stimulus funding being invested around the globe. When I first saw the rumors last week, I immediately thought – IBM is buying Initiate to be better prepared for the various e-Healthcare initiatives that are coming down the pike.
Where things may get a bit tricky is explaining the multiple MDM platforms from IBM to potential customers, and managing several different development roadmaps and product portfolios. There’s the IBM InfoSphere MDM Server (the former DWL product) and there’s also IBM InfoSphere MDM Server for Product Information Management (the former Trigo product). And now there’s the Initiate product too.
While the acquisition does make sense, there is an “embarrassment of riches” factor. IBM will, of course, develop a sales playbook explaining what situations at what type of customer are a good fit for each product.
I think the lingering feeling I have with Initiate Systems is that it may be headed for a “golden ghetto” at IBM – never to reach its full potential as a solution across many different industries, and eventually to handle many different domains of master data. IBM may (and rightly so, in its mind) pigeonhole it into the healthcare and government verticals.
But Initiate’s had some good success outside those two industries. In the Financial Services vertical, they’ve got customers like Capital One Financial, Countrywide Financial (now Bank of America), eSure Insurance, and Wells Fargo. In the Hospitality industry, they’ve got Choice Hotels. In manufacturing, they’ve got Mitsubishi Motors Australia. In the Logistics vertical, they’ve got Federal Express. In the retail sector, Barnes & Noble, CVS, Longs Drug Stores and SuperValu are all customers. And in the high tech space, they’ve got Dell, Ingenix, Intuit, LocatePLUS, Microsoft and National Instruments.
Unfortunately, they didn’t achieve enough critical mass in any of these other verticals to compete with the strong momentum they’d developed in healthcare and government.
As I said last week, these are interesting times in the MDM world. The recent M&A activity, the healthy demand from large and medium sized corporations, the large number of consultants from other areas claiming to now have experience in MDM – these are all signals to me of a large and fast-growing market. So the New Year, for those of us in the MDM space, is off to a good start.
Siperian Acquired By Informatica
Siperian, one of the last best-of-breed providers of master data management (MDM) technology, is being acquired by Informatica.
The two firms were already working together closely, having an alliance and OEM relationship through Informatica’s acquisitions in 2008 of Identity Systems (for entity resolution and matching) and in 2009 of Address Doctor (for customer address cleansing).
This will strengthen the Siperian product further by bringing Informatica’s technology even more tightly into the Siperian MDM Hub.
At the same time, it eliminates the “company viability” question mark that sometimes gets raised in large IT shops’ minds when evaluating enterprise software vendors. When a Fortune 500 company is evaluating a smaller company, they sometimes wonder how long a company like Siperian can last against companies like IBM, Oracle and SAP. I’ve never been a big fan of that argument, since some of the best software gets created at small and medium-sized companies, but there’s no doubt it’s a obstacle to be overcome with the larger enterprises. Now, it shouldn’t be an issue.
As a Siperian partner, Hub Designs is excited about this acquisition. Based on the information we’ve got at this point, it seems like a good thing for Siperian’s customers, products, shareholders, partners and people. In today’s economic climate, dreams of a big IPO (for any venture-backed technology company) are unlikely, so an acquisition by a well-run larger company is a good outcome.
I know a lot of the people at Siperian personally, and have worked closely with them over the last few years. I hope the people at Informatica realize what a strong team they are getting in this acquisition, and do everything they can to hang onto them all.
I do suggest they stop using the term “MDM Infrastructure” though (which appeared 5 times in Informatica’s press release announcing the acquisition). First, it’s not accurate – MDM projects typically need to be drive by the business to be successful, so they can’t and shouldn’t be thought of as “IT Infrastructure” projects. Secondly, from a marketing perspective, “infrastructure” is about as exciting as mud – it’s hard to get senior management excited about spending money on something with the word “infrastructure” in the name.
As for the acquisition’s impact on the rest of the MDM market, it’s still growing pretty quickly, but the number of players is shrinking. So I think we’ll see it become even more competitive, and with Informatica now becoming a strong player in the MDM hub market, that’s got to cool its relationship with Oracle, who selected Informatica as an OEM component of its Oracle Fusion MDM hub.
IBM is rumored to be acquiring Initiate Systems, which is an interesting play in its own right, especially given the expected growth in spending in the e-healthcare space over the next few years.
And SAP continues to improve its products slowly but steadily, while D&B/Purisma is doing some interesting things with web services access to the D&B central database of information on businesses.
As for the remaining independent MDM vendors, like Orchestra Networks and Kalido, or Product Information Management (PIM) solutions like Stibo and Riversand, they should see this as further validation of the strength of the MDM market. Kalido feels that it’s the only independent MDM provider who can manage every master data domain. That may be true. I plan on learning more about Kalido over the next few months.
So like the Chinese curse, “may you live in interesting times”, the beginning of 2010 promises to be interesting for all of us in the MDM business!
If you’d like to continue the discussion on the “Impact of Informatica’s Acquisition of Siperian”, click http://ning.it/aJ1Xj5.
Data Profiling For All The Right Reasons, Part 4
The Hub Designs Blog welcomes Part 4 of this series by Rob DuMoulin, an information architect with more than 26 years of IT experience, specializing in master data management, database administration and design, and business intelligence.
Part 4: Profiling Relationships and Patterns
This is part four of a five-part series describing how data profiling assists in all aspects of system development, from design through deployment.
Part One introduced different perspectives on data profiling. Part Two identified valuable system and entity metrics to track. Part Three discussed attributes. In this segment, we dive deeper into attribute relationships and pattern recognition. Also, we expand on primary key identification discussion and discuss hidden relationships.
Pattern grouping provides a mask of distinct format patterns within an attribute data set and a count of the number of occurrences. Patterns give insight into the type of values found in an attribute. For example, a numeric pattern analysis may show values such as 999.99999, 99, or -.9999.
Observing distinct patterns gives insight into the maximum digits and precision, and also domains such as integer or real. Pattern of a database date or date-time type provides unremarkably similar patterns for all dates. Because the database management system typically enforces the domain, date analysis provides no value and can be ignored. If dates are stored in character format, however, patterns quickly show variations in date formatting. Character patterns only have significance to a limited number of positions. It makes no sense to pattern a description field of 200 or 2000 characters. Smaller code attributes of less than 10 characters though do provide value. Ignore pattern profiling for character strings over 20 characters at first, then refine to shorter character strings if the results do not add value.
In pure database theory, referential integrity (RI) is your friend. In practice, designers and software vendors often forgo RI to improve system performance on data inserts. These designers place the data quality burden on the application and do not endorse external data manipulation outside the application interfaces. In the real world, though, data corruption occurs and without RI or routine data quality checks, corruptions may not be found for a long time or not at all. Personally, I have identified over $50,000 of recent orphaned sales in a retail client resulting from deliberately disabled RI. These unreported sales were not added to the ledger and were allowed to occur for performance reasons until I found them through simple profiling. Enforcement of RI is a topic for another discussion but is mentioned here because it does identify a valid reason for data profiling.
In even presumably good relational designs, some parent-child relationships are not enforced for different reasons. First, interrogate the RI listed in the system catalogs to identify all enforced relationships. Reverse-engineering a system with a good modeling tool is probably the best way to do this. A harder and more valuable analysis is to identify unenforced relationships and determining the probability of the relationship if not all values are an exact match. Do this by counting all the candidate child attribute values that exist within a known parent attribute table. If all match and there are a non-trivial number of matches, there is a good probability of a non-identified relationship. A small number of mismatches could identify data quality issues.
In Part 5, we tie all the techniques discussed in the first four parts together to show the value of a repeatable data profiling process.
Data Profiling For All The Right Reasons, Part 3
The Hub Designs Blog welcomes Part 3 of this series by Rob DuMoulin, an information architect with more than 26 years of IT experience, specializing in master data management, database administration and design, and business intelligence.
Part 3: Attribute-Level Analyses
This is part three of a five-part series on data profiling.
In Part One, we took a light-hearted view of where profiling benefits an organization and in Part Two, we discussed the fundamentals of a profiling strategy. The remaining three parts discuss attributes, relationships, patterns, and how to use the combined data profiling information you collect. In this section, we introduce attributes, the lowest-level components of a profiling effort.
An attribute is simply a individual data element. Alone, an attribute has no context. Given the simple descriptor of “Cost” for an attribute tells us very little about the attribute’s true purpose and immediately drives a need for additional information, such as units (hours, Dollars, Euros…), type (weighted, unit, gross…), and use (invoice, sum, average…). Attributes therefore must be analyzed within the context of their business purpose to have meaning.
Some characteristics require business knowledge to define and others can be determined through interrogation of existing values and underlying rules of the storage medium. It takes both analyses to get a complete picture of information within a system. While assembling this puzzle, though, keep in mind that until you validate the enforcement of business rules, only assumptions can result from physical profiling or business context.
Analyses of values, domains, and constraints allows insight into use (or abuse) of an attribute. The larger the sample size, the better confidence you gain in the results. Without explicit proof of business rule enforcement, though, you must assume that just because a value does not presently exist does not mean it cannot exist. Business rules are defined by business experts and enforced through database constraints, data type/precision, and application code. Knowing the methods of enforcement allow you to narrow a domain but not totally understand it. Profiling of actual values provides additional refinement in terms of percentage of NULL values, percentage of distinct values, minimum, maximum, and average values, top x and bottom x recurring values along with their counts, and minimum, maximum, and average data lengths.
Some attributes within a data set serve valuable purposes that are important to identify. Attributes that individually or in conjunction with others define uniqueness of the data set also may support relationships between entities. Uniqueness can be further classified as being either members of a system-enforced primary key or of a business key (outside of the defined primary key). System-enforced primary keys are relatively easy to define within a database system through interrogation of the system catalog. Business keys that exist in tables in addition to a primary key may be more difficult to identify, especially if more than one attribute is needed to define uniqueness.
Attribute-level information of interest includes: data type (size and precision), the number and percent of NULL values, column descriptions, number and percent of distinct values, and the minimum-maximum-average values and lengths. Uses of the system catalog provides some of this information, but others must be collected through sampling the data.
Other types of attributes that may help in identifying relevancy are those that provide system-level auditing or change control. Knowing which attributes fill these roles may either allow you to (a) ignore them for profiling purposes or (b) use them to help explain versions or data anomalies.
Part 4 expands on attribute profiling with the introduction of relationships and patterns.
Data Profiling For All The Right Reasons, Part 2
The Hub Designs Blog welcomes Part 2 of this series by Rob DuMoulin, an information architect with more than 26 years of IT experience, specializing in master data management, database administration and design, and business intelligence.
This discussion is the second of a five-part series on data profiling. In Part 1, we discussed the project roles that benefit from data profiling and how better understanding information results in more reliable information systems. Important goals of any profiling strategy include automation of metric collection and socializing results to support the differing objectives of a data-centric project.
Early in a system development life cycle, profiling helps define sources, data storage requirements, and data transformations. As a system goes into production (or if profiling is added to an existing system for quality control purposes), routine profiling is useful to audit system quality and business rule enforcement. The frequency of collection and amount of effort you expend to automate your profiling methods should be based on the ability of the organization to benefit from the profile results.
This section discusses the beginnings of a profiling effort. Information assembled here forms the foundation of other profiling activities. For this discussion, consider a Profile Group as a set of information sharing a common purpose and data management methods. Examples of profile groups include tables within a single database schema or a group of spreadsheets with the same format but each spreadsheet representing a different time slice of data.
The underlying System managing a set of information within the profile group may be a named relational database, a file system directory, or even a web site being accessed through web services. The reason we abstract information into Systems is to group the information into distinct governance methods common to the underlying information. Relevant metadata and governance methods we track at the system-level include: technical contacts, backup schedules, system descriptors, connection strings, business unit owners, and host operating systems. System-level metadata common to a profile group helps us understand and troubleshoot future analyses. This level of information also provides developers with an understanding of inherent restrictions (or freedoms) they may encounter when trying to use or integrate the information.
Entities within a profile group belong to the same system, may have a common unique identifier, and, for database entities, have the same schema owner. Typically, entities are database tables, but may also be similar files or spreadsheet tabs containing like attribute lists. For entities, we track characteristics common to all the attributes they contain. These include: row counts, entity-level descriptors, growth characteristics (size and frequency), last analyzed date, and various customized indicators such as active/inactive, existence of change data management attributes such as insert/update timestamps, and existence of audit traceability indicators such as insert/update username.
The combination of system and entity level profiling supply the foundation for the attribute-level profiling, which is where physical information in a system resides. It also provides valuable metadata to classify information and allows for future correlation of like information across systems. Assembly and publication of entity and system level information benefits the various consumers of the information by providing a centralized “master” source of contact and context information.
In Part 3, we will dive into the attribute level analyses around data profiling.
Data Profiling For All The Right Reasons, Part 1
The Hub Designs Blog welcomes another guest post by Rob DuMoulin, an information architect with more than 26 years of IT experience, specializing in master data management, database administration and design, and business intelligence.
Part 1: The Psychology of Data Profiling
Swiss psychologist Carl Gustav Jung founded the Analytical School of Psychology. His word association theories form the basis of the Myers-Briggs Type Indicator Assessment test to identify career aptitude in today’s high school students. Dr. Jung’s approach assigned personality profiles based on how an individual’s thoughts associated to various phrases. By analyzing responses, he could understand how an individual viewed the world around them and perceived themselves. Typically, subjects are asked to speak the first thought entering their minds after hearing a trigger phrase. For the following example, remember, there are no wrong answers. If I say the words “Data Profiling”, what is the first thing you think of?
If you thought of food, cats, country music, CSI NY, or residential plumbing, you are either not in IT or are an IT Manager.
If your first thought was “Quality Assurance”, you align yourself with data quality professionals having anti-social thoughts of failing test cases and sadistically reporting lazy developers for buggy code. You gleefully scour test cases looking for any evidence of truncation, missing values, non-matching codes, numeric precision errors, and inconsistent abbreviation, text, and date formatting.
If “Integration” comes first in your mind, past legacy integration projects have scarred you with a disdain for source system data quality levels. You view production apps with contempt and loathe the time it takes to track down data issues caused by system integrations. You investigate upstream sources to create detailed mappings and transformation rules. Typical debugging sessions consist of validating relationships to identify orphaned data, identifying attributes that contain overloaded columns (attributes containing more than one distinct data element), or fixing format errors from implied decimals.
Some of you responded with “Value Domains” or “Data Types”, indicating you are obsessive compulsive data architects compelled to organize the world into strict and orderly fashion with some degree of normalization, though you are not considered “normal” by your peers. Your concerns lie in understanding and regulating naming conventions, relationships, existence of NULL or default values, and understanding the meaning of each data element to accurately identify business rules and when two or more objects are related or redundant.
Lastly, if “Debugging” is the first item in your thought queue, you are a coder justifying why presumably good code is not working. Extreme paranoia has taught you to assume nothing about data quality, so you add tests to identify duplicates, validate relationships, enforce business rules, track change data capture, provide substitute values. Your phobia of early morning phone calls cause you to add auditing to your code to inform a DBA of data issues rather than waking you up in the middle of the night.
It is truly amazing how much we can conclude from the response to one simple phrase.
As stated before, there are no wrong answers. Aside from the innocent jab at Managers and non-IT resources, we all realize the benefits of information quality and absolutely need business involvement to understand context and domains of business information. The meaning and actions of Data Profiling change both by role and by project phase. Through profiling, we are able to identify best sources of information, learn proper ways to categorize and store it, reactively identify quality issues, and proactively define business rules to prevent future issues.
Identifying what is important to profile, when and how profiling is done, and how to share our findings across business and project resources is key. Done properly, profile results integrate to a master metadata repository and are periodically refreshed through an automated process.
This five-part series provides a tool-agnostic approach to comprehensive data profiling, focusing on information meaning and use. The next part of the series discusses system and table-level profiling. In particular, what information is important to collect at the system and table level and how can that information be leveraged by the Enterprise to help assure quality. The third part dives into attribute-level profiling and the fourth discusses attribute patterns and relationships. The final part discusses the benefits and utility of gathering profiled information into a single repository.
Continue with Part 2.
2009 Year in Review
As we’re about to enter 2010, it’s a good time to reflect on what happened in 2009 and what it all means.
“It was the best of times; it was the worst of times…” So Dickens begins “A Tale of Two Cities”, but it’s also a good description of the past year.
The first half of the year was one of the most challenging I’ve faced in my twenty-three year career in business and technology. The second half of 2009 was better – not without its speed bumps but every month was a little better than the one before it.
The macro-economic climate has been tumultuous at best. But the second half of the year showed enough improvement that Hub Designs’ revenue for the year was up 33%. Not bad for a two and a half year old company during the worst economic conditions in 80 years …
Marketing and Thought Leadership
We launched a new web site in January, and it’s been well received. Total visits to www.hubdesigns.com were up 14% over 2008.
A little later in the year, we updated the “look and feel” of the Hub Designs Blog, branding it as the “world’s fastest growing blog covering master data management and data governance”. We’ve gotten more than 43,000 hits since we started writing in July 2007, and our readership more than doubled in 2009, to about 27,000 hits per year.
We published six issues of our “Best Practices in Master Data Management” newsletter this year. We publish the newsletter about six times a year to roughly 3,300 subscribers.
I wrote six articles for Information Management magazine, including some popular ones on “Product Information Management Challenges”, how to build a business case for master data management, and how to select the right MDM vendor for your organization. I also wrote for Identity Resolution Daily, on “The Growing Role of Identity Resolution in MDM” and “Matching – MDM’s Secret Sauce”.
With our partner Siperian, we wrote a white paper in August called “When Data Governance Turns Bureaucratic: How Data Governance Police Can Constrain the Value of Your MDM Initiative” that has generated quite a bit of discussion. You can download a copy of it here.
A second white paper, called “Best Practices for Leveraging D&B in Oracle E-Business Suite”, was written in partnership with Dun & Bradstreet. It describes using D&B information to drive better supply chain performance for companies using Oracle E-Business Suite. You can download it here.
I volunteer for the Education Committee of the Oracle Applications Users Group (OAUG). A big part of that effort lies in programming the MDM track for the annual conference. This year, it was in Orlando in May, and I really enjoyed speaking there and seeing people from the Oracle community that I don’t see very often. Here’s a link to my OAUG presentation.
We participated in conference calls with Oracle Development during the year, and ultimately attended the Oracle Fusion “Hands-On Validation & Testing” session for Customer MDM at Oracle headquarters in August. It was a great chance to get some early insights into Oracle’s next major product release and to see the progress Oracle has made in building out its Fusion MDM vision, which is striking in its powerful hub technology and its elegant & productive user interface.
In 2008, we attended the Gartner MDM Summit to decide whether to exhibit there in 2009. We were impressed enough with the conference that we did exhibit in 2009, in October in Los Angeles. We had a positive experience, so we’ll be a Silver level sponsor in April 2010 in Las Vegas. Since we specialize in MDM and data governance, we find the association with Gartner’s MDM event a powerful one.
I didn’t attend Oracle OpenWorld for the past couple of years, but this year I was glad I did. It was like “old home week”, seeing people from Oracle itself and from the broader Oracle community that I’ve met over the past 15 years. David Butler, Senior Director of MDM Marketing at Oracle, posted my presentation on Oracle’s web site, and said “you were our cleanup hitter and you hit a home run with the bases loaded”.
We also did webinars with our partners Siperian and Initiate Systems. The Siperian webinar covered the differences between MDM platforms like Siperian and ERP platforms like SAP from a master data perspective. The Initiate webinar, with Initiate’s CTO Marty Moseley, discussed developing strong MDM business case, deploying core MDM technologies and lessons learned on the “build vs. buy” question.
Social Networking
After experimenting with social networking in 2008, this year we had a coordinated strategy to use the Hub Designs Blog, Facebook, LinkedIn and Twitter to communicate & collaborate with our clients, potential clients, team members, partners, suppliers, etc.
It’s a pretty simple strategy. Short updates (140 characters or less) go out on Twitter, and are re-published on both LinkedIn and Facebook. Longer updates (i.e. blog articles) are published on the Hub Designs Blog. We encourage responses and feedback using @replies on Twitter and comments on LinkedIn and Facebook, as well as longer-form comments on the blog. And we get them – almost every blog article gets at least one comment, sometimes as many as a dozen.
When a new blog article comes out, we notify everyone via a single update on Twitter. What’s amazing is that during 2009, social networking now drives about 15% of the Hub Designs Blog’s total traffic. And one of our clients gave us some good feedback that our social networking activities help her stay current on what we’re up to, and help her feel connected to us as a company.
Another social networking experiment that developed further in 2009 was the MDM Community. We started this using Ning (a “social network in a box”) in November 2008, out of frustration with LinkedIn’s “Group” functionality. It now has more than 210 members, from 23 different countries. It’s still a work in progress, but if you’re interested in master data management or data governance, you should check it out at http://mdmcommunity.ning.com. It’s becoming an international “who’s who” of the MDM world.
Summary of Client Projects
In case you think the Hub Designs team has been doing nothing but marketing, writing white papers and magazine articles, speaking at conferences, and volunteering for user groups, here’s a summary of our 2009 client projects:
- Technology provider for vehicle dealers: integration of Oracle E-Business Suite with D&B data
- Payroll services company: integration of Oracle E-Business Suite with external credit information
- Information services company: technical support for customers using Oracle E-Business Suite
- Legal information services company: readiness assessment and product MDM strategy & design
- Simulation and engineering software company: advisor to data governance board
- Manufacturer of oil and gas equipment: integration of Oracle E-Business Suite R12 with D&B
- Software company: built connector between Oracle AR and D&B’s DNBi risk management solution
- Technology company: customer MDM strategy workshop
Out With The Old, In With The New
This past year has been a lot of fun, but it has also been somewhat exhausting. So I’m looking forward to a bit more deliberate pace in 2010.
We’re very excited about the coming year at Hub Designs. We’ve got some great projects underway and in the pipeline, and we’ll be continuing to grow and expand to meet our clients’ needs and market demands.
In closing, I’d like to say how grateful I am to my family, for their patience with my traveling so much and for their unconditional love.
Hidden Costs of Duplicate Customer Data
A client asked me last week about what rate of duplicate data was “normal” in customer master data.
My initial answer was that, among companies that don’t have any formal master data management, data governance or data quality initiatives in place, duplication rates of 10%-30% (or more) are not uncommon.
When I was at D&B, we used to routinely see that level of duplication in client’s customer files.
In a study in the healthcare field, Children’s Medical Center Dallas engaged an outside firm to help clean up their duplicate data:
“Solving both the current and future problems around duplicate records helped Children’s improve the quality of patient care and increase physician acceptance of the new EHR. The duplicate record rate was initially reduced from 22.0% to 0.2% and five years later it remains an exceptionally low 0.14%. The 5 FTEs initially tasked with resolving duplicate records have been reduced to less than 1 FTE.”
“For the Children’s Medical Center, the results were heartening, not only from a care delivery standpoint but also because of the significant cost-savings that can be realized. A study conducted on Children’s data showed that on average, a duplicate medical record costs the organization more than $96.”
So it is possible to get the duplication rate down to really low levels through careful analysis and the application of the right tools, as part of an ongoing data governance program. Even the hospital above (and hospitals are usually not mentioned as practitioners of best practices) was able to maintain a duplication rate of only 0.14% after 5 years.
And there are very real costs to not de-duplicating your customer data. Depending on the functional area (marketing, sales, finance, customer service, etc.) and the business activities you undertake, high levels of duplicate customer data can:
- annoy customers or undermine their confidence in your company,
- increase mailing costs,
- cause hundreds of hours of manual reconciliation of data,
- increase resistance to implementation of new systems,
- result in multiple sales people, sales teams or collectors calling on the same customer,
- etc.
The best studies I’ve seen of the cost of duplicate data have been in the healthcare industry. One study I saw said:
“According to Just Associates, the direct cost of leaving duplicates in an Master Patient Index database is anywhere from $20 per duplicate to several hundred dollars. The lower cost reflects the organization’s labor and supply costs to identify and fix the record while the higher expense reflects the costs of repeated diagnostic tests done on a patient whose previous medical records could not be located.
The American Health Information Management Association (AHIMA) estimates that it costs between $10 and $20 per pair of duplicates to reconcile the records. If the records aren’t reconciled, however, the costs are even higher.”
Here are three more case studies backing up the range I quoted of 10%-30%:
- Once the analysis was complete, Sentara discovered they had a significant duplication rate, over 18%. They had attempted to address the duplication rate in the past through a remediation process, but due to either technology issues or because the cost of merging and cleaning up the duplicates across their many different systems was too high, they had not yet successfully reduced their duplication rate. Source: Initiate Systems success story
- Emerson Process Management faced a tremendous challenge four years ago in getting its CRM data in order: There were potentially 400 different master records for each customer, based on different locations or different functions associated with the client. “You have to begin to think about a customer as an organization you do business with that has a set of addresses tied to it,” says Nancy Rybeck, the data warehouse architect at Emerson who took charge of the cleanup. Working with Group 1, Rybeck analyzed the customer records for similarities and connections using everything from postal standards to D&B data, and managed to eliminate the 75 percent site-duplication rate the company suffered in its data. “That’s going to ripple through everything,” she says. Source: DestinationCRM.com
- Problem: Number of duplicate records: 20.9% of Utah Statewide Immunization Information System records. Impact of Problem: Difficult to find patients in system—key barrier to provider participation, risk of over-immunization—unable to find reliable patient record, cost of unnecessary immunizations, risk of adverse effects on patients. Source: health.utah.gov.
And here’s a good quote from a white paper titled “Data Quality and the Bottom Line” by The Data Warehousing Institute:
“Peter Harvey, CEO of Intellidyn, a marketing analytics firm, says that when his firm audits recently ‘cleaned’ customer files from clients, it finds that 5 percent of the file contains duplicate records. The duplication rate for untouched customer files can be 20 percent or more.”
Every organization will need its own metrics, but left unchecked, the duplication problem is a hidden cost that drags at your company, slowing down your processes and making your analyses less reliable.
If your sales analysis reports can’t be sure that there’s one and only one record for each of your largest customers, then the sales figures for those customers are probably not right. So the entire report becomes suspect at that point.
I’d like to end with a great quote on data quality by Ken Orr from the Cutter Consortium in “The Good, The Bad, and The Data Quality”:
“Ultimately, poor data quality is like dirt on the windshield. You may be able to drive for a long time with slowly degrading vision, but at some point, you either have to stop and clear the windshield or risk everything.”
Please let us know what you think by commenting here. We’re interested in hearing your thoughts on data quality and the issue of customer data duplication.
Oracle OpenWorld Presentation
I had a great time at the Oracle OpenWorld conference this year.
Oracle did a great job organizing the MDM track. There were a lot of great presentations, and a good balance of speakers between Oracle people, outside consultants and experts, and end users with success stories to share.
David Butler, Senior Director of MDM Marketing at Oracle, was kind enough to convert my presentation titled “Best Practices in Master Data Management and Data Governance” to PDF format and to post it on the Oracle.com MDM web page.
You can find it in the ‘Partners’ portlet on the right hand side of the page, or just click here.
MDM Track at the OAUG Conference
The Oracle Applications Users Group conference, COLLABORATE 10, is being held April 18-22, 2010 in Las Vegas, Nevada.
But the Master Data Management (MDM) track of COLLABORATE 10 needs YOUR help!
This is your final invitation to share your MDM and Data Governance success story, knowledge and expertise by presenting at the conference.
The MDM Track’s call for papers has been extended to 11:59 pm EDT on Monday, October 26; this deadline will not be extended further.
More than 5,000 users, technology leaders, Oracle executives and solution innovators will gather for the event April 18-22, 2010, at the Mandalay Bay Convention Center.
We hope we’ll see you there — as a speaker!
If you’re interested in presenting, all you need at this point is a title, a short abstract of 520 characters summarizing your idea, and up to five “bullet point” objectives.
If you’d like to submit a paper, just send an e-mail to info (at) hubdesigns (dot) com, giving me a brief sketch of your idea. I’ll respond with the URL you’ll need to submit it.
Silver Creek Systems
Another strong session at Oracle OpenWorld this afternoon.
Alison Schofield, the Product Strategy Director at Oracle for PIM Data Hub, lead off the session by talkking about the business challenges in improving the data quality of product information, calling it the “greatest threat to your PIM initiative.”
Items are formatted inconsistently, misclassified, with overloaded description fields and lots of non-standardized data.
Martin Boyd from Silver Creek Systems took over to talk about the DataLens product, which Oracle is now selling on an OEM basis on the Oracle price list.
Martin pointed out that 10% of the total effort will be on the MDM software implementation, 40% on establishing governance and documenting the master data architecture, and 50% on data remediation (according to AMR Research, “MDM Strategies for Enterprise Applications, April 2007″).
Data mastering is about “getting your data right” and “keeping it right”.
And there are very few standards governing product data (outside of your product information management system) – all of your legacy systems and outside trading partners are going to feed you a lot of product data of questionable quality.
Martin presented Silver Creek’s DataLens capabilities “at a glance” – the ability to standardize and validation of attributes and descriptions, translate between languages, assignment to popular product classification schema, enrichment with internal and external data. matching and merging, and re-purposing so data can be published in any format for use by downstream systems.
Martin differentiated between tools designed to handle customer data quality and those handling product data.
Name and address data has a relatively fixed syntax, but product data has no fixed syntax. And there are only about 200 or so country address formats, while there are tens of thousands of product types.
Two thirds of companies use manual efforts or custom code, but they say it’s too unreliable (75%) or too slow (64%).
Gartner (and many other analyst firms) have given great reviews to Silver Creek in the last few months.
Oracle’s Product Data Quality Server (DataLens bundled into and pre-integrated with Oracle PIM Hub by Oracle) has been used at large retail, manufacturing and health care companies.
The product’s capability starts with semantic recognition – recognizing the product within the current context – and then you can standardize, match, enrich, and repurpose the data, although those things are quite different for product data than for customer data.
The session wound up with a demo of DataLens, and the integration with Oracle’s PIM Hub.
I’ve spent the last six months on the product side of the master data management world, so I’ve found Silver Creek’s DataLens product very interesting, as it solves a major problem in the product MDM space. It was great seeing the Silver Creek folks presenting with Oracle at OpenWorld today.
Aaron Zornes Data Governance Session at Oracle OpenWorld
I’ve always enjoyed the depth and quality of Aaron Zornes’ analysis on master data management. I’ve been attending the MDM Summit conferences that he organizes in the U.S. with SourceMedia since 2006, and I’ve spoken at quite a few of his events.
Today I had the pleasure of hearing him speak on enterprise data governance. Here are some of his major points:
- Don’t settle for “passive” / downstream data governance; instead demand “active” / upstream data governance (please see my white paper with Siperian on this at http://forms.siperian.com/content/PowerGovernancePR).
- Don’t expect data governance maturity assessments to solve all your problems and provide a roadmap out of data governance anarchy.
- Today’s “data stewardship consoles” are substantially less than true enterprise data governance.
- Vendor viability does matter.
- Be prepared to spend $250k-$500k for an initial data governance solution.
Aaron styles himself as the “godfather of MDM” and today was a good reminder of why he deserves that title.
First Day at Oracle OpenWorld
Having a dedicated MDM track at Oracle OpenWorld this year makes a big difference, in terms of being able to find the sessions more easily and in the focus and energy in the sessions.
First up today was a panel discussion on Hyperion Data Relationship Management (DRM). It was moderated by my friend Rahul Kamath from Oracle, and included Dongyan Wang from NetApp, Anand Raaj from Halliburton, and Nimish Mehta from Lumendata. It was very well done, and gave some good insights into the role that DRM can play as a hierarchy management tool in an MDM environment.
Next was Pascal Laik, VP of MDM Product Strategy at Oracle, who co-presented with Cisco’s Kin-Ching Wu. Pascal talked about the reality of complex, heterogeneous environments, and the difference between “push mode” and “pull mode”. He discussed the business drivers of growth, efficiency, IT agility and compliance, and the hard work Oracle has been doing over the past couple of years to help its customers to create their business cases and document the ROI that MDM has been realizing for them. Pascal laid out Oracle’s end-to-end data quality, pre-built integration and data governance strategies, and announced the new Data Governance Manager as a way to Define, Operate, Monitor and Fix data in the hub. Interestingly, 95% of the applications that Oracle customers integrate with are non-Oracle applications.
KC Wu from Cisco discussed their Customer Registry program, which draws data from 40 source systems and publishes it to about 80 downstream systems. She described a fascinating 10-year journey up the MDM maturity model.
The highlight of the next session for me was Bill Miller, a senior IT person at Oracle whom I’ve known for several years, who recently successfully implemented Oracle Customer Hub 8.0 at Oracle. It was very interesting to hear him describe how Oracle has put in place a lot of customer MDM and data governance best practices.
The last session of the day was Vanessa Hsu from Oracle, along with Kelle O’Neal from First San Francisco Partners and Angie Couron from Symantec. They did a great session on enterprise data governance, and gave a “first look” at Data Governance Manager.







In 







