Skip to content

Archive for January 2010

28
Jan

Siperian Acquired By Informatica

Siperian, one of the last best-of-breed providers of master data management (MDM) technology, is being acquired by Informatica.

The two firms were already working together closely, having an alliance and OEM relationship through Informatica’s acquisitions in 2008 of Identity Systems (for entity resolution and matching) and in 2009 of Address Doctor (for customer address cleansing).

This will strengthen the Siperian product further by bringing Informatica’s technology even more tightly into the Siperian MDM Hub.

At the same time, it eliminates the “company viability” question mark that sometimes gets raised in large IT shops’ minds when evaluating enterprise software vendors. When a Fortune 500 company is evaluating a smaller company, they sometimes wonder how long a company like Siperian can last against companies like IBM, Oracle and SAP. I’ve never been a big fan of that argument, since some of the best software gets created at small and medium-sized companies, but there’s no doubt it’s a obstacle to be overcome with the larger enterprises. Now, it shouldn’t be an issue.

As a Siperian partner, Hub Designs is excited about this acquisition. Based on the information we’ve got at this point, it seems like a good thing for Siperian’s customers, products, shareholders, partners and people. In today’s economic climate, dreams of a big IPO (for any venture-backed technology company) are unlikely, so an acquisition by a well-run larger company is a good outcome.

I know a lot of the people at Siperian personally, and have worked closely with them over the last few years. I hope the people at Informatica realize what a strong team they are getting in this acquisition, and do everything they can to hang onto them all.

I do suggest they stop using the term “MDM Infrastructure” though (which appeared 5 times in Informatica’s press release announcing the acquisition). First, it’s not accurate – MDM projects typically need to be drive by the business to be successful, so they can’t and shouldn’t be thought of as “IT Infrastructure” projects. Secondly, from a marketing perspective, “infrastructure” is about as exciting as mud – it’s hard to get senior management excited about spending money on something with the word “infrastructure” in the name.

As for the acquisition’s impact on the rest of the MDM market, it’s still growing pretty quickly, but the number of players is shrinking. So I think we’ll see it become even more competitive, and with Informatica now becoming a strong player in the MDM hub market, that’s got to cool its relationship with Oracle, who selected Informatica as an OEM component of its Oracle Fusion MDM hub.

IBM is rumored to be acquiring Initiate Systems, which is an interesting play in its own right, especially given the expected growth in spending in the e-healthcare space over the next few years.

And SAP continues to improve its products slowly but steadily, while D&B/Purisma is doing some interesting things with web services access to the D&B central database of information on businesses.

As for the remaining independent MDM vendors, like Orchestra Networks and Kalido, or Product Information Management (PIM) solutions like Stibo and Riversand, they should see this as further validation of the strength of the MDM market. Kalido feels that it’s the only independent MDM provider who can manage every master data domain. That may be true.  I plan on learning more about Kalido over the next few months.

So like the Chinese curse, “may you live in interesting times”, the beginning of 2010 promises to be interesting for all of us in the MDM business!

If you’d like to continue the discussion on the “Impact of Informatica’s Acquisition of Siperian”, click http://ning.it/aJ1Xj5.

28
Jan

Data Profiling For All The Right Reasons, Part 4

The Hub Designs Blog welcomes Part 4 of this series by Rob DuMoulin, an information architect with more than 26 years of IT experience, specializing in master data management, database administration and design, and business intelligence.

Part 4: Profiling Relationships and Patterns

This is part four of a five-part series describing how data profiling assists in all aspects of system development, from design through deployment.

Part One introduced different perspectives on data profiling. Part Two identified valuable system and entity metrics to track. Part Three discussed attributes. In this segment, we dive deeper into attribute relationships and pattern recognition. Also, we expand on primary key identification discussion and discuss hidden relationships.

Pattern grouping provides a mask of distinct format patterns within an attribute data set and a count of the number of occurrences. Patterns give insight into the type of values found in an attribute. For example, a numeric pattern analysis may show values such as 999.99999, 99, or -.9999.

Observing distinct patterns gives insight into the maximum digits and precision, and also domains such as integer or real. Pattern of a database date or date-time type provides unremarkably similar patterns for all dates. Because the database management system typically enforces the domain, date analysis provides no value and can be ignored. If dates are stored in character format, however, patterns quickly show variations in date formatting. Character patterns only have significance to a limited number of positions. It makes no sense to pattern a description field of 200 or 2000 characters. Smaller code attributes of less than 10 characters though do provide value. Ignore pattern profiling for character strings over 20 characters at first, then refine to shorter character strings if the results do not add value.

In pure database theory, referential integrity (RI) is your friend. In practice, designers and software vendors often forgo RI to improve system performance on data inserts. These designers place the data quality burden on the application and do not endorse external data manipulation outside the application interfaces. In the real world, though, data corruption occurs and without RI or routine data quality checks, corruptions may not be found for a long time or not at all. Personally, I have identified over $50,000 of recent orphaned sales in a retail client resulting from deliberately disabled RI. These unreported sales were not added to the ledger and were allowed to occur for performance reasons until I found them through simple profiling. Enforcement of RI is a topic for another discussion but is mentioned here because it does identify a valid reason for data profiling.

In even presumably good relational designs, some parent-child relationships are not enforced for different reasons. First, interrogate the RI listed in the system catalogs to identify all enforced relationships. Reverse-engineering a system with a good modeling tool is probably the best way to do this. A harder and more valuable analysis is to identify unenforced relationships and determining the probability of the relationship if not all values are an exact match. Do this by counting all the candidate child attribute values that exist within a known parent attribute table. If all match and there are a non-trivial number of matches, there is a good probability of a non-identified relationship. A small number of mismatches could identify data quality issues.

In Part 5, we tie all the techniques discussed in the first four parts together to show the value of a repeatable data profiling process.

Continue with Part 5 or go back to Part 3.

25
Jan

Data Profiling For All The Right Reasons, Part 3

The Hub Designs Blog welcomes Part 3 of this series by Rob DuMoulin, an information architect with more than 26 years of IT experience, specializing in master data management, database administration and design, and business intelligence.

Part 3: Attribute-Level Analyses

This is part three of a five-part series on data profiling.

In Part One, we took a light-hearted view of where profiling benefits an organization and in Part Two, we discussed the fundamentals of a profiling strategy.  The remaining three parts discuss attributes, relationships, patterns, and how to use the combined data profiling information you collect.  In this section, we introduce attributes, the lowest-level components of a profiling effort.

An attribute is simply a individual data element.  Alone, an attribute has no context.  Given the simple descriptor of “Cost” for an attribute tells us very little about the attribute’s true purpose and immediately drives a need for additional information, such as units (hours, Dollars, Euros…), type (weighted, unit, gross…), and use (invoice, sum, average…).  Attributes therefore must be analyzed within the context of their business purpose to have meaning.

Some characteristics require business knowledge to define and others can be determined through interrogation of existing values and underlying rules of the storage medium. It takes both analyses to get a complete picture of information within a system. While assembling this puzzle, though, keep in mind that until you validate the enforcement of business rules, only assumptions can result from physical profiling or business context.

Analyses of values, domains, and constraints allows insight into use (or abuse) of an attribute. The larger the sample size, the better confidence you gain in the results. Without explicit proof of business rule enforcement, though, you must assume that just because a value does not presently exist does not mean it cannot exist. Business rules are defined by business experts and enforced through database constraints, data type/precision, and application code. Knowing the methods of enforcement allow you to narrow a domain but not totally understand it. Profiling of actual values provides additional refinement in terms of percentage of NULL values, percentage of distinct values, minimum, maximum, and average values, top x and bottom x recurring values along with their counts, and minimum, maximum, and average data lengths.

Some attributes within a data set serve valuable purposes that are important to identify. Attributes that individually or in conjunction with others define uniqueness of the data set also may support relationships between entities.  Uniqueness can be further classified as being either members of a system-enforced primary key or of a business key (outside of the defined primary key).  System-enforced primary keys are relatively easy to define within a database system through interrogation of the system catalog.  Business keys that exist in tables in addition to a primary key may be more difficult to identify, especially if more than one attribute is needed to define uniqueness.

Attribute-level information of interest includes: data type (size and precision), the number and percent of NULL values, column descriptions, number and percent of distinct values, and the minimum-maximum-average values and lengths.  Uses of the system catalog provides some of this information, but others must be collected through sampling the data.

Other types of attributes that may help in identifying relevancy are those that provide system-level auditing or change control. Knowing which attributes fill these roles may either allow you to (a) ignore them for profiling purposes or (b) use them to help explain versions or data anomalies.

Part 4 expands on attribute profiling with the introduction of relationships and patterns.

Continue with Part 4 or go back to Part 2.

18
Jan

Data Profiling For All The Right Reasons, Part 2

The Hub Designs Blog welcomes Part 2 of this series by Rob DuMoulin, an information architect with more than 26 years of IT experience, specializing in master data management, database administration and design, and business intelligence.

Part 2: Profiling the Basics

This discussion is the second of a five-part series on data profiling. In Part 1, we discussed the project roles that benefit from data profiling and how better understanding information results in more reliable information systems. Important goals of any profiling strategy include automation of metric collection and socializing results to support the differing objectives of a data-centric project.

Early in a system development life cycle, profiling helps define sources, data storage requirements, and data transformations. As a system goes into production (or if profiling is added to an existing system for quality control purposes), routine profiling is useful to audit system quality and business rule enforcement. The frequency of collection and amount of effort you expend to automate your profiling methods should be based on the ability of the organization to benefit from the profile results.

This section discusses the beginnings of a profiling effort. Information assembled here forms the foundation of other profiling activities. For this discussion, consider a Profile Group as a set of information sharing a common purpose and data management methods. Examples of profile groups include tables within a single database schema or a group of spreadsheets with the same format but each spreadsheet representing a different time slice of data.

The underlying System managing a set of information within the profile group may be a named relational database, a file system directory, or even a web site being accessed through web services. The reason we abstract information into Systems is to group the information into distinct governance methods common to the underlying information. Relevant metadata and governance methods we track at the system-level include: technical contacts, backup schedules, system descriptors, connection strings, business unit owners, and host operating systems. System-level metadata common to a profile group helps us understand and troubleshoot future analyses. This level of information also provides developers with an understanding of inherent restrictions (or freedoms) they may encounter when trying to use or integrate the information.

Entities within a profile group belong to the same system, may have a common unique identifier, and, for database entities, have the same schema owner. Typically, entities are database tables, but may also be similar files or spreadsheet tabs containing like attribute lists. For entities, we track characteristics common to all the attributes they contain. These include: row counts, entity-level descriptors, growth characteristics (size and frequency), last analyzed date, and various customized indicators such as active/inactive, existence of change data management attributes such as insert/update timestamps, and existence of audit traceability indicators such as insert/update username.

The combination of system and entity level profiling supply the foundation for the attribute-level profiling, which is where physical information in a system resides. It also provides valuable metadata to classify information and allows for future correlation of like information across systems. Assembly and publication of entity and system level information benefits the various consumers of the information by providing a centralized “master” source of contact and context information.

In Part 3, we will dive into the attribute level analyses around data profiling.

Continue with Part 3 or go back to Part 1.

10
Jan

Data Profiling For All The Right Reasons, Part 1

The Hub Designs Blog welcomes another guest post by Rob DuMoulin, an information architect with more than 26 years of IT experience, specializing in master data management, database administration and design, and business intelligence.

Part 1: The Psychology of Data Profiling

Swiss psychologist Carl Gustav Jung founded the Analytical School of Psychology. His word association theories form the basis of the Myers-Briggs Type Indicator Assessment test to identify career aptitude in today’s high school students. Dr. Jung’s approach assigned personality profiles based on how an individual’s thoughts associated to various phrases. By analyzing responses, he could understand how an individual viewed the world around them and perceived themselves. Typically, subjects are asked to speak the first thought entering their minds after hearing a trigger phrase. For the following example, remember, there are no wrong answers. If I say the words “Data Profiling”, what is the first thing you think of?

If you thought of food, cats, country music, CSI NY, or residential plumbing, you are either not in IT or are an IT Manager.

If your first thought was “Quality Assurance”, you align yourself with data quality professionals having anti-social thoughts of failing test cases and sadistically reporting lazy developers for buggy code. You gleefully scour test cases looking for any evidence of truncation, missing values, non-matching codes, numeric precision errors, and inconsistent abbreviation, text, and date formatting.

If “Integration” comes first in your mind, past legacy integration projects have scarred you with a disdain for source system data quality levels. You view production apps with contempt and loathe the time it takes to track down data issues caused by system integrations. You investigate upstream sources to create detailed mappings and transformation rules. Typical debugging sessions consist of validating relationships to identify orphaned data, identifying attributes that contain overloaded columns (attributes containing more than one distinct data element), or fixing format errors from implied decimals.

Some of you responded with “Value Domains” or “Data Types”, indicating you are obsessive compulsive data architects compelled to organize the world into strict and orderly fashion with some degree of normalization, though you are not considered “normal” by your peers. Your concerns lie in understanding and regulating naming conventions, relationships, existence of NULL or default values, and understanding the meaning of each data element to accurately identify business rules and when two or more objects are related or redundant.

Lastly, if “Debugging” is the first item in your thought queue, you are a coder justifying why presumably good code is not working. Extreme paranoia has taught you to assume nothing about data quality, so you add tests to identify duplicates, validate relationships, enforce business rules, track change data capture, provide substitute values. Your phobia of early morning phone calls cause you to add auditing to your code to inform a DBA of data issues rather than waking you up in the middle of the night.

It is truly amazing how much we can conclude from the response to one simple phrase.

As stated before, there are no wrong answers. Aside from the innocent jab at Managers and non-IT resources, we all realize the benefits of information quality and absolutely need business involvement to understand context and domains of business information. The meaning and actions of Data Profiling change both by role and by project phase. Through profiling, we are able to identify best sources of information, learn proper ways to categorize and store it, reactively identify quality issues, and proactively define business rules to prevent future issues.

Identifying what is important to profile, when and how profiling is done, and how to share our findings across business and project resources is key. Done properly, profile results integrate to a master metadata repository and are periodically refreshed through an automated process.

This five-part series provides a tool-agnostic approach to comprehensive data profiling, focusing on information meaning and use. The next part of the series discusses system and table-level profiling. In particular, what information is important to collect at the system and table level and how can that information be leveraged by the Enterprise to help assure quality. The third part dives into attribute-level profiling and the fourth discusses attribute patterns and relationships. The final part discusses the benefits and utility of gathering profiled information into a single repository.

Continue with Part 2.

4
Jan

Silver Creek Systems Acquired by Oracle

It had to happen eventually: Oracle is acquiring Silver Creek Systems, a leading provider of product data quality solutions.

I first became familiar with Silver Creek through a chance meeting with Martin Boyd, Silver Creek’s VP of Marketing, at the Fall 2007 MDM Summit in New York. We both ran into someone from Weyerhaeuser, and all of us ended up going out to dinner at a great New York steak house.

I stayed in touch with Martin after that, and gradually learned more about Silver Creek’s product data quality solution, DataLens. I’ve said for a long time that data quality plays a critical role in master data management, so as I learned more about product information management (PIM) and product MDM, I naturally wanted to learn more about Silver Creek.

I profiled Silver Creek in April 2009, and my first hunch that they might end up getting acquired by Oracle came with the announcement later that April about the OEM relationship between Oracle and Silver Creek, where Oracle would pre-integrate Silver Creek’s DataLens solution with Oracle’s Product Data Hub.

This blog covered Silver Creek again in October 2009, where Martin Boyd did a great presentation at Oracle OpenWorld, saying that “10% of the total effort will be on the MDM software implementation, 40% on establishing governance and documenting the master data architecture, and 50% on data remediation” (according to AMR Research).

So I’m pleased but not surprised to see the news of Oracle’s acquisition today. For more information, you can read Oracle’s press release here.

Follow

Get every new post delivered to your Inbox.

Join 2,554 other followers