In the next segment of Rob DuMoulin’s three part series, he continues to discuss the dangers of implicit knowledge and the importance of context.
The first segment of this series defined implicit values as ones that require additional context in order to infer an unambiguous meaning from the value. The segment went on to define how coded values are a valuable design practice that does not cause concern when each component of the implicit relationship exists within a single domain. A derivation to the coded values concept called “Overloading” was discussed as an undesirable practice where one column or record spans more than one domain. In this segment, we’ll discuss the practices of implicit defaulting and the assigning of implicit business meaning to blanks or NULL values other than the business meaning of “uh, nobody put a value here.”
From a user perspective, text fields containing a NULL or any number of blank spaces look the same. Although the casual business user cannot tell the difference between NULL and blanks, the computer treats them very differently and even changes retrieval performance based on the distinction. Suppose a data source or entry screen of a Yes/No business attribute defines that a NULL has an implicit meaning of ‘No’ because the value was not explicitly set to ‘Yes’. Since ‘No’ and NULL implicitly mean the same thing, the business cannot determine if the value was intentionally meant to be ‘No’ or if the user accidently missed data entry, but intended to select ‘Yes’. Unless somebody realizes and corrects the mistake, the value will be incorrect and nobody will ever know. Now suppose this column propagates through and ends up in an MDM system and the corporate data warehouse as Null and considered the best version of the truth. To accurately perform analyses using the column with this implied business rule, one must check for a ‘Y’ or the existence of both NULL and ‘No’. Another analysis option would be to check for a ‘Yes’ or a Not Equal to ‘Yes’ condition. In either case, a consumer of this column has to implicitly know that a NULL and ‘No’ are the same and even with this knowledge, the consumer still does not have confidence that the column was intentionally meant to be ‘No’.
Expand the above discussion to consider a lookup code value. The practice of allowing NULL code values is referred to as creating an ‘optional relationship’ within a parent (code table) and a child (reference table). Code tables are, by definition, explicit domains of allowable codes with each having a mutually-exclusive meaning for a record. Allowing NULL values for codes poses a similar ambiguity towards the data meaning. Did the user or source neglect to enter a value? Does the existence of NULL implicitly mean something else, like ‘Not Applicable’?
I can think of no positive business value to introducing ambiguity in data, ever. Moreover, there should be no compelling argument to intentionally inject ambiguity into new designs, especially in MDM or BI systems which are supposed to be reliable and pristine. When doing data integration from multiple systems, such as merging data into MDM or BI systems, the existence of NULLs and implicit values introduces several unnecessary challenges. MDM and BI systems are tasked with the collect of information from multiple sources with the intention of conforming it to a single Enterprise view. Adding implicit values to that process ensures logic will be needed in the transformation layer to conform it to the “Best Version of the Truth”. Once in the MDM Hub, more logic is necessary to transform the data back to its implicit form for each of the consuming systems.
“Oh what a tangled web we weave when first we try to cut corners and be implicit.”
In the last part of this three-part series, we will review the revelations thus far and discuss ways to both manage implicit legacy conditions and implement policies to keep from creating any additional ones.
 For example, Oracle does not index null values, so if one has a dataset that is predominantly nulls for a certain column, a query for a non-null value will perform quickly using the index, whereas a query for a null value would result in a table scan and would not use the index.