Tuesday 14 February 2012

Has data modelling become old-fashioned?

When Object-Oriented programming and UML was becoming popular, there was a trend to focus on OO principles like polymorphism, encapsulation an other design patterns. Hence data modelling became less popular with developers. Although we have in UML class diagrams, the old-fashioned data modelling principles are not fashionable anymore. Too much emphasis was (and is being) laid on the functional side of the analysis (using Activity diagrams and Use Case diagrams) and on OO and CBD design principles and patterns.

But I have seen too many projects where the only data analysis techniques that where applied were: we need a table to store some information, I need a primary key (preferably a system-generated one), some columns and that's all. If I asked then what is the logical unique key, they look at me as if I came directly from Mars. "That is something we will handle in our code!". Or "We have a business rule engine for handling that!".

This article wants to promote some simple principles that may seem old-fashioned, but will guarantee that your data is stored properly, allowing you to develop better/easier code and offering you easier retrieval and manipulation of your data. And when choosing a data modelling tool, it will give you insight on features that make the difference.

Logical and Physical Model

Although UML class diagrams support a great deal of the classical data modelling techniques, they do serve a different purpose and lack the flexibility of data modelling tools that work with a logical and physical model. This twofold approach is giving you greater flexibility and less headache if you want to change relations or logical keys or try to keep the logical model in sync with your physical model. It also offers the possibility to focus on logical decisions in the logical model and to postpone physical decisions to the physical model.

Suppose you have a classical Ordering database with tables like customer, order, order line, delivery, delivery line, invoice, invoice line, product, product category, service, discount, stock, warehouse. Most of these tables are all linked together with foreign keys. During development you discover you made a crucial error in one of the master tables product: the product code is only unique in combination with the product category. The impact this might have on all tables linked with a foreign key may be huge.

Some less experienced analysts try to overcome this problem by creating tables with one primary, meaningless key, generated by the system that is used as foreign key in the linked tables. This is a good physical approach but these analysts skip one important step: performing proper data analysis.

Logical Model

In a logical (data) model, you can change relationships, promote identifiers to become primary identifiers  and demote primary identifiers to become secondary identifiers, without any impact on related entities. Something you cannot do on a physical model or even worse, on a physical database containing company data. In a logical model, you focus on logical decisions. You define entities, attributes, relations, candidate identifiers, deciding which one is primary and which are secondary. And you try to model your entities until the Third Normalisation Form (see Normalisation process).
  • The relation with Order and the attribute Line Nr is the primary identifier for an Order Line
  • Two different Order Lines on  the same Order cannot order  the same Product. The relation with Product and the attribute Line Nr is a unique identifier.

You do not bother about system-generated primary keys: they are a physical decision and belong to the physical model. 

Physical Model

Once you are satisfied with the result, you can transform the logical model into a physical model:
  • entities will become tables
  • attributes become columns
  • relations become foreign keys 
  • primary identifiers become primary keys
  • secondary  identifiers become unique keys
The transformation will result for the table Order Line in:
  • A primary key, consisting of two columns: Order Nr and Line Nr
  • A unique key, consisting of two columns: Order Nr and Product Code
  • A foreign key column Order Nr 
  • A foreign key column Product Code
This might be a very nice solution from a logical point of view, but for technical reasons this is, as we saw, not a good solution:
  • What happens if a the code changes in the Product table? Also change it in the Order Line?
  • Since Order Line will be linked to other tables like Discount and Invoice Line, these tables would inherit all the columns of the primary key (Order Nr and Line Nr) as a foreign key in their table. And what if one of those numbers are changed.
This is were your data modelling tool can prove added value. The thing you want to achieve when transforming your logical model into a physical model are:
  • Create for all tables a meaningless system-generated primary key
  • Use this primary key as foreign key column to all related child tables
  • Translate the primary identifier as unique key
  • Translate the secondary identifier(s) as unique key(s)
The end result should look similar to this:


Of course you can perform these tasks yourself, but this is very cumbersome and you are bound to make mistakes.

Tool Features


So when considering a data modelling tool, following features are indispensable:

Logical Model

  • Promote unique identifiers to become primary identifiers
  • Demote primary identifiers to become unique identifiers 
  • Ability to define multiple unique identifiers per entity based on attributes, relations or a combination of both. Even an expensive and renowned tool like PowerDesigner does not allow defining a relationship as part of a unique identifier, only as part of the primary identifier.

Transformation

  • Generate all relationships as foreign keys in the physical model
  • Translate Primary Identifiers as Primary keys 
  • Translate Unique Identifiers as Unique Keys
  • Ability to generate a meaningless primary key for each table, leaving all unique keys coming from the logical model intact.
  • Apply changes in the Logical Model to the Physical Model to keep both synchronised
  • Generate DDL statements (CREATE, ALTER, ...) for your preferred database.

Reference

A nice discussion regarding the same issue can be found on https://forums.oracle.com/forums/thread.jspa?threadID=1665886.

On http://www.databaseanswers.org/modelling_tools.htm you have a complete list of data modelling tools. I preferred Oracle Designer but it is quite expensive!

If you want to know how you have to read ERD diagrams using the Barker notation and specifically their relationships, please refer to http://www.essentialstrategies.com/publications/modeling/barker.htm.

2 comments:

  1. Nice article, though I don't agree with creating surrogate keys all over the place just in case a code changes.
    I checked out what you said about PowerDesigner. What you said is correct, but not complete. Via the relationship definition, selecting the 'Dependent' option adds the relationship to the primary identifier in the child entity. If you want the relationship to participate in a non-primary identifier, you have to edit the identifier in the child entity.
    Either:
    1. Change the existing identifier to 'non-primary'
    Or:
    2. Create a new non-primary identifier, and add the FK attribute(s)via the 'attributes' tab on the identifier's property sheet.

    ReplyDelete
    Replies
    1. Hi junkie,

      I agree surrogate keys are not always necessary. There are other reasons for it as well beside code changing.

      Regarding PowerDesigner, in the version 15 we used at Toyota, there is no way to add a relationship to a non-primary identifier. You can only add attributes to a non-primary identifier. The dependent option will always add the relationship to the primary identifier. Adding the FK's attribute to the non-primary identifier is not possible since the relation is not yet transformed to an attribute. This only happens on the physical model when relations are translated to foreign keys and the pk from the parent table is placed as a fk in the child table.

      So you need to address these "logical" decisions in the physical model, since from there you can add any column to the foreign key but this is quiet a hassle.

      Delete