The need for a Data Dictionary in Data Migration
At a bare minimum the data dictionary needs to contain the following attributes
Source Data Type
Target Data Type.
optionally adding fields to the data dictionary for
This can be implemented in many ways – depending partly on how public the data needs to be – for example, it can be stored as an excel workbook, an access database or in SQL server itself (possibly with an access frontend to allow simple updates to the underlying data)
It’s advisable to make the data dictionary publicly available – for example by publishing a report on the data dictionary or by making the access / excel data available.
In all cases, I would recommend change tracking – which lends itself more to the SQL server stored data even if it’s then made visible via Excel as a linked source to all relevant SME’s.
Common transformation rules could include
Cast as datatype
format datetime 112
Data Matching – Some hints and Tips
When performing Data Matching, I find that it helps if you create match keys on common elements if you are looking at matching / merging data during data migration.
For example if you are matching post codes you may want to normalise all postcodes into the same format e.g. remove all spaces prior to doing any matching.
For addresses it may be worth creating a string match key from the address but here you have to be caregul since if you strip unwanted characters like ‘-‘ out of house numbers 1-11 becomes the same as 111. In particular this is important if you are using any fuzzy transforms in SSIS
Interestingly – in the past I’ve used Jaro Winkler routines heavily in data matching routines to provide matching responses on customer names and also on address details – this is now one of the routines available in Master Data Services – unfortunately it’s restricted to certain versions of SQL in particular:-
•SQL Server 2012 Business Intelligence (64-bit) x64
•SQL Server 2012 Enterprise (64-bit) x64 – Upgrade from SQL Server 2008 R2 Enterprise only
•SQL Server 2012 Developer (64-bit) x64
•Microsoft SQL Server 2008 R2 Enterprise (64-bit) x64
•Microsoft SQL Server 2008 R2 Developer (64-bit) x64
Alternatively if CLR is enabled there are quite a few resources on the net for the CLR version of the algorithm and there is also a TSQL version of the routine in SQL Server Central which you could use.
I’ve performance tested some of the c# clr versions but haven’t done any testing against the tsql version there – athough it’s worth pointing out that there are optimisations for some of the functions used there at SQL Server Central
Using a mapping table for lookups
Some of the key fields for mapping tables where for example it’s a merge and de-duplicate is an indicator which record is the destination record for example if you are migrating has multiple sources you need to include things like source system / sub system as well as the reference attributes and obviously the target key.
In most cases a single key field can be used to accommodate multiple source attributes – but if you are going for this approach it must be the least restrictive data type that is used – and you will have to be prepared to always cast the data to that data type (e.g. varchar / nvarchar)
Performance wise – it may be worth grouping the source data so that you can type the data in the lookup table better – that would give much better performance than casting all the data.
My personal preference for keys in a system is still the good old fashioned identity column – defined as either INT or BIGINT depending on the expected number of records / record growth although there is also a good case for using GUID’s.
In all cases, in a migration especially with merge, the keys will need to be pre-determined to avoid always having to join to all the source systems. This can be achieved many ways including using identity columns on work tables, row_number() / Dense_rank() functions or even using the output clause from any insert statements.
We will over time be publishing examples on all these options, however, the design of the mapping table will depend on a number of factors in the migration not least whether the old system uses codes for key values or has descriptions. Generally for either method the initial population of the table can be performed via a select distinct columnname from the source system.
Any joins to the lookup table need either to be done as updates or as Left joins to ensure that an absence of a code in a lookup table doesn’t cause the row to be dropped from the original source data.
Reports should be run regularly to verify that there are no null’s resulting from the left join (or update) since these indicate missing entries in the table.
As part of the analysis of the data a data mapping document should be produced – ideally this should be signed off by the end user.
The data mapping should show the source and destinations of all fields but also should identify any fields that are not going to be migrated, these are the items that will cause most problems if there is a change in the requirements later.
Continue reading Data Mapping table
One of the key requisites for data migration is setting up a plan for translating data from the old system to the new system or warehouse. The requirements include creating a number of lookup lists both for key entities e.g. Customers, Accounts, Addresses and also reference data e.g. old system may have yes / no new system may have 0 / 1 for the equivant data. In order to accurately translate the requirements from the old system to the new system it is vital to manage these lookups correctly.
It is preferable to create these lookup lists as tables rather than coding them in the transformations since this increases the flexibility.
Lookup lists may be done as seperate tables or in one large table, my personal preference is for all straight translation tables to have a single table with a key attribute indicating the source data from the system.
For entity translations, the design of the structure will be influenced by whether there is an element of de-duplication of the data or whether it’s a straight through translation from one entity to it’s equivalent in the new system
We will be creating a number of articles discussing structures for translation tables which we hope you will find interesting.