Data Matching

Data Matching – Some hints and Tips

When performing Data Matching, I find that it helps if you create match keys on common elements if you are looking at matching / merging data during data migration.

For example if you are matching post codes you may want to normalise all postcodes into the same format e.g. remove all spaces prior to doing any matching.

For addresses it may be worth creating a string match key from the address but here you have to be caregul since if you strip unwanted characters like ‘-‘ out of house numbers 1-11 becomes the same as 111. In particular this is important if you are using any fuzzy transforms in SSIS

Interestingly – in the past I’ve used Jaro Winkler routines heavily in data matching routines to provide matching responses on customer names and also on address details – this is now one of the routines available in Master Data Services – unfortunately it’s restricted to certain versions of SQL in particular:-

•SQL Server 2012 Business Intelligence (64-bit) x64

•SQL Server 2012 Enterprise (64-bit) x64 – Upgrade from SQL Server 2008 R2 Enterprise only

•SQL Server 2012 Developer (64-bit) x64

•Microsoft SQL Server 2008 R2 Enterprise (64-bit) x64

•Microsoft SQL Server 2008 R2 Developer (64-bit) x64

Alternatively if CLR is enabled there are quite a few resources on the net for the CLR version of the algorithm and there is also a TSQL version of the routine in SQL Server Central which you could use.

I’ve performance tested some of the c# clr versions but haven’t done any testing against the tsql version there – athough it’s worth pointing out that there are optimisations for some of the functions used there at SQL Server Central

Compiling CLR without Visual Studio

To Compile a VB or C# module as a CLR routine.

In a dos prompt
(c#)

path=%path%;c:windowsmicrosoft.netframeworkv3.5
csc  /out:UserDefinedFunctions.dll /target:library userdefinedsourcefile.cs

(vb)
path=%path%;c:windowsmicrosoft.netframeworkv3.5
vbc  /out:UserDefinedFunctions.dll /target:library userdefinedsourcefile.vb

Not that this doesn’t give you the benefits of tracing that you get using visual studio but there are cases where a clr routine is faster to write (I’m thinking string manipulation here) than the equivalent inline code

Structure of a mapping table

Using a mapping table for lookups

Some of the key fields for mapping tables where for example it’s a merge and de-duplicate is an indicator which record is the destination record for example if you are migrating has multiple sources you need to include things like source system / sub system as well as the reference attributes and obviously the target key.

In most cases a single key field can be used to accommodate multiple source attributes – but if you are going for this approach it must be the least restrictive data type that is used – and you will have to be prepared to always cast the data to that data type (e.g. varchar / nvarchar)

Performance wise – it may be worth grouping the source data so that you can type the data in the lookup table better – that would give much better performance than casting all the data.

My personal preference for keys in a system is still the good old fashioned identity column – defined as either INT or BIGINT depending on the expected number of records / record growth although there is also a good case for using GUID’s.

In all cases, in a migration especially with merge, the keys will need to be pre-determined to avoid always having to join to all the source systems. This can be achieved many ways including using identity columns on work tables, row_number() / Dense_rank() functions or even using the output clause from any insert statements.

We will over time be publishing examples on all these options, however, the design of the mapping table will depend on a number of factors in the migration not least whether the old system uses codes for key values or has descriptions. Generally for either method the initial population of the table can be performed via a select distinct columnname from the source system.

Any joins to the lookup table need either to be done as updates or as Left joins to ensure that an absence of a code in a lookup table doesn’t cause the row to be dropped from the original source data.

Reports should be run regularly to verify that there are no null’s resulting from the left join (or update) since these indicate missing entries in the table. 

 

Setting up for Data Migration

One of the key requisites for data migration is setting up a plan for translating data from the old system to the new system or warehouse. The requirements include creating a number of lookup lists both for key entities e.g. Customers, Accounts, Addresses and also reference data e.g. old system may have yes / no new system may have 0 / 1 for the equivant data. In order to accurately translate the requirements from the old system to the new system it is vital to manage these lookups correctly.

It is preferable to create these lookup lists as tables rather than coding them in the transformations since this increases the flexibility.

Lookup lists may be done as seperate tables or in one large table, my personal preference is for all straight translation tables to have a single table with a key attribute indicating the source data from the system.

For entity translations, the design of the structure will be influenced by whether there is an element of de-duplication of the data or whether it’s a straight through translation from one entity to it’s equivalent in the new system

We will be creating a number of articles discussing structures for translation tables which we hope you will find interesting.

Excel files with mixed data

Ever find that when you import data from excel files with SSIS that you sometimes lose data? Well one of the issues is that SSIS only samples a small number of rows – well it’s fully documented in http://technet.microsoft.com/en-us/library/ms141683.aspx.  In summary – adding IMEX=1 to the source input will resolve a lot of these issues automatically, otherwise you have a few options

  1. Because of the sample size issue you can reorder the source spreadsheet to get a mixture of the correct data types in the first 8 rows (or use the tip in the technet article to change the sample size).  
  2. Create a new column for the offending column in the source data and use =text(cell) in this column – this will force SSIS to import the data as text (doesnt always work but sometimes does)
  3. Convert the spreadsheet to tab delimited (which also gives you the option of doing a straight bulk insert)
  4. Change connection string to add imex=1 this will force the connection to be reset but resolves most issues ** EXCEPT ** when there is text data in a row that is mostly numeric!

The biggest issue is that when SSIS determines the type for the data thats it – you can’t amend it.

SSIS – Modifying column names in flat file

When column names are not in the first row of the source file – it can be a pain to change all the column names in the SSIS package – if you’ve only got a couple of columns of data – it’s ok but when you have 100’s of columns then it’s extremely painful.
One work around for this is to point your connection at the flat file source – then save the package and use view code to get the XML schema. At this point your columns are named column 0 to column n.

The following tip ONLY works if the column names are in the SAME order as your file – or any additional columns in the table are at the beginning / end of the list. Esnure results are set to text not grids!

selectName, Colidfrom syscolumns where id =object_id(‘targettable’) orderby colId asc

Use the alt key to select then column names (including trailing spaces to make sure you get “all” the name) from the pane and copy the selection. Do this for columns 0-9

Then go into the code view in Visual Studio and find Column 0 – Use alt  + shift to select columns 0-9 and replace with the contents from management studio.

Repeat the process for columns 10-99 in Management studio and visual studio then again for columns 100 – 999 (hopefully you won’t have to repeat for lines 1000 = 9999!

One minor quirk – the last column has a slightly different format – so manually correct that – or just perform this exercise excluding the last column – the last column has objecttype in line with the column list!

Finally select the Columns you have just amended in VS – again using alt + shift (or alt + mouse) and do a search replace to remove redundant spaces.

Having just had to create an SSIS package with 400 + columns – it’s a lot easier than doing it manually!

SSIS Unpivoting data

Had a request to load some prices into SQL Server from an excel spreadsheet – unfortunately the data was in a pivoted format without the raw pivot data.

 

ssis1 400

This becomes the pivot description for the pivot value – in the example below I wanted to pivot on a variety of durations and get back the prices so set pivot key value to the duration.So having the spreadsheet as a source I created a table as the output – using the new table from oledb desination – modified the table name and we have a new table of the unpivoted data

Much easier if the spreadsheet contains the original unpivoted data – but this gets you the same results, for this client I also added another data transform task for data from another sheet in the same spreadsheet and finally a transform to merge the data from both sources into their data tables

Have fun with it – it’s easier than loading the data in manually

Loading XML Files

Should be a lot easier than it is, but if you look for this on the web, you will get confused.

 Put simple – if you have an embedded DTD in the file it won’t work within SSIS – so the easiest solution – don’t use SSIS use

SELECT * FROM OPENROWSET(
   BULK ‘C:userskevindownloadsshopwhale.xml’,
   SINGLE_BLOB) AS x

 This translates as create a table with a column of data type XML

INSERT INTO T(XmlCol)
SELECT * FROM OPENROWSET(
   BULK ‘C:userskevindownloadsshopwhale.xml’,
   SINGLE_BLOB) AS x

If there is an embedded dtd you need to convert (and the column is called bulkcolumn)

INSERT INTO T(XmlCol)
SELECT convert(XML,bulkcolumn,2) FROM OPENROWSET(
   BULK ‘C:userskevindownloadsshopwhale.xml’,
   SINGLE_BLOB) AS x

Using convert(xml,columnname,2) gives limited dtd support

“Enable limited internal DTD subset processing.

If enabled, the server can use the following information that is provided in an internal DTD subset to perform nonvalidating parse operations.

  • Defaults for attributes are applied.
  • Internal entity references are resolved and expanded.
  • The DTD content model will be checked for syntactical correctness.

The parser will ignore external DTD subsets. It also does not evaluate the XML declaration to see whether the standalone attribute is set yes or no, but instead parses the XML instance as if it is a stand-alone document.

This gives you a single row holding the entire document – not very usable (example is for an xml feed for shopwhale) – we can use cross apply to convert this into xml rows in a new table

— Cross apply to get 1 row per product
SELECT   NewTable.ProductDetails.query(‘.’) AS ProductName
into tTransformed
FROM t CROSS APPLY XmlCol.nodes(‘//product’) AS NewTable(ProductDetails)

We now have all the xml rows in the table and can then extract the components – if you do a select top n from the transformed table you can see the structure – cliick on the xml and it’s loaded in a new pane in studio.

SELECT ProductName.value(‘(//product/offer_id)[1]’, ‘int’) as Offer_id
,ProductName.value(‘data(//product/merchant_category)[1]’,’nvarchar(60)’) as merchant_category
,ProductName.value(‘data(//product/update_date)[1]’,’datetime’) as update_date
,ProductName.value(‘data(//product/regular_price)[1]’,’money’) as price
,ProductName.value(‘data(//product/model_number)[1]’,’nvarchar(38)’) as Model_number
into tTransformedValues
FROM tTransformed

Hey presto we have the rows in a table in a usable format – wrap this into a stored procedure – pass in the name of the xml document and you can end up with a table that you can use!

Standards

The MOST important thing about standards is to have some, keep them and enforce them. Even if they are not the “best” standards in the world at least everything is consistent.

Look for

Naming conventions

  • Tables
  • stored Procedures
  • triggers
  • files – data, log and dump
  • Stored Procedure Conventions
  • Change audit and description
  • Parameters
  • Variables
  • Use of rowcount and @@error

As a simple example
Which is easier to read

create procedure usp_SelClients @cl int -1
as
 select Client_name, HouseNo, Housename, Street, Town, Postcode from clients join addresses on clients.addressid = addresses.addressid where Clientid=@cl

or
Create Procedue usp_SelCleints @ClientId int -1
as
       select
              Client_name,
              House_No,
              House_name,
              Street,
              Town,
              Postcode
       from
              clients
       join
              addresses
       on
              clients.addressid = addresses.addressid
       where
              Client_id=@ClientId

And that’s a small stored procedure .. what happens when you get a complicated select statement with multiple joins and conditions!