The Extract, Transform & Load (ETL) process typically accounts for 70% of an entire data warehousing/data migration effort (according to The Data Warehouse ETL Toolkit). And this proportion represents a major element of the project in terms of risk: the implementation could make or break the project. This fact should be taken into consideration while designing and implementing an ETL solution.
Every staging project has two elements: the ETL itself and a staging data structure. So today I’d like to talk about best practices for standing up a staging area using SQL Server Integration Services [ETL] and hosting a staging database in SQL Server 2012 [DB]. The ‘best practices’ are across three areas: Architecture, Development, and Implementation & Maintenance of the solution. Each area represents key patterns and practices (not a comprehensive list) for the ETL component and the data structure of the staging database.
And since I presume readers are looking for best practices silver bullets, that is what I will try to deliver. What you will not find in this post are turnkey ETL performance optimizations.
[DB] – Staging Database patterns/practices
[ETL] – Extract, Transform & Load patterns/practices.
Architecture Best Practices
1. [DB] Design 101: Design the staging database as a mirror copy of the source database. Create required objects with the same name and data types as source objects. This creates familiarity and ease of querying without the need to learn source-to-staging mappings.
2. [DB] Partitioning: Implement SQL Server database partitioning on the staging database based on the same partitioning scheme of the DB2 database. This may involve creating custom attributes to support partitioning in the staging database and also deriving this custom attribute in the ETL from legacy data. This step may not necessarily help a lot in the load of the staging database but will help in the next step of the process (i.e. using the staging database for further processing where aligned keys create optimal query plans).
3. [DB] Separating concerns: Study the source system’s data usage patterns. If a data split strategy can be achieved by identifying a subset of primary data and secondary data, employ it. Split the staging database if needed based on this. Querying a smaller set of data regardless of partitioning and other enhancements is always a good option. (Example: Separating data on the basis of current and historic data.)
4. [ETL] Source data format: Sourcing data in the form of Raw Files as opposed to direct database connections is your best option for reasons of performance and also flexibility, as your solution can be easily coupled as a solution that loads source data by partition. (Provided that the source data is given partitioned, that is. Ask for this: the DB2 Unload utility can help you get source data in partitioned file sets based on your requirement.)
5. [ETL] Reference Architecture: If you don’t already have something in place for Logging, Auditing and Exception Handling, a good place to start would be Microsoft Project Real Reference Implementation. Although aimed at the SQL Server 2005 BI stack, one can use the core concepts to extend the framework into a custom 2012 framework without much difficulty. This will also help in setting baselines for data transfer performance.
6. [ETL] Design Patterns: Establish patterns or reuse existing best practices like parent-child package executions, execution control via configurations, and final execution from SQL Agent as opposed to direct package execution from DTExec, package templates etc. A good resource would be the book SQL Server 2012 Integration Services Design Patterns.
Development Best Practices
1. [ETL] Code Reuse: The best approach to code reuse with SSIS is package templates. Once a package pattern has been established in BIDS, use the package as a template so that the reusable components (like logging, auditing etc.) can be used without spending additional development time. Here are instructions to create a package template in BIDS.
2. [DB & ETL] Standards: This item is a pretty obvious yet often the most ignored one, in my opinion. Even the simplest guidelines go astray if not paid attention to. Imagine a simple case where using default component names raises an exception “Data Flow 23 failed in package 14.” Go figure.
3. [ETL] Gotchas: Be prepared for these during a DB2 data migration:
3a) DB2 timestamp conversion: The conversion of DB2 timestamp to the SQL Server counterpart although simple can be overlooked. The mapping of data types should be from DB2 Timestamp to SQL Server Datetime2(6). The precision 6 is important or the default 7 will end up adding an extra 0 to the timestamp after data type conversion. The data type mapping between database and the source data was the easy part. The transformation needs to be handled in the ETL (a derived column) for converting a DB2 date to SQL Server acceptable format.
Example: (DB2) 2008-12-09–18.34.07.560041 –> (SQL Server) 2008-12-09 18:34:07.560041
Here is the (SSIS-derived column) code for the above transformation:
SUBSTRING(<TimestampColumn>,1,10) + " " + SUBSTRING(<TimestampColumn>,12,2) + ":" + SUBSTRING(<TimestampColumn>,15,2) + ":" + SUBSTRING(<TimestampColumn>,18,2) + "." + SUBSTRING(<TimestampColumn>,21,26)
3b) High & Low values: Low values are defined as one or more of the character with the lowest position in the program collating sequence. High values are defined as one or more of the character with the highest position in the program collating sequence [Mainframe Tips].
There will be instances where these values will not be identified in your ETL due to the collation. The best way to fix these is to ask the DB2 admins to replace these values during the unload process. Here is another solution to deal with low values.
3c) Invalid values: These values will violate your data type contracts. If left unchecked, these will raise exceptions and fail a running data extract. Handling these errors beforehand can be done either by sending them to another output (file or table) or simply ignoring these errors (Explicit ignore in SSIS).
Note: I included the ETL ‘gotchas’ in this post as these items are (IMO) some of the most common issues faced in a DB2 to SQL Server migration and preparing for these ahead of time should (again IMO) be considered best practices to be followed. An ETL solution can survive (barely, I should say) without implementing best practices; but it could definitely fail to deliver without fixing these ‘gotchas.’
4. [ETL] Memory Miser Development: A few components in SSIS have their own drawbacks in terms of memory usage. SSIS uses an in-memory pipeline transfer for data which makes it memory hungry very soon, and also greedy. This calls for a different mindset of development. I call it Memory Miser Development (MMD) where a developer develops ETL components so that they mostly make use of non-blocking transforms especially NO SORT TRANSFORMS!! (And there you go…I went all caps.) While data sorting is necessary for merge joins, it does not necessarily have to be sorted in memory. The source data can either be sorted by the DB2 unload process or let the data pass unsorted and finally sort it using T-SQL at the database engine. This will be faster than the in-memory task.
Implementation & Maintenance Best Practices
1. [DB & ETL] Network I/O: The physical location of source data files, destination SQL Server and the SSIS server *must* reside on the same physical server for optimal data transfer rates. If they are spread across different servers then the adverse impact on performance must be communicated and dealt with. Any prior baselines from running on a same server will no longer apply. Remember the simple fact that fewer data hops translates into faster performance in terms of transfer speeds.
2. [DB & ETL] Consistency of Resource Parameters: Keeping most of the environments with controlled resources (i.e., available memory) is important and this helps communicate the outcome of running your solution in terms of duration. With ETL projects “the faster the better” becomes synonymous and hence everyone is keen on looking at run-time durations across different servers. To get the similar results across different environments (dev, qa, pre-prod, etc.) maintaining consistency between these parameters is important.
[pullquote]The importance of staging areas should not be underplayed. The staging project should be treated as a vital operation that transforms inputs into desired output by adding value along the process.[/pullquote]The importance of staging areas should not be underplayed. The staging project should be treated as a vital operation that transforms inputs into desired output by adding value along the process. The implementation of a staging area directly impacts the next step in the process, which could be a data migration effort of modernization or — for that matter — a data warehouse load. There tends to be a very over-optimistic confidence level about the short-lived lifespan of the staging area. I wouldn’t recommend buying that idea right away. Staging areas are known to linger for a while, given that they now provide an interface that is very similar to the legacy database, easily accessible and query-able in T-SQL. This staging area has the potential to serve as a source system for batch and data warehouse initial loads and even as the source of truth once the legacy database is sunsetted.
My final recommendations: Use the risk element and the lifespan of the staging project as eyes on the back of your head. These two points will help you drive key decisions during the project implementation. Thinking about long-term sustainability of the ETL solution will eventually transform your solution as a best practices pattern and reference implementation solution in your organization.