CDC – Change Data Capture (zero cost ETL solution)

 

What is CDC ? 

Change data capture records insert, update, and delete activity that is applied to a SQL Server table

How it does ?

 

 

 

 

 

 

 

 

 

 

 

 

 

 

  • The source of change data for change data capture is the SQL Server transaction log.
  • As inserts, updates, and deletes are applied to tracked source tables, entries that describe those changes are added to the log. The log serves as input to the change data capture capture process.
  • This reads the log and adds information about changes to the tracked table’s associated change table.
  • Functions are provided to enumerate the changes that appear in the change tables over a specified range, returning the information in the form of a filtered result set.

Where it is used ?

A good example of a data consumer that is targeted by this technology is an extraction, transformation, and loading (ETL) application. An ETL application incrementally      loads change data from SQL Server source tables to a data warehouse or data mart.

we can  eliminate the use the of after update/delete/insert trigger

Note:Change data capture is available only on the Enterprise, Developer, and Evaluation editions of SQL Server.

Advantages of using CDC :

  • Minimal impact on the database (even more so if one uses log shipping to process the logs on a dedicated host).
  • No need for programmatic changes to the applications that use the database.
  • Low latency in acquiring changes.
  • Transactional integrity: log scanning can produce a change stream that replays the original transactions in the order they were committed. Such a change stream include changes made to all tables participating in the captured transaction.
  • No need to change the database schema

is CDC (Change Data Capture)  available in other off the shelf products)?

products perform change data capture using database transaction log files. These include:

  • Attunity Stream
  • Centerprise Data Integrator from Astera
  • DatabaseSync from WisdomForce
  • GoldenGate Transactional Data Integration
  • HVR from HVR Software
  • DBMoto from HiT Software
  • Shadowbase from Gravic
  • IBM InfoSphere Change Data Capture (previously DataMirror Transformation Server)
  • Informatica PowerExchange CDC Option (previously Striva)
  • Oracle Streams
  • Oracle Data Guard
  • Replicate1 from Vision Solutions
  • SharePlex from Quest Software
  • FlexCDC, part of Flexviews for MySQL

How to implement CDC ?

Lets get into the details of how to implement the CDC in SQL server 2008

Before enabling the cdc on the tables we need to enable the cdc on the Database.

To determine weather the database is enabled or not check the sys.databases table

SELECT NAME,IS_CDC_ENABLED 

FROM SYS.DATABASES

WHERE NAME = 'SQLJUNKIESHARE'

Use the following system stored procedure to enable CDC on a database

USE SQLJUNKIESHARE

EXEC SYS.SP_CDC_ENABLE_DB
GO

after CDC is enabled on the database above listed tables are created in system tables

  • cdc.captured_columns Returns the columns tracked for a specific capture instance.
  • cdc.change_tables Returns tables created when CDC is enabled for a table. Use sys.sp_cdc_help_change_data_capture to query this informationrather than query this table directly.
  • cdc.ddl_history Returns rows for each DDL change made to the table, once CDE is enabled. Use sys.sp_cdc_get_ddl_history instead of querying this table directly.
  • cdc.index_columns Returns index columns associated with the CDC-enabled table. Query
  • sys.sp_cdc_help_change_data_capture to retrieve this information rather than querying this table directly.
  • cdc.lsn_time_mapping Helps you map the log sequence number to transaction begin and end times. Again, avoid querying the table directly, and instead use the functions sys.fn_cdc_map_lsn_to_time and sys.fn_cdc_map_time_to_lsn.

after enabling CDC on database lets enable CDC on table

Now that Change Data Capture is enabled, I can proceed with capturing changes for tables in
the database by using the sys.sp_cdc_enable_table system stored procedure. The parameters of
this stored procedure are described below
sp_cdc_enable_table Parameters
Parameter Description

  • @source_schema       This parameter defines the schema of the object.
  • @source_name           This parameter specifies the table name.
  • @role_name                This option allows you to select the name of the user-defined role that will have permissions to access the CDC data.
  • @capture_instance     You can designate up to two capture instances for a single table. This comes in handy if you plan on altering the schema of a table already                                                 captured by CDC. You can alter the schema without affecting theoriginal CDC (unless it is a data type change), create a new capture instance,                                        track changes in two tables, and then drop the original  capture instance once you are sure the new schema capture fits your requirements. If                                        you don’t designate the name, the default value is schema_source.
  • @supports_net_changes     When enabled, this option allows you to show just the latest change to the data within the LSN range selected. This option requires a                                                      primary key be defined on the table. If no primary key is defined, you can alsodesignate a unique key in the @index_name option.
  • @index_name             This parameter allows you to designate the unique key on the table to be used by CDC if a primary key doesn’t exist.
  • @captured_column_list If you aren’t interested in tracking all column changes, this option allows you to narrow down the list.
  • @filegroup_name       This option allows you to designate where the CDC data will be stored. For very large data sets, isolation on a separate filegroup may yield                                              better manageability and performance.
  • @partition_switch       This parameter takes a TRUE or FALSE value designating whether or not a ALTER TABLE…SWITCH PARTITION command will be                                                    allowed against the  CDC table (default is FALSE)here how its like before enabling the cdc on a particular4when we set the @supports_net_changes to 1 either we need to have a primary key or we need to specify a non clustered index name
    after making those changes when we execute the stored procedure result will beJob ‘cdc.sqljunkieshare_capture’ started successfully.
    Job ‘cdc.sqljunkieshare_cleanup’ started successfully.

    Capture and Cleanup Jobs

    Besides the Change Data Capture tables and query functions that have been created in our example, two SQL Server Agent jobs are created: a Capture and a Cleanup Job.

    The Capture job generally runs continuously and is used to move changed data to the CDC tables from the transaction log.

    The Cleanup job runs on a scheduled basis to remove older data from the CDC tables so that they don’t get too large. By default, data older than three days is automatically removed from CDC tables by this job.

    we can also validate the settings of your newly configured capture instance using the
    sys.sp_cdc_help_change_data_capture stored procedure:
    EXEC sys.sp_cdc_help_change_data_capture ‘dbo’, ‘EMPLOYEE’        

  • The commit LSN both identifies changes that were committed within the same transaction, and orders those transactions.
  • The column __$start_lsn identifies the commit log sequence number (LSN) that was assigned to the change.
  • The column __$seqval can be used to order more changes that occur in the same transaction.
  • The column __$operation records the operation that is associated with the change:
  • 1 = delete,
  • 2 = insert,
  • 3 = update (before image)
  • 4 = update (after image).
  • The column __$update_mask is a variable bit mask with one defined bit for each captured column. For insert and delete entries, the update mask will always have all bits set. Update rows, however, will only have those bits set that correspond to changed columns.
Before going in to more details  lets look at some CDC functions

cdc.fn_cdc_get_all_changes_capture_instance ( from_lsn , to_lsn , ‘<row_filter_option>’ )

<row_filter_option> ::= { all | all update old }

it is prefreble to use cdc function to query the change tables

DECLARE @from_lsn binary(10), @to_lsn binary(10)
SET @from_lsn =
 sys.fn_cdc_get_min_lsn('dbo_EMPLOYEE')
SET @to_lsn = sys.fn_cdc_get_max_lsn()
SELECT * FROM cdc.fn_cdc_get_all_changes_dbo_EMPLOYEE (@from_lsn, @to_lsn, 'all');
 
<row_filter_option> ::= { all | all update old } 
Option 

‘all’ is used to filter the result set so that to get all the changes on that table or capture instance but one row for each change

(i.e one row for insert ,one row for delete one row for update)

‘all update old’ is used to filter the result set so that to get all the changes on that table or capture instance with two rows for update
(i.e one containing the values of the captured columns before the update and another containing the values of the captured columns after the update).
another important function
cdc.fn_cdc_get_net_changes_capture_instance ( from_lsn , to_lsn , ‘<row_filter_option>’ )
<row_filter_option> ::= { all | all with mask | all with merge }
You might have confused what is the difference between cdc.fn_cdc_get_net_changes_capture_instance and cdc.fn_cdc_get_all_changes_capture_instance  when to use net changes and all changes
 cdc.fn_cdc_get_net_changes_capture_instance  returns one net change row for each source row changed within the specified LSN range. That is, when a source row has multiple changes during the LSN range, a single row that reflects the final content of the row is returned by the function. For example, if a transaction inserts a row in the source table and a subsequent transaction within the LSN range updates one or more columns in that row, the function returns only one row, which includes the updated column values.
lets say we need to get all the net changes in the source the table during a period of 24 hours
DECLARE @begin_time datetime, @end_time datetime, @from_lsn binary(10), @to_lsn binary(10);
-- Obtain the beginning of the time interval.
SET @begin_time = GETDATE() -1;
-- DML statements to produce changes in the DBO.Employee table.

-- Obtain the end of the time interval.
SET @end_time = GETDATE();
-- Map the time interval to a change data capture query range.
SET @from_lsn = sys.fn_cdc_map_time_to_lsn('smallest greater than or equal', @begin_time);
SET @to_lsn = sys.fn_cdc_map_time_to_lsn('largest less than or equal', @end_time);

-- Return the net changes occurring within the query window.
SELECT * FROM cdc.fn_cdc_get_net_changes_dbo_EMPLOYEE(@from_lsn, @to_lsn, 'all');
 
a new function we used above is sys.fn_cdc_map_time_to_lsn() ,function returns the LSN when supplied with time and relational operator

sys.fn_cdc_map_time_to_lsn ( ‘<relational_operator>’, tracking_time )

<relational_operator>::= { largest less than | largest less than or equal | smallest greater than | smallest greater than or equal }

please use the below link to know some more  CDC functions

http://msdn.microsoft.com/en-us/library/bb510744.aspx

now lets create a  dimension table which servers as a destination table

CREATE TABLE DBO.DIMEMPLOYEE
(ID INT IDENTITY (1,1) PRIMARY KEY, COUSTEMERID INT UNIQUE ,FIRST_NAME VARCHAR(50),LAST_NAME VARCHAR(50))

INSERT INTO DBO.DIMEMPLOYEE (COUSTEMERID,FIRST_NAME,LAST_NAME)

SELECT ID,FIRST_NAME,LAST_NAME FROM EMPLOYEE

now lets update some rows and delete some rows  in the dbo.employee table 

delete from dbo.employee where id between 50 and 100
update dbo.employee
set last_name = 'no lastname'
where id between 150 and 200 

now lets use the below query to get the changes in the table after updating and deleting 

DECLARE @begin_time datetime, @end_time datetime,
 @from_lsn binary(10), @to_lsn binary(10);
-- Obtain the beginning of the time interval.
SET @begin_time = DATEADD(HH,-1,GETDATE());
-- DML statements to produce changes in the Db.employee  table.

-- Obtain the end of the time interval.
SET @end_time = GETDATE();
-- Map the time interval to a change data capture query range.
SET @from_lsn = sys.fn_cdc_map_time_to_lsn('smallest greater than or equal'
, @begin_time);
SET @to_lsn = sys.fn_cdc_map_time_to_lsn('largest less than or equal'
, @end_time);

-- Return the net changes occurring within the query window.
SELECT * FROM cdc.fn_cdc_get_net_changes_dbo_EMPLOYEE(@from_lsn, @to_lsn, 'all');
Now lets create a simple package in SSIS which query’s these functions and updates the dimension table
Use the below query in the OLEDB source  , we can also query the change tables directly
SELECT * 
FROM cdc.fn_cdc_get_net_changes_dbo_EMPLOYEE
(sys.fn_cdc_map_time_to_lsn
('smallest greater than or equal'
, DATEADD(HH,-1,GETDATE()))
, sys.fn_cdc_map_time_to_lsn('largest less than or equal', GETDATE()), 'all')

write simple update and delete queries in and map the id to coustemerid column in DBO.DIMEMPLOYEE table
Using CDC with out effecting the source table we were able to log those changes and using simple package we updated the data warehouse environment
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s