CDC – Change Data Capture (zero cost ETL solution)

 

What is CDC ? 

Change data capture records insert, update, and delete activity that is applied to a SQL Server table

How it does ?

 

 

 

 

 

 

 

 

 

 

 

 

 

 

  • The source of change data for change data capture is the SQL Server transaction log.
  • As inserts, updates, and deletes are applied to tracked source tables, entries that describe those changes are added to the log. The log serves as input to the change data capture capture process.
  • This reads the log and adds information about changes to the tracked table’s associated change table.
  • Functions are provided to enumerate the changes that appear in the change tables over a specified range, returning the information in the form of a filtered result set.

Where it is used ?

A good example of a data consumer that is targeted by this technology is an extraction, transformation, and loading (ETL) application. An ETL application incrementally      loads change data from SQL Server source tables to a data warehouse or data mart.

we can  eliminate the use the of after update/delete/insert trigger

Note:Change data capture is available only on the Enterprise, Developer, and Evaluation editions of SQL Server.

Advantages of using CDC :

  • Minimal impact on the database (even more so if one uses log shipping to process the logs on a dedicated host).
  • No need for programmatic changes to the applications that use the database.
  • Low latency in acquiring changes.
  • Transactional integrity: log scanning can produce a change stream that replays the original transactions in the order they were committed. Such a change stream include changes made to all tables participating in the captured transaction.
  • No need to change the database schema

is CDC (Change Data Capture)  available in other off the shelf products)?

products perform change data capture using database transaction log files. These include:

  • Attunity Stream
  • Centerprise Data Integrator from Astera
  • DatabaseSync from WisdomForce
  • GoldenGate Transactional Data Integration
  • HVR from HVR Software
  • DBMoto from HiT Software
  • Shadowbase from Gravic
  • IBM InfoSphere Change Data Capture (previously DataMirror Transformation Server)
  • Informatica PowerExchange CDC Option (previously Striva)
  • Oracle Streams
  • Oracle Data Guard
  • Replicate1 from Vision Solutions
  • SharePlex from Quest Software
  • FlexCDC, part of Flexviews for MySQL

How to implement CDC ?

Lets get into the details of how to implement the CDC in SQL server 2008

Before enabling the cdc on the tables we need to enable the cdc on the Database.

To determine weather the database is enabled or not check the sys.databases table

SELECT NAME,IS_CDC_ENABLED 

FROM SYS.DATABASES

WHERE NAME = 'SQLJUNKIESHARE'

Use the following system stored procedure to enable CDC on a database

USE SQLJUNKIESHARE

EXEC SYS.SP_CDC_ENABLE_DB
GO

after CDC is enabled on the database above listed tables are created in system tables

  • cdc.captured_columns Returns the columns tracked for a specific capture instance.
  • cdc.change_tables Returns tables created when CDC is enabled for a table. Use sys.sp_cdc_help_change_data_capture to query this informationrather than query this table directly.
  • cdc.ddl_history Returns rows for each DDL change made to the table, once CDE is enabled. Use sys.sp_cdc_get_ddl_history instead of querying this table directly.
  • cdc.index_columns Returns index columns associated with the CDC-enabled table. Query
  • sys.sp_cdc_help_change_data_capture to retrieve this information rather than querying this table directly.
  • cdc.lsn_time_mapping Helps you map the log sequence number to transaction begin and end times. Again, avoid querying the table directly, and instead use the functions sys.fn_cdc_map_lsn_to_time and sys.fn_cdc_map_time_to_lsn.

after enabling CDC on database lets enable CDC on table

Now that Change Data Capture is enabled, I can proceed with capturing changes for tables in
the database by using the sys.sp_cdc_enable_table system stored procedure. The parameters of
this stored procedure are described below
sp_cdc_enable_table Parameters
Parameter Description

  • @source_schema       This parameter defines the schema of the object.
  • @source_name           This parameter specifies the table name.
  • @role_name                This option allows you to select the name of the user-defined role that will have permissions to access the CDC data.
  • @capture_instance     You can designate up to two capture instances for a single table. This comes in handy if you plan on altering the schema of a table already                                                 captured by CDC. You can alter the schema without affecting theoriginal CDC (unless it is a data type change), create a new capture instance,                                        track changes in two tables, and then drop the original  capture instance once you are sure the new schema capture fits your requirements. If                                        you don’t designate the name, the default value is schema_source.
  • @supports_net_changes     When enabled, this option allows you to show just the latest change to the data within the LSN range selected. This option requires a                                                      primary key be defined on the table. If no primary key is defined, you can alsodesignate a unique key in the @index_name option.
  • @index_name             This parameter allows you to designate the unique key on the table to be used by CDC if a primary key doesn’t exist.
  • @captured_column_list If you aren’t interested in tracking all column changes, this option allows you to narrow down the list.
  • @filegroup_name       This option allows you to designate where the CDC data will be stored. For very large data sets, isolation on a separate filegroup may yield                                              better manageability and performance.
  • @partition_switch       This parameter takes a TRUE or FALSE value designating whether or not a ALTER TABLE…SWITCH PARTITION command will be                                                    allowed against the  CDC table (default is FALSE)here how its like before enabling the cdc on a particular4when we set the @supports_net_changes to 1 either we need to have a primary key or we need to specify a non clustered index name
    after making those changes when we execute the stored procedure result will beJob ‘cdc.sqljunkieshare_capture’ started successfully.
    Job ‘cdc.sqljunkieshare_cleanup’ started successfully.

    Capture and Cleanup Jobs

    Besides the Change Data Capture tables and query functions that have been created in our example, two SQL Server Agent jobs are created: a Capture and a Cleanup Job.

    The Capture job generally runs continuously and is used to move changed data to the CDC tables from the transaction log.

    The Cleanup job runs on a scheduled basis to remove older data from the CDC tables so that they don’t get too large. By default, data older than three days is automatically removed from CDC tables by this job.

    we can also validate the settings of your newly configured capture instance using the
    sys.sp_cdc_help_change_data_capture stored procedure:
    EXEC sys.sp_cdc_help_change_data_capture ‘dbo’, ‘EMPLOYEE’        

  • The commit LSN both identifies changes that were committed within the same transaction, and orders those transactions.
  • The column __$start_lsn identifies the commit log sequence number (LSN) that was assigned to the change.
  • The column __$seqval can be used to order more changes that occur in the same transaction.
  • The column __$operation records the operation that is associated with the change:
  • 1 = delete,
  • 2 = insert,
  • 3 = update (before image)
  • 4 = update (after image).
  • The column __$update_mask is a variable bit mask with one defined bit for each captured column. For insert and delete entries, the update mask will always have all bits set. Update rows, however, will only have those bits set that correspond to changed columns.
Before going in to more details  lets look at some CDC functions

cdc.fn_cdc_get_all_changes_capture_instance ( from_lsn , to_lsn , ‘<row_filter_option>’ )

<row_filter_option> ::= { all | all update old }

it is prefreble to use cdc function to query the change tables

DECLARE @from_lsn binary(10), @to_lsn binary(10)
SET @from_lsn =
 sys.fn_cdc_get_min_lsn('dbo_EMPLOYEE')
SET @to_lsn = sys.fn_cdc_get_max_lsn()
SELECT * FROM cdc.fn_cdc_get_all_changes_dbo_EMPLOYEE (@from_lsn, @to_lsn, 'all');
 
<row_filter_option> ::= { all | all update old } 
Option 

‘all’ is used to filter the result set so that to get all the changes on that table or capture instance but one row for each change

(i.e one row for insert ,one row for delete one row for update)

‘all update old’ is used to filter the result set so that to get all the changes on that table or capture instance with two rows for update
(i.e one containing the values of the captured columns before the update and another containing the values of the captured columns after the update).
another important function
cdc.fn_cdc_get_net_changes_capture_instance ( from_lsn , to_lsn , ‘<row_filter_option>’ )
<row_filter_option> ::= { all | all with mask | all with merge }
You might have confused what is the difference between cdc.fn_cdc_get_net_changes_capture_instance and cdc.fn_cdc_get_all_changes_capture_instance  when to use net changes and all changes
 cdc.fn_cdc_get_net_changes_capture_instance  returns one net change row for each source row changed within the specified LSN range. That is, when a source row has multiple changes during the LSN range, a single row that reflects the final content of the row is returned by the function. For example, if a transaction inserts a row in the source table and a subsequent transaction within the LSN range updates one or more columns in that row, the function returns only one row, which includes the updated column values.
lets say we need to get all the net changes in the source the table during a period of 24 hours
DECLARE @begin_time datetime, @end_time datetime, @from_lsn binary(10), @to_lsn binary(10);
-- Obtain the beginning of the time interval.
SET @begin_time = GETDATE() -1;
-- DML statements to produce changes in the DBO.Employee table.

-- Obtain the end of the time interval.
SET @end_time = GETDATE();
-- Map the time interval to a change data capture query range.
SET @from_lsn = sys.fn_cdc_map_time_to_lsn('smallest greater than or equal', @begin_time);
SET @to_lsn = sys.fn_cdc_map_time_to_lsn('largest less than or equal', @end_time);

-- Return the net changes occurring within the query window.
SELECT * FROM cdc.fn_cdc_get_net_changes_dbo_EMPLOYEE(@from_lsn, @to_lsn, 'all');
 
a new function we used above is sys.fn_cdc_map_time_to_lsn() ,function returns the LSN when supplied with time and relational operator

sys.fn_cdc_map_time_to_lsn ( ‘<relational_operator>’, tracking_time )

<relational_operator>::= { largest less than | largest less than or equal | smallest greater than | smallest greater than or equal }

please use the below link to know some more  CDC functions

http://msdn.microsoft.com/en-us/library/bb510744.aspx

now lets create a  dimension table which servers as a destination table

CREATE TABLE DBO.DIMEMPLOYEE
(ID INT IDENTITY (1,1) PRIMARY KEY, COUSTEMERID INT UNIQUE ,FIRST_NAME VARCHAR(50),LAST_NAME VARCHAR(50))

INSERT INTO DBO.DIMEMPLOYEE (COUSTEMERID,FIRST_NAME,LAST_NAME)

SELECT ID,FIRST_NAME,LAST_NAME FROM EMPLOYEE

now lets update some rows and delete some rows  in the dbo.employee table 

delete from dbo.employee where id between 50 and 100
update dbo.employee
set last_name = 'no lastname'
where id between 150 and 200 

now lets use the below query to get the changes in the table after updating and deleting 

DECLARE @begin_time datetime, @end_time datetime,
 @from_lsn binary(10), @to_lsn binary(10);
-- Obtain the beginning of the time interval.
SET @begin_time = DATEADD(HH,-1,GETDATE());
-- DML statements to produce changes in the Db.employee  table.

-- Obtain the end of the time interval.
SET @end_time = GETDATE();
-- Map the time interval to a change data capture query range.
SET @from_lsn = sys.fn_cdc_map_time_to_lsn('smallest greater than or equal'
, @begin_time);
SET @to_lsn = sys.fn_cdc_map_time_to_lsn('largest less than or equal'
, @end_time);

-- Return the net changes occurring within the query window.
SELECT * FROM cdc.fn_cdc_get_net_changes_dbo_EMPLOYEE(@from_lsn, @to_lsn, 'all');
Now lets create a simple package in SSIS which query’s these functions and updates the dimension table
Use the below query in the OLEDB source  , we can also query the change tables directly
SELECT * 
FROM cdc.fn_cdc_get_net_changes_dbo_EMPLOYEE
(sys.fn_cdc_map_time_to_lsn
('smallest greater than or equal'
, DATEADD(HH,-1,GETDATE()))
, sys.fn_cdc_map_time_to_lsn('largest less than or equal', GETDATE()), 'all')

write simple update and delete queries in and map the id to coustemerid column in DBO.DIMEMPLOYEE table
Using CDC with out effecting the source table we were able to log those changes and using simple package we updated the data warehouse environment

Executing SSIS packages from a stored procedure

hi,

lets learn how to execute SSIS packages from a stored procedure

This can be done in two ways

a) Creating a job and calling the job from procedure

    b)Using Xp_cmdshell Extended stored procedure

Lets look at the first option

Step 1 : Create job in Sql server

 Job name : SSIS

go steps  on the left side of  job menu

Create new step

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Select the type and account to run the job

select the package location

use the simple procedure below

CREATE  PROCEDURE SSIS_EXE (@JOB_NAME VARCHAR(100))

AS

BEGIN 

EXEC MSDB..SP_START_JOB @JOB_NAME

END 

EXEC SSIS_EXE 'SSIS'
Job 'SSIS' started successfully.
 Go to job activity monitor and refresh depending upon the size of package wait for the success message


or user the below query to find the job status
SELECT S.NAME,H.JOB_ID,H.MESSAGE 

FROM MSDB..SYSJOBHISTORY H 

INNER JOIN MSDB..SYSJOBS S ON H.JOB_ID = S.JOB_ID 

WHERE S.NAME = 'SSIS'



Using Xp_cmdshell system stored procedure:
This is the best and the easy way 

we pass the parameter as the package location, package path, and server name to the 
stored procedure Before jumping in to the code let's make sure  the 
XP_Cmdshell extended procedure is turned on you machine 

Use the below mentioned procedure to turn on XP_CMDSHELL

-- To allow advanced options to be changed.
EXEC sp_configure 'show advanced options', 1
GO
-- To update the currently configured value for advanced options.
RECONFIGURE
GO

-- To enable the feature.
EXEC sp_configure 'xp_cmdshell', 1
GO
-- To update the currently configured value for this feature.
RECONFIGURE
GO
 
and here's the procedure

CREATE PROCEDURE SSIS_XP 
(@LOCATION VARCHAR(1000),@PATH VARCHAR(100),@SERVERNAME VARCHAR(40))

AS

DECLARE @CMD VARCHAR(1000)

BEGIN

IF @LOCATION = 'FILESYSTEM'

 IF(@SERVERNAME IS NULL OR PATINDEX('%[a-z]%',@SERVERNAME)= 0 )

 PRINT 'PLEASE PROVIDE VALID SERVER NAME'

 ELSE 

    BEGIN

           SET @CMD = 'DTEXEC /DTS "\FILE SYSTEM\'+@PATH +'" /SERVER "'+ @SERVERNAME 
                        + '" /CHECKPOINTING OFF  /REPORTING V'

    IF @LOCATION = 'MSDB' 

       IF(@SERVERNAME IS NULL OR PATINDEX('%[A-Z]%',@SERVERNAME)= 0 )

 PRINT 'PLEASE PROVIDE VALID SERVER NAME'

 ELSE 

   BEGIN

           SET @CMD = @CMD + 'DTEXEC /DTS "\MSDB'+@PATH +'" /SERVER "'+ @SERVERNAME
                      + '" /CHECKPOINTING OFF  /REPORTING V'

IF @LOCATION = 'FILE'  

 BEGIN

 SET @CMD = @CMD + 'DTEXEC /FILE'+@PATH +'.DTSX" /CHECKPOINTING OFF  /REPORTING V'

END

END

END

ELSE

BEGIN

PRINT 'PLEASE SPECIFY CORRECT PACKAGE LOCATION'

END 

--

EXEC MASTER..XP_CMDSHELL  @CMD

END

how to use
 EXEC SSIS_XP 'FILESYSTEM' ,'PACKAGES\PACKAGE' ,'--servername--' 

or
 EXEC SSIS_XP 'MSDB' ,'SSIS\PACKAGE1' ,'--servername--'
or

EXEC SSIS_XP 'FILE' ,'C:\Programfiles\mydocuments\SSIS\PACKAGE1' ,'--servername--' 


 
We can still have lot parameters like reporting , configurations, variable values 
to make it complicated

 Data driven package execution 

A) CURSOR

 Create a table 

CREATE TABLE [dbo].[PACKAGES](
	[PATH] [varchar](100) NULL,
	[LOCATION] [varchar](1000) NULL,
	[SERVERNAME] [varchar](40) NULL
) ON [PRIMARY]

Insert the records as required by the packages
crate a cursor to loop through the table for packages
DECLARE @PATH VARCHAR(1000)

DECLARE @LOCATION VARCHAR(50)

DECLARE @SERVERNAME VARCHAR(40) 

DECLARE DATA_PACKAGE CURSOR

FOR SELECT [PATH],LOCATION,SERVERNAME  FROM DBO.PACKAGES

OPEN DATA_PACKAGE

 FETCH NEXT FROM DATA_PACKAGE 

 INTO @PATH,@LOCATION,@SERVERNAME

WHILE @@FETCH_STATUS = 0 

BEGIN

 EXEC SSIS_XP @LOCATION,@PATH,@SERVERNAME

 FETCH NEXT FROM DATA_PACKAGE 

  INTO @PATH,@LOCATION,@SERVERNAME

END

 CLOSE DATA_PACKAGE

 DEALLOCATE DATA_PACKAGE

B) we can create a simple package to do above cursor action in SSIS if you don't 
like cursor

 step1) drag and drop a data flow task in control flow


and a OLEDB source and OLEDB command in data flow

 

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Map the parameters to the procedure
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

 
 
any questions please comment below
 
 

How to load a variable value into a Execute SQL Task

Lets take a look at  a simple example of how to load a variable value into execute sql task in the control flow

We  need to update a table called Employee in the database according to the value supplied to the variable

so lets create  a  variable called ID  , Data type Int32

lets drag and drop a execute sql task into control flow and make the necessary connections

Query we used in the sql task is a simple  update query, in order to map the ID value of the variable to the id value in the where clause of the update query we use a question

once we written the query now we need to do the parameter mapping for the variable

To the left side of the Execute SQL Task there is a option called parameter mapping once we click it it will show the mapping list which will be empty

Click the add button

Browse the available parameters in the parameter name  and select the variable  User:ID

and now lets execute the package

lets see the result

That’s it! for today Thank you

Variables in Integration Services

let’s  take a look at the simple definition of variables in SSIS

Variables store values which can be used  to update tasks in control flow , data flow and in event handlers,   dynamically during the run time of the package

Types of Variables

1) System variables (system defined)

2) User defined Variables

How they are accessed(scope)  :  system variables can be accessed from any task(if it allows) in the package and these can only be read

User Defined Variables :    are created by the users  and data types include    string, int16, int32, int64, double,  dbnull, object, single,  Boolean, char, Uint32, Uint64, datetime, sbyte

how they accessed in the package is kind of different from system defined variables and now we can start using the word called scope

if a variable is created in data flow  (scope)  it can not be accessed from control flow

so lets consider different scopes here

Package level :  a variable created with out highlighting any of the task in  control flow

Task level : a variable created with highlighting any of the tasks in control flow

Data flow :  variable  created in data flow by selecting or not selecting any tasks are transformations in data flow

Important : variables are case sensitive

a variable as Var1 is not same as var1