What is CDC ?
Change data capture records insert, update, and delete activity that is applied to a SQL Server table
How it does ?
- The source of change data for change data capture is the SQL Server transaction log.
- As inserts, updates, and deletes are applied to tracked source tables, entries that describe those changes are added to the log. The log serves as input to the change data capture capture process.
- This reads the log and adds information about changes to the tracked table’s associated change table.
- Functions are provided to enumerate the changes that appear in the change tables over a specified range, returning the information in the form of a filtered result set.
Where it is used ?
A good example of a data consumer that is targeted by this technology is an extraction, transformation, and loading (ETL) application. An ETL application incrementally loads change data from SQL Server source tables to a data warehouse or data mart.
we can eliminate the use the of after update/delete/insert trigger
Note:Change data capture is available only on the Enterprise, Developer, and Evaluation editions of SQL Server.
Advantages of using CDC :
- Minimal impact on the database (even more so if one uses log shipping to process the logs on a dedicated host).
- No need for programmatic changes to the applications that use the database.
- Low latency in acquiring changes.
- Transactional integrity: log scanning can produce a change stream that replays the original transactions in the order they were committed. Such a change stream include changes made to all tables participating in the captured transaction.
- No need to change the database schema
is CDC (Change Data Capture) available in other off the shelf products)?
products perform change data capture using database transaction log files. These include:
- Attunity Stream
- Centerprise Data Integrator from Astera
- DatabaseSync from WisdomForce
- GoldenGate Transactional Data Integration
- HVR from HVR Software
- DBMoto from HiT Software
- Shadowbase from Gravic
- IBM InfoSphere Change Data Capture (previously DataMirror Transformation Server)
- Informatica PowerExchange CDC Option (previously Striva)
- Oracle Streams
- Oracle Data Guard
- Replicate1 from Vision Solutions
- SharePlex from Quest Software
- FlexCDC, part of Flexviews for MySQL
How to implement CDC ?
Lets get into the details of how to implement the CDC in SQL server 2008
Before enabling the cdc on the tables we need to enable the cdc on the Database.
To determine weather the database is enabled or not check the sys.databases table
SELECT NAME,IS_CDC_ENABLED FROM SYS.DATABASES WHERE NAME = 'SQLJUNKIESHARE'
Use the following system stored procedure to enable CDC on a database
USE SQLJUNKIESHARE EXEC SYS.SP_CDC_ENABLE_DB GO
after CDC is enabled on the database above listed tables are created in system tables
- cdc.captured_columns Returns the columns tracked for a specific capture instance.
- cdc.change_tables Returns tables created when CDC is enabled for a table. Use sys.sp_cdc_help_change_data_capture to query this informationrather than query this table directly.
- cdc.ddl_history Returns rows for each DDL change made to the table, once CDE is enabled. Use sys.sp_cdc_get_ddl_history instead of querying this table directly.
- cdc.index_columns Returns index columns associated with the CDC-enabled table. Query
- sys.sp_cdc_help_change_data_capture to retrieve this information rather than querying this table directly.
- cdc.lsn_time_mapping Helps you map the log sequence number to transaction begin and end times. Again, avoid querying the table directly, and instead use the functions sys.fn_cdc_map_lsn_to_time and sys.fn_cdc_map_time_to_lsn.
after enabling CDC on database lets enable CDC on table
Now that Change Data Capture is enabled, I can proceed with capturing changes for tables in
the database by using the sys.sp_cdc_enable_table system stored procedure. The parameters of
this stored procedure are described below
sp_cdc_enable_table Parameters
Parameter Description
- @source_schema This parameter defines the schema of the object.
- @source_name This parameter specifies the table name.
- @role_name This option allows you to select the name of the user-defined role that will have permissions to access the CDC data.
- @capture_instance You can designate up to two capture instances for a single table. This comes in handy if you plan on altering the schema of a table already captured by CDC. You can alter the schema without affecting theoriginal CDC (unless it is a data type change), create a new capture instance, track changes in two tables, and then drop the original capture instance once you are sure the new schema capture fits your requirements. If you don’t designate the name, the default value is schema_source.
- @supports_net_changes When enabled, this option allows you to show just the latest change to the data within the LSN range selected. This option requires a primary key be defined on the table. If no primary key is defined, you can alsodesignate a unique key in the @index_name option.
- @index_name This parameter allows you to designate the unique key on the table to be used by CDC if a primary key doesn’t exist.
- @captured_column_list If you aren’t interested in tracking all column changes, this option allows you to narrow down the list.
- @filegroup_name This option allows you to designate where the CDC data will be stored. For very large data sets, isolation on a separate filegroup may yield better manageability and performance.
- @partition_switch This parameter takes a TRUE or FALSE value designating whether or not a ALTER TABLE…SWITCH PARTITION command will be allowed against the CDC table (default is FALSE)here how its like before enabling the cdc on a particular
4when we set the @supports_net_changes to 1 either we need to have a primary key or we need to specify a non clustered index name
after making those changes when we execute the stored procedure result will beJob ‘cdc.sqljunkieshare_capture’ started successfully.
Job ‘cdc.sqljunkieshare_cleanup’ started successfully.Capture and Cleanup Jobs
Besides the Change Data Capture tables and query functions that have been created in our example, two SQL Server Agent jobs are created: a Capture and a Cleanup Job.
The Capture job generally runs continuously and is used to move changed data to the CDC tables from the transaction log.
The Cleanup job runs on a scheduled basis to remove older data from the CDC tables so that they don’t get too large. By default, data older than three days is automatically removed from CDC tables by this job.
we can also validate the settings of your newly configured capture instance using the
sys.sp_cdc_help_change_data_capture stored procedure:
EXEC sys.sp_cdc_help_change_data_capture ‘dbo’, ‘EMPLOYEE’ - The commit LSN both identifies changes that were committed within the same transaction, and orders those transactions.
- The column __$start_lsn identifies the commit log sequence number (LSN) that was assigned to the change.
- The column __$seqval can be used to order more changes that occur in the same transaction.
- The column __$operation records the operation that is associated with the change:
- 1 = delete,
- 2 = insert,
- 3 = update (before image)
- 4 = update (after image).
- The column __$update_mask is a variable bit mask with one defined bit for each captured column. For insert and delete entries, the update mask will always have all bits set. Update rows, however, will only have those bits set that correspond to changed columns.
cdc.fn_cdc_get_all_changes_capture_instance ( from_lsn , to_lsn , ‘<row_filter_option>’ )
<row_filter_option> ::= { all | all update old }
it is prefreble to use cdc function to query the change tables
DECLARE @from_lsn binary(10), @to_lsn binary(10) SET @from_lsn = sys.fn_cdc_get_min_lsn('dbo_EMPLOYEE') SET @to_lsn = sys.fn_cdc_get_max_lsn() SELECT * FROM cdc.fn_cdc_get_all_changes_dbo_EMPLOYEE (@from_lsn, @to_lsn, 'all'); <row_filter_option> ::= { all | all update old } Option
‘all’ is used to filter the result set so that to get all the changes on that table or capture instance but one row for each change
(i.e one row for insert ,one row for delete one row for update)
DECLARE @begin_time datetime, @end_time datetime, @from_lsn binary(10), @to_lsn binary(10); -- Obtain the beginning of the time interval. SET @begin_time = GETDATE() -1; -- DML statements to produce changes in the DBO.Employee table. -- Obtain the end of the time interval. SET @end_time = GETDATE(); -- Map the time interval to a change data capture query range. SET @from_lsn = sys.fn_cdc_map_time_to_lsn('smallest greater than or equal', @begin_time); SET @to_lsn = sys.fn_cdc_map_time_to_lsn('largest less than or equal', @end_time); -- Return the net changes occurring within the query window. SELECT * FROM cdc.fn_cdc_get_net_changes_dbo_EMPLOYEE(@from_lsn, @to_lsn, 'all'); a new function we used above is sys.fn_cdc_map_time_to_lsn() ,function returns the LSN when supplied with time and relational operator
sys.fn_cdc_map_time_to_lsn ( ‘<relational_operator>’, tracking_time )
<relational_operator>::= { largest less than | largest less than or equal | smallest greater than | smallest greater than or equal }
please use the below link to know some more CDC functions
http://msdn.microsoft.com/en-us/library/bb510744.aspx
now lets create a dimension table which servers as a destination table
CREATE TABLE DBO.DIMEMPLOYEE (ID INT IDENTITY (1,1) PRIMARY KEY, COUSTEMERID INT UNIQUE ,FIRST_NAME VARCHAR(50),LAST_NAME VARCHAR(50)) INSERT INTO DBO.DIMEMPLOYEE (COUSTEMERID,FIRST_NAME,LAST_NAME) SELECT ID,FIRST_NAME,LAST_NAME FROM EMPLOYEE now lets update some rows and delete some rows in the dbo.employee table delete from dbo.employee where id between 50 and 100 update dbo.employee set last_name = 'no lastname' where id between 150 and 200 now lets use the below query to get the changes in the table after updating and deleting DECLARE @begin_time datetime, @end_time datetime, @from_lsn binary(10), @to_lsn binary(10); -- Obtain the beginning of the time interval. SET @begin_time = DATEADD(HH,-1,GETDATE()); -- DML statements to produce changes in the Db.employee table. -- Obtain the end of the time interval. SET @end_time = GETDATE(); -- Map the time interval to a change data capture query range. SET @from_lsn = sys.fn_cdc_map_time_to_lsn('smallest greater than or equal' , @begin_time); SET @to_lsn = sys.fn_cdc_map_time_to_lsn('largest less than or equal' , @end_time); -- Return the net changes occurring within the query window. SELECT * FROM cdc.fn_cdc_get_net_changes_dbo_EMPLOYEE(@from_lsn, @to_lsn, 'all');
SELECT * FROM cdc.fn_cdc_get_net_changes_dbo_EMPLOYEE (sys.fn_cdc_map_time_to_lsn ('smallest greater than or equal' , DATEADD(HH,-1,GETDATE())) , sys.fn_cdc_map_time_to_lsn('largest less than or equal', GETDATE()), 'all')