How to remove duplicates rows in SSIS data flow task from a Flat file source or Excel source, Rawsource

Hi , Using a simple trick you can easily eliminate duplicate rows when using Flat file or Excel source or Raw source or any where in the Data Flow task























Simple and easy trick you can using any where in the Data Flow task

Thank you

SQLJUNKIESHARE

Advertisements

Inserting Records and Getting Identity in SSIS

Hi ,

When one of my friend asked about how can we catch a Identity value after inserting the rows in to table  having a Identity column and using that result in the same package

well there is no out of box solution in SSIS  once the data is fed in to the OLEDB destination or SQL server destination it becomes a impossible to find inserted Identity values (Unless a trigger is used on the destination table) for these components outputs are ERROR redirection, which we can not use

Problem :

During the Loading phase of ETL it is so common to have identity columns , but it becomes to tough to find the Identity values after inserting and using them in the same Data flow

Solutions :

There are many work around solutions to perform this which I will list them , I will try to demonstrate the one way which I like

1) Using trigger on the destination table catching the inserted values in another table

2) Using the OLEDB command and a stored procedure which uses Output clause to find all Identity values or we can use Identity_scope (problem: dataflow will be terminated which makes the procedure to be the control flow of data (which makes no sense of using SSIS))

Famous MVP Todd Mcdermid  tries to explain the same problem in his view point please take a look

http://toddmcdermid.blogspot.com/2011/08/inserting-records-and-getting-identity.html

3)add a unique number to a comment field to find the rows just inserted and in other data flow task use the unique number to process them as required

As it is a RBAR(row by agonizing row) which is time consuming and it degrades the package performance when the number of rows are more

4)  Using the Record-set destination and taking the rows back to the control flow , and using the for each loop enumerator and catch the inserted Identity columns and use the values for further processing in SSIS

lets look at the example

In this session we will also learn how to use  Recordset destination

How to perform cusrsor action in Dataflow

how to use Foreach loop (Foreach ADO enumerator)

how to transfer data between control flow and dataflow

so looks like nice information session on SSIS so please stay with me

Parallelism in ETL Process :: SSIS 2008 and SSIS 2012

Hi, Today we will talk about achieving  parallelism in ETL using SSIS 2008 and 2012 We can generalize this topic two ways is by
 
1) Using maximum concurrent executable s in SSIS control flow and data flow( easy way to achieve parallelism and not so dynamic everythng should be configured in design)
 
2)Partitioning the source and destination ( little bit complicated but most power full way to achieve parallelism, everything is dynamic)
    1) There is a setting in the SSIS which allows the user to specify the maximum number of executable’s it can execute in a given package Max Concurrent executable
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2)Partitioning the source and destination ( little bit complicated but most power full way to achieve parallelism, everything is dynamic) 
In order achieve parallelism in ETL
 
1) Source and destination should be partitioned according the Data load
 
2) Entire work load should be divided to individual tasks
3) Each task should be sub divided into multiple tasks
4) Single unit of task should not not depend up on the other task(dead locks, I/O waits..)
 
5) before using the maximum threshold CPU for parallelism proper testing should be done, so there are always resources available for system processes Some times Parallelism may degrade the performance
 
6) If work load is divided in to to many small tasks which may also degrade the performance (ex: Time spent by the Worker thread to wait for child threads may exceed single  execution of the process ) Declare variables  Conclusion: Parallelism can be save lot of time in ETL process if utilized properly

SSIS package logging – Custom logging 2008 R2 and 2012

Logging is an important part of SSIS package development , I personaly prefer to have logging in all production level packages

Logging can be used as more than debugging errors in a scheduled package execution environment but also performs the audit functions it needs and we can also measure

performance, Out of the box SSIS 2008 and SSIS 2012 have some nice logging capabilities

The following table describes the predefined events that can be enabled to write log entries when run-time events occur. These log entries apply to executables, the package, and the tasks and containers that the package includes. The name of the log entry is the same as the name of the run-time event that was raised and caused the log entry to be written.

Events Description
OnError Writes a log entry when an error occurs.
OnExecStatusChanged Writes a log entry when the execution status of the executable changes.
OnInformation Writes a log entry during the validation and execution of an executable to report information.
OnPostExecute Writes a log entry immediately after the executable has finished running.
OnPostValidate Writes a log entry when the validation of the executable finishes.
OnPreExecute Writes a log entry immediately before the executable runs.
OnPreValidate Writes a log entry when the validation of the executable starts.
OnProgress Writes a log entry when measurable progress is made by the executable.
OnQueryCancel Writes a log entry at any juncture in the task processing where it is feasible to cancel execution.
OnTaskFailed Writes a log entry when a task fails.
OnVariableValueChanged Writes a log entry when the value of a variable changes.
OnWarning Writes a log entry when a warning occurs.
PipelineComponentTime For each data flow component, writes a log entry for each phase of validation and execution. The log entry specifies the processing time for each phase.
Diagnostic Writes a log entry that provides diagnostic information.For example, you can log a message before and after every call to an external data provider

there are some custom logging messages to some of the control flow tasks and data flow tasks

Data Flow Task

The following table lists the custom log entries for the Data Flow task.

Log entry Description
BufferSizeTuning Indicates that the Data Flow task changed the size of the buffer. The log entry describes the reasons for the size change and lists the temporary new buffer size.
OnPipelinePostEndOfRowset Denotes that a component has been given its end-of-rowset signal, which is set by the last call of the ProcessInput method. An entry is written for each component in the data flow that processes input. The entry includes the name of the component.
OnPipelinePostPrimeOutput Indicates that the component has completed its last call to the PrimeOutput method. Depending on the data flow, multiple log entries may be written. If the component is a source, this means that the component has finished processing rows.
OnPipelinePreEndOfRowset Indicates that a component is about to receive its end-of-rowset signal, which is set by the last call of the ProcessInput method. An entry is written for each component in the data flow that processes input. The entry includes the name of the component.
OnPipelinePrePrimeOutput Indicates that the component is about to receive its call from the PrimeOutput method. Depending on the data flow, multiple log entries may be written.
OnPipelineRowsSent Reports the number of rows provided to a component input by a call to the ProcessInput method. The log entry includes the component name.
PipelineBufferLeak Provides information about any component that kept buffers alive after the buffer manager goes away. This means that buffers resources were not released and may cause memory leaks. The log entry provides the name of the component and the ID of the buffer.
PipelineExecutionPlan Reports the execution plan of the data flow. It provides information about how buffers will be sent to components. This information, in combination with the PipelineExecutionTrees entry, describes what is occurring in the task.
PipelineExecutionTrees Reports the execution trees of the layout in the data flow. The scheduler of the data flow engine use the trees to build the execution plan of the data flow.
PipelineInitialization Provides initialization information about the task. This information includes the directories to use for temporary storage of BLOB data, the default buffer size, and the number of rows in a buffer. Depending on the configuration of the Data Flow task, multiple log entries may be written.

for example in a data flow task you can also log some performance counters to measure the buffer allocation and buffer leakage during the transformations

to know more about custom logging messages in SSIS click the link below

http://msdn.microsoft.com/en-us/library/ms345174.aspx

Why do we need custom logging using event handlers?

Some times it becomes overhead by using the default log providers and it often ends up using the event handlers and system variables to log some key information which some times difficult to analyze using default logging

in order to perform the logging using event handlers we need to better understand the system variables and information they can provide


The following table describes the system variables that Integration Services provides for packages.

System variable Data type Description
CancelEvent Int32 The handle to a Windows Event object that the task can signal to indicate that the task should stop running.
CreationDate DateTime The date that the package was created.
CreatorComputerName String The computer on which the package was created.
CreatorName String The name of the person who built the package.
ExecutionInstanceGUID String The unique identifier of the executing instance of a package.
InteractiveMode Boolean Indicates whether the package is run in interactive mode. If a package is running in SSIS Designer, this property is set to True. If a package is running using the DTExec command prompt utility, the property is set toFalse.
LocaleId Int32 The locale that the package uses.
MachineName String The name of the computer on which the package is running.
OfflineMode Boolean Indicates whether the package is in offline mode. Offline mode does not acquire connections to data sources.
PackageID String The unique identifier of the package.
PackageName String The name of the package.
StartTime DateTime The time that the package started to run.
UserName String The account of the user who started the package. The user name is qualified by the domain name.
VersionBuild Int32 The package version.
VersionComment String Comments about the package version.
VersionGUID String The unique identifier of the version.
VersionMajor Int32 The major version of the package.
VersionMinor Int32 The minor version of the package.

please use the below link to better understand the system variables

http://msdn.microsoft.com/en-us/library/ms141788(v=SQL.105).aspx

lets do some  custom logging using event handlers for a simple package

before digging into more details lets create the table to log the details

as

CREATE TABLE [DBO].[PACKAGELOG](
[LOGID] [INT] IDENTITY(1,1) NOT NULL,
[EXECUTIONID] [UNIQUEIDENTIFIER] NULL,
[PACKAGENAME] [NVARCHAR](255) NULL,
[PACKAGEID] [VARCHAR](50) NULL,
[USERNAME] [NVARCHAR](100) NULL,
[MACHINENAME] [NVARCHAR](255) NULL,
[STARTDATETIME] [DATETIME] NULL,
[ENDDATETIME] [DATETIME] NULL,
[SOURCENAME] [VARCHAR](255) NULL,
[MESSAGE] [NVARCHAR](MAX) NULL,
[MESSAGECODE] [INT] NULL,
[LOGDATETIME] [DATETIME] NULL,
[INTERACTIVEMODE] [BIT] NULL,
[SOURCEDESCRIPTION] [VARCHAR](255) NULL,
[TASKNAME] [VARCHAR](255) NULL,
[TASKID] [NVARCHAR](255) NULL,
[EXECSTATUS] [BIT] NULL,
[SERVERNAME] [NVARCHAR](255) NULL,
CONSTRAINT [PK__PACKAGEL__5E5499A820C44329] PRIMARY KEY CLUSTERED
(
[LOGID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
GO
SET ANSI_PADDING OFF
GO
ALTER TABLE [DBO].[PACKAGELOG] ADD CONSTRAINT [DF_PACKAGELOG_EXECSTATUS] DEFAULT ((0)) FOR [EXECSTATUS]
GO

Note : there are some noted issues while using the system:: start time system variable and mapping to any SSIS date time data type

The reason is the value of the system start time looks like  this SYSTEM::STARTTIME  {18/10/2010 14:30:27}

which can not be mapped to any SSIS date time data types we will be getting a error

“The type is not supported.DBTYPE_DBDATE”. Possible failure reasons: Problems with the query, “ResultSet” property not set correctly, parameters not set correctly, or connection not established correctly

To avoid this error we have a work around is to create a variable with string and use a expression and setting the variable value to be evaluate as expression and get the parsed System start time from the variable and map that variable to the execute SQL task

as

(DT_WSTR, 4)YEAR(@[System::StartTime])

+ “-” +
RIGHT(“0” + (DT_WSTR, 2)MONTH(@[System::StartTime]), 2)

+ “-” +

RIGHT(“0″ + (DT_WSTR, 2)DAY(@[System::StartTime]), 2)

+ ” ” +

RIGHT(“0” + (DT_WSTR, 2)DATEPART(“Hour”, @[System::StartTime]), 2)

+ “:” +

RIGHT(“0” + (DT_WSTR, 2)DATEPART(“Minute”, @[System::StartTime]), 2)

+ “:” +

RIGHT(“0” + (DT_WSTR, 2)DATEPART(“Second”, @[System::StartTime]), 2)

+ “.” +

RIGHT(“00” + (DT_WSTR, 3)DATEPART(“Millisecond”,@[System::StartTime]), 3)

insert into dbo.packagelog (ExecutionID,packagename,packageid,username,machinename,interactivemode,sourcename,taskname,taskid,Message,MessageCode,execstatus,servername,startdatetime,enddatetime)
values (?,?,?,?,?,?,?,?,?,?,?,((0)),@@SERVERNAME,?,getdate())

as

log

success log

there are some nice free third party tools to dig the information from the log graphically like log analyzer

http://ssisloganalyzer.codeplex.com/

SSIS Script task and Components (Tool Kit) using – C#

With the introduction of C#, and the embedding of the new Visual Studio Tools for Applications into SSIS, you can ’ t think of using the Script Task and Script Component as scripting anymore; now it ’ s all – out programming. Typically, you ’ d code logic into these Script Tasks to control the
execution and logic flow within a package or to perform some specialized business validation.

The two Scripting Components provide access into a new scripting development environment using Microsoft Visual Studio Tools for Applications (VSTA). This change finally allows us to script logic into packages using Visual Basic 2008 or Visual C# 2008 .NET code.

It is not necessary to know the C# universe in order to write valuable C# script tasks and components . Keep in mind that it is being used as a scripting language, not a full-fledged application development language.

The ability to perform the following basic operations will allow you to do quite a lot.

  • Access SSIS variables
  • Modifying SSIS variables
  • File validation Task (File properties task)
  • Read/write ASCII files
  • Parse strings with the split function
  • Create and loop through list arrays
  • Copy, move and delete files
  • Capture a listing of specified files in a sub directory
  • Create a database connections to SQL Server
  • Execute T-SQL queries and stored procedures and capture any results

Our first task is  accessing  variables

  1. Access SSIS variables

create a variable

drag and drop a script task in control flow

Double click on the variables and use the below shown settings

click on edit script ,once you click the edit script button a window pops pup (Microsoft Visual Studio Tools for Applications (VSTA))

Result

as

Code:

string str = null;

Dts.VariableDispenser.LockForRead(“User::user”);

str = (string) Dts.Variables[“user”].Value;

MessageBox.Show(str);
Dts.TaskResult = (int)ScriptResults.Success;

2. Modifying variables in SSIS using script task

click on the script task and remove the variable “user” as read only variable and assign it to the read write variable

Modifying the variable

string str = null;

Dts.VariableDispenser.LockForWrite(“User::user”);

str = (string)Dts.Variables[“user”].Value;

MessageBox.Show(str,”variable value before changing “);

Dts.Variables[“user”].Value = “Sql junkie share”;

str = (string)Dts.Variables[“user”].Value;

MessageBox.Show(str,”variable value after changing “);

3. File Validation Task 

In any of the ETL processes when dealing with flat files, some times there will be a need to check the existence of the file so lets create a simple file validation task using C#

create a  variable “file exits” in  control flow and assign default value as False

as

Click the button Edit Script

Script

using System;
using System.Data;
using Microsoft.SqlServer.Dts.Runtime;
using System.Windows.Forms;
using System.IO;
namespace ST_ad163f48be91459c9c185a38bccd8436.csproj
{
    [System.AddIn.AddIn(“ScriptMain”, Version = “1.0”, Publisher = “”, Description = “”)]
    public partial class ScriptMain : Microsoft.SqlServer.Dts.Tasks.ScriptTask.VSTARTScriptObjectModelBase
    {
        #region VSTA generated code
        enum ScriptResults
        {
            Success = Microsoft.SqlServer.Dts.Runtime.DTSExecResult.Success,
            Failure = Microsoft.SqlServer.Dts.Runtime.DTSExecResult.Failure
        };
        #endregion
        public void Main()
        {
            string filelocation = “D:\\New Text Document.txt”;
            if (File.Exists(filelocation) == true)
            {
                Dts.Variables[“fileexists”].Value = true;
            }
            else
            {
                Dts.Variables[“fileexists”].Value = false;
            }
            Dts.TaskResult = (int)ScriptResults.Success;
        }
    }
}

File exists

File did not exist

Success

Script Component

  1. Read/write ASCII files
  2. Parse strings with the split function

we will use the earlier package , in this section  we will learn how to create a flat file source connection using script component and read the contents of the flat file and push the columns into the data flow task and  parse the contents of the text using split function

Replace the file exists part with a Data flow task , in the data flow task drag and drop a script component , click on the script component and select the component type as Source

and the flat file we will be dealing with here  looks like this

1,john,apt#3 – California USA\
2,tom,st 112 ca;lifornia USA

Task 

  • Read the flat file
  • Separate the columns which are delimited by ‘,’
  • and parse the address column and remove the special characters (apt#3 – California USA\) to (apt 3  California USA)

back to the source component

and in the script section select the scripting language as C# and click the button edit script

Code:
using System;
using System.Data;
using System.IO;
using Microsoft.SqlServer.Dts.Pipeline.Wrapper;
using Microsoft.SqlServer.Dts.Runtime.Wrapper;
using System.Text;

[Microsoft.SqlServer.Dts.Pipeline.SSISScriptComponentEntryPointAttribute]
public class ScriptMain : UserComponent
{

public override void CreateNewOutputRows()
{

StreamReader reader = new StreamReader(“D:\\New Text Document.txt”);

string line;

while ((line = reader.ReadLine()) != null)
{
string[] columns = line.Split(‘,’);

this.Output0Buffer.AddRow();
if (columns.Length > 0)
{
Output0Buffer.id = (columns[0]);
}
if (columns.Length > 1)
{
Output0Buffer.name = (columns[1]);
}
if (columns.Length > 2)
{

string[] split = columns[2].Split(new Char[] { ‘#’, ‘,’, ‘.’, ‘:’, ‘-‘, ‘\\’, ‘;’ });

Output0Buffer.address = ConvertStringArrayToString(split);

}
}
}
static string ConvertStringArrayToString(string[] array)
{
//
// Concatenate all the elements into a StringBuilder.
//
StringBuilder builder = new StringBuilder();
foreach (string value in array)
{
builder.Append(value);
}
return builder.ToString();
}
}

Output