Tag Archives: SSIS

Extract Multiple Access DBs (with different schemas) via SSIS Script Task into SQL Server

Posted on February 16, 2017 | Leave a comment

*Another post moved over from my old blog. Since I wrote this, a coworker modified the code to extract data from multiple Excel workbooks and tabs in a similar fashion. At some point I’ll try to add that addendum.

Problem:

We’ve all heard it before – “Well I get the crucial financial numbers from this Access Database…”. Or even worse, “we have hundreds of access databases all over the place with varying schemas”. On a recent project I was faced with the later – several hundred Access DBs (mdb files) that had accumulated over a number of years, in various places across the network, all with different schemas. In the client’s defense, some of this was an outside 3^rd party’s doing and they had since implemented an enterprise class solution for the process going forward. Never the less, I was tasked with extracting all of this legacy Access data into SQL Server.

🙁

I figured I would just be able to use a For Each Loop within SSIS to extract each mdb file and just drop it to SQL Server. I quickly realized however that because every Access DB could potentially have a different schema, I would need to dynamically handle the schemas in each source file. I briefly thought about using BIML (see my recent posts about Business Intelligence Markup Language) to dynamically create the SSIS packages, but I really didn’t want hundreds of packages or components since this was a one and done job. So for better or for worse I turned to the trusty Script Task.

Solution:

So what I needed to do was:

Connect to each Access DB file in different locations across the network
Create the table schema for that specific mdb file in SQL Server
Copy the data from the Access file to SQL Server
Run some other processing stored procedures not related to this post

Turns out there was also a 1b) – map each file location as a network drive using a separate Active Directory account. There was no need to keep the data after the last processing step was complete, so I was able to drop the created SQL Server tables when I was done.

Here is an overview of the control flow of my completed SSIS package:

First I use an Execute SQL Task to select all of the file names and locations of the mdb files from a pre-populated table (thankfully I had a spreadsheet of all the file names and locations). To set up the For Each Loop to loop through all the rows returned in the Execute SQL Task, we need to first create a User Variable of type Object – here I called it “mdbsToProcess”. Then in the Execute SQL Task properties, we set the ResultSet property to Full Result Set.

Then we click on the Result Set option and set the result set to the object variable we created.

Next I use a For Each Loop to loop through each of the mdb files I need to process. The For Each Loop is set up as follows to use the mdbsToProcess variable, and loop through each record to get information we need for each mdb file.

For the final code sample – click here. I’ll walk through it below. This is just a copy/paste from the SSIS Script Task, so please ignore all the SSIS default stuff.

After the Script Component generates and loads all of the tables from the mdb file into SQL Server, I simply get the current time, execute some other stored procedures to process the data unrelated to this post, then log the results.

Code Explanation:

By no means am I am App Dev guy, so feel free to improve upon this and let me know. But below is an overview of the more interesting parts of the Script Task code. First we set up all our variables (its quick and dirty so some are hard coded, others are passed in from SSIS).


 public void Main()

{

string FullFilePath = Dts.Variables["User::FullFilePathFromDB"].Value.ToString(); //Full network file path and name of mdb file
string TargetServer = Dts.Variables["$Package::TargetServer"].Value.ToString(); //Target SQL Server
string TargetDB = Dts.Variables["$Package::TargetDB"].Value.ToString(); //Target SQL DB
string tableName = null; //name of table in source
string TargetTableName = null; //name of table being created/populated
string DropTableSQL = null;
string NetworkDrivePassword = "MyNetworkDrivePassword";
string NetworkDriveUser = "MyNetworkDriveUserName";
string NetworkSharePath = "\\MyNetworkPath\Share$";

intCountCreatedTables = 0; //Used in log file

string SourceConnString = ("Provider=Microsoft.Jet.OLEDB.4.0;" + "Data Source=" + FullFilePath);
string TargetConnString = ("Data Source=" + TargetServer + "; Password=MyPassword;User ID=MyUserName;Initial Catalog=" + TargetDB + ";");

string command;
command = "/C NET USE " + NetworkSharePath + "/delete"; //If share mapped, disconnect to prevent multi "user" error

//WriteToLog("Executing Cmd Line: " + command);

ExecuteCommand(command, 5000);

command = "/C NET USE " + NetworkSharePath + " /user:" + NetworkDriveUser + " " + NetworkDrivePassword; //grant access

//WriteToLog("Executing Cmd Line: " + command);
ExecuteCommand(command, 5000);

Then we execute 2 NET USE commands to first delete any existing mapped network drive by the given name, and then to map it based on the given credentials. This is only necessary if the credentials you run the SSIS package under do not have access to the network shares. I found I had to first delete it if it existed or I could run into a “multi user” error. These commands are executed by the ExecuteCommand() method which just runs the NET USE commands via cmd.exe.

Next we use the GetSchema method on the Source Connection (AKA our Access file/OleDbConnection) to populate a data table with every table within the mdb file. Note that we specify the “Tables” schema object type.

OleDbConnection SourceConn = new OleDbConnection(SourceConnString);
SqlConnection TargetConn = new SqlConnection(TargetConnString);

SourceConn.Open();

WriteToLog("Connect to File # " + Dts.Variables["User::FileNumber"].Value + " in " + FullFilePath);

DataTable DBTabsTables = SourceConn.GetSchema("Tables");

SourceConn.Close();

Now that we have a table of all of the Access file tables, we iterate through each one in the DBTabsTables DataTable and generate our DROP and CREATE scripts which we pass to the CreateTable() method. In this case I specified the tables I wanted, but you can remove the IF statement if you want them all.

foreach (DataRow row in DBTabsTables.Rows) //For each table in mdb file

{

tableName = row["table_name"].ToString();

if(tableName == "AccessTable1" || tableName == "AccessTable2" || tableName == "AccessTable3") //only get specified tables from Access file. Or remove IF to get all tables

{

DropTableSQL = "IF OBJECT_ID('dbo."+ tableName + "') IS NOT NULL BEGIN DROP TABLE " + tableName + " END; "; //build drop table if exists SQL

CreateTable(SourceConn, tableName, TargetConn, DropTableSQL, tableName);  //For the initial create, we want to use the source tableName

++CountCreatedTables;

BulkLoadTable(SourceConn, tableName, TargetConn, tableName); //For the initial bulk load, we want to use the source tableName

}

}

Within CreateTable() we again see that we use GetSchema on the connection object, but this time we use the “Columns” object type and another variable called “restrictions”. If you look a few lines from the top of the code snippet below, you can see that “restrictions” is a 4 value array of strings. You can read about the options here, but I am passing the name of the current table within the loop into the 3^rdposition in the array (remember, its 0 based), which is then used in the GetSchema call to restrict it to only 1 table. So now we have a DataTable called ColumnDataTable which contains 1 record for each column in the designated table.

public void CreateTable(OleDbConnection SourceConn, string tableName, SqlConnection TargetConn, string DropTablesSQL, string TargetTableName)

{

string[] restrictions = new string[4];

restrictions[2] = tableName; //restrict which table information is returned by GetSchema

string accessTableFieldName;

DataRow myRow;

SourceConn.Open();

DataTable ColumnsDataTable = SourceConn.GetSchema("Columns", restrictions); //Fill DataTable with columns information

SourceConn.Close();

When using the schema information to build the CREATE scripts for the new SQL tables, you need to be careful about data types between OleDB and SQL Server. There is probably a better way to do this, but I used the below switch statement to brute force change the datatype names to their SQL counterparts.

//For every row in the table

for (int i = 0; i < SortedColumnsDataTable.Rows.Count; i++)

{ //Get column name and type

myRow = SortedColumnsDataTable.Rows[i];

accessTableFieldName = "[" + myRow["column_name"] + "] ";

switch (((OleDbType)myRow["data_type"]).ToString())//Change OleDBType to SQL datatypes
{
case"Boolean": accessTableFieldName += "bit";
break;
case"Currency": accessTableFieldName += "money";
break;
case"Date":
case"DBDate":
case"DBTimeStamp": accessTableFieldName += "datetime";
break;
case"VarWChar":
case"WChar":
case"VarChar":
case"Char": accessTableFieldName += "nvarchar(" + myRow["character_maximum_length"] + ")";
break;
case"UnsignedTinyInt": accessTableFieldName += "int";
break;
case"Double": accessTableFieldName += "Float";
break;
default: accessTableFieldName += ((OleDbType)myRow["data_type"]).ToString();
break;

}

Lastly, in BulkLoadTable() we execute a BCP command to copy the data from the mdb file to the newly created SQL Server table. Be sure to set the batchsize and timeout length to something that can handle the amount of data you are transferring.

public void BulkLoadTable(OleDbConnection SourceConn, string tableName, SqlConnection TargetConn, string TargetTableName) //use bcp to load data from source to target
{
OleDbCommand SelectFromCmd = newOleDbCommand("SELECT * FROM " + tableName, SourceConn);

OleDbDataReader rdr;

SourceConn.Open();

TargetConn.Open();

rdr = SelectFromCmd.ExecuteReader();

SqlBulkCopy sbc = new SqlBulkCopy(TargetConn);

sbc.DestinationTableName = TargetTableName;

sbc.BatchSize = 50000;

sbc.BulkCopyTimeout = 120;

sbc.WriteToServer(rdr);

sbc.Close();

rdr.Close();

SourceConn.Close();

TargetConn.Close();

WriteToLog("Load  " + TargetTableName);

}

Summary

In this post we discussed how to use SSIS and a Script Task to iterate through any number of Access mdb files in order to import the data into SQL Server. The code used can handle Access dbs in varying schemas. This type of process can be very useful when you are dealing with a large number of legacy Access dbs and the Import Wizard isn’t practical. Hopefully this post will help someone else to upgrade/migrate a legacy business process into a more enterprise ready solution – not to allow an “Access for the Enterprise” application to continue to exist!

Erik

@ErikH_BI

Leave a comment

Posted in SSIS

Tagged Access, ETL, SSIS

Having an SSIS Component “Wait” Until a Condition is met or a Time Period Passes

Posted on February 10, 2017 | Leave a comment

*This is an old post of mine from my employer’s old blog site that I had not moved here. I remembered about it when I was trying to do something similar and figured I’d post it. Today I would probably do it in C# using a timer or Thread.sleep.

Background
For a recent project, my colleague (@craigblevin) and I found ourselves reloading a QA environment of a Data Warehouse from scratch. In this case, our source data consisted of all flat files – upwards of 70 gigs worth. In normal operation, our ETL easily handled the incremental data, but we found that reloading all of the data at once choked up our minimally resourced SQL Server. So we broke the files up into more manageable chunks and ran them one at a time (in this case going back to modify our DW ETLs themselves to process one file at a time was not an expedient option). After spending the better part of a weekend taking turns checking on the progress, manually moving a file, and starting the SQL job again, we realized we could build another SSIS package to take care of this for us. But how to make sure the previous run was complete before copying over the next file and restarting the job?

Set up a For Each File Loop

The first step was to set up a simple For Each File Loop in SSIS. In this case, as depicted below, we wanted to move 1 file at a time from our staging directory into our processing folder.

Once the file was in the processing folder, our package fired the SQL Job to start the deployed SSIS packages to load the DW.

Check for the job to finish

Once the deployed job was running, we had to know when it was completed so the package could start the next file sequentially. For this we created a For Loop which contained an Execute SQL Task which checked our custom DW Audit tables to see if the job was completed. The SQL Task simply set variable @IsLoadComplete == 1 when the last run was complete, and the For Loop continued until @IsLoadComplete = 1.

This led to a problem though – the Loop Until Complete container would execute the SQL Task continuouslyuntil the loop condition was met. We did not want to constantly query the server tens of thousands of times until the job was complete.

Have SSIS wait a specified period of time

Since SSIS does not have a native “Wait” task, our quick solution was to create a nested For Loop that does nothing with an evaluation expression that checks the current time until its 10 minutes after the start time of the container. This causes the “Check” SQL statement to run only once per 10 minutes.

EvalExpression: DATEADD(“mi”, 10, @[System::ContainerStartTime]) > GETDATE()

(The finished product is below (ignore the warning on the send mail task).

How does it perform?

In our case, since we were just loading a QA environment and wanted a quick and dirty solution, we were not concerned with performance. We successfully allowed this design to run in VS on our local medium powered machine overnight while processing multiple files without issue. However, a quick check of CPU usage did show a 30% spike in CPU usage – so in the end this solution won’t win any performance awards.

In hindsight, was there a better way? (hint: yes)

While writing this post, and checking the CPU usage, I started thinking there had to be a more efficient way. Since I am not a programmer by nature, I went straight for available SSIS components instead of opening up a script task. But a quick search for “SSIS Script Task Wait” found several examples (such as Mike Davis) of doing this with a Script Task – and sure enough using a Script Task with code such as below, performs the same function with much less CPU usage.

Credit to http://www.bidn.com/blogs/MikeDavis/ssis/158/make-an-ssis-package-delay-or-wait-for-data for the below Script Task code.

Public Sub Main() ' Dimsec As Double= Convert.ToDouble(Dts.Variables("intDelayTime").Value) Dimms AsInt32 = Convert.ToInt32(sec * 1000) System.Threading.Thread.Sleep(ms) ' Dts.TaskResult = ScriptResults.Success End Sub

So if we had to do it again, replacing the “Do Nothing” For Loop with a Script Task such as the above would probably be less resource intensive and accomplish the same thing!

Leave a comment

Posted in SSIS

Tagged SSIS

Duplicate objects in SSIS (ErrorCode Column)

Posted on January 27, 2017 | Leave a comment

In SSIS (multiple versions) you can get a seemingly random error when trying to create an OLEDB Source Component.

Full Error:

The package contains two objects with the duplicate name of “OLE DB Source.Outputs[OLE DB Source Error Output].Columns[ErrorCode]” and “OLE DB Source.Outputs[OLE DB Source Error Output].Columns[ErrorCode]”

The error is caused by a dumb, but in hindsight sort of obvious, problem. The source query I was using is below.

SELECT [SurveyID]

,[VendorSurveyID]

,[SurveyStatusID]

,[SurveyStatusName]

,[ErrorCode]

,[ErrorDescription]

,[ErrorPriorityID]

,[ErrorPriorityDescription]

,[InsertDate]

FROM [CurrentSurveyStatus] s

The problem is that my source query had a column explicitly named ErrorCode – which happens to be the name of one of the default built in columns that are piped to the component error output. The same would be true if I had a column in my source query called “ErrorColumn”. It’s not readily apparent that these output columns exist because they only would show up if you use the component error path, or if you look at the Advanced Editor as shown below.

SSIS won’t let you remove the default SSIS ErrorCode column from above, but it will let you rename it in the Advanced Editor.

So you have two options –

Rename your column
Rename the default SSIS column

Below I renamed the default column as you can see in the Meta Data Viewer, and the component no longer has errors.

Note: The warning is just that no rows would currently be sent to the error output in this example.

Leave a comment

Posted in SSIS

Tagged ETL, SSIS

BIML Part III – Using BIML Script and Custom Meta Data Tables for Dynamic ETL Frameworks

Posted on August 10, 2016 | Leave a comment

Using BIML, BIML Script and existing metadata, such as INFORMATION_SCHEMA.TABLES, to dynamically generate SSIS packages can be very useful (See my previous post). But what can be even more useful is to create your own metadata tables which can be used as the drivers for packages created via BIML Script.

Scenario

Let’s say that you have a scenario where you know that you are going to have to continue making a large number of SSIS packages in the future that will perform similar, but slightly different functions. Maybe they all use different source and
destination tables with different queries, but also some of them have a
transform or two while others don’t. You know that in the future you’re going to have to make countless more of these packages for different feeds, but you don’t know the source queries or which exact transforms each one will need.

Setup

Let’s assume we currently know of three tables we need to populate, one of which needs a derived column. In the future, the other two might also need this derived column, and we know that more tables will follow this same pattern. We’ll start with 3 source tables and 3 destination tables with the “Dest” suffix. The destination
tables have a couple column differences from the source tables. We also have a table called BIMLMetaData which will function as the driver for our package generation similar to how we used INFORMATION_SCHEMA.TABLES in the previous post’s example. You can download the CREATE script here.
The script assumes an existing db called Sandbox, but change it as you like. It will also insert a few records into the source tables and into our meta data table.

Custom Metadata Table

So now we should have 7 tables – Employee, EmployeeDest, Product, ProductDest, Store, StoreDest, and BIMLMetaData. In BIMLMetaData, each row we will store the
information specific to each package which will be combined with the common
elements of all the packages that we will create in our BIML file. In BIMLMetaData we have:

Id – simple Identity column
SourceTableName – The name of our source table
TargetTableName – The name of our target table
SourceSQLQuery – The full SQL query that we want
to use as the source for our package
NeedsTransform – A bit field to determine if
this package requires a transform (in this case a derived column)
IsActive – A bit field to determine if we need
to generate a package for this row

Three rows should have been inserted into BIMLMetaData from the script above, one for Product, Employee, and Store. Note that they are all set to IsActive = 1
and Employee is set to NeedsTransform = 1.
The SQLSourceQuery values are just simple select statements which will be used in our packages.

BIML Code

First in the BIML code we will set up our connections and some variables within C#. This is very similar to my previous post, except you will notice that highlighted in blue is a query against our BIMLMetaData table instead of INFORMATION_SCHEMA tables as in our previous example. Also, as in that example, we are going to begin a foreach loop over each row in the result set ‘MyDataTable’. In the query we also add a WHERE clause for IsActive = 1, this would allow us to “turn off” BIML regenerating any packages we don’t it to.

</Connections>

//Now we are in C#!

string MyConnectionString =”Provider=SQLNCLI11.1; Data Source=localhost;Initial
Catalog=Sandbox;Persist Security Info=True;Auto Translate=False; Integrated
Security=SSPI”;

string TargetTableName;

string TableSchema;

string SourceTableName;

int NeedsTransform;

string SourceSQLQuery;

DataTable MyDataTable;

//Populate Data Table with values from BIMLMetaData where IsActive =1

MyDataTable = ExternalDataAccess.GetDataTable(MyConnectionString,”SELECT
m.SourceTableName,m.TargetTableName,m.NeedsTransform,m.SourceSQLQuery FROM
dbo.BIMLMetaData m WHERE m.IsActive = 1“);

/*Loop over each target table row in MyDataTable */

foreach(DataRow row in MyDataTable.Rows){

TargetTableName = row[“TargetTableName”].ToString();

TableSchema = “dbo”;

SourceTableName = row[“SourceTableName”].ToString();

NeedsTransform = Convert.ToInt32(row[“NeedsTransform”]);

SourceSQLQuery = row[“SourceSQLQuery”].ToString();

The following code snippets are contained within the above foreach loop. For every record in the BIMLMetaData table we are going to create 1 package, named based on the source table name, count the existing rows in the target table (*Note: we don’t do anything with this count in this demo), and populate the OleDBSource in the Data Flow with the SourceSQLQuery from our meta data table.

<Package Name=“Extract <#=SourceTableName#>”
ConstraintMode=”Linear” ProtectionLevel=”EncryptSensitiveWithUserKey”>
<!–ConstraintMode connects the
tasks in order–>

<Tasks>

SELECT
COUNT(1) AS Total FROM <#=TargetTableName#>

</DirectInput>

</ExecuteSQL>

<OleDbSource ConnectionName=“SandBox“ Name=“<#=SourceTableName#> Source” >

<#=SourceSQLQuery#> <!–Use SourceSQLQuery column
from BIMLMetaData to populate the OleDbSource direct input query–>

</DirectInput>

</OleDbSource>

Next, still contained within the foreach loop, we add another nugget of C# code, this time with an if control statement. Using the value in the BIMLMetaData.NeedsTransform column for each table, we can have BIML generate components or logic based on our settings in the meta data table.

<# if (NeedsTransform ==1)

{#>

GETDATE()

</Column>

</Columns>

</DerivedColumns>

<#}#>

Here I’m just adding a derived column with a getdate(); but you could use this technique for much more interesting things, such as which type of Slowly Changing Dimension logic to apply. With this type of a setup, you can configure
options and regenerate packages within minutes if a component is added or
removed. Lastly, we just add the OleDB Destination, as I have covered
in previous posts, along with our directives at the bottom.

</OleDbDestination>

</Transformations>

</Dataflow>

</Tasks>

</Package>

<#}#> <!–End for each MyDataTable loop–>

</Packages>

</Biml>

<!–Directives:–>

<#@ template language =“C#“ tier =“2“ #>

<#@ import namespace=“System.Data“ #>

<#@
import namespace=“Varigence.Biml.CoreLowerer.SchemaManagement“ #>

Results

Once we have created our BIML file (full code sample available here), we right click and Generate SSIS Packages and should get one package per record in the BIMLMetaData table.

What you should notice is that of the three packages generated, only the ‘Extract Employee’ package has our derived column component since it was the only record in BIMLMetaData to have NeedsTransform = 1.

Summary

In this demo we took a look at how you can create your own meta data tables to use in conjunction with BIML to build flexible and configurable SSIS frameworks. Here we only used a few flags to generate packages with or without a derived column
task, but the concept could easily be applied to much more complex logic – such
as SCD Type of dimensions or what behavior updated Fact table data should use. The thing to keep in mind is that BIML should just be “one tool in your toolbox”. You
want to be careful not to build a very flexible, yet overly complex, set of BIML files when it would be simpler to create a package the old fashioned way.

Sample Files

Table CREATE script – CREATE_MetaData_Example_Tables.sql

Finished BIML file – BIMLScript_MetaDataTable.biml

Erik

@ErikH_BI

Leave a comment

Posted in BIML, SSIS

Tagged BIML, ETL, SSIS

BIML Part II – Adding BIML Script to Automate SSIS Package Generation

Posted on August 10, 2016 | Leave a comment

In my last post (http://www.erikhudzik.com/2016/08/09/introduction-to-business-intelligence-markup-language-biml/)
we went through a quick introduction to BIML – AKA Business Intelligence Markup Language. We used it, and the help of BIDS Helper, to generate a very basic SSIS package simply from the XML based BIML code. In this post, we are going to jump ahead to the really cool things that BIML can do by adding in the concept of BIML Script. With BIML Script, we are going to very quickly create a separate extract and load package for all of Adventure Works DW tables just by using metadata.
To jump right ahead, all demo files and samples are at the end of the post.

BIML Script

Along with BIML proper, our friends at Varigence also have defined a scripting
language that can be used within BIML files, called BIML Script. BIML and BIML Script are somewhat like HTML and ASP.NET. Using BIML, BIML Script, and our BIDS Helper to generate the packages, we can take one BIML file and generate multiple SSIS packages for different meta data with the same exact design pattern. You can probably already start to imagine the possibilities.

I’m going to jump right into the BIML Script part of things, so please see my previous post linked above to get caught up. First some light housekeeping. In order to properly use BIML Script, we need to import some references just as we would in other languages. We also need to define which .Net language we want to use – C# or VB.Net. We will be using C# for this demo.

<!–Directives–>

<#@ template language =“C#“ #> <!–VB.Net or C#–>

<#@ import namespace=“System.Data“ #> <!–Imported namespaces to give us access to some BIML Script methods–>

<#@ import namespace=“Varigence.Biml.CoreLowerer.SchemaManagement“ #>

</Biml>

The <#@ #> tags are called “Directives” and actually can be included at the beginning or the end of our BIML file because the compiler finds them first wherever they are. I
like to keep them at the bottom because I think it keeps our BIML files a little cleaner.

Now the magic of BIML Script is the ability to embed .Net code “nuggets” inside of the BIML. We embed these code nuggets within <# #> tags as shown below. As you can see, within the BIML file we can jump in and out of C#. Below we jump to
C# and define some variables. Unfortunately, at this point Intellisense gets confused with the mixing of languages, so we lose a lot of syntax highlighting going forward. I will try to indent and italicize the C# nuggets for clarity.

</Connections>

//Now we are in C#!

//Create some variables

string MyConnectionString
=”Data Source=localhost;Initial
Catalog=AdventureWorks2014;Provider=SQLNCLI11.1;Persist Security Info=True;Auto Translate=False; Integrated Security=SSPI” ;

string DestinationTableName;

string TableSchema;

string SourceTableName;

DataTable MyDataTable;

<!–Back in BIML XML–>

</Packages>

</Biml>

As you can see, we can jump back and forth between BIML/XML and BIML Script C# code just by using the <# #> tags. The important take away here is that the embedded BIML Script will be “run” when you compile your BIML file and will be used inthe expansion of the resulting BIML file – which is then compiled into your
SSIS packages. You could use BIDS Helper to compile the above BIML but it would not produce anything.

Setting Up Our Demo Environment

Now that we have a basic idea of BIML Script, let’s use it to dynamically create a bunch of simple extract and load packages. For this demo we will use the AdventureWorks 2014 DW sample, as well as a modified version of it for our destination db that removes constraints to make it easier. Here is the original AdventureWorks 2014 DW db backup, and here is the modified version. First
restore the .bak file from codeplex to create the AdventureWorks2014 db on your
test server, then run CREATE_AW2014Destination.sql file to create a new db called AW2014Destination with all empty tables.

Using BIML Script to Dynamically Create Multiple Packages

We will walk through this piece by piece, and then summarize at the end, so please bear with me on the code snippets. Below is a continuation of the C# snippet from above which is embeded within the BIML <Packages> tag. What we added below is a GetDataTable method using a SQL query (highlighted in blue) to pull a list of tables from the INFORMATION_SCHEMA.Tables table. Some tables with hierarchy
datatypes were filtered out for simplicity in this demo.

We then set up a for each loop to loop over each row (AKA each record of a table from the meta data) . Then using the #> tag, we close out of the C# nugget and jump back to BIML XML with a <package> tag. Notice that within the <package> tag, the Name attribute contains <#=SourceTableName#> – using the <#= … #>
syntax, we can reference variables from previous C# BIML Script nuggets. In this case, the SourceTableName parameter.

Eagle eyed readers will notice that in the snippet below, there is no closing “}” to the foreach loop. We’ll get to that shortly!

//Now we are in C#!

string MyConnectionString =”Data Source=localhost;InitialCatalog=AdventureWorks2014;Provider=SQLNCLI11.1;Persist Security Info=True;Auto Translate=False; Integrated Security=SSPI” ;

string DestinationTableName;

string TableSchema;

string SourceTableName;

DataTable MyDataTable;

//Populate Data Table with
list of tables

MyDataTable =
ExternalDataAccess.GetDataTable(MyConnectionString,”SELECT t.TABLE_CATALOG, t.TABLE_SCHEMA,t.TABLE_NAME FROM INFORMATION_SCHEMA.TABLES t WHERE t.TABLE_SCHEMA <> ‘dbo’ AND t.TABLE_TYPE =’Base Table’ AND t.TABLE_NAME NOT LIKE ‘%Document%’ AND t.TABLE_NAME <>‘Employee'”);

/*Loop over each target table row in MyDataTable */

foreach(DataRow row in MyDataTable.Rows){

TableSchema = row[“TABLE_SCHEMA”].ToString();

SourceTableName = row[“TABLE_NAME”].ToString();

<Package ConstraintMode=“Linear“ Name=“Extract <#=SourceTableName#>” ProtectionLevel=”EncryptSensitiveWithUserKey”> <!–ConstraintMode connects the tasks in order–>

The key thing to realize with the above snippet, is that everything after the foreach loop, is STILL contained in the foreach loop – including the BIML XML. This means that for each row in MyDataTable, when the BIML file is expanded, everything
following the foreach { is repeated. Below, everything within the box is repeated for every row in MyDataTable. You will finally notice the closing “}” bracket just after the </package> tag.

//Now we are in C#!

string MyConnectionString =”Data Source=localhost;InitialCatalog=AdventureWorks2014;Provider=SQLNCLI11.1;Persist Security Info=True;Auto Translate=False; Integrated Security=SSPI” ;

string DestinationTableName;

string TableSchema;

string SourceTableName;

DataTable MyDataTable;

//Populate Data Table with list of tables

MyDataTable = ExternalDataAccess.GetDataTable(MyConnectionString,”SELECT
t.TABLE_CATALOG, t.TABLE_SCHEMA,t.TABLE_NAME FROM INFORMATION_SCHEMA.TABLES t WHERE t.TABLE_SCHEMA <> ‘dbo’ AND t.TABLE_TYPE =’Base Table’ AND t.TABLE_NAME NOT LIKE ‘%Document%’ AND t.TABLE_NAME <>‘Employee'”);

/*Loop over each target table row in MyDataTable */

foreach(DataRow row in MyDataTable.Rows){

TableSchema = row[“TABLE_SCHEMA”].ToString();

SourceTableName = row[“TABLE_NAME”].ToString();

<Package ConstraintMode=“Linear“ Name=“Extract <#=SourceTableName#>”
ProtectionLevel=”EncryptSensitiveWithUserKey”> <!–ConstraintMode connects the
tasks in order–>

<Tasks>

<ExecuteSQL Name=“Truncate <#=SourceTableName#>”
ConnectionName=”EriksDestDB”>

TRUNCATE TABLE <#=TableSchema#>.<#=SourceTableName#>

</DirectInput>

</ExecuteSQL>

<OleDbSource ConnectionName=“EriksDB“ Name=“<#=SourceTableName#> Source” >

SELECT *

FROM <#=TableSchema#>.<#=SourceTableName#>

</DirectInput>

</OleDbSource>

<OleDbDestination Name=“Insert <#=SourceTableName#>”
ConnectionName=”EriksDestDB” KeepIdentity=”true” >

</OleDbDestination>

</Transformations>

</Dataflow>

</Tasks>

</Package>

<#}#>
<!–End for each MyDataTable loop–>

</Packages>

The whole BIML file for this demo can be downloaded here. A very brief review (see my previous post for more) of what the BIML in the box above is actually doing is that it isusing <package> </package> tags to represent each package. Within that we have <tasks> which are similar to the Control Flow. Here wehave an <ExecuteSQL> tag to TRUNCATE the current table in the loop. Then lastly we have the <DataFlow> tag which contains a source and destination component based on the current table name within the foreach loop. The table names and schemas are represented as <#=TableSchema#>.<#=SourceTableName#> to access the C# variables as described above.

Now all we have to do is Right Click our BIML file (in my screenshot BimlScript_2.biml) and select Generate SSIS Packages (remember, you need
the free BIDS Helper add-in for this). The results can also be seen in the below screenshot – dozens of distinct SSIS packages, one for each Adventure Works table from the meta data.

The resulting .dtsx packages should look like the below –

Summary and Next Steps

In this example we built on the basics of BIML from my first post on the topic and introduced the concept of BIML Script – C# or VB.Net code nuggets embedded within the BIML XML itself. Using these embedded code snippets, we gain the extremely powerful ability to generate large numbers of ETL packages dynamically with very few lines of code. This demo required less than 75 lines of code in the .biml file and only minutes to write, whereas manually creating all of these – even simple – SSIS packages would have taken a few hours.

In upcoming posts we will continue to build on the concept
of dynamic SSIS package generation and try to address some tips and common
pitfalls.

Sample Files:

Adventure Works 2014 DW .bak – Adventure Works DW 2014 Full Database Backup.zip

Modified Adventureworks db create script for destination
– CREATE_AW2014Destination.sql

Final BIML file – Biml_Script2.biml

Erik

@ErikH_BI

Leave a comment

Posted in BIML, SSIS

Tagged BIML, ETL, SSIS

Introduction to Business Intelligence Markup Language (BIML)

Posted on August 9, 2016 | Leave a comment

Business Intelligence Markup Language, or BIML, is a dialect of XML that describes Business Intelligence objects such as ETL Packages, Cubes, and even Facts and Dimensions (here we will focus mainly on ETL Packages). This post will serve as an
intro to this very powerful (and mostly free!) BI development tool. In this post we will just create a singular SSIS package, but once we dive deeper into BIML’s capabilities we will be able to build dynamic and automated ETL design patterns. I hope to turn this into a series of posts on BIML.

Unless otherwise noted, we will assume SQL Server 2014 and VS 2012, although much of this applies at least back to SQL Server 2008.

Getting Started

First, in order to use BIML, you will need a tool to translate the actual BIML code into the desired BI object. There are two main options for compiling BIML code – the free Visual Studio add-on, BIDS Helper, and the proprietary IDE, Mist, by Varigence. We will focus on FREE 🙂

·
Assuming you already have SQL Server and Visual Studio installed, download and install the awesome VS add-in BIDS Helper (while there, check out the other great features it adds).

·
Once BIDS Helper is installed, if you start a new SSIS project, you will now have a new option when you right click on the project in the solution explorer

Now you will get a new file called BimlScript.biml under the “Miscellaneous” folder in the solution explorer. If you open the file, you’ll see the XML root node of your new BIML file.

</Biml>

Before going any further, lets set up our demo tables, Employee and EmployeeDest – both with identical columns – and insert a few rows.

/****** Object: Table [dbo].[Employee] Script Date: 11/1/2015 4:35:35 PM ******/

SET ANSI_NULLS ON

SET QUOTED_IDENTIFIER ON

CREATE TABLE [dbo].[Employee](

[EmployeeId]
[int] IDENTITY(1,1) NOT NULL,

[FirstName]
[nvarchar](50) NULL,

[LastName]
[nvarchar](100)
NULL,

[Notes]
[nvarchar](255)
NULL,

[CreateDate]
[datetime] NULL,

CONSTRAINT [PK_Employee] PRIMARY
KEY CLUSTERED

(

[EmployeeId]
ASC

)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE =
OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]

) ON [PRIMARY]

/****** Object: Table [dbo].[EmployeeDest] Script Date: 11/1/2015 4:35:43 PM ******/

SET ANSI_NULLS ON

SET QUOTED_IDENTIFIER ON

CREATE TABLE [dbo].[EmployeeDest](

[EmployeeId]
[int] IDENTITY(1,1) NOT NULL,

[FirstName]
[nvarchar](50) NULL,

[LastName]
[nvarchar](100)
NULL,

[Notes]
[nvarchar](255)
NULL,

[CreateDate]
[datetime] NULL,

CONSTRAINT [PK_EmployeeDest] PRIMARY
KEY CLUSTERED

(

[EmployeeId]
ASC

)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE =
OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]

) ON [PRIMARY]

INSERT INTO [dbo].[Employee]

([FirstName]

,[LastName])

VALUES

(‘Erik’,‘H.’),

(‘Jeff’,‘K.’),

(‘Ed’,‘B.’)

Generate a Simple Package

We will start simple and just create a single package, but you can take a look at the vast array of possibilities at the Varigence website (which is unfortunately always painfully slow) – https://www.varigence.com/Documentation/Language/Element/AstPackageNode
.

As you would with an SSIS package made via the standard UI, we will start by setting up our connection managers. We do this with a <Connections> and <Connection> tag as seen below. As you can see, thanks to BIDS Helper, we also have some intellisense to help us along.

</Connections>

With our connection defined, we now want to add the BIML to generate our actual package(s). As with connections, we start with a <Packages> node, then <Package>, and within <Package> we have <Tasks>. The <Tasks> tag loosely translates to the Control Flow tab of an SSIS package in the UI. So all of the nodes underneath a task, such as ExecuteSQL as shown below, would be Control Flow objects.

</Connections>

SELECT COUNT(1) AS TotalEmployees
FROM Employee

</DirectInput>

</ExecuteSQL>

</Tasks>

</Package>

</Packages>

</Biml>

So if we take the above code and put it in our BimlScript.biml file (and assuming we have a DB named Sandbox and the sample tables created), we can then generate our first package. Right click on the BIML file and select “Generate SSIS Packages” as shown below.

Assuming no errors, you should now have an SSIS package called MyFirstPackage.dtsx which looks like the below:

Simple, and useless!

Adding a Data Flow

Now let’s add some Data Flow components to our BIML file. As you might expect, within the <Tasks> tag, we now would add a <DataFlow> tag to add a Data Flow to our SSIS Control Flow. In BIML, all of the components within a Data Flow are called “Transformations” and thus go in… a <Transformations> tag. Here’s a quick review:

In order to add a Data Flow and an OleDbSource to our package, within the <Tasks> tag, we add

SELECT [EmployeeId]

,[FirstName]

,[LastName]

,[Notes]

,[CreateDate]

FROM [Sandbox].[dbo].[Employee]

</DirectInput>

</OleDbSource>

</Transformations>

</Dataflow>

after our <ExecuteSQL> tag. You’ll notice that within the <OleDbSource> node there is another node for <DirectInput>. This is exactly what would go in the OLE DB Source Editor SQL Command box in the SSIS UI. This is a good point to note that, as with all XML, BIML IS CASE SENSITIVE.

We could generate the package again at this point, and successfully re-create the MyFirstPackage.dtsx package (note that BIDS Helper helpfully will ask you if you want to overwrite an existing package when you re-generate it), but it wouldn’t do much yet.

Add a Derived Column and OLEDB Destination

For the last part of this demo, lets add a simple Derived Column transform and then our destination. So after our </OleDbSource> tag, we can add a <DerivedColumn> tag. As you can see from the screenshot, the intellisense lists the wide variety of components that you can use with BIML.

Within the <DerivedColumns> tag, we can have <Columns> and then each individual <Column> which contains the value that would go into the Derived Column component’s Expression box in the SSIS UI. You can add multiple columns in one
<DerivedColumns> tag. Also note that in this case I set it to ReplaceExisting = “true”, just as you can in the UI. At this point, our full BimlScript file should look like
this

</Connections>

SELECT COUNT(1) AS TotalEmployees
FROM Employee

</DirectInput>

</ExecuteSQL>

SELECT [EmployeeId]

,[FirstName]

,[LastName]

,[Notes]

,[CreateDate]

FROM [Sandbox].[dbo].[Employee]

</DirectInput>

</OleDbSource>

<Column Name=“CreateDate“ DataType=“DateTime“ ReplaceExisting=“true“> <!–The value inside the <Column> tags is what would be
in the Expression box on a Derived Column component–>

GETDATE()

</Column>

“Added via ETL”

</Column>

</Columns>

</DerivedColumns>

</OleDbDestination>

</Transformations>

</Dataflow>

</Tasks>

</Package>

</Packages>

</Biml>

And if we re-generate the package, and run it, hopefully it’s all green for you also.

Summary and Next Steps

Obviously this post just scratches the surface of BIML and doesn’t even begin to touch on its applications for dynamic automated development. But like with everything, with BIML it’s good to start small and build on what you’ve learned. In this post we covered the basic syntax of BIML, and created a very simple SSIS package to move data from one table to another with a simple transformation.
One of the best things about BIML is that packages created with it are identical
to any package created in the SSIS UI – other than maybe straighter lines. If you generate packages via BIML, a future developer does not need to know a thing about BIML to do maintenance work on your BIML generated package.

In future posts, I will cover more components and uses for BIML and the added power that BIMLScript – .Net code embedded within BIML – gives us to generate
true dynamic packages.

Erik

@ErikH_BI

Leave a comment

Posted in BIML, SSIS

Tagged BIML, ETL, SSIS

Have your SSIS Fast Load (Bulk Load) and Row by Row Error Messages too

Posted on August 9, 2016 | Leave a comment

SSIS OLEDB Destination

In the SSIS OLEDB Destination component there are several Data Access options to select from. For this post we will only be discussing “Table or view” and “Table or view – fast load”.

The difference between these 2 options is that effectively “Table or view” will execute one SQL command for each and every row that you are loading into your destination, while “Table or view – fast load” effectively
uses a BULK INSERT command to insert the data from the pipeline. For an under-the-hood look at the differences, please see Jamie Thomson’s post here – http://consultingblogs.emc.com/jamiethomson/archive/2006/08/14/SSIS_3A00_-Destination-Adapter-Comparison.aspx

When you choose the fast load option, you are also given a few other options to set. Rows per batch and Maximum insert commit size have always confused me because they sound similar, so I dug around a bit for the definitions.

Since fast load uses a BULK INSERT, Rows Per Batch and Maximum Insert Commit Size are actually arguments for the BULK INSERT operation. Maximum Insert Commit Size actually equates to the BATCHSIZE argument and Rows Per Batch equals… ROWS_PER_BATCH. See this post on MSDN for a bit more. Knowing that, we can define the OLEDB Destination fast load options as such from BOL:

Rows Per Batch (ROWS_PER_BATCH)

“Indicates
the approximate number of rows of data in the data file.

By
default, all the data in the data file is sent to the server as a single transaction, and the number of rows in the batch is unknown to the query optimizer. If you specify ROWS_PER_BATCH (with a value > 0) the server uses this value to optimize the bulk-import operation. The value specified for ROWS_PER_BATCH should approximately the same as the actual number of rows.”

Maximum Insert Commit Size (BATCHSIZE)

“Specifies the number of rows in a batch. Each batch is copied to the server as one transaction. If this fails, SQL Server commits or rolls back the transaction for every batch. By default, all data in the specified data file is one batch”

So by changing the Maximum Insert Commit Size, we can change the number of rows that will be committed at one time. SSIS will continue to BULK INSERT batches of this size until all of the rows in the pipeline have been inserted (or failed). And, if I understand the BOL entry correctly, Rows Per Batch is actually just an optimizer hint to get maximum performance out of the BULK INSERT operation. That is a topic for another discussion!

Why would you ever NOT use fast load?

With the fast load option, if ANY row in a transaction fails (constraints, PK violations etc…) that entire transaction will fail to commit. This means that all of the rows that were part of that BULK INSERT operation – which the number of rows is set by the Maximum Insert Commit Size – will fail and/or be passed to the OLEDB Destination Error Row path. Unfortunately, this means that if you leave the default value for Maximum Insert Commit Size (2147483647) and try to insert 1 million records, and 1 record fails a constraint, all 1 million records would be sent to the error output. And even worse, you can’t get an Error Description from the fast load/BULK INSERT operation. So you would get 1 million records into your error pipeline, not know what the error was, and not know which row caused the error.

Getting the benefits of Fast Load, but still capture error rows and descriptions

What you can do to get the benefits of a fast load, but still get the detailed error information of the non-fast load row by row insert is chain 2 OLEDB Destinations together and change the Maximum Insert Commit Size.

In the screen shot above, we are using the first OLEDB Destination with the fast load option into our FactClaim fact table and we have the Maximum Insert Commit Size set to 10,000 rows. If 1 row in this batch fails, those 10,000 rows will go to the row by row insert – but if none of those 10,000 rows have errors, they will be committed using BULK INSERT via the fast load. So if we were trying to load 1 million records, we would of course end up with 100 BULK INSERT transactions (assuming no errors), but depending on your required design this may be more acceptable than having the entire 1 million records fail or being forced to use non-fast load.

A side benefit of changing the Maximum Insert Commit Size is that it can help to manage tempdb and log space because smaller batches are committed at one time, and thus the whole 1 million rows would not need to be rolled back if the last row fails.

Getting the exact error row and error description

Once you send the smaller set of rows from the failed batch to the second OLEDB Destination with non-fast load, SSIS will try to insert those rows one at a time. In this way, only the exact rows that fail will now be sent to this second OLEDB Destinations error output. Thus only those rows that actually fail the insert are not inserted. We can then use a script task to add an ErrorDescription column, and capture the actual error description with code similar to the below.

public override void Input0_ProcessInputRow(Input0Buffer Row)

{

Row.ErrorDescription = ComponentMetaData.GetErrorDescription(Row.OLEDBDestination1ErrorCode);

}

Finally we can get a row count of the actual errors, and store those error rows, with error description, to a flat file or a table.

By using this method, we can get most of the benefit of using the fast load option (fast loading data with 0 errors in one BULK INSERT would likely be faster), but still be able to capture the exact error row and description.

Leave a comment

Posted in SSIS

Tagged SSIS

Tag Archives: SSIS

Extract Multiple Access DBs (with different schemas) via SSIS Script Task into SQL Server

Having an SSIS Component “Wait” Until a Condition is met or a Time Period Passes

Duplicate objects in SSIS (ErrorCode Column)

BIML Part III – Using BIML Script and Custom Meta Data Tables for Dynamic ETL Frameworks

BIML Part II – Adding BIML Script to Automate SSIS Package Generation

Introduction to Business Intelligence Markup Language (BIML)

Have your SSIS Fast Load (Bulk Load) and Row by Row Error Messages too

Recent Posts

Recent Comments

Archives

Categories

Meta