Framework for CSV to SQL Server conversion

Standard

For a migration project from an old database system to a SQL Server database I created a framework for CSV to SQL Server conversion.  From the old database system I would receive several CSV files that had to be imported and converted to SQL Server.

The problemfile-import-script-vacuum

When you do a migration project you’re always having the issue that the “old” system isn’t able to convert data to another database system.

I had to find a way to define sources, columns and datatypes without inventing the wheel everytime a new file was introduced.

Another problem was that some CSV files were only distinquished by year but the format was different. File a-2010 could have five columns and file a-2011 could have the same columns but only four of them. In normal circumstances you’d have to create an SSIS package for every file or create BCP or other scripts to create the inserts.

Furthermore during the proces I had to convert several columns from text to decimal, integer or any other type of data. Because we wanted to sum up some of the columns to check for integrity I had to have a way to easily convert the data to a specfic type.

The solution

Thank mister Microsoft for creating Powershell which is able to connect to SQL Server and is very good with importing CSV files!
I’m jumping to the conclusion but I love Powershell and its abilities en flexbility.

Based on a config file with XML data the application called “BASE” will enable or disable certain actions.

The actions are as followed in execution order:

  1. Initialize all the query files needed for creating objects, importing files etc
  2. Create objects needed for BASE
  3. Loading the table definitions
  4. Retrieving all the tables from the table definitions and create them
  5. Retrieving all the views from the table definitions and create them
  6. Rebuild all the tables (only when the parameter for rebuilding is set)
  7. Rebuild all the views (only when the parameter for rebuilding is set)
  8. Import the CSV files (only when the parameter for import is set)
  9. Check integrity of table definitions (only when the parameter for import is set)
  10. Convert data from staging tables to destination tables (this only works when the integrity check is either disabled or is passed)

One feature that turned out really helpful was the  ability to check which rows returned as an error in step 9.
By executing the following command when you have errors in your conversion you’ll see which rows can’t be converted.

EXEC [dbo].[sp_CheckStagingLog] @maxRecords = 100

 The reason the stored procedure has a parameter to return only the amount needed is because it’s a labor intensive query to get a good overview of the results.

How does it work

BASE works by importing a table definition that is the basic information for creating the tables and views.

The table definitions are loaded into the following tables:

  • _BASE_Staging
  • _BASE_Column
  • _BASE_Table
  • _BASE_Mapping

When the table definition is loaded it is saved in the table “_BASE_Staging” and is a complete reference of the CSV file. After the import BASE will extract the different columns, tables and will map these these values together in the _BASE_Mappings. From here we have all the information to create the necessary tables with their respective columns.

To make the import easy while using id columns BASE will create views based on the table definitions. Basically it creates the same columns in the view as the actual table but excludes the id column.

When all the necessary objects are created BASE will start by importing the files. It will show a progress bar that shows how far the import process is per file. I could’ve made it more accurate but this would only take a lot of time and I didn’t find the information very useful when I created the program.

The data is imported to so called staging tables and these tables always have to start with the following name “STG_” in the table definition.

After the import of the data BASE will start ( when the configuration allows it ) a pre-conversion check. The process will check every column on every row if it contains values that can be converted to the datatype of the destination table. This datatype is set in the table definition.

If no errors occur BASE will continue to start the conversion of the data. The data will be selected from the staging table to the destination table.
When an error occurs BASE will stop and you have to correct the issue before continuing. I chose this because I had to be sure all the data was correct before I continued to the next step.

This, in a nutshell, is what BASE does. It imports the data, checks the data, converts the data.

How does a table definition look like

Because the working of BASE is depended on the table definition I will explain how a table definition is set up.

A table definition is a csv file with the following columns in the header:

  • table_src
  • column_src
  • row_number_src
  • datatype_src
  • length_src
  • scale_src
  • precision_src
  • table_des
  • column_des
  • row_number_des
  • datatype_des
  • length_des
  • scale_des
  • precision_des

Make sure to take the following pointers in consideration:

  1. The value in the column “table_src” always starts with the name “STG_”
  2. The number in the column “row_number_src” doesn’t appear twice for the table
  3. The following datatypes are supported; int, smallint, bigint, decimal, varchar, date and time
  4. The columns in the csv are seperated with semi-colons “;”

I only needed the datatypes mentioned in 3 and because of that no other datatypes are supported. Maybe in the future other datatypes will be added but for now this is it.

The link will show you an example of a table definition:  Example_Tabledefinition

How to configure BASE

The configuration file consists of the following items:

  • BaseConfigFile: Refers to the table definition file
  • QueryDirectory: Directory that contains the queries needed to build the object in the BASE database
  • SourceServer: This is explained by itself
  • SourceDatabase: This is explained by itself
  • DestinationServer: This is explained by itself
  • DestinationDatabase: This is explained by itself
  • ImportDirectory: Directory containing the import files
  • DoRebuild: Do you want to rebuild all the objects ( Y or N)
  • DoImport: Do you want to import the files ( Y or N)
  • DoIntegrityCheck: Do you want to do an integrity check ( Y or N)
  • DoConversion: Do you want to do the conversion ( Y or N)
  • EmailStatus: Do you want to e-mail the status ( Y or N)
  • EmailFrom: From who
  • EmailTo: To who
  • EmailSubject: What is the subject
  • EmailServer: The e-mail server to use

I always used the same server for performance but you could use a different server if you wanted to. I would always use the same database for import and conversion because this makes the process a whole lot faster. You can easily copy the data after the process to another destination database.

The following link shows an example of an configuration file: Examle_Config.xml

How do I start BASE

When you created all the table definitions, set all the configurations and have the import files ready in place you can start the program.

Executing only asks one parameter. Below is an example of the command:

  • BASE.ps1 -path config.xml

How do I get it

If you want to use the software you can download it with the following link: BASE

Disclaimer

All the scripts on sqlstad.nl are published under a Creative Commons license, which means you should refer to me if you want to republish (pieces of) them. Thank you. More information about Creative Commons can be found here: Creative Commons Attribution 3.0 Netherlands License.

Leave a Reply