Archive

Archive for the ‘Business Intelligence’ Category

OLE DB versus ADO NET Components in SSIS

ETL Package developers often ignore this question: What is the difference between OLE DB and ADO NET components? Should I use OLE DB Source/Destination or ADO NET Source/Destination? Which one is faster?

Let me explain what the difference is. The only difference is both of them use drivers/providers behind the scene, which have been developed using different technologies. ADO NET providers use ADO.NET along with C# behind the scene, while OLE DB providers use something else. Recently, the Microsoft said that they would not be enhancing the features in OLE DB providers. So, don’t expect more with OLE DB. Hence, you need to use ADO NET providers more and more. Don’t get scared! In fact, if you wish to connect to Azure SQL Database, you need to use ADO NET providers, as the OLE DB doesn’t offer flexibility to connect to Azure SQL Database or Data Warehouse. However, the OLE DB has got more providers to connect to different types of data stores compare to the ADO NET, at least as of now.

As far as performance is concerned, I personally did R&D with both the Sources/Destinations with huge data i.e. 1-5 GB, Database to Database, Database to File, File to Database, all types of package. Eventually, the statistics say that OLE DB has got an edge over ADO NET, and give you 60% more performance than that of ADO NET. Yes! ADO NET Sources/Destinations process data slower than OLE DB. Here, I would like to mention that I tweaked all option like DefaultBufferMaxRows, DefaultBufferSize, AutoAdjustBufferSize for SSIS 2016, etc. and then reached to this conclusion. And my experience of package development corroborates with this.

Hence, I recommend using OLE DB Source/Destination for the time being, until Microsoft enhances the ADO NET providers. Yes! You may need to use ADO NET in case of Azure, which is becoming very common now-a-days as every client is moving databases to Azure.

There are many other factors which affect the package performance, which I am going to discuss in other blogs. So, stay tuned!

Hope this helps you out.

 

Advertisements

Why Not to Use Sort and Aggregate Transformations in SSIS

I often come across queries about the SSIS package optimization techniques. There are many, and all depends up on scenarios. However, there are a few fundamental principals to architect SSIS packages. These help us to design a package, which runs faster. I am going to tell about two transformations that I always avoid: Sort Transformation and Aggregate Transformation.

You can categorize all the transformation in two broader categories: Synchronous and Asynchronous. Or, you may divide them into three types: Non-Blocking, Semi-Blocking and Blocking. It’s not right place to discuss all Categories/Types here. I will be discussing only Blocking Transformations here. Just for the sake of your knowledge, the Non-Blocking Transformation belong to Synchronous category, while Semi-Blocking and Blocking belong to Asynchronous category.

The Blocking transformations accept entire input at first i.e. all rows from source, perform the assigned operation, and then give the output of the operation. During this process, you have to wait for the transformation to collect entire required rows before getting output/desired result. This is where it bring degradation in the Data Flow performance, and in turn, slow package.

Both the Sort and Aggregate transformations belong to the Blocking types. This is the reason I always recommend not to use them altogether in your pipeline. Then question arises: How to sort the rows and, if required, to get aggregate?

There are ways! In case of OLE DB Source and SQL Server as source, you need to use SQL query, not the Table Name from drop down list, and perform these operations in the query. These operations will run at the database engine machine and will give you output instantly. In case of flat file/raw file, request your team to calculate these values and put it either in anther file or in first one/two lines into the source file. I am sure you know how to read/escape initial lines while reading flat files. There is an option available. Check it out! This is how you can avoid using Blocking transformations like Sort and Aggregate and, in turn, achieve better package performance.

I will be blogging about other techniques to optimize the SSIS package. So, stay tuned!

 

Connecting Oracle and Teradata Using SSIS

SSIS offers various options in order to connect to Oracle, but of course, not Teradata. Though, we have third party connectors to connect to Oracle as well as Teradata. And, the Microsoft officially recommends these connectors. The first choice being Microsoft Connectors for Oracle and Teradata by Attunity.

If you use OLE DB Source/Destination, you get Microsoft OLE DB Provider for Oracle connector. While if you use ADO NET Source/Destination, you get OracleClient Data Providers connector. Both of these options work fine, however the ETL process becomes slower, rather I would say VERY SLOW, if volume of data is huge. On top of this, if you are using 32-bit machine version, then it hurts more.

Hence, if you wish your ETL process to work faster, I recommend using Microsoft Connectors for Oracle and Teradata by Attunity. These connectors provide 60-70% performance upgrade, and THAT’S PROVEN. You can download these connectors from Microsoft site for free.  Click here to download it. Though, these are also available on Attunitty.com, but, of course, at a price. You will get connectors for SQL Server 2016 and SQL Server 2017 as shown in below image:

Having installed these connectors, restart your Visual Studio, and they will appear in SSIS Toolbox under Common category as shown in below image:

Hope this blog helps you out!

Connecting SAP HANA Using SSIS

In order to connect to SAP HANA, all we need to do is to install SAP HANA Client tools, download from SAP web site. Unfortunately, the Microsoft has officially not yet released any connector to help us out.

The SAP HANA Client tools are available in both 32-bit and 64-bit versions. What all we did is installed 64-bit of SAP HANA Client tools, but to our surprise, it didn’t work. We always received an error message: “The type initializer for Sap.Data.Hana.HanaConnection threw an exception.” despite providing correct credentials to log in. And, we had no idea what’s wrong. Then, just a thought to install 32-bit SAP HANA Client tools did the magic. And, we’re able to connect using whatever manner you wish i.e. by creating DSN on system, using Server Explorer in Visual Studio, using ADO.NET Source in SSIS, etc.

We recommend you to install 32-bit SAP HANA Client tools in Development environment. It may be possible that 64-bit version might work on Production environment. However, we’re yet to test that possibility. For the time being, good to go.

Hope this helps you out!

 

 

 

 

Tags:

Different Types/Data Stores in Linked Services in Azure Data Factory

I am sure you know what a Linked Services in Azure Data Factory. If not, search my blog and read about it. I have already blogged about it earlier. Here, I’m going to list down all types/data stores that a Linked Service can connect to. This list will keep on changing in future as Azure team may remove/add other data stores.

Presently, there are 25 Types/Data Stores available for Linked Services to connect to. Here goes the list:

  1. Azure Storage:- [“type”: “AzureStorage”]
  2. Azure Storage SAS:- [“type”: “AzureStorageSas”]
  3. Azure Data Lake Store:- [“type”: “AzureDataLakeStore”]
  4. Azure SQL Database:- [“type”: “AzureSqlDatabase”]
  5. Azure SQL Data Warehouse:- [“type”: “AzureSqlDW”]
  6. Azure DocumentDB:- [“type”: “DocumentDb”]
  7. SQL Server (On-Premises):- [“type”: “OnPremisesSqlServer”]
  8. Cassandra:- [“type”: “OnPremisesCassandra”]
  9. MongoDB:- [“type”: “OnPremisesMongoDb”]
  10. Oracle:- [“type”: “OnPremisesOracle”]
  11. File System:- [“type”: “OnPremisesFileServer”]
  12. DB2:- [“type”: “OnPremisesDb2”]
  13. MySQL:- [“type”: “OnPremisesMySql”]
  14. Teradata:- [“type”: “OnPremisesTeradata”]
  15. PostgreSQL:- [“type”: “OnPremisesPostgreSql”]
  16. Sybase:- [“type”: “OnPremisesSybase”]
  17. HDFS:- [“type”: “Hdfs”]
  18. ODBC:- [“type”: “OnPremisesOdbc”]
  19. OData:- [“type”: “OData”]
  20. Web:- [“type”: “Web”]
  21. SalesForce:- [“type”: “Salesforce”]
  22. AmazonRedshift:- [“type”: “AmazonRedshift”]
  23. Amazon Web Services:- [“type”: “AwsAccessKey”]
  24. FtpServer:- [“type”: “FtpServer”]
  25. Azure Search:- [“type”: “AzureSearch”]

The “type” is mentioned in the JSON script and then we need to write “typeProperties” as per the type. For example, if type is AzureStorage, then we need to mention ConnectionString in the typeProperties section, if Web, then we need to mention Authentication Type and URL, etc.

I will be discussing each and every Data Store and its a few important properties in separate blogs. So, stay tuned!

What is Linked Services in Azure Data Factory

Linked Services in Azure Data Factory play the role of Connection String to external resources.

As we know that Data Factory is nothing but a kind of data integration cloud-service. In other words, it’s nothing but a cloud-based ETL service. Every ETL tool connects to data sources and destinations and processes the data. Hence, we require the connection details like Server Name or Drive Name/Folder, Blog Containers, Database Name, User Credentials, etc. We create a Linked Service in Data Factory to store this information, which can be used by the Activities in the pipelines. It helps use to connect to external resources like Azure SQL, on-prem SQL Server or Oracle, Azure HDInsight, Azure Storage, etc. We can create more than one Linked Services in the same Data Factory.

If you’re a SQL Server Integration Services (SSIS) developer, treat this as Connection String in Connection Manager. That’s it!

Developer usually add the word “LinkedServices” as suffix to the name of the Linked Services, though I prefer to add the word “LS”, which helps me out to have smaller names for these objects. Linked Services information is maintained as JSON in a physical file, which is also known as JSON Script.

Hope it helps you out to get an idea about the Linked Services.

Happy reading!

What is Azure Data Factory

In simple words, Azure Data Factory is nothing but an Extract, Transform and Load (ETL) tool, which does more than what an ETL tool can usually do and has been designed for the Cloud technology. You may say it’s a data integration service in the cloud. It’s just two and half years old kid of the Azure team, which is getting mature with the passage of time.

As per the MSDN documentation, Azure Data Factory is a cloud-based data integration service, to compose data storage, movement, and processing services into automated data pipelines.

From a BI developer perspective, it just a Cloud-based ETL tool. You can design connections (Linked Services), data store schema (DataSet/Table)  and an Activity, which help you to connect to a data source, pull the data out, transform it and then dump into a data destination. The beauty of this tool/service is that it can connect to diverse data stores i.e. relational to non-relational, big data to analytical data, which is very difficult to process using existing ETL tools such as SSIS or Informatica. And, the bonus is we can utilize Azure services i.e. Hardware/Software to improve the performance of the pipelines.

I have just given you an brief overview of Azure Data Factory. Click here if you wish to know more about it.

Happy reading!