In SAP BusinessObjects Data Services, you often need to use the transform "Table Comparison". Within it, you have the choice between three comparison methods: Row-by-row select, Cached comparison table, and Sorted input. This choice can have a huge impact on the data flow performance. Which one is the best for your implementation? That's what we are going to discover in this article. One of the common uses of the "Table comparison" transform is the implementation of slowly changing dimensions (you can see SCD type 2 on the Data Services wiki here) In the screenshot below is a data flow used for SCD type 1.
In a Table Comparison transform, you have the choice between the following comparison methods: Row-by-row select, Cached comparison table, and Sorted input.
The default option (automatically chosen when you create a new Table Comparison transform) is Row-by-row select. What does it do? It actually executes a SQL query on the comparison table every time that it receives an input row. So as you can imagine, this default method is rather slow. You should try to use only for data flows where you have a small number of input rows.
The next option is Cached comparison table. In this case, Data Services will load the entire comparison table. From what I understand, Data Services still does a SQL query (or equivalent) for each input row. However, this time the comparison table is cached (in-memory or paged). As there is no need to do real SQL queries in a database, this option can get much faster than the row-by-row select. Of course you've got to figure out by yourself if it's worth sending the whole comparison table in memory. If your comparison table has millions of rows and you only have 5 input rows in the table comparison transform, then it's probably not worth it.
The last option is the Sorted input. There is a pre-requisite for this method to work: your input rows must be sorted (ascending) in the same order as you've set the primary keys of the transform (the field "Input primary key columns"). In most cases this will require you to insert a Query transform before the Table Comparison transform, and to do the correct sorting in this transform. But once you've done that, this option is faster than the "Cached comparison table" method. As everything is sorted, Data Services indeed only needs to read the comparison table once. To give you an idea, for a comparison table with about 250 000 rows and as many input rows, the "Row-by-row select" method used to take us 7 to 8 minutes. Switching to "Sorted input" brought this time down to a mere 7 seconds. And "Cached comparison table" was only slightly slower, with 10 to 12 seconds.
To summarize: - Small number of input rows: you can keep the default "Row-by-row select" comparison method - Big number of input rows: use "Sorted input" if you can, otherwise choose "Cached comparison table".