InfoPackage/DTP Package Size
I learnt something new today about Package size while extracting from ECC into BI 7.0.
I happened to notice that one of the datasources which executes in delta mode from ECC to BI was transferring a very high number of packages. What was strange was that, although the number of packages were high, the number of records that each package contained was significantly less. This aroused my curiosity!
I did know that BW internally allocates the package size, but I did not know the criteria on which this was being done.
On reading a few posts at SDN, I figured out the logic behind this golden number called Package size.
[Credits to original posters at SDN]
Package size = MAXSIZE * 1000 / size of the transfer structure,but not more than MAXLINES.
I tried simulating and it worked out like a gem!
Read on to see the steps involved in this simulation. . .
SAP BI Cube Zero Record Timing problem
This post is all about another problem that I faced in our production system. I had to do significant analysis/simulation to understand the problem.
When you have a configuration as below
- Datastage job to post into BW (3.x datasource)
- Infopackage triggered is of type – Update PSA and data target in parallel, typically a write optimized DSO
- You use an external scheduling tool to trigger the datastage job
- Have another BW chain to extract data from a DSO and post into a cube which is also triggered by an external scheduler
There is a possibility that your DTP load from the DSO to the cube could post empty data.
The reason why this happens is that, BW takes a lot of time to actually update the data in the WO DSO. BW responds back to datastage as soon as it recieves all the data and does not really wait for the data to get saved in the DSO.
If your BI system is slow for any reason, the external scheduler would trigger your DTP for the cube much before the data is actually saved in the DSO. This eventually results in zero records being transferred to the cube.
A solution to this problem is to only post to the PSA by datastage and have a process chain to push data to the DSO and DSO to the Cube.
Sap Datastage – Unable to delete variant problem
It has been a while now since I have created a post. I have slowly migrated SAP’s BI product and have been learning it. Recently we hit a major issue with datastage working with SAP. I had to perform extensive simulation/analysis to understand the problem and come to a solution. I would like to share the analysis with everyone (could save some precious time). . .
Explanation
Multiple calls are made by datastage to the function module Z_RFC_DS_SERVICE with different parameters. Accordingly different activities such as create variant, delete variant, load program etc. is done.
Z_RFC_DS_SERVICE is the function module invoked remotely by datastage through the RFC.
Z_RFC_DS_SERVICE invokes the function module RS_VARIANT_DELETE
This function in turn invokes the form check_v_used_in_job.
perform check_v_used_in_job using rsvar-report rsvar-variant.
This function invokes the function module BP_VARIANT_USED_IN_JOB
Depending on the response from this function module the “unable to rename/delete variant” error is thrown. If the number of jobs in TBTCP is not equal to the number of entries in TBTCO which has finished or aborted status, the error will be thrown.
Open Source solution for Datastage
Ever thought of an Open Source alternative for the expensive IBM product -Datastage. Ladies and Gentlemen http://www.talend.com/demo/index.php
Data Mining | Scores and Models
What is scoring?
Scoring is a well known concept under data mining. For a newbie, Data Mining is the usage of data from a data warehouse for marketing. Data Mining involves complicate logics applied to get the maximum out of the data available with you on sales, guests, inventory etc.
Coming back to our topic, scoring is one of the methodologies under data-mining. Scoring can be applied to several dimensions such as guests, items etc.
Example :
Consider that you are planning to send out an offer on discount of Bakery items. Consider you being a retail giant, will have around 100 million customers. Sending offers to all of them would prove an expensive investment. Hence, you would rather prefer to send it to customers who have a history of purchasing bakery items. That is where scoring comes in. The probability that a customer will buy a bakery item again will be granted as a score to that guest. Say a score of 0 will be the least probability that he will purchase a bakery item and 1 the maximum. This directly depends on the number of times he/she has already purchased bakery items.
What is the Scoring Process?
The above picture summarizes the scoring process. I shall explain in detail each of the sections.
Segment
Segment is the category of data to which the scoring process is to be applied. An example would explain it well.
Example:
Say I decide to run the offer of discount in Bakery items to Women of the age category 30-40. This becomes the segment that forms as a candidate for the scoring process.
Model
Model is the logic or rule being applied to the data to decide what score it gets. Models are usually written in an XML type of language called PMML(Predictive Model Markup Language). Designing a model itself is a critical and complex process. Usually a dedicated team is available for this.
Example:
The model decides how to interpret the data. An example of a model would be
(Total amount spent on bakery Items by the guest)/(Total amount spent by the guest on purchase)
Basically it is just the calculation that needs to be applied on the data to generate the score.
Scoring Engine
The scoring engine is responsible for applying the model on the input segment of data. It could be a Procedure, Java based application etc.
Score
The output of the whole scoring process is the score. The score need not be 0-1 always. It could be any kind of desirable range value.
Campaign
Just having the scores does not complete the work. Now comes the work of campaigning. Once the scores have been assigned, a selection is made on the scored data to identify which need to be chosen or eliminated.
Example:
Once scoring is done for all women in the segment 30-40, we have a range of scores(0-1). The guests can be ordered by scores and the first 30% can be chosen for sending out the offers.
Summarizing everything
A segment of data is taken from a large Data Warehouse to perform the scoring. This selection is decided usually by the Business teams of the retail firm. Usually a view or table is created for each segment as there would be several segments and several models. A model is to be applied to these segments. The model is again done usually by a dedicated modeling team. Model is the statistical analysis that needs to be done for the segment to assign the score. Once the score is generated by the scoring engine the scores are saved in a ordered format. Again as there are multiple models and segments, usually dedicated tables are created. The campaigning team runs a selection criteria on the data based on the scores and decides the audience for the discounted bakery offer.
Firing up db2 9.5 on Ubuntu Hardy Heron
I work on IBM DB2 at office. So, I was curious to set it up at home in my new lappy. It was fun and I did it successfully.
I found it quite difficult to set it up though, but after a few hick-ups I was able to make it. Even though I ran the command line setup, the engine never started when I gave the db2start statement. After a lot of googling I found a solution ironically on a Russian website.
De-package
sudo dpkg -i db2exc_9.5.0-1gutsy1_i386.deb
Set the annoying environment variable
sudo echo “kernel.shmmax=1610612736″ >> /etc/sysctl.conf
Run the Set-up
/home/Prabhu/exp $ . ./db2_install
Start the Engine
db2start
Create the SAMPLE database
db2sampl
Refresh
db2stop
db2start
Connect and Enjoy
db2 connect to sample
ftp timeout in AIX
Ever wondered how you could increase the timeout of an ftp connection. The default ftp timeout limit for for an AIX OS would be 15mins. This can be easily increased.
- Edit the file /etc/inetd.conf file.
- Go to the line where ftp configuration is done. (Run a search in vi)
- add the following “-t <time-out limit>”
- for further details please go to the man pages of the ftp daemon. “man ftpd”.
Updating a large table with partitions
I came across a situation where I had to update a very large table. It had 373 million records. A simple query could time out or you would end up with a snapshot too old error.
UPDATE EMP
SET EMP_LOC_C = ’03′
WHERE EMP_LOC = ‘NY’
Now if this table has more than a billion records, the query might not succeed.
The ideal solution is to run an update partition by partition. You can get the partitions for your table by querying dba_tab_partitions.
SELECT TABLE_NAME,PARTITION_NAME
FROM DBA_TAB_PARTITIONS
WHERE TABLE_NAME LIKE ‘EMP’
P_001
P_002
. . . . .
P_030
Now what you could do is to let the query run against each partitions separately.
UPDATE EMP PARTITION(P_001)
SET EMP_LOC_C = ’03′
WHERE EMP_LOC = ‘NY’
Moreover if there is an index built on any of the columns you can use that as well and split the query even further.
UPDATE EMP PARTITION(P_001)
SET EMP_LOC_C = ’03′
WHERE EMP_LOC = ‘NY’
AND EMP_I > 8612751936
So this way you can split and update the table w.r.t. partitions.
Lookup failing due to huge data volume
This is an annoying situation you would come across in a production environment. Occasionally the volume of data would surge and the lookup stage would not be able to process it.
Sparse lookup stage basically fetches all the records from the driver link and the reference link and then loads it onto the RAM prior to performing the join. Definitely if the volume is above a certain extent the RAM would become a bottle neck. There are two ways by which this can be averted.
1) Modify the lookup type to “Sparse” from “Normal”. This can be done provided the input link count is considerable less. It should be below 1000 ideally.
2) Another option would be replace the lookup with a join. The join stage does not load the records into the RAM and performs quite well. The obvious question would be why not always use a join stage. Lookup stage is expected to work much faster than a join stage.
That’s it for now. . Happy Warehousing!
-
Archives
- January 2011 (1)
- December 2010 (2)
- February 2010 (1)
- August 2008 (5)
-
Categories
-
RSS
Entries RSS
Comments RSS

