Data Mining | Scores and Models
What is scoring?
Scoring is a well known concept under data mining. For a newbie, Data Mining is the usage of data from a data warehouse for marketing. Data Mining involves complicate logics applied to get the maximum out of the data available with you on sales, guests, inventory etc.
Coming back to our topic, scoring is one of the methodologies under data-mining. Scoring can be applied to several dimensions such as guests, items etc.
Example :
Consider that you are planning to send out an offer on discount of Bakery items. Consider you being a retail giant, will have around 100 million customers. Sending offers to all of them would prove an expensive investment. Hence, you would rather prefer to send it to customers who have a history of purchasing bakery items. That is where scoring comes in. The probability that a customer will buy a bakery item again will be granted as a score to that guest. Say a score of 0 will be the least probability that he will purchase a bakery item and 1 the maximum. This directly depends on the number of times he/she has already purchased bakery items.
What is the Scoring Process?
The above picture summarizes the scoring process. I shall explain in detail each of the sections.
Segment
Segment is the category of data to which the scoring process is to be applied. An example would explain it well.
Example:
Say I decide to run the offer of discount in Bakery items to Women of the age category 30-40. This becomes the segment that forms as a candidate for the scoring process.
Model
Model is the logic or rule being applied to the data to decide what score it gets. Models are usually written in an XML type of language called PMML(Predictive Model Markup Language). Designing a model itself is a critical and complex process. Usually a dedicated team is available for this.
Example:
The model decides how to interpret the data. An example of a model would be
(Total amount spent on bakery Items by the guest)/(Total amount spent by the guest on purchase)
Basically it is just the calculation that needs to be applied on the data to generate the score.
Scoring Engine
The scoring engine is responsible for applying the model on the input segment of data. It could be a Procedure, Java based application etc.
Score
The output of the whole scoring process is the score. The score need not be 0-1 always. It could be any kind of desirable range value.
Campaign
Just having the scores does not complete the work. Now comes the work of campaigning. Once the scores have been assigned, a selection is made on the scored data to identify which need to be chosen or eliminated.
Example:
Once scoring is done for all women in the segment 30-40, we have a range of scores(0-1). The guests can be ordered by scores and the first 30% can be chosen for sending out the offers.
Summarizing everything
A segment of data is taken from a large Data Warehouse to perform the scoring. This selection is decided usually by the Business teams of the retail firm. Usually a view or table is created for each segment as there would be several segments and several models. A model is to be applied to these segments. The model is again done usually by a dedicated modeling team. Model is the statistical analysis that needs to be done for the segment to assign the score. Once the score is generated by the scoring engine the scores are saved in a ordered format. Again as there are multiple models and segments, usually dedicated tables are created. The campaigning team runs a selection criteria on the data based on the scores and decides the audience for the discounted bakery offer.
Firing up db2 9.5 on Ubuntu Hardy Heron
I work on IBM DB2 at office. So, I was curious to set it up at home in my new lappy. It was fun and I did it successfully.
I found it quite difficult to set it up though, but after a few hick-ups I was able to make it. Even though I ran the command line setup, the engine never started when I gave the db2start statement. After a lot of googling I found a solution ironically on a Russian website.
De-package
sudo dpkg -i db2exc_9.5.0-1gutsy1_i386.deb
Set the annoying environment variable
sudo echo “kernel.shmmax=1610612736″ >> /etc/sysctl.conf
Run the Set-up
/home/Prabhu/exp $ . ./db2_install
Start the Engine
db2start
Create the SAMPLE database
db2sampl
Refresh
db2stop
db2start
Connect and Enjoy
db2 connect to sample
ftp timeout in AIX
Ever wondered how you could increase the timeout of an ftp connection. The default ftp timeout limit for for an AIX OS would be 15mins. This can be easily increased.
- Edit the file /etc/inetd.conf file.
- Go to the line where ftp configuration is done. (Run a search in vi)
- add the following “-t <time-out limit>”
- for further details please go to the man pages of the ftp daemon. “man ftpd”.
Updating a large table with partitions
I came across a situation where I had to update a very large table. It had 373 million records. A simple query could time out or you would end up with a snapshot too old error.
UPDATE EMP
SET EMP_LOC_C = ’03′
WHERE EMP_LOC = ‘NY’
Now if this table has more than a billion records, the query might not succeed.
The ideal solution is to run an update partition by partition. You can get the partitions for your table by querying dba_tab_partitions.
SELECT TABLE_NAME,PARTITION_NAME
FROM DBA_TAB_PARTITIONS
WHERE TABLE_NAME LIKE ‘EMP’
P_001
P_002
. . . . .
P_030
Now what you could do is to let the query run against each partitions separately.
UPDATE EMP PARTITION(P_001)
SET EMP_LOC_C = ’03′
WHERE EMP_LOC = ‘NY’
Moreover if there is an index built on any of the columns you can use that as well and split the query even further.
UPDATE EMP PARTITION(P_001)
SET EMP_LOC_C = ’03′
WHERE EMP_LOC = ‘NY’
AND EMP_I > 8612751936
So this way you can split and update the table w.r.t. partitions.
Lookup failing due to huge data volume
This is an annoying situation you would come across in a production environment. Occasionally the volume of data would surge and the lookup stage would not be able to process it.
Sparse lookup stage basically fetches all the records from the driver link and the reference link and then loads it onto the RAM prior to performing the join. Definitely if the volume is above a certain extent the RAM would become a bottle neck. There are two ways by which this can be averted.
1) Modify the lookup type to “Sparse” from “Normal”. This can be done provided the input link count is considerable less. It should be below 1000 ideally.
2) Another option would be replace the lookup with a join. The join stage does not load the records into the RAM and performs quite well. The obvious question would be why not always use a join stage. Lookup stage is expected to work much faster than a join stage.
That’s it for now. . Happy Warehousing!
-
Archives
- March 2012 (1)
- January 2011 (1)
- December 2010 (2)
- February 2010 (1)
- August 2008 (5)
-
Categories
-
RSS
Entries RSS
Comments RSS