Page 28 - DCAP606_BUSINESS_INTELLIGENCE
P. 28
Unit 2: Multidimensional Analysis
Notes
For fact table data that has a much larger number of interesting groups (such as per-
product sales information of a large retailer), retaining the sparse format becomes critical
to avoid densification of such high cardinality information. Oracle Data Mining algorithms
are designed to interpret missing entries in a sparse fact table appropriately, enabling
increased performance and simpler transformation processing.
Some steps in the referenced case study are not completely defined (in my opinion), and in
those situations I will take my best guess as to the intended objective. This approximation
is sufficient since the intent of this series of posts is to show the power and flexibility of
Oracle Data Mining on a real-world scenario rather than to match the case study letter-for-
letter.
The following files support reproduction of the results in this series of posts:
telcoddl.sql - SQL which creates the four tables
telcoloadproc.plb - Obfuscated SQL which creates the procedure that can generate data
and populate the tables - all data is generated, and patterns are injected to make it interesting
and “real-world” like
telcoprep.sql - A SQL create view statement corresponding to the data preparation steps
from part 2 of this series
telcomodel.sql - A SQL script corresponding to the steps from part 3 of this series
In order to prepare a schema that can run the above SQL, a user must be created with the
following privileges: create table, create view, create mining model, and create procedure
(for telcoloadproc), as well as any other privs as needed for the database user (e.g., create
session).Once the schema is prepared, telcoddl.sql and telcoloadproc.plb can be run to
create the empty tables and the procedure for loading data. The procedure that is created
is named telco_load, and it takes one optional argument - the number of customers (default
10000). The results from parts 2 and 3 of this series correspond to loading 10,000 customers.
The sample code in these posts has been tested against an 11gR2 database. Many new
features have been added in each release, so some of the referenced routines and syntax
are not available in older releases; however, similar functionality can be achieved with
10g. The following modified scripts can be used with 10g (tested with 10gR2):
telcoprep_10g.sql - A SQL create view statement corresponding to the data preparation
steps from part 2 of this series, including substitution for the 11g PIVOT syntax and
inclusion of manual data preparation for nested columns.
telcomodel_10g.sql - A SQL script corresponding to the steps from part 3 of this series,
including substitution of the Generalized Linear Model algorithm for 10g Support Vector
Machine, manual data preparation leveraging the transformation package, use of
dbms_data_mining.apply instead of 10gR2 built-in data mining scoring functions, explicit
commit of settings prior to build, and removal of the EXPLAIN routine from the script
flow.
In addition, the create mining model privilege is not available in 10g.
1. Handling missing values for call data records: The CDR_T table records the number
of phone minutes used by a customer per month and per call type (tariff). For
example, the table may contain one record corresponding to the number of peak
(call type) minutes in January for a specific customer, and another record associated
with international calls in March for the same customer. This table is likely to be
fairly dense (most type-month combinations for a given customer will be present)
Contd....
LOVELY PROFESSIONAL UNIVERSITY 23