Data Mining Beyond Algorithms
Dr Akeel Al-Attar
Managing Director - Attar Software Limited
Chief Technical Officer - Knowledge Process Software Plc
1 The case for data mining
Most organisations can be currently labelled 'data rich' since they are
collecting increasing volumes of data about business processes and resources.
Typically these data mountains are used to provide endless 'facts and
figures' such are 'there are 60 categories of occupation', '2000 mortgage
accounts are in arrears' etc. Such 'facts and figures' do not represent
knowledge but if anything can lead to 'information overload'. However,
patterns in the data represent knowledge and most organisations nowadays
can be labelled 'knowledge poor'. Our definition of data mining is the
process of discovering knowledge from data.
Data mining enables complex business processes to be understood and re-engineered.
This can be achieved through the discovery of patterns in data relating
to the past behaviour of a business process. Such patterns can be used
to improve the performance of a process by exploiting favourable patterns
and avoiding problematic patterns.
Examples of business processes where data mining can be useful are customer
response to mailing, lapsed insurance policies and energy consumption.
In each of these examples, data mining can reveal what factors affect
the outcome of the business event or process and the patterns relating
the outcome to these factors. Such patterns increase our understanding
of these processes and therefore our ability to predict and affect the
2 Data Mining Technologies
There is a high degree of confusion among the potential users of data
mining as to what data mining technologies are. This confusion has been
compounded by vendors, of complimentary technologies, positioning their
tools as data mining tools. So we have vendors of query and reporting
tools and OLAP tools claiming that their tools can be used for data mining.
Whilst it is true that one can discover useful patterns in the data using
these tools there is a question mark as to who is doing the discovery
the user or the tool! For example, query and reporting tools will interrogate
the data and report on any pattern (query) requested by the user. This
is a 'manual' and 'validation driven' method of discovery in the sense
that unless the user suspects a pattern he/she will never find it! A marginally
better situation is encountered with OLAP tools which can be termed 'visualisation
driven' since they assist the user in the process of pattern discovery
by displaying multi-dimensional data graphically. The class of tools that
can genuinely termed 'data mining tools' are those that support the automatic
discovery of patterns in data.
We are going to make one more assertion regarding the difference between
data mining and data modelling. Data mining is about discovering understandable
patterns (trees, rules or associations) in data. Data modelling is about
discovering a model that fits the data, regardless of whether the model
is understandable (e.g. tree or rules) or a black box (e.g. neural network).
Based on this assertion we restrict the main data mining technologies
to induction and the discovery of associations.
2.1 Rule Induction
Rule or decision tree induction is the most established and effective
data mining technologies in use today. It is what can be termed 'goal
driven' data mining in that a business goal is defined and rule induction
is used to generate patterns that relate to that business goal. The business
goal can be the occurrence of an event such as 'response to mail shots'
or 'mortgage arrears' or the magnitude of an event such as 'energy use'
or 'efficiency'. Rule induction will generate patterns relating the business
goal to other data fields (attributes). The resulting patterns are typically
generated as a tree with splits on data fields and terminal points (leafs)
showing the propensity or magnitude of the business event of interest.
As an example of tree induction data mining consider the data table below
which represents captured data on the process of loan authorisation. The
table captures a number of data items relating to each loan applicant
(sex, age, time at address, residence status, occupation, time in employment,
time with the bank and monthly house expenses) as well as the decision
made by the underwriters (accept or reject).
The objective of applying rule induction data mining to the above table
is to discover patterns relating the decisions made by the loan underwriters
to the details of the application. Such patterns can reveal the decision
making process of the underwriters and its consistency as shown in the
tree below which reveals that the time with the bank is the attribute
considered most important which a critical threshold of 3 years. For applicants
that have been with the bank over 3 years, the time in employment is considered
the next most important factor and so on. The tree below reveals 5 patterns
(leafs) each with an outcome (accept or reject) and a probability (0 -
1). High probability figures represent consistent decision making.
The majority of data miners who use tree induction will most probably
use as in automatic algorithm which can generate a tree once the outcome
and the attributes are defined. Whilst this is a reasonable first cut
for generating patterns from data, the real power of tree induction can
be gained using the interactive (incremental) tree induction mode. This
mode allows the user to impart his/her knowledge of the business process
to the induction algorithm. In interactive induction, the algorithm stops
at every split in the tree (starting at the root) and displays to the
user the list of attributes available for activating a split, with these
attributes being ranked by the criteria of the induction engine for selecting
attributes (significance, entropy or a combination of both). The user
is also presented with the best split of attribute values (threshold or
groups of values) according to the algorithm. The user is then free to
select the top ranking attribute (and value split) according to the algorithm
or select any other attribute in the ranked list. This allows the user
to override the automatic selection of attributes based on the user's
background knowledge of the process. For example, the user may feel that
the relationship between the outcome and the best attribute is a spurious
one, or that the best attribute is one that the user has no control over
and should be replaced by one that can be controlled.
Interactive induction can also be seen as bridging the gap between OLAP
based manual data segmentation / exploration and algorithm assisted segmentation.
2.2 The Discovery of associations
This is the second most common data mining technology and involves the
discovery of associations between the various data fields. One popular
application of this technology is the discovery of associations between
business events or transactions. For example discovering that 90% of customers
that buy product A will also buy product B (basket analysis) or that in
80% of cases when fault 1 is encountered then fault 7 is also encountered.
If the sequence of events is important then another data mining technology
for discovering sequences can be used.
A second application of associations discovery data mining is the discovery
of associations between the fields of case data. Case data is data that
can be structured as a flat table of cases. Records of mortgage applications
is an example of case data. In such data, associations can be found between
data fields; for example that 75% of all applicants that are over 45 and
in managerial occupations are also earning over £40,000. Such associations
can be used as a way of discovering clusters in the data. Note that this
differs from rule induction on case data in that no outcome needs to be
defined for the discovery process.
3 Case Studies
A number of case studies are described in the following sections which
detail the background to each case studies, the data mining approach used
and the benefits gained by the organisations concerned.
3.1 Case Study 1: Mortgage Lending
This case study comes from a UK Mortgage Lender which had a mortgage portfolio
in which 9.8% of all accounts were in arrears (over 3 months in arrears)
and 4.1% of all accounts were in severe arrears (over 6 months in arrears).
The objectives of the data mining project was to discover patterns relating
the propensity of arrears to the mortgage application data. Such patterns
can be applied at the front end applications processing to reduce the
level of arrears and can result in better management of accounts that
go into arrears.
Rule induction data mining was used with the outcome of the analysis
being the arrears status healthy, moderate arrears or severe arrears.
The attributes of the data mining analysis were the mortgage application
data such as age, income, occupation, term, loan amount, region etc..
Two separate data mining analysis were carried out; one to discover the
patterns of arrears and the second to discover the patterns of severe
arrears. Rule induction generated the following tree for arrears with
splits on the attributes %Adv (% of loan to property value), SecondCh
(second charge on property), Owned (is the property already owned), sub/Inc
(subscription to income), AppSource (application source) and Region. The
tree reveals 10 profiles with a propensity for arrears ranging from 0.02
The trees discovered from the arrears data was used in three ways:
- to generate policy rules which were introduced at the application
processing stage to reduce arrears.
- to formulate a marketing strategy targeting low risk profiles
- to focus arrears management resources on accounts most likely to
end up in severe arrears
3.2 Case Study 2: Life Underwriting
This case study is from the Norwich Union in Ireland who like other Life
Insurers were facing the challenges of reducing costs, maintaining market
share and meeting market demands. In order to meet these challenges, Norwich
Union decided to re-engineer its Life Underwriting Process in order to
speed up the process and reduce its costs.
The first phase of re-engineering involved the implementation of a rule
based underwriting system which was used to automate the processing of
Life Proposals at the point of application. The system involved capturing
underwriting knowledge which resulted in 51% of proposals being underwritten
automatically with the remaining cases being referred to head office for
Whilst the automated underwriting system proved very successful, Norwich
Union looked for ways of increasing the percentage of cases that can be
processed by the system. Attempts were made to capture more advanced underwriting
rules, however this proved to be very difficult. Data mining was then
considered as an alternative for generating additional knowledge. The
basic premise was that out of the 49% of cases being referred to Head
Office a significant number were underwritten with no or a very small
additional premium (less than the cost of the manual underwriting!). It
was therefore decided to apply rule induction analysis to cases being
referred to Head Office with the amount of additional underwriting premium
being used as an outcome. Rule induction analysis generated patterns of
low additional premiums of the following format :
If AGE > 30 & AGE < 41 and HEIGHT-WEIGHT = NORMAL Then
PREMIUM LOADING = 1%
The generated patterns for low additional premiums were qualified and
checked for risk by the actuaries at Norwich Union before being added
as additional underwriting rules to the automated underwriting system.
The result was to increase the rate of automated underwriting to 78%.
This case study illustrates how data mining helped Norwich Union in Ireland
develop new ways of processing life proposals and these methods now underpin
a cost effective new business process.
3.3 Case Study 3 : Gas Processing Plant
This project was carried out for an oil company and was based in a remote
US oil field location. The process investigated was a very large gas processing
plant which produces two useful products from the gas from the wells,
natural gas liquids and miscible injectant. NGL is mixed with crude oil
and transported for refining, and MI is used to improve the viscosity
of oil in the fields to improve crude oil recovery.
The aim of the study was to use data mining techniques to analyse historical
process data to find opportunities to increase the production rates, and
hence increase the revenue generated by the process. Approximately 2000
data measurements for the process are captured every minute.
Rule induction data mining was used to discover patterns in the data.
The business goal for data mining was the revenue from the Gas Process
Plant, while the attributes of the analysis fell into two categories:
- disturbances, such as wind speed and ambient temperature, which have
an impact on the way the process is operated and performs, but which
have to be accepted by the operators.
- control set points, these can be altered by the process operators
or automatically by the control systems, and include temperature and
pressure set points, flow ratios, control valve positions etc.
- An important part of the process is where the incoming feed gas is
pre cooled with heat exchangers in two parallel process streams. The
Oil company has always believed that there is an opportunity to improve
process performance by altering the split of flows, however, it was
not sure in which way to split the flow and what the impact will be
on the revenue. Therefore flow split was put forward as an attribute
for data mining . The tree generated by rule induction is shown below.
The induced tree reveals patterns relating the revenue from the process
to the disturbances and control settings of the process. In particular
the impact of the flow_split is revealed with a critical ratio of 1.25
The benefits derived from the generated patterns include the identification
of opportunities to improve process revenue considerably (by up to 4%).
Mostly these involve altering control set points, such as altering the
flow splits. Some of the discovered knowledge can be implemented without
any further work and for no extra cost (eg. Altering flow splits). In
other cases, it is necessary to provide the operators with timely advice
about the best combination of settings for a given circumstance. This
can be achieved cost effectively by delivering the rules generated as
part of an expert system.
The company is in the process of carrying out another rule induction
analysis with product quality as an outcome. The discovered patterns will
allow the plant to be operated at the maximum revenue possible with acceptable
3.4 Case Study 4 - Energy Usage in a Power Station
ICI Thornton Power station, produces steam for a range of processes on
the site, and generates electricity in a mix of primary pass out and secondary
condensing turbines. Total power output is approximately 50 MW (i.e. a
small power station). Fuel and water costs amount to about £5 million
a year (depending on site steam demands).
The objective of the data mining project was to identify opportunities
to reduce power station operating costs. Costs include the cost of fuel
(gas and oil to fire the boilers) and water (to make up for losses). Electricity
and steam are sold and represent a revenue.
Rule induction data mining was used for the project with the outcome
(goal) of the analysis being the net cost of steam per unit of steam supplied
to the site (i.e. the cost of the product). The attributes fall into two
categories; disturbances such as ambient temperature and the site steam
demand over which the operators have no control, and control settings
such as pressure and bled steam rates. A section of the induced tree is
shown below revealing the variation of steam cost with attribute values.
The above tree identifies the main contributors to efficient operations
as manifold pressure (i.e. pressure at primary turbine inlet), steam flow
to the secondary turbines and the total site steam flow.
The benefits derived from the generated patterns include the identification
of opportunities to improve process revenue considerably (by up to 5%).
Mostly these involve altering control set points which can be implemented
without any further work and for no extra cost. Implementing some of the
opportunities identified needed additional controls and instrumentation.
The pay back for the additional controls would be a few months.
4 The Current Issues in Data Mining
With real case studies of organisations deploying data mining as a catalyst
for enhancing / re-engineering their business processes, data mining is
now entering mainstream IT as a mature and tested technology. With this
phase of evolution data mining has moved beyond the debate on algorithms
and into the debate on usability. There are three main issues which should
be considered by any organisation considering the introduction of data
mining; methodology, ease of use and performance / scalability.
For data mining to gain wide acceptance, it is important to have a step
by step methodology for a data mining project. This ensures that the benefits
reported by seasoned data miners are repeatable by other people in various
business sectors. This can help dispel the belief that data mining is
a kind of 'black art' which can only be practised by specialist. Such
a methodology is beginning to emerge and there is certainly wide agreement
on the main steps of such a methodology. These are :
- Problem analysis This step involves analysing a
business problem to assess if it is suitable for tackling using data
mining. If it is then an assessment has to be made of the availability
of data, the data mining technology to used (induction, associations
etc..) and how the results of data mining will be deployed as part of
the overall solution
- Data Preparation This step involves extracting the
data and transforming it to the format required for the data mining
algorithm. This involves, aggregation, table joins, deriving new fields,
data cleansing etc..
- Data Exploration This step preceeds the actual pattern
discovery stage. It involves visualisation driven exploration and its
aim is to give the user a good feel for the data and to reveal any errors
in the data preparation / extraction.
- Pattern Generation This step involves using rule
induction (automatic or interactive) and associations discovery algorithms
to generate patterns. This step also involves validating and interpreting
the discovered patterns
- Pattern Deployment This step involves deploying the
discovered patterns as designed in the problem analysis stage. Patterns
are typically used in decision support systems, to produce reports /
guidelines, or to filter data for further processing.
- Pattern Monitoring A main premise for the deployment
of data mining results is that the future resembles the past and hence
historic patterns can be applied to future situations. This strategy
is safe only if the historic patterns are regularly monitored against
new data to detect shifts in these patterns at the earliest possible
4.2 Ease of Use
Data mining tools are increasingly used by computer literate business
users. This requires these tools to be no more difficult to use than a
spreadsheet program. Furthermore the data mining tool needs to support
all the steps of a data mining methodology. Finally, because of the nature
of data mining, the tool has to support extensive data and patterns reporting
4.3 Performance / Scalability of Client based data mining
With the decreasing costs of data processing and storage comes the data
rich organisation. It is now common place for small and medium sized organisations
to hold gigabytes of data relating to a business process. It is therefore
essential that data mining tools can deliver acceptable performance on
large volumes of data regardless of the computing platform / architecture
being used. The most typical computing architectures for data mining is
the Client based data mining. In this architecture the data to be mined
is downloaded (extracted) from the Data Warehouse or directly from Operational
Databases and stored in a local DataMart on the client PC. All the data
preparation and mining is carried out on the client. Until recently this
approach was limited to mining tens of thousands of records (in acceptable
times of under an hour). Recent advances has made it possible for millions
of records to be mined on a client PC.