APC-network.comAPC-network.com APC Network
Advanced Process Control for the Refining and Petrochemical Industry
 

Data Mining Beyond Algorithms

Dr Akeel Al-Attar
Managing Director - Attar Software Limited
Chief Technical Officer - Knowledge Process Software Plc


1 The case for data mining
Most organisations can be currently labelled 'data rich' since they are collecting increasing volumes of data about business processes and resources. Typically these data mountains are used to provide endless 'facts and figures' such are 'there are 60 categories of occupation', '2000 mortgage accounts are in arrears' etc. Such 'facts and figures' do not represent knowledge but if anything can lead to 'information overload'. However, patterns in the data represent knowledge and most organisations nowadays can be labelled 'knowledge poor'. Our definition of data mining is the process of discovering knowledge from data.

Data mining enables complex business processes to be understood and re-engineered. This can be achieved through the discovery of patterns in data relating to the past behaviour of a business process. Such patterns can be used to improve the performance of a process by exploiting favourable patterns and avoiding problematic patterns.

Examples of business processes where data mining can be useful are customer response to mailing, lapsed insurance policies and energy consumption. In each of these examples, data mining can reveal what factors affect the outcome of the business event or process and the patterns relating the outcome to these factors. Such patterns increase our understanding of these processes and therefore our ability to predict and affect the outcome.

2 Data Mining Technologies
There is a high degree of confusion among the potential users of data mining as to what data mining technologies are. This confusion has been compounded by vendors, of complimentary technologies, positioning their tools as data mining tools. So we have vendors of query and reporting tools and OLAP tools claiming that their tools can be used for data mining. Whilst it is true that one can discover useful patterns in the data using these tools there is a question mark as to who is doing the discovery the user or the tool! For example, query and reporting tools will interrogate the data and report on any pattern (query) requested by the user. This is a 'manual' and 'validation driven' method of discovery in the sense that unless the user suspects a pattern he/she will never find it! A marginally better situation is encountered with OLAP tools which can be termed 'visualisation driven' since they assist the user in the process of pattern discovery by displaying multi-dimensional data graphically. The class of tools that can genuinely termed 'data mining tools' are those that support the automatic discovery of patterns in data.

We are going to make one more assertion regarding the difference between data mining and data modelling. Data mining is about discovering understandable patterns (trees, rules or associations) in data. Data modelling is about discovering a model that fits the data, regardless of whether the model is understandable (e.g. tree or rules) or a black box (e.g. neural network). Based on this assertion we restrict the main data mining technologies to induction and the discovery of associations.


2.1 Rule Induction
Rule or decision tree induction is the most established and effective data mining technologies in use today. It is what can be termed 'goal driven' data mining in that a business goal is defined and rule induction is used to generate patterns that relate to that business goal. The business goal can be the occurrence of an event such as 'response to mail shots' or 'mortgage arrears' or the magnitude of an event such as 'energy use' or 'efficiency'. Rule induction will generate patterns relating the business goal to other data fields (attributes). The resulting patterns are typically generated as a tree with splits on data fields and terminal points (leafs) showing the propensity or magnitude of the business event of interest.

As an example of tree induction data mining consider the data table below which represents captured data on the process of loan authorisation. The table captures a number of data items relating to each loan applicant (sex, age, time at address, residence status, occupation, time in employment, time with the bank and monthly house expenses) as well as the decision made by the underwriters (accept or reject).

The objective of applying rule induction data mining to the above table is to discover patterns relating the decisions made by the loan underwriters to the details of the application. Such patterns can reveal the decision making process of the underwriters and its consistency as shown in the tree below which reveals that the time with the bank is the attribute considered most important which a critical threshold of 3 years. For applicants that have been with the bank over 3 years, the time in employment is considered the next most important factor and so on. The tree below reveals 5 patterns (leafs) each with an outcome (accept or reject) and a probability (0 - 1). High probability figures represent consistent decision making.


The majority of data miners who use tree induction will most probably use as in automatic algorithm which can generate a tree once the outcome and the attributes are defined. Whilst this is a reasonable first cut for generating patterns from data, the real power of tree induction can be gained using the interactive (incremental) tree induction mode. This mode allows the user to impart his/her knowledge of the business process to the induction algorithm. In interactive induction, the algorithm stops at every split in the tree (starting at the root) and displays to the user the list of attributes available for activating a split, with these attributes being ranked by the criteria of the induction engine for selecting attributes (significance, entropy or a combination of both). The user is also presented with the best split of attribute values (threshold or groups of values) according to the algorithm. The user is then free to select the top ranking attribute (and value split) according to the algorithm or select any other attribute in the ranked list. This allows the user to override the automatic selection of attributes based on the user's background knowledge of the process. For example, the user may feel that the relationship between the outcome and the best attribute is a spurious one, or that the best attribute is one that the user has no control over and should be replaced by one that can be controlled.

Interactive induction can also be seen as bridging the gap between OLAP based manual data segmentation / exploration and algorithm assisted segmentation.


2.2 The Discovery of associations
This is the second most common data mining technology and involves the discovery of associations between the various data fields. One popular application of this technology is the discovery of associations between business events or transactions. For example discovering that 90% of customers that buy product A will also buy product B (basket analysis) or that in 80% of cases when fault 1 is encountered then fault 7 is also encountered. If the sequence of events is important then another data mining technology for discovering sequences can be used.

A second application of associations discovery data mining is the discovery of associations between the fields of case data. Case data is data that can be structured as a flat table of cases. Records of mortgage applications is an example of case data. In such data, associations can be found between data fields; for example that 75% of all applicants that are over 45 and in managerial occupations are also earning over £40,000. Such associations can be used as a way of discovering clusters in the data. Note that this differs from rule induction on case data in that no outcome needs to be defined for the discovery process.

3 Case Studies
A number of case studies are described in the following sections which detail the background to each case studies, the data mining approach used and the benefits gained by the organisations concerned.


3.1 Case Study 1: Mortgage Lending
This case study comes from a UK Mortgage Lender which had a mortgage portfolio in which 9.8% of all accounts were in arrears (over 3 months in arrears) and 4.1% of all accounts were in severe arrears (over 6 months in arrears). The objectives of the data mining project was to discover patterns relating the propensity of arrears to the mortgage application data. Such patterns can be applied at the front end applications processing to reduce the level of arrears and can result in better management of accounts that go into arrears.

Rule induction data mining was used with the outcome of the analysis being the arrears status healthy, moderate arrears or severe arrears. The attributes of the data mining analysis were the mortgage application data such as age, income, occupation, term, loan amount, region etc.. Two separate data mining analysis were carried out; one to discover the patterns of arrears and the second to discover the patterns of severe arrears. Rule induction generated the following tree for arrears with splits on the attributes %Adv (% of loan to property value), SecondCh (second charge on property), Owned (is the property already owned), sub/Inc (subscription to income), AppSource (application source) and Region. The tree reveals 10 profiles with a propensity for arrears ranging from 0.02 to 0.49.


The trees discovered from the arrears data was used in three ways:

  • to generate policy rules which were introduced at the application processing stage to reduce arrears.
  • to formulate a marketing strategy targeting low risk profiles
  • to focus arrears management resources on accounts most likely to end up in severe arrears

3.2 Case Study 2: Life Underwriting
This case study is from the Norwich Union in Ireland who like other Life Insurers were facing the challenges of reducing costs, maintaining market share and meeting market demands. In order to meet these challenges, Norwich Union decided to re-engineer its Life Underwriting Process in order to speed up the process and reduce its costs.

The first phase of re-engineering involved the implementation of a rule based underwriting system which was used to automate the processing of Life Proposals at the point of application. The system involved capturing underwriting knowledge which resulted in 51% of proposals being underwritten automatically with the remaining cases being referred to head office for manual underwriting.

Whilst the automated underwriting system proved very successful, Norwich Union looked for ways of increasing the percentage of cases that can be processed by the system. Attempts were made to capture more advanced underwriting rules, however this proved to be very difficult. Data mining was then considered as an alternative for generating additional knowledge. The basic premise was that out of the 49% of cases being referred to Head Office a significant number were underwritten with no or a very small additional premium (less than the cost of the manual underwriting!). It was therefore decided to apply rule induction analysis to cases being referred to Head Office with the amount of additional underwriting premium being used as an outcome. Rule induction analysis generated patterns of low additional premiums of the following format :

If AGE > 30 & AGE < 41 and HEIGHT-WEIGHT = NORMAL Then PREMIUM LOADING = 1%

The generated patterns for low additional premiums were qualified and checked for risk by the actuaries at Norwich Union before being added as additional underwriting rules to the automated underwriting system. The result was to increase the rate of automated underwriting to 78%.

This case study illustrates how data mining helped Norwich Union in Ireland develop new ways of processing life proposals and these methods now underpin a cost effective new business process.


3.3 Case Study 3 : Gas Processing Plant
This project was carried out for an oil company and was based in a remote US oil field location. The process investigated was a very large gas processing plant which produces two useful products from the gas from the wells, natural gas liquids and miscible injectant. NGL is mixed with crude oil and transported for refining, and MI is used to improve the viscosity of oil in the fields to improve crude oil recovery.
The aim of the study was to use data mining techniques to analyse historical process data to find opportunities to increase the production rates, and hence increase the revenue generated by the process. Approximately 2000 data measurements for the process are captured every minute.

Rule induction data mining was used to discover patterns in the data. The business goal for data mining was the revenue from the Gas Process Plant, while the attributes of the analysis fell into two categories:

  • disturbances, such as wind speed and ambient temperature, which have an impact on the way the process is operated and performs, but which have to be accepted by the operators.
  • control set points, these can be altered by the process operators or automatically by the control systems, and include temperature and pressure set points, flow ratios, control valve positions etc.
  • An important part of the process is where the incoming feed gas is pre cooled with heat exchangers in two parallel process streams. The Oil company has always believed that there is an opportunity to improve process performance by altering the split of flows, however, it was not sure in which way to split the flow and what the impact will be on the revenue. Therefore flow split was put forward as an attribute for data mining . The tree generated by rule induction is shown below.

The induced tree reveals patterns relating the revenue from the process to the disturbances and control settings of the process. In particular the impact of the flow_split is revealed with a critical ratio of 1.25 : 1.

The benefits derived from the generated patterns include the identification of opportunities to improve process revenue considerably (by up to 4%). Mostly these involve altering control set points, such as altering the flow splits. Some of the discovered knowledge can be implemented without any further work and for no extra cost (eg. Altering flow splits). In other cases, it is necessary to provide the operators with timely advice about the best combination of settings for a given circumstance. This can be achieved cost effectively by delivering the rules generated as part of an expert system.

The company is in the process of carrying out another rule induction analysis with product quality as an outcome. The discovered patterns will allow the plant to be operated at the maximum revenue possible with acceptable product quality.


3.4 Case Study 4 - Energy Usage in a Power Station
ICI Thornton Power station, produces steam for a range of processes on the site, and generates electricity in a mix of primary pass out and secondary condensing turbines. Total power output is approximately 50 MW (i.e. a small power station). Fuel and water costs amount to about £5 million a year (depending on site steam demands).

The objective of the data mining project was to identify opportunities to reduce power station operating costs. Costs include the cost of fuel (gas and oil to fire the boilers) and water (to make up for losses). Electricity and steam are sold and represent a revenue.

Rule induction data mining was used for the project with the outcome (goal) of the analysis being the net cost of steam per unit of steam supplied to the site (i.e. the cost of the product). The attributes fall into two categories; disturbances such as ambient temperature and the site steam demand over which the operators have no control, and control settings such as pressure and bled steam rates. A section of the induced tree is shown below revealing the variation of steam cost with attribute values.


The above tree identifies the main contributors to efficient operations as manifold pressure (i.e. pressure at primary turbine inlet), steam flow to the secondary turbines and the total site steam flow.

The benefits derived from the generated patterns include the identification of opportunities to improve process revenue considerably (by up to 5%). Mostly these involve altering control set points which can be implemented without any further work and for no extra cost. Implementing some of the opportunities identified needed additional controls and instrumentation. The pay back for the additional controls would be a few months.

4 The Current Issues in Data Mining
With real case studies of organisations deploying data mining as a catalyst for enhancing / re-engineering their business processes, data mining is now entering mainstream IT as a mature and tested technology. With this phase of evolution data mining has moved beyond the debate on algorithms and into the debate on usability. There are three main issues which should be considered by any organisation considering the introduction of data mining; methodology, ease of use and performance / scalability.


4.1 Methodology
For data mining to gain wide acceptance, it is important to have a step by step methodology for a data mining project. This ensures that the benefits reported by seasoned data miners are repeatable by other people in various business sectors. This can help dispel the belief that data mining is a kind of 'black art' which can only be practised by specialist. Such a methodology is beginning to emerge and there is certainly wide agreement on the main steps of such a methodology. These are :

  • Problem analysis This step involves analysing a business problem to assess if it is suitable for tackling using data mining. If it is then an assessment has to be made of the availability of data, the data mining technology to used (induction, associations etc..) and how the results of data mining will be deployed as part of the overall solution
  • Data Preparation This step involves extracting the data and transforming it to the format required for the data mining algorithm. This involves, aggregation, table joins, deriving new fields, data cleansing etc..
  • Data Exploration This step preceeds the actual pattern discovery stage. It involves visualisation driven exploration and its aim is to give the user a good feel for the data and to reveal any errors in the data preparation / extraction.
  • Pattern Generation This step involves using rule induction (automatic or interactive) and associations discovery algorithms to generate patterns. This step also involves validating and interpreting the discovered patterns
  • Pattern Deployment This step involves deploying the discovered patterns as designed in the problem analysis stage. Patterns are typically used in decision support systems, to produce reports / guidelines, or to filter data for further processing.
  • Pattern Monitoring A main premise for the deployment of data mining results is that the future resembles the past and hence historic patterns can be applied to future situations. This strategy is safe only if the historic patterns are regularly monitored against new data to detect shifts in these patterns at the earliest possible time.


4.2 Ease of Use
Data mining tools are increasingly used by computer literate business users. This requires these tools to be no more difficult to use than a spreadsheet program. Furthermore the data mining tool needs to support all the steps of a data mining methodology. Finally, because of the nature of data mining, the tool has to support extensive data and patterns reporting and visualisation.


4.3 Performance / Scalability of Client based data mining
With the decreasing costs of data processing and storage comes the data rich organisation. It is now common place for small and medium sized organisations to hold gigabytes of data relating to a business process. It is therefore essential that data mining tools can deliver acceptable performance on large volumes of data regardless of the computing platform / architecture being used. The most typical computing architectures for data mining is the Client based data mining. In this architecture the data to be mined is downloaded (extracted) from the Data Warehouse or directly from Operational Databases and stored in a local DataMart on the client PC. All the data preparation and mining is carried out on the client. Until recently this approach was limited to mining tens of thousands of records (in acceptable times of under an hour). Recent advances has made it possible for millions of records to be mined on a client PC.