OUCL's  second submission to the PTE Challenge

C4.5 with relational and bulk attributes


Consortium

OUCL-2

Address

Ashwin Srinivasan and Ross D. King
Oxford University Computing Laboratory
Wolfson Building, Parks Road,
Oxford OX2 6UD, UK

  ashwin@comlab.ox.ac.uk   rdk@aber.ac.uk
 


 

Materials

 

Attribute-based representation of compounds

This submission uses an attribute-based representation for all compounds. The attributes used fall into 2 categories: (A) properties extracted from the background knowledge described by   Srinivasan and King, 1997 ; and (B)  bulk properties (available from the  NTP Structural Descriptors  page).  With this, the following groups of attributes are used (tabulated here in the order in which they appear in the data files):
 
 
Name
Category
Attributes
Type
Comment
Atom-Types count 
 Attr: 1-69 
A 69 Numerical Number of atoms of any particular QUANTA type in a molecule. Eg. number of c,22  atoms etc.
Mutagenecity alert 
 Attr: 70 
A 1 Boolean Mutagenicity rules used in Srinivasan and King, 1997
WARMR alerts
 Attr:71-285 
A
215
Boolean
Association rules discovered by WARMR: see  K.U. Leuven's submission 
Generic groups counts
 Attr: 286-313 
A
28
Numerical
Counts of number of  generic groups in each molecule. Eg number of benzenes etc.
NTP bulk properties
 Attr: 314-376 
B
63
Numerical
Bulk chemical properties
Ashby feature counts
 Attr: 377-404 
A
28
Numerical
Counts of number of ``Ashby alerts'' in each molecule
Genetic Toxicology alerts
 Atrtr: 405-416 
A
12
Boolean
Genetic toxicology alerts: see Srinivasan and King, 1997
Ames
 Attr: 417 
A
1
Boolean
Ames test outcome
 
 

Data

The training set consists of  337 compounds from the NTP database that were used in previous experiments for predicting carcinogenesis (see  Predicting carcinogenicity ).  The distribution of these into carcinogens (class ``1'') and non-carcinogens (class ``0'') is as shown below.
Number
1 182
0 155
 
Test data consists of the 30 compounds in PTE-2.  With the representation described earlier, 417 attributes are used to describe each compound. A file containing the description of the training data in a form suitable for C4.5 is available in c45pte.data. Corresponding description for the compounds in PTE-2 is in  c45pte.test . The corresponding ``names'' file for both datasets is in  c45pte.names. In the test set, the class value 2 is used to signify compounds whose classification is unavailable at the time of this submission. The order of the chemicals  in the training and test data is in  Order of chemicals in data files )
 

Algorithms and machines

All experiments used the decision-tree learner C4.5 (Release 8) in conjunction with the companion program  C4.5rules that constructs rules ( Quinlan, 1993 ) . The information provided by the WARMR alerts were obtained independently using the WARMR program (Dehaspe et al.,1998). The mutagenecity alert consists of rules obtained by the ILP algorithm (Muggleton, 1995), as implemented by P-Progol (available by contacting A. Srinivasan).  The decision-tree experiments were conducted on a Sun Ultra 1 equipped with 120M of memory space.

Method

In broad outline, the final model is constructed in two stages: The use of this two-stage approach is motivated by the fact that there are a large number of features and relatively small number of training cases. This increases the probability of selecting an irrelevant attribute in the construction of a tree. To reduce this, Stage 1 consists of the following iterative procedure based on the description in Quinlan, 1993  (pp 100-101):
  1. Let i = 0, and A(0) be the set of all available attributes
  2. increment i
  3. Let T(i) , R(i) be the tree and rule-set obtained with C4.5 and C4.5rules using data represented by A(i-1)
  4. Let A(i) be the set of attributes that appear in T(i) or R(i)
  5. If A(i) = A(i-1) then return A(i), T(i), R(i)  otherwise return to Step 2.
While the iterative procedure above performs a selection of attributes, it does not stipulate whether the tree-based (the output of C4.5) the rule-based model (the output of C4.5rules) should be selected. In Stage 2, the error of  each of these models is estimated using a leave-one-out cross-validation (the model with the lowest error is selected).In all experiments, only models obtained with default settings of C4.5 and C4.5rules were considered.
 
 

Results

C4.5 model

Using the method described resulted in the tree produced by C4.5 being selected as the final model -- the tree-based theory yields an estimated error of 31.2% compared to 33.2% for the corresponding rule-based theory. 4 iterations of the attribute-selection procedure were required.  The tree obtained is reproduced below: Here attributes prefixed with a "W" are WARMR alerts, with "AT" are atom-type counts, with "N" are generic group counts and with "S" are NTP bulk properties.  A more readable form of this tree is below: