Basic Ideas Relevant to ILP

Ashwin Srinivasan

Oxford University Computing Laboratory
Last modified: September 12, 1999.


Clausal Form 
Clause Negation and Skolemization 
A Note on Generalisation 

Clausal form

Usually logic programs are written as definite clauses that look like:

                ... .

An example is:


Quantification of variables is an issue that  is usually  left  out.
What  the clause above really means is:

        (Forall X,Z) (grandparent(X,Z):-
                        (Exists Y) (parent(X,Y), parent(Y,Z)))
This form  of  the  grandparent  definition  is  called  its
"clausal-form", ie all variables are universally quantified,
and literals are joined together by "or".

There is a methodical way to convert any logical  sen-
tence  to  clausal  form.

Negating Clauses and Skolemisation

You can replace the  existentially  quantified  vari-
ables in ~C with Skolem constants. Thus:

C:  (Forall X,Y,Z) grandparent(X,Z) or ~parent(X,Y) or ~parent(Y,Z)
~C: (Exists X,Y,Z) ~grandparent(X,Z) and parent(X,Y) and parent(Y,Z)

we can take ~C as:

~C: ~grandparent($X,$Z) and parent($X,$Y) and parent($Y,$Z)

where $X, $Y, and $Z are Skolem constants chosen to  reflect
the existential quantification on the variables.


ILP programs typically look to remove  redundancy  in two ways:
We will first look at redundant literals in a clause.
Consider (in clausal form):

C : (Forall X,Y,Z..) L_1 or L_2 or ... or L_l or .... L_n
C1: (Forall X,Y,Z..) L_1 or L_2 or ... or L_(l+1) or  .... L_n

Here X,Y,Z... are the variables in C, C1 and C1  is  C  with
L_l  removed.   Remember,  for definite clauses, only one of
these literals will be a positive. All  others  will  be
negated.  With some background knowledge B, L_l is redundant
in C iff:

        B and C is logically equivalent to B and C1

That is:

     B and C |= B and C1    and    B and C1 |= B and C
It is straightforwards to show that L_l is redundant in C iff:

                   B and C and ~C1 |=  []

Besides useless literals in a clause, you can imagine  cases
where  a  clause  may be useless in a program (i.e. a set of
clauses). To find such a clause, the procedure is similar to
that used above for dud literals. Thus, consider the follow-
ing sets of clauses:

S : C_1 and C_2 and ... and C_l and     .... C_n
S1: C_1 and C_2 and ... and C_(l+1) and ... C_n

Notice first that a set of clauses means  a  conjunction  of
clauses   (each  clause  in  turn  being  a  disjunction  of
literals).  In the above, S1 consists of all  clauses  in  S
except C_l.  That is S = S1 and C_l

The clause C_l is said to be logically redundant in S iff:

              S is logically equivalent to S1

That is:

         S1 and C_l |= S1   and   S1 |= S1 and C_l
It is straightforward to show that C_l is redundant in S iff:

                     S1 and ~C_l |= []

A Note on Generalisation

You can see from the previous sections that definitions like:


make those like:


and even:


redundant. This is not surprising,  when  you  consider  the
universal  quantification  on  the  variables  of the clause
mem(A,[B,A|C]).  In some sense, all the clauses made  redun-
dant  by  this  clause are special cases of this clause.
The process of constructing  some  general  hypothesis
and then providing a justification for why it should be true
can be viewed (controversially) as a form of "induction". Thus:

           Induction = Hypothesis Formation + Justification

ILP  systems form hypotheses in first-order logic. For this,
obtaining hypotheses from  specific cases is tied in with:
logical consequence and logical subsumption. The other  side
of   induction,  namely  justification,  is  concerned  with
selecting the best hypothesis amongst all those that satisfy
the  logical  constraints imposed. For this, ILP draws on
ideas from probability or information theory.


Recall that a term is recursively defines as being either:

        1. A constant (written in lower case); or
        2. A variable (written is upper case); or
        3. A function symbol which takes a number of terms
           as arguments (such as f(a,g(X)) etc.)

Also recall that  substitutions  are  unique  mappings  from
variables  to  terms  and  are  usually  written  as sets of
variable/term pairs, for example {X/a, Y/f(b)}. In  this,  X
and  Y  are  called the "domain" of the substitution. When a
substitution theta is applied to a term t, each  variable  X
in  t  that is an element of the domain of theta is replaced
by the term that X maps to in theta.

Term Subsumption

Given two terms t1 and t2, t1 is  said  to  subsume  t2  iff
there exists a substitution theta s.t.  t1.  theta = t2.
Thus, in the following, t1.theta = t2

                       t1 = f(g(A),B)
                       t2 = f(g(h(a)),i(b))

Note that of the two terms t1 "looks" more general.

Clause Subsumption

Recall that a clause like:

C: grandparent(X,Z):-
        parent(X,Y), parent(Y,Z).

is really:

C:  (Forall X,Y,Z) grandparent(X,Z) or ~parent(X,Y) or ~parent(Y,Z)

Yet another way that this is written is as a set of literals
like so:

C:  {grandparent(X,Z), ~parent(X,Y), ~parent(Y,Z)}

Given two clauses C1 and C2, C1 is said to  subsume  C2  iff
there  exists  a  substitution  theta  s.t.  C1.  theta C C2
(where C denotes the improper subset relation).
Thus for the following pair of clauses:

C1:  mem(A,[B|C]):- mem(A,C).
C2:  mem(0,[1,0]):- nat(0), nat(1), mem(0,[0]).

It is the case that C1 subsumes C2.

Note that of the two clauses C1 "looks" more general.

Relation to Implication

It is the case that if a clause C1 subsumes clause C2,  then
C1 logically implies C2. That is:

            If C1 subsumes C2   then   C1 |=  C2

However the reverse does not hold.
Consider the following clauses:

C1:  mem(A,[B|C]):- mem(A,C).
C2:  mem(X,[H1|[H2,T]]):- mem(X,T).

C1 logically implies C2, but does not subsume it.

Generalisation in ILP

For ILP systems, the word "generalisation" has a  particular
logical  meaning.  A sentence F is said to be more general
than G iff:

                           F |= G

The hypothesis formation problem in this setting is then to find F,
given  G. Most ILP systems do this by finding Fs that subsume G.
ILP systems for prediction are given a set of clauses as background
B, positive examples E+ and negative examples E-. With these the
abductive step is to find a set of clauses H, s.t. the following
constraints hold:

        B and E- is satisfiable  i.e.  B and E- |/= []
        B and H and E- is satisfiable  i.e.  B and H and E- |/= []
        B does not logically imply E+  i.e.  B |/= E+
        B and H logically implies E+   i.e.  B and H |= E+

Since, for any H that satisfies these constraints:

        B and H |= B and E+

B and H are said to be "more general" than B  and  E+.  That
is, the hypothesis H is said to be a "generalisation" of E+.

Typically, there are  many  (possibly  infinite)  hypotheses
that would satisfy these constraints.

Hypothesis spaces

Consider constructing a set of clauses T within the  follow-
ing language constraints:

        P: a set of predicate symbols
        C: a set of constant symbols
        V: a set of variables
        a: the maximum arity of any predicate in P
        c: the maximum number of literals allowed in the body
           of any clause in T
An upper bound on the number of atoms A that can  be
constructed given P,C,V and a is: B = P.(C+V)a

An upper bound on the number of definite  clauses  C
that  can  be  chosen  with  atoms  from  A,  with at most c
literals in the body is: S = B(c+1)

An upper bound on the number of clause sets (possible hypotheses)
that can be constructed is: 2S

Note that the this is without any function symbols allowed.

Given some positive and  negative  examples,  not  all  such
clause  sets will satisfy the logical constraints to qualify
them as a hypothesis for the examples.

Rather than search this space of clause sets in any fashion,
we  can  chose  to look at them in some order.  The ordering
that almost all ILP systems use is based not on implication,
but  on  the weaker relation of subsumption, which imposes a
lattice on clauses that can be part of a hypothesis.

To really understand the subsumption lattice, you must first
get  to  grips with the notion of least general generalisa-
tions of terms, literals and clauses.

Given a pair of terms, literals or clauses u,v:

L is lgg of u and v iff

        1) L is a common generalisation of u and v; and
        2) Every other common generalisation of u and v is
           also a generalisation of L.

F is a common generalisation of G and H iff
        1) F is a generalisation of  G; and
        2) F is a generalisation of H.

F is a generalisation of G iff F subsumes G

Term Lgg

In an important thesis, Gordon Plotkin provides a  recursive
method  for  constructing  lggs of two terms u,v (do not get
confused by the lower case letters: u and v here do not have
to be constants):

        1) If u=v then lgg(u,v) = u; otherwise
        2) If u is f(s1, and v is f(t1, then
                lgg(u,v) = f(lgg(s1,t1),...,lgg(sn,tn)); otherwise
        3) lgg(u,v) is the  variable V, that represents this pair
           of terms throughout

Literal Lgg

In addition to the rules for  term  lggs,  the  lgg  of  two
literals  is only defined when the sign and predicate symbol
are the same. That is, they are both  positive  or  negative
literals  with  the  same name and arity.  Such literals are
said to be "compatible literals" With this restriction,  the
lgg  of  literals  p(u1,....un)  and  p(v1,  is simply

Clause Lgg

We are now in a position to define the Lgg of  two  clauses.
Recall  a  clause  can be thought of as a set of literals.
The lgg of two clauses C1 and C2 is:
lgg(C1,C2) = {l: li in C1, and lj in C2 s.t. li, lj are compatible
and l = lgg(li,lj)}

Subsumption Lattice and Search

It can be  shown  that  subsumption  imposes a  lattice over
equivalence classes of clauses. Here, it is sufficient to
understand the broad meaning of this statement.
For representative clauses C and D in two equivalence classes
there is a partial ordering ">=" based on whether one clause
subsumes  the  other. Second, there is a least upper bound
for C and D, which is simply the lgg  of  the  two  clauses.
Third,  there is a greatest lower bound for C and D, which
is simply their union.
The equivalence class of C contains all clauses that are subs-
umption-equivalent to C. This is denoted by:

          [C] = {C' | C' subsumes C and C subsumes C'}
This lattice forms the basis of a clause-by-clause search in ILP.
A straightforward  procedure is:
Not all programs explicitly perform Step 2.

Most-Specific Single Clause Hypothesis

Given B and some positive example  e,  the  most  specific
single  clause  hypothesis can be derived from the following
We want the most specific clause | s.t.
Using the Deduction Theorem:
Thus, as a procedure:
1. Calculate ~e
2. Find the conjunction of literals derivable from B and ~e
3. The most specific single clause hypothesis is the negation
   of the conjunction in Step 2.

Since Step 2  produces a conjunction of  literals,  Step  3,
which  negates  this  conjunction, produces a disjunction of
literals, which is a clause.
The most-specific clause may be infinitely long. In practice
the best that programs can construct is some finite clause
that implies this infinite clause. This is done
using constructs like a depth-bounded mode language. The most
(finite) most-specific clause in this language implies the true
| and forms the basis for the clause-by-clause search.


There has to be some reason for preferring one hypothesis over
another. A probabilistic setting for providing justification for
the hypothesis selected is that afforded by Bayes Rule:
Pr(Hyp|Examples) = PriorPr(Hyp).Pr(Examples|Hyp)/PriorPr(Examples)
An information-theoretic version of this follows from taking negative
logs of both sides of this equation:
Info(Hyp|Examples) = Info(Hyp) + Info(Examples|Hyp) - Info(Examples)
where Info(X) is the number of bits required to encode X.
This forms the basis of the Minimum Description Length (MDL) principle
that selects hypotheses that minimise Info(Hyp|Examples). This is equivalent
to maximation of Bayesian posterior probability.


Evaluation is concerned with estimating the accuracy of a hypothesis
for a particular target concept and example set.
Accuracy is measured according to some probability distribution D over
the examples. It is simply the probability of drawing an example, under D,
that the hypothesis misclassifies.
The simplest way of estimating this probability is to get more labelled
examples (a ``test'' set). If m of n new examples are correctly classified
by the hypothesis, then an unbiased estimate of its accuracy is m/n. In fact
if the true accuracy is p, then the number of correct classifications is 
binomially distributed with mean np and variance np(1-p).
For many practical problems, getting a new set of labelled examples is
not a viable option. This motivates the use of resampling methods. Here
the available data of n examples are partitioned into k sets. A hypothesis
is obtained by ignoring in turn, the data in each of these sets. For each
hypothesis the classification on the set ignored is then obtained. The accuracy
of the hypothesis is then the proportion m/n of correct classifications made
on these ``left-out'' datasets. Strictly speaking the variance expression
above does not hold with this technique.


Comparing the performance of a pair of algorithms using involves testing
the null hypothesis that there is no difference in their accuracies.
The simplest way of comparing a pair of algorithms is to record their
performance on n independently drawn ``test'' examples. The comparison is then
analogous to a coin-tossing experiment where a ``head'' occurs if Algorithm 1
does better than Algorithm 2. If there is no difference in the algorithms then
we would expect n/2 heads. Since the number of heads with a fair coin is
governed by the binomial distribution with probability of success equal to 0.5,
we can calculate the probability of obtaining at least the number of
heads observed. This can be compared to an acceptable significance level.
Care must be taken with repeated testing of pairs of algorithms. With sufficient
comparisons, it is possible simply by chance, of obtaining one where Algorithm 1 
looks better than its adversary. This has to be accounted for by statistical
corrections on the acceptable significance level.