Basic Ideas Relevant to ILP
Ashwin Srinivasan
(ashwin@comlab.ox.ac.uk)
Oxford University Computing Laboratory
Last modified: September 12, 1999.
Contents
Clausal form
Usually logic programs are written as definite clauses that look like:
head_literal:-
body_literal_1,
body_literal_2,
... .
An example is:
grandparent(X,Z):-
parent(X,Y),
parent(Y,Z).
Quantification of variables is an issue that is usually left out.
What the clause above really means is:
(Forall X,Z) (grandparent(X,Z):-
(Exists Y) (parent(X,Y), parent(Y,Z)))
This form of the grandparent definition is called its
"clausal-form", ie all variables are universally quantified,
and literals are joined together by "or".
There is a methodical way to convert any logical sen-
tence to clausal form.
Negating Clauses and Skolemisation
You can replace the existentially quantified vari-
ables in ~C with Skolem constants. Thus:
C: (Forall X,Y,Z) grandparent(X,Z) or ~parent(X,Y) or ~parent(Y,Z)
~C: (Exists X,Y,Z) ~grandparent(X,Z) and parent(X,Y) and parent(Y,Z)
we can take ~C as:
~C: ~grandparent($X,$Z) and parent($X,$Y) and parent($Y,$Z)
where $X, $Y, and $Z are Skolem constants chosen to reflect
the existential quantification on the variables.
Redundancy
ILP programs typically look to remove redundancy in two ways:
1.When constructing a clause, some literals in the
clause may be unnecessary, and therefore can be removed
safely (that is, the things the clause can prove remains
unaffected by removing the literals). This makes the
clause more compact.
2. A clause constructed may make other clauses that
are already present unnecessary. These can then be safely re-
moved (that is, the things that the program can prove
remains unaffected if the new clause is added and these oth-
er clauses are removed).
We will first look at redundant literals in a clause.
Consider (in clausal form):
C : (Forall X,Y,Z..) L_1 or L_2 or ... or L_l or .... L_n
C1: (Forall X,Y,Z..) L_1 or L_2 or ... or L_(l+1) or .... L_n
Here X,Y,Z... are the variables in C, C1 and C1 is C with
L_l removed. Remember, for definite clauses, only one of
these literals will be a positive. All others will be
negated. With some background knowledge B, L_l is redundant
in C iff:
B and C is logically equivalent to B and C1
That is:
B and C |= B and C1 and B and C1 |= B and C
It is straightforwards to show that L_l is redundant in C iff:
B and C and ~C1 |= []
Besides useless literals in a clause, you can imagine cases
where a clause may be useless in a program (i.e. a set of
clauses). To find such a clause, the procedure is similar to
that used above for dud literals. Thus, consider the follow-
ing sets of clauses:
S : C_1 and C_2 and ... and C_l and .... C_n
S1: C_1 and C_2 and ... and C_(l+1) and ... C_n
Notice first that a set of clauses means a conjunction of
clauses (each clause in turn being a disjunction of
literals). In the above, S1 consists of all clauses in S
except C_l. That is S = S1 and C_l
The clause C_l is said to be logically redundant in S iff:
S is logically equivalent to S1
That is:
S1 and C_l |= S1 and S1 |= S1 and C_l
It is straightforward to show that C_l is redundant in S iff:
S1 and ~C_l |= []
A Note on Generalisation
You can see from the previous sections that definitions like:
mem(A,[B,A|C]).
make those like:
mem(0,[0,0]).
mem(1,[0,1]).
mem(0,[1,0]).
mem(0,[2,0]).
mem(1,[1,1]).
and even:
mem(A,[1,A|B]).
redundant. This is not surprising, when you consider the
universal quantification on the variables of the clause
mem(A,[B,A|C]). In some sense, all the clauses made redun-
dant by this clause are special cases of this clause.
The process of constructing some general hypothesis
and then providing a justification for why it should be true
can be viewed (controversially) as a form of "induction". Thus:
Induction = Hypothesis Formation + Justification
ILP systems form hypotheses in first-order logic. For this,
obtaining hypotheses from specific cases is tied in with:
logical consequence and logical subsumption. The other side
of induction, namely justification, is concerned with
selecting the best hypothesis amongst all those that satisfy
the logical constraints imposed. For this, ILP draws on
ideas from probability or information theory.
Subsumption
Recall that a term is recursively defines as being either:
1. A constant (written in lower case); or
2. A variable (written is upper case); or
3. A function symbol which takes a number of terms
as arguments (such as f(a,g(X)) etc.)
Also recall that substitutions are unique mappings from
variables to terms and are usually written as sets of
variable/term pairs, for example {X/a, Y/f(b)}. In this, X
and Y are called the "domain" of the substitution. When a
substitution theta is applied to a term t, each variable X
in t that is an element of the domain of theta is replaced
by the term that X maps to in theta.
Term Subsumption
Given two terms t1 and t2, t1 is said to subsume t2 iff
there exists a substitution theta s.t. t1. theta = t2.
Thus, in the following, t1.theta = t2
t1 = f(g(A),B)
t2 = f(g(h(a)),i(b))
Note that of the two terms t1 "looks" more general.
Clause Subsumption
Recall that a clause like:
C: grandparent(X,Z):-
parent(X,Y), parent(Y,Z).
is really:
C: (Forall X,Y,Z) grandparent(X,Z) or ~parent(X,Y) or ~parent(Y,Z)
Yet another way that this is written is as a set of literals
like so:
C: {grandparent(X,Z), ~parent(X,Y), ~parent(Y,Z)}
Given two clauses C1 and C2, C1 is said to subsume C2 iff
there exists a substitution theta s.t. C1. theta C C2
(where C denotes the improper subset relation).
Thus for the following pair of clauses:
C1: mem(A,[B|C]):- mem(A,C).
C2: mem(0,[1,0]):- nat(0), nat(1), mem(0,[0]).
It is the case that C1 subsumes C2.
Note that of the two clauses C1 "looks" more general.
Relation to Implication
It is the case that if a clause C1 subsumes clause C2, then
C1 logically implies C2. That is:
If C1 subsumes C2 then C1 |= C2
However the reverse does not hold.
Consider the following clauses:
C1: mem(A,[B|C]):- mem(A,C).
C2: mem(X,[H1|[H2,T]]):- mem(X,T).
C1 logically implies C2, but does not subsume it.
Generalisation in ILP
For ILP systems, the word "generalisation" has a particular
logical meaning. A sentence F is said to be more general
than G iff:
F |= G
The hypothesis formation problem in this setting is then to find F,
given G. Most ILP systems do this by finding Fs that subsume G.
ILP systems for prediction are given a set of clauses as background
B, positive examples E+ and negative examples E-. With these the
abductive step is to find a set of clauses H, s.t. the following
constraints hold:
B and E- is satisfiable i.e. B and E- |/= []
B and H and E- is satisfiable i.e. B and H and E- |/= []
B does not logically imply E+ i.e. B |/= E+
B and H logically implies E+ i.e. B and H |= E+
Since, for any H that satisfies these constraints:
B and H |= B and E+
B and H are said to be "more general" than B and E+. That
is, the hypothesis H is said to be a "generalisation" of E+.
Typically, there are many (possibly infinite) hypotheses
that would satisfy these constraints.
Hypothesis spaces
Consider constructing a set of clauses T within the follow-
ing language constraints:
P: a set of predicate symbols
C: a set of constant symbols
V: a set of variables
a: the maximum arity of any predicate in P
c: the maximum number of literals allowed in the body
of any clause in T
An upper bound on the number of atoms A that can be
constructed given P,C,V and a is: B = P.(C+V)a
An upper bound on the number of definite clauses C
that can be chosen with atoms from A, with at most c
literals in the body is: S = B(c+1)
An upper bound on the number of clause sets (possible hypotheses)
that can be constructed is: 2S
Note that the this is without any function symbols allowed.
Given some positive and negative examples, not all such
clause sets will satisfy the logical constraints to qualify
them as a hypothesis for the examples.
Rather than search this space of clause sets in any fashion,
we can chose to look at them in some order. The ordering
that almost all ILP systems use is based not on implication,
but on the weaker relation of subsumption, which imposes a
lattice on clauses that can be part of a hypothesis.
To really understand the subsumption lattice, you must first
get to grips with the notion of least general generalisa-
tions of terms, literals and clauses.
Given a pair of terms, literals or clauses u,v:
L is lgg of u and v iff
1) L is a common generalisation of u and v; and
2) Every other common generalisation of u and v is
also a generalisation of L.
F is a common generalisation of G and H iff
1) F is a generalisation of G; and
2) F is a generalisation of H.
F is a generalisation of G iff F subsumes G
Term Lgg
In an important thesis, Gordon Plotkin provides a recursive
method for constructing lggs of two terms u,v (do not get
confused by the lower case letters: u and v here do not have
to be constants):
1) If u=v then lgg(u,v) = u; otherwise
2) If u is f(s1,...sn) and v is f(t1,...tn) then
lgg(u,v) = f(lgg(s1,t1),...,lgg(sn,tn)); otherwise
3) lgg(u,v) is the variable V, that represents this pair
of terms throughout
Literal Lgg
In addition to the rules for term lggs, the lgg of two
literals is only defined when the sign and predicate symbol
are the same. That is, they are both positive or negative
literals with the same name and arity. Such literals are
said to be "compatible literals" With this restriction, the
lgg of literals p(u1,....un) and p(v1,...vn) is simply
p(lgg(u1,v1),...,lgg(un,vn)).
Clause Lgg
We are now in a position to define the Lgg of two clauses.
Recall a clause can be thought of as a set of literals.
The lgg of two clauses C1 and C2 is:
lgg(C1,C2) = {l: li in C1, and lj in C2 s.t. li, lj are compatible
and l = lgg(li,lj)}
Subsumption Lattice and Search
It can be shown that subsumption imposes a lattice over
equivalence classes of clauses. Here, it is sufficient to
understand the broad meaning of this statement.
For representative clauses C and D in two equivalence classes
there is a partial ordering ">=" based on whether one clause
subsumes the other. Second, there is a least upper bound
for C and D, which is simply the lgg of the two clauses.
Third, there is a greatest lower bound for C and D, which
is simply their union.
The equivalence class of C contains all clauses that are subs-
umption-equivalent to C. This is denoted by:
[C] = {C' | C' subsumes C and C subsumes C'}
This lattice forms the basis of a clause-by-clause search in ILP.
A straightforward procedure is:
1. Pick a positive example e
2. Construct the most-specific single clause hypothesis (called |)
3. Find the best clause C above | in the subsumption lattice
4. Remove all clauses made redundant by C and repeat search
Not all programs explicitly perform Step 2.
Most-Specific Single Clause Hypothesis
Given B and some positive example e, the most specific
single clause hypothesis can be derived from the following
steps:
We want the most specific clause | s.t.
B and | |= e
Using the Deduction Theorem:
B and ~e |= ~|
or:
| |= ~(B and ~e)
Thus, as a procedure:
1. Calculate ~e
2. Find the conjunction of literals derivable from B and ~e
3. The most specific single clause hypothesis is the negation
of the conjunction in Step 2.
Since Step 2 produces a conjunction of literals, Step 3,
which negates this conjunction, produces a disjunction of
literals, which is a clause.
The most-specific clause may be infinitely long. In practice
the best that programs can construct is some finite clause
that implies this infinite clause. This is done
using constructs like a depth-bounded mode language. The most
(finite) most-specific clause in this language implies the true
| and forms the basis for the clause-by-clause search.
Justification
There has to be some reason for preferring one hypothesis over
another. A probabilistic setting for providing justification for
the hypothesis selected is that afforded by Bayes Rule:
Pr(Hyp|Examples) = PriorPr(Hyp).Pr(Examples|Hyp)/PriorPr(Examples)
An information-theoretic version of this follows from taking negative
logs of both sides of this equation:
Info(Hyp|Examples) = Info(Hyp) + Info(Examples|Hyp) - Info(Examples)
where Info(X) is the number of bits required to encode X.
This forms the basis of the Minimum Description Length (MDL) principle
that selects hypotheses that minimise Info(Hyp|Examples). This is equivalent
to maximation of Bayesian posterior probability.
Evaluation
Evaluation is concerned with estimating the accuracy of a hypothesis
for a particular target concept and example set.
Accuracy is measured according to some probability distribution D over
the examples. It is simply the probability of drawing an example, under D,
that the hypothesis misclassifies.
The simplest way of estimating this probability is to get more labelled
examples (a ``test'' set). If m of n new examples are correctly classified
by the hypothesis, then an unbiased estimate of its accuracy is m/n. In fact
if the true accuracy is p, then the number of correct classifications is
binomially distributed with mean np and variance np(1-p).
For many practical problems, getting a new set of labelled examples is
not a viable option. This motivates the use of resampling methods. Here
the available data of n examples are partitioned into k sets. A hypothesis
is obtained by ignoring in turn, the data in each of these sets. For each
hypothesis the classification on the set ignored is then obtained. The accuracy
of the hypothesis is then the proportion m/n of correct classifications made
on these ``left-out'' datasets. Strictly speaking the variance expression
above does not hold with this technique.
Comparison
Comparing the performance of a pair of algorithms using involves testing
the null hypothesis that there is no difference in their accuracies.
The simplest way of comparing a pair of algorithms is to record their
performance on n independently drawn ``test'' examples. The comparison is then
analogous to a coin-tossing experiment where a ``head'' occurs if Algorithm 1
does better than Algorithm 2. If there is no difference in the algorithms then
we would expect n/2 heads. Since the number of heads with a fair coin is
governed by the binomial distribution with probability of success equal to 0.5,
we can calculate the probability of obtaining at least the number of
heads observed. This can be compared to an acceptable significance level.
Care must be taken with repeated testing of pairs of algorithms. With sufficient
comparisons, it is possible simply by chance, of obtaining one where Algorithm 1
looks better than its adversary. This has to be accounted for by statistical
corrections on the acceptable significance level.