Usually logic programs are written as definite clauses that look like:head_literal:- body_literal_1, body_literal_2, ... .An example is:grandparent(X,Z):- parent(X,Y), parent(Y,Z).Quantification of variables is an issue that is usually left out.

What the clause above really means is:(Forall X,Z) (grandparent(X,Z):- (Exists Y) (parent(X,Y), parent(Y,Z)))

This form of the grandparent definition is called its "clausal-form", ie all variables are universally quantified, and literals are joined together by "or". There is a methodical way to convert any logical sen- tence to clausal form.

Negating Clauses and Skolemisation

You can replace the existentially quantified vari- ables in~CwithSkolemconstants. Thus:C: (Forall X,Y,Z) grandparent(X,Z) or ~parent(X,Y) or ~parent(Y,Z) ~C: (Exists X,Y,Z) ~grandparent(X,Z) and parent(X,Y) and parent(Y,Z)we can take~Cas:~C: ~grandparent($X,$Z) and parent($X,$Y) and parent($Y,$Z)where$X,$Y, and$Zare Skolem constants chosen to reflect the existential quantification on the variables.

Redundancy

ILP programs typically look to remove redundancy in two ways:

1.When constructing a clause, some literals in the clause may be unnecessary, and therefore can be removed safely (that is, the things the clause can prove remains unaffected by removing the literals). This makes the clause more compact. 2. A clause constructed may make other clauses that are already present unnecessary. These can then be safely re- moved (that is, the things that the program can prove remains unaffected if the new clause is added and these oth- er clauses are removed).

We will first look at redundant literals in a clause.

Consider (in clausal form):C : (Forall X,Y,Z..) L_1 or L_2 or ... or L_l or .... L_n C1: (Forall X,Y,Z..) L_1 or L_2 or ... or L_(l+1) or .... L_nHere X,Y,Z... are the variables inC,C1andC1isCwithL_lremoved. Remember, for definite clauses, only one of these literals will be a positive. All others will be negated. With some background knowledge B,L_lis redundant inCiff:B and C is logically equivalent to B and C1That is:B and C |= B and C1andB and C1 |= B and C

It is straightforwards to show that L_l is redundant in C iff:B and C and ~C1 |= []

Besides useless literals in a clause, you can imagine cases where a clause may be useless in a program (i.e. a set of clauses). To find such a clause, the procedure is similar to that used above for dud literals. Thus, consider the follow- ing sets of clauses:S : C_1 and C_2 and ... and C_l and .... C_n S1: C_1 and C_2 and ... and C_(l+1) and ... C_nNotice first that a set of clauses means a conjunction of clauses (each clause in turn being a disjunction of literals). In the above,S1consists of all clauses inSexceptC_l. That isS = S1 and C_lThe clause C_l is said to be logically redundant in S iff:S is logically equivalent to S1That is:S1 and C_l |= S1 and S1 |= S1 and C_l

It is straightforward to show thatC_lis redundant inSiff:S1 and ~C_l |= []

You can see from the previous sections that definitions like:mem(A,[B,A|C]).make those like:mem(0,[0,0]). mem(1,[0,1]). mem(0,[1,0]). mem(0,[2,0]). mem(1,[1,1]).and even:mem(A,[1,A|B]).redundant. This is not surprising, when you consider the universal quantification on the variables of the clausemem(A,[B,A|C]). In some sense, all the clauses made redun- dant by this clause are special cases of this clause.

The process of constructing some general hypothesis and then providing a justification for why it should be true can be viewed (controversially) as a form of "induction". Thus:Induction = Hypothesis Formation + JustificationILP systems form hypotheses in first-order logic. For this, obtaining hypotheses from specific cases is tied in with:logical consequenceandlogical subsumption. The other side of induction, namely justification, is concerned with selecting the best hypothesis amongst all those that satisfy the logical constraints imposed. For this, ILP draws on ideas from probability or information theory.

Recall that a term is recursively defines as being either: 1. A constant (written in lower case); or 2. A variable (written is upper case); or 3. A function symbol which takes a number of terms as arguments (such asf(a,g(X))etc.) Also recall that substitutions are unique mappings from variables to terms and are usually written as sets ofvariable/termpairs, for example{X/a, Y/f(b)}. In this,XandYare called the "domain" of the substitution. When a substitution theta is applied to a termt, each variableXintthat is an element of the domain of theta is replaced by the term thatXmaps to intheta.

Given two termst1andt2,t1is said to subsumet2iff there exists a substitutionthetas.t.t1. theta = t2. Thus, in the following,t1.theta = t2t1 = f(g(A),B) t2 = f(g(h(a)),i(b))Note that of the two terms t1 "looks" more general.

Recall that a clause like:C: grandparent(X,Z):- parent(X,Y), parent(Y,Z).is really:C: (Forall X,Y,Z) grandparent(X,Z) or ~parent(X,Y) or ~parent(Y,Z)Yet another way that this is written is as a set of literals like so:C: {grandparent(X,Z), ~parent(X,Y), ~parent(Y,Z)}Given two clausesC1andC2,C1is said to subsumeC2iff there exists a substitution theta s.t.C1. theta(whereCC2denotes the improper subset relation).C

Thus for the following pair of clauses:C1: mem(A,[B|C]):- mem(A,C). C2: mem(0,[1,0]):- nat(0), nat(1), mem(0,[0]).It is the case thatC1subsumesC2. Note that of the two clausesC1"looks" more general.

It is the case that if a clauseC1subsumes clauseC2, thenC1logically impliesC2. That is:If C1 subsumes C2 then C1 |= C2However the reverse does not hold.

Consider the following clauses:C1: mem(A,[B|C]):- mem(A,C). C2: mem(X,[H1|[H2,T]]):- mem(X,T).C1 logically implies C2, but does not subsume it.

For ILP systems, the word "generalisation" has a particular logical meaning. A sentenceFis said to bemore generalthanGiff:F |= GThe hypothesis formation problem in this setting is then to findF, givenG. Most ILP systems do this by findingFs that subsumeG.

ILP systems for prediction are given a set of clauses as backgroundB, positive examplesE+and negative examplesE-. With these the abductive step is to find a set of clauses H, s.t. the following constraints hold:B and E- is satisfiable i.e. B and E- |/= [] B and H and E- is satisfiable i.e. B and H and E- |/= [] B does not logically imply E+ i.e. B |/= E+ B and H logically implies E+ i.e. B and H |= E+Since, for any H that satisfies these constraints:B and H |= B and E+ B and Hare said to be "more general" thanB and E+. That is, the hypothesisHis said to be a "generalisation" ofE+. Typically, there are many (possibly infinite) hypotheses that would satisfy these constraints.

Consider constructing a set of clausesTwithin the follow- ing language constraints:P: a set of predicate symbolsC: a set of constant symbolsV: a set of variablesa: the maximum arity of any predicate inPc: the maximum number of literals allowed in the body of any clause inT

An upper bound on the number of atoms A that can be constructed given P,C,V and a is: B = P.(C+V)^{a }An upper bound on the number of definite clauses C that can be chosen with atoms from A, with at most c literals in the body is: S = B^{(c+1) }An upper bound on the number of clause sets (possible hypotheses) that can be constructed is: 2^{S }Note that the this is without any function symbols allowed.

Given some positive and negative examples, not all such clause sets will satisfy the logical constraints to qualify them as a hypothesis for the examples. Rather than search this space of clause sets in any fashion, we can chose to look at them in some order. The ordering that almost all ILP systems use is based not on implication, but on the weaker relation of subsumption, which imposes a lattice on clauses that can be part of a hypothesis. To really understand thesubsumption lattice, you must first get to grips with the notion ofleast general generalisa- tionsof terms, literals and clauses. Given a pair of terms, literals or clausesu,v:L is lgg of u and v iff 1) L is a common generalisation of u and v; and 2) Every other common generalisation of u and v is also a generalisation of L. F is a common generalisation of G and H iff 1) F is a generalisation of G; and 2) F is a generalisation of H. F is a generalisation of G iff F subsumes G

Term Lgg

In an important thesis, Gordon Plotkin provides a recursive method for constructing lggs of two termsu,v(do not get confused by the lower case letters:uandvhere do not have to be constants):1) If u=v then lgg(u,v) = u; otherwise 2) If u is f(s1,...sn) and v is f(t1,...tn) then lgg(u,v) = f(lgg(s1,t1),...,lgg(sn,tn)); otherwise 3) lgg(u,v) is the variable V, that represents this pair of terms throughout

In addition to the rules for term lggs, the lgg of two literals is only defined when the sign and predicate symbol are the same. That is, they are both positive or negative literals with the same name and arity. Such literals are said to be "compatible literals" With this restriction, the lgg of literalsp(u1,....un)andp(v1,...vn)is simplyp(lgg(u1,v1),...,lgg(un,vn)).

We are now in a position to define the Lgg of two clauses. Recall a clause can be thought of as a set of literals. The lgg of two clausesC1andC2is:

lgg(C1,C2) = {l: li in C1, and lj in C2 s.t. li, lj are compatible and l = lgg(li,lj)}

Subsumption Lattice and Search

It can be shown that subsumption imposes a lattice over equivalence classes of clauses. Here, it is sufficient to understand the broad meaning of this statement.

For representative clausesCandDin two equivalence classes there is a partial ordering ">=" based on whether one clause subsumes the other. Second, there is a least upper bound forCandD, which is simply thelggof the two clauses. Third, there is a greatest lower bound forCandD, which is simply their union.

The equivalence class ofCcontains all clauses that are subs- umption-equivalent toC. This is denoted by: [C] = {C'|C'subsumesCandCsubsumesC'}

This lattice forms the basis of a clause-by-clause search in ILP.

A straightforward procedure is:

1. Pick a positive examplee

2. Construct the most-specific single clause hypothesis (called|)

3. Find the best clauseCabove|in the subsumption lattice

4. Remove all clauses made redundant byCand repeat search

Not all programs explicitly perform Step 2.

GivenBand some positive examplee, the most specific single clause hypothesis can be derived from the following steps:

We want the most specific clause|s.t.

B and||= e

Using the Deduction Theorem:

B and ~e |= ~|

or:

||= ~(B and ~e)

Thus, as a procedure:

1. Calculate~e2. Find the conjunction of literals derivable fromB and ~e3. The most specific single clause hypothesis is the negation of the conjunction in Step 2. Since Step 2 produces a conjunction of literals, Step 3, which negates this conjunction, produces a disjunction of literals, which is a clause.

The most-specific clause may be infinitely long. In practice the best that programs can construct is some finite clause that implies this infinite clause. This is done using constructs like a depth-bounded mode language. The most (finite) most-specific clause in this language implies the trueand forms the basis for the clause-by-clause search.|

There has to be some reason for preferring one hypothesis over another. A probabilistic setting for providing justification for the hypothesis selected is that afforded by Bayes Rule:

Pr(Hyp|Examples) = PriorPr(Hyp).Pr(Examples|Hyp)/PriorPr(Examples)

An information-theoretic version of this follows from taking negative logs of both sides of this equation:

Info(Hyp|Examples) = Info(Hyp) + Info(Examples|Hyp) - Info(Examples)

where Info(X) is the number of bits required to encode X.

This forms the basis of the Minimum Description Length (MDL) principle that selects hypotheses that minimise Info(Hyp|Examples). This is equivalent to maximation of Bayesian posterior probability.

Evaluation is concerned with estimating the accuracy of a hypothesis for a particular target concept and example set.

Accuracy is measured according to some probability distributionDover the examples. It is simply the probability of drawing an example, underD, that the hypothesis misclassifies.

The simplest way of estimating this probability is to get more labelled examples (a ``test'' set). Ifmofnnew examples are correctly classified by the hypothesis, then an unbiased estimate of its accuracy ism/n. In fact if the true accuracy isp, then the number of correct classifications is binomially distributed with meannpand variancenp(1-p).

For many practical problems, getting a new set of labelled examples is not a viable option. This motivates the use of resampling methods. Here the available data ofnexamples are partitioned intoksets. A hypothesis is obtained by ignoring in turn, the data in each of these sets. For each hypothesis the classification on the set ignored is then obtained. The accuracy of the hypothesis is then the proportionm/nof correct classifications made on these ``left-out'' datasets. Strictly speaking the variance expression above does not hold with this technique.

Comparing the performance of a pair of algorithms using involves testing the null hypothesis that there is no difference in their accuracies.

The simplest way of comparing a pair of algorithms is to record their performance onnindependently drawn ``test'' examples. The comparison is then analogous to a coin-tossing experiment where a ``head'' occurs if Algorithm 1 does better than Algorithm 2. If there is no difference in the algorithms then we would expectn/2heads. Since the number of heads with a fair coin is governed by the binomial distribution with probability of success equal to 0.5, we can calculate the probability of obtaining at least the number of heads observed. This can be compared to an acceptable significance level.

Care must be taken with repeated testing of pairs of algorithms. With sufficient comparisons, it is possible simply by chance, of obtaining one where Algorithm 1 looks better than its adversary. This has to be accounted for by statistical corrections on the acceptable significance level.