Alignment-free sequence and network comparison
Current biological ''next generation'' sequencing technology gives a large number of relatively short sequences (''reads'') from a genome. Not all segments of the genome may be covered, and some sections may be covered multiple times. Assembling these reads into long sequences which could then be compared using sequence alignment is a computer-intensive process. Also an alignment may miss the subtle signal which is of interest. Hence there is considerable interest in alignment-free methods for sequence comparison.
In joint work with Mike Waterman, Fengzhu Sun and many others we have developed a suite of statistics based on word counts for alignment-free sequence comparison, together with their asymptotic distributions both under a Markov model and under alternative models which include insertion of motifs.
For the comparison of random networks, a similar issue arises - while in principle one could try to align networks by finding large matching subgraphs, finding the optimal alignment is not only computer-intensive but also usually covers only a small fraction of each network. Our alignment-free sequence comparison statistics provide the basis for a network analogue (which we have called Netdis, joint work with Waqar Ali, Charlotte Deane, Tiago Rito and Fengzhu Sun). Netdis separates correctly different random graph model types independent of network size and density. Netdis is also able to build the correct phylogenetic tree of species based solely on the topology of current protein interaction networks.