The genomes of complex organisms, including the human genome, are highly structured. This structure takes the form of segmental patterns of variation in various properties, and may be caused by the division of genomes into regions of distinct function, by the contingent evolutionary processes that gave rise to genomes, or by a combination of both. Whatever the cause, identifying the change-points between segments is potentially important, as a means of discovering the functional components of a genome, understanding the evolutionary processes involved, and fully describing genomic architecture.
One property of genomes that is known to display a segmental pattern of variation is GC content. Genomes are composed of DNA: a long, double-stranded, linear polymer built up from four nucleotide bases, namely adenine, cytosine, guanine, and thymine (A, C, G, and T). The two strands of a DNA molecule form a double-helix, held together by hydrogen bonds formed between G and C nucleotide pairs, and between A and T nucleotide pairs, as illustrated in Figure 1. The two strands are thus complementary; the sequence of either strand can be deduced from that of the other by interconverting G with C, and A with T.
The GC content of a portion of DNA is thus the proportion of GC pairs that it contains. Sharp changes in GC content can be observed in the human and other genomes. For example, Figure 2 shows a small portion of a sequence, in which a sharp increase in GC content is observed at about position 35.
Such change-points may be the boundaries of functional elements, or may play a structural role. We model genome sequences as a multiple change-point process, that is, a process in which sequential data is separated into segments by an unknown number of change-points, with each segment supposed to have been generated by a different process.
Multiple change-point models are important in many biological applications and, particularly, in analysis of biomolecular sequences. For example, multiple change point models can be applied in segmenting protein sequences (which have a 20 character alphabet) according to hydrophobicity. This can aid in the identification of functional domains and can assist in determining the three-dimensional conformations of protein molecules. Another application in which the authors have an interest is in identifying segments that are conserved between two species.
We consider a Sequential Importance Sampling approach to change-point modeling using Monte Carlo simulation to find estimates of change-points as well as parameters of the process on each segment. Numerical experiments illustrate the effectiveness of the approach. We obtain estimates for the locations of change-points in artificially generated sequences and compare the accuracy of these estimates to those obtained via MCMC and a well-known method, IsoFinder. We also provide examples with real data sets to illustrate the usefulness of this method.
|Title of host publication||Modsim 2007: international congress on modelling and simulation|
|Subtitle of host publication||Land, Water and Environmental Management: Integrated Systems for Sustainability|
|Editors||Les Oxley, Don Kulasiri|
|Place of Publication||Christchurch, NZ|
|Publisher||Modelling & Simulation Society Australia & New Zealand|
|Number of pages||7|
|Publication status||Published - 2007|
|Event||International Congress on Modelling and Simulation - Land, Water and Environmental Management: Integrated Systems for Sustainability, MODSIM07 - Christchurch, New Zealand|
Duration: 10 Dec 2007 → 13 Dec 2007
|Other||International Congress on Modelling and Simulation - Land, Water and Environmental Management: Integrated Systems for Sustainability, MODSIM07|
|Period||10/12/07 → 13/12/07|
- Comparative genomics
- Multiple change-point problem
- Sequential importance sampling