Identifying change-points in biological sequences via sequential importance sampling

G. Yu. Sofronov*, G. E. Evans, J. M. Keith, D. P. Kroese

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference proceeding contributionpeer-review

Abstract

The genomes of complex organisms, including the human genome, are highly structured. This structure takes the form of segmental patterns of variation in various properties, and may be caused by the division of genomes into regions of distinct function, by the contingent evolutionary processes that gave rise to genomes, or by a combination of both. Whatever the cause, identifying the change-points between segments is potentially important, as a means of discovering the functional components of a genome, understanding the evolutionary processes involved, and fully describing genomic architecture.

One property of genomes that is known to display a segmental pattern of variation is GC content. Genomes are composed of DNA: a long, double-stranded, linear polymer built up from four nucleotide bases, namely adenine, cytosine, guanine, and thymine (A, C, G, and T). The two strands of a DNA molecule form a double-helix, held together by hydrogen bonds formed between G and C nucleotide pairs, and between A and T nucleotide pairs, as illustrated in Figure 1. The two strands are thus complementary; the sequence of either strand can be deduced from that of the other by interconverting G with C, and A with T.

[GRAPHICS]

The GC content of a portion of DNA is thus the proportion of GC pairs that it contains. Sharp changes in GC content can be observed in the human and other genomes. For example, Figure 2 shows a small portion of a sequence, in which a sharp increase in GC content is observed at about position 35.

[GRAPHICS]

Such change-points may be the boundaries of functional elements, or may play a structural role. We model genome sequences as a multiple change-point process, that is, a process in which sequential data is separated into segments by an unknown number of change-points, with each segment supposed to have been generated by a different process.

Multiple change-point models are important in many biological applications and, particularly, in analysis of biomolecular sequences. For example, multiple change point models can be applied in segmenting protein sequences (which have a 20 character alphabet) according to hydrophobicity. This can aid in the identification of functional domains and can assist in determining the three-dimensional conformations of protein molecules. Another application in which the authors have an interest is in identifying segments that are conserved between two species.

We consider a Sequential Importance Sampling approach to change-point modeling using Monte Carlo simulation to find estimates of change-points as well as parameters of the process on each segment. Numerical experiments illustrate the effectiveness of the approach. We obtain estimates for the locations of change-points in artificially generated sequences and compare the accuracy of these estimates to those obtained via MCMC and a well-known method, IsoFinder. We also provide examples with real data sets to illustrate the usefulness of this method.

Original languageEnglish
Title of host publicationModsim 2007: international congress on modelling and simulation
Subtitle of host publicationLand, Water and Environmental Management: Integrated Systems for Sustainability
EditorsLes Oxley, Don Kulasiri
Place of PublicationChristchurch, NZ
PublisherModelling & Simulation Society Australia & New Zealand
Pages2917-2923
Number of pages7
ISBN (Print)9780975840047
Publication statusPublished - 2007
Externally publishedYes
EventInternational Congress on Modelling and Simulation - Land, Water and Environmental Management: Integrated Systems for Sustainability, MODSIM07 - Christchurch, New Zealand
Duration: 10 Dec 200713 Dec 2007

Other

OtherInternational Congress on Modelling and Simulation - Land, Water and Environmental Management: Integrated Systems for Sustainability, MODSIM07
Country/TerritoryNew Zealand
CityChristchurch
Period10/12/0713/12/07

Keywords

  • Comparative genomics
  • Multiple change-point problem
  • Sequential importance sampling

Fingerprint

Dive into the research topics of 'Identifying change-points in biological sequences via sequential importance sampling'. Together they form a unique fingerprint.

Cite this