TY - JOUR
T1 - Estimating change-points in biological sequences via the cross-entropy method
AU - Evans, G. E.
AU - Sofronov, G. Y.
AU - Keith, J. M.
AU - Kroese, D. P.
PY - 2011/9
Y1 - 2011/9
N2 - The genomes of complex organisms, including the human genome, are known to vary in GC content along their length. That is, they vary in the local proportion of the nucleotides G and C, as opposed to the nucleotides A and T. Changes in GC content are often abrupt, producing well-defined regions. We model DNA sequences as a multiple change-point process in which the sequence is separated into segments by an unknown number of change-points, with each segment supposed to have been generated by a different process. Multiple change-point problems are important in many biological applications, particularly in the analysis of DNA sequences. Multiple change-point problems also arise in segmentation of protein sequences according to hydrophobicity. We use the Cross-Entropy method to estimate the positions of the change-points. Parameters of the process for each segment are approximated with maximum likelihood estimates. Numerical experiments illustrate the effectiveness of the approach. We obtain estimates of the locations of change-points in artificially generated sequences and compare the accuracy of these estimates with those obtained via other methods such as IsoFinder (Oliver et al. in Nucl. Acids Res. 32:W283-W292, 2004) and Markov Chain Monte Carlo. Lastly, we provide examples with real data sets to illustrate the usefulness of our method.
AB - The genomes of complex organisms, including the human genome, are known to vary in GC content along their length. That is, they vary in the local proportion of the nucleotides G and C, as opposed to the nucleotides A and T. Changes in GC content are often abrupt, producing well-defined regions. We model DNA sequences as a multiple change-point process in which the sequence is separated into segments by an unknown number of change-points, with each segment supposed to have been generated by a different process. Multiple change-point problems are important in many biological applications, particularly in the analysis of DNA sequences. Multiple change-point problems also arise in segmentation of protein sequences according to hydrophobicity. We use the Cross-Entropy method to estimate the positions of the change-points. Parameters of the process for each segment are approximated with maximum likelihood estimates. Numerical experiments illustrate the effectiveness of the approach. We obtain estimates of the locations of change-points in artificially generated sequences and compare the accuracy of these estimates with those obtained via other methods such as IsoFinder (Oliver et al. in Nucl. Acids Res. 32:W283-W292, 2004) and Markov Chain Monte Carlo. Lastly, we provide examples with real data sets to illustrate the usefulness of our method.
UR - http://www.scopus.com/inward/record.url?scp=80052018747&partnerID=8YFLogxK
U2 - 10.1007/s10479-010-0687-0
DO - 10.1007/s10479-010-0687-0
M3 - Article
AN - SCOPUS:80052018747
SN - 0254-5330
VL - 189
SP - 155
EP - 165
JO - Annals of Operations Research
JF - Annals of Operations Research
IS - 1
ER -