TY - JOUR
T1 - Probabilistic latent semantic analysis applied to whole bacterial genomes identifies common genomic features
AU - Rusakovica, Julija
AU - Hallinan, Jennifer
AU - Wipat, Anil
AU - Zuliani, Paolo
N1 - Version archived for private and non-commercial use with the permission of the author/s and according to publisher conditions. For further rights please contact the publisher.
PY - 2014
Y1 - 2014
N2 - The spread of drug resistance amongst clinically-important bacteria is a serious, and growing, problem [1]. However, the analysis of entire genomes requires considerable computational effort, usually including the assembly of the genome and subsequent identification of genes known to be important in pathology. An alternative approach is to use computational algorithms to identify genomic differences between pathogenic and non-pathogenic bacteria, even without knowing the biological meaning of those differences. To overcome this problem, a range of techniques for dimensionality reduction have been developed. One such approach is known as latent-variable models [2]. In latent-variable models dimensionality reduction is achieved by representing a high-dimensional data by a few hidden or latent variables, which are not directly observed but inferred from the observed variables present in the model. Probabilistic Latent Semantic Indexing (PLSA) is an extention of LSA [3]. PLSA is based on a mixture decomposition derived from a latent class model. The main objective of the algorithm, as in LSA, is to represent high-dimensional co-occurrence information in a lower-dimensional way in order to discover the hidden semantic structure of the data using a probabilistic framework. In this work we applied the PLSA approach to analyse the common genomic features in methicillin resistant Staphylococcus aureus, using tokens derived from amino acid sequences rather than DNA. We characterised genome-scale amino acid sequences in terms of their components, and then investigated the relationships between genomes and tokens and the phenotypes they generated. As a control we used the non-pathogenic model Gram-positive bacterium Bacillus subtilis.
AB - The spread of drug resistance amongst clinically-important bacteria is a serious, and growing, problem [1]. However, the analysis of entire genomes requires considerable computational effort, usually including the assembly of the genome and subsequent identification of genes known to be important in pathology. An alternative approach is to use computational algorithms to identify genomic differences between pathogenic and non-pathogenic bacteria, even without knowing the biological meaning of those differences. To overcome this problem, a range of techniques for dimensionality reduction have been developed. One such approach is known as latent-variable models [2]. In latent-variable models dimensionality reduction is achieved by representing a high-dimensional data by a few hidden or latent variables, which are not directly observed but inferred from the observed variables present in the model. Probabilistic Latent Semantic Indexing (PLSA) is an extention of LSA [3]. PLSA is based on a mixture decomposition derived from a latent class model. The main objective of the algorithm, as in LSA, is to represent high-dimensional co-occurrence information in a lower-dimensional way in order to discover the hidden semantic structure of the data using a probabilistic framework. In this work we applied the PLSA approach to analyse the common genomic features in methicillin resistant Staphylococcus aureus, using tokens derived from amino acid sequences rather than DNA. We characterised genome-scale amino acid sequences in terms of their components, and then investigated the relationships between genomes and tokens and the phenotypes they generated. As a control we used the non-pathogenic model Gram-positive bacterium Bacillus subtilis.
UR - http://www.scopus.com/inward/record.url?scp=84925347881&partnerID=8YFLogxK
U2 - 10.2390/biecoll-jib-2014-243
DO - 10.2390/biecoll-jib-2014-243
M3 - Article
C2 - 24980693
AN - SCOPUS:84925347881
SN - 1613-4516
VL - 11
SP - 1
EP - 13
JO - Journal of Integrative Bioinformatics
JF - Journal of Integrative Bioinformatics
IS - 2
M1 - 243
ER -