The enormous amounts of molecular microbiological data currently produced by high-throughput analytical techniques pose both huge opportunities and huge challenges for microbiologists. With over 1000 databases online, it is clearly not feasible for researchers to manually search each one for information about the genes and processes in which they are interested. Much of the data stored in these databases never makes it into the peer-reviewed literature, and so becomes essentially unavailable in its entirety. A powerful approach to maximising the usefulness of large datasets, whether generated in-house or obtained from public repositories, is data integration and mining. Data integration is the process of bringing together large amounts of disparate data into a single, computationally accessible data source, while data mining is the process of finding hidden patterns and relationships in such large datasets. A wide range of algorithms is used for data mining, including established statistical methods, and approaches from the field of machine learning. The various algorithms available have different strengths and weaknesses, and are applicable to different types of data. In this review we first discuss the data mining life cycle and then describe some of the most widely used algorithms, illustrating their applications with examples from the microbiological literature. Where possible, we have identified freely available software for implementing these algorithms.
|Number of pages||53|
|Journal||Methods in Microbiology|
|Publication status||Published - 2012|