Earth system models (ESMs) are increasing in complexity by incorporating more processes than their predecessors, making them potentially important tools for studying the evolution of climate and associated biogeochemical cycles. However, their coupled behaviour has only recently been examined in any detail, and has yielded a very wide range of outcomes. For example, coupled climate-carbon cycle models that represent land-use change simulate total land carbon stores at 2100 that vary by as much as 600 Pg C, given the same emissions scenario. This large uncertainty is associated with differences in how key processes are simulated in different models, and illustrates the necessity of determining which models are most realistic using rigorous methods of model evaluation. Here we assess the state-of-the-art in evaluation of ESMs, with a particular emphasis on the simulation of the carbon cycle and associated biospheric processes. We examine some of the new advances and remaining uncertainties relating to (i) modern and palaeodata and (ii) metrics for evaluation. We note that the practice of averaging results from many models is unreliable and no substitute for proper evaluation of individual models. We discuss a range of strategies, such as the inclusion of pre-calibration, combined process-and system-level evaluation, and the use of emergent constraints, that can contribute to the development of more robust evaluation schemes. An increasingly data-rich environment offers more opportunities for model evaluation, but also presents a challenge. Improved knowledge of data uncertainties is still necessary to move the field of ESM evaluation away from a "beauty contest" towards the development of useful constraints on model outcomes.