Complex Earth system models, and their various sub-components, are not yet subject to rigorous evaluation against observations as much as they should be, despite the existence of hundreds of proposed diagnostics. A concerted process is urgently needed to make this the norm, not the exception. Earth Observation, field observations and palaeo data can be applied to contexts as diverse as wildfire, marine ecosystems, the land carbon cycle, and greenhouse gases. Model evaluation (by comparing models and benchmark data) and model weighting (defining the 'quality' of models on the basis of such a comparison) should be considered as separate issues. Systematic approaches to parameter optimization, such as the adjoint technique, allow structural differences between models to be identified and limitations to be addressed. Such methods are established in atmospheric tracer transport and carbon cycling; research carried out in the QUEST programme has demonstrated their applicability for climate modelling. Although it is impossible to devise a foolproof metric for the ability of models to predict the future, relevant metrics could be based on their ability to simulate the past. Furthermore, it should be possible to extend parameter optimization techniques to assimilate data from the past. There are limits to what can be achieved by benchmarking against a mean state, when it is a change in state that is of greatest interest. It is useful to benchmark individual processes rather than aggregate properties. Coupling good components does not automatically result in a good Earth System model, so for complex models, a two-stage process is needed: first, benchmarking the components in stand-alone mode, and second, using the same benchmarks in coupled mode.