This paper presents a set of analytical tools to evaluate the performance of three land surface models (LSMs) that are used in global climate models (GCMs). Predictions of the fluxes of sensible heat, latent heat, and net CO2 exchange obtained using process-based LSMs are benchmarked against two statistical models that only use incoming solar radiation, air temperature, and specific humidity as inputs to predict the fluxes. Both are then compared to measured fluxes at several flux stations located on three continents. Parameter sets used for the LSMs include default values used in GCMs for the plant functional type and soil type surrounding each flux station, locally calibrated values, and ensemble sets encompassing combinations of parameters within their respective uncertainty ranges. Performance of the LSMs is found to be generally inferior to that of the statistical models across a wide variety of performance metrics, suggesting that the LSMs underutilize the meteorological information used in their inputs and that model complexity may be hindering accurate prediction. The authors show that model evaluation is purpose specific; good performance in one metric does not guarantee good performance in others. Self-organizing maps are used to divide meteorological "'forcing space" into distinct regions as a mechanism to identify the conditions under which model bias is greatest. These new techniques will aid modelers to identify the areas of model structure responsible for poor performance.