This paper investigates the development of accurate and efficient classifiers to identify misbehaving users (i.e., "flashers") in a mobile video chat application. Our analysis is based on video session data collected from a mobile client that we built that connects to a popular random video chat service. We show that prior imagebased classifiers designed for identifying normal and misbehaving users in online video chat systems perform poorly on mobile video chat data. We present an enhanced image-based classifier that improves classification performance on mobile data. More importantly, we demonstrate that incorporating multi-modal mobile sensor data from accelerometer and the camera state (front/back) along with audio can significantly improve the overall image-based classification accuracy. Our work also shows that leveraging multiple image-based predictions within a session (i.e., temporal modality) has the potential to further improve the classification performance. Finally, we show that the cost of classification in terms of running time can be significantly reduced by employing a multilevel cascaded classifier in which high-complexity features and further image-based predictions are not generated unless needed.