TY - GEN
T1 - Self-supervised human-object interaction of complex scenes with context-aware mixing
T2 - 2024 IEEE/CVF Winter Conference on Applications of Computer Vision
AU - Kikuchi, Takashi
AU - Takeuchi, Shun
PY - 2024/1/1
Y1 - 2024/1/1
N2 - Recognizing human-object interactions (HOIs) in physical retail stores, such as picking up a product, can provide valuable information about non-purchasers, and is an important aspect of understanding customer behaviors. However, there are often complex scenes in physical retail stores with numerous similar objects in the shelf, making the task of recognizing the interacting object challenging. To address the drawback of complex background scenes, we propose a method using image mixing and self-supervised techniques to train the model to differentiate objects that interact with background objects. The proposed method generates images without the object's influence based on the input image using Context-aware image mixing. Then, we introduce a self-supervised method using the generated images to learn the difference between the actual and the background objects. We evaluated the network's performance using public and private retail dataset. We confirmed that when applied to physical retail scenes, the performance overcame the recent HOI detection methods including the recent state-of-the-art method. To the best of our knowledge, this is the first study to apply a self-supervised technique to control the target of interaction for the HOI detection model, demonstrating promising potential for use in in-store consumer behavior analysis.
AB - Recognizing human-object interactions (HOIs) in physical retail stores, such as picking up a product, can provide valuable information about non-purchasers, and is an important aspect of understanding customer behaviors. However, there are often complex scenes in physical retail stores with numerous similar objects in the shelf, making the task of recognizing the interacting object challenging. To address the drawback of complex background scenes, we propose a method using image mixing and self-supervised techniques to train the model to differentiate objects that interact with background objects. The proposed method generates images without the object's influence based on the input image using Context-aware image mixing. Then, we introduce a self-supervised method using the generated images to learn the difference between the actual and the background objects. We evaluated the network's performance using public and private retail dataset. We confirmed that when applied to physical retail scenes, the performance overcame the recent HOI detection methods including the recent state-of-the-art method. To the best of our knowledge, this is the first study to apply a self-supervised technique to control the target of interaction for the HOI detection model, demonstrating promising potential for use in in-store consumer behavior analysis.
UR - https://www.scopus.com/pages/publications/85191718413
U2 - 10.1109/wacvw60836.2024.00086
DO - 10.1109/wacvw60836.2024.00086
M3 - Conference proceeding contribution
SN - 9798350370713
T3 - IEEE Winter Conference On Applications Of Computer Vision Workshops
SP - 744
EP - 751
BT - Proceedings - 2024 IEEE Winter Conference on Applications of Computer Vision Workshops
PB - Institute of Electrical and Electronics Engineers (IEEE)
CY - Los Alamitos
Y2 - 1 January 2024 through 6 January 2024
ER -