TY - CHAP
T1 - Dataset construction for multimodal detection of online gambling advertisements
AU - Sentana, I Wayan Budi
AU - Astawa, I Nyoman Gede Arya
AU - Lu, Junda
AU - Atmaja, I Made Ari Dwi Suta
AU - Sarja, Ni Ketut Pradani Gayatri
AU - Puspita, Ni Nyoman Harini
N1 - Copyright the Author(s) 2025. Version archived for private and non-commercial use with the permission of the author/s and according to publisher conditions. For further rights please contact the publisher.
PY - 2025
Y1 - 2025
N2 - This study presents the construction of a multimodal dataset designed to detect online gambling advertisement infiltrations on websites. The dataset incorporates both visual (image-based) and textual data extracted from compromised web pages. Data collection begins with a Google Engine Scraper that utilizes specialized search commands (commonly known as Google Hacking techniques) to identify URLs containing keywords frequently associated with online gambling in Bahasa Indonesia. Once identified, these URLs are processed using an automated Selenium-based module that retrieves and extracts the content of each webpage. The extracted content is then categorized into visual and textual components. The textual data is further analyzed using a large language model (LLM) via the OpenAI API to assist in the preliminary classification of gambling-related content. Final verification and labeling are performed manually to ensure accuracy. The resulting dataset comprises 600 samples—300 positively labeled as containing online gambling advertisements and 300 as non-infiltrated, forming a balanced and validated corpus for future multimodal detection model development.
AB - This study presents the construction of a multimodal dataset designed to detect online gambling advertisement infiltrations on websites. The dataset incorporates both visual (image-based) and textual data extracted from compromised web pages. Data collection begins with a Google Engine Scraper that utilizes specialized search commands (commonly known as Google Hacking techniques) to identify URLs containing keywords frequently associated with online gambling in Bahasa Indonesia. Once identified, these URLs are processed using an automated Selenium-based module that retrieves and extracts the content of each webpage. The extracted content is then categorized into visual and textual components. The textual data is further analyzed using a large language model (LLM) via the OpenAI API to assist in the preliminary classification of gambling-related content. Final verification and labeling are performed manually to ensure accuracy. The resulting dataset comprises 600 samples—300 positively labeled as containing online gambling advertisements and 300 as non-infiltrated, forming a balanced and validated corpus for future multimodal detection model development.
KW - Online Gambling Ad
KW - Multimodal Dataset Type
KW - Semantic Type Dataset
KW - Visual Type Dataset
U2 - 10.2991/978-94-6463-926-1_6
DO - 10.2991/978-94-6463-926-1_6
M3 - Chapter
T3 - Advances in Engineering Research
SP - 39
EP - 46
BT - Proceedings of the International Conference on Applied Science and Technology on Engineering Science 2025 (iCAST-ES 2025)
A2 - Al Rasyid, Muhammad Udin Harun
A2 - Mufid, Mohammad Robihul
A2 - Negara, I Gede Artha
A2 - Baiti, Risa Nurin
A2 - Dewi, Gusti Ayu Wulan Krisna
A2 - Rani, Ni Made Sintya
A2 - Yuliana, Ni Putu Indah
PB - Springer, Springer Nature
CY - Online
T2 - International Conference on Applied Science and Technology on Engineering Science 2025
Y2 - 10 October 2025 through 11 October 2025
ER -