0

Crowdsource, Crawl, or Generate? Creating SEA-VL, a Multicultural Vision-Language Dataset for Southeast Asia

SEA-VL is an open-source project aimed at creating a large, culturally relevant dataset for Southeast Asian languages to improve AI inclusivity.

Year
2025
Venue
arXiv 2025
Authors
92
Hosting
Abstract onlyARXIV-DEFAULT

Cite

Notes

Only stored in your browser.

Attribution

Abstract & full text
arxiv.org/abs/2503.07920v2ARXIV-DEFAULT
TL;DR
Semantic Scholar
Attribution policy →

Abstract

Southeast Asia (SEA) is a region of extraordinary linguistic and cultural diversity, yet it remains significantly underrepresented in vision-language (VL) research. This often results in artificial intelligence (AI) models that fail to capture SEA cultural nuances. To fill this gap, we present SEA-VL, an open-source initiative dedicated to developing high-quality, culturally relevant data for SEA languages. By involving contributors from SEA countries, SEA-VL aims to ensure better cultural relevance and diversity, fostering greater inclusivity of underrepresented languages in VL research. Beyond crowdsourcing, our initiative goes one step further in the exploration of the automatic collection of culturally relevant images through crawling and image generation. First, we find that image crawling achieves approximately ~85% cultural relevance while being more cost- and time-efficient than crowdsourcing. Second, despite the substantial progress in generative vision models, synthetic images remain unreliable in accurately reflecting SEA cultures. The generated images often fail to reflect the nuanced traditions and cultural contexts of the region. Collectively, we gather 1.28M SEA culturally-relevant images, more than 50 times larger than other existing datasets. Through SEA-VL, we aim to bridge the representation gap in SEA, fostering the development of more inclusive AI systems that authentically represent diverse cultures across SEA.

Authors

92
Bin WangYueqi SongLester James V. MirandaBörje F. KarlssonGenta Indra WinataRuochen ZhangAlham Fikri AjiPeerat LimkonchotiwatCan UdomcharoenchaikitMuhammad Reza QoribJoseph Marvin ImperialTim SantosMahardika Krisna IhsaniMohamed Fazli ImamDavid AnugrahaMohammad Rifqi FarhansyahJan Christian Blaise CruzFajri KotoPhakphum ArtkaewHaochen LiAdisai Na-ThalangAmit AgarwalSupryadiSamuel CahyawijayaKarissa VincentioHoly LoveniaFrederikus HudiTirana Noor FatyanosaOnno P. KampmanJoel Ruben Antony MonizMuhammad Ravi Shulthan HabibiWilliam NixonDan John VelascoGiang NguyenMing Shan HeeMichael AnugrahaSaptarshi SahaTaki Hasan RafiRobert WijayaChengwei WeiJiayun LuoHitesh Laxmichand PatelPriyaranjan PattnayakAyushman SinghTack Hwa WongThant Thiri MaungVicky FelirenBahrul Ilmi NasutionManuel Antonio RufinoRian Adam RajagedeCarlos Rafael CatalanSalsabila Zahirah PranidaKevin PratamaYeshil BangeraPatricia Nicole MonderinChristian SimonLynnette Hui Xian NgRichardy Lobo' SapanKanyakorn VeerakanjanaPiyalitt IttichaiwongMatthew Theodore RoqueTakdanai KreangphetKadek Hendrawan PalgunadiYanzhi YuRochana Prih HastutiMithil BangeraAdrian Xuan Wei LimAye Hninn KhineHanif Muhammad ZhafranTeddy FerdinanAudra Aurora IzzaniEvanJauza Akbar KritoFenal Ashokbhai IlasariyaJohn Amadeo DaniswaraFilbert Aurelian TjiaranataEryawan Presma YulianrifatFadil Risdian AnsoriAnab Maulana BarikRifo Ahmad GenadiIsaiah FloresKenneth Ko Han ChenAnjela Gail SantosWan Shen LimKaung Si PhyoMeisyarah DwiastutiIkhlasul Akmal HanifM. Alif Al HakimMuhammad Rizky Sya'banKun KerdthaisongJostin Jerico RosalJun Kevin