0

Ruri: Japanese General Text Embeddings

Ruri, a series of Japanese general text embedding models, utilizes synthesized datasets from LLMs, features a reranker for filtering and knowledge distillation, and evaluates its performance in various tasks.

Year
2024
Venue
arXiv 2024
Authors
2
Hosting
Abstract onlyARXIV-DEFAULT

Cite

Notes

Only stored in your browser.

Attribution

Abstract & full text
arxiv.org/abs/2409.07737ARXIV-DEFAULT
TL;DR
Semantic Scholar
Attribution policy →

Abstract

We report the development of Ruri, a series of Japanese general text embedding models. While the development of general-purpose text embedding models in English and multilingual contexts has been active in recent years, model development in Japanese remains insufficient. The primary reasons for this are the lack of datasets and the absence of necessary expertise. In this report, we provide a detailed account of the development process of Ruri. Specifically, we discuss the training of embedding models using synthesized datasets generated by LLMs, the construction of the reranker for dataset filtering and knowledge distillation, and the performance evaluation of the resulting general-purpose text embedding models.

Authors

2