September 8, 2025
by Lautaro Matas and Kathleen Shearer
Introduction
Every day, researchers around the world publish knowledge in hundreds of languages — Spanish in Argentina, Portuguese in Brazil, Arabic in Egypt, Japanese in Japan, Swahili in Kenya. This linguistic diversity is not a side note; it is the lifeblood of global scholarship. And yet, when we go looking for that knowledge, the tools at our disposal behave as if only a handful of languages truly matter. A vast amount of valuable research remains hidden simply because it was written in another tongue. This is because most discovery systems still rely on keyword search — matching the exact words in your query with the exact words in an index. That works fine in a monolingual setting, but it breaks down in a multilingual world.
But, what if search worked differently? What if you could type a query in your own language — cambio climático, énergies renouvelables, 再生可能エネルギー — and find relevant results in English, French, Spanish, Japanese, or beyond, without ever translating a word?
That’s the promise of semantic multilingual search: searching not by exact words, but by meaning. In June 2025, COAR embarked on a project to investigate the potential of semantic multilingual searching in the context of scholarly literature and develop a proposed conceptual model that could apply this technology in repositories and their full text aggregations. This work involved interviews with experts in the field, a review of current technical options, as well as a short survey about current practices in the scholarly ecosystem. These efforts build on the foundational work undertaken over the last few years by the COAR Task Force on Supporting Multilingualism and non-English Content, and also on early proof of concepts that were recently undertaken in Latin America by LA Referencia and IBICT (Instituto Brasileiro de Informação em Ciência e Tecnologia). The blog post presents an overview of our initial deliberations and conclusions and the next phase of our work.
The problem with today’s search
Most discovery systems are built around keyword matching. They try to line up the exact words in a query with the exact words stored in titles, abstracts, and metadata. This method is very rigid — it assumes the user knows the precise vocabulary that was used in the record. Even in a single language, that creates gaps: a paper on sustainable power generation, for example may never appear in response to a search for renewable energy. But the problem grows far worse in multilingual contexts, because most search indexes are designed with processing tools — tokenizers, stemmers, ranking rules — built for a single language. They work reasonably well in English and sometimes in another major language, but they fail to handle the diversity of linguistic structures across hundreds of languages. As a result, vast amounts of research remain effectively invisible unless the searcher can predict the exact form of words in the same language the system was built to process.
The result? Researchers, practitioners, and policymakers are cut off from knowledge that could make a real difference. This isn’t just a technical limitation — it’s a structural inequity.

Typical keyword discovery problems
- Exact word bias: if your query is “renewable energy,” you’ll miss content labeled “sustainable power generation.”
- Language silos: a Spanish search for cambio climático won’t surface English papers on “climate change.”
- Dominant-language bias: ranking algorithms often push English-language content to the top, even when material in other languages is highly relevant.
- Non-Latin scripts: Chinese, Arabic, or Russian can be invisible unless queries are typed in precisely the right way.
A different approach: searching by meaning
Instead of asking “Do these words appear in this document?”, semantic search asks “Does this document express the same idea as the query — no matter the language or phrasing?”
This is possible thanks to multilingual embeddings — machine learning models that represent sentences as vectors in a shared “meaning space.” Equivalent phrases in different languages — renewable energy, energía renovable, énergie renouvelable — are placed close together.
The outcome:
- Systems no longer depend on huge manually curated thesauri.
- A researcher can query in their own language and get results from many others.
- Different phrasings of similar concepts and synonyms don’t block relevant results.
Semantic search doesn’t replace keyword search. Instead, it adds a new powerful search functionality resulting in a hybrid system that combines the precision of keywords with the inclusivity of meaning-based discovery.
Early experiments: promising signs
Proofs of concept carried out in different repository networks — such as LA Referencia in Latin America and IBICT in Brazil — have already tested semantic multilingual search on a small scale. The setup was simple: generate embeddings for titles and abstracts, run vector similarity searches, and compare those results with traditional keyword matches
The findings were encouraging:
- Cross-language retrieval felt natural. Users could type queries in one language and find meaningful results in others.
- Greater visibility for underrepresented languages. Content in Portuguese, Arabic, or Japanese surfaced more often.
- Compatibility with existing systems. Prototypes worked alongside standard metadata harvesting and indexing without disruption.
Challenges remain: performance varies by language, embeddings require extra storage and compute, and evaluating relevance across languages is complex. But the proof of concept is clear: the approach works.
Proposed pathways to adoption
Rather than each infrastructure adopting its own multilingual embeddings, we are proposing to implement a community-based approach that would allow us to work together to maintain a shared collection of embeddings that can be used by everyone. And there are three practical routes that could be implemented in this community context:
| 1. The “Embedding Commons.” Repositories and networks share a common multilingual model and exchange format, generating or reusing embeddings locally. – Strength: maximum interoperability and respect for institutional sovereignty. – Challenge: requires strong coordination and versioning. 2. Harvest, process, and serve. A central service generates embeddings and provides them via API, so institutions don’t need to run the infrastructure. – Strength: very low barrier to entry. – Challenge: dependence on a central operator, need for sustainable funding. 3. Plugins and add-ons. For platforms like DSpace or Dataverse, community-built plugins could deliver semantic search features directly. – Strength: meets users where they already are. – Challenge: risk of fragmentation without standards. |
The key is to treat adoption as phased and flexible — starting small, experimenting with hosted services, and scaling up as experience is gained.
Guardrails for success
It should be noted that introducing multilingual embedding models into discovery systems requires careful governance. Three areas stand out:
- Fairness: avoid privileging languages that already dominate. Regular monitoring of search results rankings is essential.
- Transparency: models and training data should be documented and, ideally, open source. Users have the right to know how the results are generated.
- Sustainability: semantic search isn’t “set and forget.” Models need to be continually updated, re-indexed, and we will need therefore stable funding for compute and expertise.
These types of guardrails will ensure the system works not just technically, but ethically and inclusively.
The vision ahead
Imagine an agronomist in Peru searching in Spanish and discovering studies in Japanese and Ethiopian. Or a policymaker in Ghana retrieving Portuguese and English research with equal ease. This isn’t just a technical upgrade. It’s a transformation towards a truly multilingual scholarly commons, where the language of a query no longer limits access to knowledge. To get there, we need shared infrastructure, transparent governance, and a collective choice: to treat every language as equally valuable in the global flow of ideas. When we take this path, the next generation of researchers will inherit something richer: a global record of scholarship that reflects the full diversity of human knowledge.

Because when you ask a question in your language, the whole world should be able to answer.
Next steps for COAR
A full report detailing the technologies underpinning our semantic multilingual search model is being prepared and will be shared with the community in the coming weeks. The report will also include a proposal for a phased implementation of this approach in the context of the global repository network. We will also launch a community consultation to elicit input and feedback from the broader community. Information about how to participate in the consultation will be made available when we release the report within the next month on the COAR website and through various social media channels (Mastadon / Bluesky).

