A project backed by the Singapore government has developed a large language model (LLM) trained on Southeast Asian data. Millions of Southeast Asians have tried out LLMs like Meta's Llama 2 and Mistral AI, but often, the results in English have been gibberish. Experts warn that this puts them at a disadvantage as artificial intelligence transforms education, work, and governance worldwide.
To address this, the Singapore government has launched a Southeast Asian LLM called SEA-LION - Southeast Asian Languages in One Network. The model is trained on data in 11 Southeast Asian languages, including Vietnamese, Thai, and Bahasa Indonesia. It is open-sourced and a cheaper and more efficient option for businesses, governments, and academia in the region.
Leslie Teo at AI Singapore said, "We are not trying to compete with the big LLMs; we are trying to complement them so there can be a better representation of us." Governments and tech firms worldwide are attempting to bridge the gap in the development of LLMs for non-English languages. By creating datasets in local languages and launching LLMs in local languages, researchers hope to promote technology self-reliance and provide better privacy for local populations.
As more countries and regions build their own LLMs, digital and human rights experts fret that they will reproduce only the dominant views expressed online, which can be particularly problematic in nations with authoritarian governments or strict media censorship or those lacking a strong civil society.