The Joint Artificial Intelligence Institute (JAII) of the Universities of Bielefeld and Paderborn is organising another public online lecture on the topic of scaling language models. The lecture by Prof. Hinrich Schütze of LMU Munich will take place on 6 July from 16:00-17:00.
Informationen on the Lecture:
- Title: "Glot500: Scaling Multilingual Corpora and Language Models to 500 Languages"
- Lecturer: Prof. Hinrich Schütze (LMU Munich, Schütze lab)
- Date: July 6, 16:00-17:30 pm CET
- Zoom-Link to the lecture
Abstract:
Large language models (LLMs) are currently the most active area of research in NLP. Most work has focused on what we call “vertical” scaling: making LLMs even better for a relatively small number of high-resource languages. We address “horizontal” scaling instead: extending LLMs to a large subset of the world’s languages, focusing on low-resource languages. Our Glot500-m model is trained on 500 languages, many of which are not covered by any other language model. I will talk about the major challenges we faced in creating Glot500: (i) finding, validating and cleaning training data for that many languages; (ii) evaluating performance of Glot500-m on languages for which native speakers and labeled datasets were not available to us; and (iii) determining the factors that ultimately make training on a language successful. We find that trying to reduce such factors to the so-called curse of multilinguality is naive and there is in fact also a “boon of multilinguality”. We are in the process of making Glot500-c, our training corpus covering 500 languages, publicly available.