JAII-Vor­trag „Glot500: Sca­ling Mul­ti­lin­gu­al Cor­po­ra and Lan­gua­ge Mo­dels to 500 Lan­gua­ges“

Das gemeinsame Institut für Künstliche Intelligenz der Universitäten Bielefeld und Paderborn (Englisch: Joint Artficial Intelligence Institute, kurz: JAII) veranstaltet einen weiteren öffentlichen Online-Vortrag zum Thema Skalierung von Sprachmodellen. Der Vortrag von Prof. Hinrich Schütze der LMU München findet am 6. Juli von 16:00-17:00 Uhr statt.

Informationen zum Vortrag:

  • Titel: „Glot500: Scaling Multilingual Corpora and Language Models to 500 Languages“
  • Vortragender: Prof. Dr. Hinrich Schütze (LMU München, Schütze lab)
  • Datum: 6. Juli, 16:00-17:30 Uhr CET
  • Zoom-Link zum Vortrag

Abstract:

Large language models (LLMs) are currently the most active area of research in NLP. Most work has focused on what we call “vertical” scaling: making LLMs even better for a relatively small number of high-resource languages. We address “horizontal” scaling instead: extending LLMs to a large subset of the world’s languages, focusing on low-resource languages. Our Glot500-m model is trained on 500 languages, many of which are not covered by any other language model. I will talk about the major challenges we faced in creating Glot500: (i) finding, validating and cleaning training data for that many languages; (ii) evaluating performance of Glot500-m on languages for which native speakers and labeled datasets were not available to us; and (iii) determining the factors that ultimately make training on a language successful. We find that trying to reduce such factors to the so-called curse of multilinguality is naive and there is in fact also a “boon of multilinguality”. We are in the process of making Glot500-c, our training corpus covering 500 languages, publicly available.

Weiterführende Informationen:

  • Webseite des Joint Artificial Intelligence Institute (JAII)
  • News des SAIL-Netwerks zum Vortrag