A new collaboration aims to develop a language model for Danish and other Germanic languages
A new large-scale EU research initiative aims to develop a Germanic language model, providing valuable insights for the subsequent major task: the development of a Danish language model designed for practical and meaningful applications.
Many people are already acquainted with and regularly utilize ChatGPT and Bard. These language models operate within closed systems primarily trained on extensive languages like English, motivated by commercial interests in the USA. Presently, the EU is intervening to address Germanic languages, with the goal of safeguarding linguistic diversity. However, there is a notable drawback to these expansive systems that the EU and other entities find concerning. They are subject to diverse regulations and cultural norms, compelling Europeans to engage with systems that may not align with European values of ‘human-centered, trustworthy, and democratized’ artificial intelligence.
– Artificial intelligence is a rapidly advancing force that will influence at least 80% of the workforce. If we neglect the development of our language, others will exploit the opportunities that arise. It is imperative that we protect our language and enhance our skills to safeguard our interests, emphasizes Torben Blach, the project leader at the Alexandra Institute for the ambitious research project TrustLLM.
Within this initiative, experts from the Alexandra Institute will collaborate with leading European researchers in language technology. As a GTS (Advanced Technology Group) institute, the institute takes this role seriously, aiming to contribute its expertise to the national initiative of ultimately developing a Danish language model.
– We are now part of a crucial collective dedicated to develop models for the Germanic languages. This collaboration allows us to enhance our skills and gain firsthand insights into the data being collected, which will be used to train the models, explains Torben Blach.
Through this project, the EU establishes frameworks and opportunities for the best experts in the natural language field to collaborate across the European Union.
Integration of open source and AI is fundamental
The project will also delve into the ethical, research-intensive, and business dimensions of AI. Dan Saattrup Nielsen, Senior AI Specialist and PhD at the Alexandra Institute, highlights the constraints of current models. Therefore, the project’s core motivation is to adopt an open-source approach.
– Open source enables the democratization the model usage, ensuring accessibility to many rather than just a few. Currently, there is a lack of an open-source model for the Danish language. We aim to change this so that a broader audience can benefit from the models, fostering innovation and meaningful use cases, he explains.
The concentration of power within a select few makes us highly vulnerable if they decide to discontinue or significantly raise the price for the models. This lack of autonomy leaves us with no alternative but to comply.
– We rely on external data and a structure of the model, which is closed, so we do not know the logic. Therefore, we need to enhance the models by addressing the observed issues, such as the shortcomings in ChatGPT. This includes mitigating biases during model training and reducing instances where the models generate inaccurate information, he says.
Open source is meant to generate business
The partners in the project receive funding to develop the model, and the resulting product will be released as open source, making it freely accessible. This allows the model to serve as a starting point for companies to develop products that can generate profits, as their internal data science departments can leverage it.
– We can train and release these models and enable companies to download them onto their own in-house servers without the need to share data externally. This not only enhances the safety of the model use but also grants each company the flexibility to customize the model to align with their specific requirements, explains Dan.
According to Dan, this approach simplifies the creation of numerous models tailored for specific use cases. For instance, a model designed to create journal notes could greatly benefit from this approach. The key idea is that by democratizing the models, anyone can download and customize them according to their unique needs.
– Currently, we are constrained to use pre-packaged solutions from the USA. However, if the model is made publicly available, numerous companies could find a compelling business case. They could develop user-friendly products that make it easy to use hosted solutions such as ChatGPT. This opens up new possibilities for the public sector, where current restrictions limit the use of ChatGPT, he notes.
A competitive European language model The purpose of the collaboration is to ensure a competitive European language model that aligns with the values of the EU and can be utilized in the development of low-resource countries with small languages like Denmark.
This approach helps reduce costs, which would otherwise be high in countries with small languages. Now, the work begins to build a processing infrastructure, acquire the underlying data, make them accessible, and establish the massive computing power required by the models.
– We are working with a setup where access to a maximum data set is available in a storage system that can be used for training and can be copied flexibly. We gain access to a substantial amount of underlying training data in at least six Germanic languages such as German, Dutch, Norwegian, Swedish, Icelandic, and Danish, emphasizes Torben Blach.
Facts about TrustLLM
The primary objective is to create an open, reliable, and sustainable language model (LLM), with an initial focus on the Germanic languages. The model will serve as the foundation for an advanced, open ecosystem for the next generation of modular and scalable European language models, characterized by trustworthiness, sustainability, and democratization. The TrustLLM project and its associated ecosystem aim to facilitate, support, and enhance context-aware human-machine interaction across a wide range of applications.
Linköping University (LIU), Sweden
Forschungszentrum Jülich, Germany
Lindholmen Science Park Aktiebolag, Sweden
Mioeind EHF., Iceland
Haskoli Islands, Iceland
University of Copenhagen, Denmark
The Alexandra Institute, Denmark
Norwegian University of Science and Technology (NTNU), Norway Nederlandse Organisatie Voor Toegepast, Netherlands
Akademie für Künstliche Intelligenz, Germany
Horizon Europe Framework Programme (HORIZON) 6.9 million Euro.
November 2023 – October 2026