Cristina Ferrero Castaño
MarIA, the first massive Artificial Intelligence model in the Spanish language, is born.
The system, created by the BSC and financed with funds from the Language Technology Plan of the Ministry of Economic Affairs and Digital Agenda and the Future Computing Center (an initiative of the BSC and IBM), is openly available so that any developer, company or entity can use it free of charge.
July 28, 2021
The Barcelona Supercomputing Center team has developed an artificial intelligence system that is an expert in understanding and writing the Spanish language. It is the first model of this language to be made using artificial intelligence technology and based on massive data. The system has been trained with files from the National Library of Spain (59 terabytes of the institution's web archive were used) using the technology of the MareNostrum supercomputer.
The project, financed with funds from the Ministry of Economic Affairs and Digital Agenda's Language Technology Plan and the Future Computing Center, an initiative of the BSC and IBM, will make it possible for any developer, company or entity to use this system free of charge. This technology can be used in language predictors and correctors, chatbots, automatic summarisation applications, intelligent searches, sentiment analysis applications or automatic translation and subtitling engines, among other applications.
As Marta Villegas, head of the project and leader of the text mining group at the BSC-CNS, points out, new artificial intelligence technologies "are completely transforming the field of natural language processing. With this project, we are helping the country to join this scientific-technical revolution and position itself as a full player in the computational processing of Spanish".
How does MarIA work?
The first massive AI model of the Spanish language is actually a "set of language models" or, as its developers explain in a statement, "deep neural networks that have been trained to acquire an understanding of the language, its lexicon and its mechanisms for expressing meaning and writing at an expert level". They are networks that are able to work with short and long interdependencies and are capable of understanding not only abstract concepts, but also their context.
The first step in creating a language model is to build a corpus of words and phrases that will be the basis on which the system will be trained. As the project leaders explain, the equivalent of 59,000 gigabytes of the National Library's web archive was used to create the MarIA corpus. These files were then processed to eliminate anything that was not well-formed text, and only well-formed texts in the Spanish language were saved. This screening and subsequent compilation required 6,910,000 processor hours on the MareNostrum supercomputer and resulted in 201,080,084 clean documents occupying a total of 570 gigabytes of clean, duplicate-free text.
This corpus, they say, "surpasses by several orders of magnitude the size and quality of the corpora currently available. It is a corpus that will enrich the digital heritage of Spanish and of the BNE's own archive and that could serve multiple applications in the future, such as having a temporal image that allows us to analyse the evolution of the language, to understand the digital society as a whole and, of course, to train new models".
Once the corpus was created, the BSC researchers used neural network technology (based on the Transformer architecture), which has proven to work well in English, and trained it to learn to use the language. This training required 184,000 processor hours and more than 18,000 GPU hours.
After launching the general models, the BSC text mining team is working on extending the corpus, with new archive sources that will provide texts with particularities different from those found in web environments, such as CSIC scientific publications. It is also planned to generate trained models with texts from different languages: Spanish, Catalan, Galician, Basque, Portuguese and Spanish from Latin America.