Guest blogger: Using artificial intelligence to identify language proficiency requirements

24.4.2023

Blog

Jobseekers, especially those with a foreign background, should be able to find jobs for which their language skills are sufficient. Of course, this requires that the job advertisement includes information about the language skills needed for the job. However, this information is not always available in a structured form, even though the text of the job advertisement may mention what kind of language skills are required. This raised the question whether we could perhaps deduce the language proficiency requirements directly from the text with the help of an AI solution.

We set out to find a service that would understand

when we are talking about language skills and when we are not: "you must know Finnish" vs. "we have offices around Finland",
whether the sentence is positive or negative: "proficiency in English required" vs. "you don't need to know English" and
whether language proficiency is necessary or only useful: “Proficiency in French required” vs. "we value French language skills".

The service should work in the languages used in Job Market Finland (in Finnish, Swedish and English). It should identify at least twenty of the most common languages in job advertisements.

Finding the right solution

First, we tested whether the technology familiar from the relevance model (FastText + RNN network) would work in this case as well. The beginning looked promising, and the language-related words were found in the text quite easily. However, when we wanted to distinguish between essential and useful language skills, the limitations of the model became apparent. It may be that the problem could have been solved by additional training on the FastText model used, but we chose not to take that path.

Instead, we decided to try whether large language models (LLM) would provide a convenient solution for this. Hugging Face offers a great number of language models and tools to use them, so we decided to use the multilingual LaBSE model developed by Google. The advantage of a multilingual model is that we do not need a separate model for processing job advertisements in Finnish, Swedish and English, as one model is enough. Additionally, teaching materials in different languages support each other, i.e. the language model learns, to some extent, to process job advertisements also in languages that have not been taught to it separately.

Training material

In order to teach the language model what words we are interested in, we need training material. In other words, job advertisements with words describing language skills are needed as training material. This could be done by going through the actual job advertisements and manually classifying words. However, it would be quite labour-intensive and the training material would not necessarily be balanced. Finnish job advertisements contain many requirements for Finnish, Swedish and English skills, but quite rarely for French skills, for example. However, we would like the model to react in the same way to English, French, Spanish and Farsi.

We solved this conundrum by creating artificial job advertisements, i.e. real job advertisements with a sentence or two about language requirements included. These sentences, in turn, were generated from example sentences, such as "customer work requires fluent LANGUAGE and LANGUAGE skills", in which the LANGUAGE entries were replaced with random languages such as (taking into account the inflection of the word) "Finnish", "Sámi", "French", etc. This way, we were able to generate a large number of training and test material in which different languages appeared in different sentences and in different job advertisements.

In other words, the words were classified into three categories in the training material: 1) words describing the required language proficiency, 2) words describing useful language proficiency and 3) other words.

Training the model

Further training of the Hugging Face language models is not very difficult in itself, but it requires processing power and time. It is no longer very convenient to train models of this size on one's own machine, so we took advantage of Azure's machine learning environment. We started with smaller machines, but we soon realised that we need a GPU machine for training the model. After that, the problem was that some of the available graphics cards had so little memory that they could not handle a language model of this size class. Ultimately, the solution was Nvidia's Tesla T4 graphics card, which could be used run the training runs in a few hours.

Changing language words into language codes

Thanks to the trained language model, we can therefore find in the job advertisement texts words that tell the language skills requirements. This is not enough, as the languages must be represented by ISO 639-1 language codes. This conversion from a language word to a language code proved to be so straightforward that it was not worth building more intelligence into it. Therefore, languages are purely identified by a string comparison: for example, if the word found contains the string "Swedish", "svensk" or "swedis", the language code "sv" is returned.

ISO 639-1 (wikipedia.org)

Does it work?

Of course, the key question is whether the service works with the correct job advertisements. It works excellently (clearly more than 95% of job advertisements) most of the time, even though language proficiency requirements can be expressed in very different ways. Let's look at the following example of a job advertisement. "A full-time Swedish-language teacher in basic education. The school's language programme currently includes B1 Swedish, A1 and A2 English, French and German and B2 Spanish.” According to the AI, the language skills required in the assignment based on the text are Swedish. In the following example, AI has identified Finnish and Farsi as the required language proficiencies. “We are looking for a Persian/Farsi language interpreter. We require excellent Finnish and Persian/Farsi language proficiency.”

In the following example, the AI has identified English as the required language proficiency and German, French and Finnish as a useful language proficiencies. "Fluent in speaking English, and completely proficient in writing in English (writing autonomously contractual letters, Contract Amendments and documents issued to the subcontractors). - Proficiency in German and/or French and/or Finnish is a plus." From the Swedish-language text, the AI has been able to extract Finnish and English as the required languages and German, Swedish and Italian as useful language proficiencies. "Du kommer att lyckas med den här uppgiften med flytande finska och engelska muntliga och skriftliga färdigheter. Färdighet i andra språk, särskilt tyska, svenska och italienska, anses till din fördel.”

For example, in the quote “We also value versatile language skills. A good command of the English language is a necessity (our working language is Finnish) and proficiencies in other languages such as Arabic, Spanish, Swedish, Russian and Estonian is an advantage.", English and Swedish has been identified as the required language skills and Arabic, Spanish, Swedish, Russian and Estonian as useful or “good to know” language proficiencies..

Typing errors hinder the functioning of the AI

It should noted that AI services are rarely perfect. For example, typos and ambiguities prove difficult for the language model. Furthermore, the language model does not understand, references to the “other official” or “Scandinavian” languages, for example. And sometimes the model may fail without any obvious reason. However, further development of the model is relatively straightforward, as the errors detected provide new training material.

The errors made by the model are typically such that a language proficiency requirement is overlooked or misclassified. For example, the AI has not been able to classify Finnish and Swedish as compulsory language proficiencies in the following sentence. Instead, it has incorrectly defined them as useful language proficiencies. "Your tasks include customer service by telephone in both Finnish and Swedish. Your job description also includes various office tasks.”

Due to a typographical error, Finnish competence was missing from the following example. "We are looking for team player with good communication (English and Finish) and CAD skills.” Only English skills were defined as required language skills. Without a clear reason, AI did not recognise the English language in the following example. "Vi använder finska som vårt huvudsakliga arbetsspråk, men nästan daglig kommunikation med våra rektorer bedrivs på engelska, kunskap om tyska är en fördel men inte en nödvändighet." The required language skills included Finnish and German language skills.

Services coming to Job Market Finland in spring 2023

All in all, however, we can be satisfied, because the original objective was achieved with sufficient precision. In the future, we can automatically deduce the required languages from the job advertisement if they have not been provided separately. This, in turn, improves matching jobseekers with jobs, when we are able to offer job seekers job advertisements that match their language skills!

The service will be available for production in Job Market Finland in spring 2023.

Heikki Niittylä
Data Scientist
Gofore