Our Large Language Model (LLM) is built using the DistilBERT model.
This LLM is pre-trained on the entire English-language Wikipedia corpus, which helps the framework interpret the intended meaning of a given document or passage.
We further pre-train this model on roughly one million text sequences drawn from our corpus of online vacancy postings. This ensures the language model is familiar with the language of job ad text.
Finally, we use 30,000 human-coded text extracts from job ads. Our human auditors were asked to flag text which indicates an offer of remote work. We then use this to train the model to identify jobs which offer remote work.
We also use the human-coded extracts to evaluate the predictive performance of the model. We find that the final model has 99% accuracy relative to human beings.
For further information about our method, including a comparison of its performance relative to other text-algorithms (including recent Generative AI models), see our paper: “Remote Work across Jobs, Companies, and Space” (2023).
Researchers and other non-commercial users can contact us to gain access to the underlying code and information used to construct the WHAM model.