: Leveraging the broad cross-linguistic data in WALS to improve how models handle the hundreds of languages that lack large amounts of training text.
: Obligatory possessive inflection (58A) and possessive classification (59A).
: Perfective/imperfective aspect (65A), past tense (66A), future tense (67A), and the perfect (68A).
This specific set is often used in for the following purposes:
: Gender assignment (32A), coding of nominal plurality (33A), and the number of cases (49A).
For more information on the specific data points, you can explore the Official WALS Features List or the WALS-Bench dataset on Hugging Face.
: Using the WALS database features as labels to see if a model's internal representations (embeddings) cluster according to known linguistic traits, such as whether a language uses definite articles.
: Testing if models like RoBERTa or XLM-RoBERTa have "learned" the typological rules of specific languages during pre-training.
