| Title: | The Fill-Mask Association Test |
|---|---|
| Description: | The Fill-Mask Association Test ('FMAT') <doi:10.1037/pspa0000396> is an integrative, probability-based social computing method using Masked Language Models to measure conceptual associations (e.g., attitudes, biases, stereotypes, social norms, cultural values) as propositional semantic representations in natural language. Supported language models include 'BERT' <doi:10.48550/arXiv.1810.04805> and its variants available at 'Hugging Face' <https://huggingface.co/models?pipeline_tag=fill-mask>. Methodological references and installation guidance are provided at <https://psychbruce.github.io/FMAT/>. |
| Authors: | Han Wu Shuang Bao [aut, cre] (ORCID: <https://orcid.org/0000-0003-3043-710X>) |
| Maintainer: | Han Wu Shuang Bao <[email protected]> |
| License: | GPL-3 |
| Version: | 2026.1 |
| Built: | 2026-06-05 07:13:33 UTC |
| Source: | https://github.com/psychbruce/fmat |
Download and save BERT models to local cache folder "%USERPROFILE%/.cache/huggingface".
BERT_download(models = NULL, verbose = FALSE)BERT_download(models = NULL, verbose = FALSE)
models |
A character vector of model names at HuggingFace. |
verbose |
Alert if a model has been downloaded.
Defaults to |
Invisibly return a data.table of basic file information of local models.
## Not run: models = c("bert-base-uncased", "bert-base-cased") BERT_download(models) BERT_download() # check downloaded models BERT_info() # information of all downloaded models ## End(Not run)## Not run: models = c("bert-base-uncased", "bert-base-cased") BERT_download(models) BERT_download() # check downloaded models BERT_info() # information of all downloaded models ## End(Not run)
Get basic information of BERT models.
BERT_info(models = NULL)BERT_info(models = NULL)
models |
A character vector of model names at HuggingFace. |
A data.table:
model name
model type
number of parameters
vocabulary size (of input token embeddings)
embedding dimensions (of input token embeddings)
hidden layers
attention heads
[MASK] token
## Not run: models = c("bert-base-uncased", "bert-base-cased") BERT_info(models) BERT_info() # information of all downloaded models # speed: ~1.2s/model for first use; <1s afterwards ## End(Not run)## Not run: models = c("bert-base-uncased", "bert-base-cased") BERT_info(models) BERT_info() # information of all downloaded models # speed: ~1.2s/model for first use; <1s afterwards ## End(Not run)
Scrape the initial commit date of BERT models.
BERT_info_date(models = NULL)BERT_info_date(models = NULL)
models |
A character vector of model names at HuggingFace. |
A data.table:
model name
initial commit date (scraped from huggingface commit history)
## Not run: model.date = BERT_info_date() # get all models from cache folder one.model.date = FMAT:::get_model_date("bert-base-uncased") # call the internal function to scrape a model # that may not have been saved in cache folder ## End(Not run)## Not run: model.date = BERT_info_date() # get all models from cache folder one.model.date = FMAT:::get_model_date("bert-base-uncased") # call the internal function to scrape a model # that may not have been saved in cache folder ## End(Not run)
Remove BERT models from local cache folder.
BERT_remove(models)BERT_remove(models)
models |
Model names. |
NULL.
Check if mask words are in the model vocabulary.
BERT_vocab( models, mask.words, add.tokens = FALSE, add.verbose = FALSE, weight.decay = 1 )BERT_vocab( models, mask.words, add.tokens = FALSE, add.verbose = FALSE, weight.decay = 1 )
models |
A character vector of model names at HuggingFace. |
mask.words |
Option words filling in the mask. |
add.tokens |
Add new tokens (for out-of-vocabulary words or phrases) to model vocabulary? Defaults to
|
add.verbose |
Print subwords of each new token? Defaults to |
weight.decay |
Decay factor of relative importance of multiple subwords. Defaults to
For example, decay = 0.5 would give 0.5 and 0.25 (with normalized weights 0.667 and 0.333) to two subwords (e.g., "individualism" = 0.667 "individual" + 0.333 "##ism"). |
A data.table of model name, mask word, real token (replaced if out of vocabulary), and token id (0~N).
## Not run: models = c("bert-base-uncased", "bert-base-cased") BERT_info(models) BERT_vocab(models, c("bruce", "Bruce")) BERT_vocab(models, 2020:2025) # some are out-of-vocabulary BERT_vocab(models, 2020:2025, add.tokens=TRUE) # add vocab BERT_vocab(models, c("individualism", "artificial intelligence"), add.tokens=TRUE) ## End(Not run)## Not run: models = c("bert-base-uncased", "bert-base-cased") BERT_info(models) BERT_vocab(models, c("bruce", "Bruce")) BERT_vocab(models, 2020:2025) # some are out-of-vocabulary BERT_vocab(models, 2020:2025, add.tokens=TRUE) # add vocab BERT_vocab(models, c("individualism", "artificial intelligence"), add.tokens=TRUE) ## End(Not run)
This function is only for technical check. Please use FMAT_run() for general purposes.
fill_mask(query, model, targets = NULL, topn = 5, gpu) fill_mask_check(query, models, targets = NULL, topn = 5, gpu)fill_mask(query, model, targets = NULL, topn = 5, gpu) fill_mask_check(query, models, targets = NULL, topn = 5, gpu)
query |
Query sentence with mask token. |
model, models
|
Model name(s). |
targets |
Target words to fill in the mask.
Defaults to |
topn |
Number of the most likely predictions to return. Defaults to |
gpu |
Use GPU (3x faster than CPU) to run the fill-mask pipeline? Defaults to missing value that will automatically use available GPU (if not available, then use CPU). An NVIDIA GPU device (e.g., GeForce RTX Series) is required to use GPU. See Guidance for GPU Acceleration. Options passing on to the
|
A data.table of raw results.
fill_mask(): Check performance of one model.
fill_mask_check(): Check performance of multiple models.
## Not run: query = "Paris is the [MASK] of France." models = c("bert-base-uncased", "bert-base-cased") d.check = fill_mask_check(query, models, topn=2) ## End(Not run)## Not run: query = "Paris is the [MASK] of France." models = c("bert-base-uncased", "bert-base-cased") d.check = fill_mask_check(query, models, topn=2) ## End(Not run)
Prepare a data.table of queries and variables for the FMAT.
FMAT_query( query = "Text with [MASK], optionally with {TARGET} and/or {ATTRIB}.", MASK = .(), TARGET = .(), ATTRIB = .() )FMAT_query( query = "Text with [MASK], optionally with {TARGET} and/or {ATTRIB}.", MASK = .(), TARGET = .(), ATTRIB = .() )
query |
Query text (should be a character string/vector with at least one |
MASK |
A named list of
|
TARGET, ATTRIB
|
A named list of Target/Attribute words or phrases. If specified, then |
A data.table of queries and variables.
FMAT_query("[MASK] is a nurse.", MASK = .(Male="He", Female="She")) FMAT_query( c("[MASK] is {TARGET}.", "[MASK] works as {TARGET}."), MASK = .(Male="He", Female="She"), TARGET = .(Occupation=c("a doctor", "a nurse", "an artist")) ) FMAT_query( "The [MASK] {ATTRIB}.", MASK = .(Male=c("man", "boy"), Female=c("woman", "girl")), ATTRIB = .(Masc=c("is masculine", "has a masculine personality"), Femi=c("is feminine", "has a feminine personality")) )FMAT_query("[MASK] is a nurse.", MASK = .(Male="He", Female="She")) FMAT_query( c("[MASK] is {TARGET}.", "[MASK] works as {TARGET}."), MASK = .(Male="He", Female="She"), TARGET = .(Occupation=c("a doctor", "a nurse", "an artist")) ) FMAT_query( "The [MASK] {ATTRIB}.", MASK = .(Male=c("man", "boy"), Female=c("woman", "girl")), ATTRIB = .(Masc=c("is masculine", "has a masculine personality"), Femi=c("is feminine", "has a feminine personality")) )
Combine multiple query data.tables and renumber query ids.
FMAT_query_bind(...)FMAT_query_bind(...)
... |
Query data.tables returned from |
A data.table of queries and variables.
FMAT_query_bind( FMAT_query( "[MASK] is {TARGET}.", MASK = .(Male="He", Female="She"), TARGET = .(Occupation=c("a doctor", "a nurse", "an artist")) ), FMAT_query( "[MASK] occupation is {TARGET}.", MASK = .(Male="His", Female="Her"), TARGET = .(Occupation=c("doctor", "nurse", "artist")) ) )FMAT_query_bind( FMAT_query( "[MASK] is {TARGET}.", MASK = .(Male="He", Female="She"), TARGET = .(Occupation=c("a doctor", "a nurse", "an artist")) ), FMAT_query( "[MASK] occupation is {TARGET}.", MASK = .(Male="His", Female="Her"), TARGET = .(Occupation=c("doctor", "nurse", "artist")) ) )
Run the fill-mask pipeline on multiple models with CPU or GPU (faster but requires an NVIDIA GPU device).
FMAT_run( models, data, gpu, add.tokens = FALSE, add.verbose = FALSE, weight.decay = 1, pattern.special = special_case(), file = NULL, progress = TRUE, warning = TRUE, na.out = TRUE )FMAT_run( models, data, gpu, add.tokens = FALSE, add.verbose = FALSE, weight.decay = 1, pattern.special = special_case(), file = NULL, progress = TRUE, warning = TRUE, na.out = TRUE )
models |
A character vector of model names at HuggingFace. |
data |
A data.table returned from |
gpu |
Use GPU (3x faster than CPU) to run the fill-mask pipeline? Defaults to missing value that will automatically use available GPU (if not available, then use CPU). An NVIDIA GPU device (e.g., GeForce RTX Series) is required to use GPU. See Guidance for GPU Acceleration. Options passing on to the
|
add.tokens |
Add new tokens (for out-of-vocabulary words or phrases) to model vocabulary? Defaults to
|
add.verbose |
Print subwords of each new token? Defaults to |
weight.decay |
Decay factor of relative importance of multiple subwords. Defaults to
For example, decay = 0.5 would give 0.5 and 0.25 (with normalized weights 0.667 and 0.333) to two subwords (e.g., "individualism" = 0.667 "individual" + 0.333 "##ism"). |
pattern.special |
See |
file |
File name of |
progress |
Show a progress bar? Defaults to |
warning |
Alert warning of out-of-vocabulary word(s)? Defaults to |
na.out |
Replace probabilities of out-of-vocabulary word(s) with |
The function automatically adjusts for the compatibility of tokens used in certain models: (1) for uncased models (e.g., ALBERT), it turns tokens to lowercase; (2) for models that use <mask> rather than [MASK], it automatically uses the corrected mask token; (3) for models that require a prefix to estimate whole words than subwords (e.g., ALBERT, RoBERTa), it adds a white space before each mask option word. See special_case() for details.
These changes only affect the token variable in the returned data, but will not affect the M_word variable. Thus, users may analyze data based on the unchanged M_word rather than the token.
Note also that there may be extremely trivial differences (after 5~6 significant digits) in the raw probability estimates between using CPU and GPU, but these differences would have little impact on main results.
A data.table (class fmat) appending data with these new variables:
model: model name.
output: complete sentence output with unmasked token.
token: actual token to be filled in the blank mask
(a note "out-of-vocabulary" will be added
if the original word is not found in the model vocabulary).
prob: (raw) conditional probability of the unmasked token given the provided context, estimated by the masked language model.
Raw probabilities should NOT be directly used or interpreted. Please use summary.fmat() to contrast between a pair of probabilities.
## Running the examples requires the models downloaded ## Not run: models = c("bert-base-uncased", "bert-base-cased") query1 = FMAT_query( c("[MASK] is {TARGET}.", "[MASK] works as {TARGET}."), MASK = .(Male="He", Female="She"), TARGET = .(Occupation=c("a doctor", "a nurse", "an artist")) ) data1 = FMAT_run(models, query1) summary(data1, target.pair=FALSE) query2 = FMAT_query( "The [MASK] {ATTRIB}.", MASK = .(Male=c("man", "boy"), Female=c("woman", "girl")), ATTRIB = .(Masc=c("is masculine", "has a masculine personality"), Femi=c("is feminine", "has a feminine personality")) ) data2 = FMAT_run(models, query2) summary(data2, mask.pair=FALSE) summary(data2) ## End(Not run)## Running the examples requires the models downloaded ## Not run: models = c("bert-base-uncased", "bert-base-cased") query1 = FMAT_query( c("[MASK] is {TARGET}.", "[MASK] works as {TARGET}."), MASK = .(Male="He", Female="She"), TARGET = .(Occupation=c("a doctor", "a nurse", "an artist")) ) data1 = FMAT_run(models, query1) summary(data1, target.pair=FALSE) query2 = FMAT_query( "The [MASK] {ATTRIB}.", MASK = .(Male=c("man", "boy"), Female=c("woman", "girl")), ATTRIB = .(Masc=c("is masculine", "has a masculine personality"), Femi=c("is feminine", "has a feminine personality")) ) data2 = FMAT_run(models, query2) summary(data2, mask.pair=FALSE) summary(data2) ## End(Not run)
Interrater agreement of log probabilities (treated as "ratings"/rows) among BERT language models (treated as "raters"/columns), with both row and column as ("two-way") random effects.
ICC_models(data, type = "agreement", unit = "average")ICC_models(data, type = "agreement", unit = "average")
data |
Raw data returned from |
type |
Interrater |
unit |
Reliability of |
A data.frame of ICC.
) of LPR.Reliability analysis (Cronbach's ) of LPR.
LPR_reliability(fmat, item = c("query", "T_word", "A_word"), by = NULL)LPR_reliability(fmat, item = c("query", "T_word", "A_word"), by = NULL)
fmat |
A data.table returned from |
item |
Reliability of multiple |
by |
Variable(s) to split data by. Options can be |
A data.frame of Cronbach's .
This function allows you to change the default cache directory (when it lacks storage space) to another path (e.g., your portable SSD) temporarily.
set_cache_folder(path = NULL)set_cache_folder(path = NULL)
path |
Folder path to store HuggingFace models. If |
This function takes effect only for the current R session temporarily, so you should run this each time BEFORE you use other FMAT functions in an R session.
## Not run: library(FMAT) set_cache_folder("D:/huggingface_cache/") # -> models would be saved to "D:/huggingface_cache/hub/" # run this function each time before using FMAT functions BERT_download() BERT_info() ## End(Not run)## Not run: library(FMAT) set_cache_folder("D:/huggingface_cache/") # -> models would be saved to "D:/huggingface_cache/hub/" # run this function each time before using FMAT functions BERT_download() BERT_info() ## End(Not run)
Specify models that require special treatment to ensure accuracy.
special_case( uncased = "uncased|albert|electra|muhtasham", u2581 = "albert|xlm-roberta|xlnet", u2581.excl = "chinese", u0120 = "roberta|bart|deberta|bertweet-large|ModernBERT", u0120.excl = "chinese|xlm-|kornosk/" )special_case( uncased = "uncased|albert|electra|muhtasham", u2581 = "albert|xlm-roberta|xlnet", u2581.excl = "chinese", u0120 = "roberta|bart|deberta|bertweet-large|ModernBERT", u0120.excl = "chinese|xlm-|kornosk/" )
uncased |
Regular expression pattern (matching model names) for uncased models. |
u2581, u0120
|
Regular expression pattern (matching model names) for models that require a special prefix character when performing whole-word fill-mask pipeline. WARNING: The developer is unable to check all models, so users need to check the models they use and modify these parameters if necessary.
|
u2581.excl, u0120.excl
|
Exclusions to negate |
A list of regular expression patterns.
special_case()special_case()
Summarize the results of Log Probability Ratio (LPR), which indicates the relative (vs. absolute) association between concepts.
## S3 method for class 'fmat' summary( object, mask.pair = TRUE, target.pair = TRUE, attrib.pair = TRUE, warning = TRUE, ... )## S3 method for class 'fmat' summary( object, mask.pair = TRUE, target.pair = TRUE, attrib.pair = TRUE, warning = TRUE, ... )
object |
A data.table (class |
mask.pair, target.pair, attrib.pair
|
Pairwise contrast of |
warning |
Alert warning of out-of-vocabulary word(s)? Defaults to |
... |
Other arguments (currently not used). |
The LPR of just one contrast (e.g., only between a pair of attributes) may not be sufficient for a proper interpretation of the results, and may further require a second contrast (e.g., between a pair of targets).
Users are suggested to use linear mixed models (with the R packages nlme or lme4/lmerTest) to perform the formal analyses and hypothesis tests based on the LPR.
A data.table of the summarized results with Log Probability Ratio (LPR).
# see examples in `FMAT_run`# see examples in `FMAT_run`
Compute a vector of weights with a decay rate.
weight_decay(vector, decay)weight_decay(vector, decay)
vector |
Vector of sequence. |
decay |
Decay factor for computing weights. A smaller decay value would give greater weight to the former items than to the latter items. The i-th item has raw weight = decay ^ i.
|
Normalized weights (i.e., sum of weights = 1).
# "individualism" weight_decay(c("individual", "##ism"), 0.5) weight_decay(c("individual", "##ism"), 0.8) weight_decay(c("individual", "##ism"), 1) weight_decay(c("individual", "##ism"), 2) # "East Asian people" weight_decay(c("East", "Asian", "people"), 0.5) weight_decay(c("East", "Asian", "people"), 0.8) weight_decay(c("East", "Asian", "people"), 1) weight_decay(c("East", "Asian", "people"), 2)# "individualism" weight_decay(c("individual", "##ism"), 0.5) weight_decay(c("individual", "##ism"), 0.8) weight_decay(c("individual", "##ism"), 1) weight_decay(c("individual", "##ism"), 2) # "East Asian people" weight_decay(c("East", "Asian", "people"), 0.5) weight_decay(c("East", "Asian", "people"), 0.8) weight_decay(c("East", "Asian", "people"), 1) weight_decay(c("East", "Asian", "people"), 2)