Title: | The Fill-Mask Association Test |
---|---|
Description: | The Fill-Mask Association Test ('FMAT') <doi:10.1037/pspa0000396> is an integrative and probability-based method using Masked Language Models to measure conceptual associations (e.g., attitudes, biases, stereotypes, social norms, cultural values) as propositions in natural language. Supported language models include 'BERT' <doi:10.48550/arXiv.1810.04805> and its variants available at 'Hugging Face' <https://huggingface.co/models?pipeline_tag=fill-mask>. Methodological references and installation guidance are provided at <https://psychbruce.github.io/FMAT/>. |
Authors: | Han-Wu-Shuang Bao [aut, cre] |
Maintainer: | Han-Wu-Shuang Bao <[email protected]> |
License: | GPL-3 |
Version: | 2024.7 |
Built: | 2024-10-30 07:26:53 UTC |
Source: | https://github.com/psychbruce/fmat |
list
.A simple function equivalent to list
.
.(...)
.(...)
... |
Named objects (usually character vectors for this package). |
A list of named objects.
.(Male=c("he", "his"), Female=c("she", "her"))
.(Male=c("he", "his"), Female=c("she", "her"))
Download and save BERT models to local cache folder "%USERPROFILE%/.cache/huggingface".
BERT_download(models = NULL)
BERT_download(models = NULL)
models |
Model names at HuggingFace. |
No return value.
## Not run: models = c("bert-base-uncased", "bert-base-cased") BERT_download(models) BERT_download() # check downloaded models BERT_info() # information of all downloaded models ## End(Not run)
## Not run: models = c("bert-base-uncased", "bert-base-cased") BERT_download(models) BERT_download() # check downloaded models BERT_info() # information of all downloaded models ## End(Not run)
Get basic information of BERT models.
BERT_info(models = NULL)
BERT_info(models = NULL)
models |
Model names at HuggingFace. |
A data.table of model name, model file size, vocabulary size (of word/token embeddings), embedding dimensions (of word/token embeddings), and [MASK] token.
## Not run: models = c("bert-base-uncased", "bert-base-cased") BERT_info(models) BERT_info() # information of all downloaded models ## End(Not run)
## Not run: models = c("bert-base-uncased", "bert-base-cased") BERT_info(models) BERT_info() # information of all downloaded models ## End(Not run)
Check if mask words are in the model vocabulary.
BERT_vocab( models, mask.words, add.tokens = FALSE, add.method = c("sum", "mean") )
BERT_vocab( models, mask.words, add.tokens = FALSE, add.method = c("sum", "mean") )
models |
Model names at HuggingFace. |
mask.words |
Option words filling in the mask. |
add.tokens |
Add new tokens (for out-of-vocabulary words or even phrases) to model vocabulary?
Defaults to |
add.method |
Method used to produce the token embeddings of new added tokens.
Can be |
A data.table of model name, mask word, real token (replaced if out of vocabulary), and token id (0~N).
## Not run: models = c("bert-base-uncased", "bert-base-cased") BERT_info(models) BERT_vocab(models, c("bruce", "Bruce")) BERT_vocab(models, 2020:2025) # some are out-of-vocabulary BERT_vocab(models, 2020:2025, add.tokens=TRUE) # add vocab BERT_vocab(models, c("individualism", "artificial intelligence"), add.tokens=TRUE) ## End(Not run)
## Not run: models = c("bert-base-uncased", "bert-base-cased") BERT_info(models) BERT_vocab(models, c("bruce", "Bruce")) BERT_vocab(models, 2020:2025) # some are out-of-vocabulary BERT_vocab(models, 2020:2025, add.tokens=TRUE) # add vocab BERT_vocab(models, c("individualism", "artificial intelligence"), add.tokens=TRUE) ## End(Not run)
Load BERT models from local cache folder "%USERPROFILE%/.cache/huggingface".
For GPU Acceleration,
please directly use FMAT_run
.
In general, FMAT_run
is always preferred than FMAT_load
.
FMAT_load(models)
FMAT_load(models)
models |
Model names at HuggingFace. |
A named list of fill-mask pipelines obtained from the models. The returned object cannot be saved as any RData. You will need to rerun this function if you restart the R session.
## Not run: models = c("bert-base-uncased", "bert-base-cased") models = FMAT_load(models) # load models from cache ## End(Not run)
## Not run: models = c("bert-base-uncased", "bert-base-cased") models = FMAT_load(models) # load models from cache ## End(Not run)
Prepare a data.table of queries and variables for the FMAT.
FMAT_query( query = "Text with [MASK], optionally with {TARGET} and/or {ATTRIB}.", MASK = .(), TARGET = .(), ATTRIB = .() )
FMAT_query( query = "Text with [MASK], optionally with {TARGET} and/or {ATTRIB}.", MASK = .(), TARGET = .(), ATTRIB = .() )
query |
Query text (should be a character string/vector
with at least one |
MASK |
A named list of For model vocabulary, see, e.g., https://huggingface.co/bert-base-uncased/raw/main/vocab.txt Infrequent words may be not included in a model's vocabulary,
and in this case you may insert the words into the context by
specifying either |
TARGET , ATTRIB
|
A named list of Target/Attribute words or phrases.
If specified, then |
A data.table of queries and variables.
FMAT_query("[MASK] is a nurse.", MASK = .(Male="He", Female="She")) FMAT_query( c("[MASK] is {TARGET}.", "[MASK] works as {TARGET}."), MASK = .(Male="He", Female="She"), TARGET = .(Occupation=c("a doctor", "a nurse", "an artist")) ) FMAT_query( "The [MASK] {ATTRIB}.", MASK = .(Male=c("man", "boy"), Female=c("woman", "girl")), ATTRIB = .(Masc=c("is masculine", "has a masculine personality"), Femi=c("is feminine", "has a feminine personality")) )
FMAT_query("[MASK] is a nurse.", MASK = .(Male="He", Female="She")) FMAT_query( c("[MASK] is {TARGET}.", "[MASK] works as {TARGET}."), MASK = .(Male="He", Female="She"), TARGET = .(Occupation=c("a doctor", "a nurse", "an artist")) ) FMAT_query( "The [MASK] {ATTRIB}.", MASK = .(Male=c("man", "boy"), Female=c("woman", "girl")), ATTRIB = .(Masc=c("is masculine", "has a masculine personality"), Femi=c("is feminine", "has a feminine personality")) )
Combine multiple query data.tables and renumber query ids.
FMAT_query_bind(...)
FMAT_query_bind(...)
... |
Query data.tables returned from |
A data.table of queries and variables.
FMAT_query_bind( FMAT_query( "[MASK] is {TARGET}.", MASK = .(Male="He", Female="She"), TARGET = .(Occupation=c("a doctor", "a nurse", "an artist")) ), FMAT_query( "[MASK] occupation is {TARGET}.", MASK = .(Male="His", Female="Her"), TARGET = .(Occupation=c("doctor", "nurse", "artist")) ) )
FMAT_query_bind( FMAT_query( "[MASK] is {TARGET}.", MASK = .(Male="He", Female="She"), TARGET = .(Occupation=c("a doctor", "a nurse", "an artist")) ), FMAT_query( "[MASK] occupation is {TARGET}.", MASK = .(Male="His", Female="Her"), TARGET = .(Occupation=c("doctor", "nurse", "artist")) ) )
Run the fill-mask pipeline on multiple models with CPU or GPU (faster but requiring an NVIDIA GPU device).
FMAT_run( models, data, gpu, add.tokens = FALSE, add.method = c("sum", "mean"), file = NULL, progress = TRUE, warning = TRUE, na.out = TRUE )
FMAT_run( models, data, gpu, add.tokens = FALSE, add.method = c("sum", "mean"), file = NULL, progress = TRUE, warning = TRUE, na.out = TRUE )
models |
Options:
|
data |
A data.table returned from |
gpu |
Use GPU (3x faster than CPU) to run the fill-mask pipeline? Defaults to missing value that will automatically use available GPU (if not available, then use CPU). An NVIDIA GPU device (e.g., GeForce RTX Series) is required to use GPU. See Guidance for GPU Acceleration. Options passing to the
|
add.tokens |
Add new tokens (for out-of-vocabulary words or even phrases) to model vocabulary?
Defaults to |
add.method |
Method used to produce the token embeddings of new added tokens.
Can be |
file |
File name of |
progress |
Show a progress bar? Defaults to |
warning |
Alert warning of out-of-vocabulary word(s)? Defaults to |
na.out |
Replace probabilities of out-of-vocabulary word(s) with |
The function automatically adjusts for
the compatibility of tokens used in certain models:
(1) for uncased models (e.g., ALBERT), it turns tokens to lowercase;
(2) for models that use <mask>
rather than [MASK]
,
it automatically uses the corrected mask token;
(3) for models that require a prefix to estimate whole words than subwords
(e.g., ALBERT, RoBERTa), it adds a certain prefix (usually a white space;
\u2581 for ALBERT and XLM-RoBERTa, \u0120 for RoBERTa and DistilRoBERTa).
Note that these changes only affect the token
variable
in the returned data, but will not affect the M_word
variable.
Thus, users may analyze data based on the unchanged M_word
rather than the token
.
Note also that there may be extremely trivial differences (after 5~6 significant digits) in the raw probability estimates between using CPU and GPU, but these differences would have little impact on main results.
A data.table (of new class fmat
) appending data
with these new variables:
model
: model name.
output
: complete sentence output with unmasked token.
token
: actual token to be filled in the blank mask
(a note "out-of-vocabulary" will be added
if the original word is not found in the model vocabulary).
prob
: (raw) conditional probability of the unmasked token
given the provided context, estimated by the masked language model.
It is NOT SUGGESTED to directly interpret the raw probabilities
because the contrast between a pair of probabilities
is more interpretable. See summary.fmat
.
FMAT_load
(deprecated)
## Running the examples requires the models downloaded ## Not run: models = c("bert-base-uncased", "bert-base-cased") query1 = FMAT_query( c("[MASK] is {TARGET}.", "[MASK] works as {TARGET}."), MASK = .(Male="He", Female="She"), TARGET = .(Occupation=c("a doctor", "a nurse", "an artist")) ) data1 = FMAT_run(models, query1) summary(data1, target.pair=FALSE) query2 = FMAT_query( "The [MASK] {ATTRIB}.", MASK = .(Male=c("man", "boy"), Female=c("woman", "girl")), ATTRIB = .(Masc=c("is masculine", "has a masculine personality"), Femi=c("is feminine", "has a feminine personality")) ) data2 = FMAT_run(models, query2) summary(data2, mask.pair=FALSE) summary(data2) ## End(Not run)
## Running the examples requires the models downloaded ## Not run: models = c("bert-base-uncased", "bert-base-cased") query1 = FMAT_query( c("[MASK] is {TARGET}.", "[MASK] works as {TARGET}."), MASK = .(Male="He", Female="She"), TARGET = .(Occupation=c("a doctor", "a nurse", "an artist")) ) data1 = FMAT_run(models, query1) summary(data1, target.pair=FALSE) query2 = FMAT_query( "The [MASK] {ATTRIB}.", MASK = .(Male=c("man", "boy"), Female=c("woman", "girl")), ATTRIB = .(Masc=c("is masculine", "has a masculine personality"), Femi=c("is feminine", "has a feminine personality")) ) data2 = FMAT_run(models, query2) summary(data2, mask.pair=FALSE) summary(data2) ## End(Not run)
Interrater agreement of log probabilities (treated as "ratings"/rows) among BERT language models (treated as "raters"/columns), with both row and column as ("two-way") random effects.
ICC_models(data, type = "agreement", unit = "average")
ICC_models(data, type = "agreement", unit = "average")
data |
Raw data returned from |
type |
Interrater |
unit |
Reliability of |
A data.table of ICC.
) of LPR.Reliability analysis (Cronbach's ) of LPR.
LPR_reliability(fmat, item = c("query", "T_word", "A_word"), by = NULL)
LPR_reliability(fmat, item = c("query", "T_word", "A_word"), by = NULL)
fmat |
A data.table returned from |
item |
Reliability of multiple |
by |
Variable(s) to split data by.
Options can be |
A data.table of Cronbach's .
Summarize the results of Log Probability Ratio (LPR), which indicates the relative (vs. absolute) association between concepts.
The LPR of just one contrast (e.g., only between a pair of attributes) may not be sufficient for a proper interpretation of the results, and may further require a second contrast (e.g., between a pair of targets).
Users are suggested to use linear mixed models
(with the R packages nlme
or lme4
/lmerTest
)
to perform the formal analyses and hypothesis tests based on the LPR.
## S3 method for class 'fmat' summary( object, mask.pair = TRUE, target.pair = TRUE, attrib.pair = TRUE, warning = TRUE, ... )
## S3 method for class 'fmat' summary( object, mask.pair = TRUE, target.pair = TRUE, attrib.pair = TRUE, warning = TRUE, ... )
object |
A data.table (of new class |
mask.pair , target.pair , attrib.pair
|
Pairwise contrast of
|
warning |
Alert warning of out-of-vocabulary word(s)? Defaults to |
... |
Other arguments (currently not used). |
A data.table of the summarized results with Log Probability Ratio (LPR).
# see examples in `FMAT_run`
# see examples in `FMAT_run`