This version brings crucial changes and improvements to the add.tokens method of FMAT_run(). Old versions are completely deprecated.
add.method parameter of FMAT_run(): Now it always computes average subword token embeddings, with another new parameter weight.decay (default value = 1, i.e., equally weighted) that can specify the relative importance of multiple subwords.
weight_decay() for computational details.special_case() as an explicit function to better specify models requiring special treatment, which was previously just a parameter of FMAT_run().add.tokens=TRUE. Note that FMAT_run() results with add.tokens=FALSE are not affected by this change.FMAT_run().ICC_models() to support ICC estimates of both log probability (raw) and log probability ratio (LPR).add.method of add.tokens from "sum" to "mean", relevant to BERT_vocab() and FMAT_run(). Using the averaged rather than the summed subword token embeddings for out-of-vocabulary tokens would have a smaller impact on the probability estimates of vocabulary tokens.BERT_remove(): Remove models from local cache folder.fill_mask() and fill_mask_check(): These functions are only for technical check (i.e., checking the raw results of fill-mask pipeline). Normal users should usually use FMAT_run().pattern.special argument for FMAT_run(): Regular expression patterns (matching model names) for special model cases that are uncased or require a special prefix character in certain situations.
prefix.u2581: adding prefix \u2581 for all mask wordsprefix.u0120: adding prefix \u0120 for only non-starting mask wordsset_cache_folder(), BERT_download(), BERT_info(), and BERT_info_date().
BERT_info() and model initial commit date scraped from HuggingFace BERT_info_date() will be saved in subfolders of local cache: /.info/ and /.date/, respectively.FMAT_load().library(FMAT):
Sys.setenv("HF_HUB_DISABLE_SYMLINKS_WARNING" = "1")Sys.setenv("TF_ENABLE_ONEDNN_OPTS" = "0")Sys.setenv("KMP_DUPLICATE_LIB_OK" = "TRUE")Sys.setenv("OMP_NUM_THREADS" = "1")set_cache_folder(): Set (change) HuggingFace cache folder temporarily.
BERT_info_date(): Scrape the initial commit date of BERT models from HuggingFace.BERT_download() and BERT_info().BERT_download() connects to the Internet, while all the other functions run in an offline way.BERT_info().add.tokens and add.method arguments for BERT_vocab() and FMAT_run(): An experimental functionality to add new tokens (e.g., out-of-vocabulary words, compound words, or even phrases) as [MASK] options. Validation is still needed for this novel practice (one of my ongoing projects), so currently please only use at your own risk, waiting until the publication of my validation work.BERT_download() now import local model files only, without automatically downloading models. Users must first use BERT_download() to download models.FMAT_load(): Better to use FMAT_run() directly.BERT_vocab() and ICC_models().summary.fmat(), FMAT_query(), and FMAT_run() (significantly faster because now it can simultaneously estimate all [MASK] options for each unique query sentence, with running time only depending on the number of unique queries but not on the number of [MASK] options).reticulate package version ≥ 1.36.1, then FMAT should be updated to ≥ 2024.4. Otherwise, out-of-vocabulary [MASK] words may not be identified and marked. Now FMAT_run() directly uses model vocabulary and token ID to match [MASK] words. To check if a [MASK] word is in the model vocabulary, please use BERT_vocab().BERT_download() (downloading models to local cache folder "%USERPROFILE%/.cache/huggingface") to differentiate from FMAT_load() (loading saved models from local cache). But indeed FMAT_load() can also download models silently if they have not been downloaded.gpu argument (see Guidance for GPU Acceleration) in FMAT_run() to allow for specifying an NVIDIA GPU device on which the fill-mask pipeline will be allocated. GPU roughly performs 3x faster than CPU for the fill-mask pipeline. By default, FMAT_run() would automatically detect and use any available GPU with an installed CUDA-supported Python torch package (if not, it would use CPU).FMAT_run().BERT_download(), FMAT_load(), and FMAT_run().parallel in FMAT_run(): FMAT_run(model.names, data, gpu=TRUE) is the fastest.progress in FMAT_run().