Title: | Chinese Name Database 1930-2008 |
---|---|
Description: | A database of Chinese surnames and Chinese given names (1930-2008). This database contains nationwide frequency statistics of 1,806 Chinese surnames and 2,614 Chinese characters used in given names, covering about 1.2 billion Han Chinese population (96.8% of the Han Chinese household-registered population born from 1930 to 2008 and still alive in 2008). This package also contains a function for computing multiple features of Chinese surnames and Chinese given names for scientific research (e.g., name uniqueness, name gender, name valence, and name warmth/competence). |
Authors: | Han-Wu-Shuang Bao [aut, cre] |
Maintainer: | Han-Wu-Shuang Bao <[email protected]> |
License: | GPL-3 |
Version: | 2023.8 |
Built: | 2024-11-20 03:36:43 UTC |
Source: | https://github.com/psychbruce/chinesenames |
A database of Chinese surnames and Chinese given names (1930-2008). This database contains nationwide frequency statistics of 1,806 Chinese surnames and 2,614 Chinese characters used in given names, covering about 1.2 billion Han Chinese population (96.8% of the Han Chinese household-registered population born from 1930 to 2008 and still alive in 2008). This package also contains a function for computing multiple features of Chinese surnames and Chinese given names for scientific research (e.g., name uniqueness, name gender, name valence, and name warmth/competence).
Details are described in https://psychbruce.github.io/ChineseNames/
This database does not contain any individual-level information (so it does not leak personal privacy). All data are at the name level or character level. Extremely rare characters are not included.
Maintainer: Han-Wu-Shuang Bao [email protected] (ORCID)
This database was provided by Beijing Meiming Science and Technology Company (in collaboration) and originally obtained from the National Citizen Identity Information Center (NCIIC) of China in 2008.
Useful links:
Report bugs at https://github.com/psychbruce/ChineseNames/issues
Compute all available name features (indices) based on
familyname
and givenname
.
You can either input a data frame
with a variable of Chinese full names
(and a variable of birth years, if necessary)
or just input a vector of full names
(and a vector of birth years, if necessary).
Usage 1: Input a single value or a vector of name
[and birth
, if necessary].
Usage 2: Input a data frame of data
and the variable name of
var.fullname
(or var.surname
and/or var.givenname
)
[and var.birthyear
, if necessary].
Caution. Name-character uniqueness (NU) for birth year >= 2010 is estimated by forecasting and thereby may not be accurate.
compute_name_index( data = NULL, var.fullname = NULL, var.surname = NULL, var.givenname = NULL, var.birthyear = NULL, name = NA, birth = NA, index = c("NLen", "SNU", "SNI", "NU", "CCU", "NG", "NV", "NW", "NC"), NU.approx = TRUE, digits = 4, return.namechar = TRUE, return.all = FALSE )
compute_name_index( data = NULL, var.fullname = NULL, var.surname = NULL, var.givenname = NULL, var.birthyear = NULL, name = NA, birth = NA, index = c("NLen", "SNU", "SNI", "NU", "CCU", "NG", "NV", "NW", "NC"), NU.approx = TRUE, digits = 4, return.namechar = TRUE, return.all = FALSE )
data |
Data frame. |
var.fullname |
Variable name of Chinese full names (e.g., |
var.surname |
Variable name of Chinese surnames (e.g., |
var.givenname |
Variable name of Chinese given names (e.g., |
var.birthyear |
Variable name of birth year (e.g., |
name |
If no |
birth |
If no |
index |
Which indices to compute? By default, it computes all available name indices:
For details, see https://psychbruce.github.io/ChineseNames/ |
NU.approx |
Whether to approximately compute name-character uniqueness (NU)
using the nearest two birth cohorts with relative weights
(which would be more precise than just using a single birth cohort).
Default is |
digits |
Number of decimal places. Default is |
return.namechar |
Whether to return separate name characters.
Default is |
return.all |
Whether to return all temporary variables
in the computation of the final variables.
Default is |
A new data frame (of class data.table
) with name indices appended.
Full names are split into name0
(surnames, with compound surnames automatically detected),
name1
, name2
, and name3
(given-name characters).
For details and examples, see https://psychbruce.github.io/ChineseNames/
## Prepare ## sn = familyname$surname[1:12] gn = c(top100name.year$name.all.1960[1:6], top100name.year$name.all.2000[1:6], top100name.year$name.all.1960[95:100], top100name.year$name.all.2000[95:100]) demodata = data.frame(name=paste0(sn, gn), birth=c(1960:1965, 2000:2005, 1960:1965, 2000:2005)) demodata ## Compute ## newdata = compute_name_index(demodata, var.fullname="name", var.birthyear="birth") newdata
## Prepare ## sn = familyname$surname[1:12] gn = c(top100name.year$name.all.1960[1:6], top100name.year$name.all.2000[1:6], top100name.year$name.all.1960[95:100], top100name.year$name.all.2000[95:100]) demodata = data.frame(name=paste0(sn, gn), birth=c(1960:1965, 2000:2005, 1960:1965, 2000:2005)) demodata ## Compute ## newdata = compute_name_index(demodata, var.fullname="name", var.birthyear="birth") newdata
1,806 Chinese surnames and nationwide frequency.
data(familyname)
data(familyname)
A data frame with 7 variables:
surname
surname (in Chinese)
compound
0 = single surname, 1 = compound surname
initial
initial letter (a-z)
initial.rank
initial order (1-26)
n.1930_2008
total counts in the database
ppm.1930_2008
proportion in population (ppm = parts per million)
surname.uniqueness
surname uniqueness
https://psychbruce.github.io/ChineseNames/
2,614 Chinese characters used in given names and nationwide frequency.
data(givenname)
data(givenname)
A data frame with 25 variables:
character
character used in given names (in Chinese)
pinyin
pinyin (pronunciation)
bihua
number of strokes in a character
n.male
total counts in male
n.female
total counts in female
name.gender
difference in proportions of a character used by male vs. female
n.1930_1959
, n.1960_1969
, n.1970_1979
, n.1980_1989
, n.1990_1999
, n.2000_2008
total counts in a birth cohort
ppm.1930_1959
, ppm.1960_1969
, ppm.1970_1979
, ppm.1980_1989
, ppm.1990_1999
, ppm.2000_2008
proportion (parts per million) in a birth cohort
name.ppm
average ppm (parts per million) across all cohorts
name.uniqueness
name-character uniqueness (in naming practices)
corpus.ppm
proportion (parts per million) in contemporary Chinese corpus
corpus.uniqueness
character-corpus uniqueness (in contemporary Chinese corpus)
name.valence
name valence (positivity of character meaning) (based on subjective ratings from 16 raters, ICC = 0.921)
name.warmth
name warmth/morality (based on subjective ratings from 10 raters, ICC = 0.774)
name.competence
name competence/assertiveness (based on subjective ratings from 10 raters, ICC = 0.712)
https://psychbruce.github.io/ChineseNames/
Population statistics for the Chinese name database.
data(population)
data(population)
https://psychbruce.github.io/ChineseNames/
Top 1,000 given names in 31 Chinese mainland provinces.
data(top1000name.prov)
data(top1000name.prov)
https://psychbruce.github.io/ChineseNames/
Top 100 given names in 6 birth cohorts.
data(top100name.year)
data(top100name.year)
https://psychbruce.github.io/ChineseNames/
Top 50 given-name characters in 6 birth cohorts.
data(top50char.year)
data(top50char.year)
https://psychbruce.github.io/ChineseNames/