Research paperInferring Chinese surnames with Y-STR profiles
Introduction
In most human societies, surnames are paternally inherited, resulting in their co-segregation with Y chromosomes [1], [2]. Thus men sharing surnames might be expected to share similar, but not necessarily identical, Y chromosomes [3]. Indeed, King et al. [1] found that sharing a British surname significantly elevated the probability of sharing a Y-chromosomal haplotype.
The co-ancestry between surnames and Y chromosomes also has potential application in forensics. By analyzing Y chromosome variations from crime-scene samples, it may be possible to predict the likely surname of suspects, and thus substantially narrow down the investigation target. However, the link between the surnames and Y chromosomes can be weakened or even be lost due to multiple factors including adoption, name change, nonpaternities, mutations of Y haplotypes, as well as methodological deficiencies such as lack of resolution of markers used and violation of the assumptions of the analytical methods. In addition, the detectability of co-segregation between surnames and Y chromosomes is also highly dependent on the sensitivity of the molecular markers used and the history the surnames.
Short tandem repeats (STRs, also known as microsatellites) are hyper-variable sequences in the genomes, and have served as informative DNA markers in forensic inference, such as, human identification [4], [5]. STRs have also been widely used in kinship analysis [6] and forensic DNA typing [7]. Particularly, the haploidy and patrilineal inheritance of Y chromosomal STRs (Y-STRs) makes them an invaluable male specific addition to the standard panel of autosomal loci used in forensic genetics [8]. However, the reliability and robustness of Y-STRs in inferring the surnames remains to be fully tested. Particularly, the correlation between Chinese surnames and their Y-STR profiles still awaits to be explored.
Hereditary surnames were established in China as far back as 4000 years ago [9], [10], [11]. The earliest surnames originated from either a totem or a place of residence. Later surnames were derived from state names, from historic events, or from official positions, occupations, posthumous titles, or other characteristic of individuals [9]. Patrilineal surname system was initially used by the aristocracy, then became common and was adopted by people from all echelons of society at least 2000 years ago when the Han culture expanded in China. Although they have undergone a long evolutionary history compared to other societies of the world, the diversity of heritable surnames in China is low [12]. It appears that Chinese surnames have maintained long-term conservatism, stability and continuity, due to strong cultural constraints [13]. There are only around 7300 surnames being used by 1280 million people nationwide in China [11], and approximately 3000 surnames currently used by the Chinese Han group, the largest group accounting for more than 90% of Chinese population and nearly one fifth of all humans [14], [15]. In sharp contrast, European surnames originated and spread during the late Middle Ages, but a large number of different surnames have been accumulated during past few hundred years [16]. For example, more than 75,000 different surnames with more than 19 occurrences have been recorded in the 2012 Spain census [17], and 425,000 unique surnames of which ∼49,000 occurred more than 20 times were recorded in the 1881 census for 29 million British people [18]. In addition, the 100 most frequent surnames that compose 85% of the Han Chinese population overlap each other for different periods and their population size distributions have maintained an exponential shape since 1000 years ago (Song dynasty) [13], [19]. This is a clear indicator of long-term stability. Besides, many common surnames can trace back to their single ancient founding family and are highly clustered geographically in particular regions, implying a historically continuous inheritance of Chinese surnames [13]. Moreover, Chinese people cling to their surnames loyally and do not change them without special reasons [9], although surnames of the aristocracy were bestowed to lower echelons of society in earlier history. Furthermore, Chinese surnames are represented by unique Chinese character(s), and thus variants equivalent to the spelling variants found in alphabetical languages, if exist, are very rare. These socio-cultural features would predict a high degree of co-ancestry between Chinese surnames and Y-chromosomes.
In this study, we develop two computational methods to infer surnames from Y-STR profiles. We also conduct a survey of Y-STR variations in Shandong Province, China, and use it as a case study for our methods. We further explore the effects of sample size and number of STR loci on surname inference accuracy. Our results indicate that Y-STR profiles contain information from which our methods could reliably infer Chinese surnames.
Section snippets
Sampling and genotyping
Blood samples were collected with informed consent from 19,009 males in a Y-STR database project carried out by the Public Security Bureau, Shandong Province during 2012 to 2014. These individuals, bearing a total of 266 distinct surnames, are mostly permanent residents of Shandong Province (99.6%) with a few individuals (0.4%) from the other 19 provinces. The sample size for each surname ranges from 1 to 1889, with a mean of 71. Five of the surnames have sample sizes >1000, and the sample
Data summary
The dataset includes a total of 19,009 men bearing 266 distinct surnames. The sample size for each surname ranges from 1 to1889 with a mean of 71. The distribution of the top 100 surnames ranked by their frequencies in our data is exponentially shaped (Fig. 1A). The most frequent 126 surnames (sample sizes ≥ 10) comprise 97.6% of the sampled population. Only these data were included in following analyses. Approximately 10% of individuals have missing data at one or more STR loci. After excluding
Performance of surname inference
Using new methods developed here we demonstrate that surnames can be reliably inferred using 15 Y-STR loci commonly used in forensic genotyping in human populations. The inference accuracy can be more than 80% under certain conditions (Fig. 2, Fig. 3). Such high accuracy implies that our methods are efficient in exploring the mutual information between surnames and Y-STR profiles. The surname distribution of our data (Fig. 1A) has, to a large degree, recovered the exponential shape of
Conclusions
We have developed two reliable methods that efficiently infer surnames from Y-STR profiles. Our method manifested high inference accuracy in the case study of a dataset composed of ∼19,000 men bearing 266 surnames. The high accuracy suggests a high degree of co-ancestry between Chinese surnames and Y chromosomes, and is consistent with the socio-cultural features that might constrain the non-patrilineal inheritance of surname in China. We are confident that our methods are useful in surname
Conflict of interest
None.
Acknowledgements
We are grateful to Yanchai Wang for preparing the data, and grateful to the reviewers for their constructive comments, which greatly improved the quality of the manuscript. This work was supported by the CAS Key Program (KGFZD-135-16-021, to J.Y. and H.C.), the National Natural Science Foundation of China (91631106 and 31571370 to H.C., and 81330073 to J.Y.) and the “Hundred Talents Program” of the Chinese Academy of Sciences (to H.C.).
References (34)
- et al.
Genetic signatures of coancestry within surnames
Curr. Biol.
(2006) - et al.
Microsatellites and kinship
Trends Ecol. Evol.
(1993) - et al.
DNA commission of the International Society of Forensic Genetics: recommendations on forensic analysis using Y-chromosome short tandem repeats
Leg. Med.
(2001) In the name of the father: surnames and genetics
Trends Genet.
(2001)- et al.
What’s in a name? Y chromosomes, surnames and the genetic genealogy revolution
Trends Genet.
(2009) - et al.
Genetic structure of the Han Chinese population revealed by genome-wide SNP variation
Am. J. Hum. Genet.
(2009) - et al.
DNA commission of the International Society of Forensic Genetics (ISFG): an update of the recommendations on the use of Y-STRs in forensic analysis
Forensic Sci. Int.
(2006) - et al.
Mutability of Y-chromosomal microsatellites: rates, characteristics, molecular bases, and forensic implication
Am. J. Hum. Genet.
(2010) - et al.
Genomic divergences between humans and other hominoids and the effective population size of the common ancestor of humans and chimpanzees
Am. J. Hum. Genet.
(2001) - et al.
Identifying personal genomes by surname inference
Science
(2013)
The relationship between surname frequency and Y chromosome variation in Spain
Eur. J. Hum. Genet.
Identification of the remains of the Romanov family by DNA analysis
Nat. Genet.
Improving human forensics through advances in genetics, genomics and molecular biology
Nat. Rev. Genet.
Forensic DNA Typing: Biology, Technology, and Genetics of STR Markers
Chinese surnames and the genetic differences between North and South China
J. Chin. Ling. Monogr. Ser.
A study of surnames in China through isonymy
Am. J. Phys. Anthropol.
Chinese Surnames: Community Heredity and Population Distribution
Cited by (10)
Development of the decision tree model for distinguishing individuals of Chinese four surnames from Zhanjiang Han population based on Y-STR haplotypes
2021, Legal MedicineCitation Excerpt :Yang et al. evaluated application values of 17 Y-STRs for 10 main surnames in China, and they didn’t observe the mainstream haplotypes in these surnames; therefore, they stated that these Y-STRs couldn’t be utilized to predict surnames [7]. In contrast with the Yang’s results, Shi et al. investigated genetic distributions of these Y-STRs in larger Han populations, constructed two distance models based on observed Y-STR haplotypes and inferred surname origins of unknown samples by using both distances; they found that the developed methods could achieve single surname prediction with 65% accuracy; more importantly, they argued that the accuracy of surname inference would be further improved with the increasing of the sample size of surnames and the number of used Y-STRs [8]. Given the discrepancies of both results outlined above, further research on the power of Y-STRs to surname predictions should be carried out.
Forensic characteristics of 36 Y-STR loci in a Changzhou Han population and genetic distance analysis among several Chinese populations
2019, Forensic Science International: GeneticsQuality control measures in Short Tandem Repeat (STR) Analysis
2022, Handbook of DNA ProfilingInterpretation of dna data within the context of uk forensic science - investigation
2021, Emerging Topics in Life Sciences
- 1
These authors contributed equally to this study.