Research paper
Inferring Chinese surnames with Y-STR profiles

https://doi.org/10.1016/j.fsigen.2017.11.014Get rights and content

Highlights

  • Two efficient computational methods were developed to infer surnames from Y-STR profiles.

  • More than 19,000 men bearing 266 surnames were typed for 17 Y-STR loci to demonstrate the performance of the methods.

  • The possibility of inferring surnames from Y-STR profiles reliably enables promising applications in forensics.

Abstract

Co-ancestry of human surnames and Y-chromosomes in most human populations and social groups suggests the possibility of inferring one from the other. However, such an intuitive perspective remains to be formally explored. In the present study, we develop two computational methods, based on cosine distance (dcos) and coalescence distance (dcoal) respectively, to infer surnames from Y-STR profiles. We also survey Y-STR variations at 15 loci for 19,009 individuals of Shandong Province in China. For a total of 266 surnames included in the data set, our methods can pinpoint to a single surname with an average accuracy of 65%, and with an average accuracy higher than 80% when providing >4 candidate surnames. We also demonstrate that increasing the sample size of surnames and the number of STR loci improves the accuracy of surname inference. Our results indicate that the 15 non-duplicated Y-STR loci contain information from which surname can be reliably inferred for Chinese populations, showing a promising application in forensics.

Introduction

In most human societies, surnames are paternally inherited, resulting in their co-segregation with Y chromosomes [1], [2]. Thus men sharing surnames might be expected to share similar, but not necessarily identical, Y chromosomes [3]. Indeed, King et al. [1] found that sharing a British surname significantly elevated the probability of sharing a Y-chromosomal haplotype.

The co-ancestry between surnames and Y chromosomes also has potential application in forensics. By analyzing Y chromosome variations from crime-scene samples, it may be possible to predict the likely surname of suspects, and thus substantially narrow down the investigation target. However, the link between the surnames and Y chromosomes can be weakened or even be lost due to multiple factors including adoption, name change, nonpaternities, mutations of Y haplotypes, as well as methodological deficiencies such as lack of resolution of markers used and violation of the assumptions of the analytical methods. In addition, the detectability of co-segregation between surnames and Y chromosomes is also highly dependent on the sensitivity of the molecular markers used and the history the surnames.

Short tandem repeats (STRs, also known as microsatellites) are hyper-variable sequences in the genomes, and have served as informative DNA markers in forensic inference, such as, human identification [4], [5]. STRs have also been widely used in kinship analysis [6] and forensic DNA typing [7]. Particularly, the haploidy and patrilineal inheritance of Y chromosomal STRs (Y-STRs) makes them an invaluable male specific addition to the standard panel of autosomal loci used in forensic genetics [8]. However, the reliability and robustness of Y-STRs in inferring the surnames remains to be fully tested. Particularly, the correlation between Chinese surnames and their Y-STR profiles still awaits to be explored.

Hereditary surnames were established in China as far back as 4000 years ago [9], [10], [11]. The earliest surnames originated from either a totem or a place of residence. Later surnames were derived from state names, from historic events, or from official positions, occupations, posthumous titles, or other characteristic of individuals [9]. Patrilineal surname system was initially used by the aristocracy, then became common and was adopted by people from all echelons of society at least 2000 years ago when the Han culture expanded in China. Although they have undergone a long evolutionary history compared to other societies of the world, the diversity of heritable surnames in China is low [12]. It appears that Chinese surnames have maintained long-term conservatism, stability and continuity, due to strong cultural constraints [13]. There are only around 7300 surnames being used by 1280 million people nationwide in China [11], and approximately 3000 surnames currently used by the Chinese Han group, the largest group accounting for more than 90% of Chinese population and nearly one fifth of all humans [14], [15]. In sharp contrast, European surnames originated and spread during the late Middle Ages, but a large number of different surnames have been accumulated during past few hundred years [16]. For example, more than 75,000 different surnames with more than 19 occurrences have been recorded in the 2012 Spain census [17], and 425,000 unique surnames of which ∼49,000 occurred more than 20 times were recorded in the 1881 census for 29 million British people [18]. In addition, the 100 most frequent surnames that compose 85% of the Han Chinese population overlap each other for different periods and their population size distributions have maintained an exponential shape since 1000 years ago (Song dynasty) [13], [19]. This is a clear indicator of long-term stability. Besides, many common surnames can trace back to their single ancient founding family and are highly clustered geographically in particular regions, implying a historically continuous inheritance of Chinese surnames [13]. Moreover, Chinese people cling to their surnames loyally and do not change them without special reasons [9], although surnames of the aristocracy were bestowed to lower echelons of society in earlier history. Furthermore, Chinese surnames are represented by unique Chinese character(s), and thus variants equivalent to the spelling variants found in alphabetical languages, if exist, are very rare. These socio-cultural features would predict a high degree of co-ancestry between Chinese surnames and Y-chromosomes.

In this study, we develop two computational methods to infer surnames from Y-STR profiles. We also conduct a survey of Y-STR variations in Shandong Province, China, and use it as a case study for our methods. We further explore the effects of sample size and number of STR loci on surname inference accuracy. Our results indicate that Y-STR profiles contain information from which our methods could reliably infer Chinese surnames.

Section snippets

Sampling and genotyping

Blood samples were collected with informed consent from 19,009 males in a Y-STR database project carried out by the Public Security Bureau, Shandong Province during 2012 to 2014. These individuals, bearing a total of 266 distinct surnames, are mostly permanent residents of Shandong Province (99.6%) with a few individuals (0.4%) from the other 19 provinces. The sample size for each surname ranges from 1 to 1889, with a mean of 71. Five of the surnames have sample sizes >1000, and the sample

Data summary

The dataset includes a total of 19,009 men bearing 266 distinct surnames. The sample size for each surname ranges from 1 to1889 with a mean of 71. The distribution of the top 100 surnames ranked by their frequencies in our data is exponentially shaped (Fig. 1A). The most frequent 126 surnames (sample sizes  10) comprise 97.6% of the sampled population. Only these data were included in following analyses. Approximately 10% of individuals have missing data at one or more STR loci. After excluding

Performance of surname inference

Using new methods developed here we demonstrate that surnames can be reliably inferred using 15 Y-STR loci commonly used in forensic genotyping in human populations. The inference accuracy can be more than 80% under certain conditions (Fig. 2, Fig. 3). Such high accuracy implies that our methods are efficient in exploring the mutual information between surnames and Y-STR profiles. The surname distribution of our data (Fig. 1A) has, to a large degree, recovered the exponential shape of

Conclusions

We have developed two reliable methods that efficiently infer surnames from Y-STR profiles. Our method manifested high inference accuracy in the case study of a dataset composed of ∼19,000 men bearing 266 surnames. The high accuracy suggests a high degree of co-ancestry between Chinese surnames and Y chromosomes, and is consistent with the socio-cultural features that might constrain the non-patrilineal inheritance of surname in China. We are confident that our methods are useful in surname

Conflict of interest

None.

Acknowledgements

We are grateful to Yanchai Wang for preparing the data, and grateful to the reviewers for their constructive comments, which greatly improved the quality of the manuscript. This work was supported by the CAS Key Program (KGFZD-135-16-021, to J.Y. and H.C.), the National Natural Science Foundation of China (91631106 and 31571370 to H.C., and 81330073 to J.Y.) and the “Hundred Talents Program” of the Chinese Academy of Sciences (to H.C.).

References (34)

  • C. Martinez-Cadenas et al.

    The relationship between surname frequency and Y chromosome variation in Spain

    Eur. J. Hum. Genet.

    (2016)
  • P. Gill et al.

    Identification of the remains of the Romanov family by DNA analysis

    Nat. Genet.

    (1994)
  • M. Kayser et al.

    Improving human forensics through advances in genetics, genomics and molecular biology

    Nat. Rev. Genet.

    (2012)
  • J.M. Butler

    Forensic DNA Typing: Biology, Technology, and Genetics of STR Markers

    (2005)
  • R.F. Du et al.

    Chinese surnames and the genetic differences between North and South China

    J. Chin. Ling. Monogr. Ser.

    (1992)
  • Y. Liu et al.

    A study of surnames in China through isonymy

    Am. J. Phys. Anthropol.

    (2012)
  • Y.D. Yuan et al.

    Chinese Surnames: Community Heredity and Population Distribution

    (2002)
  • Cited by (10)

    View all citing articles on Scopus
    1

    These authors contributed equally to this study.

    View full text