Abstract
The draft sequence data of the whole human genome was released for the first time in 2000. Since then, genomic sequence analyses of human populations have been intensively conducted. In particular, the following three international projects of human genome sequencing are good examples for large–scale studies of human genome variations: 1) HapMap (1,417 individuals) 2) Human Genome Diversity Project (940 individuals) 3) 1000 genomes (2,504 individuals) However, the human genome sequence data are not readily compared with each other because of the essential differences in data format and annotation. If we can integrate all the three data sets into a single volume of data, we should be able to conduct a more detailed analysis of human genome sequence variation. The simple addition of individual samples leads to a total number of 4,861 individuals (= 1,417+940+2,504 individuals). Moreover, though genomic data of the Middle East (ME) populations is limited, ME could be a key to understand human history because of its unique location as a crossroad of Asia, Europe and Africa. With the aim of elucidating an evolutionary history of ME populations, we have developed the computational tools of integrating these three human genome sequence data sets incluiding ME populations into a single volume of data. Using those tools, we successfully integrated these data. Then, we constructed a phylogenetic tree of about 5,000 human individuals at the genome sequence level. As a result, we identified evolutionary clusters of the ME populations in relation to other major ethnic groups, with very interesting features. Here, we report the outcome of this kind of big data analyses with successful data integration, discussing evolutionary significance of human genomic variations. We also present how to identify the ME– specific variant alleles for particular diseases such as diabetes, using the Clinvar database.