North and South Steppe/Caucasus Ancestry

January 28, 2022

Indo-European languages originated somewhere in the Russian Steppe/Caucasus Mountains area, and studies like Sarno et al. (2017) have suggested that some of them were spread to Europe (especially the Balkans and Italy) through a southern route by a population that was different than the Yamnaya who went north.

A recent study found that in the Steppe/Caucasus zone during the Bronze Age there were two distinct populations: a Northern one (which the authors call "Steppe") and a Southern one (which they call "Caucasus"). Both had CHG/Iran-related ancestry and similar mtDNA, but "Steppe" was mixed with EHG/ANE and had mainly Y-chromosome haplogroup R, whereas "Caucasus" mostly lacked EHG and was mainly haplogroup J.

It's clear that Italy and the Balkans received more of the Southern kind of ancestry, which could have brought with it languages like Greek, Albanian, Illyrian, Thracian, Messapian and maybe others too. (It would also likely be the source of the Anatolian and Armenian branches of IE).

Based on PCA and ADMIXTURE plots we observe two distinct genetic clusters: one falls with previously published ancient individuals from the West Eurasian steppe (hence termed 'Steppe'), and the second clusters with present-day southern Caucasian populations and ancient BA individuals from today's Armenia (henceforth called 'Caucasus'), while a few individuals take on intermediate positions between the two. The stark distinction seen in our temporal transect is also visible in the Y-chromosome haplogroup distribution, with R1/R1b1 and Q1a2 types in the Steppe and L, J, and G2 types in the Caucasus cluster (Fig. 3a, Supplementary Data 1, Supplementary Note 4). In contrast, the mitochondrial haplogroup distribution is more diverse and similar in both groups (Fig. 3b, Supplementary Data 1).


Our fitted qpGraph model recapitulates the genetic separation between the Caucasus and Steppe groups with the Eneolithic steppe individuals deriving more than 60% of ancestry from EHG and the remainder from a CHG-related basal lineage, whereas the Maykop group received about 86.4% from CHG, 9.6% Anatolian farming related ancestry, and 4% from EHG. The Yamnaya individuals from the Caucasus derived the majority of their ancestry from Eneolithic steppe individuals, but also received about 16% from Globular Amphora-related farmers (Fig. 5, Supplementary Note 6).


The insight that the Caucasus mountains served as a corridor for the spread of CHG ancestry north but also for subtle later gene-flow from the south allows speculations on the postulated homelands of Proto-Indo-European (PIE) languages and documented gene-flows that could have carried a consecutive spread of both across West Eurasia. This also opens up the possibility of a homeland of PIE south of the Caucasus, and could offer a parsimonious explanation for an early branching off of Anatolian languages, as shown on many PIE tree topologies. Geographically conceivable are also Armenian and Greek, for which genetic data support an eastern influence from Anatolia or the southern Caucasus, and an Indo-Iranian offshoot to the east. However, latest ancient DNA results from South Asia suggest an LMBA spread via the steppe belt. Irrespective of the early branching pattern, the spread of some or all of the PIE branches would have been possible via the North Pontic/Caucasus region and from there, along with pastoralist expansions, to the heart of Europe. This scenario finds support from the well attested and widely documented 'steppe ancestry' in European populations and the postulate of increasingly patrilinear societies in the wake of these expansions.

The two clusters are represented in the PCA, and the dotted lines show trajectories of admixture: the pink one between Western European Farmers and the Northern Steppe/Caucasus cluster, and the brown one between Eastern European Farmers and the Southern Steppe/Caucasus cluster. All Italians (and Balkan peoples) are on those clines and plot between the two dotted lines, near the bottom.

Wang et al. "Ancient human genome-wide data from a 3000-year interval in the Caucasus corresponds with eco-geographic regions". Nat Commun, 2019.