Adrien Ickowicz, Ben Raymond


It is always an interesting exercise to try and predict how performances at youth and junior level translate into senior level. It is also interesting to have an idea of the main features of volleyball that are expected in a given competition, from historical data.

What we are questioning here is not so much the predictive analytics, but rather trying to identify if there is indeed a distinction between the different age group and / or the different competition and continents. And if so, what is it? To assess that, we run a statistical analysis called Principal Component Analysis on a number of competitions and with a number of indicators reflective of the competitions, and see if we can cluster these in groups.

Data

We used as many scouted files from the games as possible, as per Table 1 below. Unfortunately we could not get our hands on any African volleyball data, but we do have coverage for many age groups for South America, North America, Asia, Europe and World competitions, and many games for each category. While we were generally unable to get access every game in every competition, we have only included competitions in which we had a reasonable fraction of games, so as to be confident that the data gave a reasonable reflection of the standard and characteristics of play.

The table describes the different competitions we were able to include in the analysis, the number of teams that participated, and the number of games we accessed for each competitions.

Table 1
The analysed competitions
Summary of games and teams analysed through the shared scouted data. Not all games within the competitions had available scouted files.
Age group Gender Continent Year Games Teams
AVC AVC Senior Men 2022 Senior Men asia & oceania 2022 20 19
AVC U16 Men 2023 U16 Men asia & oceania 2023 16 11
AVC U18 Men 2018 U18 Men asia & oceania 2018 7 9
AVC U20 Men 2022 U20 Men asia & oceania 2022 27 17
CEV CEV Senior Men 2021 Senior Men europe 2021 16 20
CEV U17 Men 2021 U17 Men europe 2021 17 17
CEV U18 Men 2022 U18 Men europe 2022 18 16
CEV U20 Men 2022 U20 Men europe 2022 20 18
CSV CSV Senior Men 2021 Senior Men south america 2021 9 6
CSV U17 Men 2023 U17 Men south america 2023 9 6
CSV U19 Men 2022 U19 Men south america 2022 10 10
CSV U21 Men 2022 U21 Men south america 2022 8 11
FIVB FIVB U19 Men 2021 U19 Men world 2021 29 20
FIVB U19 Men 2023 U19 Men world 2023 45 20
FIVB U21 Men 2021 U21 Men world 2021 20 20
FIVB U21 Men 2023 U21 Men world 2023 31 22
FIVB Senior Men 2018 Senior Men world 2018 94 25
FIVB Senior Men 2022 Senior Men world 2022 37 25
NORCECA NORCECA Senior Men 2021 Senior Men north america 2021 13 9
NORCECA Senior Men 2022 Senior Men north america 2022 26 13
NORCECA Senior Men 2023 Senior Men north america 2023 17 12
NORCECA U19 Men 2023 U19 Men north america 2023 11 8
NORCECA U21 Men 2023 U21 Men north america 2023 13 9
Source: Courtesy of Lionel Bonnaure and the many scouts worldwide for sharing their scouted files. Sourced from VB Canada’s scout share server.

The data only goes back to 2018 (two competitions), and the rest from 2020 until 2023. The redundancy of the FIVB competitions (2 U19, 2 U21 and 2 seniors) should also allow us to evaluate the consistency of these competitions in terms of the key statisitical descriptors.

Key statistics

Anyone familiar with volleyball, or sport really, knows that many numbers can be derived from any games. These numbers are then used in a number of ways, Moneyball being one example. At Science Untangled, amongst our suite of apps used to analyse any given game, there is one you can access (there https://apps.untan.gl/dvrr/) which gives you access to range of metrics. The Analysis summary tab in particular provides a number of statistical metrics that we are going to use in this article. The key statistics that we decided to include in the analysis are:

The tab actually provides more than 70 different metrics. We deliberately only focus on the 12 indicators described above to keep it high level. You can check the values for the different competitions in Table 2.

While we are by no means covering every aspect of the game, this should give us a good idea of general tendencies. Further analysis can then be performed following the direction provided by this early approach.

Method

Even though we’ve restricted ourselves to 12 measurements, this is still way too many to visualize directly. Ideally we’d like to summarize this data down to two three dimensions, so that it can be plotted and the differences between the competitions visualized more easily. To do this we use a method called principal components analysis. This is a well-known statistical technique to capture the most important information from high-dimensional data. It relies on the fact that many of the variables will be correlated with each other (i.e. when one increases, so does another). We can combine those correlated variables into new composite variables, called “principal components”, leaving us with a smaller number of variables that still capture the main patterns in the original data. For a technical explanation of how this is done, see e.g. the Wikipedia page.

In the histogram above, we can see the quantity of information provided by the different components is steadily decreasing. The 1st component explains about 43% of the competitions difference, then it goes down to about 24%, then 15%, etc. The components are actually calculated so that they explain different aspects of information, so as you keep many components, you can add the percent of information they cover. The first three components can be considered to be the most significant since they contain almost 81% of the total information of the data. The remaining components each add only a small additional amount of information, so we keep with the first 3 components as a balance between explaining as much as possible and keeping it simple.

So, with keeping 3 components, we need to also understand what these three components represent (remember that each of these components is some combination of our original variables). This is what the figure below helps doing. Within each circle, the original key statistics are represented, and their coordinates mark their relative importance.

Three main pieces of information can be observed from the axis plot below:

The goal of the third visualization is to determine how much each variable is represented in a given component.

With these explanation, here is what can be said from each component:

Competitions positioning

Having established our principal components, we can project each of our competitions onto these new axes, and examine how they compare to each other. This is shown in the figure below (these are two-dimensional plots, so we plot axes 1 and 2 together in the first plot and axes 1 and 3 in the second plot). The figure is telling us that:

We can also present the same information on a single 3D plot, rather than two 2-D plots. Below is a 3D plot to help navigate the difference in 3 dimensions. Click and drag the plot to rotate it, and scroll to zoom.

Discussion

This analysis is not the right way to go about evaluating a given team’s performance in a competition, in particular when there is no full round robin where everyone gets a chance to play everyone else. It is however telling us how different the overall standards of play between competitions can be, and how much variability can be expected across competitions. Bear in mind that the key statistics for a team are influenced by what their opponents do, and so making conclusions about the ‘quality’ of one competition over another is not warranted. But one can say what are the aspects of the game that seem to have priority in the different confederations.

The next article will lead us to dig into the within-competition variability.