Kunho
KO EN

Project Report

2022 World Cup Data Analysis: Uncovering Modern Football Trends

Published
2025. 2. 25.
Data
FIFA Qatar 2022
Method
UMAP, K-means
Scope
63 matches, 37 features

1. Introduction

1.1 Background and Necessity

The importance of data analysis is increasing every day across many sports. Most teams analyze the opponent's strategy before entering a match. In the field of strategic analysis, teams have moved away from the old method of relying only on intuition and can now take a systematic approach through data. Football also has tactical trends. By analyzing these trends, we can identify the direction of modern football. This study was conducted to help develop future football tactical systems by understanding that direction.

1.2 Research Objective

This study aims to identify the flow of modern football. The data source was set as the 2022 Qatar World Cup. Because the World Cup includes teams from many continents, it is useful for understanding modern football trends without being limited to a specific country or continent. Using indicators obtained from every Qatar World Cup match and each country's tournament result, this study attempts to identify overall characteristics and cluster-specific characteristics. The dataset consists of 37 features, including attacking and defensive indicators.

1.3 Methodology Overview

The data came from the official FIFA website for the 2022 Qatar World Cup. Indicators for each team were organized using all matches from the group stage to the final. After that, the data was preprocessed using Z-score standardization and UMAP dimensionality reduction, and clustering was performed on the resulting data. K-means was used as the clustering algorithm because it is computationally fast and useful for visualization. The algorithm was evaluated with the Elbow Method and Silhouette Score, and the optimal value of K was selected.

2. Research Method

2.1 Data Introduction

This study uses data from the official FIFA website for the 2022 Qatar World Cup. It includes a total of 63 matches from the group stage to the final. Team-level indicators are organized for each match. For each match, 37 features can be obtained per team, and the final dataset was generated by averaging each team's indicators by feature.

2.2 Data Preprocessing

Z-score standardization was applied to the given data, standardizing it into data with a mean of 1 and a standard deviation of 0. By aligning the scales among feature values, the stability of the dimensionality reduction and clustering models used later was secured. Next, UMAP was used for dimensionality reduction. Because the data is nonlinear, UMAP reduced the number of features from 37 to 2. The UMAP hyperparameters were set to n_neighbors=15 and min_dist=0.1 to preserve the global structure of the data well.

Country-level data reduced with UMAP after Z-score standardization
UMAP Projection of Scaled Data

2.3 Checking Correlations Between Features

A Rank Score feature was added based on each country's result in the Qatar World Cup, and its correlation with other features was checked. Visualization was performed to identify which characteristics are important for achieving good results in the World Cup.

Football metrics highly correlated with Rank Score
Features Correlated with Rank Score

2.4 Clustering Algorithm

K-means was selected as the clustering algorithm because it supports fast computation and makes it easy to visually inspect clustering results. Although it is vulnerable to noise, this was not a problem because the earlier preprocessing process showed that there was no noise. To speed up the algorithm, k-means++ was selected as the initial value selection method.

2.5 Method for Determining the Number of Clusters

To find the optimal number of clusters, visualization and comparison were performed using the Elbow Method and Silhouette Score. The Elbow Method first suggested that the optimal number of clusters would be 2. Afterward, the average Silhouette Score was calculated for each number of clusters and compared with the Silhouette Score for each cluster. When the number of clusters was 6, the average Silhouette Score was 0.434, and the cluster-level Silhouette Scores were mostly uniform. Therefore, the optimal number of clusters was determined to be 6.

UMAP visualization by candidate K-means cluster counts
KMeans Clustering with Different k Values
Silhouette score comparison by candidate cluster count
Silhouette Score by Number of Clusters

3. Research Results

3.1 Correlation Analysis Results

The correlation between Rank Score and other features was calculated. infront offers to receive Average recorded 0.382850, total offers to receive Average recorded 0.321767, and inbetween offers to receive Average recorded 0.248061. Therefore, movements and attempts to receive the ball are important, and especially movements to receive the ball in high positions are necessary.

In addition, receptions between midfield and defensive lines Average recorded 0.293273, indicating the importance of receiving the ball between the opponent's midfield and defensive lines. left channel Average was measured at 0.238437, and right channel Average at 0.280012, showing that using wide spaces is more important than using central areas.

3.2 Clustering Results

To identify the characteristics of each cluster, the mean and standard deviation of each cluster's features were visualized with error bars. Cluster 2, which includes Argentina, the Qatar World Cup champion, as well as football powerhouses Brazil, England, Germany, Portugal, and Spain, recorded high values in most features. Its average tournament performance was also excellent. Cluster 0, which includes runner-up France, did not show superior values in most features. Cluster 1 recorded the lowest values in most features, but achieved better results than Clusters 3 and 4. Cluster 3 showed a high goals-conceded rate and recorded the lowest results.

Feature means and standard deviations by cluster
Mean and Standard Deviation of Features in Clusters

3.3 Team Style Analysis by Cluster

Style indicators for Cluster 0
Cluster 0 Team Style

Cluster 0, which includes runner-up France, has high values for penetration-related indicators such as attempted defensive line breaks Average, completed defensive line breaks Average, and offsides Average compared with its pass count and pass success rate. Also, central channel Average is low, while right channel Average and left channel Average are high, showing that it mainly used wide spaces rather than central spaces. This indicates a direct style of play that attacks the space behind the defense rather than building the game through short passes.

Style indicators for Cluster 1
Cluster 1 Team Style

Cluster 1 has low values in most indicators, but achieved better results than Clusters 3 and 4. Even though most values were low, goal preventions Average, forced turnovers Average, and defensive pressures applied Average were recorded at high levels. Also, switches of play completed Average was high, meaning that this cluster induced cracks in the opponent's defense. Overall, it performed well defensively, and this led to good results.

Style indicators for Cluster 2
Cluster 2 Team Style

Cluster 2, consisting only of football powerhouses, shows the best values in most indicators. It recorded overwhelming possession, pass count, and pass success rate. The high values of left inside channel Average, central channel Average, and right inside channel Average suggest strong performance in central areas. Based on control of central areas, this cluster produced many shots on target and many goals.

Style indicators for Cluster 3
Cluster 3 Team Style

Cluster 3, which recorded the lowest results, did not show outstanding values in either attack or defense compared with other clusters. It had the lowest scoring value and also showed poor defensive indicators. Unstable defense and unclear attacking ability appear to be the main reasons for its low results.

Style indicators for Cluster 4
Cluster 4 Team Style

Cluster 4 shows low values in most indicators, such as low possession, few passes, and low pass success rate. However, its goal inside the penalty area Average value is high. This shows that even though its overall performance was not strong, it had the ability to score from few chances through finishing inside the box. Compared with attacking indicators, its defensive indicators show good values.

Style indicators for Cluster 5
Cluster 5 Team Style

Cluster 5 recorded the best values after Cluster 2, which showed strong results across many indicators. In particular, it shows the highest value for switches of play completed Average. This means that it induced cracks in the opponent's defense based on generally strong performance. It had high values in passing and penetration-related indicators, but mainly used wide spaces rather than central areas. However, crosses completed Average was low, and its scoring ability was also weak.

4. Conclusion

This study shows that using wide spaces is important. Excluding the traditional football powerhouses, most countries emphasized the use of the wings, and this led to good results. In particular, the combination of forward penetration movements and wide-space usage appears to be a major reason for strong performance. Based on basic defensive ability, it is important not to focus only on possession, but to decide how to use the side areas of the pitch effectively.

In addition, teams can achieve better results when players move organically by exchanging positions on the pitch. Therefore, most countries use side spaces and continuous penetration. Football powerhouses composed of players with outstanding individual ability still focus on possession and control of central areas.

Among the 37 features used in the data, 33 are attacking indicators and 4 are defensive indicators. Defensive indicators were insufficient compared with attacking indicators, which caused imbalance in the data. In modern football, where the importance of defense is emphasized, this had a critical effect on the analysis. Indicators that are difficult to express numerically, such as individual ability or a manager's capability, were also not included in the analysis.