Silent Parties: A Cluster Analysis of Voting Behavior in the European Parliament

Welcome to this interactive application, designed to explore and analyze the clustering of voting behaviors within the European Parliament. This tool provides insights into how Members of the European Parliament (MEPs) form latent voting blocs based on their voting patterns. Using advanced dimensionality reduction techniques like UMAP, MCA, and DW-NOMINATE, as well as clustering methods such as K-Means, PAM, and HDBSCAN, this application allows you to investigate the ideological alignments and coalitions that emerge within the EU legislative body. Don't worry if you're new to these techniques! This app provides an intuitive interface to guide you through each step, offering visuals and interpretations to help you understand the complex relationships between MEPs.

The data used in this application are publicly accessible through the VoteWatch Europe and PollTrack.eu databases. These data sources contain voting records and biographical information for MEPs, covering multiple EU legislative sessions from 2004 to 2022. By leveraging this comprehensive dataset, you can gain insights into how political groups, alliances, and individual MEPs align on critical issues across different policy areas, including economics, foreign policy, social issues, and more.

For more information please check the European Parliament page clicking
Here

How to Use this App

How This Project Came to Be

This application allows you to explore the complexities of MEP voting behavior through various analytical lenses. My hope is that it reveals some of the nuances within the European Parliament and inspires further exploration.

Inspiration

The idea to analyze voting behavior in the European Parliament started with a simple question: Are there hidden patterns in the way MEPs vote?

Data Collection

Using publicly available data on MEPs' voting records, I gathered detailed datasets spanning multiple legislative periods.

Data Processing

Data cleaning, filtering, and encoding helped prepare the dataset for deeper analysis, ensuring accuracy in every step.

Exploratory Analysis

With visualizations and statistical summaries, I began uncovering initial patterns and trends in the voting data.

Dimensionality Reduction

Using UMAP, MCA, and DW-NOMINATE, I transformed high-dimensional voting data into a compact format for clustering.

Clustering

Applying K-Means, PAM, and HDBSCAN, I identified clusters representing potential political groups or alliances.

Validation

Metrics like silhouette score and stability analysis validated the clusters' reliability, ensuring meaningful insights.

Submitted in partial fulfillment of the requirements
for the degree of Master of Science

Developed by
John F. Brüne

Introduction - Defining the Problem

In May 2024 the U.S. website FiveThirtyEight conducted an analysis to explore patterns in the voting behavior of U.S. House members. They used the K-Means clustering algorithm to categorize representatives into 8 distinct clusters based on their voting records. However, this example serves as a reminder of the challenges of clustering unlabelled data: there is no single 'correct' number of clusters. Depending on how we choose to analyze the data, we could find 4, 8, or even 10 clusters, without a definitive answer as to which is most accurate.

In the following tabs, this study will turn to similar data from the European Parliament, aiming to explore the same clustering challenge and investigate how representatives form latent voting blocs based on their behavior.

Graphical tests

The K-Means algorithm aims to minimize the following objective function:

$$ \text{argmin} \sum_{i=1}^{k} \sum_{x_j \in C_i} || x_j - \mu_i ||^2 $$

Here, $k$ represents the number of clusters, $x_j$ are the data points (voting records), and $\mu_i$ represents the centroid of each cluster $C_i$. The algorithm iteratively adjusts these centroids to minimize within-cluster variance, but the selection of $k$ remains subjective. The clustering of unlabelled data is inherently flexible — meaning that one dataset could yield different numbers of clusters depending on interpretation and goals.

Analytical tests

Remember, the interpretation of each metric is important in determining clustering quality.

Exploratory Analysis

In this step, we delve into the key patterns and relationships within the voting behavior data of the European Parliament. Exploratory analysis is essential for understanding how MEPs (Members of the European Parliament) align on different issues, which factions tend to vote together, and how individual MEP characteristics might correlate with voting behavior. This preliminary step will help us determine relevant variables for clustering and ultimately guide our grouping decisions.

Use the options below to select variables of interest and generate visualizations that reveal underlying structures in the data. For instance, you might explore how different political groups or countries align on specific voting topics, or examine the attendance and activity levels of MEPs. Additionally, interactive maps and summary tables are provided to give you a comprehensive view of the data.

Exploratory Plots and Statistics

The exploratory plots and statistics below provide insights into the distribution and relationships within the selected variables. Use these visualizations to identify notable voting patterns, similarities, and differences among MEPs. You may find that certain groups consistently align on votes, while others show greater variability.

Please finish the Data Preparation first.

Interpretation of Box Plot

The box plot visualizes the distribution of the selected variable(s) across different categories, such as political groups (EPGs) or countries. Each box represents the spread of data for one category, showing the median, interquartile range, and potential outliers. This helps in identifying which groups have more homogeneous voting behavior or attendance levels and which ones display greater variation.

For instance, a narrow box with minimal outliers indicates that a group's members tend to behave similarly on the selected measure, while a wider box or multiple outliers may suggest diverse voting patterns within that group.

Country Map

The map visualizes the selected measure (such as attendance or voting alignment) by country. This can help you observe geographical patterns in the data, such as higher engagement in certain regions or countries with MEPs who tend to vote together.

Summary Table

This table provides a summary of the selected variable(s) for each political group or country. Use this table to identify differences across groups, such as variations in average attendance scores or voting alignments. Sorting by mean can be useful for highlighting key trends and outliers.

Parliament View

Please finish the Data Preparation first.

Key Terms and References

Here you’ll find a handy glossary of key terms and concepts, along with a list of references that helped shape this thesis. Think of this section as your go-to guide for understanding the ideas, methods, and insights shared throughout the work.

References
Glossary

References

Below is the list of references used throughout the thesis. These references are cited in appropriate sections to provide academic credibility and to acknowledge the authors whose work has contributed to this project.

Hix, Simon, Noury, Abdul, and Roland, Gerard. (2005). Dimensions of Politics in the European Parliament. See details here.
Parltrack.eu: Biographical Data on Members of the European Parliament. Visit Parltrack.eu
DW-NOMINATE method documentation and applications in legislative studies. Learn more about DW-NOMINATE.
UMAP (Uniform Manifold Approximation and Projection) for dimension reduction in clustering analysis. Details available here.
HDBSCAN clustering methodology and its applications in density-based clustering. Refer to the official documentation.
FactoMineR: Statistical Methods for MCA and PCA analysis. Visit FactoMineR.
Flagpedia for country flag icons. Check Flagpedia.
ggiraph and Shiny for interactive visualizations in R. Documentation available here.
Shiny inspired by Didactic modeling process: Linear regression Oscar Daniel Rivera Baena. View here.

Glossary

Cluster: A group of observations that are similar to each other based on a defined set of features. Clustering aims to partition the data into subsets with high internal similarity and high external dissimilarity.

HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise): An advanced clustering algorithm that identifies clusters of varying density in the data. It also designates outliers, often as Cluster 0, which are considered noise.

K-means Clustering: A partitioning algorithm that divides data into a pre-defined number of clusters by minimizing within-cluster variance.

PAM (Partitioning Around Medoids): A clustering method that selects medoids as representative points and minimizes the sum of dissimilarities between each data point and its nearest medoid.

UMAP (Uniform Manifold Approximation and Projection): A dimensionality reduction technique that preserves the local and global structure of the data while reducing it to 2D or 3D for visualization.

MCA (Multiple Correspondence Analysis): A method for reducing the dimensionality of categorical data by representing it in a lower-dimensional space.

DW-NOMINATE: A scaling method commonly used in political science to map voting behavior onto ideological dimensions, such as economic and social dimensions.

Silhouette Score: A measure of how well each data point fits within its cluster. A higher score indicates better-defined clusters.

Davies-Bouldin Index: A metric for evaluating clustering quality by measuring the ratio of within-cluster spread to between-cluster separation. Lower values indicate better clustering.

Calinski-Harabasz Index: An index that evaluates clustering performance by comparing the dispersion of points within clusters to the dispersion between clusters. Higher scores are better.

Normalization: A process to scale the data so that each feature contributes equally to the analysis, typically by adjusting the values to a standard range or scale.

Feature Engineering: The process of selecting, transforming, and creating features to improve the performance of machine learning models.

Complete Case Analysis: A method for handling missing data by excluding observations with any missing values.

Radar Chart: A graphical method of displaying multivariate data in the form of a two-dimensional chart of three or more quantitative variables represented on axes starting from the same point.

Box Plot: A visualization technique for summarizing the distribution of a dataset and identifying potential outliers.

Interactive Parliament Plot: A custom visualization that maps Members of the European Parliament (MEPs) and their respective parties onto a semicircular plot with clickable interactivity.

Correlation Coefficient: A measure of the strength and direction of the linear relationship between two variables, ranging from -1 (perfect negative correlation) to +1 (perfect positive correlation).

R-squared (Coefficient of Determination): Represents the proportion of variance in the dependent variable that is predictable from the independent variables.

Adjusted R-squared: A modified version of R-squared that adjusts for the number of predictors in the model, penalizing the inclusion of irrelevant variables.

EPG (European Political Group): The official grouping of Members of the European Parliament based on shared political ideologies.

MEP (Member of the European Parliament): An elected representative in the European Parliament.

Voting Alignment: A measure of how frequently MEPs or parties vote in line with one another.

Shiny: An R framework for building interactive web applications and dashboards for data analysis and visualization.

reactable: A package in R for creating interactive and customizable tables with advanced features.

ggplot2: A widely used R package for creating elegant and versatile visualizations.

Cluster Stability: The extent to which clusters remain consistent across different sampling or parameter settings, often tested through bootstrapping.

Outliers: Data points that deviate significantly from the majority of observations, often identified and excluded in clustering.

Bootstrap Analysis: A resampling method used to estimate the stability of clusters by creating multiple datasets through random sampling with replacement.