GitHub-Repository link: https://github.com/techlabsms/project-wt-18-12-Abgeordnetenanalyse
The project’s goal is the investigation and visual depiction of the German Member of parliament’s historical data since 1949 in regards to their sex, age, country of birth and party affiliation, as well as the calculation of the probability for a representative to be part of the parliament in the next parliamentary period.
During the fall of 2018, all of our team finally received the acceptation into the next TechLabs semester in the city of Münster, Germany. Grateful to have been given the opportunity to learn state of the art tech skills, we all
chose the track of ‘DataScience in Python’. Now came the time to develop a collaborative project, which would occupy our interest for the next months. As our new team, full of aspiring learners, came into being, it quickly became clear that we wanted to analyze a topic in the political realm. The German federal parliament was the obvious choice, since its history since 1949 not only provided a lot of useful data, it also allowed us to find interesting correlations previously unknown to us. However, we not only hoped to find such connections, but also to use the historic data to develop a predictive model for the future composition of the parliament. Our task started with finding a suitable data set, which contained all needed data of the parliaments since 1949. Luckily this proved quite easy, since the parliament maintained a publicly available data set of all MPs’. Furthermore, this data was an abundant source of all kinds of interesting properties, like age, country of birth and even previous profession. There was only one problem: the data was saved in a XML-file, which is relatively hard to import into python.
Importing the data
Since we had never been exposed to working with XML-files in python before, a lot of time was spent simply to understand the workings of the data file and how to best import it into a python DataFrame, with which we were more familiar. The data could best be described as a tree with many different branches and more nested branches inside. Each MP got assigned an individual ID tag, below which all necessary details about this MP were stored. These properties each also got their own unique tag, for example there was a tag specifically for the particular MP’s previous jobs. The real challenge however was the selection of the parliamentary periods of each MP, since these were nested inside another tag. While we were able to assemble a first working DataFrame using Google magic and a lot of helpful tips by our mentor, the mentioned parliamentary periods were a tough nut to crack. The situation was complicated by the fact that in contrast to other properties like Name, the parliamentary periods would have to be stored inside a list for every MP. Eventually, however, we were able to incorporate these elements as well and thus our first fully working DataFrame was complete.
At this point we began to distribute individual assignments: Paulo Frevert would work on cleaning the DataFrame in order to later be able to visualize it in an appealing way. László Kühl would be mostly responsible for the development of the prediction model and all related tasks. Last but not least, Raimund Koop and Marc Lütkehermöller had the task of finding correlations and trends in the data and visualizing them appropriately. I’m going to describe each compartment and its challenges below.
Cleaning the data should always be the first step after importing the data, since all other tasks, which process the DataFrame, rely on data that is complete and without error. To this end, each column would have to be inspected for duplicates or false values, NotANumber (NaN) values had to be filled or dropped, as did unnecessary information. Most of these tasks could be ticked off very quickly, as the administers of the parliament had done a good job of keeping the provided data as clean as possible. So, there wasn’t a single duplicate entry or false information in nearly every column and only once was it needed to fill None values. In consequence, this allowed us to quickly move on to other tasks. For example, it proved to be helpful to encode the gender of ‘male’ and ‘female’ with 1 and 0 respectively. The same could be said of columns like country of birth, for which we merged all non-German places of birth into one entry ‘foreign countries’ for convenience and then set them as binaries, as well. As there was an entry barrier for parties during the first years of the parliament, a big number of irrelevant parties cluttered the visualization process. Therefore, we decided that it would be best to pool parties together in such a fashion that it would allow us to properly show their development over time. For this purpose, some parties simply got labeled as ‘Other’, while others would be added to a similar, but bigger party (e.g. the local party of the Free Democratic Party (FDP) in the Saargebiet would be merged with the same parent party for the whole of Germany).
However, during a quick check of the validity of the data of the most recent parliamentary period, it surfaced, that the number of MPs in the DataFrame for that period did not match the actual number. Believing it first to be only a small error in the assignment of some representatives, who switched the parliamentary group, it later turned out to be a bigger problem in scope. In fact, every representative who left parliament and got replaced by another would not be deleted out of the data, while in contrast the replacing member would get added to the data. The result was a much higher number of MPs in every period, sometimes even in the magnitude of over 50 members too much. This frustrated us, especially since we worried that this might distort our prediction, but given the sheer size of the changes and the fact, that every MP would have to be deleted by hand, we decided that we would leave the data as is.
Engineering the DataFrame & Machine Learning
With the cleaning process finished, now it was time to engineer the DataFrame for the prediction model. This process consisted in two main tasks: Calculating the MP’s age for each parliamentary period and changing the index from the previous cryptic ‘MP-ID’ to one that takes the parliamentary period into account. More concretely, we wanted that every MP would get a separate row for each time he had been in the parliament (e.g. a representative who occupied the role for three periods would also get three different rows with different indexes). This admittedly was quite a daunting task, since it required many different steps with each their own need for functions written from scratch. First, the previous list of all parliamentary periods for each MP had to be broken down into 19 different columns, each containing a binary if the representative had been in the particular period or not. Second, we had to make 19 individual DataFrames from these new columns, one for each parliamentary period. Then we had to create the new index and fit it to the individual DataFrames, by creating a dictionary, which zipped the previous index to the new one. Finally, the 19 DataFrames had to be re-merged into one big DataFrame. Calculating the age for each MP was now easily done: We could simply calculate the difference between the starting year of each period and the birth year of that particular MP. With this ground work in place, we could now face the prediction model itself.
As our initial goal was the prediction of the composition of the next parliament, we first thought about ways to properly work with the high number of variables we encountered. However, we soon realized that a full-scale prediction of the next parliament simply from past MP’s properties would be infeasible, especially given the inherent uncertainty of voting patterns and party politics. Thus, we had to narrow down our focus to more simple prediction tasks. Concretely, we now wanted to train a model on what we dubbed the ‘survival-rate’, which consists in the share of MP’s for each given parliamentary period, who also belonged to the next parliament. Calculating this survival-rate proved pretty easy: We only had to select all representatives for each given period, followed by counting the sum of the column for the next parliamentary period for these MP’s. As the mentioned columns were all previously converted to binaries, we now only had to divide this sum by the number of total MP’s, and we received the survival-rate. Repeat for each period, add all rates into a list et voilà, the task was done! Processing this new information was the more challenging exercise. It was especially hard to find exactly that model, which would best fit the survival-rate time series. Using scikit-learn, we tried a number of different models, from a simple Decision Tree Classifier, over a Logistic Regression, to KNeighbors. Eventually we found a Gaussian model to be the best fit. The resulting figure can be seen below:
The blue line represents the actual survival-rate over time, while the orange line shows the predicted survival rate with the gaussian model. As you can see, our model generally does a good job of aligning with the actual data, but especially in the last years, the fit is no longer nearly as good (also seen in the green line, which shows the standard deviation). Nevertheless, we chose to predict the survival-rate of the current parliament after the end of the end of the parliamentary period and, using the same model, we calculated a rate of 78%. It will be very interesting to compare this prediction with the future development, though we reckon that the actual survival-rate will be much lower, as uncertainty increased and the entry of new parties into the Bundestag makes the shuffling of seats more likely.
Finally, we wanted to visualize our DataFrame in hope of finding interesting correlations. This approach consisted in two main sections: First showing the development of particular characteristics of representatives (age, sex and foreign country of birth) and second the survival-rate of MP’s for particular parties and attributes. For this process we made use of the extensive plotting features of matplotlib and seaborn. As the actual visualization is pretty straight forward (we just had to select the respective part of the data and label the axes), I will focus on the most important results hereafter.
First we looked at the age structure of representatives over time:
The main point of interest in this data is the change of the mean age of all MP’s. As shown, during the late 1960s and early 70s the mean age had a huge drop. This can be explained with the emergence of the counterculture movement of 1968. We found it really fascinating to be able to see such historical events and speculating about their origins (as the reason for a change were often far from obvious).
The same can be said when looking at the age distribution for each party:
While these results were as expected, it was still nice to get a confirmation of our intuitions. Next, we looked at the distribution of sex for each party and the parliament as a whole:
The evolution of the share of men was not quite as radical as that of age, however the general trend can clearly be seen. Especially the entrance of the Green and the Left Party made a visible reduction in the proportion of male representatives. Note that the drop to 0% for the Left Party in 2002 can be explained by their exit from the parliament.
On to the share of foreigners! A foreigner in our data was every MP with a country of birth that was not Germany.
Surprisingly there is no clear trend in the data. While the general trend during the 1970s and 80s is similar to age and sex distribution, the proportion soon decreased. As Germany as a whole generally speaking grew more diverse over the years, this progression does not depict in the data. It could be interesting to theorize as to the reasons for that.
Lastly, we have taken a look at the survival-rate in greater detail than in our prediction model, as we found it to be of particular interest to see how it changed based on personal characteristics.
With the normal survival-rate we are already familiar by now. The big Volksparteien of Union and SPD were relatively consistent and resembling each other as expected. Astonishingly the trend of the slow decline of these parties does not show in the data, especially the poor performance of the SPD would give reason to expect a more distinct kink in the graph. The three other parties more resembled our expectations, considering their varying outcomes in elections, exemplified in their particular exit from parliament for the FDP and the Left Party in 2002 and 2009 respectively. Concerning the fact, that the Left Party still scored a slightly higher survival-rate than 0% in that year and their subsequent 100% rate can be explained with the fact, that, while they were not able to secure the 5%-national hurdle, some few individual representatives could still get a seat via a direct mandate (and they of course entered the parliament after that, as well).
To conclude, we all believe the last few months to have been a great experience which we would repeat anytime. Not only were we able to learn the basics of programming with Python, we also gained deeper insight into the workings of DataScience and its related fields. The project itself was very interesting, especially given its relevance to the current political situation. While it was not possibly to completely fulfill all our initial ideas for the project, we are generally pretty happy with the way the results turned out to be. Furthermore, we believe the project to be worth a revisit after the voting for the next parliament, since the entrance of the AfD Party and the further loss of previous party loyalty could significantly change the revealed previous dynamics.
Team members (all ‘DataScience in Python’-Track):