📊 Exploring open access GTFS data
🚉 On Trains: Post #2
Introduction
As discussed in my last blogpost, I have been exploring different General Transit Feed Specifications (GTFS) sources to build a railway map of Europe with scheduling data.
In this short blogpost, I will show some simple visualisations to look at the data I managed to collect from two databases: Mobility Database and transit.land.
Additionally, I looked at stations on Open Street Map (OSM) (nodes with "railway"="station"
attribute), and from the open sourced trainline stations dataset.
The data collection process was simple: a simple script queries the transitland API and Mobility Database API, retrieves all feeds from agencies in countries within Europe that are marked as active, and downloads the returned URL.
Both the APIs and the URLs were downloaded today, December 8th 2024.
Visualisations
Collected data overview
Combining both data sources, I ended up with a list of 739 URLs linking to GTFS files, after removing duplicates.
Of these URLs, returned a zip file (the others were either dead links, 404s, ...). In the plot below, I call such URLs "Available", and differentiate between URLs that originated from transit.land, Mobility Database, or both.
47 of the downloaded zip files were identical (by comparing the hash of the file), and were hence removed. The downloaded zip files totaled more than 9 GB, with each uncompressed zip file averaging 734 MB.
A dozen zip files were corrupted, meaning that they had to be ignored for subsequent plots.
Different GTFS files
My first curiosity was to see which files from the official GTFS documentation were actually used.
This plot shows the frequency of filenames found in GTFS files. Files in red are filenames that are not referenced in the official documentations and appear more than once. After a closer look, emissions.txt
files seem to be common in Finland, but pretty much nowhere else.
The plot below shows sizes for each of these files. The stop_times.txt
is unsurprisingly the largest file, with a maximum size of 3.70 GB
!
Route types
Something I pointed out in my last post is how most companies do not adhere to the GTFS reference for route types, which only allows values from 0 to 12, and use the TPEG-taxonomy instead. Here are the different distributions of route types in all downloaded GTFS files:
Train stations per data source
Now onto the interesting stuff! What is the actual coverage of train stations each database has?
This plot combines data from OSM stations and the downloaded GTFS files. Each station is given a color depending on whether its source came from transit.land, Mobility Database, or both.
It is interesting to see how transit.land has a better coverage of Finland and Sweden, while Mobility Database offers more stations in Spain. This may require further investigation.
Some stations had locations outside of Europe, including some at 0,0, which I removed from the dataset.
Timeline of data sources
The above timeline shows the maximum and minimum values of "date" in the calendar_dates.txt
file. The letter next to the plot indicates whether the feed link came from Mobility Database (M), transit.land (T) or both (B).
As one can see, Norway really stands out here. They provide a zip file with their entire historical data. Most other companies seem to provide data for a few months. There seem to also be links of GTFS data from the past, that are no longer updated. This highlights the importance of preprocessing all data.
OSM stations with uic
attribute
Finally, here is a map of railway stations from OSM, with information on whether or not they contain a UIC attribute, and whether that attribute matches the one from the trainline database. UIC, the international union of railways, gives each station a unique identifier. There is no public list published from the UIC, so this useful identifier has to be collected from different sources online.
Note: this plot was updated on 17/12/2024 to include railway=halt
The matching of the stations was made by looking at all stations within 600m of the original stations, looking for a station with the same ID, and returning the closest station otherwise. It is clear from the map that some countries, such as Italy, really lack information about UIC attribute for stations. Other countries, such as Germany, seem to misuse the UIC id. Upon a closer look, it seems that a lot of German stations seem to use Deutsche Bahn's internal number, instead of UIC.
Being in close contact with members from the OSM community around Europe, I will discuss automating different sources such as trainline's database into OSM data in the future.
Closing thoughts
This blog post ended up a bit more rushed than I wanted, but I'm happy it made it to my bi-weekly schedule. The visualisations are quite basic so far, but I'm still happy with them! More complex visualisations will definitely come in the future, I will make sure to share them in my blogposts.
If you're fascinated by visualisations of mobility networks in Europe, I highly recommend checking out Spiekermann & Wegener's visualisations. One of my goals is to replicate some of their visualisations and release the methodology as open source code.
What's next? I am planning on comparing different trip planner softwares in a future blogpost.
I am also planning on releasing my code for downloading, and efficient processing of GTFS files (with threading) in the future. I am also happy to share them via email, feel free to send me a message.
Note: no text in this document has been generated or rewritten by a Large Language Model.
This blogpost was updated on 09/12 and 17/12 to fix minor issues in plots