🚏 Getting started with transit data

🚉 On Trains: Post #1

Fabio Barbero published on

8 min, 1424 words

Categories: Random

Introduction

Imagine the following hypothetical scenario: you've given yourself the task to work on attempting to visualize and optimize the rail network in Europe, and now you need to find a way to get transit data.

You're interested in getting the data about the physical train tracks, as well as the timetables of all rail companies in Europe. And, why not, bus companies and other transport too.

Fine, you tell yourself. It can't be that hard. If Google Maps can do it, you can do it too, right? Deep down, you know that's a lie. Not only because Google Maps is a team of a lot of people with years of development, but because you yet have to see a good alternative to Google Maps that works well across all countries.

So, you start researching. And you come across a very promising start: the General Transit Feed Specification (GTFS).

What is the GTFS?

*opens Wikipedia*

"The GTFS or the General Transit Feed Specification defines a common data format for public transportation schedules and associated geographic information."

A Google employee "monkeyed around" (yes, that's still from Wikipedia) this specification while developing Google Maps. They basically wanted a way to query any transport company server, and get structured files to give them information about where the routes are, their names, the stops in the route, their locations and names, the timetables, and so on... The "G" in GTFS used to stand for Google, but it was changed since not everyone likes Google.

It is now no longer maintained by Google, but by the Mobilitydata organisation. It keeps evolving (more on that later), and it can now have real-time data and has a sibling: the General Bike Feed Specification (GBFS).

The way it works is very simple: you get a zip file containing different files: routes.txt, trips.txt, stop_routes.txt, stops.txt, ... Some of them are optional, such as shapes.txt (which returns geographical shapes of the routes), and they each contain information as explained in the official documentation (yes, I read docs sometimes).

Side note: there seems to be an alternative founded by the European Commission, called NeTEx. It also includes data about fares. However, I did not have much time to investigate it yet, and it does not seem to be as widespread, even within Europe.

Okay, so assuming I now know how to read all the data I get... where do I get it? What is needed is an up-to-date list of all GTFS feeds of all transit companies in Europe. Hmmm. I keep finding incomplete/unmaintained lists everywhere. Maybe I should start one myself (no).

From what I understand, a big part of Mobility as a Service (MaaS) companies (Google Maps, Transit, Moovit, ...) is aggregating and keeping up to date these lists of GTFS feeds.

I was shocked to see that a report from the European Commission paid 8000 EUR to access incomplete data from the Multiple East-West Railways Integrated Timetable Storage (MERITS) for 1 year. This is the cheapest plan they offer, and does not provide real-time or historical data. MERITS is owned by the Union Internationale des Chemins de fer (UIC), and seems to be used in a lot of business applications, such as the company Hacon, which provides data for the Eurail (Interrail) planning. I could write a blogpost about how angry I am about MERITS and how I reverse engineered Eurail's API another time, but it's beyond the scope of this article.

Luckily, there are two maintained datasets that can be found: transit.land and Mobility Database (from the Mobilitydata organization, maintaining GTFS). I'm not fully sure how much data is missing. I have reported incorrect feeds and submitted new feeds to the Mobility Database, but have not heard back from them for a month.

Quirks with GTFS feeds

So how does it work? You can retrieve from these different databases a list of feed URLs, which you can then call directly to get their zip files containing the specifications.

Well, that is if the server still works, and properly returns data. I have seen a lot of different things. Dead servers. Corrupted zip files. Google Drive links. FTP servers. Expired certificates.

And then, after having downloaded 12GB of zip files containing GTFS data (raw text) for only Europe, I now have to iterate through them to find the ones that are exclusively about rail. Once again, I've seen everything. I implemented my own CSV parser to make it faster, but a bus company in Spain put spaces after their commas and made my script crash, and so did a lot of other edge cases. Iterating through all the zip files, in a thread, takes around 20 minutes. And I feel like I've done a good job optimizing my (Python) code.

Okay, figuring out the routes that are about rails. Let's look at the specification. That means that I should look for route_type = 2. But wait. route_type should be an integer from 0 to 12. Why am I getting values like 117, or 1300?

This drove me crazy for a while. Turns out that although the official specification only has values from 0 to 12, most companies in Europe use a different numbering system, which is apparently the TPEG-taxonomy. I didn't really investigate about it too much, but you can read more about it here. It has been proposed as a change to GTFS in 2012, but does not seem to have been approved.

Open Street Map data

For gathering data about the actual infrastructure (train lines), Open Street Map is the place to go. I love Open Street Map. It also contains relevant information on the train tracks, with their different gauge (="track width" - did you know that different countries have different gauges, meaning trains in parts of Spain or in Finland cannot leave the country to go to the rest of the European Union?) and information about whether they are abandoned or not.

I have not played much with the data yet, but I have the amazing website by French pirate-politician Pierre Beyssac that gives you information on how to get from one place to the next, as a train (assuming you don't have any other trains on the lines). Unfortunately, the website's code doesn't seem to be public, meaning I will have to reimplement it myself.

Stations data

This will probably be a topic for another blogpost, but I just wanted to share with you a crazy fun fact: each station is given a UIC code by the UIC (the International Railway organization, which I have mentioned earlier for their MERITS), but the data of all stations is not made publicly available by the UIC. Instead, you need to spend 4-5 digits to get access to their database for a year.

The data is not confidential: if you know the UIC code of a station, you can share it. The data is included in many publicly available datasets, such as in Wikidata, OpenStreetMap, or trainline's stations dataset.

Closing thoughts

I was planning on having a more well thought article this week. But I waited until the very last moment to write it (I was busy setting up a minecraft server and touching grass). I really admire tech articles from Robert Heaton that manage to be so well-structured and easy to read. I'd love to know their secrets to writing blogposts. So here you have it, this week's shoutout. Next blogpost will be in 2 weeks!

There are many tools that I have not yet explored and therefore not included in the blogpost, such as the routing server OpenTripPlanner.

I was a bit too shy to include visualizations I have made from the data I got. Mostly because I feel like they're not good enough. Perhaps my next blogpost will indeed be about visualizing map data.

And please, if you have any comments, or know anyone that might have any, please let me know! I'm always happy to talk to people about this!

Note: no text in this document has been generated or rewritten by a Large Language Model.