Flight data at SFO and SFO Museum – 1.17 million flights later

This is a blog post by aaron cope that was published on January 24, 2020 . It was tagged sfo, opendata, flightdata and whosonfirst.

timetable: TWA (Trans World Airlines), Paper, ink. Gift of David A. Abercrombie, in memory of Stanley A. Abercrombie , SFO Museum Collection, 2001.039.770

A little over a year ago we announced that we had started to collect flight data for every flight in and out of SFO in a blog post titled Capturing flight data at SFO and SFO Museum. A few months later we supplemented that work with improved aircraft information and flight paths, detailed in a blog post titled Harvey Milk Plane Has a Permalink – Updated flight data at SFO Museum.

airmail flight cover: Pan American World Airways, Airbus, Toulouse - New York route, Paper, ink, adhesive. Gift of the Pan Am Association, SFO Museum Collection, 2000.058.1185 a c

As of January 2020 we continue to collect flight data every day and publish it as open data. So far, we’ve collected over 1.2 million flights in and out of SFO since January 2019! Each one is recorded as a Who’s On First -style GeoJSON document with a stable and permanent identifier, hierarchies specific to that flight and pointers to all the places and airports and airlines and aircraft involved with that flight.

airmail flight cover: Pan American World Airways, Boeing 747, New York - London route, Paper, ink, adhesive. Gift of the Pan Am Association, SFO Museum Collection, 2000.058.1185 a c

The number “1.2 million flights” is a little misleading since it is inclusive of “codeshares” where one airline operates the same flight (as in a plane with seats in the sky) on behalf of one or more other airlines.

For example if you are flying from Auckland to Houston you might first cross the Pacific Ocean on flight NZ9816, an Air New Zealand plane to San Francisco. From there you’d transfer to flight NZ9194 which is really flight UA505 on a United Airlines plane to Houston.

Your ticket says NZ9194 but you’re really on UA505. That’s codesharing and it happens a lot but there were still over 769,250 actual flights (as in a plane on the runway) in and out of SFO in 2019!

menu: Pan American World Airways, Historic First Flights series, Lockheed L-049 Constellation, Paper, ink. Gift of Thomas G. Dragges, SFO Museum Collection, 2001.080.201

These data aren’t necessarily interesting in the moment. These data become interesting over time when there are a lot of them to coral in to unexpected patterns and the proverbial shape of the elephant. Their value comes from being able to look back and see things the then-present never imagined. The challenge when you want to look back at past data is often that no one thought it worthwhile to collect at the time or to give it a safe and patient home where the future might find it in the…well, future.

Our project to collect flight data is part of a larger effort. How and where do we keep the data internally? How and where do we make it available to the public? How do we do these things in such a way that it doesn’t become a burden for staff and resources? Every six months or so I spend a day or two revisiting and improving the tools and infrastructure we’ve set up to harvest and publish flights but otherwise the entire workflow is automated and doesn’t need any babysitting from me or anyone else at the museum.

brochure: TWA (Trans World Airlines), Boeing 747, Paper, ink. Gift of the William Hough Collection, SFO Museum Collection, 2006.010.545

“A day or two” is still a precious commodity in a museum where available staff time is a limited resource. The effort to collect these flight data is not nothing. However, when you consider the amount of time it would take to do the same work retroactively, with uncertain results at best, coupled with the fact that these efforts ensure data is also collected going forward it seems to me like time well spent.

The cultural heritage sector needs more infrastructure to automate the work that would never be left to staff to do by hand because of time and cost. We need those systems to be stable and, with a little bit of investment upfront, inexpensive. These are not insignificant challenges but the work we’re doing around flight data is one attempt to address those concerns and to make them viable.

airmail flight cover: Pan American World Airways, Boeing 747, Paper, ink, adhesive. Gift of the Pan Am Association, SFO Museum Collection, 2000.058.1185 a c

Today we are happy to announce that we’ve published the flight data for all of 2019 as a single downloadable SQLite database file:

https://millsfield.sfomuseum.org/distributions/sqlite/sfomuseum-data-flights-2019-latest.db.bz2
That file is 1.5GB compressed and 10GB uncompressed so you may want to be careful where you download it from.

SQLite is a small and fast relational database with a rich set of features that has been ported to nearly all operating systems and is supported by a long list of programming languages. Unlike other databases which store data across a number of files and folders SQLite databases are entirely self-contained in a single file.

The SFO Museum 2019 flight data database contains 1,513,571 records spanning 1,171,680 flights (which constitute 1,507,230 records when you include flight paths and other “alternate” geometries for those flights).

negative: San Francisco International Airport (SFO), inaugural TWA (Trans World Airlines) Boeing 747 flight, Negative. Transfer, SFO Museum Collection, 2011.032.2236

The other 6,341 records are the data from the sfomuseum-data-architecture, sfomuseum-data-enterprise, sfomuseum-data-aircraft and sfomuseum-data-whosonfirst repositories. These data are included with the database because the flight data will reference some or all of their records and we want people to be able to retrieve all the related and contextual information for flights at SFO in a single download.

Eventually we will produce blended (or “combined”) SQLite databases for all of our collections-related metadata. As with all our open data releases this data is licensed using the Linux Foundation’s Community Data License Agreement (Permissive).

The data is stored in five tables reflecting the Who’s On First -ish nature of the documents:

airmail flight cover: BOAC (British Overseas Airways Corporation), first polar flight, Paper, cardstock, ink. Purchase, SFO Museum Collection, 2010.221.001

What you do with a database full of flights is up to you. For example, I queried the data directly to count the number of not-codeshared flights:

$> sqlite3 sfomuseum-data-flights-2019-latest.db

sqlite> SELECT COUNT(id) FROM geojson
	WHERE
	json_extract(json(body), '$.properties.sfomuseum:placetype') = "flight"
	AND				 
	json_extract(json(body), '$.properties.flysfo:base_airline') != ""; 

769250


The easiest place to get started is probably Simon Willison’s Datasette tool, which describes itself as:

…a tool for exploring and publishing data. It helps people take data of any shape or size and publish that as an interactive, explorable website and accompanying API.

Datasette is aimed at data journalists, museum curators, archivists, local governments and anyone else who has data that they wish to share with the world. It is part of a wider ecosystem of tools and plugins dedicated to making working with structured data as productive as possible.


Once you’ve installed Datasette you can start it up with the flight data like this:

$> datasette ./sfomuseum-data-flights-2019-latest.db --config sql_time_limit_ms:3600


And then you can start to query the data in your web browser. For example, all of the British Airways flights involving JFK:

Or all of the flights in and out of Gate G97 (before the new Terminal 1 was opened and before Gate G97 became Gate G8).

In a similar vein we’ve made sure that our SQLite database works with the Who’s On First Browser tool which describes itself as:

…a web application written in the Go programming language for rendering known Who’s On First (WOF) IDs in a number of formats including HTML, SVG, PNG and GeoJSON. It uses Bootstrap for HTML layouts and Leaflet, Tangram.js and Nextzen vector tiles for rendering maps. All of these dependencies are bundled with the tool and served locally. With the exception of the vector tiles (which can be cached) and a configurable data source there are no external dependencies. It is designed to work locally and remotely, including other people’s cloud services, with a variety of Who’s On First datasources.


The Who’s On First Browser builds on work we discussed in an earlier blog post titled Using the Placeholder Geocoder at SFO Museum around the idea of self-contained and locally hosted web applications.

Unlike Datasette the Who’s On First Browser does not currently support search functionality. It will support searching one day in the future but right now it’s probably not the right tool for someone who wants to casually browse or “spelunk” their way through the data.

As mentioned above the Who’s On First Browser is a tool for rendering the geometries and properties of known records, including their “alternate” geometries in a number of different formats. These include a human-friendly web page as well as endpoints for returning raw GeoJSON data and for rendering that data as SVG or PNG images.

The Who’s On First Browser tool is available from the Who’s On First GitHub organization and only requires the Go programming language to use. Once you’ve downloaded the tool you can start it up like this:

$> cd go-whosonfirst-browser-sqlite

$> go run -mod vendor cmd/whosonfirst-browser/main.go \
	-enable-all \
	-reader-source 'sql://sqlite3/geojson/id/body?dsn=sfomuseum-data-flights-2019-latest.db' \
	-nextzen-api-key {NEXTZEN_APIKEY}


Then you can start to look up known IDs. For example, Alaska Airlines flight AS1925 from Los Angeles to San Francisco on October 15, 2019. Its permanent ID is 1495160039.

This is what it looks like in the Who’s On First browser when you visit http://localhost:8080/id/1495160039 in a web browser:

The screenshot above is actually cropped because the list of properties on the left hand side is very long. Most property lists are collapsed by default but here’s what the flysfo and icao properties look like for flight AS1925:

Here’s a screenshot of the raw GeoJSON and rendered HTML data side by side:

The Who’s On First browser also supports the display of “alternate” geometries which is where we store various flight path data. These include the actual recorded path versus the flight plan, the approach or takeoff from SFO and so on. These alternative geometries are listed under the src:geom_alt property.

For example if you visited http://localhost:8080/id/1495160039-alt-swim-path/ in your web browser you’d see the recorded flight path for flight AS1925:

If you visited http://localhost:8080/id/1495160039-alt-swim-approach/ you’d see the detailed path the flight took on its approach in to SFO:

SFO Museum has contributed enough time and code to the Who’s On First Browser project to ensure that alternate geometries are supported. When and whether we will be able to add search functionality is uncertain. It’s definitely something we would find helpful so it might happen sooner than later. In the meantime it’s a good and useful tool for proving that the open data we publish is usable and useful in a variety of different tools.

airmail envelope: Aeroflot Soviet Airlines, Paper, ink. Gift of the Captain John B. Russell Family, SFO Museum Collection, 2012.149.1736

In the first blog post about collecting flight data I noted that:

Importantly, these data are not being accessioned in to the collection yet. There is not a corresponding accession number for each flight on every day. It is reasonable to ask: Does it really make sense to collect every single flight? Even if it does, is this data better suited for a library or an archive, rather than the museum? Conveniently, SFO Museum is all three of these things so we’re well-positioned to craft an answer but they remain valid questions.


We still haven’t answered the larger question of whether we should accession flight data. The notion of rolling up a year’s worth of data in to a SQLite database suggests how we might accession this and other similar kinds of data, though.

Rather than adding a few thousand records to our collection management system each day perhaps we might add a single record every year and reference a database file as the collection “asset”. In this way the database is sort of like a traditional library finding aid but with built-in searching and indexing.

Part of the challenge around collecting any digital asset is how to preserve it. One of the nice things about relational databases is that they are just rows and tables and can easily be exported as comma-separated value (CSV) text files. Those files can, if necessary, be printed to paper which might seem counter-intuitive but paper is a proven and trusted archival material. In this way our flight data starts to look a lot like the log books of the past.

Taking the idea one step further we could print these text files using an optical character recognition (OCR) friendly font and then re-scan those same paper files using OCR software finally re-importing the CSV data back in to a SQLite database. That would be the best of both worlds for both preserving and making accessible these kinds of datasets. Actually designing and implementing the tools for this sort of archival “round-trip” is still just an idea right now so if you, or someone you know, builds it before we do please let us know!

photograph: TWA (Trans World Airlines), Boeing 707 final flight, Photograph. Gift of George Gayuski , SFO Museum Collection, 2001.082.108 f