SFO Museum flight data records now available as GeoParquet exports

This is a short blog post to announce the availability of flight data for of the over 8.4 million (and counting) flights that have traveled in and out of SFO, since 2006, as GeoParquet files. GeoParquet is a standard which add geospatial types to the Apache Parquet standard:
Apache Parquet is a powerful column-oriented data format, built from the ground up to as a modern alternative to CSV files. GeoParquet is an incubating Open Geospatial Consortium (OGC) standard that adds interoperable geospatial types (Point, Line, Polygon) to Parquet.
One of the most attractive properties of (Geo)Parquet files is that they can be queried “over the wire” from a remote server without having to download the entire database locally. That’s pretty exciting especially considering how much flight data we publish. For example, using the command line DuckDB tool we can query for flights in January 2025 grouped by the individual aircraft (tail number) that operated them like this:
D SELECT "swim:tail_number" AS tail_number, COUNT("wof:id") AS count FROM read_parquet('https://static.sfomuseum.org/parquet/sfomuseum-data-flights-2025-01.parquet') WHERE tail_number != '' GROUP BY tail_number ORDER BY count DESC LIMIT 10;
┌─────────────┬───────┐
│ tail_number │ count │
│ varchar │ int64 │
├─────────────┼───────┤
│ N76502 │ 15 │
│ N24211 │ 14 │
│ N27269 │ 13 │
│ N487UA │ 12 │
│ N17550 │ 11 │
│ N77536 │ 10 │
│ N14231 │ 10 │
│ N37274 │ 10 │
│ N27568 │ 10 │
│ N464AS │ 10 │
├─────────────┴───────┤
│ 10 rows 2 columns │
└─────────────────────┘
Did you notice the read_parquet('https://static.sfomuseum.org/parquet/sfomuseum-data-flights-2025-01.parquet') syntax? That tells DuckDB to load Parquet data from a remote URL. It is also possible to specify multiple Parquet files read using the read_parquet([ URL, URL, URL... ]) syntax. For example here is the same query but for flights from January 2025, 2024 and 2023:
D SELECT "swim:tail_number" AS tail_number, COUNT("wof:id") AS count FROM read_parquet(['https://static.sfomuseum.org/parquet/sfomuseum-data-flights-2025-01.parquet','https://static.sfomuseum.org/parquet/sfomuseum-data-flights-2024-01.parquet', 'https://static.sfomuseum.org/parquet/sfomuseum-data-flights-2023-01.parquet']) WHERE tail_number != '' GROUP BY tail_number ORDER BY count DESC LIMIT 10;
┌─────────────┬───────┐
│ tail_number │ count │
├─────────────┼───────┤
│ N115BZ │ 92 │
│ N37540 │ 60 │
│ N27526 │ 53 │
│ N47517 │ 53 │
│ N37547 │ 52 │
│ N47524 │ 50 │
│ N27546 │ 46 │
│ N991JT │ 46 │
│ N851UA │ 45 │
│ N37538 │ 45 │
└─────────────┴───────┘
Additionally, DuckDB supports a number of spatial functions. For example, here are the flights (from January 2025) whose paths intersected with the point -97.21, 33.4:
D SELECT "wof:id", "src:alt_label" FROM read_parquet('https://static.sfomuseum.org/parquet/sfomuseum-data-flights-2025-01.parquet') WHERE ST_Intersects(ST_GeomFromText('POINT(-97.21 33.4)'), geometry);
┌────────────┬───────────────┐
│ wof:id │ src:alt_label │
├────────────┼───────────────┤
│ 1947729937 │ swim-route │
│ 1947729983 │ swim-route │
└────────────┴───────────────┘
To see all the column types in a GeoParquet file you can use the DESCRIBE (SELECT * FROM .. )) query, like this:
D .mode box
D DESCRIBE (SELECT * FROM read_parquet('https://static.sfomuseum.org/parquet/sfomuseum-data-flights-2025-01.parquet'));
┌────────────────────────────┬────────────────────────────────────────────────────────┬──────┬─────┬─────────┬───────┐
│ column_name │ column_type │ null │ key │ default │ extra │
├────────────────────────────┼────────────────────────────────────────────────────────┼──────┼─────┼─────────┼───────┤
│ date:cessation_lower │ DATE │ YES │ │ │ │
│ date:cessation_upper │ DATE │ YES │ │ │ │
│ date:inception_lower │ DATE │ YES │ │ │ │
│ date:inception_upper │ DATE │ YES │ │ │ │
│ edtf:cessation │ VARCHAR │ YES │ │ │ │
│ edtf:inception │ TIMESTAMP │ YES │ │ │ │
│ flysfo:actual_timestamp │ INTEGER │ YES │ │ │ │
│ flysfo:airline │ VARCHAR │ YES │ │ │ │
│ flysfo:base_airline │ VARCHAR │ YES │ │ │ │
│ flysfo:base_flight_number │ VARCHAR │ YES │ │ │ │
│ flysfo:date │ DATE │ YES │ │ │ │
│ flysfo:estimated_timestamp │ INTEGER │ YES │ │ │ │
│ flysfo:event │ VARCHAR │ YES │ │ │ │
│ flysfo:flight_number │ VARCHAR │ YES │ │ │ │
│ flysfo:gate │ VARCHAR │ YES │ │ │ │
│ flysfo:journey │ VARCHAR │ YES │ │ │ │
│ flysfo:scheduled_timestamp │ INTEGER │ YES │ │ │ │
│ flysfo:timestamp │ INTEGER │ YES │ │ │ │
│ geom:area │ INTEGER │ YES │ │ │ │
│ geom:bbox │ VARCHAR │ YES │ │ │ │
│ geom:latitude │ DOUBLE │ YES │ │ │ │
│ geom:longitude │ DOUBLE │ YES │ │ │ │
│ iata:airline │ VARCHAR │ YES │ │ │ │
│ iata:base_airline │ VARCHAR │ YES │ │ │ │
│ icao:aircraft │ VARCHAR │ YES │ │ │ │
│ icao:airline │ VARCHAR │ YES │ │ │ │
│ icao:base_airline │ VARCHAR │ YES │ │ │ │
│ lbl:latitude │ DOUBLE │ YES │ │ │ │
│ lbl:longitude │ DOUBLE │ YES │ │ │ │
│ sfomuseum:aircraft_id │ INTEGER │ YES │ │ │ │
│ sfomuseum:airline_id │ INTEGER │ YES │ │ │ │
│ sfomuseum:arrival_id │ INTEGER │ YES │ │ │ │
│ sfomuseum:base_flight_id │ VARCHAR │ YES │ │ │ │
│ sfomuseum:departure_id │ INTEGER │ YES │ │ │ │
│ sfomuseum:flight_id │ VARCHAR │ YES │ │ │ │
│ sfomuseum:flight_number │ VARCHAR[] │ YES │ │ │ │
│ sfomuseum:placetype │ VARCHAR │ YES │ │ │ │
│ src:alt_label │ VARCHAR │ YES │ │ │ │
│ src:geom │ VARCHAR │ YES │ │ │ │
│ src:geom_alt │ VARCHAR[] │ YES │ │ │ │
│ swim:altitudes │ VARCHAR[] │ YES │ │ │ │
│ swim:geom │ VARCHAR │ YES │ │ │ │
│ swim:speeds │ INTEGER[] │ YES │ │ │ │
│ swim:tail_number │ VARCHAR │ YES │ │ │ │
│ swim:timestamps │ INTEGER[] │ YES │ │ │ │
│ wof:belongsto │ INTEGER[] │ YES │ │ │ │
│ wof:concordances │ VARCHAR │ YES │ │ │ │
│ wof:country │ VARCHAR │ YES │ │ │ │
│ wof:created │ INTEGER │ YES │ │ │ │
│ wof:geomhash │ VARCHAR │ YES │ │ │ │
│ wof:hierarchy │ VARCHAR │ YES │ │ │ │
│ wof:id │ INTEGER │ YES │ │ │ │
│ wof:involves │ INTEGER[] │ YES │ │ │ │
│ wof:lastmodified │ INTEGER │ YES │ │ │ │
│ wof:name │ VARCHAR │ YES │ │ │ │
│ wof:parent_id │ INTEGER │ YES │ │ │ │
│ wof:placetype │ VARCHAR │ YES │ │ │ │
│ wof:placetype_alt │ VARCHAR[] │ YES │ │ │ │
│ wof:repo │ VARCHAR │ YES │ │ │ │
│ wof:superseded_by │ VARCHAR │ YES │ │ │ │
│ wof:supersedes │ VARCHAR │ YES │ │ │ │
│ geometry │ GEOMETRY │ YES │ │ │ │
│ geometry_bbox │ STRUCT(xmin FLOAT, ymin FLOAT, xmax FLOAT, ymax FLOAT) │ YES │ │ │ │
└────────────────────────────┴────────────────────────────────────────────────────────┴──────┴─────┴─────────┴───────┘
The GeoParquet files are generated nightly derived from source files in the sfomuseum-data/sfomuseum-data-flights-YYYY-MM GitHub repositories and are available from:
There is also a machine-readable (JSON) list of all those files available from:
For the time being only flight data records are available as GeoParquet files but once we’ve let these “bake” for a little while it is likely we will make this an option for all the other data that SFO Museum publishes. We hope these GeoParquet files are interesting and useful and we’d love to hear how you are using them.
