Archiving social media accounts at SFO Museum – Take three

This is a blog post by aaron cope that was published on May 04, 2022 . It was tagged twitter, instagram, golang, socialmedia and tools.

There wasn’t meant to be a two-year hiatus between updates to the local archive of SFO Museum social media accounts but that’s what happened. In mid-April we finally downloaded fresh exports from Twitter and Instagram and updated the tools we use to process those data and now there are over a decade’s worth of posts from each service on the Mills Field website. The process is still not fully automated but we are a few steps closer to acheiving that goal.

Depending on how you feel about the impending sale of Twitter these kinds of processes can take on an added urgency. Fundamentally, though, we believe that this - creating our own local archive of our social media accounts - is an important practice regardless of the internal drama any given service is experiencing. These third-party services that we use offer many benefits but too often we forget that they are not necessarily built for for longevity. Importantly it’s not necessarily their responsibility either. In the first blog post about archiving SFO Museum’s social media accounts I wrote:

SFO Museum makes a point of being active on a number of social media platforms. We have accounts on Twitter and Facebook and Instagram and a full-time staff member – Bao Li – whose job it is to craft and curate posts related to the museum and its collection and share them far and wide on the internet. It is important to recognize that Bao’s work is not simply “non-institutional contextualization of digitized collection objects” but an important contribution, one that is central to the museum’s mission … but we haven’t done a great job of “capturing” or “archiving” any of it.

So long as there is a way for SFO Museum to export the things that it posts on a service we can and should take on some of the burden of preserving those efforts for posterity. That is, after all, the business of museums and libraries and archives.

Photograph: Panama-Pacific International Exposition. Gelatin silver print. Gift of Edwin I. Power, Jr. and Linda L. Liscom, printing funded by San Francisco Aeronautical Society, SFO Museum Collection. 2015.040.060 a

As part of that effort we have updated the Go package we use to process Instagram archives to include a new tool called derive-media-json. This is a command line tool to derive an abbreviated “media.json” file from a “contents/posts-(N).html” file as published by the Instagram export tool, circa April, 2022.

Previous Instagram export data bundles (circa October, 2020) used to provide one or more “media-(N).json” files that contained machine-readable properties for working with Instagram exports. This tool attempts to reconstruct that data derived from HTML markup and outputs the results as JSON to STDOUT.

For example:

$> bin/derive-media-json /usr/local/instagram-export/contents/posts_1.html

   "path": "media/posts/201502/1209467_621332467997055_325446168_n_17841739630062499.jpg",
   "taken_at": "2015-02-26T15:07:00Z",
   "caption": "\"Making art is like escaping to find peace of mind.\" -Lee Kang Hyo (b. 1961). A final image from Dual Natures in Ceramics before the exhibition is deinstalled tomorrow. #DualNatures #Korean #ceramics #pottery"
   "path": "media/posts/201502/10986292_690732684371255_1179212910_n_17841739627062499.jpg",
   "taken_at": "2015-02-13T15:15:00Z",
   "caption": "Gorgeous and golden details emphasize the exotic elements on this 1860s-70s table stand. #EgyptianRevival #furniture #design"
   ... and so on

It is expected that this tool is brittle precisely because it is parsing non-structured data observed at a moment in time. This tool has been demonstrated to work with Instagram exports as published in April, 2022 but there are no guarantees that this tool will work with future (or past) Instagram exports. This tool should not need to exist but until equivalent machine-readable data is published by Instagram it will have to do.

Parachute-jumping and gliding: popular soviet sports. Paper, ink. The Tony Bill Aviation Library Collection, SFO Museum Collection. 2000.160.1301

What you do with this machine-readable data is up to you. The derive-media-json tool is part of our ongoing effort to create small focused tools that do one thing well and to share those tools with others to save them time while fostering a culture of generosity.