Small focused tools

This is a blog post by aaron cope that was published on August 04, 2020 . It was tagged golang and tools.

It’s been a little while since I’ve written anything here. I still need to finish the series of posts we started about our work geotagging photos in the SFO Museum collection and this will happen soon. As a prelude to that wrap up I want to touch on something I said in the Geotagging Photos at SFO Museum, Part 1 – Setting the Stage blog post:

The cultural heritage sector needs as many small, focused tools as it can produce. It needs them in the long-term to finally reach the goal of a common infrastructure that can be employed sector-wide. It needs them in the short-term to develop the skill and the practice required to make those tools successful. We need to learn how to scope the purpose of and our expectations of any single tool so that we can be generous of, and learn from, the inevitable missteps and false starts that will occur along the way.

In that spirit this blog post is about four small command-line utilities we’ve written to help us do our work and are sharing with the wider cultural heritage sector. These tools aren’t specific to SFO Museum and might be useful to other organizations in the sector.

dump and restore (from Elasticsearch)

Recently we needed to migrate 11 million records from one Elasticsearch instance to another. Elasticsearch has its own process for creating and restoring snapshots of an index but it is sufficiently complex and time-consuming that I decided to write my own tools to perform the migration instead.

The first tool, called dump, exports all the records in an Elasticsearch index as a stream of line-separated JSON records that can be written to a file. The second tool, called restore, reads a file containing line-separated JSON records and inserts them in to an Elasticsearch index. Both tools are written in the Go programming language which allows them to be compiled and distributed as standalone binary applications with no external dependencies.

By default the dump tool outputs everything to STDOUT which means you can pipe the data in to other tools like bzip2 to compress it before writing everything to a file on disk. For example:

$> bin/dump \
	-elasticsearch-endpoint http://localhost:9200 \
	-elasticsearch-index millsfield \

   | bzip2 -c > /usr/local/data/millsfield.bz2

2020/07/09 13:29:52 Wrote 1000 (55658) records
2020/07/09 13:29:53 Wrote 2000 (55658) records
...
2020/07/09 13:30:28 Wrote 53000 (55658) records
2020/07/09 13:30:29 Wrote 54000 (55658) records
2020/07/09 13:30:29 Wrote 55000 (55658) records
2020/07/09 13:30:29 Wrote 55658 (55658) records

As of this writing the Go programming language doesn’t have native support for bzip2 encoding. That’s why we need to pipe the output of the dump tool to the bzip2 utility. Go does have support for decoding bzip2-encoded files so the restore tool can read the file we just exported by setting the is-bzip flag. For example:

$> ./bin/restore \
	-elasticsearch-endpoint http://localhost:9200 \
	-elasticsearch-index millsfield \
	-is-bzip \
	/usr/local/data/millsfield.bz2

{
  "NumAdded": 55658,
  "NumFlushed": 55658,
  "NumFailed": 0,
  "NumIndexed": 55658,
  "NumCreated": 0,
  "NumUpdated": 0,
  "NumDeleted": 0,
  "NumRequests": 31
}

The restore tool is also able to read data from STDIN so if we wanted to we could pipe the output of the dump tool directly in to the restore tool, migrating one Elasticsearch index to another without the need to write any files to disk. For example:

$> bin/dump \
	-elasticsearch-endpoint http://localhost:9200 \
	-elasticsearch-index index-1 \

   | bin/restore \
	-elasticsearch-endpoint http://localhost:9200 \
	-elasticsearch-index index-2 \   

That’s all these tools do but we think they are useful (and save time) all the same. In the end they made our Elasticsearch migration faster than the official snapshot-and-restore model. As an added bonus, the exports themselves are in an easy-to-read and easy-to-parse format that we can use to inspect and manipulate outside of Elasticsearch itself.

echo

negative: San Francisco Airport, radar tower. Negative. Transfer, SFO Museum Collection. 2011.032.0315.

The third tool is called echo and copies (or echoes) one data source to another. There are lots of tools, many of them bundled with a computer’s operating system, that do this already. What distinguishes the echo tool is that it uses the Go Cloud blob package which allows it to read and write data from a variety of different sources and storage providers.

The echo tool has two parameters: -from and -to which are pointers to data sources (or “blobs”) to read from and write to, respectively. If the former is empty then data is read from STDIN. If the latter is empty then data is written to STDOUT. For example:

$> echo 'hello world' | bin/echo 
hello world

Which is literally the same as using the built-in Unix echo tool, but slower and less efficient:

$> echo 'hello world'
hello world

Here’s an example of using our custom echo tool to read data from STDIN and write it to a local file on disk:

$> echo 'hello world' | bin/echo -to 'file:///usr/local/data/test'

$> cat /usr/local/data/test
hello world

And this is how we would read that back, sending the data to STDOUT:

$> bin/echo -from 'file:///usr/local/data/test'
hello world

Again, all of these things can be accomplished using built-in echo and cat Unix command-line tools:

$echo 'hello world' > /usr/local/data/test
$> cat /usr/local/data/test
hello world

The value of our custom echo tool is it supports reading and writing data to any source that implements the Go Cloud blob interface. The blob interface allows us to read and write data to a local filesystem as well as remote storage providers like Amazon’s S3 or Google’s Cloud Storage services.

For example, here’s how we would read data from STDIN and write it to a file in S3:

$> echo 'this is a test' \

   | ./bin/echo \
   	-to 's3://s3-bucket/misc/test.txt?region=us-east-1'

And then read the file we’ve just written sending the data to STDOUT:

$> bin/echo \
	-from 's3://s3-bucket/misc/test.txt?region=us-east-1'
this is a test

Data sources are defined using the Go Cloud URL syntax allowing us to reuse the same echo tool in a variety of settings by changing the URL that it uses to read or write data.

If we combine our custom echo tool with the dump tool we can chain them together and export our Elasticsearch database as bzip2-encoded file of line-separated JSON files stored in S3, in a simple pipeline of commands, like this:

$> bin/dump \
	-elasticsearch-endpoint http://localhost:9200 \
	-elasticsearch-index millsfield \

   | bzip2 \

   | bin/echo \
	-to 's3://s3-bucket/millsfield.bz2?region=us-east-1'

An Elasticsearch index could be restored from an export stored in S3 like this:

$> bin/echo \
   	-from 's3://s3-bucket/millsfield.bz2?region=us-east-1' \

   | bin/restore \
	-elasticsearch-endpoint http://localhost:9200 \
	-elasticsearch-index millsfield \
	-is-bzip \
	-stdin

runtimevar

negative: San Francisco Airport, radar tower. Negative. Transfer, SFO Museum Collection. 2011.032.0315.

The fourth tool, called runtimevar, is a command line wrapper around the Go Cloud runtimevar package. The runtime package defines itself as:

…an easy and portable way to watch runtime configuration variables. This guide shows how to work with runtime configuration variables using the Go CDK. Subpackages contain driver implementations of runtimevar for various services, including Cloud and on-prem solutions. You can develop your application locally using filevar or constantvar, then deploy it to multiple Cloud providers with minimal initialization reconfiguration.

This tool processes a Go Cloud runtime URL variable and emits its string value to STDOUT. As with the echo tool the value of the runtimevar tool is that it supports a variety of sources from an in-memory (constant) values, files on disk or even remote parameter storage services.

For example, this is how the runtimevar tool would be used with a in-memory value:

$> bin/runtimevar 'constant://?val=hello+world'

hello world

And this is how the tool would be used with a value stored in the AWS Parameter Store service:

$> bin/runtimevar 'awsparamstore://sfomuseum-secret-parameter?region=us-east-1'

s33kr3t

And this is how the tool would be used in a shell script retrieving and storing a different PASSWORD value depending on whether the script was run in a development or production environment:

# use a constant value for local development
# use a password stored in AWS Parameter Store for production

PASSWORD_URI="constant://?val=password"

if [ ${IS_PRODUCTION}="1" ]
then
	PASSWORD_URI="awsparamstore://sfomuseum-password?region=us-east-1"
end

PASSWORD=`runtimevar ${PASSWORD_URI}`

Like the echo tool the same runtime tool is able to be used in a variety of different settings simply by changing the URL of the variable we need to retrieve.

All of these tools are currently in use at SFO Museum and we have published the source code for each to our GitHub account. We would welcome any suggestions or contributions to improve them. These aren’t big or flashy tools but they are helpful and by sharing them we hope to encourage others in the sector to publish their own small, focused tools.