OEmbeddings - What is the least amount of metadata necessary for shared vector embeddings?

This is a blog post by aaron cope that was published on April 15, 2026 . It was tagged roboteyes, machine-learning, collection, oembeddings, embeddings, golang and parquet.

Photograph: San Francisco Airport, aerial view. Photograph. Collection of SFO Museum, SFO Museum Collection. 2011.068.026

This is a blog post describing a proposal for a set of common attributes to include with shared vector embeddings. Shared vector embeddings were the subject of our last blog post: “Shared cross-institutional vector embeddings – how we might get there”. These common attributes are meant to be the least amount of metadata necessary to provide a simple preview and suitable attribution for the item (an image or text) for which vector embeddings have been produced. We are calling these common properties “OEmbeddings”. This is a play on words and a shout-out to the oEmbed standard.

oEmbed is a format for allowing an embedded representation of a URL on third party sites. The simple API allows a website to display embedded content (such as photos or videos) when a user posts a link to that resource, without having to parse the resource directly.

“OEmbeddings” properties are imagined to be part of a free-form attributes dictionary associated with a vector embeddings record and do not prevent producers from including additional metadata properties as their needs demand. Here are the proposed set of attributes:

Name Type Required Notes
type string yes Either “image” or “text”.
preview string yes The preview content for the vector embeddings. If type is “text” then this is expected to be a string. If type is “image” this is expected to be a string confirming to the JSON Schema “uri” type
depiction_url uri no A web page (or resource) for the depiction used to create the vector embeddings.
subject_url uri yes A web page (or resource) for the subject of the depiction used to create the vector embeddings.
subject_title string yes The title of the subject of the depiction. This may be an empty string.
subject_creditline string yes The creditline or attribution for the subject of the depiction. This may be an empty string.
provider_name string yes The name of the provider (holder) of the subject being depicted.
provider_url uri yes The primary web page for the provider (holder) of the subject being depicted.

These are per-record properties. This is necessary because the assumption is that these records may be stored alongside records from a variety of different sources and producers. While it may be desireable to encode some of these properties as “file-level” metadata (for example only defining provider_name or provider_url once for a collection of records written to an Apache Parquet file) those attributes, which contain important information for attribution purposes, will likely not be available to the underlying database used to index and query records. Likewise, it is easy to imagine introducing a secondary “meta data” file or lookup table for shared properties but then there is problem of producing, managing and synchronizing those data across systems and implementations.

Per-record properties introduce a non-trivial difference in overall file size for large distributions of vector embeddings. That is not ideal but is also considered an acceptable trade-off given that plain-text values enjoy high compression rates in formats like Apache Parquet or CSV encodings compressed using bzip.

That’s it. That’s the entire proposal. As mentioned it does not preclude producers of vector embeddings from adding their own attributes. It simply describes a common set of properties which consumers can use to display previews and assign attribution for works from third-party sources. There is a formal JSON schema for these properties which can be used to ensure a free-form attributes dictionary contains OEmbeddings properties.

The go-embeddingsdb "inspector" web application showing an SFO Museum Instagram photo of Yayoi Kusama's sculpture "High Heels for Going to Heaven" and similar images from the National Gallery of Art collection using the apple/mobileclip_s2 model.

SFO Museum maintains code written in Go and as a WebAssembly binary for testing whether a dictionary conforms to the OEmbeddings protocol. We have updated all the tools in the sfomuseum/go-embeddings-harvest package to include OEmbeddings properties in the output they create. We have also updated the “vector embeddings inspector” web application to use these new properties when present.

SFO Museum Vector Embeddings

To demonstrate this proposal we have published vector embeddings for four different SFO Museum sources (objects from the Aviation collection, photos we’ve posted to Instagram, exhibition photos and exhibition installation photos), two openly-licensed SFO and California aviation-related Flickr sources and all the object images in the National Gallery of Art’s (NGA) opendata release (just to prove we could). These data are available from:

Data are encoded as Apache Parquet files. Each record included in these files has an attributes dictionary conforming to the OEmbeddings protocol with details specific to the parties responsible for the item depicted. In addition to being a compact exchange format Parquet files may be read “over the wire” (without downloading all the data first) using a tool like DuckDB. For example:

D SELECT depiction_id, subject_id, model, JSON_EXTRACT(attributes, 'subject_title') AS subject_title,
JSON_EXTRACT(attributes, 'provider_name') AS provider_name FROM
read_parquet('https://static.sfomuseum.org/embeddings/20260410-sfomuseum-collection.parquet') LIMIT 10;

┌──────────────┬────────────┬─────────────────────┬───────────────────────────────────┬───────────────┐
│ depiction_id │ subject_id │        model        │           subject_title           │ provider_name │
│   varchar    │  varchar   │       varchar       │               json                │     json      │
├──────────────┼────────────┼─────────────────────┼───────────────────────────────────┼───────────────┤
│ 1527818053   │ 1511921717 │ apple/mobileclip_s0 │ "menu: British Airways, Concorde" │ "SFO Museum"  │
│ 1527818053   │ 1511921717 │ apple/mobileclip_s1 │ "menu: British Airways, Concorde" │ "SFO Museum"  │
│ 1527818053   │ 1511921717 │ apple/mobileclip_s2 │ "menu: British Airways, Concorde" │ "SFO Museum"  │
│ 1527818057   │ 1511923667 │ apple/mobileclip_s0 │ "menu: British Airways, Concorde" │ "SFO Museum"  │
│ 1527818057   │ 1511923667 │ apple/mobileclip_s1 │ "menu: British Airways, Concorde" │ "SFO Museum"  │
│ 1527818057   │ 1511923667 │ apple/mobileclip_s2 │ "menu: British Airways, Concorde" │ "SFO Museum"  │
│ 1527818059   │ 1511908213 │ apple/mobileclip_s0 │ "menu: JAL (Japan Air Lines)"     │ "SFO Museum"  │
│ 1527818059   │ 1511908213 │ apple/mobileclip_s1 │ "menu: JAL (Japan Air Lines)"     │ "SFO Museum"  │
│ 1527818059   │ 1511908213 │ apple/mobileclip_s2 │ "menu: JAL (Japan Air Lines)"     │ "SFO Museum"  │
│ 1527818061   │ 1511921615 │ apple/mobileclip_s0 │ "menu: JAL (Japan Air Lines)"     │ "SFO Museum"  │
├──────────────┴────────────┴─────────────────────┴───────────────────────────────────┴───────────────┤
│ 10 rows                                                                                   5 columns │
└─────────────────────────────────────────────────────────────────────────────────────────────────────┘

These data files should still be considered experimental at least until there is some consensus about what a common set of properties for sharing vector embeddings between cultural heritage organizations should be. If no consensus emerges then we (SFO Museum) will likely just adopt the proposed properties and use those until another set of properties is formalized.

Although we have produced vector embeddings for other organizations (NGA) these are for demonstration purposes only. Our hope is that other cultural heritage organizations will produce and publish vector embeddings on their own. The (nearly forgotten) promise of open data is that SFO Museum can produce third-party vector embeddings if and when we need to but given the time and resources necessary to create them it seems like it would benefit everyone for individual organizations to do that work once and then share the results with their peers.

Photograph: San Jose Municipal Airport. Photograph. Gift of the William Hough Collection, SFO Museum Collection. 2010.225.152