ImageBind: six modalities in one space, and why it matters

On 9 May 2023, a Meta team led by Rohit Girdhar published "ImageBind: One Embedding Space To Bind Them All," a system that places six different kinds of data — images, text, audio, depth, thermal imaging and motion-sensor readings — into a single mathematical space.

The striking part is not the number six. It is how they got there: by pairing each modality only with images, never with each other. And even so, once trained, the system correctly related sounds to depth maps — two things it had never been shown together.

That result conceals a general recipe for connecting many things cheaply, and you will meet it in places that have nothing to do with artificial intelligence.

What a shared space is

A model does not handle photos or sentences: it handles lists of numbers. Each item becomes a vector of a few hundred figures summarising its content, and the trick is that position should mean something. In a well-built space, two photos of dogs land near each other and a photo of a dog lands far from a photo of an aeroplane.

A shared space goes one step further: it puts things of different natures into the same coordinate system. The photo of a dog, the word "dog" and the sound of a bark all end up in the same region. And once that holds, operations that seemed to require separate programs collapse into one: searching images by description, finding the sound matching a scene, or comparing a text with a video are all the same act — measuring distances.

The pairing problem

Aligning two modalities requires paired material: photos with their captions, videos with their audio, scenes with their depth maps. Somebody has to assemble those pairs, and that is the expensive part.

With six modalities, the number of possible pairs is fifteen. Assembling fifteen paired datasets is unworkable, and some do not even exist: there is no archive of thermal images with matching sound, nor any reason for one to exist.

Here is the finding, in the paper's own words: "all combinations of paired data are not necessary to train such a joint embedding, and only image-paired data is sufficient to bind the modalities together."

Five datasets instead of fifteen. And not any five: five that already existed, because images come naturally accompanied by everything else. Photos carry text beside them, videos carry sound, depth cameras and motion sensors record alongside the lens.

What appears without being asked for

The consequence is the elegant part. If audio aligns with images, and depth does too, then audio and depth end up aligned with each other without anyone ever pairing them. They share a frame of reference because both were measured against the same standard.

The authors describe the resulting capabilities as available "out-of-the-box": retrieval across modalities, arithmetic composition of modalities, and cross-modal detection and generation. None of that was trained for as an objective. It is geometry: if A is near C and B is near C, then A and B cannot be far apart.

Why those six, and not others

The list is worth studying, because the choice explains what this is really for. Alongside the three predictable ones — image, text and audio — sit three that are not: depth, the distance to each point in a scene; thermal imaging, which records temperature rather than light; and readings from inertial measurement units, the acceleration and rotation sensors inside phones and watches.

Those three share something: they are data-poor. Nothing remotely comparable to the billions of captioned images on the internet exists for them. Nobody has hand-labelled millions of accelerometer traces, and nobody is going to.

And there lies the method's practical payoff: a data-poor modality inherits from the hub what it could never have learned alone. By anchoring accelerometer readings to the images that accompanied them, those readings enter a space where the notions of "running," "falling" or "climbing stairs" already exist — notions built from abundant visual and textual material. The sensor did not learn what falling is; it was placed where that had already been written down.

This is why the approach matters outside research labs. Most valuable real-world data — from a factory, a hospital, an electricity grid — looks exactly like this: abundant in volume and barren of labels. If those signals can be hitched to something that is described, they stop needing a training set of their own that nobody is going to fund.

The capability you take away

This is the idea that outlives the particular model, and it is among the most profitable there is.

To connect N things to one another you do not need N(N-1)/2 connections: N-1 connections to a common hub will do. With six items, fifteen against five. With twenty, one hundred and ninety against nineteen. The gap grows explosively, which is why the pattern shows up wherever the problem is interoperability.

It is exactly why protocols and interchange formats exist. A translator working across twenty languages does not need four hundred dictionaries if it adopts a pivot language. A system integrating twenty applications does not need connectors for every pair if they all speak to a common bus. Whenever you see an integration growing by multiplication, the useful question is: what could the hub be?

And the price, which is worth knowing too

A hub is not free, and this is where the enthusiasm ends and judgement begins.

Everything passes through the axis, so the axis's blind spots become everybody's blind spots. If a distinction is not recorded in an image, it cannot bind two modalities through one. A nuance the ear catches and the eye does not — the tone in which something is said, the difference between a healthy engine and a seizing one — has no route into a space built on the visual.

The same goes for bias: if the image material represents a group, a situation or a part of the world poorly, that flaw does not stay in the images. It propagates to every modality anchored to them, and becomes harder to see, because it no longer looks like a photo problem.

You can see it clearly by looking at what is not on the list. There is no smell, no taste, no touch or force, and not for want of interest: no archive of images accompanied by those signals exists, and without that pairing there is no way to hitch them on. The method does not reach where the hub has never been.

That trade-off — huge savings in exchange for a common dependency — is the same one accepted when choosing a standard format or a lingua franca. It is nearly always worth it. But it is worth signing while knowing what you sign.

Where to go next, with no middlemen

The original paper is public and its figures make sense without reading the mathematics: they show what the system retrieves when handed a sound or a thermal image, which is the quickest way to develop an intuition for what "same space" means.

The capability you take from this is recognising the hub pattern: when something must connect to everything, do not connect it to everything — find the centre things already pass through, and know what it leaves out.

ImageBind: a step towards holistic artificial intelligence

What a shared space is

The pairing problem

What appears without being asked for

Why those six, and not others

The capability you take away

And the price, which is worth knowing too

Where to go next, with no middlemen

Related articles

A proposed law seeks to pause new AI data centers in the U.S.

AI tops business priorities, according to a new study

EFE and Soria Noticias show two newsroom uses for AI