What is a perceptual hash function?

Let the OSS Enterprise newsletter guide your open source journey! Sign up here.

When programmers need to create a shorter surrogate for a larger file or block of data, they often turn to hash functions. These programmers analyze a block of data and produce a short number that can act as a stand-in or shorthand for the larger collection of bytes, sometimes in an index and other times in a more complicated calculation.

Perceptual hash functions are tuned to produce the same result for similar images or sounds. They aim to imitate human perception by focusing on the types of features (colors and frequencies) that drive human sight and hearing.

Many popular non-perceptual hash functions are very sensitive to the smallest changes. Simply flipping one bit, say by changing the amount of blue in a pixel from 200 to 199 units, might change half of the bits in the hash functions. Perceptual hash functions are designed to return answers for images or sounds that a human might feel are similar. That is, small changes in the media don't affect the output.

Hash functions simplify searching and indexing through databases and other data storage. Hash tables, a popular data structure known for fast response, rely on a good hash function as an index to quickly locate the larger block of data. Facial recognition algorithms, for instance, use a perceptual hash function to organize photos by the people in the image. The algorithms use the relative distances between facial features -- like eyes, nose, and mouth -- to construct a short vector of numbers that can organize a collection of images.

Some algorithms depend on hash functions to flag changes. These approaches, often called "check sums," began as a quick way to look for mistransmitted data. Both the sender and receiver might add together all of the bytes in the data and then compare the answer. If both agree, the algorithm might assume no mistakes were made -- an assumption that is not guaranteed. If the errors made in transmission happened in certain a way -- say adding three to one byte while also subtracting three from a different one -- the mistakes would cancel out and the checksum algorithm would fail to catch the error.

All hash functions are vulnerable to "collisions" when two different blocks of data produce the same hash value. This happens more often with hash functions that produce shorter answers because the number of possible data blocks is much, much greater than the number of potential answers.

Some functions, like the U.S. government's standard Secure Hash Algorithm (SHA256), are designed to make it practically impossible for anyone to find a collision. They were designed using the same principles as strong encryption routines to prevent reverse engineering. Many cryptographic algorithms rely on secure hash functions like SHA256 as a building block, and some refer to them colloquially as the "duct tape" of cryptography.

Perceptual hash functions can't be as resistant. They are designed so that similar data produces a similar hash value, something that makes it easy to search for a collision. This makes them vulnerable to spoofing and misdirection. Given one file, it is relatively easy to construct a second file that looks and appears quite different but produces the same perceptual hash value.

How do perceptual hash functions work?

Perceptual hash functions are still a field of active research, and there are no definitive or even dominant standards. These functions tend to break a sound or image file into relatively large blocks and then convert similar shapes or sounds to the same value. The rough pattern and distribution of values in these blocks can be thought of as a very low-resolution version and is often the same or very similar for images or sounds that are close.

A basic function for sound, for instance, may split the file into one-second sections and then analyze the presence or absence of frequencies in each section. If there are low-frequency sounds, say between 100Hz and 300Hz, the function may assign a 1 to that section. It might also test other popular frequencies, like the common range for the human voice. Some automatic functions for identifying popular music can do a good job with a simple function like this because they will sense the bass rhythm and the moments when someone is singing.

The size of the blocks and the frequencies that are tested can be adjusted for the application. A hash function for identifying bird songs might be triggered by higher frequencies. Shorter blocks offer more precision -- something that may not be desired if the goal is simply to group similar sounds.

Image functions use similar techniques with colors and blocks. For this reason, many perceptual functions will often match shapes. A picture of a person with their arms at their side and their legs apart may match a photo of the Eiffel tower because both have the same shape.

Several common options for comparing images are ahash, dhash, and phash. The ahash computes the average color of each block after splitting the image into an 8x8 grid of 64 blocks. The phash function is available as open source.

What can they do?

Perceptual hashes can support a diverse collection of applications:

Copyright infringement -- Similar hash values can detect and match images, sounds, or videos, even if they've been changed through cropping or downscaling.
Video tagging -- Facial perceptual hashes can help index a video to identify when particular people are visible.
Misspelling -- Textual perceptual hash functions can categorize words by their sounds, making it possible to catch and correct misspelled words.
Security -- Perceptual hashes can find and identify people or animals in video or still images tracking their movement.
Compliance -- Some algorithms can detect what people are wearing, something useful for construction sites and hospitals. One algorithm can flag people who might not be wearing personal protective equipment required by law, for example.

How legacy players are using them

Some databases -- like MySQL, Oracle, and Microsoft -- use the Soundex algorithm to allow "fuzzy search" for words that sound alike even though they're spelled differently. The algorithm's answer is made up of a letter followed by several digits. For example, both "SURE" and "SHORE" produce the same result: "S600."

Some of the cloud companies also offer facial recognition algorithms that can be easily integrated with their database. Microsoft's Azure, for instance, offers Face, a tool that will find and group similar faces in a collection of images. The company's API will find and return attributes of a face -- like hair color or the presence of any facial hair. It will also try to construct an estimate of the age and basic emotions of the person (anger, contempt, happiness, etc.).

Amazon Rekognition can detect faces in images, as well as other useful attributes, like text. It works with both still images and videos, which makes it useful for many tasks, like finding all scenes with a particular actor. Rekognition also maintains a database of celebrities and will identify them in your images.

Google's Cloud Vision API detects and categorizes many parts of an image, like text or landmarks. The tool doesn't offer direct facial recognition, but the API will find and measure the location of elements, like the midpoint between eyes and the boundaries of the eyebrows. Celebrity recognition is currently a restricted beta product.

How upstarts are applying them

Apple recently announced it would use perceptual hash functions called NeuralHash to search customers' iPhones for potentially illegal images of child sexual abuse. The results of the perceptual hash algorithm would be compared against values of known images found in other investigations. The process would be automatic, but any match could trigger an investigation.

A number of companies -- like Clearview.ai or Facebook -- are creating databases filled with perceptual hashes of scanned images. They are, in general, not making these databases available to other developers.

The topic is an area of active exploration. Some open source versions include pHash, Blockhash, and OpenCV.

Is there anything perceptual hash functions can't do?

While the perceptual hash functions are often quite accurate, they tend to produce false matches. Apple's facial recognition software used to unlock an iPhone can sometimes confuse parents with children, allowing the children to unlock their parents' phones.

In general, the ability of a hash function to reduce an often large or complex set of data to a short number is also the source of this weakness. Collisions are impossible to prevent because there are often a dramatically smaller number of potential answers and a much, much larger number of inputs. While some cryptographically secure hash functions can make it hard to find these collisions, they still exist.

In the same way, the strength of perceptual hash functions is also a major weakness. If the function does a good job of approximating human perception, it will also be easier for humans to find and even create collisions. There are a number of attacks that can exploit this aspect. Several early experimental projects (here and here), for instance, offer software to help find and even create collisions.