Semantic Search

Sometimes when you’re searching documents you aren’t interested in an exact match. If you’re investigating financial dealings across 70,000 emails, you don’t just want to ctrl+f and search for “money” – you’re interested in “currency” and “finances” and “debt” and everything else with the same feeling as “money.”

Searching using similarity that’s not an exact word-for-word match is called semantic search. It’s been around for a while, but it’s gotten much much much more popular with the rise of large language models!

Semantic search also goes far beyond just one-word synonyms like the example above.For example. XXXXX

Embeddings

Any time someone explains semantic search they’re legally obligated to go very very very deep into a technical topic called embeddings. We’ll take a look at embeddings briefly so we have a better understanding of limitations when we’re using semantic search.

Cat

Cat

Cat

Cat

Cat

Cat

Gotchas

Types of matches

Different embeddings

Across languages

Most embedding models just look at English, unless they note otherwise – for example, uer/sbert-base-chinese-nli is a model that’s Chinese-only. But what happens if your data is in a mix of languages? No worries, embeddings can work just as well… as long as you select a multilingual model!

For a real-life use case, here’s a great writeup by Jeremy Merrill about the process for analyzing a 356-gigabytes leak that included documents in both English and Portuguese:

For instance, searching for sentences similar to “establishing a new corporation” found these sentence fragments as the top two matches:

  • “nova entidade para a sociedade,” Portuguese for “a new entity for the company” in an email discussing creating a new corporate structure.
  • “of the firm as newly constituted,” from a law authorizing secretive corporate structures in Mauritius.

You can see an example comparing a multilingual model to an English-only model below. Feel free to adjust the sentences if you’d like to test it out across other languages!

The model we’re using provides a list of 50+ languages it knows – how well it works for any given pair, though, I can’t quite say!