Ask like a human: Implementing semantic search on Stack Overflow
Search has always been core to the Stack Overflow experience. It was one of the first features we built for users, and since the early days of Stack Overflow, most of our visitors have come to us through search engines. With search engine referrals consistently comprising over 90% of our site traffic, we felt for many years as though this was a solved problem.
Our search has been lexical, meaning it tries to match keywords in your query in order to identify the best results. With the recent advancements in AI and LLMs, however, semantic search has begun to push lexical search out of fashion. Semantic search converts an entire document’s content into numerical vectors based on machine-learned meaning, which a search can then traverse as if it were a 3D physical space. It allows for more efficient storage of search data and better, faster results, but most importantly, it allows users to search using natural language instead of a rigid syntax of keyword manipulation.
We’re fascinated by search—co-author David Haney built his early career on it. To David, search isn’t interesting because of the employed logic and algorithms; it’s fascinating because it’s a human sentiment problem. It doesn’t matter if the algorithms “work perfectly,” just that the human doing the searching feels good about the results—that they got what they wanted. With the rise in semantic search, we saw a well-timed opportunity to grasp the benefits of semantic for our own site search functionality. Last week’s announcement at WeAreDevelopers highlighted how changes to search, powered by OverflowAI, are a big part of our roadmap. Below, we’ll dig into some of the details of how search used to work around here, and how we’re building the next generation..
The old ways: Lexical
Let’s talk about lexical search and where we’ve come from. When Stack Overflow started in 2008, we used Microsoft SQL’s full-text search capabilities. As our website grew to one of the most heavily trafficked domains on the internet, we needed to expand. We replaced it with Elasticsearch, which is the industry standard, and have been using that ever since.
We use an algorithm that matches keywords to documents using TF-IDF (Term Frequency – Inverse Document Frequency). Essentially, TF-IDF ranks a keyword’s importance by how rare it is across your entire search corpus. The less often a word appears, the more important it is. That helps to devalue “filler” words that appear all the time in text but don’t provide much search value: “is,” “you,” “the,” and so on. We use stemming, which groups related word forms so that “run” will also match both “runner” and “running.” When you do a search, we find documents that best match your query with effectively no evaluation of their location in the document.
We do some additional processing to improve relevance for the search results, like bi-gram shingles, which tokenize words into adjacent pairs. This helps differentiate phrases like “The alligator ate Sue” from “Sue ate the alligator,” which have the identical keywords but very different meanings.
But even with the top-of-the-line algorithm, lexical search suffers from a couple of significant problems. First of all, it’s very rigid. If you misspell a keyword or use a synonym, you won’t get good results unless someone has done some processing in the index. If you pack a bunch of words into a query—by, let’s say, asking a question as if you were having a conversation with someone—then you might not match any documents.
The second problem is that lexical search requires a domain-specific language to get results for anything more than a stack of keywords. It’s not intuitive to most people to have to use specialized punctuation and boolean operators to get what you want.
To get good results, you shouldn’t need to know any magic words. With semantic search, you don’t.
And the new: Semantic
As we said earlier, search is a human sentiment problem. We don’t want to have to figure out the right keywords in a query; we just want to ask a question and get a good answer. Questions and answers are the bread and butter of stackoverflow.com and our Stack Exchange sites. It makes sense to employ semantic search to make our search equally intuitive and approachable.
Semantic search is a completely different paradigm than lexical search. Imagine the search space as a 3D cube where all the documents are given a numerical score based on their meaning and embedded into vectors with coordinates. Vectors are stored in proximity to each other according to the meanings of their terms. Said another way, if you think of the search space as a city, terms with closely related meaning or sentiment will live in the same or adjacent neighborhoods.
When you run a search, your query is first transformed into a vector in the same way that your documents were indexed. The system then traverses the vector space from the query embeddings to find nearby vectors, because those are related and therefore relevant. In a way, it’s a nearest-neighbor function, as it returns the vectors for terms found in your documents that are most closely situated to the query.
The search itself is a simple mathematical calculation: cosine distance. The real challenge is in creating the embeddings themselves.
The first challenge is deciding what model to use. We could choose between a pre-tuned open source, a proprietary model, or fine-tuning our own model. Luckily, our data dumps have been used in many of these embedding models, so they are already optimized for our data to some extent. For the time being, we’re using a pre-tuned open source model that produces 768 dimensions. Others, mainly closed-source models, can produce double or even triple that amount, but it’s pretty widely accepted that these models are unnecessary or even overkill.
The second challenge is deciding how to chunk the text; that is, how to break the text up into tokens for embedding Because embedding models have a fixed context length, we need to make sure to include the right text in the embedding: not too little but also not too much.
With this semantic mapping of our data, we can avoid the rigidity and strictness of lexical search. You can write your query like a natural question you’d ask a friend, and get relevant results back in kind. When searching for “how to sort list of integers in python”.
The downside to semantic search is that sometimes you actually want a rigid search. If you’re looking for a specific error code or unique keyword, then semantic search may not do as well. In general, semantic search does worse with short queries, like two- or three-word phrases. But fortunately, we don’t have to exclusively use either lexical or semantic—we can do a hybrid search experience. In a hybrid model, we can continue to support our Advanced Search use cases while offering the best of semantic search to most users at the same time.
Of course, the devil is in the details, so here’s what we’ve been doing to make semantic search a reality on Stack Overflow.
Implementing semantic search on SO
Semantic search and LLMs go together like cookies and milk. Across the technology sector, organizations are rapidly deploying Retrieval Augmented Generation (RAG) to create intuitive search experiences. We’re doing similarly—after all, Not Invented Here is an antipattern! The thing about RAG is that it’s only as good as the quality of your search results. We think our tens of millions of questions and answers—curated and moderated by our amazing community—are about as qualified as it gets.
Building on the work we did to build course recommendations, we used the same pipeline from our Azure Databricks data platform that fed into a pre-trained BERT model from the
SentenceTransformers library to generate embeddings. For the vector database that stored those embeddings, we had a few non-negotiable requirements:
- It had to be open-source and not hosted so we could run it on our existing Azure infrastructure.
- It needed to support hybrid search—lexical and semantic on the same data.
- Because our existing data science efforts have leaned pretty heavily into the PySpark ecosystem, it needed to have a native Spark connection.
Weaviate, a startup focused on building open source AI-first infrastructure, satisfied all those requirements. So far, so good.
The part of this system that will undergo the most experimentation once it rolls out is the embedding generation process. Specifically, what text we include as part of the embedding for a question or answer—and how we process it. We’ve made our embedding generation pipeline highly configurable to easily support creating multiple embeddings per post. This should enable us to experiment rapidly.
We also have access to other signals that help define the quality of a question and its answers: votes, page views, anonymized copy data, and more. We could bias towards updated documents that have been edited recently or received new votes. There are a lot of levers we can pull—and as we experiment in the coming months—we’ll figure out which ones add the most value to our search experience..
Another important signal is feedback from the community. We want to know when people are happy with their search results. So when you see a chance to give us feedback, please do!
AI and the future
Let us pose a question: to programmers seeking accurate, logically correct answers to their questions, what good is an LLM that hallucinates? How do you feel when you’re given technical guidance by a conversational AI that is confidently incorrect?
In all of our forthcoming search iteration and experimentation, our ethos is simple: accuracy and attribution. In a world of LLMs creating results from sources unknown, we will provide clear attribution of questions and answers used in our RAG LLM summaries. As always on Stack Overflow, all of your search results are created and curated by community members.
Our hypothesis is that if our semantic search produces high-quality results, technologists looking for answers will use our search instead of a search engine or conversational AI. Our forthcoming semantic search functionality is the first step in a continuous experimental process that will involve a lot of data science, iteration, and most importantly: our users. We’re excited to embark upon this adventure together with our community and can’t wait for you to experience our new semantic search.