A recent research report from analysis firm Trail of Bits highlights some of the key differences — representing critical considerations for contemporary information retrieval — between OpenSearch and Elasticsearch. OpenSearch and the Open Search Project were created by Amazon; OpenSearch’s search and analytics platform was forked from Elasticsearch.
The offerings were evaluated with the OpenSearch Benchmark, which compares solutions according to various workloads. The report indicates that OpenSearch v2.17.1 (the latest version at the time the research was performed) was 11 percent faster on the Vectorsearch workload than ElasticSearch v8.15.4.
It also reveals that OpenSearch was 1.6x faster on the Big5 workload. These results were found when aggregating the geometric mean of each solution’s queries. Both platforms have since been updated to other versions.
Trail of Bits chose to spotlight the results of these workloads in a recent blog partly because of their meaningfulness to the enterprise. According to Evan Downing, Trail of Bits senior security engineer, AI/ML, and one of the preparers of the report, “Big5’s kind of your generic workload that will satisfy most users and the Vectorsearch workload will evaluate things that have to do with machine learning and vector embeddings.”
The Vectorsearch workload directly correlates to generative AI applications and applications of vector similarity search. According to Trail of Bits Engineering Director William Woodruff, the Big5 workload involves “things like searching for terms over a product database.”
An examination of the different approaches OpenSearch and Elasticsearch invoke for meeting these workloads, and others in the OpenSearch Benchmark, illustrates some of the most useful capabilities in search today.
Multiple Search Engines
Although the solutions were assessed with the OpenSearch Benchmark, “To my knowledge, OpenSearch Benchmark was forked from the Elasticsearch benchmarking suite,” Downing said. Despite the fact that OpenSearch itself was forked from Elasticsearch, the report indicates that a comparison between the two solutions isn’t apples to apples.
One of chief differences is that, at the time of the research (most of which occurred between September and December of 2024), OpenSearch supported a variety of search engines—including those designed for vector embedding retrieval use cases—while Elasticsearch supported just one, Apache Lucene. OpenSearch users can avail themselves of Lucene, Facebook AI Similarity Search (Faiss), and Non-Metric Space Library (NMSLIB).
This three to one ratio of engines between OpenSearch and Elasticsearch could have impacted OpenSearch’s favorable results in the vectorsearch workload.
Vector Search Algorithms and Quantizations
The various search engines assessed in the benchmark employ different approaches to information retrieval — which is not a monolithic process. According to Downing, Lucene, Faiss, and NMSLIB “support different algorithms for doing vector search and also different quantizations. So basically, you can think of this as a compression for the dataset size and the requirements that are required by the users of these algorithms.”
Quantization techniques are one of the factors that influence the performance of vector search databases. The compression to which Downing referred can impact the cost of using vector search systems, particularly in terms of storage. Although there are a host of differences between these three engines, for the actual benchmark, it was pertinent that “each of those workload engines requires different parameters in order to run, based on different API requirements and other things,” Downing said. “So, when we’re comparing this all on the line, we’re comparing OpenSearch with Lucene, OpenSearch with NMSLIB, OpenSearch with Faiss, and Elasticsearch with Lucene.”
Smart Metadata Filtering
Of the three, Lucene may be the most widely known engine. It’s an open source search engine library operated by the Apache Foundation. For solutions that have multiple engines to choose from, as OpenSearch does, there are some applications for which Lucene is particularly appropriate. “It is my understanding that Lucene is generally a good option for smaller deployments,” Downing commented.
One of the more notable facets of Lucene is its metadata filtering. Typically, users can filter the results of vector database searches based on metadata about the actual embeddings. There are options for filtering metadata before searches and after searches, which can affect the overall quality of the results.
The distinction with Lucene is that it “offers some benefits, as does Faiss, with some things like smart filtering, where the optimum filtering strategy, like pre-filtering, or post-filtering, or exact K-Nearest Neighbors, is automatically applied depending on the different situation,” Downing said. Faiss is a software library (with few third-party dependencies) for vector similarity search and other applications that underpin use cases for generative models. NMSLIB is a vector embedding search library and toolset for assessing similarity search methods. “NMSLIB and Faiss are built mostly for large-scale use cases,” Downing said.
Big5 Workload
The Big5 workload illustrates how far information retrieval has come today. It encompasses aspects of text querying, sorting, date histograms, range queries, and term aggregations. These capabilities are useful for searching through documents, product and customer information, structured and unstructured data, and more.
OpenSearch outperformed Elasticsearch in all Big5 categories and was 16.55 times faster than Elasticsearch in the date histogram component. Date histogram features provide temporal aggregations. “This is sort of a chronological grouping, you could say, where you’re dividing the dataset into buckets or intervals,” Downing commented. “So, for example, we want to say give me all the documents from a specific day on this month.”
Text queries are predicated in part on lexical, or keyword, search capabilities and are commonly applied to use cases involving user IDs, email addresses, or names. Range queries “are based on a specific range of values in a given field,” Downing explained. With these capabilities, users can retrieve results from a dataset in which the temperature is between 70 and 85 degrees, for example. Sorting enables organizations to order the results of queries according to any number of factors, which might include chronological, numeric, or alphabetical order.
Meaningful Findings
For the enterprise user, the most meaningful findings from the recent benchmark between OpenSearch and Elasticsearch have less to do with the performance of these solutions and more to do with their capabilities. The report indicates that all vector search platforms are not the same. They incorporate different engines that support respective features.
Some of those distinctions pertain to libraries for vector embedding search and pivotal considerations like metadata filtering, as well as versatility for quantization and compression. Moreover, capabilities for sorting search results, aggregating search terms, issuing range queries, and other facets of the Big5 workload are also worthy of consideration when assessing search and analytics platforms — and their performance.
The post Report: OpenSearch Bests ElasticSearch at Vector Modeling appeared first on The New Stack.