Binary Search Revisited

posted November 8, 2011 by Stephan Brumme

Same same but different ?!

Some time ago I had to write code for a binary search of sorted data. Obviously there are libraries which can do the heavy lifting for me, such as C++'s STL.

When I looked up the source code of STL's binary_search I was surprised to find that it slightly differs from what I expected: it always performs log2(N) less-than comparisons and only after that checks for equality exactly once:

hide binary_search / pseudo code (STL)


     while at least one element
     {
       find element in the middle
       if it's smaller than x
         strip all elements before and including middle element
       else
         strip all elements after middle element
     }
     return true if remaining element matches x, else false

That means if it finds a match (something that equals what we are searching for) it still keeps going until subdivision finishes - and forfeits the chance for an early exit.
The STL guys usually prefer fast algorithms - why did they do it ?

This clearly calls for some serious googling ... and the vast majority of web sites, including Wikipedia, combine the less-than comparison with an check for equality and indeed allow for an early exit if they find a match:

hide binary_search / pseudo code (most common algorithm)


     while at least one element
     {
       find element in the middle
       if it matches x, return true <== NEW
       if it's smaller than x
         strip all elements before and including middle element
       else
         strip all elements after middle element
     }
     return false <== DIFFERENT

In Big-O notation: if we have a container with N elements, the following number of comparisons are expected on average:

	check if less than	check if equal
STL	≈ log₂(N)	1
"rest of world"	≈ log₂(N) - 1	≈ log₂(N) - 1

Is STL right ? Does saving all these checks for equality at the cost of one more less-than give a real-world performance boost ?
There is only one way to find the best binary_search: WE NEED BENCHMARKS !

Download

Git users: scroll down to the repository link

Download BinarySearch.cpp

Latest release: April 24, 2013, size: 10.2 kBytes, 381 lines

CRC32: 8ccb4820
MD5: 3e13d18e5c402b03dd670e18d11c8510
SHA1: 1dce5af123bf2f76c97bd1fd21c9c35d444dd32c
SHA256:47efec0b9b5facca2a8ccf29d0b7a36972140ee61478f16de48054c57455e36b

Stay up-to-date:git clone https://create.stephan-brumme.com/binary-search/git

GitHub mirror:https://github.com/stbrumme/binary-search

If you encounter any bugs/problems or have ideas for improving future versions, please write me an email: create@stephan-brumme.com

License

This code is licensed under the zlib License:

This software is provided 'as-is', without any express or implied warranty. In no event will the authors be held liable for any damages arising from the use of this software. Permission is granted to anyone to use this software for any purpose, including commercial applications, and to alter it and redistribute it freely, subject to the following restrictions: 1. The origin of this software must not be misrepresented; you must not claim that you wrote the original software. If you use this software in a product, an acknowledgment in the product documentation would be appreciated but is not required. 2. Altered source versions must be plainly marked as such, and must not be misrepresented as being the original software. 3. This notice may not be removed or altered from any source distribution.zlib License

Simplifying the Code

Visual C++'s and GCC's STL implementations of binary_search rely on lower_bound (I'm pretty sure all STL libraries do). However, like most STL stuff the code is hardly readable. After inlining lower_bound, stripping off all the debugging stuff and replacing the weird variables and type names by something more understandable, binary_search looks like this:

hide binary_search


     /// same algorithm as STL
     template <typename Iterator, typename Value>
     bool binary_search(Iterator begin, Iterator end, Value value)
     {
       // narrow down range of candidates
       auto numElements = std::distance(begin, end);
     
       // repeat until a single element is left
       while (numElements > 0)
       {
         // pivot element
         auto middlePos = numElements >> 1;
         auto middle    = begin;
         std::advance(middle, middlePos);
     
         // >>> UPDATE check for equality
         //if (value == *middle)
         //  return true;
         // <<< UPDATE
     
         if (*middle < value)
         {
           // value must be on the right side of pivot element
           begin = middle;
           ++begin;
           numElements = numElements - middlePos - 1;
         }
         else
         {
           // value must be on the left  side of pivot element (or it is the pivot element)
           numElements = middlePos;
         }
       }
     
       // out of bounds ?
       if (begin == end)
         return false;
     
       // >>> UPDATE return always false if checking for equality in while-loop
       // return false;
       // <<< UPDATE
     
       // hit ?
       return value == *begin;
     }

I highlighted lines 16-19 to show where the check for equality could be inserted.

Note: you need a C++11 compiler because of the super-convenient auto keyword.

Benchmark

The main benchmark parameters are the container type (std::vector, std::deque or std::list) and:

    
     /// data type
     typedef int Number;
     /// number of elements
     const size_t numElements = 1000*1000;

My sorted data is generated by a simple loop:

    
     for (size_t i = 0; i < numElements; i++)
       sorted.push_back(Number(i*step));

step (and notFound=step/2) is responsible for creating a certain gap between neighboring numbers. These gaps are helpful to get a better estimate of the algorithm's runtime because they allow for looking for numbers that are NOT in the container in an evenly distributed manner.

index	0	1	2	3	4	5	...
sorted[index]	0	10	20	30	40	50	...

The benchmark probes 0 (succeeds), 5 (fails), 10 (succeeds), 15 (fails), 20 (succeeds), 25 (fails), 30 (succeeds), 35 (fails), etc.:

hide test routine


     for (size_t i = 0; i < numElements; i++)
     {
       // hit
       Number succeed = Number(i*step);
       if (!std::binary_search(sorted.begin(), sorted.end(), succeed))
         std::cout << "error @" << succeed << std::endl;
       // miss
       Number fail    = Number(i*step+notFound);
       if ( std::binary_search(sorted.begin(), sorted.end(), fail))
         std::cout << "error @" << fail << std::endl;
     }

Searching 1000000 sorted numbers requires to repeat the inner while-loop about 20 times. All compiler optimizations I could think of were enabled.
That's what I get:

1 million `int`	Visual C++ 2010	GCC 4.7
std::binary_search	131ms	122ms
my binary search	107ms	96ms
early check for equality	160ms	97ms
recursive	242ms	110ms
hybrid (8)	85ms	133ms

The hybrid algorithm (discussed below) is the overall winner for integers (compiled with Visual C++).

My implementation of std::binary_search isn't as nested as the STL and lacks some iterator checking. In the end, both compilers generate code which is about 25% faster code than the STL.
For GCC it doesn't really matter which variant of my binary search implementations is chosen but Visual C++ seriously loses ground when enabling the early check for equality.
I added #define _SECURE_SCL 0 to disable Visual C++'s checked iterators in release mode, too.

Let's have a look at float (very similar results for double):

1 million `float`	Visual C++ 2010	GCC 4.7
std::binary_search	251ms	119ms
my binary search	217ms	96ms
early check for equality	200ms	126ms
recursive	290ms	129ms
hybrid (16)	158ms	132ms

GCC remains mostly unaffected but Visual C++ becomes about two times slower ! It's worth noting that Visual C++ in this scenario becomes faster when adding the early check for equality. The difference is small but consistent.

Even More Simplifying the Code

A recursive implementation of the binary search is quite short:

hide binary_search_recursive


     /// recursive implementation
     template <typename Iterator, typename Value>
     bool binary_search_recursive(Iterator begin, Iterator end, Value value)
     {
       // empty ?
       if (begin == end)
         return false;
     
       // stop recursion ?
       auto numElements = std::distance(begin, end);
       if (numElements == 1)
         return value == *begin;
     
       // pivot element
       auto middlePos = numElements >> 1;
       auto middle    = begin;
       std::advance(middle, middlePos);
     
       // keep looking on the right or left side of pivot element ?
       if (value < *middle)
         return binary_search_recursive(begin,  middle, value);
       else
         return binary_search_recursive(middle, end,    value);
     }

As we have seen in the benchmark, it runs pretty fast on GCC 4.7 - and sucks with Visual C++ 2010.

Hybrid Approach

When the container holds small element types which are laid out next to each other in the memory (such as int), the L1/L2 caches' influence plays a major role.
My hybrid algorithm switches from binary subdivision to linear search after a given thresholdLinear was reached:

hide binary_search_hybrid


     /// linear search in inner loop
     template <size_t thresholdLinear, typename Iterator, typename Value>
     bool binary_search_hybrid(Iterator begin, Iterator end, Value value)
     {
       // narrow down range of candidates
       auto numElements = std::distance(begin, end);
     
       // repeat until only a few elements are left
       while (numElements > int(thresholdLinear))
       {
         // pivot element
         auto middlePos = numElements >> 1;
         auto middle    = begin;
         std::advance(middle, middlePos);
     
         if (*middle < value)
         {
           // value must be on the right side of pivot element
           begin = middle;
           ++begin;
           numElements = numElements - middlePos - 1;
         }
         else
         {
           // value must be on the left  side of pivot element (or it is the pivot element)
           numElements = middlePos;
         }
       }
     
       // skip all smaller values
       while (begin != end && *begin < value)
         ++begin;
     
       // hit ?
       return begin != end && *begin == value;
     }

In contrast to the original algorithm, here the while compares against thresholdLinear, not 0. Lines 30-35 are new, everything else should be familiar.

The best threshold value heavily depends on the data type. For 32-bit ints I found thresholdLinear = 8 to be the best parameter.
Your mileage may vary.


     my::binary_search_hybrid<8>(sorted.begin(), sorted.end(), lookFor));

The World Is Not A Vector

std::binary_search accepts arbitrary forward iterators and makes it easy to replace the container (benchmarks above: std::vector).
Here are the results of STL's double-ended queue (std::deque):

`std::deque`, 1 million `int`	Visual C++ 2010	GCC 4.7
std::binary_search	352ms	206ms
my binary search	335ms	204ms
early check for equality	482ms	215ms
recursive	510ms	540ms
hybrid (8)	309ms	198ms

GCC wins all benchmarks by a huge margin. The hybrid algorithm fits well to the paged structure of a std::deque and utilizes the CPU caches best.

When looking at std::list we have to keep in mind that even though only a forward iterator is required, a lot of iterating is done in std::advance and std::distance.
Therefore instead of 1000000 only 10000 ints are pushed into an std::list:

`std::list`, ten thousand `int`	Visual C++ 2010	GCC 4.7
std::binary_search	692ms	903ms
my binary search	683ms	878ms
early check for equality	687ms	882ms
recursive	982ms	1261ms
hybrid (8)	686ms	884ms

This time, Visual C++ wins all benchmarks over GCC. Maybe VC++ comes with a better memory allocator.
Unlike all tables before, I didn't highlight the fastest algorithm - because it's not shown in this table !
A simple std::find is faster than all presented algorithm when working on std::list. It can be improved for sorted data, too: abort when the iterator points at values larger than the value we are looking for. And here's the code of find and find_sorted:

hide find


     /// same algorithm as STL's find
     template <typename Iterator, typename Value>
     Iterator find(Iterator begin, Iterator end, Value value)
     {
       // compare all values until found
       while (begin != end)
       {
         if (*begin == value)
           return begin; // found !
         ++begin;
       }
     
       // not found
       return end;
     }

hide find_sorted


     /// find, optimized for sorted data
     // requires: iff !(x<y) && !(x>y) then x == y
     template <typename Iterator, typename Value>
     Iterator find_sorted(Iterator begin, Iterator end, Value value)
     {
       // skip all smaller values
       while (begin != end && *begin < value)
         ++begin;
     
       // hit ?
       if (begin != end && *begin == value)
         return begin;
       else
         return end;
     }

Note: these two templates return an Iterator instead of bool (just like the original std::find).

`std::list`, 10'000 `int`	Visual C++ 2010	GCC 4.7
find	253ms	292ms
find_sorted	163ms	184ms

That's 3 to 4 times faster than before. However, a small std::vector with only 10000 ints is still the clear winner:

`std::vector`, ten thousand `int`	Visual C++ 2010	GCC 4.7
hybrid binary search	1.2ms	0.8ms
find	192ms	160ms
find_sorted	125ms	62ms

The same test with 1 million int in a std::list takes several minutes for find and find_sorted to complete.

The World Is Not A Simple Number

My benchmarks were completely memory-based and the objects only simple numbers (int or float). Performance was mainly influenced by instruction branching and caching.

If data is stored on high-latency storage, especially storage with slow seeking times like a hard disk or some network-based solution, then the advantage of an early equality test truly kicks in. The hybrid approach computes the most comparisons and suffers severely if comparisons are expensive (e.g. large objects instead of primitive data types such as int).

If your container is very small then find can be surprisingly efficient because it gets support from modern CPU's intelligent prefetching. For less than approx. 50 ints, find_sorted is as fast as a binary search and for less than approx. 25 ints, even find becomes competitive. The simple std::find has a few advantages, too: no need for the less-than operator and - more importantly - the input data doesn't have to be sorted.

Conclusion

Download my code and test with your own data. Then decide on your own what fits best to your needs !


June 30, 2021: Length-Limited Prefix Codes April 13, 2020: smalLZ4 - optimal LZ4 compression June 17, 2019: toojpeg - a JPEG encoder in a single C++ file ... and 30 more !	RSS	www.stephan-brumme.com bits.stephan-brumme.com photos.stephan-brumme.com minime.stephan-brumme.com
© 2011-2024 Stephan Brumme products and company names mentioned may be trademarks of their respective owners. see my disclaimer. dude, this is the end ... of this page.