Intro to Hashing in C++

Hashing is the process of converting a given key into another smaller value for O(1) retrieval time.

Crack FAANG

21 min readJan 30, 2021

This is done by taking the help of some function or algorithm which is called a hash function to map data to some encrypted or simplified representative value which is termed as “hash code” or “hash”. This hash is then used as an index to narrow down search criteria to get data quickly.

Hashing can be considered as a significant improvement over DAT to reduce the space complexity.
In our example of the employee system that we have seen in the introduction part, we can simply pass the employee ID to our hash function, get the hash code and use it as a key to query over records.
Let us see one more example to understand how hashing works.
Purely as an example to help us grasp the concept, let us suppose that we want to map a list of string keys to string values (for example, map a list of countries to their capital cities). So let’s say we want to store the data in Table 1 in the map.

And let us suppose that our hash function is to simply take the length of the string.
For simplicity, we will have two arrays: one for our keys and one for the values. So to put an item in the hash table, we compute its hash code (in this case, simply count the number of characters), then put the key and value in the arrays at the corresponding index.
For example, Cuba has a hash code (length) of 4. So we store Cuba in the 4th position in the keys array, and Havana in the 4th index of the values array, etc. And we end up with the following:

Now, in this specific example, things work quite well. Our array needs to be big enough to accommodate the longest string, but in this case, that’s only 11 slots. And we do waste a bit of space because, for example, there are no 1-letter keys in our data, nor keys between 8 and 10 letters. But in this case, the waste isn’t so bad either. And taking the length of a string is nice and fast, so so is the process of finding the value associated with a given key (certainly faster than doing up to five-string comparisons).
We can also easily see that this method would not work for storing arbitrary strings. If one of our string keys was a thousand characters in length but the rest were small, we would waste the majority of the space in the arrays. More seriously, this model can’t deal with collisions: that is, what to do when there is more than one key with the same hash code (in this case, more than one string of a given length). For example, if our keys were random words of English, taking the string length would be fairly useless. Granted, the word “psuedoantidisestablishmentarianistically” would probably get its own place in the array. But on the other hand, we’d be left with thousands of, say, 6-letter words all competing for the same slot in the array.

In this topic, we explore hashing, a technique very widely used in interview questions. Hashing is designed to solve the problem of needing to efficiently find or store an item in a collection.

For example, if we have a list of 10,000 words of English and we want to check if a given word is in the list, it would be inefficient to successively compare the word with all 10,000 items until we find a match. Hashing is a technique to make things more efficient by effectively narrowing down the search at the outset.

Hash Table

A hash table is an array that stores pointers to data mapping to a given hashed key.

The Hash table uses a hash function to compute the index of the array where a record will be inserted or searched. Basically, there are 2 main components of the hash table: Hash function and array.
A value in the hash table is null if no existing key has a hash value equal to the index for the entry.
The average time complexity needed to search for a record in the hash table is O(1) under reasonable assumptions.
The maximum size of the array is according to the amount of data expected to be hashed.

On the previous slide, we introduced the notion of hashing, mapping a piece of data such as a string to some kind of a representative integer value. We can then create a map by using this hash as an index into an array of key/value pairs. Such a structure is generally called a hash table or, Hashmap in Java, or dictionary in Python, or unordered_map in C++ ( Sorry C users, you will have to implement your own hashmap ). We saw that using the string length to create the hash, and indexing a simple array, could work in some restricted cases, but is no good generally: for example, we have the problem of collisions (several keys with the same length) and wasted space if a few keys are vastly larger than the majority.

Buckets

Now, we can solve the problem of collisions by having an array of (references to) linked lists rather than simply an array of keys/values. Each little list is generally called a bucket.
Then, we can solve the problem of having an array that is too large simply by taking the hash code modulo a certain array size3. So for example, if the array were 32 positions in size (going from 0–31 ), rather than storing a key/value pair in the list at position 33, we store it at position (33 mod 32) = 1. (In simple terms, we “wrap around” when we reach the end of the array.) So we end up with a structure something like this:

Each node in the linked lists stores a pairing of a key with a value. Now, to look for the mapping for, say, Ireland, we first compute this key’s hash code (in this case, the string length, 7). Then we start traversing the linked list at position 7 in the table. We traverse each node in the list, comparing the key stored in that node with Ireland. When we find a match, we return the value from the pair stored in that node (Dublin). In our example here, we find it on the second comparison. So although we have to do some comparisons, if the list at a given position in the table is fairly short, we’ll still reduce significantly the amount of work we need to do to find a given key/value mapping.
The structure we have just illustrated is essentially the one used by Java’s hash maps and hash sets. However, we generally wouldn’t want to use the string length as the hash code.
Let us see how we can convert an object into a hash code more effectively by using hash functions in the next section rather than just using string length as the hashing algorithm.

Hash Functions

A hash function is a function or algorithm that is used to generate the encrypted or shortened value to any given key. The resultant of a hash function is termed as a hash value or simply hash.

Properties of a good hash function:

Should be a one-way algorithm. In simple words, the hash value generated should not be converted back into the original key.

Should be efficiently computable. In real applications, if our algorithm takes a long time to compute the hash value itself from a hash function, then we lose the purpose of hashing.
Should uniformly distribute the keys among the available records.

Types of Hash Functions:

Index Mapping Method

The most trivial form of hashing technique is called “Index Mapping (or Trivial Hashing)”. Here:

We consider an array wherein the value of each position of the array corresponds to a key in the given list of keys.
This technique is very effective when the number of keys is reasonably small. It ensures that allocating one position of the array for every possible key is affordable.
Here, the hash function just takes the input and returns out the same value as output.

This is effective only because it considers that any retrieval takes only O(1) time.
But otherwise, this approach is very trivial and inefficient to be used in real-life scenarios as it assumes that the values are integers whereas, in real life, we might have any data type of data. Also, even for the case of integers, if the data is large, this is not suitable.

Division method

Here, for hash functions, we map each key into the slots of a hash table by taking the remainder of that key divided by the total table size available i.e h(key) = key % table_length => h(key) = key MODULO table_length

This method is quite fast since it takes the help of a single division operation and is most commonly used.

Things to keep in mind while using the division method:

Certain values of table size are avoided for effective performance. For example, the size of the table should not be a power, say p of a certain number, say r such that if table_length = rp, then h(key) accommodates p lowest-order bits of the key. We need to ensure that the hash function we design depends on all the bits of the key unless we are sure that all low-order p-bit patterns are equally likely.
Research says that the hash function obtained from the division method gives the best results when the size of the table is prime. If table_length is prime and if r is the number of possible character codes on a system such that r % table_length = 1, then h(key) = sum of binary representation of characters in key % table_length.

Mid square method

Suppose we want to place a record of key 3101 and the size of the hash table is 2000.
Here, firstly the key is squared, and then the mid part of the resultant number is taken as the index for the hash table. So in our case: key = 3101 => (3101)2 = 3101*3101 = 9616201 i.e.
h(key) = h(3101) = 162 (by taking the middle 3 digits from 9616201). Place the record in the 162nd index of the hash table.

Digit folding method

Here, the key is divided into separate parts and by using simple operations these separated parts are combined to produce a hash.
Consider a record of key 12345678. Divide this into parts say 123, 456, 78. After dividing the parts combine these parts by performing add operation on them. h(key) = h(12345678) = 123 + 456 + 78 = 657

Multiplication method

In this method, the hash function implementation is as follows:
we multiply key key by a real number c that lies in the range 0 < c < 1 and the fractional part of their product is extracted.
Then, this fractional part is multiplied by the size of the hash table table_size. The floor of the result is considered as the final hash. It is represented as

h(key) = floor (table_size * (key * (c mod 1))) or h(key) = floor (table_size * fractional (key * c))

Here, the function floor(x) returns the integer part of the real number x, and fractional(x) yields the fractional part of that number by performing fractional(x) = x – floor(x)

The main advantage of this method is that the value of table size (table_size) doesn't matter. Typically, this value is chosen as a power of 2 since this can be easily implemented on most computing systems.

Collisions

The phenomenon where two keys generate the same hash value from a given hash function is called a collision. A good hash function should ensure that the collisions are minimum.

Load Factor:

The load factor is simply a measure of how full (occupied) the hash table is, and is simply defined as: α = number of occupied slots/total slots
In simple words, consider we have a hash table of size 1000, and we have 500 slots filled, then the load factor would become α = 500/1000 = 0.5
The larger the load factor value, the larger is the space saved but at the cost of inserting more elements in a slot via chaining (We will learn about chaining in the below sections). This would worsen the access time of elements.
Generally, whenever a tight situation between deciding what is more important — time or space — we incline towards optimizing time (increasing speed) than the memory space. This is called a time-space tradeoff.
To optimize the time complexity, we can reduce the load factor, but doing this will increase the memory or space required for the hash table.

Hashing Techniques

Collisions are bound to occur no matter how good a hash function is. Hence, to maintain the performance of a hash table, it is important to minimize collisions through various collision resolution techniques.

There are majorly 2 methods for handling collisions:

Separate Chaining
Open Addressing

Separate Chaining

Here the main idea is to make each slot of the hash table point to a linked list of records that have the same hash value.

Performance Analysis:

Hashing performance can be evaluated under the assumption that each key is equally likely and uniformly hashed to any slot of the hash table.
Consider table_size as number of slots in the hash table and n as the number of keys to be saved in the table. Then:

We have Load factor α = n/table_size Time complexity to search/delete = O(1 + α) Time complexity for insert = O(1) Hence, overall time complexity of search, insert and delete operation will be O(1) if α is O(1)

Advantages of Separate Chaining:

This technique is very simple in terms of implementation.
We can guarantee that the insert operation always occurs in O(1) time complexity as linked lists allow insertion in constant time.
We need not worry about the hash table getting filled up. We can always add any number of elements to the chain whenever needed.
This method is less sensitive or not very much dependent on the hash function or the load factors.
Generally, this method is used when we do not know exactly how many and how frequently the keys would be inserted or deleted.

Disadvantages of Separate Chaining:

Chaining uses extra memory space for creating and maintaining links.
It might happen that some parts of the hash table will never be used. This contributes to the wastage of space.
In the worst-case scenario, it might happen that all the keys to be inserted belong to a single bucket. This would result in a linked list structure and the search time would be O(n).
Chaining cache performance is not that great since we are storing keys in the form of a linked list. Open addressing techniques have better cache performance as all the keys are guaranteed to be stored in the same table. We will explore open addressing techniques in the next section.

Open Addressing

In this technique, we ensure that all records are stored in the hash table itself. The size of the table must be greater than or equal to the total number of keys available. In case the array gets filled up, we increase the size of the table by copying old data whenever needed. How do we handle the following operations in this technique?

Insert(key): When we try to insert a key to the bucket which is already occupied, we keep probing the hash table until an empty slot is found. Once we find the empty slot, we insert key into that slot.

Search(key): While searching for key in the hash table, we keep probing until the slot’s value doesn’t become equal to key or until an empty slot is found.
Delete(key): While performing delete operation, when we try to simply delete key, then the search operation for that key might fail. Hence, deleted key’s slots are marked as “deleted” so that we get the status of the key when searched.

The term “open addressing” tells us that the address or location of the key to be placed is not determined by its hash value.

Following are the techniques for following open addressing:

Linear Probing:

In this, we linearly probe for the next free slot in the hash table. Generally, the gap between the two probes is taken as 1.
Consider hash(key) be the slot index computed using hash function and table_size represent the hash table size. Suppose hash(key) index has a value occupied already, then:

We check if (hash(key) + 1) % table_size is free

If ((hash(key) + 1) % table_size) is also not free, then we check for ((hash(key) + 2) % table_size) If ((hash(key) + 2) % table_size) is again not free, then we try ((hash(key) + 3) % table_size), and so on until we find the next available free slot

When performing search operation, the array is scanned linearly in the same sequence until we find the target element or an empty slot is found. Empty slot indicates that there is no such key present in the table.

Disadvantage: There would be cases where many consecutive elements form groups and the time complexity to find the available slot or to search an element would increase greatly thereby reducing the efficiency. This phenomenon is called as clustering.

Quadratic Probing:

In this approach, we look for the i²-th slot in i-th iteration.
Consider hash(key) be the slot index required to place key computed using hash function and table_size is the size of hash table available, then:

If slot (hash(key) % table_size) is not free, then we check for availability of ((hash(key) + 1*1) % table_size)

If ((hash(key) + 1*1) % table_size) is again full, then we check for ((hash(key) + 2*2) % table_size)

If ((hash(key) + 2*2) % table_size) is not empty, then we check for the status of ((hash(key) + 3*3) % table_size) and so on until we find the next available empty slot.

Double Hashing:

In this method: we follow the idea of applying a second hash function on the key whenever a collision occurs.

It can be done as follows:

(hash1(key) + c * hash2(key)) % Table_Size , where c keeps incremented by 1 upon every collision to find the next free slot.

Here hash1() is the first hash function and hash2() is the second hash function applied in case of collision and Table_Size is the size of the hash table.
The first hash function can be defined as hash1(key) = key % Table_Size . Second hash function is popularly like hash2(key) = prime_no – (key % PRIME) where prime_no is a prime number smaller than Table_Size.

Analysis of Open Addressing:

The performance of hashing technique can be evaluated under the assumption that each key is equally likely and uniformly hashed to any slot of the hash table.
Consider table_size as the total number of slots in the hash table, n as number of keys to be inserted into the table, then:

* Load factor, α = n/table_size ( α < 1 )

* Expected time taken to search/insert/delete operation < (1/(1 - α))

* Hence, search/insert/delete operations take at max (1/(1 - α)) time

Hash Table Implementation

Declaration

unordered_map<int, int> A; // declares an empty map. O(1)

Adding a key, value pair

A.insert(key, value); // O(1) time on averageOR A[key]=value

Find the value for key = K

if (A.find(K) == A.end()) return null; //  means that K does not exist in A. 
    else return A[K];    // O(1) average case. Rare worst case O(n).OR A[key]>0

Size ( number of elements ) of the vector

A.size()  // O(1)

Erase from the map

A.erase(K);

Clear Map:

A.clear();

Iterate over Map:

for (auto i: A) {
cout << i.first << “ “ << i.second << endl;
}

Hash Search Questions

Question 1: Array Difference

Given an array A of integers and another non-negative integer k, find if there exist 2 indices i and j such that A[i] - A[j] = k, i != j.

Input :

A : [1 5 3]
k : 2

Output :

as 3 - 1 = 2

Solution

We are looking to find pair of integers where A[i] — A[j] = k, k being a known entity
Let's say we fix A[i] ( i.e. we know A[i]), do we know what A[j] should be?
A[j] = A[i] — k.

We can store all the numbers in a HashMap / HashSet and then lookup A[j] in it to find out if A[j] exists.

int diffPossible(const vector<int> &num, int diff) {
            if (num.size() < 2 || diff < 0) return false;
            unordered_set<int> S;
            for (int i = 0; i < num.size(); i++) {
                int aj = num[i] - diff;
                int ai = num[i] + diff;
                if (S.find(aj) != S.end()) return true;
                if (S.find(ai) != S.end()) return true;
                S.insert(num[i]);
            }
            return false;
        }

Time Complexity: O(n)

Space Complexity: O(n)

Question 2: First Repeating element

Given an integer array A of size N, find the first repeating element in it.

We need to find the element that occurs more than once and whose index of the first occurrence is smallest.

Example Input

Input 1:

A = [10, 5, 3, 4, 3, 5, 6]

Input 2:

A = [6, 10, 5, 4, 9, 120]

Example Output

Output 1:

Output 2:

-1

Example Explanation

Explanation 1:

5 is the first element that repeats

Explanation 2:

There is no repeating element, output -1

Solution:

A Simple Solution is to use two nested loops. The outer loop picks an element one by one, the inner loop checks whether the element is repeated or not. Once we find an element that repeats, we break the loops and print the element.

The Time Complexity of this solution is O(n2)

We can Use Sorting to solve the problem in O(n Logn) time. Following are detailed steps.
1) Copy the given array to an auxiliary array temp[].
2) Sort the temp array using an O(nLogn) time sorting algorithm.
3) Scan the input array from left to right. For every element, count its occurrences in temp[] using binary search. As soon as we find an element that occurs more than once, we return the element.
This step can be done in O(nLogn) time.

We can optimize the solution using Hashing.

We can Use Hashing to solve this in O(n) time on average.

The idea is to traverse the given array from right to left and update the minimum index whenever we find an element that has been visited on the right side.

At last return the element at the minimum index stored.

int Solution::solve(vector<int> &A) {
    unordered_map<int,int> M;
    int ans=-1;
    for(int i=A.size()-1;i>=0;i--){
        int n = A[i];
        
        if(M.find(n) != M.end()){
            ans = n;
        }
        
        M[n]=i;
    }
    return ans;
}

Key Formation Questions

Question 3: Pairs With Given Xor

Given a 1D integer array A containing N distinct integers.

Find the number of unique pairs of integers in the array whose XOR is equal to B.

Pairs (a, b) and (b, a) are considered to be the same and should be counted once.

Example Input

Input 1:

A = [5, 4, 10, 15, 7, 6]
B = 5

Input 2:

A = [3, 6, 8, 10, 15, 50]
B = 5

Example Output

Output 1:

Output 2:

Example Explanation

Explanation 1:

(10 ^ 15) = 5

Explanation 2:

(3 ^ 6) = 5 and (10 ^ 15) = 5

Solution:

The idea is based on the fact that A[i] ^ A[j] is equal to B if and only if A[i] ^ B is equal to A[j].

Using this fact think of a solution that works in linear time complexity.

A Simple solution is to traverse each element and check if there’s another number whose XOR with it is equal to B.
This solution takes O(N2) time.

An efficient solution to this problem takes O(N) time.
The idea is based on the fact that A[i] ^ A[j] is equal to B if and only if A[i] ^ B is equal to A[j].

Initialize result as 0.
Create an empty hash set “s”.
Do following for each element A[i] in A[]
If B ^ A[i] is in “s”, then increment result by 1.
Insert A[i] into the hash set “s”.
Return result.

int Solution::solve(vector<int> &A, int B) {
    unordered_map<int,int> mp; // int to count
    int ans = 0;
    for(int i = 0; i < A.size(); i++){
        ans += mp[A[i] ^ B];
        mp[A[i]] = 1;
    }
    return ans;
}

Maths and Hashing Questions

Question 4: Check Palindrome!

Given a string A consisting of lowercase characters.

Check if characters of the given string can be rearranged to form a palindrome.

Return 1 if it is possible to rearrange the characters of the string A such that it becomes a palindrome else return 0.

Example Input

Input 1:

A = "abcde"

Input 2:

A = "abbaee"

Example Output

Output 1:

Output 2:

Example Explanation

Explanation 1:

No possible rearrangement to make the string palindrome.

Explanation 2:

Given string "abbaee" can be rearranged to "aebbea" to form a palindrome.

Solution:

A set of characters can form a palindrome if at most one character occurs odd number of times and all characters occur even number of times.

We can do it in O(|A|) time using a count array.
Following are detailed steps.

Create a count array of alphabet size which is typically 26. Initialize all values of the count array as 0.
Traverse the given string and increment the count of every character.
Traverse the count array and if the count array has more than one odd value, return false. Otherwise, return true.

int Solution::solve(string A) {
    int n = A.size();
    vector<int> freq(26, 0);
    for(int i = 0; i < n; i++) freq[A[i] - 'a']++;
    int odd = 0;
    for(int i = 0; i < 26; i++){
        if(freq[i] % 2 != 0) odd++;
    }
    if(odd > 1) return 0;
    return 1;
}

Incremental Hash Questions

Question 5: Two out of Three

Given are three arrays A, B, and C.

Return the sorted list of numbers that are present in at least 2 out of the 3 arrays.

Example Input

Input 1:

A = [1, 1, 2]
B = [2, 3]
C = [3]

Input 2:

A = [1, 2]
B = [1, 3]
C = [2, 3]

Example Output

Output 1:

[2, 3]

Output 2:

[1, 2, 3]

Example Explanation

Explanation 1:

1 is only present in A. 2 is present in A and B. 3 is present in B and C.

Explanation 2:

All numbers are present in atleast 2 out of 3 lists.

Solution:

Use maps to ensure you take nonrepeating elements of diff arrays. Now use another map to maintain the count of the number of lists each of these elements is present in.

Now you need to just store all elements that have this map value as 2 or 3
in a vector and return this vector.

vector<int> Solution::solve(vector<int> &A, vector<int> &B, vector<int> &C)
{
    unordered_map<int,int> vis,t;
    vector<int> P;
    for(int i=0;i<A.size();i++)
    {
        if(t[A[i]]==0)
        {
            vis[A[i]]++;
            t[A[i]]++;
        }
    }
    t.clear();
    for(int i=0;i<B.size();i++)
    {
        if(t[B[i]]==0)
        {
            vis[B[i]]++;
            t[B[i]]++;
        }
    }
    t.clear();
    for(int i=0;i<C.size();i++)
    {
        if(t[C[i]]==0)
        {
            vis[C[i]]++;
            t[C[i]]++;
        }
    }
    for (auto i : vis)
    {
        if(i.second>1)
        P.push_back(i.first);
    }
    sort(P.begin(),P.end());
    return P;
}

Hashing Two Pointer Questions

Question 6: Subarray with B odd numbers

Given an array of integers A and an integer B.

Find the total number of subarrays having exactly B odd numbers.

Problem Constraints

1 <= length of the array <= 105

1 <= A[i] <= 109

0 <= B <= A

Input Format

The first argument given is the integer array A.
The second argument given is integer B.

Output Format

Return the total number of subarrays having exactly B odd numbers.

Example Input

Input 1:

A = [4, 3, 2, 3, 4]
B = 2

Input 2:

A = [5, 6, 7, 8, 9]
B = 3

Example Output

Output 1:

Output 2:

Example Explanation

Explanation 1:

The subarrays having exactly B odd numbers are:
 [4, 3, 2, 3], [4, 3, 2, 3, 4], [3, 2, 3], [3, 2, 3, 4]

Explanation 2:

The subarrays having exactly B odd numbers is [5, 6, 7, 8, 9]

Solution:

The naive approach is to generate all possible subarrays and simultaneously checking for the subarrays with B odd numbers.
The Time Complexity of the above solution is O(n^2)

An efficient approach is to while traversing, compute the prefix[] array. Prefix[i] stores the number of prefixes which has ‘i’ odd numbers in them.

We increase the count of odd numbers if the array element is an odd one. When the count of odd numbers exceeds or is equal to B, add the number of prefixes which has “(odd-B)” numbers to the answer.

At every step odd>=B, we calculate the number of subarrays formed till a particular index with the help of a prefix array. The prefix[odd-B] provides us with the number of prefixes which has “odd-B” odd numbers, which are added to the count to get the number of subarrays till the index.

Author: https://www.linkedin.com/in/shivam-ross/ | https://twitter.com/BeastofBayArea | https://www.instagram.com/sup.its.shiv/

Intro to Hashing in C++

Hashing is the process of converting a given key into another smaller value for O(1) retrieval time.

Hash Table

Buckets

Hash Functions

Types of Hash Functions:

Index Mapping Method

Division method

Mid square method

Digit folding method

Multiplication method

Collisions

Load Factor:

Hashing Techniques

Separate Chaining

Open Addressing

Linear Probing:

Quadratic Probing:

Double Hashing:

Analysis of Open Addressing:

Hash Table Implementation

Declaration

Adding a key, value pair

Find the value for key = K

Size ( number of elements ) of the vector

Erase from the map

Clear Map:

Iterate over Map:

Hash Search Questions

Question 1: Array Difference

Solution

Question 2: First Repeating element

Solution:

Key Formation Questions

Question 3: Pairs With Given Xor

Solution:

Maths and Hashing Questions

Question 4: Check Palindrome!

Solution:

Incremental Hash Questions

Question 5: Two out of Three

Solution:

Hashing Two Pointer Questions

Question 6: Subarray with B odd numbers

Solution:

Written by Crack FAANG

Responses (1)