Sunday, July 15, 2012

C# - Dictionary performance

Ever wondered if your dictionary is performing as it should? You used a dictionary and so it's fast, discussion closed? If you think so please read on as performance of a dictionary depends very much on the hashkeys for the objects you store.

Having identical hashes for different objects happens a lot and you hardly notice it. Unless you check it. Below you will find an extension that will give you information on how well your data is stored within most C# hashing data containers like Dictionary, HashSet and ConcurrentDictionary.

A bad hashcode algorithm will generate identical hashes for different objects and make items end up in the same bucket in a Dictionary. Thereby strongly reducing performance. If multiple items end up in the same bucket, which is just a list, the dictionary will have to walk through them to find the one you need. And through all of them to find out that an item is not in there.

But it is pretty hard to check whether you have an issue in this department. But not any more, as I have written a C# extenstion together with a colleague of mine; Jerremy Koot. Use it on a hashing container after you have added a lot of data to it to see how well the dictionary is filled. It will generate a small report like this;

Dictionary has 2893249 buckets with 1836025 items.
The Largest bucket contains 91 items.
It has 1647411 empty buckets (5.7%).
Each non-empty bucket has on average 1.5 items.
The 1836025 items share 1627301 unique hashes.
The largest collision has 90 items sharing the same hash


var myDictionaryStats = ExtHashContainers.Statistics(myDictionaryInstance); 

Based on your own statistics you might decide to Add or Improve the GetHashCode() override for the key data structure that you use to add to the Dictionary. Lowering the average number of items in a non-empty bucket  should be the goal.

Paste the code below into an Editor and use the source formatter will make it a bit more readable.


using System;
using System.Linq;
using System.Reflection;
using System.Collections.Generic;
using System.Collections.Concurrent;

// This unit is Freeware. It was developed by Jerremy Koot & Ivo Tops. July 2011
// Version  By    Changes
// =======  ===== ==============================================================
// v1.02    Ivo   Removed not-working Hashtable support and simplified code
// v1.01    Ivo   Lowered memory usage
// v1.00    I&J   First Version

namespace FastLibrary

    /// <summary>
    /// Static Extension Methods for Dictionary, ConcurrentDictionary and HashSet
    /// </summary>
    public static class ExtHashContainers
        /// <summary>
        /// Checks a dictionary for performance statistics
        /// </summary>
        public static string Statistics<TKey, TValue>(this Dictionary<TKey, TValue> source)
            return ExamineData(source.Keys, source);

        /// <summary>
        /// Checks a concurrent dictionary for performance statistics
        /// </summary>
        public static string Statistics<TKey, TValue>(this ConcurrentDictionary<TKey, TValue> source)
            return ExamineData(source.Keys, source);

        /// <summary>
        /// Checks a HashSet for performance statistics
        /// </summary>
        public static string Statistics<TKey>(this HashSet<TKey> source)
            return ExamineData(source, source);

        private static string ExamineData<TKey>(ICollection<TKey> source, Object hashContainer)

            if (!source.Any()) return "No Data found.";

            // Find Buckets
            var b = GetBuckets(hashContainer);
            if (b < 0) return ("Unable to get Buckets Field for HashContainer");

            // Create our counting temp dictionaries
            var d = new int[b];
            var h = new Dictionary<int, int>(source.Count);

            // Find Hash Collisions and Bucket Stats
            foreach (var k in source)
                // Hashes are stripped of sign bit in HashContainers
                // .NET Hashers do not use negative hashes, and use % voor bucket selection
                var hash = k.GetHashCode() & 0x7FFFFFFF; 
                int bucket = hash % b;    
                // Bucket Stats

                // Hashing Stats
                int c;
                if (h.TryGetValue(hash, out c)) h.Remove(hash);
                else c = 0;
                h.Add(hash, c);

            // Do some math
            var maxInBucket = d.Max(q => q);
            var maxSameHash = h.Values.Max(q => q);
            var emptyBuckets = d.Count(q => q == 0);
            var emptyStr = b == 0 ? "0" : ((float)(emptyBuckets) / b * 100).ToString("0.0");
            var worstHash = (from i in h where i.Value == maxSameHash select i.Key).FirstOrDefault();

            // Report our findings
            var r = Environment.NewLine + hashContainer.GetType().Name + " has " + b "buckets with " + source.Count + " items. " +
              Environment.NewLine + "The Largest bucket contains " + maxInBucket + " items. " + Environment.NewLine + "It has " + (emptyBuckets) +
                    " empty buckets (" + emptyStr + "%)" + Environment.NewLine + "Each non-empty bucket has on average " +
                    ((float)(source.Count / (float)(b - emptyBuckets))).ToString("0.0") + " items." + "The " + source.Count + " items share " + h.Count + " unique hashes. ";
            if (maxSameHash > 1) r += Environment.NewLine + "The largest collision has " + maxSameHash + " items sharing the same hash, which == " + worstHash;
            return r;

        private static Int32 GetBuckets(object dictionary)
            var type = dictionary.GetType();
            while (type != null && !type.IsGenericType) type = type.BaseType;
            if (type == null) return -1;

            string field = null;
            if (type.GetGenericTypeDefinition() == typeof(Dictionary<,>)) field = "buckets";
            if (type.GetGenericTypeDefinition() == typeof(ConcurrentDictionary<,>)) field = "m_buckets";
            if (type.GetGenericTypeDefinition() == typeof(HashSet<>)) field = "m_buckets";
            if (field == null) return -1;

            var bucketsField = type.GetField(field, BindingFlags.NonPublic | BindingFlags.Instance);
            if (bucketsField == null) return -1;

            var buckets = bucketsField.GetValue(dictionary);
            if (buckets == null) return -1;

            var length = buckets.GetType().GetProperty("Length");
            return (int)length.GetGetMethod().Invoke(buckets, null);