PROBLEM! BruXy: Python list's value histogram
Published: 18. 4. 2021   Category: Programming

Python list's value histogram

As I am using Python more and more, I am also more thinking also neat Python „tricks“ manipulating lists and using functions like map, lambda, etc. It is probably not fair to call these standard functions of the languages as tricks. They are features :) But the way I am usually thinking about list or dictionary processing leads into some for-cycle block and many times it can be written as an elegant pythonic one-liner.

Today, after a short mental exercise, I have figured out simplest way how to get a histogram of list's values, example:

data = ['715', '5757', '5620', '5756', '5754', '5756', '5758', '5754', '5756',
'5757', '5754', '5756', '5756', '5755', '5755', '5756', '5756', '5756', '5755',
'5755', '5756', '5755', '5756', '5756', '5756', '5754', '5756', '5756', '5754',
'5756', '5756', '4796']

hist = dict( (x, data.count(x)) for x in set(data) )

{'5758': 1,
 '5757': 2,
 '5754': 5,
 '5620': 1,
 '5755': 5,
 '4796': 1,
 '715': 1,
 '5756': 16}

The logic of that one-liner is quite simple. It is generating a dictionary from a list of key/value pair tuples. Items of the list are strings because I have exported data from a log monitoring tool, but it can easily be integers or any hashable objects. List of tuple pairs is just another way how do define a dictionary:

example = dict( [ ('key1', 11), ('key2', 22), ('key3', 33) ] )

{'key1': 11, 'key2': 22, 'key3': 33}

Generator expression (PEP 289) creates that list of pairs: the value and its number of occurrences. The heart of the for-cycle is set(data), which creates a set of all unique values of the list data. Then you will just count() the number of each unique value in the list.

I was quite curious about the efficiency of this one-liner because iterate the whole list for each value to count them is O(n2), so more efficient way is to create a dictionary in one iteration of for-cycle and increment values which will be just O(n):

hist2 = dict()
for x in data:
   if x in hist2:
      hist2[x] += 1
   else: 
      hist2[x] = 1

I tried to run a profiler on both versions: generator, for-cycle on the array with 10M values and there are results on my PC and the for-cycle was about 20% faster (I would expect more, but have no idea what is going on in CPython guts):