Working With Histograms

Histograms are used to represent the distribution of data. The data can be 1-dimensional (1-D, a single value) or multi dimensional (n-D, seveal values per data point). The LT.box supports 1D and 2D histograms.

Histograms are constructed by defining a range of values that is divided into a series of (equal) consecutive, non-overlapping intervals. These intervals are called bins and the number of intervals is called the number of bins. For a 2-D histrogram there are two ranges and two number of bins. The histogram is filled when for each test value the interval into which the test value falls is determined and the numer of times a test value falls into each interval is counted. The final histogram is typically plotted in form of a bar graph, where on the x-axis one plots the central values the intervals (bin-center) and on the y-axis the final number of times a test value was in withing that interval (bin-content). The histogram provides therefore a rough estimate of the underlying distribution density. An example of a 1D histogram is shown below:

Anatomy of a histogram explaining bin-width, bin-center, bin-content and range

Anatomy of a histogram. The red bars indicate 150 randomly distributed test values. The underlyin probability distribution is a gaussian centered at 0 with a sigma of 1. The arrows point to the bin-width, bin-center, bin-content and the range used in this example.

The same principle also applies to 2-D (and higher) histograms which allow one to study the correlation between pairs of values.

1D-Histograms

Histograms are typically used in counting experiments, especially in nuclear physics experiments. One experiment you do very early on is to count the number of decays from a radioactive source during a certain amount of time. This measurement is then repeated many times. I have simulated the result of such and experiment and the data are stored in the file ’counts.data’. To work with these data I first load them in the same way as I did before:

In [27]: mcf = B.get_file('counts.data') # get the file

To see what columns have been defined you do (as before)

In [28]: mcf.show_keys()

And the output should be something like

In [28]: mcf.show_keys()
['n', 'counts', 'indx']

Here n is the number of the measurement (the first one would be 0) and counts is the number of counts you obtained in this experiment. The additional variable indx is used internally and is not interesting to you (DO NOT TOUCH IT!). To get an idea what the data look like, I could just plot counts as a function of n. Since I want to plot the data that are part of a pdfile object I will used the dplot_exp() function.

In [29]: B.dplot_exp(mcf, 'n', 'counts')

You will notice that the data scatter around a value of 150. To analyze them further we need to load them into numpy arrays as shonw below:

In [30]: ne = mcf['n']
In [31]: counts = mcf['counts']

Now let’s calculate the average. To do this you need a for loop

In [32]: N = 0              # set the counter to 0
In [33]: sum = 0.           # set the sum to 0
In [34]: for cc in counts:  # loop over all counts
   ....:    N += 1          # increment the counter
   ....:    sum += cc       # add the current value to the total sum

Close the loop with two returns. Now you you can get the average by doing

In [35]: N_av = sum/N       # evaluate the average

This was a little programming. First I created a variable called N to count the number of terms in the sum and, in the first statement (N = 0), I set it to 0. Then I did the same for another variable, called sum. sum should contain the running sum of the values which is carried out in a for loop (for cc in counts:). In the loop, cc contains the current value of counts. There are two strange looking lines. The first, N += 1 means: add 1 to the current value of N or increment N by 1. The second does the same for the variable sum but instead of adding 1 we add the current value of the counts stored in cc. As a consequence at the end of the loop N contains the number of measurements and sum the total sum of the values. To calculate the average all I have to do is take the ratio of the two numbers and store them in the variable N_av. This was done in the last line (N_av = sum/N). To look at its value:

In [35]: N_av
Out[35]: 150.41499999999999

This was a small programming example. Since the average is such an important quantity this procedure has been built into numpy.

In [36]: np.average( counts )

Will give you the same result. There are many useful function defined for numpy.array() ‘s. Here are just a few useful ones using counts as an example:

Function

Returns

counts.sum()

the sum of all elements

counts.mean()

the mean (average) of all elements

counts.var()

the variance of all elements

counts.min()

the smallest value of the array

counts.max()

the larges value of the array

Try them out !

To create a 1D histogram you can used the histo function of LT.box. The simplest way to use it is as follows:

In [37]: hh = B.histo(counts)

It looks as if nothing has happened. But all the work as already been completed. The variable hh contains the new histogram. You can now make a plot of the histogram by doing

In [38]: hh.plot()

And you will get a (blue) graph that represents the frequency with which certain counts values occur. There are other useful plotting commands, such as hh.plot_exp() which will plot error bars associated with the bin contents. Check the histogram documentation for more details.

(Source code)

Plotted histogram, vertical filled bars. x-axis label "x-bin", y-axis label "content" and title "my histogram"

What you see is a relatively coarse representation of this distribution since in its simples form histo automatically determines the range of values that you provided. This range is divided into 10 equally spaced regions, called bins centered around a central value (bin_center) with a width called the bin_width. It then goes through all the data in the array counts and checks in which region (bin) each value falls. To each bin belongs a counter that is incremented each time a values falls within the associated region. The value of each counter is called the bin content. The graph now shows the value of the bin content as a function of the bin-center value. The width of each step corresponds to the width of a bin. You have much more control on how your histogram is setup. Lets re-define it and this time we give it a range and also the number of bins. From the first graph we see that most bin-content values of counts lie between 120 and 180. If we want to get a count of how often a specific value occurs we need to make the bin width equal to one as follows

In [39]: hh = B.histo(counts, range = (120., 180.), bins = 60)

This is another example of using keywords to control what an object/function is doing. the range = (120., 180.) determines the range of values accepted, where range is the keyword. The other statement bins = 60, sets the number of bins to 60. In that way the width of a bin is 1. Now plot the new histogram again.

In [40]: clf(); show()
In [41]: hh.plot(); show()

(Remember the first line clf() clears the figure). Now you have a much better representation of the histogram:

(Source code)

Plot of histogram with bin width 1, x-axis label "x-bin", y-axis label "content", title "my histogram"

The central values (bin-center), the bin-content (number of occurrences) and the associated errors (bin-error) of each bin are can be accesses as follows:

In [42]: hh.bin_center (return)

Produces the the output:

In [43]: hh.bin_center

Out[43]:
array([ 120.5,  121.5,  122.5,  123.5,  124.5,  125.5,  126.5,  127.5,
        128.5,  129.5,  130.5,  131.5,  132.5,  133.5,  134.5,  135.5,
        136.5,  137.5,  138.5,  139.5,  140.5,  141.5,  142.5,  143.5,
        144.5,  145.5,  146.5,  147.5,  148.5,  149.5,  150.5,  151.5,
        152.5,  153.5,  154.5,  155.5,  156.5,  157.5,  158.5,  159.5,
        160.5,  161.5,  162.5,  163.5,  164.5,  165.5,  166.5,  167.5,
        168.5,  169.5,  170.5,  171.5,  172.5,  173.5,  174.5,  175.5,
        176.5,  177.5,  178.5,  179.5])

When working with integer values suich as counts it would be better if the bin-center were also be at integer values. To achieve this remember that the range is defined by the lower limit (left edge) of the first bin (bin number 0) and the upper limit (right edge) of the last bin (bin number \(n\)). To have bin centers at integer values the lower limit of the range should be between \(x_0 - \delta x/2\) and \(x_n + \delta x/2\) and the number of bins should be \(n+1\). Here \(x_0\) is thedesired bin-center of the fist bin, \(x_n\) the desired bin-center of the last bin and \(\delta x\) is the bin-width. For the previous example this would mean that the histogram should be created as shown below

In [44]: hh1 = B.histo(counts, range = (120.0 - 0.5, 180.0 + 0.5), bins = 60 + 1,
                       xlabel = 'counts', ylabel = 'frequency', title = 'bin width:1, centered')

The resulting bin-centers are as expected centered on integer count values and the bin-width is still 1.

In [45]: hh1.bin_center

Out[45]:
array([ 120., 121., 122., 123., 124., 125., 126., 127., 128., 129., 130.,
        131., 132., 133., 134., 135., 136., 137., 138., 139., 140., 141.,
        142., 143., 144., 145., 146., 147., 148., 149., 150., 151., 152.,
        153., 154., 155., 156., 157., 158., 159., 160., 161., 162., 163.,
        164., 165., 166., 167., 168., 169., 170., 171., 172., 173., 174.,
        175., 176., 177., 178., 179., 180. ])

Histogram Operations

A histogram can be saved as a datafile as follows (using the histogram, hh1 defined above)

In [45]: hh1.save(filename = 'hh.data')

The saved histogram can then be loaded again using the following statements:

In [46]: hh = B.histo(file = 'hh.data')

Try it out and plot the two histogram to check that they are identical.

Several operations are possible with a histogram. One of the most important one is the summation of the bin content between two given limits. For example lets sum the bin content between 140 and 160 of histogram hh:

In [47]: S, sig_S = hh.sum(140, 160)

In this case S contains the sum and sig_S contains the uncertainty of this sum (i.e. its square root).

Sometimes the data that are entered into a histogram contain background data. If another histogram with just background data has been acquired the backround histogram can be subtracted from the histogram containing both signal and background data. This happens for example in the Compton Scattering experiment. An example is given below (the corresponding data files are in the example directory):

In [48]: h_sig_bkg = B.histo(file = 'h_signal_bkg.data')  # load the histogram containing signal and background
In [49]: h_bkg = B.histo(file = 'h_bkg.data')             # load the histogram containing background only
In [50]: h_sig = h_sig_bkg -  h_bkg                       # the difference of these 2 histograms contains the signal only
In [51]: h_sig.xlabel = h_sig_bkg.xlabel                  # set the labels of the new histogram
In [52]: h_sig.ylabel = h_sig_bkg.ylabel
In [53]: h_sig.title = 'Signal (bkg corrected)'

If the binning is too fine you see large statistical fluctuations. This can be reduced by combining bins using the rebin member function of histo:

In [48]: h_sig2 = h_sig.rebin(2)   # combine 2 bins to form a new histogram (h_sig2)
In [47]: h_sig4 = h_sig.rebin(4)   # combine 4 bins to form a new histogram (h_sig4)

Make a plot of the new histograms and compare them to the original.

We can normalize a histogram (i.e. make sure that the sum of all bin content is 1). This is done by dividing the histogram by the sum of all bins:

In [48]: S_h, dS_h = h_sig.sum()   # calculate the total sum
In [47]: h_sig_norm = h_sig/S      # divide all the bin conten by S and store the result in h_sig_norm

If you now sum h_sig_norm you should get the value of 1. h_sig_norm is now an approximated probability density.

Summary of important member functions:

Below is a summary of the most important information stored in histo. I use hh as the name for the histogram here by you can use any other variable name when you create it:

variable

meaning

bin_center

Central value of each bin.

bin_content

Content of each bin (the value of the counter)

bin_error

Error of the content of each bin

bin_width

Width of each bin

title

Histogram title. You can store the short description of the data (used in plotting).

xlabel

Description of the bin_center values (used in plotting).

ylabel

Description of the bin_content (used in plotting).

And here the most important (member) functions:

Function

Arguments and Results

histo

Create a histogram. There are several ways to create one depending on the arguments in histo here are the main possibilities:

  • h = histo( data_array, bins = 20, range = (10., 30.) ), data_array is a numpy array containing the data to be histogrammed, bins the number of bins, in this case 20 and range the range of values to be sorted into bins bins , in this example between 10 and 30.

  • h = histo(file = ’histo.data’), read the histogram data from file ’histo.data’ and create a new histogram. The ’histo.data’ file should have been created previously by a histo.save(’histo.data’) command (see below)

plot

(ymin = 0., filled = True, color = ’b’)

Plot the histogram. There are several keywords that can be set to control the plot. You can assign a minimal value ymin for the smallest content to be drawn (default is 0) . filled = False the area below the curve is not filled. color = ’b’ draw the line in blue. Other colors are: red(r), green(g) , magenta(m), yellow(y) or any other matplotlib color.

sum

(xmin=10., xmax=20)

Add the content of the histogram for the values ranging between (and including) xmin = 10 and xmax = 20. Returns a tuple with the value and the error. If no arguments are given it returns the total of all bins.

save

(filename = ’histo.data’)

Save the histogram to the file histo.data.

For more information see LT.box.histo.

2D-Histograms

If one is interested in analyzing correlations of data 2-dimensional histograms are frequently used. In this case one counts the number of occurrences of pairs of data. The 2D-histogram is setup as follows:

In [37]: hh2 = B.histo2d(x_vals, y_vals, bins = [15,25]) # create a 2D-histogram for the values x_vals, y_vals
                                                         # use 15 bins in the x-direction
                                                         # use 25 bins in the y-direction

Once the histogram is created you can plot it with the command:

In [38]: hh2.plot()  # plot the 2d histogram

For more detailed information on 2D-histograms see LT.box.histo2d.