Working With Histograms¶
Histograms are used to represent the distribution of data. The data can be 1-dimensional (1-D, a single value) or multi dimensional (n-D, seveal values per data point). The LT.box supports 1D and 2D histograms.
Histograms are constructed by defining a range of values that is divided into a series of (equal) consecutive, non-overlapping intervals. These intervals are called bins and the number of intervals is called the number of bins. For a 2-D histrogram there are two ranges and two number of bins. The histogram is filled when for each test value the interval into which the test value falls is determined and the numer of times a test value falls into each interval is counted. The final histogram is typically plotted in form of a bar graph, where on the x-axis one plots the central values the intervals (bin-center) and on the y-axis the final number of times a test value was in withing that interval (bin-content). The histogram provides therefore a rough estimate of the underlying distribution density. An example of a 1D histogram is shown below:

Anatomy of a histogram. The red bars indicate 150 randomly distributed test values. The underlyin probability distribution is a gaussian centered at 0 with a sigma of 1. The arrows point to the bin-width, bin-center, bin-content and the range used in this example.¶
The same principle also applies to 2-D (and higher) histograms which allow one to study the correlation between pairs of values.
1D-Histograms¶
Histograms are typically used in counting experiments, especially in nuclear physics experiments. One experiment you do very early on is to count the number of decays from a radioactive source during a certain amount of time. This measurement is then repeated many times. I have simulated the result of such and experiment and the data are stored in the file ’counts.data’. To work with these data I first load them in the same way as I did before:
In [27]: mcf = B.get_file('counts.data') # get the file
To see what columns have been defined you do (as before)
In [28]: mcf.show_keys()
And the output should be something like
In [28]: mcf.show_keys()
['n', 'counts', 'indx']
Here n
is the number of the measurement (the first one would be 0) and
counts
is the number of counts you obtained in this experiment. The
additional variable indx
is used internally and is not interesting to
you (DO NOT TOUCH IT!). To get an idea what the data look like, I
could just plot counts
as a function of n
. Since I want to plot the
data that are part of a pdfile
object I will used the dplot_exp()
function.
In [29]: B.dplot_exp(mcf, 'n', 'counts')
You will notice that the data scatter around a value of 150. To analyze them further we need to load them into numpy arrays as shonw below:
In [30]: ne = mcf['n']
In [31]: counts = mcf['counts']
Now let’s calculate the average. To do this you need a for
loop
In [32]: N = 0 # set the counter to 0
In [33]: sum = 0. # set the sum to 0
In [34]: for cc in counts: # loop over all counts
....: N += 1 # increment the counter
....: sum += cc # add the current value to the total sum
Close the loop with two returns. Now you you can get the average by doing
In [35]: N_av = sum/N # evaluate the average
This was a little programming. First I created a variable called N
to
count the number of terms in the sum and, in the first statement (N =
0
), I set it to 0. Then I did the same for another variable, called
sum
. sum
should contain the running sum of the values which is
carried out in a for
loop (for cc in counts:
). In the loop, cc
contains the current value of counts. There are two strange looking
lines. The first, N += 1
means: add 1 to the current value of N or
increment N
by 1. The second does the same for the variable sum
but instead of adding 1 we add the current value of the counts stored in
cc
. As a consequence at the end of the loop N
contains the number of
measurements and sum
the total sum
of the values. To calculate the
average all I have to do is take the ratio of the two numbers and store
them in the variable N_av
. This was done in the last line (N_av =
sum/N
). To look at its value:
In [35]: N_av
Out[35]: 150.41499999999999
This was a small programming example. Since the average is such an important quantity this procedure has been built into numpy.
In [36]: np.average( counts )
Will give you the same result. There are many useful function defined
for numpy.array()
‘s. Here are just a few useful ones using counts
as an example:
Function |
Returns |
---|---|
counts.sum() |
the sum of all elements |
counts.mean() |
the mean (average) of all elements |
counts.var() |
the variance of all elements |
counts.min() |
the smallest value of the array |
counts.max() |
the larges value of the array |
Try them out !
To create a 1D histogram you can used the histo
function of LT.box
. The simplest way to use it is as
follows:
In [37]: hh = B.histo(counts)
It looks as if nothing has happened. But all the work as already been
completed. The variable hh
contains the new histogram. You can now make a plot of the histogram by doing
In [38]: hh.plot()
And you will get a (blue) graph that represents the frequency with
which certain counts values occur. There are other useful plotting commands, such as
hh.plot_exp()
which will plot error bars associated with the bin contents.
Check the histogram documentation for more details.

What you see is a relatively coarse representation of this
distribution since in its simples form histo
automatically determines the range of values that you provided. This
range is divided into 10 equally spaced regions, called bins
centered
around a central value (bin_center
) with a width called the bin_width
. It
then goes through all the data in the array counts
and checks in
which region (bin
) each value falls. To each bin
belongs a
counter that is incremented each time a values falls within the
associated region. The value of each counter is called the bin
content
. The graph now shows the value of the bin content
as a
function of the bin-center
value. The width of each step
corresponds to the width of a bin. You have much more control on how
your histogram is setup. Lets re-define it and this time we give it a
range and also the number of bins. From the first graph we see that
most bin-content
values of counts lie between 120 and 180. If we
want to get a count of how often a specific value occurs we need to
make the bin width equal to one as follows
In [39]: hh = B.histo(counts, range = (120., 180.), bins = 60)
This is another example of using keywords to control what an
object/function is doing. the range = (120., 180.)
determines the
range of values accepted, where range
is the keyword. The other
statement bins = 60
, sets the number of bins to 60. In that way the
width of a bin is 1. Now plot the new histogram again.
In [40]: clf(); show()
In [41]: hh.plot(); show()
(Remember the first line clf()
clears the figure). Now you have a much
better representation of the histogram:

The central values (bin-center
), the bin-content
(number of occurrences) and the
associated errors (bin-error
) of each bin are can be accesses as follows:
In [42]: hh.bin_center (return)
Produces the the output:
In [43]: hh.bin_center
Out[43]:
array([ 120.5, 121.5, 122.5, 123.5, 124.5, 125.5, 126.5, 127.5,
128.5, 129.5, 130.5, 131.5, 132.5, 133.5, 134.5, 135.5,
136.5, 137.5, 138.5, 139.5, 140.5, 141.5, 142.5, 143.5,
144.5, 145.5, 146.5, 147.5, 148.5, 149.5, 150.5, 151.5,
152.5, 153.5, 154.5, 155.5, 156.5, 157.5, 158.5, 159.5,
160.5, 161.5, 162.5, 163.5, 164.5, 165.5, 166.5, 167.5,
168.5, 169.5, 170.5, 171.5, 172.5, 173.5, 174.5, 175.5,
176.5, 177.5, 178.5, 179.5])
When working with integer values suich as counts it would be better if the bin-center were also be at integer values. To achieve this remember that the range is defined by the lower limit (left edge) of the first bin (bin number 0) and the upper limit (right edge) of the last bin (bin number \(n\)). To have bin centers at integer values the lower limit of the range should be between \(x_0 - \delta x/2\) and \(x_n + \delta x/2\) and the number of bins should be \(n+1\). Here \(x_0\) is thedesired bin-center of the fist bin, \(x_n\) the desired bin-center of the last bin and \(\delta x\) is the bin-width. For the previous example this would mean that the histogram should be created as shown below
In [44]: hh1 = B.histo(counts, range = (120.0 - 0.5, 180.0 + 0.5), bins = 60 + 1,
xlabel = 'counts', ylabel = 'frequency', title = 'bin width:1, centered')
The resulting bin-centers are as expected centered on integer count values and the bin-width is still 1.
In [45]: hh1.bin_center
Out[45]:
array([ 120., 121., 122., 123., 124., 125., 126., 127., 128., 129., 130.,
131., 132., 133., 134., 135., 136., 137., 138., 139., 140., 141.,
142., 143., 144., 145., 146., 147., 148., 149., 150., 151., 152.,
153., 154., 155., 156., 157., 158., 159., 160., 161., 162., 163.,
164., 165., 166., 167., 168., 169., 170., 171., 172., 173., 174.,
175., 176., 177., 178., 179., 180. ])
Histogram Operations¶
A histogram can be saved as a datafile as follows (using the histogram, hh1
defined above)
In [45]: hh1.save(filename = 'hh.data')
The saved histogram can then be loaded again using the following statements:
In [46]: hh = B.histo(file = 'hh.data')
Try it out and plot the two histogram to check that they are identical.
Several operations are possible with a histogram. One of the most important one is the summation of the bin content between two given limits. For example lets sum the bin content between 140 and 160 of histogram hh:
In [47]: S, sig_S = hh.sum(140, 160)
In this case S
contains the sum and sig_S
contains the uncertainty of this sum (i.e. its square root).
Sometimes the data that are entered into a histogram contain background data. If another histogram with just background data has been acquired the backround histogram can be subtracted from the histogram containing both signal and background data. This happens for example in the Compton Scattering experiment. An example is given below (the corresponding data files are in the example directory):
In [48]: h_sig_bkg = B.histo(file = 'h_signal_bkg.data') # load the histogram containing signal and background
In [49]: h_bkg = B.histo(file = 'h_bkg.data') # load the histogram containing background only
In [50]: h_sig = h_sig_bkg - h_bkg # the difference of these 2 histograms contains the signal only
In [51]: h_sig.xlabel = h_sig_bkg.xlabel # set the labels of the new histogram
In [52]: h_sig.ylabel = h_sig_bkg.ylabel
In [53]: h_sig.title = 'Signal (bkg corrected)'
If the binning is too fine you see large statistical fluctuations. This can be reduced by combining bins using
the rebin
member function of histo:
In [48]: h_sig2 = h_sig.rebin(2) # combine 2 bins to form a new histogram (h_sig2)
In [47]: h_sig4 = h_sig.rebin(4) # combine 4 bins to form a new histogram (h_sig4)
Make a plot of the new histograms and compare them to the original.
We can normalize a histogram (i.e. make sure that the sum of all bin content is 1). This is done by dividing the histogram by the sum of all bins:
In [48]: S_h, dS_h = h_sig.sum() # calculate the total sum
In [47]: h_sig_norm = h_sig/S # divide all the bin conten by S and store the result in h_sig_norm
If you now sum h_sig_norm
you should get the value of 1. h_sig_norm
is now an approximated probability density.
Summary of important member functions:¶
Below is a summary of the most important information stored in
histo
. I use hh
as the name for the histogram here by you can use
any other variable name when you create it:
variable |
meaning |
---|---|
bin_center |
Central value of each bin. |
bin_content |
Content of each bin (the value of the counter) |
bin_error |
Error of the content of each bin |
bin_width |
Width of each bin |
title |
Histogram title. You can store the short description of the data (used in plotting). |
xlabel |
Description of the bin_center values (used in plotting). |
ylabel |
Description of the bin_content (used in plotting). |
And here the most important (member) functions:
Function |
Arguments and Results |
---|---|
histo |
Create a histogram. There are several ways to create one depending on the arguments in histo here are the main possibilities:
|
plot |
Plot the histogram. There are several keywords that can be set to
control the plot. You can assign a minimal value |
sum |
Add the content of the histogram for the values ranging between (and
including) |
save |
Save the histogram to the file |
For more information see LT.box.histo
.
2D-Histograms¶
If one is interested in analyzing correlations of data 2-dimensional histograms are frequently used. In this case one counts the number of occurrences of pairs of data. The 2D-histogram is setup as follows:
In [37]: hh2 = B.histo2d(x_vals, y_vals, bins = [15,25]) # create a 2D-histogram for the values x_vals, y_vals
# use 15 bins in the x-direction
# use 25 bins in the y-direction
Once the histogram is created you can plot it with the command:
In [38]: hh2.plot() # plot the 2d histogram
For more detailed information on 2D-histograms see LT.box.histo2d
.