Working with Data Files and Numpy Arrays¶
Open files and load data¶
To load the file you just created (my_exp_1.data) you do the following:
In [2]: mf = B.get_file('my_exp_1.data') # B is the LT.box
my_exp_1.data
is the name of the data file that you just created
and that resides in your current working directory (if this did not work look at find my files).
The line of code:
B.get_file(’my_exp_1.data’)
creates a dfile
object, to which we assign the name mf
.
Now you can start to ‘play’ with it. As an example you can find the
column names of your data by doing the following:
In [3]: mf.show_keys()
The ()
is important. To look at the values associated with each name
you can do:
In [4]: mf.show_data('time')
In [5]: mf.show_data('dist')
In [6]: mf.show_data('d_err')
Or you can display them together:
In [7]: mf.show_data('time:dist:d_err')
This may be useful for showing them but in order to work with the data you need to get them into ipython as arrays. This is achieved by doing:
In [8]: t = mf['time']
If you remember from eaerlier, this is similar to accessing an element of a dictionary, but in this case you get
a so-called numpy.array()
which I then stored in variable t
(for time).
Numpy arrays are
one of the most important objects we will be using and will be
discussed some more below.
You can again display the array by just typing its name and hit return
In [9]: t
You should see something like (the number in the brackets will be different):
Out[9]: array([ 0., 1., 2., 3., 4., 5., 6., 7., 8., 9., 10.])
To access a single element of an array you enter:
In [10]: t[1]
Try out other values for the index!
The integer (whole number) in the bracket is the index and in this case runs from 0 to 10, since there are 11 elements in the array. You can find the length of an array by typing:
In [11]: len(t)
And you should get back 11.
Numpy arrays¶
A numpy array is a one- or multi-dimensional array of most
frequently numerical or logical data (but other data can also be
stored). You can find a very nice introduction at
numpy for beginners .
Information on its dimensions and its size
is stored in the shape
part of the array. Try the following:
In [11]: t.shape
and you should get (11,).
You can now manipulate this array. For instance you can multiply it by a
number. In the example you will multiply all values by 1000. to convert
to milliseconds and assign the new result to tms
:
In [12]: tms = t*1000
In [13]: tms
The output now looks like this:
Out[13]: array([ 0., 1000., 2000., 3000., 4000., 5000., 6000.,
7000., 8000., 9000., 10000.])
Almost any mathematical operation is possible, check the documentation. Now get the the distance data and the corresponding errors by doing:
In [14]: dexp = mf['dist']
In [15]: derr = mf['d_err']
You can also make arrays that have the same size as your data but
contain only 0’s or 1’s. In the example below we make an array
exactly like derr
that contains only ones and another that
contains only zeros.
In [16]: err_one = np.ones_like(derr)
In [15]: err_0 = np.zeros_like(derr)
Look at err_one
and err_0
and verify that they contain only
1’s or 0’s and have exact the same number of elements as you array derr
.
You can convert a python list to a numpy array as follows:
In [12]: my_list = [1,2,3,4,5,6]
In [13]: my_array = np.array(my_list)
You can also convert a numpy array back to a list by doing:
In [12]: my_list = list(my_array)
Another very useful tool is np.linspace(start,stop, num,
endpoint)
. This function returns an array of num
equally spaced
values between start
and stop
. By default endpoint = True
meaning that the stop value is included in the array. If you set
endpoint = False
the stop value is not included. Below are a few
examples.
In [21]: np.linspace(start = -2. ,stop = 5.,num = 10, endpoint = True) # Include the endpoint
Out[21]:
array([-2. , -1.22222222, -0.44444444, 0.33333333, 1.11111111,
1.88888889, 2.66666667, 3.44444444, 4.22222222, 5. ])
In [22]: np.linspace(-2., 5., 10) # short cut version
Out[22]:
array([-2. , -1.22222222, -0.44444444, 0.33333333, 1.11111111,
1.88888889, 2.66666667, 3.44444444, 4.22222222, 5. ])
In [23]: np.linspace(start = -2. ,stop = 5.,num = 10, endpoint = False) # without the end point
Out[23]: array([-2. , -1.3, -0.6, 0.1, 0.8, 1.5, 2.2, 2.9, 3.6, 4.3])
As an example try the following from your spyder console (assuming pyplot
as been preloaded):
In [24]: x = np.linspace(0., 2.*np.pi, 1000) # create an array with 1000 elements
In [25]: plot(sin(x), cos(x + np.pi/4.)) # create 2 Lissajou curves
In [26]: plot(sin(x), cos(5.*(x + np.pi/4.)))
Note that in the previous example the terms \(sin(x)\) and \(cos(x + \pi/4)\) are calculated for the entire array of 1000 values.
2-dimensional arrays¶
You can combine 1-dimensional arrays of the same length into a two dimensional array by:
In [16]: time_distance = np.array([t, dexp]')
In [17]: time_distance.shape
should now give you (2,11). You can access each element by their indices, try:
In [16]: time_distance[2,3]
Selecting data from arrays / Logical operations on arrays¶
As in regular python lists numpy arrays support a wide variety of slicing operations to select a sub-set of data from an array:
In [16]: t_sub = tms[2:8]
In [170: t_sub
Out[16]: array([2000., 3000., 4000., 5000., 6000., 7000.])
Selects elements 2 through 7 of the array tms
and stores these in the array t_sub
. You
can also place the indices of the array elements that you would like to access in an array
as shown in the example below:
In [16]: i_s = np.array([2,3,5,7])
In [17]: t_s = tms[i_s]
In [18]: t_s
Out[16]: array([2000., 3000., 5000., 7000.])
Remember also:
In [16]: tms[0] # is the first element of tms or any array in general
Out[16]: 0.0
In [18]: tms[-1] # is the last element
Out[18]: 10000.0
In [19]: tms[-2] # is the 2nd to last element etc.
Out[16]: 9000.0
Numpy arrays can also be used in logical operations. This is especially useful when you would like to select a subset of the data for further operations. Try out the following
In [16]: big = tms > 4000
In [17]: small = tms < 7000.
In [17]: tms[big]
Out[17]: array([ 5000., 6000., 7000., 8000., 9000., 10000.])
In [18]: tms[small]
Out[18]: array([ 0., 1000., 2000., 3000., 4000., 5000., 6000.])
The arrays big
and small
are used to select elements from the
original array. The array tms[big]
only contains those values of tms
that are bigger than 4000 and tms[small]
only
contains the values that are smaller than 7000. The arrays big
and small
contain the logical result (True
or False
) of the logical expression for each array element.
In [19]: small
Out[19]:
array([ True, True, True, True, True, True, True, False, False,
False, False])
They can also be combined as
In [20]: both = big & small
In [21]: tms[both]
Out[21]: array([5000., 6000.])
Here &
means and
and |
mean or
. The array tms[both]
therefore contains
only those elements of tms
that are between 4000 and 7000 (excluding the limits)
This can also be written in one line as:
In [21]: tms[ (4000 < tms) & (tms < 7000)]
Out[21]: array([5000., 6000.])
Parameters in data files¶
You can also access the parameters that you defined in your file. First you can look at all the parameters that you defined by doing:
In [16]: mf.par.show_all_data()
pressure 1.e5
temperature 80.
In this case you see the two parameters called pressure and temperature with a value of 1.e5 and 80, respectively. To get these values and store them in variables you would do:
In [17]: T = mf.par['temperature']
In [18]: P = mf.par['pressure']
If you get an error message saying e.g. mf.par does not exist you have an error in
your parameter definition in the data file.
For more detailed information look at the datafile documentation (pdfile
).
Computations using arrays¶
Now all your data are in the form of variables and (numpy) arrays that can be used for computation. For instance you might want to know what percentage error each data point has. This can be done as follows:
In [16]: p_err = derr/dexp * 100.
In [17]: p_err
And the output should be:
Out[17]: array([ 31.42857143, 26.08695652, 12.5 , 13.15789474,
17.89473684, 8.57142857, 11.04294479, 7.60233918,
9.23076923, 5.91133005, 7.0754717 ])
To have a bit a nicer output you can use a for
loop. First some
information on loops. The simple for
loop works as follows
In [18]: for D in dexp:
....: _
The cursor will have moved to the right by about 4 spaces, the prompt has changed and the cursor is typically just below the D
now enter at the location of the _
:
print( 'distance = ', D )
the output should look like:
In [18]: for D in dexp:
....: print( 'distance = ', D )
....: _
The cursor is now just below the p
of print
. Now press return
twice and the loop starts to run. Your output will look like (with the
first part of the loop):
In [18]: for D in dexp:
....: print( "distance = ", D )
....:
....:
distance = 3.5
distance = 4.6
distance = 8.0
distance = 11.4
distance = 9.5
distance = 14.0
distance = 16.3
distance = 17.1
distance = 19.5
distance = 20.3
distance = 21.2
What happened here:
you created a
for
loop, where each element( one after another) ofdexp
get assigned the nameD
. In the loop body (what comes below thefor...
statement and is indented) the current valueD
is printed together with the string’distance = ’
.
The loop ends where the indentation ends
This is typical syntax in python and is used for all other program blocks. In the beginning it can be a bit irritating as you will encounter it (see indent_error for an example)
Interactively a block is closed with two returns.
In order to print all values of t
, dexp
and derr
in one for loop I
use enumerate
. First I check what enumerate
does
In [19]: for i,D in enumerate(dexp):
....: print( 'i = ', i, 'D = ', D )
and end the last line again with 2 returns. You should then see:
In [19]: for i,D in enumerate(dexp):
....: print( 'i = ', i, 'D = ', D )
....:
....:
i = 0 D = 3.5
i = 1 D = 4.6
i = 2 D = 8.0
i = 3 D = 11.4
i = 4 D = 9.5
i = 5 D = 14.0
i = 6 D = 16.3
i = 7 D = 17.1
i = 8 D = 19.5
i = 9 D = 20.3
i = 10 D = 21.2
In this variation the i
contains the index of D
in dexp
. Since the
corresponding values in t
, dexp
and derr
all have the same index, I can
print them all in one loop as follows:
In [20]: for i,D in enumerate(dexp):
....: print( 'time = ', t[i], 'dist = ', D, ' error = ', derr[i] )
Again I close the loop with 2 returns. An the output now is:
In [20]: for i,D in enumerate(dexp):
....: print( 'time = ', t[i], 'dist = ', D, ' error = ', derr[i] )
....:
....:
time = 0.0 dist = 3.5 error = 1.1
time = 1.0 dist = 4.6 error = 1.2
time = 2.0 dist = 8.0 error = 1.0
time = 3.0 dist = 11.4 error = 1.5
time = 4.0 dist = 9.5 error = 1.7
time = 5.0 dist = 14.0 error = 1.2
time = 6.0 dist = 16.3 error = 1.8
time = 7.0 dist = 17.1 error = 1.3
time = 8.0 dist = 19.5 error = 1.8
time = 9.0 dist = 20.3 error = 1.2
time = 10.0 dist = 21.2 error = 1.5
Now you have learned how to get the data and how to loop over data.
There are many more loop possibilities in Python that you can find in
the documentation. For your needs in modern lab the for
loop is
enough.
Python cannot find my files !¶
This is a problem that many people encounter in the beginning. When you issue the command:
In [2]: mf = B.get_file('my_exp_1.data') # B is the LT.box
Python looks for the file in the current working directory
. Where
is this ? There are three commands that you can issue from within
ipython regarding the directory (or folder) that you are currently
working in:
In [1]: pwd # print working directory: displays where it is currently looking for files
Out[1]: '/Users/boeglinw'
In [2]: ls # list contents of the current directory
In [3]: cd Documents # change directory to the Documents which is part of boeglinw
In [4]: pwd
Out[4]: '/Users/boeglinw/Documents'
In [5]: cd .. # change directory back up to boeglinw
Out[5]: '/Users/boeglinw'
This works for all operating systems. Alternatively use the file tab in spyder to set your working directory or right-click on the tab in the editor window containing your file name and select ‘Set concole working directory’. If you need more help let me know.