Working with Data Files and Numpy Arrays

Open files and load data

To load the file you just created (my_exp_1.data) you do the following:

In [2]: mf = B.get_file('my_exp_1.data')  # B is the LT.box

my_exp_1.data is the name of the data file that you just created and that resides in your current working directory (if this did not work look at find my files). The line of code:

B.get_file(’my_exp_1.data’)

creates a dfile object, to which we assign the name mf. Now you can start to ‘play’ with it. As an example you can find the column names of your data by doing the following:

In [3]: mf.show_keys()

The () is important. To look at the values associated with each name you can do:

In [4]: mf.show_data('time')
In [5]: mf.show_data('dist')
In [6]: mf.show_data('d_err')

Or you can display them together:

In [7]: mf.show_data('time:dist:d_err')

This may be useful for showing them but in order to work with the data you need to get them into ipython as arrays. This is achieved by doing:

In [8]: t = mf['time']

If you remember from eaerlier, this is similar to accessing an element of a dictionary, but in this case you get a so-called numpy.array() which I then stored in variable t (for time). Numpy arrays are one of the most important objects we will be using and will be discussed some more below.

You can again display the array by just typing its name and hit return

In [9]: t

You should see something like (the number in the brackets will be different):

Out[9]: array([  0.,   1.,   2.,   3.,   4.,   5.,   6.,   7.,   8.,   9.,  10.])

To access a single element of an array you enter:

In [10]: t[1]

Try out other values for the index!

The integer (whole number) in the bracket is the index and in this case runs from 0 to 10, since there are 11 elements in the array. You can find the length of an array by typing:

In [11]: len(t)

And you should get back 11.

Numpy arrays

A numpy array is a one- or multi-dimensional array of most frequently numerical or logical data (but other data can also be stored). You can find a very nice introduction at numpy for beginners . Information on its dimensions and its size is stored in the shape part of the array. Try the following:

In [11]: t.shape

and you should get (11,).

You can now manipulate this array. For instance you can multiply it by a number. In the example you will multiply all values by 1000. to convert to milliseconds and assign the new result to tms:

In [12]: tms = t*1000
In [13]: tms

The output now looks like this:

Out[13]: array([     0.,   1000.,   2000.,   3000.,   4000.,   5000.,   6000.,
         7000.,   8000.,   9000.,  10000.])

Almost any mathematical operation is possible, check the documentation. Now get the the distance data and the corresponding errors by doing:

In [14]: dexp = mf['dist']
In [15]: derr = mf['d_err']

You can also make arrays that have the same size as your data but contain only 0’s or 1’s. In the example below we make an array exactly like derr that contains only ones and another that contains only zeros.

In [16]: err_one = np.ones_like(derr)
In [15]: err_0 = np.zeros_like(derr)

Look at err_one and err_0 and verify that they contain only 1’s or 0’s and have exact the same number of elements as you array derr.

You can convert a python list to a numpy array as follows:

In [12]: my_list = [1,2,3,4,5,6]
In [13]: my_array = np.array(my_list)

You can also convert a numpy array back to a list by doing:

In [12]: my_list = list(my_array)

Another very useful tool is np.linspace(start,stop, num, endpoint). This function returns an array of num equally spaced values between start and stop. By default endpoint = True meaning that the stop value is included in the array. If you set endpoint = False the stop value is not included. Below are a few examples.

In [21]: np.linspace(start = -2. ,stop = 5.,num = 10, endpoint = True)     # Include the endpoint
Out[21]:
array([-2.        , -1.22222222, -0.44444444,  0.33333333,  1.11111111,
    1.88888889,  2.66666667,  3.44444444,  4.22222222,  5.        ])

In [22]: np.linspace(-2., 5., 10)                                          # short cut version
Out[22]:
array([-2.        , -1.22222222, -0.44444444,  0.33333333,  1.11111111,
    1.88888889,  2.66666667,  3.44444444,  4.22222222,  5.        ])

In [23]: np.linspace(start = -2. ,stop = 5.,num = 10, endpoint = False)    # without the end point
Out[23]: array([-2. , -1.3, -0.6,  0.1,  0.8,  1.5,  2.2,  2.9,  3.6,  4.3])

As an example try the following from your spyder console (assuming pyplot as been preloaded):

In [24]: x = np.linspace(0., 2.*np.pi, 1000)      # create an array with 1000 elements
In [25]: plot(sin(x), cos(x + np.pi/4.))          # create 2 Lissajou curves
In [26]: plot(sin(x), cos(5.*(x + np.pi/4.)))

Note that in the previous example the terms \(sin(x)\) and \(cos(x + \pi/4)\) are calculated for the entire array of 1000 values.

2-dimensional arrays

You can combine 1-dimensional arrays of the same length into a two dimensional array by:

In [16]: time_distance = np.array([t, dexp]')
In [17]: time_distance.shape

should now give you (2,11). You can access each element by their indices, try:

In [16]: time_distance[2,3]

Selecting data from arrays / Logical operations on arrays

As in regular python lists numpy arrays support a wide variety of slicing operations to select a sub-set of data from an array:

In [16]: t_sub = tms[2:8]
In [170: t_sub
Out[16]: array([2000., 3000., 4000., 5000., 6000., 7000.])

Selects elements 2 through 7 of the array tms and stores these in the array t_sub. You can also place the indices of the array elements that you would like to access in an array as shown in the example below:

In [16]: i_s = np.array([2,3,5,7])
In [17]: t_s = tms[i_s]
In [18]: t_s
Out[16]: array([2000., 3000., 5000., 7000.])

Remember also:

In [16]: tms[0]    # is the first element of tms or any array in general
Out[16]: 0.0
In [18]: tms[-1]   # is the last element
Out[18]: 10000.0
In [19]: tms[-2]   # is the 2nd to last element etc.
Out[16]: 9000.0

Numpy arrays can also be used in logical operations. This is especially useful when you would like to select a subset of the data for further operations. Try out the following

In [16]: big = tms > 4000
In [17]: small = tms < 7000.
In [17]: tms[big]
Out[17]: array([ 5000.,  6000.,  7000.,  8000.,  9000., 10000.])
In [18]: tms[small]
Out[18]: array([   0., 1000., 2000., 3000., 4000., 5000., 6000.])

The arrays big and small are used to select elements from the original array. The array tms[big] only contains those values of tms that are bigger than 4000 and tms[small] only contains the values that are smaller than 7000. The arrays big and small contain the logical result (True or False) of the logical expression for each array element.

In [19]: small
Out[19]:
array([ True,  True,  True,  True,  True,  True,  True, False, False,
False, False])

They can also be combined as

In [20]: both = big & small
In [21]: tms[both]
Out[21]: array([5000., 6000.])

Here & means and and | mean or. The array tms[both] therefore contains only those elements of tms that are between 4000 and 7000 (excluding the limits)

This can also be written in one line as:

In [21]: tms[ (4000 < tms) & (tms < 7000)]
Out[21]: array([5000., 6000.])

Parameters in data files

You can also access the parameters that you defined in your file. First you can look at all the parameters that you defined by doing:

In [16]: mf.par.show_all_data()
pressure 1.e5
temperature 80.

In this case you see the two parameters called pressure and temperature with a value of 1.e5 and 80, respectively. To get these values and store them in variables you would do:

In [17]: T = mf.par['temperature']
In [18]: P = mf.par['pressure']

If you get an error message saying e.g. mf.par does not exist you have an error in your parameter definition in the data file. For more detailed information look at the datafile documentation (pdfile).

Computations using arrays

Now all your data are in the form of variables and (numpy) arrays that can be used for computation. For instance you might want to know what percentage error each data point has. This can be done as follows:

In [16]: p_err = derr/dexp * 100.

In [17]: p_err

And the output should be:

Out[17]: array([ 31.42857143,  26.08695652,  12.5       ,  13.15789474,
         17.89473684,   8.57142857,  11.04294479,   7.60233918,
         9.23076923,   5.91133005,   7.0754717 ])

To have a bit a nicer output you can use a for loop. First some information on loops. The simple for loop works as follows

In [18]: for D in dexp:
   ....:     _

The cursor will have moved to the right by about 4 spaces, the prompt has changed and the cursor is typically just below the D

now enter at the location of the _:

print( 'distance = ', D )

the output should look like:

In [18]: for D in dexp:
   ....:     print( 'distance = ', D )
   ....:     _

The cursor is now just below the p of print. Now press return twice and the loop starts to run. Your output will look like (with the first part of the loop):

In [18]: for D in dexp:
   ....:     print( "distance = ", D )
   ....:
   ....:
distance =  3.5
distance =  4.6
distance =  8.0
distance =  11.4
distance =  9.5
distance =  14.0
distance =  16.3
distance =  17.1
distance =  19.5
distance =  20.3
distance =  21.2

What happened here:

you created a for loop, where each element( one after another) of dexp get assigned the name D. In the loop body (what comes below the for... statement and is indented) the current value D is printed together with the string ’distance = .

The loop ends where the indentation ends

This is typical syntax in python and is used for all other program blocks. In the beginning it can be a bit irritating as you will encounter it (see indent_error for an example)

Interactively a block is closed with two returns.

In order to print all values of t, dexp and derr in one for loop I use enumerate. First I check what enumerate does

In [19]: for i,D in enumerate(dexp):
   ....:     print( 'i = ', i, 'D = ', D )

and end the last line again with 2 returns. You should then see:

In [19]: for i,D in enumerate(dexp):
   ....:     print( 'i = ', i, 'D = ', D )
   ....:
   ....:
i =  0 D =  3.5
i =  1 D =  4.6
i =  2 D =  8.0
i =  3 D =  11.4
i =  4 D =  9.5
i =  5 D =  14.0
i =  6 D =  16.3
i =  7 D =  17.1
i =  8 D =  19.5
i =  9 D =  20.3
i =  10 D =  21.2

In this variation the i contains the index of D in dexp. Since the corresponding values in t, dexp and derr all have the same index, I can print them all in one loop as follows:

In [20]: for i,D in enumerate(dexp):
   ....:     print( 'time = ', t[i], 'dist = ', D, ' error = ', derr[i] )

Again I close the loop with 2 returns. An the output now is:

In [20]: for i,D in enumerate(dexp):
   ....:     print( 'time = ', t[i], 'dist = ', D, ' error = ', derr[i] )
   ....:
   ....:
time =  0.0 dist =  3.5  error =  1.1
time =  1.0 dist =  4.6  error =  1.2
time =  2.0 dist =  8.0  error =  1.0
time =  3.0 dist =  11.4  error =  1.5
time =  4.0 dist =  9.5  error =  1.7
time =  5.0 dist =  14.0  error =  1.2
time =  6.0 dist =  16.3  error =  1.8
time =  7.0 dist =  17.1  error =  1.3
time =  8.0 dist =  19.5  error =  1.8
time =  9.0 dist =  20.3  error =  1.2
time =  10.0 dist =  21.2  error =  1.5

Now you have learned how to get the data and how to loop over data. There are many more loop possibilities in Python that you can find in the documentation. For your needs in modern lab the for loop is enough.

Python cannot find my files !

This is a problem that many people encounter in the beginning. When you issue the command:

In [2]: mf = B.get_file('my_exp_1.data')  # B is the LT.box

Python looks for the file in the current working directory. Where is this ? There are three commands that you can issue from within ipython regarding the directory (or folder) that you are currently working in:

In [1]: pwd           # print working directory: displays where it is currently looking for files
Out[1]: '/Users/boeglinw'
In [2]: ls            # list contents of the current directory
In [3]: cd Documents  # change directory to the Documents which is  part of boeglinw
In [4]: pwd
Out[4]: '/Users/boeglinw/Documents'
In [5]: cd ..         # change directory back up to boeglinw
Out[5]: '/Users/boeglinw'

This works for all operating systems. Alternatively use the file tab in spyder to set your working directory or right-click on the tab in the editor window containing your file name and select ‘Set concole working directory’. If you need more help let me know.