Quick start to CMS Open Data#

This is a jupyter notebook, where you can have text “cells” (like this text here) and code “cells” i.e. boxes where you can write python code to be executed (like the one below). No need to install anything (if you run this on http://mybinder.org/) or find compilers, it is all done for you in background.

For the exercise with CMS open data, we use python as programming language: it is easy to get started, just type, for example, 1 + 1 in the cell below and click on “Run” icon above.

1+1

Now try something more advanced, for example sqrt(4)

sqrt(4)

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [2], in <cell line: 1>()
----> 1 sqrt(4)

NameError: name 'sqrt' is not defined

That failed: basic python can do some operations but for anything more complex, we need additional software packages or “modules”.

That’s what we will import here (select the cell below and run it by clicking on the Run icon):

import pandas as pd 
#pandas is for data structures and data analysis tools
import numpy as np
#numpy is for scientific computing
import matplotlib.pyplot as plt
#matplotlib is for plotting

Now, you want to try whether sqrt(4) works now? No, it does not… you will have to tell jupyter that you want to take the function from numpy (which for brevity was named np above). So try np.sqrt(4)…

np.sqrt(4)

2.0

Note that you can modify this page at any time, it does no harm. You can add cells (Insert) and change their type from Code (default) to text (i.e. “Markdown”) under Cell -> Cell type.

OK, let’s get started with the data. We’ll read the data from the CERN Open data portal. We call it “data” but you can use any other name.

# we use read_csv function from pandas to read the data into a "data frame"
data = pd.read_csv('http://cern.ch/opendata/record/545/files/Dimuon_DoubleMu.csv')

We’ll have a look what we got in these data:

data.head()

	Run	Event	type1	E1	px1	py1	pz1	pt1	eta1	phi1	...	type2	E2	px2	py2	pz2	pt2	eta2	phi2	Q2	M
0	165617	74601703	G	9.6987	-9.5104	0.3662	1.8633	9.5175	0.1945	3.1031	...	G	9.7633	7.3277	-1.1524	6.3473	7.4178	0.7756	-0.1560	1	17.4922
1	165617	75100943	G	6.2039	-4.2666	0.4565	-4.4793	4.2910	-0.9121	3.0350	...	G	9.6690	7.2740	-2.8211	-5.7104	7.8019	-0.6786	-0.3700	1	11.5534
2	165617	75587682	G	19.2892	-4.2121	-0.6516	18.8121	4.2622	2.1905	-2.9881	...	G	9.8244	4.3439	-0.4735	8.7985	4.3697	1.4497	-0.1086	1	9.1636
3	165617	75660978	G	7.0427	-6.3268	-0.2685	3.0802	6.3325	0.4690	-3.0992	...	G	5.5857	4.4748	0.8489	-3.2319	4.5546	-0.6605	0.1875	1	12.4774
4	165617	75947690	G	7.2751	0.1030	-5.5331	-4.7212	5.5340	-0.7736	-1.5522	...	G	7.3181	-0.3988	6.9408	2.2825	6.9523	0.3227	1.6282	1	14.3159

5 rows × 21 columns

This is a dataset from http://opendata.cern.ch/record/545 on CERN Open Data Portal. It is a csv (comma separated values) file which can be easily used in many different frameworks. You can find other files of this type with this search.

What has been written in this dataset are the values (charge, direction, energy, momentum) of two muons from a CMS primary dataset http://opendata.cern.ch/record/17. All other particles have been omitted. If you’re interested, the code and instructions for producing this kind of simplified datasets are in http://opendata.cern.ch/record/552

Now, if you brave enough you can compute the invariant mass for the two muons. If you are in a hurry, just use the value M, which has been computed for you already :-)

invariant_mass = data['M']

You can type the name of your new variable to see what went in there:

invariant_mass

      17.4922
      11.5534
       9.1636
      12.4774
      14.3159
          ...   
  11.2077
  14.5819
  29.8425
  20.2068
   9.3741
Name: M, Length: 100000, dtype: float64

Now let’s make a histogram

# Plot the histogram with the function hist() of the matplotlib.pyplot module:
# (http://matplotlib.org/api/pyplot_api.html?highlight=matplotlib.pyplot.hist#matplotlib.pyplot.hist).
# 'Bins' determines the number of the bins used.
plt.hist(invariant_mass, bins=500)

# Name the axises and give the title.
plt.xlabel('Invariant mass [GeV]')
plt.ylabel('Number of events')
plt.title('The histogram of the invariant masses of two muons \n') # \n creates a new line for making the title look better
# plt.yscale('log') 
# Show the plot.
plt.show()

../../_images/quick-start-to-CMS-open-data_18_0.png

You can check whether log scale looks better. Just edit the cell above, uncomment plt.yscale(‘log’) before plt.show() and run it again.

Now zoom to see whether you can find some familiar particles in the dimuon spectrun. We can do that by setting the range of the histogram with option range=[min,max].

plt.hist(invariant_mass, bins=200, range=[8,12])
plt.show()

../../_images/quick-start-to-CMS-open-data_21_0.png

Have a look at the very low invariant mass range as well (change the range limits in the cell above and run it again). You will be surprised to see how well some low mass particles are visible in our data!

We have compiled a list of particles decaying into two muons (extracted for easier reading for educational purposes from C. Patrignani et al. (Particle Data Group), Chin. Phys. C, 40, 100001 (2016) and 2017 update.). In a teaching situation, you can ask students to identify these particles in the plot.

Note that binder does not save your changes. To save your work, download it from File -> Download as -> Notebook. This saves a local copy to your computer (and has no effect on the original notebook).

Some additional material#

If you want to compute yourself the invariant mass from energy and momentum of the two muons, know that you can access the values out of our dataset by data.E1 and data.px1 etc. Remember that before every complex mathematical function you must define that it comes from the numpy package, which we imported and which we defined as np (e.g. np.sqrt(your_value)).

As an extra challenge, note that it happens that one value of the newly computed invariant mass squared goes below 0 (sorry, this is real data) so we will have to exclude that from our set of values before taking the square root… not that nice but a good occasion to show how you can very efficiently set a selection criteria in a dataset with “pandas”

mass_squared = (data.E1 + data.E2)**2 - ((data.px1 + data.px2)**2 +(data.py1 + data.py2)**2  +(data.pz1 + data.pz2)**2)
mass_squared_pos = mass_squared[(mass_squared >0)]

Now plot the invariant mass again to compare it with the histograms above. Did we get it right?

plt.hist(np.sqrt(mass_squared_pos), bins=200, range=[8,12])
plt.show()

../../_images/quick-start-to-CMS-open-data_26_0.png

For more examples, have a look in https://github.com/cms-opendata-education. You are free to copy and modify these notebooks, and suggest new ones to be added to our collection. This material will also be available through CERN Open Data Portal.

Physics

Quick start to CMS Open Data

Contents

Quick start to CMS Open Data#

Some additional material#