Working with arbitrary data

In this tutorial, we demonstrate working with arbitrary, rather than pre-defined, data.

Setup

[1]:
import copy
import numpy as np
import matplotlib.pyplot as plt

import histlite as hl
import csky as cy

%matplotlib inline
[2]:
cy.plotting.mrichman_mpl()
/mnt/lfs7/user/ssclafani/software/external/csky/csky/plotting.py:92: MatplotlibDeprecationWarning: Support for setting the 'text.latex.preamble' or 'pgf.preamble' rcParam to a list of strings is deprecated since 3.3 and will be removed two minor releases later; set it to a single string instead.
  r'\SetSymbolFont{operators}   {sans}{OT1}{cmss} {m}{n}'

Get some real data

[3]:
timer = cy.timing.Timer()
time = timer.time
[6]:
with time('ana setup: original'):
    ana = cy.get_analysis(cy.selections.Repository(), 'version-003-p02', cy.selections.PSDataSpecs.ps_2011)
a = ana[0]
data, sig, livetime = a.data, a.sig, a.livetime
Setting up Analysis for:
IC86_2011
Setting up IC86_2011...
Reading /data/ana/analyses/ps_tracks/version-003-p02/IC86_2011_MC.npy ...
Reading /data/ana/analyses/ps_tracks/version-003-p02/IC86_2011_exp.npy ...
Reading /data/ana/analyses/ps_tracks/version-003-p02/GRL/IC86_2011_exp.npy ...
Energy PDF Ratio Model...
  * gamma = 4.0000 ...
Signal Acceptance Model...
  * gamma = 4.0000 ...
Done.

0:00:14.625254 elapsed.

Generate modified data

You can choose your workflow here: modify the data and then define a DataSpec, or include the dataset modifications inside the DataSpec. I’ll go with the latter, and define a version of IC86 that is magically better in that its angular reconstruction is twice as good. By inheriting from the original, we reuse anything we don’t specifically redefine: in particular, binning settings for PDFs.

[7]:
class IC86_2011_MagicallyBetter(cy.selections.PSDataSpecs.IC86_2011):
    """Better version of IC86-2011"""
    def dataset_modifications(self, ds):
        """Improve angres by a factor of two."""
        sig = copy.deepcopy(ds.sig)
        data = copy.deepcopy(ds.data)
        sig['xra'] = .5 * sig.xra
        sig['xdec'] = .5 * sig.xdec
        # this particular geometric code has a silly (dec,ra) convention, please forgive
        sig['dec'], sig['ra'] = cy.coord.rotate_xaxis_to_source(
            sig.true_dec, sig.true_ra, sig.xdec, sig.xra, latlon=True)
        sigma_floor = np.radians(0.2)
        sig['sigma'] = np.maximum(sigma_floor, .5 * sig.sigma)
        data['sigma'] = np.maximum(sigma_floor, .5 * data.sigma)
        ds.sig, ds.data = sig, data

    _path_data = data
    _path_sig = sig
    _livetime = a.livetime

with time('ana setup: modified'):
    ana2 = cy.get_analysis(cy.selections.Repository(), 'version-003-p02', IC86_2011_MagicallyBetter)
Setting up Analysis for:
IC86_2011_MagicallyBetter
Setting up IC86_2011_MagicallyBetter...
Reading /data/ana/analyses/ps_tracks/version-003-p02/GRL/IC86_2011_exp.npy ...
Energy PDF Ratio Model...
  * gamma = 4.0000 ...
Signal Acceptance Model...
  * gamma = 4.0000 ...
Done.

0:00:12.286926 elapsed.

Test performance

[8]:
src = cy.sources(0, 0)
tr = cy.get_trial_runner(src=src, ana=ana, mp_cpus=10)
tr2 = cy.get_trial_runner(src=src, ana=ana2, mp_cpus=10)
[9]:
with time('bg 1'):
    b = cy.dists.Chi2TSD(tr.get_many_fits(1000, seed=1))
Performing 1000 background trials using 10 cores:
       1000/1000 trials complete.

0:00:02.381897 elapsed.
[10]:
sens_kw = dict(beta=0.9, n_sig_step=1, batch_size=500, tol=.03, seed=2)
[11]:
with time('sens 1'):
    sens = tr.find_n_sig(b.median(), **sens_kw)
Start time: 2021-07-30 15:13:30.866028
Using 10 cores.
* Starting initial scan for 90% of 50 trials with TS >= 0.052...
  n_sig = 1.000 ... frac = 0.74000
  n_sig = 2.000 ... frac = 0.88000
  n_sig = 3.000 ... frac = 0.90000
* Generating batches of 500 trials...
n_trials | n_inj    0.00    1.20    2.40    3.60    4.80    6.00  |  n_sig(relative error)
500      |         45.4%   73.6%   85.4%   95.0%   95.0%   98.2%  |      2.895 (+/-  4.3%) [chi2.cdf]
1000     |         47.4%   72.7%   85.8%   94.4%   96.1%   98.2%  |      2.928 (+/-  3.3%) [chi2.cdf]
1500     |         48.6%   72.6%   84.9%   93.9%   96.0%   98.2%  |      3.009 (+/-  2.4%) [chi2.cdf]
End time: 2021-07-30 15:13:54.452319
Elapsed time: 0:00:23.586291

0:00:23.587761 elapsed.
[12]:
with time('bg 2'):
    b2 = cy.dists.Chi2TSD(tr2.get_many_fits(1000, seed=1))
Performing 1000 background trials using 10 cores:
       1000/1000 trials complete.

0:00:02.341552 elapsed.
[13]:
with time('sens 2'):
    sens2 = tr2.find_n_sig(b2.median(), **sens_kw)
Start time: 2021-07-30 15:14:04.353769
Using 10 cores.
* Starting initial scan for 90% of 50 trials with TS >= 0.022...
  n_sig = 1.000 ... frac = 0.80000
  n_sig = 2.000 ... frac = 0.90000
* Generating batches of 500 trials...
n_trials | n_inj    0.00    0.80    1.60    2.40    3.20    4.00  |  n_sig(relative error)
500      |         50.2%   67.6%   85.0%   87.8%   94.8%   95.8%  |      2.454 (+/-  4.9%) [chi2.cdf]
1000     |         51.1%   69.7%   84.0%   89.6%   94.7%   97.0%  |      2.358 (+/-  3.2%) [chi2.cdf]
1500     |         49.3%   70.6%   84.1%   89.3%   94.9%   96.7%  |      2.345 (+/-  2.6%) [chi2.cdf]
End time: 2021-07-30 15:14:23.955296
Elapsed time: 0:00:19.601527

0:00:19.605097 elapsed.

Do these compare about right?

[14]:
sens['n_sig'] / sens2['n_sig'], np.sqrt(2)
[14]:
(1.2831309784417606, 1.4142135623730951)

I guess I buy it.

[13]:
print(timer)
ana setup: original | 0:00:15.504567
ana setup: modified | 0:00:14.002988
bg 1                | 0:00:02.314820
sens 1              | 0:00:21.714302
bg 2                | 0:00:02.306961
sens 2              | 0:00:50.869679
------------------------------------
total               | 0:01:46.713317