Correlated analyses with MultiTrialRunner

In this tutorial, we consider the use case of multiple, potentially partially-correlated analyses. Here, we want to perform multiple likelihood tests on the same “trials” — the same realizations of the datasets.

[2]:

import numpy as np
import matplotlib.pyplot as plt

import histlite as hl
import csky as cy

%matplotlib inline

[3]:

cy.plotting.mrichman_mpl()

/mnt/lfs7/user/ssclafani/software/external/csky/csky/plotting.py:92: MatplotlibDeprecationWarning: Support for setting the 'text.latex.preamble' or 'pgf.preamble' rcParam to a list of strings is deprecated since 3.3 and will be removed two minor releases later; set it to a single string instead.
  r'\SetSymbolFont{operators}   {sans}{OT1}{cmss} {m}{n}'

[4]:

ana_dir = '/data/user/mrichman/csky_cache/ana'

We’ll test with one year of data, PS IC86-2011.

[5]:

ana = cy.get_analysis(cy.selections.repo, 'version-003-p02' , cy.selections.PSDataSpecs.ps_2011, dir=ana_dir)

Setting up Analysis for:
IC86_2011
Setting up IC86_2011...
Reading /data/ana/analyses/ps_tracks/version-003-p02/IC86_2011_MC.npy ...
Reading /data/ana/analyses/ps_tracks/version-003-p02/IC86_2011_exp.npy ...
Reading /data/ana/analyses/ps_tracks/version-003-p02/GRL/IC86_2011_exp.npy ...
<- /data/user/mrichman/csky_cache/ana/IC86_2011.subanalysis.version-003-p02.npy
Done.

Everything that follows will use this same Analysis, so we put it in cy.CONF.

[5]:

cy.CONF['ana'] = ana
cy.CONF['mp_cpus'] = 15

Individual Trial Runners 

We begin by setting up an arbitrary catalog of two sources:

[6]:

ra = np.r_[0, 180]
dec = np.r_[-45, 45]
src = cy.sources(ra, dec, deg=True)
src.as_dataframe

[6]:

	dec	dec_deg	sindec	ra	ra_deg	weight	extension	extension_deg
0	-0.785398	-45.0	-0.707107	0.000000	0.0	0.5	0.0	0.0
1	0.785398	45.0	0.707107	3.141593	180.0	0.5	0.0	0.0

We get a trial runner for each:

[7]:

trs = [cy.get_trial_runner(src=s) for s in src]
trs

[7]:

[<csky.trial.TrialRunner at 0x7f91d37d5080>,
 <csky.trial.TrialRunner at 0x7f91baf133c8>]

Now for the tricky part: we need to make sure our background trials have events for every source, but we don’t want to scramble the whole sky. So we set up our “injection” trial runner to know about the whole catalog.

We also need to decide on how the signal injection should work; here we’ll just inject signals for the 0th source.

[8]:

tr_inj = cy.get_trial_runner(src=src, inj_conf=dict(src=src[0]))

To confirm that signal events wind up where they should, let’s look at an example trial:

[9]:

trial = tr_inj.get_one_trial(n_sig=5, poisson=False, seed=2)
trial

[9]:

Trial(evss=[[Events(13188 items | columns: dec, idx, inj, log10energy, ra, sigma, sindec), Events(5 items | columns: dec, idx, inj, log10energy, ra, sigma, sindec)]], n_exs=[123056])

[10]:

trial.evss[0][1].as_dataframe

[10]:

	dec	log10energy	ra	sigma	sindec	idx	inj
0	-0.789157	5.577282	0.000794	0.003491	-0.704913	773766	<csky.inj.PointSourceInjector object at 0x7f91...
1	-0.831554	5.118237	0.008226	0.011637	-0.736841	311727	<csky.inj.PointSourceInjector object at 0x7f91...
2	-0.787004	5.241669	0.003833	0.003491	-0.720210	698751	<csky.inj.PointSourceInjector object at 0x7f91...
3	-0.789060	5.397988	0.001629	0.005621	-0.702275	499808	<csky.inj.PointSourceInjector object at 0x7f91...
4	-0.773355	5.275870	6.279155	0.018069	-0.697965	746488	<csky.inj.PointSourceInjector object at 0x7f91...

As you can see, the signal events have ra,dec where they should.

Background trials 

In order to get per-test p-values, we need to run background trials for each source. At this step, correlations between tests do not matter, so we can still use the ordinary trial runners.

[11]:

%time bgs = [cy.dists.Chi2TSD(tr.get_many_fits(1e3)) for tr in trs]
bgs

Performing 1000 background trials using 15 cores:
       1000/1000 trials complete.
Performing 1000 background trials using 15 cores:
       1000/1000 trials complete.
CPU times: user 422 ms, sys: 225 ms, total: 646 ms
Wall time: 4.65 s

[11]:

[Chi2TSD(1000 trials, eta=0.503, ndof=1.105, median=0.000 (from fit 0.000)),
 Chi2TSD(1000 trials, eta=0.540, ndof=1.060, median=0.013 (from fit 0.013))]

MultiTrialRunner 

We are ready to perform “simultaneous” trials for all hypothesis tests. For this we use a MultiTrialRunner.

[12]:

multr = cy.trial.MultiTrialRunner(
    # the Analysis
    ana,
    # bg+sig injection trial runner (produces trials)
    tr_inj,
    # llh test trial runners (perform fits given trials)
    trs,
    # background distrubutions
    bgs=bgs,
    # use multiprocessing
    mp_cpus=cy.CONF['mp_cpus'],
)

One more time without interspersed comments:

[13]:

multr = cy.trial.MultiTrialRunner(ana, tr_inj, trs, bgs=bgs, mp_cpus=cy.CONF['mp_cpus'])

Single Trial

A single trial looks like so:

[14]:

trial_kw = dict(n_sig=10, poisson=False, seed=2)
multr.get_one_fit(**trial_kw)

[14]:

array([ 8.39466146, 32.19708837, 10.61458083,  2.06225432,  0.26760624,
        0.        ,  0.        ,  3.        ])

This output is easier to understand if you make this call, which breaks up the result per-source:

[15]:

multr.get_all_fits(**trial_kw)

[15]:

[array([ 8.39466146, 32.19708837, 10.61458083,  2.06225432]),
 array([0.26760624, 0.        , 0.        , 3.        ])]

No we can see what’s happening: for each source, we get mlog10p,ts,ns,gamma, or \((-\log_{10}(p),\text{TS},n_s,\gamma)\). Since we’ve injected lots of signal for the 0th source, the significance is over 5σ:

[16]:

from scipy import stats
stats.norm.isf(10**-7.633)

[16]:

5.463963905200207

Note that technically, the bgs argument was optional. If we don’t provide it, then as a fallback we get “fake” mlog10p which just duplicate TS:

[17]:

multr_nobg = cy.trial.MultiTrialRunner(ana, tr_inj, trs)
multr_nobg.get_all_fits(**trial_kw)

[17]:

[array([32.19708837, 32.19708837, 10.61458083,  2.06225432]),
 array([0., 0., 0., 3.])]

MultiTrialRunner trials 

We’re ready to perform a batch of trials using MultiTrialRunner.

[18]:

trials = multr.get_many_fits(1000, n_sig=10, seed=1)

Performing 1000 trials with n_sig = 10.000 using 15 cores:
       1000/1000 trials complete.

[19]:

trials

[19]:

Arrays(1000 items | columns: gamma_0000, gamma_0001, mlog10p_0000, mlog10p_0001, ns_0000, ns_0001, ts_0000, ts_0001)

The results have been tabulated according to the name of the fit parameters and indices of the sources. Let’s look at the results for the first few trials:

[20]:

trials.as_dataframe.head()

[20]:

	gamma_0000	gamma_0001	mlog10p_0000	mlog10p_0001	ns_0000	ns_0001	ts_0000	ts_0001
0	1.972629	3.250000	10.400346	0.267606	14.934783	0.000000	40.814930	0.000000
1	2.028484	4.000000	7.719988	0.267606	13.135672	0.000000	29.309175	0.000000
2	2.075980	2.842599	4.350988	0.335503	8.785978	0.919642	15.035911	0.046768
3	1.923978	4.000000	9.293134	0.267606	13.924785	0.000000	36.052241	0.000000
4	2.025977	2.000000	5.080775	0.267606	7.943536	0.000000	18.098095	0.000000

Once again, since we’ve injected plenty of signal (10 events per trial, but now with Poisson smearing), we see that the results are much more significant for source 0000 than for source 0001 (which only has background nearby).

We can plot the resulting distributions like so:

[21]:

fig, axs = plt.subplots(2, 2, figsize=(8,6))
axs = np.ravel(axs)
names = 'mlog10p ts ns gamma'.split()
for (i, name) in enumerate(names):
    ax = axs[i]
    mask0 = trials.ns_0000 > 0 if name in 'ns gamma' else trials.ns_0000 >= 0
    mask1 = trials.ns_0001 > 0 if name in 'ns gamma' else trials.ns_0001 >= 0
    hl.plot1d(ax, hl.hist(trials[name+'_0000'][mask0], bins=25), crosses=True, color='C1', label='source 0')
    hl.plot1d(ax, hl.hist(trials[name+'_0001'][mask1], bins=25), crosses=True, color='C0', label='source 1')
    ax.set_xlabel(name)
    ax.set_ylabel(r'trials per bin')
    ax.semilogy()
ax.legend()
plt.tight_layout()

Stacking 

The simplest way to implement a stacking analysis is to simply add the TS values. Let’s do some background trials for that setup:

[22]:

bg_trials = multr.get_many_fits(10e3, seed=1)
bg_trials['sum_ts'] = np.sum([bg_trials[name] for name in bg_trials.keys() if 'ts' in name], axis=0)

Performing 10000 background trials using 15 cores:
      10000/10000 trials complete.

[23]:

sum_bg = cy.dists.Chi2TSD(bg_trials.sum_ts)

[24]:

fig, ax = plt.subplots()
h = sum_bg.get_hist(bins=25)
hl.plot1d(ax, h, crosses=True)
norm = h.integrate().values
x = np.linspace(.5, 20, 300)
ax.plot(x, norm * sum_bg.pdf(x), 'k:')
ax.set_xlabel(r'TS')
ax.set_ylabel(r'trials per bin')
ax.semilogy()
plt.tight_layout()

Hottest source analysis 

We might be interested in the hottest source in each trial. Then the “post-trials” test statistic is just the maximum mlog10p from each trial. This can be obtained like so:

[25]:

mlog10p = np.array([bg_trials[name] for name in bg_trials.keys() if name.startswith('mlog10p_')])
ts = np.array([bg_trials[name] for name in bg_trials.keys() if name.startswith('ts_')])
ns = np.array([bg_trials[name] for name in bg_trials.keys() if name.startswith('ns_')])
gamma = np.array([bg_trials[name] for name in bg_trials.keys() if name.startswith('gamma_')])

[26]:

imax = np.argmax(mlog10p, axis=0)
r = np.arange(len(imax))
bg_trials_hottest = cy.utils.Arrays(dict(
    mlog10p=mlog10p[imax,r], ts=ts[imax,r], ns=ns[imax,r], gamma=gamma[imax,r]))

[27]:

fig, axs = plt.subplots(2, 2, figsize=(8,6))
axs = np.ravel(axs)
names = 'mlog10p ts ns gamma'.split()
for (i, name) in enumerate(names):
    ax = axs[i]
    mask = bg_trials_hottest.ns > 0 if name in 'ns gamma' else bg_trials_hottest.ns >= 0
    hl.plot1d(ax, hl.hist(bg_trials_hottest[name][mask], bins=25), crosses=True)
    ax.set_xlabel(name + ' (hottest source)')
    ax.set_ylabel(r'trials per bin')
    ax.semilogy()
plt.tight_layout()

p-values from single batch 

Above we described the most intuitive way to set all this up. Technically, the initial trial batches may be redundant if equal statistics are required for the per-hypothesis background distributions and the final distributions used for post-trials calculations.

We can do all the work with one batch; let’s see how. First we run some trials:

[28]:

%time t = multr_nobg.get_many_fits(10000, seed=1, mp_cpus=15)

Performing 10000 background trials using 15 cores:
      10000/10000 trials complete.
CPU times: user 52.9 ms, sys: 149 ms, total: 202 ms
Wall time: 12.2 s

[29]:

t.as_dataframe.head()

[29]:

	gamma_0000	gamma_0001	mlog10p_0000	mlog10p_0001	ns_0000	ns_0001	ts_0000	ts_0001
0	3.250000	3.250000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
1	4.000000	4.000000	4.938129	0.000000	5.184195	0.000000	4.938129	0.000000
2	3.236252	2.842487	0.030601	0.046903	0.332028	0.921009	0.030601	0.046903
3	2.043473	4.000000	1.255303	0.000000	2.614685	0.000000	1.255303	0.000000
4	2.500000	2.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000

We fit each TS distribution:

[30]:

new_bgs = [cy.dists.Chi2TSD(t['ts_{:04d}'.format(i)]) for i in range(len(src))]
new_bgs

[30]:

[Chi2TSD(10000 trials, eta=0.524, ndof=1.119, median=0.005 (from fit 0.007)),
 Chi2TSD(10000 trials, eta=0.500, ndof=1.016, median=0.000 (from fit 0.000))]

And finally, we patch the mlog10p values:

[31]:

for i in range(len(src)):
    suffix = '_{:04d}'.format(i)
    t['mlog10p'+suffix] = -np.log10(new_bgs[i].sf(t['ts'+suffix]))

[32]:

t.as_dataframe.head()

[32]:

	gamma_0000	gamma_0001	mlog10p_0000	mlog10p_0001	ns_0000	ns_0001	ts_0000	ts_0001
0	3.250000	3.250000	0.280752	0.300596	0.000000	0.000000	0.000000	0.000000
1	4.000000	4.000000	1.741344	0.300596	5.184195	0.000000	4.938129	0.000000
2	3.236252	2.842487	0.329330	0.371563	0.332028	0.921009	0.030601	0.046903
3	2.043473	4.000000	0.795645	0.300596	2.614685	0.000000	1.255303	0.000000
4	2.500000	2.000000	0.280752	0.300596	0.000000	0.000000	0.000000	0.000000

Remarks 

Here we’ve demonstrated how to perform an analysis that includes multiple hypothesis tests. In general, one can either:

Characterize the background for each hypothesis test, independently:
Run more trials using a MultiTrialRunner so that correlations between hypothesis tests are properly accounted for.

Or:

Run trials with a MultiTrialRunner and, if needed, compute per-test p-values from distributions fitted to those trials.

Either way, MultiTrialRunner provides a mechanism for efficiently performing trials that account for possible partial correlations between multiple hypothesis tests.