KszPipe¶
- class kszx.KszPipe(input_dir, output_dir)¶
This is the main kSZ analysis pipeline, which computes data/surrogate power spectra from catalogs.
There are two ways to run a KszPipe pipeline. The first way is to create a KszPipe instance and call the
run()method. This can be done either in a script or a jupyter notebook (if you’re using jupyter, you should keep in mind that the KszPipe may take hours to run, and you’ll need to babysit the connection to the jupyterhib). The second way is to run from the command line with:python -m kszx kszpipe_run [-p NUM_PROCESSES] <input_dir> <output_dir>
The
input_dircontains a parameter fileparams.ymland galaxy/random catalogs. Theoutput_dirwill be populated with power spectra. For details of what KszPipe computes, and documentation of file formats, see the sphinx docs:After running the pipeline, you may want to load pipeline outputs using the helper class
KszPipeOutdir, or do parameter estimation usingPgvLikelihood.High-level features:
Runs “surrogate” sims (see overleaf) to characterize the survey window function, determine dependence of power spectra on \((f_{NL}, b_v)\), and assign error bars to power spectra.
Velocity reconstruction noise is included in surrogate sims via a bootstrap procedure, using the observed CMB realization. This automatically incorporates noise inhomogeneity and “striping”, and captures correlations e.g. between 90 and 150 GHz.
The galaxy catalog can be spectroscopic or photometric (via the
ztrue_colandzobs_colconstructor args). Surrogate sims will capture the effect of photo-z errors.The windowed power spectra \(P_{gg}\), \(P_{gv}\), \(P_{vv}\) use a normalization which should be approximately correct. The normalization is an ansatz which is imperfect, especially on large scales, so surrogate sims should still be used to compare power spetra to models. Eventually, we’ll implement a precise calculation of the window function.
Currently assumes one galaxy field, and two velocity reconstructions labelled “90” and “150” (with ACT in mind).
Currently, there is not much implemented for CMB foregrounds. Later, I’d like to include foreground clustering terms in the surrogate model (i.e. terms of the form \(b_\delta \delta(x)\), in addition to the kSZ term \(b_v v_r(x)\)), and estimate the \(b_\delta\) biases by estimating the spin-zero \(P_{gv}\) power spectrum.
- property window_function¶
3-by-3 matrix \(W_{ij}\) containing the window function for power spectra on three spatial footprints:
footprint 0: random catalog weighted by
weight_galcolumnfootprint 1: random catalog weighted by product of columns
weight_vr * bv_90footprint 2: random catalog weighted by product of columns
weight_vr * bv_150
These spatial weightings are appropriate for the \(\delta_g\), \(v_r^{90}\), and \(v_r^{150}\) fields.
Window functions are computed with
wfunc_utils.compute_wcrude()and are crude approximations (for more info seecompute_wcrude()docstring), but this is okay since surrogate fields are treated consistently.
- property surrogate_factory¶
Returns an instance of class SurrogateFactory, a helper class for simulating the density and radial velocity fields at locations of randoms.
- get_pk_data(run=False, force=False)¶
Returns a shape (3,3,nkbins) array, and saves it in
pipeline_outdir/pk_data.npy.The returned array contains auto and cross power spectra of the following fields:
0: galaxy overdensity
1: kSZ velocity reconstruction \(v_r^{90}\)
2: kSZ velocity reconstruction \(v_r^{150}\)
Flags:
If
run=False, then this function expects the \(P(k)\) file to be on disk from a previous pipeline run.If
run=True, then the \(P(k)\) file will be computed if it is not on disk.If
force=True, then this function recomputes \(P(k)\), even if it is on disk from a previous pipeline run.
- get_pk_surrogate(isurr, run=False, force=False)¶
Returns a shape (6,6,nkbins) array, and saves it in
pipeline_outdir/tmp/pk_surr_{isurr}.npy.The returned array contains auto and cross power spectra of the following fields, for a single surrogate:
0: surrogate galaxy field \(S_g\) with \(f_{NL}=0\).
1: derivative \(dS_g/df_{NL}\).
2: surrogate kSZ velocity reconstruction \(S_v^{90}\), with \(b_v=0\) (i.e. noise only).
3: derivative \(dS_v^{90}/db_v\).
4: surrogate kSZ velocity reconstruction \(S_v^{150}\), with \(b_v=0\) (i.e. noise only).
5: derivative \(dS_v^{150}/db_v\).
Flags:
If
run=False, then this function expects the \(P(k)\) file to be on disk from a previous pipeline run.If
run=True, then the \(P(k)\) file will be computed if it is not on disk.If
force=True, then this function recomputes \(P(k)\), even if it is on disk from a previous pipeline run.
- get_pk_surrogates()¶
Returns a shape (nsurr,6,6,nkins) array, and saves it in
pipeline_outdir/pk_surrogates.npy.The returned array contains auto and cross power spectra of the following fields, for all surrogates:
0: surrogate galaxy field \(S_g\) with \(f_{NL}=0\).
1: derivative \(dS_g/df_{NL}\).
2: surrogate kSZ velocity reconstruction \(S_v^{90}\), with \(b_v=0\) (i.e. noise only).
3: derivative \(dS_v^{90}/db_v\).
4: surrogate kSZ velocity reconstruction \(S_v^{150}\), with \(b_v=0\) (i.e. noise only).
5: derivative \(dS_v^{150}/db_v\).
This function only reads files from disk – it does not run the pipeline. To run the pipeline, use
run().
- run(processes)¶
Runs pipeline and saves results to disk, skipping results already on disk from previous runs.
Implementation: creates a multiprocessing Pool, and calls
get_pk_data()andget_pk_surrogates()in worker processes.Can be run from the command line with:
python -m kszx kszpipe_run [-p NUM_PROCESSES] <input_dir> <output_dir>
The
processesargument is the number of worker processes. Currently I don’t have a good way of setting this automatically – the caller must adjust the number of processes, based on the size of the datasets, and amount of memory available.
- class kszx.KszPipeOutdir(dirname, nsurr=None)¶
A helper class for loading and processing output files from
class KszPipe.Note: for MCMCs and parameter fits, there is a separate class
PgvLikelihood. The KszPipeOutdir class is more minimal (the main use case is plot scripts!)The constructor reads the files
{dirname}/params.yml,{dirname}/pk_data.npy,{dirname}/pk_surrogates.npywhich are generated byrun(). For more info on these files, and documentation of file formats, see the sphinx docs:Constructor arguments:
dirname(string): name of pipeline output directory.nsurr(integer or None): this is a hack for running on an incomplete pipeline. If specified, then{dirname}/pk_surr.npyis not read. Instead we read files of the form{dirname}/tmp/pk_surr_{i}.npy.
- pgg_data()¶
Returns shape
(nkbins,)array containing \(P_{gg}^{data}(k)\).
- pgg_mean(fnl=0)¶
Returns shape
(nkbins,)array, containing \(\langle P_{gg}^{surr}(k) \rangle\).
- pgg_rms(fnl=0)¶
Returns shape
(nkbins,)array, containing sqrt(Var(\(P_{gg}^{surr}(k)\))).
- pgv_data(field)¶
Returns shape
(nkbins,)array containing \(P_{gv}^{data}(k)\).The
fieldargument is a length-2 vector, selecting a linear combination of the 90+150 GHz velocity reconstructions. For example:field=[1,0]for 90 GHz reconstructionfield=[0,1]for 150 GHz reconstructionfield=[1,-1]for null (90-150) GHz reconstruction.
- pgv_mean(field, fnl, bv)¶
Returns shape
(nkbins,)array containing \(\langle P_{gv}^{surr}(k) \rangle\).The
fieldargument is a length-2 vector, selecting a linear combination of the 90+150 GHz velocity reconstructions. For example:field=[1,0]for 90 GHz reconstructionfield=[0,1]for 150 GHz reconstructionfield=[1,-1]for null (90-150) GHz reconstruction.
- pgv_rms(field, fnl, bv)¶
Returns shape
(nkbins,)array containing sqrt(Var(\(P_{gv}^{surr}(k)\))).The
fieldargument is a length-2 vector, selecting a linear combination of the 90+150 GHz velocity reconstructions. For example:field=[1,0]for 90 GHz reconstructionfield=[0,1]for 150 GHz reconstructionfield=[1,-1]for null (90-150) GHz reconstruction.
- pvv_data(field)¶
Returns shape
(nkbins,)array containing \(P_{vv}^{data}(k)\).The
fieldargument is a length-2 vector, selecting a linear combination of the 90+150 GHz velocity reconstructions. For example:field=[1,0]for 90 GHz reconstructionfield=[0,1]for 150 GHz reconstructionfield=[1,-1]for null (90-150) GHz reconstruction.
- pvv_mean(field, bv)¶
Returns shape
(nkbins,)array containing \(\langle P_{vv}^{data}(k) \rangle\).The
fieldargument is a length-2 vector, selecting a linear combination of the 90+150 GHz velocity reconstructions. For example:field=[1,0]for 90 GHz reconstructionfield=[0,1]for 150 GHz reconstructionfield=[1,-1]for null (90-150) GHz reconstruction.
- pvv_rms(field, bv)¶
Returns shape
(nkbins,)array containing sqrt(var(\(P_{vv}^{data}(k)\))).The
fieldargument is a length-2 vector, selecting a linear combination of the 90+150 GHz velocity reconstructions. For example:field=[1,0]for 90 GHz reconstructionfield=[0,1]for 150 GHz reconstructionfield=[1,-1]for null (90-150) GHz reconstruction.
KszPipe details¶
High-level organization¶
Our KSZ pipelines can be organized into three stages:
“Pre-KszPipe”: a collection of jupyter notebooks which filter the CMB maps, apply cuts/weighting to the galaxy catalog, and write a few output files (
params.yml,galaxies.h5,randoms.h5). These output files are specified below under KszPipe input files.For this stage, we’ve been using ad hoc jupyter notebooks, rather than “polished” pipeline code in git. We often want to run experiments where we change an aspect of the processing, so the “hackability” of jupyter notebooks is really convenient.
“KszPipe”: takes the galaxy/random catalogs (see KszPipe input files), and creates files containing power spectra of the data, and power spectra of simulated “surrogate” fields. These output files are specified below under KszPipe output files.
In this stage, the heavy lifting can be done with
kszx.KszPipe, which callskszx.SurrogateFactory.“Post-KszPipe”: plot power spectra, fit parameters \((f_{NL}, b_v)\), and run MCMCs. Some useful helper classes:
kszx.KszPipeOutdir,kszx.PgvLikelihood.
Data and surrogate fields¶
The KszPipe computes power spectra involving a galaxy density field \(\rho_g\), one or more kSZ velocity reconstructions \(\hat v_r\), and surrogate fields \(S_g, S_v\). Definitions of these fields are given in the overleaf, and can be summarized as follows:
To mitigate foregrounds, the KszPipe also includes mean-subtraction steps in \(v_r\) and \(S_v\) which are described in the overleaf.
The bias \(b_v\) which appears in these equations is a per-object fiducial bias, which must be estimated in an earlier pipeline stage, before the KszPipe is run. The fiducial bias \(b_v\) will depend on a choice of fiducial \(P_{ge}(k)\), as well as the filter applied to \(T_{CMB}\). (Since different CMB frequency channels can use different CMB filters, \(b_v\) can depend on CMB frequency channel.) One reasonable way of initializing \(b_v\) is to use the following approximate expression from the overleaf:
KszPipe input files¶
The KszPipe constructor has an argument input_dir.
This is the name of a directory containing the following files:
params.yml: this file is easiest to describe by example:# Version of the KszPipe params.yml format. # (In case we change the file format in the future.) version: 1 # Number of surrogate sims nsurr: 400 # Galaxy bias used in surrogate sims surr_bg: 2.1 # Number of redshift bins used for field-level mean-subtraction. # Passed as 'nbins' argument to kszx.utils.subtract_binned_means(). # If zero, then kszx.utils.subtract_binned_means() is not called. nzbins_gal: 25 # mean-subtraction for galaxy field nzbins_vr: 25 # mean-subtraction for galaxy field # k-binning for power spectrum estimation # Note: bins are generated from (nkbins, kmax) as follows. # # kbin_edges = np.linspace(0, kmax, nkbins+1) # kbin_centers = (kbin_edges[1:] + kbin_edges[:-1]) / 2. nkbins: 25 kmax: 0.05
galaxies.h5: galaxy catalog, written in HDF5 format bywrite_h5(). The galaxy catalog should contain the following columns.
ra_deg,dec_deg: angular locations of galaxies.
bv_90,bv_150: per-object KSZ velocity bias \(b_v^i\), defined by the equation \(\tilde T(\theta_i) = b_v^i v_r(x_i) + (\mbox{noise})\).
tcmb_90,tcmb_150: per-object filtered CMB temperatures \(\tilde T(\theta_i)\).
weight_gal,weight_vr: per-object weightings used when constructing the galaxy density field \(\rho_g\) and kSZ velocity reconstruction \(\hat v_r\). (These weightings are denoted \(W_i^L\) and \(W_i^S\) in the overleaf.)
z: observed redshift.
randoms.h5random catalog. Contains the columns above, except that thezcolumn (redshift) is replaced by two columnszobs,ztrue. (For a photometric survey, thezobsandztruecolumns will differ – for a spectroscopic survey they will be copies of each other.)
bounding_box.pkl: object of classBoundingBox, saved in pickle format.
KszPipe output files¶
The KszPipe constructor has an argument output_dir.
This is the name of a directory which the KszPipe will populate with the following files:
params.yml: this file is copied from the input directory to the output directory unmodified.
pk_data.npy: array of shape(3,3,nkbins), containing auto and cross power spectra of the following fields:
0: galaxy overdensity \(\delta_g\)
1: kSZ velocity reconstruction \(v_r^{90}\)
2: kSZ velocity reconstruction \(v_r^{150}\)
pk_surrogates.npy: array of shape(nsurr,6,6,nkbins), containing auto and cross power spectra of the following fields:
0: surrogate galaxy field \(S_g\) with \(f_{NL}=0\).
1: derivative \(dS_g/df_{NL}\).
2: surrogate kSZ velocity reconstruction \(S_v^{90}\), with \(b_v=0\) (i.e. noise only).
3: derivative \(dS_v^{90}/db_v\).
4: surrogate kSZ velocity reconstruction \(S_v^{150}\), with \(b_v=0\) (i.e. noise only).
5: derivative \(dS_v^{150}/db_v\).