KszPipe

class kszx.KszPipe(input_dir, output_dir)

This is the main kSZ analysis pipeline, which computes data/surrogate power spectra from catalogs.

There are two ways to run a KszPipe pipeline. The first way is to create a KszPipe instance and call the run() method. This can be done either in a script or a jupyter notebook (if you’re using jupyter, you should keep in mind that the KszPipe may take hours to run, and you’ll need to babysit the connection to the jupyterhib). The second way is to run from the command line with:

python -m kszx kszpipe_run [-p NUM_PROCESSES] <input_dir> <output_dir>

The input_dir contains a parameter file params.yml and galaxy/random catalogs. The output_dir will be populated with power spectra. For details of what KszPipe computes, and documentation of file formats, see the sphinx docs:

After running the pipeline, you may want to load pipeline outputs using the helper class KszPipeOutdir, or do parameter estimation using PgvLikelihood.

High-level features:

  • Runs “surrogate” sims (see overleaf) to characterize the survey window function, determine dependence of power spectra on \((f_{NL}, b_v)\), and assign error bars to power spectra.

  • Velocity reconstruction noise is included in surrogate sims via a bootstrap procedure, using the observed CMB realization. This automatically incorporates noise inhomogeneity and “striping”, and captures correlations e.g. between 90 and 150 GHz.

  • The galaxy catalog can be spectroscopic or photometric (via the ztrue_col and zobs_col constructor args). Surrogate sims will capture the effect of photo-z errors.

  • The windowed power spectra \(P_{gg}\), \(P_{gv}\), \(P_{vv}\) use a normalization which should be approximately correct. The normalization is an ansatz which is imperfect, especially on large scales, so surrogate sims should still be used to compare power spetra to models. Eventually, we’ll implement a precise calculation of the window function.

  • Currently assumes one galaxy field, and two velocity reconstructions labelled “90” and “150” (with ACT in mind).

  • Currently, there is not much implemented for CMB foregrounds. Later, I’d like to include foreground clustering terms in the surrogate model (i.e. terms of the form \(b_\delta \delta(x)\), in addition to the kSZ term \(b_v v_r(x)\)), and estimate the \(b_\delta\) biases by estimating the spin-zero \(P_{gv}\) power spectrum.

property window_function

3-by-3 matrix \(W_{ij}\) containing the window function for power spectra on three spatial footprints:

  • footprint 0: random catalog weighted by weight_gal column

  • footprint 1: random catalog weighted by product of columns weight_vr * bv_90

  • footprint 2: random catalog weighted by product of columns weight_vr * bv_150

These spatial weightings are appropriate for the \(\delta_g\), \(v_r^{90}\), and \(v_r^{150}\) fields.

Window functions are computed with wfunc_utils.compute_wcrude() and are crude approximations (for more info see compute_wcrude() docstring), but this is okay since surrogate fields are treated consistently.

property surrogate_factory

Returns an instance of class SurrogateFactory, a helper class for simulating the density and radial velocity fields at locations of randoms.

get_pk_data(run=False, force=False)

Returns a shape (3,3,nkbins) array, and saves it in pipeline_outdir/pk_data.npy.

The returned array contains auto and cross power spectra of the following fields:

  • 0: galaxy overdensity

  • 1: kSZ velocity reconstruction \(v_r^{90}\)

  • 2: kSZ velocity reconstruction \(v_r^{150}\)

Flags:

  • If run=False, then this function expects the \(P(k)\) file to be on disk from a previous pipeline run.

  • If run=True, then the \(P(k)\) file will be computed if it is not on disk.

  • If force=True, then this function recomputes \(P(k)\), even if it is on disk from a previous pipeline run.

get_pk_surrogate(isurr, run=False, force=False)

Returns a shape (6,6,nkbins) array, and saves it in pipeline_outdir/tmp/pk_surr_{isurr}.npy.

The returned array contains auto and cross power spectra of the following fields, for a single surrogate:

  • 0: surrogate galaxy field \(S_g\) with \(f_{NL}=0\).

  • 1: derivative \(dS_g/df_{NL}\).

  • 2: surrogate kSZ velocity reconstruction \(S_v^{90}\), with \(b_v=0\) (i.e. noise only).

  • 3: derivative \(dS_v^{90}/db_v\).

  • 4: surrogate kSZ velocity reconstruction \(S_v^{150}\), with \(b_v=0\) (i.e. noise only).

  • 5: derivative \(dS_v^{150}/db_v\).

Flags:

  • If run=False, then this function expects the \(P(k)\) file to be on disk from a previous pipeline run.

  • If run=True, then the \(P(k)\) file will be computed if it is not on disk.

  • If force=True, then this function recomputes \(P(k)\), even if it is on disk from a previous pipeline run.

get_pk_surrogates()

Returns a shape (nsurr,6,6,nkins) array, and saves it in pipeline_outdir/pk_surrogates.npy.

The returned array contains auto and cross power spectra of the following fields, for all surrogates:

  • 0: surrogate galaxy field \(S_g\) with \(f_{NL}=0\).

  • 1: derivative \(dS_g/df_{NL}\).

  • 2: surrogate kSZ velocity reconstruction \(S_v^{90}\), with \(b_v=0\) (i.e. noise only).

  • 3: derivative \(dS_v^{90}/db_v\).

  • 4: surrogate kSZ velocity reconstruction \(S_v^{150}\), with \(b_v=0\) (i.e. noise only).

  • 5: derivative \(dS_v^{150}/db_v\).

This function only reads files from disk – it does not run the pipeline. To run the pipeline, use run().

run(processes)

Runs pipeline and saves results to disk, skipping results already on disk from previous runs.

Implementation: creates a multiprocessing Pool, and calls get_pk_data() and get_pk_surrogates() in worker processes.

Can be run from the command line with:

python -m kszx kszpipe_run [-p NUM_PROCESSES] <input_dir> <output_dir>

The processes argument is the number of worker processes. Currently I don’t have a good way of setting this automatically – the caller must adjust the number of processes, based on the size of the datasets, and amount of memory available.

class kszx.KszPipeOutdir(dirname, nsurr=None)

A helper class for loading and processing output files from class KszPipe.

Note: for MCMCs and parameter fits, there is a separate class PgvLikelihood. The KszPipeOutdir class is more minimal (the main use case is plot scripts!)

The constructor reads the files {dirname}/params.yml, {dirname}/pk_data.npy, {dirname}/pk_surrogates.npy which are generated by run(). For more info on these files, and documentation of file formats, see the sphinx docs:

Constructor arguments:

  • dirname (string): name of pipeline output directory.

  • nsurr (integer or None): this is a hack for running on an incomplete pipeline. If specified, then {dirname}/pk_surr.npy is not read. Instead we read files of the form {dirname}/tmp/pk_surr_{i}.npy.

pgg_data()

Returns shape (nkbins,) array containing \(P_{gg}^{data}(k)\).

pgg_mean(fnl=0)

Returns shape (nkbins,) array, containing \(\langle P_{gg}^{surr}(k) \rangle\).

pgg_rms(fnl=0)

Returns shape (nkbins,) array, containing sqrt(Var(\(P_{gg}^{surr}(k)\))).

pgv_data(field)

Returns shape (nkbins,) array containing \(P_{gv}^{data}(k)\).

The field argument is a length-2 vector, selecting a linear combination of the 90+150 GHz velocity reconstructions. For example:

  • field=[1,0] for 90 GHz reconstruction

  • field=[0,1] for 150 GHz reconstruction

  • field=[1,-1] for null (90-150) GHz reconstruction.

pgv_mean(field, fnl, bv)

Returns shape (nkbins,) array containing \(\langle P_{gv}^{surr}(k) \rangle\).

The field argument is a length-2 vector, selecting a linear combination of the 90+150 GHz velocity reconstructions. For example:

  • field=[1,0] for 90 GHz reconstruction

  • field=[0,1] for 150 GHz reconstruction

  • field=[1,-1] for null (90-150) GHz reconstruction.

pgv_rms(field, fnl, bv)

Returns shape (nkbins,) array containing sqrt(Var(\(P_{gv}^{surr}(k)\))).

The field argument is a length-2 vector, selecting a linear combination of the 90+150 GHz velocity reconstructions. For example:

  • field=[1,0] for 90 GHz reconstruction

  • field=[0,1] for 150 GHz reconstruction

  • field=[1,-1] for null (90-150) GHz reconstruction.

pvv_data(field)

Returns shape (nkbins,) array containing \(P_{vv}^{data}(k)\).

The field argument is a length-2 vector, selecting a linear combination of the 90+150 GHz velocity reconstructions. For example:

  • field=[1,0] for 90 GHz reconstruction

  • field=[0,1] for 150 GHz reconstruction

  • field=[1,-1] for null (90-150) GHz reconstruction.

pvv_mean(field, bv)

Returns shape (nkbins,) array containing \(\langle P_{vv}^{data}(k) \rangle\).

The field argument is a length-2 vector, selecting a linear combination of the 90+150 GHz velocity reconstructions. For example:

  • field=[1,0] for 90 GHz reconstruction

  • field=[0,1] for 150 GHz reconstruction

  • field=[1,-1] for null (90-150) GHz reconstruction.

pvv_rms(field, bv)

Returns shape (nkbins,) array containing sqrt(var(\(P_{vv}^{data}(k)\))).

The field argument is a length-2 vector, selecting a linear combination of the 90+150 GHz velocity reconstructions. For example:

  • field=[1,0] for 90 GHz reconstruction

  • field=[0,1] for 150 GHz reconstruction

  • field=[1,-1] for null (90-150) GHz reconstruction.

KszPipe details

High-level organization

Our KSZ pipelines can be organized into three stages:

  • “Pre-KszPipe”: a collection of jupyter notebooks which filter the CMB maps, apply cuts/weighting to the galaxy catalog, and write a few output files (params.yml, galaxies.h5, randoms.h5). These output files are specified below under KszPipe input files.

    For this stage, we’ve been using ad hoc jupyter notebooks, rather than “polished” pipeline code in git. We often want to run experiments where we change an aspect of the processing, so the “hackability” of jupyter notebooks is really convenient.

  • “KszPipe”: takes the galaxy/random catalogs (see KszPipe input files), and creates files containing power spectra of the data, and power spectra of simulated “surrogate” fields. These output files are specified below under KszPipe output files.

    In this stage, the heavy lifting can be done with kszx.KszPipe, which calls kszx.SurrogateFactory.

  • “Post-KszPipe”: plot power spectra, fit parameters \((f_{NL}, b_v)\), and run MCMCs. Some useful helper classes: kszx.KszPipeOutdir, kszx.PgvLikelihood.

Data and surrogate fields

The KszPipe computes power spectra involving a galaxy density field \(\rho_g\), one or more kSZ velocity reconstructions \(\hat v_r\), and surrogate fields \(S_g, S_v\). Definitions of these fields are given in the overleaf, and can be summarized as follows:

\[\begin{split}\begin{align} W_L(x) &= \mbox{Large-scale galaxy weight function (e.g. FKP)} \\ W_S(x) &= \mbox{Small-scale weight function, applied to galaxies during velocity reconstruction} \\ \rho_g(x) &= \bigg( \sum_{i\in \rm gal} W_i^L \, \delta^3(x-x_i) \bigg) - \frac{N_g}{N_r} \bigg( \sum_{j\in \rm rand} W_j^L \, \delta^3(x-x_j) \bigg) \\ \hat v_r(x) &= \sum_{i\in \rm gal} W_i^S \, \tilde T(\theta_i) \, \delta^3(x-x_i) \\ S_g(x) &= \sum_{j\in \rm rand} \frac{N_g}{N_r} W_j^L \big( b_j^G \delta_m(x_j) + f_{NL} b_j^{NG} \phi(x_j) + \eta_j \big) \delta^3(x-x_j) \\ S_v(x) &= \sum_{j\in\rm rand} \bigg( \frac{N_g}{N_r} W_j^S b_j^v v_r(x_j) + M_j W_j^S \tilde T(\theta_j) \bigg) \delta^3(x-x_j) \\ \big\langle \eta_j^2 \big\rangle &= \frac{N_r}{N_g} - \big\langle \delta_G(z_j)^2 \big\rangle \\ M_j &= \begin{cases} 1 & \mbox{if $j$ is in a randomly selected subset of size } N_{\rm gal} \\ 0 & \mbox{otherwise} \end{cases} \end{align}\end{split}\]

To mitigate foregrounds, the KszPipe also includes mean-subtraction steps in \(v_r\) and \(S_v\) which are described in the overleaf.

The bias \(b_v\) which appears in these equations is a per-object fiducial bias, which must be estimated in an earlier pipeline stage, before the KszPipe is run. The fiducial bias \(b_v\) will depend on a choice of fiducial \(P_{ge}(k)\), as well as the filter applied to \(T_{CMB}\). (Since different CMB frequency channels can use different CMB filters, \(b_v\) can depend on CMB frequency channel.) One reasonable way of initializing \(b_v\) is to use the following approximate expression from the overleaf:

\[\begin{split}\begin{align} b_j^v &\approx B_v(z_j) \, W_{\rm CMB}(\theta_j) \\ B_v(\chi) &\equiv \frac{K(\chi)}{\chi^2} \int \frac{d^2L}{(2\pi)^2} \, b_L F_L \, P_{ge}^{\rm true}(k,\chi)_{k=L/\chi} \end{align}\end{split}\]

KszPipe input files

The KszPipe constructor has an argument input_dir. This is the name of a directory containing the following files:

  • params.yml: this file is easiest to describe by example:

    # Version of the KszPipe params.yml format.
    # (In case we change the file format in the future.)
    version: 1
    
    # Number of surrogate sims
    nsurr: 400
    
    # Galaxy bias used in surrogate sims
    surr_bg: 2.1
    
    # Number of redshift bins used for field-level mean-subtraction.
    # Passed as 'nbins' argument to kszx.utils.subtract_binned_means().
    # If zero, then kszx.utils.subtract_binned_means() is not called.
    
    nzbins_gal: 25  # mean-subtraction for galaxy field
    nzbins_vr: 25   # mean-subtraction for galaxy field
    
    # k-binning for power spectrum estimation
    # Note: bins are generated from (nkbins, kmax) as follows.
    #
    #  kbin_edges = np.linspace(0, kmax, nkbins+1)
    #  kbin_centers = (kbin_edges[1:] + kbin_edges[:-1]) / 2.
    
    nkbins: 25
    kmax: 0.05
    
  • galaxies.h5: galaxy catalog, written in HDF5 format by write_h5(). The galaxy catalog should contain the following columns.

    • ra_deg, dec_deg: angular locations of galaxies.

    • bv_90, bv_150: per-object KSZ velocity bias \(b_v^i\), defined by the equation \(\tilde T(\theta_i) = b_v^i v_r(x_i) + (\mbox{noise})\).

    • tcmb_90, tcmb_150: per-object filtered CMB temperatures \(\tilde T(\theta_i)\).

    • weight_gal, weight_vr: per-object weightings used when constructing the galaxy density field \(\rho_g\) and kSZ velocity reconstruction \(\hat v_r\). (These weightings are denoted \(W_i^L\) and \(W_i^S\) in the overleaf.)

    • z: observed redshift.

  • randoms.h5 random catalog. Contains the columns above, except that the z column (redshift) is replaced by two columns zobs, ztrue. (For a photometric survey, the zobs and ztrue columns will differ – for a spectroscopic survey they will be copies of each other.)

  • bounding_box.pkl: object of class BoundingBox, saved in pickle format.

KszPipe output files

The KszPipe constructor has an argument output_dir. This is the name of a directory which the KszPipe will populate with the following files:

  • params.yml: this file is copied from the input directory to the output directory unmodified.

  • pk_data.npy: array of shape (3,3,nkbins), containing auto and cross power spectra of the following fields:

    • 0: galaxy overdensity \(\delta_g\)

    • 1: kSZ velocity reconstruction \(v_r^{90}\)

    • 2: kSZ velocity reconstruction \(v_r^{150}\)

  • pk_surrogates.npy: array of shape (nsurr,6,6,nkbins), containing auto and cross power spectra of the following fields:

    • 0: surrogate galaxy field \(S_g\) with \(f_{NL}=0\).

    • 1: derivative \(dS_g/df_{NL}\).

    • 2: surrogate kSZ velocity reconstruction \(S_v^{90}\), with \(b_v=0\) (i.e. noise only).

    • 3: derivative \(dS_v^{90}/db_v\).

    • 4: surrogate kSZ velocity reconstruction \(S_v^{150}\), with \(b_v=0\) (i.e. noise only).

    • 5: derivative \(dS_v^{150}/db_v\).