top of page

The Probabilistic Cataloger (PCAT)

Have you ever worried about unknown unknowns in life? Looking towards the horizon over the ocean, have you ever found it disturbing that you couldn't identify an object you saw far away, as the limited angular resolution of your eyes made the object appear to be simultaneously consistent with a large buoyant, a boat, and a small ship? I have many times.

Imagine you count the number of sources in each of these globular clusters and find that they are 1244, 2532, 3424, and 2344. Would it not be great to have an uncertainty estimate on the number of stars so you can reconcile the information from these individual systems to construct a concordance model?

To delve into this deep issue in statistics using a simple and widely-confronted research problem as an example, consider the following elementary operation in astrophysics: inferring a catalog of sources in a given image of stars. The image is a dataset and, in essence, is a two-dimensional array of the number of photons that landed in each pixel during the exposure. The traditional inference approach is to address the following question: what is the set of point source positions and fluxes that maximizes the Poisson likelihood that the observed photon count map is generated by the point source model?

However, there is something fundamentally wrong about this method (pardon my Bayesian priors here). In a deterministic catalog, point sources in a crowded field can be blended together when their separation is comparable to the point spread function (PSF). Similarly, background fluctuations or mismodeling can fake multiple point sources when there is actually a single real source. Given an observed photon-count map, the maximum-likelihood solution will either identify a single source or multiple sources. Therefore, depending on the reality, it can either miss an existing point source or overfit by introducing a spurious one. This type of across-model covariance cannot be accounted for in either a frequentist or a Bayesian approach with a point estimate.

A non-transdimensional model only includes elements that improve the goodness-of-fit above a hard significance threshold. The resulting catalog of these elements is thus limited by a selection function. In source finding in an image, the result is a flux-limited catalog, where the distribution of source detection significances is cut off at an arbitrary, ad hoc value. You may have wondered: how would the catalog look if you had a different threshold? Your best-fit model would have a different number of sources, each with different positions and fluxes. You may then wonder: what is the correct answer? I'd say neither of these. I argue that a well-principled Bayesian answer should include all these answers according to how frequently they occur in a statistical sample of catalogs. This is because non-transdimensional modeling introduces significant biases in our inference of hyperparameters by misusing the information content of a given data set, which could otherwise be fed into further analyses that rely on the catalog.

If you spend most of your time playing with data, a homebrew sampler is your friend. This is because any inference (given some data) can almost always be performed using a likelihood function, which encodes the probability of observing the data under a model. Potentially folding our priors in, the fundamental problem of inference is essentially reduced to sampling from a non-analytical, potentially multimodal, and/or large-dimensional posterior probability density.

These factors eventually led me, a fellow graduate student at the Harvard-CfA, Stephen Portillo, and my advisor, Douglas Finkbeiner, to construct a Bayesian framework for sampling from the space of catalogs consistent with the image. The ensemble of such fair samples represents our state of knowledge about the point sources in the image. This is in stark contrast to the frequentist approach, which attempts to represent it using an estimator of the most likely solution.

Transdimensional sampling

By construction, the catalog space is a transdimensional space encompassing the parameter space of all point-source models of any dimension. Therefore, constructing a Markov chain, whose stationary distribution is the target probability distribution in the catalog space, requires across-model moves. I perform these state transitions using reversible jumps that respect detailed balance. Therefore, multiple types of proposals, such as birth, death, split, and merge, are used, along with the usual heavy-tailed within-model transitions, to span the catalog space.

Hierarchical modeling

A given point source can be regarded as a realization of an underlying Poisson process unique for the population to which it belongs. The priors on its position, flux, or color can then be made conditional on a few hyperpriors, which characterize the spatial, spectral, or color distribution of its population. This type of hierarchical modeling allows us to constrain the form of the priors, while allowing the individual point sources to be independent realizations of the population. Note that hyperpriors are just higher-level priors that control the shape of the priors on the individual point sources, and that the likelihood is still invariant with respect to hyperpriors!

 

The Probabilistic Cataloger (PCAT)

The Probabilistic Cataloger (PCAT) is a transdimensional, hierarchical, and Bayesian sampler written in Python that draws fair samples from the posterior distribution, given a likelihood function and a prior distribution over transdimensional elements. It uses transdimensional as well as fixed-dimensional, within-model proposals. The within-model proposals are heavy-tailed Gaussian proposals as in a Metropolis-Hastings Markov Chain Monte Carlo (MCMC) sampler. PCAT constructs a Markov chain of states, whose stationary distribution is the target posterior probability density. The sampler takes steps in a transformed parameter space where the prior density is uniform and uses adaptive burn-in to make transdimensional sampling as efficient as possible.

Despite its superior properties compared to deterministic cataloging, probabilistic cataloging has not been the mainstream approach. This is expected, since probabilistic cataloging is a computationally demanding task. Given the high, variable dimensionality of the hypothesis space, sampling it independently using MCMC requires long simulations. Therefore, I optimize the sampler's time performance by making the model's predictions more efficient and introducing approximations when possible.

PCAT is designed to be a general-purpose catalog sampler with a user-friendly API. For the full API, refer to the GitHub project page.

bottom of page