The Probabilistic Cataloger (PCAT)

Have you ever worried about unknown unknowns in life? Looking towards the horizon over the ocean, have you every found it disturbing that you couldn't identify an object you saw far away as the limited angular resolution of your eyes made the object appear to be simultaneously consistent with a large buoyant, a boat, and a small ship? I have, many times.

 

when you are asked to tell the number of objects in a fuzzy image?

Imagine you count the number of sources in each of these globular clusters and find that they are 1244, 2532, 3424, and 2344. ould it not be great to have an uncertainty estimate on the number of stars so you can reconcile the information from these individual systems to construct a concordance model.

 

To delve into this deep issue in statistics using a simple and widely-confronted research problem as an example, consider the following elementary operation in astrophysics: inferring a catalog of sources in a given image of stars. The image is a data set, and in its essence, is a two dimensional array of the number of photons that landed in each pixel during the exposure.

 

The traditional inference approach is to address the following question: what is the set of point source positions and fluxes that maximizes the Poisson likelihood that the observed photon count map is generated by the point source model.

However, there is something fundamentally wrong about this method (pardon my Bayesian priors here). In a deterministic catalog, point sources in a crowded field can be blended together when their separation is comparable to the point spread function (PSF). Similarly, background fluctuations or mismodeling can fake multiple point sources when there is actually a single real source. Given an observed photon count map, the maximum likelihood solution will either identify a single source or multiple sources. Therefore depending on what the reality is, it can either miss an existing point source or overfit by introducing a spurious one. This type of across-model covariance cannot be accounted for in a frequentist approach or Bayesian approach with a point-estimate.

A non-transdimensional model only includes elements that improve the goodness-of-fit above a hard significance threshold. The resulting catalog of these elements is thus limited by a selection function. In the case of the problem of finding sources in an image, the result is a flux limited catalog, where the distribution of the detection significances of the sources cuts off at an arbitrary, ad-hoc value. You may have wondered: how would the catalog look if you had a different threshold? Your best-fit model would have a different number of sources and these sources would have different positions and fluxes. You may then wonder: what is the correct answer? I'd say neither of these. I argue that a well-principled Bayesian answer should include all these answers according to how frequently they occur in a statistical sample of catalogs. This is because non-transdimensional modeling causes significant biases our inference of hyperparameters due to a misuse of the information content of a given data set.

 

which could otherwise be fed into further analyses that rely on the catalog.

If you spend most of your time playing with data, a homebrew sampler is your friend. This is because any type of inference (given some data) can almost always be performed with the use of a likelihood function, which encodes the probability of obtaining the observed data under the assumption of a model. Potentially folding our priors in, then, the fundamental problem of inference is essentially reduced to that of sampling from a non-analytical and potentially multimodal and/or large dimensional posterior probability density.

These factors eventually lead me, a fellow graduate student at the Harvard-CfA, Stephen Portillo, and my advisor Douglas Finkbeiner, to construct a Bayesian framework to take samples from the space of catalogs consistent with the image. The ensemble of such fair samples together is a representation of our state of knowledge about the point sources in the image. This is in stark contrast with the frequentist approach, which attempts to represent it with an estimator for themost likely solution.

Transdimensional sampling

By construction, the catalog space is a transdimensional space, which encompasses the parameter space of all point source models of any dimension. Therefore constructing a Markov chain, whose stationary distribution is the target probability distribution in the catalog space, requires across-model moves. I perform these state transitions using reversible jumps that respect detailed balance. Therefore multiple types of proposals such as birth, death, split and merge are used along with the usual heavy-tailed within-model transitions in order to span the catalog space.

Hierarchical modeling

A given point source can be regarded as a realization of an underlying Poisson process unique for the population to which it belongs. The priors on its position, flux or color can then be made conditional on a few hyperpriors, which characterize the spatial, spectral or color distribution of its population. This type of hierarchical modeling allows us to constrain the form of the priors, while allowing the individual point sources to be independent realizations of the population. Note that hyperpriors are just higher level priors that control the shape of the priors on the individual point sources and that the likelihood is still invariant with respect to hyperpriors!

 

The Probabilistic Cataloger (PCAT)

The Probabilistic Cataloger (PCAT) is a transdimensional, hierarchical, and Bayesian sampler written in python that takes fair samples from the posterior probability distribution, given a likelihood function and prior probability distribution over transdimensional elements. It uses transdimensional as well as fixed-dimensional, within-model proposals. The within-model proposals are heavy-tailed Gaussian proposals as in a Metropolis-Hastings Markov Chain Monte Carlo (MCMC) sampler. PCAT constructs a Markov chain of states, whose stationary distribution is the target posterior probability density. The sampler takes steps in a transformed parameter space where the prior density is uniform and uses adaptive burn-in to make transdimensional sampling as efficient as possible.

Despite its superior properties compared to deterministic catalogs, probabilistic cataloging has not been the mainstream approach to cataloging. This is expected, since probabilistic cataloging is a computationally demanding task. Given that the dimensionality of the hypothesis space is variable and large, taking independent samples from it using MCMC requires long simulations. Therefore I optimize the time performance of the sampler by making the model prediction efficient and introducing approximations when possible.

PCAT is designed to be a general-purpose catalog sampler with a user-friendly API. For the full API, refer to the GitHub project page. If you have any concerns or questions feel free to file an issue in its GitHub repository.