# NRC 2017 Invited Speakers

### Rina Foygel Barber (Tweedie Award winner)

Department of Statistics, University of Chicago

**Title: Flexible inference for high-dimensional regression and classification**

**Abstract:** In this talk, I will present ongoing work on two problems in high-dimensional regression: model selection, where we would like to select relevant features without too many false positives; and confidence intervals for estimation, where we would like to perform inference on our estimates of the response Y given covariates X, without assuming any particular model. For the first problem, model selection, I will present work on the knockoff filter, which creates knockoff copies of all the covariates to provide a “control group” of irrelevant features that we can use to assess the precision of any model selection method. Ongoing work on this problem includes assessing the robustness of the knockoff method, and combining knockoffs with permutation methods. For the estimation problem, I will discuss some preliminary results for data-driven inference on isotonic regression, where any “black box” algorithm fitting a regression function to the data can be recalibrated to provide a confidence interval for $E[Y\vert X]$.

### Nilanjan Chatterjee

Department of Biostatistics and School of Medicine, Johns Hopkins University

**Title: Developing predictive models for precision medicine using summary-level information from big-data sources**

**Abstract:** Extraction of information through summary-level statistics, as opposed to individual level data, from big datasets can be appealing because of various practical reasons such as data sharing, storage and computing, as well as for ethical reasons, such as maintenance of the privacy of the study subjects and protection of the future research interest of data generating institutions/investigators. In this talk, I will describe statistical methods for building predictive models using summary-level information in two different settings. One involves development of high-dimensional penalized additive regression models using summary-level association statistics from genome-wide association studies and functional/annotation information from other genomic databases. The other application involves building general regression models using individual level data from an analytic study while utilizing information on parameters of a reduced model fitted to an external big dataset. The methods will be illustrated with cutting edge applications of disease risk prediction models in precision medicine.

### Alison Etheridge (IMS President Elect)

Department of Statistics, University of Oxford

**Title: Modelling evolution in a spatial continuum**

**Abstract:** Since the pioneering work of Fisher, Haldane and Wright at the beginning of the 20th Century, mathematics has played a central role in theoretical population genetics. In turn, population genetics has provided the motivation both for important classes of probabilistic models, such as coalescent processes, and for deterministic models, such as the celebrated Fisher-KPP equation. Whereas coalescent models capture `relatedness’ between genes, the Fisher KPP equation captures something of the interaction between natural selection and spatial structure. What has proved to be remarkably difficult is to combine the two, at least in the biologically relevant setting of a two-dimensional spatial continuum.
In this talk we describe some of the challenges of modelling evolution in a spatial continuum, present a model that addresses those challenges, and, as time permits, describe some applications.

### Ed George

Statistics Department, University of Pennsylvania

**Title: Mortality Rate Estimation and Standardization for Public Reporting: Medicare’s Hospital Compare**

**Abstract:** Bayesian models are increasingly fit to large administrative data sets and then used to make individualized recommendations. In particular, Medicare’s Hospital Compare webpage provides information to patients about specific hospital mortality rates for a heart attack or Acute Myocardial Infarction (AMI). Hospital Compare’s current recommendations are based on a random-effects logit model with a random hospital indicator and patient risk factors. Except for the largest hospitals, these individual recommendations or predictions are not checkable against data, because data from smaller hospitals are too limited to provide a meaningful check. Before individualized Bayesian recommendations, people derived general advice from empirical studies of many hospitals; e.g., prefer hospitals of type 1 to type 2 because the risk is lower at type 1 hospitals. Here we calibrate these Bayesian recommendation systems by checking, out of sample, whether their predictions aggregate to give correct general advice derived from another sample. This process of calibrating individualized predictions against general empirical advice leads to substantial revisions in the Hospital Compare model for AMI mortality. In order to make appropriately calibrated predictions, our revised models incorporate information about hospital volume, nursing staff, medical residents, and the hospital’s ability to perform cardiovascular procedures. For the ultimate purpose of comparisons, hospital mortality rates must be standardized to adjust for patient mix variation across hospitals. We find that indirect standardization, as currently used by Hospital Compare, fails to adequately control for differences in patient risk factors and systematically underestimates mortality rates at the low volume hospitals. To provide good control and correctly calibrated rates, we propose direct standardization instead. (This is joint research with Veronika Rockova, Paul Rosenbaum, Ville Satopaa and Jeffrey Silber).

### Carey E. Priebe

Department of Applied Mathematics & Statistics, Johns Hopkins University

**Title:** Semiparametric spectral modeling of the Drosophila connectome

**Abstract:** We present semiparametric spectral modeling of the complete larval Drosophila mushroom body connectome. Motivated by a thorough exploratory data analysis of the network via Gaussian mixture modeling (GMM) in the adjacency spectral embedding (ASE) representation space, we introduce the latent structure model (LSM) for network modeling and inference. LSM is a generalization of the stochastic block model (SBM) and a special case of the random dot product graph (RDPG) latent position model, and is amenable to semiparametric GMM in the ASE representation space. The resulting connectome code derived via semiparametric GMM composed with ASE captures latent connectome structure and elucidates biologically relevant neuronal properties.

### Jon A. Wellner (IMS President)

Department of Statistics, University of Washington, Seattle

**Title: Teaching Statistics in the Age of Data Science**

**Abstract:** Advances in statistical theory and the development of more accurate statistical methods
in the 1920’s and 1930’s lead to increases in the teaching of statistics. Because of difficulties in meeting the demand for trained statisticians and knowledgeable teachers of statistics, the IMS formed a committee chaired by Harold Hotelling to make recommendations concerning the teaching of statistics. A talk based on the initial work of this committee was delivered by Hotelling at an IMS Meeting in Hanover, New Hampshire and was published in the Annals of Mathematical Statistics, volume 11, in 1940. This report was reprinted or reprinted several times: (a) in the Festschrift presented to Hotelling in 1960 on the occasion of his 65th birthday; (b) as a “Golden Oldie” in Statistical Science, volume 3, 1983. It lead to the creation of many departments of statistics at universities across the U.S., and remains worthy of re-reading even now.

The field of statistics is again experiencing high demand for statisticians and data scientists. The demand and rapid shifts in technology provoke several important questions:

- What is statistics?
- What is data science?
- What should we be teaching in statistics and data science courses?

My talk will attempt to summarize and review several of the current (divergent!) perspectives on these questions, propose some possible answers from my own teaching experience, and explain why I am optimistic about the future of statistics and data science.