Publications and Presentations of Dr. Wolfgang Rolke

PDF versions are available for download by clicking on the name of the paper

[30] Testing Goodness of Fit

A talk on the general goodness of fit testing problem. I discuss some of the history and controversies of the main methods as well as some newer developments.

Presented at TerascaleStatistics School 2017 at DESY, Hamburg Germany

A shorter version which I presented in our Mathematics Department Seminar is available here:

PDF

Powerpoint

[29] Introduction to R

An introduction to R. Presented at TerascaleStatistics School 2017 at DESY, Hamburg Germany

[28] GOFer

An introduction to the online goodness-of-fit testing app https://drrolke.shinyapps.io/GOFer/.

CMS Statistics meeting, CERN, Geneva, Switzerland

[27] R / Shiny workshop
together with the Puerto Rico Chapter of the American Statistical Association
to get started also download workshop.zip

[26] Limit Setting Methods for the On/Off Problem

Abstract: In the talk I discuss a number of different methods for calculating confidence intervals for the On-Off problem. The methods include all those in common use today. I derive explicit formulas for the limits and calculate the true coverage and the expected lengths of these methods.

CMS Statistical Meeting, November 2015, Large Hadron Collider at CERN, Geneva, Switzerland

[25] Generalized Linear Models Workshop

Abstract: a workshop with an introduction to simple and multiple regression as well as the generalized linear model.
Expo in Statistics 2015, C3Tec, Caguas, PR, together with the Puerto Rico Chapter of the American Statistical Association

[24] What's wrong with Hypthesis Testing?

Abstract: Talk at Department of Mathematical Sciences UPRM seminar. Hypothesis testing has been a large part of Statistics for almost a century and is one of the most common methodologies in many fields of research. Yet earlier this year a major journal in Psychology announced that they will no longer publish any papers that include a hypothesis test. So, what's wrong with hypothesis testing?

[23] A Comparison of Limit Setting Methods for the On-Off Problem

Abstract: We study the frequentist properties of confidence intervals for the On-Off problem. The methods include all those in common use today. We derive explicit formulas for the limits and calculate the true coverage and the expected lengths of these methods.

Published in Nuclear Instruments and Methods A
DOI: 10.1016/j.nima.2015.10.028

For an online limits calculator go to https://wolfgangrolke.shinyapps.io/OnOffLimitsCalculator

[22] Some Features of R You Might Not Yet Know

Abstract: Talk at SIDIMXXX about startup customization in R using dropbox, .First and .Rprofile. I also discuss Rcpp and Rshiny

my .Rprofile is here

[21] What Country is the All-Time Best in the Worldcup?

Abstract: I analyze data from all the Worldcup tournaments held to answer this question. The main statistical point of the paper is that there is no single obviously correct answer. In just about any statistical (or even scientific) research we have to make a number of essentially subjective choices. The best we can do is to make it clear what those choices were and why we made them the way we did.

[20] Identifying Students at Risk

Abstract: An analysis of UPR-Mayaguez student data to develop a method for identifying students at a high risk of not returning for the second year or of not graduating.

[19] The Power to See: A New Graphical Test of Normality

with Sivan Aldor-Noima, Lawrence D. Brown, Andreas Buja and Robert A. Stine

Abstract: Many statistical procedures assume the underlying data generating process involves Gaussian errors. Among the popular tests for normality, only the Kolmogorov-Smirnov test has a graphical representation. Alternative tests, such as the Shapiro-Wilk test, offer little insight as to how the observed data deviate from normality. In this paper we discuss a simple new graphical procedure which provides simultaneous confidence bands for a normal quantile-quantile plot. These bands define a test of normality and are narrower in the tails than those related to the Kolmogorov-Smirnov test. Correspondingly the new procedure has greater power to detect deviations from normality in the tails.

Published in The American Statistician (2013), Vol 67/4

DOI:
10.1080/00031305.2013.847865

a free copy of the paper is available here

The routines in R are available here . R routines for probability plots of general distributions can be downloaded from here (these routines require a fully specified null distribution)

[18] Estimating a Signal In the Presence of an Unknown Background

with Angel Lopez

Abstract: We describe a new method for fitting distributions to data which only requires knowledge of the parametric form of either the signal or the background but not both. The unknown distribution is fit using a non-parametric kernel density estimator. The method returns parameter estimates as well as limits on those estimates. Simulation studies show that these estimates are unbiased and that the limits on the estimates are correct.

Published in Nuclear Instruments and Methods in Physics Research A, (2012) Volume 685, p. 16-21.

DOI: 10.1016/j.nima.2012.05.029

How to run the app

C Code for Semiparametric Fitting

Abstract: C++ code for the semiparametric ditting discussed in [17a]

[17] Solution to Banff 2 Challenge Based on Likelihood Ratio Test

Abstract: We describe our solution to the Banff 2 challenge problems as well as the outcomes.

[16] A Test for Equality of Distributions in High Dimensions

with Angel Lopez

Abstract: We present a method which tests whether or not two datasets (one of which could be Monte Carlo generated) might come from the same distribution. Our method works in arbitrarily high dimensions.

[15] A Shared Spatial Cache Model for Mobile Environments

with Fernando J. Maymi (West Point Military Academy) and Manuel Rodriguez-Martinez (UPRM)

Abstract: In many scenarios, particularly in military and emergency response operations, mobile nodes that are in close proximity to each other exhibit a high degree of data affinity. For example, all soldiers in the same region, regardless of their specialty, will want to know all nearby threats, as well as all friendly assets. Since relaying queries to a distant server is costly in terms of bandwidth and battery power, it would be ideal to use local resources that are only a hop away. In this paper we propose a shared spatial cache that can be thought of as residing in a region rather than in any given node. Each node that participates in the cache holds an expendable part of the data, so that the loss of any node or small group of nodes can be tolerated with little or no degradation of service. We describe the analytical models that verify our claims and show the results of extensive simulations that validate our models under simulated but realistic conditions.

Published in the Proceedings of MobiDE'2010, Ninth International ACM Workshop on Data Engineering for Wireless and Mobile Access, June 6th, 2010, Indianapolis, Indiana, USA (in conjunction with SIGMOD/PODS 2010)

eConf C030908 (2003) MOBT002
DOI10.1145/1850822.1850834

[14] Limits, discovery and cut optimization for a Poisson process with uncertainty in background and signal efficiency: TRolke 2.0.

with J. Lundberg, J. Conrad, and A. Lopez

Abstract: A C++ class was written for the calculation of frequentist confidence intervals using the profile likelihood method. Seven combinations of Binomial, Gaussian, Poissonian and Binomial uncertainties are implemented. The package provides routines for the calculation of upper and lower limits, sensitivity and related properties. It also supports hypothesis tests which take uncertainties into account. It can be used in compiled C++ code, in Python or interactively via the ROOT analysis framework.

DOI: 10.1016/j.cpc.2009.11.001

[13] A Test for the Presence of a Signal

with A. Lopez

Abstract: We describe a statistical hypothesis test for the presence of a signal based on the likelihood ratio statistic. We derive the test for several cases of interest and also show that for those cases the test works very well, even far out in the tails of the distribution. We also study extensions of the test to cases where there are multiple channels.

[12] Limits and Confidence Intervals in the Presence of Nuisance Parameters

with A. Lopez and J. Conrad

Abstract: We study the frequentist properties of confidence intervals computed by the method known to statisticians as the Profile Likelihood. It is seen that the coverage of these intervals is surprisingly good over a wide range of possible parameter values for important classes of problems, in particular whenever there are additional nuisance parameters with statistical or systematic errors.

Published in Nuclear Instruments and Methods in Physics Research A, 551/2-3, 2005, pp. 493-503

DOI: 10.1016/j.nima.2005.05.068

For the routines to carry out the calculations go here.

[11] How to Claim a Discovery

with A. Lopez

Abstract: We describe a statistical hypothesis test for the presence of a signal. The test allows the researcher to fix the signal location and/or width a priori, or perform a search to find the signal region that maximizes the signal. The background rate and/or distribution can be known or might be estimated from the data. Cuts can be used to bring out the signal.

Published in Proceedings of PHYSTAT2003: Statistical Problems in Particle Physics, Astrophysics and Cosmology, SLAC, p41-44.
[10] Search for Rare and Forbidden 3-body Di-muon Decays of the Charmed Mesons D+ and D+s,

A high energy physics paper using the analysis tools developed in [1], [2], [3] and [5]

DOI: 10.1016/j.physletb.2003.07.079

[9] Calibration for Simultaneity: (Re) Sampling Methods for Simultaneous Inference with Applications to Function Estimation and Functional Data,

with Andreas Buja, in preparation for resubmittion to JASA.

Abstract: We describe and illustrate a simple Monte Carlo technique for carrying out simultaneous inference with arbitrarily many statistics. Special cases of the technique have appeared in the literature, but there exists widespread unawareness of the simplicity and broad applicability of this solution to simultaneous inference. Simultaneous inference for multiple statistics gives the appearance of an ill-posed search problem because it is not clear how to choose among the too many possibilities of simultaneous coverage regions. The problem can, however, be simplifed by restricting the search to a one-parameter family of nested regions and select the region whose estimated coverage probability equals the desired value. Natural one-parameter families are readiliy available.
The technique applies whenever inference is based on a single distribution. A nonexhaustive list of examples of such distributions are: 1) fixed distributions such as standard normals when diagnosing distributional assumptions, 2) conditional null distributions in exact tests with Neyman structure, in particular permutation tests, 3) bootstrap distributions for bootstrap con.dence regions, 4) Bayesian posterior distributions for high-dimensional posterior probability regions, or 5) predictive distributions for multiple prediction intervals.

[8] A Glossary of Selected Statistical Terms

with Harrison Prosper and Jim Linneman

Abstract: This glossary brings together some statistical concepts that physicists may happen upon in the course of their work. The aim is not absolute mathematical precision---few physicists would tolerate such a burden. Instead, (one hopes) there is just enough precision to be clear. We begin with an introduction and a list of notations. We hope this will make the glossary, which is in alphabetical order, somewhat easier to read.

Published in Proceedings Of The Conference On: Advanced Statistical Techniques in Particle Physics, Institute for Particle Physics Phenomenology, University of Durham, UK (2002), 314-330

[7] Bias-Corrected Confidence Intervals for Rare Searches

with A. Lopez

Abstract: A short version of [3].

Published in Proceedings Of The Conference On: Advanced Statistical Techniques in Particle Physics, Institute for Particle Physics Phenomenology, University of Durham, UK (2002), 44-48

[6] Statistical Analysis of the SELEX Double Charm Signals

with A. Lopez

Abstract: A discussion of the statistical significance of some discoveries claimed by the SELEX collaboration.

[5] Correcting the Minimization Bias in Searches for Small Signals

with A. Lopez

Abstract: We discuss a method for correcting the bias in the limits for small signals if those limits were found based on cuts that were chosen by minimizing a criterion such as sensitivity. This type of bias is commonly present when a "minimization" and an "evaluation" are done at the same time. We propose to use a variant of the statistical bootstrap to adjust the limits. A Monte Carlo study shows that these new limits have correct coverage.

Published in Nuclear Instruments and Methods in Physics Research A, vol 503/3, 2003, pp 617 - 624

DOI: 10.1016/S0168-9002(03)00428-5

[4] Setting Limits for Poisson Rates in the Presence of Noise

Abstract: A short version of [1].

Published in Proceedings of SIDIM 2000

[3] Confidence Intervals and Upper Bounds for Small Signals in the Presence of Background Noise

with A. Lopez

Abstract: We discuss a new method for setting limits on small signals in the presence of background noise. The method is based on a combination of a two dimensional confidence region and the large sample approximation to the likelihood ratio test statistic. It automatically quotes upper limits for small signals and two-sided confidence intervals for larger samples. We show that this method gives the correct coverage and also has good power.

Published in Nuclear Instruments and Methods in Physics Research A, V.458, 2001, 745-758

DOI: 10.1016/S0168-9002(00)00935-9

[2] Stock Abundance and Potential Yield of the Queen Conch Resource in Belize

with R. Appeldorn

Report to the CARICOM Fisheries Research Assessment and Management Program, 1997

[1] Continuous-time Markov Processes in Geology

Journal of Mathematical Geology, Vol 23, # 3, April 1991

DOI 10.1007/BF02065784

Routines

1) A C++ routine to find the limits discussed in [11] are here. You can also carry out these calculations using ROOT, for this go to http://root.cern.ch/root/html/TRolke.html.