Poisson Models to Predict Scoring Rates in Hockey
Abstract: Trying to predict the score of a hockey game can be a complicated and seemingly impossible task due to the fast-pace nature of the game. We propose to model scoring rates by investigating various factors such as a team’s offensive ability, defensive ability of its opponent, and home-ice advantage. Considering that hockey scores are not normally distributed, we assume that scores follow a Poisson distribution and use these factors to build a Poisson regression model for the scoring rates. We apply this model to data from the 2008-2009 ECAC Division I Women’s Ice Hockey season. Using the Poisson model we examine each team’s scoring to generate a Poisson scoring rate and use the fit to produce an offensive rating, defensive rating, and predicted winning percentage for each team.
Evaluating the Robustness of Competing Clustering Algorithms
Abstract: The task of grouping individual data points which have no visible traits other than their proximity to other points on a graph (which in and of itself is not always obvious) is a useful but often difficult task. One tool to address this problem is cluster analysis, which is an unsupervised process that allows us to construct groups of similar observations. As there are numerous competing clustering algorithms, this raises the question, “Which algorithm is the most robust at identifying groups in data?” There are various ways to approach this question, but we focus on comparing the performance of five of the more popular clustering algorithms across various scenarios. In particular, we use a simulation study to assess the performance of each algorithm when the correct number of clusters is given as well as the robustness under various incorrect numbers of clusters. The results are then used to suggest which algorithms are best in each situation. We then apply the best algorithm to grouping hockey player data.
Classification Trees and Effective Recruiting in College Sports
Abstract: Classification trees are used with a categorical response variable. The goal of a classification tree is to derive a model that predicts to which category a particular subject or individual belongs, based one or more explanatory factors. For example, we could use a classification tree to predict the level of success for college soccer players based upon information available to coaches during the recruiting of these athletes. These classification trees are displayed as a decision tree that has a start node which then branches into other nodes. Using classification and regression trees (CART), we develop the ability to fit a tree to data. Once we have formulated a CART model through pruning and impurity, we evaluate its predictive ability. We apply this methodology to data obtained from the St. Lawrence Women's Soccer team. Upon finding the best CART, we compare it against a logistic regression model to check its accuracy. If results are sufficient and measurable, we can use the model to improve future recruiting.
Fourier Analysis of Boolean Functions and Its Applications
Abstract: In 1988, Je_ Kahn, Gil Kalai and Nathan Linial were the First people to use Fourier analysis to prove theorems on Boolean functions. Later their method has led to several major breakthroughs on problems in various fields, especially in social choice theory. Kalai subsequently applied the idea to show the famous so-called Arrows Impossibility Theorem, which was considered as one of the most important theorem in social choice theory. Not only did he prove the theorem, but also found an optimal bound for the probability of obtaining a rational outcome of a social choice function. We will describe notions of Fourier analysis on Boolean functions, Kalai's proof and also discuss other applications.
Does Iron-Fortified Fish Sauce Reduce the Presence of Anemia?: Data Analysis and Simulations
Abstract: Anemia is a blood disorder that, if left untreated, can cause serious complications, especially in women of childbearing age. A study was done in 2004 to investigate the effectiveness of iron-fortified fish sauce for controlling anemia rates in women of childbearing age from Vietnam. In this study, entire villages were assigned to either a treatment or control group. Studies of this nature are often referred to as cluster randomization trials. Across disciplines, several different methods have been used to analyze this type of data. We investigated if the conclusions made from these data depend on the choice of method. Further, we performed a series of simulations to investigate how different aspects on the experimental design, such as cluster imbalance, intracluster correlation, and the number of clusters, impact the power of these statistical methods.
Dan Look, SLU Faculty
A Visual Introduction to Complex Dynamics
Abstract: Complex Dynamics deals with the behavior of functions under repeated application, or iteration. The most memorable aspect of this field are the beautiful fractal images representing locations where the function behaves ``chaotically". We will introduce basic terms and describe the mathematics behind the images.
Modelling the Spread of Raccoon Rabies in Connecticut using Spatial Data
Abstract: In this paper we will look at the spread of raccoon rabies across the state of Connecticut. In 1993 raccoon rabies appeared on the western end of Connecticut, and within 4 years made its way to the easternmost coast. Spatial data provided by Waller and Gotway (2004) includes the coordinates of 169 Connecticut townships and the relative times of the first appearance of rabies at each. Methods such as trend surface modeling will be used to investigate the spatial spread of raccoon rabies through the state of Connecticut.
Forecasting the Natural Gas Market: Applications of Time Series Analysis
Abstract: In this project I consider forecasting the natural gas market using rig count, production and wellhead price using data supplied by the Energy Information Administration. Utilizing smoothing, univariate and transfer function methods I establish time series models for each of the three series independently and then consider models to forecast natural gas prices based upon changes in rig count and production values.
Predicting Wins for Baseball Games
Abstract: Baseball is the great American past time. In this study we examine different aspects of baseball games to determine what factors play a role in predicting the winning team for a specific game or an entire season. To predict who is likely to win individual games, we consider factors such as each team’s offensive or defensive ability, batting line up, past game scores, and previous winning percentage. We will also model winning percentage of an overall season, based on offensive production, pitching, and defense. We use these models and computer simulations to examine how often the best team actually wins. Play ball!
Detecting Hotspots with Spatial Analysis
Abstract: The assessment and detection of areas with abnormally high incidents of rare cancers --- sometimes called hotspots --- is an important epidemiological function. Spatial analysis is useful in determining if there are clusters of high rates of rare diseases in a particular location. This type of analysis aims to determine if the rates of cancer are randomly dispersed or are clustered in uncommon ways. Using a data set of Leukemia rates in Upstate New York, I use spatial analysis to assess whether there are 'hotspots' present.
Investigating the convergence rate of sampling distributions from skewed populations
The Central Limit Theorem (CLT) states that the distribution of the sample mean of independent and identically distributed random variables converges to the normal distribution as the sample size increases. A common rule of thumb is to consider sample sizes greater than 30 as "large enough" samples to use the CLT as an approximation. However, the "large enough" depends on how non-normal the individual observations are distributed. Using the level of skewness as a measure of non-normality, this study investigates the normality of the distribution of sample means using Gamma distributed random variables with different levels of skewness. Various simulation techniques in R are used to test whether the use of the CLT is a good approximation for the distribution of the sample mean. As skewness increases, a larger sample is needed to use the CLT as a good approximation of the distribution of the sample mean. This study employs the Kolmogorov-Smirnov test to measure the normality of the distribution of sample means and looks at the coverage rate for large sample confidence intervals. The results from this study will helps us decide, for a given skewness level and sample size, when it is appropriate to use the CLT or if alternative methods (such as the bootstrap) are needed.
Exploring Markov Chain Monte Carlo Techniques
Abstract: Markov Chain Monte Carlo (MCMC) methods are powerful algorithms that enable statisticians to explore information about probability distributions through computer simulations when exact theoretical methods are not feasible. The Gibbs Sampler, for example, allows us to gather information about marginal and joint distributions of multivariate densities assuming that we know information about the conditional distributions. Of particular interest is the use of MCMC methods in Bayesian statistics to help estimate posterior distributions. In this talk we illustrate several uses of MCMC methods through computer simulation and applications to real data.
Sam Vandervelde, SLU Faculty
Euler, Partitions and Triangular Numbers
A theorem of Euler asserts that there are as many partitions of n into distinct parts as there are partitions into odd parts. But this result falls apart when one attempts to involve even numbers… or does it? In this talk we will examine Euler's famous result and discover how triangular numbers come to the rescue for partitions involving even numbers. In particular, we will establish a new companion result, which states that the number of partitions of n into distinct parts is also equal to the number of partitions of n into even parts along with exactly one triangular part.