Festival of Science 2018 Student Abstracts

Office & Department Directory
Math, Computer Science, and Statistics Department
Festival of Science 2018 Student Abstracts

Festival of Science 2018 was held on Friday, April 27, 2018.

Josh Ma - Mathematics - Faculty Advisor: Dr. Daniel Look

Chaotic Dynamical Systems in Financial Markets

Natural occurrences of time series typically come in a form of nonlinear dynamical systems –deterministic progressions of initial conditions when placed in fixed, iterative systems. Under certain situations, these systems will exhibit interesting behavior and even converge to recognizable patterns. Certain types of dynamical systems display deterministic chaos, marked by sensitive dependence on initial conditions. Though these models may seem simple, they can appear unpredictable and difficult to analyze. Examples of naturally occurring time series that could display chaotic behavior include the Dow Jones Industrial Average and Dollar/Peso exchange rate.

Elliot Martin-Mathematics - Faculty Advisor: Daniel Look

Bifurcation Diagrams and Chaotic Maps

In the 1976 paper Simple Mathematical Models with Very Complicated Dynamics, Robert May introduces a discussion on the interesting dynamics that occur within the logistic function, . This map relates to a wide range of fields, including population modeling. Many of the ideas that May introduced in his paper would eventually become a part of the field of Chaos Theory. We focus on what it means for a function to act chaotically, and we examine bifurcation diagrams, which are used to visualize drastic changes in the population dynamics.

Hussien Elrayes-Mathematics - Faculty Advisor: Daniel Look

Discrete-time Dynamical Systems: Bifurcation Theory and Applications

Discrete-Dynamical systems are used to model the behavior of processes such as population growth and supply/demand. These systems are defined by difference equations, which are functions connecting the current state of a system to its next state. These states are usually connected by a parameter (e.g. a ratio). However, certain changes in the parameter value can lead to drastic changes in the behavior of the system; these are referred to as bifurcations. Different difference maps experience different types of bifurcations. We discuss some common types of bifurcations and their implications on their respective systems

Jennifer Scudder-Mathematics - Faculty Advisor: Daniel Look

What Does Math Have To Do With Taffy-Making?

How does taffy-making relate to math? Mathematical chaos is seen more often than one may think, including in taffy-making. In this project, we investigate various chaotic mixing theories in relation to Smale's Horseshoe Map. The Horseshoe Map can be viewed as the result of the consistent squishing and stretching of a rectangle into the shape of a horseshoe. The repetition of this act creates mathematical chaos. This stretching and folding process is also seen in the important final step in the creation of taffy, as it incorporates many tiny air bubbles throughout the candy to make it lighter and chewier.

Delaney Spink-Mathematics - Faculty Advisor: Daniel Look

The Beauty in Chaos: Fractals and Art

Fractals were given their name in 1975 by Benoit Mandelbrot, but since long before that time fractals have captured the minds and imaginations of artists and mathematicians alike for both their beauty and their complexity. Fractals, on the surface, can easily be recognized as art on their own - but there is much more to them than what first meets the eye. We will be exploring the intersection between art and fractals and some of the interesting properties of fractal art.

Leo Kraunelis-Mathematics - Faculty Advisor: Daniel Look

A Chaotic Commute: Chaos Theory and Traffic Flow

Traffic can be frustrating, time consuming and seemingly unavoidable. When doing urban planning it is important to predict traffic flow, however this is difficult due to the chaotic nature of traffic. This talk will give a brief overview of chaos theory. We will then look at how chaos theory can be applied to transportation flows and how it is used to avoid inefficiencies such as traffic jams.

Sara Holmgren-Mathematics - Faculty Advisor: Daniel Look

Chaos in Ecology: Discrete Dynamics of Populations

This talk investigates chaos as it relates to ecology, using discrete dynamics of populations as represented by simple logistic models. Predicting human population over long periods of time can be difficult, so we focus on populations that developed in relative isolation. In particular, we will discuss the dynamics given by models for the population of Easter Island and for biological populations with discrete generations. In these models, different parameter values lead to different dynamics of the populations: extinction, stable equilibriums, dynamic equilibriums, and transitions to chaos.

Aijia Wu-Mathematics - Faculty Advisor: Duncan Melville

Implication of Optimization in the Iron and Steel Industry

The iron and steel industry is one of the most important basic industries of a nation. So, the importance of iron ore transportation work cannot be ignored. My Honors SYE is about the optimal method for iron ore transportation based on the implications of linear algebra. There are two essential principles in my plan: 1. Minimize the total volume (ton×kilometers), at the same time, minimize the number of trucks used for transportation to minimize the cost of transportation; 2. Under the situation of same distance, maximize the transportation times to maximize the total amount of rocks and ores (The iron yield of rock is better than ore. The rock has the priority. If the amount is same, we choose the one has less total volume). According to these two principles, I set up two different Math modes separately to get the optimal plans about number of electric forklifts we use and which shoveling position we should put them on; number of truck we use and their routes and transportation times for the questions.

Yue Yang-Mathematics - Faculty Advisor: Duncan Melville

Studying the Properties of Young Tableaux:Replacing Young Tableaux with Certain Numbers

Young tableaux are combinatorial objects useful in representation theory and Schubert calculus. A standard Young tableau is a Young diagram, which is a diagram that represents partitions, that is filled in a way such that the entries in each row and each column are increasing, and normally those entries are integers from 1 to n. This project asks: if you replace one integer with a half-integer that is smaller than itself and refill the Young diagram to get new Young Tableaux, how many new Young Tableaux are generated, i.e, Young Tableaux that are standard if we follow this process? During the study, I found three different ways to represent the process of filling Young Tableaux, and one way to calculate the new Young tableaux.

Funding: SLU Summer Fellowship

Khang Le-Mathematics - Faculty Advisor: Natasha Komarov

The Sprague Grundy number and its application in Combinatorial Game Theory

Combinatorial game theory is a field of game theory dealing with games with perfect information. In particular, these games are two-person games with win-or-lose outcomes and players take turns moving until a terminal position is reached (a player who cannot move loses). Impartial games are combinatorial games in which each player has the same set of possible moves from any given position. It has been shown that the Sprague Grundy number is the key for understanding impartial games because a single Sprague Grundy number measures the entire state of the game. I will show how to calculate the Sprague Grundy number in a specific combinatorial game and present some winning strategies analyzed from that number. The game I am going to examine is called the SOS game, in which players take turns filling an S or O in an empty square of a row consisting of squares, until the first player gets SOS in consecutive squares.

Evan Page-Computer Science - Faculty Advisor: Choong-Soo Lee

Developing a Mobile Application Controlled LED Light System Using Raspberry Pi

There are many LED light-strip systems that are controlled by a remote or a mobile application on your phone. This project uses a Raspberry Pi system (a small single-board computer) to build a similar system from scratch. The Raspberry Pi system powers and controls the Led lights, and accepts messages from an Android app. Using the microphone, the Android app listens to music, and analyzes it to compute the colors for the LED system. Then, the app communicates the colors by sending UDP (user-datagram protocol) messages to the Raspberry Pi system. Preliminary testing results demonstrate that it provides a positive experience to the audience, and further testing will help tune the system for a better experience.

Yuxi Zhang-Math, Computer Science and Statistics - Faculty Advisor: Ed Harcourt

Postural Control Measurement and Balance Training Using Mobile Technology

Impaired postural control is common in the elderly population and people with health conditions such as Parkinson’s disease, stroke, concussion, and multiple sclerosis. Recent advances in measuring capabilities of inertial sensors inside mobile devices offer a potential way to measure postural stability quantitatively and practically. For this project we built an Android application that uses accelerometer and gyroscope sensor data collected by mobile devices to measure patients’ performance during balance interventions and to deliver personalized task-orientated balance training.

Cameron Pilarski-Computer Science - Faculty Advisor: Ed Harcourt

My Closet: A social, mobile application for Android

Clothes are a necessity; the types of clothing that you wear can define who you are and help form an identity for yourself. For those who are untidy or unorganized, those drawers can become a mess and finding certain clothes or knowing what’s clean or dirty can be a pain. My Closet provides a social, mobile platform for users to keep track of what clothes are in their closet or laundry, and what outfits are planned for the day or special dates in the future. The social aspect allows users to message and peek into friend’s closets, with the ability to borrow and loan clothing articles, easily transferring them between closets in the effort of preventing friends from accidentally stealing your clothes ever again! My Closet makes all of these features easy to use incorporating the latest Android Material Design components with its back-end being driven by Google’s Firebase service which handles in-app user authentication and user data storage.

Ryan Lough and Joseph Sullivan-Statistics - Faculty Advisors: Jessica Chapman

Textbooks to Playbooks: A Model to Predict what it Means To have a “Successful” Career as a Professional Hockey Player

It is challenging for NHL scouts to determine whether a good college player will be able to translate that success to the professional level. This poses several interesting questions, including can the statistics of a collegiate hockey player in addition to the quality of their team be used to develop a statistical model to predict how successful of a professional career that player will have? And how exactly does one define a “successful” career? We have developed different models to predict a player’s success in professional hockey. There have been similar studies to predict how many points a player will be able to produce in the NHL using how many points they produced in college. However, there are very few models, if any, that take into consideration the strength of that player’s collegiate team and the difference it makes in their professional career. We use player data from Elite Prospects and team data from College Hockey News from the 2000-2001 season until present to build and assess models for predicting the professional success of college hockey players.

Wenhui Huo-Statistics - Faculty Advisor: Michael Schuckers

Using Statistical Classification Methods to Label Diagnosis of Breast Cancer

Women are always concerned about the possibility that they will develop breast cancer. Early detection and treatments have resulted in significant gains in preventing progression of breast cancer. During the early stage, factors such as clump thickness, uniformity of cell size and shape, and number of cells are considered to make prediction of the diagnosis (benign or malignant) of a particular patient. These factors may show more power in prediction with the development of cancer cells. In order to provide prompt treatment, it is necessary to make a prediction as accurate as possible. Therefore, by using the data that included 699 instances and was obtained from the University of Wisconsin Hospitals, different statistical classification methods such as Logistic Regression, K-Nearest Neighbors, and Classification Trees have been evaluated and compared to see which method has the most accurate prediction. By doing so, people cannot only have better understanding about the breast cancer but also gain knowledge of the application of different classification methods.

Margaret Musser-Statistics - Faculty Advisor: Michael Schuckers

Using automated statistical program, measuRing,to detect tree rings and analyze width measurements

Dendrochronology and tree-ring science is a common practice used to determine environmental conditions and climate cycles. In order to do so, however, requires tree coring, sample preparation, tedious manual measurement of ring width on a moving microscope stage, and data analysis. To streamline this process, there are statistical programs that count and measure tree rings automatically. The R-studio package measuRing is designed to detect changes in pixel color as tree ring boundaries. In this talk we will explore the usage of this package in R and some of its limitations. However, in our analyses, the ringDetect function within this package did not detect every tree ring, thus causing difficulties with any subsequent data analyses. While measuRing includes the ringSelect function to manually select rings that were missed in ringDetect, this process is as tedious as the manual method. In order to improve upon this package, the ringDetect output plot was run through a local regression model. The Loess model, in this case, locally smoothed the plot on a controlled, small span. Following this regression model, the ringDetect plot and the Loess model plot were layered and residuals extracted in order to determine which peaks in the plot indicated tree ring boundaries.

Alexandra Withee-Statistics - Faculty Advisor: Michael Schuckers

Yacht Hydrodynamics

The residuary resistance of a ship is the amount of resistance (excluding friction) against a ship caused by the opposing motion as it moves through the water. The purpose of my project was to determine the best way to predict sailing yachts’ residuary resistance. In order to do this, I considered the basic hull dimensions and velocity of the boat such as the longitudinal position of the center of buoyancy, prismatic coefficient, length-displacement ratio, beam-draught ratio, length-beam ratio, and Froude number. All variables, including residuary resistance, are adimensional, meaning that they have no physical dimensions that can be measured. My final model was selected by comparing statistical learning regression models such as Best Subsets, Lasso, Ridge Regression, Principal Component, and Partial Least Squares. By analyzing the test errors of each model, I verified their performance in meeting the goal of most accurately predicting residuary resistance. The data I applied these methods to is a sailing yacht hydrodynamics dataset from the University of Delft which includes data from 308 full-scale experiments with 22 different hull forms. In the end, being able to accurately predict residuary resistance, ship makers can more effectively design and build sailboats that can most efficiently move through the water, which makes them faster and easier to sail.

Ashley Norris-Statistics - Faculty Advisor: Michael Schuckers

Collection and Analysis of SLU Lacrosse Data

For this project we are collecting and analyzing the St. Lawrence University Varsity Men’s Lacrosse games for the 2018 season. Lacrosse analytics is a field with little prior research, so we started our project by prioritizing statistics to track with the coaches. While statistics for the game are tracked, we are doing a more in-depth analysis by focusing on their offensive efficiency and face-offs. For offensive efficiency, we analyze both individual player involvement and also how the team is doing overall. We track statistics such as: positive/negative turnovers for individual players, the number of quality scoring opportunities within a possession, the number of quality shots in a possession and the outcomes for each offensive possession. Our goal is to find trends in the team’s offensive efficiency and provide analyses for the coaches to make data-driven decisions.

Margar Harutyunyan-Statistics - Faculty Advisors: Michael Schuckers and Fatma Gunay

Econometrics of Choice: Estimating Multinomial Discrete-Choice Model for Differentiated Automobile Market

Look up - “How many choices does a person make in a day?” in Google. The answer comes up to be 35,000 choices, most of which are explicit or discrete choices. This is staggering and fascinating given that we have only 24 hours a day. Although we cannot learn everything about the choices we make, we can still study them in certain depths to understand our behavior as consumers. Many of our choices can be grouped into discrete categories to be analyzed through multinomial logit and discrete-choice statistical and economic models. Multinomial logit is statistical learning technique for analyzing relationships between polytomous response variables and sets of predictors that has been actively used in discrete-choice modeling. Similarly, discrete-choice models have been developed by econometricians to describe choices consumers make in the marketplace. We apply the framework of multinomial logits and discrete-choice models to 4654 consumer-level observations data from R software’s mlogit package to develop understanding about differentiated product market demand for the U.S. automobile market and obtain demand parameters. The final model explains how different variables affect consumer choices and demand. This analysis utilizes both economic and statistic theories to establish a coherent theoretical foundation for the research. The importance of the project rests in understanding the consumer behavior as well providing theoretical and quantitative explanation to why people choose particular alternatives in the automobile market.

Anna Izzo-Statistics - Faculty Advisor: Michael Schuckers

Predicting Employee Turnover using Survival Analysis: Cox Proportional Hazards Model

Employee turnover, which is when an employee leaves the company for whom they are working, is inevitable. At some point in their careers, almost all employees will turn over. Unexpected attrition is expensive for companies. Because of this, statisticians have investigated modeling to understand and minimize the costs associated with turnover within businesses. One of these methods is survival analysis, a method in which analysis of data can predict the time until the occurrence of an event of interest as well as whether or not the event of interest previously occurred. In this project, survival analysis is applied to employee data with the Cox proportional hazards model in an attempt to better understand the relationships between predictors and rate of hazard, as well as gain insight on the ideal timing of engagement intervention to avoid unexpected turnover. A deeper understanding of this approach will permit businesses to retain quality talent and avoid exorbitant attrition costs.

Victoria Shaffer-Statistics - Faculty Advisor: Michael Schuckers

Determining Song Rank on Spotify Using Poisson Methods

Spotify is a song and podcast streaming service that was officially launched in 2008 and since then has grown to over 159 million users. Songs tend to rise and fall in their popularity. Here we will try to predict the typical trajectory of a song or a class of songs. Given the number of song streams in the previous eight months of 2017 along with the songs rank, using Poisson data counts, the rank of the songs in the last four months of 2017 (September, October, November, and December) will be predicted. We will consider Poisson regression as well as Poisson Lasso and Poisson Ridge regression for these predictions.

Bridget Benz-Statistics - Faculty Advisor: Michael Schuckers

The Kaplan-Meier Estimator

Survival analysis is a collection of methods used to predict the chances of an event occurring given historical data. For example, one could use survival analysis to predict whether a patient with a certain illness will live or die at a variety of points in time based on studies of previous patients throughout time. One such method of survival analysis is the Kaplan-Meier Estimator which depends on the amount of patients living and dead at various increments after diagnosis. This poster will examine the derivation of the Kaplan-Meier Estimator, some pros and cons of using it, and an application to real world data.

Nicole Williams-Statistics - Faculty Advisor: Michael Schuckers

Where Are You From?: Using Log Linear Models to Predict a Person’s Native Country

In statistical analyses, we are often interested in predicting counts as the outcome of our analyses. Log linear models show interaction and association between categorical variables using the relationship of cell counts of variables. Log linear models follow a Poisson distribution and are appropriate to use when there is more than one response or no response is specified. In this project the log linear approach is applied to Census data from 1994. This data provides information on age, wage, race, education level, relationship status, and occupation from people of ages 16 to 100. Using this information and log linear models, this project looks into the various relationships of different categorical variables. Understanding the interactions between these variables could then be used to help predict someone’s native country.

Shyanne White-Statistics - Faculty Advisor: Michael Schuckers

Predicting a Starbucks Drink from Nutritional Values

In this analysis, I will be using a set of data based upon the drink menu at typical Starbucks store. This particular dataset includes information on each drink’s category, type, preparation, and various nutritional facts. For this project, the goal is to look at how well different types of methods predict the beverage type based off of the other variables and using the results to see how accurate the methods were. The data was analyzed in R using classification random forests which are built with a categorical response variable and then combined to create random forests, which output the classification that is the mean prediction of the individual trees. This was then cross-validated to determine the out of sample prediction accuracy. This approach is used often in statistics to predict the classification of an observation using other factors.

Carly Jefferson-Statistics - Faculty Advisor: Michael Schuckers

Redundancy Analysis

Redundancy analysis (RDA) is a method that summarizes the variation in a set of response variables that can be explained by a set of variables. RDA is a direct gradient analysis technique which summarizes linear relationships between components of response variables that are redundant with a set of explanatory variables. RDA can also be considered a version of principal component analysis, where the response variables are linear combinations of each other. RDA produces an ordination that summarizes the main patterns of variation in the response, which can be explained by a matrix of explanatory variables. The total variance of the data set can be separated into constrained and unconstrained variances. This result shows how much variation in your response variables was redundant with the variation in your explanatory variables. If the constrained variance is much higher than your unconstrained variance, the analysis suggests that much of the variation in the response data may be accounted for by your explanatory variables. If there is a large proportion of unconstrained variation, then only a small amount of the variation in your response matrix is displayed. Each RDA axis has an eigenvalue associated with it and the total variance of the solution is equivalent to the sum of all eigenvalues (constrained and unconstrained). I will use plots such as distance and correlation plots to represent the importance of redundancy analysis.

Tanjona Rakotoarisoa-Statistics - Faculty Advisor: Michael Schuckers

What are the key factors that influence entrepreneurship in the United States?

The United States (US) is thought to be an entrepreneurs’ heaven, but according to a recent study published in the Forbes magazine, the most entrepreneurial group in America was not born here. Hence, we are interested in what are the key factors that influence entrepreneurship in the United States? The purpose of this project is to determine the predictors that determine whether a person is an entrepreneur or not by using the 2016 Census Income data extracted from the data.world website. Using the logistic regression model, the probability of someone being an entrepreneur will be ascertained depending on the following predictors: age, level of education, marital status, race, gender, and country of birth. Furthermore, we will explore other classification methods such as the K Nearest Neighbors classification to ascertain the quality of our predictions.

Sam Gavett-Statistics - Faculty Advisor: Michael Schuckers

Hierarchical Clustering

For this poster, I will investigate hierarchical clustering using the linkage methodology. Specifically, I will be using the linkages method and apply it to a forestry data set. The primary aim of this analysis is to consider the effects of different types of distances used including the standard Euclidian distance. An investigation of the impact of standardizing variables using z-scores vs. raw data and the visualization of results using different dendograms will be part of the presentation. The data considered here observations from four areas of the Roosevelt National Forest in Colorado. All observations are cartographic variables from 30 meter x 30 meter sections of forest. This dataset comes from Kaggle.com and includes over half a million measurements.

Jack Stokoe-Statistics - Faculty Advisor: Michael Schuckers

Implementing Random Forests in a Regression Context

Decision trees for regression (i.e. regression trees) divide the set of possible values for the variables into separate regions, taking the shape as an upside-down tree. The predicted response value of an observation that falls into a certain region is simply the average of the response values of observations already in that region. Regression trees are easy to explain as they closely mimic human decision-making and offer a graphical representation that is more interpretable than other regression methods. However, there is a tradeoff – an increase in interpretability generally leads to a sacrifice in prediction accuracy. One solution is to use an extension of decision trees called random forests for regression (i.e. random decision forests). Instead of producing a single tree, this method yields a prediction by combining a large number of trees (in a forest), which improves prediction accuracy. In this poster, I will present the methodology of random decision forests and illustrate its application.

Xiaobing Wang-Statistics - Faculty Advisor: Michael Schuckers

Dimensional reduction for high-dimensional datasets

Dimensional reduction is the process of reducing the number of variables under consideration for statistical modelling by obtaining a smaller set of predictors. Dimensional reduction has many advantages in data analysis. It can help in data compression and reduces computation time and can also help remove redundant features. Data analyses such as regression or classification can be done more efficiently after dimensional reduction in many cases. In this presentation, I will introduce several ways of dimensional reduction including Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA), Generalized Discriminant Analysis (GDA) and Partial Least Squares (PLS). I will apply them on a dataset from NHTSA (National Highway Traffic Safety Administration) and explain both the advantages and disadvantages of each methods.

Yihuan Lai-Statistics - Faculty Advisor: Michael Schuckers

Application of the Latent Class Growth Analysis to Two Longitudinal Datasets

In statistics, different methods have been developed to estimate individual change over a period of time. Also, the existence of latent trajectories, where individuals are captured by trajectories that are unobserved, can be examined. The method used to evaluate such trajectories is called Latent Class Growth Analysis (LCGA). This method is getting more and more popular in the field of Social Science in recent years. In this project, LCGA will be applied to two longitudinal datasets related to Health Psychology. The first dataset is about the post-traumatic growth (PTG) of breast cancer survivors over seven years; and the second dataset is about patients' psychological responses after the dentofacial correction surgery during a year. The analysis was conducted in Mplus, which is a powerful statistical program especially for estimating latent variables. The types of post-traumatic growth experienced by the patients in the first dataset were divided into three trajectories: constructive PTG, illusory PTG, and distressed PTG. In the second dataset, the patients with dentofacial deformities were divided into two trajectories: resilient type and chronic dysfunctional type.

Gordon White-Statistics - Faculty Advisor: Michael Schuckers

Machine Learning in Hockey Statistics

This project focused on predicting whether a hockey player will make it to the National Hockey League (NHL) based on their performance at age eighteen. Data for this project was taken from a relational database of approximately 400,000 hockey players initially created by Guinevere Gilman. Using these player data, we will determine how well we can predict whether the player will make the NHL using logistical regression, classification random forests and k-nearest-neighbors (KNN). Logistical regression will be used as a base line to determine how well a machine learning algorithm can perform. Player statistics such as goals, assists, points and the league in which they play will be used to build each model. The goal of this project is to determine how well a machine learning algorithm can predict if a player will make it to the NHL.

Gordon White-Statistics - Faculty Advisors: Michael Schuckers and Lisa Torrey

A Web App for Visualization of Hockey Statistics

This project focused on building a graphical user interface (GUI) to a relational database of approximately 400,000 hockey players. The database, which contains information about player demographics and performances, was initially created in Postgres by Guinevere Gilman in a previous project. We developed a useful web interface for it using HTML, CSS, JavaScript, and PHP. The web application extracts player statistics from the database and uses Plotly, a JavaScript visualization package, to display them in a user-friendly way.

Yue Yang-Statistics - Faculty Advisor: Robin Lock

Beyond Constant: A Study on Generalized Autoregressive Conditional Heteroskedasticity Model

In time series, there are many different models based on the assumption that the conditioned variance of a series of data is constant, including one powerful model called autoregressive moving average model (ARMA). ARMA model assumed that the dataset has constant conditioned and unconditioned variance, and my senior project was studying a model that does not need conditioned variance to analyze datasets, which expands the type of datasets that can be analyzed. The generalized autoregressive conditional model (GARCH) focuses on modeling the conditioned variance based on the past time and past variance, and I focused on properties in this model. Also, I used different R packages to study this model by simulating different datasets, and analyzed how those datasets fit in the GARCH model. During this process, one package in R turned out to have unusual results depending on the properties of datasets, so my work also included how to use R packages to get accurate results even with different datasets. I also worked on real life data sets in stock prices, though many stocks do not show heteroscedasticity patterns.

Xiaobing Wang-Statistics - Faculty Advisor: Robin Lock

Comparing Randomization Methods for Difference in Means Using Simulations in R

We explain three randomization methods for testing a difference in two means: Reallocate Groups, Shift Groups, and Combine Groups. For Reallocate Groups, we put all cases from two groups together and then randomly assigned them to two new groups with the same sample sizes as the groups from the original sample without replacement. For Shift Groups, we use common mean as the mean of all cases from both two groups. We then shift two groups to make their new sample means both equal to the common mean. For Combine Groups, we put all cases from two groups together and then randomly assign them to two new groups with the same sample sizes as the groups from the original sample with replacement. We use R-based simulations to evaluate these three methods along with the traditional t-test and pooled t-test. We use numerical and graphical methods to compare the performance and power of all five procedures.