The Festival of Science 2014 will be held Friday, May 2, in Eben Holden at 1:00 - 5:00pm.
This Student Symposium during the annual Festival of Science (FoS) is a successful, valued, and growing opportunity for SLU undergraduates to share the results of their research with the campus community. Together with the annual Romer and Susan Caroline Ferguson Lectures, the Student Symposium has become an effective showcase of some of the exciting progress in the sciences. This event is also a great opportunity to support our students and the hard work they have done.
Roselyne Laboso and Evan Walsh - Weekly Monitoring of River Chemistry in St. Lawrence County, New York
Abstract: Four rivers in the northern Adirondack region flowing into the St. Lawrence River were sampled weekly for one year (June 2011 – May 2012) for 72 elements and 8 anions. From West to East, the rivers include the Oswegatchie, Grasse, Raquette, and St. Regis. The headwaters of all the rivers begin in the acidified Adirondack Highlands (crystalline acidic bedrock) and flow across the Adirondack Lowlands (marble-rich metasedimentary sequence) and early Paleozoic sedimentary rocks (sandstone and dolostone) of the St. Lawrence River Valley. Sampling was conducted close to US Geological Survey gauging stations in the St. Lawrence River Valley so that loading of the analytes could be determined. In general, the waters were dilute: OSW (128.0±37.4), GRA (125.3±40.6), RAQ (61.8±21.7), and STR (96.1±36.2) μS. Their mean pH was 7.11±0.52, 7.62±0.66, 7.22±0.50, and 7.33±0.50. The most abundant cations were Ca, Na, Si, Mg, and K. The chief anions were CO32-, Cl-, and SO42-. Mean calculated ANC values in the four rivers are 43.1±6.4, 43.6±9.0, 17.2±5.4, and 31.5±7.2 (mg/L). The chemistry of the rivers varies from west to east and is influenced by the nature of the bedrock. The chemistry of the Raquette River is clearly anomalous and has the least amount of total dissolved solids and the least seasonal variability in nearly every analyte. This homogeneity is likely related to a number (n = 17) of hydropower reservoirs along its length. Runoff from a dolostone quarry upriver is also investigated to determine effects on chemistry of Raquette River.
Kevin Angstadt - Accelerating Database Joins Using a General Purpose GPU
Abstract: We demonstrate a significant speedup in database operations by repurposing hardware normally dedicated for computer graphics. In recent years, the computer manufacturing industry has achieved significant advances in the design of graphics cards (formally known as graphics processing units or GPUs). These add-on cards are now more computationally powerful than modern CPUs and often less expensive. Graphics processing units are highly parallel devices, meaning that the hardware can run thousands of instructions simultaneously, whereas the average CPU can only execute four to eight operations at the same time. Manufacturers have also developed software interfaces for graphics cards that allow developers to harness this power in their programs, as well as special hardware to run such computations. Previous research has shown the use of these general purpose graphics processing units (GPGPUs) have many applications in the research area of database systems. We extend previous work demonstrating efficient data storage and processing techniques on GPGPUs for single-table data queries to multi-table queries. Queries of this nature (known in database theory as joins) require far more processing and storage space, and our research presents novel techniques for efficiently harnessing the power of a graphics card to compute such data requests. Our project demonstrates a low-cost method for accelerating data processing, with initial results indicating speedups of two- to four-times on consumer-level GPUs over CPUs
Katherine Abramski - Improving the Statistical Method for Classifying Geomagnetic Storms
Abstract: Solar storms create disturbances in the Earth’s magnetic field that can damage satellites and cause power grids to fail. In an effort to better understand solar storms, we explore ways to improve the current method for classifying events as storms based on the Disturbance Storm Time (Dst) index, which measures disturbances in the Earth’s magnetic field. Several methods currently exist for classifying events as storms based on their Dst observation, including the threshold method, which classifies an event as a storm if Dst below a specified value is observed. This method is rather simple and results in significant error (high false positive and false negative rates). We investigate how these classification methods were derived and search for a more statistically justified way to define storms in an attempt to reduce error.
We explore three different logistic regression methods of classifying events as storms based on Dst. Our logistic regression methods use predictors to create a model that can determine the probability that a given event is a storm. We varied our predictors in order to find a method will low error rates. First, we used Dst lows as our predictors. Then we detrended the Dst data and used Dst lows as our predictors. Finally we used the percent decrease in Dst from the day before as a predictor. We found that all of these methods were more effective than the threshold method.
Alex Gladwin - Who Writes the Unwritable?: The Issue of Authorship in H. P. Lovecraft and C. M. Eddy’s “The Loved Dead.”
AbstractIn 1924, Weird Tales published a controversial story called “The Loved Dead,” which is told from the perspective of a necrophile. The issue was banned in at least Indiana. Today, the controversy surrounds the story’s authorship; while it was originally published under the name C. M. Eddy, Jr., a contemporary weird fiction writer and friend of Eddy’s, H. P. Lovecraft, is known to have done some sort of revision work on the story. The extent, however, is uncertain, and has caused debate. Thus, we will look at “The Loved Dead” using Stylometry, a study that attempts to quantify underlying aspects of style, in order to provide evidence toward a more informed claim of the controversial tale’s authorship.
John Balderston - Foxes, Hipsters, and The Internet Meme: A First-World Social Epidemic
Abstract: The Internet Meme: a fast spreading, sometimes “viral,” internet fad that is quite possibly the fastest mutating disease known to mankind. The Meme virus threatens the health and abdominal circumference of individuals everywhere. The multiple strains of the virus and its speedy mutation rate have left the grand majority of the human race perpetually infected. For this reason, we create a mathematical SIR model to demonstrate the spread of memes, where individuals can either be Susceptible (S), Addicted (A), or Rehabilitated (R). Our SAR model incorporates social impacts on the spread of this dreaded plague, including personal preference, hipster effects, boredom, and meme mutation. Observing the internet meme in this manner allows for the relevant understanding of social diseases, in which interactions within the population can result in a form of vaccination or devaccination, unseen in typical SIR modeling. Our hope is that our SARs will lend insight into combating the spread of this debilitating disease.
Juan Chang - Predicting Owner Tendencies in Fantasy Football Drafts
Abstract: As the world of professional sports grows more popular every year, so do interactive competitions such as fantasy sports. Fantasy Football, in particular, is among the most widely enjoyed competitions in the United States, today. Fantasy Football has become more popular, and league play has become more competitive as owners decide to gamble their money for potential profit. As the competitiveness increases, owners strive to compile the best possible team. In particular, owners will have an advantage if they can pick the player they want each and every round of the Fantasy Football Draft. This research will create a model that will be beneficial for Fantasy Football owners who want to succeed in their leagues, specifically the draft. By gathering information to develop a data set, multinomial logistic regression analysis will be used in order to develop this model. This model will act as a forecast, predicting the upcoming picks in the draft, so that owners can successfully anticipate which specific NFL players will be available at any point in time. This will allow the owner to produce a team of players he or she wants and, ultimately, have the most successful team by the end of the season.
Kathryn Christensen - Understanding Uncertainty in Predicting the Lifetime of Plutonium Fuel Cells
Abstract: Plutonium is a radioactive element that gives off a large amount of heat in the decay process, making it an excellent source of energy. This energy can then be harnessed as power, capable of fueling satellites and spacecrafts. Because it is such a long-lasting power source, it is important to be able to accurately predict how long the fuel cell will sustain its craft and how much power remains at a certain point in time. After modeling power and time, we identified potential sources of error with our prediction – measured mass of the fuel clad, the percentage of plutonium in the clad, the percentage of each plutonium isotope, and the half-life of each isotope. We design a simulation study to help determine which source of error has the biggest impact on the variability in the predictions. Using three different levels for each source of error - low, medium, and high, we explore which source will have the biggest impact on the predicted power remaining by observing their impacts on prediction interval width, variability of the prediction, and entropy. In addition, we run an ANOVA type analysis to determine if there is any relationship between the size of the error and how much it impacts the variability of the prediction, looking specifically for a linear or quadratic relationship.
Lara Clemens - Comparison of E. coli Genomes: An introduction to comparative genomics
Abstract: Bioinformatics is a growing field that hybridizes biology, mathematics and computer science in order to analyze biological data, most commonly DNA sequences. I focused on a subset of bioinformatics called comparative genomics which analyzes DNA sequences of organisms and aims to compare the sequence level data to the organismal level data. For instance, noting preserved DNA sequences that align with the preservation of a trait amongst organisms. I used data available for Escherichia coli strains to compare their DNA sequences and to test if groups of E. coli with similar DNA sequences matched groups of E. coli with distinct morphological traits.
Ian DePuy - Biomathematical Modeling in Python
Abstract: The use of mathematical and computational models in biology is increasing rapidly allowing insight into nature that has not been seen before. These models provide insight useful in areas such as wildlife and habitat preservation, infectious disease control, medical research in DNA mutations, and phylogenetics. I examined models that simulate such things as predator-prey interactions, molecular evolution, population change, phylogenetic trees, and the transmission of infectious disease. The models were written in the programming language Python using the Python packages, Numpy (for numerical analysis) and Matplotlib (for basic plotting and graphing), allowing users to provide initial values for a variety of model parameters. These graphs can be used to help visualize and comprehend each model, as well as provide an understanding of the mathematics that went into creating them.
Zachary Felix - Pitch Classification Using Major League Baseball’s Pitchf/x Data
Abstract: Pitchf/x is a baseball tracking system that uses cameras to record measures such as the velocity, movement, and location of every pitch thrown in Major League Baseball games since 2006. We use an R package called pitchRx to access and store data from the online Pitchf/x files. We examine methods, such as k-means clustering and multinomial logistic regression, for classifying the type of pitch (fastball, curveball, etc.) based on Pitchf/x characteristics.
Sarah Koallick - A Java Graphical User Interface to an R Pareto Front Library
Abstract: The Pareto front multiple criteria optimization method is used by statisticians to make an informed decision when choosing experimental design data. In this project, we created a graphical user interface (GUI) that incorporates an R library written by Dr. Lu Lu and Dr. Christine Anderson-Cook. Their library creates a Pareto front and other graphical representations of the data to aid a user in making the best and most well-informed decision possible. The software we developed presents a GUI for the Pareto front R library allowing the user, in an understandable way, to input the data they want to optimize. The GUI is developed in Java and uses and interface called JRI, the Java/R Interface. JRI allows an instance of R to run inside a Java application.
Devyn LaFrance - Using Cluster Analysis to Identify Groups in Cancer Gene Expression Data Molecules Using Fluorescence Spectroscopy
Abstract: Cluster analysis is an exploratory data analysis tool used to separate data into groups (clusters) in such a way that objects in the same cluster are more similar to one another than objects in other clusters. Cluster analysis groups data in such a way that makes it easier to predict behavior or qualities of data based on group membership. Due to this grouping technique, cluster analysis can be applied to genetic data in order to group genes for medical research. Cluster analysis is often one of the first steps in gene expression analysis. When dealing with gene expression analysis, data is obtained from a DNA microarray, which is used to measure the expression level of a large quantity of genes simultaneously. When gene expression data is obtained, cluster analysis can be applied in order to study the effects of treatments, disease identification, the developmental stages of gene expression, and the identification of biological samples. We apply cluster analysis techniques on publicly available brain, colon, and prostate cancer datasets to look for patterns in gene expressions.
Chelsey Legacy - Minnesota Lake Surveys: An Application of Web-Scraping and Parallel Computing in R
Abstract: The main objective of this project was to implement a program for web-scraping and parallel computing in the statistics computing program R. Web scraping is the process of gathering data from the Internet. Many websites are filled with interesting and useful data, however it is not in proper format for statistical analysis. With the use of web-scraping this data can be compiled into a format that that is easily manipulated for study. The process of gathering the data (which can contain thousands of variables and entries) can take a lot of time, as it is a complex process. However, with parallel computing the computer is able to divide the task into smaller jobs that can be solved in at the same time in “parallel”. The process of web-scraping using parallel computing was used in an application of lake survey data from the Minnesota Department of Natural Recourses website. The website contained an abundance of information on characteristics from the lakes in Minnesota including lake area, maximum depth, littoral area, water clarity, bottom substrate, and depth of plant growth. Basic statistical analysis was conducted on this dataset once gathered in order to gain further insight into the relationships among these variables. Some of these relationships include the number of fish species per county, a comparison of littoral and lake area, and a graphical representation of the relative lake sizes in Minnesota displayed graphically.
Brian Magovney - Bradley Terry Model to Rank Division III Baseball Teams
Abstract: The ranking athletic teams, particularly college teams, is often debated and discussed, especially when there may be large differences in strengths of schedules between different teams or leagues. Who is the best? How do you determine what team deserves to be ranked where? We use a Bradley Terry Model to generate a function to determine team rankings based on schedule information and game results scraped from the web using an R package.
Dan Mulcahey - Identifying Groups of Similarly Performing Mutual Funds Using Cluster Analysis
Abstracat: Cluster analysis is an explanatory tool used to identify groups within data. We apply cluster analysis in two different ways. First, we use recent percent returns to find clusters of similarly performing mutual funds and demonstrate how Word Clouds can be used to conveniently summarize fund descriptions and quickly identify the similarities among funds in each cluster, and the differences among funds between clusters. Second, we consider clustering mutual funds based on the correlations in their annual returns over the past ten years, identifying five distinct categories with varying degrees of stability.
Michael Orlando - Mapping the National Survey of Reaction and the Environment Data
Abstract: The NSRE dataset contains information on 162 variables measuring attitudes about the environment and recreation for almost 100,000 respondents. To help analyze and display this data we have developed a set of R functions to generate choropleths, maps that are colored to display quantitative information along a color gradient. We discuss some of the interesting findings the maps reveal about the NSRE data as well as the challenges of automatically producing choropleths and exporting them to a webpage through using the shiny package.
Aden Peterson - Simulation Methods for Survival Models with Applications to Lifetimes of C. elegans
Abstract: We examine methods for estimating survival curves and comparing life expectancies with applications to biological events. The analysis addresses the idea of determining the proportion of a population that will survive past a particular time. We use Kaplan-Meier curves to estimate survival functions and perform inference on curves using computer simulation methods, such as randomization tests and bootstrapping. We illustrate these methods modeling the lifetimes of C. elegans used in a study of treatments for brain disease involving free radicals.
Dean Petzing - An Analysis of Play Selection in College Football
Abstract: Football is a game of many dimensions. Coaches face hundreds of decisions throughout the week of preparation leading up to a game. Who should play, how much should they play, what formations to run, what plays to call out of those formations, how to attack the other team and so on. This paper examines a similar question that has a large impact over the course of a game. How much do we want to run the football and how much do we want to pass the football? This question is asked by every offensive coach throughout the country. The purpose of this analysis is to analyze the outcomes from running the ball and passing the ball. Is there an advantage to doing one more than the other? Inspired by Ken Kovash and Steven Levitt, who examine the same phenomena at the professional level, this study examines the decision to run or pass in college football. The success and failure of a play will be determined by weighing the expected change in points from a given run or pass play and comparing that to the average change in points from all plays at the same position on the field, and distance from a first down, by doing so we can determine whether to favor running over passing or vice versa.
Vir Seth - CSA Squash Rating Using a Bradley-Terry Model
Abstract: We use a Bradley-Terry Model to generate rankings of the squash teams in the College Squash Association (CSA). The model takes into account several factors such as the number of matches played, the strength of schedule and the win-loss records. At the end of the regular season, the top 8 teams in the nation are invited to compete for the Potter Cup, a single elimination draw to crown the National Champion. How do the rankings for the 2014 Potter Cup compare to the rankings obtained by using our Bradley-Terry model?