Innovation Challenge:
PepsiCo Data Science
This Challenge is
CLOSED

Key Dates
Challenge Began
September 25, 2020
Solutions Due
October 14, 2020
Virtual Pitch
November 9, 2020
Winners Announced
November 16, 2020
Create the Unexpected
From September 25th to October 14th, 2020, over 250 innovators from 38 different states across the U.S. signed up to participate in the challenge in which they were tasked to develop a model that predicts the effect of growing location, soil type, fertilizer, and crop parameters associated with growth and development on product assessment. PepsiCo and the New York Academy of Sciences invited solvers to participate in the challenge for a chance to join the PepsiCo R&D Intern Team in the summer of 2021. The winning solutions, created by Md Taufeeq Uddin and Blake Bullwinkel, were both meticulously crafted following data exploration, data carpentry, and model training steps. You can learn more about the winning solutions here.
The Challenge
Solvers were asked to use an analytics software of their choosing (including but not limited to R, Python, MatLab) to create a predictive model based on the Crop and Grain dataset provided by PepsiCo. Read the full challenge statement including the question and background here.
How It Works
Webinar
After signing up to participate, solvers were invited to register for the webinar A Recruiter’s Perspective: Leveraging STEM skills to meet the needs of Industry where they heard from a panel of PepsiCo R&D, data science and HR leaders on PepsiCo’s creative approaches to data science and Research & Development and how a STEM proficient workforce is leading their innovative efforts.
Challenge
On September 25th,2020, participants received an exclusive link to the crop and grain dataset along with the challenge survey. Solvers worked independently to develop a model that predicted the effect of growing location, soil type, fertilizer, and crop parameters associated with growth and development on product assessment.
Virtual Pitch
The top 5 finalists were invited to a virtual pitch session with the Challenge Judges as well as members of the Academy’s challenge design team. Finalists were given 10 minutes each to present their solution and model design followed by judges Q&A.
Want to be notified when our next challenge is announced?
Sponsor
Key Dates
Challenge Began
September 25, 2020
Solutions Due
October 14, 2020
Virtual Pitch
November 9, 2020
Winners Announced
November 16, 2020
Grand Prize Winners
Md Taufeeq Uddin, University of South Florida
The raw data was first pre-processed to extract relevant information based on the challenge goal to assess the quality of agricultural products using product and assessment information, geographic location, and weather data. Second, new features were created from the assessment types, weather, and location data. Finally, the data was fed to random forests (RF) regressor to predict the assessment score. The developed model was validated using cross-validation strategy and obtained impressive results in terms of normalized RMSE (root mean square error) and Spearman rank correlation coefficient metrics. In terms of interpretation, based on the importance score of the applied RF model, the created features from the assessment types made the major contribution towards predicting the target assessment score. Also, the product growth stage and weather data made a fair contribution.
Blake Bullwinkel, Harvard University
In order to predict the assessment score of crops as accurately as possible, a variety of regression models were considered, including multiple linear regression, decision tree, and random forest models. First, the crop grain data was combined with site specific data, and weather data was incorporated using rolling averages. Categorical variables were converted to dummy variables using one-hot encoding. Second, the pre-processed data was split into train (80%) and test (20%) sets, and a baseline multiple linear regression model using all the features achieved an R2 of 0.8357 on the test set. Incorporating interaction terms achieved marginal improvements, and applying a Yeo-Johnson transformation on the response increased the test R2 to 0.8886. Finally, decision tree and random forest models tuned using 10-fold cross-validation performed even better, the latter achieving a test R2 of 0.9815. All models indicated that assessment type was the most important feature for predicting assessment score.
Finalists
- Christopher Cammilleri, Rensselaer Polytechnic Institute
- Alexander Shen, University of Michigan
- Sam Tauke, University of Wisconsin-Madison