Using Data Mining Tools & Concepts to Beat the Spread
Autor: Sharon • January 14, 2018 • 2,700 Words (11 Pages) • 682 Views
...
Contents:
- Passing Attempts / Completions / Yards
- Rushing Attempts / Completions / Yards
- Scoring Details
- Turnovers (Both Fumbles & Interceptions, both allowed and forced)
- Scoring Stats
- Table Size: 67 columns / 1600 rows (team-game pairs)
- General Team Details (Table name: “team_info”)
Data Sources:
- http://www.cfbstats.com
Description: This table linked general details about each football team with the team’s unique ID used for this project
Contents:
- Team ID (Primary Key)
- Team Name
- Home Stadium Location
- Conference ID
- Table Size: 5 columns / 120 rows (FBS teams)
---------------------------------------------------------------
- Player Injury Details (Table name: player_injuries)
Data Sources:
- http://www.collegeinjuryreport.com
Description: This table was created to store information about each reported injury throughout the 2011 season
Contents:
- Player Name
- Player Position & Team
- Seriousness of Injury
- Status for Next Game
- Date of Injury
- Player Info (Starter vs Non-Starter, Key-Player vs Non key-Player)
- Table Size: 11 columns / 20,000 rows (weekly injury reports)
- College Football Rankings (Table name: rankings)
Sources:
- http://www.collegepollarchive.com
Description: Contains college football poll rankings for each team and week during the 2011 season
Contents:
- AP Poll - Rankings
- Coaches Poll – Rankings
- Harris Poll – Rankings
- BCS Poll – Rankings
- Table Size: 7 columns / 379 rows (weekly ranked teams)
3.2 Data preprocessing
There were three key data preprocessing steps that needed to take place. First, to create the aggregate data set needed for the task at hand, the various tables in the relational database needed to be linked. For instance, after data collection was complete, one table consisted of college football rankings for each week during the 2011 season, while another table had detailed game statistics for each team in every game during the season. The weekly polls needed to be paired with the games occurring the following week. A primary issue encountered was that various websites use various unique identifiers for each team. For instance, one website may use the entire name to identify a university, such as “University of Southern California”, while another website may use the abbreviated version, such as “USC” or “Southern California”. Because of these differences, the first pre-processing step involved reconciling these differences and making modifications to keep the team names consistent among the various tables.
Second, recall that the primary goal of this project was to accurately predict the spread outcome of a given game given historical game statistics for both teams. Since the raw data collected only contained information about the game being played (as opposed to data from previous games), additional work needed to be done to summarize the results of the games played prior to the game being analyzed. To perform this task, an additional PHP script was created to obtain averages for previous 1, 3 and 5 games prior to current game for each game in the data set. These values were then inserted into a new MySQL table called GamePrediction.
[pic 1]
Third, the basic premise in using the data preprocessing step described in the previous paragraph is that a team’s future game performance is highly correlated to the team’s performance from previous games during the same season. However, there are several events that can transpire leading up to a future game that can diminish the strength of this premise. One such possibility is that event in which several key players on a particular team become injured in a relatively short amount of time leading up to the game being predicted, thereby reducing the performance capability of the team. Typically the spread for a given game will take key injuries into account; however, if this data is not implemented into the data set, then the classification will likely result in a higher error rate. Because of this, a key part of this project involved using injury reports to remove games from the data set where the number of pre-game injuries for one of the two teams exceeded a particular threshold. I found the most effective threshold to be 2. As such, this data cleaning step can be summarized in the following way: For each game, count the number of “relevant” pre-game injuries to key players of either team. An injury is considered “relevant” if the injury involves a key starting player, the injury originally occurred within 2 weeks prior to the date of the game being predicted, and the player’s status for the game is “Doubtful”, “Out”, or “Out for Year”. By doing this, I hope to eliminate games that have a higher chance of being incorrectly classified, which will result in an overall lower classification error rate.
[pic 2]
- Data Mining Methods
Once preprocessing was complete and the data was in a format suitable for WEKA, the next step involved attempting to train a decision tree in WEKA. The GamePrediction table was exported from MySQL in a comma separated value format, which was then
...