Knn Model Data Analysis
Autor: goude2017 • December 18, 2017 • 1,343 Words (6 Pages) • 850 Views
...
to compare with the models build by filling in the missing data as discussed above. This dataset had 3567 records including a header.
3. Data Model Analysis
• Linear Regression was not considered because it only outputs numerical variables.
• The Logistic Regression model was about as good as the Tree Model. However, because the Tree is easier to explain, we preferred it to the Logistic Regression.
• We built a Logistic Regression model using only the significant variables from Tree model. The new Logistic Regression performed worse than the first. This is not surprising as many of the significant variables in the original logistic regression did not match the significant variables in the Tree.
• We are least comfortable using the Naive Bayes model because it does not account for any of the numeric inputs. Only 3 out of the 20 variables are considered in this model. Additionally, the model’s performance was the worst of all tested.
• Between the Tree and the KNN model, the KNN model had much better results in Training and essentially equivalent in validation. While the tree has an easy to explain progression, it is a bit too simplistic because after splitting on Non_Markup_Score it splits on the Alchemy_Category Recreation and Business. This is not very helpful to StumbleUpon because the recommendation would simply be to only promote recreation and business websites.
• Based on the excellent results in the training data for the KNN model we feel that this model has great potential and that it is the best model to use.
4. Best Model
4.1. KNN Analysis
• In the model we allowed XLMiner to normalize the data.
• Scored best between 1 and 10; 9 was chosen given an error of 36.8% on Validation and 26% in training.
• The question that StumbleUpon may be asking is: Is it better to predict success (True-True) and have more false negatives or predict failures (False-False). The Cutoff probability will change the results based on StumbleUpon’s answer to this question.
• There is a strong possibility that the KNN model is over fit as the validation dataset did not closely match the training set.
• This model leaves the user somewhat blind to why the model works. There are no coefficients to examine or odds to suggest the importance of each variable.
• The Sensitivity of this model is .69. Therefore, we predict that 69% of the websites we predict will be evergreen are in fact evergreen.
• The Specificity of this model is .58. Therefore, only 58% of the websites we predict that are not evergreen are in fact not evergreen. This number is a bit low and suggests that we could look at ways to better predict the websites not being evergreen.
• We could potentially combined the KNN model with Alchemy_Categories. To further predict if a website is evergreen we could first split up the website based on Alchemy_Catagory and then use the KNN model to predict if it is evergreen or not. While this sounds intriguing, it was not pursued do to its complexity. There are 12 categories in Alchemy_Catagory and not enough data in each category to build a model.
• Dimension reduction techniques using trees. This technique improved the overall error rate in Training to 28.71% from 28.9% but in validation the overall error rate dropped from 36.1% to 38.1%. Therefore, while slower to execute than the reduced model, the full model has better performance and should be used.
4.2. Classification Tree
Variables in order of importance
a) Non-Markup - the number of alphanumeric characters on the page - Split at 2527
b) Alchemy_Recreation
c) Alchemy_Business
d) Link_Word_score - percentage of words on the page that are hyperlink - Split at 20.5
e) Number of words in URL - Split at 6.5 - less than 6.5 is evergreen, more is not.
f) Average Link Size - Average number of words in each link - Split at 2.8 - More than 2.8 it is not evergreen
g) Common Link Ratio 4 - If less than 2.8 ‘Average Link Size’ then common link split at .06. If more than .06 then evergreen
...