Revenue Generation Prediction

Summary

  • To predict whether revenue would be generated when a customer visit a website page
  • Based on the features available for a website we need to predict whether a visit from a customer would result in generation in revenue.
  • By analysis of a customer’s behavior pattern e.g- type of product been searched, duration of stay, we can determine the optimal variation that would corresponds to maximization of correlation of dependency of target variable on independent feature.
  • With the help of analysis through visualization calculated better enhancements for more streamline target output, in this case revenue generation.
  • Deployed clustering algorithm such as KNN to predict the outcome
  • Arrived at an accuracy of 85.915%

In websites the amount of traffic can determine the revenue generated by keeping an eye on the customer visit on the webpage.

Like any other data, it contain various features which determine the outcome and lay an impact on the outcome.

Number of customers in various features

Based on the above bar charts we can determine -

  • The features in which we have the highest number of customer visits which helps us to understand the customer trend.
  • We identify Product Related query to be the most sort out.

Determining share of categorical features

In the above pie chart we can determine the dominance of a single category over the rest.

Determining the correlation

Correlation plays an important role factoring the most important features that play highly critical role. By giving an attention to these specfic features would hamper the target greatly.

Correlation in continuous features

Following observations are made-

  • Revenue has a direct proportionality with Page values.
  • Revenue has inverse proportionality with exit rates and bounce rates.

Correlation heatmap

Modeling and Prediction

This problem corresponds to classification problem and KNN or K nearest neighbor classifier can be employed in usage.

KNN can be used for both classification and regression predictive problems. However, it is more widely used in classification problems in the industry. To evaluate any technique we generally look at 3 important aspects:

  • Ease to interpret output
  • Calculation time
  • Predictive Power

Value of K can be determined by calculating Euclidean distance as our distance metric since it’s the most popular method. The other metrics that can be used are Chebyshev, cosine, etc.

Optimal number of neighbors

Conclusion

  • We determine optimal value of K is 4
  • Model performs with an accuracy of 86%

Classfication report

Link to GitHub repository