TZ Gaming: Optimal Targeting of Mobile Ads

Logistic Regression, Permutation Importance, Prediction Plots, Pseudo R-Squared, Chi-Square Test, Correlation, Decile Analysis, Gain Curves, Confusion Matrix
Author

Mario Nonog

Published

August 10, 2025

Problem Description:

As a developer of games for mobile devices, TZ gaming has achieved strong growth of its customer base. A prominent source of new customers has come from ads displayed through the Vneta ad-network. A mobile-ad network is a technology platform that serves as a broker between (1) app developers (or publishers) looking to sell ad space and (2) a group of advertisers.

App developers sell “impressions”, i.e., a space where an ad can be shown, through the Vneta network to companies such as TZ gaming looking to advertise to app users. Vneta acts as a broker for 50-60 millions impressions/ads per day. TZ gaming uses ads to appeal to prospective customers for their games. They generally use short (15 sec) video ads that help to emphasize the dynamic nature of the games. In the past, TZ has been able to, approximately, break-even on ad-spend with Vneta when calculating the benefits that can be directly attributed to ad click-through. Many senior executives at TZ believe that there are additional, longer-term, benefits from these ads such as brand awareness, etc. that are harder to quantify.

Currently, TZ has access to very limited data from Vneta. Matt Huateng, the CEO of TZ gaming, is intrigued by the potential for data science to enhance the efficiency of targeted advertising on mobile devices. Specifically, two options are under consideration: (1) Buy access to additional data from Vneta and use TZ’s analytics team to build targeting models or (2) Subscribe to Vneta’s analytics consultancy service, which provides impression-level click-through rate predictions based on Vneta’s proprietary data and algorithms.

Vneta has shared behavioral information linked to 115,488 recent impressions used to show TZ ads and has also provided a set or predictions based on their own (proprietary) algorithm. Matt is unsure if the consulting services offered by Vneta will be worth the money for future ad campaigns and has asked you to do some initial analyses on the provided data and compare the generated predictions to Vneta’s recommendations. The following targeting options will be evaluated to determine the best path forward.

Options:

  1. Spam all prospects
  2. Continue with the current targeting approach
  3. Use predictions from a logistic regression model for ad targeting
  4. Use predictions generated by Vneta for ad targeting

Assumptions

The assumptions used for the analysis are as follows: • Targeting of impressions to consumers covered by the Vneta ad-network to date has been (approximately) random • Cost per 1,000 video impressions (CPM) is $10 • Conversion to sign-up as a TZ game player after clicking on an ad is 5% • The expected CLV of customers that sign-up with TZ after clicking on an ad is approximately $25 • The price charged for the data by Vneta is $50K • The price charged for the data science consulting services by Vneta is $150K

Approach:

• Use the 87,535 rows in the data with “training == ‘train’” to estimate different models. Then generate predictions for all 115,488 rows in the dataset • Options 1-4 should be evaluated only on the predictions generated for the 27,953 rows in the data with “training == ‘test’”. These are the observations that were not used to estimate your model • Extrapolate the cost and benefits for options 1-4 above for an upcoming advertising campaign where TZ will commit to purchase 20-million impressions from Vneta

TZ gaming has decided to use logistic regression for targeting. This is a powerful and widely used tool to model consumer response. It is similar to linear regression but the key difference is that the response variable (target) is binary (e.g., click or no-click) rather than continuous. For each impression, the logistic regression model will predict the probability of click-through, which can be used for ad targeting. Like linear regression, you can include both continuous and categorical predictors in your model as explanatory variables (features). Matt is eager to assess the value of logistic regression as a method to predict ad click-through and target prospects and has asked you to complete the following analyses.

TZ Gaming: Optimal Targeting of Mobile Ads

Each row in the tz_gaming dataset represents an impression. For each row (impression), we have data on 21 variables. All explanatory variables are created by Vneta based on one month tracking history of users, apps, and ads. The available variables are described in below.

  • training – Dummy variable that splits the dataset into a training (“train”) and a test (“test”) set
  • inum – Impression number
  • click – Click indicator for the TZ ad served in the impression. Equals “yes” if the ad was clicked and “no” otherwise
  • time – The hour of the day in which the impression occurred (1-24). For example, “2” indicates the impression occurred between 1 am and 2 am
  • time_fct – Same as time but the is coded as categorical
  • app – The app in which the impression was shown. Ranges from 1 to 49
  • mobile_os – Customer’s mobile OS
  • impup – Number of past impressions the user has seen in the app
  • clup – Number of past impressions the user has clicked on in the app
  • ctrup – Past CTR (Click-Through Rate) (x 100) for the user in the app
  • impua – Number of past impressions of the TZ ad that the user has seen across all apps
  • clua – Number of past impressions of the TZ ad that the user has clicked on across all apps
  • ctrua – Past CTR (x 100) of the TZ ad by the user across all apps
  • imput – Number of past impressions the user has seen within in the hour
  • clut – Number of past impressions the user has clicked on in the hour
  • ctrut – Past CTR (x 100) of the user in the hour
  • imppat – Number of past impressions that showed the TZ ad in the app in the hour
  • clpat – Number of past clicks the TZ ad has received in the app in the hour
  • ctrpat – Past CTR (x 100) of the TZ ad in the app in the hour
  • rnd – Simulated data from a normal distribution with mean 0 and a standard deviation of 1
  • pred_vneta – Predicted probability of click per impressions generated by Vneta’s proprietary machine learning algorithm
  • id – Anonymized user ID

Note that there is a clear relationship between the impressions, clicks, and ctr variables within a strata. Specifically:

  • ctrup = clup/impup
  • ctru = clu/impu
  • ctrut = clut/imput
  • ctrpat = clpat/impat

The last three letters of a feature indicate the sources of variation in a variable:

  • u — denotes user
  • t — denotes time
  • p — denotes app
  • a — denotes ad

Logistic Regression

import pandas as pd
tz_gaming = pd.read_parquet("data/tz_gaming.parquet")
print(tz_gaming)
       training     inum click  time time_fct    app mobile_os  impup  clup  \
0         train       I7    no     9        9   app8       ios    439     2   
1         train      I23    no    15       15   app1       ios     64     0   
2         train      I28    no    12       12   app5       ios     80     0   
3         train      I30    no    19       19   app1       ios     25     0   
4         train      I35    no    24       24   app1   android   3834    29   
...         ...      ...   ...   ...      ...    ...       ...    ...   ...   
115483     test  I399982    no    21       21   app2       ios   2110     0   
115484     test  I399986    no    17       17  app14   android    291     1   
115485     test  I399991    no    23       23   app1   android    364     3   
115486     test  I399992    no    20       20   app6   android     59     2   
115487     test  I399994    no    18       18   app1       ios    498     7   

           ctrup  ...     ctrua  imput  clut     ctrut  imppat  clpat  \
0       0.455581  ...  0.000000     25     0  0.000000      71      1   
1       0.000000  ...  0.000000      7     0  0.000000   67312   1069   
2       0.000000  ...  6.578947     94     0  0.000000     331      1   
3       0.000000  ...  0.000000     19     0  0.000000   71114   1001   
4       0.756390  ...  0.689655    329     4  1.215805  183852   2317   
...          ...  ...       ...    ...   ...       ...     ...    ...   
115483  0.000000  ...  0.000000    158     0  0.000000   23216     19   
115484  0.343643  ...  0.000000     74     1  1.351351    3665     14   
115485  0.824176  ...  4.347826     19     0  0.000000  173353   2292   
115486  3.389831  ...  0.000000     37     1  2.702703    3474     53   
115487  1.405622  ...  4.166667     53     1  1.886792   77884   1201   

          ctrpat       rnd  pred_vneta        id  
0       1.408451 -1.207066    0.003961  id247135  
1       1.588127  0.277429    0.003961  id245079  
2       0.302115  1.084441    0.003961  id927245  
3       1.407599 -2.345698    0.018965  id922188  
4       1.260253  0.429125    0.003961  id355833  
...          ...       ...         ...       ...  
115483  0.081840 -1.852059    0.003961  id847352  
115484  0.381992 -0.296415    0.003961  id457437  
115485  1.322158  0.099201    0.003961  id792352  
115486  1.525619 -0.186421    0.050679  id115678  
115487  1.542037  0.857281    0.003961  id705546  

[115488 rows x 22 columns]
tz_train = tz_gaming[tz_gaming["training"] == "train"]
print(tz_train["training"].head())

tz_train
0    train
1    train
2    train
3    train
4    train
Name: training, dtype: object
training inum click time time_fct app mobile_os impup clup ctrup ... ctrua imput clut ctrut imppat clpat ctrpat rnd pred_vneta id
0 train I7 no 9 9 app8 ios 439 2 0.455581 ... 0.000000 25 0 0.000000 71 1 1.408451 -1.207066 0.003961 id247135
1 train I23 no 15 15 app1 ios 64 0 0.000000 ... 0.000000 7 0 0.000000 67312 1069 1.588127 0.277429 0.003961 id245079
2 train I28 no 12 12 app5 ios 80 0 0.000000 ... 6.578947 94 0 0.000000 331 1 0.302115 1.084441 0.003961 id927245
3 train I30 no 19 19 app1 ios 25 0 0.000000 ... 0.000000 19 0 0.000000 71114 1001 1.407599 -2.345698 0.018965 id922188
4 train I35 no 24 24 app1 android 3834 29 0.756390 ... 0.689655 329 4 1.215805 183852 2317 1.260253 0.429125 0.003961 id355833
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
87530 train I299985 no 11 11 app2 android 1181 0 0.000000 ... 0.840336 55 0 0.000000 9625 14 0.145455 -0.249031 0.003961 id565693
87531 train I299986 no 10 10 app33 ios 1885 0 0.000000 ... 0.000000 374 0 0.000000 658 1 0.151976 0.770718 0.003961 id222657
87532 train I299990 no 1 1 app45 ios 8 0 0.000000 ... 0.000000 15 0 0.000000 166 7 4.216867 0.181559 0.018965 id340594
87533 train I299991 no 8 8 app1 ios 113 2 1.769912 ... 0.000000 0 0 0.000000 14245 158 1.109161 -1.263831 0.003961 id634151
87534 train I299995 no 18 18 app35 android 13 0 0.000000 ... 0.000000 14 1 7.142857 472 1 0.211864 -1.428302 0.050679 id280606

87535 rows × 22 columns

import pyrsm as rsm
clf= rsm.model.logistic(
        data={"tz_train":tz_train},
        rvar= "click",
        lev="yes",
        evar=[ "time_fct", "app", "mobile_os", "impua", "clua", "ctrua"]
)
clf.summary()
Logistic regression (GLM)
Data                 : tz_train
Response variable    : click
Level                : yes
Explanatory variables: time_fct, app, mobile_os, impua, clua, ctrua
Null hyp.: There is no effect of x on click
Alt. hyp.: There is an effect of x on click

                     OR      OR%  coefficient   std.error  z.value p.value     
Intercept         0.029   -97.1%       -3.528       0.197  -17.936  < .001  ***
time_fct[2]       0.622   -37.8%       -0.474       0.321   -1.478   0.139     
time_fct[3]       0.718   -28.2%       -0.332       0.454   -0.730   0.466     
time_fct[4]       0.000  -100.0%      -23.543   42007.161   -0.001     1.0     
time_fct[5]       0.000  -100.0%      -23.721   55229.970   -0.000     1.0     
time_fct[6]       0.349   -65.1%       -1.052       1.021   -1.030   0.303     
time_fct[7]       1.221    22.1%        0.200       0.426    0.468    0.64     
time_fct[8]       1.104    10.4%        0.099       0.296    0.335   0.737     
time_fct[9]       1.029     2.9%        0.029       0.287    0.101    0.92     
time_fct[10]      0.830   -17.0%       -0.187       0.295   -0.633   0.527     
time_fct[11]      0.637   -36.3%       -0.451       0.276   -1.635   0.102     
time_fct[12]      0.874   -12.6%       -0.135       0.280   -0.483   0.629     
time_fct[13]      0.590   -41.0%       -0.528       0.290   -1.823   0.068    .
time_fct[14]      1.099     9.9%        0.094       0.225    0.419   0.675     
time_fct[15]      0.986    -1.4%       -0.014       0.225   -0.062   0.951     
time_fct[16]      1.046     4.6%        0.045       0.233    0.195   0.846     
time_fct[17]      1.014     1.4%        0.014       0.250    0.055   0.956     
time_fct[18]      1.061     6.1%        0.060       0.247    0.241   0.809     
time_fct[19]      1.284    28.4%        0.250       0.233    1.072   0.284     
time_fct[20]      1.224    22.4%        0.202       0.231    0.873   0.382     
time_fct[21]      0.867   -13.3%       -0.142       0.244   -0.584   0.559     
time_fct[22]      0.970    -3.0%       -0.030       0.238   -0.127   0.899     
time_fct[23]      1.044     4.4%        0.043       0.238    0.180   0.857     
time_fct[24]      1.099     9.9%        0.094       0.229    0.410   0.682     
app[app2]         0.124   -87.6%       -2.091       0.229   -9.127  < .001  ***
app[app3]         0.184   -81.6%       -1.694       1.003   -1.689   0.091    .
app[app4]         0.383   -61.7%       -0.960       0.321   -2.988   0.003   **
app[app5]         0.282   -71.8%       -1.264       1.005   -1.258   0.208     
app[app6]         0.651   -34.9%       -0.429       0.175   -2.457   0.014    *
app[app7]         0.596   -40.4%       -0.517       1.007   -0.513   0.608     
app[app8]         0.000  -100.0%      -24.374   71686.172   -0.000     1.0     
app[app9]         0.748   -25.2%       -0.290       0.236   -1.227    0.22     
app[app10]        0.000  -100.0%      -23.891   67058.977   -0.000     1.0     
app[app11]        1.051     5.1%        0.050       0.719    0.069   0.945     
app[app12]        0.713   -28.7%       -0.338       0.224   -1.504   0.133     
app[app13]        2.131   113.1%        0.757       0.204    3.709  < .001  ***
app[app14]        0.218   -78.2%       -1.525       0.451   -3.384  < .001  ***
app[app15]        0.462   -53.8%       -0.772       0.717   -1.077   0.281     
app[app16]        0.251   -74.9%       -1.383       0.711   -1.945   0.052    .
app[app17]        0.882   -11.8%       -0.126       0.724   -0.174   0.862     
app[app18]        0.075   -92.5%       -2.593       1.002   -2.586    0.01   **
app[app19]        0.000  -100.0%      -24.366   92379.731   -0.000     1.0     
app[app20]        0.000  -100.0%      -24.211   71551.198   -0.000     1.0     
app[app21]        0.337   -66.3%       -1.087       0.383   -2.839   0.005   **
app[app22]        0.321   -67.9%       -1.136       1.005   -1.130   0.258     
app[app23]        3.111   211.1%        1.135       0.392    2.898   0.004   **
app[app24]        1.590    59.0%        0.464       0.273    1.701   0.089    .
app[app25]        0.975    -2.5%       -0.025       0.594   -0.042   0.966     
app[app26]        0.000  -100.0%      -24.471   62613.516   -0.000     1.0     
app[app27]        0.202   -79.8%       -1.601       0.711   -2.253   0.024    *
app[app28]        0.328   -67.2%       -1.114       0.581   -1.918   0.055    .
app[app29]        2.634   163.4%        0.969       0.345    2.805   0.005   **
app[app30]        0.000  -100.0%      -24.385   85789.970   -0.000     1.0     
app[app31]        0.246   -75.4%       -1.403       0.710   -1.975   0.048    *
app[app32]        0.000  -100.0%      -23.157   46670.216   -0.000     1.0     
app[app33]        0.513   -48.7%       -0.668       0.338   -1.974   0.048    *
app[app34]        0.000  -100.0%      -24.221   54802.115   -0.000     1.0     
app[app35]        0.226   -77.4%       -1.486       1.004   -1.480   0.139     
app[app36]        0.000  -100.0%      -24.122   76747.449   -0.000     1.0     
app[app37]        0.434   -56.6%       -0.836       0.714   -1.171   0.242     
app[app38]        0.000  -100.0%      -23.864   53394.100   -0.000     1.0     
app[app39]        1.741    74.1%        0.554       0.419    1.322   0.186     
app[app40]        1.005     0.5%        0.005       0.514    0.009   0.993     
app[app41]        0.632   -36.8%       -0.460       0.714   -0.644   0.519     
app[app42]        2.009   100.9%        0.698       0.369    1.889   0.059    .
app[app43]        0.000  -100.0%      -24.466   75014.775   -0.000     1.0     
app[app44]        0.000  -100.0%      -24.238   45702.677   -0.001     1.0     
app[app45]        1.479    47.9%        0.391       0.588    0.665   0.506     
app[app46]        0.296   -70.4%       -1.216       1.005   -1.210   0.226     
app[app47]        0.289   -71.1%       -1.241       1.005   -1.235   0.217     
app[app48]        0.000  -100.0%      -24.030  101834.366   -0.000     1.0     
app[app49]        0.241   -75.9%       -1.424       1.004   -1.418   0.156     
mobile_os[ios]    0.450   -55.0%       -0.799       0.076  -10.440  < .001  ***
mobile_os[other]  0.719   -28.1%       -0.330       0.183   -1.801   0.072    .
impua             0.979    -2.1%       -0.022       0.002  -11.386  < .001  ***
clua              1.295    29.5%        0.259       0.058    4.452  < .001  ***
ctrua             1.022     2.2%        0.021       0.002    9.058  < .001  ***

Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Pseudo R-squared (McFadden): 0.109
Pseudo R-squared (McFadden adjusted): 0.092
Area under the RO Curve (AUC): 0.792
Log-likelihood: -3946.072, AIC: 8046.145, BIC: 8768.389
Chi-squared: 968.279, df(76), p.value < 0.001 
Nr obs: 87,535

Interpretation

For OR (Odds Ratio) As another hour is added to tim_fct[2], the model suggest that the odds of clicking will decrease by a factor of 0.622 or by 37.8% that is when keeping all other variable in the model constant.

For OR (Odds Ratio) As another hour is added to tim_fct[3], the model suggest that the odds of clicking will decrease by a factor of 0.718 or by 28.2% that is when keeping all other variable in the model constant.

And so on for all the rest of the features.

Chi-Square Citical Value

from scipy.stats import chi2
df=76
alpha_0_05 = chi2.ppf(1-0.05,df)
alpha_0_01= chi2.ppf(1-0.01,df)
print(alpha_0_05,alpha_0_01)
97.35097037903296 107.58254478061235

The logistic regression chi square value (970) is so much higher than both the calculated chi-square critical value above(97, 107). Therefore, the difference between the observed data and expected data is extremely large, thus reject the null hypothesis.

Metric Value
McFadden Pseudo R² 0.109, Indicates modest model fit (0.2–0.4 is strong)
Adjusted Pseudo R² 0.092, Slightly adjusted down for number of predictors
AUC 0.792, Good discrimination power (0.7–0.8 = good)
Log-likelihood -3946.072 higher is better
AIC 8046.145, Lower = better fit
BIC 8768.389, Penalizes complexity more than AIC
Chi-squared (df=76) 968.279, Strong model signal (p < 0.001)
Number of Observations 87,535,Large dataset, good statistical power

Based on the p=value the most significant features are app2. app13, app14, mobile_os, impua,clua and ctrua. Now let use a different method to see the variable that is the most important. However, first let us see the plot of the classifier (clf)

Plot

clf.plot(
    plots="pred", incl=[ "time_fct", "app", "mobile_os", "impua", "clua", "ctrua"]
)

Permutation Importance

clf.plot("vimp")

Interpretation

The app variable has the highest importance because it has the highest decrease in AUC after permutted. Impua comes second which still contributes significantly on model predictions. Then, mobile_os is next important and the last three is time_fct, clua, ctrua have way less impact on model predictions. The chi square of 968 is way above our critical value of 97 and 107, which means that there is a greater deviation between observed and expected frequencies, suggesting that more variables are more likely to be dependent or assoicated from each other, which is also supported by the p_value less than 0.001. The Pseudo R-Squared of 0.109 or 0.092 when adjusted indicates the goodness of fit for logistic regression, higher means better fit. This number shows on how well the independent variales explain the dependent variables.

Prediction of Click Probability

tz_gaming["pred_logit"]= clf.predict(tz_gaming)["prediction"]
print(tz_gaming.head())
  training inum click  time time_fct   app mobile_os  impup  clup     ctrup  \
0    train   I7    no     9        9  app8       ios    439     2  0.455581   
1    train  I23    no    15       15  app1       ios     64     0  0.000000   
2    train  I28    no    12       12  app5       ios     80     0  0.000000   
3    train  I30    no    19       19  app1       ios     25     0  0.000000   
4    train  I35    no    24       24  app1   android   3834    29  0.756390   

   ...  imput  clut     ctrut  imppat  clpat    ctrpat       rnd  pred_vneta  \
0  ...     25     0  0.000000      71      1  1.408451 -1.207066    0.003961   
1  ...      7     0  0.000000   67312   1069  1.588127  0.277429    0.003961   
2  ...     94     0  0.000000     331      1  0.302115  1.084441    0.003961   
3  ...     19     0  0.000000   71114   1001  1.407599 -2.345698    0.018965   
4  ...    329     4  1.215805  183852   2317  1.260253  0.429125    0.003961   

         id    pred_logit  
0  id247135  3.382977e-13  
1  id245079  1.156355e-02  
2  id927245  2.655311e-03  
3  id922188  1.349420e-02  
4  id355833  1.868222e-03  

[5 rows x 23 columns]
tz_train["pred_logit"]= clf.predict(tz_train)["prediction"]
print(tz_train.head())
  training inum click  time time_fct   app mobile_os  impup  clup     ctrup  \
0    train   I7    no     9        9  app8       ios    439     2  0.455581   
1    train  I23    no    15       15  app1       ios     64     0  0.000000   
2    train  I28    no    12       12  app5       ios     80     0  0.000000   
3    train  I30    no    19       19  app1       ios     25     0  0.000000   
4    train  I35    no    24       24  app1   android   3834    29  0.756390   

   ...  imput  clut     ctrut  imppat  clpat    ctrpat       rnd  pred_vneta  \
0  ...     25     0  0.000000      71      1  1.408451 -1.207066    0.003961   
1  ...      7     0  0.000000   67312   1069  1.588127  0.277429    0.003961   
2  ...     94     0  0.000000     331      1  0.302115  1.084441    0.003961   
3  ...     19     0  0.000000   71114   1001  1.407599 -2.345698    0.018965   
4  ...    329     4  1.215805  183852   2317  1.260253  0.429125    0.003961   

         id    pred_logit  
0  id247135  3.382977e-13  
1  id245079  1.156355e-02  
2  id927245  2.655311e-03  
3  id922188  1.349420e-02  
4  id355833  1.868222e-03  

[5 rows x 23 columns]
/tmp/ipykernel_58081/4094291988.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  tz_train["pred_logit"]= clf.predict(tz_train)["prediction"]

Logistic Regression (EVAR: ONLY RND)

clf_rnd= rsm.model.logistic(
    data= {"tz_train": tz_train},
    rvar= "click",
    lev= "yes",
    evar= "rnd"
)
clf_rnd.summary()
Logistic regression (GLM)
Data                 : tz_train
Response variable    : click
Level                : yes
Explanatory variables: rnd
Null hyp.: There is no effect of x on click
Alt. hyp.: There is an effect of x on click

              OR     OR%  coefficient  std.error  z.value p.value     
Intercept  0.009  -99.1%       -4.720      0.036 -130.657  < .001  ***
rnd        0.965   -3.5%       -0.036      0.036   -0.986   0.324     

Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Pseudo R-squared (McFadden): 0.0
Pseudo R-squared (McFadden adjusted): -0.0
Area under the RO Curve (AUC): 0.513
Log-likelihood: -4429.726, AIC: 8863.451, BIC: 8882.211
Chi-squared: 0.972, df(1), p.value 0.324 
Nr obs: 87,535

Prediction of Click Probability (EVAR: ONLY RND)

tz_gaming["pred_rnd"]= clf_rnd.predict(tz_gaming)["prediction"]
print(tz_gaming.head())
  training inum click  time time_fct   app mobile_os  impup  clup     ctrup  \
0    train   I7    no     9        9  app8       ios    439     2  0.455581   
1    train  I23    no    15       15  app1       ios     64     0  0.000000   
2    train  I28    no    12       12  app5       ios     80     0  0.000000   
3    train  I30    no    19       19  app1       ios     25     0  0.000000   
4    train  I35    no    24       24  app1   android   3834    29  0.756390   

   ...  clut     ctrut  imppat  clpat    ctrpat       rnd  pred_vneta  \
0  ...     0  0.000000      71      1  1.408451 -1.207066    0.003961   
1  ...     0  0.000000   67312   1069  1.588127  0.277429    0.003961   
2  ...     0  0.000000     331      1  0.302115  1.084441    0.003961   
3  ...     0  0.000000   71114   1001  1.407599 -2.345698    0.018965   
4  ...     4  1.215805  183852   2317  1.260253  0.429125    0.003961   

         id    pred_logit  pred_rnd  
0  id247135  3.382977e-13  0.009222  
1  id245079  1.156355e-02  0.008751  
2  id927245  2.655311e-03  0.008505  
3  id922188  1.349420e-02  0.009600  
4  id355833  1.868222e-03  0.008704  

[5 rows x 24 columns]
tz_train["pred_rnd"]= clf_rnd.predict(tz_train)["prediction"]
print(tz_train.head())
  training inum click  time time_fct   app mobile_os  impup  clup     ctrup  \
0    train   I7    no     9        9  app8       ios    439     2  0.455581   
1    train  I23    no    15       15  app1       ios     64     0  0.000000   
2    train  I28    no    12       12  app5       ios     80     0  0.000000   
3    train  I30    no    19       19  app1       ios     25     0  0.000000   
4    train  I35    no    24       24  app1   android   3834    29  0.756390   

   ...  clut     ctrut  imppat  clpat    ctrpat       rnd  pred_vneta  \
0  ...     0  0.000000      71      1  1.408451 -1.207066    0.003961   
1  ...     0  0.000000   67312   1069  1.588127  0.277429    0.003961   
2  ...     0  0.000000     331      1  0.302115  1.084441    0.003961   
3  ...     0  0.000000   71114   1001  1.407599 -2.345698    0.018965   
4  ...     4  1.215805  183852   2317  1.260253  0.429125    0.003961   

         id    pred_logit  pred_rnd  
0  id247135  3.382977e-13  0.009222  
1  id245079  1.156355e-02  0.008751  
2  id927245  2.655311e-03  0.008505  
3  id922188  1.349420e-02  0.009600  
4  id355833  1.868222e-03  0.008704  

[5 rows x 24 columns]
/tmp/ipykernel_58081/2027560838.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  tz_train["pred_rnd"]= clf_rnd.predict(tz_train)["prediction"]

Multicollinearity and Omitted Variable Bias

clf_mc1 = rsm.model.logistic(
    data= {"tz_train":tz_train},
    rvar= "click",
    lev= "yes",
    evar= ["imppat", "clpat", "ctrpat"]
)
clf_mc1.summary()
Logistic regression (GLM)
Data                 : tz_train
Response variable    : click
Level                : yes
Explanatory variables: imppat, clpat, ctrpat
Null hyp.: There is no effect of x on click
Alt. hyp.: There is an effect of x on click

              OR     OR%  coefficient  std.error  z.value p.value     
Intercept  0.004  -99.6%       -5.419      0.073  -74.156  < .001  ***
imppat     1.000   -0.0%       -0.000      0.000   -4.802  < .001  ***
clpat      1.002    0.2%        0.002      0.000    5.713  < .001  ***
ctrpat     1.615   61.5%        0.479      0.034   13.933  < .001  ***

Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Pseudo R-squared (McFadden): 0.035
Pseudo R-squared (McFadden adjusted): 0.035
Area under the RO Curve (AUC): 0.676
Log-likelihood: -4273.088, AIC: 8554.176, BIC: 8591.695
Chi-squared: 314.248, df(3), p.value < 0.001 
Nr obs: 87,535

Chi-Square Citical Value

from scipy.stats import chi2
df=3
alpha_0_05 = chi2.ppf(1-0.05,df)
alpha_0_01= chi2.ppf(1-0.01,df)
print(alpha_0_05,alpha_0_01)
7.8147279032511765 11.34486673014437

Plot

clf_mc1.plot(
    plots="pred", incl=["imppat", "clpat", "ctrpat"]
)

Permutation Importance

clf_mc1.plot(plots="vimp")

Interpretation

For Odd Ratio, when you increase the ctrpat (click throgh put rate) by one unit, the odds of the outcome which is clicking increases by a factor of 1.615 or 61.5% while other variables remain constant.Meanwhile, a unit increase only by a factor of 1.002 for clpt (past clicks in a specific hour) and no effect on odd of clicking for the feature imppat (the number of past impression that showed a TZ ad)

Using the coefficient and the plot, an increase in the number of past impression that showed a TZ ad(immpat) has a insignificant negative correlation(-0.00) with odds of clicking. This shows in both regression and plot. Then, there is a slight positive coeffecient(0.00) between clpat and the odds of clicking. Finally, we have considerable positive correlation(0.48) between ctrpat and the odds of clicking. Both of these shows in the plot as well.

all p_values are less than 0.05 implies on the significant effect of imppat, ctrpat and clpat on the odds of clicking. In fact, they all have three asterisk which amplifies the level of statistical significance of these three variables.

Correlation

tz_train[['imppat', 'clpat', 'ctrpat']].corr()
imppat clpat ctrpat
imppat 1.000000 0.971579 0.344099
clpat 0.971579 1.000000 0.460035
ctrpat 0.344099 0.460035 1.000000
clf_mc2 = rsm.model.logistic(
    data= {"tz_train": tz_train},
    rvar = "click",
    lev= "yes",
    evar =["imppat", "ctrpat"],
 )
clf_mc2.summary()
Logistic regression (GLM)
Data                 : tz_train
Response variable    : click
Level                : yes
Explanatory variables: imppat, ctrpat
Null hyp.: There is no effect of x on click
Alt. hyp.: There is an effect of x on click

              OR     OR%  coefficient  std.error  z.value p.value     
Intercept  0.004  -99.6%       -5.529      0.068  -80.814  < .001  ***
imppat     1.000    0.0%        0.000      0.000    5.460  < .001  ***
ctrpat     1.733   73.3%        0.550      0.030   18.422  < .001  ***

Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Pseudo R-squared (McFadden): 0.031
Pseudo R-squared (McFadden adjusted): 0.031
Area under the RO Curve (AUC): 0.674
Log-likelihood: -4290.903, AIC: 8587.805, BIC: 8615.945
Chi-squared: 278.619, df(2), p.value < 0.001 
Nr obs: 87,535
clf_mc2.plot("pred")

Interpretation

The coefficient of ctrpat increased from 0.48 to 0.55, and the odds ratio rose from 61.5% to 73.3%. This suggests that removing clpat revealed the true impact of the remaining explanatory variables. While Pseudo R-Squared and AUC were not significantly affected, the predicted plot now shows a clearer relationship between the odds of clicking and the variables imppat and ctrpat. Since imppat and clpat have a high correlation of 0.97, multicollinearity exists, which can make coefficient estimates unstable and affect model interpretation. By omitting one of these highly correlated variables, we mitigate this issue, leading to a more interpretable model. This is evident in the plot, where the relationship between ctrpat, imppat, and the odds of clicking is now more clearly observed.

Difference After Adding More Features

clf_mc3 = rsm.model.logistic (
    data = {"tz_train": tz_train},
    rvar= "click",
    lev= "yes",
    evar=["time_fct", "app", "imppat", "clpat", "ctrpat"]
)

clf_mc3.summary()
Logistic regression (GLM)
Data                 : tz_train
Response variable    : click
Level                : yes
Explanatory variables: time_fct, app, imppat, clpat, ctrpat
Null hyp.: There is no effect of x on click
Alt. hyp.: There is an effect of x on click

                 OR      OR%  coefficient   std.error  z.value p.value     
Intercept     0.012   -98.8%       -4.424       0.284  -15.575  < .001  ***
time_fct[2]   0.588   -41.2%       -0.532       0.323   -1.644     0.1     
time_fct[3]   0.693   -30.7%       -0.367       0.461   -0.795   0.426     
time_fct[4]   0.000  -100.0%      -23.834   43904.903   -0.001     1.0     
time_fct[5]   0.000  -100.0%      -23.860   57249.164   -0.000     1.0     
time_fct[6]   0.359   -64.1%       -1.025       1.027   -0.999   0.318     
time_fct[7]   1.220    22.0%        0.199       0.438    0.453    0.65     
time_fct[8]   1.135    13.5%        0.127       0.309    0.411   0.681     
time_fct[9]   1.063     6.3%        0.061       0.299    0.203   0.839     
time_fct[10]  0.843   -15.7%       -0.170       0.303   -0.561   0.575     
time_fct[11]  0.637   -36.3%       -0.451       0.288   -1.565   0.118     
time_fct[12]  0.834   -16.6%       -0.181       0.292   -0.620   0.535     
time_fct[13]  0.535   -46.5%       -0.626       0.306   -2.047   0.041    *
time_fct[14]  0.982    -1.8%       -0.018       0.257   -0.069   0.945     
time_fct[15]  0.840   -16.0%       -0.174       0.272   -0.639   0.523     
time_fct[16]  0.874   -12.6%       -0.135       0.279   -0.483   0.629     
time_fct[17]  0.864   -13.6%       -0.146       0.296   -0.493   0.622     
time_fct[18]  0.942    -5.8%       -0.060       0.286   -0.208   0.835     
time_fct[19]  1.178    17.8%        0.164       0.251    0.651   0.515     
time_fct[20]  1.188    18.8%        0.172       0.247    0.698   0.485     
time_fct[21]  0.782   -21.8%       -0.245       0.261   -0.942   0.346     
time_fct[22]  0.933    -6.7%       -0.069       0.259   -0.267   0.789     
time_fct[23]  0.993    -0.7%       -0.007       0.268   -0.025    0.98     
time_fct[24]  1.134    13.4%        0.125       0.258    0.486   0.627     
app[app2]     0.136   -86.4%       -1.997       0.351   -5.681  < .001  ***
app[app3]     0.187   -81.3%       -1.675       1.016   -1.649   0.099    .
app[app4]     0.487   -51.3%       -0.719       0.359   -2.004   0.045    *
app[app5]     0.413   -58.7%       -0.883       1.011   -0.874   0.382     
app[app6]     1.007     0.7%        0.007       0.216    0.034   0.973     
app[app7]     0.708   -29.2%       -0.346       1.019   -0.340   0.734     
app[app8]     0.000  -100.0%      -24.110   72717.593   -0.000     1.0     
app[app9]     0.935    -6.5%       -0.067       0.271   -0.248   0.804     
app[app10]    0.000  -100.0%      -24.021   75559.599   -0.000     1.0     
app[app11]    1.365    36.5%        0.311       0.752    0.414   0.679     
app[app12]    0.575   -42.5%       -0.553       0.274   -2.022   0.043    *
app[app13]    2.790   179.0%        1.026       0.504    2.037   0.042    *
app[app14]    0.246   -75.4%       -1.402       0.479   -2.927   0.003   **
app[app15]    0.773   -22.7%       -0.258       0.723   -0.357   0.721     
app[app16]    0.381   -61.9%       -0.965       0.733   -1.316   0.188     
app[app17]    1.502    50.2%        0.407       0.727    0.560   0.576     
app[app18]    0.108   -89.2%       -2.226       1.016   -2.190   0.029    *
app[app19]    0.000  -100.0%      -24.046   93787.196   -0.000     1.0     
app[app20]    0.000  -100.0%      -24.145   73366.877   -0.000     1.0     
app[app21]    0.475   -52.5%       -0.744       0.416   -1.789   0.074    .
app[app22]    0.517   -48.3%       -0.660       1.019   -0.648   0.517     
app[app23]    3.549   254.9%        1.267       0.419    3.023   0.003   **
app[app24]    2.411   141.1%        0.880       0.363    2.421   0.015    *
app[app25]    1.571    57.1%        0.452       0.612    0.738    0.46     
app[app26]    0.000  -100.0%      -24.073   63336.276   -0.000     1.0     
app[app27]    0.292   -70.8%       -1.230       0.731   -1.683   0.092    .
app[app28]    0.318   -68.2%       -1.144       0.603   -1.897   0.058    .
app[app29]    2.584   158.4%        0.949       0.379    2.502   0.012    *
app[app30]    0.000  -100.0%      -24.080   87078.077   -0.000     1.0     
app[app31]    0.164   -83.6%       -1.808       0.735   -2.460   0.014    *
app[app32]    0.000  -100.0%      -24.044   56046.291   -0.000     1.0     
app[app33]    0.524   -47.6%       -0.647       0.376   -1.720   0.086    .
app[app34]    0.000  -100.0%      -24.014   55897.140   -0.000     1.0     
app[app35]    0.357   -64.3%       -1.030       1.019   -1.011   0.312     
app[app36]    0.000  -100.0%      -24.001   78847.008   -0.000     1.0     
app[app37]    0.741   -25.9%       -0.300       0.732   -0.410   0.682     
app[app38]    0.000  -100.0%      -24.036   55273.917   -0.000     1.0     
app[app39]    2.349   134.9%        0.854       0.449    1.904   0.057    .
app[app40]    1.625    62.5%        0.486       0.530    0.915    0.36     
app[app41]    0.816   -18.4%       -0.204       0.725   -0.281   0.779     
app[app42]    2.909   190.9%        1.068       0.452    2.363   0.018    *
app[app43]    0.000  -100.0%      -24.109   75901.133   -0.000     1.0     
app[app44]    0.000  -100.0%      -24.047   46569.636   -0.001     1.0     
app[app45]    1.920    92.0%        0.652       0.604    1.079   0.281     
app[app46]    0.466   -53.4%       -0.763       1.020   -0.748   0.455     
app[app47]    0.406   -59.4%       -0.902       1.012   -0.891   0.373     
app[app48]    0.000  -100.0%      -24.045  106095.563   -0.000     1.0     
app[app49]    0.259   -74.1%       -1.349       1.019   -1.324   0.185     
imppat        1.000    -0.0%       -0.000       0.000   -1.131   0.258     
clpat         1.001     0.1%        0.001       0.001    1.135   0.256     
ctrpat        1.077     7.7%        0.075       0.117    0.637   0.524     

Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Pseudo R-squared (McFadden): 0.056
Pseudo R-squared (McFadden adjusted): 0.04
Area under the RO Curve (AUC): 0.704
Log-likelihood: -4180.646, AIC: 8511.292, BIC: 9214.776
Chi-squared: 499.132, df(74), p.value < 0.001 
Nr obs: 87,535
clf_mc3.plot(
    plots="pred", incl=["time_fct", "app", "imppat", "clpat", "ctrpat"]
)

Interpretation

Introducing variables like time_fct and app into a logistic regression model with imppat, clpat, and ctrpat can alter prediction plots by changing the relationships between predictors and the outcome. These new variables may interact with existing ones, influencing both the direction and magnitude of effects. Additionally, they can act as confounder adjustments, revealing the true impact of imppat, clpat, and ctrpat on click likelihood by accounting for hidden biases. Conditioning effects may also emerge, as the behavior of existing predictors shifts in the presence of time_fct and app, leading to different probability estimates. Ultimately, these changes are reflected in prediction plots, which visually capture the refined relationships between predictors and the outcome, offering a clearer interpretation of the model’s dynamics.

Decile Analysis

tz_test =tz_gaming[tz_gaming["training"] == "test"]
tz_test["pred_logit_dec"] = (tz_test.groupby("training").pred_logit.transform(rsm.xtile, 10, rev=True))
print(tz_test.head())
      training     inum click  time time_fct    app mobile_os  impup  clup  \
87535     test  I300002    no    21       21   app1   android   1458     3   
87536     test  I300006    no     3        3  app40       ios      3     0   
87537     test  I300012    no     5        5  app12   android   5057     6   
87538     test  I300015    no    10       10   app1   android   1993    10   
87539     test  I300016    no    14       14   app1       ios    212     7   

          ctrup  ...     ctrut  imppat  clpat    ctrpat       rnd  pred_vneta  \
87535  0.205761  ...  0.000000   68113    957  1.405018  0.147891    0.003961   
87536  0.000000  ...  0.000000      50      0  0.000000  0.383246    0.018965   
87537  0.118647  ...  0.000000     754      8  1.061008  1.274485    0.003961   
87538  0.501756  ...  0.000000   26537    276  1.040057  0.673022    0.003961   
87539  3.301887  ...  5.263158   57348    874  1.524029 -0.785851    0.050679   

             id    pred_logit  pred_rnd  pred_logit_dec  
87535  id466983  1.020981e-02  0.008791               4  
87536  id946375  8.665095e-03  0.008718               4  
87537  id479295  1.910723e-14  0.008448              10  
87538   id83284  6.240407e-03  0.008630               5  
87539  id359434  1.233449e-02  0.009086               3  

[5 rows x 25 columns]
/tmp/ipykernel_58081/1756994838.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  tz_test["pred_logit_dec"] = (tz_test.groupby("training").pred_logit.transform(rsm.xtile, 10, rev=True))
dec_tab = (
    tz_test.groupby('pred_logit_dec')
    .agg(
        nr_impressions= ("pred_logit_dec", "size"),
        nr_clicks= ("click", lambda x: (x == "yes").sum()),
    )
    .assign( ctr= lambda x: x.nr_clicks / x.nr_impressions)
    .reset_index()
)
dec_tab
pred_logit_dec nr_impressions nr_clicks ctr
0 1 2796 103 0.036838
1 2 2793 48 0.017186
2 3 2788 42 0.015065
3 4 2796 30 0.010730
4 5 2802 15 0.005353
5 6 2796 7 0.002504
6 7 2794 7 0.002505
7 8 2796 3 0.001073
8 9 2796 4 0.001431
9 10 2796 12 0.004292

Bar Chart

import matplotlib.pyplot as plt
bc= dec_tab.plot.bar(x="pred_logit_dec", y="ctr", legend=False)

bc.set_xlabel('Decile')
bc.set_ylabel ("click through rate (CTR)")
bc.set_title ("Click Through Rate by Decile")
bc.axhline(dec_tab["ctr"].mean(), color='r', linestyle='--')

plt.show()
print(tz_train)

      training     inum click  time time_fct    app mobile_os  impup  clup  \
0        train       I7    no     9        9   app8       ios    439     2   
1        train      I23    no    15       15   app1       ios     64     0   
2        train      I28    no    12       12   app5       ios     80     0   
3        train      I30    no    19       19   app1       ios     25     0   
4        train      I35    no    24       24   app1   android   3834    29   
...        ...      ...   ...   ...      ...    ...       ...    ...   ...   
87530    train  I299985    no    11       11   app2   android   1181     0   
87531    train  I299986    no    10       10  app33       ios   1885     0   
87532    train  I299990    no     1        1  app45       ios      8     0   
87533    train  I299991    no     8        8   app1       ios    113     2   
87534    train  I299995    no    18       18  app35   android     13     0   

          ctrup  ...  clut     ctrut  imppat  clpat    ctrpat       rnd  \
0      0.455581  ...     0  0.000000      71      1  1.408451 -1.207066   
1      0.000000  ...     0  0.000000   67312   1069  1.588127  0.277429   
2      0.000000  ...     0  0.000000     331      1  0.302115  1.084441   
3      0.000000  ...     0  0.000000   71114   1001  1.407599 -2.345698   
4      0.756390  ...     4  1.215805  183852   2317  1.260253  0.429125   
...         ...  ...   ...       ...     ...    ...       ...       ...   
87530  0.000000  ...     0  0.000000    9625     14  0.145455 -0.249031   
87531  0.000000  ...     0  0.000000     658      1  0.151976  0.770718   
87532  0.000000  ...     0  0.000000     166      7  4.216867  0.181559   
87533  1.769912  ...     0  0.000000   14245    158  1.109161 -1.263831   
87534  0.000000  ...     1  7.142857     472      1  0.211864 -1.428302   

       pred_vneta        id    pred_logit  pred_rnd  
0        0.003961  id247135  3.382977e-13  0.009222  
1        0.003961  id245079  1.156355e-02  0.008751  
2        0.003961  id927245  2.655311e-03  0.008505  
3        0.018965  id922188  1.349420e-02  0.009600  
4        0.003961  id355833  1.868222e-03  0.008704  
...           ...       ...           ...       ...  
87530    0.003961  id565693  2.353185e-04  0.008915  
87531    0.003961  id222657  8.101551e-04  0.008600  
87532    0.018965  id340594  1.876022e-02  0.008781  
87533    0.003961  id634151  9.397408e-03  0.009241  
87534    0.050679  id280606  7.000647e-03  0.009294  

[87535 rows x 24 columns]

Gain Curves

dec_tab["cum_prop"]= dec_tab["nr_impressions"].cumsum()/dec_tab["nr_impressions"].sum()
dec_tab["cum_gains"]= dec_tab['nr_clicks'].cumsum()/dec_tab['nr_clicks'].sum()
gains_tab = dec_tab
gains_tab
pred_logit_dec nr_impressions nr_clicks ctr cum_prop cum_gains
0 1 2796 103 0.036838 0.100025 0.380074
1 2 2793 48 0.017186 0.199943 0.557196
2 3 2788 42 0.015065 0.299682 0.712177
3 4 2796 30 0.010730 0.399707 0.822878
4 5 2802 15 0.005353 0.499946 0.878229
5 6 2796 7 0.002504 0.599971 0.904059
6 7 2794 7 0.002505 0.699925 0.929889
7 8 2796 3 0.001073 0.799950 0.940959
8 9 2796 4 0.001431 0.899975 0.955720
9 10 2796 12 0.004292 1.000000 1.000000
plt.figure(figsize=(10, 6))
plt.plot(dec_tab['cum_prop'], dec_tab['cum_gains'], label='Cumulative Gains', drawstyle='steps-post')
plt.plot([0, 1], [0, 1], 'k--', label='No Model')  

# Labeling the plot
plt.title('Cumulative Gains Chart')
plt.xlabel('Cumulative Proportion of Impressions')
plt.ylabel('Cumulative Gains')
plt.legend()
plt.grid(True)

# Show the plot
plt.show()

Confusion Matrix

cpm = 10 # cost per 1000 video impression
conversion_rate= 0.05   # conversion to sign up with TZ after clicking on an ad
clv= 25 # expected clv of customers that sign up with TZ after clicking on an ad

threshold= (cpm/(conversion_rate * clv *1000))
threshold
0.008
tz_test = tz_gaming[tz_gaming["training"] == "test"].copy()

tz_test["click_yes"]= tz_test["click"].apply(lambda x: 1 if x == "yes" else 0 if x == "no" else np.nan)

tz_test["click_yes"] = tz_test["click_yes"].astype(float)


tz_test
training inum click time time_fct app mobile_os impup clup ctrup ... ctrut imppat clpat ctrpat rnd pred_vneta id pred_logit pred_rnd click_yes
87535 test I300002 no 21 21 app1 android 1458 3 0.205761 ... 0.000000 68113 957 1.405018 0.147891 0.003961 id466983 1.020981e-02 0.008791 0.0
87536 test I300006 no 3 3 app40 ios 3 0 0.000000 ... 0.000000 50 0 0.000000 0.383246 0.018965 id946375 8.665095e-03 0.008718 0.0
87537 test I300012 no 5 5 app12 android 5057 6 0.118647 ... 0.000000 754 8 1.061008 1.274485 0.003961 id479295 1.910723e-14 0.008448 0.0
87538 test I300015 no 10 10 app1 android 1993 10 0.501756 ... 0.000000 26537 276 1.040057 0.673022 0.003961 id83284 6.240407e-03 0.008630 0.0
87539 test I300016 no 14 14 app1 ios 212 7 3.301887 ... 5.263158 57348 874 1.524029 -0.785851 0.050679 id359434 1.233449e-02 0.009086 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
115483 test I399982 no 21 21 app2 ios 2110 0 0.000000 ... 0.000000 23216 19 0.081840 -1.852059 0.003961 id847352 1.093091e-03 0.009435 0.0
115484 test I399986 no 17 17 app14 android 291 1 0.343643 ... 1.351351 3665 14 0.381992 -0.296415 0.003961 id457437 3.609483e-03 0.008930 0.0
115485 test I399991 no 23 23 app1 android 364 3 0.824176 ... 0.000000 173353 2292 1.322158 0.099201 0.003961 id792352 2.052670e-02 0.008806 0.0
115486 test I399992 no 20 20 app6 android 59 2 3.389831 ... 2.702703 3474 53 1.525619 -0.186421 0.050679 id115678 2.192207e-02 0.008896 0.0
115487 test I399994 no 18 18 app1 ios 498 7 1.405622 ... 1.886792 77884 1201 1.542037 0.857281 0.003961 id705546 1.170346e-02 0.008574 0.0

27953 rows × 25 columns

TP = ((tz_test["training"] == "test") & (tz_test["pred_logit"]> threshold) & (tz_test["click_yes"] == 1)).sum()
FP = ((tz_test["training"] == "test") & (tz_test["pred_logit"]> threshold) & (tz_test["click_yes"] == 0)).sum()
TN = ((tz_test["training"] == "test") & (tz_test["pred_logit"]<= threshold) & (tz_test["click_yes"] == 0)).sum()
FN = ((tz_test["training"] == "test") & (tz_test["pred_logit"]<= threshold) & (tz_test["click_yes"] == 1)).sum()

cm_logit = pd.DataFrame(
    {
        "label": ["TP", "FP", "TN", "FN"],
        "nr" : [TP, FP, TN, FN]# TP, FP, TN, and FN values in that order
    }
)
cm_logit
label nr
0 TP 221
1 FP 10661
2 TN 17021
3 FN 50

Accuracy

accuracy_logit = (TP + TN) / (TP + TN + FP + FN)# float
accuracy_logit
np.float64(0.6168210925482059)

Confusion Matrix Based on Pred_RND

TP = ((tz_test["training"] == "test") & (tz_test["pred_rnd"]> threshold) & (tz_test["click_yes"] == 1)).sum()
FP = ((tz_test["training"] == "test") & (tz_test["pred_rnd"]> threshold) & (tz_test["click_yes"] == 0)).sum()
TN = ((tz_test["training"] == "test") & (tz_test["pred_rnd"]<= threshold) & (tz_test["click_yes"] == 0)).sum()
FN = ((tz_test["training"] == "test") & (tz_test["pred_rnd"]<= threshold) & (tz_test["click_yes"] == 1)).sum()


cm_rnd = pd.DataFrame(
    {
        "label": ["TP", "FP", "TN", "FN"],
        "nr": [TP, FP, TN, FN]# TP, FP, TN, and FN values in that order
    }
)
cm_rnd
label nr
0 TP 271
1 FP 27606
2 TN 76
3 FN 0

Accuracy Based on Pred_RND Confusion Matrix

accuracy_rnd = (TP + TN) / (TP + TN + FP + FN)# float
accuracy_rnd
np.float64(0.012413694415626229)

Summary and Interpretation

first_model = {
    "TP": 271,
    "FP": 27606,
    "TN": 76,
    "FN": 0
}

rnd_model = {
    "TP": 0,
    "FP": 0,
    "TN": 27682,
    "FN": 271
}

def compute_metrics(cm):
    TP = cm["TP"]
    FP = cm["FP"]
    TN = cm["TN"]
    FN = cm["FN"]
    total = TP + FP + TN + FN

    accuracy = (TP + TN) / total
    precision = TP / (TP + FP) if (TP + FP) > 0 else 0
    recall = TP / (TP + FN) if (TP + FN) > 0 else 0
    specificity = TN / (TN + FP) if (TN + FP) > 0 else 0
    f1_score = (2 * precision * recall) / (precision + recall) if (precision + recall) > 0 else 0

    return {
        "TP": TP,
        "FP": FP,
        "TN": TN,
        "FN": FN,
        "Accuracy": round(accuracy, 4),
        "Precision": round(precision, 4),
        "Recall": round(recall, 4),
        "Specificity": round(specificity, 4),
        "F1 Score": round(f1_score, 4)
    }

first_metrics = compute_metrics(first_model)
rnd_metrics = compute_metrics(rnd_model)

results_df = pd.DataFrame({
    "Metric": list(first_metrics.keys()),
    "First Model": list(first_metrics.values()),
    "RND": list(rnd_metrics.values())
})

print(results_df.to_string(index=False))
     Metric  First Model        RND
         TP     271.0000     0.0000
         FP   27606.0000     0.0000
         TN      76.0000 27682.0000
         FN       0.0000   271.0000
   Accuracy       0.0124     0.9903
  Precision       0.0097     0.0000
     Recall       1.0000     0.0000
Specificity       0.0027     1.0000
   F1 Score       0.0193     0.0000

The First Model correctly identified all 271 true positives, achieving 100% recall but only 0.97% precision due to 27,606 false positives. In contrast, the RND model predicted no positives, resulting in perfect specificity and 99.02% accuracy, but 0% recall and no ability to detect actual clicks. Although the First Model’s accuracy was just 1.03%, it was more useful than the RND model because it captured all true click events. However, its extremely low precision highlights inefficiency, as most predicted clicks were incorrect. This comparison shows that in imbalanced datasets, metrics like precision, recall, and F1 score are more informative than accuracy alone

Model Comparison

Cost Information

  • Cost per 1,000 video impressions (CPM) is $10
  • Conversion to sign-up as a TZ game player after clicking on an ad is 5%
  • The expected CLV of customers that sign-up with TZ after clicking on an ad is approximately $25
  • The total cost of the data from Vneta is $50K
  • The total cost charged for the data science consulting services by Vneta is $150K
cpm = 10 # cost per 1000 video impression
conversion_rate= 0.05   # conversion to sign up with TZ after clicking on an ad
clv= 25 # expected clv of customers that sign up with TZ after clicking on an ad
total_impressions = 20_000_000 # given in the problem above
additional_cost_VNETA = 50_000 # cost of the data from VNETA DATA
additional_cost_Consulting= 150_000 #cost of thr consulting services by VNETA
additional_cost_spamming= 0 #cost for spamming with no VNETA involve

def calculate_break_even_response_rate(cpm,conversion_rate, clv, total_impressions,additional_cost):
    total_cost = (total_impressions/1000) *cpm +additional_cost
    margin = total_impressions * conversion_rate * clv
    break_even_rate_of_response= total_cost/margin
    return break_even_rate_of_response

break_even_response_rate_spamming= calculate_break_even_response_rate(cpm,conversion_rate, clv, total_impressions, additional_cost_spamming)
break_even_response_rate_spamming
0.008
tz_gaming= tz_test
tz_gaming["target_logit"] = tz_gaming["pred_logit"] > break_even_response_rate_spamming
tz_gaming["target_rnd"] = tz_gaming["pred_rnd"]> break_even_response_rate_spamming
tz_gaming["target_vneta"]= tz_gaming["pred_vneta"] >break_even_response_rate_spamming
tz_gaming
training inum click time time_fct app mobile_os impup clup ctrup ... ctrpat rnd pred_vneta id pred_logit pred_rnd click_yes target_logit target_rnd target_vneta
87535 test I300002 no 21 21 app1 android 1458 3 0.205761 ... 1.405018 0.147891 0.003961 id466983 1.020981e-02 0.008791 0.0 True True False
87536 test I300006 no 3 3 app40 ios 3 0 0.000000 ... 0.000000 0.383246 0.018965 id946375 8.665095e-03 0.008718 0.0 True True True
87537 test I300012 no 5 5 app12 android 5057 6 0.118647 ... 1.061008 1.274485 0.003961 id479295 1.910723e-14 0.008448 0.0 False True False
87538 test I300015 no 10 10 app1 android 1993 10 0.501756 ... 1.040057 0.673022 0.003961 id83284 6.240407e-03 0.008630 0.0 False True False
87539 test I300016 no 14 14 app1 ios 212 7 3.301887 ... 1.524029 -0.785851 0.050679 id359434 1.233449e-02 0.009086 0.0 True True True
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
115483 test I399982 no 21 21 app2 ios 2110 0 0.000000 ... 0.081840 -1.852059 0.003961 id847352 1.093091e-03 0.009435 0.0 False True False
115484 test I399986 no 17 17 app14 android 291 1 0.343643 ... 0.381992 -0.296415 0.003961 id457437 3.609483e-03 0.008930 0.0 False True False
115485 test I399991 no 23 23 app1 android 364 3 0.824176 ... 1.322158 0.099201 0.003961 id792352 2.052670e-02 0.008806 0.0 True True False
115486 test I399992 no 20 20 app6 android 59 2 3.389831 ... 1.525619 -0.186421 0.050679 id115678 2.192207e-02 0.008896 0.0 True True True
115487 test I399994 no 18 18 app1 ios 498 7 1.405622 ... 1.542037 0.857281 0.003961 id705546 1.170346e-02 0.008574 0.0 True True False

27953 rows × 28 columns

Spamming

total_succesful_clicks= tz_gaming[(tz_gaming["click_yes"]== 1)].shape[0]
total_succesful_clicks
total_costs = (tz_gaming.shape[0]/1000)*cpm
total_costs
total_revenue= total_succesful_clicks*conversion_rate*clv
total_revenue
profit_spam =total_revenue- total_costs
profit_spam
rome_spam= (profit_spam/total_costs)
rome_spam
print(profit_spam, rome_spam)
59.22000000000003 0.21185561478195555

LOGIT

total_succesful_clicks = tz_gaming[(tz_gaming["target_logit"] == True) & (tz_gaming["click_yes"]== 1)].shape[0]
total_succesful_clicks
total_costs = (tz_gaming[(tz_gaming["target_logit"]== True)].shape[0]/1000)*cpm
total_costs
total_revenue= total_succesful_clicks*conversion_rate*clv
total_revenue
profit_logit =total_revenue- total_costs
profit_logit
rome_logit= (profit_logit/total_costs)
rome_logit
print(profit_logit, rome_logit)
167.43 1.5385958463517737

RND

total_succesful_clicks = tz_gaming[(tz_gaming["target_rnd"] == True) & (tz_gaming["click_yes"]== 1)].shape[0]
total_succesful_clicks
total_costs = (tz_gaming[(tz_gaming["target_rnd"]== True)].shape[0]/1000)*cpm
total_costs
total_revenue= total_succesful_clicks*conversion_rate*clv
total_revenue
profit_rnd =total_revenue- total_costs
profit_rnd
rome_rnd= (profit_rnd/total_costs)
rome_rnd
print(profit_rnd, rome_rnd)
59.98000000000002 0.21515945044301762

VNETA

total_succesful_clicks = tz_gaming[(tz_gaming["target_vneta"] == True) & (tz_gaming["click_yes"]== 1)].shape[0]
total_succesful_clicks
total_costs = (tz_gaming[(tz_gaming["target_vneta"]== True)].shape[0]/1000)*cpm
total_costs
total_revenue= total_succesful_clicks*conversion_rate*clv
total_revenue
profit_vneta =total_revenue- total_costs
profit_vneta
rome_vneta= (profit_vneta/total_costs)
rome_vneta
print(profit_vneta, rome_vneta)
151.29 3.1059330732909047

Summary

tz_gaming["pred_spam"] = 1
tz_gaming["target_spam"] = True

mod_perf = pd.DataFrame(
    {
        "model": [
            "logit",
            "rnd",
            "vneta",
            "spam",
        ],
        "profit": [profit_logit, profit_rnd, profit_vneta, profit_spam],
        "ROME": [rome_logit, rome_rnd, rome_vneta, rome_spam],
    }
)
mod_perf
model profit ROME
0 logit 167.43 1.538596
1 rnd 59.98 0.215159
2 vneta 151.29 3.105933
3 spam 59.22 0.211856

Analysis

The Vneta model demonstrates the highest ROME at 3.105933, making it the most efficient in marketing budget utilization. It generates a profit of 151.29, slightly lower than the Logit model but still highly effective. This model is ideal for scenarios where marketing efficiency and budget allocation are top priorities. On the other hand, the Logit model achieves the highest profit of 167.43, though its ROME is lower than that of the Vneta model, indicating reduced spending efficiency. It is best suited for cases where maximizing total profit is more important than efficiency. In contrast, both the Rnd and Spam models underperform in terms of profit and ROME compared to predictive models, suggesting that predictive strategies significantly outperform non-targeted methods. Overall, the Vneta model is recommended for its superior marketing efficiency, while the Logit model is preferred for maximizing profit. The choice between these two depends on whether ROI or total profit is the primary objective.

Profit Comparison

total_impressions_purchased = 20_000_000
test_sample= tz_gaming.shape[0]
test_sample
logit = (profit_logit /test_sample)* total_impressions_purchased
rnd= (profit_rnd/test_sample) * total_impressions_purchased
vneta= (profit_vneta/test_sample) * total_impressions_purchased
spam= (profit_spam/test_sample) * total_impressions_purchased

logit, rnd, vneta, spam
(119793.93982756771, 42914.89285586521, 108245.98433084105, 42371.12295639111)

ROME Comparison

total_costs = (tz_gaming[(tz_gaming["target_logit"]== True)].shape[0]/1000)*cpm
rome_logit = logit/total_costs

total_costs = (tz_gaming[(tz_gaming["target_rnd"]== True)].shape[0]/1000)*cpm
rome_rnd= logit/total_costs

total_costs = (tz_gaming[(tz_gaming["target_vneta"]== True)].shape[0]/1000)*cpm
rome_vneta = vneta/total_costs

total_costs = (tz_gaming.shape[0]/1000)*cpm
rome_spam = spam/total_costs 

rome_logit, rome_rnd, rome_vneta, rome_logit
(1100.8448798710506, 429.7232120657449, 2222.2538355746465, 1100.8448798710506)

Profit and ROME Summary

mod_perf_20M = pd.DataFrame(
    {
        "model": [
            "logit",
            "rnd",
            "vneta",
            "spam",
        ],
        "profit": [logit, rnd, vneta, spam],
        "ROME": [rome_logit, rome_rnd, rome_vneta, rome_spam],
    }
)
mod_perf_20M
model profit ROME
0 logit 119793.939828 1100.844880
1 rnd 42914.892856 429.723212
2 vneta 108245.984331 2222.253836
3 spam 42371.122956 151.579877

Conclusion

The Logit model leads in profit generation with $119,793.94, making it the top choice for revenue maximization, though its ROME stands at 1100.84, which, while strong, is not the highest. The Rnd model lags behind with a profit of $42,914.89 and a ROME of 429.73, reflecting lower efficiency and profitability. Similarly, the Spam model records $42,371.12 in profit, closely matching the Rnd model, but has the lowest ROME at 151.579877, making it the least effective approach. The Vneta model, on the other hand, secures a high profit of $108,245.98 and boasts the highest ROME at 2222.253, demonstrating superior marketing efficiency. This makes Vneta the optimal choice for balancing profit and marketing spend efficiency. Given its outstanding ROME, the Vneta model is the primary recommendation for maximizing investment returns. However, the Logit model remains a solid alternative for cases where absolute profit takes precedence over efficiency. The new data aligns with previous recommendations, reaffirming Vneta’s leadership in efficiency despite slightly lower profits than the Logit model. The results confirm that predictive models significantly outperform non-targeted approaches like the Rnd and Spam models. Ultimately, the Vneta model is best for those prioritizing a balance of high profit and peak efficiency, while the Logit model is the go-to option for pure profit maximization.