Logistic Regression

Author

Dr. Mohammad Nasir Abdullah

Logistic Regression Analysis for GLOW dataset.

In this example, we use GLOW dataset from Applied Logistic Regression book by Hosmer and Lemeshow.

Aim: To select those variables that result in a “best” model within the scientific context of the problem.

Dataset GLOW (n=500)

The GLOW500 dataset is The Global Longitudinal Study of Osteoporosis in Women (GLOW) is an international study of osteoporosis in women over 55 years of age being coordinated at the Center for Outcomes Research (COR) at the University of Massachusetts/Worcester by its Director, Dr. Frederick Anderson, Jr.

The study has enrolled over 60,000 women aged 55 and older in ten countries. The major goals of the study are to use the data to provide insights into the management of fracture risk, patient experience with prevention and treatment of fractures and distribution of risk factors among older women on an international scale over the follow up period.

Data used here come from six sites in the United States and include a few selected potential risk factors for fracture from the baseline questionnaire. The outcome variable is any fracture in the first year of follow up. The incident first-year fracture rate among the 21,000 subjects enrolled in these six sites is about 4 percent. In order to have a data set of a manageable size, n=500, for this note, we have over sampled the fractures and under sampled the non-fractures (Hosmer-Lemeshow, 2013).

The codesheet for variables in the GLOW study

Variable Description Codes/Values Name
1 Identification code 1-n SUB_ID
2 Study site 1-6 SITE_ID
3 Physician ID code 128 unique codes PHY_ID
4 History of prior fracute 1 = Yes, 0 = No PRIORFRAC
5 Age at enrollment Years AGE
6 Weight at enrollment Kilograms WEIGHT
7 Height at enrollment Centimeters HEIGHT
8 Body mass index kg/m2 BMI
9 Menopause before age 45 1=Yes, 0=No PREMENO
10 Mother had hip fracture 1=Yes, 0=No MOMFRAC
11 Arms are needed to stand from a chair 1=Yes, 0=No ARMASSIST
12 Former or current smoker 1=Yes, 0=No SMOKE
13 Self-reported risk of fracture 1 = Less than others of the same age,
2 = Same as others of the same age,
3 = Greater than others of the same age
RATERISK
14 Fracture risk score Composite risk score FRACSCORE
15 Any fracture in first year 1=Yes, 0=No FRACTURE

Importing the dataset into R environment.

data1 <- read.table("GLOW500.txt", sep="\t", header = TRUE)
dim(data1) #dimensions of the dataset
[1] 500  15
str(data1) #Structure of the dataset
'data.frame':   500 obs. of  15 variables:
 $ SUB_ID   : int  1 2 3 4 5 6 7 8 9 10 ...
 $ SITE_ID  : int  1 4 6 6 1 5 5 1 1 4 ...
 $ PHY_ID   : int  14 284 305 309 37 299 302 36 8 282 ...
 $ PRIORFRAC: int  0 0 1 0 0 1 0 1 1 0 ...
 $ AGE      : int  62 65 88 82 61 67 84 82 86 58 ...
 $ WEIGHT   : num  70.3 87.1 50.8 62.1 68 68 50.8 40.8 62.6 63.5 ...
 $ HEIGHT   : int  158 160 157 160 152 161 150 153 156 166 ...
 $ BMI      : num  28.2 34 20.6 24.3 29.4 ...
 $ PREMENO  : int  0 0 0 0 0 0 0 0 0 0 ...
 $ MOMFRAC  : int  0 0 1 0 0 0 0 0 0 0 ...
 $ ARMASSIST: int  0 0 1 0 0 0 0 0 0 0 ...
 $ SMOKE    : int  0 0 0 0 0 1 0 0 0 0 ...
 $ RATERISK : int  2 2 1 1 2 2 1 2 2 1 ...
 $ FRACSCORE: int  1 2 11 5 1 4 6 7 7 0 ...
 $ FRACTURE : int  0 0 0 0 0 0 0 0 0 0 ...

Correcting the structure of the dataframe

# data1$PREMENO <- as.factor(data1$PREMENO)
# data1$MOMFRAC <- as.factor(data1$MOMFRAC)
# data1$ARMASSIST <- as.factor(data1$ARMASSIST)
# data1$SMOKE <- as.factor(data1$SMOKE)
# data1$RATERISK <- as.factor(data1$RATERISK)
# data1$FRACTURE <- as.factor(data1$FRACTURE)
# str(data1)

Model Building Strategies

  1. Purposeful selection begins with a careful univariable analysis of each independent variable.

    a. Any variable whose univariable test has a p-value less than 0.25 along with all variables of known clinical importance.

  2. Fit the multivariable model containing all covariates identified for inclusion in step 1.

  3. Variable selection procedures (forward selection, backward elimination, stepwise selection).

  4. Checking interactions among the variables in the model.

  5. Assess its adequacy and check its fit.

    1. Checking multicollinearity
    2. linearity of continuous variable
    3. influential statistic
  6. Establish final model.

Univariable Logistic Regression

The first step in purposeful selection is to fit a univariable logistic regression model for each covariate.

#categorical variable is RATERISK
data1$RATERISK <- as.factor(data1$RATERISK)
v1 <- glm(FRACTURE~RATERISK, data=data1, family=binomial("logit"))
summary(v1)

Call:
glm(formula = FRACTURE ~ RATERISK, family = binomial("logit"), 
    data = data1)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -1.6023     0.2071  -7.735 1.03e-14 ***
RATERISK2     0.5462     0.2664   2.050   0.0404 *  
RATERISK3     0.9091     0.2711   3.353   0.0008 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 562.34  on 499  degrees of freedom
Residual deviance: 550.58  on 497  degrees of freedom
AIC: 556.58

Number of Fisher Scoring iterations: 4

Assessing univariable logistic regression simultaneously

library(broom)
#One by one estimates
results <- data.frame()


#create data frame to store results
#loop through the Group and each variable
for(group in unique(data1$FRACTURE)){
  for(var in names(data1)[c(-1, -2, -3, -15)]){ #remove Fracture, Site_id, phy_id
    #dynamically generate formula
    formula <- as.formula(paste0("FRACTURE ~", var))
    
    #fit glm model to binary logistic regression
    fit <- glm(formula, data=data1,
               family=binomial("logit"))
    fit <- tidy(fit)
    #capture summary stats
    variable <- var
    coef <- fit$estimate[2]
    stderror <- fit$std.error[2]
    walds <- fit$statistic[2]
    pvalue <- fit$p.value[2]
    
    #create temporary data frame
    df <- data.frame(variable=var, coef=fit$estimate[2], 
                     stderror <- fit$std.error[2], walds <- fit$statistic[2],
                     pvalue = fit$p.value[2])
    
    #bind rows of temporary data frame to the results data frame
    results <- rbind(results,df)
  }
}
results[1:(length(names(data1)[c(-1, -2, -3, -15)])-1),]
    variable         coef stderror....fit.std.error.2.
1  PRIORFRAC  1.063829449                  0.223081075
2        AGE  0.052886215                  0.011628679
3     WEIGHT -0.005196790                  0.006414854
4     HEIGHT -0.051665422                  0.017094823
5        BMI  0.005757553                  0.017185317
6    PREMENO  0.050772325                  0.259208611
7    MOMFRAC  0.660502240                  0.280983786
8  ARMASSIST  0.709147522                  0.209766610
9      SMOKE -0.307654206                  0.435807000
10  RATERISK  0.546216749                  0.266436095
   walds....fit.statistic.2.       pvalue
1                  4.7688019 1.853248e-06
2                  4.5479126 5.418063e-06
3                 -0.8101182 4.178722e-01
4                 -3.0222847 2.508745e-03
5                  0.3350275 7.376044e-01
6                  0.1958744 8.447085e-01
7                  2.3506774 1.873927e-02
8                  3.3806501 7.231455e-04
9                 -0.7059414 4.802246e-01
10                 2.0500854 4.035610e-02

Based on univariable model, we can conclude that all variables are univariablely significant since the p-value were less than 0.05 except for WEIGHT, BMI, PREMENO, and SMOKE because the p-value were more than 0.05

Step 2: We now fit our first multivariable model that contains all covariates that are significant in univariable analysis at the 25% level. The covariates WEIGHT (p-value>0.25), BMI (p-value>0.25), PREMENO (p-value > 0.25), and SMOKE (p-value >0.25)

fit1 <- glm(FRACTURE~AGE+HEIGHT+PRIORFRAC+MOMFRAC+ARMASSIST+ RATERISK,
            data=data1,family = binomial("logit"))

summary(fit1)

Call:
glm(formula = FRACTURE ~ AGE + HEIGHT + PRIORFRAC + MOMFRAC + 
    ARMASSIST + RATERISK, family = binomial("logit"), data = data1)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)   
(Intercept)  2.70935    3.22992   0.839  0.40157   
AGE          0.03434    0.01305   2.632  0.00848 **
HEIGHT      -0.04383    0.01827  -2.400  0.01640 * 
PRIORFRAC    0.64526    0.24606   2.622  0.00873 **
MOMFRAC      0.62122    0.30698   2.024  0.04300 * 
ARMASSIST    0.44579    0.23281   1.915  0.05551 . 
RATERISK2    0.42202    0.27925   1.511  0.13071   
RATERISK3    0.70692    0.29342   2.409  0.01599 * 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 562.34  on 499  degrees of freedom
Residual deviance: 507.50  on 492  degrees of freedom
AIC: 523.5

Number of Fisher Scoring iterations: 4

Once this model is fit, we examine each covariate to ascertain its continued significance, at traditional levels, in the model. We see that the covariate with the largest p-value that is greater than 0.05 is for RATERISK2, the design/dummy variable that compares women with RATERISK = 2 to women with RATERISK = 1 .

Next, we will check the likelihood ratio test for RATERISK

#without RATERISK,
mod_fit_RATERISK <- glm(FRACTURE~AGE+HEIGHT+PRIORFRAC+MOMFRAC+ARMASSIST, 
                    data=data1,family = binomial("logit"))

anova(fit1, mod_fit_RATERISK, test ="Chisq")
Analysis of Deviance Table

Model 1: FRACTURE ~ AGE + HEIGHT + PRIORFRAC + MOMFRAC + ARMASSIST + RATERISK
Model 2: FRACTURE ~ AGE + HEIGHT + PRIORFRAC + MOMFRAC + ARMASSIST
  Resid. Df Resid. Dev Df Deviance Pr(>Chi)  
1       492     507.50                       
2       494     513.46 -2  -5.9587  0.05083 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
library(lmtest)
Loading required package: zoo

Attaching package: 'zoo'
The following objects are masked from 'package:base':

    as.Date, as.Date.numeric
lrtest(fit1, mod_fit_RATERISK)
Likelihood ratio test

Model 1: FRACTURE ~ AGE + HEIGHT + PRIORFRAC + MOMFRAC + ARMASSIST + RATERISK
Model 2: FRACTURE ~ AGE + HEIGHT + PRIORFRAC + MOMFRAC + ARMASSIST
  #Df  LogLik Df  Chisq Pr(>Chisq)  
1   8 -253.75                       
2   6 -256.73 -2 5.9587    0.05083 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The likelihood ratio test for the exclusion of self-reported risk of fracture (eg: deleting RATERISK_2 and RATERISK_3 from the model) is Chi-Square=5.9587, which with two degrees of freedom, yields p-value=0.05083, nearly significant at the 0.05 level.

Before moving to the next step, we consider another possible model. Since the coefficient for RATERISK_2 is not significant, one possibility is to combine level 1 and 2, self-reported risk less than or the same as other women, into dichotomous, but we loose information about the specific log-odds of categories 1 and 2. On consultation with subject matter investigators, it was thought that combining these two categories is reasonable.

Hence we fit this model and its results as below:

data1$RATERISK <- ifelse(data1$RATERISK==1|data1$RATERISK==2, 1, 2)
data1$RATERISK <-as.factor(data1$RATERISK)

fit1 <- glm(FRACTURE~AGE+HEIGHT+PRIORFRAC+MOMFRAC+ARMASSIST+ RATERISK,
            data=data1,family = binomial("logit"))

summary(fit1)

Call:
glm(formula = FRACTURE ~ AGE + HEIGHT + PRIORFRAC + MOMFRAC + 
    ARMASSIST + RATERISK, family = binomial("logit"), data = data1)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)   
(Intercept)  3.40658    3.17704   1.072  0.28361   
AGE          0.03318    0.01295   2.563  0.01039 * 
HEIGHT      -0.04631    0.01813  -2.555  0.01063 * 
PRIORFRAC    0.66416    0.24521   2.709  0.00676 **
MOMFRAC      0.66402    0.30558   2.173  0.02978 * 
ARMASSIST    0.47272    0.23133   2.044  0.04100 * 
RATERISK2    0.45792    0.23806   1.924  0.05441 . 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 562.34  on 499  degrees of freedom
Residual deviance: 509.82  on 493  degrees of freedom
AIC: 523.82

Number of Fisher Scoring iterations: 4

In this model, the coefficient for covariate RATERISK_3 (now just became RATERISK2) now provides the estimate of the log of the odds ratio comparing the odds of fracture for individuals in level 3 to that of the combined group consisting of level 1 and 2.

Still, after combining the categories for RATERISK, the p-value shows more than 0.05. We now will remove the variable from the model.

fit1 <- glm(FRACTURE~AGE+HEIGHT+PRIORFRAC+MOMFRAC+ARMASSIST,
            data=data1,family = binomial("logit"))

summary(fit1)

Call:
glm(formula = FRACTURE ~ AGE + HEIGHT + PRIORFRAC + MOMFRAC + 
    ARMASSIST, family = binomial("logit"), data = data1)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)   
(Intercept)  3.74081    3.17824   1.177  0.23919   
AGE          0.02975    0.01275   2.333  0.01964 * 
HEIGHT      -0.04635    0.01816  -2.552  0.01070 * 
PRIORFRAC    0.75259    0.23959   3.141  0.00168 **
MOMFRAC      0.72263    0.30235   2.390  0.01684 * 
ARMASSIST    0.52372    0.22829   2.294  0.02179 * 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 562.34  on 499  degrees of freedom
Residual deviance: 513.46  on 494  degrees of freedom
AIC: 525.46

Number of Fisher Scoring iterations: 4

Step 3: At this point we have our preliminary main effects model and must now check for the scale of the logit for continuous covariates AGE and HEIGHT.

In this note, I show using BoxTidwell test.

The Box-Tidwell test evaluats the linearity of logit assumption in logistic regression models.

library(car)
Loading required package: carData
boxTidwell(FRACTURE~AGE, 
             other.x = ~HEIGHT+PRIORFRAC+MOMFRAC+ARMASSIST, 
             data = data1)
 MLE of lambda Score Statistic (t) Pr(>|t|)
      -0.46074             -0.3303   0.7413

iterations =  9 

Based on the result, the p-value of Box Tidwell was more than 0.05, it means that the linearity assumption for variable AGE is met.

Now we check for variable HEIGHT

boxTidwell(FRACTURE~HEIGHT, 
             other.x = ~AGE+PRIORFRAC+MOMFRAC+ARMASSIST, 
             data = data1)
 MLE of lambda Score Statistic (t) Pr(>|t|)
        -12.23              1.5921    0.112

iterations =  3 

Based on the result, the p-value of Box Tidwell was more than 0.05, it means that the linearity assumption for variable HEIGHT is met.

Step 4: Checking for two way interaction terms

formula(fit1)
FRACTURE ~ AGE + HEIGHT + PRIORFRAC + MOMFRAC + ARMASSIST

Interaction terms for continuous variables (AGE and HEIGHT)

interaction1 <- glm(FRACTURE ~ AGE + HEIGHT + AGE*HEIGHT + PRIORFRAC + MOMFRAC + ARMASSIST , data=data1, family=binomial("logit"))

summary(interaction1)

Call:
glm(formula = FRACTURE ~ AGE + HEIGHT + AGE * HEIGHT + PRIORFRAC + 
    MOMFRAC + ARMASSIST, family = binomial("logit"), data = data1)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)   
(Intercept) 14.692414  23.850071   0.616  0.53787   
AGE         -0.124501   0.332818  -0.374  0.70834   
HEIGHT      -0.114786   0.148872  -0.771  0.44068   
PRIORFRAC    0.743006   0.240328   3.092  0.00199 **
MOMFRAC      0.725302   0.302650   2.397  0.01655 * 
ARMASSIST    0.523937   0.228411   2.294  0.02180 * 
AGE:HEIGHT   0.000965   0.002081   0.464  0.64283   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 562.34  on 499  degrees of freedom
Residual deviance: 513.24  on 493  degrees of freedom
AIC: 527.24

Number of Fisher Scoring iterations: 4

Based on the model with interaction term, the interaction AGE and HEIGHT is not significant since the p-value more than 0.05. Hence, we can conclude that the interaction term should be removed from the model.

Step 5: Checking Multicollinearity of independent variables

library(car)
car::vif(fit1)
      AGE    HEIGHT PRIORFRAC   MOMFRAC ARMASSIST 
 1.140680  1.066260  1.080805  1.012421  1.085556 

Based on Variance Inflaction Factor (VIF), the values were less than 10. It means that the multicollinearity issue did not occur in the dataset. Thus, we can proceed to establish final model.

To establish final model, we need to assess the goodness of fit of the model to see to know whether the probabilities produced by the model accurately reflect the true outcome experience in the data.

Method to assess the Goodness of fit in this note are: The Hosmer-Lemeshow test, Classification table, and Area under the ROC curve.

  1. The Hosmer-Lemeshow test
#Hosmer and Lemeshow test 
library(generalhoslem)
Loading required package: reshape
Loading required package: MASS
HL <- logitgof(data1$FRACTURE, fitted(fit1)) 
HL #hosmer and lemeshow p-value result cbind(HL$expected, HL$observed) 

    Hosmer and Lemeshow test (binary model)

data:  data1$FRACTURE, fitted(fit1)
X-squared = 6.1627, df = 8, p-value = 0.629

Based on the Hosmer-Lemeshow test, the p-value was 0.629 which is more than 0.05. This concludes that the model is fit for prediction.

  1. Classification Table
#Classification table formula(logis1) 
MyStep <- step (fit1, formula(fit1), data=data1, family=binomial("logit") )  ##Create Predicted Probability
Start:  AIC=525.46
FRACTURE ~ AGE + HEIGHT + PRIORFRAC + MOMFRAC + ARMASSIST

            Df Deviance    AIC
<none>           513.46 525.46
- ARMASSIST  1   518.69 528.69
- AGE        1   518.92 528.92
- MOMFRAC    1   518.95 528.95
- HEIGHT     1   520.19 530.19
- PRIORFRAC  1   523.12 533.12
theProbs <- fitted(MyStep)  
## Table  
classDF <- data.frame(response=data1$FRACTURE, predicted=round(fitted(MyStep),0))

classification.table <- xtabs(~ predicted + response, data=classDF)  

classification.rate <- ((classification.table[2,2] + classification.table[1,1])/ (sum(margin.table(classification.table,1))))*100  

classification.rate 
[1] 75.6

The overall rate of correct classification is estimated as 75.6% which means 75.6% correctly classified of the no fracture.

The sensitivity and specificity

sensitivity <- (classification.table[2,2] / (margin.table(classification.table,2)[2]))*100

specificity <- (classification.table[1,1] / (margin.table(classification.table,1)[1]))*100

#Sensitivity 
print(sensitivity)
   1 
13.6 
#Specificity
print(specificity)
       0 
76.97228 

The sensitivity of the model was 13.6%, where the model correctly specified the fracture by 17.6% and model correctly specified the non fracture by 76.97%.

  1. Area under the ROC curve

The area under the ROC curve, which ranges from 0.5 to 1.0, provides a measure of the model’s ability to discriminate between those subjects who experience the outcome of interest versus those who do not.

#ROC  
prob=predict(fit1,type=c("response")) 

# mydata$prob=prob 
library(pROC) 
Type 'citation("pROC")' for a citation.

Attaching package: 'pROC'
The following objects are masked from 'package:stats':

    cov, smooth, var
g <- roc(FRACTURE~prob, data=data1) 
Setting levels: control = 0, case = 1
Setting direction: controls < cases
mean(g$auc) #Area under the curve ROC 
[1] 0.7070613

The area under the ROC curve indicate 0.7071 which means that the model is acceptable discrimination.

As ruled of thumb in interpretation of ROC curve:

1) ROC = 0.5 –> This suggest no discrimination, so we might as well flip a coin.

2) 0.5 < ROC < 0.7 –> We consider this poor discrimination, not much better than a coin toss.

3) 0.7 < ROC < 0.8 –> We consider this acceptable discrimination.

4) 0.8 < ROC < 0.9 –> We consider this excellent discrimination.

5) ROC >= 0.9 –> We consider this outstanding discrimination.

To plot the ROC curve

plot(g, , main = "A Receiver Operating Characteristic curve",
     cex.main = 0.8, family="Times")


text(0, 0.6, "AUC: 0.7071",col="blue", cex = 0.8)

As for the next step should be the testing for regression diagnostic. This section will be discussed in next note.

As far for now, we will establish the final model.

final.model <- glm(formula(fit1), data1, family=binomial("logit"))
summary(final.model)

Call:
glm(formula = formula(fit1), family = binomial("logit"), data = data1)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)   
(Intercept)  3.74081    3.17824   1.177  0.23919   
AGE          0.02975    0.01275   2.333  0.01964 * 
HEIGHT      -0.04635    0.01816  -2.552  0.01070 * 
PRIORFRAC    0.75259    0.23959   3.141  0.00168 **
MOMFRAC      0.72263    0.30235   2.390  0.01684 * 
ARMASSIST    0.52372    0.22829   2.294  0.02179 * 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 562.34  on 499  degrees of freedom
Residual deviance: 513.46  on 494  degrees of freedom
AIC: 525.46

Number of Fisher Scoring iterations: 4

The odd ratio for each variables

exp(fit1$coefficients)
(Intercept)         AGE      HEIGHT   PRIORFRAC     MOMFRAC   ARMASSIST 
 42.1319492   1.0302015   0.9547102   2.1224882   2.0598462   1.6882976 

Based on the result, we can conclude that Age, Height, Prior Fracture condition, Mother had hip fracture, and Arms are needed to stand from a chair were statistically significant towards Fracture (Any fracture in first year) since all the p-values were less than 0.05. Based on the odds ratio of the model, we can conclude that PRIORFRAC variable had the highest odds ratio, which means that those who had Prior Fracture condition has 2.122 time the odds compared to those who did not have prior fracture condition when adjusted for others variables. A person who had prior fracture condition has higher chance (122%) to have fracture compared to those who did not have prior fracture condition.