In this example, we use GLOW dataset from Applied Logistic Regression book by Hosmer and Lemeshow.
Aim: To select those variables that result in a “best” model within the scientific context of the problem.
Dataset GLOW (n=500)
The GLOW500 dataset is The Global Longitudinal Study of Osteoporosis in Women (GLOW) is an international study of osteoporosis in women over 55 years of age being coordinated at the Center for Outcomes Research (COR) at the University of Massachusetts/Worcester by its Director, Dr. Frederick Anderson, Jr.
The study has enrolled over 60,000 women aged 55 and older in ten countries. The major goals of the study are to use the data to provide insights into the management of fracture risk, patient experience with prevention and treatment of fractures and distribution of risk factors among older women on an international scale over the follow up period.
Data used here come from six sites in the United States and include a few selected potential risk factors for fracture from the baseline questionnaire. The outcome variable is any fracture in the first year of follow up. The incident first-year fracture rate among the 21,000 subjects enrolled in these six sites is about 4 percent. In order to have a data set of a manageable size, n=500, for this note, we have over sampled the fractures and under sampled the non-fractures (Hosmer-Lemeshow, 2013).
The codesheet for variables in the GLOW study
Variable
Description
Codes/Values
Name
1
Identification code
1-n
SUB_ID
2
Study site
1-6
SITE_ID
3
Physician ID code
128 unique codes
PHY_ID
4
History of prior fracute
1 = Yes, 0 = No
PRIORFRAC
5
Age at enrollment
Years
AGE
6
Weight at enrollment
Kilograms
WEIGHT
7
Height at enrollment
Centimeters
HEIGHT
8
Body mass index
kg/m2
BMI
9
Menopause before age 45
1=Yes, 0=No
PREMENO
10
Mother had hip fracture
1=Yes, 0=No
MOMFRAC
11
Arms are needed to stand from a chair
1=Yes, 0=No
ARMASSIST
12
Former or current smoker
1=Yes, 0=No
SMOKE
13
Self-reported risk of fracture
1 = Less than others of the same age,
2 = Same as others of the same age,
3 = Greater than others of the same age
RATERISK
14
Fracture risk score
Composite risk score
FRACSCORE
15
Any fracture in first year
1=Yes, 0=No
FRACTURE
Importing the dataset into R environment.
data1 <-read.table("GLOW500.txt", sep="\t", header =TRUE)dim(data1) #dimensions of the dataset
library(broom)#One by one estimatesresults <-data.frame()#create data frame to store results#loop through the Group and each variablefor(group inunique(data1$FRACTURE)){for(var innames(data1)[c(-1, -2, -3, -15)]){ #remove Fracture, Site_id, phy_id#dynamically generate formula formula <-as.formula(paste0("FRACTURE ~", var))#fit glm model to binary logistic regression fit <-glm(formula, data=data1,family=binomial("logit")) fit <-tidy(fit)#capture summary stats variable <- var coef <- fit$estimate[2] stderror <- fit$std.error[2] walds <- fit$statistic[2] pvalue <- fit$p.value[2]#create temporary data frame df <-data.frame(variable=var, coef=fit$estimate[2], stderror <- fit$std.error[2], walds <- fit$statistic[2],pvalue = fit$p.value[2])#bind rows of temporary data frame to the results data frame results <-rbind(results,df) }}results[1:(length(names(data1)[c(-1, -2, -3, -15)])-1),]
Based on univariable model, we can conclude that all variables are univariablely significant since the p-value were less than 0.05 except for WEIGHT, BMI, PREMENO, and SMOKE because the p-value were more than 0.05
Step 2: We now fit our first multivariable model that contains all covariates that are significant in univariable analysis at the 25% level. The covariates WEIGHT (p-value>0.25), BMI (p-value>0.25), PREMENO (p-value > 0.25), and SMOKE (p-value >0.25)
Call:
glm(formula = FRACTURE ~ AGE + HEIGHT + PRIORFRAC + MOMFRAC +
ARMASSIST + RATERISK, family = binomial("logit"), data = data1)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.70935 3.22992 0.839 0.40157
AGE 0.03434 0.01305 2.632 0.00848 **
HEIGHT -0.04383 0.01827 -2.400 0.01640 *
PRIORFRAC 0.64526 0.24606 2.622 0.00873 **
MOMFRAC 0.62122 0.30698 2.024 0.04300 *
ARMASSIST 0.44579 0.23281 1.915 0.05551 .
RATERISK2 0.42202 0.27925 1.511 0.13071
RATERISK3 0.70692 0.29342 2.409 0.01599 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 562.34 on 499 degrees of freedom
Residual deviance: 507.50 on 492 degrees of freedom
AIC: 523.5
Number of Fisher Scoring iterations: 4
Once this model is fit, we examine each covariate to ascertain its continued significance, at traditional levels, in the model. We see that the covariate with the largest p-value that is greater than 0.05 is for RATERISK2, the design/dummy variable that compares women with RATERISK = 2 to women with RATERISK = 1 .
Next, we will check the likelihood ratio test for RATERISK
#without RATERISK,mod_fit_RATERISK <-glm(FRACTURE~AGE+HEIGHT+PRIORFRAC+MOMFRAC+ARMASSIST, data=data1,family =binomial("logit"))anova(fit1, mod_fit_RATERISK, test ="Chisq")
The likelihood ratio test for the exclusion of self-reported risk of fracture (eg: deleting RATERISK_2 and RATERISK_3 from the model) is Chi-Square=5.9587, which with two degrees of freedom, yields p-value=0.05083, nearly significant at the 0.05 level.
Before moving to the next step, we consider another possible model. Since the coefficient for RATERISK_2 is not significant, one possibility is to combine level 1 and 2, self-reported risk less than or the same as other women, into dichotomous, but we loose information about the specific log-odds of categories 1 and 2. On consultation with subject matter investigators, it was thought that combining these two categories is reasonable.
Call:
glm(formula = FRACTURE ~ AGE + HEIGHT + PRIORFRAC + MOMFRAC +
ARMASSIST + RATERISK, family = binomial("logit"), data = data1)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.40658 3.17704 1.072 0.28361
AGE 0.03318 0.01295 2.563 0.01039 *
HEIGHT -0.04631 0.01813 -2.555 0.01063 *
PRIORFRAC 0.66416 0.24521 2.709 0.00676 **
MOMFRAC 0.66402 0.30558 2.173 0.02978 *
ARMASSIST 0.47272 0.23133 2.044 0.04100 *
RATERISK2 0.45792 0.23806 1.924 0.05441 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 562.34 on 499 degrees of freedom
Residual deviance: 509.82 on 493 degrees of freedom
AIC: 523.82
Number of Fisher Scoring iterations: 4
In this model, the coefficient for covariate RATERISK_3 (now just became RATERISK2) now provides the estimate of the log of the odds ratio comparing the odds of fracture for individuals in level 3 to that of the combined group consisting of level 1 and 2.
Still, after combining the categories for RATERISK, the p-value shows more than 0.05. We now will remove the variable from the model.
Call:
glm(formula = FRACTURE ~ AGE + HEIGHT + PRIORFRAC + MOMFRAC +
ARMASSIST, family = binomial("logit"), data = data1)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.74081 3.17824 1.177 0.23919
AGE 0.02975 0.01275 2.333 0.01964 *
HEIGHT -0.04635 0.01816 -2.552 0.01070 *
PRIORFRAC 0.75259 0.23959 3.141 0.00168 **
MOMFRAC 0.72263 0.30235 2.390 0.01684 *
ARMASSIST 0.52372 0.22829 2.294 0.02179 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 562.34 on 499 degrees of freedom
Residual deviance: 513.46 on 494 degrees of freedom
AIC: 525.46
Number of Fisher Scoring iterations: 4
Step 3: At this point we have our preliminary main effects model and must now check for the scale of the logit for continuous covariates AGE and HEIGHT.
In this note, I show using BoxTidwell test.
The Box-Tidwell test evaluats the linearity of logit assumption in logistic regression models.
library(car)
Loading required package: carData
boxTidwell(FRACTURE~AGE, other.x =~HEIGHT+PRIORFRAC+MOMFRAC+ARMASSIST, data = data1)
Call:
glm(formula = FRACTURE ~ AGE + HEIGHT + AGE * HEIGHT + PRIORFRAC +
MOMFRAC + ARMASSIST, family = binomial("logit"), data = data1)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 14.692414 23.850071 0.616 0.53787
AGE -0.124501 0.332818 -0.374 0.70834
HEIGHT -0.114786 0.148872 -0.771 0.44068
PRIORFRAC 0.743006 0.240328 3.092 0.00199 **
MOMFRAC 0.725302 0.302650 2.397 0.01655 *
ARMASSIST 0.523937 0.228411 2.294 0.02180 *
AGE:HEIGHT 0.000965 0.002081 0.464 0.64283
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 562.34 on 499 degrees of freedom
Residual deviance: 513.24 on 493 degrees of freedom
AIC: 527.24
Number of Fisher Scoring iterations: 4
Based on the model with interaction term, the interaction AGE and HEIGHT is not significant since the p-value more than 0.05. Hence, we can conclude that the interaction term should be removed from the model.
Step 5: Checking Multicollinearity of independent variables
library(car)car::vif(fit1)
AGE HEIGHT PRIORFRAC MOMFRAC ARMASSIST
1.140680 1.066260 1.080805 1.012421 1.085556
Based on Variance Inflaction Factor (VIF), the values were less than 10. It means that the multicollinearity issue did not occur in the dataset. Thus, we can proceed to establish final model.
To establish final model, we need to assess the goodness of fit of the model to see to know whether the probabilities produced by the model accurately reflect the true outcome experience in the data.
Method to assess the Goodness of fit in this note are: The Hosmer-Lemeshow test, Classification table, and Area under the ROC curve.
The Hosmer-Lemeshow test
#Hosmer and Lemeshow test library(generalhoslem)
Loading required package: reshape
Loading required package: MASS
HL <-logitgof(data1$FRACTURE, fitted(fit1)) HL #hosmer and lemeshow p-value result cbind(HL$expected, HL$observed)
Hosmer and Lemeshow test (binary model)
data: data1$FRACTURE, fitted(fit1)
X-squared = 6.1627, df = 8, p-value = 0.629
Based on the Hosmer-Lemeshow test, the p-value was 0.629 which is more than 0.05. This concludes that the model is fit for prediction.
The sensitivity of the model was 13.6%, where the model correctly specified the fracture by 17.6% and model correctly specified the non fracture by 76.97%.
Area under the ROC curve
The area under the ROC curve, which ranges from 0.5 to 1.0, provides a measure of the model’s ability to discriminate between those subjects who experience the outcome of interest versus those who do not.
Based on the result, we can conclude that Age, Height, Prior Fracture condition, Mother had hip fracture, and Arms are needed to stand from a chair were statistically significant towards Fracture (Any fracture in first year) since all the p-values were less than 0.05. Based on the odds ratio of the model, we can conclude that PRIORFRAC variable had the highest odds ratio, which means that those who had Prior Fracture condition has 2.122 time the odds compared to those who did not have prior fracture condition when adjusted for others variables. A person who had prior fracture condition has higher chance (122%) to have fracture compared to those who did not have prior fracture condition.