Naïve Bayes Classification (R)

Heart Attack Survival

This analysis focuses on factors that determine the likelihood of survival after a patient has been admitted to the hospital for a heart attack at two points in time. By categorizing patient attributes, a model will be created using the Naïve Bayes Classification to identify the most influential factors and how those factors influence the predicted survivability of a heart attack at the hospital and up to a follow-up appointment. Identifying the different factors can be useful in prolonging long-term survival.

Naive Bayes: Text

Data

3 quantitative variables (age, bmi, hr) were left unmodified. Patient age is negatively skewed which is in line with the average age of heart attack patients; also represented are younger age groups (under 55) who have seen an increase in hospitalization due to heart attacks since 1995 (American Heart Association News, 2018). Body mass index (bmi) is somewhat normally distributed. Heartrate (hr) is slightly positively skewed with most observations falling in the normal range (60 to 100 bpm).

Two numeric attributes were not normally distributed. Length of hospital stay (los) is positively skewed with the average length of stay just over six days. The length of time between a hospital stay and follow up (lenfol) is multimodal with most follow up appointments occurring after 1, 3, or 5 years after hospital admission. Los was converted to a factor variable using discretization. Lenfol was removed from the dataset because of its relationship to the outcome status; only patients who did not die at the hospital would have a follow-up appointment.

Naive Bayes: Image

Correlation

The strongest positive correlation (.61) was between the two blood pressure variables which are later combined into one variable. Length of time between hospital admission and follow up appointment (lenfol) showed the strongest negative correlation (-.56) between admission year (year) and patient survival status at the follow-up appointment (fstat).

Naive Bayes: Image

Model Comparison

The first model was constructed using all attributes as predictors. (72.3% accuracy on training, 71.5% accuracy on the test set.)

For the second model variable selection is based on two logistic regression models that used backward elimination based on p values. The model summaries are shown in Figures 4 and 5. The first model was created to determine the survival status of living or dead. Variables identified as significant were age, hr, sho, chf, av3, and year. The second model was created based on death at the hospital or at the time of the follow-up appointment. Significant attributes identified were hr, sho, los, and bp. (72.1% accuracy on training, 70.3% accuracy on the test set.)

The third and final model was created using stepwise regression. Variables were removed one at a time. If accuracy decreased after variable removal, the variable was added back in. If accuracy increased or stayed relatively the same, the variable remained removed. The third and final model predicts survival status class based on independent variables afb, age, bmi, bp, chf, cvd, gender, hr, los, sho, and year. (73.5% accuracy on training, 72.9% accuracy on the test set.)

Naive Bayes: Image

The overall accuracy of the model was moderate and predicted a survival rate of 56%, death at the hospital 7%, and death between the hospital admission and the follow-up appointment 37%. Age is the most significant factor in determining the survival of a heart attack up to follow-up care followed by congestive heart complications and body mass index.

Naive Bayes: Image

R scripts on Github

Naive Bayes: Text