Data Portfolio
Naïve Bayes Classification (R)
Heart Attack Survival
This analysis focuses on factors that determine the likelihood of survival after a patient has been admitted to the hospital for a heart attack at two points in time. By categorizing patient attributes, a model will be created using the Naïve Bayes Classification to identify the most influential factors and how those factors influence the predicted survivability of a heart attack at the hospital and up to a follow-up appointment. Identifying the different factors can be useful in prolonging long-term survival.

Data
3 quantitative variables (age, bmi, hr) were left unmodified. Patient age is negatively skewed which is in line with the average age of heart attack patients; also represented are younger age groups (under 55) who have seen an increase in hospitalization due to heart attacks since 1995 (American Heart Association News, 2018). Body mass index (bmi) is somewhat normally distributed. Heartrate (hr) is slightly positively skewed with most observations falling in the normal range (60 to 100 bpm).
​
Two numeric attributes were not normally distributed. Length of hospital stay (los) is positively skewed with the average length of stay just over six days. The length of time between a hospital stay and follow up (lenfol) is multimodal with most follow up appointments occurring after 1, 3, or 5 years after hospital admission. Los was converted to a factor variable using discretization. Lenfol was removed from the dataset because of its relationship to the outcome status; only patients who did not die at the hospital would have a follow-up appointment.

Correlation
The strongest positive correlation (.61) was between the two blood pressure variables which are later combined into one variable. Length of time between hospital admission and follow up appointment (lenfol) showed the strongest negative correlation (-.56) between admission year (year) and patient survival status at the follow-up appointment (fstat).
Model Comparison
The first model was constructed using all attributes as predictors. (72.3% accuracy on training, 71.5% accuracy on the test set.)
​
For the second model variable selection is based on two logistic regression models that used backward elimination based on p values. The model summaries are shown in Figures 4 and 5. The first model was created to determine the survival status of living or dead. Variables identified as significant were age, hr, sho, chf, av3, and year. The second model was created based on death at the hospital or at the time of the follow-up appointment. Significant attributes identified were hr, sho, los, and bp. (72.1% accuracy on training, 70.3% accuracy on the test set.)
​
The third and final model was created using stepwise regression. Variables were removed one at a time. If accuracy decreased after variable removal, the variable was added back in. If accuracy increased or stayed relatively the same, the variable remained removed. The third and final model predicts survival status class based on independent variables afb, age, bmi, bp, chf, cvd, gender, hr, los, sho, and year. (73.5% accuracy on training, 72.9% accuracy on the test set.)

