Data Portfolio
Predictive Analysis
Public Libraries and the Catalog Inclusion of eBooks (R)
This project focuses on public libraries’ integration of eBooks into their catalog. Using CensusData this analysis predicts the expected increase in digital materials and explores public libraries' ability to meet those goals.
Exploratory Analysis
How Americans access library services.
“Between 2008 and 2017, the number of eBooks and downloadable audio materials has more than doubled” (IMLS, 2020). Both public libraries and the eBook publishing industry are directly affected by community demand for eBooks. With the eBook market projected to grow, public library eBook demand can also be expected to grow. Public libraries are also expanding their services to include technology and digital literacy training. Librarians have been and continue to be trained as well to assist community members with new content and services.


US Public Libraries
The PLS data set for each year from 2012 to 2018 is being used for this analysis. The raw data contains between 9,245 and 9,309 records (slight variations by year) with 262 features. After the initial review of the data 56 variables remain. Features included can be grouped by demographics (state, county, zip, area population), library entities (library type, number of branches and bookmobiles), expenses (salaries, operating expenditures), capital, catalog content (print and digital content), circulation, internet services, and visits and website sessions.
The ACS data set for each year from 2010 to 2019 have been gathered through the Census Bureau’s API. Years used may have slight variation after additional data exploration is completed (depending on the correlation of PLS eBook catalog increase and prior years’ changes in the community). Variable groups included in the ACS data are Age and Sex, Race, Education Attainment, Poverty (including food assistance via SNAP), and Health Insurance.

Correlation
Many fields within the PLS data are highly correlated, especially fields within their groups. PLS correlation is shown in Table 4. The analysis will likely rely primarily on group totals (like total income to represent all income groups) however the data is being left in place for later use (example: tying FEDGVT income to the LSTA grant data).

Calculated Features
Because the focus of this analysis is primarily on changes that occur within communities and the public libraries that are located within those communities from year to year, the data was used to create a new dataset that included percent change for each feature. This allowed the analysis to focus on percentage changes in population segments from the ACS and library features from the PLS.
After calculating percent changes using a simple equation for percent change and the reshape2 package in R, the data was transformed into a timeseries so that imputation could be done using multivariate Kalman smoothing. I felt the numbers produced using this imputation method (with the imputeTS package) was more representative of actual changes within the ACS and PLS data than the previously performed using the mice package.

Percent change of eBook volume in public library catalogs from year to year
eBook incorporation into public library catalogs has been fairly consistent with increases more common than decreases in the number of eBooks. Where decreases occur they appear to be grouped indicating that public libraries within states are connected to one another through digital catalogs. Statewide decreases in volume could be due to a decrease in funding to public libraries. It is also possible that these states negotiate eBook usage rights as a group and on behalf of all libraries in a given area. I think this is unlikely because libraries often act as independent entities, but it is a possibility. The decreases showing statewide also follow a year where there was a significant increase in eBook catalogs. Areas showing an immediate and significant increase followed by a significant decrease could have taken part in a digital trial, possibly through the use of a third party like OverDrive or Libby.
The Northeast and the Midwest are actively increasing the eBook catalogs. Cities including New York, Boston, Philadelphia, and Chicago may have a substantial influence on these increases. Texas is also actively increasing it is eBook catalog sizes. It appears that areas around large cities, and possibly areas where concentrations of colleges and universities can be found have steadier volume increases. While learning institutions use university or college libraries which are considered a different type of library, acclimation to eBooks at learning institutions may increase patron familiarity.
eBook volume in public library catalogs is negatively correlated with the number of people without health insurance in the community.

Correlation insights
Other correlations seen in the data include a negative correlation between uneducated males and females with educational attainment of less than high school. Positive correlations are seen with decreases in food assistance, increases in the female population aged 24 to 44, and surprisingly females with educational attainment greater than a high school diploma and less than a bachelor’s degree (females group A). One possibility for the increase in demand for eBooks at public libraries among this group of females is the desire to continue education at a reduced cost. One of the goals of public libraries is to provide free education.
The economic health of a community is directly related to the increase or decrease of eBooks available in public libraries. As patrons face economic struggles of their own, the demand for eBooks diminishes. This could also tie into the effect that females group A have. A decrease in resources could lead this group to seek out more affordable options to attain education.
eBook catalogs have increased in size while electronic material expenditures and circulation have increased only slightly


Large towns and cities of all sizes pay significantly less per ebook.
Large cities have public libraries that are growing their eBook catalog the fastest. Public libraries located in more densely populated areas can purchase the rights to eBooks at lower costs. Rural areas including small downs and areas outside metro areas pay far more than areas on the fringe of cities. Less densely populated areas are still growing their eBook catalogs but are adding fewer eBooks. One possibility is that the libraries in smaller areas are more selective in the eBooks they are attaining. Also influential is that smaller populations require fewer eBooks whereas more densely populated areas are more likely to purchase multiple copies of the same eBook.
Another possible but unlikely influence on eBook catalogs is the interaction between libraries. Inter-library loan agreements allow patrons access to eBooks at different library locations. “eBook interlibrary loan has not become prevalent in libraries” (Ren, 2018).
The vast difference in eBook costs should be explored more and additional information should be added for calculations to include Bowker’s pricing data. Pricing has decreased and leveled out in the most recent years but the difference libraries pay based on the type of geographic location is likely still significant.
Although inter-library loans have not become prevalent, is there some effect they have? For instance are inter-library loans more prevalent in more densely populated areas? Another angle to view urban public libraries is that they are likely to have a greater number of branches that share eBook access. These eBooks would not be captured as inter-library loans but would increase the numbers of eBooks and increase access within the library service area.
Public library expenditures for all materials have increased by 32% since 1995 (Public library Revenue, nd). The increase does not seem to correspond to the higher price of eBooks (eBooks having higher prices for incorporation into catalogs than their print counterparts).
Clustering of Data Sets
Cluster plots used for clustering the ACS and PLS data sets.
Neither of the two main data sets used in this project is well-suited for clustering without a substantial amount of data manipulation. There are too many features in each dataset to cluster the data quickly and successfully. Further investigation including a large reduction in the number of features could provide improved results.

Classification Models
Decision Tree
The decision tree model was created based on complexity. The best cp value was determined to be 0.001252003 and was used to prune the tree. Figure 14 shows how complexity was determined and the resulting pruned decision tree. The complexity benefit is extremely small and the tree is complex. The overall accuracy rate for this model was 41%.

Support Vector Machine (SVM)
Both models (decision tree and SVM) performed poorly at identifying eBook change type groups. Although highly correlated features were removed, the number of variables and the amount of data affected both models. The number of features created the opportunity for the decision tree model to become too cumbersome very quickly. The amount of data may have caused the SVM model to perform poorly.
The silver lining in creating these two models that were only 40% accurate at identifying change groups is that they reached accuracy around 70% when identifying the difference between a decrease/no change, and an increase.
k Nearest Neighbors (k-NN)
The best k value is 20; overfitting occurred around k = 40. Although k-NN is known as a lazy algorithm using memorization, it performed well during model creation and when using the test set. This model was used with the unclassified data from the ACS 2019 data which does not have matching PLS data (PLS data is released later than ACS). This model can use community factors to predict volume change in eBooks for public library catalogs at nearly 80% accuracy.

The highest percentages for sensitivity, specificity, and precision were all measured in the model where k=20. Sensitivity is much higher in all models than specificity. In the case of this analysis, an argument can be made that true positives, communities where eBook increases will occur, is more important than misclassification of communities where they will not occur, making sensitivity and precision more important than specificity.
When evaluating the simpler classification between decrease/no change and increase, a balanced accuracy of 90% is achieved.
For the k-NN model, false positives are far more likely than false negatives (16.57% and 3.37%). In this analysis, false negatives are more likely to impact preparing for an increase in eBook volume than false positives where growth is predicted but does not occur. The F1 score for this model is 93.79% indicating that both precision and recall are high and the classifier is capable of promising results.

Results
K Nearest-Neighbors was used to determine if public libraries would see an increase or not in their volume of eBooks. An accuracy of 92% (90% balanced accuracy) was reached. Using the ACS data for 2,791 counties, volume increases were predicted for 6,557 library systems

