IALH Research Fellow Dillon Chrimes has a new publication entitled Big data analytics of predicting annual US Medicare billing claims with health services. The information in this paper was presented at the IEEE International Conference on Big Data in Osaka, Japan in December 2022.
Abstract:
This paper investigated the use of large public use files (PUFs) of US Medicare claims in the form of big data analytics to predict claim amounts in US dollars (USD) and large spending anomalies across hundreds of health services documented in the data set. There were two main research questions to better understand content and use of PUFs of US Medicare. One question was related to understanding the dataset and the parameters that could predict the total submitted billing claims for one year (i.e. 2017 fiscal year in USD). The second question was to establish whether or not anomalies in health service costs could be detected. Null hypothesis was that there are no significant variables in the general linear model (GLM) of the regression analysis. The hypothesis related to factors of type and frequency of health services, total HCPCS (Healthcare Common Procedural Coding System), population (total beneficiaries), age, provider specialty, chronic disease, states and regions could be significant in the classification and regression models. The 2017 Medicare Claims dataset, publicly provided by Centers for Medicare & Medicaid Services (CMS), was 291 MB and consisted of >30 columns and ~1,048,576 rows. The methodology followed data mining techniques to general linear regression to derive model fitting that compared the model residuals. From the residuals, multivariate outlier detection was carried out that included k-means clustering and principal component analysis. The results showed a correlation R 2 of 52% with health services and submitted Medicare amounts (USD) with thousands of outliers. Total services variable was highly significant with the total amount of submitted claims (maximum of 1025413240). HCPCS was not significant. There was also a strong correlation of Medicare costs to larger population in states with larger cities, especially in California, Florida, New York, and Texas. However, regions, States, cities, zip codes, and other divisions of the US states and regions were not significantly different. Adding other variables (like procedures and chronic diseases) improved the R 2 correlation only slightly. The results showed that there was significant correlation among services and the total submitted amount. Furthermore, there were many extreme outliers in terms of costs and consistently it was diagnostic radiation services in larger population states that had the highest total amount ($) of Medicare claims. New Your City’s 2017 diagnostic radiology was ten times larger than another city in the US. These large ($) amounts in the outliers and their characteristics did indicate that outliers were detectible. Medicare data comprised longitudinal information on a substantial proportion of the population aged ≥65 years in the US; however, poor model performance did show that there are many gaps in the linkage to chronic conditions of patients, wherein there could be possible misclassification of diagnoses, which would require more detailed investigation spanning other datasets. Similarly, Medicare data, like other forms of health care claims, are not collected for the purposes of research, but to support reimbursement for health care services. Unlike data captured in an electronic health record (EHRs) or as part of a prospective study, information collected as claims for reimbursement for services provided may be influenced by financial incentives and may be more susceptible to misclassification. In conclusion, Medicare claims can be predicted from a variety of parameters and outliers can be detected. However, poor model performance in the big data analytics of billing claims requires further investigation of the data mining techniques.
to read the full conference proceedings, see DOI: 10.1109/BigData55660.2022.10020524