Using Natural Language Processing (NLP) to Identify Fraudulent Healthcare Claims
DOI:
https://doi.org/10.47941/ijce.2738Keywords:
Artificial Intelligence, Healthcare Fraud Detection, Unstructured Data Analysis, Natural Language Processing.Abstract
Purpose: This white paper describes the need to enhance fraud detection within healthcare using the methods of Natural Language Processing (NLP) in unstructured text: physician notes, patient records, and claim descriptions. To overcome the limitations of traditional rule-based platforms in handling healthcare’s unstructured data complexity and scale is the objective.
Methodology: The proposed approach combines with a well-established pre-trained NLP models (BioBERT and ClinicalBERT) with known methods, such as named entity recognition, anomaly detection, and predictive modeling. A phased approach, as part of the implementation strategy, will be used to implement NLP models for clinical IT environments, from data ingestion and transformation through model deployment and live fraud surveillance.
Findings: Based on the studies’ results, NLP systems increase fraud detection accuracy by 30 percent, reduce false positives by 20 percent, and allow claims processing under a second. While the white paper’s innovative offering begins with a proposal for a hybrid solution, which combines NLP-driven text analysis with existing rule-based systems, this combination delivers a stronger and more flexible means of fraud detection. The predictive nature of NLP enables healthcare organizations to identify potential fraud risks for providers before the issues grow worse.
Unique Contribution to Theory, Practice and Policy: The paper’s experts call upon IT personnel to lead adopting NLP systems, refresh models to meet new fraud threats, and explore collaboration with federated learning and blockchain to enhance protections and compliance standards. Upon implementing these recommendations, healthcare organization will be able to more effectively deal with fraudulent activities and optimize their workflows more efficiently.
Downloads
References
Al-Hanawi, M. K., Alqahtani, F. S., Alharbi, T. K., Alshahrani, S. M., Alsaif, B., Aljuaid, M., & Alboqami, A. (2021). The economic burden of healthcare fraud in Saudi Arabia: A cross-sectional study. Risk Management and Healthcare Policy, 14, 4673–4682. https://doi.org/10.2147/RMHP.S333614
Alkhodair, S. A., Altwaijri, N., & Albarrak, A. I. (2023). Identifying preventable emergency admissions in hospitals using machine learning. In Telehealth ecosystems in practice (pp. 95–96). IOS Press. https://doi.org/10.3233/SHTI230741
Amazon Web Services. (2022). AWS. https://aws.amazon.com
Baader, G., & Krcmar, H. (2018). Cybersecurity awareness in accounting research: A literature review. International Journal of Accounting Information Systems, 31, 1–16.
Bartholomew, D. C., Nwaigwe, C. C., Orumie, U. C., & Nwafor, G. O. (2024). Intervention analysis of COVID-19 vaccination in Nigeria: The naive solution versus interrupted time series. Annals of Data Science, 11(5), 1609–1634. https://doi.org/10.1007/s40745-023-00492-2
Chen, I. Y., Pierson, E., Rose, S., Joshi, S., Ferryman, K., & Ghassemi, M. (2023). Ethical machine learning in healthcare. Annual Review of Biomedical Data Science, 6, 123–144. https://doi.org/10.1146/annurev-biodatasci-110122-094135
Chen, R. J., Lu, M. Y., Chen, T. Y., Williamson, D. F. K., & Mahmood, F. (2021). Synthetic data in machine learning for medicine and healthcare. Nature Biomedical Engineering, 5(6), 493–497. https://doi.org/10.1038/s41551-021-00751-8
Dai, T., Zhao, J., Li, D., Tian, S., Zhao, X., & Pan, S. (2023). Heterogeneous deep graph convolutional network with citation relational BERT for COVID-19 inline citation recommendation. Expert Systems with Applications, 213, Article 118841. https://doi.org/10.1016/j.eswa.2022.118841
He, Y., Aliyu, A., Evans, M., & Luo, C. (2021). Health care cybersecurity challenges and solutions under the climate of COVID-19: Scoping review. Journal of Medical Internet Research, 23(4), Article e21747. https://doi.org/10.2196/21747
Herland, M., Bauder, R. A., & Khoshgoftaar, T. M. (2020). Approaches for identifying U.S. Medicare fraud in medical claims data. Health Information Science and Systems, 8(1), 1–13. https://doi.org/10.1007/s13755-020-00114-4
Himmelstein, D. U., & Woolhandler, S. (2020). The U.S. health care system on the eve of the Covid-19 epidemic: A review of recent trends. Health Affairs, 39(10), 1710–1718. https://doi.org/10.1377/hlthaff.2020.00815
Holzinger, A., Malle, B., Saranti, A., & Pfeifer, B. (2021). Towards multi-modal causability with graph neural networks enabling information fusion for explainable AI. Information Fusion, 71, 28–37. https://doi.org/10.1016/j.inffus.2021.01.008
Hoofnagle, C. J., van der Sloot, B., & Borgesius, F. Z. (2019). The European Union General Data Protection Regulation: What it is and what it means. Information & Communications Technology Law, 28(1), 65–98. https://doi.org/10.1080/13600834.2019.1573501
Johnson, J. M., & Khoshgoftaar, T. M. (2020a). Data-centric AI for healthcare fraud detection. Health Information Science and Systems, 8(1), 1–13. https://doi.org/10.1007/s13755-020-00114-4
Johnson, J. M., & Khoshgoftaar, T. M. (2020b). Medicare fraud detection using machine learning with gradient boosting. Journal of Big Data, 7(1), 1–25. https://doi.org/10.1186/s40537-020-00377-8
Kolambe, S., & Kaur, P. (2024). Exploring advanced techniques in natural language processing and machine learning for in-depth analysis of insurance claims. In Smart computing paradigms: Artificial intelligence and network applications (pp. 47–56). Springer. https://doi.org/10.1007/978-981-97-7880-5_5
Kumaraswamy, N., Markey, M. K., Ekin, T., Barner, J. C., & Rascati, K. (2022). Healthcare fraud data mining methods: A look back and look ahead. Perspectives in Health Information Management, 19(1), 1i. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8790905/
Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C. H., & Kang, J. (2020). BioBERT: A pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4), 1234–1240. https://doi.org/10.1093/bioinformatics/btz682
Liao, Q., Fielding, R., Cheung, Y. T. D., Lian, J., Yuan, J., & Lam, W. W. T. (2020). Effectiveness and parental acceptability of social networking interventions for promoting seasonal influenza vaccination among young children: Randomized controlled trial. Journal of Medical Internet Research, 22(2), Article e16427. https://doi.org/10.2196/16427
Lundberg, S. M., Erion, G., Chen, H., DeGrave, A., Prutkin, J. M., Nair, B., Katz, R., Himmelfarb, J., Bansal, N., & Lee, S.-I. (2020). From local explanations to global understanding with explainable AI for trees. Nature Machine Intelligence, 2(1), 56–67. https://doi.org/10.1038/s42256-019-0138-9
Mehrabi, N., Morstatter, F., Saxena, N., Lerman, K., & Galstyan, A. (2021). A survey on bias and fairness in machine learning. ACM Computing Surveys, 54(6), 1–35. https://doi.org/10.1145/3457607
National Health Care Anti-Fraud Association. (2023). The challenge of healthcare fraud. https://www.nhcaa.org/resources/health-care-fraud-statistics/
Nicora, G., Moretti, F., Sauta, E., Della Porta, M., Malcovati, L., Cazzola, M., & Bellazzi, R. (2020). A continuous-time Markov model approach for modeling myelodysplastic syndromes progression from cross-sectional data. Journal of Biomedical Informatics, 104, Article 103398. https://doi.org/10.1016/j.jbi.2020.103398
Noor, A., Pattanaik, P., Khan, M. Z., Alromema, W., & Noor, T. H. (2023). Deep feature detection approach for COVID-19 classification based on X-ray images. International Journal of Advanced Computer Science and Applications, 14(5), 532–539. https://doi.org/10.14569/IJACSA.2023.0140560
PYMNTS.com. (2020). Deep dive: How AI and ML improve fraud detection rates and reduce false positives. https://www.pymnts.com
Sadiq, S., Yan, Y., Taylor, A., Shyu, C.-R., & Chen, S.-C. (2021). AAFA: Associative affinity factor analysis for bot detection and stance classification in Twitter. Information Processing & Management, 58(3), Article 102511. https://doi.org/10.1016/j.ipm.2020.102511
Saripalle, R. K. (2020). Leveraging FHIR to integrate clinical data across heterogeneous health systems. Health Informatics Journal, 26(4), 2871–2885. https://doi.org/10.1177/1460458220944197
Schwartz, P. M., & Solove, D. J. (2014). Reconciling personal information in the United States and European Union. California Law Review, 102(4), 877–916. https://doi.org/10.15779/Z38W66984C
Shi, Y., Nie, X., Zhu, Z., Xie, L., Wang, W., & Miao, J. (2022). Boundary evaluation of the maximum coupling obtained in EM illumination test with different polarization direction. Electronics, 11(15), Article 2345. https://doi.org/10.3390/electronics11152345
Shorten, C., Khoshgoftaar, T. M., & Furht, B. (2021). Deep learning applications for COVID-19. Journal of Big Data, 8(1), Article 18. https://doi.org/10.1186/s40537-020-00392-9
Slomski, A. (2020). Palliative care benefits patients with Parkinson disease. JAMA, 323(16), 1543. https://doi.org/10.1001/jama.2020.2949
Smith, T., Tadesse, A. F., & Vincent, N. E. (2021). The impact of CIO characteristics on data breaches. International Journal of Accounting Information Systems, 43, Article 100532. https://doi.org/10.1016/j.accinf.2021.100532
Tabaie, A., Sengupta, S., Pruitt, Z. M., & Fong, A. (2023). A machine learning approach with human-AI collaboration for automated classification of patient safety event reports: Algorithm development and validation study. BMJ Health & Care Informatics, 30(1), Article e100731. https://doi.org/10.1136/bmjhci-2022-100731
Thornton, D., Mueller, R. M., Paulus, D., & Schoutens, P. (2022). The economic impact of AI on healthcare fraud detection: A systematic review. Health Policy and Technology, 11(2), Article 100623. https://doi.org/10.1016/j.hlpt.2022.100623
Vindrola-Padros, C., Ledger, J., Barbosa, E. C., & Fulop, N. J. (2022). The implementation of improvement interventions for 'low performing' and 'high performing' organisations in health, education and local government: A phased literature review. International Journal of Health Policy and Management, 11(7), 874–882. https://doi.org/10.34172/ijhpm.2020.197
Zamzami, N., Koochemeshkian, P., & Bouguila, N. (2020). A distribution-based regression for real-time COVID-19 cases detection from chest X-ray and CT images. In 2020 IEEE 21st International Conference on Information Reuse and Integration for Data Science (IRI) (pp. 104–111). IEEE. https://doi.org/10.1109/IRI49571.2020.00024
Zhang, C., Xiao, X., & Wu, C. (2020). Medical fraud and abuse detection system based on machine learning. International Journal of Environmental Research and Public Health, 17(19), Article 7265. https://doi.org/10.3390/ijerph17197265
Zhang, R., Tian, D., Wang, H., Kang, X., Wang, G., & Xu, L. (2023). Risk assessment of compound dynamic disaster based on AHP-EWM. Applied Sciences, 13(18), Article 10137. https://doi.org/10.3390/app131810137
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Mani Joga Rao Cheekaramelli

This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution (CC-BY) 4.0 License that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this journal.