Detecting Phishing Emails Targeting Healthcare Practitioners: A Domain-Specific Ensemble Approach Using Diverse Datasets

The healthcare sector is increasingly targeted by sophisticated phishing emails generated by Large Language Models (LLMs) and advanced online tools. These emails often bypass traditional security measures, posing serious threats to patient safety and privacy. This study enhances domain-specific feature extraction by integrating advanced Natural Language Processing (NLP) techniques, specifically the transformer-based BERT model, to extract domain-specific phishing embeddings from health-related email content, improving overall detection accuracy. The proposed detection framework employs an ensemble stacking classifier that integrates Random Forest (RF) and Convolutional Neural Networks (CNN) as base models, with a Neural Network (NN) meta-learner for final classification. The models achieved high accuracy: RF at 96.4%, CNN at 97.3%, and NN at 97%, with ROC-AUC scores of 99% for CNN and RF, and 97% for NN. Evaluated on a dataset of 18,354 instances, including synthetic emails from ChatGPT-4o and Llama-3.1, and real-world samples from the Kaggle repository, the model demonstrated robust performance in identifying healthcare-targeted phishing attempts. While the results are promising, they are constrained by the dataset's characteristics. Future research will explore end-to-end transformer-based models to further enhance the detection of LLM-generated phishing threats in the healthcare sector.

Gaston Elongha
Marymount University
United States
gle85144@marymount.edu

Michelle Liu
Marymount University
United States
xliu@marymount.edu