Abstract
Phishing is one of the most prevalent and expensive types of cybercrime faced by organizations and individuals
worldwide. Most prior research has focused on various technical features and traditional representations of
text to characterize phishing emails. There is a significant knowledge gap about the qualitative traits embedded
in them, which could be useful in a range of phishing mitigation tasks. In this paper, we dissect the structure
of phishing emails to gain a better understanding of the factors that influence human decision-making when
assessing suspicious emails and identify a novel set of descriptive features. For this, we employ an iterative
qualitative coding approach to identify features that are descriptive of the emails. We developed the “Phishing
Codebook”, a structured framework to systematically extract key information from phishing emails, and we
apply this codebook to a publicly available dataset of 503 phishing emails collected between 2015 and 2021.
We present key observations and challenges related to phishing attacks delivered indirectly through legitimate
services, the challenge of recurring and long-lasting scams, and the variations within campaigns used by attackers
to bypass rule-based filters. Furthermore, we provide two use cases to show how the Phishing Codebook is
useful in identifying similar phishing emails and in creating well-tailored responses to end-users. We share the
Phishing Codebook and the annotated benchmark dataset to help researchers have a better understanding of
phishing emails.