南方科技大学知识苑(SUSTech KC): Phishing websites detection via CNN and multi-head self-attention on imbalanced datasets

题名	Phishing websites detection via CNN and multi-head self-attention on imbalanced datasets
作者	Xiao，Xi 1,2; Xiao，Wentao 1,2; Zhang，Dianyan 1,2; Zhang，Bin 2; Hu，Guangwu 3; Li，Qing2,4 ; Xia，Shutao 1,2
通讯作者	Hu，Guangwu
发表日期	2021-09-01
DOI	10.1016/j.cose.2021.102372
发表期刊	COMPUTERS & SECURITY 影响因子和分区
ISSN	0167-4048
EISSN	1872-6208
卷号	108
摘要	Phishing websites belong to a social engineering attack where perpetrators fake legitimate websites to lure people to access so as to illegally acquire user's identity, password, privacy and even properties. This attack imposes a great threat to people and becomes more and more severe. In order to identify phishing websites, many proposals have shown their merits. For example, the classical proposal CNN-LSTM received a very high precision by combining Convolutional Neural Network (CNN) and Long Short Term Memory (LSTM) together. However, despite CNN achieved great success in AI area, LSTM still exists the biases issue since it always treats the later features much more important than the former ones. In the meanwhile, as the self-attention mechanism can discover the text's inner dependency relationships, it has been widely applied to various tasks of deep learning-based Natural Language Processing (NLP). If we treat a URL as a text string, this mechanism can learn comprehensive URL representations. In order to improve the accuracy for phishing websites detection further, in this paper, we propose a novel Convolutional Neural Network (CNN) with self-attention named self-attention CNN for phishing Uniform Resource Locators (URLs) identification. Specifically, self-attention CNN first leverages Generative Adversarial Network (GAN) to generate phishing URLs so as to balance the datasets of legitimate and phishing URLs. Then it utilizes CNN and multi-head self-attention to construct our new classifier which is comprised of four blocks, namely the input block, the attention block, the feature block and the output block. Finally, the trained classifier can give a high-accuracy result for an unknown website URL. Overall thorough experiments indicate that self-attention CNN achieves 95.6% accuracy, which outperforms CNN-LSTM, single CNN and single LSTM by 1.4%, 4.6% and 2.1% respectively.
关键词	Convolutional neural network Generative adversarial network Imbalanced dataset Multi-head self-attention Phishing websites detection
相关链接	[Scopus记录]
收录类别	SCI ; EI
语种	英语
学校署名	其他
资助项目	National Key Research and Development Program of China["2018YFB1800204","2018YFB1800601"] ; National Natural Science Foundation of China[61972219,61771273] ; Natural Science Foundation of Guangdong Province[2021A1515012640] ; R&D Program of Shenzhen["JCYJ20190813174403598","SGDX20190918101201696","JCYJ20190813165003837"]
WOS研究方向	Computer Science
WOS类目	Computer Science, Information Systems
WOS记录号	WOS:000677639500010
出版者	ELSEVIER ADVANCED TECHNOLOGY
EI入藏号	20212710578501
EI主题词	Computer crime ; Convolution ; Long short-term memory ; Natural language processing systems
EI分类号	Information Theory and Signal Processing:716.1 ; Data Processing and Image Processing:723.2
ESI学科分类	COMPUTER SCIENCE
Scopus记录号	2-s2.0-85108874331
来源库	Scopus
引用统计	被引频次[WOS]：29
成果类型	期刊论文
条目标识符	http://kc.sustech.edu.cn/handle/2SGJ60CL/230141
专题	南方科技大学工学院_计算机科学与工程系
作者单位	1.Shenzhen International Graduate School,Tsinghua University,China 2.Peng Cheng Laboratory,Shenzhen,China 3.School of Computer Science,Shenzhen Institute of Information Technology,Shenzhen,China 4.Southern University of Science and Technology,Shenzhen,China
推荐引用方式 GB/T 7714	Xiao，Xi,Xiao，Wentao,Zhang，Dianyan,et al. Phishing websites detection via CNN and multi-head self-attention on imbalanced datasets[J]. COMPUTERS & SECURITY,2021,108.
APA	Xiao，Xi.,Xiao，Wentao.,Zhang，Dianyan.,Zhang，Bin.,Hu，Guangwu.,...&Xia，Shutao.(2021).Phishing websites detection via CNN and multi-head self-attention on imbalanced datasets.COMPUTERS & SECURITY,108.
MLA	Xiao，Xi,et al."Phishing websites detection via CNN and multi-head self-attention on imbalanced datasets".COMPUTERS & SECURITY 108(2021).