From 83e5c861a5c0f1c850b3453674b7438fa1a9877b Mon Sep 17 00:00:00 2001 From: Morning Glow <10953218+MorningGlow@user.noreply.gitee.com> Date: Sun, 16 Jul 2023 01:01:55 +0000 Subject: [PATCH] =?UTF-8?q?=E5=9F=BA=E4=BA=8E=E6=96=87=E6=9C=AC=E5=86=85?= =?UTF-8?q?=E5=AE=B9=E7=9A=84=E5=9E=83=E5=9C=BE=E7=9F=AD=E4=BF=A1=E8=AF=86?= =?UTF-8?q?=E5=88=AB?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Signed-off-by: Morning Glow <10953218+MorningGlow@user.noreply.gitee.com> --- .../第3组-夏添/homework/垃圾短信.ipynb | 447 ++++++++++++++++++ 1 file changed, 447 insertions(+) create mode 100644 1、人才招聘数据分析/第3组-夏添/homework/垃圾短信.ipynb diff --git a/1、人才招聘数据分析/第3组-夏添/homework/垃圾短信.ipynb b/1、人才招聘数据分析/第3组-夏添/homework/垃圾短信.ipynb new file mode 100644 index 0000000..a509a47 --- /dev/null +++ b/1、人才招聘数据分析/第3组-夏添/homework/垃圾短信.ipynb @@ -0,0 +1,447 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "source": [ + "# 垃圾短信识别\n", + "任务:读取刚刚发的数据集文件,从正样本中和负样本中各抽取1w条,对数据预处理,清楚数字,标点和停用词。使用TF-IDF转换后通过朴素贝叶斯进行分类,并评估模型效果\n", + "\n", + "## 数据处理\n", + "首先是数据处理阶段,我们先导入必要的库" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%% md\n" + } + } + }, + { + "cell_type": "code", + "execution_count": 1, + "outputs": [], + "source": [ + "import pandas as pd\n", + "import re\n", + "import jieba\n", + "from sklearn.naive_bayes import GaussianNB\n", + "from sklearn.model_selection import train_test_split\n", + "from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%%\n" + } + } + }, + { + "cell_type": "code", + "execution_count": 2, + "outputs": [ + { + "data": { + "text/plain": " label message\n0 \n1 0 商业秘密的秘密性那是维系其商业价值和垄断地位的前提条件之一\n2 1 南口阿玛施新春第一批限量春装到店啦   春暖花开淑女裙、冰蓝色公主衫 ...\n3 0 带给我们大常州一场壮观的视觉盛宴\n4 0 有原因不明的泌尿系统结石等\n5 0 23年从盐城拉回来的麻麻的嫁妆\n... ... ...\n799996 0 助排毒缓解痛经预防子宫肌瘤&\n799997 0 这是今年首次启动I级防台应急响应\n799998 0 丽江下飞机时迎接我们的是凉风\n799999 0 费了半天劲各种找关系终于联系上心仪公司的内部人\n800000 0 是汉奸还是被强奸自己对号入座吧\n\n[800000 rows x 2 columns]", + "text/html": "
\n\n\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n
labelmessage
0
10商业秘密的秘密性那是维系其商业价值和垄断地位的前提条件之一
21南口阿玛施新春第一批限量春装到店啦   春暖花开淑女裙、冰蓝色公主衫 ...
30带给我们大常州一场壮观的视觉盛宴
40有原因不明的泌尿系统结石等
5023年从盐城拉回来的麻麻的嫁妆
.........
7999960助排毒缓解痛经预防子宫肌瘤&amp
7999970这是今年首次启动I级防台应急响应
7999980丽江下飞机时迎接我们的是凉风
7999990费了半天劲各种找关系终于联系上心仪公司的内部人
8000000是汉奸还是被强奸自己对号入座吧
\n

800000 rows × 2 columns

\n
" + }, + "execution_count": 2, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "data = pd.read_csv('message80W1.csv', header=None, index_col=0)\n", + "data.columns = ['label', 'message']\n", + "data" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%%\n" + } + } + }, + { + "cell_type": "markdown", + "source": [ + "\n", + "注意到原始数据集有80万行*2列数据。其中第二列是短信内容,第一列标识着第二列是否是垃圾短信,如为1,则是垃圾短信,如为0,则不是垃圾短信。我们将列重新定义索引名称。" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%% md\n" + } + } + }, + { + "cell_type": "markdown", + "source": [ + "按照要求,我们只需从label为0和1的短信中各抽取1万条短信即可。下面,我们执行抽样过程并将抽样得到的正反两样本合并拼接成新的样本,之后执行去重操作。" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%% md\n" + } + } + }, + { + "cell_type": "code", + "execution_count": 3, + "outputs": [], + "source": [ + "n = 10000\n", + "a = data[data['label'] == 0].sample(n)#对正常短信进行抽样\n", + "b = data[data['label'] == 1].sample(n)#对垃圾短信进行抽样\n", + "data_sample = pd.concat([a, b], axis=0)#将二者按照纵向(即列)进行拼接\n", + "data_unique = data_sample['message'].drop_duplicates()" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%%\n" + } + } + }, + { + "cell_type": "markdown", + "source": [ + "注意到样本中有一些xx字符串,我们对其进行处理" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%% md\n" + } + } + }, + { + "cell_type": "code", + "execution_count": 4, + "outputs": [], + "source": [ + "data_quxxx = data_unique.apply(lambda x: re.sub('x', '', x))" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%%\n" + } + } + }, + { + "cell_type": "markdown", + "source": [ + "读取自定义的分词词典对短信进行分词操作" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%% md\n" + } + } + }, + { + "cell_type": "code", + "execution_count": 5, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Building prefix dict from the default dictionary ...\n", + "Loading model from cache C:\\Users\\between\\AppData\\Local\\Temp\\jieba.cache\n", + "Loading model cost 0.578 seconds.\n", + "Prefix dict has been built successfully.\n" + ] + } + ], + "source": [ + "jieba.load_userdict('newdic1.txt')#添加词典进行分词\n", + "data_cut = data_quxxx.apply(lambda x: jieba.lcut(x))" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%%\n" + } + } + }, + { + "cell_type": "markdown", + "source": [ + "读取停用词表去停用词" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%% md\n" + } + } + }, + { + "cell_type": "code", + "execution_count": 6, + "outputs": [], + "source": [ + "stopWords = pd.read_csv('stopword.txt', encoding='GB18030', sep='hahaha', header=None,engine='python')\n", + "stopWords = ['≮', '≯', '≠', '≮', ' ', '会', '月', '日', '–'] + list(stopWords.iloc[:, 0])\n", + "data_stop = data_cut.apply(lambda x: [i for i in x if i not in stopWords])" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%%\n" + } + } + }, + { + "cell_type": "markdown", + "source": [ + "运用join函数将列表转为字符串" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%% md\n" + } + } + }, + { + "cell_type": "code", + "execution_count": 7, + "outputs": [], + "source": [ + "labels = data_sample.loc[data_stop.index, 'label']\n", + "adata = data_stop.apply(lambda x: ' '.join(x))" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%%\n" + } + } + }, + { + "cell_type": "markdown", + "source": [ + "## 分割数据集" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%% md\n" + } + } + }, + { + "cell_type": "code", + "execution_count": 8, + "outputs": [], + "source": [ + "data_tr, data_te, labels_tr, labels_te = train_test_split(adata, labels, test_size=0.2)" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%%\n" + } + } + }, + { + "cell_type": "markdown", + "source": [ + "转词向量" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%% md\n" + } + } + }, + { + "cell_type": "code", + "execution_count": 9, + "outputs": [], + "source": [ + "countVectorizer = CountVectorizer()" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%%\n" + } + } + }, + { + "cell_type": "markdown", + "source": [ + "训练集" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%% md\n" + } + } + }, + { + "cell_type": "code", + "execution_count": 10, + "outputs": [], + "source": [ + "data_tr = countVectorizer.fit_transform(data_tr)\n", + "X_tr = TfidfTransformer().fit_transform(data_tr.toarray()).toarray()" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%%\n" + } + } + }, + { + "cell_type": "markdown", + "source": [ + "测试集" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%% md\n" + } + } + }, + { + "cell_type": "code", + "execution_count": 11, + "outputs": [], + "source": [ + "data_te = CountVectorizer(vocabulary=countVectorizer.vocabulary_).fit_transform(data_te)\n", + "X_te = TfidfTransformer().fit_transform(data_te.toarray()).toarray()" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%%\n" + } + } + }, + { + "cell_type": "markdown", + "source": [ + "## 模型训练\n", + "选用朴素贝叶斯模型进行训练" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%% md\n" + } + } + }, + { + "cell_type": "code", + "execution_count": 12, + "outputs": [ + { + "data": { + "text/plain": "GaussianNB()", + "text/html": "
GaussianNB()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" + }, + "execution_count": 12, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "model = GaussianNB()\n", + "model.fit(X_tr, labels_tr)" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%%\n" + } + } + }, + { + "cell_type": "markdown", + "source": [ + "输出模型评分" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%% md\n" + } + } + }, + { + "cell_type": "code", + "execution_count": 13, + "outputs": [ + { + "data": { + "text/plain": "0.905811623246493" + }, + "execution_count": 13, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "model.score(X_te, labels_te)" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%%\n" + } + } + }, + { + "cell_type": "markdown", + "source": [], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%% md\n" + } + } + }, + { + "cell_type": "markdown", + "source": [], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%% md\n" + } + } + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 2 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython2", + "version": "2.7.6" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +} \ No newline at end of file