第二组

This commit is contained in:
[wrh]
2023-07-13 22:04:26 +08:00
parent d53873a33b
commit 733c55b78a
15 changed files with 22234 additions and 21 deletions
File diff suppressed because one or more lines are too long
File diff suppressed because it is too large Load Diff
File diff suppressed because it is too large Load Diff
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
@@ -0,0 +1,814 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Day 8"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"## 使用HMM做中文分词\n",
"import numpy as np\n",
"\n",
"# B: begin, M: middle, E: end, S: single\n",
"status = ['B', 'M', 'E', 'S']\n",
"\n",
"start_probability = {'B': 0.2, 'M': 0.3, 'E': 0.2, 'S': 0.3}\n",
"\n",
"# 转移概率: 从某个状态转移到另一个状态的概率\n",
"transition_probability={\n",
" \"B\":{\"E\":0.5,\"M\":0.7},\n",
" \"M\":{\"E\":0.5,\"M\":0.5},\n",
" \"E\":{\"B\":0.5,\"S\":0.7},\n",
" \"S\":{\"B\":0.5,\"S\":0.7}\n",
"}\n",
"\n",
"# 发射概率: 在某个状态下生成某个观测的概率\n",
"emission_probability={\n",
" \"B\":{\"我\":0.1,\"你\":0.2,\"他\":0.3,\"她\":0.4},\n",
" \"M\":{\"是\":0.1,\"有\":0.4,\"的\":0.4},\n",
" \"E\":{\"人\":0.5,\"吗\":0.5},\n",
" \"S\":{\"了\":0.5,\"啊\":0.2,\"啦\":0.3}\n",
"}\n",
"## 维特比算法\n",
"\"\"\"\n",
" 维特比算法. 用于解决隐马尔可夫模型中的三个问题\n",
" :param obs: 观测序列\n",
" :param status: 隐状态\n",
" :param start_p:隐藏状态的初始概率,表示一个句子开始时每个字符状态的概率\n",
" :param trans_p:表示从一个隐藏状态转移到另一个隐藏状态的概率。中文分词中表示从一个字的状态转移到另一个字的状态的概率\n",
" :param emit_p:给定隐藏状态生成观测状态的概率。中文分词中表示一个字生成某个词的概率\n",
" :return: 返回一个元组,第一个元素表示最优路径的概率,第二个元素表示隐状态序列\n",
"\"\"\"\n",
" # 初始化\n",
"def viterbi(obs,status,start_p,trans_p,emit_p):\n",
"\n",
" V=[{}]\n",
" path={}\n",
"\n",
" for y in status:\n",
" V[0][y]=start_p[y]*emit_p[y].get(obs[0],0)\n",
" path[y]=[y]\n",
"\n",
" # 递推\n",
" for t in range(1, len(obs)):\n",
" V.append({})\n",
" newpath = {}\n",
"\n",
" for y in status:\n",
" em_p = emit_p[y].get(obs[t], 0)\n",
" (prob, state) = max([(V[t-1][y0] * trans_p[y0].get(y, 0) * em_p, y0) for y0 in status])\n",
" V[t][y] = prob\n",
" newpath[y] = path[state] + [y]\n",
"\n",
" path = newpath\n",
"\n",
" (prob, state) = max([(V[len(obs) - 1][y], y) for y in status])\n",
" return (prob, path[state])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def cut(sentence,states,start_p,trans_p,emit_p):\n",
" prob,pos_list=viterbi(sentence,states,start_p,trans_p,emit_p)\n",
" begin,next=0,0\n",
" for i,char in enumerate(sentence):\n",
" pos=pos_list[i]\n",
" if pos==\"B\":\n",
" begin=i\n",
" elif pos==\"E\":\n",
" yield sentence[begin:i+1]\n",
" next=i+1\n",
" elif pos==\"S\":\n",
" yield char\n",
" next=i+1\n",
" if next<len(sentence):\n",
" yield sentence[next:]\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sentence=\"我是人\"\n",
"print(list(cut(sentence,status,start_probability,transition_probability,emission_probability)))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"A={1:2,2:6,4:8}\n",
"A.get(1,\"不存在\"),A.get(3,\"不存在\")\n",
"# dict.get(key,default)\n",
"# 存在key时返回key的value,不存在key时返回default"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"这段代码实现了维特比算法,用于解码隐马尔可夫模型(Hidden Markov ModelHMM)。下面是对代码的逐行解释:\n",
"\n",
"1. `V=[{}]``V`是一个列表,每个元素是一个字典,用于保存中间结果。`V[t][y]`表示在时间步 t 时,状态为 y 的最大概率。\n",
"\n",
"2. `path={}``path`是一个字典,用于保存路径。`path[y]`表示在时间步 t 时,状态为 y 的最大概率路径。\n",
"\n",
"3. `for y in status:`:遍历所有可能的状态 y。\n",
"\n",
"4. `V[0][y]=start_p[y]*emit_p[y].get(obs[0],0)`:计算初始状态概率乘以发射概率,作为初始时间步的最大概率。`start_p[y]`表示初始状态为 y 的概率,`emit_p[y].get(obs[0],0)`表示在状态 y 下观测到观测值 obs[0] 的概率。\n",
"\n",
"5. `path[y]=[y]`:将当前状态 y 添加到路径中。\n",
"\n",
"6. `for t in range(1, len(obs)):`:从时间步 1 开始,遍历观测序列中的每个时间步。\n",
"\n",
"7. `V.append({})`:在 V 列表中添加一个空字典,表示新的时间步。\n",
"\n",
"8. `newpath = {}`:新建一个空字典,用于保存新的路径。\n",
"\n",
"9. `for y in status:`:遍历所有可能的状态 y。\n",
"\n",
"10. `em_p = emit_p[y].get(obs[t], 0)`:计算在状态 y 下观测到观测值 obs[t] 的概率。\n",
"\n",
"11. `(prob, state) = max([(V[t-1][y0] * trans_p[y0].get(y, 0) * em_p, y0) for y0 in status])`:计算在时间步 t-1 时,状态为 y0 的最大概率,乘以从状态 y0 转移到状态 y 的转移概率,乘以在状态 y 下观测到观测值 obs[t] 的概率。然后选择具有最大概率的状态和对应的概率。\n",
"\n",
"12. `V[t][y] = prob`:将计算得到的最大概率保存到 V 列表中。\n",
"\n",
"13. `newpath[y] = path[state] + [y]`:将当前状态 y 添加到路径中。\n",
"\n",
"14. `path = newpath`:更新路径为新的路径。\n",
"\n",
"15. `(prob, state) = max([(V[len(obs) - 1][y], y) for y in status])`:在最后一个时间步选择具有最大概率的状态和对应的概率。\n",
"\n",
"16. `return (prob, path[state])`:返回最大概率和对应的路径。\n",
"\n",
"综上,该函数通过动态规划的方式,计算出给定观测序列下最可能的状态序列,并返回最大概率和对应的路径。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"以下是对给定代码的数学公式表示:\n",
"\n",
"首先,定义一些符号和参数:\n",
"- `obs`:观测序列,表示为 $O = (O_1, O_2, \\ldots, O_T)$,其中 $O_t$ 表示时间步 $t$ 的观测值。\n",
"- `status`:可能的状态集合,表示为 $S = \\{s_1, s_2, \\ldots, s_N\\}$,其中 $N$ 是状态的总数。\n",
"- `start_p`:初始状态概率,表示为 $P(s_i)$,其中 $1 \\leq i \\leq N$。\n",
"- `trans_p`:状态转移概率矩阵,表示为 $A = (a_{ij})$,其中 $a_{ij}$ 表示从状态 $s_i$ 转移到状态 $s_j$ 的概率。\n",
"- `emit_p`:发射概率矩阵,表示为 $B = (b_{ij})$,其中 $b_{ij}$ 表示在状态 $s_i$ 下观测到观测值 $O_j$ 的概率。\n",
"\n",
"使用以下公式表示维特比算法的计算过程:\n",
"\n",
"1. 初始化:\n",
" $$\\begin{align*}\n",
" V[0][s_i] &= P(s_i) \\cdot B_{s_i}(O_1), \\quad \\text{其中 } 1 \\leq i \\leq N \\\\\n",
" \\text{path}[s_i] &= [s_i], \\quad \\text{其中 } 1 \\leq i \\leq N\n",
" \\end{align*}$$\n",
"\n",
"2. 递推过程:\n",
" $$\\begin{align*}\n",
" V[t][s_j] &= \\max_{1 \\leq i \\leq N} \\left( V[t-1][s_i] \\cdot a_{ij} \\cdot B_{s_j}(O_t) \\right), \\quad \\text{其中 } 2 \\leq t \\leq T, \\quad 1 \\leq j \\leq N \\\\\n",
" \\text{path}[s_j] &= \\text{path}[s_i] + [s_j], \\quad \\text{其中 } 2 \\leq t \\leq T, \\quad 1 \\leq j \\leq N\n",
" \\end{align*}$$\n",
"\n",
"3. 终止:\n",
" $$\\begin{align*}\n",
" \\text{prob}, s_{\\text{max}} &= \\max_{1 \\leq i \\leq N} \\left( V[T][s_i] \\right)\n",
" \\end{align*}$$\n",
"\n",
"最终结果为 $(\\text{prob}, \\text{path}[s_{\\text{max}}])$,其中 $\\text{prob}$ 是最大概率,$\\text{path}[s_{\\text{max}}]$ 是对应的状态序列。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Baum-Welch算法(也称为EM算法)是一种用于无监督学习问题中求解隐马尔可夫模型(HMM)参数的迭代算法。它通过迭代估计模型的转移概率和发射概率,直到收敛为止。下面是Baum-Welch算法的数学过程:\n",
"\n",
"假设我们有一个HMM模型,其中包括状态集合 $S$、观测集合 $O$,以及未知的转移概率矩阵 $A$ 和发射概率矩阵 $B$。\n",
"\n",
"1. 初始化参数:\n",
" - 随机初始化转移概率矩阵 $A$ 和发射概率矩阵 $B$。\n",
" - 计算初始状态概率向量 $\\pi$。\n",
"\n",
"2. E步骤(Expectation Step):\n",
" - 使用前向-后向算法计算每个时间步每个状态的前向概率 $\\alpha$ 和后向概率 $\\beta$。\n",
" - 计算每个时间步每个状态的后验概率 $\\gamma$,表示在给定观测序列下,处于状态 $s_i$ 的概率。\n",
" - 计算每个时间步从状态 $s_i$ 转移到状态 $s_j$ 的后验转移概率 $\\xi$,表示在给定观测序列下,从状态 $s_i$ 转移到状态 $s_j$ 的概率。\n",
"\n",
"3. M步骤(Maximization Step):\n",
" - 根据计算得到的后验概率 $\\gamma$ 和后验转移概率 $\\xi$,更新模型的参数 $A$ 和 $B$。\n",
" - 更新转移概率矩阵 $A$\n",
" $$a_{ij} = \\frac{\\sum_{t=1}^{T-1} \\xi(t, i, j)}{\\sum_{t=1}^{T-1} \\gamma(t, i)}, \\quad \\text{其中 } 1 \\leq i, j \\leq N$$\n",
" - 更新发射概率矩阵 $B$\n",
" $$b_{jk} = \\frac{\\sum_{t=1, O_t = v_k}^{T} \\gamma(t, j)}{\\sum_{t=1}^{T} \\gamma(t, j)}, \\quad \\text{其中 } 1 \\leq j \\leq N, \\quad 1 \\leq k \\leq M$$\n",
" - 更新初始状态概率向量 $\\pi$\n",
" $$\\pi_i = \\gamma(1, i), \\quad \\text{其中 } 1 \\leq i \\leq N$$\n",
"\n",
"4. 重复步骤2和步骤3,直到模型的参数收敛。\n",
"\n",
"通过多次迭代E步骤和M步骤,Baum-Welch算法会逐渐提高模型参数的估计精度,使模型更好地适应给定的观测序列。该算法是一种典型的期望最大化(EM)算法,用于无监督学习问题中的参数估计。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"选自ChatGPT。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## NLP的入门"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"from sklearn.feature_extraction.text import CountVectorizer,TfidfTransformer,TfidfVectorizer\n",
"from sklearn.naive_bayes import GaussianNB\n",
"corpus=[\n",
" \"My dog has flea problems,help please.\",\n",
" \"Maybe not take him to dog park is stupid.\",\n",
" \"My dalmation is so cute. I love him.\",\n",
" \"Stop posting stupid worthless garbage.\",\n",
" \"Mr licks ate mu steak,what can I do?\",\n",
" \"Quit buying worthless dog food stupid.\"\n",
"]\n",
"labels=[0,1,0,1,0,1]\n"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[[0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0]]\n",
"[0]\n"
]
}
],
"source": [
"\n",
"cv=CountVectorizer()\n",
"x=cv.fit_transform(corpus).toarray()\n",
"y=np.array(labels)\n",
"\n",
"clf=GaussianNB()\n",
"clf.fit(x,y)\n",
"\n",
"sample=[\"Do not eat me!\"]\n",
"sample_x=cv.transform(sample).toarray()\n",
"print(sample_x)\n",
"prediction=clf.predict(sample_x)\n",
"print(prediction)\n"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>0</th>\n",
" <th>1</th>\n",
" <th>2</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>商业秘密的秘密性那是维系其商业价值和垄断地位的前提条件之一</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" <td>南口阿玛施新春第一批限量春装到店啦   春暖花开淑女裙、冰蓝色公主衫 ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>3</td>\n",
" <td>0</td>\n",
" <td>带给我们大常州一场壮观的视觉盛宴</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>4</td>\n",
" <td>0</td>\n",
" <td>有原因不明的泌尿系统结石等</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>5</td>\n",
" <td>0</td>\n",
" <td>23年从盐城拉回来的麻麻的嫁妆</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>799995</th>\n",
" <td>799996</td>\n",
" <td>0</td>\n",
" <td>助排毒缓解痛经预防子宫肌瘤&amp;amp</td>\n",
" </tr>\n",
" <tr>\n",
" <th>799996</th>\n",
" <td>799997</td>\n",
" <td>0</td>\n",
" <td>这是今年首次启动I级防台应急响应</td>\n",
" </tr>\n",
" <tr>\n",
" <th>799997</th>\n",
" <td>799998</td>\n",
" <td>0</td>\n",
" <td>丽江下飞机时迎接我们的是凉风</td>\n",
" </tr>\n",
" <tr>\n",
" <th>799998</th>\n",
" <td>799999</td>\n",
" <td>0</td>\n",
" <td>费了半天劲各种找关系终于联系上心仪公司的内部人</td>\n",
" </tr>\n",
" <tr>\n",
" <th>799999</th>\n",
" <td>800000</td>\n",
" <td>0</td>\n",
" <td>是汉奸还是被强奸自己对号入座吧</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>800000 rows × 3 columns</p>\n",
"</div>"
],
"text/plain": [
" 0 1 2\n",
"0 1 0 商业秘密的秘密性那是维系其商业价值和垄断地位的前提条件之一\n",
"1 2 1 南口阿玛施新春第一批限量春装到店啦   春暖花开淑女裙、冰蓝色公主衫 ...\n",
"2 3 0 带给我们大常州一场壮观的视觉盛宴\n",
"3 4 0 有原因不明的泌尿系统结石等\n",
"4 5 0 23年从盐城拉回来的麻麻的嫁妆\n",
"... ... .. ...\n",
"799995 799996 0 助排毒缓解痛经预防子宫肌瘤&amp\n",
"799996 799997 0 这是今年首次启动I级防台应急响应\n",
"799997 799998 0 丽江下飞机时迎接我们的是凉风\n",
"799998 799999 0 费了半天劲各种找关系终于联系上心仪公司的内部人\n",
"799999 800000 0 是汉奸还是被强奸自己对号入座吧\n",
"\n",
"[800000 rows x 3 columns]"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"me=pd.read_csv('message80W1.csv',header=None)\n",
"me\n"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.model_selection import train_test_split\n",
"from sklearn.metrics import accuracy_score\n",
"me_zheng=me[me[1]==1]\n",
"me_fu=me[me[1]==0]\n",
"me_zheng=me_zheng[:10000]\n",
"me_fu=me_fu[:10000]\n",
"#合并后的dataframe\n",
"me=pd.concat([me_zheng,me_fu],axis=0)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>0</th>\n",
" <th>1</th>\n",
" <th>2</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" <td>南口阿玛施新春第一批限量春装到店啦   春暖花开淑女裙、冰蓝色公主衫 ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>7</td>\n",
" <td>1</td>\n",
" <td>感谢致电杭州萧山全金釜韩国烧烤店,本店位于金城路xxx号。韩式烧烤等,价格实惠、欢迎惠顾【全...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>9</td>\n",
" <td>1</td>\n",
" <td>一次价值xxx元王牌项目;可充值xxx元店内项目卡一张;可以参与V动好生活百分百抽奖机会一次...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11</th>\n",
" <td>12</td>\n",
" <td>1</td>\n",
" <td>(长期诚信在本市作各类资格职称(以及印 /章、牌、 ……等。祥:x x x x x x x ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12</th>\n",
" <td>13</td>\n",
" <td>1</td>\n",
" <td>《依林美容》三.八.女人节倾情大放送活动开始啦!!!!超值套餐等你拿,活动时间x月x日一x月...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11079</th>\n",
" <td>11080</td>\n",
" <td>0</td>\n",
" <td>居住面积2045尺/190平米</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11080</th>\n",
" <td>11081</td>\n",
" <td>0</td>\n",
" <td>期间在16、17号两天会须闭句容西收费站杭州往南京方向下高速的匝道</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11081</th>\n",
" <td>11082</td>\n",
" <td>0</td>\n",
" <td>分享自无线徐州客户端</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11082</th>\n",
" <td>11083</td>\n",
" <td>0</td>\n",
" <td>本身就经常坏的电梯此刻是终于瘫痪了</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11085</th>\n",
" <td>11086</td>\n",
" <td>0</td>\n",
" <td>地铁运营方宣布车门故障排除</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>20000 rows × 3 columns</p>\n",
"</div>"
],
"text/plain": [
" 0 1 2\n",
"1 2 1 南口阿玛施新春第一批限量春装到店啦   春暖花开淑女裙、冰蓝色公主衫 ...\n",
"6 7 1 感谢致电杭州萧山全金釜韩国烧烤店,本店位于金城路xxx号。韩式烧烤等,价格实惠、欢迎惠顾【全...\n",
"8 9 1 一次价值xxx元王牌项目;可充值xxx元店内项目卡一张;可以参与V动好生活百分百抽奖机会一次...\n",
"11 12 1 (长期诚信在本市作各类资格职称(以及印 /章、牌、 ……等。祥:x x x x x x x ...\n",
"12 13 1 《依林美容》三.八.女人节倾情大放送活动开始啦!!!!超值套餐等你拿,活动时间x月x日一x月...\n",
"... ... .. ...\n",
"11079 11080 0 居住面积2045尺/190平米\n",
"11080 11081 0 期间在16、17号两天会须闭句容西收费站杭州往南京方向下高速的匝道\n",
"11081 11082 0 分享自无线徐州客户端\n",
"11082 11083 0 本身就经常坏的电梯此刻是终于瘫痪了\n",
"11085 11086 0 地铁运营方宣布车门故障排除\n",
"\n",
"[20000 rows x 3 columns]"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"me"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Building prefix dict from the default dictionary ...\n",
"Dumping model to file cache C:\\Users\\wrh\\Temp\\jieba.cache\n",
"Loading model cost 0.578 seconds.\n",
"Prefix dict has been built successfully.\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>0</th>\n",
" <th>1</th>\n",
" <th>2</th>\n",
" <th>cleantxt</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2</td>\n",
" <td>1</td>\n",
" <td>南口阿玛施新春第一批限量春装到店啦   春暖花开淑女裙、冰蓝色公主衫 ...</td>\n",
" <td>南口 阿玛施 新春 第一批 限量 春装 春暖花开 淑女 蓝色 公主 气质 粉小 西装 冰丝 ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>7</td>\n",
" <td>1</td>\n",
" <td>感谢致电杭州萧山全金釜韩国烧烤店,本店位于金城路xxx号。韩式烧烤等,价格实惠、欢迎惠顾【全...</td>\n",
" <td>感谢 致电 杭州 萧山 全金 韩国 烧烤店 本店 位于 金城 xxx 韩式 烧烤 价格 实惠...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>9</td>\n",
" <td>1</td>\n",
" <td>一次价值xxx元王牌项目;可充值xxx元店内项目卡一张;可以参与V动好生活百分百抽奖机会一次...</td>\n",
" <td>价值 xxx 王牌 项目 充值 xxx 元店 项目 一张 参与 动好 生活 百分百 抽奖 机...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11</th>\n",
" <td>12</td>\n",
" <td>1</td>\n",
" <td>(长期诚信在本市作各类资格职称(以及印 /章、牌、 ……等。祥:x x x x x x x ...</td>\n",
" <td>长期 诚信 本市 各类 资格 职称 李伟</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12</th>\n",
" <td>13</td>\n",
" <td>1</td>\n",
" <td>《依林美容》三.八.女人节倾情大放送活动开始啦!!!!超值套餐等你拿,活动时间x月x日一x月...</td>\n",
" <td>依林 美容 女人 倾情 大放送 活动 超值 套餐 等你拿 活动 时间 xx 详情 进店 咨询...</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" 0 1 2 \\\n",
"1 2 1 南口阿玛施新春第一批限量春装到店啦   春暖花开淑女裙、冰蓝色公主衫 ... \n",
"6 7 1 感谢致电杭州萧山全金釜韩国烧烤店,本店位于金城路xxx号。韩式烧烤等,价格实惠、欢迎惠顾【全... \n",
"8 9 1 一次价值xxx元王牌项目;可充值xxx元店内项目卡一张;可以参与V动好生活百分百抽奖机会一次... \n",
"11 12 1 (长期诚信在本市作各类资格职称(以及印 /章、牌、 ……等。祥:x x x x x x x ... \n",
"12 13 1 《依林美容》三.八.女人节倾情大放送活动开始啦!!!!超值套餐等你拿,活动时间x月x日一x月... \n",
"\n",
" cleantxt \n",
"1 南口 阿玛施 新春 第一批 限量 春装 春暖花开 淑女 蓝色 公主 气质 粉小 西装 冰丝 ... \n",
"6 感谢 致电 杭州 萧山 全金 韩国 烧烤店 本店 位于 金城 xxx 韩式 烧烤 价格 实惠... \n",
"8 价值 xxx 王牌 项目 充值 xxx 元店 项目 一张 参与 动好 生活 百分百 抽奖 机... \n",
"11 长期 诚信 本市 各类 资格 职称 李伟 \n",
"12 依林 美容 女人 倾情 大放送 活动 超值 套餐 等你拿 活动 时间 xx 详情 进店 咨询... "
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import jieba\n",
"# 预处理\n",
"# 停用词可能会过滤掉情感词,比如 “好” 之类的\n",
"stoplist = list(pd.read_csv('stopword.txt', names = ['w'], sep = 'aaa', \n",
" encoding = 'ANSI', engine='python').w)\n",
"\n",
"# 分词且去掉停用词\n",
"def m_cut(intxt):\n",
" return [ w for w in jieba.cut(intxt) \n",
" if w not in stoplist and len(w) > 1 ] \n",
"\n",
"# 分词\n",
"cuttxt = lambda x: \" \".join(m_cut(x))\n",
"me[\"cleantxt\"] = me[2].apply(cuttxt) \n",
"me.head()"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[0., 0., 0., ..., 0., 0., 0.],\n",
" [0., 0., 0., ..., 0., 0., 0.],\n",
" [0., 0., 0., ..., 0., 0., 0.],\n",
" ...,\n",
" [0., 0., 0., ..., 0., 0., 0.],\n",
" [0., 0., 0., ..., 0., 0., 0.],\n",
" [0., 0., 0., ..., 0., 0., 0.]])"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"labels=me.iloc[:,1]\n",
"y=np.array(labels)\n",
"\n",
"#词频矩阵\n",
"td=TfidfVectorizer()\n",
"wordmtx=td.fit_transform(me.cleantxt).toarray()\n",
"wordmtx"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0.9015\n"
]
}
],
"source": [
"x_train,x_test,y_train,y_test=train_test_split(wordmtx,y,test_size=0.2,random_state=0)\n",
"clf=GaussianNB()\n",
"clf.fit(x_train,y_train)\n",
"y_pred=clf.predict(x_test)\n",
"print(accuracy_score(y_pred,y_test))"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([1, 1, 0, 0, 0, 0, 0, 0, 0], dtype=int64)"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#举例验证\n",
"sentence=[\"尊敬的客上,感谢您一直的支持,亿美亿康美容部特在本月的x、x、x三天举办秒杀活动,现场更是优惠多多,开抢倒计时还有两天,欲抢从速!xx号艳艳\",\n",
" \"感谢致电杭州萧山全金釜韩国烧烤店,本店位于金城路xxx号。韩式烧烤等,价格实惠、欢迎惠顾【全...\",\n",
" \"秒杀价格88488848你值得拥有\",\n",
" \"有博主做过同类防晒霜的对比\",\n",
" \"一刀999\",\n",
" \"并夕夕\",\n",
" \"csc每天打游戏\",\n",
" \"今天年脑爆炸了,太刺激了\",\n",
" \"我收到了垃圾短信\",\n",
" ]\n",
"\n",
"sentence = [cuttxt(s) for s in sentence]\n",
"sentence=td.transform(sentence).toarray()\n",
"pred=clf.predict(sentence)\n",
"pred"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**注意!注意!注意!**\n",
"\n",
"该段代码执行的的数据量大!\n",
"\n",
"不能反复跑!否则可能导致内存炸裂报MemoryError的错!\n",
"\n",
"可以重启来解决。"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.7"
},
"orig_nbformat": 4
},
"nbformat": 4,
"nbformat_minor": 2
}
@@ -0,0 +1,2 @@
model_checkpoint_path: "train_model"
all_model_checkpoint_paths: "train_model"
File diff suppressed because one or more lines are too long