447 lines
15 KiB
Plaintext
447 lines
15 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 80,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import pandas as pd\n",
|
|
"import numpy as np\n",
|
|
"import os\n",
|
|
"import matplotlib.pyplot as plt"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"# 1、合并表格"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 110,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"['CNKI-01-450.xlsx', 'CNKI-02-500.xlsx', 'CNKI-03-199.xlsx']\n"
|
|
]
|
|
},
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>SrcDatabase-来源库</th>\n",
|
|
" <th>Title-题名</th>\n",
|
|
" <th>Author-作者</th>\n",
|
|
" <th>Organ-单位</th>\n",
|
|
" <th>Source-文献来源</th>\n",
|
|
" <th>Keyword-关键词</th>\n",
|
|
" <th>Summary-摘要</th>\n",
|
|
" <th>PubTime-发表时间</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>0</th>\n",
|
|
" <td>CPFDTEMP</td>\n",
|
|
" <td>应用金融科技创新服务小微企业的途径与机制研究</td>\n",
|
|
" <td>李慧;</td>\n",
|
|
" <td>中国工商银行股份有限公司天津市分行;</td>\n",
|
|
" <td>第三十四届中国(天津)2020’IT、网络、信息技术、电子、仪器仪表创新学术会议论文集</td>\n",
|
|
" <td>融资困境;;金融科技;;小微企业</td>\n",
|
|
" <td>近些年来,人工智能、区块链、云计算和大数据成为了各行各业的焦点,传统金融业包括商业银行在内,...</td>\n",
|
|
" <td>2020-08-17 16:17</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1</th>\n",
|
|
" <td>CJFDTEMP</td>\n",
|
|
" <td>民生锐评</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>中国商界</td>\n",
|
|
" <td>中国商界</td>\n",
|
|
" <td>新能源车;人工智能应用;新能源汽车;</td>\n",
|
|
" <td>人工智能成为推动金融科技发展关键力量近日,在2020世界人工智能大会云端峰会召开前夕,金融壹...</td>\n",
|
|
" <td>2020-08-15</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>2</th>\n",
|
|
" <td>CJFDTEMP</td>\n",
|
|
" <td>金融科技研究进展与评析</td>\n",
|
|
" <td>鲁钊阳;张珂瑞;</td>\n",
|
|
" <td>西南政法大学经济学院;</td>\n",
|
|
" <td>金融理论与实践</td>\n",
|
|
" <td>金融科技;;金融创新;;金融发展</td>\n",
|
|
" <td>从历史的角度来看,金融科技的产生是数百年来技术变革的必然产物,大数据、区块链、云计算、人工智...</td>\n",
|
|
" <td>2020-08-12 16:58</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>3</th>\n",
|
|
" <td>CJFDTEMP</td>\n",
|
|
" <td>聚焦新兴技术创新研究 洞悉商业银行智慧金融战略</td>\n",
|
|
" <td>NaN</td>\n",
|
|
" <td>中国金融电脑</td>\n",
|
|
" <td>中国金融电脑</td>\n",
|
|
" <td>科技创新机制;智慧金融;金融科技创新;</td>\n",
|
|
" <td>当下,以5G、人工智能、大数据、云计算、区块链等为代表的新ICT技术的创新发展如火如荼,并加...</td>\n",
|
|
" <td>2020-08-07</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>4</th>\n",
|
|
" <td>CJFDTEMP</td>\n",
|
|
" <td>打造科技创新引擎 赋能业务价值创造</td>\n",
|
|
" <td>陈满才;</td>\n",
|
|
" <td>中国工商银行金融科技部;</td>\n",
|
|
" <td>中国金融电脑</td>\n",
|
|
" <td>云计算平台;物联网;人工智能技术;普惠金融发展;服务实体经济;工作体制机制;产学研用;银行业...</td>\n",
|
|
" <td>当前,金融科技已逐渐从后台走向前台,成为推动金融转型升级的新引擎、金融服务实体经济的新途径、...</td>\n",
|
|
" <td>2020-08-07</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" SrcDatabase-来源库 Title-题名 Author-作者 Organ-单位 \\\n",
|
|
"0 CPFDTEMP 应用金融科技创新服务小微企业的途径与机制研究 李慧; 中国工商银行股份有限公司天津市分行; \n",
|
|
"1 CJFDTEMP 民生锐评 NaN 中国商界 \n",
|
|
"2 CJFDTEMP 金融科技研究进展与评析 鲁钊阳;张珂瑞; 西南政法大学经济学院; \n",
|
|
"3 CJFDTEMP 聚焦新兴技术创新研究 洞悉商业银行智慧金融战略 NaN 中国金融电脑 \n",
|
|
"4 CJFDTEMP 打造科技创新引擎 赋能业务价值创造 陈满才; 中国工商银行金融科技部; \n",
|
|
"\n",
|
|
" Source-文献来源 \\\n",
|
|
"0 第三十四届中国(天津)2020’IT、网络、信息技术、电子、仪器仪表创新学术会议论文集 \n",
|
|
"1 中国商界 \n",
|
|
"2 金融理论与实践 \n",
|
|
"3 中国金融电脑 \n",
|
|
"4 中国金融电脑 \n",
|
|
"\n",
|
|
" Keyword-关键词 \\\n",
|
|
"0 融资困境;;金融科技;;小微企业 \n",
|
|
"1 新能源车;人工智能应用;新能源汽车; \n",
|
|
"2 金融科技;;金融创新;;金融发展 \n",
|
|
"3 科技创新机制;智慧金融;金融科技创新; \n",
|
|
"4 云计算平台;物联网;人工智能技术;普惠金融发展;服务实体经济;工作体制机制;产学研用;银行业... \n",
|
|
"\n",
|
|
" Summary-摘要 PubTime-发表时间 \n",
|
|
"0 近些年来,人工智能、区块链、云计算和大数据成为了各行各业的焦点,传统金融业包括商业银行在内,... 2020-08-17 16:17 \n",
|
|
"1 人工智能成为推动金融科技发展关键力量近日,在2020世界人工智能大会云端峰会召开前夕,金融壹... 2020-08-15 \n",
|
|
"2 从历史的角度来看,金融科技的产生是数百年来技术变革的必然产物,大数据、区块链、云计算、人工智... 2020-08-12 16:58 \n",
|
|
"3 当下,以5G、人工智能、大数据、云计算、区块链等为代表的新ICT技术的创新发展如火如荼,并加... 2020-08-07 \n",
|
|
"4 当前,金融科技已逐渐从后台走向前台,成为推动金融转型升级的新引擎、金融服务实体经济的新途径、... 2020-08-07 "
|
|
]
|
|
},
|
|
"execution_count": 110,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"base='./datasets/'\n",
|
|
"datasets=os.listdir(base)\n",
|
|
"print(datasets)\n",
|
|
"dfs=[]\n",
|
|
"for dataset in datasets:\n",
|
|
" df=pd.read_excel(os.path.join(base,dataset),encoding='gbk')\n",
|
|
" dfs.append(df)\n",
|
|
"df=pd.concat(dfs)\n",
|
|
"df.head()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## 2、删除第一列数据,以及缺失值"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 111,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"df.drop(df.columns[0],axis=1,inplace=True)\n",
|
|
"df.dropna(inplace=True)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## 3、异常值处理 \n",
|
|
"- 发布时间中存在部分显示为'PubTime-发表时间'的内容,应该去除"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 112,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"异常数据如下:\n"
|
|
]
|
|
},
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>Title-题名</th>\n",
|
|
" <th>Author-作者</th>\n",
|
|
" <th>Organ-单位</th>\n",
|
|
" <th>Source-文献来源</th>\n",
|
|
" <th>Keyword-关键词</th>\n",
|
|
" <th>Summary-摘要</th>\n",
|
|
" <th>PubTime-发表时间</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>439</th>\n",
|
|
" <td>Title-篇名</td>\n",
|
|
" <td>Author-作者</td>\n",
|
|
" <td>Organ-单位</td>\n",
|
|
" <td>Source-文献来源</td>\n",
|
|
" <td>Keyword-关键词</td>\n",
|
|
" <td>Summary-摘要</td>\n",
|
|
" <td>PubTime-发表时间</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>500</th>\n",
|
|
" <td>Title-篇名</td>\n",
|
|
" <td>Author-作者</td>\n",
|
|
" <td>Organ-单位</td>\n",
|
|
" <td>Source-文献来源</td>\n",
|
|
" <td>Keyword-关键词</td>\n",
|
|
" <td>Summary-摘要</td>\n",
|
|
" <td>PubTime-发表时间</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>196</th>\n",
|
|
" <td>Title-篇名</td>\n",
|
|
" <td>Author-作者</td>\n",
|
|
" <td>Organ-单位</td>\n",
|
|
" <td>Source-文献来源</td>\n",
|
|
" <td>Keyword-关键词</td>\n",
|
|
" <td>Summary-摘要</td>\n",
|
|
" <td>PubTime-发表时间</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" Title-题名 Author-作者 Organ-单位 Source-文献来源 Keyword-关键词 Summary-摘要 \\\n",
|
|
"439 Title-篇名 Author-作者 Organ-单位 Source-文献来源 Keyword-关键词 Summary-摘要 \n",
|
|
"500 Title-篇名 Author-作者 Organ-单位 Source-文献来源 Keyword-关键词 Summary-摘要 \n",
|
|
"196 Title-篇名 Author-作者 Organ-单位 Source-文献来源 Keyword-关键词 Summary-摘要 \n",
|
|
"\n",
|
|
" PubTime-发表时间 \n",
|
|
"439 PubTime-发表时间 \n",
|
|
"500 PubTime-发表时间 \n",
|
|
"196 PubTime-发表时间 "
|
|
]
|
|
},
|
|
"execution_count": 112,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"abnormal=df[df['PubTime-发表时间']=='PubTime-发表时间']\n",
|
|
"print('异常数据如下:')\n",
|
|
"abnormal"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## 4、删除异常值,将发表时间转换成时间序列,保存数据"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 113,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"df.drop(abnormal.index,inplace=True)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 114,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>Title-题名</th>\n",
|
|
" <th>Author-作者</th>\n",
|
|
" <th>Organ-单位</th>\n",
|
|
" <th>Source-文献来源</th>\n",
|
|
" <th>Keyword-关键词</th>\n",
|
|
" <th>Summary-摘要</th>\n",
|
|
" <th>PubTime-发表时间</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
"Empty DataFrame\n",
|
|
"Columns: [Title-题名, Author-作者, Organ-单位, Source-文献来源, Keyword-关键词, Summary-摘要, PubTime-发表时间]\n",
|
|
"Index: []"
|
|
]
|
|
},
|
|
"execution_count": 114,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"df[df['PubTime-发表时间']=='PubTime-发表时间']"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 115,
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"Title-题名 0\n",
|
|
"Author-作者 0\n",
|
|
"Organ-单位 0\n",
|
|
"Source-文献来源 0\n",
|
|
"Keyword-关键词 0\n",
|
|
"Summary-摘要 0\n",
|
|
"PubTime-发表时间 0\n",
|
|
"dtype: int64"
|
|
]
|
|
},
|
|
"execution_count": 115,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"df.isnull().sum()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"- 经过上述两行检查,缺失值与异常值处理完毕\n",
|
|
"---"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 116,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"#时间序列转换\n",
|
|
"time=df['PubTime-发表时间']\n",
|
|
"df['PubTime-发表时间']=[i[:10] for i in time]\n",
|
|
"df['PubTime-发表时间']=pd.to_datetime(df['PubTime-发表时间'])"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 118,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"df.to_csv('CNKI.csv',encoding='gbk',index=None)"
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "Python 3",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.7.0"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 2
|
|
}
|