当前位置 : 主页 > 网络编程 > 其它编程 >

开发笔记:Python网络爬虫(新闻收集脚本)

来源:互联网 收集:自由互联 发布时间:2023-07-02
篇首语:本文由编程笔记#自由互联小编为大家整理,主要介绍了Python网络爬虫(新闻收集脚本)相关的知识,希望对你有一定的参考价值。===================== 篇首语:本文由编程笔记#自
篇首语:本文由编程笔记#自由互联小编为大家整理,主要介绍了Python网络爬虫(新闻收集脚本)相关的知识,希望对你有一定的参考价值。=====================

篇首语:本文由编程笔记#自由互联小编为大家整理,主要介绍了Python 网络爬虫(新闻收集脚本)相关的知识,希望对你有一定的参考价值。

=====================爬虫原理=====================

通过Python访问新闻首页,并用正则表达式获取新闻排行榜链接。

依次访问这些链接,从网页的html代码中获取文章信息,并将信息保存到Article对象中。

将Article对象中的数据通过pymysql【第三方模块】保存到数据库中。

=====================数据结构=====================

CREATE TABLE `news` ( `id` int(6) unsigned AUTO_INCREMENT NOT NULL, `title` varchar(45) NOT NULL, `author` varchar(12) NOT NULL, `date` varchar(12) NOT NULL, `about` varchar(255) NOT NULL, `content` text NOT NULL, PRIMARY KEY (`id`)) ENGINE=InnoDB DEFAULT CHARSET=utf8;

=====================脚本代码=====================

# 百度百家文章收集import reimport urllib.requestimport pymysql.cursors# 数据库配置参数cOnfig= { ‘host‘: ‘localhost‘, ‘port‘: ‘3310‘, ‘username‘: ‘woider‘, ‘password‘: ‘3243‘, ‘database‘: ‘python‘, ‘charset‘: ‘utf8‘}# 数据表创建语句‘‘‘CREATE TABLE `news` ( `id` int(6) unsigned AUTO_INCREMENT NOT NULL, `title` varchar(45) NOT NULL, `author` varchar(12) NOT NULL, `date` varchar(12) NOT NULL, `about` varchar(255) NOT NULL, `content` text NOT NULL, PRIMARY KEY (`id`)) ENGINE=InnoDB DEFAULT CHARSET=utf8;‘‘‘# 文章对象class Article(object): title = None author = None date = None about = None content = None pass# 正则表达式patArticle = ‘‘ # 匹配文章链接patTitle = ‘\s*

(.+)

‘ # 匹配文章标题patAuthor = ‘\s*(.+)‘ # 匹配文章作者patDate = ‘(.+)‘ # 匹配发布日期patAbout = ‘(.+)‘ # 匹配文章简介patCOntent= ‘((.|\s)+)‘ # 匹配文章内容patCopy = ‘(.|\s)+‘ # 匹配版权声明patTag = ‘()|(\s*)‘ # 匹配HTML标签# 文章信息def collect_article(url): article = Article() html = urllib.request.urlopen(url).read().decode(‘utf8‘) article.title = re.findall(patTitle, html)[0] article.author = re.findall(patAuthor, html)[0] article.date = re.findall(patDate, html)[0] article.about = re.findall(patAbout, html)[0] content = re.findall(patContent, html)[0] content = re.sub(patCopy, ‘‘, content[0]) content = re.sub(‘

‘, ‘\n‘, content) content = re.sub(patTag, ‘‘, content) article.content = content return article# 储存信息def save_article(connect, article): message = None try: cursor = connect.cursor() sql = "INSERT INTO news (title, author, date, about, content) VALUES ( %s, %s, %s, %s, %s)" data = (article.title, article.author, article.date, article.about, article.content) cursor.execute(sql, data) connect.commit() except Exception as e: message = str(e) else: message = article.title finally: cursor.close() return message# 抓取链接home = ‘http://baijia.baidu.com/‘ # 百度百家首页html = urllib.request.urlopen(home).read().decode(‘utf8‘) # 获取页面源码links = re.findall(patArticle, html)[0:10] # 每日热点新闻# 连接数据库cOnnect= pymysql.connect( host=config[‘host‘], port=int(config[‘port‘]), user=config[‘username‘], passwd=config[‘password‘], db=config[‘database‘], charset=config[‘charset‘])for url in links: article = collect_article(url) # 收集文章信息 message = save_article(connect,article) # 储存文章信息 print(message) passconnect.close() # 关闭数据库连接

=====================运行结果=====================

技术分享

技术分享

网友评论