特色栏目： python 批处理 net编程 Javascript Php Asp Css Html5 Android seo centos

获取公众号文章小工具

来源：互联网收集：自由互联发布时间：2023-10-08

python的自动获取公众号文章前言由于作者想每天都发一篇公众号文章，但是却又不想天天写文章，作者没有这个精力和时间，于是，我就想来，写一个小工具，每天就只需要简单的操

python的自动获取公众号文章

前言

由于作者想每天都发一篇公众号文章，但是却又不想天天写文章，作者没有这个精力和时间，于是，我就想来，写一个小工具，每天就只需要简单的操作就可以快速发布一篇文章，虽然这种行为不太好，但是确实是方便了不少

看图片：

$[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-oYzon87C-1666528630152)(C:\Users\14299\AppData\Roaming\Typora\typora-user-images\image-20221022221835195.png)]$

如今就只需要输入关键字，和篇目数量，就可以在本地拿取到相应数量的文章，word文档格式，应该不用我告诉你，word文档是可以直接转换到公众号的图文消息里吧。

是不是听起来，这小工具就还挺有意思了。

事不宜迟，那就直接开始教程吧！

一、开发环境

windows系统
python 3.7

python所调用的python库有
json os tkinter(图形化界面) pypandoc(使用此库时，需要注意的是需要安装pandoc) requests
获取pandoc安装包可关注公众号小磊秒秒屋回复：pandoc 或者是 pdoc

二、开发步骤

2.1、爬虫部分

此次先将爬虫作为一个独立的项目先对其进行开发，这个网址的爬虫程序比较基础。0基础也可以学会。

步骤：

主要分为以下4步

控制台输入查询的关键字及篇目数
对url链接发起请求，提取数据
将数据源码以html格式进行保存
对保存的html文件转换成docx文件

导入外部库

import json
import os
import pypandoc
import requests

输入关键字

def main():
    serch_url = 'https://4l77k49qor-dsn.algolia.net/1/indexes/*/queries?x-algolia-agent=Algolia%20for%20JavaScript%20(4.13.0)%3B%20Browser%20(lite)%3B%20docsearch%20(3.0.0)%3B%20docsearch-react%20(3.0.0)&x-algolia-api-key=0f8cb8d4dbe2581b6912018d4e33fb8d&x-algolia-application-id=4L77K49QOR'
    key = input('请输入要搜索的关键词:')
    try:
        num = int(input('请输入你需要爬取几篇:'))
    except Exception as e:
        print('请正常输入好吧', e)
        num = 1
    post_data = '{"requests":[{"query":"' + key + '","indexName":"mdnice","params":"attributesToRetrieve=%5B%22hierarchy.lvl0%22%2C%22hierarchy.lvl1%22%2C%22hierarchy.lvl2%22%2C%22hierarchy.lvl3%22%2C%22hierarchy.lvl4%22%2C%22hierarchy.lvl5%22%2C%22hierarchy.lvl6%22%2C%22content%22%2C%22type%22%2C%22url%22%5D&attributesToSnippet=%5B%22hierarchy.lvl1%3A10%22%2C%22hierarchy.lvl2%3A10%22%2C%22hierarchy.lvl3%3A10%22%2C%22hierarchy.lvl4%3A10%22%2C%22hierarchy.lvl5%3A10%22%2C%22hierarchy.lvl6%3A10%22%2C%22content%3A10%22%5D&snippetEllipsisText=%E2%80%A6&highlightPreTag=%3Cmark%3E&highlightPostTag=%3C%2Fmark%3E&hitsPerPage=20&clickAnalytics=true"}]}'
    post_data = post_data.encode('utf-8')
    run_(serch_url,post_data,num) #调用run方法

发起post请求

def run_(serch_url, post_data, num=1):
    text = requests.post(url=serch_url, data=post_data)
    if text.status_code == 200:
        results = json.loads(text.text)['results'][0]['hits']
        if num > len(results):
            num = len(results)
        for r in results[:num]:
            file_name = r['hierarchy']['lvl1']
            url = r['url']
            html_name = download_html(file_name, url)
            to_docx(html_name)
    else:
        print('链接失效')

抓取数据存html文件

def download_html(filename, url):
    url = url
    content = requests.get(url).text
    file_path = os.getcwd() + '\\file'
    if not os.path.exists(file_path):
        os.mkdir(file_path)
    filename = file_path + '\\' + filename + '.html'
    with open(filename, 'w', encoding='utf-8') as f:
        f.write(content)
    return filename

html文件转docx

提示：如果未装pandoc，程序会报错，由于pypandoc就是通过使用pandoc对其进行操作
获取pandoc安装包可关注公众号小磊秒秒屋回复：小工具
也可自行百度下载，下载速度会比较慢

def to_docx(html_name):
    new_name = html_name.split('.')[0]
    pypandoc.convert_file(html_name, 'docx', outputfile=f'{new_name}.docx')

调用主函数

if __name__ == '__main__':
    main()

3.2、用stkinter对爬虫程序进行封装

此次我们用python的面向对象进行编程，如果对其不了解的，可以先去了解一下，hh

导入外部库

import json
import os
import tkinter as tk
from tkinter import messagebox
# 图形化
import pypandoc
import requests

创建类

class ToolGetArticle(tk.Tk):
    def __init__(self):
        super(ToolGetArticle, self).__init__()
        self.title('获取文章工具')
        width, height = 300, 150
        screenwidth = self.winfo_screenwidth()
        screenheight = self.winfo_screenheight()
        size_geo = '%dx%d+%d+%d' % (width, height, (screenwidth - width) / 2, (screenheight - height) / 2)
        self.geometry(size_geo)
        # self.root_window.iconbitmap('C:/Users/Administrator/Desktop/favicon.ico')
        self["background"] = "#C9C9C9"
        # 爬取的关键字
        self.mainkey = tk.StringVar()
        self.num = tk.IntVar()

    def add_kongjian(self):
        tk.Label(self, text="爬取关键字：").grid(row=0)
        tk.Label(self, text="篇数：").grid(row=1)
        self.e1 = tk.Entry(self)
        self.e2 = tk.Spinbox(self)
        self.e1.grid(row=0, column=1, padx=10, pady=5)
        self.e2.grid(row=1, column=1, padx=10, pady=5)
        tk.Button(self, text="开始", width=10, command=self.new_func).grid(row=3, column=0, sticky="w", padx=10, pady=5)
        tk.Button(self, text="退出", width=10, command=self.quit).grid(row=3, column=1, sticky="e", padx=10, pady=5)

    def check_func(self):
        self.mainkey = self.e1.get()
        self.num = self.e2.get()
        #############判断篇数是不是数字#############
        try:
            num = int(self.num)
            self.num1 = num
        except Exception as e:
            messagebox.showwarning(str(e), "篇数：写个数字吧")
            self.e2.delete(0, tk.END)
            return False
        ############判断key是否有关键字############
        if self.mainkey == '':
            messagebox.showwarning("错误", "搜索关键词要写啊")
            self.e1.delete(0, tk.END)
            return False
        return True

    #####################爬虫程序######################
    def download_html(self, filename, url):
        content = requests.get(url).text
        file_path = os.getcwd() + '\\file'
        if not os.path.exists(file_path):
            os.mkdir(file_path)
        filename = file_path + '\\' + filename + '.html'
        with open(filename, 'w', encoding='utf-8') as f:
            f.write(content)
        return filename

    def to_docx(self, html_name):
        new_name = html_name.split('.')[0]
        # print(,new_name)
        try:
            pypandoc.convert_file(html_name, 'docx', outputfile=f'{new_name}.docx')
        except Exception as e:
            print(e)
            messagebox.showwarning('警告','出错了联系管理员')
        messagebox.showwarning('成功','下载成功查看本地同级文件夹file')
        print('原网址：',self.url)
        print('文件名：',new_name)

    def run_(self, serch_url, post_data, num=1):
        text = requests.post(url=serch_url, data=post_data)
        if text.status_code == 200:
            results = json.loads(text.text)['results'][0]['hits']
            if num > len(results):
                num = len(results)
            for r in results[:num]:
                file_name = r['hierarchy']['lvl1']
                self.url = r['url']
                html_name = self.download_html(file_name,self.url)
                self.to_docx(html_name)
        else:
            return False

    def main(self):
        serch_url = 'https://4l77k49qor-dsn.algolia.net/1/indexes/*/queries?x-algolia-agent=Algolia%20for%20JavaScript%20(4.13.0)%3B%20Browser%20(lite)%3B%20docsearch%20(3.0.0)%3B%20docsearch-react%20(3.0.0)&x-algolia-api-key=0f8cb8d4dbe2581b6912018d4e33fb8d&x-algolia-application-id=4L77K49QOR'
        key = self.mainkey
        num = self.num1
        post_data = '{"requests":[{"query":"' + str(
            key) + '","indexName":"mdnice","params":"attributesToRetrieve=%5B%22hierarchy.lvl0%22%2C%22hierarchy.lvl1%22%2C%22hierarchy.lvl2%22%2C%22hierarchy.lvl3%22%2C%22hierarchy.lvl4%22%2C%22hierarchy.lvl5%22%2C%22hierarchy.lvl6%22%2C%22content%22%2C%22type%22%2C%22url%22%5D&attributesToSnippet=%5B%22hierarchy.lvl1%3A10%22%2C%22hierarchy.lvl2%3A10%22%2C%22hierarchy.lvl3%3A10%22%2C%22hierarchy.lvl4%3A10%22%2C%22hierarchy.lvl5%3A10%22%2C%22hierarchy.lvl6%3A10%22%2C%22content%3A10%22%5D&snippetEllipsisText=%E2%80%A6&highlightPreTag=%3Cmark%3E&highlightPostTag=%3C%2Fmark%3E&hitsPerPage=20&clickAnalytics=true"}]}'
        post_data = post_data.encode('utf-8')
        self.run_(serch_url, post_data, num)

    ##################################################

    def new_func(self):
        if self.check_func():
            self.main()

    def run_main(self):
        self.mainloop()

这里的类继承tkinter.Tk，也就是说，Tk类的父级方法以及组件，已经继承过来了，然后操作组件就比较方便

我也不知道这需要讲些什么，那从构造函数说起吧:

构造函数

_ _ i n i t _ _ () 函数中

1、这一段函数呢，设置了主窗口的大小和位置

screenwidth = self.winfo_screenwidth()
screenheight = self.winfo_screenheight()
size_geo = '%dx%d+%d+%d' % (width, height, (screenwidth - width) / 2, (screenheight - height) / 2)
self.geometry(size_geo)

2、由于tk中的组件需要动态传值时，需要如此声明变量

self.mainkey = tk.StringVar()
self.num = tk.IntVar()

添加控件

add_kongjian()函数中

1、添加label标签，和Entry、Spinbox标签

tk.Label(self, text="爬取关键字：").grid(row=0)
 tk.Label(self, text="篇数：").grid(row=1)
 self.e1 = tk.Entry(self)
 self.e2 = tk.Spinbox(self)
 self.e1.grid(row=0, column=1, padx=10, pady=5)
 self.e2.grid(row=1, column=1, padx=10, pady=5)

2、添加Button标签，并绑定函数

tk.Button(self, text="开始", width=10, command=self.new_func).grid(row=3, column=0, sticky="w", padx=10, pady=5)
tk.Button(self, text="退出", width=10, command=self.quit).grid(row=3, column=1, sticky="e", padx=10, pady=5)

值得注意的是，绑定标签时，row、column等是用来定位的
开始的Button绑定了自定义函数，new_func()

开始爬虫

在开始进行爬虫程序之前呢，我们需要检查一下，传入的数据是否符合要求，如果符合要求才会继续爬虫，如果不符合，则需要弹出警告

1、new_func函数

调用了检查函数

def new_func(self):
	if self.check_func():
        self.main()

2、check_func函数

def check_func(self):
    self.mainkey = self.e1.get()
    self.num = self.e2.get()
    #############判断篇数是不是数字#############
    try:
        num = int(self.num)
        self.num1 = num
    except Exception as e:
        messagebox.showwarning(str(e), "篇数：写个数字吧")
        self.e2.delete(0, tk.END)
        return False
    ############判断key是否有关键字############
    if self.mainkey == '':
        messagebox.showwarning("错误", "搜索关键词要写啊")
        self.e1.delete(0, tk.END)
        return False
    return True

3.main主要爬虫的程序

与之前主要有所不同的是tto_docx函数添加了捕获异常，如果出现异常，将弹出弹窗警告

def to_docx(self, html_name):
    new_name = html_name.split('.')[0]
    # print(,new_name)
    try:
        pypandoc.convert_file(html_name, 'docx', outputfile=f'{new_name}.docx')
    except Exception as e:
        print(e)
        messagebox.showwarning('警告', '出错了联系管理员')
        messagebox.showwarning('成功', '下载成功查看本地同级文件夹file')
        print('原网址：', self.url)
        print('文件名：', new_name)

开始运行吧

if __name__ == '__main__':
    tk1 = ToolGetArticle()
    tk1.add_kongjian()
    tk1.run_main()

$[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-5T7kFoQJ-1666528630153)(C:\Users\14299\AppData\Roaming\Typora\typora-user-images\image-20221022234121640.png)]$

运行成功，将弹出工具窗口，是不是挺有意思，快去试试吧

在linux系统上运行代码时，需要修改文件路径哦

使用说明，不用多说吧，运行成功之后会在你程序运行的路径，同级目录下创建file文件夹，然后将获取到的html文件和转换之后的docx文件都存放在文件夹中。快去试试吧

print('文件名：', new_name)

##### 开始运行吧

```python
if __name__ == '__main__':
    tk1 = ToolGetArticle()
    tk1.add_kongjian()
    tk1.run_main()

运行成功，将弹出工具窗口，是不是挺有意思，快去试试吧

在linux系统上运行代码时，需要修改文件路径哦

使用说明，不用多说吧，运行成功之后会在你程序运行的路径，同级目录下创建file文件夹，然后将获取到的html文件和转换之后的docx文件都存放在文件夹中。快去试试吧

上一篇：python编程实现自动发送消息
下一篇：没有了

获取公众号文章小工具

python的自动获取公众号文章

前言

一、开发环境

二、开发步骤

2.1、爬虫部分

步骤：

导入外部库

输入关键字

发起post请求

抓取数据存html文件

html文件转docx

调用主函数

3.2、用stkinter对爬虫程序进行封装

导入外部库

创建类

构造函数

1、这一段函数呢，设置了主窗口的大小和位置

2、由于tk中的组件需要动态传值时，需要如此声明变量

添加控件

1、添加label标签，和Entry、Spinbox标签

2、添加Button标签，并绑定函数

开始爬虫

1、new_func函数

2、check_func函数

3.main主要爬虫的程序

开始运行吧

相关文章