并发、异步IO 在编写爬虫时,性能的消耗主要在IO请求中。当单进程单线程模式下请求URL时必然会引起等待,从而使得请求整体变慢。 import requests def fetch_async(url): response = requests.get(u
并发、异步IO
在编写爬虫时,性能的消耗主要在IO请求中。当单进程单线程模式下请求URL时必然会引起等待,从而使得请求整体变慢。
import requests def fetch_async(url): response = requests.get(url) return response url_list = [‘http://www.github.com‘, ‘http://www.bing.com‘] for url in url_list: print(url,fetch_async(url))1.同步执行
from concurrent.futures import ThreadPoolExecutor import requests def fetch_async(url): response = requests.get(url) return response url_list = [‘http://www.github.com‘, ‘http://www.bing.com‘] pool = ThreadPoolExecutor(5) for url in url_list: pool.submit(fetch_async, url) pool.shutdown(wait=True)2-多线程(线程池)执行
"""并发未来-线程池""" from concurrent.futures import ThreadPoolExecutor import time import requests def task(url): response = requests.get(url) print(url,response.status_code) response.encoding = response.apparent_encoding if response.status_code == 200: return {"url":url,"text":response.text} def save_to_html(res,*args,**kwargs): res = res.result() #res 回调函数接收到res返回的是一个对象<Future at 0x1ed4cf245c0 state=finished returned dict> filename = res[‘url‘].split(".")[-2] + ".html" with open(filename,‘w+‘) as f: f.write(res["text"]) print(filename,"--->写入成功!") def parse_html(res,*args,**kwargs): pass if __name__ == ‘__main__‘: start = time.time() pool = ThreadPoolExecutor() #线程池 不过不指定值 默认为CPU*5 url_list = [ ‘http://www.cnblogs.com/‘, ‘https://huaban.com/favorite/beauty/‘, ‘http://www.bing.com‘, ‘http://www.zhihu.com‘, ‘http://www.sina.com‘, ‘http://www.baidu.com‘, ‘http://www.autohome.com.cn‘, ] for url in url_list: v = pool.submit(task,url) v.add_done_callback(save_to_html) v.add_done_callback(parse_html) pool.shutdown(wait=True) print("consume time is:",time.time()-start)3-多线程+回调函数
from concurrent.futures import ProcessPoolExecutor import requests def fetch_async(url): response = requests.get(url) return response url_list = [‘http://www.github.com‘, ‘http://www.bing.com‘] pool = ProcessPoolExecutor(5) for url in url_list: pool.submit(fetch_async, url) pool.shutdown(wait=True)4-多进程
"""并发未来-进程池""" from concurrent.futures import ProcessPoolExecutor import time import requests def task(url): response = requests.get(url) print(url,response.status_code) response.encoding = response.apparent_encoding if response.status_code == 200: return {"url":url,"text":response.text} def save_to_html(res,*args,**kwargs): res = res.result() #res 回调函数接收到res返回的是一个对象<Future at 0x1ed4cf245c0 state=finished returned dict> filename = res[‘url‘].split(".")[-2] + ".html" with open(filename,‘w+‘) as f: f.write(res["text"]) print(filename,"--->写入成功!") def parse_html(res,*args,**kwargs): pass if __name__ == ‘__main__‘: start = time.time() pool = ProcessPoolExecutor() #线程池 不过不指定值 默认为CPU*5 url_list = [ ‘http://www.cnblogs.com/‘, ‘https://huaban.com/favorite/beauty/‘, ‘http://www.bing.com‘, ‘http://www.zhihu.com‘, ‘http://www.sina.com‘, ‘http://www.baidu.com‘, ‘http://www.autohome.com.cn‘, ] for url in url_list: v = pool.submit(task,url) v.add_done_callback(save_to_html) v.add_done_callback(parse_html) pool.shutdown(wait=True) print("consume time is:",time.time()-start)5-多进程+回调函数
通过上述代码均可以完成对请求性能的提高,对于多线程和多进行的缺点是在IO阻塞时会造成了线程和进程的浪费,所以异步IO首选:
补充:协程+异步IO(还举例讲了 并发、并行、同步、异步、阻塞、非阻塞)
参考:https://blog.csdn.net/weixin_41207499/article/details/80657201
参考:https://www.cnblogs.com/ssyfj/p/9222342.html
https://www.liaoxuefeng.com/wiki/1016959663602400/1017985577429536