1、创建拉勾网爬虫项目 - CrawlSpider的使用
推荐工具:cmder , 下载地址:http:/// → 下载full版本,使我们在windows环境下也可以使用linux部分命令
在终端/cmder中,进入我们项目,执行:scrapy genspider --list :查看可使用的初始化版本
ailable templates:
basic #
crawl #
csvfeed #
xmlfeed #
# 执行命令:-t 表示通过模板生成
scrapy genspider -t crawl lagou www.lagou.com
# 不指定初始化模板,默认的是用basic模板
scrapy genspider lagou www.lagou.com
通过crawl 新建爬虫:
scrapy genspider -t crawl lagou www.lagou.com

此时,生成lagou.py文件,lagou.py文件内
LagouSpider(CrawlSpider) ,即继承于 CrawlSpider ,不再是basic模板的 scrapy.Spider了。(注意CrawlSpider继承于scrapy下的Spider)
lagou.py:


from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class LagouSpider(CrawlSpider):
name = 'lagou'
allowed_domains = ['www.lagou.com']
start_urls = ['http://www.lagou.com/']
rules = (
Rule(LinkExtractor(allow=r'Items/'), callback='parse_item', follow=True),
)
def parse_item(self, response):
i = {}
#i['domain_id'] = response.xpath('//input[@id="sid"]/@value').extract()
#i['name'] = response.xpath('//div[@id="name"]').extract()
#i['description'] = response.xpath('//div[@id="description"]').extract()
return i
lagou.py
关于CrawlSpider全站式爬取数据-相关介绍,请参考此链接:
使用CrawlSpider爬取拉钩网数据-测试
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class LagouSpider(CrawlSpider):
name = 'lagou'
allowed_domains = ['www.lagou.com']
start_urls = ['http://www.lagou.com/']
rules = (
# 3个规则
Rule(LinkExtractor(allow=r'zhaopin/.*/'), follow=True),#爬取zhaopin下的所有url
Rule(LinkExtractor(allow=r'gongsi/d+.html/'), follow=True),#爬取gongsi下的所有url
Rule(LinkExtractor(allow=r'jobs/d+.html/'), callback='parse_job', follow=True),#爬取jobs下的所有url
)
def parse_job(self, response):
i = {}
#i['domain_id'] = response.xpath('//input[@id="sid"]/@value').extract()
#i['name'] = response.xpath('//div[@id="name"]').extract()
#i['description'] = response.xpath('//div[@id="description"]').extract()
return i
设置断点并进行debug运行

爬取数据结果:
link:
Link(url='https://www.lagou.com/zhaopin/Java/', text='Java', fragment='', nofollow=False) # 每个url、text(内容)
links:
# 爬取的url → 属于https://www.lagou.com/zhaopin/下的所有url
<class 'list'>: [Link(url='https://www.lagou.com/zhaopin/Java/', text='Java', fragment='', nofollow=False),
Link(url='https://www.lagou.com/zhaopin/PHP/', text='PHP', fragment='', nofollow=False),
Link(url='https://www.lagou.com/zhaopin/C++/', text='C++', fragment='', nofollow=False),
Link(url='https://www.lagou.com/zhaopin/qukuailian/', text='区块链', fragment='', nofollow=False),
......
response:
# https://www.lagou.com/:爬取的url页面
<200 https://www.lagou.com/>
seen:
# 将爬取到的所有符合规则的url都放到seen集合中去
{Link(url='https://www.lagou.com/zhaopin/Java/', text='Java', fragment='', nofollow=False)}

关于Rules多个规则时爬取顺序:scrapy爬虫是使用异步机制,使用规则爬取数据时是没有区分先后顺序的
假设这个是第一层url,只有一个url)数据,其中包含所有url链接(都存于response中)
→2、判断每个url(假设这是第二层url,第一层url爬取到的页面数据下的所有url)是否符合Rules用户自定义的规则,如果符合则爬取该条url数据(Rules没有执行顺序之分,只要符合Rules规则中的任意一条,就会按该条Rule规则进行request爬取数据)数据爬取完成后调用其对应的callback函数
→3、接着判断follow是否为True,如果为True,之后会在接着深度式爬取第三层url(在第二层每个url页面的url链接);为False则不再进一步数据爬取
→4、最后,通过对item数据的处理,将其入库即完成了拉勾网数据爬取任务
follow注意的一个点,这个点在我的随便CrawlSpider源码介绍中没提到,需要注意:使用CrawlSpider爬取爬虫数据时,在Rules规则中假如我们设定了某个Rule规则的follow为False,则表示会在第二层url数据爬取后便不再进一步爬取,而不是爬取完第一层(首页)数据后便停止下一层数据提取。原因如下(源码分析):
爬虫开始时我们调用的是start_request函数,之后调用的是parse函数,而parse函数传递的参数中有一个‘follow = True‘,程序要判断是否进行下一层数据爬取(跟进)时,是根据_parse_response函数下的这句代码:
if follow and self._follow_links:
来判断是否进行下一层数据爬取的。实际上第一次传进去的follow是parse函数中传递的参数:follow = True ,此时无论Rules规则中的参数follow是否为False,此处判断都为True,即会进行下一层数据爬取(第二层)。直到下一次调用此函数(_parse_response)时,传进来的follow值才是Rule规则中用户自定义的follow值。此时才会终止爬虫的下一步数据爬取(follow = False)
在爬取拉钩网url中,出现302重定向问题(要求登录)。此时我们可以通过:custom_settings,对一些default设置进行预设。爬取拉钩网时,需要登录获取cookies或者直接手动将cookies放置custom_settings中。
经过测试,爬取拉钩网时必须的两个参数分别是:cookies 、 User-Agent 。我们将其配置到custom_settings中就可以了。有多种方式:
1、直接手动将cookies添加进custom_settings中,浏览器访问(get请求就可以了),检查代码复制cookies数据及User_Agent数据到custom_settings中就可以了:
# cookies、User_Agent必须
custom_settings = {
"COOKIES_ENABLED": False,
# "DOWNLOAD_DELAY": 1, # 延时/秒
'DEFAULT_REQUEST_HEADERS': {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8',
'Connection': 'keep-alive',
'Cookie': 'JSESSIONID=ABAAABAABEEAAJAF08A698E7D4CC5B1B474ED6DDA70F780; Hm_lvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1541953902; _ga=GA1.2.1358601872.1541953903; user_trace_token=20181112003151-4b03ac83-e5cf-11e8-8882-5254005c3644; LGSID=20181112003151-4b03aeb8-e5cf-11e8-8882-5254005c3644; LGUID=20181112003151-4b03b056-e5cf-11e8-8882-5254005c3644; _gid=GA1.2.1875637681.1541953903; index_location_city=%E5%B9%BF%E5%B7%9E; TG-TRACK-CODE=index_navigation; SEARCH_ID=6edac9dbb3714a8780795d56ccdc7f78; LGRID=20181112013543-36e69b49-e5d8-11e8-888e-5254005c3644; Hm_lpvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1541957735',
'Host': 'www.lagou.com',
'Origin': 'https://www.lagou.com',
'Referer': 'https://www.lagou.com/',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.67 Safari/537.36',
}
}
2、使用selenium登录获取cookies,写入文件,再添加到 custom_settings 中就可以了,这种是自动型,比较推荐。关于这种方式读者需要自己去尝试,本文目前只在于测试没必要用到第二种故没特地测试处理。给段网上代码可以参考


from selenium import webdriver
from scrapy.selector import Selector
import time
def login_lagou():
browser = webdriver.Chrome(executable_path="D:/chromedriver.exe")
browser.get("https://passport.lagou.com/login/login.html")
# 填充账号密码
browser
.find_element_by_css_selector("body > section > div.left_area.fl > div:nth-child(2) > form > div:nth-child(1) > input")
.send_keys("username")
browser
.find_element_by_css_selector("body > section > div.left_area.fl > div:nth-child(2) > form > div:nth-child(2) > input")
.send_keys("password")
# 点击登陆按钮
browser
.find_element_by_css_selector("body > section > div.left_area.fl > div:nth-child(2) > form > div.input_item.btn_group.clearfix > input")
.click()
cookie_dict={}
time.sleep(3)
Cookies = browser.get_cookies()
for cookie in Cookies:
cookie_dict[cookie['name']] = cookie['value']
# browser.quit()
return cookie_dict
selenium获取cookies
另外还有通过requests第三方库登录,及使用scrapy自带request模拟登录都是可以的。方式有多种,有兴趣的可以自己尝试。
测试代码:


# -*- coding: utf-8 -*-
import scrapy
from scrapy.http import Request
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class LagouSpider(CrawlSpider):
name = 'lagou'
allowed_domains = ['www.lagou.com']
start_urls = ['http://www.lagou.com/']
rules = (
Rule(LinkExtractor(allow=r'zhaopin/.*/'), follow=False),
# Rule(LinkExtractor(allow=r'gongsi/d+.html/'), follow=False),
Rule(LinkExtractor(allow=r'jobs/d+.html'), callback='parse_job', follow=True),
)
custom_settings = {
"COOKIES_ENABLED": False,
# "DOWNLOAD_DELAY": 1,
'DEFAULT_REQUEST_HEADERS': {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8',
'Connection': 'keep-alive',
'Cookie': 'JSESSIONID=ABAAABAABEEAAJAF08A698E7D4CC5B1B474ED6DDA70F780; Hm_lvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1541953902; _ga=GA1.2.1358601872.1541953903; user_trace_token=20181112003151-4b03ac83-e5cf-11e8-8882-5254005c3644; LGSID=20181112003151-4b03aeb8-e5cf-11e8-8882-5254005c3644; LGUID=20181112003151-4b03b056-e5cf-11e8-8882-5254005c3644; _gid=GA1.2.1875637681.1541953903; index_location_city=%E5%B9%BF%E5%B7%9E; TG-TRACK-CODE=index_navigation; SEARCH_ID=6edac9dbb3714a8780795d56ccdc7f78; LGRID=20181112013543-36e69b49-e5d8-11e8-888e-5254005c3644; Hm_lpvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1541957735',
'Host': 'www.lagou.com',
'Origin': 'https://www.lagou.com',
'Referer': 'https://www.lagou.com/',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.67 Safari/537.36',
}
}
# headers = {
# "HOST": "www.lagou.com",
# 'User-Agent': "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.67 Safari/537.36"
# }
#
# def _build_request(self, rule, link):
# r = Request(url=link.url, callback=self._response_downloaded,headers=self.headers)
# r.meta.update(rule=rule, link_text=link.text)
# return r
def parse_job(self, response):
i = {}
#i['domain_id'] = response.xpath('//input[@id="sid"]/@value').extract()
#i['name'] = response.xpath('//div[@id="name"]').extract()
#i['description'] = response.xpath('//div[@id="description"]').extract()
return i
CrawlSpider - lagou.py
新建lagou.py爬虫项目后,接下来就是对爬取目标的定位了,我们的目标是爬取拉勾网上每个招聘信息中以下的数据:
class LagouJobItem(scrapy.Item):
# 拉勾网职位信息
title = scrapy.Field() # 标题
url = scrapy.Field()
url_object_id = scrapy.Field() # url+md5加密
salary = scrapy.Field() # 薪资
job_city = scrapy.Field() # 工作城市
work_years = scrapy.Field() # 工作年限
degree_need = scrapy.Field() # 工作经验
job_type = scrapy.Field() # 工作类型(全职/兼职)
publish_time = scrapy.Field() # 发布时间
job_advantage = scrapy.Field() # 职位诱惑
job_desc = scrapy.Field() # 工作描述
job_addr = scrapy.Field() # 工作地点
company_name = scrapy.Field() # 公司名称
company_url = scrapy.Field() # 公司url
tags = scrapy.Field() # 职位标签
crawl_time = scrapy.Field() # 爬取时间
class LagouJobItemLoader(ItemLoader):
# 自定义拉钩ItemLoader
default_output_processor = TakeFirst()
def parse_job(self, response):
# 解析拉勾网的职位
item_loader = LagouJobItemLoader(item=LagouJobItem(),response=response)
item_loader.add_css("title",".job-name::attr(title)")
item_loader.add_value("url",response.url)
item_loader.add_value("url_object_id",get_md5(response.url))
item_loader.add_css("salary",".job_request .salary::text")
item_loader.add_css("job_city",".job_request span:nth-child(2)::text") # 取到span标签的第二个(span标签)
item_loader.add_css("work_years",".job_request span:nth-child(3)::text")
item_loader.add_css("degree_need",".job_request span:nth-child(4)::text")
item_loader.add_css("job_type",".job_request span:nth-child(5)::text")
item_loader.add_css("tags",".position-label li::text")
item_loader.add_css("publish_time",".publish_time::text")
item_loader.add_css("job_advantage",".job-advantage p::text")
item_loader.add_css("job_desc",".job_bt div")
item_loader.add_css("job_addr",".work_addr")
item_loader.add_css("company_name","#job_company dt a img::attr(alt)")
item_loader.add_css("company_url","#job_company dt a::attr(href)")
item_loader.add_value("crawl_time",datetime.now())
lagou_job_item = item_loader.load_item()
return lagou_job_item
然后,尝试debug对数据进行爬取,item结果如下:

def remove_splash(value):
# 去除 /
return value.replace("/","")
def time_split(value):
# 根据/分割,返回时间点 publish_time: 13:55 发布于拉勾网
value_list = value.split(" ")
return value_list[0]
def get_word_year(value):
# 获取工作年限
match_re = re.match(".*?((d+)-?(d*)).*", value)
if match_re:
word_year = match_re.group(1)
else:
word_year = "经验不限"
return word_year
def get_job_addr(value):
# 拼接地址,并去除无用信息
addr_list = value.split("
")
addr_list = [item.strip() for item in addr_list if item.strip() != '查看地图']
return "".join(addr_list)
def get_job_desc(value):
# 拼接招聘内容描述
desc_list = value.split("
")
desc_list = [item.strip() for item in desc_list]
return "".join(desc_list)
在item.py中的 LagouJobItem类应用:
# 拉钩网爬取相关item
class LagouJobItemLoader(ItemLoader):
# 自定义拉钩ItemLoader
default_output_processor = TakeFirst()
class LagouJobItem(scrapy.Item):
# 拉勾网职位信息
title = scrapy.Field() # 标题
url = scrapy.Field()
url_object_id = scrapy.Field() # url+md5加密
salary = scrapy.Field() # 薪资
job_city = scrapy.Field( # 工作城市
input_processor=MapCompose(remove_splash)
)
work_years = scrapy.Field( # 工作年限
input_processor=MapCompose(get_word_year)
)
degree_need = scrapy.Field( # 工作经验
input_processor=MapCompose(remove_splash)
)
job_type = scrapy.Field() # 工作类型(全职/兼职)
publish_time = scrapy.Field( # 发布时间
input_processor=MapCompose(time_split)
)
job_advantage = scrapy.Field() # 职位诱惑
job_desc = scrapy.Field( # 工作描述
input_processor=MapCompose(remove_tags,get_job_desc)
)
job_addr = scrapy.Field( # 工作地点
input_processor = MapCompose(remove_tags,get_job_addr)
)
company_name = scrapy.Field() # 公司名称
company_url = scrapy.Field() # 公司url
tags = scrapy.Field( # 职位标签
input_processor=Join("-")
)
crawl_time = scrapy.Field( # 爬取时间
)
再进行debug调试,爬取到的是我们理想中的数据类型:

1)首先,我们需要先建表,及填写数据类型
表名:lagou_job

2)开始数据库入库的相关操作
我们之前在pipelines.py进行数据入库操作时,是直接在MysqlTwistedPipeline类中进行数据增删改查相关操作的,其实这样已经把入库操作写死。在一个爬虫项目中,当你需要对多个网站爬取数据,存储数据时每个网站的数据都需要进行入库处理,此时上述的入库操作便不适用了。
修订数据入库操作其实很简单,异步入库数据库的基础设置都一样,主要就是增删改查不同,我们将这部分操作移到每个爬虫项目对应的Item中进行操作即可。具体操作如下:
旧版数据入库:


import MySQLdb
import MySQLdb.cursors
from twisted.enterprise import adbapi
class MysqlTwistedPipeline(object):
def __init__(self,dbpool):
self.dbpool = dbpool
@classmethod
def from_settings(cls,settings): # 用于读取配置文件信息,先于process_item调用
dbparm = dict(
host=settings["MYSQL_HOST"],
db=settings["MYSQL_DBNAME"],
user=settings["MYSQL_USER"],
passwd=settings["MYSQL_PASSWORD"],
charset='utf8',
cursorclass=MySQLdb.cursors.DictCursor, # 字典类型,还有一种json类型
use_unicode=True,
)
dbpool = adbapi.ConnectionPool("MySQLdb",**dbparm) # tadbapi.ConnectionPool:wisted提供的一个用于异步化操作的连接处(容器)。将数据库模块,及连接数据库的参数等传入即可连接mysql
return cls(dbpool) # 实例化 MysqlTwistedPipeline
def process_item(self, item, spider):
# 操作数据时调用
query = self.dbpool.runInteraction(self.sql_insert, item) # 执行mysql语句相应操作 ,异步操作
query.addErrback(self.handle_error, item, spider) # 异常处理
def handle_error(self, failure, item, spider):
# 处理异常
print("异常:", failure)
def sql_insert(self, cursor, item):
# 数据插入操作
insert_sql = """
insert into article_spider(title, url, create_date, fav_nums,url_object_id)
VALUES (%s, %s, %s, %s,%s)
"""
cursor.execute(insert_sql,
(item["title"], item["url"], item["create_date"], item["fav_nums"], item["url_object_id"]))
MysqlTwistedPipeline
新版:
首先,将数据增删改查操作相关转移到对应item对象中操作:
# items.py/LagouJobItem
def get_insert_sql(self):
insert_sql = """
insert into lagou_job(title, url, url_object_id, salary, job_city, work_years, degree_need,
job_type, publish_time, job_advantage, job_desc, job_addr, company_name, company_url,
tags, crawl_time) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)
ON DUPLICATE KEY UPDATE salary=VALUES(salary), job_desc=VALUES(job_desc)
"""
params = (
self["title"], self["url"], self["url_object_id"], self["salary"], self["job_city"],
self["work_years"], self["degree_need"], self["job_type"],
self["publish_time"], self["job_advantage"], self["job_desc"],
self["job_addr"], self["company_name"], self["company_url"],
self["job_addr"], self["crawl_time"].strftime(SQL_DATETIME_FORMAT),
)
return insert_sql, params
ON DUPLICATE KEY UPDATE解析:当插入数据时,该条数据主键存在(冲突),则进行更新数据操作。
接着,更改MysqlTwistedPipline类中的do_insert函数:
def do_insert(self, cursor, item):
#执行具体的插入
#根据不同的item 构建不同的sql语句并插入到mysql中
insert_sql, params = item.get_insert_sql()
cursor.execute(insert_sql, params)
setting.py中设置:
ITEM_PIPELINES = {
'ArticleSpider.pipelines.MysqlTwistedPipeline': 4, # 方式三:异步数据库保存item数据
}
使用debug进行调试测试,查看lago_job表数据,发现数据正不断存入:

1、setting.py


import os
BOT_NAME = 'ArticleSpider'
SPIDER_MODULES = ['ArticleSpider.spiders']
NEWSPIDER_MODULE = 'ArticleSpider.spiders'
ROBOTSTXT_OBEY = False
# Enble or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'ArticleSpider.middlewares.ArticlespiderSpiderMiddleware': 543,
#}
# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# 'ArticleSpider.middlewares.ArticlespiderDownloaderMiddleware': 543,
#}
# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}
# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
# 'ArticleSpider.pipelines.ArticlespiderPipeline': 300,
# 'ArticleSpider.pipelines.ArticleImagePipeline': 2, # 用于图片下载时调用
# 'ArticleSpider.pipelines.JsonWithEncodingPipeline': 3, # 方式一:用于保存item 数据 ,在图片下载之后再调用
'ArticleSpider.pipelines.MysqlTwistedPipeline': 4, # 方式三:异步数据库保存item数据
# 'ArticleSpider.pipelines.JsonExporterPipleline': 3, # 方式二:使用scrapy提供的JsonItemExporter保存json文件,用于保存item 数据
# 'scrapy.pipelines.images.ImagesPipeline':1 # scrapy中的pipelines自带的ImagesPipeline,用于图片下载,另外还有图片、媒体下载
}
BASE_DIR = os.path.dirname(os.path.abspath(__file__))
IMAGES_STORE = os.path.join(BASE_DIR,'images') # 名称是固定写法,文件保存路径
IMAGES_URLS_FIELD = "acticle_image_url" # 名称是固定写法。设定acticle_image_url字段为图片url,下载图片时找此字段对应的数据
ITEM_DATA_DIR = os.path.join(BASE_DIR,"item_data") # item数据保存到当地item_data文件夹
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
# mysql配置
MYSQL_HOST = "127.0.0.1"
MYSQL_DBNAME = "article_spider"
MYSQL_USER = "root"
MYSQL_PASSWORD = "0315"
SQL_DATETIME_FORMAT = "%Y-%m-%d %H:%M:%S"
SQL_DATE_FORMAT = "%Y-%m-%d"
View Code
2、common.py


# md5 加密
import hashlib
def get_md5(url):
if isinstance(url,str): # python3中Unicode即是str
url = url.encode("utf-8")
m = hashlib.md5()
m.update(url)
return m.hexdigest()
common.py
3、items.py


from scrapy.loader import ItemLoader
from scrapy.loader.processors import MapCompose, TakeFirst, Join
from ArticleSpider.settings import SQL_DATETIME_FORMAT, SQL_DATE_FORMAT
import re
from w3lib.html import remove_tags # 用于去除HTML标签
def return_value(value):
return value
def remove_splash(value):
return value.replace("/","")
def time_split(value):
# 根据空格分割,返回时间点 publish_time: 13:55 发布于拉勾网
value_list = value.split(" ")
return value_list[0]
def get_word_year(value):
# 获取工作年限
match_re = re.match(".*?((d+)-?(d*)).*", value)
if match_re:
word_year = match_re.group(1)
else:
word_year = "经验不限"
return word_year
def get_job_addr(value):
# 拼接地址,并去除无用信息
addr_list = value.split("
")
addr_list = [item.strip() for item in addr_list if item.strip() != '查看地图']
return "".join(addr_list)
def get_job_desc(value):
# 拼接招聘内容描述
desc_list = value.split("
")
desc_list = [item.strip() for item in desc_list]
return "".join(desc_list)
# 拉钩网爬取相关item
class LagouJobItemLoader(ItemLoader):
# 自定义拉钩ItemLoader
default_output_processor = TakeFirst()
class LagouJobItem(scrapy.Item):
# 拉勾网职位信息
title = scrapy.Field() # 标题
url = scrapy.Field()
url_object_id = scrapy.Field() # url+md5加密
salary = scrapy.Field() # 薪资
job_city = scrapy.Field( # 工作城市
input_processor=MapCompose(remove_splash)
)
work_years = scrapy.Field( # 工作年限
input_processor=MapCompose(get_word_year)
)
degree_need = scrapy.Field( # 工作经验
input_processor=MapCompose(remove_splash)
)
job_type = scrapy.Field() # 工作类型(全职/兼职)
publish_time = scrapy.Field( # 发布时间
input_processor=MapCompose(time_split)
)
job_advantage = scrapy.Field() # 职位诱惑
job_desc = scrapy.Field( # 工作描述
input_processor=MapCompose(remove_tags,get_job_desc)
)
job_addr = scrapy.Field( # 工作地点
input_processor = MapCompose(remove_tags,get_job_addr)
)
company_name = scrapy.Field() # 公司名称
company_url = scrapy.Field() # 公司url
tags = scrapy.Field( # 职位标签
input_processor=Join("-")
)
crawl_time = scrapy.Field( # 爬取时间
)
def get_insert_sql(self):
insert_sql = """
insert into lagou_job(title, url, url_object_id, salary, job_city, work_years, degree_need,
job_type, publish_time, job_advantage, job_desc, job_addr, company_name, company_url,
tags, crawl_time) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)
ON DUPLICATE KEY UPDATE salary=VALUES(salary), job_desc=VALUES(job_desc)
"""
params = (
self["title"], self["url"], self["url_object_id"], self["salary"], self["job_city"],
self["work_years"], self["degree_need"], self["job_type"],
self["publish_time"], self["job_advantage"], self["job_desc"],
self["job_addr"], self["company_name"], self["company_url"],
self["job_addr"], self["crawl_time"].strftime(SQL_DATETIME_FORMAT),
)
return insert_sql, params
items.py
4、pipelines.py


import MySQLdb
import MySQLdb.cursors
from twisted.enterprise import adbapi
class MysqlTwistedPipeline(object):
def __init__(self,dbpool):
self.dbpool = dbpool
@classmethod
def from_settings(cls,settings): # 用于读取配置文件信息,先于process_item调用
dbparm = dict(
host=settings["MYSQL_HOST"],
db=settings["MYSQL_DBNAME"],
user=settings["MYSQL_USER"],
passwd=settings["MYSQL_PASSWORD"],
charset='utf8',
cursorclass=MySQLdb.cursors.DictCursor, # 字典类型,还有一种json类型
use_unicode=True,
)
dbpool = adbapi.ConnectionPool("MySQLdb",**dbparm) # tadbapi.ConnectionPool:wisted提供的一个用于异步化操作的连接处(容器)。将数据库模块,及连接数据库的参数等传入即可连接mysql
return cls(dbpool) # 实例化 MysqlTwistedPipeline
def process_item(self, item, spider):
# 操作数据时调用
query = self.dbpool.runInteraction(self.do_insert, item) # 执行mysql语句相应操作 ,异步操作
query.addErrback(self.handle_error, item, spider) # 异常处理
def handle_error(self, failure, item, spider):
#处理异步插入的异常
print (failure)
def do_insert(self, cursor, item):
#执行具体的插入
#根据不同的item 构建不同的sql语句并插入到mysql中
insert_sql, params = item.get_insert_sql()
cursor.execute(insert_sql, params)
pipelines.py
5、lagou.py


from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from ArticleSpider.items import LagouJobItem,LagouJobItemLoader
from ArticleSpider.util.common import get_md5
from datetime import datetime
class LagouSpider(CrawlSpider):
name = 'lagou'
allowed_domains = ['www.lagou.com']
start_urls = ['http://www.lagou.com/']
rules = (
Rule(LinkExtractor(allow=r'zhaopin/.*/'), follow=True),
Rule(LinkExtractor(allow=r'gongsi/d+.html/'), follow=True),
Rule(LinkExtractor(allow=r'jobs/d+.html'), callback='parse_job', follow=True),
)
custom_settings = {
"COOKIES_ENABLED": False,
"DOWNLOAD_DELAY": 3,
'DEFAULT_REQUEST_HEADERS': {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8',
'Connection': 'keep-alive',
'Cookie': '_ga=GA1.2.1358601872.1541953903; user_trace_token=20181112003151-4b03ac83-e5cf-11e8-8882-5254005c3644; LGUID=20181112003151-4b03b056-e5cf-11e8-8882-5254005c3644; _gid=GA1.2.1875637681.1541953903; index_location_city=%E5%B9%BF%E5%B7%9E; JSESSIONID=ABAAABAAAGGABCB641A801FD52253622370040445465BDC; Hm_lvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1541953902,1541989806; TG-TRACK-CODE=index_navigation; SEARCH_ID=0ee1c4af2c2d47dc84be450da8c8c8fc; Hm_lpvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1541992192; LGRID=20181112111000-70d55352-e628-11e8-9b85-525400f775ce',
'Host': 'www.lagou.com',
'Origin': 'https://www.lagou.com',
'Referer': 'https://www.lagou.com/',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.67 Safari/537.36',
}
}
def parse_job(self, response):
# 解析拉勾网的职位
item_loader = LagouJobItemLoader(item=LagouJobItem(),response=response)
item_loader.add_css("title",".job-name::attr(title)")
item_loader.add_value("url",response.url)
item_loader.add_value("url_object_id",get_md5(response.url))
item_loader.add_css("salary",".job_request .salary::text")
item_loader.add_css("job_city",".job_request span:nth-child(2)::text") # 取到span标签的第二个(span标签)
item_loader.add_css("work_years",".job_request span:nth-child(3)::text")
item_loader.add_css("degree_need",".job_request span:nth-child(4)::text")
item_loader.add_css("job_type",".job_request span:nth-child(5)::text")
item_loader.add_css("tags",".position-label li::text")
item_loader.add_css("publish_time",".publish_time::text")
item_loader.add_css("job_advantage",".job-advantage p::text")
item_loader.add_css("job_desc",".job_bt div")
item_loader.add_css("job_addr",".work_addr")
item_loader.add_css("company_name","#job_company dt a img::attr(alt)")
item_loader.add_css("company_url","#job_company dt a::attr(href)")
item_loader.add_value("crawl_time",datetime.now())
lagou_job_item = item_loader.load_item()
return lagou_job_item
lagou.py
6、main.py


import os,sys
sys.path.append(os.path.dirname(os.path.abspath(__file__))) # 将父路径添加至sys path中
execute(['scrapy','crawl','lagou',]) # 执行:scrapy crawl lagou。 其中'lagou'是lagou.py中JobboleSpider类的name字段数据
main.py
爬虫:自动获取数据的程序,关键是批量的获取
反爬虫:使用技术手段防止爬虫程序的方法
误伤:反爬虫技术将普通用户识别为爬虫,效果再好也不能用
成本:反爬虫人力和机器成本
拦截:拦截率越高,误伤率越高
反爬虫的目的:

爬虫与反爬虫的对抗过程:

scrapy框架中有帮我们默认实现了默认的user-agent(默认为scrapy),我们要实现自定义随机更换user-agent就需要自定义UserAgentMiddleware。首先,在setting中开启DOWNLOADER_MIDDLEWARES 的相关配置:
DOWNLOADER_MIDDLEWARES = {
'ArticleSpider.middlewares.ArticlespiderDownloaderMiddleware': 543,
}
接着,将scrapy自带的UserAgentMiddleware设置为None:
DOWNLOADER_MIDDLEWARES = {
'ArticleSpider.middlewares.MyCustomDownloaderMiddleware': 543,
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None # 此行,避免使用了默认的user-agent middleware
}
然后,在middleware.py文件中,我们新建一个类:RandomUserAgentMiddlware
在这之前,我们需要在GitHub下载:fake-useragent 包, fake-useragent维护了很多user-agent版本,具体自行查看fake-useragent介绍。
首先,安装:pip install fake-useragent , 然后在项目中引用就行了。
RandomUserAgentMiddlware类代码:
from fake_useragent import UserAgent # 引入fake-useragent的UserAgent
class RandomUserAgentMiddlware(object):
#随机更换user-agent
def __init__(self, crawler):
super(RandomUserAgentMiddlware, self).__init__()
= UserAgent() # 实例化UserAgent
_type = crawler.settings.get("RANDOM_UA_TYPE", "random") # 从setting中获取useragent的类型(Firefox、Chrome、IE或random)
@classmethod
def from_crawler(cls, crawler):
return cls(crawler)
def process_request(self, request, spider):
def get_ua():
return getattr(, _type) # 根据setting中获取的useragent类型,映射真正方法
request.headers.setdefault('User-Agent', get_ua()) # 添加到headers中
setting中配置:
1)useragent类型:
RANDOM_UA_TYPE = "random"
2)在DOWNLOADER_MIDDLEWARES中配置:
DOWNLOADER_MIDDLEWARES = {
# 'ArticleSpider.middlewares.ArticlespiderDownloaderMiddleware': 543,
'ArticleSpider.middlewares.RandomUserAgentMiddlware': 543,
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
}
自此,便实现了user-agent的随机切换
上述操作完成后,在debug或运行时如果报错:fake_useragent.errors.FakeUserAgentError: Maximum amount of retries reached ,可分别尝试下述方法:
如果上面两种方法都不行,执行:
ua = UserAgent(verify_ssl=False)
由于 fake-useragent 库维护的 user-agent 列表存放在在线网页上,过低版本依赖的列表网页可能就会报 404
随手更新:
ua.update()
查看全部user-agent:
ua.data_browsers
重新运行,在debug模式下,可以看到随机获取到一个user-agent版本,添加进headers:

3、使用西刺创建ip代理池,实现ip代理
ip动态变化:重启路由器等
ip代理的原理:不直接发送自己真实ip,而使用中间代理商(代理服务器),那么服务器不知道我们的ip也就不会把我们禁掉
测试:使用ip代理很简单,只需要在上面我们自定义的RandomUserAgentMiddlware类中的process_request函数加上一行代码:
request.meta["proxy"] = "http://118.190.95.35:9001" # 使用的是西刺代理ip:118.190.95.35 ,端口:9001 ,类型:HTTP

这样,当爬虫向每个url爬取数据时,都会通过ip代理的形式向服务器发送请求。
上述只是简单实现ip代理模式,实际上使用单个ip代理也是很容易被发现,所以说应该采取类似上面使用随机useragent的方式,来实现随机ip代理,这样就大大降低了被反爬发现的几率。
首先,需要新建个脚本,将西刺代理上相关ip代理数据爬取到我们服务器的文件或数据库中

新建tools包:用来存放脚本文件
在tools包中新建脚本:crawl_xici_ip.py脚本文件,用于爬取西刺ip代理数据(ip、端口、协议类型、响应时间等),将数据存至数据库并调用(实现ip代理池)
1)爬取数据
import requests
from scrapy.selector import Selector
headers = {
"User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"
}
for i in range(200):
rep = requests.get("http:///nn/{0}".format(i),headers=headers)
# print(rep)
selector = Selector(text=rep.text) # 将返回的response响应数据(文本信息)传给Selector
all_trs = selector.css("#ip_list tr")
ip_list = []
for tr in all_trs[1:]: # 获取西刺网ip及端口等相关数据
ip = tr.css("td:nth-child(2)::text").extract_first('')
port = tr.css("td:nth-child(3)::text").extract_first('')
anony_type = tr.css("td:nth-child(5)::text").extract_first('')
proxy_type = tr.css("td:nth-child(6)::text").extract_first('')
speed_str = tr.css("td:nth-child(8) div::attr(title)").extract_first('')
if speed_str:
speed = float(speed_str.split("秒")[0])
else:
speed = 9999.0
ip_list.append((ip,port,anony_type,proxy_type,speed)) # 存入list
2)存入数据库
import MySQLdb
conn = MySQLdb.connect(host = "127.0.0.1",user = "root",passwd = "*******",db = "article_spider",charset = "utf8")
cursor = conn.cursor()
for ip_info in ip_list:
cursor.execute(
"insert into proxy_ip (ip,port,anony_type,proxy_type,speed) values('{0}','{1}','{2}','{3}',{4})".format(
ip_info[0],ip_info[1],ip_info[2],ip_info[3],ip_info[4]
)
)
conn.commit()
3)从数据库中取数据,取出数据后对数据(ip 、端口)进行测试,如果可用再返回,不可用则删除该条数据,重新取数据,如此循环
class GetIP(object):
def delete_ip(self,ip):
# 从数据库删除无效的ip
delete_sql = """
delete from proxy_ip where ip='{0}'
""".format(ip)
cursor.execute(delete_sql)
conn.commit()
return True
def judge_ip(self,ip,port):
# 使用代理模式访问百度,测试ip是否可用
http_url = "http://www.baidu.com"
proxy_url = "http://{0}:{1}".format(ip,port) # 代理ip设置
try:
proxy_dict = {
"http":proxy_url,
}
response = requests.get(http_url,proxies = proxy_dict) # proxies要去传入的是个dict类型,键值对类型:"http":"http://www.baidu.com"等
except Exception as e:
print("Invalid ip and port")
self.delete_ip(ip)
return False
else:
code = response.status_code
if code >=200 and code <300:
print("Effective ip")
return True
else:
print("Invalid ip and port")
self.delete_ip(ip)
return False
def get_random_ip(self):
# 随机获取mysql中某条数据的ip及端口
random_sql = """
select ip, port from proxy_ip where proxy_type='http'
order by RAND()
limit 1
"""
result = cursor.execute(random_sql)
for ip_info in cursor.fetchall():
ip = ip_info[0]
port = ip_info[1]
judge_re = self.judge_ip(ip,port)
if judge_re: # 测试通过,表示该端口及ip可用,直接return即可
return "http://{0}:{1}".format(ip,port)
else:
return self.get_random_ip() # 测试失败,ip无效,重新获取随机ip
crawl_xici_ip.py完整代码:


import requests
from scrapy.selector import Selector
import MySQLdb
conn = MySQLdb.connect(host = "127.0.0.1",user = "root",passwd = "0315",db = "article_spider",charset = "utf8")
cursor = conn.cursor()
def crawl_ips():
# 爬取西刺ip信息相关数据,并存入数据库
headers = {
"User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"
}
for i in range(200):
rep = requests.get("http:///nn/{0}".format(i),headers=headers)
# print(rep)
selector = Selector(text=rep.text) # 将返回的response响应数据(文本信息)传给Selector
all_trs = selector.css("#ip_list tr")
ip_list = []
for tr in all_trs[1:]: # 获取西刺网ip及端口等相关数据
ip = tr.css("td:nth-child(2)::text").extract_first('')
port = tr.css("td:nth-child(3)::text").extract_first('')
anony_type = tr.css("td:nth-child(5)::text").extract_first('')
proxy_type = tr.css("td:nth-child(6)::text").extract_first('')
speed_str = tr.css("td:nth-child(8) div::attr(title)").extract_first('')
if speed_str:
speed = float(speed_str.split("秒")[0])
else:
speed = 9999.0
ip_list.append((ip,port,anony_type,proxy_type,speed)) # 存入list
for ip_info in ip_list:
cursor.execute(
"insert into proxy_ip (ip,port,anony_type,proxy_type,speed) values('{0}','{1}','{2}','{3}',{4})".format(
ip_info[0],ip_info[1],ip_info[2],ip_info[3],ip_info[4]
)
)
conn.commit()
class GetIP(object):
def delete_ip(self,ip):
# 从数据库删除无效的ip
delete_sql = """
delete from proxy_ip where ip='{0}'
""".format(ip)
cursor.execute(delete_sql)
conn.commit()
return True
def judge_ip(self,ip,port):
# 使用代理模式访问百度,测试ip是否可用
http_url = "http://www.baidu.com"
proxy_url = "http://{0}:{1}".format(ip,port) # 代理ip设置
try:
proxy_dict = {
"http":proxy_url,
}
response = requests.get(http_url,proxies = proxy_dict) # proxies要去传入的是个dict类型,键值对类型:"http":"http://www.baidu.com"等
except Exception as e:
print("Invalid ip and port")
self.delete_ip(ip)
return False
else:
code = response.status_code
if code >=200 and code <300:
print("Effective ip")
return True
else:
print("Invalid ip and port")
self.delete_ip(ip)
return False
def get_random_ip(self):
# 随机获取mysql中某条数据的ip及端口
random_sql = """
select ip, port from proxy_ip where proxy_type='http'
order by RAND()
limit 1
"""
result = cursor.execute(random_sql)
for ip_info in cursor.fetchall():
ip = ip_info[0]
port = ip_info[1]
judge_re = self.judge_ip(ip,port)
if judge_re: # 测试通过,表示该端口及ip可用,直接return即可
return "http://{0}:{1}".format(ip,port)
else:
return self.get_random_ip() # 测试失败,ip无效,重新获取随机ip
crawl_xici_ip.py
直接在RandomUserAgentMiddlware中使用获取随机ip代理:
from ArticleSpider.tools.crawl_xici_ip import GetIP # 引人tools/crawl_xici_ip.py中自定义的脚本
def process_request(self, request, spider):
def get_ua():
return getattr(, _type) # 根据setting中获取的useragent类型,映射真正方法
request.headers.setdefault('User-Agent', get_ua()) # 添加到headers中
request.meta["proxy"] = self.get_ip.get_random_ip() # 使用ip代理池,实现随机ip代理
如此,使用西刺网相关数据实现ip代理便完成了,但实际上西刺ip免费代理是不稳定的,如果有必要,还是建议使用收费版的ip代理。
相关配置:
1)安装:
pip install scrapy-crawlera
2)相关配置(setting.py):
DOWNLOADER_MIDDLEWARES = {
...
'scrapy_crawlera.CrawleraMiddleware': 610
}
CRAWLERA_ENABLED = True
CRAWLERA_APIKEY = 'apikey' # apikey:需要我们到官网注册,会获取到这个值,不过目前已经是收费的了
2.1)如果不用第二种方式(setting设置),也可以使用这种方式,在spider项目中使用:
class MySpider:
crawlera_enabled = True
crawlera_apikey = 'apikey'
3)在项目中使用:
scrapy.Request(
'http://example.com',
headers={
'X-Crawlera-Max-Retries': 1, # 此行代码实现
...
},
)