selenium+browsermob network网络爬取
selenium抓取某宝数据的时候,js数据渲染且又有麻烦的验证过程。模拟验证存在困难,不过通过network里特定url调用便能拿到相关数据。
调用selenium模拟浏览器行为怎么拿network 网络url?
通过模拟浏览器selenium代理browsermob proxy便能捕获network网络url。
安装selenium与browsermob-py 包
下载browsermob-proxy releases
运行下面的代码 记得配置browsermob-proxy releases路径
server = Server("C:\\browsermob-proxy-2.0-beta-9\\bin\\browsermob-proxy.bat")
server.start()
proxy = server.create_proxy()
profile = webdriver.FirefoxProfile()
profile.set_proxy(proxy.selenium_proxy())
driver = webdriver.Firefox(firefox_profile=profile)
proxy.new_har("baidu")
driver.get("http://www.baidu.com")
proxy.wait_for_traffic_to_stop(1, 60)
with open('1.har', 'w') as outfile:
#proxy.har为json数据字符串
json.dump(proxy.har, outfile)
server.stop()
driver.quit()
最近firefox升级到53的版本后使用browsermob代理未效果,版本降低下来后一样,折腾好久后看到有人用PhantomJS于是用PhantomJS成功。本质是设置浏览器代理且忽略ssl错误。搜索stackoverflow上面明确建议采用firefox48较运行稳定。
PhantomJS代理设置关键:
proxy = server.create_proxy()
print proxy.port
proxy_address = "--proxy=127.0.0.1:%s" % proxy.port
service_args = [proxy_address, '--ignore-ssl-errors=yes', ] # so that i can do https connections
pjs_url=r"D:\Python27\Scripts\phantomjs-2.1.1-windows\bin\phantomjs.exe"
driver = webdriver.PhantomJS(executable_path=pjs_url,service_args=service_args)
相关资料:
http://stackoverflow.com/questions/26028604/cant-capture-har-using-python-selenium-script-with-browsermob-proxy
geckodriver版本下载地址:https://github.com/mozilla/geckodriver/releases/
您可能也对下面文章感兴趣:
There are 1 Comments to "selenium+browsermob network网络爬取"