使用Python下载文件（大文件，重定向文件）

在网络上很多文件是使用http的方式提供下载。使用python做爬虫，爬文件是其中一个目标。

Python有很多包可以做http请求，如下：

python内置的包: urllib，urllib2和urllib3
requests包，这是一个在urllib3上扩展的包
grequests，扩展requests包，用来处理异步的http功能。

这里使用requests来做文件下载，主要提供三种示例：

小文件的爬取
大文件的爬取
重定向文件的爬取

requests包爬内容的基本用法

最基本的是使用request的get方法，就可以很简单下载一个网页。如下：

import requests
url = 'https://www.baidu.com'
html = requests.get(url)

首先是引入requests包，然后向requests的get函数传入url作为参数即可。

requests包对小文件的爬取

上面的示例中，网页内容存放在html变量中。对于下载文件，我们往往需要以二进制的方式存放在文件中。

使用python内置的open函数以字节写的方式打开文件，把html.content里的内容写到文件中。如下:

with open('filename.txt', 'wb') as r: 
    r.write(html.content)

requests包对大文件的爬取

对于大文件，我们就不能简单的调用html.content来获取文件内容。

这种情况下，requests以流的方式读取文件内容，它提供了request.iter_content方法可以让我们以迭代的方式读取，每次迭代读取内容称为块chunk，在读文件流时，指定读取的每块大小（chunk_size）。

示例如下：

r = requests.get(url, Stream=True)
with open("filename.pdf",'wb') as Pypdf:
    for chunk in r.iter_content(chunk_size=1024)
      if chunk: 
         pypdf.write(ch)

requests包爬取重定向文件

对于重定向后的文件，requests.get()函数，提供了allow_redirects参数，我们只要把它设为True，就可以了。

import requests
url = 'http://example.com/file.pdf'
response = requests.get(url, allow_redirects=True)
with open('filename.pdf') as Pypdf:
    pypdf.write(response.content)