Downloading a lot of PDFs¶
Sometimes you'll have a big list of PDFs that you want to download. Maybe you'll eventually run them through Tika or tesseract or something, but for now: we need to download them!
Downloading a list of files¶
The best way to download a list of files is using wget
. You can install wget
on OS X with brew install wget
, and on Windows I think this is the best option.
After it's installed, make sure you have a file formatted with one complete URL on each line, like below.
https://www.example.com/file-1.pdf
https://www.example.com/file-2.pdf
https://www.example.com/file-3.pdf
Then you can use the wget
command to download each one of the files.
Creating a list of files from a dataframe¶
Maybe you have a dataframe where one of the columns is your URL.
name | code | url |
---|---|---|
ABC | 123 | https://www.example.com/file-1.pdf |
XYZ | 456 | https://www.example.com/file-2.pdf |
LMN | 789 | https://www.example.com/file-3.pdf |
How do you export the url
column into a list of URLs? You just need to tell to_csv
to only save the column you're interested in, and not to write any header or index information.
Now you're all set to use wget -i urls.txt
!