Generally speaking when you have a problem with scrapebox these are the best steps to follow. Work thru them until your problem is resolved. 1.) Restart your computer. 2.) Update scrapebox to the latest version if there is an update available. Sometimes things change with Google, or Wordpress etc. And the updates fix these issues. Repository of scripts that scrape news headlines from Google News, prepare them for readability analysis, and visualize the results aggregated by news outlet. The scripts and their output are described in this blog post. Googlenews.py scrapes news headlines and the name of their outlets from the Google News homepage on a set schedule.
Each scraped article has the following fields:
- title: Title of the article
- datetime: Publication date
- content: Full content (text format)
- link: URL where the article was published
- keyword: Google News keyword used to find this article
![Scrapebox Scrapebox](/uploads/1/2/5/8/125877080/660490461.jpg)
How many articles can I fetch with this scraper?
No upper bound of course but it should be in the range
100,000 articles per day
when scraping 24/7 with VPN enabled.How to get started?
![Scrapebox Google News Scrapebox Google News](https://neilpatel-qvjnwj7eutn3.netdna-ssl.com/wp-content/uploads/2018/02/scrapebox-screenshot-574x425.jpg)
Output example
Article 1
Article 2
NOTE: The field
content
was truncated for improving the readibility.Configuration
SLEEP_TIME_EVERY_TEN_ARTICLES_IN_SECONDS
: Sleep time before two calls to Google News. On average 10 articles are fetched per call. Default value is 1 second.ARTICLE_COUNT_LIMIT_PER_KEYWORD
: Maximum number of articles fetched for one keyword. Default value is 300. I tried it up to 600 and it worked.RUN_POST_PROCESSING
: Post processing means opening the URL of the article and extracting the content. For maximum efficiency, we first scrape all the available tuples (title, datetime, url) on Google.com. Then, from the collected URLs, we fetch the content. This two-step procedure is empirically more efficient. Run firstRUN_POST_PROCESSING
with a value of 0. Then, run it a second time withRUN_POST_PROCESSING
set to 1. All the Google data scraped is persisted so no problem!LINKS_POST_PROCESSING_CLEAN_HTML_RATIO_LETTERS_LENGTH
: Technical parameter for the post processing. Apply to Japanese only. We are interesting in dropping the english sentences from the Japanese articles. Default is 0.33.LINKS_POST_PROCESSING_NUM_THREADS
: Number of threads to use when doing this post processing task. Default is 8.
VPN
Scraping Google News usually results in a ban for a few hours. Using a VPN with dynamic IP fetching is a way to overcome this problem.
In my case, I subscribed to this VPN: https://www.expressvpn.com/.
I provide a python binding for this VPN here: https://github.com/philipperemy/expressvpn-python.
Run those commands in Ubuntu 64 bits to configure the VPN with the Google News Scraper project:
Also make sure that:
- you can run
expressvpn
in your terminal. - ExpressVPN is properly configured:
- you get
expressvpn-python (x.y)
wherex.y
is the version, when you runpip list | grep 'expressvpn-python'
Once you have all of that, simply run:
Every time the script detects that Google has banned you, it will request the VPN to get a fresh new IP and will resume.
Questions/Answers
- Why didn't you use the RSS feed provided by Google News? It does not exist for Japanese!
- What is the best way to use this scraper? If you want to scrape a lot of data, I highly recommend you to subscribe to a VPN, preferably ExpressVPN (I implemented the VPN wrapper and the interaction with this scraper).