Web Scraping Python Reddit

Posted on  by 



Today we’re going to use Scrapy to scrape all the top-voted images from the cats subreddit. Because why not? Web scraping is basically pullin. I made a Python web scraping guide for beginners I've been web scraping professionally for a few years and decided to make a series of web scraping tutorials that I wish I had when I started. The series will follow a large project I'm building that analyzes political rhetoric in the news. Today I’m going to walk you through the process of scraping search results from Reddit using Python. We’re going to write a simple program that performs a keyword search and extracts useful information from the search results.

Today I’m going to walk you through the process of scraping search results from Reddit using Python. We’re going to write a simple program that performs a keyword search and extracts useful information from the search results. Then we’re going to improve our program’s performance by taking advantage of parallel processing.

Anyone know the best websites to use to practice web scraping or web extraction using beautiful soup in python. Press J to jump to the feed. Press question mark to learn the rest of the keyboard shortcuts. Help Reddit App Reddit coins Reddit premium Reddit gifts. We would like to show you a description here but the site won’t allow us.

Tools

We’ll be using the following Python 3 libraries to make our job easier:

  • Beautiful Soup 4,
  • Requests to access the HTML content,
  • LXML as the HTML parser,
  • and Multiprocessing to speed things up.

multiprocessing comes with Python 3 by default as far as I know, but you may need to install the others manually using a package manager such as PIP:

Old Reddit

Before we begin, I want to point out that we’ll be scraping the old Reddit, not the new one. That’s because the new site loads more posts automatically when you scroll down:


The problem is that it’s not possible to simulate this scroll-down action using a simple tool like Requests. We’d need to use something like Selenium for that kind of thing. As a workaround, we’re going to use the old site which is easier to crawl using the links located on the navigation panel:

Scraper v1 - Program Arguments

Let’s start by making our program accept some arguments that will allow us to customize our search. Here are some useful parameters:

  • keyword to search
  • subreddit restriction (optional)
  • date restriction (optional)

Let’s say we want to search for the keyword “web scraping”. In this case, the URL we want to go is:
https://old.reddit.com/search?q=%22web+scraping%22

If we want to limit our search with a particular subreddit such as “r/Python”, then our URL will become:
https://old.reddit.com/r/Python/search?q=%22web+scraping%22&restrict_sr=on

Finally, the URL is going to look like one of the following if we want to search for the posts submitted in the last year:
https://old.reddit.com/search?q=%22web+scraping%22&t=year
https://old.reddit.com/r/Python/search?q=%22web+scraping%22&restrict_sr=on&t=year

The following is the initial version of our program that builds and prints the appropriate URL according to the program arguments:

Now we can run our program as follows:

Scraping Reddit Using Python

Scraper v2 - Collecting Search Results

If you take a look at the page source, you’ll notice that all the post results are stored in <div>s with a search-result-link class. Also note that unless it’s the last page, there will be an <a> tag with a <rel> attribute equal to nofollow next. That’s how we’ll know when to stop advancing to the next page.

Reddit Web Scraper

Therefore using the URL we built from the program arguments, we can collect the post sections from all pages with a simple function that we’ll call getSearchResults. Here’s the second version of our program:

Web Scraping Python Reddit

Scraper v3 - Parsing Post Data

Now that we have a bunch of posts in the form of a bs4.element.Tag array, we can extract useful information by parsing each element of this array further. We can extract information such as:

Scrape
InformationSource
datedatetime attribute of the <time> tag
title<a> tag with search-title class
score<span> tag with search-score class
author<a> tag with author class
subreddit<a> tag with search-subreddit-link class
URLhref attribute of the <a> tag with search-comments class
# of commentstext field of the <a> tag with search-comments class

We’re also going to create a container object to store the extracted data and save it as a JSON file (product.json). We’ll load this file in the beginning of our program which may contain data from other keyword searches. When we’re done scraping the current keyword, we’ll append the new content to the existing data. Here’s the third version of our program:

Now we can search for different keywords by running our program multiple times. The extracted data will be appended to the product.json file after each execution.

Scraper v4 - Scraping Comments

So far we’ve been able to scrape information from the post results easily, since this information is available in a given results page. But we might also want to scrape comment information which cannot be accessed from the results page. We must instead parse the comment page of each indiviadual post using the URL that we previously extract in our parsePosts funciton.

Datacamp Reddit

Web Scraping Python Reddit

If you take a close look at the HTML source of a comment page such as this one, you’ll see that the comments are located inside a <div> with a sitetable nestedlisting class. Each comment inside this <div> is stored in another <div> with a data-type attribute equal to comment. From there, we can obtain some useful information such as:

Scraping reddit using python
InformationSource
# of repliesdata-replies attribute
author<a> tag with author class inside the <p> tag with tagline class
datedatetime attribute in the <time> tag inside the <p> tag with tagline class
comment IDname attribute in the <a> tag inside the <p> tag with parent class
parent ID<a> tag with the data-event-action attribute equal to parent
texttext field of the <div> tag with md class
scoretext field of the <span> tag with score unvoted class

Let’s create a new function called parseComments and call it from our parsePosts function so that we can get the comment data along with the post data:

Scraper v5 - Multiprocessing

Our program is functionally complete at this point. However, it runs a bit slowly because all the work is done serially by a single process. We can improve the performance by handling the posts by multiple processes using the Process and Manager objects from the multiprocessing library.

The first thing we need to do is to rename the parsePosts function and make it handle only a single post. To do that, we’re simply going to remove the for statement. We also need to change the function parameters a little bit. Instead of passing our original product object, we’ll pass a list object to append the results obtained by the current process.

results is actually a multiprocessing.managers.ListProxy object that we can use to accumulate the output generated by all processes. We’ll later convert it to a regular list and save it in our product. Our main script will now look like as follows:

Onenote grammarly download

Web Scraping Python Reddit

This simple technique alone will greatly speed-up the performance. For instance when I perform a search involving 163 posts in my machine, the serial version of the program takes 150 seconds to execute, corresponding to approximately 1 post per second. On the other hand, the parallel version only takes 15 seconds to execute (~10 posts per second) which is 10x faster.

Scrape Images From Reddit

You can check out the complete source code on Github. Also, make sure to subscribe to get updates on my future articles.





Coments are closed