Web crawling with Python and BeautifulSoup (and a little HTML)

May 19, 2022

Here, I’m presenting my code I used to scrape fact check news data from the politifact webpage. It is easy!

Crawling fact check data

Load libraries

from bs4 import BeautifulSoup
import requests
from time import sleep 

import spacy
import pandas as pd

Check html structure of a website

url = 'https://www.politifact.com/factchecks/list/?page=2&category=elections&ruling=true'
    
response = requests.get(url)

soup = BeautifulSoup(response.text, "html5lib")
page = soup.find_all('li', class_='o-listicle__item')

print(page[0])

<li class="o-listicle__item">
<article class="m-statement m-statement--is-medium m-statement--true">
<div class="m-statement__author">
<div class="m-statement__avatar">
<div class="m-statement__image">
<div class="c-image" style="padding-top: 119.27710843373494%;">
<img class="c-image__thumb" height="99" src="https://static.politifact.com/CACHE/images/politifact/mugs/Screen_Shot_2019-04-23_at_2.53.39_PM/3214d9831a1ebeb94994638553905ac6.jpg" width="83"/>
<picture>
<img class="c-image__original" height="178" src="https://static.politifact.com/CACHE/images/politifact/mugs/Screen_Shot_2019-04-23_at_2.53.39_PM/e0db4deb19a51268fcdde8871510580a.jpg" width="166"/>
</picture>
</div>
</div>
</div>
<div class="m-statement__meta">
<a class="m-statement__name" href="/personalities/mac-heller/" title="Mac Heller">
Mac Heller
</a>
<div class="m-statement__desc">
stated on April 7, 2019 in an interview on CNN:
</div>
</div>
</div>
<div class="m-statement__content">
<div class="m-statement__body">
<div class="m-statement__quote-wrap">
<div class="m-statement__quote">
<a href="/factchecks/2019/apr/23/mac-heller/minority-electorate-surpassed-25-percent-when-obam/">
"The 2008 election was the first election in which voters of color comprised over 25 percent of the electorate, and that number is going up."
</a>
</div>
</div>
<div class="m-statement__meter">
<div class="c-image" style="padding-top: 89.49771689497716%;">
<img alt="true" class="c-image__thumb" height="196" src="https://static.politifact.com/img/meter-true-th.jpg" width="219"/>
<picture>
<img alt="true" class="c-image__original" height="196" src="https://static.politifact.com/img/meter-true.jpg" width="219"/>
</picture>
</div>
</div>
<footer class="m-statement__footer">
By Amy Sherman • April 23, 2019
</footer>
</div>
</div>
</article>
</li>

# Test one post
post = page[0]

name = post.find('a', class_='m-statement__name').get_text().strip()
date = post.find('div', class_='m-statement__desc').get_text().strip()
text = post.find('div', class_='m-statement__quote').get_text().strip()

print(name)
print(date)
print(text)

Mac Heller
stated on April 7, 2019 in an interview on CNN:
"The 2008 election was the first election in which voters of color comprised over 25 percent of the electorate, and that number is going up."

Crawl all pages

# For all pages
category = ['elections','taxes','environment','immigration','health-check','coronavirus','foreign-policy']
ruling=['true','mostly-true','half-true','barely-true','false','pants-fire']

names = []
dates = []
texts = []
topics = []
labels = []

nlp = spacy.load("en_core_web_sm")

for cat in category:
    for rul in ruling:
        print(cat,rul)
        
        for i in range(1,50):     
            
            sleep(1) # To be nice
            
            url = 'https://www.politifact.com/factchecks/list/?page={0}&category={1}&ruling={2}'.format(i,cat,rul)

            html = requests.get(url).text
            soup = BeautifulSoup(html, "html5lib")
            page = soup.find_all('li', class_='o-listicle__item')
            
            if len(page)==0:
                print('break at page',i)
                break
            else:
                for post in page:

                    name = post.find('a', class_='m-statement__name').get_text().strip()
                    date = post.find('div', class_='m-statement__desc').get_text().strip()
                    text = post.find('div', class_='m-statement__quote').get_text().strip()

                    doc = nlp(date)
                    date = [ent.text for ent in doc.ents if ent.label_ =='DATE'][0]
                    label = rul
                    topic = cat

                    names.append(name)
                    dates.append(date)
                    texts.append(text)
                    topics.append(topic)
                    labels.append(label)

elections true
break at page 8
elections mostly-true
break at page 8
elections half-true
break at page 8
elections barely-true
break at page 8
elections false
break at page 18
elections pants-fire
break at page 10
taxes true
break at page 8
taxes mostly-true
break at page 12
taxes half-true
break at page 13
taxes barely-true
break at page 12
taxes false
break at page 12
taxes pants-fire
break at page 5
environment true
break at page 4
environment mostly-true
break at page 5
environment half-true
break at page 5
environment barely-true
break at page 5
environment false
break at page 6
environment pants-fire
break at page 3
immigration true
break at page 5
immigration mostly-true
break at page 8
immigration half-true
break at page 9
immigration barely-true
break at page 9
immigration false
break at page 12
immigration pants-fire
break at page 5
health-check true
break at page 2
health-check mostly-true
break at page 2
health-check half-true
break at page 2
health-check barely-true
break at page 3
health-check false
break at page 2
health-check pants-fire
break at page 2
coronavirus true
break at page 3
coronavirus mostly-true
break at page 4
coronavirus half-true
break at page 5
coronavirus barely-true
break at page 8
coronavirus false
break at page 25
coronavirus pants-fire
break at page 8
foreign-policy true
break at page 4
foreign-policy mostly-true
break at page 6
foreign-policy half-true
break at page 6
foreign-policy barely-true
break at page 7
foreign-policy false
break at page 7
foreign-policy pants-fire
break at page 4

Generate a dataframe

df = pd.DataFrame({
    'name':names,
    'date':dates,
    'text':texts,
    'topic':topics,
    'label':labels
})

df.date = pd.to_datetime(df.date)

display(df.sample(10))

df.info()

	name	date	text	topic	label
3394	John Kasich	2012-11-29	Says "Detroit dumping a bunch of sewage" in La...	environment	half-true
349	Mitt Romney	2012-01-25	Says Newt Gingrich said "Spanish is the langua...	elections	mostly-true
6747	Jack Posobiec	2021-03-23	“New information coming in that Boulder shoote...	foreign-policy	barely-true
7129	Marco Rubio	2016-01-17	Hostages were released as soon as Ronald Reaga...	foreign-policy	pants-fire
2750	Facebook posts	2021-02-15	“If you make $50,000/year, $36 of your taxes g...	taxes	false
13	Tweets	2020-11-18	When Donald Trump lost the Iowa caucus to Ted ...	elections	true
5483	Dave McCormick	2022-01-31	“We all know China created COVID.”	coronavirus	false
3242	Social Media	2017-03-06	"It's important to pay attention to the Russia...	environment	mostly-true
3239	Al Gore	2017-06-04	"70 percent of Florida is in drought today."	environment	mostly-true
3543	Rush Limbaugh	2009-06-29	On the day the House voted on the climate chan...	environment	barely-true

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7179 entries, 0 to 7178
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   name    7179 non-null   object        
 1   date    7179 non-null   datetime64[ns]
 2   text    7179 non-null   object        
 3   topic   7179 non-null   category      
 4   label   7179 non-null   category      
dtypes: category(2), datetime64[ns](1), object(2)
memory usage: 183.0+ KB

Save a dataset as a csv format

df.to_csv('data/politifact.csv',index=False)

Share on

Twitter Facebook LinkedIn

Web crawling with Python and BeautifulSoup (and a little HTML)

Crawling fact check data

Load libraries

Check html structure of a website

Crawl all pages

Generate a dataframe

Save a dataset as a csv format

Share on

Leave a comment

You may also enjoy

Neural Style Transfer from BTS to BTS

Text preprocessing with Python NLTK package

Dancer’s Business

Merge various datasets with Pandas