Here, I’m presenting my code I used to scrape fact check news data from the politifact webpage. It is easy!



Crawling fact check data

Load libraries

from bs4 import BeautifulSoup
import requests
from time import sleep 

import spacy
import pandas as pd

Check html structure of a website

url = 'https://www.politifact.com/factchecks/list/?page=2&category=elections&ruling=true'
    
response = requests.get(url)

soup = BeautifulSoup(response.text, "html5lib")
page = soup.find_all('li', class_='o-listicle__item')

print(page[0])
<li class="o-listicle__item">
<article class="m-statement m-statement--is-medium m-statement--true">
<div class="m-statement__author">
<div class="m-statement__avatar">
<div class="m-statement__image">
<div class="c-image" style="padding-top: 119.27710843373494%;">
<img class="c-image__thumb" height="99" src="https://static.politifact.com/CACHE/images/politifact/mugs/Screen_Shot_2019-04-23_at_2.53.39_PM/3214d9831a1ebeb94994638553905ac6.jpg" width="83"/>
<picture>
<img class="c-image__original" height="178" src="https://static.politifact.com/CACHE/images/politifact/mugs/Screen_Shot_2019-04-23_at_2.53.39_PM/e0db4deb19a51268fcdde8871510580a.jpg" width="166"/>
</picture>
</div>
</div>
</div>
<div class="m-statement__meta">
<a class="m-statement__name" href="/personalities/mac-heller/" title="Mac Heller">
Mac Heller
</a>
<div class="m-statement__desc">
stated on April 7, 2019 in an interview on CNN:
</div>
</div>
</div>
<div class="m-statement__content">
<div class="m-statement__body">
<div class="m-statement__quote-wrap">
<div class="m-statement__quote">
<a href="/factchecks/2019/apr/23/mac-heller/minority-electorate-surpassed-25-percent-when-obam/">
"The 2008 election was the first election in which voters of color comprised over 25 percent of the electorate, and that number is going up."
</a>
</div>
</div>
<div class="m-statement__meter">
<div class="c-image" style="padding-top: 89.49771689497716%;">
<img alt="true" class="c-image__thumb" height="196" src="https://static.politifact.com/img/meter-true-th.jpg" width="219"/>
<picture>
<img alt="true" class="c-image__original" height="196" src="https://static.politifact.com/img/meter-true.jpg" width="219"/>
</picture>
</div>
</div>
<footer class="m-statement__footer">
By Amy Sherman • April 23, 2019
</footer>
</div>
</div>
</article>
</li>
# Test one post
post = page[0]

name = post.find('a', class_='m-statement__name').get_text().strip()
date = post.find('div', class_='m-statement__desc').get_text().strip()
text = post.find('div', class_='m-statement__quote').get_text().strip()

print(name)
print(date)
print(text)
Mac Heller
stated on April 7, 2019 in an interview on CNN:
"The 2008 election was the first election in which voters of color comprised over 25 percent of the electorate, and that number is going up."

Crawl all pages

# For all pages
category = ['elections','taxes','environment','immigration','health-check','coronavirus','foreign-policy']
ruling=['true','mostly-true','half-true','barely-true','false','pants-fire']

names = []
dates = []
texts = []
topics = []
labels = []

nlp = spacy.load("en_core_web_sm")

for cat in category:
    for rul in ruling:
        print(cat,rul)
        
        for i in range(1,50):     
            
            sleep(1) # To be nice
            
            url = 'https://www.politifact.com/factchecks/list/?page={0}&category={1}&ruling={2}'.format(i,cat,rul)

            html = requests.get(url).text
            soup = BeautifulSoup(html, "html5lib")
            page = soup.find_all('li', class_='o-listicle__item')
            
            if len(page)==0:
                print('break at page',i)
                break
            else:
                for post in page:

                    name = post.find('a', class_='m-statement__name').get_text().strip()
                    date = post.find('div', class_='m-statement__desc').get_text().strip()
                    text = post.find('div', class_='m-statement__quote').get_text().strip()

                    doc = nlp(date)
                    date = [ent.text for ent in doc.ents if ent.label_ =='DATE'][0]
                    label = rul
                    topic = cat

                    names.append(name)
                    dates.append(date)
                    texts.append(text)
                    topics.append(topic)
                    labels.append(label)
elections true
break at page 8
elections mostly-true
break at page 8
elections half-true
break at page 8
elections barely-true
break at page 8
elections false
break at page 18
elections pants-fire
break at page 10
taxes true
break at page 8
taxes mostly-true
break at page 12
taxes half-true
break at page 13
taxes barely-true
break at page 12
taxes false
break at page 12
taxes pants-fire
break at page 5
environment true
break at page 4
environment mostly-true
break at page 5
environment half-true
break at page 5
environment barely-true
break at page 5
environment false
break at page 6
environment pants-fire
break at page 3
immigration true
break at page 5
immigration mostly-true
break at page 8
immigration half-true
break at page 9
immigration barely-true
break at page 9
immigration false
break at page 12
immigration pants-fire
break at page 5
health-check true
break at page 2
health-check mostly-true
break at page 2
health-check half-true
break at page 2
health-check barely-true
break at page 3
health-check false
break at page 2
health-check pants-fire
break at page 2
coronavirus true
break at page 3
coronavirus mostly-true
break at page 4
coronavirus half-true
break at page 5
coronavirus barely-true
break at page 8
coronavirus false
break at page 25
coronavirus pants-fire
break at page 8
foreign-policy true
break at page 4
foreign-policy mostly-true
break at page 6
foreign-policy half-true
break at page 6
foreign-policy barely-true
break at page 7
foreign-policy false
break at page 7
foreign-policy pants-fire
break at page 4

Generate a dataframe

df = pd.DataFrame({
    'name':names,
    'date':dates,
    'text':texts,
    'topic':topics,
    'label':labels
})

df.date = pd.to_datetime(df.date)

display(df.sample(10))

df.info()
name date text topic label
3394 John Kasich 2012-11-29 Says "Detroit dumping a bunch of sewage" in La... environment half-true
349 Mitt Romney 2012-01-25 Says Newt Gingrich said "Spanish is the langua... elections mostly-true
6747 Jack Posobiec 2021-03-23 “New information coming in that Boulder shoote... foreign-policy barely-true
7129 Marco Rubio 2016-01-17 Hostages were released as soon as Ronald Reaga... foreign-policy pants-fire
2750 Facebook posts 2021-02-15 “If you make $50,000/year, $36 of your taxes g... taxes false
13 Tweets 2020-11-18 When Donald Trump lost the Iowa caucus to Ted ... elections true
5483 Dave McCormick 2022-01-31 “We all know China created COVID.” coronavirus false
3242 Social Media 2017-03-06 "It's important to pay attention to the Russia... environment mostly-true
3239 Al Gore 2017-06-04 "70 percent of Florida is in drought today." environment mostly-true
3543 Rush Limbaugh 2009-06-29 On the day the House voted on the climate chan... environment barely-true
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7179 entries, 0 to 7178
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   name    7179 non-null   object        
 1   date    7179 non-null   datetime64[ns]
 2   text    7179 non-null   object        
 3   topic   7179 non-null   category      
 4   label   7179 non-null   category      
dtypes: category(2), datetime64[ns](1), object(2)
memory usage: 183.0+ KB

Save a dataset as a csv format

df.to_csv('data/politifact.csv',index=False)

Leave a comment