Web crawling with Python and BeautifulSoup (and a little HTML)
Here, I’m presenting my code I used to scrape fact check news data from the politifact webpage. It is easy!
Crawling fact check data
Load libraries
from bs4 import BeautifulSoup
import requests
from time import sleep
import spacy
import pandas as pd
Check html structure of a website
url = 'https://www.politifact.com/factchecks/list/?page=2&category=elections&ruling=true'
response = requests.get(url)
soup = BeautifulSoup(response.text, "html5lib")
page = soup.find_all('li', class_='o-listicle__item')
print(page[0])
<li class="o-listicle__item">
<article class="m-statement m-statement--is-medium m-statement--true">
<div class="m-statement__author">
<div class="m-statement__avatar">
<div class="m-statement__image">
<div class="c-image" style="padding-top: 119.27710843373494%;">
<img class="c-image__thumb" height="99" src="https://static.politifact.com/CACHE/images/politifact/mugs/Screen_Shot_2019-04-23_at_2.53.39_PM/3214d9831a1ebeb94994638553905ac6.jpg" width="83"/>
<picture>
<img class="c-image__original" height="178" src="https://static.politifact.com/CACHE/images/politifact/mugs/Screen_Shot_2019-04-23_at_2.53.39_PM/e0db4deb19a51268fcdde8871510580a.jpg" width="166"/>
</picture>
</div>
</div>
</div>
<div class="m-statement__meta">
<a class="m-statement__name" href="/personalities/mac-heller/" title="Mac Heller">
Mac Heller
</a>
<div class="m-statement__desc">
stated on April 7, 2019 in an interview on CNN:
</div>
</div>
</div>
<div class="m-statement__content">
<div class="m-statement__body">
<div class="m-statement__quote-wrap">
<div class="m-statement__quote">
<a href="/factchecks/2019/apr/23/mac-heller/minority-electorate-surpassed-25-percent-when-obam/">
"The 2008 election was the first election in which voters of color comprised over 25 percent of the electorate, and that number is going up."
</a>
</div>
</div>
<div class="m-statement__meter">
<div class="c-image" style="padding-top: 89.49771689497716%;">
<img alt="true" class="c-image__thumb" height="196" src="https://static.politifact.com/img/meter-true-th.jpg" width="219"/>
<picture>
<img alt="true" class="c-image__original" height="196" src="https://static.politifact.com/img/meter-true.jpg" width="219"/>
</picture>
</div>
</div>
<footer class="m-statement__footer">
By Amy Sherman • April 23, 2019
</footer>
</div>
</div>
</article>
</li>
# Test one post
post = page[0]
name = post.find('a', class_='m-statement__name').get_text().strip()
date = post.find('div', class_='m-statement__desc').get_text().strip()
text = post.find('div', class_='m-statement__quote').get_text().strip()
print(name)
print(date)
print(text)
Mac Heller
stated on April 7, 2019 in an interview on CNN:
"The 2008 election was the first election in which voters of color comprised over 25 percent of the electorate, and that number is going up."
Crawl all pages
# For all pages
category = ['elections','taxes','environment','immigration','health-check','coronavirus','foreign-policy']
ruling=['true','mostly-true','half-true','barely-true','false','pants-fire']
names = []
dates = []
texts = []
topics = []
labels = []
nlp = spacy.load("en_core_web_sm")
for cat in category:
for rul in ruling:
print(cat,rul)
for i in range(1,50):
sleep(1) # To be nice
url = 'https://www.politifact.com/factchecks/list/?page={0}&category={1}&ruling={2}'.format(i,cat,rul)
html = requests.get(url).text
soup = BeautifulSoup(html, "html5lib")
page = soup.find_all('li', class_='o-listicle__item')
if len(page)==0:
print('break at page',i)
break
else:
for post in page:
name = post.find('a', class_='m-statement__name').get_text().strip()
date = post.find('div', class_='m-statement__desc').get_text().strip()
text = post.find('div', class_='m-statement__quote').get_text().strip()
doc = nlp(date)
date = [ent.text for ent in doc.ents if ent.label_ =='DATE'][0]
label = rul
topic = cat
names.append(name)
dates.append(date)
texts.append(text)
topics.append(topic)
labels.append(label)
elections true
break at page 8
elections mostly-true
break at page 8
elections half-true
break at page 8
elections barely-true
break at page 8
elections false
break at page 18
elections pants-fire
break at page 10
taxes true
break at page 8
taxes mostly-true
break at page 12
taxes half-true
break at page 13
taxes barely-true
break at page 12
taxes false
break at page 12
taxes pants-fire
break at page 5
environment true
break at page 4
environment mostly-true
break at page 5
environment half-true
break at page 5
environment barely-true
break at page 5
environment false
break at page 6
environment pants-fire
break at page 3
immigration true
break at page 5
immigration mostly-true
break at page 8
immigration half-true
break at page 9
immigration barely-true
break at page 9
immigration false
break at page 12
immigration pants-fire
break at page 5
health-check true
break at page 2
health-check mostly-true
break at page 2
health-check half-true
break at page 2
health-check barely-true
break at page 3
health-check false
break at page 2
health-check pants-fire
break at page 2
coronavirus true
break at page 3
coronavirus mostly-true
break at page 4
coronavirus half-true
break at page 5
coronavirus barely-true
break at page 8
coronavirus false
break at page 25
coronavirus pants-fire
break at page 8
foreign-policy true
break at page 4
foreign-policy mostly-true
break at page 6
foreign-policy half-true
break at page 6
foreign-policy barely-true
break at page 7
foreign-policy false
break at page 7
foreign-policy pants-fire
break at page 4
Generate a dataframe
df = pd.DataFrame({
'name':names,
'date':dates,
'text':texts,
'topic':topics,
'label':labels
})
df.date = pd.to_datetime(df.date)
display(df.sample(10))
df.info()
name | date | text | topic | label | |
---|---|---|---|---|---|
3394 | John Kasich | 2012-11-29 | Says "Detroit dumping a bunch of sewage" in La... | environment | half-true |
349 | Mitt Romney | 2012-01-25 | Says Newt Gingrich said "Spanish is the langua... | elections | mostly-true |
6747 | Jack Posobiec | 2021-03-23 | “New information coming in that Boulder shoote... | foreign-policy | barely-true |
7129 | Marco Rubio | 2016-01-17 | Hostages were released as soon as Ronald Reaga... | foreign-policy | pants-fire |
2750 | Facebook posts | 2021-02-15 | “If you make $50,000/year, $36 of your taxes g... | taxes | false |
13 | Tweets | 2020-11-18 | When Donald Trump lost the Iowa caucus to Ted ... | elections | true |
5483 | Dave McCormick | 2022-01-31 | “We all know China created COVID.” | coronavirus | false |
3242 | Social Media | 2017-03-06 | "It's important to pay attention to the Russia... | environment | mostly-true |
3239 | Al Gore | 2017-06-04 | "70 percent of Florida is in drought today." | environment | mostly-true |
3543 | Rush Limbaugh | 2009-06-29 | On the day the House voted on the climate chan... | environment | barely-true |
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7179 entries, 0 to 7178
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 name 7179 non-null object
1 date 7179 non-null datetime64[ns]
2 text 7179 non-null object
3 topic 7179 non-null category
4 label 7179 non-null category
dtypes: category(2), datetime64[ns](1), object(2)
memory usage: 183.0+ KB
Save a dataset as a csv format
df.to_csv('data/politifact.csv',index=False)
Leave a comment