How To Extract Html Links With A Matching Word From A Website Using Python
Solution 1:
You need to search for the word india
in the displayed text. To do this you'll need a custom function instead:
from bs4 import BeautifulSoup
import requests
url = "http://www.bbc.com/news/world/asia/"
r = requests.get(url)
soup = BeautifulSoup(r.content)
india_links = lambda tag: (getattr(tag, 'name', None) == 'a'and'href'in tag.attrs and'india'in tag.get_text().lower())
results = soup.find_all(india_links)
The india_links
lambda finds all tags that are <a>
links with an href
attribute and contain india
(case insensitive) somewhere in the displayed text.
Note that I used the requests
response object .content
attribute; leave decoding to BeautifulSoup!
Demo:
>>> from bs4 importBeautifulSoup
>>> import requests
>>> url = "http://www.bbc.com/news/world/asia/"
>>> r = requests.get(url)
>>> soup = BeautifulSoup(r.content)
>>> india_links = lambda tag: getattr(tag, 'name', None) == 'a' and 'href'in tag.attrs and 'india'in tag.get_text().lower()
>>> results = soup.find_all(india_links)
>>> from pprint import pprint
>>> pprint(results)
[<ahref="/news/world/asia/india/">India</a>,
<aclass="story"href="/news/world-asia-india-30647504"rel="published-1420102077277">India scheme to monitor toilet use </a>,
<aclass="story"href="/news/world-asia-india-30640444"rel="published-1420022868334">India to scrap tax breaks on cars</a>,
<aclass="story"href="/news/world-asia-india-30640436"rel="published-1420012598505">India shock over Dhoni retirement</a>,
<ahref="/news/world/asia/india/">India</a>,
<aclass="headline-anchor"href="/news/world-asia-india-30630274"rel="published-1419931669523"><imgalt="A Delhi police officer with red flag walks amidst morning fog in Delhi, India, Monday, Dec 29, 2014. "src="http://news.bbcimg.co.uk/media/images/79979000/jpg/_79979280_79979240.jpg"/><spanclass="headline heading-13">India fog continues to cause chaos</span></a>,
<aclass="headline-anchor"href="/news/world-asia-india-30632852"rel="published-1419940599384"><spanclass="headline heading-13">Court boost to India BJP chief</span></a>,
<aclass="headline-anchor"href="/sport/0/cricket/30632182"rel="published-1419930930045"><spanclass="headline heading-13">India captain Dhoni quits Tests</span></a>,
<aclass="story"href="http://www.bbc.co.uk/news/world-radio-and-tv-15386555"rel="published-1392018507550"><imgalt="A woman riding a scooter waits for a traffic signal along a street in Mumbai February 5, 2014."src="http://news.bbcimg.co.uk/media/images/72866000/jpg/_72866856_020889093.jpg"/>Special report: India Direct</a>,
<ahref="/2/hi/south_asia/country_profiles/1154019.stm">India</a>]
Note the http://www.bbc.co.uk/news/world-radio-and-tv-15386555
link here; we had to use the lambda
search because a search with a text
regular expression would not have found that element; the contained text (Special report: India Direct) is not the only element in the tag and thus would not be found.
A similar problem applies to the /news/world-asia-india-30632852
link; the nested <span>
element makes it that the Court boost to India BJP chief headline text is not a direct child element of the link tag.
You can extract just the links with:
from urllib.parse importurljoinresult_links= [urljoin(url, tag['href']) for tag in results]
where all relative URLs are resolved relative to the original URL:
>>> from urllib.parse import urljoin
>>> result_links = [urljoin(url, tag['href']) fortagin results]
>>> pprint(result_links)
['http://www.bbc.com/news/world/asia/india/','http://www.bbc.com/news/world-asia-india-30647504','http://www.bbc.com/news/world-asia-india-30640444','http://www.bbc.com/news/world-asia-india-30640436','http://www.bbc.com/news/world/asia/india/','http://www.bbc.com/news/world-asia-india-30630274','http://www.bbc.com/news/world-asia-india-30632852','http://www.bbc.com/sport/0/cricket/30632182','http://www.bbc.co.uk/news/world-radio-and-tv-15386555','http://www.bbc.com/2/hi/south_asia/country_profiles/1154019.stm']
Post a Comment for "How To Extract Html Links With A Matching Word From A Website Using Python"