beautifulsoup - Python scraping href iinks -

- June 24, 2012

my goal scrape href links on base_url site.

my code:

from bs4 import beautifulsoup selenium import webdriver import requests, csv, re  game_links = [] link_pages = [] base_url = "http://www.basket.fi/sarjat/ohjelma_tulokset/?season_id=93783&league_id=4#mbt:2-303$f&stage=177155:$p&0="   browser = webdriver.phantomjs() browser.get(base_url) table = beautifulsoup(browser.page_source, 'lxml') game in table.find_all("a", {'game_id': re.compile('\d+')}):     href=game.get("href")     print(href)

result:

http://www.basket.fi/sarjat/ottelu/?game_id=3502579&season_id=93783&league_id=4 http://www.basket.fi/sarjat/ottelu/?game_id=3502579&season_id=93783&league_id=4 http://www.basket.fi/sarjat/ottelu/?game_id=3502523&season_id=93783&league_id=4 http://www.basket.fi/sarjat/ottelu/?game_id=3502523&season_id=93783&league_id=4  ......

the problem can't understand why in result href links come 2 times?

as notice in image there same game_id 2 links

modified code: this 1 link

for game in table.find_all("a", {'game_id': re.compile('\d+')}):     if game.children:         href=game.get("href")         print(href)

wiki

Search This Blog

tL

beautifulsoup - Python scraping href iinks -

Comments

Post a Comment

Popular posts from this blog

Asterisk AGI Python Script to Dialplan does not work -

kotlin - Out-projected type in generic interface prohibits the use of metod with generic parameter -

python - Read npy file directly from S3 StreamingBody -