How to scrape multiple tables from a dynamic page in Python

Edit:

I found the links to everything embedded under a div tag in a link tag - not knowing ajax/front end dev, I'm not sure what you call it: looks like this:

<a class="tabs__link js-tabs-ranking" href="it" data-ajax-stack="&quot;itg&quot;:&quot;/en/ajax/ranking/19/itg/18f11c81b4cd83f7b82b47a88d939a9c/none&quot;,&quot;ipg&quot;:&quot;/en/ajax/ranking/19/ipg/b1c62bbc714bc8823f59f3ec1030a3d7/none&quot;,&quot;etg&quot;:&quot;/en/ajax/ranking/19/etg/5b2a3871133c7df8954b81ca884d233f/none&quot;,&quot;img&quot;:&quot;/en/ajax/ranking/19/img/03a4a10eac4baaffa954cebf29c39b1c/none&quot;,&quot;ijg&quot;:&quot;/en/ajax/ranking/19/ijg/ec301eb70c0b7df824159aaa00d79135/none&quot;,&quot;icg&quot;:&quot;/en/ajax/ranking/19/icg/81b5589ac9889472dcda9560dd23683d/none&quot;" data-type="g" data-xtclick="ranking::tab::overall">General classification</a>

I've gotten the following code to get the tables into dataframes for everything EXCEPT pages where there are multiple tables and headings - type ime, ipe - I've made a if-else to try and handle those pages differently.

What I'd like to do is put each table into its own dataframe but I keep getting a bunch of ugly html and an error at the end "TypeError: 'NoneType' object is not callable". I'll keep banging away at this but any suggestions are more than welcome!

import requests
import html5lib
import pandas as pd
from bs4 import BeautifulSoup


#type_dict = 'e':'Stage', 'g':'General Classification'
tab_dict = 'ite':'Stage',
'ipe':'Points',
'ime':'Mountains',
'ije':'Young riders',
'ice':'Combativity',
'ete':'Teams',
'itg':'General Classification',
'ipg':'Points Classification',
'img':'Mountains Classification',
'ijg':'Young Riders Classification',
'icg':'Combativity Classification',
'etg':'Teams Classification'
#Add a user input for the URL
start_url = "https://www.letour.fr/en/rankings/stage-19"
base_url = start_url.split('/')[2]

page = requests.get(start_url)
content = page.content
r_table = pd.read_html(content)

#This worked to get the table out into a DataFrame
df = r_table[0]
#print(df['Rider'])
soup = BeautifulSoup(content, "html5lib")

all_links = soup.find_all(class_="tabs__link js-tabs-ranking")
#grabbing the block of ajax links that give URLs to various stage/GC results
for item in all_links:
 myurl = item['data-ajax-stack']
 myurl = myurl.replace('/', '/').replace('', '').replace('', '').replace('"','')
 myurl = dict(x.split(':') for x in myurl.split(','))
#looping through the lists of links and getting the pages
 for key, value in myurl.items():
 r_type = tab_dict[key]
 print("Getting the data for: " + r_type)
 url = ("http://" + base_url + value)
 try:
 if key == "ipe" or key == "ime":

 page = requests.get(url).content
 print(page)
 soup = BeautifulSoup(page, "html5lib")
 #heading = soup.find_all('div', class_="rankingTables__caption")
 for caption in soup.find_all('div', class_="rankingTables__caption"):
 res_caption = caption.text.title()
 print(res_caption)
 res_table = pd.read_html(caption)
 df = res_table[0]
 print(df) #debugging, test
 else:
 page = requests.get(url).content
 soup = BeautifulSoup(page, "html5lib")
 res_table = pd.read_html(page)
 df = res_table[0]
 print(df) #debugging/test

 except ValueError:
 print("No table found for " + key)
 break

I am relatively new to Python, and am using a web scraping project to learn more. I am stuck on a problem trying to get multiple blocks of tabular data from a dynamic web page. Is there an easy way to get the tables generated from the various clicks on this page?

My code below works on the results that come up on the default load, but I want to be able to loop through the tabs and grab them all into the same dataframe.

Looking at the source code, there is one tag that changes depending on the tab you click:

I thought about making a dictionary
data-current-type 'e':'Stage', 'g':'General Classification'
data-current-tab 'it':'Individual Classification',
'ip':'Points',
'im':'Mountains',
'ij':'Young riders',
'ic':'Combativity',
'et':'Teams'

This design relies upon being able to pass these different tags back to the page and I don't think that's going to work.

Looking at the post, the URLs are dynamically generated:

https://www.letour.fr/en/ajax/ranking/20/itg/8c7d5ddc44042219f544306cab96c718/subtab
https://www.letour.fr/en/ajax/ranking/20/ipg/2d4afa3722c55ad1564caddee00f117f/subtab

Can anyone point me in the direction of the best tool to get the data I want? I've tried searching and searching this forum but I must not be using the right tags...

import requests
import pandas as pd

start_url = "https://www.letour.fr/en/rankings/stage-20"

page = requests.get(start_url)

content = page.content
#get the table
res_table = pd.read_html(content)
#Define the DataFrame
df = res_table[0]

edited Nov 13 '18 at 19:20

asked Nov 12 '18 at 17:17

lweislo

If content is loaded dynamically then use selenium not requests
– Chris
Nov 12 '18 at 17:55

add a comment |

Edit:

I found the links to everything embedded under a div tag in a link tag - not knowing ajax/front end dev, I'm not sure what you call it: looks like this:

<a class="tabs__link js-tabs-ranking" href="it" data-ajax-stack="&quot;itg&quot;:&quot;/en/ajax/ranking/19/itg/18f11c81b4cd83f7b82b47a88d939a9c/none&quot;,&quot;ipg&quot;:&quot;/en/ajax/ranking/19/ipg/b1c62bbc714bc8823f59f3ec1030a3d7/none&quot;,&quot;etg&quot;:&quot;/en/ajax/ranking/19/etg/5b2a3871133c7df8954b81ca884d233f/none&quot;,&quot;img&quot;:&quot;/en/ajax/ranking/19/img/03a4a10eac4baaffa954cebf29c39b1c/none&quot;,&quot;ijg&quot;:&quot;/en/ajax/ranking/19/ijg/ec301eb70c0b7df824159aaa00d79135/none&quot;,&quot;icg&quot;:&quot;/en/ajax/ranking/19/icg/81b5589ac9889472dcda9560dd23683d/none&quot;" data-type="g" data-xtclick="ranking::tab::overall">General classification</a>

import requests
import html5lib
import pandas as pd
from bs4 import BeautifulSoup


#type_dict = 'e':'Stage', 'g':'General Classification'
tab_dict = 'ite':'Stage',
'ipe':'Points',
'ime':'Mountains',
'ije':'Young riders',
'ice':'Combativity',
'ete':'Teams',
'itg':'General Classification',
'ipg':'Points Classification',
'img':'Mountains Classification',
'ijg':'Young Riders Classification',
'icg':'Combativity Classification',
'etg':'Teams Classification'
#Add a user input for the URL
start_url = "https://www.letour.fr/en/rankings/stage-19"
base_url = start_url.split('/')[2]

page = requests.get(start_url)
content = page.content
r_table = pd.read_html(content)

#This worked to get the table out into a DataFrame
df = r_table[0]
#print(df['Rider'])
soup = BeautifulSoup(content, "html5lib")

all_links = soup.find_all(class_="tabs__link js-tabs-ranking")
#grabbing the block of ajax links that give URLs to various stage/GC results
for item in all_links:
 myurl = item['data-ajax-stack']
 myurl = myurl.replace('/', '/').replace('', '').replace('', '').replace('"','')
 myurl = dict(x.split(':') for x in myurl.split(','))
#looping through the lists of links and getting the pages
 for key, value in myurl.items():
 r_type = tab_dict[key]
 print("Getting the data for: " + r_type)
 url = ("http://" + base_url + value)
 try:
 if key == "ipe" or key == "ime":

 page = requests.get(url).content
 print(page)
 soup = BeautifulSoup(page, "html5lib")
 #heading = soup.find_all('div', class_="rankingTables__caption")
 for caption in soup.find_all('div', class_="rankingTables__caption"):
 res_caption = caption.text.title()
 print(res_caption)
 res_table = pd.read_html(caption)
 df = res_table[0]
 print(df) #debugging, test
 else:
 page = requests.get(url).content
 soup = BeautifulSoup(page, "html5lib")
 res_table = pd.read_html(page)
 df = res_table[0]
 print(df) #debugging/test

 except ValueError:
 print("No table found for " + key)
 break

My code below works on the results that come up on the default load, but I want to be able to loop through the tabs and grab them all into the same dataframe.

Looking at the source code, there is one tag that changes depending on the tab you click:

This design relies upon being able to pass these different tags back to the page and I don't think that's going to work.

Looking at the post, the URLs are dynamically generated:

https://www.letour.fr/en/ajax/ranking/20/itg/8c7d5ddc44042219f544306cab96c718/subtab
https://www.letour.fr/en/ajax/ranking/20/ipg/2d4afa3722c55ad1564caddee00f117f/subtab

Can anyone point me in the direction of the best tool to get the data I want? I've tried searching and searching this forum but I must not be using the right tags...

import requests
import pandas as pd

start_url = "https://www.letour.fr/en/rankings/stage-20"

page = requests.get(start_url)

content = page.content
#get the table
res_table = pd.read_html(content)
#Define the DataFrame
df = res_table[0]

edited Nov 13 '18 at 19:20

asked Nov 12 '18 at 17:17

lweislo

If content is loaded dynamically then use selenium not requests
– Chris
Nov 12 '18 at 17:55

add a comment |

Edit:

I found the links to everything embedded under a div tag in a link tag - not knowing ajax/front end dev, I'm not sure what you call it: looks like this:

<a class="tabs__link js-tabs-ranking" href="it" data-ajax-stack="&quot;itg&quot;:&quot;/en/ajax/ranking/19/itg/18f11c81b4cd83f7b82b47a88d939a9c/none&quot;,&quot;ipg&quot;:&quot;/en/ajax/ranking/19/ipg/b1c62bbc714bc8823f59f3ec1030a3d7/none&quot;,&quot;etg&quot;:&quot;/en/ajax/ranking/19/etg/5b2a3871133c7df8954b81ca884d233f/none&quot;,&quot;img&quot;:&quot;/en/ajax/ranking/19/img/03a4a10eac4baaffa954cebf29c39b1c/none&quot;,&quot;ijg&quot;:&quot;/en/ajax/ranking/19/ijg/ec301eb70c0b7df824159aaa00d79135/none&quot;,&quot;icg&quot;:&quot;/en/ajax/ranking/19/icg/81b5589ac9889472dcda9560dd23683d/none&quot;" data-type="g" data-xtclick="ranking::tab::overall">General classification</a>

import requests
import html5lib
import pandas as pd
from bs4 import BeautifulSoup


#type_dict = 'e':'Stage', 'g':'General Classification'
tab_dict = 'ite':'Stage',
'ipe':'Points',
'ime':'Mountains',
'ije':'Young riders',
'ice':'Combativity',
'ete':'Teams',
'itg':'General Classification',
'ipg':'Points Classification',
'img':'Mountains Classification',
'ijg':'Young Riders Classification',
'icg':'Combativity Classification',
'etg':'Teams Classification'
#Add a user input for the URL
start_url = "https://www.letour.fr/en/rankings/stage-19"
base_url = start_url.split('/')[2]

page = requests.get(start_url)
content = page.content
r_table = pd.read_html(content)

#This worked to get the table out into a DataFrame
df = r_table[0]
#print(df['Rider'])
soup = BeautifulSoup(content, "html5lib")

all_links = soup.find_all(class_="tabs__link js-tabs-ranking")
#grabbing the block of ajax links that give URLs to various stage/GC results
for item in all_links:
 myurl = item['data-ajax-stack']
 myurl = myurl.replace('/', '/').replace('', '').replace('', '').replace('"','')
 myurl = dict(x.split(':') for x in myurl.split(','))
#looping through the lists of links and getting the pages
 for key, value in myurl.items():
 r_type = tab_dict[key]
 print("Getting the data for: " + r_type)
 url = ("http://" + base_url + value)
 try:
 if key == "ipe" or key == "ime":

 page = requests.get(url).content
 print(page)
 soup = BeautifulSoup(page, "html5lib")
 #heading = soup.find_all('div', class_="rankingTables__caption")
 for caption in soup.find_all('div', class_="rankingTables__caption"):
 res_caption = caption.text.title()
 print(res_caption)
 res_table = pd.read_html(caption)
 df = res_table[0]
 print(df) #debugging, test
 else:
 page = requests.get(url).content
 soup = BeautifulSoup(page, "html5lib")
 res_table = pd.read_html(page)
 df = res_table[0]
 print(df) #debugging/test

 except ValueError:
 print("No table found for " + key)
 break

My code below works on the results that come up on the default load, but I want to be able to loop through the tabs and grab them all into the same dataframe.

Looking at the source code, there is one tag that changes depending on the tab you click:

This design relies upon being able to pass these different tags back to the page and I don't think that's going to work.

Looking at the post, the URLs are dynamically generated:

https://www.letour.fr/en/ajax/ranking/20/itg/8c7d5ddc44042219f544306cab96c718/subtab
https://www.letour.fr/en/ajax/ranking/20/ipg/2d4afa3722c55ad1564caddee00f117f/subtab

Can anyone point me in the direction of the best tool to get the data I want? I've tried searching and searching this forum but I must not be using the right tags...

import requests
import pandas as pd

start_url = "https://www.letour.fr/en/rankings/stage-20"

page = requests.get(start_url)

content = page.content
#get the table
res_table = pd.read_html(content)
#Define the DataFrame
df = res_table[0]

edited Nov 13 '18 at 19:20

asked Nov 12 '18 at 17:17

lweislo

Edit:

I found the links to everything embedded under a div tag in a link tag - not knowing ajax/front end dev, I'm not sure what you call it: looks like this:

<a class="tabs__link js-tabs-ranking" href="it" data-ajax-stack="&quot;itg&quot;:&quot;/en/ajax/ranking/19/itg/18f11c81b4cd83f7b82b47a88d939a9c/none&quot;,&quot;ipg&quot;:&quot;/en/ajax/ranking/19/ipg/b1c62bbc714bc8823f59f3ec1030a3d7/none&quot;,&quot;etg&quot;:&quot;/en/ajax/ranking/19/etg/5b2a3871133c7df8954b81ca884d233f/none&quot;,&quot;img&quot;:&quot;/en/ajax/ranking/19/img/03a4a10eac4baaffa954cebf29c39b1c/none&quot;,&quot;ijg&quot;:&quot;/en/ajax/ranking/19/ijg/ec301eb70c0b7df824159aaa00d79135/none&quot;,&quot;icg&quot;:&quot;/en/ajax/ranking/19/icg/81b5589ac9889472dcda9560dd23683d/none&quot;" data-type="g" data-xtclick="ranking::tab::overall">General classification</a>

import requests
import html5lib
import pandas as pd
from bs4 import BeautifulSoup


#type_dict = 'e':'Stage', 'g':'General Classification'
tab_dict = 'ite':'Stage',
'ipe':'Points',
'ime':'Mountains',
'ije':'Young riders',
'ice':'Combativity',
'ete':'Teams',
'itg':'General Classification',
'ipg':'Points Classification',
'img':'Mountains Classification',
'ijg':'Young Riders Classification',
'icg':'Combativity Classification',
'etg':'Teams Classification'
#Add a user input for the URL
start_url = "https://www.letour.fr/en/rankings/stage-19"
base_url = start_url.split('/')[2]

page = requests.get(start_url)
content = page.content
r_table = pd.read_html(content)

#This worked to get the table out into a DataFrame
df = r_table[0]
#print(df['Rider'])
soup = BeautifulSoup(content, "html5lib")

all_links = soup.find_all(class_="tabs__link js-tabs-ranking")
#grabbing the block of ajax links that give URLs to various stage/GC results
for item in all_links:
 myurl = item['data-ajax-stack']
 myurl = myurl.replace('/', '/').replace('', '').replace('', '').replace('"','')
 myurl = dict(x.split(':') for x in myurl.split(','))
#looping through the lists of links and getting the pages
 for key, value in myurl.items():
 r_type = tab_dict[key]
 print("Getting the data for: " + r_type)
 url = ("http://" + base_url + value)
 try:
 if key == "ipe" or key == "ime":

 page = requests.get(url).content
 print(page)
 soup = BeautifulSoup(page, "html5lib")
 #heading = soup.find_all('div', class_="rankingTables__caption")
 for caption in soup.find_all('div', class_="rankingTables__caption"):
 res_caption = caption.text.title()
 print(res_caption)
 res_table = pd.read_html(caption)
 df = res_table[0]
 print(df) #debugging, test
 else:
 page = requests.get(url).content
 soup = BeautifulSoup(page, "html5lib")
 res_table = pd.read_html(page)
 df = res_table[0]
 print(df) #debugging/test

 except ValueError:
 print("No table found for " + key)
 break

My code below works on the results that come up on the default load, but I want to be able to loop through the tabs and grab them all into the same dataframe.

Looking at the source code, there is one tag that changes depending on the tab you click:

This design relies upon being able to pass these different tags back to the page and I don't think that's going to work.

Looking at the post, the URLs are dynamically generated:

https://www.letour.fr/en/ajax/ranking/20/itg/8c7d5ddc44042219f544306cab96c718/subtab
https://www.letour.fr/en/ajax/ranking/20/ipg/2d4afa3722c55ad1564caddee00f117f/subtab

Can anyone point me in the direction of the best tool to get the data I want? I've tried searching and searching this forum but I must not be using the right tags...

import requests
import pandas as pd

start_url = "https://www.letour.fr/en/rankings/stage-20"

page = requests.get(start_url)

content = page.content
#get the table
res_table = pd.read_html(content)
#Define the DataFrame
df = res_table[0]

python pandas web-scraping

edited Nov 13 '18 at 19:20

asked Nov 12 '18 at 17:17

lweislo

edited Nov 13 '18 at 19:20

asked Nov 12 '18 at 17:17

lweislo

edited Nov 13 '18 at 19:20

asked Nov 12 '18 at 17:17

lweislo

asked Nov 12 '18 at 17:17

lweislo

asked Nov 12 '18 at 17:17

lweislo

If content is loaded dynamically then use selenium not requests
– Chris
Nov 12 '18 at 17:55

add a comment |

If content is loaded dynamically then use selenium not requests
– Chris
Nov 12 '18 at 17:55

If content is loaded dynamically then use selenium not requests
– Chris
Nov 12 '18 at 17:55

add a comment |

2 Answers
2

active

oldest

votes

You could do something like the following. Where I loop the links you provided and concat results into a final dataframe.

import pandas as pd
links = ["https://www.letour.fr/en/ajax/ranking/20/itg/8c7d5ddc44042219f544306cab96c718/subtab","https://www.letour.fr/en/ajax/ranking/20/ipg/2d4afa3722c55ad1564caddee00f117f/subtab"]
final = 

for link in links:
 result = pd.read_html(link)
 # print(result)
 header = result[0][0:0]
 final.append(result[0][0:])

df = pd.concat(final, sort=False)
df.drop_duplicates()
df.index = pd.RangeIndex(len(df.index))
print(df)
df.to_csv(r"C:UsersUserDesktoptest.csv", encoding='utf-8')

edited Nov 12 '18 at 19:24

answered Nov 12 '18 at 19:19

QHarr

30.2k81941

I took another approach, the links for both stage and general classification were tucked inside an ajax tag. I am posting the solution to that, but of course it raised another question!
– lweislo
Nov 13 '18 at 19:04

Often the way :-)
– QHarr
Nov 13 '18 at 19:05

add a comment |

I am not so sure the links are dynamically generated. Refreshing the page appears that the links are the same.

That being said, what you may want to do is extract the links based on the xpath of the <a> elements of the tabs.

So your xpath to get the links might be a dict:

links_xpath = 
 'climber' : "//a[contains(@class, 'tabs__link') and contains(text(), 'Climber')]/@href",
 'points' : "//a[contains(@class, 'tabs__link') and contains(text(), 'Points')]/@href",
 # etc.

That will extract the link, which you can then concat with the base URL and ensure your scraper works regardless of the underlying link, at least until the page layout might change.

answered Nov 12 '18 at 19:45

Dmitriy Khaykin

4,85811530

1

You're right! I am getting there. There's another element that has all the URLs in source code format. Much easier to deal with!
– lweislo
Nov 12 '18 at 20:30

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53267069%2fhow-to-scrape-multiple-tables-from-a-dynamic-page-in-python%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

You could do something like the following. Where I loop the links you provided and concat results into a final dataframe.

import pandas as pd
links = ["https://www.letour.fr/en/ajax/ranking/20/itg/8c7d5ddc44042219f544306cab96c718/subtab","https://www.letour.fr/en/ajax/ranking/20/ipg/2d4afa3722c55ad1564caddee00f117f/subtab"]
final = 

for link in links:
 result = pd.read_html(link)
 # print(result)
 header = result[0][0:0]
 final.append(result[0][0:])

df = pd.concat(final, sort=False)
df.drop_duplicates()
df.index = pd.RangeIndex(len(df.index))
print(df)
df.to_csv(r"C:UsersUserDesktoptest.csv", encoding='utf-8')

edited Nov 12 '18 at 19:24

answered Nov 12 '18 at 19:19

QHarr

30.2k81941

I took another approach, the links for both stage and general classification were tucked inside an ajax tag. I am posting the solution to that, but of course it raised another question!
– lweislo
Nov 13 '18 at 19:04

Often the way :-)
– QHarr
Nov 13 '18 at 19:05

add a comment |

You could do something like the following. Where I loop the links you provided and concat results into a final dataframe.

import pandas as pd
links = ["https://www.letour.fr/en/ajax/ranking/20/itg/8c7d5ddc44042219f544306cab96c718/subtab","https://www.letour.fr/en/ajax/ranking/20/ipg/2d4afa3722c55ad1564caddee00f117f/subtab"]
final = 

for link in links:
 result = pd.read_html(link)
 # print(result)
 header = result[0][0:0]
 final.append(result[0][0:])

df = pd.concat(final, sort=False)
df.drop_duplicates()
df.index = pd.RangeIndex(len(df.index))
print(df)
df.to_csv(r"C:UsersUserDesktoptest.csv", encoding='utf-8')

edited Nov 12 '18 at 19:24

answered Nov 12 '18 at 19:19

QHarr

30.2k81941

I took another approach, the links for both stage and general classification were tucked inside an ajax tag. I am posting the solution to that, but of course it raised another question!
– lweislo
Nov 13 '18 at 19:04

Often the way :-)
– QHarr
Nov 13 '18 at 19:05

add a comment |

You could do something like the following. Where I loop the links you provided and concat results into a final dataframe.

import pandas as pd
links = ["https://www.letour.fr/en/ajax/ranking/20/itg/8c7d5ddc44042219f544306cab96c718/subtab","https://www.letour.fr/en/ajax/ranking/20/ipg/2d4afa3722c55ad1564caddee00f117f/subtab"]
final = 

for link in links:
 result = pd.read_html(link)
 # print(result)
 header = result[0][0:0]
 final.append(result[0][0:])

df = pd.concat(final, sort=False)
df.drop_duplicates()
df.index = pd.RangeIndex(len(df.index))
print(df)
df.to_csv(r"C:UsersUserDesktoptest.csv", encoding='utf-8')

edited Nov 12 '18 at 19:24

answered Nov 12 '18 at 19:19

QHarr

30.2k81941

You could do something like the following. Where I loop the links you provided and concat results into a final dataframe.

import pandas as pd
links = ["https://www.letour.fr/en/ajax/ranking/20/itg/8c7d5ddc44042219f544306cab96c718/subtab","https://www.letour.fr/en/ajax/ranking/20/ipg/2d4afa3722c55ad1564caddee00f117f/subtab"]
final = 

for link in links:
 result = pd.read_html(link)
 # print(result)
 header = result[0][0:0]
 final.append(result[0][0:])

df = pd.concat(final, sort=False)
df.drop_duplicates()
df.index = pd.RangeIndex(len(df.index))
print(df)
df.to_csv(r"C:UsersUserDesktoptest.csv", encoding='utf-8')

edited Nov 12 '18 at 19:24

answered Nov 12 '18 at 19:19

QHarr

30.2k81941

edited Nov 12 '18 at 19:24

answered Nov 12 '18 at 19:19

QHarr

30.2k81941

answered Nov 12 '18 at 19:19

QHarr

30.2k81941

answered Nov 12 '18 at 19:19

QHarr

30.2k81941

I took another approach, the links for both stage and general classification were tucked inside an ajax tag. I am posting the solution to that, but of course it raised another question!
– lweislo
Nov 13 '18 at 19:04

Often the way :-)
– QHarr
Nov 13 '18 at 19:05

add a comment |

I took another approach, the links for both stage and general classification were tucked inside an ajax tag. I am posting the solution to that, but of course it raised another question!
– lweislo
Nov 13 '18 at 19:04

Often the way :-)
– QHarr
Nov 13 '18 at 19:05

I took another approach, the links for both stage and general classification were tucked inside an ajax tag. I am posting the solution to that, but of course it raised another question!
– lweislo
Nov 13 '18 at 19:04

Often the way :-)
– QHarr
Nov 13 '18 at 19:05

add a comment |

I am not so sure the links are dynamically generated. Refreshing the page appears that the links are the same.

That being said, what you may want to do is extract the links based on the xpath of the <a> elements of the tabs.

So your xpath to get the links might be a dict:

links_xpath = 
 'climber' : "//a[contains(@class, 'tabs__link') and contains(text(), 'Climber')]/@href",
 'points' : "//a[contains(@class, 'tabs__link') and contains(text(), 'Points')]/@href",
 # etc.

That will extract the link, which you can then concat with the base URL and ensure your scraper works regardless of the underlying link, at least until the page layout might change.

answered Nov 12 '18 at 19:45

Dmitriy Khaykin

4,85811530

1

You're right! I am getting there. There's another element that has all the URLs in source code format. Much easier to deal with!
– lweislo
Nov 12 '18 at 20:30

add a comment |

I am not so sure the links are dynamically generated. Refreshing the page appears that the links are the same.

That being said, what you may want to do is extract the links based on the xpath of the <a> elements of the tabs.

So your xpath to get the links might be a dict:

links_xpath = 
 'climber' : "//a[contains(@class, 'tabs__link') and contains(text(), 'Climber')]/@href",
 'points' : "//a[contains(@class, 'tabs__link') and contains(text(), 'Points')]/@href",
 # etc.

That will extract the link, which you can then concat with the base URL and ensure your scraper works regardless of the underlying link, at least until the page layout might change.

answered Nov 12 '18 at 19:45

Dmitriy Khaykin

4,85811530

1

You're right! I am getting there. There's another element that has all the URLs in source code format. Much easier to deal with!
– lweislo
Nov 12 '18 at 20:30

add a comment |

I am not so sure the links are dynamically generated. Refreshing the page appears that the links are the same.

That being said, what you may want to do is extract the links based on the xpath of the <a> elements of the tabs.

So your xpath to get the links might be a dict:

links_xpath = 
 'climber' : "//a[contains(@class, 'tabs__link') and contains(text(), 'Climber')]/@href",
 'points' : "//a[contains(@class, 'tabs__link') and contains(text(), 'Points')]/@href",
 # etc.

That will extract the link, which you can then concat with the base URL and ensure your scraper works regardless of the underlying link, at least until the page layout might change.

answered Nov 12 '18 at 19:45

Dmitriy Khaykin

4,85811530

I am not so sure the links are dynamically generated. Refreshing the page appears that the links are the same.

That being said, what you may want to do is extract the links based on the xpath of the <a> elements of the tabs.

So your xpath to get the links might be a dict:

links_xpath = 
 'climber' : "//a[contains(@class, 'tabs__link') and contains(text(), 'Climber')]/@href",
 'points' : "//a[contains(@class, 'tabs__link') and contains(text(), 'Points')]/@href",
 # etc.

That will extract the link, which you can then concat with the base URL and ensure your scraper works regardless of the underlying link, at least until the page layout might change.

answered Nov 12 '18 at 19:45

Dmitriy Khaykin

4,85811530

answered Nov 12 '18 at 19:45

Dmitriy Khaykin

4,85811530

answered Nov 12 '18 at 19:45

Dmitriy Khaykin

4,85811530

answered Nov 12 '18 at 19:45

Dmitriy Khaykin

4,85811530

1

You're right! I am getting there. There's another element that has all the URLs in source code format. Much easier to deal with!
– lweislo
Nov 12 '18 at 20:30

add a comment |

1

You're right! I am getting there. There's another element that has all the URLs in source code format. Much easier to deal with!
– lweislo
Nov 12 '18 at 20:30

You're right! I am getting there. There's another element that has all the URLs in source code format. Much easier to deal with!
– lweislo
Nov 12 '18 at 20:30

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

wPQwKd6U55mjFsxqPv1pixb2i4 Vd,XuBy gJKvAQk7O22dIi0DZh 25cEWOAfoLN,BIL

搜尋此網誌

Odtnhj