How to scrape multiple tables from a dynamic page in Python










0














Edit:



I found the links to everything embedded under a div tag in a link tag - not knowing ajax/front end dev, I'm not sure what you call it: looks like this:



<a class="tabs__link js-tabs-ranking" href="it" data-ajax-stack="&quot;itg&quot;:&quot;/en/ajax/ranking/19/itg/18f11c81b4cd83f7b82b47a88d939a9c/none&quot;,&quot;ipg&quot;:&quot;/en/ajax/ranking/19/ipg/b1c62bbc714bc8823f59f3ec1030a3d7/none&quot;,&quot;etg&quot;:&quot;/en/ajax/ranking/19/etg/5b2a3871133c7df8954b81ca884d233f/none&quot;,&quot;img&quot;:&quot;/en/ajax/ranking/19/img/03a4a10eac4baaffa954cebf29c39b1c/none&quot;,&quot;ijg&quot;:&quot;/en/ajax/ranking/19/ijg/ec301eb70c0b7df824159aaa00d79135/none&quot;,&quot;icg&quot;:&quot;/en/ajax/ranking/19/icg/81b5589ac9889472dcda9560dd23683d/none&quot;" data-type="g" data-xtclick="ranking::tab::overall">General classification</a>


I've gotten the following code to get the tables into dataframes for everything EXCEPT pages where there are multiple tables and headings - type ime, ipe - I've made a if-else to try and handle those pages differently.



What I'd like to do is put each table into its own dataframe but I keep getting a bunch of ugly html and an error at the end "TypeError: 'NoneType' object is not callable". I'll keep banging away at this but any suggestions are more than welcome!



import requests
import html5lib
import pandas as pd
from bs4 import BeautifulSoup


#type_dict = 'e':'Stage', 'g':'General Classification'
tab_dict = 'ite':'Stage',
'ipe':'Points',
'ime':'Mountains',
'ije':'Young riders',
'ice':'Combativity',
'ete':'Teams',
'itg':'General Classification',
'ipg':'Points Classification',
'img':'Mountains Classification',
'ijg':'Young Riders Classification',
'icg':'Combativity Classification',
'etg':'Teams Classification'
#Add a user input for the URL
start_url = "https://www.letour.fr/en/rankings/stage-19"
base_url = start_url.split('/')[2]

page = requests.get(start_url)
content = page.content
r_table = pd.read_html(content)

#This worked to get the table out into a DataFrame
df = r_table[0]
#print(df['Rider'])
soup = BeautifulSoup(content, "html5lib")

all_links = soup.find_all(class_="tabs__link js-tabs-ranking")
#grabbing the block of ajax links that give URLs to various stage/GC results
for item in all_links:
myurl = item['data-ajax-stack']
myurl = myurl.replace('/', '/').replace('', '').replace('', '').replace('"','')
myurl = dict(x.split(':') for x in myurl.split(','))
#looping through the lists of links and getting the pages
for key, value in myurl.items():
r_type = tab_dict[key]
print("Getting the data for: " + r_type)
url = ("http://" + base_url + value)
try:
if key == "ipe" or key == "ime":

page = requests.get(url).content
print(page)
soup = BeautifulSoup(page, "html5lib")
#heading = soup.find_all('div', class_="rankingTables__caption")
for caption in soup.find_all('div', class_="rankingTables__caption"):
res_caption = caption.text.title()
print(res_caption)
res_table = pd.read_html(caption)
df = res_table[0]
print(df) #debugging, test
else:
page = requests.get(url).content
soup = BeautifulSoup(page, "html5lib")
res_table = pd.read_html(page)
df = res_table[0]
print(df) #debugging/test

except ValueError:
print("No table found for " + key)
break



I am relatively new to Python, and am using a web scraping project to learn more. I am stuck on a problem trying to get multiple blocks of tabular data from a dynamic web page. Is there an easy way to get the tables generated from the various clicks on this page?



My code below works on the results that come up on the default load, but I want to be able to loop through the tabs and grab them all into the same dataframe.



Looking at the source code, there is one tag that changes depending on the tab you click:





I thought about making a dictionary
data-current-type 'e':'Stage', 'g':'General Classification'
data-current-tab 'it':'Individual Classification',
'ip':'Points',
'im':'Mountains',
'ij':'Young riders',
'ic':'Combativity',
'et':'Teams'



This design relies upon being able to pass these different tags back to the page and I don't think that's going to work.



Looking at the post, the URLs are dynamically generated:



https://www.letour.fr/en/ajax/ranking/20/itg/8c7d5ddc44042219f544306cab96c718/subtab
https://www.letour.fr/en/ajax/ranking/20/ipg/2d4afa3722c55ad1564caddee00f117f/subtab



Can anyone point me in the direction of the best tool to get the data I want? I've tried searching and searching this forum but I must not be using the right tags...



import requests
import pandas as pd

start_url = "https://www.letour.fr/en/rankings/stage-20"

page = requests.get(start_url)

content = page.content
#get the table
res_table = pd.read_html(content)
#Define the DataFrame
df = res_table[0]









share|improve this question























  • If content is loaded dynamically then use selenium not requests
    – Chris
    Nov 12 '18 at 17:55















0














Edit:



I found the links to everything embedded under a div tag in a link tag - not knowing ajax/front end dev, I'm not sure what you call it: looks like this:



<a class="tabs__link js-tabs-ranking" href="it" data-ajax-stack="&quot;itg&quot;:&quot;/en/ajax/ranking/19/itg/18f11c81b4cd83f7b82b47a88d939a9c/none&quot;,&quot;ipg&quot;:&quot;/en/ajax/ranking/19/ipg/b1c62bbc714bc8823f59f3ec1030a3d7/none&quot;,&quot;etg&quot;:&quot;/en/ajax/ranking/19/etg/5b2a3871133c7df8954b81ca884d233f/none&quot;,&quot;img&quot;:&quot;/en/ajax/ranking/19/img/03a4a10eac4baaffa954cebf29c39b1c/none&quot;,&quot;ijg&quot;:&quot;/en/ajax/ranking/19/ijg/ec301eb70c0b7df824159aaa00d79135/none&quot;,&quot;icg&quot;:&quot;/en/ajax/ranking/19/icg/81b5589ac9889472dcda9560dd23683d/none&quot;" data-type="g" data-xtclick="ranking::tab::overall">General classification</a>


I've gotten the following code to get the tables into dataframes for everything EXCEPT pages where there are multiple tables and headings - type ime, ipe - I've made a if-else to try and handle those pages differently.



What I'd like to do is put each table into its own dataframe but I keep getting a bunch of ugly html and an error at the end "TypeError: 'NoneType' object is not callable". I'll keep banging away at this but any suggestions are more than welcome!



import requests
import html5lib
import pandas as pd
from bs4 import BeautifulSoup


#type_dict = 'e':'Stage', 'g':'General Classification'
tab_dict = 'ite':'Stage',
'ipe':'Points',
'ime':'Mountains',
'ije':'Young riders',
'ice':'Combativity',
'ete':'Teams',
'itg':'General Classification',
'ipg':'Points Classification',
'img':'Mountains Classification',
'ijg':'Young Riders Classification',
'icg':'Combativity Classification',
'etg':'Teams Classification'
#Add a user input for the URL
start_url = "https://www.letour.fr/en/rankings/stage-19"
base_url = start_url.split('/')[2]

page = requests.get(start_url)
content = page.content
r_table = pd.read_html(content)

#This worked to get the table out into a DataFrame
df = r_table[0]
#print(df['Rider'])
soup = BeautifulSoup(content, "html5lib")

all_links = soup.find_all(class_="tabs__link js-tabs-ranking")
#grabbing the block of ajax links that give URLs to various stage/GC results
for item in all_links:
myurl = item['data-ajax-stack']
myurl = myurl.replace('/', '/').replace('', '').replace('', '').replace('"','')
myurl = dict(x.split(':') for x in myurl.split(','))
#looping through the lists of links and getting the pages
for key, value in myurl.items():
r_type = tab_dict[key]
print("Getting the data for: " + r_type)
url = ("http://" + base_url + value)
try:
if key == "ipe" or key == "ime":

page = requests.get(url).content
print(page)
soup = BeautifulSoup(page, "html5lib")
#heading = soup.find_all('div', class_="rankingTables__caption")
for caption in soup.find_all('div', class_="rankingTables__caption"):
res_caption = caption.text.title()
print(res_caption)
res_table = pd.read_html(caption)
df = res_table[0]
print(df) #debugging, test
else:
page = requests.get(url).content
soup = BeautifulSoup(page, "html5lib")
res_table = pd.read_html(page)
df = res_table[0]
print(df) #debugging/test

except ValueError:
print("No table found for " + key)
break



I am relatively new to Python, and am using a web scraping project to learn more. I am stuck on a problem trying to get multiple blocks of tabular data from a dynamic web page. Is there an easy way to get the tables generated from the various clicks on this page?



My code below works on the results that come up on the default load, but I want to be able to loop through the tabs and grab them all into the same dataframe.



Looking at the source code, there is one tag that changes depending on the tab you click:





I thought about making a dictionary
data-current-type 'e':'Stage', 'g':'General Classification'
data-current-tab 'it':'Individual Classification',
'ip':'Points',
'im':'Mountains',
'ij':'Young riders',
'ic':'Combativity',
'et':'Teams'



This design relies upon being able to pass these different tags back to the page and I don't think that's going to work.



Looking at the post, the URLs are dynamically generated:



https://www.letour.fr/en/ajax/ranking/20/itg/8c7d5ddc44042219f544306cab96c718/subtab
https://www.letour.fr/en/ajax/ranking/20/ipg/2d4afa3722c55ad1564caddee00f117f/subtab



Can anyone point me in the direction of the best tool to get the data I want? I've tried searching and searching this forum but I must not be using the right tags...



import requests
import pandas as pd

start_url = "https://www.letour.fr/en/rankings/stage-20"

page = requests.get(start_url)

content = page.content
#get the table
res_table = pd.read_html(content)
#Define the DataFrame
df = res_table[0]









share|improve this question























  • If content is loaded dynamically then use selenium not requests
    – Chris
    Nov 12 '18 at 17:55













0












0








0







Edit:



I found the links to everything embedded under a div tag in a link tag - not knowing ajax/front end dev, I'm not sure what you call it: looks like this:



<a class="tabs__link js-tabs-ranking" href="it" data-ajax-stack="&quot;itg&quot;:&quot;/en/ajax/ranking/19/itg/18f11c81b4cd83f7b82b47a88d939a9c/none&quot;,&quot;ipg&quot;:&quot;/en/ajax/ranking/19/ipg/b1c62bbc714bc8823f59f3ec1030a3d7/none&quot;,&quot;etg&quot;:&quot;/en/ajax/ranking/19/etg/5b2a3871133c7df8954b81ca884d233f/none&quot;,&quot;img&quot;:&quot;/en/ajax/ranking/19/img/03a4a10eac4baaffa954cebf29c39b1c/none&quot;,&quot;ijg&quot;:&quot;/en/ajax/ranking/19/ijg/ec301eb70c0b7df824159aaa00d79135/none&quot;,&quot;icg&quot;:&quot;/en/ajax/ranking/19/icg/81b5589ac9889472dcda9560dd23683d/none&quot;" data-type="g" data-xtclick="ranking::tab::overall">General classification</a>


I've gotten the following code to get the tables into dataframes for everything EXCEPT pages where there are multiple tables and headings - type ime, ipe - I've made a if-else to try and handle those pages differently.



What I'd like to do is put each table into its own dataframe but I keep getting a bunch of ugly html and an error at the end "TypeError: 'NoneType' object is not callable". I'll keep banging away at this but any suggestions are more than welcome!



import requests
import html5lib
import pandas as pd
from bs4 import BeautifulSoup


#type_dict = 'e':'Stage', 'g':'General Classification'
tab_dict = 'ite':'Stage',
'ipe':'Points',
'ime':'Mountains',
'ije':'Young riders',
'ice':'Combativity',
'ete':'Teams',
'itg':'General Classification',
'ipg':'Points Classification',
'img':'Mountains Classification',
'ijg':'Young Riders Classification',
'icg':'Combativity Classification',
'etg':'Teams Classification'
#Add a user input for the URL
start_url = "https://www.letour.fr/en/rankings/stage-19"
base_url = start_url.split('/')[2]

page = requests.get(start_url)
content = page.content
r_table = pd.read_html(content)

#This worked to get the table out into a DataFrame
df = r_table[0]
#print(df['Rider'])
soup = BeautifulSoup(content, "html5lib")

all_links = soup.find_all(class_="tabs__link js-tabs-ranking")
#grabbing the block of ajax links that give URLs to various stage/GC results
for item in all_links:
myurl = item['data-ajax-stack']
myurl = myurl.replace('/', '/').replace('', '').replace('', '').replace('"','')
myurl = dict(x.split(':') for x in myurl.split(','))
#looping through the lists of links and getting the pages
for key, value in myurl.items():
r_type = tab_dict[key]
print("Getting the data for: " + r_type)
url = ("http://" + base_url + value)
try:
if key == "ipe" or key == "ime":

page = requests.get(url).content
print(page)
soup = BeautifulSoup(page, "html5lib")
#heading = soup.find_all('div', class_="rankingTables__caption")
for caption in soup.find_all('div', class_="rankingTables__caption"):
res_caption = caption.text.title()
print(res_caption)
res_table = pd.read_html(caption)
df = res_table[0]
print(df) #debugging, test
else:
page = requests.get(url).content
soup = BeautifulSoup(page, "html5lib")
res_table = pd.read_html(page)
df = res_table[0]
print(df) #debugging/test

except ValueError:
print("No table found for " + key)
break



I am relatively new to Python, and am using a web scraping project to learn more. I am stuck on a problem trying to get multiple blocks of tabular data from a dynamic web page. Is there an easy way to get the tables generated from the various clicks on this page?



My code below works on the results that come up on the default load, but I want to be able to loop through the tabs and grab them all into the same dataframe.



Looking at the source code, there is one tag that changes depending on the tab you click:





I thought about making a dictionary
data-current-type 'e':'Stage', 'g':'General Classification'
data-current-tab 'it':'Individual Classification',
'ip':'Points',
'im':'Mountains',
'ij':'Young riders',
'ic':'Combativity',
'et':'Teams'



This design relies upon being able to pass these different tags back to the page and I don't think that's going to work.



Looking at the post, the URLs are dynamically generated:



https://www.letour.fr/en/ajax/ranking/20/itg/8c7d5ddc44042219f544306cab96c718/subtab
https://www.letour.fr/en/ajax/ranking/20/ipg/2d4afa3722c55ad1564caddee00f117f/subtab



Can anyone point me in the direction of the best tool to get the data I want? I've tried searching and searching this forum but I must not be using the right tags...



import requests
import pandas as pd

start_url = "https://www.letour.fr/en/rankings/stage-20"

page = requests.get(start_url)

content = page.content
#get the table
res_table = pd.read_html(content)
#Define the DataFrame
df = res_table[0]









share|improve this question















Edit:



I found the links to everything embedded under a div tag in a link tag - not knowing ajax/front end dev, I'm not sure what you call it: looks like this:



<a class="tabs__link js-tabs-ranking" href="it" data-ajax-stack="&quot;itg&quot;:&quot;/en/ajax/ranking/19/itg/18f11c81b4cd83f7b82b47a88d939a9c/none&quot;,&quot;ipg&quot;:&quot;/en/ajax/ranking/19/ipg/b1c62bbc714bc8823f59f3ec1030a3d7/none&quot;,&quot;etg&quot;:&quot;/en/ajax/ranking/19/etg/5b2a3871133c7df8954b81ca884d233f/none&quot;,&quot;img&quot;:&quot;/en/ajax/ranking/19/img/03a4a10eac4baaffa954cebf29c39b1c/none&quot;,&quot;ijg&quot;:&quot;/en/ajax/ranking/19/ijg/ec301eb70c0b7df824159aaa00d79135/none&quot;,&quot;icg&quot;:&quot;/en/ajax/ranking/19/icg/81b5589ac9889472dcda9560dd23683d/none&quot;" data-type="g" data-xtclick="ranking::tab::overall">General classification</a>


I've gotten the following code to get the tables into dataframes for everything EXCEPT pages where there are multiple tables and headings - type ime, ipe - I've made a if-else to try and handle those pages differently.



What I'd like to do is put each table into its own dataframe but I keep getting a bunch of ugly html and an error at the end "TypeError: 'NoneType' object is not callable". I'll keep banging away at this but any suggestions are more than welcome!



import requests
import html5lib
import pandas as pd
from bs4 import BeautifulSoup


#type_dict = 'e':'Stage', 'g':'General Classification'
tab_dict = 'ite':'Stage',
'ipe':'Points',
'ime':'Mountains',
'ije':'Young riders',
'ice':'Combativity',
'ete':'Teams',
'itg':'General Classification',
'ipg':'Points Classification',
'img':'Mountains Classification',
'ijg':'Young Riders Classification',
'icg':'Combativity Classification',
'etg':'Teams Classification'
#Add a user input for the URL
start_url = "https://www.letour.fr/en/rankings/stage-19"
base_url = start_url.split('/')[2]

page = requests.get(start_url)
content = page.content
r_table = pd.read_html(content)

#This worked to get the table out into a DataFrame
df = r_table[0]
#print(df['Rider'])
soup = BeautifulSoup(content, "html5lib")

all_links = soup.find_all(class_="tabs__link js-tabs-ranking")
#grabbing the block of ajax links that give URLs to various stage/GC results
for item in all_links:
myurl = item['data-ajax-stack']
myurl = myurl.replace('/', '/').replace('', '').replace('', '').replace('"','')
myurl = dict(x.split(':') for x in myurl.split(','))
#looping through the lists of links and getting the pages
for key, value in myurl.items():
r_type = tab_dict[key]
print("Getting the data for: " + r_type)
url = ("http://" + base_url + value)
try:
if key == "ipe" or key == "ime":

page = requests.get(url).content
print(page)
soup = BeautifulSoup(page, "html5lib")
#heading = soup.find_all('div', class_="rankingTables__caption")
for caption in soup.find_all('div', class_="rankingTables__caption"):
res_caption = caption.text.title()
print(res_caption)
res_table = pd.read_html(caption)
df = res_table[0]
print(df) #debugging, test
else:
page = requests.get(url).content
soup = BeautifulSoup(page, "html5lib")
res_table = pd.read_html(page)
df = res_table[0]
print(df) #debugging/test

except ValueError:
print("No table found for " + key)
break



I am relatively new to Python, and am using a web scraping project to learn more. I am stuck on a problem trying to get multiple blocks of tabular data from a dynamic web page. Is there an easy way to get the tables generated from the various clicks on this page?



My code below works on the results that come up on the default load, but I want to be able to loop through the tabs and grab them all into the same dataframe.



Looking at the source code, there is one tag that changes depending on the tab you click:





I thought about making a dictionary
data-current-type 'e':'Stage', 'g':'General Classification'
data-current-tab 'it':'Individual Classification',
'ip':'Points',
'im':'Mountains',
'ij':'Young riders',
'ic':'Combativity',
'et':'Teams'



This design relies upon being able to pass these different tags back to the page and I don't think that's going to work.



Looking at the post, the URLs are dynamically generated:



https://www.letour.fr/en/ajax/ranking/20/itg/8c7d5ddc44042219f544306cab96c718/subtab
https://www.letour.fr/en/ajax/ranking/20/ipg/2d4afa3722c55ad1564caddee00f117f/subtab



Can anyone point me in the direction of the best tool to get the data I want? I've tried searching and searching this forum but I must not be using the right tags...



import requests
import pandas as pd

start_url = "https://www.letour.fr/en/rankings/stage-20"

page = requests.get(start_url)

content = page.content
#get the table
res_table = pd.read_html(content)
#Define the DataFrame
df = res_table[0]






python pandas web-scraping






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 13 '18 at 19:20

























asked Nov 12 '18 at 17:17









lweislo

84




84











  • If content is loaded dynamically then use selenium not requests
    – Chris
    Nov 12 '18 at 17:55
















  • If content is loaded dynamically then use selenium not requests
    – Chris
    Nov 12 '18 at 17:55















If content is loaded dynamically then use selenium not requests
– Chris
Nov 12 '18 at 17:55




If content is loaded dynamically then use selenium not requests
– Chris
Nov 12 '18 at 17:55












2 Answers
2






active

oldest

votes


















0














You could do something like the following. Where I loop the links you provided and concat results into a final dataframe.



import pandas as pd
links = ["https://www.letour.fr/en/ajax/ranking/20/itg/8c7d5ddc44042219f544306cab96c718/subtab","https://www.letour.fr/en/ajax/ranking/20/ipg/2d4afa3722c55ad1564caddee00f117f/subtab"]
final =

for link in links:
result = pd.read_html(link)
# print(result)
header = result[0][0:0]
final.append(result[0][0:])

df = pd.concat(final, sort=False)
df.drop_duplicates()
df.index = pd.RangeIndex(len(df.index))
print(df)
df.to_csv(r"C:UsersUserDesktoptest.csv", encoding='utf-8')





share|improve this answer






















  • I took another approach, the links for both stage and general classification were tucked inside an ajax tag. I am posting the solution to that, but of course it raised another question!
    – lweislo
    Nov 13 '18 at 19:04










  • Often the way :-)
    – QHarr
    Nov 13 '18 at 19:05


















0














I am not so sure the links are dynamically generated. Refreshing the page appears that the links are the same.



That being said, what you may want to do is extract the links based on the xpath of the <a> elements of the tabs.



So your xpath to get the links might be a dict:



links_xpath = 
'climber' : "//a[contains(@class, 'tabs__link') and contains(text(), 'Climber')]/@href",
'points' : "//a[contains(@class, 'tabs__link') and contains(text(), 'Points')]/@href",
# etc.



That will extract the link, which you can then concat with the base URL and ensure your scraper works regardless of the underlying link, at least until the page layout might change.






share|improve this answer
















  • 1




    You're right! I am getting there. There's another element that has all the URLs in source code format. Much easier to deal with!
    – lweislo
    Nov 12 '18 at 20:30










Your Answer






StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













draft saved

draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53267069%2fhow-to-scrape-multiple-tables-from-a-dynamic-page-in-python%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown

























2 Answers
2






active

oldest

votes








2 Answers
2






active

oldest

votes









active

oldest

votes






active

oldest

votes









0














You could do something like the following. Where I loop the links you provided and concat results into a final dataframe.



import pandas as pd
links = ["https://www.letour.fr/en/ajax/ranking/20/itg/8c7d5ddc44042219f544306cab96c718/subtab","https://www.letour.fr/en/ajax/ranking/20/ipg/2d4afa3722c55ad1564caddee00f117f/subtab"]
final =

for link in links:
result = pd.read_html(link)
# print(result)
header = result[0][0:0]
final.append(result[0][0:])

df = pd.concat(final, sort=False)
df.drop_duplicates()
df.index = pd.RangeIndex(len(df.index))
print(df)
df.to_csv(r"C:UsersUserDesktoptest.csv", encoding='utf-8')





share|improve this answer






















  • I took another approach, the links for both stage and general classification were tucked inside an ajax tag. I am posting the solution to that, but of course it raised another question!
    – lweislo
    Nov 13 '18 at 19:04










  • Often the way :-)
    – QHarr
    Nov 13 '18 at 19:05















0














You could do something like the following. Where I loop the links you provided and concat results into a final dataframe.



import pandas as pd
links = ["https://www.letour.fr/en/ajax/ranking/20/itg/8c7d5ddc44042219f544306cab96c718/subtab","https://www.letour.fr/en/ajax/ranking/20/ipg/2d4afa3722c55ad1564caddee00f117f/subtab"]
final =

for link in links:
result = pd.read_html(link)
# print(result)
header = result[0][0:0]
final.append(result[0][0:])

df = pd.concat(final, sort=False)
df.drop_duplicates()
df.index = pd.RangeIndex(len(df.index))
print(df)
df.to_csv(r"C:UsersUserDesktoptest.csv", encoding='utf-8')





share|improve this answer






















  • I took another approach, the links for both stage and general classification were tucked inside an ajax tag. I am posting the solution to that, but of course it raised another question!
    – lweislo
    Nov 13 '18 at 19:04










  • Often the way :-)
    – QHarr
    Nov 13 '18 at 19:05













0












0








0






You could do something like the following. Where I loop the links you provided and concat results into a final dataframe.



import pandas as pd
links = ["https://www.letour.fr/en/ajax/ranking/20/itg/8c7d5ddc44042219f544306cab96c718/subtab","https://www.letour.fr/en/ajax/ranking/20/ipg/2d4afa3722c55ad1564caddee00f117f/subtab"]
final =

for link in links:
result = pd.read_html(link)
# print(result)
header = result[0][0:0]
final.append(result[0][0:])

df = pd.concat(final, sort=False)
df.drop_duplicates()
df.index = pd.RangeIndex(len(df.index))
print(df)
df.to_csv(r"C:UsersUserDesktoptest.csv", encoding='utf-8')





share|improve this answer














You could do something like the following. Where I loop the links you provided and concat results into a final dataframe.



import pandas as pd
links = ["https://www.letour.fr/en/ajax/ranking/20/itg/8c7d5ddc44042219f544306cab96c718/subtab","https://www.letour.fr/en/ajax/ranking/20/ipg/2d4afa3722c55ad1564caddee00f117f/subtab"]
final =

for link in links:
result = pd.read_html(link)
# print(result)
header = result[0][0:0]
final.append(result[0][0:])

df = pd.concat(final, sort=False)
df.drop_duplicates()
df.index = pd.RangeIndex(len(df.index))
print(df)
df.to_csv(r"C:UsersUserDesktoptest.csv", encoding='utf-8')






share|improve this answer














share|improve this answer



share|improve this answer








edited Nov 12 '18 at 19:24

























answered Nov 12 '18 at 19:19









QHarr

30.2k81941




30.2k81941











  • I took another approach, the links for both stage and general classification were tucked inside an ajax tag. I am posting the solution to that, but of course it raised another question!
    – lweislo
    Nov 13 '18 at 19:04










  • Often the way :-)
    – QHarr
    Nov 13 '18 at 19:05
















  • I took another approach, the links for both stage and general classification were tucked inside an ajax tag. I am posting the solution to that, but of course it raised another question!
    – lweislo
    Nov 13 '18 at 19:04










  • Often the way :-)
    – QHarr
    Nov 13 '18 at 19:05















I took another approach, the links for both stage and general classification were tucked inside an ajax tag. I am posting the solution to that, but of course it raised another question!
– lweislo
Nov 13 '18 at 19:04




I took another approach, the links for both stage and general classification were tucked inside an ajax tag. I am posting the solution to that, but of course it raised another question!
– lweislo
Nov 13 '18 at 19:04












Often the way :-)
– QHarr
Nov 13 '18 at 19:05




Often the way :-)
– QHarr
Nov 13 '18 at 19:05













0














I am not so sure the links are dynamically generated. Refreshing the page appears that the links are the same.



That being said, what you may want to do is extract the links based on the xpath of the <a> elements of the tabs.



So your xpath to get the links might be a dict:



links_xpath = 
'climber' : "//a[contains(@class, 'tabs__link') and contains(text(), 'Climber')]/@href",
'points' : "//a[contains(@class, 'tabs__link') and contains(text(), 'Points')]/@href",
# etc.



That will extract the link, which you can then concat with the base URL and ensure your scraper works regardless of the underlying link, at least until the page layout might change.






share|improve this answer
















  • 1




    You're right! I am getting there. There's another element that has all the URLs in source code format. Much easier to deal with!
    – lweislo
    Nov 12 '18 at 20:30















0














I am not so sure the links are dynamically generated. Refreshing the page appears that the links are the same.



That being said, what you may want to do is extract the links based on the xpath of the <a> elements of the tabs.



So your xpath to get the links might be a dict:



links_xpath = 
'climber' : "//a[contains(@class, 'tabs__link') and contains(text(), 'Climber')]/@href",
'points' : "//a[contains(@class, 'tabs__link') and contains(text(), 'Points')]/@href",
# etc.



That will extract the link, which you can then concat with the base URL and ensure your scraper works regardless of the underlying link, at least until the page layout might change.






share|improve this answer
















  • 1




    You're right! I am getting there. There's another element that has all the URLs in source code format. Much easier to deal with!
    – lweislo
    Nov 12 '18 at 20:30













0












0








0






I am not so sure the links are dynamically generated. Refreshing the page appears that the links are the same.



That being said, what you may want to do is extract the links based on the xpath of the <a> elements of the tabs.



So your xpath to get the links might be a dict:



links_xpath = 
'climber' : "//a[contains(@class, 'tabs__link') and contains(text(), 'Climber')]/@href",
'points' : "//a[contains(@class, 'tabs__link') and contains(text(), 'Points')]/@href",
# etc.



That will extract the link, which you can then concat with the base URL and ensure your scraper works regardless of the underlying link, at least until the page layout might change.






share|improve this answer












I am not so sure the links are dynamically generated. Refreshing the page appears that the links are the same.



That being said, what you may want to do is extract the links based on the xpath of the <a> elements of the tabs.



So your xpath to get the links might be a dict:



links_xpath = 
'climber' : "//a[contains(@class, 'tabs__link') and contains(text(), 'Climber')]/@href",
'points' : "//a[contains(@class, 'tabs__link') and contains(text(), 'Points')]/@href",
# etc.



That will extract the link, which you can then concat with the base URL and ensure your scraper works regardless of the underlying link, at least until the page layout might change.







share|improve this answer












share|improve this answer



share|improve this answer










answered Nov 12 '18 at 19:45









Dmitriy Khaykin

4,85811530




4,85811530







  • 1




    You're right! I am getting there. There's another element that has all the URLs in source code format. Much easier to deal with!
    – lweislo
    Nov 12 '18 at 20:30












  • 1




    You're right! I am getting there. There's another element that has all the URLs in source code format. Much easier to deal with!
    – lweislo
    Nov 12 '18 at 20:30







1




1




You're right! I am getting there. There's another element that has all the URLs in source code format. Much easier to deal with!
– lweislo
Nov 12 '18 at 20:30




You're right! I am getting there. There's another element that has all the URLs in source code format. Much easier to deal with!
– lweislo
Nov 12 '18 at 20:30

















draft saved

draft discarded
















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.





Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


Please pay close attention to the following guidance:


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53267069%2fhow-to-scrape-multiple-tables-from-a-dynamic-page-in-python%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







這個網誌中的熱門文章

How to read a connectionString WITH PROVIDER in .NET Core?

Node.js Script on GitHub Pages or Amazon S3

Museum of Modern and Contemporary Art of Trento and Rovereto