Need to scrape the data using BeautifulSoup










0















I am in need to get the celebrity details from https://www.astrotheme.com/celestar/horoscope_celebrity_search_by_filters.php
Input: Time of birth as known only, except the world events in a profession, where I get nearby 22,822 celebrities. I am able to get the first page data, using the urllib2 and bs4



import re
import urllib2
from bs4 import BeautifulSoup

url = "https://www.astrotheme.com/celestar/horoscope_celebrity_search_by_filters.php"
data = "sexe=M|F&categorie[0]=0|1|2|3|4|5|6|7|8|9|10|11|12&connue=1&pays=-1&tri=0&x=33&y=13"

fp = urllib2.urlopen(url, data)
soup = BeautifulSoup(fp, 'html.parser')
from_div = soup.find_all('div', attrs='class': 'titreFiche')

for major in from_div:
name = re.findall(r'portrait">(.*?)<br/>', str(major))
link = re.findall(r'<a href="(.*?)"', str(major))
print name[0], link[0]


For the next 230 pages, I am unable to get the data. I used to change the URL as page equal to until end but I can't scrape. Is there any way to get those remaining data from that page?










share|improve this question
























  • who gave me a negative score? Please tell why?

    – Aravindh Thirumaran
    Nov 14 '18 at 6:27















0















I am in need to get the celebrity details from https://www.astrotheme.com/celestar/horoscope_celebrity_search_by_filters.php
Input: Time of birth as known only, except the world events in a profession, where I get nearby 22,822 celebrities. I am able to get the first page data, using the urllib2 and bs4



import re
import urllib2
from bs4 import BeautifulSoup

url = "https://www.astrotheme.com/celestar/horoscope_celebrity_search_by_filters.php"
data = "sexe=M|F&categorie[0]=0|1|2|3|4|5|6|7|8|9|10|11|12&connue=1&pays=-1&tri=0&x=33&y=13"

fp = urllib2.urlopen(url, data)
soup = BeautifulSoup(fp, 'html.parser')
from_div = soup.find_all('div', attrs='class': 'titreFiche')

for major in from_div:
name = re.findall(r'portrait">(.*?)<br/>', str(major))
link = re.findall(r'<a href="(.*?)"', str(major))
print name[0], link[0]


For the next 230 pages, I am unable to get the data. I used to change the URL as page equal to until end but I can't scrape. Is there any way to get those remaining data from that page?










share|improve this question
























  • who gave me a negative score? Please tell why?

    – Aravindh Thirumaran
    Nov 14 '18 at 6:27













0












0








0








I am in need to get the celebrity details from https://www.astrotheme.com/celestar/horoscope_celebrity_search_by_filters.php
Input: Time of birth as known only, except the world events in a profession, where I get nearby 22,822 celebrities. I am able to get the first page data, using the urllib2 and bs4



import re
import urllib2
from bs4 import BeautifulSoup

url = "https://www.astrotheme.com/celestar/horoscope_celebrity_search_by_filters.php"
data = "sexe=M|F&categorie[0]=0|1|2|3|4|5|6|7|8|9|10|11|12&connue=1&pays=-1&tri=0&x=33&y=13"

fp = urllib2.urlopen(url, data)
soup = BeautifulSoup(fp, 'html.parser')
from_div = soup.find_all('div', attrs='class': 'titreFiche')

for major in from_div:
name = re.findall(r'portrait">(.*?)<br/>', str(major))
link = re.findall(r'<a href="(.*?)"', str(major))
print name[0], link[0]


For the next 230 pages, I am unable to get the data. I used to change the URL as page equal to until end but I can't scrape. Is there any way to get those remaining data from that page?










share|improve this question
















I am in need to get the celebrity details from https://www.astrotheme.com/celestar/horoscope_celebrity_search_by_filters.php
Input: Time of birth as known only, except the world events in a profession, where I get nearby 22,822 celebrities. I am able to get the first page data, using the urllib2 and bs4



import re
import urllib2
from bs4 import BeautifulSoup

url = "https://www.astrotheme.com/celestar/horoscope_celebrity_search_by_filters.php"
data = "sexe=M|F&categorie[0]=0|1|2|3|4|5|6|7|8|9|10|11|12&connue=1&pays=-1&tri=0&x=33&y=13"

fp = urllib2.urlopen(url, data)
soup = BeautifulSoup(fp, 'html.parser')
from_div = soup.find_all('div', attrs='class': 'titreFiche')

for major in from_div:
name = re.findall(r'portrait">(.*?)<br/>', str(major))
link = re.findall(r'<a href="(.*?)"', str(major))
print name[0], link[0]


For the next 230 pages, I am unable to get the data. I used to change the URL as page equal to until end but I can't scrape. Is there any way to get those remaining data from that page?







python-2.7 web-scraping beautifulsoup






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 13 '18 at 14:15









ewwink

11.6k22238




11.6k22238










asked Nov 13 '18 at 14:11









Aravindh ThirumaranAravindh Thirumaran

76




76












  • who gave me a negative score? Please tell why?

    – Aravindh Thirumaran
    Nov 14 '18 at 6:27

















  • who gave me a negative score? Please tell why?

    – Aravindh Thirumaran
    Nov 14 '18 at 6:27
















who gave me a negative score? Please tell why?

– Aravindh Thirumaran
Nov 14 '18 at 6:27





who gave me a negative score? Please tell why?

– Aravindh Thirumaran
Nov 14 '18 at 6:27












1 Answer
1






active

oldest

votes


















1














you need session cookies, use requests to save session easily



from bs4 import BeautifulSoup
import requests, re

url = "https://www.astrotheme.com/celestar/horoscope_celebrity_search_by_filters.php"
searchData = 2
session = requests.session()

def doSearch(url, data=None):
if data:
fp = session.post(url, data=data).text
else:
fp = session.get(url).text
soup = BeautifulSoup(fp, 'html.parser')
from_div = soup.find_all('div', attrs='class': 'titreFiche')

for major in from_div:
name = re.findall(r'portrait">(.*?)<br/>', str(major))
link = re.findall(r'<a href="(.*?)"', str(major))
print name[0], link[0]

# do Post search in first request
doSearch(url, searchData)

# we have session and we can use Get request for next page
for index in range(2, 4): # get page 2 to 3
print('getting page: %s' % index)
pageurl = '%s?page=%s' % (url, index)
print(pageurl)
doSearch(pageurl)





share|improve this answer























  • Thanks, a lot man. You are awesome. It worked fine for me. Again thanks thanks and thanks

    – Aravindh Thirumaran
    Nov 14 '18 at 5:49











  • Why my question got -1 in the score? What is the problem in my question?

    – Aravindh Thirumaran
    Nov 14 '18 at 5:51












  • I don't know but I don't do down vote to your question and you're welcome.

    – ewwink
    Nov 14 '18 at 8:22











  • Boss, I am getting Connection error: raise ConnectionError(err, request=request) requests.exceptions.ConnectionError: ('Connection aborted.', BadStatusLine("''",)) on the line "fp = session.get(url).text" I google it but unable to get that. They ask to change https to http and another one told that add header. I tried both ways but I can't. Is there any other ways there? Actually, you used only 2 pages. I want for until last page. Is it worked fine for you next 200 pages?

    – Aravindh Thirumaran
    Nov 14 '18 at 11:34












  • it worked for example for index in range(200, 204): it could be server down because your request is too fast try adding sleep between request

    – ewwink
    Nov 14 '18 at 12:45










Your Answer






StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













draft saved

draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53282917%2fneed-to-scrape-the-data-using-beautifulsoup%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









1














you need session cookies, use requests to save session easily



from bs4 import BeautifulSoup
import requests, re

url = "https://www.astrotheme.com/celestar/horoscope_celebrity_search_by_filters.php"
searchData = 2
session = requests.session()

def doSearch(url, data=None):
if data:
fp = session.post(url, data=data).text
else:
fp = session.get(url).text
soup = BeautifulSoup(fp, 'html.parser')
from_div = soup.find_all('div', attrs='class': 'titreFiche')

for major in from_div:
name = re.findall(r'portrait">(.*?)<br/>', str(major))
link = re.findall(r'<a href="(.*?)"', str(major))
print name[0], link[0]

# do Post search in first request
doSearch(url, searchData)

# we have session and we can use Get request for next page
for index in range(2, 4): # get page 2 to 3
print('getting page: %s' % index)
pageurl = '%s?page=%s' % (url, index)
print(pageurl)
doSearch(pageurl)





share|improve this answer























  • Thanks, a lot man. You are awesome. It worked fine for me. Again thanks thanks and thanks

    – Aravindh Thirumaran
    Nov 14 '18 at 5:49











  • Why my question got -1 in the score? What is the problem in my question?

    – Aravindh Thirumaran
    Nov 14 '18 at 5:51












  • I don't know but I don't do down vote to your question and you're welcome.

    – ewwink
    Nov 14 '18 at 8:22











  • Boss, I am getting Connection error: raise ConnectionError(err, request=request) requests.exceptions.ConnectionError: ('Connection aborted.', BadStatusLine("''",)) on the line "fp = session.get(url).text" I google it but unable to get that. They ask to change https to http and another one told that add header. I tried both ways but I can't. Is there any other ways there? Actually, you used only 2 pages. I want for until last page. Is it worked fine for you next 200 pages?

    – Aravindh Thirumaran
    Nov 14 '18 at 11:34












  • it worked for example for index in range(200, 204): it could be server down because your request is too fast try adding sleep between request

    – ewwink
    Nov 14 '18 at 12:45















1














you need session cookies, use requests to save session easily



from bs4 import BeautifulSoup
import requests, re

url = "https://www.astrotheme.com/celestar/horoscope_celebrity_search_by_filters.php"
searchData = 2
session = requests.session()

def doSearch(url, data=None):
if data:
fp = session.post(url, data=data).text
else:
fp = session.get(url).text
soup = BeautifulSoup(fp, 'html.parser')
from_div = soup.find_all('div', attrs='class': 'titreFiche')

for major in from_div:
name = re.findall(r'portrait">(.*?)<br/>', str(major))
link = re.findall(r'<a href="(.*?)"', str(major))
print name[0], link[0]

# do Post search in first request
doSearch(url, searchData)

# we have session and we can use Get request for next page
for index in range(2, 4): # get page 2 to 3
print('getting page: %s' % index)
pageurl = '%s?page=%s' % (url, index)
print(pageurl)
doSearch(pageurl)





share|improve this answer























  • Thanks, a lot man. You are awesome. It worked fine for me. Again thanks thanks and thanks

    – Aravindh Thirumaran
    Nov 14 '18 at 5:49











  • Why my question got -1 in the score? What is the problem in my question?

    – Aravindh Thirumaran
    Nov 14 '18 at 5:51












  • I don't know but I don't do down vote to your question and you're welcome.

    – ewwink
    Nov 14 '18 at 8:22











  • Boss, I am getting Connection error: raise ConnectionError(err, request=request) requests.exceptions.ConnectionError: ('Connection aborted.', BadStatusLine("''",)) on the line "fp = session.get(url).text" I google it but unable to get that. They ask to change https to http and another one told that add header. I tried both ways but I can't. Is there any other ways there? Actually, you used only 2 pages. I want for until last page. Is it worked fine for you next 200 pages?

    – Aravindh Thirumaran
    Nov 14 '18 at 11:34












  • it worked for example for index in range(200, 204): it could be server down because your request is too fast try adding sleep between request

    – ewwink
    Nov 14 '18 at 12:45













1












1








1







you need session cookies, use requests to save session easily



from bs4 import BeautifulSoup
import requests, re

url = "https://www.astrotheme.com/celestar/horoscope_celebrity_search_by_filters.php"
searchData = 2
session = requests.session()

def doSearch(url, data=None):
if data:
fp = session.post(url, data=data).text
else:
fp = session.get(url).text
soup = BeautifulSoup(fp, 'html.parser')
from_div = soup.find_all('div', attrs='class': 'titreFiche')

for major in from_div:
name = re.findall(r'portrait">(.*?)<br/>', str(major))
link = re.findall(r'<a href="(.*?)"', str(major))
print name[0], link[0]

# do Post search in first request
doSearch(url, searchData)

# we have session and we can use Get request for next page
for index in range(2, 4): # get page 2 to 3
print('getting page: %s' % index)
pageurl = '%s?page=%s' % (url, index)
print(pageurl)
doSearch(pageurl)





share|improve this answer













you need session cookies, use requests to save session easily



from bs4 import BeautifulSoup
import requests, re

url = "https://www.astrotheme.com/celestar/horoscope_celebrity_search_by_filters.php"
searchData = 2
session = requests.session()

def doSearch(url, data=None):
if data:
fp = session.post(url, data=data).text
else:
fp = session.get(url).text
soup = BeautifulSoup(fp, 'html.parser')
from_div = soup.find_all('div', attrs='class': 'titreFiche')

for major in from_div:
name = re.findall(r'portrait">(.*?)<br/>', str(major))
link = re.findall(r'<a href="(.*?)"', str(major))
print name[0], link[0]

# do Post search in first request
doSearch(url, searchData)

# we have session and we can use Get request for next page
for index in range(2, 4): # get page 2 to 3
print('getting page: %s' % index)
pageurl = '%s?page=%s' % (url, index)
print(pageurl)
doSearch(pageurl)






share|improve this answer












share|improve this answer



share|improve this answer










answered Nov 13 '18 at 15:09









ewwinkewwink

11.6k22238




11.6k22238












  • Thanks, a lot man. You are awesome. It worked fine for me. Again thanks thanks and thanks

    – Aravindh Thirumaran
    Nov 14 '18 at 5:49











  • Why my question got -1 in the score? What is the problem in my question?

    – Aravindh Thirumaran
    Nov 14 '18 at 5:51












  • I don't know but I don't do down vote to your question and you're welcome.

    – ewwink
    Nov 14 '18 at 8:22











  • Boss, I am getting Connection error: raise ConnectionError(err, request=request) requests.exceptions.ConnectionError: ('Connection aborted.', BadStatusLine("''",)) on the line "fp = session.get(url).text" I google it but unable to get that. They ask to change https to http and another one told that add header. I tried both ways but I can't. Is there any other ways there? Actually, you used only 2 pages. I want for until last page. Is it worked fine for you next 200 pages?

    – Aravindh Thirumaran
    Nov 14 '18 at 11:34












  • it worked for example for index in range(200, 204): it could be server down because your request is too fast try adding sleep between request

    – ewwink
    Nov 14 '18 at 12:45

















  • Thanks, a lot man. You are awesome. It worked fine for me. Again thanks thanks and thanks

    – Aravindh Thirumaran
    Nov 14 '18 at 5:49











  • Why my question got -1 in the score? What is the problem in my question?

    – Aravindh Thirumaran
    Nov 14 '18 at 5:51












  • I don't know but I don't do down vote to your question and you're welcome.

    – ewwink
    Nov 14 '18 at 8:22











  • Boss, I am getting Connection error: raise ConnectionError(err, request=request) requests.exceptions.ConnectionError: ('Connection aborted.', BadStatusLine("''",)) on the line "fp = session.get(url).text" I google it but unable to get that. They ask to change https to http and another one told that add header. I tried both ways but I can't. Is there any other ways there? Actually, you used only 2 pages. I want for until last page. Is it worked fine for you next 200 pages?

    – Aravindh Thirumaran
    Nov 14 '18 at 11:34












  • it worked for example for index in range(200, 204): it could be server down because your request is too fast try adding sleep between request

    – ewwink
    Nov 14 '18 at 12:45
















Thanks, a lot man. You are awesome. It worked fine for me. Again thanks thanks and thanks

– Aravindh Thirumaran
Nov 14 '18 at 5:49





Thanks, a lot man. You are awesome. It worked fine for me. Again thanks thanks and thanks

– Aravindh Thirumaran
Nov 14 '18 at 5:49













Why my question got -1 in the score? What is the problem in my question?

– Aravindh Thirumaran
Nov 14 '18 at 5:51






Why my question got -1 in the score? What is the problem in my question?

– Aravindh Thirumaran
Nov 14 '18 at 5:51














I don't know but I don't do down vote to your question and you're welcome.

– ewwink
Nov 14 '18 at 8:22





I don't know but I don't do down vote to your question and you're welcome.

– ewwink
Nov 14 '18 at 8:22













Boss, I am getting Connection error: raise ConnectionError(err, request=request) requests.exceptions.ConnectionError: ('Connection aborted.', BadStatusLine("''",)) on the line "fp = session.get(url).text" I google it but unable to get that. They ask to change https to http and another one told that add header. I tried both ways but I can't. Is there any other ways there? Actually, you used only 2 pages. I want for until last page. Is it worked fine for you next 200 pages?

– Aravindh Thirumaran
Nov 14 '18 at 11:34






Boss, I am getting Connection error: raise ConnectionError(err, request=request) requests.exceptions.ConnectionError: ('Connection aborted.', BadStatusLine("''",)) on the line "fp = session.get(url).text" I google it but unable to get that. They ask to change https to http and another one told that add header. I tried both ways but I can't. Is there any other ways there? Actually, you used only 2 pages. I want for until last page. Is it worked fine for you next 200 pages?

– Aravindh Thirumaran
Nov 14 '18 at 11:34














it worked for example for index in range(200, 204): it could be server down because your request is too fast try adding sleep between request

– ewwink
Nov 14 '18 at 12:45





it worked for example for index in range(200, 204): it could be server down because your request is too fast try adding sleep between request

– ewwink
Nov 14 '18 at 12:45

















draft saved

draft discarded
















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53282917%2fneed-to-scrape-the-data-using-beautifulsoup%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







這個網誌中的熱門文章

Barbados

How to read a connectionString WITH PROVIDER in .NET Core?

Node.js Script on GitHub Pages or Amazon S3