Need to scrape the data using BeautifulSoup

I am in need to get the celebrity details from https://www.astrotheme.com/celestar/horoscope_celebrity_search_by_filters.php
Input: Time of birth as known only, except the world events in a profession, where I get nearby 22,822 celebrities. I am able to get the first page data, using the urllib2 and bs4

import re
import urllib2
from bs4 import BeautifulSoup

url = "https://www.astrotheme.com/celestar/horoscope_celebrity_search_by_filters.php"
data = "sexe=M|F&categorie[0]=0|1|2|3|4|5|6|7|8|9|10|11|12&connue=1&pays=-1&tri=0&x=33&y=13"

fp = urllib2.urlopen(url, data)
soup = BeautifulSoup(fp, 'html.parser')
from_div = soup.find_all('div', attrs='class': 'titreFiche')

for major in from_div:
 name = re.findall(r'portrait">(.*?)<br/>', str(major))
 link = re.findall(r'<a href="(.*?)"', str(major))
 print name[0], link[0]

For the next 230 pages, I am unable to get the data. I used to change the URL as page equal to until end but I can't scrape. Is there any way to get those remaining data from that page?

edited Nov 13 '18 at 14:15

ewwink

11.6k22238

asked Nov 13 '18 at 14:11

Aravindh Thirumaran

who gave me a negative score? Please tell why?

– Aravindh Thirumaran
Nov 14 '18 at 6:27

add a comment |

import re
import urllib2
from bs4 import BeautifulSoup

url = "https://www.astrotheme.com/celestar/horoscope_celebrity_search_by_filters.php"
data = "sexe=M|F&categorie[0]=0|1|2|3|4|5|6|7|8|9|10|11|12&connue=1&pays=-1&tri=0&x=33&y=13"

fp = urllib2.urlopen(url, data)
soup = BeautifulSoup(fp, 'html.parser')
from_div = soup.find_all('div', attrs='class': 'titreFiche')

for major in from_div:
 name = re.findall(r'portrait">(.*?)<br/>', str(major))
 link = re.findall(r'<a href="(.*?)"', str(major))
 print name[0], link[0]

For the next 230 pages, I am unable to get the data. I used to change the URL as page equal to until end but I can't scrape. Is there any way to get those remaining data from that page?

edited Nov 13 '18 at 14:15

ewwink

11.6k22238

asked Nov 13 '18 at 14:11

Aravindh Thirumaran

who gave me a negative score? Please tell why?

– Aravindh Thirumaran
Nov 14 '18 at 6:27

add a comment |

import re
import urllib2
from bs4 import BeautifulSoup

url = "https://www.astrotheme.com/celestar/horoscope_celebrity_search_by_filters.php"
data = "sexe=M|F&categorie[0]=0|1|2|3|4|5|6|7|8|9|10|11|12&connue=1&pays=-1&tri=0&x=33&y=13"

fp = urllib2.urlopen(url, data)
soup = BeautifulSoup(fp, 'html.parser')
from_div = soup.find_all('div', attrs='class': 'titreFiche')

for major in from_div:
 name = re.findall(r'portrait">(.*?)<br/>', str(major))
 link = re.findall(r'<a href="(.*?)"', str(major))
 print name[0], link[0]

For the next 230 pages, I am unable to get the data. I used to change the URL as page equal to until end but I can't scrape. Is there any way to get those remaining data from that page?

edited Nov 13 '18 at 14:15

ewwink

11.6k22238

asked Nov 13 '18 at 14:11

Aravindh Thirumaran

import re
import urllib2
from bs4 import BeautifulSoup

url = "https://www.astrotheme.com/celestar/horoscope_celebrity_search_by_filters.php"
data = "sexe=M|F&categorie[0]=0|1|2|3|4|5|6|7|8|9|10|11|12&connue=1&pays=-1&tri=0&x=33&y=13"

fp = urllib2.urlopen(url, data)
soup = BeautifulSoup(fp, 'html.parser')
from_div = soup.find_all('div', attrs='class': 'titreFiche')

for major in from_div:
 name = re.findall(r'portrait">(.*?)<br/>', str(major))
 link = re.findall(r'<a href="(.*?)"', str(major))
 print name[0], link[0]

For the next 230 pages, I am unable to get the data. I used to change the URL as page equal to until end but I can't scrape. Is there any way to get those remaining data from that page?

python-2.7 web-scraping beautifulsoup

edited Nov 13 '18 at 14:15

ewwink

11.6k22238

asked Nov 13 '18 at 14:11

Aravindh Thirumaran

edited Nov 13 '18 at 14:15

ewwink

11.6k22238

asked Nov 13 '18 at 14:11

Aravindh Thirumaran

edited Nov 13 '18 at 14:15

ewwink

11.6k22238

edited Nov 13 '18 at 14:15

ewwink

11.6k22238

edited Nov 13 '18 at 14:15

ewwink

11.6k22238

asked Nov 13 '18 at 14:11

Aravindh Thirumaran

asked Nov 13 '18 at 14:11

Aravindh Thirumaran

asked Nov 13 '18 at 14:11

Aravindh Thirumaran

who gave me a negative score? Please tell why?

– Aravindh Thirumaran
Nov 14 '18 at 6:27

add a comment |

who gave me a negative score? Please tell why?

– Aravindh Thirumaran
Nov 14 '18 at 6:27

who gave me a negative score? Please tell why?

– Aravindh Thirumaran
Nov 14 '18 at 6:27

add a comment |

1 Answer
1

active

oldest

votes

you need session cookies, use requests to save session easily

from bs4 import BeautifulSoup
import requests, re

url = "https://www.astrotheme.com/celestar/horoscope_celebrity_search_by_filters.php"
searchData = 2
session = requests.session()

def doSearch(url, data=None):
 if data:
 fp = session.post(url, data=data).text
 else:
 fp = session.get(url).text
 soup = BeautifulSoup(fp, 'html.parser')
 from_div = soup.find_all('div', attrs='class': 'titreFiche')

 for major in from_div:
 name = re.findall(r'portrait">(.*?)<br/>', str(major))
 link = re.findall(r'<a href="(.*?)"', str(major))
 print name[0], link[0]

# do Post search in first request
doSearch(url, searchData)

# we have session and we can use Get request for next page
for index in range(2, 4): # get page 2 to 3
 print('getting page: %s' % index)
 pageurl = '%s?page=%s' % (url, index)
 print(pageurl)
 doSearch(pageurl)

answered Nov 13 '18 at 15:09

ewwink

11.6k22238

Thanks, a lot man. You are awesome. It worked fine for me. Again thanks thanks and thanks

– Aravindh Thirumaran
Nov 14 '18 at 5:49

Why my question got -1 in the score? What is the problem in my question?

– Aravindh Thirumaran
Nov 14 '18 at 5:51

I don't know but I don't do down vote to your question and you're welcome.

– ewwink
Nov 14 '18 at 8:22

Boss, I am getting Connection error: raise ConnectionError(err, request=request) requests.exceptions.ConnectionError: ('Connection aborted.', BadStatusLine("''",)) on the line "fp = session.get(url).text" I google it but unable to get that. They ask to change https to http and another one told that add header. I tried both ways but I can't. Is there any other ways there? Actually, you used only 2 pages. I want for until last page. Is it worked fine for you next 200 pages?

– Aravindh Thirumaran
Nov 14 '18 at 11:34

it worked for example for index in range(200, 204): it could be server down because your request is too fast try adding sleep between request

– ewwink
Nov 14 '18 at 12:45

|
show 1 more comment

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53282917%2fneed-to-scrape-the-data-using-beautifulsoup%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

you need session cookies, use requests to save session easily

from bs4 import BeautifulSoup
import requests, re

url = "https://www.astrotheme.com/celestar/horoscope_celebrity_search_by_filters.php"
searchData = 2
session = requests.session()

def doSearch(url, data=None):
 if data:
 fp = session.post(url, data=data).text
 else:
 fp = session.get(url).text
 soup = BeautifulSoup(fp, 'html.parser')
 from_div = soup.find_all('div', attrs='class': 'titreFiche')

 for major in from_div:
 name = re.findall(r'portrait">(.*?)<br/>', str(major))
 link = re.findall(r'<a href="(.*?)"', str(major))
 print name[0], link[0]

# do Post search in first request
doSearch(url, searchData)

# we have session and we can use Get request for next page
for index in range(2, 4): # get page 2 to 3
 print('getting page: %s' % index)
 pageurl = '%s?page=%s' % (url, index)
 print(pageurl)
 doSearch(pageurl)

answered Nov 13 '18 at 15:09

ewwink

11.6k22238

Thanks, a lot man. You are awesome. It worked fine for me. Again thanks thanks and thanks

– Aravindh Thirumaran
Nov 14 '18 at 5:49

Why my question got -1 in the score? What is the problem in my question?

– Aravindh Thirumaran
Nov 14 '18 at 5:51

I don't know but I don't do down vote to your question and you're welcome.

– ewwink
Nov 14 '18 at 8:22

Boss, I am getting Connection error: raise ConnectionError(err, request=request) requests.exceptions.ConnectionError: ('Connection aborted.', BadStatusLine("''",)) on the line "fp = session.get(url).text" I google it but unable to get that. They ask to change https to http and another one told that add header. I tried both ways but I can't. Is there any other ways there? Actually, you used only 2 pages. I want for until last page. Is it worked fine for you next 200 pages?

– Aravindh Thirumaran
Nov 14 '18 at 11:34

it worked for example for index in range(200, 204): it could be server down because your request is too fast try adding sleep between request

– ewwink
Nov 14 '18 at 12:45

|
show 1 more comment

you need session cookies, use requests to save session easily

from bs4 import BeautifulSoup
import requests, re

url = "https://www.astrotheme.com/celestar/horoscope_celebrity_search_by_filters.php"
searchData = 2
session = requests.session()

def doSearch(url, data=None):
 if data:
 fp = session.post(url, data=data).text
 else:
 fp = session.get(url).text
 soup = BeautifulSoup(fp, 'html.parser')
 from_div = soup.find_all('div', attrs='class': 'titreFiche')

 for major in from_div:
 name = re.findall(r'portrait">(.*?)<br/>', str(major))
 link = re.findall(r'<a href="(.*?)"', str(major))
 print name[0], link[0]

# do Post search in first request
doSearch(url, searchData)

# we have session and we can use Get request for next page
for index in range(2, 4): # get page 2 to 3
 print('getting page: %s' % index)
 pageurl = '%s?page=%s' % (url, index)
 print(pageurl)
 doSearch(pageurl)

answered Nov 13 '18 at 15:09

ewwink

11.6k22238

Thanks, a lot man. You are awesome. It worked fine for me. Again thanks thanks and thanks

– Aravindh Thirumaran
Nov 14 '18 at 5:49

Why my question got -1 in the score? What is the problem in my question?

– Aravindh Thirumaran
Nov 14 '18 at 5:51

I don't know but I don't do down vote to your question and you're welcome.

– ewwink
Nov 14 '18 at 8:22

Boss, I am getting Connection error: raise ConnectionError(err, request=request) requests.exceptions.ConnectionError: ('Connection aborted.', BadStatusLine("''",)) on the line "fp = session.get(url).text" I google it but unable to get that. They ask to change https to http and another one told that add header. I tried both ways but I can't. Is there any other ways there? Actually, you used only 2 pages. I want for until last page. Is it worked fine for you next 200 pages?

– Aravindh Thirumaran
Nov 14 '18 at 11:34

it worked for example for index in range(200, 204): it could be server down because your request is too fast try adding sleep between request

– ewwink
Nov 14 '18 at 12:45

|
show 1 more comment

you need session cookies, use requests to save session easily

from bs4 import BeautifulSoup
import requests, re

url = "https://www.astrotheme.com/celestar/horoscope_celebrity_search_by_filters.php"
searchData = 2
session = requests.session()

def doSearch(url, data=None):
 if data:
 fp = session.post(url, data=data).text
 else:
 fp = session.get(url).text
 soup = BeautifulSoup(fp, 'html.parser')
 from_div = soup.find_all('div', attrs='class': 'titreFiche')

 for major in from_div:
 name = re.findall(r'portrait">(.*?)<br/>', str(major))
 link = re.findall(r'<a href="(.*?)"', str(major))
 print name[0], link[0]

# do Post search in first request
doSearch(url, searchData)

# we have session and we can use Get request for next page
for index in range(2, 4): # get page 2 to 3
 print('getting page: %s' % index)
 pageurl = '%s?page=%s' % (url, index)
 print(pageurl)
 doSearch(pageurl)

answered Nov 13 '18 at 15:09

ewwink

11.6k22238

you need session cookies, use requests to save session easily

from bs4 import BeautifulSoup
import requests, re

url = "https://www.astrotheme.com/celestar/horoscope_celebrity_search_by_filters.php"
searchData = 2
session = requests.session()

def doSearch(url, data=None):
 if data:
 fp = session.post(url, data=data).text
 else:
 fp = session.get(url).text
 soup = BeautifulSoup(fp, 'html.parser')
 from_div = soup.find_all('div', attrs='class': 'titreFiche')

 for major in from_div:
 name = re.findall(r'portrait">(.*?)<br/>', str(major))
 link = re.findall(r'<a href="(.*?)"', str(major))
 print name[0], link[0]

# do Post search in first request
doSearch(url, searchData)

# we have session and we can use Get request for next page
for index in range(2, 4): # get page 2 to 3
 print('getting page: %s' % index)
 pageurl = '%s?page=%s' % (url, index)
 print(pageurl)
 doSearch(pageurl)

answered Nov 13 '18 at 15:09

ewwink

11.6k22238

answered Nov 13 '18 at 15:09

ewwink

11.6k22238

answered Nov 13 '18 at 15:09

ewwink

11.6k22238

answered Nov 13 '18 at 15:09

ewwink

11.6k22238

Thanks, a lot man. You are awesome. It worked fine for me. Again thanks thanks and thanks

– Aravindh Thirumaran
Nov 14 '18 at 5:49

Why my question got -1 in the score? What is the problem in my question?

– Aravindh Thirumaran
Nov 14 '18 at 5:51

I don't know but I don't do down vote to your question and you're welcome.

– ewwink
Nov 14 '18 at 8:22

Boss, I am getting Connection error: raise ConnectionError(err, request=request) requests.exceptions.ConnectionError: ('Connection aborted.', BadStatusLine("''",)) on the line "fp = session.get(url).text" I google it but unable to get that. They ask to change https to http and another one told that add header. I tried both ways but I can't. Is there any other ways there? Actually, you used only 2 pages. I want for until last page. Is it worked fine for you next 200 pages?

– Aravindh Thirumaran
Nov 14 '18 at 11:34

it worked for example for index in range(200, 204): it could be server down because your request is too fast try adding sleep between request

– ewwink
Nov 14 '18 at 12:45

|
show 1 more comment

Thanks, a lot man. You are awesome. It worked fine for me. Again thanks thanks and thanks

– Aravindh Thirumaran
Nov 14 '18 at 5:49

Why my question got -1 in the score? What is the problem in my question?

– Aravindh Thirumaran
Nov 14 '18 at 5:51

I don't know but I don't do down vote to your question and you're welcome.

– ewwink
Nov 14 '18 at 8:22

Boss, I am getting Connection error: raise ConnectionError(err, request=request) requests.exceptions.ConnectionError: ('Connection aborted.', BadStatusLine("''",)) on the line "fp = session.get(url).text" I google it but unable to get that. They ask to change https to http and another one told that add header. I tried both ways but I can't. Is there any other ways there? Actually, you used only 2 pages. I want for until last page. Is it worked fine for you next 200 pages?

– Aravindh Thirumaran
Nov 14 '18 at 11:34

it worked for example for index in range(200, 204): it could be server down because your request is too fast try adding sleep between request

– ewwink
Nov 14 '18 at 12:45

Thanks, a lot man. You are awesome. It worked fine for me. Again thanks thanks and thanks

– Aravindh Thirumaran
Nov 14 '18 at 5:49

Why my question got -1 in the score? What is the problem in my question?

– Aravindh Thirumaran
Nov 14 '18 at 5:51

I don't know but I don't do down vote to your question and you're welcome.

– ewwink
Nov 14 '18 at 8:22

Boss, I am getting Connection error: raise ConnectionError(err, request=request) requests.exceptions.ConnectionError: ('Connection aborted.', BadStatusLine("''",)) on the line "fp = session.get(url).text" I google it but unable to get that. They ask to change https to http and another one told that add header. I tried both ways but I can't. Is there any other ways there? Actually, you used only 2 pages. I want for until last page. Is it worked fine for you next 200 pages?

– Aravindh Thirumaran
Nov 14 '18 at 11:34

it worked for example for index in range(200, 204): it could be server down because your request is too fast try adding sleep between request

– ewwink
Nov 14 '18 at 12:45

|
show 1 more comment

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Odtnhj