Find and retrieve content from html text using BeautifulSoup
up vote
0
down vote
favorite
I have the following html code (or at least I think it's html) that I am working on with BeautifulSoup on Python.
I have parsed the html using Beautiful soup correctly. What I would like to do next is to retrieve the content associated with the 'div' containing a certain data-label (for example, in the bottom part of the code, data-label="Relation"). In particular I would like to obtain a dictionary that has as key the text of the data-label, i.e. in my example "Relation", and as value the content of the same 'div', i.e. in my example the href "http://documenti.camera.it/apps/commonServices/getDocumento.ashx?sezione=bollettini=comunicato=17=2016=06=14=03=data.20160614.com03.bollettino.sede00020.tit00010.int00010=data.20160614.com03.bollettino.sede00020.tit00010.int00010#data.20160614.com03.bollettino.sede00020.tit00010.int00010"
I have tried several approaches but data-label, as far as I know, does not appear to be a valid attribute, so I am not sure how to handle this.
(Note that this is just an example, but I will have to do the same for thousands, if not millions, of these webpages, with this similar structure).
Any help is appreciated. Thank you!
<div id="directs">
<label class="c1"><a data-comment="A human-readable name for the subject." data-label="label" href="http://www.w3.org/2000/01/rdf-schema#label">
rdfs:<span>label</span>
</a></label>
<div class="c2 value ">
<div class="toMultiLine ">
<div class="fixed">
<span class="dType">xsd:string</span>
intervento di Fabrizio CICCHITTO
</div>
</div>
</div>
<label class="c1"><a data-comment="A name given to the resource." data-label="Title" href="http://purl.org/dc/elements/1.1/title">
dc:<span>title</span>
</a></label>
<div class="c2 value ">
<div class="toMultiLine ">
<div class="fixed">
intervento di Fabrizio CICCHITTO
</div>
</div>
</div>
<label class="c1"><a data-comment="" data-label="" href="http://lod.xdams.org/ontologies/ods/modified">
ods:<span>modified</span>
</a></label>
<div class="c2 value ">
<div class="toMultiLine ">
<div class="fixed">
<span class="dType">xsd:dateTime</span>
2016-07-05T12:26:02Z
</div>
</div>
</div>
<label class="c1"><a data-comment="The subject is an instance of a class." data-label="type" href="http://www.w3.org/1999/02/22-rdf-syntax-ns#type">
rdf:<span>type</span>
</a></label>
<div class="c2 value">
<div class="toOneLine">
<a class=" isLocal" href="http://dati.camera.it/ocd/intervento" title="<http://dati.camera.it/ocd/intervento>">
ocd:intervento
</a>
</div>
</div>
<label class="c1"><a data-comment="propriet generica utilizzata per puntare alla risorsa deputato in vari punti dell'ontologia" data-label="rierimento a deputato" href="http://dati.camera.it/ocd/rif_deputato">
ocd:<span>rif_deputato</span>
</a></label>
<div class="c2 value">
<div class="toOneLine">
<a class=" isLocal" href="http://dati.camera.it/ocd/deputato.rdf/d15080_17" title="<http://dati.camera.it/ocd/deputato.rdf/d15080_17>">
http://dati.camera.it/ocd/deputato.rdf/d15080_17
</a>
</div>
</div>
<label class="c1"><a data-comment="A related resource." data-label="Relation" href="http://purl.org/dc/elements/1.1/relation">
dc:<span>relation</span>
</a></label>
<div class="c2 value">
<div class="toOneLine">
<a class=" " href="http://documenti.camera.it/apps/commonServices/getDocumento.ashx?sezione=bollettini=comunicato=17=2016=06=14=03=data.20160614.com03.bollettino.sede00020.tit00010.int00010=data.20160614.com03.bollettino.sede00020.tit00010.int00010#data.20160614.com03.bollettino.sede00020.tit00010.int00010"
target="_blank" title="<http://documenti.camera.it/apps/commonServices/getDocumento.ashx?sezione=bollettini=comunicato=17=2016=06=14=03=data.20160614.com03.bollettino.sede00020.tit00010.int00010=data.20160614.com03.bollettino.sede00020.tit00010.int00010#data.20160614.com03.bollettino.sede00020.tit00010.int00010>">
http://documenti.camera.it/apps/commonServices/getDocumento.ashx?sezione=bollettini=comunicato=17=2016=06=14=03=data.20160614.com03.bollettino.sede00020.tit00010.int00010=data.20160614.com03.bollettino.sede00020.tit00010.int00010#data.20160614.com03.bollettino.sede00020.tit00010.int00010
</a>
</div>
</div>
</div>
python html web-scraping beautifulsoup
add a comment |
up vote
0
down vote
favorite
I have the following html code (or at least I think it's html) that I am working on with BeautifulSoup on Python.
I have parsed the html using Beautiful soup correctly. What I would like to do next is to retrieve the content associated with the 'div' containing a certain data-label (for example, in the bottom part of the code, data-label="Relation"). In particular I would like to obtain a dictionary that has as key the text of the data-label, i.e. in my example "Relation", and as value the content of the same 'div', i.e. in my example the href "http://documenti.camera.it/apps/commonServices/getDocumento.ashx?sezione=bollettini=comunicato=17=2016=06=14=03=data.20160614.com03.bollettino.sede00020.tit00010.int00010=data.20160614.com03.bollettino.sede00020.tit00010.int00010#data.20160614.com03.bollettino.sede00020.tit00010.int00010"
I have tried several approaches but data-label, as far as I know, does not appear to be a valid attribute, so I am not sure how to handle this.
(Note that this is just an example, but I will have to do the same for thousands, if not millions, of these webpages, with this similar structure).
Any help is appreciated. Thank you!
<div id="directs">
<label class="c1"><a data-comment="A human-readable name for the subject." data-label="label" href="http://www.w3.org/2000/01/rdf-schema#label">
rdfs:<span>label</span>
</a></label>
<div class="c2 value ">
<div class="toMultiLine ">
<div class="fixed">
<span class="dType">xsd:string</span>
intervento di Fabrizio CICCHITTO
</div>
</div>
</div>
<label class="c1"><a data-comment="A name given to the resource." data-label="Title" href="http://purl.org/dc/elements/1.1/title">
dc:<span>title</span>
</a></label>
<div class="c2 value ">
<div class="toMultiLine ">
<div class="fixed">
intervento di Fabrizio CICCHITTO
</div>
</div>
</div>
<label class="c1"><a data-comment="" data-label="" href="http://lod.xdams.org/ontologies/ods/modified">
ods:<span>modified</span>
</a></label>
<div class="c2 value ">
<div class="toMultiLine ">
<div class="fixed">
<span class="dType">xsd:dateTime</span>
2016-07-05T12:26:02Z
</div>
</div>
</div>
<label class="c1"><a data-comment="The subject is an instance of a class." data-label="type" href="http://www.w3.org/1999/02/22-rdf-syntax-ns#type">
rdf:<span>type</span>
</a></label>
<div class="c2 value">
<div class="toOneLine">
<a class=" isLocal" href="http://dati.camera.it/ocd/intervento" title="<http://dati.camera.it/ocd/intervento>">
ocd:intervento
</a>
</div>
</div>
<label class="c1"><a data-comment="propriet generica utilizzata per puntare alla risorsa deputato in vari punti dell'ontologia" data-label="rierimento a deputato" href="http://dati.camera.it/ocd/rif_deputato">
ocd:<span>rif_deputato</span>
</a></label>
<div class="c2 value">
<div class="toOneLine">
<a class=" isLocal" href="http://dati.camera.it/ocd/deputato.rdf/d15080_17" title="<http://dati.camera.it/ocd/deputato.rdf/d15080_17>">
http://dati.camera.it/ocd/deputato.rdf/d15080_17
</a>
</div>
</div>
<label class="c1"><a data-comment="A related resource." data-label="Relation" href="http://purl.org/dc/elements/1.1/relation">
dc:<span>relation</span>
</a></label>
<div class="c2 value">
<div class="toOneLine">
<a class=" " href="http://documenti.camera.it/apps/commonServices/getDocumento.ashx?sezione=bollettini=comunicato=17=2016=06=14=03=data.20160614.com03.bollettino.sede00020.tit00010.int00010=data.20160614.com03.bollettino.sede00020.tit00010.int00010#data.20160614.com03.bollettino.sede00020.tit00010.int00010"
target="_blank" title="<http://documenti.camera.it/apps/commonServices/getDocumento.ashx?sezione=bollettini=comunicato=17=2016=06=14=03=data.20160614.com03.bollettino.sede00020.tit00010.int00010=data.20160614.com03.bollettino.sede00020.tit00010.int00010#data.20160614.com03.bollettino.sede00020.tit00010.int00010>">
http://documenti.camera.it/apps/commonServices/getDocumento.ashx?sezione=bollettini=comunicato=17=2016=06=14=03=data.20160614.com03.bollettino.sede00020.tit00010.int00010=data.20160614.com03.bollettino.sede00020.tit00010.int00010#data.20160614.com03.bollettino.sede00020.tit00010.int00010
</a>
</div>
</div>
</div>
python html web-scraping beautifulsoup
add a comment |
up vote
0
down vote
favorite
up vote
0
down vote
favorite
I have the following html code (or at least I think it's html) that I am working on with BeautifulSoup on Python.
I have parsed the html using Beautiful soup correctly. What I would like to do next is to retrieve the content associated with the 'div' containing a certain data-label (for example, in the bottom part of the code, data-label="Relation"). In particular I would like to obtain a dictionary that has as key the text of the data-label, i.e. in my example "Relation", and as value the content of the same 'div', i.e. in my example the href "http://documenti.camera.it/apps/commonServices/getDocumento.ashx?sezione=bollettini=comunicato=17=2016=06=14=03=data.20160614.com03.bollettino.sede00020.tit00010.int00010=data.20160614.com03.bollettino.sede00020.tit00010.int00010#data.20160614.com03.bollettino.sede00020.tit00010.int00010"
I have tried several approaches but data-label, as far as I know, does not appear to be a valid attribute, so I am not sure how to handle this.
(Note that this is just an example, but I will have to do the same for thousands, if not millions, of these webpages, with this similar structure).
Any help is appreciated. Thank you!
<div id="directs">
<label class="c1"><a data-comment="A human-readable name for the subject." data-label="label" href="http://www.w3.org/2000/01/rdf-schema#label">
rdfs:<span>label</span>
</a></label>
<div class="c2 value ">
<div class="toMultiLine ">
<div class="fixed">
<span class="dType">xsd:string</span>
intervento di Fabrizio CICCHITTO
</div>
</div>
</div>
<label class="c1"><a data-comment="A name given to the resource." data-label="Title" href="http://purl.org/dc/elements/1.1/title">
dc:<span>title</span>
</a></label>
<div class="c2 value ">
<div class="toMultiLine ">
<div class="fixed">
intervento di Fabrizio CICCHITTO
</div>
</div>
</div>
<label class="c1"><a data-comment="" data-label="" href="http://lod.xdams.org/ontologies/ods/modified">
ods:<span>modified</span>
</a></label>
<div class="c2 value ">
<div class="toMultiLine ">
<div class="fixed">
<span class="dType">xsd:dateTime</span>
2016-07-05T12:26:02Z
</div>
</div>
</div>
<label class="c1"><a data-comment="The subject is an instance of a class." data-label="type" href="http://www.w3.org/1999/02/22-rdf-syntax-ns#type">
rdf:<span>type</span>
</a></label>
<div class="c2 value">
<div class="toOneLine">
<a class=" isLocal" href="http://dati.camera.it/ocd/intervento" title="<http://dati.camera.it/ocd/intervento>">
ocd:intervento
</a>
</div>
</div>
<label class="c1"><a data-comment="propriet generica utilizzata per puntare alla risorsa deputato in vari punti dell'ontologia" data-label="rierimento a deputato" href="http://dati.camera.it/ocd/rif_deputato">
ocd:<span>rif_deputato</span>
</a></label>
<div class="c2 value">
<div class="toOneLine">
<a class=" isLocal" href="http://dati.camera.it/ocd/deputato.rdf/d15080_17" title="<http://dati.camera.it/ocd/deputato.rdf/d15080_17>">
http://dati.camera.it/ocd/deputato.rdf/d15080_17
</a>
</div>
</div>
<label class="c1"><a data-comment="A related resource." data-label="Relation" href="http://purl.org/dc/elements/1.1/relation">
dc:<span>relation</span>
</a></label>
<div class="c2 value">
<div class="toOneLine">
<a class=" " href="http://documenti.camera.it/apps/commonServices/getDocumento.ashx?sezione=bollettini=comunicato=17=2016=06=14=03=data.20160614.com03.bollettino.sede00020.tit00010.int00010=data.20160614.com03.bollettino.sede00020.tit00010.int00010#data.20160614.com03.bollettino.sede00020.tit00010.int00010"
target="_blank" title="<http://documenti.camera.it/apps/commonServices/getDocumento.ashx?sezione=bollettini=comunicato=17=2016=06=14=03=data.20160614.com03.bollettino.sede00020.tit00010.int00010=data.20160614.com03.bollettino.sede00020.tit00010.int00010#data.20160614.com03.bollettino.sede00020.tit00010.int00010>">
http://documenti.camera.it/apps/commonServices/getDocumento.ashx?sezione=bollettini=comunicato=17=2016=06=14=03=data.20160614.com03.bollettino.sede00020.tit00010.int00010=data.20160614.com03.bollettino.sede00020.tit00010.int00010#data.20160614.com03.bollettino.sede00020.tit00010.int00010
</a>
</div>
</div>
</div>
python html web-scraping beautifulsoup
I have the following html code (or at least I think it's html) that I am working on with BeautifulSoup on Python.
I have parsed the html using Beautiful soup correctly. What I would like to do next is to retrieve the content associated with the 'div' containing a certain data-label (for example, in the bottom part of the code, data-label="Relation"). In particular I would like to obtain a dictionary that has as key the text of the data-label, i.e. in my example "Relation", and as value the content of the same 'div', i.e. in my example the href "http://documenti.camera.it/apps/commonServices/getDocumento.ashx?sezione=bollettini=comunicato=17=2016=06=14=03=data.20160614.com03.bollettino.sede00020.tit00010.int00010=data.20160614.com03.bollettino.sede00020.tit00010.int00010#data.20160614.com03.bollettino.sede00020.tit00010.int00010"
I have tried several approaches but data-label, as far as I know, does not appear to be a valid attribute, so I am not sure how to handle this.
(Note that this is just an example, but I will have to do the same for thousands, if not millions, of these webpages, with this similar structure).
Any help is appreciated. Thank you!
<div id="directs">
<label class="c1"><a data-comment="A human-readable name for the subject." data-label="label" href="http://www.w3.org/2000/01/rdf-schema#label">
rdfs:<span>label</span>
</a></label>
<div class="c2 value ">
<div class="toMultiLine ">
<div class="fixed">
<span class="dType">xsd:string</span>
intervento di Fabrizio CICCHITTO
</div>
</div>
</div>
<label class="c1"><a data-comment="A name given to the resource." data-label="Title" href="http://purl.org/dc/elements/1.1/title">
dc:<span>title</span>
</a></label>
<div class="c2 value ">
<div class="toMultiLine ">
<div class="fixed">
intervento di Fabrizio CICCHITTO
</div>
</div>
</div>
<label class="c1"><a data-comment="" data-label="" href="http://lod.xdams.org/ontologies/ods/modified">
ods:<span>modified</span>
</a></label>
<div class="c2 value ">
<div class="toMultiLine ">
<div class="fixed">
<span class="dType">xsd:dateTime</span>
2016-07-05T12:26:02Z
</div>
</div>
</div>
<label class="c1"><a data-comment="The subject is an instance of a class." data-label="type" href="http://www.w3.org/1999/02/22-rdf-syntax-ns#type">
rdf:<span>type</span>
</a></label>
<div class="c2 value">
<div class="toOneLine">
<a class=" isLocal" href="http://dati.camera.it/ocd/intervento" title="<http://dati.camera.it/ocd/intervento>">
ocd:intervento
</a>
</div>
</div>
<label class="c1"><a data-comment="propriet generica utilizzata per puntare alla risorsa deputato in vari punti dell'ontologia" data-label="rierimento a deputato" href="http://dati.camera.it/ocd/rif_deputato">
ocd:<span>rif_deputato</span>
</a></label>
<div class="c2 value">
<div class="toOneLine">
<a class=" isLocal" href="http://dati.camera.it/ocd/deputato.rdf/d15080_17" title="<http://dati.camera.it/ocd/deputato.rdf/d15080_17>">
http://dati.camera.it/ocd/deputato.rdf/d15080_17
</a>
</div>
</div>
<label class="c1"><a data-comment="A related resource." data-label="Relation" href="http://purl.org/dc/elements/1.1/relation">
dc:<span>relation</span>
</a></label>
<div class="c2 value">
<div class="toOneLine">
<a class=" " href="http://documenti.camera.it/apps/commonServices/getDocumento.ashx?sezione=bollettini=comunicato=17=2016=06=14=03=data.20160614.com03.bollettino.sede00020.tit00010.int00010=data.20160614.com03.bollettino.sede00020.tit00010.int00010#data.20160614.com03.bollettino.sede00020.tit00010.int00010"
target="_blank" title="<http://documenti.camera.it/apps/commonServices/getDocumento.ashx?sezione=bollettini=comunicato=17=2016=06=14=03=data.20160614.com03.bollettino.sede00020.tit00010.int00010=data.20160614.com03.bollettino.sede00020.tit00010.int00010#data.20160614.com03.bollettino.sede00020.tit00010.int00010>">
http://documenti.camera.it/apps/commonServices/getDocumento.ashx?sezione=bollettini=comunicato=17=2016=06=14=03=data.20160614.com03.bollettino.sede00020.tit00010.int00010=data.20160614.com03.bollettino.sede00020.tit00010.int00010#data.20160614.com03.bollettino.sede00020.tit00010.int00010
</a>
</div>
</div>
</div>
<div id="directs">
<label class="c1"><a data-comment="A human-readable name for the subject." data-label="label" href="http://www.w3.org/2000/01/rdf-schema#label">
rdfs:<span>label</span>
</a></label>
<div class="c2 value ">
<div class="toMultiLine ">
<div class="fixed">
<span class="dType">xsd:string</span>
intervento di Fabrizio CICCHITTO
</div>
</div>
</div>
<label class="c1"><a data-comment="A name given to the resource." data-label="Title" href="http://purl.org/dc/elements/1.1/title">
dc:<span>title</span>
</a></label>
<div class="c2 value ">
<div class="toMultiLine ">
<div class="fixed">
intervento di Fabrizio CICCHITTO
</div>
</div>
</div>
<label class="c1"><a data-comment="" data-label="" href="http://lod.xdams.org/ontologies/ods/modified">
ods:<span>modified</span>
</a></label>
<div class="c2 value ">
<div class="toMultiLine ">
<div class="fixed">
<span class="dType">xsd:dateTime</span>
2016-07-05T12:26:02Z
</div>
</div>
</div>
<label class="c1"><a data-comment="The subject is an instance of a class." data-label="type" href="http://www.w3.org/1999/02/22-rdf-syntax-ns#type">
rdf:<span>type</span>
</a></label>
<div class="c2 value">
<div class="toOneLine">
<a class=" isLocal" href="http://dati.camera.it/ocd/intervento" title="<http://dati.camera.it/ocd/intervento>">
ocd:intervento
</a>
</div>
</div>
<label class="c1"><a data-comment="propriet generica utilizzata per puntare alla risorsa deputato in vari punti dell'ontologia" data-label="rierimento a deputato" href="http://dati.camera.it/ocd/rif_deputato">
ocd:<span>rif_deputato</span>
</a></label>
<div class="c2 value">
<div class="toOneLine">
<a class=" isLocal" href="http://dati.camera.it/ocd/deputato.rdf/d15080_17" title="<http://dati.camera.it/ocd/deputato.rdf/d15080_17>">
http://dati.camera.it/ocd/deputato.rdf/d15080_17
</a>
</div>
</div>
<label class="c1"><a data-comment="A related resource." data-label="Relation" href="http://purl.org/dc/elements/1.1/relation">
dc:<span>relation</span>
</a></label>
<div class="c2 value">
<div class="toOneLine">
<a class=" " href="http://documenti.camera.it/apps/commonServices/getDocumento.ashx?sezione=bollettini=comunicato=17=2016=06=14=03=data.20160614.com03.bollettino.sede00020.tit00010.int00010=data.20160614.com03.bollettino.sede00020.tit00010.int00010#data.20160614.com03.bollettino.sede00020.tit00010.int00010"
target="_blank" title="<http://documenti.camera.it/apps/commonServices/getDocumento.ashx?sezione=bollettini=comunicato=17=2016=06=14=03=data.20160614.com03.bollettino.sede00020.tit00010.int00010=data.20160614.com03.bollettino.sede00020.tit00010.int00010#data.20160614.com03.bollettino.sede00020.tit00010.int00010>">
http://documenti.camera.it/apps/commonServices/getDocumento.ashx?sezione=bollettini=comunicato=17=2016=06=14=03=data.20160614.com03.bollettino.sede00020.tit00010.int00010=data.20160614.com03.bollettino.sede00020.tit00010.int00010#data.20160614.com03.bollettino.sede00020.tit00010.int00010
</a>
</div>
</div>
</div>
<div id="directs">
<label class="c1"><a data-comment="A human-readable name for the subject." data-label="label" href="http://www.w3.org/2000/01/rdf-schema#label">
rdfs:<span>label</span>
</a></label>
<div class="c2 value ">
<div class="toMultiLine ">
<div class="fixed">
<span class="dType">xsd:string</span>
intervento di Fabrizio CICCHITTO
</div>
</div>
</div>
<label class="c1"><a data-comment="A name given to the resource." data-label="Title" href="http://purl.org/dc/elements/1.1/title">
dc:<span>title</span>
</a></label>
<div class="c2 value ">
<div class="toMultiLine ">
<div class="fixed">
intervento di Fabrizio CICCHITTO
</div>
</div>
</div>
<label class="c1"><a data-comment="" data-label="" href="http://lod.xdams.org/ontologies/ods/modified">
ods:<span>modified</span>
</a></label>
<div class="c2 value ">
<div class="toMultiLine ">
<div class="fixed">
<span class="dType">xsd:dateTime</span>
2016-07-05T12:26:02Z
</div>
</div>
</div>
<label class="c1"><a data-comment="The subject is an instance of a class." data-label="type" href="http://www.w3.org/1999/02/22-rdf-syntax-ns#type">
rdf:<span>type</span>
</a></label>
<div class="c2 value">
<div class="toOneLine">
<a class=" isLocal" href="http://dati.camera.it/ocd/intervento" title="<http://dati.camera.it/ocd/intervento>">
ocd:intervento
</a>
</div>
</div>
<label class="c1"><a data-comment="propriet generica utilizzata per puntare alla risorsa deputato in vari punti dell'ontologia" data-label="rierimento a deputato" href="http://dati.camera.it/ocd/rif_deputato">
ocd:<span>rif_deputato</span>
</a></label>
<div class="c2 value">
<div class="toOneLine">
<a class=" isLocal" href="http://dati.camera.it/ocd/deputato.rdf/d15080_17" title="<http://dati.camera.it/ocd/deputato.rdf/d15080_17>">
http://dati.camera.it/ocd/deputato.rdf/d15080_17
</a>
</div>
</div>
<label class="c1"><a data-comment="A related resource." data-label="Relation" href="http://purl.org/dc/elements/1.1/relation">
dc:<span>relation</span>
</a></label>
<div class="c2 value">
<div class="toOneLine">
<a class=" " href="http://documenti.camera.it/apps/commonServices/getDocumento.ashx?sezione=bollettini=comunicato=17=2016=06=14=03=data.20160614.com03.bollettino.sede00020.tit00010.int00010=data.20160614.com03.bollettino.sede00020.tit00010.int00010#data.20160614.com03.bollettino.sede00020.tit00010.int00010"
target="_blank" title="<http://documenti.camera.it/apps/commonServices/getDocumento.ashx?sezione=bollettini=comunicato=17=2016=06=14=03=data.20160614.com03.bollettino.sede00020.tit00010.int00010=data.20160614.com03.bollettino.sede00020.tit00010.int00010#data.20160614.com03.bollettino.sede00020.tit00010.int00010>">
http://documenti.camera.it/apps/commonServices/getDocumento.ashx?sezione=bollettini=comunicato=17=2016=06=14=03=data.20160614.com03.bollettino.sede00020.tit00010.int00010=data.20160614.com03.bollettino.sede00020.tit00010.int00010#data.20160614.com03.bollettino.sede00020.tit00010.int00010
</a>
</div>
</div>
</div>
python html web-scraping beautifulsoup
python html web-scraping beautifulsoup
asked Nov 10 at 19:08
mgiom
226
226
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
up vote
0
down vote
accepted
You can find the data-labels
in one pass and the div
content in another. Then, the results can be zipped together to create the dictionary:
from bs4 import BeautifulSoup as soup
import re
d = soup(content, 'html.parser').find('div', 'id':'directs')
_labels = [i.a['data-label'] for i in d.find_all('label')]
_content = [i.text for i in d.find_all('div', 'class':re.compile('c2 values*'))]
result = dict(zip(_labels, _content))
Output:
'label': 'nnnxsd:string n intervento di Fabrizio CICCHITTOn nn',
'Title': 'nnn intervento di Fabrizio CICCHITTOn nn',
'': 'nnnxsd:dateTime n 2016-07-05T12:26:02Zn nn',
'type': 'nnn ocd:interventon nn',
'rierimento a deputato': 'nnn http://dati.camera.it/ocd/deputato.rdf/d15080_17n nn',
'Relation': 'nnn http://documenti.camera.it/apps/commonServices/getDocumento.ashx?sezione=bollettini=comunicato=17=2016=06=14=03=data.20160614.com03.bollettino.sede00020.tit00010.int00010=data.20160614.com03.bollettino.sede00020.tit00010.int00010#data.20160614.com03.bollettino.sede00020.tit00010.int00010n nn'
It works! Thank you so much!
– mgiom
Nov 10 at 19:42
@mgiom Glad to help!
– Ajax1234
Nov 10 at 19:42
add a comment |
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
0
down vote
accepted
You can find the data-labels
in one pass and the div
content in another. Then, the results can be zipped together to create the dictionary:
from bs4 import BeautifulSoup as soup
import re
d = soup(content, 'html.parser').find('div', 'id':'directs')
_labels = [i.a['data-label'] for i in d.find_all('label')]
_content = [i.text for i in d.find_all('div', 'class':re.compile('c2 values*'))]
result = dict(zip(_labels, _content))
Output:
'label': 'nnnxsd:string n intervento di Fabrizio CICCHITTOn nn',
'Title': 'nnn intervento di Fabrizio CICCHITTOn nn',
'': 'nnnxsd:dateTime n 2016-07-05T12:26:02Zn nn',
'type': 'nnn ocd:interventon nn',
'rierimento a deputato': 'nnn http://dati.camera.it/ocd/deputato.rdf/d15080_17n nn',
'Relation': 'nnn http://documenti.camera.it/apps/commonServices/getDocumento.ashx?sezione=bollettini=comunicato=17=2016=06=14=03=data.20160614.com03.bollettino.sede00020.tit00010.int00010=data.20160614.com03.bollettino.sede00020.tit00010.int00010#data.20160614.com03.bollettino.sede00020.tit00010.int00010n nn'
It works! Thank you so much!
– mgiom
Nov 10 at 19:42
@mgiom Glad to help!
– Ajax1234
Nov 10 at 19:42
add a comment |
up vote
0
down vote
accepted
You can find the data-labels
in one pass and the div
content in another. Then, the results can be zipped together to create the dictionary:
from bs4 import BeautifulSoup as soup
import re
d = soup(content, 'html.parser').find('div', 'id':'directs')
_labels = [i.a['data-label'] for i in d.find_all('label')]
_content = [i.text for i in d.find_all('div', 'class':re.compile('c2 values*'))]
result = dict(zip(_labels, _content))
Output:
'label': 'nnnxsd:string n intervento di Fabrizio CICCHITTOn nn',
'Title': 'nnn intervento di Fabrizio CICCHITTOn nn',
'': 'nnnxsd:dateTime n 2016-07-05T12:26:02Zn nn',
'type': 'nnn ocd:interventon nn',
'rierimento a deputato': 'nnn http://dati.camera.it/ocd/deputato.rdf/d15080_17n nn',
'Relation': 'nnn http://documenti.camera.it/apps/commonServices/getDocumento.ashx?sezione=bollettini=comunicato=17=2016=06=14=03=data.20160614.com03.bollettino.sede00020.tit00010.int00010=data.20160614.com03.bollettino.sede00020.tit00010.int00010#data.20160614.com03.bollettino.sede00020.tit00010.int00010n nn'
It works! Thank you so much!
– mgiom
Nov 10 at 19:42
@mgiom Glad to help!
– Ajax1234
Nov 10 at 19:42
add a comment |
up vote
0
down vote
accepted
up vote
0
down vote
accepted
You can find the data-labels
in one pass and the div
content in another. Then, the results can be zipped together to create the dictionary:
from bs4 import BeautifulSoup as soup
import re
d = soup(content, 'html.parser').find('div', 'id':'directs')
_labels = [i.a['data-label'] for i in d.find_all('label')]
_content = [i.text for i in d.find_all('div', 'class':re.compile('c2 values*'))]
result = dict(zip(_labels, _content))
Output:
'label': 'nnnxsd:string n intervento di Fabrizio CICCHITTOn nn',
'Title': 'nnn intervento di Fabrizio CICCHITTOn nn',
'': 'nnnxsd:dateTime n 2016-07-05T12:26:02Zn nn',
'type': 'nnn ocd:interventon nn',
'rierimento a deputato': 'nnn http://dati.camera.it/ocd/deputato.rdf/d15080_17n nn',
'Relation': 'nnn http://documenti.camera.it/apps/commonServices/getDocumento.ashx?sezione=bollettini=comunicato=17=2016=06=14=03=data.20160614.com03.bollettino.sede00020.tit00010.int00010=data.20160614.com03.bollettino.sede00020.tit00010.int00010#data.20160614.com03.bollettino.sede00020.tit00010.int00010n nn'
You can find the data-labels
in one pass and the div
content in another. Then, the results can be zipped together to create the dictionary:
from bs4 import BeautifulSoup as soup
import re
d = soup(content, 'html.parser').find('div', 'id':'directs')
_labels = [i.a['data-label'] for i in d.find_all('label')]
_content = [i.text for i in d.find_all('div', 'class':re.compile('c2 values*'))]
result = dict(zip(_labels, _content))
Output:
'label': 'nnnxsd:string n intervento di Fabrizio CICCHITTOn nn',
'Title': 'nnn intervento di Fabrizio CICCHITTOn nn',
'': 'nnnxsd:dateTime n 2016-07-05T12:26:02Zn nn',
'type': 'nnn ocd:interventon nn',
'rierimento a deputato': 'nnn http://dati.camera.it/ocd/deputato.rdf/d15080_17n nn',
'Relation': 'nnn http://documenti.camera.it/apps/commonServices/getDocumento.ashx?sezione=bollettini=comunicato=17=2016=06=14=03=data.20160614.com03.bollettino.sede00020.tit00010.int00010=data.20160614.com03.bollettino.sede00020.tit00010.int00010#data.20160614.com03.bollettino.sede00020.tit00010.int00010n nn'
answered Nov 10 at 19:15
Ajax1234
38.5k42351
38.5k42351
It works! Thank you so much!
– mgiom
Nov 10 at 19:42
@mgiom Glad to help!
– Ajax1234
Nov 10 at 19:42
add a comment |
It works! Thank you so much!
– mgiom
Nov 10 at 19:42
@mgiom Glad to help!
– Ajax1234
Nov 10 at 19:42
It works! Thank you so much!
– mgiom
Nov 10 at 19:42
It works! Thank you so much!
– mgiom
Nov 10 at 19:42
@mgiom Glad to help!
– Ajax1234
Nov 10 at 19:42
@mgiom Glad to help!
– Ajax1234
Nov 10 at 19:42
add a comment |
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53242467%2ffind-and-retrieve-content-from-html-text-using-beautifulsoup%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown