Find and retrieve content from html text using BeautifulSoup









up vote
0
down vote

favorite












I have the following html code (or at least I think it's html) that I am working on with BeautifulSoup on Python.



I have parsed the html using Beautiful soup correctly. What I would like to do next is to retrieve the content associated with the 'div' containing a certain data-label (for example, in the bottom part of the code, data-label="Relation"). In particular I would like to obtain a dictionary that has as key the text of the data-label, i.e. in my example "Relation", and as value the content of the same 'div', i.e. in my example the href "http://documenti.camera.it/apps/commonServices/getDocumento.ashx?sezione=bollettini=comunicato=17=2016=06=14=03=data.20160614.com03.bollettino.sede00020.tit00010.int00010=data.20160614.com03.bollettino.sede00020.tit00010.int00010#data.20160614.com03.bollettino.sede00020.tit00010.int00010"



I have tried several approaches but data-label, as far as I know, does not appear to be a valid attribute, so I am not sure how to handle this.



(Note that this is just an example, but I will have to do the same for thousands, if not millions, of these webpages, with this similar structure).



Any help is appreciated. Thank you!






<div id="directs">
<label class="c1"><a data-comment="A human-readable name for the subject." data-label="label" href="http://www.w3.org/2000/01/rdf-schema#label">
rdfs:<span>label</span>
</a></label>
<div class="c2 value ">
<div class="toMultiLine ">
<div class="fixed">
<span class="dType">xsd:string</span>
intervento di Fabrizio CICCHITTO
</div>
</div>
</div>
<label class="c1"><a data-comment="A name given to the resource." data-label="Title" href="http://purl.org/dc/elements/1.1/title">
dc:<span>title</span>
</a></label>
<div class="c2 value ">
<div class="toMultiLine ">
<div class="fixed">
intervento di Fabrizio CICCHITTO
</div>
</div>
</div>
<label class="c1"><a data-comment="" data-label="" href="http://lod.xdams.org/ontologies/ods/modified">
ods:<span>modified</span>
</a></label>
<div class="c2 value ">
<div class="toMultiLine ">
<div class="fixed">
<span class="dType">xsd:dateTime</span>
2016-07-05T12:26:02Z
</div>
</div>
</div>
<label class="c1"><a data-comment="The subject is an instance of a class." data-label="type" href="http://www.w3.org/1999/02/22-rdf-syntax-ns#type">
rdf:<span>type</span>
</a></label>
<div class="c2 value">
<div class="toOneLine">
<a class=" isLocal" href="http://dati.camera.it/ocd/intervento" title="&lt;http://dati.camera.it/ocd/intervento&gt;">
ocd:intervento
</a>
</div>
</div>
<label class="c1"><a data-comment="propriet generica utilizzata per puntare alla risorsa deputato in vari punti dell'ontologia" data-label="rierimento a deputato" href="http://dati.camera.it/ocd/rif_deputato">
ocd:<span>rif_deputato</span>
</a></label>
<div class="c2 value">
<div class="toOneLine">
<a class=" isLocal" href="http://dati.camera.it/ocd/deputato.rdf/d15080_17" title="&lt;http://dati.camera.it/ocd/deputato.rdf/d15080_17&gt;">
http://dati.camera.it/ocd/deputato.rdf/d15080_17
</a>
</div>
</div>
<label class="c1"><a data-comment="A related resource." data-label="Relation" href="http://purl.org/dc/elements/1.1/relation">
dc:<span>relation</span>
</a></label>
<div class="c2 value">
<div class="toOneLine">
<a class=" " href="http://documenti.camera.it/apps/commonServices/getDocumento.ashx?sezione=bollettini=comunicato=17=2016=06=14=03=data.20160614.com03.bollettino.sede00020.tit00010.int00010=data.20160614.com03.bollettino.sede00020.tit00010.int00010#data.20160614.com03.bollettino.sede00020.tit00010.int00010"
target="_blank" title="&lt;http://documenti.camera.it/apps/commonServices/getDocumento.ashx?sezione=bollettini=comunicato=17=2016=06=14=03=data.20160614.com03.bollettino.sede00020.tit00010.int00010=data.20160614.com03.bollettino.sede00020.tit00010.int00010#data.20160614.com03.bollettino.sede00020.tit00010.int00010&gt;">
http://documenti.camera.it/apps/commonServices/getDocumento.ashx?sezione=bollettini=comunicato=17=2016=06=14=03=data.20160614.com03.bollettino.sede00020.tit00010.int00010=data.20160614.com03.bollettino.sede00020.tit00010.int00010#data.20160614.com03.bollettino.sede00020.tit00010.int00010
</a>
</div>
</div>
</div>












share|improve this question

























    up vote
    0
    down vote

    favorite












    I have the following html code (or at least I think it's html) that I am working on with BeautifulSoup on Python.



    I have parsed the html using Beautiful soup correctly. What I would like to do next is to retrieve the content associated with the 'div' containing a certain data-label (for example, in the bottom part of the code, data-label="Relation"). In particular I would like to obtain a dictionary that has as key the text of the data-label, i.e. in my example "Relation", and as value the content of the same 'div', i.e. in my example the href "http://documenti.camera.it/apps/commonServices/getDocumento.ashx?sezione=bollettini=comunicato=17=2016=06=14=03=data.20160614.com03.bollettino.sede00020.tit00010.int00010=data.20160614.com03.bollettino.sede00020.tit00010.int00010#data.20160614.com03.bollettino.sede00020.tit00010.int00010"



    I have tried several approaches but data-label, as far as I know, does not appear to be a valid attribute, so I am not sure how to handle this.



    (Note that this is just an example, but I will have to do the same for thousands, if not millions, of these webpages, with this similar structure).



    Any help is appreciated. Thank you!






    <div id="directs">
    <label class="c1"><a data-comment="A human-readable name for the subject." data-label="label" href="http://www.w3.org/2000/01/rdf-schema#label">
    rdfs:<span>label</span>
    </a></label>
    <div class="c2 value ">
    <div class="toMultiLine ">
    <div class="fixed">
    <span class="dType">xsd:string</span>
    intervento di Fabrizio CICCHITTO
    </div>
    </div>
    </div>
    <label class="c1"><a data-comment="A name given to the resource." data-label="Title" href="http://purl.org/dc/elements/1.1/title">
    dc:<span>title</span>
    </a></label>
    <div class="c2 value ">
    <div class="toMultiLine ">
    <div class="fixed">
    intervento di Fabrizio CICCHITTO
    </div>
    </div>
    </div>
    <label class="c1"><a data-comment="" data-label="" href="http://lod.xdams.org/ontologies/ods/modified">
    ods:<span>modified</span>
    </a></label>
    <div class="c2 value ">
    <div class="toMultiLine ">
    <div class="fixed">
    <span class="dType">xsd:dateTime</span>
    2016-07-05T12:26:02Z
    </div>
    </div>
    </div>
    <label class="c1"><a data-comment="The subject is an instance of a class." data-label="type" href="http://www.w3.org/1999/02/22-rdf-syntax-ns#type">
    rdf:<span>type</span>
    </a></label>
    <div class="c2 value">
    <div class="toOneLine">
    <a class=" isLocal" href="http://dati.camera.it/ocd/intervento" title="&lt;http://dati.camera.it/ocd/intervento&gt;">
    ocd:intervento
    </a>
    </div>
    </div>
    <label class="c1"><a data-comment="propriet generica utilizzata per puntare alla risorsa deputato in vari punti dell'ontologia" data-label="rierimento a deputato" href="http://dati.camera.it/ocd/rif_deputato">
    ocd:<span>rif_deputato</span>
    </a></label>
    <div class="c2 value">
    <div class="toOneLine">
    <a class=" isLocal" href="http://dati.camera.it/ocd/deputato.rdf/d15080_17" title="&lt;http://dati.camera.it/ocd/deputato.rdf/d15080_17&gt;">
    http://dati.camera.it/ocd/deputato.rdf/d15080_17
    </a>
    </div>
    </div>
    <label class="c1"><a data-comment="A related resource." data-label="Relation" href="http://purl.org/dc/elements/1.1/relation">
    dc:<span>relation</span>
    </a></label>
    <div class="c2 value">
    <div class="toOneLine">
    <a class=" " href="http://documenti.camera.it/apps/commonServices/getDocumento.ashx?sezione=bollettini=comunicato=17=2016=06=14=03=data.20160614.com03.bollettino.sede00020.tit00010.int00010=data.20160614.com03.bollettino.sede00020.tit00010.int00010#data.20160614.com03.bollettino.sede00020.tit00010.int00010"
    target="_blank" title="&lt;http://documenti.camera.it/apps/commonServices/getDocumento.ashx?sezione=bollettini=comunicato=17=2016=06=14=03=data.20160614.com03.bollettino.sede00020.tit00010.int00010=data.20160614.com03.bollettino.sede00020.tit00010.int00010#data.20160614.com03.bollettino.sede00020.tit00010.int00010&gt;">
    http://documenti.camera.it/apps/commonServices/getDocumento.ashx?sezione=bollettini=comunicato=17=2016=06=14=03=data.20160614.com03.bollettino.sede00020.tit00010.int00010=data.20160614.com03.bollettino.sede00020.tit00010.int00010#data.20160614.com03.bollettino.sede00020.tit00010.int00010
    </a>
    </div>
    </div>
    </div>












    share|improve this question























      up vote
      0
      down vote

      favorite









      up vote
      0
      down vote

      favorite











      I have the following html code (or at least I think it's html) that I am working on with BeautifulSoup on Python.



      I have parsed the html using Beautiful soup correctly. What I would like to do next is to retrieve the content associated with the 'div' containing a certain data-label (for example, in the bottom part of the code, data-label="Relation"). In particular I would like to obtain a dictionary that has as key the text of the data-label, i.e. in my example "Relation", and as value the content of the same 'div', i.e. in my example the href "http://documenti.camera.it/apps/commonServices/getDocumento.ashx?sezione=bollettini=comunicato=17=2016=06=14=03=data.20160614.com03.bollettino.sede00020.tit00010.int00010=data.20160614.com03.bollettino.sede00020.tit00010.int00010#data.20160614.com03.bollettino.sede00020.tit00010.int00010"



      I have tried several approaches but data-label, as far as I know, does not appear to be a valid attribute, so I am not sure how to handle this.



      (Note that this is just an example, but I will have to do the same for thousands, if not millions, of these webpages, with this similar structure).



      Any help is appreciated. Thank you!






      <div id="directs">
      <label class="c1"><a data-comment="A human-readable name for the subject." data-label="label" href="http://www.w3.org/2000/01/rdf-schema#label">
      rdfs:<span>label</span>
      </a></label>
      <div class="c2 value ">
      <div class="toMultiLine ">
      <div class="fixed">
      <span class="dType">xsd:string</span>
      intervento di Fabrizio CICCHITTO
      </div>
      </div>
      </div>
      <label class="c1"><a data-comment="A name given to the resource." data-label="Title" href="http://purl.org/dc/elements/1.1/title">
      dc:<span>title</span>
      </a></label>
      <div class="c2 value ">
      <div class="toMultiLine ">
      <div class="fixed">
      intervento di Fabrizio CICCHITTO
      </div>
      </div>
      </div>
      <label class="c1"><a data-comment="" data-label="" href="http://lod.xdams.org/ontologies/ods/modified">
      ods:<span>modified</span>
      </a></label>
      <div class="c2 value ">
      <div class="toMultiLine ">
      <div class="fixed">
      <span class="dType">xsd:dateTime</span>
      2016-07-05T12:26:02Z
      </div>
      </div>
      </div>
      <label class="c1"><a data-comment="The subject is an instance of a class." data-label="type" href="http://www.w3.org/1999/02/22-rdf-syntax-ns#type">
      rdf:<span>type</span>
      </a></label>
      <div class="c2 value">
      <div class="toOneLine">
      <a class=" isLocal" href="http://dati.camera.it/ocd/intervento" title="&lt;http://dati.camera.it/ocd/intervento&gt;">
      ocd:intervento
      </a>
      </div>
      </div>
      <label class="c1"><a data-comment="propriet generica utilizzata per puntare alla risorsa deputato in vari punti dell'ontologia" data-label="rierimento a deputato" href="http://dati.camera.it/ocd/rif_deputato">
      ocd:<span>rif_deputato</span>
      </a></label>
      <div class="c2 value">
      <div class="toOneLine">
      <a class=" isLocal" href="http://dati.camera.it/ocd/deputato.rdf/d15080_17" title="&lt;http://dati.camera.it/ocd/deputato.rdf/d15080_17&gt;">
      http://dati.camera.it/ocd/deputato.rdf/d15080_17
      </a>
      </div>
      </div>
      <label class="c1"><a data-comment="A related resource." data-label="Relation" href="http://purl.org/dc/elements/1.1/relation">
      dc:<span>relation</span>
      </a></label>
      <div class="c2 value">
      <div class="toOneLine">
      <a class=" " href="http://documenti.camera.it/apps/commonServices/getDocumento.ashx?sezione=bollettini=comunicato=17=2016=06=14=03=data.20160614.com03.bollettino.sede00020.tit00010.int00010=data.20160614.com03.bollettino.sede00020.tit00010.int00010#data.20160614.com03.bollettino.sede00020.tit00010.int00010"
      target="_blank" title="&lt;http://documenti.camera.it/apps/commonServices/getDocumento.ashx?sezione=bollettini=comunicato=17=2016=06=14=03=data.20160614.com03.bollettino.sede00020.tit00010.int00010=data.20160614.com03.bollettino.sede00020.tit00010.int00010#data.20160614.com03.bollettino.sede00020.tit00010.int00010&gt;">
      http://documenti.camera.it/apps/commonServices/getDocumento.ashx?sezione=bollettini=comunicato=17=2016=06=14=03=data.20160614.com03.bollettino.sede00020.tit00010.int00010=data.20160614.com03.bollettino.sede00020.tit00010.int00010#data.20160614.com03.bollettino.sede00020.tit00010.int00010
      </a>
      </div>
      </div>
      </div>












      share|improve this question













      I have the following html code (or at least I think it's html) that I am working on with BeautifulSoup on Python.



      I have parsed the html using Beautiful soup correctly. What I would like to do next is to retrieve the content associated with the 'div' containing a certain data-label (for example, in the bottom part of the code, data-label="Relation"). In particular I would like to obtain a dictionary that has as key the text of the data-label, i.e. in my example "Relation", and as value the content of the same 'div', i.e. in my example the href "http://documenti.camera.it/apps/commonServices/getDocumento.ashx?sezione=bollettini=comunicato=17=2016=06=14=03=data.20160614.com03.bollettino.sede00020.tit00010.int00010=data.20160614.com03.bollettino.sede00020.tit00010.int00010#data.20160614.com03.bollettino.sede00020.tit00010.int00010"



      I have tried several approaches but data-label, as far as I know, does not appear to be a valid attribute, so I am not sure how to handle this.



      (Note that this is just an example, but I will have to do the same for thousands, if not millions, of these webpages, with this similar structure).



      Any help is appreciated. Thank you!






      <div id="directs">
      <label class="c1"><a data-comment="A human-readable name for the subject." data-label="label" href="http://www.w3.org/2000/01/rdf-schema#label">
      rdfs:<span>label</span>
      </a></label>
      <div class="c2 value ">
      <div class="toMultiLine ">
      <div class="fixed">
      <span class="dType">xsd:string</span>
      intervento di Fabrizio CICCHITTO
      </div>
      </div>
      </div>
      <label class="c1"><a data-comment="A name given to the resource." data-label="Title" href="http://purl.org/dc/elements/1.1/title">
      dc:<span>title</span>
      </a></label>
      <div class="c2 value ">
      <div class="toMultiLine ">
      <div class="fixed">
      intervento di Fabrizio CICCHITTO
      </div>
      </div>
      </div>
      <label class="c1"><a data-comment="" data-label="" href="http://lod.xdams.org/ontologies/ods/modified">
      ods:<span>modified</span>
      </a></label>
      <div class="c2 value ">
      <div class="toMultiLine ">
      <div class="fixed">
      <span class="dType">xsd:dateTime</span>
      2016-07-05T12:26:02Z
      </div>
      </div>
      </div>
      <label class="c1"><a data-comment="The subject is an instance of a class." data-label="type" href="http://www.w3.org/1999/02/22-rdf-syntax-ns#type">
      rdf:<span>type</span>
      </a></label>
      <div class="c2 value">
      <div class="toOneLine">
      <a class=" isLocal" href="http://dati.camera.it/ocd/intervento" title="&lt;http://dati.camera.it/ocd/intervento&gt;">
      ocd:intervento
      </a>
      </div>
      </div>
      <label class="c1"><a data-comment="propriet generica utilizzata per puntare alla risorsa deputato in vari punti dell'ontologia" data-label="rierimento a deputato" href="http://dati.camera.it/ocd/rif_deputato">
      ocd:<span>rif_deputato</span>
      </a></label>
      <div class="c2 value">
      <div class="toOneLine">
      <a class=" isLocal" href="http://dati.camera.it/ocd/deputato.rdf/d15080_17" title="&lt;http://dati.camera.it/ocd/deputato.rdf/d15080_17&gt;">
      http://dati.camera.it/ocd/deputato.rdf/d15080_17
      </a>
      </div>
      </div>
      <label class="c1"><a data-comment="A related resource." data-label="Relation" href="http://purl.org/dc/elements/1.1/relation">
      dc:<span>relation</span>
      </a></label>
      <div class="c2 value">
      <div class="toOneLine">
      <a class=" " href="http://documenti.camera.it/apps/commonServices/getDocumento.ashx?sezione=bollettini=comunicato=17=2016=06=14=03=data.20160614.com03.bollettino.sede00020.tit00010.int00010=data.20160614.com03.bollettino.sede00020.tit00010.int00010#data.20160614.com03.bollettino.sede00020.tit00010.int00010"
      target="_blank" title="&lt;http://documenti.camera.it/apps/commonServices/getDocumento.ashx?sezione=bollettini=comunicato=17=2016=06=14=03=data.20160614.com03.bollettino.sede00020.tit00010.int00010=data.20160614.com03.bollettino.sede00020.tit00010.int00010#data.20160614.com03.bollettino.sede00020.tit00010.int00010&gt;">
      http://documenti.camera.it/apps/commonServices/getDocumento.ashx?sezione=bollettini=comunicato=17=2016=06=14=03=data.20160614.com03.bollettino.sede00020.tit00010.int00010=data.20160614.com03.bollettino.sede00020.tit00010.int00010#data.20160614.com03.bollettino.sede00020.tit00010.int00010
      </a>
      </div>
      </div>
      </div>








      <div id="directs">
      <label class="c1"><a data-comment="A human-readable name for the subject." data-label="label" href="http://www.w3.org/2000/01/rdf-schema#label">
      rdfs:<span>label</span>
      </a></label>
      <div class="c2 value ">
      <div class="toMultiLine ">
      <div class="fixed">
      <span class="dType">xsd:string</span>
      intervento di Fabrizio CICCHITTO
      </div>
      </div>
      </div>
      <label class="c1"><a data-comment="A name given to the resource." data-label="Title" href="http://purl.org/dc/elements/1.1/title">
      dc:<span>title</span>
      </a></label>
      <div class="c2 value ">
      <div class="toMultiLine ">
      <div class="fixed">
      intervento di Fabrizio CICCHITTO
      </div>
      </div>
      </div>
      <label class="c1"><a data-comment="" data-label="" href="http://lod.xdams.org/ontologies/ods/modified">
      ods:<span>modified</span>
      </a></label>
      <div class="c2 value ">
      <div class="toMultiLine ">
      <div class="fixed">
      <span class="dType">xsd:dateTime</span>
      2016-07-05T12:26:02Z
      </div>
      </div>
      </div>
      <label class="c1"><a data-comment="The subject is an instance of a class." data-label="type" href="http://www.w3.org/1999/02/22-rdf-syntax-ns#type">
      rdf:<span>type</span>
      </a></label>
      <div class="c2 value">
      <div class="toOneLine">
      <a class=" isLocal" href="http://dati.camera.it/ocd/intervento" title="&lt;http://dati.camera.it/ocd/intervento&gt;">
      ocd:intervento
      </a>
      </div>
      </div>
      <label class="c1"><a data-comment="propriet generica utilizzata per puntare alla risorsa deputato in vari punti dell'ontologia" data-label="rierimento a deputato" href="http://dati.camera.it/ocd/rif_deputato">
      ocd:<span>rif_deputato</span>
      </a></label>
      <div class="c2 value">
      <div class="toOneLine">
      <a class=" isLocal" href="http://dati.camera.it/ocd/deputato.rdf/d15080_17" title="&lt;http://dati.camera.it/ocd/deputato.rdf/d15080_17&gt;">
      http://dati.camera.it/ocd/deputato.rdf/d15080_17
      </a>
      </div>
      </div>
      <label class="c1"><a data-comment="A related resource." data-label="Relation" href="http://purl.org/dc/elements/1.1/relation">
      dc:<span>relation</span>
      </a></label>
      <div class="c2 value">
      <div class="toOneLine">
      <a class=" " href="http://documenti.camera.it/apps/commonServices/getDocumento.ashx?sezione=bollettini=comunicato=17=2016=06=14=03=data.20160614.com03.bollettino.sede00020.tit00010.int00010=data.20160614.com03.bollettino.sede00020.tit00010.int00010#data.20160614.com03.bollettino.sede00020.tit00010.int00010"
      target="_blank" title="&lt;http://documenti.camera.it/apps/commonServices/getDocumento.ashx?sezione=bollettini=comunicato=17=2016=06=14=03=data.20160614.com03.bollettino.sede00020.tit00010.int00010=data.20160614.com03.bollettino.sede00020.tit00010.int00010#data.20160614.com03.bollettino.sede00020.tit00010.int00010&gt;">
      http://documenti.camera.it/apps/commonServices/getDocumento.ashx?sezione=bollettini=comunicato=17=2016=06=14=03=data.20160614.com03.bollettino.sede00020.tit00010.int00010=data.20160614.com03.bollettino.sede00020.tit00010.int00010#data.20160614.com03.bollettino.sede00020.tit00010.int00010
      </a>
      </div>
      </div>
      </div>





      <div id="directs">
      <label class="c1"><a data-comment="A human-readable name for the subject." data-label="label" href="http://www.w3.org/2000/01/rdf-schema#label">
      rdfs:<span>label</span>
      </a></label>
      <div class="c2 value ">
      <div class="toMultiLine ">
      <div class="fixed">
      <span class="dType">xsd:string</span>
      intervento di Fabrizio CICCHITTO
      </div>
      </div>
      </div>
      <label class="c1"><a data-comment="A name given to the resource." data-label="Title" href="http://purl.org/dc/elements/1.1/title">
      dc:<span>title</span>
      </a></label>
      <div class="c2 value ">
      <div class="toMultiLine ">
      <div class="fixed">
      intervento di Fabrizio CICCHITTO
      </div>
      </div>
      </div>
      <label class="c1"><a data-comment="" data-label="" href="http://lod.xdams.org/ontologies/ods/modified">
      ods:<span>modified</span>
      </a></label>
      <div class="c2 value ">
      <div class="toMultiLine ">
      <div class="fixed">
      <span class="dType">xsd:dateTime</span>
      2016-07-05T12:26:02Z
      </div>
      </div>
      </div>
      <label class="c1"><a data-comment="The subject is an instance of a class." data-label="type" href="http://www.w3.org/1999/02/22-rdf-syntax-ns#type">
      rdf:<span>type</span>
      </a></label>
      <div class="c2 value">
      <div class="toOneLine">
      <a class=" isLocal" href="http://dati.camera.it/ocd/intervento" title="&lt;http://dati.camera.it/ocd/intervento&gt;">
      ocd:intervento
      </a>
      </div>
      </div>
      <label class="c1"><a data-comment="propriet generica utilizzata per puntare alla risorsa deputato in vari punti dell'ontologia" data-label="rierimento a deputato" href="http://dati.camera.it/ocd/rif_deputato">
      ocd:<span>rif_deputato</span>
      </a></label>
      <div class="c2 value">
      <div class="toOneLine">
      <a class=" isLocal" href="http://dati.camera.it/ocd/deputato.rdf/d15080_17" title="&lt;http://dati.camera.it/ocd/deputato.rdf/d15080_17&gt;">
      http://dati.camera.it/ocd/deputato.rdf/d15080_17
      </a>
      </div>
      </div>
      <label class="c1"><a data-comment="A related resource." data-label="Relation" href="http://purl.org/dc/elements/1.1/relation">
      dc:<span>relation</span>
      </a></label>
      <div class="c2 value">
      <div class="toOneLine">
      <a class=" " href="http://documenti.camera.it/apps/commonServices/getDocumento.ashx?sezione=bollettini=comunicato=17=2016=06=14=03=data.20160614.com03.bollettino.sede00020.tit00010.int00010=data.20160614.com03.bollettino.sede00020.tit00010.int00010#data.20160614.com03.bollettino.sede00020.tit00010.int00010"
      target="_blank" title="&lt;http://documenti.camera.it/apps/commonServices/getDocumento.ashx?sezione=bollettini=comunicato=17=2016=06=14=03=data.20160614.com03.bollettino.sede00020.tit00010.int00010=data.20160614.com03.bollettino.sede00020.tit00010.int00010#data.20160614.com03.bollettino.sede00020.tit00010.int00010&gt;">
      http://documenti.camera.it/apps/commonServices/getDocumento.ashx?sezione=bollettini=comunicato=17=2016=06=14=03=data.20160614.com03.bollettino.sede00020.tit00010.int00010=data.20160614.com03.bollettino.sede00020.tit00010.int00010#data.20160614.com03.bollettino.sede00020.tit00010.int00010
      </a>
      </div>
      </div>
      </div>






      python html web-scraping beautifulsoup






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Nov 10 at 19:08









      mgiom

      226




      226






















          1 Answer
          1






          active

          oldest

          votes

















          up vote
          0
          down vote



          accepted










          You can find the data-labels in one pass and the div content in another. Then, the results can be zipped together to create the dictionary:



          from bs4 import BeautifulSoup as soup
          import re
          d = soup(content, 'html.parser').find('div', 'id':'directs')
          _labels = [i.a['data-label'] for i in d.find_all('label')]
          _content = [i.text for i in d.find_all('div', 'class':re.compile('c2 values*'))]
          result = dict(zip(_labels, _content))


          Output:



          'label': 'nnnxsd:string n intervento di Fabrizio CICCHITTOn nn', 
          'Title': 'nnn intervento di Fabrizio CICCHITTOn nn',
          '': 'nnnxsd:dateTime n 2016-07-05T12:26:02Zn nn',
          'type': 'nnn ocd:interventon nn',
          'rierimento a deputato': 'nnn http://dati.camera.it/ocd/deputato.rdf/d15080_17n nn',
          'Relation': 'nnn http://documenti.camera.it/apps/commonServices/getDocumento.ashx?sezione=bollettini=comunicato=17=2016=06=14=03=data.20160614.com03.bollettino.sede00020.tit00010.int00010=data.20160614.com03.bollettino.sede00020.tit00010.int00010#data.20160614.com03.bollettino.sede00020.tit00010.int00010n nn'





          share|improve this answer




















          • It works! Thank you so much!
            – mgiom
            Nov 10 at 19:42










          • @mgiom Glad to help!
            – Ajax1234
            Nov 10 at 19:42










          Your Answer






          StackExchange.ifUsing("editor", function ()
          StackExchange.using("externalEditor", function ()
          StackExchange.using("snippets", function ()
          StackExchange.snippets.init();
          );
          );
          , "code-snippets");

          StackExchange.ready(function()
          var channelOptions =
          tags: "".split(" "),
          id: "1"
          ;
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function()
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled)
          StackExchange.using("snippets", function()
          createEditor();
          );

          else
          createEditor();

          );

          function createEditor()
          StackExchange.prepareEditor(
          heartbeatType: 'answer',
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader:
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          ,
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          );



          );













           

          draft saved


          draft discarded


















          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53242467%2ffind-and-retrieve-content-from-html-text-using-beautifulsoup%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes








          up vote
          0
          down vote



          accepted










          You can find the data-labels in one pass and the div content in another. Then, the results can be zipped together to create the dictionary:



          from bs4 import BeautifulSoup as soup
          import re
          d = soup(content, 'html.parser').find('div', 'id':'directs')
          _labels = [i.a['data-label'] for i in d.find_all('label')]
          _content = [i.text for i in d.find_all('div', 'class':re.compile('c2 values*'))]
          result = dict(zip(_labels, _content))


          Output:



          'label': 'nnnxsd:string n intervento di Fabrizio CICCHITTOn nn', 
          'Title': 'nnn intervento di Fabrizio CICCHITTOn nn',
          '': 'nnnxsd:dateTime n 2016-07-05T12:26:02Zn nn',
          'type': 'nnn ocd:interventon nn',
          'rierimento a deputato': 'nnn http://dati.camera.it/ocd/deputato.rdf/d15080_17n nn',
          'Relation': 'nnn http://documenti.camera.it/apps/commonServices/getDocumento.ashx?sezione=bollettini=comunicato=17=2016=06=14=03=data.20160614.com03.bollettino.sede00020.tit00010.int00010=data.20160614.com03.bollettino.sede00020.tit00010.int00010#data.20160614.com03.bollettino.sede00020.tit00010.int00010n nn'





          share|improve this answer




















          • It works! Thank you so much!
            – mgiom
            Nov 10 at 19:42










          • @mgiom Glad to help!
            – Ajax1234
            Nov 10 at 19:42














          up vote
          0
          down vote



          accepted










          You can find the data-labels in one pass and the div content in another. Then, the results can be zipped together to create the dictionary:



          from bs4 import BeautifulSoup as soup
          import re
          d = soup(content, 'html.parser').find('div', 'id':'directs')
          _labels = [i.a['data-label'] for i in d.find_all('label')]
          _content = [i.text for i in d.find_all('div', 'class':re.compile('c2 values*'))]
          result = dict(zip(_labels, _content))


          Output:



          'label': 'nnnxsd:string n intervento di Fabrizio CICCHITTOn nn', 
          'Title': 'nnn intervento di Fabrizio CICCHITTOn nn',
          '': 'nnnxsd:dateTime n 2016-07-05T12:26:02Zn nn',
          'type': 'nnn ocd:interventon nn',
          'rierimento a deputato': 'nnn http://dati.camera.it/ocd/deputato.rdf/d15080_17n nn',
          'Relation': 'nnn http://documenti.camera.it/apps/commonServices/getDocumento.ashx?sezione=bollettini=comunicato=17=2016=06=14=03=data.20160614.com03.bollettino.sede00020.tit00010.int00010=data.20160614.com03.bollettino.sede00020.tit00010.int00010#data.20160614.com03.bollettino.sede00020.tit00010.int00010n nn'





          share|improve this answer




















          • It works! Thank you so much!
            – mgiom
            Nov 10 at 19:42










          • @mgiom Glad to help!
            – Ajax1234
            Nov 10 at 19:42












          up vote
          0
          down vote



          accepted







          up vote
          0
          down vote



          accepted






          You can find the data-labels in one pass and the div content in another. Then, the results can be zipped together to create the dictionary:



          from bs4 import BeautifulSoup as soup
          import re
          d = soup(content, 'html.parser').find('div', 'id':'directs')
          _labels = [i.a['data-label'] for i in d.find_all('label')]
          _content = [i.text for i in d.find_all('div', 'class':re.compile('c2 values*'))]
          result = dict(zip(_labels, _content))


          Output:



          'label': 'nnnxsd:string n intervento di Fabrizio CICCHITTOn nn', 
          'Title': 'nnn intervento di Fabrizio CICCHITTOn nn',
          '': 'nnnxsd:dateTime n 2016-07-05T12:26:02Zn nn',
          'type': 'nnn ocd:interventon nn',
          'rierimento a deputato': 'nnn http://dati.camera.it/ocd/deputato.rdf/d15080_17n nn',
          'Relation': 'nnn http://documenti.camera.it/apps/commonServices/getDocumento.ashx?sezione=bollettini=comunicato=17=2016=06=14=03=data.20160614.com03.bollettino.sede00020.tit00010.int00010=data.20160614.com03.bollettino.sede00020.tit00010.int00010#data.20160614.com03.bollettino.sede00020.tit00010.int00010n nn'





          share|improve this answer












          You can find the data-labels in one pass and the div content in another. Then, the results can be zipped together to create the dictionary:



          from bs4 import BeautifulSoup as soup
          import re
          d = soup(content, 'html.parser').find('div', 'id':'directs')
          _labels = [i.a['data-label'] for i in d.find_all('label')]
          _content = [i.text for i in d.find_all('div', 'class':re.compile('c2 values*'))]
          result = dict(zip(_labels, _content))


          Output:



          'label': 'nnnxsd:string n intervento di Fabrizio CICCHITTOn nn', 
          'Title': 'nnn intervento di Fabrizio CICCHITTOn nn',
          '': 'nnnxsd:dateTime n 2016-07-05T12:26:02Zn nn',
          'type': 'nnn ocd:interventon nn',
          'rierimento a deputato': 'nnn http://dati.camera.it/ocd/deputato.rdf/d15080_17n nn',
          'Relation': 'nnn http://documenti.camera.it/apps/commonServices/getDocumento.ashx?sezione=bollettini=comunicato=17=2016=06=14=03=data.20160614.com03.bollettino.sede00020.tit00010.int00010=data.20160614.com03.bollettino.sede00020.tit00010.int00010#data.20160614.com03.bollettino.sede00020.tit00010.int00010n nn'






          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Nov 10 at 19:15









          Ajax1234

          38.5k42351




          38.5k42351











          • It works! Thank you so much!
            – mgiom
            Nov 10 at 19:42










          • @mgiom Glad to help!
            – Ajax1234
            Nov 10 at 19:42
















          • It works! Thank you so much!
            – mgiom
            Nov 10 at 19:42










          • @mgiom Glad to help!
            – Ajax1234
            Nov 10 at 19:42















          It works! Thank you so much!
          – mgiom
          Nov 10 at 19:42




          It works! Thank you so much!
          – mgiom
          Nov 10 at 19:42












          @mgiom Glad to help!
          – Ajax1234
          Nov 10 at 19:42




          @mgiom Glad to help!
          – Ajax1234
          Nov 10 at 19:42

















           

          draft saved


          draft discarded















































           


          draft saved


          draft discarded














          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53242467%2ffind-and-retrieve-content-from-html-text-using-beautifulsoup%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          這個網誌中的熱門文章

          Barbados

          How to read a connectionString WITH PROVIDER in .NET Core?

          Node.js Script on GitHub Pages or Amazon S3