writing a AND query for to find matching documents within a dataset (python)









up vote
3
down vote

favorite












I am trying to construct a function called 'and_query' that takes as input a single string, consisting of one or more words, such that the function returns a list of matching documents for the words being in the abstracts of the documents.



First, I put all the words in an inverted index with the id being the id of the document and the abstract the plain text.



inverted_index = defaultdict(set)

for (id, abstract) in Abstracts.items():
for term in preprocess(tokenize(abstract)):
inverted_index[term].add(id)


Then, I wrote a query function where finals is a list of all the matching documents.



Because it should only return documents for which every word of the function parameter has a match in the document, I used the set operation 'intersecton'.



def and_query(tokens):
documents=set()
finals =
terms = preprocess(tokenize(tokens))

for term in terms:
for i in inverted_index[term]:
documents.add(i)

for term in terms:
temporary_set= set()
for i in inverted_index[term]:
temporary_set.add(i)
finals.extend(documents.intersection(temporary_set))
return finals

def finals_print(finals):
for final in finals:
display_summary(final)

finals_print(and_query("netherlands vaccine trial"))


However, it seems like the function is still returning documents for which only 1 word is in the abstract of the document.



does anyone know what i did wrong concerning my set operations??



(I think the fault should be anywhere in this part of the code):



for term in terms:
temporary_set= set()
for i in inverted_index[term]:
temporary_set.add(i)
finals.extend(documents.intersection(temporary_set))
return finals


Thanks in advance



basically what i want to do in short:



for word in words:
id_set_for_one_word= set()
for i in get_id_of that_word[word]:
id_set_for_one_word.add(i)
pseudo:
id_set_for_one_word intersection (id_set_of_other_words)

finals.extend( set of all intersections for all words)


and then i need the intersection of the id sets on all of these words, returning a set in which the id's are that exist for every word in words.










share|improve this question



















  • 1




    Could you provide some input data to be able to test the code?
    – Franco Piccolo
    Nov 11 at 15:48










  • not really actually. A lot of preprocessing and other operations are performed before the data actually is being used to query on. Also a lot of modules have to be imported to make it work. gonna be a lot of work to provide that here.
    – Jorian Onderwater
    Nov 11 at 16:15










  • I updated my question with something in a sort of pseudocode make be somewhat more clear what i'm trying to do
    – Jorian Onderwater
    Nov 11 at 16:33










  • TLDR, but if you want to ‘and’ several criteria so that only abstracts matching return then I would 1. prep in advance, outside matchers. 2. call matchers in sequence, passing in the list of abstracts. 3. prune non matching abstracts within each simple matcher function. having ‘extends’ is code smell here for me.
    – JL Peyret
    Nov 11 at 17:34















up vote
3
down vote

favorite












I am trying to construct a function called 'and_query' that takes as input a single string, consisting of one or more words, such that the function returns a list of matching documents for the words being in the abstracts of the documents.



First, I put all the words in an inverted index with the id being the id of the document and the abstract the plain text.



inverted_index = defaultdict(set)

for (id, abstract) in Abstracts.items():
for term in preprocess(tokenize(abstract)):
inverted_index[term].add(id)


Then, I wrote a query function where finals is a list of all the matching documents.



Because it should only return documents for which every word of the function parameter has a match in the document, I used the set operation 'intersecton'.



def and_query(tokens):
documents=set()
finals =
terms = preprocess(tokenize(tokens))

for term in terms:
for i in inverted_index[term]:
documents.add(i)

for term in terms:
temporary_set= set()
for i in inverted_index[term]:
temporary_set.add(i)
finals.extend(documents.intersection(temporary_set))
return finals

def finals_print(finals):
for final in finals:
display_summary(final)

finals_print(and_query("netherlands vaccine trial"))


However, it seems like the function is still returning documents for which only 1 word is in the abstract of the document.



does anyone know what i did wrong concerning my set operations??



(I think the fault should be anywhere in this part of the code):



for term in terms:
temporary_set= set()
for i in inverted_index[term]:
temporary_set.add(i)
finals.extend(documents.intersection(temporary_set))
return finals


Thanks in advance



basically what i want to do in short:



for word in words:
id_set_for_one_word= set()
for i in get_id_of that_word[word]:
id_set_for_one_word.add(i)
pseudo:
id_set_for_one_word intersection (id_set_of_other_words)

finals.extend( set of all intersections for all words)


and then i need the intersection of the id sets on all of these words, returning a set in which the id's are that exist for every word in words.










share|improve this question



















  • 1




    Could you provide some input data to be able to test the code?
    – Franco Piccolo
    Nov 11 at 15:48










  • not really actually. A lot of preprocessing and other operations are performed before the data actually is being used to query on. Also a lot of modules have to be imported to make it work. gonna be a lot of work to provide that here.
    – Jorian Onderwater
    Nov 11 at 16:15










  • I updated my question with something in a sort of pseudocode make be somewhat more clear what i'm trying to do
    – Jorian Onderwater
    Nov 11 at 16:33










  • TLDR, but if you want to ‘and’ several criteria so that only abstracts matching return then I would 1. prep in advance, outside matchers. 2. call matchers in sequence, passing in the list of abstracts. 3. prune non matching abstracts within each simple matcher function. having ‘extends’ is code smell here for me.
    – JL Peyret
    Nov 11 at 17:34













up vote
3
down vote

favorite









up vote
3
down vote

favorite











I am trying to construct a function called 'and_query' that takes as input a single string, consisting of one or more words, such that the function returns a list of matching documents for the words being in the abstracts of the documents.



First, I put all the words in an inverted index with the id being the id of the document and the abstract the plain text.



inverted_index = defaultdict(set)

for (id, abstract) in Abstracts.items():
for term in preprocess(tokenize(abstract)):
inverted_index[term].add(id)


Then, I wrote a query function where finals is a list of all the matching documents.



Because it should only return documents for which every word of the function parameter has a match in the document, I used the set operation 'intersecton'.



def and_query(tokens):
documents=set()
finals =
terms = preprocess(tokenize(tokens))

for term in terms:
for i in inverted_index[term]:
documents.add(i)

for term in terms:
temporary_set= set()
for i in inverted_index[term]:
temporary_set.add(i)
finals.extend(documents.intersection(temporary_set))
return finals

def finals_print(finals):
for final in finals:
display_summary(final)

finals_print(and_query("netherlands vaccine trial"))


However, it seems like the function is still returning documents for which only 1 word is in the abstract of the document.



does anyone know what i did wrong concerning my set operations??



(I think the fault should be anywhere in this part of the code):



for term in terms:
temporary_set= set()
for i in inverted_index[term]:
temporary_set.add(i)
finals.extend(documents.intersection(temporary_set))
return finals


Thanks in advance



basically what i want to do in short:



for word in words:
id_set_for_one_word= set()
for i in get_id_of that_word[word]:
id_set_for_one_word.add(i)
pseudo:
id_set_for_one_word intersection (id_set_of_other_words)

finals.extend( set of all intersections for all words)


and then i need the intersection of the id sets on all of these words, returning a set in which the id's are that exist for every word in words.










share|improve this question















I am trying to construct a function called 'and_query' that takes as input a single string, consisting of one or more words, such that the function returns a list of matching documents for the words being in the abstracts of the documents.



First, I put all the words in an inverted index with the id being the id of the document and the abstract the plain text.



inverted_index = defaultdict(set)

for (id, abstract) in Abstracts.items():
for term in preprocess(tokenize(abstract)):
inverted_index[term].add(id)


Then, I wrote a query function where finals is a list of all the matching documents.



Because it should only return documents for which every word of the function parameter has a match in the document, I used the set operation 'intersecton'.



def and_query(tokens):
documents=set()
finals =
terms = preprocess(tokenize(tokens))

for term in terms:
for i in inverted_index[term]:
documents.add(i)

for term in terms:
temporary_set= set()
for i in inverted_index[term]:
temporary_set.add(i)
finals.extend(documents.intersection(temporary_set))
return finals

def finals_print(finals):
for final in finals:
display_summary(final)

finals_print(and_query("netherlands vaccine trial"))


However, it seems like the function is still returning documents for which only 1 word is in the abstract of the document.



does anyone know what i did wrong concerning my set operations??



(I think the fault should be anywhere in this part of the code):



for term in terms:
temporary_set= set()
for i in inverted_index[term]:
temporary_set.add(i)
finals.extend(documents.intersection(temporary_set))
return finals


Thanks in advance



basically what i want to do in short:



for word in words:
id_set_for_one_word= set()
for i in get_id_of that_word[word]:
id_set_for_one_word.add(i)
pseudo:
id_set_for_one_word intersection (id_set_of_other_words)

finals.extend( set of all intersections for all words)


and then i need the intersection of the id sets on all of these words, returning a set in which the id's are that exist for every word in words.







python set set-intersection






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 11 at 16:32

























asked Nov 11 at 15:03









Jorian Onderwater

235




235







  • 1




    Could you provide some input data to be able to test the code?
    – Franco Piccolo
    Nov 11 at 15:48










  • not really actually. A lot of preprocessing and other operations are performed before the data actually is being used to query on. Also a lot of modules have to be imported to make it work. gonna be a lot of work to provide that here.
    – Jorian Onderwater
    Nov 11 at 16:15










  • I updated my question with something in a sort of pseudocode make be somewhat more clear what i'm trying to do
    – Jorian Onderwater
    Nov 11 at 16:33










  • TLDR, but if you want to ‘and’ several criteria so that only abstracts matching return then I would 1. prep in advance, outside matchers. 2. call matchers in sequence, passing in the list of abstracts. 3. prune non matching abstracts within each simple matcher function. having ‘extends’ is code smell here for me.
    – JL Peyret
    Nov 11 at 17:34













  • 1




    Could you provide some input data to be able to test the code?
    – Franco Piccolo
    Nov 11 at 15:48










  • not really actually. A lot of preprocessing and other operations are performed before the data actually is being used to query on. Also a lot of modules have to be imported to make it work. gonna be a lot of work to provide that here.
    – Jorian Onderwater
    Nov 11 at 16:15










  • I updated my question with something in a sort of pseudocode make be somewhat more clear what i'm trying to do
    – Jorian Onderwater
    Nov 11 at 16:33










  • TLDR, but if you want to ‘and’ several criteria so that only abstracts matching return then I would 1. prep in advance, outside matchers. 2. call matchers in sequence, passing in the list of abstracts. 3. prune non matching abstracts within each simple matcher function. having ‘extends’ is code smell here for me.
    – JL Peyret
    Nov 11 at 17:34








1




1




Could you provide some input data to be able to test the code?
– Franco Piccolo
Nov 11 at 15:48




Could you provide some input data to be able to test the code?
– Franco Piccolo
Nov 11 at 15:48












not really actually. A lot of preprocessing and other operations are performed before the data actually is being used to query on. Also a lot of modules have to be imported to make it work. gonna be a lot of work to provide that here.
– Jorian Onderwater
Nov 11 at 16:15




not really actually. A lot of preprocessing and other operations are performed before the data actually is being used to query on. Also a lot of modules have to be imported to make it work. gonna be a lot of work to provide that here.
– Jorian Onderwater
Nov 11 at 16:15












I updated my question with something in a sort of pseudocode make be somewhat more clear what i'm trying to do
– Jorian Onderwater
Nov 11 at 16:33




I updated my question with something in a sort of pseudocode make be somewhat more clear what i'm trying to do
– Jorian Onderwater
Nov 11 at 16:33












TLDR, but if you want to ‘and’ several criteria so that only abstracts matching return then I would 1. prep in advance, outside matchers. 2. call matchers in sequence, passing in the list of abstracts. 3. prune non matching abstracts within each simple matcher function. having ‘extends’ is code smell here for me.
– JL Peyret
Nov 11 at 17:34





TLDR, but if you want to ‘and’ several criteria so that only abstracts matching return then I would 1. prep in advance, outside matchers. 2. call matchers in sequence, passing in the list of abstracts. 3. prune non matching abstracts within each simple matcher function. having ‘extends’ is code smell here for me.
– JL Peyret
Nov 11 at 17:34













3 Answers
3






active

oldest

votes

















up vote
0
down vote













To elaborate on my code smells comment, here's a rough draft of what I have done before to solve this kind of problems.



def tokenize(abstract):
#return <set of words in abstract>
set_ = .....
return set_

candidates = (id, abstract, tokenize(abstract)) for abstract in Abstracts.items():


all_criterias = "netherlands vaccine trial".split()


def searcher(candidates, criteria, match_on_found=True):

search_results =
for cand in candidates:
#cand[2] has a set of tokens or somesuch... abstract.
if criteria in cand[2]:
if match_on_found:
search_results.append(cand)
else:
#that's a AND NOT if you wanted that
search_results.append(cand)
return search_results


for criteria in all_criterias:
#pass in the full list every time, but it gets progressively shrunk
candidates = searcher(candidates, criteria)

#whats left is what you want
answer = [(abs[0],abs[1]) for abs in candidates]





share|improve this answer





























    up vote
    0
    down vote














    Question: returns a list of matching documents for the words being in the abstracts of the documents




    The term with the min number of documents, hold always the result.

    If a term does not exists in inverted_index, gives no match at all.



    For the sake of simplicity, predefined data:



    Abstracts = 1: 'Lorem ipsum dolor sit amet,',
    2: 'consetetur sadipscing elitr,',
    3: 'sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat,',
    4: 'sed diam voluptua.',
    5: 'At vero eos et accusam et justo duo dolores et ea rebum.',
    6: 'Stet clita kasd gubergren,',
    7: 'no sea takimata sanctus est Lorem ipsum dolor sit amet.',



    inverted_index = 'Stet': 6, 'ipsum': 1, 7, 'erat,': 3, 'ut': 3, 'dolores': 5, 'gubergren,': 6, 'kasd': 6, 'ea': 5, 'consetetur': 2, 'sit': 1, 7, 'nonumy': 3, 'voluptua.': 4, 'est': 7, 'elitr,': 2, 'At': 5, 'rebum.': 5, 'magna': 3, 'sadipscing': 2, 'diam': 3, 4, 'dolore': 3, 'sanctus': 7, 'labore': 3, 'sed': 3, 4, 'takimata': 7, 'Lorem': 1, 7, 'invidunt': 3, 'aliquyam': 3, 'accusam': 5, 'duo': 5, 'amet.': 7, 'et': 3, 5, 'sea': 7, 'dolor': 1, 7, 'vero': 5, 'no': 7, 'eos': 5, 'tempor': 3, 'amet,': 1, 'clita': 6, 'justo': 5, 'eirmod': 3

    def and_query(tokens):
    print("tokens:".format(tokens))
    #terms = preprocess(tokenize(tokens))
    terms = tokens.split()

    term_min = None
    for term in terms:
    if term in inverted_index:
    # Find min
    if not term_min or term_min[0] > len(inverted_index[term]):
    term_min = (len(inverted_index[term]), term)
    else:
    # Break early, if a term is not in inverted_index
    return set()

    finals = inverted_index[term_min[1]]
    print("term_min: inverted_index:".format(term_min, finals))
    return finals


    def finals_print(finals):
    if finals:
    for final in finals:
    print("Document []:".format(final, Abstracts[final]))
    else:
    print("No matching Document found")

    if __name__ == "__main__":
    for tokens in ['sed diam voluptua.', 'Lorem ipsum dolor', 'Lorem ipsum dolor test']:
    finals_print(and_query(tokens))
    print()



    Output:



    tokens:sed diam voluptua.
    term_min:(1, 'voluptua.') inverted_index:4
    Document [4]:sed diam voluptua.

    tokens:Lorem ipsum dolor
    term_min:(2, 'Lorem') inverted_index:1, 7
    Document [1]:Lorem ipsum dolor sit amet,
    Document [7]:no sea takimata sanctus est Lorem ipsum dolor sit amet.

    tokens:Lorem ipsum dolor test
    No matching Document found



    Tested with Python: 3.4.2






    share|improve this answer



























      up vote
      0
      down vote













      Found the solution eventually myself.
      replacing



       finals.extend(documents.intersection(id_set_for_one_word))
      return finals


      with



       documents = (documents.intersection(id_set_for_one_word))
      return documents


      seems to work here.



      Still, thanks for all the effort y'all.






      share|improve this answer




















        Your Answer






        StackExchange.ifUsing("editor", function ()
        StackExchange.using("externalEditor", function ()
        StackExchange.using("snippets", function ()
        StackExchange.snippets.init();
        );
        );
        , "code-snippets");

        StackExchange.ready(function()
        var channelOptions =
        tags: "".split(" "),
        id: "1"
        ;
        initTagRenderer("".split(" "), "".split(" "), channelOptions);

        StackExchange.using("externalEditor", function()
        // Have to fire editor after snippets, if snippets enabled
        if (StackExchange.settings.snippets.snippetsEnabled)
        StackExchange.using("snippets", function()
        createEditor();
        );

        else
        createEditor();

        );

        function createEditor()
        StackExchange.prepareEditor(
        heartbeatType: 'answer',
        convertImagesToLinks: true,
        noModals: true,
        showLowRepImageUploadWarning: true,
        reputationToPostImages: 10,
        bindNavPrevention: true,
        postfix: "",
        imageUploader:
        brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
        contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
        allowUrls: true
        ,
        onDemand: true,
        discardSelector: ".discard-answer"
        ,immediatelyShowMarkdownHelp:true
        );



        );













        draft saved

        draft discarded


















        StackExchange.ready(
        function ()
        StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53250003%2fwriting-a-and-query-for-to-find-matching-documents-within-a-dataset-python%23new-answer', 'question_page');

        );

        Post as a guest















        Required, but never shown

























        3 Answers
        3






        active

        oldest

        votes








        3 Answers
        3






        active

        oldest

        votes









        active

        oldest

        votes






        active

        oldest

        votes








        up vote
        0
        down vote













        To elaborate on my code smells comment, here's a rough draft of what I have done before to solve this kind of problems.



        def tokenize(abstract):
        #return <set of words in abstract>
        set_ = .....
        return set_

        candidates = (id, abstract, tokenize(abstract)) for abstract in Abstracts.items():


        all_criterias = "netherlands vaccine trial".split()


        def searcher(candidates, criteria, match_on_found=True):

        search_results =
        for cand in candidates:
        #cand[2] has a set of tokens or somesuch... abstract.
        if criteria in cand[2]:
        if match_on_found:
        search_results.append(cand)
        else:
        #that's a AND NOT if you wanted that
        search_results.append(cand)
        return search_results


        for criteria in all_criterias:
        #pass in the full list every time, but it gets progressively shrunk
        candidates = searcher(candidates, criteria)

        #whats left is what you want
        answer = [(abs[0],abs[1]) for abs in candidates]





        share|improve this answer


























          up vote
          0
          down vote













          To elaborate on my code smells comment, here's a rough draft of what I have done before to solve this kind of problems.



          def tokenize(abstract):
          #return <set of words in abstract>
          set_ = .....
          return set_

          candidates = (id, abstract, tokenize(abstract)) for abstract in Abstracts.items():


          all_criterias = "netherlands vaccine trial".split()


          def searcher(candidates, criteria, match_on_found=True):

          search_results =
          for cand in candidates:
          #cand[2] has a set of tokens or somesuch... abstract.
          if criteria in cand[2]:
          if match_on_found:
          search_results.append(cand)
          else:
          #that's a AND NOT if you wanted that
          search_results.append(cand)
          return search_results


          for criteria in all_criterias:
          #pass in the full list every time, but it gets progressively shrunk
          candidates = searcher(candidates, criteria)

          #whats left is what you want
          answer = [(abs[0],abs[1]) for abs in candidates]





          share|improve this answer
























            up vote
            0
            down vote










            up vote
            0
            down vote









            To elaborate on my code smells comment, here's a rough draft of what I have done before to solve this kind of problems.



            def tokenize(abstract):
            #return <set of words in abstract>
            set_ = .....
            return set_

            candidates = (id, abstract, tokenize(abstract)) for abstract in Abstracts.items():


            all_criterias = "netherlands vaccine trial".split()


            def searcher(candidates, criteria, match_on_found=True):

            search_results =
            for cand in candidates:
            #cand[2] has a set of tokens or somesuch... abstract.
            if criteria in cand[2]:
            if match_on_found:
            search_results.append(cand)
            else:
            #that's a AND NOT if you wanted that
            search_results.append(cand)
            return search_results


            for criteria in all_criterias:
            #pass in the full list every time, but it gets progressively shrunk
            candidates = searcher(candidates, criteria)

            #whats left is what you want
            answer = [(abs[0],abs[1]) for abs in candidates]





            share|improve this answer














            To elaborate on my code smells comment, here's a rough draft of what I have done before to solve this kind of problems.



            def tokenize(abstract):
            #return <set of words in abstract>
            set_ = .....
            return set_

            candidates = (id, abstract, tokenize(abstract)) for abstract in Abstracts.items():


            all_criterias = "netherlands vaccine trial".split()


            def searcher(candidates, criteria, match_on_found=True):

            search_results =
            for cand in candidates:
            #cand[2] has a set of tokens or somesuch... abstract.
            if criteria in cand[2]:
            if match_on_found:
            search_results.append(cand)
            else:
            #that's a AND NOT if you wanted that
            search_results.append(cand)
            return search_results


            for criteria in all_criterias:
            #pass in the full list every time, but it gets progressively shrunk
            candidates = searcher(candidates, criteria)

            #whats left is what you want
            answer = [(abs[0],abs[1]) for abs in candidates]






            share|improve this answer














            share|improve this answer



            share|improve this answer








            edited Nov 11 at 18:02

























            answered Nov 11 at 17:57









            JL Peyret

            2,8651629




            2,8651629






















                up vote
                0
                down vote














                Question: returns a list of matching documents for the words being in the abstracts of the documents




                The term with the min number of documents, hold always the result.

                If a term does not exists in inverted_index, gives no match at all.



                For the sake of simplicity, predefined data:



                Abstracts = 1: 'Lorem ipsum dolor sit amet,',
                2: 'consetetur sadipscing elitr,',
                3: 'sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat,',
                4: 'sed diam voluptua.',
                5: 'At vero eos et accusam et justo duo dolores et ea rebum.',
                6: 'Stet clita kasd gubergren,',
                7: 'no sea takimata sanctus est Lorem ipsum dolor sit amet.',



                inverted_index = 'Stet': 6, 'ipsum': 1, 7, 'erat,': 3, 'ut': 3, 'dolores': 5, 'gubergren,': 6, 'kasd': 6, 'ea': 5, 'consetetur': 2, 'sit': 1, 7, 'nonumy': 3, 'voluptua.': 4, 'est': 7, 'elitr,': 2, 'At': 5, 'rebum.': 5, 'magna': 3, 'sadipscing': 2, 'diam': 3, 4, 'dolore': 3, 'sanctus': 7, 'labore': 3, 'sed': 3, 4, 'takimata': 7, 'Lorem': 1, 7, 'invidunt': 3, 'aliquyam': 3, 'accusam': 5, 'duo': 5, 'amet.': 7, 'et': 3, 5, 'sea': 7, 'dolor': 1, 7, 'vero': 5, 'no': 7, 'eos': 5, 'tempor': 3, 'amet,': 1, 'clita': 6, 'justo': 5, 'eirmod': 3

                def and_query(tokens):
                print("tokens:".format(tokens))
                #terms = preprocess(tokenize(tokens))
                terms = tokens.split()

                term_min = None
                for term in terms:
                if term in inverted_index:
                # Find min
                if not term_min or term_min[0] > len(inverted_index[term]):
                term_min = (len(inverted_index[term]), term)
                else:
                # Break early, if a term is not in inverted_index
                return set()

                finals = inverted_index[term_min[1]]
                print("term_min: inverted_index:".format(term_min, finals))
                return finals


                def finals_print(finals):
                if finals:
                for final in finals:
                print("Document []:".format(final, Abstracts[final]))
                else:
                print("No matching Document found")

                if __name__ == "__main__":
                for tokens in ['sed diam voluptua.', 'Lorem ipsum dolor', 'Lorem ipsum dolor test']:
                finals_print(and_query(tokens))
                print()



                Output:



                tokens:sed diam voluptua.
                term_min:(1, 'voluptua.') inverted_index:4
                Document [4]:sed diam voluptua.

                tokens:Lorem ipsum dolor
                term_min:(2, 'Lorem') inverted_index:1, 7
                Document [1]:Lorem ipsum dolor sit amet,
                Document [7]:no sea takimata sanctus est Lorem ipsum dolor sit amet.

                tokens:Lorem ipsum dolor test
                No matching Document found



                Tested with Python: 3.4.2






                share|improve this answer
























                  up vote
                  0
                  down vote














                  Question: returns a list of matching documents for the words being in the abstracts of the documents




                  The term with the min number of documents, hold always the result.

                  If a term does not exists in inverted_index, gives no match at all.



                  For the sake of simplicity, predefined data:



                  Abstracts = 1: 'Lorem ipsum dolor sit amet,',
                  2: 'consetetur sadipscing elitr,',
                  3: 'sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat,',
                  4: 'sed diam voluptua.',
                  5: 'At vero eos et accusam et justo duo dolores et ea rebum.',
                  6: 'Stet clita kasd gubergren,',
                  7: 'no sea takimata sanctus est Lorem ipsum dolor sit amet.',



                  inverted_index = 'Stet': 6, 'ipsum': 1, 7, 'erat,': 3, 'ut': 3, 'dolores': 5, 'gubergren,': 6, 'kasd': 6, 'ea': 5, 'consetetur': 2, 'sit': 1, 7, 'nonumy': 3, 'voluptua.': 4, 'est': 7, 'elitr,': 2, 'At': 5, 'rebum.': 5, 'magna': 3, 'sadipscing': 2, 'diam': 3, 4, 'dolore': 3, 'sanctus': 7, 'labore': 3, 'sed': 3, 4, 'takimata': 7, 'Lorem': 1, 7, 'invidunt': 3, 'aliquyam': 3, 'accusam': 5, 'duo': 5, 'amet.': 7, 'et': 3, 5, 'sea': 7, 'dolor': 1, 7, 'vero': 5, 'no': 7, 'eos': 5, 'tempor': 3, 'amet,': 1, 'clita': 6, 'justo': 5, 'eirmod': 3

                  def and_query(tokens):
                  print("tokens:".format(tokens))
                  #terms = preprocess(tokenize(tokens))
                  terms = tokens.split()

                  term_min = None
                  for term in terms:
                  if term in inverted_index:
                  # Find min
                  if not term_min or term_min[0] > len(inverted_index[term]):
                  term_min = (len(inverted_index[term]), term)
                  else:
                  # Break early, if a term is not in inverted_index
                  return set()

                  finals = inverted_index[term_min[1]]
                  print("term_min: inverted_index:".format(term_min, finals))
                  return finals


                  def finals_print(finals):
                  if finals:
                  for final in finals:
                  print("Document []:".format(final, Abstracts[final]))
                  else:
                  print("No matching Document found")

                  if __name__ == "__main__":
                  for tokens in ['sed diam voluptua.', 'Lorem ipsum dolor', 'Lorem ipsum dolor test']:
                  finals_print(and_query(tokens))
                  print()



                  Output:



                  tokens:sed diam voluptua.
                  term_min:(1, 'voluptua.') inverted_index:4
                  Document [4]:sed diam voluptua.

                  tokens:Lorem ipsum dolor
                  term_min:(2, 'Lorem') inverted_index:1, 7
                  Document [1]:Lorem ipsum dolor sit amet,
                  Document [7]:no sea takimata sanctus est Lorem ipsum dolor sit amet.

                  tokens:Lorem ipsum dolor test
                  No matching Document found



                  Tested with Python: 3.4.2






                  share|improve this answer






















                    up vote
                    0
                    down vote










                    up vote
                    0
                    down vote










                    Question: returns a list of matching documents for the words being in the abstracts of the documents




                    The term with the min number of documents, hold always the result.

                    If a term does not exists in inverted_index, gives no match at all.



                    For the sake of simplicity, predefined data:



                    Abstracts = 1: 'Lorem ipsum dolor sit amet,',
                    2: 'consetetur sadipscing elitr,',
                    3: 'sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat,',
                    4: 'sed diam voluptua.',
                    5: 'At vero eos et accusam et justo duo dolores et ea rebum.',
                    6: 'Stet clita kasd gubergren,',
                    7: 'no sea takimata sanctus est Lorem ipsum dolor sit amet.',



                    inverted_index = 'Stet': 6, 'ipsum': 1, 7, 'erat,': 3, 'ut': 3, 'dolores': 5, 'gubergren,': 6, 'kasd': 6, 'ea': 5, 'consetetur': 2, 'sit': 1, 7, 'nonumy': 3, 'voluptua.': 4, 'est': 7, 'elitr,': 2, 'At': 5, 'rebum.': 5, 'magna': 3, 'sadipscing': 2, 'diam': 3, 4, 'dolore': 3, 'sanctus': 7, 'labore': 3, 'sed': 3, 4, 'takimata': 7, 'Lorem': 1, 7, 'invidunt': 3, 'aliquyam': 3, 'accusam': 5, 'duo': 5, 'amet.': 7, 'et': 3, 5, 'sea': 7, 'dolor': 1, 7, 'vero': 5, 'no': 7, 'eos': 5, 'tempor': 3, 'amet,': 1, 'clita': 6, 'justo': 5, 'eirmod': 3

                    def and_query(tokens):
                    print("tokens:".format(tokens))
                    #terms = preprocess(tokenize(tokens))
                    terms = tokens.split()

                    term_min = None
                    for term in terms:
                    if term in inverted_index:
                    # Find min
                    if not term_min or term_min[0] > len(inverted_index[term]):
                    term_min = (len(inverted_index[term]), term)
                    else:
                    # Break early, if a term is not in inverted_index
                    return set()

                    finals = inverted_index[term_min[1]]
                    print("term_min: inverted_index:".format(term_min, finals))
                    return finals


                    def finals_print(finals):
                    if finals:
                    for final in finals:
                    print("Document []:".format(final, Abstracts[final]))
                    else:
                    print("No matching Document found")

                    if __name__ == "__main__":
                    for tokens in ['sed diam voluptua.', 'Lorem ipsum dolor', 'Lorem ipsum dolor test']:
                    finals_print(and_query(tokens))
                    print()



                    Output:



                    tokens:sed diam voluptua.
                    term_min:(1, 'voluptua.') inverted_index:4
                    Document [4]:sed diam voluptua.

                    tokens:Lorem ipsum dolor
                    term_min:(2, 'Lorem') inverted_index:1, 7
                    Document [1]:Lorem ipsum dolor sit amet,
                    Document [7]:no sea takimata sanctus est Lorem ipsum dolor sit amet.

                    tokens:Lorem ipsum dolor test
                    No matching Document found



                    Tested with Python: 3.4.2






                    share|improve this answer













                    Question: returns a list of matching documents for the words being in the abstracts of the documents




                    The term with the min number of documents, hold always the result.

                    If a term does not exists in inverted_index, gives no match at all.



                    For the sake of simplicity, predefined data:



                    Abstracts = 1: 'Lorem ipsum dolor sit amet,',
                    2: 'consetetur sadipscing elitr,',
                    3: 'sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat,',
                    4: 'sed diam voluptua.',
                    5: 'At vero eos et accusam et justo duo dolores et ea rebum.',
                    6: 'Stet clita kasd gubergren,',
                    7: 'no sea takimata sanctus est Lorem ipsum dolor sit amet.',



                    inverted_index = 'Stet': 6, 'ipsum': 1, 7, 'erat,': 3, 'ut': 3, 'dolores': 5, 'gubergren,': 6, 'kasd': 6, 'ea': 5, 'consetetur': 2, 'sit': 1, 7, 'nonumy': 3, 'voluptua.': 4, 'est': 7, 'elitr,': 2, 'At': 5, 'rebum.': 5, 'magna': 3, 'sadipscing': 2, 'diam': 3, 4, 'dolore': 3, 'sanctus': 7, 'labore': 3, 'sed': 3, 4, 'takimata': 7, 'Lorem': 1, 7, 'invidunt': 3, 'aliquyam': 3, 'accusam': 5, 'duo': 5, 'amet.': 7, 'et': 3, 5, 'sea': 7, 'dolor': 1, 7, 'vero': 5, 'no': 7, 'eos': 5, 'tempor': 3, 'amet,': 1, 'clita': 6, 'justo': 5, 'eirmod': 3

                    def and_query(tokens):
                    print("tokens:".format(tokens))
                    #terms = preprocess(tokenize(tokens))
                    terms = tokens.split()

                    term_min = None
                    for term in terms:
                    if term in inverted_index:
                    # Find min
                    if not term_min or term_min[0] > len(inverted_index[term]):
                    term_min = (len(inverted_index[term]), term)
                    else:
                    # Break early, if a term is not in inverted_index
                    return set()

                    finals = inverted_index[term_min[1]]
                    print("term_min: inverted_index:".format(term_min, finals))
                    return finals


                    def finals_print(finals):
                    if finals:
                    for final in finals:
                    print("Document []:".format(final, Abstracts[final]))
                    else:
                    print("No matching Document found")

                    if __name__ == "__main__":
                    for tokens in ['sed diam voluptua.', 'Lorem ipsum dolor', 'Lorem ipsum dolor test']:
                    finals_print(and_query(tokens))
                    print()



                    Output:



                    tokens:sed diam voluptua.
                    term_min:(1, 'voluptua.') inverted_index:4
                    Document [4]:sed diam voluptua.

                    tokens:Lorem ipsum dolor
                    term_min:(2, 'Lorem') inverted_index:1, 7
                    Document [1]:Lorem ipsum dolor sit amet,
                    Document [7]:no sea takimata sanctus est Lorem ipsum dolor sit amet.

                    tokens:Lorem ipsum dolor test
                    No matching Document found



                    Tested with Python: 3.4.2







                    share|improve this answer












                    share|improve this answer



                    share|improve this answer










                    answered Nov 11 at 20:02









                    stovfl

                    7,2583931




                    7,2583931




















                        up vote
                        0
                        down vote













                        Found the solution eventually myself.
                        replacing



                         finals.extend(documents.intersection(id_set_for_one_word))
                        return finals


                        with



                         documents = (documents.intersection(id_set_for_one_word))
                        return documents


                        seems to work here.



                        Still, thanks for all the effort y'all.






                        share|improve this answer
























                          up vote
                          0
                          down vote













                          Found the solution eventually myself.
                          replacing



                           finals.extend(documents.intersection(id_set_for_one_word))
                          return finals


                          with



                           documents = (documents.intersection(id_set_for_one_word))
                          return documents


                          seems to work here.



                          Still, thanks for all the effort y'all.






                          share|improve this answer






















                            up vote
                            0
                            down vote










                            up vote
                            0
                            down vote









                            Found the solution eventually myself.
                            replacing



                             finals.extend(documents.intersection(id_set_for_one_word))
                            return finals


                            with



                             documents = (documents.intersection(id_set_for_one_word))
                            return documents


                            seems to work here.



                            Still, thanks for all the effort y'all.






                            share|improve this answer












                            Found the solution eventually myself.
                            replacing



                             finals.extend(documents.intersection(id_set_for_one_word))
                            return finals


                            with



                             documents = (documents.intersection(id_set_for_one_word))
                            return documents


                            seems to work here.



                            Still, thanks for all the effort y'all.







                            share|improve this answer












                            share|improve this answer



                            share|improve this answer










                            answered Nov 12 at 9:34









                            Jorian Onderwater

                            235




                            235



























                                draft saved

                                draft discarded
















































                                Thanks for contributing an answer to Stack Overflow!


                                • Please be sure to answer the question. Provide details and share your research!

                                But avoid


                                • Asking for help, clarification, or responding to other answers.

                                • Making statements based on opinion; back them up with references or personal experience.

                                To learn more, see our tips on writing great answers.





                                Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


                                Please pay close attention to the following guidance:


                                • Please be sure to answer the question. Provide details and share your research!

                                But avoid


                                • Asking for help, clarification, or responding to other answers.

                                • Making statements based on opinion; back them up with references or personal experience.

                                To learn more, see our tips on writing great answers.




                                draft saved


                                draft discarded














                                StackExchange.ready(
                                function ()
                                StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53250003%2fwriting-a-and-query-for-to-find-matching-documents-within-a-dataset-python%23new-answer', 'question_page');

                                );

                                Post as a guest















                                Required, but never shown





















































                                Required, but never shown














                                Required, but never shown












                                Required, but never shown







                                Required, but never shown

































                                Required, but never shown














                                Required, but never shown












                                Required, but never shown







                                Required, but never shown







                                這個網誌中的熱門文章

                                Barbados

                                How to read a connectionString WITH PROVIDER in .NET Core?

                                Node.js Script on GitHub Pages or Amazon S3