writing a AND query for to find matching documents within a dataset (python)

up vote
3
down vote

favorite

I am trying to construct a function called 'and_query' that takes as input a single string, consisting of one or more words, such that the function returns a list of matching documents for the words being in the abstracts of the documents.

First, I put all the words in an inverted index with the id being the id of the document and the abstract the plain text.

inverted_index = defaultdict(set)

for (id, abstract) in Abstracts.items():
for term in preprocess(tokenize(abstract)):
 inverted_index[term].add(id)

Then, I wrote a query function where finals is a list of all the matching documents.

Because it should only return documents for which every word of the function parameter has a match in the document, I used the set operation 'intersecton'.

def and_query(tokens):
 documents=set()
 finals = 
 terms = preprocess(tokenize(tokens))

 for term in terms:
 for i in inverted_index[term]:
 documents.add(i)

 for term in terms:
 temporary_set= set()
 for i in inverted_index[term]:
 temporary_set.add(i)
 finals.extend(documents.intersection(temporary_set))
 return finals

def finals_print(finals):
 for final in finals:
 display_summary(final) 

finals_print(and_query("netherlands vaccine trial"))

However, it seems like the function is still returning documents for which only 1 word is in the abstract of the document.

does anyone know what i did wrong concerning my set operations??

(I think the fault should be anywhere in this part of the code):

for term in terms:
 temporary_set= set()
 for i in inverted_index[term]:
 temporary_set.add(i)
 finals.extend(documents.intersection(temporary_set))
return finals

Thanks in advance

basically what i want to do in short:

for word in words:
 id_set_for_one_word= set()
 for i in get_id_of that_word[word]:
 id_set_for_one_word.add(i)
pseudo:
 id_set_for_one_word intersection (id_set_of_other_words)

finals.extend( set of all intersections for all words)

and then i need the intersection of the id sets on all of these words, returning a set in which the id's are that exist for every word in words.

edited Nov 11 at 16:32

asked Nov 11 at 15:03

Jorian Onderwater

235

1

Could you provide some input data to be able to test the code?
– Franco Piccolo
Nov 11 at 15:48

not really actually. A lot of preprocessing and other operations are performed before the data actually is being used to query on. Also a lot of modules have to be imported to make it work. gonna be a lot of work to provide that here.
– Jorian Onderwater
Nov 11 at 16:15

I updated my question with something in a sort of pseudocode make be somewhat more clear what i'm trying to do
– Jorian Onderwater
Nov 11 at 16:33

TLDR, but if you want to ‘and’ several criteria so that only abstracts matching return then I would 1. prep in advance, outside matchers. 2. call matchers in sequence, passing in the list of abstracts. 3. prune non matching abstracts within each simple matcher function. having ‘extends’ is code smell here for me.
– JL Peyret
Nov 11 at 17:34

add a comment |

up vote
3
down vote

favorite

First, I put all the words in an inverted index with the id being the id of the document and the abstract the plain text.

inverted_index = defaultdict(set)

for (id, abstract) in Abstracts.items():
for term in preprocess(tokenize(abstract)):
 inverted_index[term].add(id)

Then, I wrote a query function where finals is a list of all the matching documents.

Because it should only return documents for which every word of the function parameter has a match in the document, I used the set operation 'intersecton'.

def and_query(tokens):
 documents=set()
 finals = 
 terms = preprocess(tokenize(tokens))

 for term in terms:
 for i in inverted_index[term]:
 documents.add(i)

 for term in terms:
 temporary_set= set()
 for i in inverted_index[term]:
 temporary_set.add(i)
 finals.extend(documents.intersection(temporary_set))
 return finals

def finals_print(finals):
 for final in finals:
 display_summary(final) 

finals_print(and_query("netherlands vaccine trial"))

However, it seems like the function is still returning documents for which only 1 word is in the abstract of the document.

does anyone know what i did wrong concerning my set operations??

(I think the fault should be anywhere in this part of the code):

for term in terms:
 temporary_set= set()
 for i in inverted_index[term]:
 temporary_set.add(i)
 finals.extend(documents.intersection(temporary_set))
return finals

Thanks in advance

basically what i want to do in short:

for word in words:
 id_set_for_one_word= set()
 for i in get_id_of that_word[word]:
 id_set_for_one_word.add(i)
pseudo:
 id_set_for_one_word intersection (id_set_of_other_words)

finals.extend( set of all intersections for all words)

and then i need the intersection of the id sets on all of these words, returning a set in which the id's are that exist for every word in words.

edited Nov 11 at 16:32

asked Nov 11 at 15:03

Jorian Onderwater

235

1

Could you provide some input data to be able to test the code?
– Franco Piccolo
Nov 11 at 15:48

not really actually. A lot of preprocessing and other operations are performed before the data actually is being used to query on. Also a lot of modules have to be imported to make it work. gonna be a lot of work to provide that here.
– Jorian Onderwater
Nov 11 at 16:15

I updated my question with something in a sort of pseudocode make be somewhat more clear what i'm trying to do
– Jorian Onderwater
Nov 11 at 16:33

TLDR, but if you want to ‘and’ several criteria so that only abstracts matching return then I would 1. prep in advance, outside matchers. 2. call matchers in sequence, passing in the list of abstracts. 3. prune non matching abstracts within each simple matcher function. having ‘extends’ is code smell here for me.
– JL Peyret
Nov 11 at 17:34

add a comment |

up vote
3
down vote

favorite

First, I put all the words in an inverted index with the id being the id of the document and the abstract the plain text.

inverted_index = defaultdict(set)

for (id, abstract) in Abstracts.items():
for term in preprocess(tokenize(abstract)):
 inverted_index[term].add(id)

Then, I wrote a query function where finals is a list of all the matching documents.

Because it should only return documents for which every word of the function parameter has a match in the document, I used the set operation 'intersecton'.

def and_query(tokens):
 documents=set()
 finals = 
 terms = preprocess(tokenize(tokens))

 for term in terms:
 for i in inverted_index[term]:
 documents.add(i)

 for term in terms:
 temporary_set= set()
 for i in inverted_index[term]:
 temporary_set.add(i)
 finals.extend(documents.intersection(temporary_set))
 return finals

def finals_print(finals):
 for final in finals:
 display_summary(final) 

finals_print(and_query("netherlands vaccine trial"))

However, it seems like the function is still returning documents for which only 1 word is in the abstract of the document.

does anyone know what i did wrong concerning my set operations??

(I think the fault should be anywhere in this part of the code):

for term in terms:
 temporary_set= set()
 for i in inverted_index[term]:
 temporary_set.add(i)
 finals.extend(documents.intersection(temporary_set))
return finals

Thanks in advance

basically what i want to do in short:

for word in words:
 id_set_for_one_word= set()
 for i in get_id_of that_word[word]:
 id_set_for_one_word.add(i)
pseudo:
 id_set_for_one_word intersection (id_set_of_other_words)

finals.extend( set of all intersections for all words)

and then i need the intersection of the id sets on all of these words, returning a set in which the id's are that exist for every word in words.

edited Nov 11 at 16:32

asked Nov 11 at 15:03

Jorian Onderwater

235

First, I put all the words in an inverted index with the id being the id of the document and the abstract the plain text.

inverted_index = defaultdict(set)

for (id, abstract) in Abstracts.items():
for term in preprocess(tokenize(abstract)):
 inverted_index[term].add(id)

Then, I wrote a query function where finals is a list of all the matching documents.

Because it should only return documents for which every word of the function parameter has a match in the document, I used the set operation 'intersecton'.

def and_query(tokens):
 documents=set()
 finals = 
 terms = preprocess(tokenize(tokens))

 for term in terms:
 for i in inverted_index[term]:
 documents.add(i)

 for term in terms:
 temporary_set= set()
 for i in inverted_index[term]:
 temporary_set.add(i)
 finals.extend(documents.intersection(temporary_set))
 return finals

def finals_print(finals):
 for final in finals:
 display_summary(final) 

finals_print(and_query("netherlands vaccine trial"))

However, it seems like the function is still returning documents for which only 1 word is in the abstract of the document.

does anyone know what i did wrong concerning my set operations??

(I think the fault should be anywhere in this part of the code):

for term in terms:
 temporary_set= set()
 for i in inverted_index[term]:
 temporary_set.add(i)
 finals.extend(documents.intersection(temporary_set))
return finals

Thanks in advance

basically what i want to do in short:

for word in words:
 id_set_for_one_word= set()
 for i in get_id_of that_word[word]:
 id_set_for_one_word.add(i)
pseudo:
 id_set_for_one_word intersection (id_set_of_other_words)

finals.extend( set of all intersections for all words)

and then i need the intersection of the id sets on all of these words, returning a set in which the id's are that exist for every word in words.

python set set-intersection

edited Nov 11 at 16:32

asked Nov 11 at 15:03

Jorian Onderwater

235

edited Nov 11 at 16:32

asked Nov 11 at 15:03

Jorian Onderwater

235

edited Nov 11 at 16:32

asked Nov 11 at 15:03

Jorian Onderwater

235

asked Nov 11 at 15:03

Jorian Onderwater

235

asked Nov 11 at 15:03

Jorian Onderwater

235

1

Could you provide some input data to be able to test the code?
– Franco Piccolo
Nov 11 at 15:48

not really actually. A lot of preprocessing and other operations are performed before the data actually is being used to query on. Also a lot of modules have to be imported to make it work. gonna be a lot of work to provide that here.
– Jorian Onderwater
Nov 11 at 16:15

I updated my question with something in a sort of pseudocode make be somewhat more clear what i'm trying to do
– Jorian Onderwater
Nov 11 at 16:33

TLDR, but if you want to ‘and’ several criteria so that only abstracts matching return then I would 1. prep in advance, outside matchers. 2. call matchers in sequence, passing in the list of abstracts. 3. prune non matching abstracts within each simple matcher function. having ‘extends’ is code smell here for me.
– JL Peyret
Nov 11 at 17:34

add a comment |

1

Could you provide some input data to be able to test the code?
– Franco Piccolo
Nov 11 at 15:48

not really actually. A lot of preprocessing and other operations are performed before the data actually is being used to query on. Also a lot of modules have to be imported to make it work. gonna be a lot of work to provide that here.
– Jorian Onderwater
Nov 11 at 16:15

I updated my question with something in a sort of pseudocode make be somewhat more clear what i'm trying to do
– Jorian Onderwater
Nov 11 at 16:33

TLDR, but if you want to ‘and’ several criteria so that only abstracts matching return then I would 1. prep in advance, outside matchers. 2. call matchers in sequence, passing in the list of abstracts. 3. prune non matching abstracts within each simple matcher function. having ‘extends’ is code smell here for me.
– JL Peyret
Nov 11 at 17:34

Could you provide some input data to be able to test the code?
– Franco Piccolo
Nov 11 at 15:48

not really actually. A lot of preprocessing and other operations are performed before the data actually is being used to query on. Also a lot of modules have to be imported to make it work. gonna be a lot of work to provide that here.
– Jorian Onderwater
Nov 11 at 16:15

I updated my question with something in a sort of pseudocode make be somewhat more clear what i'm trying to do
– Jorian Onderwater
Nov 11 at 16:33

TLDR, but if you want to ‘and’ several criteria so that only abstracts matching return then I would 1. prep in advance, outside matchers. 2. call matchers in sequence, passing in the list of abstracts. 3. prune non matching abstracts within each simple matcher function. having ‘extends’ is code smell here for me.
– JL Peyret
Nov 11 at 17:34

add a comment |

3 Answers
3

active

oldest

votes

up vote
0
down vote

To elaborate on my code smells comment, here's a rough draft of what I have done before to solve this kind of problems.

def tokenize(abstract):
 #return <set of words in abstract>
 set_ = .....
 return set_

candidates = (id, abstract, tokenize(abstract)) for abstract in Abstracts.items():


all_criterias = "netherlands vaccine trial".split()


def searcher(candidates, criteria, match_on_found=True):

 search_results = 
 for cand in candidates:
 #cand[2] has a set of tokens or somesuch... abstract.
 if criteria in cand[2]:
 if match_on_found:
 search_results.append(cand)
 else:
 #that's a AND NOT if you wanted that
 search_results.append(cand)
 return search_results


for criteria in all_criterias:
 #pass in the full list every time, but it gets progressively shrunk
 candidates = searcher(candidates, criteria)

#whats left is what you want
answer = [(abs[0],abs[1]) for abs in candidates]

edited Nov 11 at 18:02

answered Nov 11 at 17:57

JL Peyret

2,8651629

add a comment |

up vote
0
down vote

Question: returns a list of matching documents for the words being in the abstracts of the documents

The term with the min number of documents, hold always the result.

If a term does not exists in inverted_index, gives no match at all.

For the sake of simplicity, predefined data:

Abstracts = 1: 'Lorem ipsum dolor sit amet,',
 2: 'consetetur sadipscing elitr,',
 3: 'sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat,',
 4: 'sed diam voluptua.',
 5: 'At vero eos et accusam et justo duo dolores et ea rebum.',
 6: 'Stet clita kasd gubergren,',
 7: 'no sea takimata sanctus est Lorem ipsum dolor sit amet.',
 


inverted_index = 'Stet': 6, 'ipsum': 1, 7, 'erat,': 3, 'ut': 3, 'dolores': 5, 'gubergren,': 6, 'kasd': 6, 'ea': 5, 'consetetur': 2, 'sit': 1, 7, 'nonumy': 3, 'voluptua.': 4, 'est': 7, 'elitr,': 2, 'At': 5, 'rebum.': 5, 'magna': 3, 'sadipscing': 2, 'diam': 3, 4, 'dolore': 3, 'sanctus': 7, 'labore': 3, 'sed': 3, 4, 'takimata': 7, 'Lorem': 1, 7, 'invidunt': 3, 'aliquyam': 3, 'accusam': 5, 'duo': 5, 'amet.': 7, 'et': 3, 5, 'sea': 7, 'dolor': 1, 7, 'vero': 5, 'no': 7, 'eos': 5, 'tempor': 3, 'amet,': 1, 'clita': 6, 'justo': 5, 'eirmod': 3

def and_query(tokens):
 print("tokens:".format(tokens))
 #terms = preprocess(tokenize(tokens))
 terms = tokens.split()

 term_min = None
 for term in terms:
 if term in inverted_index:
 # Find min
 if not term_min or term_min[0] > len(inverted_index[term]):
 term_min = (len(inverted_index[term]), term)
 else:
 # Break early, if a term is not in inverted_index
 return set()

 finals = inverted_index[term_min[1]]
 print("term_min: inverted_index:".format(term_min, finals))
 return finals


def finals_print(finals):
 if finals:
 for final in finals:
 print("Document []:".format(final, Abstracts[final]))
 else:
 print("No matching Document found")

if __name__ == "__main__":
 for tokens in ['sed diam voluptua.', 'Lorem ipsum dolor', 'Lorem ipsum dolor test']:
 finals_print(and_query(tokens))
 print()

Output:

tokens:sed diam voluptua.
term_min:(1, 'voluptua.') inverted_index:4
Document [4]:sed diam voluptua.

tokens:Lorem ipsum dolor
term_min:(2, 'Lorem') inverted_index:1, 7
Document [1]:Lorem ipsum dolor sit amet,
Document [7]:no sea takimata sanctus est Lorem ipsum dolor sit amet.

tokens:Lorem ipsum dolor test
No matching Document found

Tested with Python: 3.4.2

answered Nov 11 at 20:02

stovfl

7,2583931

add a comment |

up vote
0
down vote

Found the solution eventually myself.
replacing

 finals.extend(documents.intersection(id_set_for_one_word))
return finals

with

 documents = (documents.intersection(id_set_for_one_word))
return documents

seems to work here.

Still, thanks for all the effort y'all.

answered Nov 12 at 9:34

Jorian Onderwater

235

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53250003%2fwriting-a-and-query-for-to-find-matching-documents-within-a-dataset-python%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

3 Answers
3

active

oldest

votes

3 Answers
3

active

oldest

votes

up vote
0
down vote

To elaborate on my code smells comment, here's a rough draft of what I have done before to solve this kind of problems.

def tokenize(abstract):
 #return <set of words in abstract>
 set_ = .....
 return set_

candidates = (id, abstract, tokenize(abstract)) for abstract in Abstracts.items():


all_criterias = "netherlands vaccine trial".split()


def searcher(candidates, criteria, match_on_found=True):

 search_results = 
 for cand in candidates:
 #cand[2] has a set of tokens or somesuch... abstract.
 if criteria in cand[2]:
 if match_on_found:
 search_results.append(cand)
 else:
 #that's a AND NOT if you wanted that
 search_results.append(cand)
 return search_results


for criteria in all_criterias:
 #pass in the full list every time, but it gets progressively shrunk
 candidates = searcher(candidates, criteria)

#whats left is what you want
answer = [(abs[0],abs[1]) for abs in candidates]

edited Nov 11 at 18:02

answered Nov 11 at 17:57

JL Peyret

2,8651629

add a comment |

up vote
0
down vote

To elaborate on my code smells comment, here's a rough draft of what I have done before to solve this kind of problems.

def tokenize(abstract):
 #return <set of words in abstract>
 set_ = .....
 return set_

candidates = (id, abstract, tokenize(abstract)) for abstract in Abstracts.items():


all_criterias = "netherlands vaccine trial".split()


def searcher(candidates, criteria, match_on_found=True):

 search_results = 
 for cand in candidates:
 #cand[2] has a set of tokens or somesuch... abstract.
 if criteria in cand[2]:
 if match_on_found:
 search_results.append(cand)
 else:
 #that's a AND NOT if you wanted that
 search_results.append(cand)
 return search_results


for criteria in all_criterias:
 #pass in the full list every time, but it gets progressively shrunk
 candidates = searcher(candidates, criteria)

#whats left is what you want
answer = [(abs[0],abs[1]) for abs in candidates]

edited Nov 11 at 18:02

answered Nov 11 at 17:57

JL Peyret

2,8651629

add a comment |

up vote
0
down vote

To elaborate on my code smells comment, here's a rough draft of what I have done before to solve this kind of problems.

def tokenize(abstract):
 #return <set of words in abstract>
 set_ = .....
 return set_

candidates = (id, abstract, tokenize(abstract)) for abstract in Abstracts.items():


all_criterias = "netherlands vaccine trial".split()


def searcher(candidates, criteria, match_on_found=True):

 search_results = 
 for cand in candidates:
 #cand[2] has a set of tokens or somesuch... abstract.
 if criteria in cand[2]:
 if match_on_found:
 search_results.append(cand)
 else:
 #that's a AND NOT if you wanted that
 search_results.append(cand)
 return search_results


for criteria in all_criterias:
 #pass in the full list every time, but it gets progressively shrunk
 candidates = searcher(candidates, criteria)

#whats left is what you want
answer = [(abs[0],abs[1]) for abs in candidates]

edited Nov 11 at 18:02

answered Nov 11 at 17:57

JL Peyret

2,8651629

To elaborate on my code smells comment, here's a rough draft of what I have done before to solve this kind of problems.

def tokenize(abstract):
 #return <set of words in abstract>
 set_ = .....
 return set_

candidates = (id, abstract, tokenize(abstract)) for abstract in Abstracts.items():


all_criterias = "netherlands vaccine trial".split()


def searcher(candidates, criteria, match_on_found=True):

 search_results = 
 for cand in candidates:
 #cand[2] has a set of tokens or somesuch... abstract.
 if criteria in cand[2]:
 if match_on_found:
 search_results.append(cand)
 else:
 #that's a AND NOT if you wanted that
 search_results.append(cand)
 return search_results


for criteria in all_criterias:
 #pass in the full list every time, but it gets progressively shrunk
 candidates = searcher(candidates, criteria)

#whats left is what you want
answer = [(abs[0],abs[1]) for abs in candidates]

edited Nov 11 at 18:02

answered Nov 11 at 17:57

JL Peyret

2,8651629

edited Nov 11 at 18:02

answered Nov 11 at 17:57

JL Peyret

2,8651629

answered Nov 11 at 17:57

JL Peyret

2,8651629

answered Nov 11 at 17:57

JL Peyret

2,8651629

add a comment |

up vote
0
down vote

Question: returns a list of matching documents for the words being in the abstracts of the documents

The term with the min number of documents, hold always the result.

If a term does not exists in inverted_index, gives no match at all.

For the sake of simplicity, predefined data:

Abstracts = 1: 'Lorem ipsum dolor sit amet,',
 2: 'consetetur sadipscing elitr,',
 3: 'sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat,',
 4: 'sed diam voluptua.',
 5: 'At vero eos et accusam et justo duo dolores et ea rebum.',
 6: 'Stet clita kasd gubergren,',
 7: 'no sea takimata sanctus est Lorem ipsum dolor sit amet.',
 


inverted_index = 'Stet': 6, 'ipsum': 1, 7, 'erat,': 3, 'ut': 3, 'dolores': 5, 'gubergren,': 6, 'kasd': 6, 'ea': 5, 'consetetur': 2, 'sit': 1, 7, 'nonumy': 3, 'voluptua.': 4, 'est': 7, 'elitr,': 2, 'At': 5, 'rebum.': 5, 'magna': 3, 'sadipscing': 2, 'diam': 3, 4, 'dolore': 3, 'sanctus': 7, 'labore': 3, 'sed': 3, 4, 'takimata': 7, 'Lorem': 1, 7, 'invidunt': 3, 'aliquyam': 3, 'accusam': 5, 'duo': 5, 'amet.': 7, 'et': 3, 5, 'sea': 7, 'dolor': 1, 7, 'vero': 5, 'no': 7, 'eos': 5, 'tempor': 3, 'amet,': 1, 'clita': 6, 'justo': 5, 'eirmod': 3

def and_query(tokens):
 print("tokens:".format(tokens))
 #terms = preprocess(tokenize(tokens))
 terms = tokens.split()

 term_min = None
 for term in terms:
 if term in inverted_index:
 # Find min
 if not term_min or term_min[0] > len(inverted_index[term]):
 term_min = (len(inverted_index[term]), term)
 else:
 # Break early, if a term is not in inverted_index
 return set()

 finals = inverted_index[term_min[1]]
 print("term_min: inverted_index:".format(term_min, finals))
 return finals


def finals_print(finals):
 if finals:
 for final in finals:
 print("Document []:".format(final, Abstracts[final]))
 else:
 print("No matching Document found")

if __name__ == "__main__":
 for tokens in ['sed diam voluptua.', 'Lorem ipsum dolor', 'Lorem ipsum dolor test']:
 finals_print(and_query(tokens))
 print()

Output:

tokens:sed diam voluptua.
term_min:(1, 'voluptua.') inverted_index:4
Document [4]:sed diam voluptua.

tokens:Lorem ipsum dolor
term_min:(2, 'Lorem') inverted_index:1, 7
Document [1]:Lorem ipsum dolor sit amet,
Document [7]:no sea takimata sanctus est Lorem ipsum dolor sit amet.

tokens:Lorem ipsum dolor test
No matching Document found

Tested with Python: 3.4.2

answered Nov 11 at 20:02

stovfl

7,2583931

add a comment |

up vote
0
down vote

Question: returns a list of matching documents for the words being in the abstracts of the documents

The term with the min number of documents, hold always the result.

If a term does not exists in inverted_index, gives no match at all.

For the sake of simplicity, predefined data:

Abstracts = 1: 'Lorem ipsum dolor sit amet,',
 2: 'consetetur sadipscing elitr,',
 3: 'sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat,',
 4: 'sed diam voluptua.',
 5: 'At vero eos et accusam et justo duo dolores et ea rebum.',
 6: 'Stet clita kasd gubergren,',
 7: 'no sea takimata sanctus est Lorem ipsum dolor sit amet.',
 


inverted_index = 'Stet': 6, 'ipsum': 1, 7, 'erat,': 3, 'ut': 3, 'dolores': 5, 'gubergren,': 6, 'kasd': 6, 'ea': 5, 'consetetur': 2, 'sit': 1, 7, 'nonumy': 3, 'voluptua.': 4, 'est': 7, 'elitr,': 2, 'At': 5, 'rebum.': 5, 'magna': 3, 'sadipscing': 2, 'diam': 3, 4, 'dolore': 3, 'sanctus': 7, 'labore': 3, 'sed': 3, 4, 'takimata': 7, 'Lorem': 1, 7, 'invidunt': 3, 'aliquyam': 3, 'accusam': 5, 'duo': 5, 'amet.': 7, 'et': 3, 5, 'sea': 7, 'dolor': 1, 7, 'vero': 5, 'no': 7, 'eos': 5, 'tempor': 3, 'amet,': 1, 'clita': 6, 'justo': 5, 'eirmod': 3

def and_query(tokens):
 print("tokens:".format(tokens))
 #terms = preprocess(tokenize(tokens))
 terms = tokens.split()

 term_min = None
 for term in terms:
 if term in inverted_index:
 # Find min
 if not term_min or term_min[0] > len(inverted_index[term]):
 term_min = (len(inverted_index[term]), term)
 else:
 # Break early, if a term is not in inverted_index
 return set()

 finals = inverted_index[term_min[1]]
 print("term_min: inverted_index:".format(term_min, finals))
 return finals


def finals_print(finals):
 if finals:
 for final in finals:
 print("Document []:".format(final, Abstracts[final]))
 else:
 print("No matching Document found")

if __name__ == "__main__":
 for tokens in ['sed diam voluptua.', 'Lorem ipsum dolor', 'Lorem ipsum dolor test']:
 finals_print(and_query(tokens))
 print()

Output:

tokens:sed diam voluptua.
term_min:(1, 'voluptua.') inverted_index:4
Document [4]:sed diam voluptua.

tokens:Lorem ipsum dolor
term_min:(2, 'Lorem') inverted_index:1, 7
Document [1]:Lorem ipsum dolor sit amet,
Document [7]:no sea takimata sanctus est Lorem ipsum dolor sit amet.

tokens:Lorem ipsum dolor test
No matching Document found

Tested with Python: 3.4.2

answered Nov 11 at 20:02

stovfl

7,2583931

add a comment |

up vote
0
down vote

Question: returns a list of matching documents for the words being in the abstracts of the documents

The term with the min number of documents, hold always the result.

If a term does not exists in inverted_index, gives no match at all.

For the sake of simplicity, predefined data:

Abstracts = 1: 'Lorem ipsum dolor sit amet,',
 2: 'consetetur sadipscing elitr,',
 3: 'sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat,',
 4: 'sed diam voluptua.',
 5: 'At vero eos et accusam et justo duo dolores et ea rebum.',
 6: 'Stet clita kasd gubergren,',
 7: 'no sea takimata sanctus est Lorem ipsum dolor sit amet.',
 


inverted_index = 'Stet': 6, 'ipsum': 1, 7, 'erat,': 3, 'ut': 3, 'dolores': 5, 'gubergren,': 6, 'kasd': 6, 'ea': 5, 'consetetur': 2, 'sit': 1, 7, 'nonumy': 3, 'voluptua.': 4, 'est': 7, 'elitr,': 2, 'At': 5, 'rebum.': 5, 'magna': 3, 'sadipscing': 2, 'diam': 3, 4, 'dolore': 3, 'sanctus': 7, 'labore': 3, 'sed': 3, 4, 'takimata': 7, 'Lorem': 1, 7, 'invidunt': 3, 'aliquyam': 3, 'accusam': 5, 'duo': 5, 'amet.': 7, 'et': 3, 5, 'sea': 7, 'dolor': 1, 7, 'vero': 5, 'no': 7, 'eos': 5, 'tempor': 3, 'amet,': 1, 'clita': 6, 'justo': 5, 'eirmod': 3

def and_query(tokens):
 print("tokens:".format(tokens))
 #terms = preprocess(tokenize(tokens))
 terms = tokens.split()

 term_min = None
 for term in terms:
 if term in inverted_index:
 # Find min
 if not term_min or term_min[0] > len(inverted_index[term]):
 term_min = (len(inverted_index[term]), term)
 else:
 # Break early, if a term is not in inverted_index
 return set()

 finals = inverted_index[term_min[1]]
 print("term_min: inverted_index:".format(term_min, finals))
 return finals


def finals_print(finals):
 if finals:
 for final in finals:
 print("Document []:".format(final, Abstracts[final]))
 else:
 print("No matching Document found")

if __name__ == "__main__":
 for tokens in ['sed diam voluptua.', 'Lorem ipsum dolor', 'Lorem ipsum dolor test']:
 finals_print(and_query(tokens))
 print()

Output:

tokens:sed diam voluptua.
term_min:(1, 'voluptua.') inverted_index:4
Document [4]:sed diam voluptua.

tokens:Lorem ipsum dolor
term_min:(2, 'Lorem') inverted_index:1, 7
Document [1]:Lorem ipsum dolor sit amet,
Document [7]:no sea takimata sanctus est Lorem ipsum dolor sit amet.

tokens:Lorem ipsum dolor test
No matching Document found

Tested with Python: 3.4.2

answered Nov 11 at 20:02

stovfl

7,2583931

Question: returns a list of matching documents for the words being in the abstracts of the documents

The term with the min number of documents, hold always the result.

If a term does not exists in inverted_index, gives no match at all.

For the sake of simplicity, predefined data:

Abstracts = 1: 'Lorem ipsum dolor sit amet,',
 2: 'consetetur sadipscing elitr,',
 3: 'sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat,',
 4: 'sed diam voluptua.',
 5: 'At vero eos et accusam et justo duo dolores et ea rebum.',
 6: 'Stet clita kasd gubergren,',
 7: 'no sea takimata sanctus est Lorem ipsum dolor sit amet.',
 


inverted_index = 'Stet': 6, 'ipsum': 1, 7, 'erat,': 3, 'ut': 3, 'dolores': 5, 'gubergren,': 6, 'kasd': 6, 'ea': 5, 'consetetur': 2, 'sit': 1, 7, 'nonumy': 3, 'voluptua.': 4, 'est': 7, 'elitr,': 2, 'At': 5, 'rebum.': 5, 'magna': 3, 'sadipscing': 2, 'diam': 3, 4, 'dolore': 3, 'sanctus': 7, 'labore': 3, 'sed': 3, 4, 'takimata': 7, 'Lorem': 1, 7, 'invidunt': 3, 'aliquyam': 3, 'accusam': 5, 'duo': 5, 'amet.': 7, 'et': 3, 5, 'sea': 7, 'dolor': 1, 7, 'vero': 5, 'no': 7, 'eos': 5, 'tempor': 3, 'amet,': 1, 'clita': 6, 'justo': 5, 'eirmod': 3

def and_query(tokens):
 print("tokens:".format(tokens))
 #terms = preprocess(tokenize(tokens))
 terms = tokens.split()

 term_min = None
 for term in terms:
 if term in inverted_index:
 # Find min
 if not term_min or term_min[0] > len(inverted_index[term]):
 term_min = (len(inverted_index[term]), term)
 else:
 # Break early, if a term is not in inverted_index
 return set()

 finals = inverted_index[term_min[1]]
 print("term_min: inverted_index:".format(term_min, finals))
 return finals


def finals_print(finals):
 if finals:
 for final in finals:
 print("Document []:".format(final, Abstracts[final]))
 else:
 print("No matching Document found")

if __name__ == "__main__":
 for tokens in ['sed diam voluptua.', 'Lorem ipsum dolor', 'Lorem ipsum dolor test']:
 finals_print(and_query(tokens))
 print()

Output:

tokens:sed diam voluptua.
term_min:(1, 'voluptua.') inverted_index:4
Document [4]:sed diam voluptua.

tokens:Lorem ipsum dolor
term_min:(2, 'Lorem') inverted_index:1, 7
Document [1]:Lorem ipsum dolor sit amet,
Document [7]:no sea takimata sanctus est Lorem ipsum dolor sit amet.

tokens:Lorem ipsum dolor test
No matching Document found

Tested with Python: 3.4.2

answered Nov 11 at 20:02

stovfl

7,2583931

answered Nov 11 at 20:02

stovfl

7,2583931

answered Nov 11 at 20:02

stovfl

7,2583931

answered Nov 11 at 20:02

stovfl

7,2583931

add a comment |

up vote
0
down vote

Found the solution eventually myself.
replacing

 finals.extend(documents.intersection(id_set_for_one_word))
return finals

with

 documents = (documents.intersection(id_set_for_one_word))
return documents

seems to work here.

Still, thanks for all the effort y'all.

answered Nov 12 at 9:34

Jorian Onderwater

235

add a comment |

up vote
0
down vote

Found the solution eventually myself.
replacing

 finals.extend(documents.intersection(id_set_for_one_word))
return finals

with

 documents = (documents.intersection(id_set_for_one_word))
return documents

seems to work here.

Still, thanks for all the effort y'all.

answered Nov 12 at 9:34

Jorian Onderwater

235

add a comment |

up vote
0
down vote

Found the solution eventually myself.
replacing

 finals.extend(documents.intersection(id_set_for_one_word))
return finals

with

 documents = (documents.intersection(id_set_for_one_word))
return documents

seems to work here.

Still, thanks for all the effort y'all.

answered Nov 12 at 9:34

Jorian Onderwater

235

Found the solution eventually myself.
replacing

 finals.extend(documents.intersection(id_set_for_one_word))
return finals

with

 documents = (documents.intersection(id_set_for_one_word))
return documents

seems to work here.

Still, thanks for all the effort y'all.

answered Nov 12 at 9:34

Jorian Onderwater

235

answered Nov 12 at 9:34

Jorian Onderwater

235

answered Nov 12 at 9:34

Jorian Onderwater

235

answered Nov 12 at 9:34

Jorian Onderwater

235

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Odtnhj