writing a AND query for to find matching documents within a dataset (python)
up vote
3
down vote
favorite
I am trying to construct a function called 'and_query' that takes as input a single string, consisting of one or more words, such that the function returns a list of matching documents for the words being in the abstracts of the documents.
First, I put all the words in an inverted index with the id being the id of the document and the abstract the plain text.
inverted_index = defaultdict(set)
for (id, abstract) in Abstracts.items():
for term in preprocess(tokenize(abstract)):
inverted_index[term].add(id)
Then, I wrote a query function where finals is a list of all the matching documents.
Because it should only return documents for which every word of the function parameter has a match in the document, I used the set operation 'intersecton'.
def and_query(tokens):
documents=set()
finals =
terms = preprocess(tokenize(tokens))
for term in terms:
for i in inverted_index[term]:
documents.add(i)
for term in terms:
temporary_set= set()
for i in inverted_index[term]:
temporary_set.add(i)
finals.extend(documents.intersection(temporary_set))
return finals
def finals_print(finals):
for final in finals:
display_summary(final)
finals_print(and_query("netherlands vaccine trial"))
However, it seems like the function is still returning documents for which only 1 word is in the abstract of the document.
does anyone know what i did wrong concerning my set operations??
(I think the fault should be anywhere in this part of the code):
for term in terms:
temporary_set= set()
for i in inverted_index[term]:
temporary_set.add(i)
finals.extend(documents.intersection(temporary_set))
return finals
Thanks in advance
basically what i want to do in short:
for word in words:
id_set_for_one_word= set()
for i in get_id_of that_word[word]:
id_set_for_one_word.add(i)
pseudo:
id_set_for_one_word intersection (id_set_of_other_words)
finals.extend( set of all intersections for all words)
and then i need the intersection of the id sets on all of these words, returning a set in which the id's are that exist for every word in words.
python set set-intersection
add a comment |
up vote
3
down vote
favorite
I am trying to construct a function called 'and_query' that takes as input a single string, consisting of one or more words, such that the function returns a list of matching documents for the words being in the abstracts of the documents.
First, I put all the words in an inverted index with the id being the id of the document and the abstract the plain text.
inverted_index = defaultdict(set)
for (id, abstract) in Abstracts.items():
for term in preprocess(tokenize(abstract)):
inverted_index[term].add(id)
Then, I wrote a query function where finals is a list of all the matching documents.
Because it should only return documents for which every word of the function parameter has a match in the document, I used the set operation 'intersecton'.
def and_query(tokens):
documents=set()
finals =
terms = preprocess(tokenize(tokens))
for term in terms:
for i in inverted_index[term]:
documents.add(i)
for term in terms:
temporary_set= set()
for i in inverted_index[term]:
temporary_set.add(i)
finals.extend(documents.intersection(temporary_set))
return finals
def finals_print(finals):
for final in finals:
display_summary(final)
finals_print(and_query("netherlands vaccine trial"))
However, it seems like the function is still returning documents for which only 1 word is in the abstract of the document.
does anyone know what i did wrong concerning my set operations??
(I think the fault should be anywhere in this part of the code):
for term in terms:
temporary_set= set()
for i in inverted_index[term]:
temporary_set.add(i)
finals.extend(documents.intersection(temporary_set))
return finals
Thanks in advance
basically what i want to do in short:
for word in words:
id_set_for_one_word= set()
for i in get_id_of that_word[word]:
id_set_for_one_word.add(i)
pseudo:
id_set_for_one_word intersection (id_set_of_other_words)
finals.extend( set of all intersections for all words)
and then i need the intersection of the id sets on all of these words, returning a set in which the id's are that exist for every word in words.
python set set-intersection
1
Could you provide some input data to be able to test the code?
– Franco Piccolo
Nov 11 at 15:48
not really actually. A lot of preprocessing and other operations are performed before the data actually is being used to query on. Also a lot of modules have to be imported to make it work. gonna be a lot of work to provide that here.
– Jorian Onderwater
Nov 11 at 16:15
I updated my question with something in a sort of pseudocode make be somewhat more clear what i'm trying to do
– Jorian Onderwater
Nov 11 at 16:33
TLDR, but if you want to ‘and’ several criteria so that only abstracts matching return then I would 1. prep in advance, outside matchers. 2. call matchers in sequence, passing in the list of abstracts. 3. prune non matching abstracts within each simple matcher function. having ‘extends’ is code smell here for me.
– JL Peyret
Nov 11 at 17:34
add a comment |
up vote
3
down vote
favorite
up vote
3
down vote
favorite
I am trying to construct a function called 'and_query' that takes as input a single string, consisting of one or more words, such that the function returns a list of matching documents for the words being in the abstracts of the documents.
First, I put all the words in an inverted index with the id being the id of the document and the abstract the plain text.
inverted_index = defaultdict(set)
for (id, abstract) in Abstracts.items():
for term in preprocess(tokenize(abstract)):
inverted_index[term].add(id)
Then, I wrote a query function where finals is a list of all the matching documents.
Because it should only return documents for which every word of the function parameter has a match in the document, I used the set operation 'intersecton'.
def and_query(tokens):
documents=set()
finals =
terms = preprocess(tokenize(tokens))
for term in terms:
for i in inverted_index[term]:
documents.add(i)
for term in terms:
temporary_set= set()
for i in inverted_index[term]:
temporary_set.add(i)
finals.extend(documents.intersection(temporary_set))
return finals
def finals_print(finals):
for final in finals:
display_summary(final)
finals_print(and_query("netherlands vaccine trial"))
However, it seems like the function is still returning documents for which only 1 word is in the abstract of the document.
does anyone know what i did wrong concerning my set operations??
(I think the fault should be anywhere in this part of the code):
for term in terms:
temporary_set= set()
for i in inverted_index[term]:
temporary_set.add(i)
finals.extend(documents.intersection(temporary_set))
return finals
Thanks in advance
basically what i want to do in short:
for word in words:
id_set_for_one_word= set()
for i in get_id_of that_word[word]:
id_set_for_one_word.add(i)
pseudo:
id_set_for_one_word intersection (id_set_of_other_words)
finals.extend( set of all intersections for all words)
and then i need the intersection of the id sets on all of these words, returning a set in which the id's are that exist for every word in words.
python set set-intersection
I am trying to construct a function called 'and_query' that takes as input a single string, consisting of one or more words, such that the function returns a list of matching documents for the words being in the abstracts of the documents.
First, I put all the words in an inverted index with the id being the id of the document and the abstract the plain text.
inverted_index = defaultdict(set)
for (id, abstract) in Abstracts.items():
for term in preprocess(tokenize(abstract)):
inverted_index[term].add(id)
Then, I wrote a query function where finals is a list of all the matching documents.
Because it should only return documents for which every word of the function parameter has a match in the document, I used the set operation 'intersecton'.
def and_query(tokens):
documents=set()
finals =
terms = preprocess(tokenize(tokens))
for term in terms:
for i in inverted_index[term]:
documents.add(i)
for term in terms:
temporary_set= set()
for i in inverted_index[term]:
temporary_set.add(i)
finals.extend(documents.intersection(temporary_set))
return finals
def finals_print(finals):
for final in finals:
display_summary(final)
finals_print(and_query("netherlands vaccine trial"))
However, it seems like the function is still returning documents for which only 1 word is in the abstract of the document.
does anyone know what i did wrong concerning my set operations??
(I think the fault should be anywhere in this part of the code):
for term in terms:
temporary_set= set()
for i in inverted_index[term]:
temporary_set.add(i)
finals.extend(documents.intersection(temporary_set))
return finals
Thanks in advance
basically what i want to do in short:
for word in words:
id_set_for_one_word= set()
for i in get_id_of that_word[word]:
id_set_for_one_word.add(i)
pseudo:
id_set_for_one_word intersection (id_set_of_other_words)
finals.extend( set of all intersections for all words)
and then i need the intersection of the id sets on all of these words, returning a set in which the id's are that exist for every word in words.
python set set-intersection
python set set-intersection
edited Nov 11 at 16:32
asked Nov 11 at 15:03
Jorian Onderwater
235
235
1
Could you provide some input data to be able to test the code?
– Franco Piccolo
Nov 11 at 15:48
not really actually. A lot of preprocessing and other operations are performed before the data actually is being used to query on. Also a lot of modules have to be imported to make it work. gonna be a lot of work to provide that here.
– Jorian Onderwater
Nov 11 at 16:15
I updated my question with something in a sort of pseudocode make be somewhat more clear what i'm trying to do
– Jorian Onderwater
Nov 11 at 16:33
TLDR, but if you want to ‘and’ several criteria so that only abstracts matching return then I would 1. prep in advance, outside matchers. 2. call matchers in sequence, passing in the list of abstracts. 3. prune non matching abstracts within each simple matcher function. having ‘extends’ is code smell here for me.
– JL Peyret
Nov 11 at 17:34
add a comment |
1
Could you provide some input data to be able to test the code?
– Franco Piccolo
Nov 11 at 15:48
not really actually. A lot of preprocessing and other operations are performed before the data actually is being used to query on. Also a lot of modules have to be imported to make it work. gonna be a lot of work to provide that here.
– Jorian Onderwater
Nov 11 at 16:15
I updated my question with something in a sort of pseudocode make be somewhat more clear what i'm trying to do
– Jorian Onderwater
Nov 11 at 16:33
TLDR, but if you want to ‘and’ several criteria so that only abstracts matching return then I would 1. prep in advance, outside matchers. 2. call matchers in sequence, passing in the list of abstracts. 3. prune non matching abstracts within each simple matcher function. having ‘extends’ is code smell here for me.
– JL Peyret
Nov 11 at 17:34
1
1
Could you provide some input data to be able to test the code?
– Franco Piccolo
Nov 11 at 15:48
Could you provide some input data to be able to test the code?
– Franco Piccolo
Nov 11 at 15:48
not really actually. A lot of preprocessing and other operations are performed before the data actually is being used to query on. Also a lot of modules have to be imported to make it work. gonna be a lot of work to provide that here.
– Jorian Onderwater
Nov 11 at 16:15
not really actually. A lot of preprocessing and other operations are performed before the data actually is being used to query on. Also a lot of modules have to be imported to make it work. gonna be a lot of work to provide that here.
– Jorian Onderwater
Nov 11 at 16:15
I updated my question with something in a sort of pseudocode make be somewhat more clear what i'm trying to do
– Jorian Onderwater
Nov 11 at 16:33
I updated my question with something in a sort of pseudocode make be somewhat more clear what i'm trying to do
– Jorian Onderwater
Nov 11 at 16:33
TLDR, but if you want to ‘and’ several criteria so that only abstracts matching return then I would 1. prep in advance, outside matchers. 2. call matchers in sequence, passing in the list of abstracts. 3. prune non matching abstracts within each simple matcher function. having ‘extends’ is code smell here for me.
– JL Peyret
Nov 11 at 17:34
TLDR, but if you want to ‘and’ several criteria so that only abstracts matching return then I would 1. prep in advance, outside matchers. 2. call matchers in sequence, passing in the list of abstracts. 3. prune non matching abstracts within each simple matcher function. having ‘extends’ is code smell here for me.
– JL Peyret
Nov 11 at 17:34
add a comment |
3 Answers
3
active
oldest
votes
up vote
0
down vote
To elaborate on my code smells comment, here's a rough draft of what I have done before to solve this kind of problems.
def tokenize(abstract):
#return <set of words in abstract>
set_ = .....
return set_
candidates = (id, abstract, tokenize(abstract)) for abstract in Abstracts.items():
all_criterias = "netherlands vaccine trial".split()
def searcher(candidates, criteria, match_on_found=True):
search_results =
for cand in candidates:
#cand[2] has a set of tokens or somesuch... abstract.
if criteria in cand[2]:
if match_on_found:
search_results.append(cand)
else:
#that's a AND NOT if you wanted that
search_results.append(cand)
return search_results
for criteria in all_criterias:
#pass in the full list every time, but it gets progressively shrunk
candidates = searcher(candidates, criteria)
#whats left is what you want
answer = [(abs[0],abs[1]) for abs in candidates]
add a comment |
up vote
0
down vote
Question: returns a list of matching documents for the words being in the abstracts of the documents
The term
with the min
number of documents
, hold always the result
.
If a term
does not exists in inverted_index
, gives no match at all.
For the sake of simplicity, predefined data:
Abstracts = 1: 'Lorem ipsum dolor sit amet,',
2: 'consetetur sadipscing elitr,',
3: 'sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat,',
4: 'sed diam voluptua.',
5: 'At vero eos et accusam et justo duo dolores et ea rebum.',
6: 'Stet clita kasd gubergren,',
7: 'no sea takimata sanctus est Lorem ipsum dolor sit amet.',
inverted_index = 'Stet': 6, 'ipsum': 1, 7, 'erat,': 3, 'ut': 3, 'dolores': 5, 'gubergren,': 6, 'kasd': 6, 'ea': 5, 'consetetur': 2, 'sit': 1, 7, 'nonumy': 3, 'voluptua.': 4, 'est': 7, 'elitr,': 2, 'At': 5, 'rebum.': 5, 'magna': 3, 'sadipscing': 2, 'diam': 3, 4, 'dolore': 3, 'sanctus': 7, 'labore': 3, 'sed': 3, 4, 'takimata': 7, 'Lorem': 1, 7, 'invidunt': 3, 'aliquyam': 3, 'accusam': 5, 'duo': 5, 'amet.': 7, 'et': 3, 5, 'sea': 7, 'dolor': 1, 7, 'vero': 5, 'no': 7, 'eos': 5, 'tempor': 3, 'amet,': 1, 'clita': 6, 'justo': 5, 'eirmod': 3
def and_query(tokens):
print("tokens:".format(tokens))
#terms = preprocess(tokenize(tokens))
terms = tokens.split()
term_min = None
for term in terms:
if term in inverted_index:
# Find min
if not term_min or term_min[0] > len(inverted_index[term]):
term_min = (len(inverted_index[term]), term)
else:
# Break early, if a term is not in inverted_index
return set()
finals = inverted_index[term_min[1]]
print("term_min: inverted_index:".format(term_min, finals))
return finals
def finals_print(finals):
if finals:
for final in finals:
print("Document []:".format(final, Abstracts[final]))
else:
print("No matching Document found")
if __name__ == "__main__":
for tokens in ['sed diam voluptua.', 'Lorem ipsum dolor', 'Lorem ipsum dolor test']:
finals_print(and_query(tokens))
print()
Output:
tokens:sed diam voluptua.
term_min:(1, 'voluptua.') inverted_index:4
Document [4]:sed diam voluptua.
tokens:Lorem ipsum dolor
term_min:(2, 'Lorem') inverted_index:1, 7
Document [1]:Lorem ipsum dolor sit amet,
Document [7]:no sea takimata sanctus est Lorem ipsum dolor sit amet.
tokens:Lorem ipsum dolor test
No matching Document found
Tested with Python: 3.4.2
add a comment |
up vote
0
down vote
Found the solution eventually myself.
replacing
finals.extend(documents.intersection(id_set_for_one_word))
return finals
with
documents = (documents.intersection(id_set_for_one_word))
return documents
seems to work here.
Still, thanks for all the effort y'all.
add a comment |
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
0
down vote
To elaborate on my code smells comment, here's a rough draft of what I have done before to solve this kind of problems.
def tokenize(abstract):
#return <set of words in abstract>
set_ = .....
return set_
candidates = (id, abstract, tokenize(abstract)) for abstract in Abstracts.items():
all_criterias = "netherlands vaccine trial".split()
def searcher(candidates, criteria, match_on_found=True):
search_results =
for cand in candidates:
#cand[2] has a set of tokens or somesuch... abstract.
if criteria in cand[2]:
if match_on_found:
search_results.append(cand)
else:
#that's a AND NOT if you wanted that
search_results.append(cand)
return search_results
for criteria in all_criterias:
#pass in the full list every time, but it gets progressively shrunk
candidates = searcher(candidates, criteria)
#whats left is what you want
answer = [(abs[0],abs[1]) for abs in candidates]
add a comment |
up vote
0
down vote
To elaborate on my code smells comment, here's a rough draft of what I have done before to solve this kind of problems.
def tokenize(abstract):
#return <set of words in abstract>
set_ = .....
return set_
candidates = (id, abstract, tokenize(abstract)) for abstract in Abstracts.items():
all_criterias = "netherlands vaccine trial".split()
def searcher(candidates, criteria, match_on_found=True):
search_results =
for cand in candidates:
#cand[2] has a set of tokens or somesuch... abstract.
if criteria in cand[2]:
if match_on_found:
search_results.append(cand)
else:
#that's a AND NOT if you wanted that
search_results.append(cand)
return search_results
for criteria in all_criterias:
#pass in the full list every time, but it gets progressively shrunk
candidates = searcher(candidates, criteria)
#whats left is what you want
answer = [(abs[0],abs[1]) for abs in candidates]
add a comment |
up vote
0
down vote
up vote
0
down vote
To elaborate on my code smells comment, here's a rough draft of what I have done before to solve this kind of problems.
def tokenize(abstract):
#return <set of words in abstract>
set_ = .....
return set_
candidates = (id, abstract, tokenize(abstract)) for abstract in Abstracts.items():
all_criterias = "netherlands vaccine trial".split()
def searcher(candidates, criteria, match_on_found=True):
search_results =
for cand in candidates:
#cand[2] has a set of tokens or somesuch... abstract.
if criteria in cand[2]:
if match_on_found:
search_results.append(cand)
else:
#that's a AND NOT if you wanted that
search_results.append(cand)
return search_results
for criteria in all_criterias:
#pass in the full list every time, but it gets progressively shrunk
candidates = searcher(candidates, criteria)
#whats left is what you want
answer = [(abs[0],abs[1]) for abs in candidates]
To elaborate on my code smells comment, here's a rough draft of what I have done before to solve this kind of problems.
def tokenize(abstract):
#return <set of words in abstract>
set_ = .....
return set_
candidates = (id, abstract, tokenize(abstract)) for abstract in Abstracts.items():
all_criterias = "netherlands vaccine trial".split()
def searcher(candidates, criteria, match_on_found=True):
search_results =
for cand in candidates:
#cand[2] has a set of tokens or somesuch... abstract.
if criteria in cand[2]:
if match_on_found:
search_results.append(cand)
else:
#that's a AND NOT if you wanted that
search_results.append(cand)
return search_results
for criteria in all_criterias:
#pass in the full list every time, but it gets progressively shrunk
candidates = searcher(candidates, criteria)
#whats left is what you want
answer = [(abs[0],abs[1]) for abs in candidates]
edited Nov 11 at 18:02
answered Nov 11 at 17:57
JL Peyret
2,8651629
2,8651629
add a comment |
add a comment |
up vote
0
down vote
Question: returns a list of matching documents for the words being in the abstracts of the documents
The term
with the min
number of documents
, hold always the result
.
If a term
does not exists in inverted_index
, gives no match at all.
For the sake of simplicity, predefined data:
Abstracts = 1: 'Lorem ipsum dolor sit amet,',
2: 'consetetur sadipscing elitr,',
3: 'sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat,',
4: 'sed diam voluptua.',
5: 'At vero eos et accusam et justo duo dolores et ea rebum.',
6: 'Stet clita kasd gubergren,',
7: 'no sea takimata sanctus est Lorem ipsum dolor sit amet.',
inverted_index = 'Stet': 6, 'ipsum': 1, 7, 'erat,': 3, 'ut': 3, 'dolores': 5, 'gubergren,': 6, 'kasd': 6, 'ea': 5, 'consetetur': 2, 'sit': 1, 7, 'nonumy': 3, 'voluptua.': 4, 'est': 7, 'elitr,': 2, 'At': 5, 'rebum.': 5, 'magna': 3, 'sadipscing': 2, 'diam': 3, 4, 'dolore': 3, 'sanctus': 7, 'labore': 3, 'sed': 3, 4, 'takimata': 7, 'Lorem': 1, 7, 'invidunt': 3, 'aliquyam': 3, 'accusam': 5, 'duo': 5, 'amet.': 7, 'et': 3, 5, 'sea': 7, 'dolor': 1, 7, 'vero': 5, 'no': 7, 'eos': 5, 'tempor': 3, 'amet,': 1, 'clita': 6, 'justo': 5, 'eirmod': 3
def and_query(tokens):
print("tokens:".format(tokens))
#terms = preprocess(tokenize(tokens))
terms = tokens.split()
term_min = None
for term in terms:
if term in inverted_index:
# Find min
if not term_min or term_min[0] > len(inverted_index[term]):
term_min = (len(inverted_index[term]), term)
else:
# Break early, if a term is not in inverted_index
return set()
finals = inverted_index[term_min[1]]
print("term_min: inverted_index:".format(term_min, finals))
return finals
def finals_print(finals):
if finals:
for final in finals:
print("Document []:".format(final, Abstracts[final]))
else:
print("No matching Document found")
if __name__ == "__main__":
for tokens in ['sed diam voluptua.', 'Lorem ipsum dolor', 'Lorem ipsum dolor test']:
finals_print(and_query(tokens))
print()
Output:
tokens:sed diam voluptua.
term_min:(1, 'voluptua.') inverted_index:4
Document [4]:sed diam voluptua.
tokens:Lorem ipsum dolor
term_min:(2, 'Lorem') inverted_index:1, 7
Document [1]:Lorem ipsum dolor sit amet,
Document [7]:no sea takimata sanctus est Lorem ipsum dolor sit amet.
tokens:Lorem ipsum dolor test
No matching Document found
Tested with Python: 3.4.2
add a comment |
up vote
0
down vote
Question: returns a list of matching documents for the words being in the abstracts of the documents
The term
with the min
number of documents
, hold always the result
.
If a term
does not exists in inverted_index
, gives no match at all.
For the sake of simplicity, predefined data:
Abstracts = 1: 'Lorem ipsum dolor sit amet,',
2: 'consetetur sadipscing elitr,',
3: 'sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat,',
4: 'sed diam voluptua.',
5: 'At vero eos et accusam et justo duo dolores et ea rebum.',
6: 'Stet clita kasd gubergren,',
7: 'no sea takimata sanctus est Lorem ipsum dolor sit amet.',
inverted_index = 'Stet': 6, 'ipsum': 1, 7, 'erat,': 3, 'ut': 3, 'dolores': 5, 'gubergren,': 6, 'kasd': 6, 'ea': 5, 'consetetur': 2, 'sit': 1, 7, 'nonumy': 3, 'voluptua.': 4, 'est': 7, 'elitr,': 2, 'At': 5, 'rebum.': 5, 'magna': 3, 'sadipscing': 2, 'diam': 3, 4, 'dolore': 3, 'sanctus': 7, 'labore': 3, 'sed': 3, 4, 'takimata': 7, 'Lorem': 1, 7, 'invidunt': 3, 'aliquyam': 3, 'accusam': 5, 'duo': 5, 'amet.': 7, 'et': 3, 5, 'sea': 7, 'dolor': 1, 7, 'vero': 5, 'no': 7, 'eos': 5, 'tempor': 3, 'amet,': 1, 'clita': 6, 'justo': 5, 'eirmod': 3
def and_query(tokens):
print("tokens:".format(tokens))
#terms = preprocess(tokenize(tokens))
terms = tokens.split()
term_min = None
for term in terms:
if term in inverted_index:
# Find min
if not term_min or term_min[0] > len(inverted_index[term]):
term_min = (len(inverted_index[term]), term)
else:
# Break early, if a term is not in inverted_index
return set()
finals = inverted_index[term_min[1]]
print("term_min: inverted_index:".format(term_min, finals))
return finals
def finals_print(finals):
if finals:
for final in finals:
print("Document []:".format(final, Abstracts[final]))
else:
print("No matching Document found")
if __name__ == "__main__":
for tokens in ['sed diam voluptua.', 'Lorem ipsum dolor', 'Lorem ipsum dolor test']:
finals_print(and_query(tokens))
print()
Output:
tokens:sed diam voluptua.
term_min:(1, 'voluptua.') inverted_index:4
Document [4]:sed diam voluptua.
tokens:Lorem ipsum dolor
term_min:(2, 'Lorem') inverted_index:1, 7
Document [1]:Lorem ipsum dolor sit amet,
Document [7]:no sea takimata sanctus est Lorem ipsum dolor sit amet.
tokens:Lorem ipsum dolor test
No matching Document found
Tested with Python: 3.4.2
add a comment |
up vote
0
down vote
up vote
0
down vote
Question: returns a list of matching documents for the words being in the abstracts of the documents
The term
with the min
number of documents
, hold always the result
.
If a term
does not exists in inverted_index
, gives no match at all.
For the sake of simplicity, predefined data:
Abstracts = 1: 'Lorem ipsum dolor sit amet,',
2: 'consetetur sadipscing elitr,',
3: 'sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat,',
4: 'sed diam voluptua.',
5: 'At vero eos et accusam et justo duo dolores et ea rebum.',
6: 'Stet clita kasd gubergren,',
7: 'no sea takimata sanctus est Lorem ipsum dolor sit amet.',
inverted_index = 'Stet': 6, 'ipsum': 1, 7, 'erat,': 3, 'ut': 3, 'dolores': 5, 'gubergren,': 6, 'kasd': 6, 'ea': 5, 'consetetur': 2, 'sit': 1, 7, 'nonumy': 3, 'voluptua.': 4, 'est': 7, 'elitr,': 2, 'At': 5, 'rebum.': 5, 'magna': 3, 'sadipscing': 2, 'diam': 3, 4, 'dolore': 3, 'sanctus': 7, 'labore': 3, 'sed': 3, 4, 'takimata': 7, 'Lorem': 1, 7, 'invidunt': 3, 'aliquyam': 3, 'accusam': 5, 'duo': 5, 'amet.': 7, 'et': 3, 5, 'sea': 7, 'dolor': 1, 7, 'vero': 5, 'no': 7, 'eos': 5, 'tempor': 3, 'amet,': 1, 'clita': 6, 'justo': 5, 'eirmod': 3
def and_query(tokens):
print("tokens:".format(tokens))
#terms = preprocess(tokenize(tokens))
terms = tokens.split()
term_min = None
for term in terms:
if term in inverted_index:
# Find min
if not term_min or term_min[0] > len(inverted_index[term]):
term_min = (len(inverted_index[term]), term)
else:
# Break early, if a term is not in inverted_index
return set()
finals = inverted_index[term_min[1]]
print("term_min: inverted_index:".format(term_min, finals))
return finals
def finals_print(finals):
if finals:
for final in finals:
print("Document []:".format(final, Abstracts[final]))
else:
print("No matching Document found")
if __name__ == "__main__":
for tokens in ['sed diam voluptua.', 'Lorem ipsum dolor', 'Lorem ipsum dolor test']:
finals_print(and_query(tokens))
print()
Output:
tokens:sed diam voluptua.
term_min:(1, 'voluptua.') inverted_index:4
Document [4]:sed diam voluptua.
tokens:Lorem ipsum dolor
term_min:(2, 'Lorem') inverted_index:1, 7
Document [1]:Lorem ipsum dolor sit amet,
Document [7]:no sea takimata sanctus est Lorem ipsum dolor sit amet.
tokens:Lorem ipsum dolor test
No matching Document found
Tested with Python: 3.4.2
Question: returns a list of matching documents for the words being in the abstracts of the documents
The term
with the min
number of documents
, hold always the result
.
If a term
does not exists in inverted_index
, gives no match at all.
For the sake of simplicity, predefined data:
Abstracts = 1: 'Lorem ipsum dolor sit amet,',
2: 'consetetur sadipscing elitr,',
3: 'sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat,',
4: 'sed diam voluptua.',
5: 'At vero eos et accusam et justo duo dolores et ea rebum.',
6: 'Stet clita kasd gubergren,',
7: 'no sea takimata sanctus est Lorem ipsum dolor sit amet.',
inverted_index = 'Stet': 6, 'ipsum': 1, 7, 'erat,': 3, 'ut': 3, 'dolores': 5, 'gubergren,': 6, 'kasd': 6, 'ea': 5, 'consetetur': 2, 'sit': 1, 7, 'nonumy': 3, 'voluptua.': 4, 'est': 7, 'elitr,': 2, 'At': 5, 'rebum.': 5, 'magna': 3, 'sadipscing': 2, 'diam': 3, 4, 'dolore': 3, 'sanctus': 7, 'labore': 3, 'sed': 3, 4, 'takimata': 7, 'Lorem': 1, 7, 'invidunt': 3, 'aliquyam': 3, 'accusam': 5, 'duo': 5, 'amet.': 7, 'et': 3, 5, 'sea': 7, 'dolor': 1, 7, 'vero': 5, 'no': 7, 'eos': 5, 'tempor': 3, 'amet,': 1, 'clita': 6, 'justo': 5, 'eirmod': 3
def and_query(tokens):
print("tokens:".format(tokens))
#terms = preprocess(tokenize(tokens))
terms = tokens.split()
term_min = None
for term in terms:
if term in inverted_index:
# Find min
if not term_min or term_min[0] > len(inverted_index[term]):
term_min = (len(inverted_index[term]), term)
else:
# Break early, if a term is not in inverted_index
return set()
finals = inverted_index[term_min[1]]
print("term_min: inverted_index:".format(term_min, finals))
return finals
def finals_print(finals):
if finals:
for final in finals:
print("Document []:".format(final, Abstracts[final]))
else:
print("No matching Document found")
if __name__ == "__main__":
for tokens in ['sed diam voluptua.', 'Lorem ipsum dolor', 'Lorem ipsum dolor test']:
finals_print(and_query(tokens))
print()
Output:
tokens:sed diam voluptua.
term_min:(1, 'voluptua.') inverted_index:4
Document [4]:sed diam voluptua.
tokens:Lorem ipsum dolor
term_min:(2, 'Lorem') inverted_index:1, 7
Document [1]:Lorem ipsum dolor sit amet,
Document [7]:no sea takimata sanctus est Lorem ipsum dolor sit amet.
tokens:Lorem ipsum dolor test
No matching Document found
Tested with Python: 3.4.2
answered Nov 11 at 20:02
stovfl
7,2583931
7,2583931
add a comment |
add a comment |
up vote
0
down vote
Found the solution eventually myself.
replacing
finals.extend(documents.intersection(id_set_for_one_word))
return finals
with
documents = (documents.intersection(id_set_for_one_word))
return documents
seems to work here.
Still, thanks for all the effort y'all.
add a comment |
up vote
0
down vote
Found the solution eventually myself.
replacing
finals.extend(documents.intersection(id_set_for_one_word))
return finals
with
documents = (documents.intersection(id_set_for_one_word))
return documents
seems to work here.
Still, thanks for all the effort y'all.
add a comment |
up vote
0
down vote
up vote
0
down vote
Found the solution eventually myself.
replacing
finals.extend(documents.intersection(id_set_for_one_word))
return finals
with
documents = (documents.intersection(id_set_for_one_word))
return documents
seems to work here.
Still, thanks for all the effort y'all.
Found the solution eventually myself.
replacing
finals.extend(documents.intersection(id_set_for_one_word))
return finals
with
documents = (documents.intersection(id_set_for_one_word))
return documents
seems to work here.
Still, thanks for all the effort y'all.
answered Nov 12 at 9:34
Jorian Onderwater
235
235
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53250003%2fwriting-a-and-query-for-to-find-matching-documents-within-a-dataset-python%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
Could you provide some input data to be able to test the code?
– Franco Piccolo
Nov 11 at 15:48
not really actually. A lot of preprocessing and other operations are performed before the data actually is being used to query on. Also a lot of modules have to be imported to make it work. gonna be a lot of work to provide that here.
– Jorian Onderwater
Nov 11 at 16:15
I updated my question with something in a sort of pseudocode make be somewhat more clear what i'm trying to do
– Jorian Onderwater
Nov 11 at 16:33
TLDR, but if you want to ‘and’ several criteria so that only abstracts matching return then I would 1. prep in advance, outside matchers. 2. call matchers in sequence, passing in the list of abstracts. 3. prune non matching abstracts within each simple matcher function. having ‘extends’ is code smell here for me.
– JL Peyret
Nov 11 at 17:34