Elasticsearch Edge NGram tokenizer higher score when word begins with n-gram
up vote
0
down vote
favorite
Suppose there is the following mapping with Edge NGram Tokenizer:
"settings":
"analysis":
"analyzer":
"autocomplete_analyzer":
"tokenizer": "autocomplete_tokenizer",
"filter": [
"standard"
]
,
"autocomplete_search":
"tokenizer": "whitespace"
,
"tokenizer":
"autocomplete_tokenizer":
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 10,
"token_chars": [
"letter",
"symbol"
]
,
"mappings":
"tag":
"properties":
"id":
"type": "long"
,
"name":
"type": "text",
"analyzer": "autocomplete_analyzer",
"search_analyzer": "autocomplete_search"
And the following documents are indexed:
POST /tag/tag/_bulk
"index":
"name" : "HITS FIND SOME"
"index":
"name" : "TRENDING HI"
"index":
"name" : "HITS OTHER"
Then searching
"query":
"match":
"name":
"query": "HI"
yields all with the same score, or TRENDING - HI
with a score higher than one of the others.
How can it be configured, to show with a higher score the entries that actually start with the searcher n-gram? In this case, HITS FIND SOME
and HITS OTHER
to have a higher score than TRENDING HI
; at the same time TRENDING HI
should be in the results.
Highlighter is also used, so the given solution shouldn't mess it up.
The highlighter used in query is:
"highlight":
"pre_tags": [
"<"
],
"post_tags": [
">"
],
"fields":
"name":
Using this with match_phrase_prefix
messes up the highlighting, yielding <H><I><T><S> FIND SOME
when searching only for H
.
elasticsearch search n-gram
This question has an open bounty worth +100
reputation from m3th0dman ending ending at 2018-11-19 14:15:36Z">in 5 days.
This question has not received enough attention.
Expecting a solution to the given issue without messing up the highlighter.
add a comment |
up vote
0
down vote
favorite
Suppose there is the following mapping with Edge NGram Tokenizer:
"settings":
"analysis":
"analyzer":
"autocomplete_analyzer":
"tokenizer": "autocomplete_tokenizer",
"filter": [
"standard"
]
,
"autocomplete_search":
"tokenizer": "whitespace"
,
"tokenizer":
"autocomplete_tokenizer":
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 10,
"token_chars": [
"letter",
"symbol"
]
,
"mappings":
"tag":
"properties":
"id":
"type": "long"
,
"name":
"type": "text",
"analyzer": "autocomplete_analyzer",
"search_analyzer": "autocomplete_search"
And the following documents are indexed:
POST /tag/tag/_bulk
"index":
"name" : "HITS FIND SOME"
"index":
"name" : "TRENDING HI"
"index":
"name" : "HITS OTHER"
Then searching
"query":
"match":
"name":
"query": "HI"
yields all with the same score, or TRENDING - HI
with a score higher than one of the others.
How can it be configured, to show with a higher score the entries that actually start with the searcher n-gram? In this case, HITS FIND SOME
and HITS OTHER
to have a higher score than TRENDING HI
; at the same time TRENDING HI
should be in the results.
Highlighter is also used, so the given solution shouldn't mess it up.
The highlighter used in query is:
"highlight":
"pre_tags": [
"<"
],
"post_tags": [
">"
],
"fields":
"name":
Using this with match_phrase_prefix
messes up the highlighting, yielding <H><I><T><S> FIND SOME
when searching only for H
.
elasticsearch search n-gram
This question has an open bounty worth +100
reputation from m3th0dman ending ending at 2018-11-19 14:15:36Z">in 5 days.
This question has not received enough attention.
Expecting a solution to the given issue without messing up the highlighter.
add a comment |
up vote
0
down vote
favorite
up vote
0
down vote
favorite
Suppose there is the following mapping with Edge NGram Tokenizer:
"settings":
"analysis":
"analyzer":
"autocomplete_analyzer":
"tokenizer": "autocomplete_tokenizer",
"filter": [
"standard"
]
,
"autocomplete_search":
"tokenizer": "whitespace"
,
"tokenizer":
"autocomplete_tokenizer":
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 10,
"token_chars": [
"letter",
"symbol"
]
,
"mappings":
"tag":
"properties":
"id":
"type": "long"
,
"name":
"type": "text",
"analyzer": "autocomplete_analyzer",
"search_analyzer": "autocomplete_search"
And the following documents are indexed:
POST /tag/tag/_bulk
"index":
"name" : "HITS FIND SOME"
"index":
"name" : "TRENDING HI"
"index":
"name" : "HITS OTHER"
Then searching
"query":
"match":
"name":
"query": "HI"
yields all with the same score, or TRENDING - HI
with a score higher than one of the others.
How can it be configured, to show with a higher score the entries that actually start with the searcher n-gram? In this case, HITS FIND SOME
and HITS OTHER
to have a higher score than TRENDING HI
; at the same time TRENDING HI
should be in the results.
Highlighter is also used, so the given solution shouldn't mess it up.
The highlighter used in query is:
"highlight":
"pre_tags": [
"<"
],
"post_tags": [
">"
],
"fields":
"name":
Using this with match_phrase_prefix
messes up the highlighting, yielding <H><I><T><S> FIND SOME
when searching only for H
.
elasticsearch search n-gram
Suppose there is the following mapping with Edge NGram Tokenizer:
"settings":
"analysis":
"analyzer":
"autocomplete_analyzer":
"tokenizer": "autocomplete_tokenizer",
"filter": [
"standard"
]
,
"autocomplete_search":
"tokenizer": "whitespace"
,
"tokenizer":
"autocomplete_tokenizer":
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 10,
"token_chars": [
"letter",
"symbol"
]
,
"mappings":
"tag":
"properties":
"id":
"type": "long"
,
"name":
"type": "text",
"analyzer": "autocomplete_analyzer",
"search_analyzer": "autocomplete_search"
And the following documents are indexed:
POST /tag/tag/_bulk
"index":
"name" : "HITS FIND SOME"
"index":
"name" : "TRENDING HI"
"index":
"name" : "HITS OTHER"
Then searching
"query":
"match":
"name":
"query": "HI"
yields all with the same score, or TRENDING - HI
with a score higher than one of the others.
How can it be configured, to show with a higher score the entries that actually start with the searcher n-gram? In this case, HITS FIND SOME
and HITS OTHER
to have a higher score than TRENDING HI
; at the same time TRENDING HI
should be in the results.
Highlighter is also used, so the given solution shouldn't mess it up.
The highlighter used in query is:
"highlight":
"pre_tags": [
"<"
],
"post_tags": [
">"
],
"fields":
"name":
Using this with match_phrase_prefix
messes up the highlighting, yielding <H><I><T><S> FIND SOME
when searching only for H
.
elasticsearch search n-gram
elasticsearch search n-gram
edited 2 days ago
asked Nov 10 at 11:44
m3th0dman
5,49833566
5,49833566
This question has an open bounty worth +100
reputation from m3th0dman ending ending at 2018-11-19 14:15:36Z">in 5 days.
This question has not received enough attention.
Expecting a solution to the given issue without messing up the highlighter.
This question has an open bounty worth +100
reputation from m3th0dman ending ending at 2018-11-19 14:15:36Z">in 5 days.
This question has not received enough attention.
Expecting a solution to the given issue without messing up the highlighter.
add a comment |
add a comment |
2 Answers
2
active
oldest
votes
up vote
3
down vote
You must understand how elasticsearch/lucene analyzes your data and calculate the search score.
1. Analyze API
https://www.elastic.co/guide/en/elasticsearch/reference/current/_testing_analyzers.html this will show you what elasticsearch will store, in your case:
T / TR / TRE /.... TRENDING / / H / HI
2. Score
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-bool-query.html
The bool query is often used to build complex query where you need a particular use case. Use must
to filter document, then should
to score. A common use case is to use different analyzers on a same field (by using the keyword fields
in the mapping, you can analyze a same field differently).
3. dont mess highlight
According the doc: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-highlighting.html#specify-highlight-query
You can add an extra query:
"query":
"bool":
"must" : [
"match":
"name": "HI"
],
"should": [
"prefix":
"name": "HI"
]
,
"highlight":
"pre_tags": [
"<"
],
"post_tags": [
">"
],
"fields":
"name":
"highlight_query":
"match":
"name": "HI"
add a comment |
up vote
2
down vote
In this particular case you could add a match_phrase_prefix
term to your query, which does prefix match on the last term in the text:
"query":
"bool":
"should": [
"match":
"name": "HI"
,
"match_phrase_prefix":
"name": "HI"
]
The match
term will match on all three results, but the match_phrase_prefix
won't match on TRENDING HI
. As a result, you'll get all three items in the results, but TRENDING HI
will appear with a lower score.
Quoting the docs:
The match_phrase_prefix query is a poor-man’s autocomplete[...] For better solutions for search-as-you-type see the completion suggester and Index-Time Search-as-You-Type.
On a side note, if you're introducing that bool
query, you'll probably want to look at the minimum_should_match
option, depending on the results you want.
But I needTRENDING HI
as a result; just with a lower score.
– m3th0dman
Nov 11 at 10:54
1
@m3th0dman the overall results are a combination of matching results for each term, soTRENDING HI
will appear in the results, and it will appear with a lower score. Edited the answer to make this clearer.
– AdrienF
Nov 11 at 14:04
Thank you for your answer!
– m3th0dman
2 days ago
Unfortunately this messes up the highlighter.
– m3th0dman
2 days ago
@m3th0dman that's a new element. Could you give some more details on how you're doing the highlighting, and what you mean exactly by it being "messed up"?
– AdrienF
2 days ago
|
show 1 more comment
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
3
down vote
You must understand how elasticsearch/lucene analyzes your data and calculate the search score.
1. Analyze API
https://www.elastic.co/guide/en/elasticsearch/reference/current/_testing_analyzers.html this will show you what elasticsearch will store, in your case:
T / TR / TRE /.... TRENDING / / H / HI
2. Score
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-bool-query.html
The bool query is often used to build complex query where you need a particular use case. Use must
to filter document, then should
to score. A common use case is to use different analyzers on a same field (by using the keyword fields
in the mapping, you can analyze a same field differently).
3. dont mess highlight
According the doc: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-highlighting.html#specify-highlight-query
You can add an extra query:
"query":
"bool":
"must" : [
"match":
"name": "HI"
],
"should": [
"prefix":
"name": "HI"
]
,
"highlight":
"pre_tags": [
"<"
],
"post_tags": [
">"
],
"fields":
"name":
"highlight_query":
"match":
"name": "HI"
add a comment |
up vote
3
down vote
You must understand how elasticsearch/lucene analyzes your data and calculate the search score.
1. Analyze API
https://www.elastic.co/guide/en/elasticsearch/reference/current/_testing_analyzers.html this will show you what elasticsearch will store, in your case:
T / TR / TRE /.... TRENDING / / H / HI
2. Score
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-bool-query.html
The bool query is often used to build complex query where you need a particular use case. Use must
to filter document, then should
to score. A common use case is to use different analyzers on a same field (by using the keyword fields
in the mapping, you can analyze a same field differently).
3. dont mess highlight
According the doc: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-highlighting.html#specify-highlight-query
You can add an extra query:
"query":
"bool":
"must" : [
"match":
"name": "HI"
],
"should": [
"prefix":
"name": "HI"
]
,
"highlight":
"pre_tags": [
"<"
],
"post_tags": [
">"
],
"fields":
"name":
"highlight_query":
"match":
"name": "HI"
add a comment |
up vote
3
down vote
up vote
3
down vote
You must understand how elasticsearch/lucene analyzes your data and calculate the search score.
1. Analyze API
https://www.elastic.co/guide/en/elasticsearch/reference/current/_testing_analyzers.html this will show you what elasticsearch will store, in your case:
T / TR / TRE /.... TRENDING / / H / HI
2. Score
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-bool-query.html
The bool query is often used to build complex query where you need a particular use case. Use must
to filter document, then should
to score. A common use case is to use different analyzers on a same field (by using the keyword fields
in the mapping, you can analyze a same field differently).
3. dont mess highlight
According the doc: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-highlighting.html#specify-highlight-query
You can add an extra query:
"query":
"bool":
"must" : [
"match":
"name": "HI"
],
"should": [
"prefix":
"name": "HI"
]
,
"highlight":
"pre_tags": [
"<"
],
"post_tags": [
">"
],
"fields":
"name":
"highlight_query":
"match":
"name": "HI"
You must understand how elasticsearch/lucene analyzes your data and calculate the search score.
1. Analyze API
https://www.elastic.co/guide/en/elasticsearch/reference/current/_testing_analyzers.html this will show you what elasticsearch will store, in your case:
T / TR / TRE /.... TRENDING / / H / HI
2. Score
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-bool-query.html
The bool query is often used to build complex query where you need a particular use case. Use must
to filter document, then should
to score. A common use case is to use different analyzers on a same field (by using the keyword fields
in the mapping, you can analyze a same field differently).
3. dont mess highlight
According the doc: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-highlighting.html#specify-highlight-query
You can add an extra query:
"query":
"bool":
"must" : [
"match":
"name": "HI"
],
"should": [
"prefix":
"name": "HI"
]
,
"highlight":
"pre_tags": [
"<"
],
"post_tags": [
">"
],
"fields":
"name":
"highlight_query":
"match":
"name": "HI"
edited 2 days ago
answered 2 days ago
Thomas Decaux
12.3k25658
12.3k25658
add a comment |
add a comment |
up vote
2
down vote
In this particular case you could add a match_phrase_prefix
term to your query, which does prefix match on the last term in the text:
"query":
"bool":
"should": [
"match":
"name": "HI"
,
"match_phrase_prefix":
"name": "HI"
]
The match
term will match on all three results, but the match_phrase_prefix
won't match on TRENDING HI
. As a result, you'll get all three items in the results, but TRENDING HI
will appear with a lower score.
Quoting the docs:
The match_phrase_prefix query is a poor-man’s autocomplete[...] For better solutions for search-as-you-type see the completion suggester and Index-Time Search-as-You-Type.
On a side note, if you're introducing that bool
query, you'll probably want to look at the minimum_should_match
option, depending on the results you want.
But I needTRENDING HI
as a result; just with a lower score.
– m3th0dman
Nov 11 at 10:54
1
@m3th0dman the overall results are a combination of matching results for each term, soTRENDING HI
will appear in the results, and it will appear with a lower score. Edited the answer to make this clearer.
– AdrienF
Nov 11 at 14:04
Thank you for your answer!
– m3th0dman
2 days ago
Unfortunately this messes up the highlighter.
– m3th0dman
2 days ago
@m3th0dman that's a new element. Could you give some more details on how you're doing the highlighting, and what you mean exactly by it being "messed up"?
– AdrienF
2 days ago
|
show 1 more comment
up vote
2
down vote
In this particular case you could add a match_phrase_prefix
term to your query, which does prefix match on the last term in the text:
"query":
"bool":
"should": [
"match":
"name": "HI"
,
"match_phrase_prefix":
"name": "HI"
]
The match
term will match on all three results, but the match_phrase_prefix
won't match on TRENDING HI
. As a result, you'll get all three items in the results, but TRENDING HI
will appear with a lower score.
Quoting the docs:
The match_phrase_prefix query is a poor-man’s autocomplete[...] For better solutions for search-as-you-type see the completion suggester and Index-Time Search-as-You-Type.
On a side note, if you're introducing that bool
query, you'll probably want to look at the minimum_should_match
option, depending on the results you want.
But I needTRENDING HI
as a result; just with a lower score.
– m3th0dman
Nov 11 at 10:54
1
@m3th0dman the overall results are a combination of matching results for each term, soTRENDING HI
will appear in the results, and it will appear with a lower score. Edited the answer to make this clearer.
– AdrienF
Nov 11 at 14:04
Thank you for your answer!
– m3th0dman
2 days ago
Unfortunately this messes up the highlighter.
– m3th0dman
2 days ago
@m3th0dman that's a new element. Could you give some more details on how you're doing the highlighting, and what you mean exactly by it being "messed up"?
– AdrienF
2 days ago
|
show 1 more comment
up vote
2
down vote
up vote
2
down vote
In this particular case you could add a match_phrase_prefix
term to your query, which does prefix match on the last term in the text:
"query":
"bool":
"should": [
"match":
"name": "HI"
,
"match_phrase_prefix":
"name": "HI"
]
The match
term will match on all three results, but the match_phrase_prefix
won't match on TRENDING HI
. As a result, you'll get all three items in the results, but TRENDING HI
will appear with a lower score.
Quoting the docs:
The match_phrase_prefix query is a poor-man’s autocomplete[...] For better solutions for search-as-you-type see the completion suggester and Index-Time Search-as-You-Type.
On a side note, if you're introducing that bool
query, you'll probably want to look at the minimum_should_match
option, depending on the results you want.
In this particular case you could add a match_phrase_prefix
term to your query, which does prefix match on the last term in the text:
"query":
"bool":
"should": [
"match":
"name": "HI"
,
"match_phrase_prefix":
"name": "HI"
]
The match
term will match on all three results, but the match_phrase_prefix
won't match on TRENDING HI
. As a result, you'll get all three items in the results, but TRENDING HI
will appear with a lower score.
Quoting the docs:
The match_phrase_prefix query is a poor-man’s autocomplete[...] For better solutions for search-as-you-type see the completion suggester and Index-Time Search-as-You-Type.
On a side note, if you're introducing that bool
query, you'll probably want to look at the minimum_should_match
option, depending on the results you want.
edited Nov 11 at 14:02
answered Nov 10 at 14:27
AdrienF
372113
372113
But I needTRENDING HI
as a result; just with a lower score.
– m3th0dman
Nov 11 at 10:54
1
@m3th0dman the overall results are a combination of matching results for each term, soTRENDING HI
will appear in the results, and it will appear with a lower score. Edited the answer to make this clearer.
– AdrienF
Nov 11 at 14:04
Thank you for your answer!
– m3th0dman
2 days ago
Unfortunately this messes up the highlighter.
– m3th0dman
2 days ago
@m3th0dman that's a new element. Could you give some more details on how you're doing the highlighting, and what you mean exactly by it being "messed up"?
– AdrienF
2 days ago
|
show 1 more comment
But I needTRENDING HI
as a result; just with a lower score.
– m3th0dman
Nov 11 at 10:54
1
@m3th0dman the overall results are a combination of matching results for each term, soTRENDING HI
will appear in the results, and it will appear with a lower score. Edited the answer to make this clearer.
– AdrienF
Nov 11 at 14:04
Thank you for your answer!
– m3th0dman
2 days ago
Unfortunately this messes up the highlighter.
– m3th0dman
2 days ago
@m3th0dman that's a new element. Could you give some more details on how you're doing the highlighting, and what you mean exactly by it being "messed up"?
– AdrienF
2 days ago
But I need
TRENDING HI
as a result; just with a lower score.– m3th0dman
Nov 11 at 10:54
But I need
TRENDING HI
as a result; just with a lower score.– m3th0dman
Nov 11 at 10:54
1
1
@m3th0dman the overall results are a combination of matching results for each term, so
TRENDING HI
will appear in the results, and it will appear with a lower score. Edited the answer to make this clearer.– AdrienF
Nov 11 at 14:04
@m3th0dman the overall results are a combination of matching results for each term, so
TRENDING HI
will appear in the results, and it will appear with a lower score. Edited the answer to make this clearer.– AdrienF
Nov 11 at 14:04
Thank you for your answer!
– m3th0dman
2 days ago
Thank you for your answer!
– m3th0dman
2 days ago
Unfortunately this messes up the highlighter.
– m3th0dman
2 days ago
Unfortunately this messes up the highlighter.
– m3th0dman
2 days ago
@m3th0dman that's a new element. Could you give some more details on how you're doing the highlighting, and what you mean exactly by it being "messed up"?
– AdrienF
2 days ago
@m3th0dman that's a new element. Could you give some more details on how you're doing the highlighting, and what you mean exactly by it being "messed up"?
– AdrienF
2 days ago
|
show 1 more comment
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53238598%2felasticsearch-edge-ngram-tokenizer-higher-score-when-word-begins-with-n-gram%23new-answer', 'question_page');
);
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password