Elasticsearch Edge NGram tokenizer higher score when word begins with n-gram

up vote
0
down vote

favorite

Suppose there is the following mapping with Edge NGram Tokenizer:


 "settings": 
 "analysis": 
 "analyzer": 
 "autocomplete_analyzer": 
 "tokenizer": "autocomplete_tokenizer",
 "filter": [
 "standard"
 ]
 ,
 "autocomplete_search": 
 "tokenizer": "whitespace"
 
 ,
 "tokenizer": 
 "autocomplete_tokenizer": 
 "type": "edge_ngram",
 "min_gram": 1,
 "max_gram": 10,
 "token_chars": [
 "letter",
 "symbol"
 ]
 
 
 
 ,
 "mappings": 
 "tag": 
 "properties": 
 "id": 
 "type": "long"
 ,
 "name": 
 "type": "text",
 "analyzer": "autocomplete_analyzer",
 "search_analyzer": "autocomplete_search"

And the following documents are indexed:

POST /tag/tag/_bulk
"index":
"name" : "HITS FIND SOME"
"index":
"name" : "TRENDING HI"
"index":
"name" : "HITS OTHER"

Then searching


 "query": 
 "match": 
 "name": 
 "query": "HI"

yields all with the same score, or TRENDING - HI with a score higher than one of the others.

How can it be configured, to show with a higher score the entries that actually start with the searcher n-gram? In this case, HITS FIND SOME and HITS OTHER to have a higher score than TRENDING HI; at the same time TRENDING HI should be in the results.

Highlighter is also used, so the given solution shouldn't mess it up.

The highlighter used in query is:

 "highlight": 
 "pre_tags": [
 "<"
 ],
 "post_tags": [
 ">"
 ],
 "fields": 
 "name":

Using this with match_phrase_prefix messes up the highlighting, yielding <H><I><T><S> FIND SOME when searching only for H.

edited 2 days ago

asked Nov 10 at 11:44

m3th0dman

5,49833566

This question has an open bounty worth +100
reputation from m3th0dman ending ending at 2018-11-19 14:15:36Z">in 5 days.

This question has not received enough attention.

Expecting a solution to the given issue without messing up the highlighter.

add a comment |

up vote
0
down vote

favorite

Suppose there is the following mapping with Edge NGram Tokenizer:


 "settings": 
 "analysis": 
 "analyzer": 
 "autocomplete_analyzer": 
 "tokenizer": "autocomplete_tokenizer",
 "filter": [
 "standard"
 ]
 ,
 "autocomplete_search": 
 "tokenizer": "whitespace"
 
 ,
 "tokenizer": 
 "autocomplete_tokenizer": 
 "type": "edge_ngram",
 "min_gram": 1,
 "max_gram": 10,
 "token_chars": [
 "letter",
 "symbol"
 ]
 
 
 
 ,
 "mappings": 
 "tag": 
 "properties": 
 "id": 
 "type": "long"
 ,
 "name": 
 "type": "text",
 "analyzer": "autocomplete_analyzer",
 "search_analyzer": "autocomplete_search"

And the following documents are indexed:

POST /tag/tag/_bulk
"index":
"name" : "HITS FIND SOME"
"index":
"name" : "TRENDING HI"
"index":
"name" : "HITS OTHER"

Then searching


 "query": 
 "match": 
 "name": 
 "query": "HI"

yields all with the same score, or TRENDING - HI with a score higher than one of the others.

Highlighter is also used, so the given solution shouldn't mess it up.

The highlighter used in query is:

 "highlight": 
 "pre_tags": [
 "<"
 ],
 "post_tags": [
 ">"
 ],
 "fields": 
 "name":

Using this with match_phrase_prefix messes up the highlighting, yielding <H><I><T><S> FIND SOME when searching only for H.

edited 2 days ago

asked Nov 10 at 11:44

m3th0dman

5,49833566

This question has an open bounty worth +100
reputation from m3th0dman ending ending at 2018-11-19 14:15:36Z">in 5 days.

This question has not received enough attention.

Expecting a solution to the given issue without messing up the highlighter.

add a comment |

up vote
0
down vote

favorite

Suppose there is the following mapping with Edge NGram Tokenizer:


 "settings": 
 "analysis": 
 "analyzer": 
 "autocomplete_analyzer": 
 "tokenizer": "autocomplete_tokenizer",
 "filter": [
 "standard"
 ]
 ,
 "autocomplete_search": 
 "tokenizer": "whitespace"
 
 ,
 "tokenizer": 
 "autocomplete_tokenizer": 
 "type": "edge_ngram",
 "min_gram": 1,
 "max_gram": 10,
 "token_chars": [
 "letter",
 "symbol"
 ]
 
 
 
 ,
 "mappings": 
 "tag": 
 "properties": 
 "id": 
 "type": "long"
 ,
 "name": 
 "type": "text",
 "analyzer": "autocomplete_analyzer",
 "search_analyzer": "autocomplete_search"

And the following documents are indexed:

POST /tag/tag/_bulk
"index":
"name" : "HITS FIND SOME"
"index":
"name" : "TRENDING HI"
"index":
"name" : "HITS OTHER"

Then searching


 "query": 
 "match": 
 "name": 
 "query": "HI"

yields all with the same score, or TRENDING - HI with a score higher than one of the others.

Highlighter is also used, so the given solution shouldn't mess it up.

The highlighter used in query is:

 "highlight": 
 "pre_tags": [
 "<"
 ],
 "post_tags": [
 ">"
 ],
 "fields": 
 "name":

Using this with match_phrase_prefix messes up the highlighting, yielding <H><I><T><S> FIND SOME when searching only for H.

edited 2 days ago

asked Nov 10 at 11:44

m3th0dman

5,49833566

Suppose there is the following mapping with Edge NGram Tokenizer:


 "settings": 
 "analysis": 
 "analyzer": 
 "autocomplete_analyzer": 
 "tokenizer": "autocomplete_tokenizer",
 "filter": [
 "standard"
 ]
 ,
 "autocomplete_search": 
 "tokenizer": "whitespace"
 
 ,
 "tokenizer": 
 "autocomplete_tokenizer": 
 "type": "edge_ngram",
 "min_gram": 1,
 "max_gram": 10,
 "token_chars": [
 "letter",
 "symbol"
 ]
 
 
 
 ,
 "mappings": 
 "tag": 
 "properties": 
 "id": 
 "type": "long"
 ,
 "name": 
 "type": "text",
 "analyzer": "autocomplete_analyzer",
 "search_analyzer": "autocomplete_search"

And the following documents are indexed:

POST /tag/tag/_bulk
"index":
"name" : "HITS FIND SOME"
"index":
"name" : "TRENDING HI"
"index":
"name" : "HITS OTHER"

Then searching


 "query": 
 "match": 
 "name": 
 "query": "HI"

yields all with the same score, or TRENDING - HI with a score higher than one of the others.

Highlighter is also used, so the given solution shouldn't mess it up.

The highlighter used in query is:

 "highlight": 
 "pre_tags": [
 "<"
 ],
 "post_tags": [
 ">"
 ],
 "fields": 
 "name":

Using this with match_phrase_prefix messes up the highlighting, yielding <H><I><T><S> FIND SOME when searching only for H.

elasticsearch search n-gram

edited 2 days ago

asked Nov 10 at 11:44

m3th0dman

5,49833566

edited 2 days ago

asked Nov 10 at 11:44

m3th0dman

5,49833566

edited 2 days ago

asked Nov 10 at 11:44

m3th0dman

5,49833566

asked Nov 10 at 11:44

m3th0dman

5,49833566

asked Nov 10 at 11:44

m3th0dman

5,49833566

This question has an open bounty worth +100
reputation from m3th0dman ending ending at 2018-11-19 14:15:36Z">in 5 days.

This question has not received enough attention.

Expecting a solution to the given issue without messing up the highlighter.

This question has an open bounty worth +100
reputation from m3th0dman ending ending at 2018-11-19 14:15:36Z">in 5 days.

This question has not received enough attention.

Expecting a solution to the given issue without messing up the highlighter.

add a comment |

2 Answers
2

active

oldest

votes

up vote
3
down vote

You must understand how elasticsearch/lucene analyzes your data and calculate the search score.

1. Analyze API

https://www.elastic.co/guide/en/elasticsearch/reference/current/_testing_analyzers.html this will show you what elasticsearch will store, in your case:

T / TR / TRE /.... TRENDING / / H / HI

2. Score

https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-bool-query.html

The bool query is often used to build complex query where you need a particular use case. Use must to filter document, then should to score. A common use case is to use different analyzers on a same field (by using the keyword fields in the mapping, you can analyze a same field differently).

3. dont mess highlight

According the doc: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-highlighting.html#specify-highlight-query

You can add an extra query:


 "query": 
 "bool": 
 "must" : [
 
 "match": 
 "name": "HI"
 
 
 ],
 "should": [
 
 "prefix": 
 "name": "HI"
 
 
 ]
 
 ,
 "highlight": 
 "pre_tags": [
 "<"
 ],
 "post_tags": [
 ">"
 ],
 "fields": 
 "name": 
 "highlight_query": 
 "match": 
 "name": "HI"

edited 2 days ago

answered 2 days ago

Thomas Decaux

12.3k25658

add a comment |

up vote
2
down vote

In this particular case you could add a match_phrase_prefix term to your query, which does prefix match on the last term in the text:


 "query": 
 "bool": 
 "should": [
 
 "match": 
 "name": "HI"
 
 ,
 
 "match_phrase_prefix": 
 "name": "HI"
 
 
 ]

The match term will match on all three results, but the match_phrase_prefix won't match on TRENDING HI. As a result, you'll get all three items in the results, but TRENDING HI will appear with a lower score.

Quoting the docs:

The match_phrase_prefix query is a poor-man’s autocomplete[...] For better solutions for search-as-you-type see the completion suggester and Index-Time Search-as-You-Type.

On a side note, if you're introducing that bool query, you'll probably want to look at the minimum_should_match option, depending on the results you want.

edited Nov 11 at 14:02

answered Nov 10 at 14:27

AdrienF

372113

But I need TRENDING HI as a result; just with a lower score.
– m3th0dman
Nov 11 at 10:54

1

@m3th0dman the overall results are a combination of matching results for each term, so TRENDING HI will appear in the results, and it will appear with a lower score. Edited the answer to make this clearer.
– AdrienF
Nov 11 at 14:04

Thank you for your answer!
– m3th0dman
2 days ago

Unfortunately this messes up the highlighter.
– m3th0dman
2 days ago

@m3th0dman that's a new element. Could you give some more details on how you're doing the highlighting, and what you mean exactly by it being "messed up"?
– AdrienF
2 days ago

|
show 1 more comment

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53238598%2felasticsearch-edge-ngram-tokenizer-higher-score-when-word-begins-with-n-gram%23new-answer', 'question_page');

);

Post as a guest

Name

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

up vote
3
down vote

You must understand how elasticsearch/lucene analyzes your data and calculate the search score.

1. Analyze API

https://www.elastic.co/guide/en/elasticsearch/reference/current/_testing_analyzers.html this will show you what elasticsearch will store, in your case:

T / TR / TRE /.... TRENDING / / H / HI

2. Score

https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-bool-query.html

3. dont mess highlight

According the doc: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-highlighting.html#specify-highlight-query

You can add an extra query:


 "query": 
 "bool": 
 "must" : [
 
 "match": 
 "name": "HI"
 
 
 ],
 "should": [
 
 "prefix": 
 "name": "HI"
 
 
 ]
 
 ,
 "highlight": 
 "pre_tags": [
 "<"
 ],
 "post_tags": [
 ">"
 ],
 "fields": 
 "name": 
 "highlight_query": 
 "match": 
 "name": "HI"

edited 2 days ago

answered 2 days ago

Thomas Decaux

12.3k25658

add a comment |

up vote
3
down vote

You must understand how elasticsearch/lucene analyzes your data and calculate the search score.

1. Analyze API

https://www.elastic.co/guide/en/elasticsearch/reference/current/_testing_analyzers.html this will show you what elasticsearch will store, in your case:

T / TR / TRE /.... TRENDING / / H / HI

2. Score

https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-bool-query.html

3. dont mess highlight

According the doc: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-highlighting.html#specify-highlight-query

You can add an extra query:


 "query": 
 "bool": 
 "must" : [
 
 "match": 
 "name": "HI"
 
 
 ],
 "should": [
 
 "prefix": 
 "name": "HI"
 
 
 ]
 
 ,
 "highlight": 
 "pre_tags": [
 "<"
 ],
 "post_tags": [
 ">"
 ],
 "fields": 
 "name": 
 "highlight_query": 
 "match": 
 "name": "HI"

edited 2 days ago

answered 2 days ago

Thomas Decaux

12.3k25658

add a comment |

up vote
3
down vote

You must understand how elasticsearch/lucene analyzes your data and calculate the search score.

1. Analyze API

https://www.elastic.co/guide/en/elasticsearch/reference/current/_testing_analyzers.html this will show you what elasticsearch will store, in your case:

T / TR / TRE /.... TRENDING / / H / HI

2. Score

https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-bool-query.html

3. dont mess highlight

According the doc: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-highlighting.html#specify-highlight-query

You can add an extra query:


 "query": 
 "bool": 
 "must" : [
 
 "match": 
 "name": "HI"
 
 
 ],
 "should": [
 
 "prefix": 
 "name": "HI"
 
 
 ]
 
 ,
 "highlight": 
 "pre_tags": [
 "<"
 ],
 "post_tags": [
 ">"
 ],
 "fields": 
 "name": 
 "highlight_query": 
 "match": 
 "name": "HI"

edited 2 days ago

answered 2 days ago

Thomas Decaux

12.3k25658

You must understand how elasticsearch/lucene analyzes your data and calculate the search score.

1. Analyze API

https://www.elastic.co/guide/en/elasticsearch/reference/current/_testing_analyzers.html this will show you what elasticsearch will store, in your case:

T / TR / TRE /.... TRENDING / / H / HI

2. Score

https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-bool-query.html

3. dont mess highlight

According the doc: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-highlighting.html#specify-highlight-query

You can add an extra query:


 "query": 
 "bool": 
 "must" : [
 
 "match": 
 "name": "HI"
 
 
 ],
 "should": [
 
 "prefix": 
 "name": "HI"
 
 
 ]
 
 ,
 "highlight": 
 "pre_tags": [
 "<"
 ],
 "post_tags": [
 ">"
 ],
 "fields": 
 "name": 
 "highlight_query": 
 "match": 
 "name": "HI"

edited 2 days ago

answered 2 days ago

Thomas Decaux

12.3k25658

edited 2 days ago

answered 2 days ago

Thomas Decaux

12.3k25658

answered 2 days ago

Thomas Decaux

12.3k25658

answered 2 days ago

Thomas Decaux

12.3k25658

add a comment |

up vote
2
down vote

In this particular case you could add a match_phrase_prefix term to your query, which does prefix match on the last term in the text:


 "query": 
 "bool": 
 "should": [
 
 "match": 
 "name": "HI"
 
 ,
 
 "match_phrase_prefix": 
 "name": "HI"
 
 
 ]

Quoting the docs:

The match_phrase_prefix query is a poor-man’s autocomplete[...] For better solutions for search-as-you-type see the completion suggester and Index-Time Search-as-You-Type.

On a side note, if you're introducing that bool query, you'll probably want to look at the minimum_should_match option, depending on the results you want.

edited Nov 11 at 14:02

answered Nov 10 at 14:27

AdrienF

372113

But I need TRENDING HI as a result; just with a lower score.
– m3th0dman
Nov 11 at 10:54

1

@m3th0dman the overall results are a combination of matching results for each term, so TRENDING HI will appear in the results, and it will appear with a lower score. Edited the answer to make this clearer.
– AdrienF
Nov 11 at 14:04

Thank you for your answer!
– m3th0dman
2 days ago

Unfortunately this messes up the highlighter.
– m3th0dman
2 days ago

@m3th0dman that's a new element. Could you give some more details on how you're doing the highlighting, and what you mean exactly by it being "messed up"?
– AdrienF
2 days ago

|
show 1 more comment

up vote
2
down vote

In this particular case you could add a match_phrase_prefix term to your query, which does prefix match on the last term in the text:


 "query": 
 "bool": 
 "should": [
 
 "match": 
 "name": "HI"
 
 ,
 
 "match_phrase_prefix": 
 "name": "HI"
 
 
 ]

Quoting the docs:

The match_phrase_prefix query is a poor-man’s autocomplete[...] For better solutions for search-as-you-type see the completion suggester and Index-Time Search-as-You-Type.

On a side note, if you're introducing that bool query, you'll probably want to look at the minimum_should_match option, depending on the results you want.

edited Nov 11 at 14:02

answered Nov 10 at 14:27

AdrienF

372113

But I need TRENDING HI as a result; just with a lower score.
– m3th0dman
Nov 11 at 10:54

1

@m3th0dman the overall results are a combination of matching results for each term, so TRENDING HI will appear in the results, and it will appear with a lower score. Edited the answer to make this clearer.
– AdrienF
Nov 11 at 14:04

Thank you for your answer!
– m3th0dman
2 days ago

Unfortunately this messes up the highlighter.
– m3th0dman
2 days ago

@m3th0dman that's a new element. Could you give some more details on how you're doing the highlighting, and what you mean exactly by it being "messed up"?
– AdrienF
2 days ago

|
show 1 more comment

up vote
2
down vote

In this particular case you could add a match_phrase_prefix term to your query, which does prefix match on the last term in the text:


 "query": 
 "bool": 
 "should": [
 
 "match": 
 "name": "HI"
 
 ,
 
 "match_phrase_prefix": 
 "name": "HI"
 
 
 ]

Quoting the docs:

The match_phrase_prefix query is a poor-man’s autocomplete[...] For better solutions for search-as-you-type see the completion suggester and Index-Time Search-as-You-Type.

On a side note, if you're introducing that bool query, you'll probably want to look at the minimum_should_match option, depending on the results you want.

edited Nov 11 at 14:02

answered Nov 10 at 14:27

AdrienF

372113

In this particular case you could add a match_phrase_prefix term to your query, which does prefix match on the last term in the text:


 "query": 
 "bool": 
 "should": [
 
 "match": 
 "name": "HI"
 
 ,
 
 "match_phrase_prefix": 
 "name": "HI"
 
 
 ]

Quoting the docs:

The match_phrase_prefix query is a poor-man’s autocomplete[...] For better solutions for search-as-you-type see the completion suggester and Index-Time Search-as-You-Type.

On a side note, if you're introducing that bool query, you'll probably want to look at the minimum_should_match option, depending on the results you want.

edited Nov 11 at 14:02

answered Nov 10 at 14:27

AdrienF

372113

edited Nov 11 at 14:02

answered Nov 10 at 14:27

AdrienF

372113

answered Nov 10 at 14:27

AdrienF

372113

answered Nov 10 at 14:27

AdrienF

372113

But I need TRENDING HI as a result; just with a lower score.
– m3th0dman
Nov 11 at 10:54

1

@m3th0dman the overall results are a combination of matching results for each term, so TRENDING HI will appear in the results, and it will appear with a lower score. Edited the answer to make this clearer.
– AdrienF
Nov 11 at 14:04

Thank you for your answer!
– m3th0dman
2 days ago

Unfortunately this messes up the highlighter.
– m3th0dman
2 days ago

@m3th0dman that's a new element. Could you give some more details on how you're doing the highlighting, and what you mean exactly by it being "messed up"?
– AdrienF
2 days ago

|
show 1 more comment

But I need TRENDING HI as a result; just with a lower score.
– m3th0dman
Nov 11 at 10:54

1

@m3th0dman the overall results are a combination of matching results for each term, so TRENDING HI will appear in the results, and it will appear with a lower score. Edited the answer to make this clearer.
– AdrienF
Nov 11 at 14:04

Thank you for your answer!
– m3th0dman
2 days ago

Unfortunately this messes up the highlighter.
– m3th0dman
2 days ago

@m3th0dman that's a new element. Could you give some more details on how you're doing the highlighting, and what you mean exactly by it being "messed up"?
– AdrienF
2 days ago

But I need TRENDING HI as a result; just with a lower score.
– m3th0dman
Nov 11 at 10:54

@m3th0dman the overall results are a combination of matching results for each term, so TRENDING HI will appear in the results, and it will appear with a lower score. Edited the answer to make this clearer.
– AdrienF
Nov 11 at 14:04

Thank you for your answer!
– m3th0dman
2 days ago

Unfortunately this messes up the highlighter.
– m3th0dman
2 days ago

@m3th0dman that's a new element. Could you give some more details on how you're doing the highlighting, and what you mean exactly by it being "messed up"?
– AdrienF
2 days ago

|
show 1 more comment

draft saved

draft discarded

draft saved

draft discarded

Post as a guest

Name

This page is only for reference, If you need detailed information, please check here

yfcpLBx k7dZ7L4Zf3SEnj3ZZzCKxNPR5kPkRShZjzlJtlaPHXohZ3Zo3H

搜尋此網誌

Odtnhj