I create an index like this:
curl --location --request PUT 'http://127.0.0.1:9200/test/' \
--header 'Content-Type: application/json' \
--data-raw '{
"settings" : {
"number_of_shards" : 1
},
"mappings" : {
"properties" : {
"word" : { "type" : "text" }
}
}
}'
when I create a document:
curl --location --request POST 'http://127.0.0.1:9200/test/_doc/' \
--header 'Content-Type: application/json' \
--data-raw '{ "word":"organic" }'
And finally, search with an intentionally misspelled word:
curl --location --request POST 'http://127.0.0.1:9200/test/_search' \
--header 'Content-Type: application/json' \
--data-raw '{
"suggest": {
"001" : {
"text" : "rganic",
"term" : {
"field" : "word"
}
}
}
}'
The word 'organic' lost the first letter - ES never gives suggestion options for such a mispell (works absolutely fine for any other misspells - 'orgnic', 'oragnc' and 'organi'). What am I missing?
This is happening because of the prefix_length parameter: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-suggesters.html . It defaults to 1, i.e. at least 1 letter from the beginning of the term has to match. You can set prefix_length to 0 but this will have performance implications. Only your hardware, your setup and your dataset can show you exactly what those will be in practice in your case, i.e. try it :). However, be careful - Elasticsearch and Lucene devs set the default to 1 for a reason.
Here's a query which for me returns the suggestion result you're after on Elasticsearch 7.4.0 after I perform your setup steps.
curl --location --request POST 'http://127.0.0.1:9200/test/_search' \
--header 'Content-Type: application/json' \
--data-raw '{
"suggest": {
"001" : {
"text" : "rganic",
"term" : {
"field" : "word",
"prefix_length": 0
}
}
}
}'
You need to use the CANDIDATE GENERATORS with phrase suggester check this out from Elasticsearch in Action book page 444
Having multiple generators and filters lets you do some neat tricks. For instance, if
typos are likely to happen both at the beginning and end of words, you can use multi-
ple generators to avoid expensive suggestions with low prefix lengths by using the
reverse token filter, as shown in figure F.4.
You’ll implement what’s shown in figure F.4 in listing F.4:
■ First, you’ll need an analyzer that includes the reverse token filter.
■ Then you’ll index the correct product description in two fields: one
analyzed with the standard analyzer and one with the reverse analyzer.
From Elasticsearch docs
The following example shows a phrase suggest call with two generators: the first one is using a field containing ordinary indexed terms, and the second one uses a field that uses terms indexed with a reverse filter (tokens are index in reverse order). This is used to overcome the limitation of the direct generators to require a constant prefix to provide high-performance suggestions. The pre_filter and post_filter options accept ordinary analyzer names.
So you can achieve this by using the reverse analyzer with the post-filter and pre-filter
And as you can see they said:
This is used to overcome the limitation of the direct generators to require a constant prefix to provide high-performance suggestions.
Check this Figure from Elasticsearch In Action book I believe it will make the idea more clear.
A screenshot from the book explains how elastic search will give us the correct phrase
For more information refer to the docs
https://www.elastic.co/guide/en/elasticsearch/reference/6.8/search-suggesters-phrase.html#:~:text=The%20phrase%20suggester%20uses%20candidate,individual%20term%20in%20the%20text.
If explained the full idea then this will be a very long answer but I gave you the key and you can go and do your research about using the phrase suggester with multiple generators.
Related
A file is located in a known path on Google Drive, for example:
/root/Myfiles/test.txt
How can I get the item-id of the file using the Google Drive V3 REST API (https://www.googleapis.com/drive/v3/files/)? In detail, I am not sure how to construct the query paramer q= for this.
Regards,
Unless you have the file id of MyFiles then your going to have to do this in two calls.
The first thing we will do is list all the directories in root.
This can be done using the Q parameter as you already know
By passing parents in 'root' and mimeType = 'application/vnd.google-apps.folder' and name ='Myfiles' I tell it that I am looking for a folder called Myfiles that has a parent folder of root.
curl \
'https://www.googleapis.com/drive/v3/files?q=parents%20in%20%27root%27%20and%20mimeType%20%3D%20%27application%2Fvnd.google-apps.folder%27%20and%20name%20%3D%27YouTube%27&key=[YOUR_API_KEY]' \
--header 'Authorization: Bearer [YOUR_ACCESS_TOKEN]' \
--header 'Accept: application/json' \
--compressed
The response from this will then look something like this
{
"kind": "drive#fileList",
"incompleteSearch": false,
"files": [
{
"kind": "drive#file",
"id": "1R_QjyKyvET838G6loFSRu27C-3ASMJJa",
"name": "Myfiles",
"mimeType": "application/vnd.google-apps.folder"
}
]
}
I know know the file id of the folder called Myfiles
Now i can do another call which i request a file within that directory id with the name of test.txt like this parents in '1R_QjyKyvET838G6loFSRu27C-3ASMJJa' and name = 'test.txt'
The code will then look something like this
curl \
'https://www.googleapis.com/drive/v3/files?q=parents%20in%20%271R_QjyKyvET838G6loFSRu27C-3ASMJJa%27%20and%20name%20%3D%20%27test.txt%27&key=[YOUR_API_KEY]' \
--header 'Authorization: Bearer [YOUR_ACCESS_TOKEN]' \
--header 'Accept: application/json' \
--compressed
The response
{
"kind": "drive#fileList",
"incompleteSearch": false,
"files": [
{
"kind": "drive#file",
"id": "1_BgrWKsjnZvayvr2kbdHzSzE3K2tNsWhntBsQwfrDOw",
"name": "test.txt",
"mimeType": "application/vnd.google-apps.document"
}
]
}
Summary
As #DalmTo said If you want to search for files within a specific folder you need to have that ID to search within it.
parents in Whether the parent’s collection contains the specified ID.
Which means that you should do two separate queries. One asking for the id of your folder and another looking for the file test.txt in that folder.
q: parents in "root" and mimeType = "application/vnd.google-apps.folder" and name = "Myfiles"
q: parents in "ID_FOLDER" and mimeType = "text/plain" and name = "test"
Example:
If you only have one file in your entire Drive that meets the required characteristics, you could do it in a single query:
q: name = "test" and mimeType = "text/plain"
Caution: If you have uploaded the file, Drive may have detected it as: application/octet-stream. Normally .txt files are detected as plain/text, for more information on MIME types and Drive API, you can check here for common MIME types and here for Drive specific types.
Alternative using Google Apps Script
Here is an example using Google Apps Script:
function findFile() {
var folderId;
var folderQuery = '"root" in parents and title = "Myfiles" and mimeType = "application/vnd.google-apps.folder"'
let folder = Drive.Files.list({
q: folderQuery
})
folderId = folder.items[0].id
let fileQuery = `parents in "${folderId}" and title = "test"`
var file = Drive.Files.list({
q: fileQuery
})
return file.items[0].id
}
Caution: Google Apps Script uses Drive API v2, in this case the query_term name becomes title
More Information
For a deeper understanding of how the Drive API works you can check Search for files guide:
A query string contains the following three parts:
query_term operator values
query_term is the query term or field to search upon.
operator specifies the condition for the query term.
values are the specific values you want to use to filter your search results
To keep in mind when used outside of a client library:
Note: These examples use the unencoded q parameter, where name = 'hello' is encoded as name+%3d+%27hello%27. Client libraries handle this encoding automatically.
I have elasticsearch mapping as follows:
{
"info": {
"properties": {
"timestamp": {"type":"date","format":"epoch_second"},
"user": {"type":"keyword" },
"filename": {"type":"text"}
}
}
}
When I try to do match query on filename, it works properly when I don't give dot in search input, but when dot in included, it returns many false results.
I learnt that standard analyzer is the issue. It breaks search input on dots and then search. What analyzer I can use in this case? The filenames can be millions and I don't want something with takes lot of memory and time. Please suggest.
As you are talking about filenames here, i would suggest using the keyword analyzer. This will not split the string and just index it as it is.
You could also just change ur mapping from text to keyword instead.
I can do a quick URI search like
GET twitter/tweet/_search?q=user:kimchy
Can I search multiple fields this way? For example, user:kimchy AND age:23?
What I tried 1 (error):
curl -XDELETE localhost:9200/myindex/
curl localhost:9200/myindex/mytype/1 -d '{"a":1,"b":9}'
curl localhost:9200/myindex/mytype/2 -d '{"a":9,"b":9}'
curl localhost:9200/myindex/mytype/3 -d '{"a":9,"b":1}'
Say I want just the document {"a":9, "b":9}, I tried
GET localhost:9200/myindex/_search?q=a:9&b:9
but I get error
{
error: {
root_cause: [{
type: "illegal_argument_exception",
reason: "request [/myindex/_search] contains unrecognized parameter: [b:9]"
}],
type: "illegal_argument_exception",
reason: "request [/myindex/_search] contains unrecognized parameter: [b:9]"
},
status: 400
}
What I tried 2 (works!):
GET localhost:9200/myindex/_search?q=a:9 AND b:9
The spaces are important. Alternatively, use %20.
Yes, you can. Try something like this:
GET twitter/tweet/_search?q=user:kimchy%20AND%20age:23
Note that if you URI decode this, it's equivalent to:
GET twitter/tweet/_search?q=user:kimchy AND age:23
Note that when you are using this REST endpoint like this, I think you are really taking advantage of something like the query_string_query. Refer to those docs to get an idea of the extent of the query string language and features available to you.
Here is what I'd like the stemmer to do:
breaking: break
broke: break
broken: break
entering: enter
entered: enter
enter: enter
I've indexed the field as follows:
"body": {
"type": "text",
"fields": {
"stemmed": {
"type": "text",
"analyzer": "english"
}
}
}
When I query “breaking and entering”, I can see that what is searched for in the body.stemmed field is: "break and enter". Seems good.
However, when I query “broke and entered”, I get: “broke and enter”. Thus, apparently, “broke” does not become “break” when the "english" stemmer is used.
Likewise, “broken and entered” becomes: “broken and enter”. So, ES apparently does not change either “broke” or “broken” to “break” (which, according to this: snowball, I guess explains why if this is what is used).
So, is there a way to specify a "known" stemmer that will accomplish what I'm trying to do?
Your requirement can be fulfilled by a Dictionary Stemmer, which does dictionary lookups for stemming words. Algorithmic stemmers stem without knowledge about the root words, they simply do it algorithmically.
Look at Hunspell stemmer, think it will do the job:
https://www.elastic.co/guide/en/elasticsearch/guide/current/hunspell.html
I have an index containing lot's of streets. The index looks like this:
Mainstreet 42
Some other street 15
Foostr. 9
The default search query looks like this:
+QUERY_STRING*
So querying for foo (sent as +foo*) or foostr (sent as +foostr*) results in Foostr. 9, which is correct. BUT querying for foostr. (which get's sent to Elasticsearch as +foostr.*) gives no results, but why?
I use standard analyzer and the query string with no special options. (This also returns 0 results when using http://127.0.0.1:9200/test/streets?q=+foostr.*).
Btw. this: http://127.0.0.1:9200/test/streets?q=+foostr. (same as above without the asterisk) finds the right results
Questions:
Why is this happening?
How to avoid this behavior?
One thing i didn't think about was:
Elasticsearch will not analyze wildcard queries by default!
This means. By default it will act like this:
input query | the query that ES will use
----------------------------------------
foo | foo
foo. | foo
foo* | foo*
foo.* | foo.*
As you can see, if the input query contains a wildcard, ES will not remove any characters. When using no wildcard, ES will take the query and run an analyzer, which (i.e. when using the default analyzer) will remove all dots.
To "fix" this, you can either
Remove all dots manually from the query string. Or
Use analyze_wildcard=true (i.e. http://127.0.0.1:9200/test/streets?q=+foostr.*&analyze_wildcard=true). Here's an explanation of what happens: https://github.com/elastic/elasticsearch/issues/787
1) This is because standard analyser does not index special characters. Example if you index a string Yoo! My name is Karthik., elasticsearch breaks it down to (yoo, my, name, is, karthik) without special characters (which actually makes sense in many simple cases) and in lowercase. So, when you search for foostr., there were no results.. as it was indexed as foostr (without ".").
2) You can use different types of analysers for different fields depending on your requirement while indexing (or you can use no_analyser as well).
Example:-
$ curl -XPUT 'http://localhost:9200/bookstore/book/_mapping' -d '
{
"book" : {
"properties" : {
"title" : {"type" : "string", "analyzer" : "simple"},
"description" : {"type" : "string", "index" : "not_analyzed"}
}
}
}
'
You can refer this and this for more information.
HTH!