I have hundreds of indexes and I want to fetch only a given field from every record under these indexes. I can do the following
curl -XGET 'http://localhost:9200/cms-2016-03-30/job/_search?pretty=true&field=CMSDataset'
This unfortunately returns a lot of things I don't want and also doesn't give me all the records (~10^6).
Also, there are many cms-* style indexes, and I want to parse through all of them and get only this field. How do I do this?
You need to use source filtering instead of fields (which is deprecated)
curl -XGET 'http://localhost:9200/cms-2016-03-30/job/_search?pretty=true&_source=CMSDataset'
^
|
change this
From the official documentation:
The fields parameter is about fields that are explicitly marked as stored in the mapping, which is off by default and generally not recommended. Use source filtering instead to select subsets of the original source document to be returned.
UPDATE
You can use the size parameter (e.g. 100) in order to return more records (by default it's 10) and then simply use * as the index name:
curl -XGET 'http://localhost:9200/*/_search?pretty=true&size=100&_source=CMSDataset'
Related
I have a situation where I need to search from multiple indexes (products and users). Below is a sample query I am using to do that search
http://localhost:9200/_all/_search?q=*wood*
http://localhost:9200/users,products/_search?q=*wood*
With the above API request, it only returns search results for the product index. But if I search using the below API it returns search results for users index
http://localhost:9200/users/_search?q=*wood*
As you can see I am passing same value for "q" parameter. I need to search for both product and users index and check if there is the word "wood" in any attribute in both indexes. How can I achieve this
You can pass multiple index names instead of _all as it will search in other indices that you don't intent to by using the comma seprated index name like
http://localhost:9200/users,products/_search?q=*wood*
Although, _all should also fetch the result from users index which you get when you specify its name, you need to debug why its happening, maybe increase the size param to 1000 as by default Elasticsearch returns only 10 results and it seems in case of _all all the top results coming from products index only.
I was wondering what is the recommended approach to filter out some of the fields that are sent to Elasticsearch from Store and Index?
I want to filter our some fields from getting indexed in Elasticsearch. You may ask why you are sending them to Elasticsearch from the first place. Unfortunately, it is sent via another application that doesn't accept any filtering mechanism. Hence, filtering should be addressed at the time of indexing. Here is what we have done, but I am not sure what would be the consequences of these steps:
1- Disable dynamic mapping ("dynamic": "false" ) in ES templates.
2- Including only the required fields in _source and excluding the rest.
According to ES website, some of the ES functionalities will be disabled by disabling _source fields. Given I don't need the filtered fields at all, I was wondering whether the mentioned solution will break anything regarding the remaining fields or not?
There are a few mapping parameters that allow you to do what you want:
index: true/false: if true the field value is indexed in order to be searched later on (default: true)
store: true/false: if true the field values are stored in addition to being indexed. Usually, the field values are stored in the source already, but you can choose to not store the source but store the field value itself (default: false)
enabled: true/false: only for the mapping type as a whole or for object types. you can decide whether to only store the value but not index it
So you can use any combination of the above parameters if you don't want to modify the source documents and simple let ES do it for you.
We have an elasticsearch that contains over half a billion documents that each have a url field that stores a URL.
The url field mapping currently has the settings:
{
index: not_analyzed
doc_values: true
...
}
We want our users to be able to search URLs, or portions of URLs without having to use wildcards.
For example, taking the URL with path: /part1/user#site/part2/part3.ext
They should be able to bring back a matching document by searching:
part3.ext
user#site
part1
part2/part3.ext
The way I see it, we have two options:
Implement an analysed version of this field (which can no longer have doc_values: true) and do match querying instead of wildcards. This would also require using a custom analyser to leverage the pattern tokeniser to make the extracted terms correct (the standard tokeniser would split user#site into user and site).
Go through our database and for each document create a new field that is a list of URL parts. This field could have doc_values: true still so would be stored off-heap, and we could do term querying on exact field values instead of wildcards.
My question is this:
Which is better for performance: having a list of variable lengths that has doc_values on, or having an analysed field? (ie: option 1 or option 2) OR is there an option 3 that would be even better yet?!
Thanks for your help!
Your question is about a field where you need doc_values but can not index with keyword-analyzer.
You did not mention why you need doc_values. But you did mention that you currently not search in this field.
So I guess that the name of the search-field do not have to be the same: you can copy the field value in an other field which is only for search ( "store": false ). For this new field you can use the pattern-analyzer or pattern-tokenizer for your use case.
It seems that no-one has actually performance tested the two options, so I did.
I took a sample of 10 million documents and created two new indices:
An index with an analysed field that was setup as suggested in the other answer.
An index with a string field that would store all permutations of URL segmentation.
I ran an enrichment process over the second index to populate the fields. The field values on the first index were created when I re-indexed the sample data from my main index.
Then I created a set of gatling tests to run against the indices and compared the gatling results and netdata (https://github.com/firehol/netdata) landscape for each.
The results were as follows:
Regarding the netadata landscape: The analysed field showed a spike - although only a small one - on all elastic nodes. The not_analysed list field tests didn't even register.
It is worth mentioning that enriching the list field with URL segmentation permutations bloated the index by about 80% in our case. So there's a trade off - you never need to do wildcard searches for exact sub-segment matching on URLs, but you'll need a lot more disk to do it.
Update
Don't do this. Go for doc_values. Doing anything with analyzed strings that have a massive number of possible terms will mean massive field data that will, eventually, never fit in the amount of memory you can allocate it.
I am using elasticsearch as a document database and each record I create has a guid id that the system uses for the record id. Business people want to offer a feature to let the user have their own auto file name convention based on date and how many records were created so far this day/month.
What I need is to prevent duplicate user file names. Is there a way to setup an indexed field to be unique? Like a sql unique constraint?
You'd need to use the field that is supposed to be unique as id for your documents. By default a new document with existing id would override the existing document with same id, but you can switch to op_type=create in order to get back an error if a document with same id already exists.
There's no way to have the same behaviour with arbitrary fields though, only the _id field works that way. I would probably consider handling this logic in the application layer instead of within elasticsearch.
One solution will be to use uniqueId field value for specifying document ID and use op_type=create while storing the documents in ES. With this you can make sure your uniqueId field will have unique value and will not be overridden by another same valued document.
For this, the elasticsearch document says:
The index operation also accepts an op_type that can be used to force a create operation, allowing for "put-if-absent" behavior. When create is used, the index operation will fail if a document by that id already exists in the index.
Here is an example of using the op_type parameter:
$ curl -XPUT 'http://localhost:9200/es_index/es_type/unique_a?op_type=create' -d '{
"user" : "kimchy",
"uniqueId" : "unique_a"
}'
If you run the above request it is ok, but running it the next time will give you an error.
You can use the _id in the column you want to have unique contraint on.
Here is the sample river that uses postgresql. Yo can change the Database Driver/DB-URL according to your usage.
curl -XPUT localhost:9200/_river/simple_jdbc_river/_meta -d "{\"type\":\"jdbc\",\"jdbc\":{\"strategy\":\"simple\",\"poll\":\"1s\",\"driver\":\"org.postgresql.Driver\",\"url\":\"jdbc:postgresql://DB-URL/DB-INSTANCE\",\"user\":\"USERNAME\",\"password\":\"PASSWORD\",\"sql\":\"select t.id as _id,t.name from topic as t \",\"digesting\" : true},\"index\":{\"index\":\"jdbc\",\"type\":\"topic_jdbc_river1\"}}"
So far as to ES 7.5, there is no such extra "constraint" to ensure uniqueness using a custom field in the mapping.
But you still can walk around it via your own application UUID, which could be used directly explicitly as the _id (which is unique implictly) to achieve your goals.
PUT <your_index_name>/_doc/<your_app_uuid>
{
"a_field": "a_value"
}
Another approach might be to generate the string you store in a field that should be unique by integrating an auto-incrementing integer. This way you ensure from the start that your field values are unique.
You would put your file name together like this:
<current day/month>_<auto-incremented integer>
Auto-incrementing integers are not supported by Elasticsearch per se but you could mimic them using this approach. If you happen to use node.js you can use the es-sequence module.
I want to delete index entries directly from MOBZ's ElasticSearch head (web UI).
I tried a DELETE query in the "Any Request" section with the following:
{"query":{"term":{"supplier":"ABC"}}}
However, all I get in return is:
{
ok: true
acknowledged: true
}
and the entries do not get deleted.
What am I doing wrong?
You should have removed the "query" from your post data.
You only need it for _search, and you should be using the _query entrypoint for delete.
In that case it is obvious the post is only a query, thus making it redendant (and actually irrelevant) to explicitly state it's a query.
That is:
curl -XPOST 'localhost:9200/myindex/mydoc/_search' -d
'{"query":{"term":{"supplier":"ABC"}}}'
will work fine for search.
But to delete by query, if you try:
curl -XDELETE 'localhost:9200/myindex/mydoc/_query' -d
'{"query":{"term":{"supplier":"ABC"}}}'
it won't work (note the change in entry point to _query, as well as switch CURL parameter to delete).
You need to call:
curl -XDELETE 'localhost:9200/myindex/mydoc/_query' -d
'{"term":{"supplier":"ABC"}}'
Let me know if this helps.
If you want to do it in HEAD:
put /stock/one/_query in the any request text box next to the drop-box of "GET/PUT/POST/DELETE"
choose DELETE in the drop-down menu
the request body should be {"term":{"vendor":"Socks"}}
Your problem was that you used a request body of: {"query":{"term":{"vendor":"Socks"}}}
That is fine for search, but not for delete.
A simple way to delete from plugin head by doc Id:
Go to Any Request TAB in plugin head
Simply put http:/localhost:9200/myindex/indextype/id in the text box above DELETE drop-down
Select DELETE from drop-down
Execute the request by clicking in Request button
Here is the sample image:
I'd issue a search request first, to verify that the documents you want deleted are actually being returned by your query.
It's impossible to give clear help, since there are many things that could be going wrong, but here are some possible probelems:
You don't have the correct index/type specified in the ES Head query
You need to specify the index and type on the second input box, not the first. The first line is meant for the host address and automatically adds a trailing slash
You need to use the Delete command from the dropdown
The analyzer of your fields is altering the field text in a way that it isn't being found by the Term query.
In all likelihood, it is the last option. If you haven't specified an analyzer, the default one that ES picks is the Standard analyzer, which includes a lowercase filter. The term "ABC" is therefore never indexed, instead "abc" is indexed.
Term query is not analyzed at all, so case sensitivity is important.
If those tips don't help, post up your mapping and some sample data, and we can probably help better.