Elasticsearch index alias - elasticsearch

I am trying to use elasticsearch to filter millions of data. All data are in one index and I want to access them in a 'direct' way.
What I mean with direct way?
Direct way means for example accessing the 700000th element of this index (not by id). Is this possible somehow?
What I tried already:
from + size works, but seems not to be fast if number of elements > 10000
Scrolling I didn't try, but it's seem somehow not the right thing for my use-case.
So any other ideas?

Scrolling will not work. That will fetch all the data.
I think elasticseach is not the correct use case for what you want to do.
It would be better to use a linked list of the ids, that will let you fetch the id by index and then you can query elasticsearch to get the data.
If you data is such that it does not get modified or deleted then you can add an extra field in the mapping that will act like an auto increment field in a database. You can fetch the data using that field.

Related

Get last document from index in Elasticsearch

I'm playing around the package github.com/olivere/elastic; all works fine, but I've a question: is it possible to get the last N inserted documents?
The From statement has 0 as default starting point for the Search action and I didn't understand if is possible to omit it in search.
Tldr;
Although I am not aware of a feature in elasticsearch api to retrieve the latest inserted documents.
There is a way to achieve something alike if you store the ingest time of the documents.
Then you can sort on the ingest time, and retrieve the top N documents.

Prevent data duplication over multiple indices in Elasticsearch

Data duplication prevention is handled at the index level with the field "_id".
However, to avoid having huge indices, I work with several small indices linked under an alias. Is there a mechanism in place to check existing _ids at the alias level (over multiple indices) when a document is inserted or should it be handled at the application level ?
indices architecture
not natively, no. you'd need to handle this in your own code
Before inserting your document, you need to first find out which real index contains your document via the alias using
GET alias/_search?q=_id:123456&filter_path=hits.hits._index
In the response you'll get the concrete index name that you can then use to index/update your new document version.

Can I update multiple documents with different field values at once?

I have an index of 280,000 documents in elasticsearch. I need to assign unique field values for each document. Currently I am iterating through all ID values and updating each document using _update. This process works fine but is very slow taking around 8 hours for 280,000 documents.
Any ideas on how I could speed this process up? Is it possible to update multiple documents at once assigning different field values to each document.
Try using ES Bulk API, which will allow you to update multiple documents with single request.
I would suggest to check the refresh property on the index as well, if you refresh the index every time you insert a record the performance will go down and I think that is whats happening to you just now. But if you use bulk update that should be fine, jsut something to keep in mind.

Elasticsearch slow performance for huge data retrieval with source field

I'm using ElasticSearch to search from more than 10 million records, most records contains 1 to 25 words. I want to retrieve data from it, the method I'm using now is drastically slow for big data retrieval as I'm trying to get data from the source field. I want a method that can make this process faster. I'm free to use other database or anything with ElasticSearch. Can anyone suggest some good Ideas and Example for this?
I've tried searching for solution on google and one solution I found was pagination and I've already applied it wherever it's possible but pagination is not an option when I want to retrieve many(5000+) hits in one query.
Thanks in advance.
Try using scroll
While a search request returns a single “page” of results, the scroll
API can be used to retrieve large numbers of results (or even all
results) from a single search request, in much the same way as you
would use a cursor on a traditional database.

How to create an index from search results, all on the server?

I will be getting documents from a filtered query (quite a lot of documents). I will then immediately create an index from them (in Python, using requests to directly query the REST API), without any modification.
Is it possible to make this operation directly on the server, without the round-trip of data to the script and back?
Another question was similar (in the intent) and the only answer is to go via Logstash (equivalent to using my code, though possibly more efficient)
refer http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/reindex.html
in short what you need to do is
0.) ensure you have _source set to true
1.) use scan and scroll API , pass your filtered query with search type scan,
2.)fetch documents using scroll id
2.) bulk index the result using the source field which returns you the json used to index data
refer:
http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/scan-scroll.html
guide/en/elasticsearch/guide/current/bulk.html
guide/en/elasticsearch/guide/current/reindex.html
es 2.3 has an experimental feature that allows reindex from a query
https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-reindex.html

Resources