How exactly does elasticsearch versioning work? - elasticsearch

My understanding was that Elasticsearch would store the lastest copy of the document and just update the version field number? But I was playing around with a few thousand documents and had the need to index them repeatedly without changing any data in the document. My thinking was that the index size would remain the same, but that wasn't the case ... the index size seemed to increase.
This confused me a little bit, so i just wanted to seek clarification on the internal mechanism of versioning within elasticsearch.

An update is a Delete + Insert Lucene operation behind the scene.
But you should know that Lucene does not really delete the document but mark it as deleted.
To remove deleted docs, you have to optimize your Lucene segments.
$ curl -XPOST 'http://localhost:9200/twitter/_optimize?only_expunge_deletes=true'
See Optimize API. Also have a look at merge options. Merging segments happens behind the scene at some time.
For a general overview of versioning support in Elasticsearch, please refer to the Elasticsearch Versioning Support.

Related

Implements popular keyword in ElasticSearch

I'm using ElasticSearch on AWS EC2.
And i want to implement today's popular keyword function in ES.
there is 3 indexes(place, genre, name), and i want see today's popular keyword in name index only.
I tried to use ES slowlog and logstash. but slowlog save logs every shard's log.
(ex)number of shards : 5 then 5 query log saved.
Is there any good and easy way to implement popular keyword in ES?
As far as I know, this is not supported by Elasticsearch and you need to build your own custom solution.
Design you mentioned using the slowlog is not good as you mentioned its on per shard basis, even if you do some more computing and able to merge and relate them to a single search at index level, it would not be good, as
you have to change the slow log configuration and for every index there needs to be a different threshold, you can change it to 0ms, to make sure you get all the search queries in slow logs, but that would take a huge disk space and would not be good for Elasticsearch performance.
You have to do some parsing of slow log in your application and if you do it runtime it would be very costly.
I think you can maintain a distributed cache in your application where you store the top searched keyword like the leaderboard of a multi-player gaming app, which is changing very frequently but in your case, you don't even have to update this cache very frequently. I would not go into much implementation details, but simple Hashmap of search term as key and count as value would solve the issue.
Hope this helps. let me know if you have questions.

How does ElasticSearch handle an index with 230m entries?

I was looking through elasticsearch and was noticing that you can create an index and bulk add items. I currently have a series of flat files with 220 million entries. I am working on Logstash to parse and add them to ElasticSearch, but I feel that it existing under 1 index would be rough to query. The row data is nothing more than 1-3 properties at most.
How does Elasticsearch function in this case? In order to effectively query this index, do you just add additional instances to the cluster and they will work together to crunch the set?
I have been walking through the documentation, and it is explaining what to do, but not necessarily all the time explaining why it does what it does.
In order to effectively query this index, do you just add additional instances to the cluster and they will work together to crunch the set?
That is exactly what you need to do. Typically it's an iterative process:
start by putting a subset of the data in. You can also put in all the data, if time and cost permit.
put some search load on it that is as close as possible to production conditions, e.g. by turning on whatever search integration you're planning to use. If you're planning to only issue queries manually, now's the time to try them and gauge their speed and the relevance of the results.
see if the queries are particularly slow and if their results are relevant enough. You change the index mappings or queries you're using to achieve faster results, and indeed add more nodes to your cluster.
Since you mention Logstash, there are a few things that may help further:
check out Filebeat for indexing the data on an ongoing basis. You may not need to do the work of reading the files and bulk indexing yourself.
if it's log or log-like data and you're mostly interested in more recent results, it could be a lot faster to split up the data by date & time (e.g. index-2019-08-11, index-2019-08-12, index-2019-08-13). See the Index Lifecycle Management feature for automating this.
try using the Keyword field type where appropriate in your mappings. It stops analysis on the field, preventing you from doing full-text searches inside the field and only allowing exact string matches. Useful for fields like a "tags" field or a "status" field with something like ["draft", "review", "published"] values.
Good luck!

Can we migrate non stored Index data in SOLR to Elastic search?

We are currently using SOLR for full-text search. Now we are planning to move from SOLR to ElasticSearch. When we were in this process i have read somewhere that there are some plugins available which will migrate data from SOLR-ElasticSearch. But it won't be able to migrate those records which are not stored in SOLR. So is there a plugin available which will migrate non-stored index data from SOLR to elastic search if so please let me know.
Currently am using SOLR-to-ES plugin, but it won't migrate the non-stored index data.
Thanks
If the field is not stored, then you don't have the original value. If you have it indexed, what's is in there is the value after it has gone through the analysis chain, and so is probably different than the original one (has no stopwords, is probably lowercased, maybe stemmed...stuff like that).
There are a couple of possibilities that might allow you to have the original content when not stored:
indexed field: if it has been analyzed with just the keyword tokenizer: then the indexed value is the original value.
field has docValues=true then the original value is also stored. This feature was introduced later, so your index might not be using it.
The issue is, the common plugings might not take advantage of those cases where stored=true is not totally necessary. You need to check them.

Elasticsearch - What is the indexing process?

I am working on a project that uses Elasticsearch. I have my core search UI working. I'm now looking to improve some things. In this process, I discovered that I do not really understand what happens during "indexing". I understand what an index is. I understand what a document is. I understand that indexing happens either a) when a document is added b) when a document is updated) or c) when the refresh endpoint is called.
Still, I do not really understand the detail behind indexing. For example, does indexing happen if a document is removed? What really happens during indexing? I keep looking for some documentation that explains this. However, I'm not having any luck.
Can someone please explain what happens during indexing and possibly point out some documentation?
Thank you!
Indexing is a huge process and has a lot of steps involved in it. I will try to provide a brief intro to the major steps in indexing process
Making Text Searchable
Every word in a text field needs to be searchable,
The data structure that best supports the multiple-values-per-field requirement is the inverted index. The inverted index contains a sorted list of all of the unique values, or terms, that occur in any document and, for each term, a list of all the documents that contain it.
Updating Index :
First of all, please do note that a "lucene index is immutable"
Hence, in case of any (CRUD (-R)) operation, instead of rewriting the whole inverted index, lucene adds new supplementary indices to reflect more-recent changes.
Indexing Process
New documents are collected in an in-memory indexing buffer.
Every so often, the buffer is commited:
A new segment—a supplementary inverted index—is written to disk.
A new commit point is written to disk, which includes the name of the new segment.
The disk is fsync’ed—all writes waiting in the filesystem cache are flushed to disk, to ensure that they have been physically written.
The new segment is opened, making the documents it contains visible to search.
The in-memory buffer is cleared, and is ready to accept new documents.
What happens in case of Delete
Segments are immutable, so documents cannot be removed from older segments.
When a document is “deleted,” it is actually just marked as deleted in the .del file. A document that has been marked as deleted can still match a query, but it is removed from the results list before the final query results are returned.
When is it actually removed
In Segment Merging, deleted documents are purged from the filesystem.
References :
Elasticsearch Docs
Inverted Index
Lucene Talks

Is Elasticsearch suitable as a final storage solution?

I'm currently learning Elasticsearch, and I have noticed that a lot of operations for modifying indices require reindexing of all documents, such as adding a field to all documents, which from my understanding means retrieving the document, performing the desirable operation, deleting the original document from the index and reindex it. This seems to be somewhat dangerous and a backup of the original index seems to be preferable before performing this (obviously).
This made me wonder if Elasticsearch actually is suitable as a final storage solution at all, or if I should keep the raw documents that makes up an index separately stored to be able to recreate an index from scratch if necessary. Or is a regular backup of the index safe enough?
You are talking about two issues here:
Deleting old documents and re-indexing on schema change: You don't always have to delete old documents when you add new fields. There are various options to change the schema. Have a look at this blog which explains changing the schema without any downtime.
http://www.elasticsearch.org/blog/changing-mapping-with-zero-downtime/
Also, look at the Update API which gives you the ability to add/remove fields.
The update API allows to update a document based on a script provided. The operation gets the document (collocated with the shard) from the index, runs the script (with optional script language and parameters), and index back the result (also allows to delete, or ignore the operation). It uses versioning to make sure no updates have happened during the "get" and "reindex".
Note, this operation still means full reindex of the document, it just removes some network roundtrips and reduces chances of version conflicts between the get and the index. The _source field need to be enabled for this feature to work.
Using Elasticsearch as a final storage solution at all : It depends on how you intend to use Elastic Search as storage. Do you need RDBMS , key Value store, column based datastore or a document store like MongoDb? Elastic Search is definitely well suited when you need a distributed document store (json, html, xml etc) with Lucene based advanced search capabilities. Have a look at the various use cases for ES especially the usage at The Guardian:http://www.elasticsearch.org/case-study/guardian/
I'm pretty sure, that search engines shouldn't be viewed as a storage solution, because of the nature of these applications. I've never heard about this kind of a practice to backup index of search engine.
Usual schema when you using ElasticSearch or Solr or whatever search engine you have:
You have some kind of a datasource (it could be database, legacy mainframe, excel papers, some REST service with data or whatever)
You have search engine that should index this datasource to add to your system capability for search. When datasource is changed - you could reindex it, or index only changed part with the help of incremental indexation.
If something happen to search engine index - you could easily reindex all your data.

Resources