Swapping out one index for another in Elasticsearch - elasticsearch

I am running Elasticsearch on a personal machine that only has so much memory. I'd like to use all of the memory at any given time for whatever problem I'm working on, but make it easy to switch between projects.
For example, I have a project involving a large text corpus, and a different project with geospatial data. I'd like to switch Elasticsearch from indexing one to the other without reindexing all the documents.
Is there an easier way to do this than to do a backup/reload of the index?

ES has open/close index API:
curl -XPOST 'localhost:9200/my_index/_close'
curl -XPOST 'localhost:9200/my_index/_open'

Related

Implements popular keyword in ElasticSearch

I'm using ElasticSearch on AWS EC2.
And i want to implement today's popular keyword function in ES.
there is 3 indexes(place, genre, name), and i want see today's popular keyword in name index only.
I tried to use ES slowlog and logstash. but slowlog save logs every shard's log.
(ex)number of shards : 5 then 5 query log saved.
Is there any good and easy way to implement popular keyword in ES?
As far as I know, this is not supported by Elasticsearch and you need to build your own custom solution.
Design you mentioned using the slowlog is not good as you mentioned its on per shard basis, even if you do some more computing and able to merge and relate them to a single search at index level, it would not be good, as
you have to change the slow log configuration and for every index there needs to be a different threshold, you can change it to 0ms, to make sure you get all the search queries in slow logs, but that would take a huge disk space and would not be good for Elasticsearch performance.
You have to do some parsing of slow log in your application and if you do it runtime it would be very costly.
I think you can maintain a distributed cache in your application where you store the top searched keyword like the leaderboard of a multi-player gaming app, which is changing very frequently but in your case, you don't even have to update this cache very frequently. I would not go into much implementation details, but simple Hashmap of search term as key and count as value would solve the issue.
Hope this helps. let me know if you have questions.

Index existing documents on startup

I'm new to elasticsearch and this is a question I've been trying to find an answer to. Basically I have around a thousand documents that I would like elasticsearch to index for me. Do I have to write a bash/python script that would just use CURL to put/post all these documents in my elasticsearch server or can I configure my server so that it would automatically index documents in a specific folder/location on disk when I start it up for the first time?
I far as I know Elasticsearch does not have any option for pulling document to index itself. As you mentioned you need to create a script and push your documents to ES yourself.

Is Elasticsearch suitable as a final storage solution?

I'm currently learning Elasticsearch, and I have noticed that a lot of operations for modifying indices require reindexing of all documents, such as adding a field to all documents, which from my understanding means retrieving the document, performing the desirable operation, deleting the original document from the index and reindex it. This seems to be somewhat dangerous and a backup of the original index seems to be preferable before performing this (obviously).
This made me wonder if Elasticsearch actually is suitable as a final storage solution at all, or if I should keep the raw documents that makes up an index separately stored to be able to recreate an index from scratch if necessary. Or is a regular backup of the index safe enough?
You are talking about two issues here:
Deleting old documents and re-indexing on schema change: You don't always have to delete old documents when you add new fields. There are various options to change the schema. Have a look at this blog which explains changing the schema without any downtime.
http://www.elasticsearch.org/blog/changing-mapping-with-zero-downtime/
Also, look at the Update API which gives you the ability to add/remove fields.
The update API allows to update a document based on a script provided. The operation gets the document (collocated with the shard) from the index, runs the script (with optional script language and parameters), and index back the result (also allows to delete, or ignore the operation). It uses versioning to make sure no updates have happened during the "get" and "reindex".
Note, this operation still means full reindex of the document, it just removes some network roundtrips and reduces chances of version conflicts between the get and the index. The _source field need to be enabled for this feature to work.
Using Elasticsearch as a final storage solution at all : It depends on how you intend to use Elastic Search as storage. Do you need RDBMS , key Value store, column based datastore or a document store like MongoDb? Elastic Search is definitely well suited when you need a distributed document store (json, html, xml etc) with Lucene based advanced search capabilities. Have a look at the various use cases for ES especially the usage at The Guardian:http://www.elasticsearch.org/case-study/guardian/
I'm pretty sure, that search engines shouldn't be viewed as a storage solution, because of the nature of these applications. I've never heard about this kind of a practice to backup index of search engine.
Usual schema when you using ElasticSearch or Solr or whatever search engine you have:
You have some kind of a datasource (it could be database, legacy mainframe, excel papers, some REST service with data or whatever)
You have search engine that should index this datasource to add to your system capability for search. When datasource is changed - you could reindex it, or index only changed part with the help of incremental indexation.
If something happen to search engine index - you could easily reindex all your data.

How to check elasticsearch query performance?

I need to check elasticsearch query performance. But due to caching I am unable to figure out actual query performance. Is there any way to stop caching.
I had tried _cache/clear as per suggested below document.
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-clearcache.html
$ curl -XPOST 'http://localhost:9200/_cache/clear'
Also tried , set index.cache.filter.type to none in elasticsearch.yml
index.cache.filter.type : none
I using Sense to run elasticseaech query.
Any other way to doing this?
Maybe restart your elastic search cluster, then run some queries that hit more or less the same data but not the actual query you want to test, and then the query you want to test.
I also notice the first query you run against a restarted cluster is slow, but after that everything tends to be fast.
It's very possible that ElasticSearch isn't even caching the query you're trying to get performance data on, it's just really really fast ;)

How exactly does elasticsearch versioning work?

My understanding was that Elasticsearch would store the lastest copy of the document and just update the version field number? But I was playing around with a few thousand documents and had the need to index them repeatedly without changing any data in the document. My thinking was that the index size would remain the same, but that wasn't the case ... the index size seemed to increase.
This confused me a little bit, so i just wanted to seek clarification on the internal mechanism of versioning within elasticsearch.
An update is a Delete + Insert Lucene operation behind the scene.
But you should know that Lucene does not really delete the document but mark it as deleted.
To remove deleted docs, you have to optimize your Lucene segments.
$ curl -XPOST 'http://localhost:9200/twitter/_optimize?only_expunge_deletes=true'
See Optimize API. Also have a look at merge options. Merging segments happens behind the scene at some time.
For a general overview of versioning support in Elasticsearch, please refer to the Elasticsearch Versioning Support.

Resources