Does Elasticsearch make a copy of my data? - elasticsearch

I copied 1TB of data to a cloud server, then ran Elasticsearch on that folder. Things seemed to index great. However, I noticed that hard disk space went from 33% used to 90% used. So it seems Elastic must have copied the source directory? Can I now delete that 1TB of original data from that machine?

If you run GET _stats/?human you'll see lots of details from your cluster, like how much storage you are using or how many documents you have added. If you have all the data you want in your cluster and it's correctly structured, you can delete the original data. Elasticsearch has its own copy.
BTW by default you will get 1 replica if you have more than 1 node; so 1 primary and 1 replica copy of the data. If you have a single node there will only be the primary one.

Related

Elasticsearch index is taking up too much disk space

I have index in production with 1 replica (this takes total ~ 1TB). Into this index every time coming new data (a lot of updates and creates).
When i have created the copy of this index - by running _reindex(with the same data and 1 replica as well) - the new index takes 600 GB.
Looks like there is a lot of junk and some kind of logs in original index which possible to cleanup. But not sure how to do it.
The questions: how to cleanup the index (without _reindex), why this is happening and how to prevent for it in the future?
Lucene segment files are immutable so when you delete or update (since it can't update doc in place) a document, old version is just marked deleted but not actually removed from disk. ES runs merge operation periodically to "defragment" the data but you can also trigger merge manually with _forcemerge (try running with only_expunge_deletes as well: it might be faster).
Also, make sure your shards are sized correctly and use ILM rollover to keep index size under control.

Elasticsearch multi data path disk full

I have a 500gb hard drive divided between os and elasticsearch data, being almost full I added a second 1Tb hard drive and added it as a second drive in the elasticsearch.yml
(ex: file.data: /els, /var/lib/elasticsearch).
Through kiabna I can see that now the space is actually 1.5TB but every time I send a file this is saved in the usual hdd left the 1TB empty.
can someone help me?
version of elasticsearch 6.6.1
If you send document to an old index, than that old index is not moved to the new path and stays on the old hdd. Using multiple paths, elasticsearch does not relocate shards where there is some space.
For further information, see the following docs:
The path.data settings can be set to multiple paths, in which case all paths will be used to store data (although the files belonging to a single shard will all be stored on the same data path):
If you want to extend the current path, using new hdd, you can use something like Logical Volume Management. It is an abstraction of drives, so you can attach many real disk drives to a single logical drive.

Backing up, Deleting, Restoring Elasticsearch Indexes By Index Folder

Most of the ElasticSearch documentation discusses working with the indexes through the REST API - is there any reason I can't simply move or delete index folders from the disk?
You can move data around on disk, to a point -
If Elasticsearch is running, it is never a good idea to move or delete the index
folders, because Elasticsearch will not know what happened to the data, and you
will get all kinds of FileNotFoundExceptions in the logs as well as indices
that are red until you manually delete them.
If Elasticsearch is not running, you can move index folders to another node (for
instance, if you were decomissioning a node permanently and needed to get the
data off), however, if the delete or move the folder to a place where
Elasticsearch cannot see it when the service is restarted, then Elasticsearch
will be unhappy. This is because Elasticsearch writes what is known as the
cluster state to disk, and in this cluster state the indices are recorded, so if
ES starts up and expects to find index "foo", but you have deleted the "foo"
index directory, the index will stay in a red state until it is deleted through
the REST API.
Because of this, I would recommend that if you want to move or delete individual
index folders from disk, that you use the REST API whenever possible, as it's
possible to get ES into an unhappy state if you delete a folder that it expects
to find an index in.
EDIT: I should mention that it's safe to copy (for backups) an indices folder,
from the perspective of Elasticsearch, because it doesn't modify the contents of
the folder. Sometimes people do this to perform backups outside of the snapshot
& restore API.
I use this procedure: I close, backup, then delete the indexes.
curl -XPOST "http://127.0.0.1:9200/*index_name*/_close"
After this point all index data is on disk and in a consistent state, and no writes are possible. I copy the directory where the index is stored and then delete it:
curl -XPOST "http://127.0.0.1:9200/*index_name*/_delete"
By closing the index, elasticsearch stop all access on the index. Then I send a command to delete the index (and all corresponding files on disk).

Use Elasticsearch as backup store

My application receives and parse thousands of small JSON snippets each about ~1Kb every hour. I want to create a backup of all incoming JSON snippets.
Is it a good idea to use Elasticsearch to backup this snippets in an index with f.ex. "number_of_replicas:" 4? Never read that anyone has used Elasticsearch for this.
Is my data safe in Elasticsearch when I use a cluster of servers and replicas or should I better use another storage for this use case?
(Writing it to the local file system isn't safe, as our hard discs crashes often. First I have thought about using HDFS, but this isn't made for small files.)
First you need to find difference between replica and backups.
replica is more than one copy of data at run time.It increases high availability and failover support,it wont support accidental delete of data.
Backup is copy of whole data at backup time.it will be used to restore when system crashed.
Elastic search for back up.. its not good idea.. Elastic search is a search engine not DB.If you have not configured ES cluster carefully,then you will end up with loss of data.
So in my opinion ,
To store json object, we got lot of dbs.. For example mongodb is a nosql db.We can easily configure it with more replicas.It means high availability of data and failover support.As you asked its also opensource and more reliable.
for more info about mongodb refer https://www.mongodb.org/
Update:
In elasticsearch if you create index with more shards it'll be distributed among nodes.If a node fails then the data will be lost.But in mongoDB more node means ,each mongodb node contains its own copy of data.If a mongodb fails then we can retrieve out data from replica mongodbs. We need to be more conscious about replica setup and shard allocation in Elasticsearch. But in mongoDB it's easier and good architecture too.
Note: I didn't say storing data in elasticsearch is not safe.I mean, comparing to mongodb,it's difficult to configure replica and maintain in elasticsearch.
Hope it helps..!

Moving an elasticsearch index of one node in a machine to another drive of the same machine

I have an elasticsearch node in a machine with a 150gb ssd and a 3 tb hdd. Since I am running out of space in the ssd, I would like to move one index from the ssd to the hdd. Is this possible? If so how?
I could create another node on the hdd, but I'd rather have one node in the machine...
Thanks!
You can safely move the data directory (and individual indexes and even shards) around. We've scp'd entire indexes around in this manner.
You probably should not actively index or delete when you are doing this though, or unpredictable things could happen.
Once you do the move, you just need to tell elasticsearch where to find data directory. You set this in the elasticsearch config file found in /etc/elasticsearch
Just add this setting:
path:
logs: /path/to/log/files
data: /path/to/data/directory
You might want to cp and not mv, just in case things don't go as planed.

Resources