ElasticSearch mass delete by URL - elasticsearch

I have about 8 million documents(with URL) in the ElasticSearch(v.0.90) index but only 3,5 millions documents are "valid"(need to be searched within). I have a list of all "valid" URLs(no special template here).
How can I delete other 4,5M documents from the index?

You can use DeleteByQuery API.
Then, you will probably need to run Optimize.
But you will need for that a lot of free disk space.

Related

Filter result in memory to search in elasticsearch from multiple indexes

I have 2 indexes and they both have one common field (basically relationship).
Now as elastic search is not giving filters from multiple indexes, should we store them in memory in variable and filter them in node.js (which basically means that my application itself is working as a database server now).
We previously were using MongoDB which is also a NoSQL DB but we were able to manage it through aggregate queries but seems the elastic search is not providing that.
So even if we use both databases combined, we have to store results of them somewhere to further filter data from them as we are giving users advanced search functionality where they are able to filter data from multiple collections.
So should we store results in memory to filter data further? We are currently giving advanced search in 100 million records to customers but that was not having the advanced text search that elastic search provides, now we are planning to provide elastic search text search to customers.
What do you suggest should we use the approach here to make MongoDB and elastic search together? We are using node.js to serve data.
Or which option to choose from
Denormalizing: Flatten your data
Application-side joins: Run multiple queries on normalized data
Nested objects: Store arrays of objects
Parent-child relationships: Store multiple documents through joins
https://blog.mimacom.com/parent-child-elasticsearch/
https://spoon-elastic.com/all-elastic-search-post/simple-elastic-usage/denormalize-index-elasticsearch/
Storing things client side in memory is not the solution.
First of all the simplest way to solve this problem is to simply make one combined index. Its very trivial to do this. Just insert all the documents from index 2 into index 1. Prefix all fields coming from index-2 by some prefix like "idx2". That way you won't overwrite any similar fields. You can use an ingestion pipeline to do this, or just do it client side. You only will ever do this once.
After that you can perform aggregations on the single index, since you have all the data in one-index.
If you are using somehting other than ES as your primary data-store you need to reconfigure the indexing operation to redirect everything that was earlier going into index-2 to go into index-1 as well(with the prefixed terms).
100 million records is trivial for something like ELasticsearch. Doing anykind of "joins" client side is NOT RECOMMENDED, as this will obviate the entire value of using ES.
If you need any further help on executing this, feel free to contact me. I have 11 years exp in ES. And I have seen people struggle with "joins" for 99% of the time. :)
The first thing to do when coming from MySQL/PostGres or even Mongodb is to restructure the indices to suit the needs of data-querying. Never try to work with multiple indices, ES is not built for that.
HTH.

ElasticSearch as primary DB for document library

My task is a full-text search system for a really large amount of documents. Now I have documents as RTF file and their metadata, so all this will be indexed in elastic search. These documents are unchangeable (they can be only deleted) and I don't really expect many new documents per day. So is it a good idea to use elastic as primary DB in this case?
Maybe I'll store the RTF file separately, but I really don't see the point of storing all this data somewhere else.
This question was solved here. So it's a good case for elasticsearch as the primary DB
Elastic is more known as distributed full text search engine , not as database...
If you preserve the document _source it can be used as database since almost any time you decide to apply document changes or mapping changes you need to re-index the documents in the index(known as table in relation world) , there is no possibility to update parts of the elastic lucene inverse index , you need to re-index the whole document ...
Elastic index survival mechanism is one of the best , meaning that if you loose node the index lost replicas are automatically replicated to some of the other available nodes so you dont need to do any manual operations ...
If you do regular backups and having no requirement the data to be 24/7 available it is completely acceptable to hold the data and full text index in elasticsearch as like in database ...
But if you need highly available combination I would recommend keeping the documents in mongoDB (known as best for distributed document store) for example and use elasticsearch only in its original purpose as full text search engine ...

Elasticsearch indices eating too much space

I'm using Elasticsearch 7.5.2 on Ubuntu. Recently, I began using Elasticsearch to display relevant search results on every page load. This shot up the volume, but I also found out that it has created large index files. Note that I'm using 'app-search' to power my queries.
Here's the sample index files that are occupying too much space:
.app-search-analytics-logs-loco_togo_production-7.1.0-2020.01.26 => 52 GB
.app-search-analytics-logs-loco_togo_production-7.1.0-2020.01.27 => 53 GB
I tried deleting these using CURL, but they reappear and show lesser space (~5 GB each).
I want to know if there is a way to control these indexes. I'm not sure what purpose do these indices solve and if there is a way to prevent them?
I tried deleting these using CURL, but they reappear and show lesser space (~5 GB each).
Obviously your delete-action was executed. It seems like that the indices still get written to. If documents still get into elasticsearch, the index gets re-created.
So for example:
The index from 2020.01.27 has 53 GB before the deletion. After you delete it, the data is gone and the index itself too. But as soon as new documents of the very same day (2020.01.27) get indexed, the index gets re-created containing the documents after the deletion which is probably the 5GB.
If this is not what you want, you need to check if there are some sources still sending data.
Hope this helps.
EDIT:
Q: However, is there a way to manage these indices? I don't want them to eat up too much space.
Yes! Index Lifecycle Management (ILM) is what you are looking for. It aims to automate the maintenance/management of indices. So for example you could define a rollover every 30GB to a new index in order to keep them small. Another example is to delete the index after X days. Take a look at all the phases and actions.

Is it possible to limit a size of an Elasticsearch index?

I have an Elasticsearch instance for indexing log records. Naturally the data grows over time and I would like to limit its size(about 10GB). Something like a mongoDb capped collection.
I'm not interested in old log records anyway.
I haven't found any config for this and I'm not sure that I can just remove data files.
any suggestions ?
The Elasticsearch "way" of dealing with "old" data is to create time-based indices. Meaning, for each day or each week you create an index. Index everything belonging to that day/week in that index.
You decide how many days you want to keep around and stick to that number. Let's say that the data for 7 days counts as 10 GB. In the 8th day you create the new index, as usual, then you delete the index from 8 days before.
All the time you'll have in your cluster 7 indices.
Using ttl as the other poster suggested is not recommended, because is far more difficult and it creates additional pressure on the cluster. The ttl mechanism checks every indices.ttl.interval (60 seconds by default) for expired documents, it creates bulk requests out of them and deletes them. This means unnecessary requests coming to the cluster.
Instead, deleting an index is very easy and quick.
Take a look at this and how to easily manage time based indices with Curator.
From what I remember a capped collection in MongoDB was just a circular buffer type of collection that removes oldest entries when there's no more room? Unfortunately there's nothing like this out of the box in ElasticSearch, you have to add this functionality yourself either by removing single documents (or batches of documents) using ES's API. A more performant way is described in their documentation under retiring data.
You can provide a per index/type default _ttl(time to live) value as follows:
{
"tweet" : {
"_ttl" : { "enabled" : true, "default" : "1d" }
}
}
You will have more detail here: https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-ttl-field.html
Regards,
Alain

Is Elasticsearch suitable as a final storage solution?

I'm currently learning Elasticsearch, and I have noticed that a lot of operations for modifying indices require reindexing of all documents, such as adding a field to all documents, which from my understanding means retrieving the document, performing the desirable operation, deleting the original document from the index and reindex it. This seems to be somewhat dangerous and a backup of the original index seems to be preferable before performing this (obviously).
This made me wonder if Elasticsearch actually is suitable as a final storage solution at all, or if I should keep the raw documents that makes up an index separately stored to be able to recreate an index from scratch if necessary. Or is a regular backup of the index safe enough?
You are talking about two issues here:
Deleting old documents and re-indexing on schema change: You don't always have to delete old documents when you add new fields. There are various options to change the schema. Have a look at this blog which explains changing the schema without any downtime.
http://www.elasticsearch.org/blog/changing-mapping-with-zero-downtime/
Also, look at the Update API which gives you the ability to add/remove fields.
The update API allows to update a document based on a script provided. The operation gets the document (collocated with the shard) from the index, runs the script (with optional script language and parameters), and index back the result (also allows to delete, or ignore the operation). It uses versioning to make sure no updates have happened during the "get" and "reindex".
Note, this operation still means full reindex of the document, it just removes some network roundtrips and reduces chances of version conflicts between the get and the index. The _source field need to be enabled for this feature to work.
Using Elasticsearch as a final storage solution at all : It depends on how you intend to use Elastic Search as storage. Do you need RDBMS , key Value store, column based datastore or a document store like MongoDb? Elastic Search is definitely well suited when you need a distributed document store (json, html, xml etc) with Lucene based advanced search capabilities. Have a look at the various use cases for ES especially the usage at The Guardian:http://www.elasticsearch.org/case-study/guardian/
I'm pretty sure, that search engines shouldn't be viewed as a storage solution, because of the nature of these applications. I've never heard about this kind of a practice to backup index of search engine.
Usual schema when you using ElasticSearch or Solr or whatever search engine you have:
You have some kind of a datasource (it could be database, legacy mainframe, excel papers, some REST service with data or whatever)
You have search engine that should index this datasource to add to your system capability for search. When datasource is changed - you could reindex it, or index only changed part with the help of incremental indexation.
If something happen to search engine index - you could easily reindex all your data.

Resources