Data load from HDFS to ES taking very long time - elasticsearch

I have created an external table in hive and need to move the data to ES (of 2 nodes, each with 1 TB). Below regular query taking very long time (more than 6 hours) for a source table with 9GB of data.
INSERT INTO TABLE <ES_DB>.<EXTERNAL_TABLE_FOR_ES>
SELECT COL1, COL2, COL3..., COL10
FROM <HIVE_DB>.<HIVE_TABLE>;
ES index is having default 5 shards and 1 replica. Increasing the number of shards could any way speed up the ingestion?
Could some one suggest any improvements to speed up the ES node ingestion.

You don't mention the methodology you're using to feed the data into ES so it's hard to see if you're using an ingestion pipeline or what technology to bridge the gap. Given that, I'll stick with generic advice on how to optimize ingestion into Elasticsearch.
Elastic has published some guidance for optimizing systems for ingestion, and there are three points that we've found do make a real difference:
Turn Off Replicas: Set the number of replicas to zero while injesting the data to eliminate the need to copy the data while also injesting it. This is an index-level setting ("number_of_replicas")
Don't Specify an ID: It isn't clear from your database schema if you are mapping across any identifiers, but if you can avoid specifying a document Id to Elastic and let it specify its own that significantly improves performance.
Use Parallel Bulk Operators: Use the BulkAPI to push data into ES and feed it with multiple threads so it always has more than one Bulk request to work on server-side.
Finally, have you installed Kibana and monitored your nodes to know what they are limited by? In particular CPU or Memory?

Related

How to do small queries efficiently in Clickhouse

In our deployment, there are one thousand shards. The insertions are done via a distributed table with sharding jumpConsistentHash(colX, 1000). When I query for rows with colX=... and turn on send_logs_level='trace', I see the query is sent to all shards and is executed on each shard. This is limiting our QPS (queries per second). Checking with Clickhouse document, it states:
SELECT queries are sent to all the shards and work regardless of how data is distributed across the shards (they can be distributed completely randomly).
When you add a new shard, you don’t have to transfer the old data to it.
You can write new data with a heavier weight – the data will be distributed slightly unevenly, but queries will work correctly and efficiently.
You should be concerned about the sharding scheme in the following cases:
* Queries are used that require joining data (IN or JOIN) by a specific key. If data is sharded by this key, you can use local IN or JOIN instead of GLOBAL IN or GLOBAL JOIN, which is much more efficient.
* A large number of servers is used (hundreds or more) with a large number of small queries (queries of individual clients - websites, advertisers, or partners).
In order for the small queries to not affect the entire cluster, it makes sense to locate data for a single client on a single shard.
Alternatively, as we’ve done in Yandex.Metrica, you can set up bi-level sharding: divide the entire cluster into “layers”, where a layer may consist of multiple shards.
Data for a single client is located on a single layer, but shards can be added to a layer as necessary, and data is randomly distributed within them.
Distributed tables are created for each layer, and a single shared distributed table is created for global queries.
It seems there is a solution for such small queries as in our case (the second bullet above), but I am not clear about the point. Does it mean when querying for a specific query with predicate colX=..., I need to find the corresponding "layer" that contains its rows and then query on the corresponding distributed table for this layer?
Is there a way to query on the global distributed table for these small queries?

Elasticsearch maximum index count limit

Is there any limit on how many indexes we can create in elastic search?
Can 100 000 indexes be created in Elasticsearch?
I have read that, maximum of 600-1000 indices can be created. Can it be scaled?
eg: I have a number of stores, and the store has items. Each store will have its own index where its items will be indexed.
There is no limit as such, but obviously, you don't want to create too many indices(too many depends on your cluster, nodes, size of indices etc), but in general, it's not advisable as it can have a server impact on cluster functioning and performance.
Please check loggly's blog and their first point is about proper provisioning and below is important relevant text from the same blog.
ES makes it very easy to create a lot of indices and lots and lots of
shards, but it’s important to understand that each index and shard
comes at a cost. If you have too many indices or shards, the
management load alone can degrade your ES cluster performance,
potentially to the point of making it unusable. We’re focusing on
management load here, but running too many indices/shards can also
have pretty significant impacts on your indexing and search
performance.
The biggest factor we’ve found to impact management overhead is the
size of the Cluster State, which contains all of the mappings for
every index in the cluster. At one point, we had a single cluster with
a Cluster State size of over 900MB! The cluster was alive but not
usable.
Edit: Thanks #Silas, who pointed that from ES 2.X, cluster state updates are not that much costly(As the only diff is sent in update call). More info on this change can be found on this ES issue

How yandex implemented 2 layered sharding

In the clickhouse documentation, there is a mention of Yandex.Metrica, implementing Bi-Level sharding.
"Alternatively, as we've done in Yandex.Metrica, you can set up bi-level sharding: divide the entire cluster into "layers", where a layer may consist of multiple shards. Data for a single client is located on a single layer, but shards can be added to a layer as necessary, and data is randomly distributed within them."
Is there a detailed implementation for this sharding scheme, documented some place.
Logically Yandex.Metrica has only one high-cardinality ID column that serves as main sharding key.
By default SELECTs from table with Distributed engine requests partial results from one replica of each shard.
If you have like hundreds servers or more, it's a lot of network communication to query all shards (probably 1/2 or 1/3 of all servers) which might introduce more latency than the actual query execution.
The reason for this behavior is that ClickHouse allows to write data directly to shards (bypassing Distributed engine and it's configured sharding key) and the application that does it is not forced to comply with sharding key of Distributed table (it can chose differently to spread data more evenly or by whatever other reason).
So the idea of that bi-level sharding is to split large cluster into smaller sub-clusters (10-20 servers each) and make most SELECT queries go through a Distributed tables that are configured against sub-clusters, thus making less network communication necessary and lowering the impact of possible stragglers.
Global Distributed tables for whole large cluster is also configured for some ad-hoc or overview style queries, but they are not so frequent and have lower latency requirements.
This still leaves freedom for the application that writes data to balance it arbitrarily between shards forming sub-cluster (by writing directly to them).
But to make this all work together applications that write and read data need to have a consistent mapping from whatever high-cardinality ID is used (CounterID in case of Metrica) to sub-cluster ID and hostnames it consists of. Metrica stores this mapping in MySQL, but in other cases something else might look more applicable.
Alternative approach is to use "optimize_skip_unused_shards" setting that makes SELECT queries which have a condition on sharding key of Distributed table to skip shards that are not supposed to have data. It introduces the requirement for data to be distributed between shards exactly as if it was written through this Distributed table or the report will not include some misplaced data.

What is the best way to store big data and create instant search with ES?

I am working on a project that will store million of data per day. So I want to store it in compressed structure(only searchable field and removing unwanted fields) to elastic search for instant text search. But I want the uncompressed data to be stored for later process and analytics. it should have more write speed and Cheaper to store billions of data.
Elasticsearch allows you to decide, per index, where to store it (via shard allocation) and what kind of compression you would like to use (via index codec).
So with unlimited resources and time, you could design a process where you index documents into daily indices for example, on a 5 node cluster where you keep the last 7 days on 3 of the servers (let's call these the fast servers) and anything older than that will be kept on the 2 slower servers, that way queries ranged on the last 7 days will run faster while jobs that are not time-sensitive can run on the older indices which are stored on the slower servers.
The fast servers could have more computing power and faster SSD disks while the slower servers will have normal spinning disks.
Regarding compression, Elasticsearch compression works on the _source data, so compression should not affect aggregation speed, its also important to note that if you change the index compression it will only apply to new/updated documents and will not run retroactively on documents that you've indexed in the past.

Performance degrades after adding solr nodes

I'm having an odd issue in where I set up a DSE 4.0 cluster with 1 Cassandra node and 1 Solr node (using DseSimpleSnitch) and performance is great. If I add additional nodes to have 3 Cassandra nodes and 3 Solr nodes, then the performance of my Solr queries goes downhill dramatically. Anyone have any idea what I might be doing wrong? I have basically all default options for DSE and have tried wiping all data and recreating everything from scratch several times with the same result. I've also tried creating the keyspace with replication factors of 1 and 2 with the same results.
Maybe my use case is a bit odd but I'm using Solr for OLTP type queries(via SolrJ with binary writers/readers) which is why the performance is critical. With a very light workload of say 5 clients making very simple Solr queries the response times go up about 50% from a single Solr node to 3 Solr nodes with only a few hundred small documents seeded for my test(~25ms to ~50ms). The response times get about 2 to 3 times slower with 150 clients against 3 nodes compared to a single node. The response times for Cassandra are unchanged, its only the Solr queries that get slower.
Could there be something with my configuration causing this?
Solr queries need to fan out to cover the full range of keys for the column family. So, when you go from one node to three nodes, it should be no surprise that the total query time would rise to three times a query that can be satisfied with a single node.
You haven't mentioned the RF for the Search DC.
For more complex queries, the fan out would give a net reduction in query latency since only a fraction of the total query time would occur on each node, while for a small query the overhead of the fanout and aggregation of query results dwarfs the time to do the actual Solr core query.
Generally, Cassandra queries tend to be much simpler than Solr queries, so they are rarely comparable.
Problem solved. After noticing the documentation mentioning not to use virtual nodes for Solr nodes (and not saying why) I checked my configuration and noticed I was using virtual nodes. I changed my configuration to not use virtual nodes and the performance issue disappeared. I also upgraded from 4.0.0. to 4.0.2 at the same time but I'm pretty sure it was the virtual nodes causing the problem.

Resources