Elasticsearch can't handle multiple requests without dramatically decrease its performance - performance

I have a two node cluster hosted in ElasticCloud.
Host Elastic Cloud
Platform Google Cloud
Region US Central 1 (Iowa)
Memory 8 GB
Storage 192 GB
SSD Yes
HA Yes
Each node has:
Allocated Processors 2
Number of processors 2
Number of indices 4*
Shards (p/ index) 5*
Number of replicas 1
Number of document 150M
Allocated Disk 150GB
* the main indices, kibana and watcher creates a bunch of small indices.
My documents are mostly text. There are some other fields (no more than 5 per index), no nested objects. Indices specs:
| Index | Avg Doc Length | # Docs | Disk |
|---------|----------------|--------|------|
| index-1 | 300 | 80M | 70GB |
| index-2 | 500 | 5M | 5GB |
| index-3 | 3000 | 2M | 10GB |
| index-4 | 2500 | 18M | 54GB |
When system is idle, response time (load time) is typically few seconds. But when I simulate the behavior of 10 users I start to get timeouts in my application. Originally timeout was 10s, I updated it to 60s and I am still having issues. Here follows a chart for simulation of 10 concurrent users using Search Api.
Red line is total request time in seconds and dotted pink line is my 60 seconds timeout. So, I'd say in most of the times my users will experience an timeout. The query I've used is quite simple:
{
"size": 500,
"from": ${FROM},
"query":{
"query_string": {
"query": "good OR bad"
}
}
}
I've tried all possible tweaks that came to my knowledge. I don't know if that is the real ES performance and I have to accept it and upgrade my plan.

Related

Is one TaskManager with three slots the same as three TaskManagers with one slot in Apache Flink

In Flink, as my understanding, JobManager can assign a job to multiple TaskManagers with multiple slots if necessary. For example, one job can be assigned three TaskManagers, using five slots.
Now, saying that I execute one TaskManager(TM) with three slots, which is assigned to 3G RAM and one CPU.
Is this totally the same as executing three TaskManagers, sharing one CPU, and each of them is assigned to 1 G RAM?
case 1
---------------
| 3G RAM |
| one CPU |
| three slots |
| TM |
---------------
case 2
--------------------------------------------|
| one CPU |
| ------------ ------------ ------------ |
| | 1G RAM | | 1G RAM | | 1G RAM | |
| | one slot | | one slot | | one slot | |
| | TM | | TM | | TM | |
| ------------ ------------ ------------ |
--------------------------------------------|
There are performance and operational differences that pull in both directions.
When running in non-containerized environments, with the RocksDB state backend, it can make sense to have a single TM per machine, with many slots. This will minimize the per-TM overhead. However, the per-TM overhead is not that significant.
On the other hand, running with one slot per TM provides some helpful isolation, and reduces the impact of garbage collection, which is particularly relevant with a heap-based state backend.
With containerized deployments, it is generally recommended to go with one slot per TM until reaching some significant scale, at which point you will want to scale by adding more slots per TM rather than more TMs. The issue is that the checkpoint coordinator needs to coordinate with each TM (but not with each slot), and as the number of TMs gets into the hundreds or thousands, this can become a bottleneck.

How to high-perform big dataset transformations in Azure?

Goal / Problem
since 3 weeks we are trying to find the best possible high-performing solution in Azure for loading 10 million records (could be even more!) in a staging area, do different transformations based on the staged records and finally store the updates physically in a store again.
To achieve this we did a lot of research and tried different approaches to get the results back in a decent amount of time (below 1 minute max), but we completely got stuck! Every second we can save is a huge benefit for our customer!
Note: We have a huge budget to solve this problem, so the cost-factor can be ignored.
Example of the input schema
+------+--------+----------+
| Id | Year | Amount |
+------+--------+----------+
| 1 | 1900 | 1000 |
| 2 | 1900 | 2000 |
| 3 | 1901 | 4000 |
| 4 | 1902 | 8000 |
| ... | ... | ... |
| 1M | 9999 | 1000 |
+------+--------+----------+
Transformation
The transformation process is split in separate steps. Each step has to temporary store its results until we persist the data into a physical store. It has to be possible to rearrange the steps in a different order or just skip a step to create some kind of a workflow.
A step could be one of the following:
Double the Amount
Substract 1k from Amount
Cap the Amount to a maximum of 5k
Cap the Amount to a minimum of zero
Cap the sum of a Year to a maximum of 100k
There are so many possible solutions and opportunities in Azure, that it is really hard to know which way is the best to go, so we need help from you.
What data stores we already considered
Azure SQL Database
Azure CosmosDB
What services we already considered
Azure Data Factory
Azure Functions with a self-implemented Fanout/Fanin architecture (vis ServiceBus Queues and Redis Cache)
Durable Functions
Azure Databricks
Question
Is there anyone who had a similar problem to solve and could give us some advices or recommendations for an architecture? We would be really thankful.
Edit #1: Description of transformation process added

Tuples taking more time in reaching from spout to last bolt(aka Complete Latency) is high

Version Info:
"org.apache.storm" % "storm-core" % "1.2.1"
"org.apache.storm" % "storm-kafka-client" % "1.2.1"
I was creating and experimenting with a Storm topology I have created which has 4 bolts and one kafka spout.
I was trying to tune configs like parallelism of these bolts, max-spout-pending, etc to see how much scale I can get out of it. After some configuration config/results looks something like below:
max-spout-pending: 1200
Kafka Spout Executors: 10
Num Workers: 10
+----------+-------------+----------+----------------------+
| boltName | Parallelism | Capacity | Execute latency (ms) |
+----------+-------------+----------+----------------------+
| __acker | 10 | 0.008 | 0.005 |
| bolt1 | 15 | 0.047 | 0.211 |
| bolt2 | 150 | 0.846 | 33.151 |
| bolt3 | 1500 | 0.765 | 289.679 |
| bolt4 | 48 | 0.768 | 10.451 |
+----------+-------------+----------+----------------------+
Process latency and Execute latency are almost same. There is an HTTP call involved in bolt 3 which is taking approximately that much time and bolt 2 and bolt 4 are also doing some I/O operation.
While I can see that each bolt can individually process more than 3k, (bolt3: 1500/289.679ms = 5.17k qps, bolt4: 48/10.451ms = 4.59k qps and so on), but overall this topology is processing tuples at only ~3k qps. I am running it on 10 boxes(so one worker per box), having 12 core CPU and 32GB RAM. I have given each worker process -xms 8Gb and -xmx 10Gb, so RAM should also not be constraint. I see GC also happening properly, 4 GC per minute taking around total time of 350ms in a minute(from flight recording of worker process of 1 minute).
I see Complete Latency for each tuple to be around 4 sec, which is something I am not able to understand, as If I compute all time taken by all bolts, it comes around 334 ms, but as mentioned here, tuples can be waiting in buffers, it suggests to increase dop(degree of parallelism), which I have done and reached above state.
I add some more metering and I see tuples are taking on average around 1.3sec to reach from bolt 2 to bolt 3 and 5 sec from bolt 3 to bolt 4. While I understand Storm might be keeping them in it's outbound or inbound buffer, My question is how do I reduce it as these bolts should be able to process more tuples in a second as par my earlier calculation, what is holding them from entering and being processed at faster rate?
I think your issue may be due to ack tuples, that are used to start and stop the complete latency clock, being stuck waiting at the ackers.
You have a lot of bolts and presumably high throughput which will result in a lot of ack messages. Try increasing the number of ackers, using the topology.acker.executors config value which will hopefully reduce the queuing delay for the ack tuples.
If you are also using a custom metrics consumer you may also want to increase the parallelism of this component too, given the number of bolts you have.

based on Throughput = (number of requests) / (total time), I got two different throughput numbers

Please look at the output of JMeter run:
TestA 20 0 0.00% 45423.30 26988 62228 60189.40 62130.85 62228.00 0.24 1.21 3.07
TestB 20 0 0.00% 245530.50 225405 260410 259775.40 260401.20 260410.00 0.06 0.29 0.51
It is all from the same test run (same period), and one throughput is 0.24 and the other is 0.06. Something wrong with JMeter?
Thanks for the input,
John
My expectation is that you're using numbers from JMeter Reporting Dashboard therefore we're looking at:
Label | #Samples| KO| %Errors| Average |Min |Max |90% |95% |99% |Throughput | Received |Sent
TestA | 20 | 0 | 0.00% | 45423.30| 26988 | 62228 |60189.40 | 62130.85 |62228.00 | 0.24 | 1.21 | 3.07
TestB | 20 | 0 |0.00% |245530.50| 225405| 260410|259775.40| 260401.20|260410.00| 0.06 |0.29 | 0.51
According to JMeter Glossary
Throughput is calculated as requests/unit of time. The time is calculated from the start of the first sample to the end of the last sample. This includes any intervals between samples, as it is supposed to represent the load on the server.
The formula is: Throughput = (number of requests) / (total time).
Looking into Average column you have 45 seconds of average response time for Test A and 245 seconds for Test B. It means that Test B was 6x times longer than the Test A therefore you have 6x times lower Throughput.
So I would recommend looking into server-side logs, version control system commits, APM tools, profiling tools, JMeter PerfMon Plugin results, etc. in order to identify why response time for Test B became 6x times worse than for Test A.

Oracle 11g resource manager clarification

I'm hoping that someone can just confirm my understanding of how the resource manager works...
If I've got a 4 node RAC with 2 consumer groups and 2 services, the services send each consumer group to one node only i.e. consumer group 1 ALWAYS gets sent to node 1 and 2 and consumer group 2 ALWAYS gets sent to node 3 and 4.
If I've got a tiered resource plan such as:
Group Name | L0 | L1 | max
Group 1 | 75% | 0 | 80%
Group 2 | 0 | 75% | 80%
Am I right in saying that as group 1 is on nodes 1 and 2 and group 2 is on nodes 3 and 4, they will each have 75% of resources available on their respective nodes? and both be limited to 80% on the node they are running on?
i.e. Resources are constricted and calculated on a per node basis and not a per cluster.
So just because a connection on node 1 group 1 is using 80% of resources, another connection on node 2 group 1 will have up to 80% available to it and not 0%.
And similarly if group 1 is using its allocated maximum, group 2 will also get its full share on nodes 3 and 4 as group 1 which is of higher priority isn't running on those nodes.
I've had a response from Oracle Support:
Resource management's limits are applied per node except PARALLEL_TARGET_PERCENTAGE, so for your example, you are right.
Connections in consumer group 2 only ever hit node 2 (due to the
services), group 2 will get a minimum of 75% of resources on the 2nd
node and potentially 100% if no max limit has been set or 80% if the max limit has been set.

Resources