How to high-perform big dataset transformations in Azure? - performance

Goal / Problem
since 3 weeks we are trying to find the best possible high-performing solution in Azure for loading 10 million records (could be even more!) in a staging area, do different transformations based on the staged records and finally store the updates physically in a store again.
To achieve this we did a lot of research and tried different approaches to get the results back in a decent amount of time (below 1 minute max), but we completely got stuck! Every second we can save is a huge benefit for our customer!
Note: We have a huge budget to solve this problem, so the cost-factor can be ignored.
Example of the input schema
+------+--------+----------+
| Id | Year | Amount |
+------+--------+----------+
| 1 | 1900 | 1000 |
| 2 | 1900 | 2000 |
| 3 | 1901 | 4000 |
| 4 | 1902 | 8000 |
| ... | ... | ... |
| 1M | 9999 | 1000 |
+------+--------+----------+
Transformation
The transformation process is split in separate steps. Each step has to temporary store its results until we persist the data into a physical store. It has to be possible to rearrange the steps in a different order or just skip a step to create some kind of a workflow.
A step could be one of the following:
Double the Amount
Substract 1k from Amount
Cap the Amount to a maximum of 5k
Cap the Amount to a minimum of zero
Cap the sum of a Year to a maximum of 100k
There are so many possible solutions and opportunities in Azure, that it is really hard to know which way is the best to go, so we need help from you.
What data stores we already considered
Azure SQL Database
Azure CosmosDB
What services we already considered
Azure Data Factory
Azure Functions with a self-implemented Fanout/Fanin architecture (vis ServiceBus Queues and Redis Cache)
Durable Functions
Azure Databricks
Question
Is there anyone who had a similar problem to solve and could give us some advices or recommendations for an architecture? We would be really thankful.
Edit #1: Description of transformation process added

Related

Is one TaskManager with three slots the same as three TaskManagers with one slot in Apache Flink

In Flink, as my understanding, JobManager can assign a job to multiple TaskManagers with multiple slots if necessary. For example, one job can be assigned three TaskManagers, using five slots.
Now, saying that I execute one TaskManager(TM) with three slots, which is assigned to 3G RAM and one CPU.
Is this totally the same as executing three TaskManagers, sharing one CPU, and each of them is assigned to 1 G RAM?
case 1
---------------
| 3G RAM |
| one CPU |
| three slots |
| TM |
---------------
case 2
--------------------------------------------|
| one CPU |
| ------------ ------------ ------------ |
| | 1G RAM | | 1G RAM | | 1G RAM | |
| | one slot | | one slot | | one slot | |
| | TM | | TM | | TM | |
| ------------ ------------ ------------ |
--------------------------------------------|
There are performance and operational differences that pull in both directions.
When running in non-containerized environments, with the RocksDB state backend, it can make sense to have a single TM per machine, with many slots. This will minimize the per-TM overhead. However, the per-TM overhead is not that significant.
On the other hand, running with one slot per TM provides some helpful isolation, and reduces the impact of garbage collection, which is particularly relevant with a heap-based state backend.
With containerized deployments, it is generally recommended to go with one slot per TM until reaching some significant scale, at which point you will want to scale by adding more slots per TM rather than more TMs. The issue is that the checkpoint coordinator needs to coordinate with each TM (but not with each slot), and as the number of TMs gets into the hundreds or thousands, this can become a bottleneck.

Tuples taking more time in reaching from spout to last bolt(aka Complete Latency) is high

Version Info:
"org.apache.storm" % "storm-core" % "1.2.1"
"org.apache.storm" % "storm-kafka-client" % "1.2.1"
I was creating and experimenting with a Storm topology I have created which has 4 bolts and one kafka spout.
I was trying to tune configs like parallelism of these bolts, max-spout-pending, etc to see how much scale I can get out of it. After some configuration config/results looks something like below:
max-spout-pending: 1200
Kafka Spout Executors: 10
Num Workers: 10
+----------+-------------+----------+----------------------+
| boltName | Parallelism | Capacity | Execute latency (ms) |
+----------+-------------+----------+----------------------+
| __acker | 10 | 0.008 | 0.005 |
| bolt1 | 15 | 0.047 | 0.211 |
| bolt2 | 150 | 0.846 | 33.151 |
| bolt3 | 1500 | 0.765 | 289.679 |
| bolt4 | 48 | 0.768 | 10.451 |
+----------+-------------+----------+----------------------+
Process latency and Execute latency are almost same. There is an HTTP call involved in bolt 3 which is taking approximately that much time and bolt 2 and bolt 4 are also doing some I/O operation.
While I can see that each bolt can individually process more than 3k, (bolt3: 1500/289.679ms = 5.17k qps, bolt4: 48/10.451ms = 4.59k qps and so on), but overall this topology is processing tuples at only ~3k qps. I am running it on 10 boxes(so one worker per box), having 12 core CPU and 32GB RAM. I have given each worker process -xms 8Gb and -xmx 10Gb, so RAM should also not be constraint. I see GC also happening properly, 4 GC per minute taking around total time of 350ms in a minute(from flight recording of worker process of 1 minute).
I see Complete Latency for each tuple to be around 4 sec, which is something I am not able to understand, as If I compute all time taken by all bolts, it comes around 334 ms, but as mentioned here, tuples can be waiting in buffers, it suggests to increase dop(degree of parallelism), which I have done and reached above state.
I add some more metering and I see tuples are taking on average around 1.3sec to reach from bolt 2 to bolt 3 and 5 sec from bolt 3 to bolt 4. While I understand Storm might be keeping them in it's outbound or inbound buffer, My question is how do I reduce it as these bolts should be able to process more tuples in a second as par my earlier calculation, what is holding them from entering and being processed at faster rate?
I think your issue may be due to ack tuples, that are used to start and stop the complete latency clock, being stuck waiting at the ackers.
You have a lot of bolts and presumably high throughput which will result in a lot of ack messages. Try increasing the number of ackers, using the topology.acker.executors config value which will hopefully reduce the queuing delay for the ack tuples.
If you are also using a custom metrics consumer you may also want to increase the parallelism of this component too, given the number of bolts you have.

Elasticsearch can't handle multiple requests without dramatically decrease its performance

I have a two node cluster hosted in ElasticCloud.
Host Elastic Cloud
Platform Google Cloud
Region US Central 1 (Iowa)
Memory 8 GB
Storage 192 GB
SSD Yes
HA Yes
Each node has:
Allocated Processors 2
Number of processors 2
Number of indices 4*
Shards (p/ index) 5*
Number of replicas 1
Number of document 150M
Allocated Disk 150GB
* the main indices, kibana and watcher creates a bunch of small indices.
My documents are mostly text. There are some other fields (no more than 5 per index), no nested objects. Indices specs:
| Index | Avg Doc Length | # Docs | Disk |
|---------|----------------|--------|------|
| index-1 | 300 | 80M | 70GB |
| index-2 | 500 | 5M | 5GB |
| index-3 | 3000 | 2M | 10GB |
| index-4 | 2500 | 18M | 54GB |
When system is idle, response time (load time) is typically few seconds. But when I simulate the behavior of 10 users I start to get timeouts in my application. Originally timeout was 10s, I updated it to 60s and I am still having issues. Here follows a chart for simulation of 10 concurrent users using Search Api.
Red line is total request time in seconds and dotted pink line is my 60 seconds timeout. So, I'd say in most of the times my users will experience an timeout. The query I've used is quite simple:
{
"size": 500,
"from": ${FROM},
"query":{
"query_string": {
"query": "good OR bad"
}
}
}
I've tried all possible tweaks that came to my knowledge. I don't know if that is the real ES performance and I have to accept it and upgrade my plan.

Writing large parquet file (500 millions row / 1000 columns) to S3 takes too much time

First let me introduce my use case, i daily receive a 500 million rows like so :
ID | Categories
1 | cat1, cat2, cat3, ..., catn
2 | cat1, catx, caty, ..., anothercategory
Input data: 50 compressed csv files each file is 250 MB -> Total :12.5 GB Compressed
The purpose is to answer questions like : find all ids that belongs to Catx and Caty, find ids that belongs to cat3 and not caty etc...: ie : ids in cat1 U cat2or ids cat3 ∩ catx
Assuming that categories are dynamically created (every day i have a new set of categories) and my business wants to explore all possible intersections and unions ( we don't have a fixed set of queries) i came up with the following solution :
I wrote a spark job that transforms the date into a fat sparse matrix where columns are all possible categories plus a column ID, for each row and column i set true were id belongs to this category and false if not:
ID | cat1 | cat2 | cat3 |...| catn | catx | caty | anothercategory |....
1 | true | true | true |...| true | false |false |false |....
2 | true |false |false |...|false | true | true | true |....
SQL can simply answer my questions, for instance if i want to find all ids that belongs to category cat1 and category catx then i run the following sql query against this matrix :
Select id from MyTable where cat1 = true and catx=true;
I choose to save this sparse matrix as a compressed parquet file, i made this choice with regards to the sparsity and the queries nature, i believe columnar storage is the most appropriate storage format.
Now with my use case described here are my observations, i may be missing some optimization points :
12.5GB compressed input data after transformation take ~300GB writing this sparse matrix as parquet takes too much time and resources, it took 2,3 hours with spark1.6 stand alone cluster of 6 aws instances r4.4xlarge (i set enough parallelization to distribute work and take advantage of all the workers i have)
I ended up with too many parquet files, the more i parallelize the smallest parquet files are. Seems like each RDD gives a single parquet file -> too many small files is not optimal to scan as my queries go through all the column values
I went through a lot of posts but still don't understand why writing 500 Million/1000 column compressed parquet to S3 takes this much time, once on S3 the small files sums up to ~35G
Looking to the application master UI, the job hangs on the writing stage, the transformation stage and the shuffling don't seem to be resource/time consuming.
I tried to tweak parquet parameters like group_size, page_size an disable_dictionnary but did not see performance improvements.
I tried to partition to bigger RDDs and write them to S3 in order to get bigger parquet files but the job took too much time,finally i killed it.
I could run the job in ~ 1 hour using a spark 2.1 stand alone cluster of 4 aws instances of type r4.16xlarge, i feel like i am using a huge cluster to achieve a small improvement, the only benefit i got is running more parallel tasks. Am i missing something ? I can maybe leverage ~ 1 To RAM to achieve this better and get bigger parquet files.
Guys do you have any feedback regarding writing large parquet file on S3 using spark?
I would like to know your opinions/ critics about this solution too.
Thanks and Regards.
It's a combination of Spark re-reading stuff to do summaries (something you can disable) and the algorithm for committing work doing a rename(), which is mimicked in S3 by a copy.
See" Apache Spark and object stores for more details and some switches which can slightly speed up your work (disable summaries, use the less-renaming algorithm)
Even with those you will get delays and, because S3 is eventually consistent, at risk of producing corrupt output. Safest to write to a transient HDFS filesystem and then copy to S3 at the end of all your work

Correcting improper usage of Cassandra

I have a similar question that was unanswered (but had many comments):
How to make Cassandra fast
My setup:
Ubuntu Server
AWS service - Intel(R) Xeon(R) CPU E5-2680 v2 # 2.80GHz, 4GB Ram.
2 Nodes of Cassandra Datastax Community Edition: (2.1.3).
PHP 5.5.9. With datastax php-driver
I come from a MySQL database knowledge with very basic NoSQL hands on in terms of ElasticSearch (now called Elastic) and MongoDB in terms of Documents storage.
When I read how to use Cassandra, here are the bullets that I understood
It is distributed
You can have replicated rings to distribute data
You need to establish partition keys for maximum efficiency
Rethink your query rather than to use indices
Model according to queries and not data
Deletes are bad
You can only sort starting from the second key of your primary key set
Cassandra has "fast" write
I have a PHP Silex framework API that receive a batch json data and is inserted into 4 tables as a minimum, 6 at maximum (mainly due to different types of sort that I need).
At first I only had two nodes of Cassandra. I ran Apache Bench to test. Then I added a third node, and it barely shaved off a fraction of a second at higher batch size concurrency.
Concurrency Batch size avg. time (ms) - 2 Nodes avg. time (ms) - 3 Nodes
1 5 288 180
1 50 421 302
1 400 1 298 1 504
25 5 1 993 2 111
25 50 3 636 3 466
25 400 32 208 21 032
100 5 5 115 5 167
100 50 11 776 10 675
100 400 61 892 60 454
A batch size is the number of entries (to the 4-6 tables) it is making per call.
So batch of 5, means it is making 5x (4-6) table insert worth of data. At higher batch size / concurrency the application times out.
There are 5 columns in a table with relatively small size of data (mostly int with text being no more than approx 10 char long)
My keyspace is the following:
user_data | True | org.apache.cassandra.locator.SimpleStrategy | {"replication_factor":"1"}
My "main" question is: what did I do wrong? It seems to be this is relatively small data set of that considering that Cassandra was built on BigDataTable at very high write speed.
Do I add more nodes beyond 3 in order to speed up?
Do I change my replication factor and do Quorum / Read / Write and then hunt for a sweet spot from the datastax doc: http://docs.datastax.com/en/cassandra/2.0/cassandra/dml/dml_config_consistency_c.html
Do I switch framework, go node.js for higher concurrency for example.
Do I rework my tables, as I have no good example as to how effectively use column family? I need some hint for this one.
For the table question:
I'm tracking history of a user. User has an event and is associated to a media id, and there so extra meta data too.
So columns are: event_type, user_id, time, media_id, extra_data.
I need to sort them differently therefore I made different tables for them (as I understood how Cassandra data modeling should work...I am perhaps wrong). Therefore I'm replicating the different data across various tables.
Help?
EDIT PART HERE
the application also has redis and mysql attached for other CRUD points of interest such as retrieving a user data and caching it for faster pull.
so far on avg with MySQL and then Redis activated, I have a 72ms after Redis kicks in, 180ms on MySQL pre-redis.
The first problem is you're trying to benchmark the whole system, without knowing what any individual component can do. Are you trying to see how fast an individual operation is? or how many operations per second you can do? They're different values.
I typically recommend you start by benchmarking Cassandra. Modern Cassandra can typically do 20-120k operations per second per server. With RF=3, that means somewhere between 5k and 40k reads / second or writes/second. Use cassandra-stress to make sure cassandra is doing what you expect, THEN try to loop in your application and see if it matches. If you slow way down, then you know the application is your bottleneck, and you can start thinking about various improvements (different driver, different language, async requests instead of sync, etc).
Right now, you're doing too much and analyzing too little. Break the problem into smaller pieces. Solve the individual pieces, then put the puzzle together.
Edit: Cassandra 2.1.3 is getting pretty old. It has some serious bugs. Use 2.1.11 or 2.2.3. If you're just starting development, 2.2.3 may be OK (and let's assume you'll actually go to production with 2.2.5 or so). If you're ready to go prod tomorrow, use 2.1.x instead.

Resources