Meaning input records Spark - spark-streaming

I have a doubt, I use Spark Streaming and I can see in the sparkUI:
I get in each microbatch 160.000 records, I can see it by SparkUI and the offsets I'm reading.(160K)
first stage which reads from Kafka I see:
Total Time Across All Tasks: 39 min
Locality Level Summary: Process local: 54
**Input Size / Records: 755.2 MB / 48114**
Output: 124.8 KB / 5179
Why isn't the input size 160K? what does it mean exactly Input Size / Records?

Related

Block size effect hadoop

iam working on hadoop apache 2.7.1
and iam adding files of size that doesn't exceed 100 Kb
so if i configure block size to be 1 mb or to be the default value which is
128 mb
that will not affect my files because they will be saved on one block only
and one block will be retrieved when we download the file
but what will be the difference in block storage size
i mean does storing files on 1 mb block size differs from storing them on 128 mb block size when files are smaller than 1 mb
i mean when file of 1 mb is stored in a block of size 128 m will it reserve this whole block and this block is not going to be used for other files ,or empty space is going to be used for other files with a pointer refer to file start location in a block
i found no difference in uploading and downloading time
is there any other points that i have to consider
I am going to cite the (now discontinued) SO documentation for this, written by me, because why not.
Say for example you have a file of size 1024 MBs. if your block size is 128 MB, you will get 8 blocks of 128MB each. This means that your namenode will need to store metadata of 8 x 3 = 24 files (3 being the replication factor).
Consider the same scenario with a block size of 4 KBs. It will result in 1GB / 4KB = 250000 blocks and that will require the namenode to save the metadata for 750000 blocks for just a 1GB file. Since all these metadata related information is stored in-memory, larger block size is preferred to save that bit of extra load on the NameNode.

Kafka + Spark scalability

We have very simple Spark Streaming job (implemented in Java), which is:
reading JSONs from Kafka via DirectStream (acks on Kafka messages are turned off)
parsing the JSONs into POJO (using GSON - our messages are only ~300 bytes)
map the POJO to tuple of key-value (value = object)
reduceByKey (custom reduce function - always comparing 1 field - quality - from the objects and leaves the object instance with higher quality)
store the result in the state (via mapWithState stores the object with highest quality per key)
store the result to HDFS
The JSONs are generated with set of 1000 IDs (keys) and all the events are randomly distributed to Kafka topic partitions. This also means, that resulting set of objects is max 1000, as the job is storing only the object with highest quality for each ID.
We were running the performance tests on AWS EMR (m4.xlarge = 4 cores, 16 GB memory) with following parameters:
number of executors = number of nodes (i.e. 1 executor per node)
number of Kafka partitions = number of nodes (i.e. in our case also executors)
batch size = 10 (s)
sliding window = 20 (s)
window size = 600 (s)
block size = 2000 (ms)
default Parallelism - tried different settings, however best results getting when the default parallelism is = number of nodes/executors
Kafka cluster contains just 1 broker, which is utilized to max ~30-40% during the peak load (we're pre-filling the data to topic and then independently executing the test). We have tried to increase the num.io.threads and num.network.threads, but without significant improvement.
The he results of performance tests (about 10 minutes of continuous load) were (YARN master and Driver nodes are on top of the node counts bellow):
2 nodes - able to process max. 150 000 events/s without any processing delay
5 nodes - 280 000 events/s => 25 % penalty if compared to expected "almost linear scalability"
10 nodes - 380 000 events/s => 50 % penalty if compared to expected "almost linear scalability"
The CPU utilization in case of 2 nodes was ~
We also played around other settings including:
- testing low/high number of partitions
- testing low/high/default value of defaultParallelism
- testing with higher number of executors (i.e. divide the resources to e.g. 30 executors instead of 10)
but the settings above were giving us the best results.
So - the question - is Kafka + Spark (almost) linearly scalable? If it should be scalable much better, than our tests shown - how it can be improved. Our goal is to support hundreds/thousands of Spark executors (i.e. scalability is crucial for us).
We have resolved this by:
increasing the capacity of Kafka cluster
more CPU power - increased the number of nodes for Kafka (1 Kafka node per 2 spark exectur nodes seemed to be fine)
more brokers - basically 1 broker per executor gave us the best results
setting proper default parallelism (number of cores in cluster * 2)
ensuring all the nodes will have approx. the same amount of work
batch size/blockSize should be ~equal or a multiple of number of executors
At the end, we've been able to achieve 1 100 000 events/s processed by the spark cluster with 10 executor nodes. The tuning made also increased the performance on configurations with less nodes -> we have achieved practically linear scalability when scaling from 2 to 10 spark executor nodes (m4.xlarge on AWS).
At the beginning, CPU on Kafka node wasn't approaching to limits, however it was not able to respond to the demands of Spark executors.
Thnx for all suggestions, particulary for #ArturBiesiadowski, who suggested the Kafka cluster incorrect sizing.

HDFS disk usage showing different information

I got below details through hadoop fsck /
Total size: 41514639144544 B (Total open files size: 581 B)
Total dirs: 40524
Total files: 124348
Total symlinks: 0 (Files currently being written: 7)
Total blocks (validated): 340802 (avg. block size 121814540 B) (Total open file blocks (not validated): 7)
Minimally replicated blocks: 340802 (100.0 %)
I am usign 256MB block size.
so 340802 blocks * 256 MB = 83.2TB * 3(replicas) =249.6 TB
but in cloudera manager it shows 110 TB disk used. how is it possible?
You cannot just multiply with block size and replication factor. Block size and replication factor can be changed dynamically at each file level.
Hence the computation done in 2nd part of your question need not be correct, especially fsck command is showing block size approximately 120MB.
In this case 40 TB storage is taking up around 110 TB of storage. So replication factor is also not 3 for all the files. What ever you get in Cloudera Manager is correct value.

how is "cumulative map pmem" calculated?

My MR job has the following characteristics :
input size: 2 GB
chunk size: 128 MB
cluster size: 17 nodes
Distribution: MapR
I can see that "Cumulative Map PMem" size as 15 GB when I run my MR job.
And I have got to know that "Cumulative Map PMem" is Physical Memory counter representing total physical memory that has been used while executing MR job.
I would like to know how exactly this "Cumulative Map PMem" is calculated ? How is this shooting up to 15 GB.

Caches node of Cassandra jconsole is not expandable

I have 1 node of Cassandra 1.1.2 installed on Linux, and I want to determine the size that every CF is occupying in the cache, and how many percents of every CF is in the cache (both for row cache and key cache)
When I connecting to this node via jconsole, and I'm expanding the org.apache.cassandra.db node in jconsole, the 'Caches' node is unexpandable, although according to:
http://www.datastax.com/docs/1.1/operations/monitoring#monitoring-and-adjusting-cache-performance
It should be expandable.
In addition, the output of the nodetool also does not contain the properties
Key cache capacity, Key cache size and Key cache hit rate:
Column Family: io2
SSTable count: 4
Space used (live): 566387478
Space used (total): 566387478
Number of Keys (estimate): 3858816
Memtable Columns Count: 0
Memtable Data Size: 0
Memtable Switch Count: 0
Read Count: 0
Read Latency: NaN ms.
Write Count: 0
Write Latency: NaN ms.
Pending Tasks: 0
Bloom Filter False Postives: 0
Bloom Filter False Ratio: 0.00000
Bloom Filter Space Used: 7238040
Compacted row minimum size: 125
Compacted row maximum size: 149
Compacted row mean size: 149
Any idea?
The Caches node of Cassandra jconsole is not expandable in Cassandra 1.1.2,
because individual per cf caches were combined into a global cache in 1.1,
thus it is not possible to see these fields in 1.1.2
http://www.datastax.com/dev/blog/caching-in-cassandra-1-1

Resources