I'm feeding Spark streaming with a Kinesis Stream.
My project is using 1s batches,
during the first batches (the queue contains a few million items, and task is told to start from the beginning of the stream)
spark streaming start consuming batches of 10K records.
This happens every 10/20s.
i.e:
t0 -> records : 0
t1 -> records : 0
.....
t10 -> records: 10.000 -> total process time is 0.8s (lower than batch time)
t11 -> recods : 0
..
t15 ->records : 0
..
t20 -> records: 10.000
this beaviour occurs until spark cathces up with the top of the stream. After htat every batch will process elements every second.
It feels like at the start point it should consistently process a number of records per batch, without having htat high numberof batches processing no records.
Any setting that I'm ignoring? is the described behaviour expected?
The cause for this issue is this bug: https://issues.apache.org/jira/browse/SPARK-18620 in the spark-kinesis consumer, which does not set maxRate correctly.
Related
I'm testing below script in spark-shell - single partition scan of partitioned table.
val s = System.nanoTime
var q =
s"""
select * from partitioned_table where part_column = 'part_column_value'
"""
spark.sql(q).show
println("Elapsed: " + (System.nanoTime-s) / 1e9 + " seconds")
First execution takes around 30 seconds while all subsequent executions take around 2 seconds.
If we have a look at runtime statistics - there are two additional jobs before first execution
Looks like job with 1212 stages scans all the partitions in a table (total num of partitions 1199, total num of HDFS files for this table - 1384).
I did not find a way to discover what exactly scala/java or SQL code is running for Job 0 but I suspect it's for caching.
Each time I exit spark-shell and start it again - I see this two additional jobs before first executions.
Of course, similar observations are true for other queries.
Questions
Is it possible to prove or negate hypothesis about caching?
If it's for caching - how to disable cache and how to clean it up?
Update. Details about job.
The problem was occurring for specific Spark version 2.0.2.
Spark has been scanning of all the partitions while building the plan and before query execution.
The issues has been logged and fixed in Spark 2.1.0
https://issues.apache.org/jira/browse/SPARK-16980
I'm trying to perform a simple join operation using Talend & Spark. The input data set is a few million records and the look up data set is around 100 records.(we might need to join with million records look up data too).
When trying to just read the input data and generate a flat file with the following memory settings, the job works fine and takes less amount of time to run. But, when trying to perform a join operation as explained above, the job gets stuck at 99.7%.
ExecutorMemory = 20g
Cores Per Executor = 4
Yarn resources allocation = Fixed
Num of executors = 100
spark.yarn.executor.memoryOverhead=6000 (On some preliminary research I found that this has to be 10% of the executor memory, but that didn't help too.)
After a while(30-40 minutes) the job prints a log saying - "Lost executor xx on abc.xyz.com". This is probably because it's put on wait for too long and the executor gets killed.
I'm trying to check if anyone has run into this issue where a Spark job gets stuck at 99.7% for a simple operation. Also, what are the recommended tuning properties to use in such a scenario.
We have one problem with storing data in HBase. We've taken such steps:
Big csv file (size: 20 G) is being processed by Spark application with hfiles as result (result data size: 180 G).
Creation of table by using command: 'TABLE_NAME', {'NAME'=>'cf', 'COMPRESSION'=>'SNAPPY'}
Data from created hfiles are bulkloaded with command hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles -Dhbase.mapreduce.bulkload.max.hfiles.perRegion.perFamily=1024 hdfs://ip:8020/path TABLE_NAME
Right after loading of table the size is 180 G, however after some period of time (yesterday it was at 8pm, two days ago around 8am) a process being launched which compacts data to size 14 G.
My question is what is the name of this process? Is that a major compaction? Becouse I'm trying to trigger compaction (major_compact and compact) manually, but this is an output from command started on uncompacted table:
hbase(main):001:0> major_compact 'TEST_TYMEK_CRM_ACTION_HISTORY'
0 row(s) in 1.5120 seconds
This is compactions process. I can suggest the following reason for such big difference in table size. Using Spark application, you will not use a compression codec for HFile, because it specifies it after file creation. HFiles attachment to the table doesn't change it formate (all files in HDFS are immutable). Only after compaction process, data will be compressed. You can monition compaction process via HBase UI; it usually ran on 60000 port.
I made some profiling for my MR job and found that fetching next records for table scan takes ~30% of time spent in mapper. As far as I understand, scanner fetches N rows from server as configured by scan.setCaching and then iterates them locally.
Is there anything I can do to minimize cache load time? Is this a signal that scan was setup incorrectly? Current setup:
scan caching = 100
record size = ~5kb
cf block size = ~130kb, compression=gz
I thought of a custom table record reader that performs pre-fetching in background.
I'm evaluating Nifi for our ETL process.
I want to build the following flow:
Fetch a lot of data from SQL database -> Split into chunks 1000 records
each -> Count error records in each chunk -> Count total number of error
records -> If it exceeds a threshold Fail process -> else save each chunk to the database.
The problem I can't resolve is how to wait until all chunks are validated.
If for example I have 5 validation tasks working concurrently, I need some
kind of barrier to wait until all chunks are processed and only after that
run error count processor because I don't want to save invalid data and
delete it if the threshold is reached.
The other question I have is if there is any possibility to run this
validation processor on multiple nodes in parallel and still have the
possibility to wait until they all are completed.
One solution to this is to use the ExecuteScript processor as a "relief valve" to hold a simple count in memory triggered off of the first receipt of a flowfile with a specific attribute value (store in the local/cluster state with basically a Map of key attribute-value to value count). Once that value reaches a threshold, you can generate a new flowfile to route to the success relationship containing the attribute value that has finished. In this case, send the other results (the flowfiles that need to be batched) to a MergeContent processor and set the minimum batching size to whatever you like. The follow-on processor to the valve should have its Scheduling Strategy set to Event Driven so it only runs when it receives a flowfile from the valve.
Updating count in distributed MapCache is not the correct way as fetch and update are separate and cannot be made in atomic processor which just increments counts.
http://apache-nifi-users-list.2361937.n4.nabble.com/How-do-I-atomically-increment-a-variable-in-NiFi-td1084.html