[HDFS connector + Kafka]How to write multiple topics in standalone mode? - hadoop

I am using Confluent's HDFS Connector to write streamed data to HDFS. I followed the user manual and quick start and setup my Connector.
It works properly when i consume only one topic.
My property file looks like this
name=hdfs-sink
connector.class=io.confluent.connect.hdfs.HdfsSinkConnector
tasks.max=1
topics=test_topic1
hdfs.url=hdfs://localhost:9000
flush.size=30
When i add more than one topic, i see it continuously committing offsets and i do not see it writing the committed messages.
name=hdfs-sink
connector.class=io.confluent.connect.hdfs.HdfsSinkConnector
tasks.max=2
topics=test_topic1,test_topic2
hdfs.url=hdfs://localhost:9000
flush.size=30
I tried with tasks.max with 1 and 2.
I continuously get Committing offsets logged as below
[2016-10-26 15:21:30,990] INFO Started recovery for topic partition test_topic1-0 (io.confluent.connect.hdfs.TopicPartitionWriter:193)
[2016-10-26 15:21:31,222] INFO Finished recovery for topic partition test_topic1-0 (io.confluent.connect.hdfs.TopicPartitionWriter:208)
[2016-10-26 15:21:31,230] INFO Started recovery for topic partition test_topic2-0 (io.confluent.connect.hdfs.TopicPartitionWriter:193)
[2016-10-26 15:21:31,236] INFO Finished recovery for topic partition test_topic2-0 (io.confluent.connect.hdfs.TopicPartitionWriter:208)
[2016-10-26 15:21:35,155] INFO Reflections took 6962 ms to scan 249 urls, producing 11712 keys and 77746 values (org.reflections.Reflections:229)
[2016-10-26 15:22:29,226] INFO WorkerSinkTask{id=hdfs-sink-0} Committing offsets (org.apache.kafka.connect.runtime.WorkerSinkTask:261)
[2016-10-26 15:23:29,227] INFO WorkerSinkTask{id=hdfs-sink-0} Committing offsets (org.apache.kafka.connect.runtime.WorkerSinkTask:261)
[2016-10-26 15:24:29,225] INFO WorkerSinkTask{id=hdfs-sink-0} Committing offsets (org.apache.kafka.connect.runtime.WorkerSinkTask:261)
[2016-10-26 15:25:29,224] INFO WorkerSinkTask{id=hdfs-sink-0} Committing offsets (org.apache.kafka.connect.runtime.WorkerSinkTask:261)
When i gracefully stop the service (Ctrl+C), i see it removing the tmp files.
What am i doing wrong? What is the proper way to do it?
Appreciate any suggestions on this.

I've kept stumbling over the same problem you've mentioned here for the past month or so and I couldn't get to the bottom of it, until today when I've upgraded to confluent 3.1.1 and stuff started working as expected...
This is how I roll
name=hdfs-sink
connector.class=io.confluent.connect.hdfs.HdfsSinkConnector
tasks.max=5
topics=accounts,contacts,users
hdfs.url=hdfs://localhost:9000
flush.size=1
hive.metastore.uris=thrift://localhost:9083
hive.integration=true
schema.compatibility=BACKWARD
format.class=io.confluent.connect.hdfs.parquet.ParquetFormat
partitioner.class=io.confluent.connect.hdfs.partitioner.HourlyPartitioner
locale=en-us
timezone=UTC

Related

How to stop Preparing to rebalance group with old generation in Kafka?

I used Kafka for my web application and I found the below messages in kafka.log :
[2021-07-06 08:49:03,658] INFO [GroupCoordinator 0]: Preparing to rebalance group qpcengine-group in state PreparingRebalance with old generation 105 (__consumer_offsets-28) (reason: removing member consumer-1-7eafeb56-e6fe-4161-9c88-e69c06a0ab37 on heartbeat expiration) (kafka.coordinator.group.GroupCoordinator)
[2021-07-06 08:49:03,658] INFO [GroupCoordinator 0]: Group qpcengine-group with generation 106 is now empty (__consumer_offsets-28) (kafka.coordinator.group.GroupCoordinator)
But, kafka like as looping forever for one consumer.
How can I stop it?
Here the picture of the kafka log :
enter image description here
If you only have one partition,you dont'need to use consumer_group
just try to use Assign(not subscribe)

kafka-streams instance on startup continuously logs "Found no committed offset for partition traces-1"

I have a kafka-streams app with 2 instances. This is a brand new kafka-cluster with all topics created and have no messages written to them yet.
I start the first instance and see that it has transitioned from REBALANCING to RUNNING state
Now I start the next instance and notice that it continuously logs the following:
2020-01-14 18:03:57.896 [streaming-app-f2457059-c9ec-4c21-a177-be54f8d59cb2-StreamThread-2] INFO o.a.k.c.c.i.ConsumerCoordinator - [Consumer clientId=streaming-app-f2457059-c9ec-4c21-a177-be54f8d59cb2-StreamThread-2-consumer, groupId=streaming-app] Found no committed offset for partition traces-1

Kafka Connect JDBC OOM - Large Amount of Data

I am trying to implement something similar to this tutorial. However, it worked because the data set is very small. How would I do this for a larger table? Because I keep gettting an out of memory error. My logs are
ka.connect.runtime.rest.RestServer:60)
[2018-04-04 17:16:17,937] INFO [Worker clientId=connect-1, groupId=connect-cluster] Marking the coordinator ip-172-31-14-140.ec2.internal:9092 (id: 2147483647 rack: null) dead (org.apache.kafka.clients.consumer.internals.AbstractCoordinator:341)
[2018-04-04 17:16:17,938] ERROR Uncaught exception in herder work thread, exiting: (org.apache.kafka.connect.runtime.distributed.DistributedHerder:218)
java.lang.OutOfMemoryError: Java heap space
[2018-04-04 17:16:17,939] ERROR Uncaught exception in thread 'kafka-coordinator-heartbeat-thread | connect-sink-redshift': (org.apache.kafka.clients.consumer.internals.AbstractCoordinator$HeartbeatThread:51)
java.lang.OutOfMemoryError: Java heap space
[2018-04-04 17:16:17,940] INFO Kafka Connect stopping (org.apache.kafka.connect.runtime.Connect:65)
[2018-04-04 17:16:17,940] INFO Stopping REST server (org.apache.kafka.connect.runtime.rest.RestServer:154)
[2018-04-04 17:16:17,940] ERROR WorkerSinkTask{id=sink-redshift-0} Task threw an uncaught and unrecoverable exception (org.apache.kafka.connect.runtime.WorkerTask:172)
java.lang.OutOfMemoryError: Java heap space
[2018-04-04 17:16:17,940] ERROR WorkerSinkTask{id=sink-redshift-0} Task is being killed and will not recover until manually restarted (org.apache.kafka.connect.runtime.WorkerTask:173)
[2018-04-04 17:16:17,940] INFO Stopping task (io.confluent.connect.jdbc.sink.JdbcSinkTask:96)
[2018-04-04 17:16:17,941] INFO WorkerSourceTask{id=production-db-0} Committing offsets (org.apache.kafka.connect.runtime.WorkerSourceTask:306)
[2018-04-04 17:16:17,940] ERROR Unexpected exception in Thread[KafkaBasedLog Work Thread - connect-statuses,5,main] (org.apache.kafka.connect.util.KafkaBasedLog:334)
java.lang.OutOfMemoryError: Java heap space
[2018-04-04 17:16:17,946] INFO WorkerSourceTask{id=production-db-0} flushing 0 outstanding messages for offset commit (org.apache.kafka.connect.runtime.WorkerSourceTask:323)
[2018-04-04 17:16:17,954] ERROR WorkerSourceTask{id=production-db-0} Task threw an uncaught and unrecoverable exception (org.apache.kafka.connect.runtime.WorkerTask:172)
java.lang.OutOfMemoryError: Java heap space
[2018-04-04 17:16:17,960] ERROR WorkerSourceTask{id=production-db-0} Task is being killed and will not recover until manually restarted (org.apache.kafka.connect.runtime.WorkerTask:173)
[2018-04-04 17:16:17,960] INFO [Producer clientId=producer-4] Closing the Kafka producer with timeoutMillis = 30000 ms. (org.apache.kafka.clients.producer.KafkaProducer:341)
[2018-04-04 17:16:17,960] INFO Stopped ServerConnector#64f4bfe4{HTTP/1.1}{0.0.0.0:8083} (org.eclipse.jetty.server.ServerConnector:306)
[2018-04-04 17:16:17,967] INFO Stopped o.e.j.s.ServletContextHandler#2f06a90b{/,null,UNAVAILABLE} (org.eclipse.jetty.server.handler.ContextHandler:865)
I have also tried increasing the memory with the suggestion here but I am unable to load the entire table into memory. Is there a way to limit the number of data produced?
For the JDBC Connector, the most important property you can probably apply would be this, which seems to be what you are asking for.
batch.max.rows
Maximum number of rows to include in a single batch when polling for new data. This setting can be used to limit the amount of data
buffered internally in the connector.
There is no need to "buffer the entire table into memory", With smaller batches, and more frequent polls and commits, you can ensure that almost all rows will be scanned, and you won't be at risk for a large batch failing, then the connector stopping for a period of time, then restarting and missing a few rows on the next poll.
Otherwise, make sure you aren't doing bulk table mode, as it'll try to scan the entire table again and again.
Also query option can do a column projection on the table.
You can find more configuration options in the documentation, but any OOM errors will need to be carefully examined on a case-by-case basis by enabling JMX monitoring and exporting these values into some aggregate system you can monitor more closely like Prometheus rather than just seeing the OOM error and not knowing if changing any particular parameter is really helping.
Another option would be to use CDC based connectors like another blog post shows

remove hadoop info streaming.PipeMapRed and others

When I run hadoop, I get many info massages. So I want to remove unuseful, such as INFO streaming.PipeMapRed, INFO mapred.MapTask etc. And retain the most important - INFO mapreduce.Job.
So how to do it?enter image description here
There is an Interface in org.apache.hadoop.mapred called Reporter which provides report progress and update counters, status information etc. If you have access to the code, You can eliminate INFO messages and report imporant ones with setStatus method.

What is Apache Spark doing before a job start

I have an Apache Spark batch job running continuously on AWS EMR. It pulls from AWS S3, runs a couple of jobs with that data, and then stores the data in an RDS instance.
However, there seems to be a long period of inactivity between jobs.
This is the CPU use:
And this is the network:
Notice the gap between each column, it is almost the same size as the activity column!
At first I thought these two columns were shifted (when it was pulling from S3, it wasn't using a lot of CPU and vice-versa) but then I noticed that these two graphs actually follow each other. This makes sense since the RDDs are lazy and will thus pull as the job is running.
Which leads to my question, what is Spark doing during that time? All of the Ganglia graphs seem zeroed during that time. It is as if the cluster decided to take a break before each job.
Thanks.
EDIT: Looking at the logs, this is the part where it seems to take an hour of...doing nothing?
15/04/27 01:13:13 INFO storage.DiskBlockManager: Created local directory at /mnt1/var/lib/hadoop/tmp/nm-local-dir/usercache/hadoop/appcache/application_1429892010439_0020/spark-c570e510-934c-4510-a1e5-aa85d407b748
15/04/27 01:13:13 INFO storage.MemoryStore: MemoryStore started with capacity 4.9 GB
15/04/27 01:13:13 INFO netty.NettyBlockTransferService: Server created on 37151
15/04/27 01:13:13 INFO storage.BlockManagerMaster: Trying to register BlockManager
15/04/27 01:13:13 INFO storage.BlockManagerMaster: Registered BlockManager
15/04/27 01:13:13 INFO util.AkkaUtils: Connecting to HeartbeatReceiver: akka.tcp://sparkDriver#ip-10-0-3-12.ec2.internal:41461/user/HeartbeatReceiver
15/04/27 02:30:45 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 0
15/04/27 02:30:45 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 7
15/04/27 02:30:45 INFO executor.Executor: Running task 77251.0 in stage 0.0 (TID 0)
15/04/27 02:30:45 INFO executor.Executor: Running task 77258.0 in stage 0.0 (TID 7)
15/04/27 02:30:45 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 8
15/04/27 02:30:45 INFO executor.Executor: Running task 0.0 in stage 0.0 (TID 8)
15/04/27 02:30:45 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 15
15/04/27 02:30:45 INFO executor.Executor: Running task 7.0 in stage 0.0 (TID 15)
15/04/27 02:30:45 INFO broadcast.TorrentBroadcast: Started reading broadcast variable
Notice at 01:13:13, it just hangs there until 20:30:45.
I found the issue. The problem was in the way I was calling pulling from S3.
We have our data in S3 separated by a date pattern as in s3n://bucket/2015/01/03/10/40/actualData.txt for the data from 2015-01-03 at 10:40
So when we want to run the batch process on the whole set, we call sc.textFiles("s3n://bucket/*/*/*/*/*/*").
BUT that is bad. In retrospect, this makes sense; for each star (*), Spark needs to get all of the files in that "directory", and then get all of the files in the directory under that. A single month has about 30 files and each day has 24 files, and each of those has 60. So the above pattern would call a "list files" on each star AND the call list files on the files returned, all the way down to the minutes! This is so that is can eventually get all of the **/acutalData.txt files and then union all of their RDDs.
This, of course, is really slow. So the answer was to build these paths in code (a list of strings for all the dates. In our case, all possible dates can be determined) and reducing them into a comma-separated string that can be passed into textFiles.
If in your case you can't determine all of the possible paths, consider either restructuring your data or build as much as possible of the paths and only call * towards the end of the path, or use the AmazonS3Client to get all the keys using the list-objects api (which allows you to get ALL keys in a bucket with a prefix very quickly) and then pass them as comma-separated string into textFiles. It will still make a list Status call for each file and it will still be serial, but there will be a lot less calls.
However, all of these solutions just slow down the inevitable; as more and more data gets built, more and more list status calls will be made serially. The root of the issue seems to the that sc.textFiles(s3n://) pretends that s3 is a file system, which is not. It is a key-value store. Spark (and Hadoop) need a different way of dealing with S3 (and possibly other key-value stores) that don't assume a file system.

Resources