Flume Twitter Stream rolling small files in HDFS - hadoop

I think I have tried every combination of altering my config file. I also saw somewhere that it might be due to my replication factor being 3 so I changed it to 1. I am using cloudera manager on AWS. Below is my config file, any ideas?
In HDFS, the file sizes are all under 20kb, trying to get at least 40-50mb. What is funny is that the same config file is writing ~60mb files on my virtual machine that I was practicing with (pre-installed hadoop + tools). See below for config file, any ideas?
# The configuration file needs to define the sources,
# the channels and the sinks.
# Sources, channels and sinks are defined per agent,
# in this case called 'TwitterAgent'
TwitterAgent.sources = Twitter
TwitterAgent.channels = MemChannel
TwitterAgent.sinks = HDFS
TwitterAgent.sources.Twitter.type = com.cloudera.flume.source.TwitterSource
TwitterAgent.sources.Twitter.channels = MemChannel
TwitterAgent.sources.Twitter.consumerKey = xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
TwitterAgent.sources.Twitter.consumerSecret = xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
TwitterAgent.sources.Twitter.accessToken = xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
TwitterAgent.sources.Twitter.accessTokenSecret = xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
TwitterAgent.sources.Twitter.keywords = apple, grapes, fruits, strawberry, mango, pear
TwitterAgent.sinks.HDFS.channel = MemChannel
TwitterAgent.sinks.HDFS.type = hdfs
TwitterAgent.sinks.HDFS.hdfs.path = hdfs://123.456.789.us-west-2.compute.amazonaws.com:8020/user/flume/tweets
TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream
TwitterAgent.sinks.HDFS.hdfs.writeFormat = Text
TwitterAgent.sinks.HDFS.hdfs.rollInterval = 0
TwitterAgent.sinks.HDFS.hdfs.batchSize = 100000
TwitterAgent.sinks.HDFS.hdfs.rollSize = 0
TwitterAgent.sinks.HDFS.hdfs.rollCount = 0
TwitterAgent.channels.MemChannel.type = memory
TwitterAgent.channels.MemChannel.capacity = 10000
TwitterAgent.channels.MemChannel.transactionCapacity = 1000

If rollInterval, batchSize, rollSize & rollCount are not working, remain things looks hdfs.callTimeout.
Because someone said reducing replication factor could be solution.
Reducing replication factor means reducing hdfs operation time and according to flume user guideline, default value of callTimeout is 10000 milliseconds.
Other clues are
How-to: Do Apache Flume Performance Tuning (Part 1)
How can I force Flume-NG to process the backlog of events after a sink failed?
Using an HDFS Sink and rollInterval in Flume-ng to batch up 90 seconds of log information

So i finally figured out the issue. (note I am running a single node test cluster). One of the solutions in stackoverflow was to set the dfs.replication factor to 1 which I did but that did not solve the problem.
For some reason what was happening was that in my flume agent, there was a mismatch in configs. The HDFS Sink has a parameter called minBlockReplicas, which informs it as to how many block replicas are necessary to have, and if not specified, it pulls that paramaneter from the default HDFS configuration file (which i thought I set to 1). It looks like it was getting a different value for dfs.replication or for dfs.namennode.replication.min.
I circumvented the error my modifying my flume file directly by using
TwitterAgent.sinks.HDFS.hdfs.minBlockReplicas = 1
Hope this helps.

Yes, by adding this line it is resolved my small multiple files creating on HDFS while using flume
a1.sinks.HDFS.hdfs.minBlockReplicas = 1

Related

EOFException from Kafka in Flume

I am trying to set up a simple data pipeline from a console Kafka producer to the Hadoop file system (HDFS). I am working on a 64bit Ubuntu Virtual Machine and have created separate users for both Hadoop and Kafka as was suggested by the guides that I have followed. Consuming the produced input in Kafka with a console consumer works and the HDFS seems to be up and running.
Now I want to use Flume to pipe the input into the HDFS. I am using the following configuration file:
tier1.sources = source1
tier1.channels = channel1
tier1.sinks = sink1
tier1.sources.source1.type = org.apache.flume.source.kafka.KafkaSource
tier1.sources.source1.zookeeperConnect = 127.0.0.1:2181
tier1.sources.source1.topic = test
tier1.sources.source1.groupId = flume
tier1.sources.source1.channels = channel1
tier1.sources.source1.interceptors = i1
tier1.sources.source1.interceptors.i1.type = timestamp
tier1.sources.source1.kafka.consumer.timeout.ms = 2000
tier1.channels.channel1.type = memory
tier1.channels.channel1.capacity = 10000
tier1.channels.channel1.transactionCapacity = 1000
tier1.sinks.sink1.type = hdfs
tier1.sinks.sink1.hdfs.path = hdfs://flume/kafka/%{topic}/%y-%m-%d
tier1.sinks.sink1.hdfs.rollInterval = 5
tier1.sinks.sink1.hdfs.rollSize = 0
tier1.sinks.sink1.hdfs.rollCount = 0
tier1.sinks.sink1.hdfs.fileType = DataStream
tier1.sinks.sink1.channel = channel1
Now when I run Flume with the following command
bin/flume-ng agent --conf ./conf -f conf/flume.conf -Dflume.root.logger=DEBUG,console -n tier1
I get the same exception in the console output over and over again:
2017-10-19 12:17:04,279 (lifecycleSupervisor-1-2) [DEBUG - org.apache.kafka.clients.NetworkClient.handleConnections(NetworkClient.java:467)] Completed connection to node 2147483647
2017-10-19 12:17:04,279 (lifecycleSupervisor-1-2) [DEBUG - org.apache.kafka.common.network.Selector.poll(Selector.java:307)] Connection with Ubuntu-Sandbox/127.0.1.1 disconnected
java.io.EOFException
at org.apache.kafka.common.network.NetworkReceive.readFromReadableChannel(NetworkReceive.java:83)
at org.apache.kafka.common.network.NetworkReceive.readFrom(NetworkReceive.java:71)
at org.apache.kafka.common.network.KafkaChannel.receive(KafkaChannel.java:153)
at org.apache.kafka.common.network.KafkaChannel.read(KafkaChannel.java:134)
at org.apache.kafka.common.network.Selector.poll(Selector.java:286)
at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:256)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.clientPoll(ConsumerNetworkClient.java:320)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:213)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:193)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:163)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator.ensureActiveGroup(AbstractCoordinator.java:222)
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.ensurePartitionAssignment(ConsumerCoordinator.java:311)
at org.apache.kafka.clients.consumer.KafkaConsumer.pollOnce(KafkaConsumer.java:890)
at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:853)
at org.apache.flume.source.kafka.KafkaSource.doStart(KafkaSource.java:529)
at org.apache.flume.source.BasicSourceSemantics.start(BasicSourceSemantics.java:83)
at org.apache.flume.source.PollableSourceRunner.start(PollableSourceRunner.java:71)
at org.apache.flume.lifecycle.LifecycleSupervisor$MonitorRunnable.run(LifecycleSupervisor.java:249)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
The only way to stop Flume is to kill the Java process.
I thought that it might have something to do with the separate users for Hadoop and Kafka, but even when running everything with the Kafka user I get the same result. I haven't found anything concerning the EOFException method online either, which is strange considering that I have just followed the "Getting Started" guides and used pretty standard configurations for everything.
Maybe it has something to do with the preceding line ("Ubuntu-Sandbox/127.0.1.1 disconnected") and hence the configuration of my VM?
Any help is highly appreciated!
Have you considered using Kafka Connect (part of Apache Kafka) and the HDFS connector instead? This is generally seen to have superseded Flume. It is easy to use, with a similar file-based configuration as Flume.

Spark Job error GC overhead limit exceeded [duplicate]

This question already has answers here:
Error java.lang.OutOfMemoryError: GC overhead limit exceeded
(22 answers)
Closed 6 years ago.
I am running a spark job and I am setting the following configurations in the spark-defaults.sh. I have the following changes in the name node. I have 1 data node. And I am working on data of 2GB.
spark.master spark://master:7077
spark.executor.memory 5g
spark.eventLog.enabled true
spark.eventLog.dir hdfs://namenode:8021/directory
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.driver.memory 5g
spark.executor.extraJavaOptions -XX:+PrintGCDetails -Dkey=value -Dnumbers="one two three"
But I am getting an error saying GC limit exceeded.
Here is the code I am working on.
import os
import sys
import unicodedata
from operator import add
try:
from pyspark import SparkConf
from pyspark import SparkContext
except ImportError as e:
print ("Error importing Spark Modules", e)
sys.exit(1)
# delimeter function
def findDelimiter(text):
sD = text[1]
eD = text[2]
return (eD, sD)
def tokenize(text):
sD = findDelimiter(text)[1]
eD = findDelimiter(text)[0]
arrText = text.split(sD)
text = ""
seg = arrText[0].split(eD)
arrText=""
senderID = seg[6].strip()
yield (senderID, 1)
conf = SparkConf()
sc = SparkContext(conf=conf)
textfile = sc.textFile("hdfs://my_IP:9000/data/*/*.txt")
rdd = textfile.flatMap(tokenize)
rdd = rdd.reduceByKey(lambda a,b: a+b)
rdd.coalesce(1).saveAsTextFile("hdfs://my_IP:9000/data/total_result503")
I even tried groupByKey instead of also. But I am getting the same error. But when I tried removing the reduceByKey or groupByKey I am getting outputs. Can some one help me with this error.
Should I also increase the size of GC in hadoop. And as I said earlier I have set driver.memory to 5gb, I did it in the name node. Should I do that in data node as well?
Try to add below setting for your spark-defaults.sh:
spark.driver.extraJavaOptions -XX:+UseG1GC
spark.executor.extraJavaOptions -XX:+UseG1GC
Tuning jvm garbage collection might be tricky, but "G1GC" seems works pretty good. Worth trying!!
The code you have should have worked with your configuration . As suggested earlier try using G1GC .
Also try reducing storage memory fraction . By default its 60% . Try reducing it to 40% or less.
You can set it by adding spark.storage.memoryFraction 0.4
I was able to solve the problem. I was running my hadoop in the root user of the master node. But I configured the hadoop in a different user in the datanodes. Now I configured them in the root user of the data node and increased the executor and driver memory it worked fine.

Flume Spooling Directory Source: Cannot load files larger files

I am trying to ingest using flume spooling directory to HDFS(SpoolDir > Memory Channel > HDFS).
I am using Cloudera Hadoop 5.4.2. (Hadoop 2.6.0, Flume 1.5.0).
It works well with smaller files, however it fails with larger files. Please find below my testing scenerio:
files with size Kbytes to 50-60MBytes, processed without issue.
files with greater than 50-60MB, it writes around 50MB to HDFS then I found flume agent unexpected exit.
There are no error message on flume log.
I found that it is trying to create the ".tmp" file (HDFS) several times, and each time writes couple of megabytes (some time 2MB, some time 45MB ) before unexpected exit.
After some time, the last tried ".tmp" file renamed as completed(".tmp" removed) and the file in source spoolDir also renamed as ".COMPLETED" although full file is not written to HDFS.
In real scenerio, our files will be around 2GB in size. So, need some robust flume configuration to handle workload.
Note:
Flume agent node is part of hadoop cluster and not a datanode (it is an edge node).
Spool directory is local filesystem on the same server running flume agent.
All are physical sever (not virtual).
In the same cluster, we have twitter datafeeding with flume running fine(although very small about of data).
Please find below flume.conf file I am using here:
#############start flume.conf####################
spoolDir.sources = src-1
spoolDir.channels = channel-1
spoolDir.sinks = sink_to_hdfs1
######## source
spoolDir.sources.src-1.type = spooldir
spoolDir.sources.src-1.channels = channel-1
spoolDir.sources.src-1.spoolDir = /stage/ETL/spool/
spoolDir.sources.src-1.fileHeader = true
spoolDir.sources.src-1.basenameHeader =true
spoolDir.sources.src-1.batchSize = 100000
######## channel
spoolDir.channels.channel-1.type = memory
spoolDir.channels.channel-1.transactionCapacity = 50000000
spoolDir.channels.channel-1.capacity = 60000000
spoolDir.channels.channel-1.byteCapacityBufferPercentage = 20
spoolDir.channels.channel-1.byteCapacity = 6442450944
######## sink
spoolDir.sinks.sink_to_hdfs1.type = hdfs
spoolDir.sinks.sink_to_hdfs1.channel = channel-1
spoolDir.sinks.sink_to_hdfs1.hdfs.fileType = DataStream
spoolDir.sinks.sink_to_hdfs1.hdfs.path = hdfs://nameservice1/user/etl/temp/spool
spoolDir.sinks.sink_to_hdfs1.hdfs.filePrefix = %{basename}-
spoolDir.sinks.sink_to_hdfs1.hdfs.batchSize = 100000
spoolDir.sinks.sink_to_hdfs1.hdfs.rollInterval = 0
spoolDir.sinks.sink_to_hdfs1.hdfs.rollSize = 0
spoolDir.sinks.sink_to_hdfs1.hdfs.rollCount = 0
spoolDir.sinks.sink_to_hdfs1.hdfs.idleTimeout = 60
#############end flume.conf####################
Kindly suggest me whether there is any issue with my configuration or am I missing something.
Or is it a known issue that Flume SpoolDir cannot handle with bigger files.
Regards,
-Obaid
I have posted the same topic to another open community, if I get solution from other one, I will update here and vice versa.
I have tested flume with several size files and finally come up with conclusion that "flume is not for larger size files".
So, finally I have started using HDFS NFS Gateway. This is really cool and now I do not even need a spool directory in local storage. Pushing file directly to nfs mounted HDFS using scp.
Hope it will help some one who is facing same issue like me.
Thanks,
Obaid
Try using File channel as it is more reliable than Memory channel.
Use the following configuration to add File-Channel.
spoolDir.channels = channel-1
spoolDir.channels.channel-1.type = file
spoolDir.channels.channel-1.checkpointDir = /mnt/flume/checkpoint
spoolDir.channels.channel-1.dataDirs = /mnt/flume/data

Impala - file not found error

I'm using impala with flume as filestream.
The problem is flume is adding temporary files with extension .tmp, and then when they are deleted impala queries are failing with the following message:
Backend 0:Failed to open HDFS file
hdfs://localhost:8020/user/hive/../FlumeData.1420040201733.tmp
Error(2): No such file or directory
How can I make impala to ignore this tmp files, or flume not to write them, or write them to another directory?
Flume configuration:
### Agent2 - Avro Source and File Channel, hdfs Sink ###
# Name the components on this agent
Agent2.sources = avro-source
Agent2.channels = file-channel
Agent2.sinks = hdfs-sink
# Describe/configure Source
Agent2.sources.avro-source.type = avro
Agent2.sources.avro-source.hostname = 0.0.0.0
Agent2.sources.avro-source.port = 11111
Agent2.sources.avro-source.bind = 0.0.0.0
# Describe the sink
Agent2.sinks.hdfs-sink.type = hdfs
Agent2.sinks.hdfs-sink.hdfs.path = hdfs://localhost:8020/user/hive/table/
Agent2.sinks.hdfs-sink.hdfs.rollInterval = 0
Agent2.sinks.hdfs-sink.hdfs.rollCount = 10000
Agent2.sinks.hdfs-sink.hdfs.fileType = DataStream
#Use a channel which buffers events in file
Agent2.channels.file-channel.type = file
Agent2.channels.file-channel.checkpointDir = /home/ubutnu/flume/checkpoint/
Agent2.channels.file-channel.dataDirs = /home/ubuntu/flume/data/
# Bind the source and sink to the channel
Agent2.sources.avro-source.channels = file-channel
Agent2.sinks.hdfs-sink.channel = file-channel
I had this problem once.
I've upgraded hadoop and flume and it got solved. (from cloudera hadoop cdh-5.2 into cdh-5.3)
Try upgrading - hadoop, flume or impala.
See if your flume configuration match the flume version, that was my problem.

Flume 1.4 with Hadoop 2.2.0 and hdfs sink type having issue

Below mentioned is my flume configuration file...
# example.conf: A single-node Flume configuration
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# reading file using tail command and sending data to channel
a1.sources.r1.type = spooldir
a1.sources.r1.spoolDir = /data/apache-flume-1.4.0-bin/logs
a1.sources.r1.channels = c1
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
a1.sinks.k1.type = hdfs
a1.sinks.k1.channel = c1
a1.sinks.k1.hdfs.path = hdfs://PPWFMD509:9160/flume-test
a1.sinks.k1.hdfs.filePrefix = events-
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 10
a1.sinks.k1.hdfs.roundUnit = minute
running on hadoop version 2.2.0 (have added hadoop-core 1.2.1.jar file to the flume lib directory)
on the maven repository i am not able to file jar for hadoop-core2.2.x. why ? and what id hadoop-core-0.20 versions ?
when running the same and placing file have below mentioned exception
2014-02-26 14:51:30,865 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.hdfs.BucketWriter.open(BucketWriter.java:219)] Creating hdfs://PPWFMD509:9160/flume-test/events-.1393406490812.tmp
2014-02-26 14:51:31,079 (SinkRunner-PollingRunner-DefaultSinkProcessor) [WARN - org.apache.flume.sink.hdfs.HDFSEventSink.process(HDFSEventSink.java:418)] HDFS IO error
org.apache.hadoop.ipc.RemoteException: Server IPC version 9 cannot communicate with client version 4
at org.apache.hadoop.ipc.Client.call(Client.java:1113)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:229)
The message Server IPC version 9 cannot communicate with client version 4 points that you faced a compatibility issue. Flume is trying to use hadoop client version which is not compatible with your hadoop cluster (1.2.1 cannot work with 2+ version).
As for lib version, this is quote from Hadoop Releases:
1.2.X - current stable version, 1.2 release
2.5.X - current stable 2.x version
0.23.X - similar to 2.X.X but missing NN HA.

Resources