I'm using impala with flume as filestream.
The problem is flume is adding temporary files with extension .tmp, and then when they are deleted impala queries are failing with the following message:
Backend 0:Failed to open HDFS file
hdfs://localhost:8020/user/hive/../FlumeData.1420040201733.tmp
Error(2): No such file or directory
How can I make impala to ignore this tmp files, or flume not to write them, or write them to another directory?
Flume configuration:
### Agent2 - Avro Source and File Channel, hdfs Sink ###
# Name the components on this agent
Agent2.sources = avro-source
Agent2.channels = file-channel
Agent2.sinks = hdfs-sink
# Describe/configure Source
Agent2.sources.avro-source.type = avro
Agent2.sources.avro-source.hostname = 0.0.0.0
Agent2.sources.avro-source.port = 11111
Agent2.sources.avro-source.bind = 0.0.0.0
# Describe the sink
Agent2.sinks.hdfs-sink.type = hdfs
Agent2.sinks.hdfs-sink.hdfs.path = hdfs://localhost:8020/user/hive/table/
Agent2.sinks.hdfs-sink.hdfs.rollInterval = 0
Agent2.sinks.hdfs-sink.hdfs.rollCount = 10000
Agent2.sinks.hdfs-sink.hdfs.fileType = DataStream
#Use a channel which buffers events in file
Agent2.channels.file-channel.type = file
Agent2.channels.file-channel.checkpointDir = /home/ubutnu/flume/checkpoint/
Agent2.channels.file-channel.dataDirs = /home/ubuntu/flume/data/
# Bind the source and sink to the channel
Agent2.sources.avro-source.channels = file-channel
Agent2.sinks.hdfs-sink.channel = file-channel
I had this problem once.
I've upgraded hadoop and flume and it got solved. (from cloudera hadoop cdh-5.2 into cdh-5.3)
Try upgrading - hadoop, flume or impala.
See if your flume configuration match the flume version, that was my problem.
Related
I am trying to set up a simple data pipeline from a console Kafka producer to the Hadoop file system (HDFS). I am working on a 64bit Ubuntu Virtual Machine and have created separate users for both Hadoop and Kafka as was suggested by the guides that I have followed. Consuming the produced input in Kafka with a console consumer works and the HDFS seems to be up and running.
Now I want to use Flume to pipe the input into the HDFS. I am using the following configuration file:
tier1.sources = source1
tier1.channels = channel1
tier1.sinks = sink1
tier1.sources.source1.type = org.apache.flume.source.kafka.KafkaSource
tier1.sources.source1.zookeeperConnect = 127.0.0.1:2181
tier1.sources.source1.topic = test
tier1.sources.source1.groupId = flume
tier1.sources.source1.channels = channel1
tier1.sources.source1.interceptors = i1
tier1.sources.source1.interceptors.i1.type = timestamp
tier1.sources.source1.kafka.consumer.timeout.ms = 2000
tier1.channels.channel1.type = memory
tier1.channels.channel1.capacity = 10000
tier1.channels.channel1.transactionCapacity = 1000
tier1.sinks.sink1.type = hdfs
tier1.sinks.sink1.hdfs.path = hdfs://flume/kafka/%{topic}/%y-%m-%d
tier1.sinks.sink1.hdfs.rollInterval = 5
tier1.sinks.sink1.hdfs.rollSize = 0
tier1.sinks.sink1.hdfs.rollCount = 0
tier1.sinks.sink1.hdfs.fileType = DataStream
tier1.sinks.sink1.channel = channel1
Now when I run Flume with the following command
bin/flume-ng agent --conf ./conf -f conf/flume.conf -Dflume.root.logger=DEBUG,console -n tier1
I get the same exception in the console output over and over again:
2017-10-19 12:17:04,279 (lifecycleSupervisor-1-2) [DEBUG - org.apache.kafka.clients.NetworkClient.handleConnections(NetworkClient.java:467)] Completed connection to node 2147483647
2017-10-19 12:17:04,279 (lifecycleSupervisor-1-2) [DEBUG - org.apache.kafka.common.network.Selector.poll(Selector.java:307)] Connection with Ubuntu-Sandbox/127.0.1.1 disconnected
java.io.EOFException
at org.apache.kafka.common.network.NetworkReceive.readFromReadableChannel(NetworkReceive.java:83)
at org.apache.kafka.common.network.NetworkReceive.readFrom(NetworkReceive.java:71)
at org.apache.kafka.common.network.KafkaChannel.receive(KafkaChannel.java:153)
at org.apache.kafka.common.network.KafkaChannel.read(KafkaChannel.java:134)
at org.apache.kafka.common.network.Selector.poll(Selector.java:286)
at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:256)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.clientPoll(ConsumerNetworkClient.java:320)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:213)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:193)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:163)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator.ensureActiveGroup(AbstractCoordinator.java:222)
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.ensurePartitionAssignment(ConsumerCoordinator.java:311)
at org.apache.kafka.clients.consumer.KafkaConsumer.pollOnce(KafkaConsumer.java:890)
at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:853)
at org.apache.flume.source.kafka.KafkaSource.doStart(KafkaSource.java:529)
at org.apache.flume.source.BasicSourceSemantics.start(BasicSourceSemantics.java:83)
at org.apache.flume.source.PollableSourceRunner.start(PollableSourceRunner.java:71)
at org.apache.flume.lifecycle.LifecycleSupervisor$MonitorRunnable.run(LifecycleSupervisor.java:249)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
The only way to stop Flume is to kill the Java process.
I thought that it might have something to do with the separate users for Hadoop and Kafka, but even when running everything with the Kafka user I get the same result. I haven't found anything concerning the EOFException method online either, which is strange considering that I have just followed the "Getting Started" guides and used pretty standard configurations for everything.
Maybe it has something to do with the preceding line ("Ubuntu-Sandbox/127.0.1.1 disconnected") and hence the configuration of my VM?
Any help is highly appreciated!
Have you considered using Kafka Connect (part of Apache Kafka) and the HDFS connector instead? This is generally seen to have superseded Flume. It is easy to use, with a similar file-based configuration as Flume.
I'm running hbase in a cluster mode and I'm getting the following error:
DEBUG org.apache.hadoop.hbase.zookeeper.ZKUtil - catalogtracker-on-hconnection-0x6e704bd0x0, quorum=node2:2181, baseZNode=/hbase Set watcher on znode that does not yet exist, /hbase/meta-region-server
I had the similar the error and got it resolved by doing these:
1) Making sure HBase Client Version is compatible with the HBase version on the cluster.
2) Adding hbase-site.xml to your application classpath, so that HBase client determines all the appropriate HBase configurations from it.
val conf = org.apache.hadoop.hbase.HBaseConfiguration.create()
// Instead of the following settings, pass hbase-site.xml in classpath
// conf.set("hbase.zookeeper.quorum", hbaseHost)
// conf.set("hbase.zookeeper.property.clientPort", hbasePort)
HBaseAdmin.checkHBaseAvailable(conf);
log.debug("HBase found! with conf " + conf);
I am trying to ingest using flume spooling directory to HDFS(SpoolDir > Memory Channel > HDFS).
I am using Cloudera Hadoop 5.4.2. (Hadoop 2.6.0, Flume 1.5.0).
It works well with smaller files, however it fails with larger files. Please find below my testing scenerio:
files with size Kbytes to 50-60MBytes, processed without issue.
files with greater than 50-60MB, it writes around 50MB to HDFS then I found flume agent unexpected exit.
There are no error message on flume log.
I found that it is trying to create the ".tmp" file (HDFS) several times, and each time writes couple of megabytes (some time 2MB, some time 45MB ) before unexpected exit.
After some time, the last tried ".tmp" file renamed as completed(".tmp" removed) and the file in source spoolDir also renamed as ".COMPLETED" although full file is not written to HDFS.
In real scenerio, our files will be around 2GB in size. So, need some robust flume configuration to handle workload.
Note:
Flume agent node is part of hadoop cluster and not a datanode (it is an edge node).
Spool directory is local filesystem on the same server running flume agent.
All are physical sever (not virtual).
In the same cluster, we have twitter datafeeding with flume running fine(although very small about of data).
Please find below flume.conf file I am using here:
#############start flume.conf####################
spoolDir.sources = src-1
spoolDir.channels = channel-1
spoolDir.sinks = sink_to_hdfs1
######## source
spoolDir.sources.src-1.type = spooldir
spoolDir.sources.src-1.channels = channel-1
spoolDir.sources.src-1.spoolDir = /stage/ETL/spool/
spoolDir.sources.src-1.fileHeader = true
spoolDir.sources.src-1.basenameHeader =true
spoolDir.sources.src-1.batchSize = 100000
######## channel
spoolDir.channels.channel-1.type = memory
spoolDir.channels.channel-1.transactionCapacity = 50000000
spoolDir.channels.channel-1.capacity = 60000000
spoolDir.channels.channel-1.byteCapacityBufferPercentage = 20
spoolDir.channels.channel-1.byteCapacity = 6442450944
######## sink
spoolDir.sinks.sink_to_hdfs1.type = hdfs
spoolDir.sinks.sink_to_hdfs1.channel = channel-1
spoolDir.sinks.sink_to_hdfs1.hdfs.fileType = DataStream
spoolDir.sinks.sink_to_hdfs1.hdfs.path = hdfs://nameservice1/user/etl/temp/spool
spoolDir.sinks.sink_to_hdfs1.hdfs.filePrefix = %{basename}-
spoolDir.sinks.sink_to_hdfs1.hdfs.batchSize = 100000
spoolDir.sinks.sink_to_hdfs1.hdfs.rollInterval = 0
spoolDir.sinks.sink_to_hdfs1.hdfs.rollSize = 0
spoolDir.sinks.sink_to_hdfs1.hdfs.rollCount = 0
spoolDir.sinks.sink_to_hdfs1.hdfs.idleTimeout = 60
#############end flume.conf####################
Kindly suggest me whether there is any issue with my configuration or am I missing something.
Or is it a known issue that Flume SpoolDir cannot handle with bigger files.
Regards,
-Obaid
I have posted the same topic to another open community, if I get solution from other one, I will update here and vice versa.
I have tested flume with several size files and finally come up with conclusion that "flume is not for larger size files".
So, finally I have started using HDFS NFS Gateway. This is really cool and now I do not even need a spool directory in local storage. Pushing file directly to nfs mounted HDFS using scp.
Hope it will help some one who is facing same issue like me.
Thanks,
Obaid
Try using File channel as it is more reliable than Memory channel.
Use the following configuration to add File-Channel.
spoolDir.channels = channel-1
spoolDir.channels.channel-1.type = file
spoolDir.channels.channel-1.checkpointDir = /mnt/flume/checkpoint
spoolDir.channels.channel-1.dataDirs = /mnt/flume/data
I think I have tried every combination of altering my config file. I also saw somewhere that it might be due to my replication factor being 3 so I changed it to 1. I am using cloudera manager on AWS. Below is my config file, any ideas?
In HDFS, the file sizes are all under 20kb, trying to get at least 40-50mb. What is funny is that the same config file is writing ~60mb files on my virtual machine that I was practicing with (pre-installed hadoop + tools). See below for config file, any ideas?
# The configuration file needs to define the sources,
# the channels and the sinks.
# Sources, channels and sinks are defined per agent,
# in this case called 'TwitterAgent'
TwitterAgent.sources = Twitter
TwitterAgent.channels = MemChannel
TwitterAgent.sinks = HDFS
TwitterAgent.sources.Twitter.type = com.cloudera.flume.source.TwitterSource
TwitterAgent.sources.Twitter.channels = MemChannel
TwitterAgent.sources.Twitter.consumerKey = xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
TwitterAgent.sources.Twitter.consumerSecret = xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
TwitterAgent.sources.Twitter.accessToken = xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
TwitterAgent.sources.Twitter.accessTokenSecret = xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
TwitterAgent.sources.Twitter.keywords = apple, grapes, fruits, strawberry, mango, pear
TwitterAgent.sinks.HDFS.channel = MemChannel
TwitterAgent.sinks.HDFS.type = hdfs
TwitterAgent.sinks.HDFS.hdfs.path = hdfs://123.456.789.us-west-2.compute.amazonaws.com:8020/user/flume/tweets
TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream
TwitterAgent.sinks.HDFS.hdfs.writeFormat = Text
TwitterAgent.sinks.HDFS.hdfs.rollInterval = 0
TwitterAgent.sinks.HDFS.hdfs.batchSize = 100000
TwitterAgent.sinks.HDFS.hdfs.rollSize = 0
TwitterAgent.sinks.HDFS.hdfs.rollCount = 0
TwitterAgent.channels.MemChannel.type = memory
TwitterAgent.channels.MemChannel.capacity = 10000
TwitterAgent.channels.MemChannel.transactionCapacity = 1000
If rollInterval, batchSize, rollSize & rollCount are not working, remain things looks hdfs.callTimeout.
Because someone said reducing replication factor could be solution.
Reducing replication factor means reducing hdfs operation time and according to flume user guideline, default value of callTimeout is 10000 milliseconds.
Other clues are
How-to: Do Apache Flume Performance Tuning (Part 1)
How can I force Flume-NG to process the backlog of events after a sink failed?
Using an HDFS Sink and rollInterval in Flume-ng to batch up 90 seconds of log information
So i finally figured out the issue. (note I am running a single node test cluster). One of the solutions in stackoverflow was to set the dfs.replication factor to 1 which I did but that did not solve the problem.
For some reason what was happening was that in my flume agent, there was a mismatch in configs. The HDFS Sink has a parameter called minBlockReplicas, which informs it as to how many block replicas are necessary to have, and if not specified, it pulls that paramaneter from the default HDFS configuration file (which i thought I set to 1). It looks like it was getting a different value for dfs.replication or for dfs.namennode.replication.min.
I circumvented the error my modifying my flume file directly by using
TwitterAgent.sinks.HDFS.hdfs.minBlockReplicas = 1
Hope this helps.
Yes, by adding this line it is resolved my small multiple files creating on HDFS while using flume
a1.sinks.HDFS.hdfs.minBlockReplicas = 1
Below mentioned is my flume configuration file...
# example.conf: A single-node Flume configuration
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# reading file using tail command and sending data to channel
a1.sources.r1.type = spooldir
a1.sources.r1.spoolDir = /data/apache-flume-1.4.0-bin/logs
a1.sources.r1.channels = c1
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
a1.sinks.k1.type = hdfs
a1.sinks.k1.channel = c1
a1.sinks.k1.hdfs.path = hdfs://PPWFMD509:9160/flume-test
a1.sinks.k1.hdfs.filePrefix = events-
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 10
a1.sinks.k1.hdfs.roundUnit = minute
running on hadoop version 2.2.0 (have added hadoop-core 1.2.1.jar file to the flume lib directory)
on the maven repository i am not able to file jar for hadoop-core2.2.x. why ? and what id hadoop-core-0.20 versions ?
when running the same and placing file have below mentioned exception
2014-02-26 14:51:30,865 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.hdfs.BucketWriter.open(BucketWriter.java:219)] Creating hdfs://PPWFMD509:9160/flume-test/events-.1393406490812.tmp
2014-02-26 14:51:31,079 (SinkRunner-PollingRunner-DefaultSinkProcessor) [WARN - org.apache.flume.sink.hdfs.HDFSEventSink.process(HDFSEventSink.java:418)] HDFS IO error
org.apache.hadoop.ipc.RemoteException: Server IPC version 9 cannot communicate with client version 4
at org.apache.hadoop.ipc.Client.call(Client.java:1113)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:229)
The message Server IPC version 9 cannot communicate with client version 4 points that you faced a compatibility issue. Flume is trying to use hadoop client version which is not compatible with your hadoop cluster (1.2.1 cannot work with 2+ version).
As for lib version, this is quote from Hadoop Releases:
1.2.X - current stable version, 1.2 release
2.5.X - current stable 2.x version
0.23.X - similar to 2.X.X but missing NN HA.