Writing to flume using spool directory how to rename file - hadoop

I am writing to hdfs using flume spool directory. Here is my code
#initialize agent's source, channel and sink
agent.sources = test
agent.channels = memoryChannel
agent.sinks = flumeHDFS
# Setting the source to spool directory where the file exists
agent.sources.test.type = spooldir
agent.sources.test.spoolDir = /johir
agent.sources.test.fileHeader = false
agent.sources.test.fileSuffix = .COMPLETED
# Setting the channel to memory
agent.channels.memoryChannel.type = memory
# Max number of events stored in the memory channel
agent.channels.memoryChannel.capacity = 10000
# agent.channels.memoryChannel.batchSize = 15000
agent.channels.memoryChannel.transactioncapacity = 1000000
# Setting the sink to HDFS
agent.sinks.flumeHDFS.type = hdfs
agent.sinks.flumeHDFS.hdfs.path =/user/root/
agent.sinks.flumeHDFS.hdfs.fileType = DataStream
# Write format can be text or writable
agent.sinks.flumeHDFS.hdfs.writeFormat = Text
# use a single csv file at a time
agent.sinks.flumeHDFS.hdfs.maxOpenFiles = 1
# rollover file based on maximum size of 10 MB
agent.sinks.flumeHDFS.hdfs.rollCount=0
agent.sinks.flumeHDFS.hdfs.rollInterval=0
agent.sinks.flumeHDFS.hdfs.rollSize = 1000000
agent.sinks.flumeHDFS.hdfs.batchSize =1000
# never rollover based on the number of events
agent.sinks.flumeHDFS.hdfs.rollCount = 0
# rollover file based on max time of 1 min
#agent.sinks.flumeHDFS.hdfs.rollInterval = 0
# agent.sinks.flumeHDFS.hdfs.idleTimeout = 600
# Connect source and sink with channel
agent.sources.test.channels = memoryChannel
agent.sinks.flumeHDFS.channel = memoryChannel
But he problem is data being written to the file is renamed to some a random tmp name. How can I rename the file in hdfs to my original file name in the source directory. For example I have the file day1.txt, day2.txt,day3.txt. Those are data for two different days. I want keep them stored in hdfs as day1.txt,day2.txt,day3.txt. But these three files are merged and stored in hdfs as FlumeData.1464629158164.tmp file. Is there any way to do this?

If you want to retain the original file name, you should attach the filename as a header to each event.
Set the basenameHeader property to true. This will create a header with the basename key unless set to something else using the basenameHeaderKey property.
Use the hdfs.filePrefix property to set the filename using basenameHeader values.
Add the below properties to your configuration file.
#source properties
agent.sources.test.basenameHeader = true
#sink properties
agent.sinks.flumeHDFS.type = hdfs
agent.sinks.flumeHDFS.hdfs.filePrefix = %{basename}

Related

how to make flume load files to hdfs, hdfs never close file .tmp and rename file by name.

Actually I have 2 questions, my first question is : How to make HDFS close file (example .123456789.tmp ) after the entire file was flushed by flume agent.
In fact, the file never closed until, I force flume agent to stop.
I beleive there is a method using the 4 parameters as follow:
hdfs.rollSize = 0
hdfs.rollCount =0
hdfs.rollInterval = 0
hdfs.batchsize = 1000000
Well, my second question is, my agent flume receives files from SFTP server, while I need to keep each file name in hdfs. It works fine with spooldir type, but not with SFTP !! is there a any ideas ?
My configuration file for flume agent as follow:
agent.sources = r1
agent.channels = c1
agent.sinks = k
configure ftp source
agent.sources.r1.type = org.keedio.flume.source.mra.source.Source
agent.sources.r1.client.source = sftp
agent.sources.r1.name.server = ip
agent.sources.r1.user = user
agent.sources.r1.password = secret
agent.sources.r1.port = 22
agent.sources.r1.knownHosts = ~/.ssh/known_hosts
agent.sources.r1.work.dir = /DATA/test/flumrFTP
agent.sources.r1.fileHeader = true
agent.sources.r1.basenameHeader = true
agent.sources.r1.inputCharset = ISO-8859-1
#agent.sources.r1.batchSize = 1000
agent.sources.r1.flushlines = true
configure sink s1
agent.sinks.k.type = hdfs
agent.sinks.k.hdfs.path = hdfs://hostname:8000/user/admin/DATA/import_flume/
agent.sinks.k.hdfs.filePrefix = %{basename}
agent.sinks.k.hdfs.rollCount = 0
agent.sinks.k.hdfs.rollInterval = 0
agent.sinks.k.hdfs.rollSize = 0
agent.sinks.k.hdfs.useLocalTimeStamp = true
agent.sinks.k.hdfs.batchsize = 1000000
agent.sinks.k.hdfs.fileType = DataStream
Use a channel which buffers events in memory
agent.channels.c1.type = memory
agent.channels.c1.capacity = 1000000
agent.channels.c1.transactionCapacity = 1000000
agent.sources.r1.channels = c1
agent.sinks.k.channel = c1
Try setting the variable
hdfs.rollInterval It's the number of seconds to wait before rolling current file
This setting closes the file after the number of seconds you set. I set mine at 200 seconds and I am loading smaller files

flume taking time to copy data into hdfs when rolling based on file size

I have a usecase where i want to copy remote file into hdfs using flume. I also want that the copied files should align with the HDFS block size (128MB/256MB).Total size of remote data is 33GB.
I am using avro source and sink to copy remote data into hdfs. Similarly from sink side i am doing file size rolling(128,256).but for copying file from remote machine and storing it into hdfs(file size 128/256 MB) flume is taking an avg of 2 min.
Flume Configuration:
Avro Source(Remote Machine)
### Agent1 - Spooling Directory Source and File Channel, Avro Sink ###
# Name the components on this agent
Agent1.sources = spooldir-source
Agent1.channels = file-channel
Agent1.sinks = avro-sink
# Describe/configure Source
Agent1.sources.spooldir-source.type = spooldir
Agent1.sources.spooldir-source.spoolDir =/home/Benchmarking_Simulation/test
# Describe the sink
Agent1.sinks.avro-sink.type = avro
Agent1.sinks.avro-sink.hostname = xx.xx.xx.xx #IP Address destination machine
Agent1.sinks.avro-sink.port = 50000
#Use a channel which buffers events in file
Agent1.channels.file-channel.type = file
Agent1.channels.file-channel.checkpointDir = /home/Flume_CheckPoint_Dir/
Agent1.channels.file-channel.dataDirs = /home/Flume_Data_Dir/
Agent1.channels.file-channel.capacity = 10000000
Agent1.channels.file-channel.transactionCapacity=50000
# Bind the source and sink to the channel
Agent1.sources.spooldir-source.channels = file-channel
Agent1.sinks.avro-sink.channel = file-channel
Avro Sink(Machine where hdfs running)
### Agent1 - Avro Source and File Channel, Avro Sink ###
# Name the components on this agent
Agent1.sources = avro-source1
Agent1.channels = file-channel1
Agent1.sinks = hdfs-sink1
# Describe/configure Source
Agent1.sources.avro-source1.type = avro
Agent1.sources.avro-source1.bind = xx.xx.xx.xx
Agent1.sources.avro-source1.port = 50000
# Describe the sink
Agent1.sinks.hdfs-sink1.type = hdfs
Agent1.sinks.hdfs-sink1.hdfs.path =/user/Benchmarking_data/multiple_agent_parallel_1
Agent1.sinks.hdfs-sink1.hdfs.rollInterval = 0
Agent1.sinks.hdfs-sink1.hdfs.rollSize = 130023424
Agent1.sinks.hdfs-sink1.hdfs.rollCount = 0
Agent1.sinks.hdfs-sink1.hdfs.fileType = DataStream
Agent1.sinks.hdfs-sink1.hdfs.batchSize = 50000
Agent1.sinks.hdfs-sink1.hdfs.txnEventMax = 40000
Agent1.sinks.hdfs-sink1.hdfs.threadsPoolSize=1000
Agent1.sinks.hdfs-sink1.hdfs.appendTimeout = 10000
Agent1.sinks.hdfs-sink1.hdfs.callTimeout = 200000
#Use a channel which buffers events in file
Agent1.channels.file-channel1.type = file
Agent1.channels.file-channel1.checkpointDir = /home/Flume_Check_Point_Dir
Agent1.channels.file-channel1.dataDirs = /home/Flume_Data_Dir
Agent1.channels.file-channel1.capacity = 100000000
Agent1.channels.file-channel1.transactionCapacity=100000
# Bind the source and sink to the channel
Agent1.sources.avro-source1.channels = file-channel1
Agent1.sinks.hdfs-sink1.channel = file-channel1
Network connectivity between both machine is 686 Mbps.
Can somebody please help me to identify whether something is wrong in the configuration or an alternate configuration so that the copying doesn't take so much of time.
Both agents use file channel. So before writing to HDFS, data has been written to disk twice. You can try to use a memory channel for each agent to see if the performance is improved.

Apache Flume spoolDirectory configuration is failing

I am using the following code to write the files in my source directory to hdfs.
# Initialize agent's source, channel and sink
agent.sources = test
agent.channels = memoryChannel
agent.sinks = flumeHDFS
# Setting the source to spool directory where the file exists
agent.sources.test.type = spooldir
agent.sources.test.spoolDir = /Data
# Setting the channel to memory
agent.channels.memoryChannel.type = memory
# Max number of events stored in the memory channel
agent.channels.memoryChannel.capacity = 10000
# agent.channels.memoryChannel.batchSize = 15000
agent.channels.memoryChannel.transactioncapacity = 1000000
# Setting the sink to HDFS
agent.sinks.flumeHDFS.type = hdfs
agent.sinks.flumeHDFS.hdfs.path = /user/team
agent.sinks.flumeHDFS.hdfs.fileType = DataStream
# Write format can be text or writable
agent.sinks.flumeHDFS.hdfs.writeFormat = Text
# use a single csv file at a time
agent.sinks.flumeHDFS.hdfs.maxOpenFiles = 1
# rollover file based on maximum size of 10 MB
agent.sinks.flumeHDFS.hdfs.rollCount=0
agent.sinks.flumeHDFS.hdfs.rollInterval=2000
agent.sinks.flumeHDFS.hdfs.rollSize = 0
agent.sinks.flumeHDFS.hdfs.batchSize =1000000
# never rollover based on the number of events
agent.sinks.flumeHDFS.hdfs.rollCount = 0
# rollover file based on max time of 1 min
#agent.sinks.flumeHDFS.hdfs.rollInterval = 0
# agent.sinks.flumeHDFS.hdfs.idleTimeout = 600
# Connect source and sink with channel
agent.sources.TwitterExampleDir.channels = memoryChannel
agent.sinks.flumeHDFS.channel = memoryChannel
But I am getting the following error
: Failed to configure component!
org.apache.flume.conf.ConfigurationException: Failed to configure
component!
at org.apache.flume.conf.source.SourceConfiguration.configure(SourceConfigurati
on.java:110)
at org.apache.flume.conf.FlumeConfiguration$AgentConfiguration.validateSources(
FlumeConfiguration.java:566)
at org.apache.flume.conf.FlumeConfiguration$AgentConfiguration.isValid(FlumeCon
figuration.java:345)
at org.apache.flume.conf.FlumeConfiguration$AgentConfiguration.access$000(Flume
Configuration.java:212)
at org.apache.flume.conf.FlumeConfiguration.validateConfiguration(FlumeConfigur
ation.java:126)
at org.apache.flume.conf.FlumeConfiguration.(FlumeConfiguration.java:108)
at org.apache.flume.node.PropertiesFileConfigurationProvider.getFlumeConfigurat
ion(PropertiesFileConfigurationProvider.java:193)
at org.apache.flume.node.AbstractConfigurationProvider.getConfiguration(Abstrac
tConfigurationProvider.java:94)
at org.apache.flume.node.PollingPropertiesFileConfigurationProvider$FileWatcher
Runnable.run(PollingPropertiesFileConfigurationProvider.java:140)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$
301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Sch
eduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:11
42)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:6
17)
at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.flume.conf.ConfigurationException: No channels set for test
at org.apache.flume.conf.source.SourceConfiguration.configure(SourceConfigurati
on.java:68)
... 15 more
Can anyone help what should I do to put my data from the source directory /Data to hdfs directory /user/team?
Stack Trace mention:
No channels set for test
You have a specified sources as test
agent.sources = test
But while connecting it to channel
agent.sources.TwitterExampleDir.channels = memoryChannel
so you have to mention test instead of TwitterExampleDir

Configuring flume to not generate .tmp files when sinking data to hdfs

I am using flume to stream data into hdfs from server logs. But while data is getting streamed into the hdfs it is first creating .tmp file. Is there a way in the configuration where .tmp files can be hidden or there name can be changed by appending a . in front. My collection agent file look like-
## TARGET AGENT ##
## configuration file location: /etc/flume/conf
## START Agent: flume-ng agent -c conf -f /etc/flume/conf/flume-trg-agent.conf -n collector
#http://flume.apache.org/FlumeUserGuide.html#avro-source
collector.sources = AvroIn
collector.sources.AvroIn.type = avro
collector.sources.AvroIn.bind = 0.0.0.0
collector.sources.AvroIn.port = 4545
collector.sources.AvroIn.channels = mc1 mc2
## Channels ##
## Source writes to 2 channels, one for each sink
collector.channels = mc1 mc2
#http://flume.apache.org/FlumeUserGuide.html#memory-channel
collector.channels.mc1.type = memory
collector.channels.mc1.capacity = 100
collector.channels.mc2.type = memory
collector.channels.mc2.capacity = 100
## Sinks ##
collector.sinks = LocalOut HadoopOut
## Write copy to Local Filesystem
#http://flume.apache.org/FlumeUserGuide.html#file-roll-sink
#collector.sinks.LocalOut.type = file_roll
#collector.sinks.LocalOut.sink.directory = /var/log/flume
#collector.sinks.LocalOut.sink.rollInterval = 0
#collector.sinks.LocalOut.channel = mc1
## Write to HDFS
#http://flume.apache.org/FlumeUserGuide.html#hdfs-sink
collector.sinks.HadoopOut.type = hdfs
collector.sinks.HadoopOut.channel = mc2
collector.sinks.HadoopOut.hdfs.path = /user/root/flume-channel/%{log_type}
collector.sinks.k1.hdfs.filePrefix = events-
collector.sinks.HadoopOut.hdfs.fileType = DataStream
collector.sinks.HadoopOut.hdfs.writeFormat = Text
collector.sinks.HadoopOut.hdfs.rollSize = 1000000
Any help will be appreciated.
All files in Flume which are opened for writing can have .tmp extension by default. You can alter this with another extension. But we cannot avoid this extension. Moreover it is required to differentiate with closed files.
So it is better to some suffix like "." for open files. Flume HDFS Sink offers several parameters:
hdfs.inUsePrefix – Prefix that is used for temporal files that flume actively writes into
hdfs.inUseSuffix .tmp Suffix that is used for temporal files that flume actively writes into
hdfs.inUsePrefix = .
collector.sinks.HadoopOut.hdfs.inUsePrefix = .
hdfs.inUseSuffix = if it is blank it uses .tmp otherwise it uses specified suffix.
set hdfs.idleTimeout=x where x is a positive number

sink.hdfs writer adds garbage in my text file

I have successfully configured flume to transfer text files from a local folder to hdfs. My problem is when this file is transfered into hdfs, some unwanted text "hdfs.write.Longwriter + binary characters" are prefixed in my text file.
Here is my flume.conf
agent.sources = flumedump
agent.channels = memoryChannel
agent.sinks = flumeHDFS
agent.sources.flumedump.type = spooldir
agent.sources.flumedump.spoolDir = /opt/test/flume/flumedump/
agent.sources.flumedump.channels = memoryChannel
# Each sink's type must be defined
agent.sinks.flumeHDFS.type = hdfs
agent.sinks.flumeHDFS.hdfs.path = hdfs://bigdata.ibm.com:9000/user/vin
agent.sinks.flumeHDFS.fileType = DataStream
#Format to be written
agent.sinks.flumeHDFS.hdfs.writeFormat = Text
agent.sinks.flumeHDFS.hdfs.maxOpenFiles = 10
# rollover file based on maximum size of 10 MB
agent.sinks.flumeHDFS.hdfs.rollSize = 10485760
# never rollover based on the number of events
agent.sinks.flumeHDFS.hdfs.rollCount = 0
# rollover file based on max time of 1 mi
agent.sinks.flumeHDFS.hdfs.rollInterval = 60
#Specify the channel the sink should use
agent.sinks.flumeHDFS.channel = memoryChannel
# Each channel's type is defined.
agent.channels.memoryChannel.type = memory
# Other config values specific to each type of channel(sink or source)
# can be defined as well
# In this case, it specifies the capacity of the memory channel
agent.channels.memoryChannel.capacity = 100
My source text file is very simple containing text :
Hi My name is Hadoop and this is file one.
The sink file I get in hdfs looks like this :
SEQ !org.apache.hadoop.io.LongWritable org.apache.hadoop.io.Text������5����>I <4 H�ǥ�+Hi My name is Hadoop and this is file one.
Please let me know what am i doing wrong?
Figured it out.
I had to fix this line
agent.sinks.flumeHDFS.fileType = DataStream
and change it to
agent.sinks.flumeHDFS.hdfs.fileType = DataStream
this fixed the issue.

Resources