How to put files in flume spooldir one by one? - hadoop

I am using flume spooldir to put files in HDFS, but I am getting so many small files in HDFS. I thought of using batch size and roll interval, but I don't want to get dependent on size and interval. So I decided to push files in flume spooldir one at a time. How can I do this?

According to https://flume.apache.org/FlumeUserGuide.html#spooling-directory-source, if you set a1.sources.src-1.fileHeader = true, then you can specify any headers (for example the file name header) in the HDFS Sink (see %{host} in the escape sequence description at https://flume.apache.org/FlumeUserGuide.html#hdfs-sink.
EDIT:
For an example config, you can try the following:
a1.sources = r1
a1.sources.r1.type = spooldir
a1.sources.r1.channels = c1
a1.sources.r1.spoolDir = /flumespool
a1.sources.r1.basenameHeader = true
a1.channels = c1
a1.channels.c1.type = memory
a1.sinks = k1
a1.sinks.k1.type = hdfs
a1.sinks.k1.channel = c1
a1.sinks.k1.hdfs.path = /flumeout/%{basename}
a1.sinks.k1.hdfs.fileType = DataStream

Related

flume taking time to copy data into hdfs when rolling based on file size

I have a usecase where i want to copy remote file into hdfs using flume. I also want that the copied files should align with the HDFS block size (128MB/256MB).Total size of remote data is 33GB.
I am using avro source and sink to copy remote data into hdfs. Similarly from sink side i am doing file size rolling(128,256).but for copying file from remote machine and storing it into hdfs(file size 128/256 MB) flume is taking an avg of 2 min.
Flume Configuration:
Avro Source(Remote Machine)
### Agent1 - Spooling Directory Source and File Channel, Avro Sink ###
# Name the components on this agent
Agent1.sources = spooldir-source
Agent1.channels = file-channel
Agent1.sinks = avro-sink
# Describe/configure Source
Agent1.sources.spooldir-source.type = spooldir
Agent1.sources.spooldir-source.spoolDir =/home/Benchmarking_Simulation/test
# Describe the sink
Agent1.sinks.avro-sink.type = avro
Agent1.sinks.avro-sink.hostname = xx.xx.xx.xx #IP Address destination machine
Agent1.sinks.avro-sink.port = 50000
#Use a channel which buffers events in file
Agent1.channels.file-channel.type = file
Agent1.channels.file-channel.checkpointDir = /home/Flume_CheckPoint_Dir/
Agent1.channels.file-channel.dataDirs = /home/Flume_Data_Dir/
Agent1.channels.file-channel.capacity = 10000000
Agent1.channels.file-channel.transactionCapacity=50000
# Bind the source and sink to the channel
Agent1.sources.spooldir-source.channels = file-channel
Agent1.sinks.avro-sink.channel = file-channel
Avro Sink(Machine where hdfs running)
### Agent1 - Avro Source and File Channel, Avro Sink ###
# Name the components on this agent
Agent1.sources = avro-source1
Agent1.channels = file-channel1
Agent1.sinks = hdfs-sink1
# Describe/configure Source
Agent1.sources.avro-source1.type = avro
Agent1.sources.avro-source1.bind = xx.xx.xx.xx
Agent1.sources.avro-source1.port = 50000
# Describe the sink
Agent1.sinks.hdfs-sink1.type = hdfs
Agent1.sinks.hdfs-sink1.hdfs.path =/user/Benchmarking_data/multiple_agent_parallel_1
Agent1.sinks.hdfs-sink1.hdfs.rollInterval = 0
Agent1.sinks.hdfs-sink1.hdfs.rollSize = 130023424
Agent1.sinks.hdfs-sink1.hdfs.rollCount = 0
Agent1.sinks.hdfs-sink1.hdfs.fileType = DataStream
Agent1.sinks.hdfs-sink1.hdfs.batchSize = 50000
Agent1.sinks.hdfs-sink1.hdfs.txnEventMax = 40000
Agent1.sinks.hdfs-sink1.hdfs.threadsPoolSize=1000
Agent1.sinks.hdfs-sink1.hdfs.appendTimeout = 10000
Agent1.sinks.hdfs-sink1.hdfs.callTimeout = 200000
#Use a channel which buffers events in file
Agent1.channels.file-channel1.type = file
Agent1.channels.file-channel1.checkpointDir = /home/Flume_Check_Point_Dir
Agent1.channels.file-channel1.dataDirs = /home/Flume_Data_Dir
Agent1.channels.file-channel1.capacity = 100000000
Agent1.channels.file-channel1.transactionCapacity=100000
# Bind the source and sink to the channel
Agent1.sources.avro-source1.channels = file-channel1
Agent1.sinks.hdfs-sink1.channel = file-channel1
Network connectivity between both machine is 686 Mbps.
Can somebody please help me to identify whether something is wrong in the configuration or an alternate configuration so that the copying doesn't take so much of time.
Both agents use file channel. So before writing to HDFS, data has been written to disk twice. You can try to use a memory channel for each agent to see if the performance is improved.

Flume adding line feed after 2048 characters in a row

I have a Flume 1.5 agent running on a Ubuntu workstation that collects logs from various devices and re-formats the logs into a comma delimited file with very long rows. After the collection and re-reformatting of the logs they are placed into a spool directory where the Flume Agent sends the log file to a Hadoop server running a Flume agent to accept the log file and place them in a HDFS directory.
Everything works fine except that when Flume sends the file to HDFS directory there are Line Feeds after every 2048 characters in each row.
Below is my flume config files.
Is there a setting to tell flume to not insert line feeds?
#On Ubuntu Workstation
#list sources, sinks and channels in the agent
agent.sources = axon_source
agent.channels = memorychannel
agent.sinks = AvroOut
#define flow
agent.sources.axon_source.channels = memorychannel
agent.sinks.AvroOut.channel = memorychannel
agent.channels.memorychannel.type = memory
agent.channels.memorychannel.capacity = 100000
#source
agent.sources.axon_source.type = spooldir
agent.sources.axon_source.spoolDir = /home/ubuntu/workspace/logdump
agent.sources.axon_source.decodeErrorPolicy = ignore
#avro out
agent.sinks.AvroOut.type = avro
agent.sinks.AvroOut.hostname = 172.31.12.221
agent.sinks.AvroOut.port = 41415
agent.sinks.AvroOut.maxIoWorkers = 2
------------------------------------------------------------
#On Hadoop Server
agent.sources = AvroIn
agent.sources.AvroIn.type = avro
agent.sources.AvroIn.bind = 172.31.131.1
agent.sources.AvroIn.port = 41415
agent.sources.AvroIn.channels = MemChan1
agent.channels = MemChan1
agent.channels.MemChan1.type = memory
agent.channels.MemChan1.capacity = 100000
agent.sinks = HDFSSink
agent.sinks.HDFSSink.type = hdfs
agent.sinks.HDFSSink.channel = MemChan1
agent.sinks.HDFSSink.hdfs.path = /Logs/%Y%m/
agent.sinks.HDFSSink.hdfs.filePrefix = axoncapture
agent.sinks.HDFSSink.hdfs.fileSuffix = .log
agent.sinks.HDFSSink.hdfs.minBlockReplicas = 1
agent.sinks.HDFSSink.hdfs.rollCount = 0
agent.sinks.HDFSSink.hdfs.rollSize = 314572800
agent.sinks.HDFSSink.hdfs.writeFormat = Text
agent.sinks.HDFSSink.hdfs.fileType = DataStream
agent.sinks.HDFSSink.hdfs.useLocalTimeStamp = True
Found the answer to my question:
The default maxLineLength for the LINE deserializer is 2048:
http://archive.cloudera.com/cdh5/cdh/5/flume-ng/FlumeUserGuide.html#line
I added the line to my flume.conf file and fixed the problem:
agent.sources.axon_source.deserializer.maxLineLength=60000

Flume sinks data in inconsistance fashion

I have got a problem. I am using apache flume to read the logs from txt file to sink to hdfs. somehow some records are getting skipped while reading. I am using fileChannel please check the below configuration.
agent2.sources = file_server
agent2.sources.file_server.type=exec
agent2.sources.file_server.command = tail -F /home/datafile/error.log
agent2.sources.file_server.channels = fileChannel
agent2.channels = fileChannel
agent2.channels.fileChannel.type=file
agent2.channels.fileChannel.capacity = 12000
agent2.channels.fileChannel.transactionCapacity = 10000
agent2.channels.fileChannel.checkpointDir=/home/data/flume/checkpoint
agent2.channels.fileChannel.dataDirs=/home/data/flume/data
# Agent2 sinks
agent2.sinks = hadooper loged
agent2.sinks.hadooper.type = hdfs
agent2.sinks.loged.type=logger
agent2.sinks.hadooper.hdfs.path = hdfs://localhost:8020/flume/data/file
agent2.sinks.hadooper.hdfs.fileType = DataStream
agent1.sinks.hadooper.hdfs.writeFormat = Text
agent2.sinks.hadooper.hdfs.writeFormat = Text
agent2.sinks.hadooper.hdfs.rollInterval = 600
agent2.sinks.hadooper.hdfs.rollCount = 0
agent2.sinks.hadooper.hdfs.rollSize = 67108864
agent2.sinks.hadooper.hdfs.batchSize = 10
agent2.sinks.hadooper.hdfs.idleTimeout=0
agent2.sinks.hadooper.channel = fileChannel
agent2.sinks.loged.channel = fileChannel
agent2.sinks.hdfs.threadsPoolSize = 20
Please help.
I think the problem is your are using 2 sinks reading both of them from a single channel; in that case, a Flume event read by one of the 2 sinks is not read by the other one, and viceversa.
If you want both sinks receive a copy of the same Flume event, then you will need to create a dedicated channel for each sink. Once created these channels, the default channel selector, which is ReplicatingChannelSelector, will create a copy into each channel.

flume loss data when collect online data to hdfs

I used flume-ng 1.5 version to collect logs.
There are two agents in the data flow and they are on two hosts, respectively.
And the data is sended from agent1 to agent2.
The agents's component is as follows:
agent1: spooling dir source --> file channel --> avro sink
agent2: avro source --> file channel --> hdfs sink
But it seems to loss data about 1/1000 percentage of million data.
To solve problem I tried these steps:
look up agents log: cannot find any error or exception.
look up agents monitor metrics: the events number that put and take from channel always equals
statistic the data number by hive query and hdfs file use shell, respectively: the two number is equal and less than the online data number
agent1's configuration:
#agent
agent1.sources = src_spooldir
agent1.channels = chan_file
agent1.sinks = sink_avro
#source
agent1.sources.src_spooldir.type = spooldir
agent1.sources.src_spooldir.spoolDir = /data/logs/flume-spooldir
agent1.sources.src_spooldir.interceptors=i1
#interceptors
agent1.sources.src_spooldir.interceptors.i1.type=regex_extractor
agent1.sources.src_spooldir.interceptors.i1.regex=(\\d{4}-\\d{2}-\\d{2}).*
agent1.sources.src_spooldir.interceptors.i1.serializers=s1
agent1.sources.src_spooldir.interceptors.i1.serializers.s1.name=dt
#sink
agent1.sinks.sink_avro.type = avro
agent1.sinks.sink_avro.hostname = 10.235.2.212
agent1.sinks.sink_avro.port = 9910
#channel
agent1.channels.chan_file.type = file
agent1.channels.chan_file.checkpointDir = /data/flume/agent1/checkpoint
agent1.channels.chan_file.dataDirs = /data/flume/agent1/data
agent1.sources.src_spooldir.channels = chan_file
agent1.sinks.sink_avro.channel = chan_file
agent2's configuration
# agent
agent2.sources = source1
agent2.channels = channel1
agent2.sinks = sink1
# source
agent2.sources.source1.type = avro
agent2.sources.source1.bind = 10.235.2.212
agent2.sources.source1.port = 9910
# sink
agent2.sinks.sink1.type= hdfs
agent2.sinks.sink1.hdfs.fileType = DataStream
agent2.sinks.sink1.hdfs.filePrefix = log
agent2.sinks.sink1.hdfs.path = hdfs://hnd.hadoop.jsh:8020/data/%{dt}
agent2.sinks.sink1.hdfs.rollInterval = 600
agent2.sinks.sink1.hdfs.rollSize = 0
agent2.sinks.sink1.hdfs.rollCount = 0
agent2.sinks.sink1.hdfs.idleTimeout = 300
agent2.sinks.sink1.hdfs.round = true
agent2.sinks.sink1.hdfs.roundValue = 10
agent2.sinks.sink1.hdfs.roundUnit = minute
# channel
agent2.channels.channel1.type = file
agent2.channels.channel1.checkpointDir = /data/flume/agent2/checkpoint
agent2.channels.channel1.dataDirs = /data/flume/agent2/data
agent2.sinks.sink1.channel = channel1
agent2.sources.source1.channels = channel1
Any suggestions are welcome!
there is a bug in file line deseriazer when encounter some specific character of utf which point is between U+10000 and U+10FFFF, they represent in utf16 by two 16-bit code unit called surrogate pairs.

flume posting data into HDFS but characters issues

Below mentioned is my flume configuration.
a1.sources = r1
a1.sinks = k1
a1.channels = c1
a1.sources.r1.type = http
a1.sources.r1.port = 5140
a1.sources.r1.channels = c1
a1.sources.r1.handler = org.apache.flume.source.http.JSONHandler
a1.sources.r1.handler.nickname = random props
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
a1.sinks.k1.type = hdfs
a1.sinks.k1.channel = c1
a1.sinks.k1.hdfs.path = hdfs://10.0.40.18:9160/flume-test
a1.sinks.k1.hdfs.filePrefix = events-
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 10
a1.sinks.k1.hdfs.roundUnit = minute
there is no error in flume log file but when reading file using hadoop command having issue.
hadoop fs -cat hdfs://10.0.40.18:9160/flume-test/even1393415633931
flume log message is hdfs file created is "hdfs://10.0.40.18:9160/flume-test/even1393415633931"
Any help appreciable.
First, try replacing HDFS sink with a logger to see if your input is correctly arriving.
After that is confirmed, I would recommend trying to adjust flush settings for the sink. HDFS sink batches events before flushing to HDFS through hdfs.batchSize, which is by default 100. This is probably the issue, as you will need to send 100 JSON posts before your output flushes for the first time.
Lastly, you may also want to try tweaking hdfs.writeFormat which is by default set to Writable and not Text.
It sounds like you want a text file so you should use DataStream like this:
a1.sinks.k1.hdfs.file.Type = DataStream

Resources