Flume + Kafka + HDFS: Split messages - hadoop

I've the following flume agent configuration to read messages from a kafka source and write them back to a HDFS sink
tier1.sources = source1
tier 1.channels = channel1
tier1.sinks = sink1
tier1.sources.source1.type = org.apache.flume.source.kafka.KafkaSource
tier1.sources.source1.zookeeperConnect = 192.168.0.100:2181
tier1.sources.source1.topic = test
tier1.sources.source1.groupId = flume
tier1.sources.source1.channels = channel1
tier1.sources.source1.interceptors = i1
tier1.sources.source1.interceptors.i1.type = timestamp
tier1.sources.source1.kafka.consumer.timeout.ms = 100
tier1.channels.channel1.type = org.apache.flume.channel.kafka.KafkaChannel
tier1.channels.channel1.brokerList = 192.168.0.100:9092
tier1.channels.channel1.topic = test
tier1.channels.channel1.zookeeperConnect = 192.168.0.100:2181/kafka
tier1.channels.channel1.parseAsFlumeEvent = false
tier1.sinks.sink1.channel = channel1
tier1.sinks.sink1.type = hdfs
tier1.sinks.sink1.hdfs.writeFormat = Text
tier1.sinks.sink1.hdfs.fileType = DataStream
tier1.sinks.sink1.hdfs.filePrefix = test-kafka
tier1.sinks.sink1.hdfs.fileSufix = .avro
tier1.sinks.sink1.hdfs.useLocalTimeStamp = true
tier1.sinks.sink1.hdfs.path = /tmp/kafka/%y-%m-%d
tier1.sinks.sink1.hdfs.rollCount=0
tier1.sinks.sink1.hdfs.rollSize=0
The kafka messages content is avro data wich is properly serialized into a file if only one kafka messages arrives every polling period.
When two kafka messages arrive on the same batch, they are grouped on the same HDFS file, since the avro messages contains both schema + data, the result file containes schema + data + schema + data, causing it to be a invalid .avro file.
How can I split the avro event to get the different kafka messages splited to be written each one of them on a different file
Thank you

One approach: Lets say you call your source kafka incoming data 'SourceTopic'. You can register a custom sink to this 'SourceTopic'.
<FlumeNodeRole>.sinks.<your-sink>.type =net.my.package.CustomSink
In your CustomSink, you can write a method to differentiate incoming message, split it, and resend to a different 'DestinationTopic'. This 'DestinationTopic' can now act as a new flume source for your file serialization.
Refer below link for pipe-lining flume:
https://flume.apache.org/FlumeUserGuide.html

Related

Data loss (skipping) using Flume with Kafka source and HDFS sink

I am experiencing data loss (skipping chunks of time in data) when I am pulling data off a kafka topic as a source and putting it into an HDFS file (DataStream) as a sink. The pattern seems to be in 10, 20 or 30 minute blocks of data skipping. I have verified that the skipped data is in the topic .log file that is being generated by Kafka. (The original data is coming from a syslog, going through a different flume agent and being put into the Kafka topic - the data loss isn't happening there).
I find it interesting and unusual that the blocks of skipped data are always 10, 20 or 30 mins and happen at least once an hour in my data.
Here is a copy of my configuration file:
a1.sources = kafka-source
a1.channels = memory-channel
a1.sinks = hdfs-sink
a1.sources.kafka-source.type = org.apache.flume.source.kafka.KafkaSource
a1.sources.kafka-source.zookeeperConnect = 10.xx.x.xx:xxxx
a1.sources.kafka-source.topic = firewall
a1.sources.kafka-source.groupId = flume
a1.sources.kafka-source.channels = memory-channel
a1.channels.memory-channel.type = memory
a1.channels.memory-channel.capacity = 100000
a1.channels.memory-channel.transactionCapacity = 1000
a1.sinks.hdfs-sink.type = hdfs
a1.sinks.hdfs-sink.hdfs.fileType = DataStream
a1.sinks.hdfs-sink.hdfs.path = /topics/%{topic}/%m-%d-%Y
a1.sinks.hdfs-sink.hdfs.filePrefix = firewall1
a1.sinks.hdfs-sink.hdfs.fileSuffix = .log
a1.sinks.hdfs-sink.hdfs.rollInterval = 86400
a1.sinks.hdfs-sink.hdfs.rollSize = 0
a1.sinks.hdfs-sink.hdfs.rollCount = 0
a1.sinks.hdfs-sink.hdfs.maxOpenFiles = 1
a1.sinks.hdfs-sink.channel = memory-channel
Any insight would be helpful. I have been searching online for answers for awhile.
Thanks.

flume taking time to copy data into hdfs when rolling based on file size

I have a usecase where i want to copy remote file into hdfs using flume. I also want that the copied files should align with the HDFS block size (128MB/256MB).Total size of remote data is 33GB.
I am using avro source and sink to copy remote data into hdfs. Similarly from sink side i am doing file size rolling(128,256).but for copying file from remote machine and storing it into hdfs(file size 128/256 MB) flume is taking an avg of 2 min.
Flume Configuration:
Avro Source(Remote Machine)
### Agent1 - Spooling Directory Source and File Channel, Avro Sink ###
# Name the components on this agent
Agent1.sources = spooldir-source
Agent1.channels = file-channel
Agent1.sinks = avro-sink
# Describe/configure Source
Agent1.sources.spooldir-source.type = spooldir
Agent1.sources.spooldir-source.spoolDir =/home/Benchmarking_Simulation/test
# Describe the sink
Agent1.sinks.avro-sink.type = avro
Agent1.sinks.avro-sink.hostname = xx.xx.xx.xx #IP Address destination machine
Agent1.sinks.avro-sink.port = 50000
#Use a channel which buffers events in file
Agent1.channels.file-channel.type = file
Agent1.channels.file-channel.checkpointDir = /home/Flume_CheckPoint_Dir/
Agent1.channels.file-channel.dataDirs = /home/Flume_Data_Dir/
Agent1.channels.file-channel.capacity = 10000000
Agent1.channels.file-channel.transactionCapacity=50000
# Bind the source and sink to the channel
Agent1.sources.spooldir-source.channels = file-channel
Agent1.sinks.avro-sink.channel = file-channel
Avro Sink(Machine where hdfs running)
### Agent1 - Avro Source and File Channel, Avro Sink ###
# Name the components on this agent
Agent1.sources = avro-source1
Agent1.channels = file-channel1
Agent1.sinks = hdfs-sink1
# Describe/configure Source
Agent1.sources.avro-source1.type = avro
Agent1.sources.avro-source1.bind = xx.xx.xx.xx
Agent1.sources.avro-source1.port = 50000
# Describe the sink
Agent1.sinks.hdfs-sink1.type = hdfs
Agent1.sinks.hdfs-sink1.hdfs.path =/user/Benchmarking_data/multiple_agent_parallel_1
Agent1.sinks.hdfs-sink1.hdfs.rollInterval = 0
Agent1.sinks.hdfs-sink1.hdfs.rollSize = 130023424
Agent1.sinks.hdfs-sink1.hdfs.rollCount = 0
Agent1.sinks.hdfs-sink1.hdfs.fileType = DataStream
Agent1.sinks.hdfs-sink1.hdfs.batchSize = 50000
Agent1.sinks.hdfs-sink1.hdfs.txnEventMax = 40000
Agent1.sinks.hdfs-sink1.hdfs.threadsPoolSize=1000
Agent1.sinks.hdfs-sink1.hdfs.appendTimeout = 10000
Agent1.sinks.hdfs-sink1.hdfs.callTimeout = 200000
#Use a channel which buffers events in file
Agent1.channels.file-channel1.type = file
Agent1.channels.file-channel1.checkpointDir = /home/Flume_Check_Point_Dir
Agent1.channels.file-channel1.dataDirs = /home/Flume_Data_Dir
Agent1.channels.file-channel1.capacity = 100000000
Agent1.channels.file-channel1.transactionCapacity=100000
# Bind the source and sink to the channel
Agent1.sources.avro-source1.channels = file-channel1
Agent1.sinks.hdfs-sink1.channel = file-channel1
Network connectivity between both machine is 686 Mbps.
Can somebody please help me to identify whether something is wrong in the configuration or an alternate configuration so that the copying doesn't take so much of time.
Both agents use file channel. So before writing to HDFS, data has been written to disk twice. You can try to use a memory channel for each agent to see if the performance is improved.

Flume to Kafka to Elasticsearch Integration

I am passing some data from Flume to Kafka. My Flafka config file looks like this:
tier1.sources = source1
tier1.channels = channel1
tier1.sinks = sink1
tier1.sources.source1.type = exec
tier1.sources.source1.command = cat /testing.txt
tier1.sources.source1.channels = channel1
tier1.channels.channel1.type = memory
tier1.channels.channel1.capacity = 10000
tier1.channels.channel1.transactionCapacity = 1000
tier1.sinks.sink1.type = org.apache.flume.sink.kafka.KafkaSink
tier1.sinks.sink1.topic = sink1
tier1.sinks.sink1.brokerList = kafkagames-1:9092,kafkagames-2:9092
tier1.sinks.sink1.channel = channel1
tier1.sinks.sink1.batchSize = 20
And I have connected Kafka and Elasticsearch using kafka river plugin. FLume source is sending data to Kafka Consumer whereas Elasticsearch is expecting data from Kafka Producer.
Is there a way I can push the data to Kafka Producer from Flume rather than directly going to Kafka Consumer so Elasticsearch can read the data?
Any advice? Thanks.

Flume sinks data in inconsistance fashion

I have got a problem. I am using apache flume to read the logs from txt file to sink to hdfs. somehow some records are getting skipped while reading. I am using fileChannel please check the below configuration.
agent2.sources = file_server
agent2.sources.file_server.type=exec
agent2.sources.file_server.command = tail -F /home/datafile/error.log
agent2.sources.file_server.channels = fileChannel
agent2.channels = fileChannel
agent2.channels.fileChannel.type=file
agent2.channels.fileChannel.capacity = 12000
agent2.channels.fileChannel.transactionCapacity = 10000
agent2.channels.fileChannel.checkpointDir=/home/data/flume/checkpoint
agent2.channels.fileChannel.dataDirs=/home/data/flume/data
# Agent2 sinks
agent2.sinks = hadooper loged
agent2.sinks.hadooper.type = hdfs
agent2.sinks.loged.type=logger
agent2.sinks.hadooper.hdfs.path = hdfs://localhost:8020/flume/data/file
agent2.sinks.hadooper.hdfs.fileType = DataStream
agent1.sinks.hadooper.hdfs.writeFormat = Text
agent2.sinks.hadooper.hdfs.writeFormat = Text
agent2.sinks.hadooper.hdfs.rollInterval = 600
agent2.sinks.hadooper.hdfs.rollCount = 0
agent2.sinks.hadooper.hdfs.rollSize = 67108864
agent2.sinks.hadooper.hdfs.batchSize = 10
agent2.sinks.hadooper.hdfs.idleTimeout=0
agent2.sinks.hadooper.channel = fileChannel
agent2.sinks.loged.channel = fileChannel
agent2.sinks.hdfs.threadsPoolSize = 20
Please help.
I think the problem is your are using 2 sinks reading both of them from a single channel; in that case, a Flume event read by one of the 2 sinks is not read by the other one, and viceversa.
If you want both sinks receive a copy of the same Flume event, then you will need to create a dedicated channel for each sink. Once created these channels, the default channel selector, which is ReplicatingChannelSelector, will create a copy into each channel.

flume loss data when collect online data to hdfs

I used flume-ng 1.5 version to collect logs.
There are two agents in the data flow and they are on two hosts, respectively.
And the data is sended from agent1 to agent2.
The agents's component is as follows:
agent1: spooling dir source --> file channel --> avro sink
agent2: avro source --> file channel --> hdfs sink
But it seems to loss data about 1/1000 percentage of million data.
To solve problem I tried these steps:
look up agents log: cannot find any error or exception.
look up agents monitor metrics: the events number that put and take from channel always equals
statistic the data number by hive query and hdfs file use shell, respectively: the two number is equal and less than the online data number
agent1's configuration:
#agent
agent1.sources = src_spooldir
agent1.channels = chan_file
agent1.sinks = sink_avro
#source
agent1.sources.src_spooldir.type = spooldir
agent1.sources.src_spooldir.spoolDir = /data/logs/flume-spooldir
agent1.sources.src_spooldir.interceptors=i1
#interceptors
agent1.sources.src_spooldir.interceptors.i1.type=regex_extractor
agent1.sources.src_spooldir.interceptors.i1.regex=(\\d{4}-\\d{2}-\\d{2}).*
agent1.sources.src_spooldir.interceptors.i1.serializers=s1
agent1.sources.src_spooldir.interceptors.i1.serializers.s1.name=dt
#sink
agent1.sinks.sink_avro.type = avro
agent1.sinks.sink_avro.hostname = 10.235.2.212
agent1.sinks.sink_avro.port = 9910
#channel
agent1.channels.chan_file.type = file
agent1.channels.chan_file.checkpointDir = /data/flume/agent1/checkpoint
agent1.channels.chan_file.dataDirs = /data/flume/agent1/data
agent1.sources.src_spooldir.channels = chan_file
agent1.sinks.sink_avro.channel = chan_file
agent2's configuration
# agent
agent2.sources = source1
agent2.channels = channel1
agent2.sinks = sink1
# source
agent2.sources.source1.type = avro
agent2.sources.source1.bind = 10.235.2.212
agent2.sources.source1.port = 9910
# sink
agent2.sinks.sink1.type= hdfs
agent2.sinks.sink1.hdfs.fileType = DataStream
agent2.sinks.sink1.hdfs.filePrefix = log
agent2.sinks.sink1.hdfs.path = hdfs://hnd.hadoop.jsh:8020/data/%{dt}
agent2.sinks.sink1.hdfs.rollInterval = 600
agent2.sinks.sink1.hdfs.rollSize = 0
agent2.sinks.sink1.hdfs.rollCount = 0
agent2.sinks.sink1.hdfs.idleTimeout = 300
agent2.sinks.sink1.hdfs.round = true
agent2.sinks.sink1.hdfs.roundValue = 10
agent2.sinks.sink1.hdfs.roundUnit = minute
# channel
agent2.channels.channel1.type = file
agent2.channels.channel1.checkpointDir = /data/flume/agent2/checkpoint
agent2.channels.channel1.dataDirs = /data/flume/agent2/data
agent2.sinks.sink1.channel = channel1
agent2.sources.source1.channels = channel1
Any suggestions are welcome!
there is a bug in file line deseriazer when encounter some specific character of utf which point is between U+10000 and U+10FFFF, they represent in utf16 by two 16-bit code unit called surrogate pairs.

Resources