Flume to Kafka to Elasticsearch Integration - elasticsearch

I am passing some data from Flume to Kafka. My Flafka config file looks like this:
tier1.sources = source1
tier1.channels = channel1
tier1.sinks = sink1
tier1.sources.source1.type = exec
tier1.sources.source1.command = cat /testing.txt
tier1.sources.source1.channels = channel1
tier1.channels.channel1.type = memory
tier1.channels.channel1.capacity = 10000
tier1.channels.channel1.transactionCapacity = 1000
tier1.sinks.sink1.type = org.apache.flume.sink.kafka.KafkaSink
tier1.sinks.sink1.topic = sink1
tier1.sinks.sink1.brokerList = kafkagames-1:9092,kafkagames-2:9092
tier1.sinks.sink1.channel = channel1
tier1.sinks.sink1.batchSize = 20
And I have connected Kafka and Elasticsearch using kafka river plugin. FLume source is sending data to Kafka Consumer whereas Elasticsearch is expecting data from Kafka Producer.
Is there a way I can push the data to Kafka Producer from Flume rather than directly going to Kafka Consumer so Elasticsearch can read the data?
Any advice? Thanks.

Related

FlumeData file not getting created in HDFS sink

I am trying to ingest real time data using Kafka as source and flume as sink.Sink type is HDFS. My producer is working fine,i can see the data being produced and my agent is running fine(no error while running the command) but the file is not getting generated in specified directory.
Command for Starting flume agent:
/usr/hdp/2.5.0.0-1245/flume/bin/flume-ng agent -c /usr/hdp/2.5.0.0-1245/flume/conf -f /usr/hdp/2.5.0.0-1245/flume/conf/flume-hdfs.conf -n tier1
And my flume-hdfs.conf file:
tier1.sources = source1
tier1.channels = channel1
tier1.sinks = sink1
tier1.sources.source1.type = org.apache.flume.source.kafka.KafkaSource
tier1.sources.source1.zookeeperConnect = localhost:2181
tier1.sources.source1.topic = data_1
tier1.sources.source1.channels = channel1
tier1.channels.channel1.type = org.apache.flume.channel.kafka.KafkaChannel
tier1.channels.channel1.brokerList = localhost:6667
tier1.channels.channel1.zookeeperConnect = localhost:2181
tier1.channels.channel1.capacity = 10000
tier1.channels.channel1.transactionCapacity = 1000
tier1.sinks.sink1.type = hdfs
tier1.sinks.sink1.hdfs.path = /user/user_name/FLUME_LOGS/
tier1.sinks.sink1.hdfs.rollInterval = 5
tier1.sinks.sink1.hdfs.rollSize = 0
tier1.sinks.sink1.hdfs.rollCount = 0
tier1.sinks.sink1.hdfs.fileType = DataStream
tier1.sinks.sink1.channel = channel1
I am not able to find out what is wrong with the execution.
Please suggest how to overcome this problem.
Set the path of HDFS sink this way:
tier1.sinks.sink1.hdfs.path = "VALUE of fs.default.name, located in core-site.xml"/user/user_name/FLUME_LOGS/
For example
tier1.sinks.sink1.hdfs.path = hdfs://localhost:54310/user/user_name/FLUME_LOGS/

flume taking time to copy data into hdfs when rolling based on file size

I have a usecase where i want to copy remote file into hdfs using flume. I also want that the copied files should align with the HDFS block size (128MB/256MB).Total size of remote data is 33GB.
I am using avro source and sink to copy remote data into hdfs. Similarly from sink side i am doing file size rolling(128,256).but for copying file from remote machine and storing it into hdfs(file size 128/256 MB) flume is taking an avg of 2 min.
Flume Configuration:
Avro Source(Remote Machine)
### Agent1 - Spooling Directory Source and File Channel, Avro Sink ###
# Name the components on this agent
Agent1.sources = spooldir-source
Agent1.channels = file-channel
Agent1.sinks = avro-sink
# Describe/configure Source
Agent1.sources.spooldir-source.type = spooldir
Agent1.sources.spooldir-source.spoolDir =/home/Benchmarking_Simulation/test
# Describe the sink
Agent1.sinks.avro-sink.type = avro
Agent1.sinks.avro-sink.hostname = xx.xx.xx.xx #IP Address destination machine
Agent1.sinks.avro-sink.port = 50000
#Use a channel which buffers events in file
Agent1.channels.file-channel.type = file
Agent1.channels.file-channel.checkpointDir = /home/Flume_CheckPoint_Dir/
Agent1.channels.file-channel.dataDirs = /home/Flume_Data_Dir/
Agent1.channels.file-channel.capacity = 10000000
Agent1.channels.file-channel.transactionCapacity=50000
# Bind the source and sink to the channel
Agent1.sources.spooldir-source.channels = file-channel
Agent1.sinks.avro-sink.channel = file-channel
Avro Sink(Machine where hdfs running)
### Agent1 - Avro Source and File Channel, Avro Sink ###
# Name the components on this agent
Agent1.sources = avro-source1
Agent1.channels = file-channel1
Agent1.sinks = hdfs-sink1
# Describe/configure Source
Agent1.sources.avro-source1.type = avro
Agent1.sources.avro-source1.bind = xx.xx.xx.xx
Agent1.sources.avro-source1.port = 50000
# Describe the sink
Agent1.sinks.hdfs-sink1.type = hdfs
Agent1.sinks.hdfs-sink1.hdfs.path =/user/Benchmarking_data/multiple_agent_parallel_1
Agent1.sinks.hdfs-sink1.hdfs.rollInterval = 0
Agent1.sinks.hdfs-sink1.hdfs.rollSize = 130023424
Agent1.sinks.hdfs-sink1.hdfs.rollCount = 0
Agent1.sinks.hdfs-sink1.hdfs.fileType = DataStream
Agent1.sinks.hdfs-sink1.hdfs.batchSize = 50000
Agent1.sinks.hdfs-sink1.hdfs.txnEventMax = 40000
Agent1.sinks.hdfs-sink1.hdfs.threadsPoolSize=1000
Agent1.sinks.hdfs-sink1.hdfs.appendTimeout = 10000
Agent1.sinks.hdfs-sink1.hdfs.callTimeout = 200000
#Use a channel which buffers events in file
Agent1.channels.file-channel1.type = file
Agent1.channels.file-channel1.checkpointDir = /home/Flume_Check_Point_Dir
Agent1.channels.file-channel1.dataDirs = /home/Flume_Data_Dir
Agent1.channels.file-channel1.capacity = 100000000
Agent1.channels.file-channel1.transactionCapacity=100000
# Bind the source and sink to the channel
Agent1.sources.avro-source1.channels = file-channel1
Agent1.sinks.hdfs-sink1.channel = file-channel1
Network connectivity between both machine is 686 Mbps.
Can somebody please help me to identify whether something is wrong in the configuration or an alternate configuration so that the copying doesn't take so much of time.
Both agents use file channel. So before writing to HDFS, data has been written to disk twice. You can try to use a memory channel for each agent to see if the performance is improved.

Flume + Kafka + HDFS: Split messages

I've the following flume agent configuration to read messages from a kafka source and write them back to a HDFS sink
tier1.sources = source1
tier 1.channels = channel1
tier1.sinks = sink1
tier1.sources.source1.type = org.apache.flume.source.kafka.KafkaSource
tier1.sources.source1.zookeeperConnect = 192.168.0.100:2181
tier1.sources.source1.topic = test
tier1.sources.source1.groupId = flume
tier1.sources.source1.channels = channel1
tier1.sources.source1.interceptors = i1
tier1.sources.source1.interceptors.i1.type = timestamp
tier1.sources.source1.kafka.consumer.timeout.ms = 100
tier1.channels.channel1.type = org.apache.flume.channel.kafka.KafkaChannel
tier1.channels.channel1.brokerList = 192.168.0.100:9092
tier1.channels.channel1.topic = test
tier1.channels.channel1.zookeeperConnect = 192.168.0.100:2181/kafka
tier1.channels.channel1.parseAsFlumeEvent = false
tier1.sinks.sink1.channel = channel1
tier1.sinks.sink1.type = hdfs
tier1.sinks.sink1.hdfs.writeFormat = Text
tier1.sinks.sink1.hdfs.fileType = DataStream
tier1.sinks.sink1.hdfs.filePrefix = test-kafka
tier1.sinks.sink1.hdfs.fileSufix = .avro
tier1.sinks.sink1.hdfs.useLocalTimeStamp = true
tier1.sinks.sink1.hdfs.path = /tmp/kafka/%y-%m-%d
tier1.sinks.sink1.hdfs.rollCount=0
tier1.sinks.sink1.hdfs.rollSize=0
The kafka messages content is avro data wich is properly serialized into a file if only one kafka messages arrives every polling period.
When two kafka messages arrive on the same batch, they are grouped on the same HDFS file, since the avro messages contains both schema + data, the result file containes schema + data + schema + data, causing it to be a invalid .avro file.
How can I split the avro event to get the different kafka messages splited to be written each one of them on a different file
Thank you
One approach: Lets say you call your source kafka incoming data 'SourceTopic'. You can register a custom sink to this 'SourceTopic'.
<FlumeNodeRole>.sinks.<your-sink>.type =net.my.package.CustomSink
In your CustomSink, you can write a method to differentiate incoming message, split it, and resend to a different 'DestinationTopic'. This 'DestinationTopic' can now act as a new flume source for your file serialization.
Refer below link for pipe-lining flume:
https://flume.apache.org/FlumeUserGuide.html

Flume - Could not configure sink - No channel configured for sink

I have configured flume to read logs file and write to HDFS. When I start the flume the log files are read but it not written to HDFS. flume.log has the warning message - could not configure sink - no channel configured for sink but I already assigned a channel to sink in the conf-file.
Given below is the conf-file and error message:
File: spool-to-hdfs.properties
# List all components.
agent1.sources = source1
agent1.sinks = sink1
agent1.channels = channel1
# Describe source.
agent1.sources.source1.type = spooldir
agent1.sources.source1.spoolDir =/Suriya/flume/input_files
# Describe channel
#agent1.channels.channel1.type = file
#agent1.channels.channel1.checkpointDir = /Suriya/flume/checkpointDir
#agent1.channels.channel1.dataDirs =/Suriya/flume/dataDirs
agent1.channels.channel1.type = memory
# Describe sink
agent1.sinks.sink1.type = hdfs
#agent1.sinks.sink1.hdfs.path = hdfs://sandbox.hortonworks.com:8020/hdfs/Suriya/flume
agent1.sinks.sink1.hdfs.path = hdfs://localhost/hdfs/Suriya/flume
agent1.sinks.sink1.hdfs.fileType= DataStream
agent1.sinks.sink1.hdfs.writeFormat = Text
# Bind source and sink to channel
agent1.sources.source1.channels = channel1
agent1.sinks.sink1.channels = channel1
**-- starting the agent**
flume-ng agent --conf-file spool-to-hdfs.properties --conf /etc/flume/conf --name agent1;
flume.log
03 Aug 2015 23:37:16,699 WARN [conf-file-poller-0] (org.apache.flume.conf.FlumeConfiguration$AgentConfiguration.validateSinks:697) - Could not configure sink sink1 due to: No channel configured for sink: sink1
org.apache.flume.conf.ConfigurationException: No channel configured for sink: sink1 at org.apache.flume.conf.sink.SinkConfiguration.configure(SinkConfiguration.java:51)
Replace the bind part with.
# Bind source and sink to channel
agent1.sources.source1.channels = channel1
agent1.sinks.sink1.channels = channel1
This bind config
# Bind source and sink to channel
agent1.sources.source1.channels = channel1
agent1.sinks.sink1.channel = channel1
agent1.sources.source1.channels = channel1
Looks okay
BUT
agent1.sinks.sink1.channels = channel1
Should be
agent1.sinks.sink1.channel = channel1
Let us know how it goes.

Flume sinks data in inconsistance fashion

I have got a problem. I am using apache flume to read the logs from txt file to sink to hdfs. somehow some records are getting skipped while reading. I am using fileChannel please check the below configuration.
agent2.sources = file_server
agent2.sources.file_server.type=exec
agent2.sources.file_server.command = tail -F /home/datafile/error.log
agent2.sources.file_server.channels = fileChannel
agent2.channels = fileChannel
agent2.channels.fileChannel.type=file
agent2.channels.fileChannel.capacity = 12000
agent2.channels.fileChannel.transactionCapacity = 10000
agent2.channels.fileChannel.checkpointDir=/home/data/flume/checkpoint
agent2.channels.fileChannel.dataDirs=/home/data/flume/data
# Agent2 sinks
agent2.sinks = hadooper loged
agent2.sinks.hadooper.type = hdfs
agent2.sinks.loged.type=logger
agent2.sinks.hadooper.hdfs.path = hdfs://localhost:8020/flume/data/file
agent2.sinks.hadooper.hdfs.fileType = DataStream
agent1.sinks.hadooper.hdfs.writeFormat = Text
agent2.sinks.hadooper.hdfs.writeFormat = Text
agent2.sinks.hadooper.hdfs.rollInterval = 600
agent2.sinks.hadooper.hdfs.rollCount = 0
agent2.sinks.hadooper.hdfs.rollSize = 67108864
agent2.sinks.hadooper.hdfs.batchSize = 10
agent2.sinks.hadooper.hdfs.idleTimeout=0
agent2.sinks.hadooper.channel = fileChannel
agent2.sinks.loged.channel = fileChannel
agent2.sinks.hdfs.threadsPoolSize = 20
Please help.
I think the problem is your are using 2 sinks reading both of them from a single channel; in that case, a Flume event read by one of the 2 sinks is not read by the other one, and viceversa.
If you want both sinks receive a copy of the same Flume event, then you will need to create a dedicated channel for each sink. Once created these channels, the default channel selector, which is ReplicatingChannelSelector, will create a copy into each channel.

Resources