I'm trying to create a new flume agent like source spooldir and puts them in HDFS. This is my config file:
agent.sources = file
agent.channels = channel
agent.sinks = hdfsSink
# SOURCES CONFIGURATION
agent.sources.file.type = spooldir
agent.sources.file.channels = channel
agent.sources.file.spoolDir = /path/to/json_files
# SINKS CONFIGURATION
agent.sinks.hdfsSink.type = hdfs
agent.sinks.hdfsSink.hdfs.path = /HADOOP/PATH/%Y/%m/%d/%H/
agent.sinks.hdfsSink.hdfs.filePrefix = common
agent.sinks.hdfsSink.hdfs.fileSuffix = .json
agent.sinks.hdfsSink.hdfs.rollInterval = 300
agent.sinks.hdfsSink.hdfs.rollSize = 5242880
agent.sinks.hdfsSink.hdfs.rollCount = 0
agent.sinks.hdfsSink.hdfs.maxOpenFiles = 2
agent.sinks.hdfsSink.hdfs.fileType = DataStream
agent.sinks.hdfsSink.hdfs.callTimeout = 100000
agent.sinks.hdfsSink.hdfs.batchSize = 1000
agent.sinks.hdfsSink.channel = channel
# CHANNELS CONFIGURATION
agent.channels.channel.type = memory
agent.channels.channel.capacity = 10000
agent.channels.channel.transactionCapacity = 1000
I'm getting an error that describes Expected timestamp in the Flume event headers, but it was null. The files that I'm reading contains JSON structure, which has a field named timestamp.
Is there a way to add this timestamp in the header?
As per my earlier comment, now I am sharing the entire steps which I followed and performed for spooling header enable json file, putting it to hadoop hdfs cluster using flume, creating a external file over json file and later performed DML query over it -
Created flume-spool.conf
//Flume Configuration Starts
erum.sources =source-1
erum.channels =file-channel-1
erum.sinks =hdfs-sink-1
erum.sources.source-1.channels =file-channel-1
erum.sinks.hdfs-sink-1.channel =file-channel-1
//Define a file channel called fileChannel on erum
erum.channels.file-channel-1.type =file
erum.channels.file-channel-1.capacity =2000000
erum.channels.file-channel-1.transactionCapacity =100000
//Define a source for erum
erum.sources.source-1.type =spooldir
erum.sources.source-1.bind =localhost
erum.sources.source-1.port =44444
erum.sources.source-1.inputCharset =UTF-8
erum.sources.source-1.bufferMaxLineLength =100
//Spooldir in my case is /home/arif/practice/flume_sink
erum.sources.source-1.spoolDir =/home/arif/practice/flume_sink/
erum.sources.source-1.fileHeader =true
erum.sources.source-1.fileHeaderKey=file
erum.sources.source-1.fileSuffix =.COMPLETED
//Sink is flume_import under hdfs
erum.sinks.hdfs-sink-1.pathManager =DEFAULT
erum.sinks.hdfs-sink-1.type =hdfs
erum.sinks.hdfs-sink-1.hdfs.filePrefix =common
erum.sinks.hdfs-sink-1.hdfs.fileSuffix =.json
erum.sinks.hdfs-sink-1.hdfs.writeFormat =Text
erum.sinks.hdfs-sink-1.hdfs.fileType =DataStream
erum.sinks.hdfs-sink-1.hdfs.path =hdfs://localhost:9000/user/arif/flume_sink/products/
erum.sinks.hdfs-sink-1.hdfs.batchSize =1000
erum.sinks.hdfs-sink-1.hdfs.rollSize =2684354560
erum.sinks.hdfs-sink-1.hdfs.rollInterval =5
erum.sinks.hdfs-sink-1.hdfs.rollCount =5000
Now we are running the flume-spool using agent - erum
bin/flume-ng agent -n erum -c conf -f conf/flume-spool.conf -Dflume.root.logger=DEBUG,console
Copied the products.json file inside the erum.sources.source-1.spoolDir flume configured specified directory.
Contents inside the products.json file is as follows as it were -
{"productid":"5968dd23fc13ae04d9000001","product_name":"sildenafilcitrate","mfgdate":"20160719031109","supplier":"WisozkInc","quantity":261,"unit_cost":"$10.47"}
{"productid":"5968dd23fc13ae04d9000002","product_name":"MountainJuniperusashei","mfgdate":"20161003021009","supplier":"Keebler-Hilpert","quantity":292,"unit_cost":"$8.74"}
{"productid":"5968dd23fc13ae04d9000003","product_name":"DextromathorphanHBr","mfgdate":"20161101041113","supplier":"Schmitt-Weissnat","quantity":211,"unit_cost":"$20.53"}
{"productid":"5968dd23fc13ae04d9000004","product_name":"MeophanHBr","mfgdate":"20161101061113","supplier":"Schmitt-Weissnat","quantity":198,"unit_cost":"$18.73"}
Download the hive-serdes-sources-1.0.6.jar from below url-
https://www.dropbox.com/s/lsjgk2zaqz8uli9/hive-serdes-sources-1.0.6.jar?dl=0
After spooling the json file to hdfs cluster using flume-spool, we will start the hive server, login to hive shell and then do the following-
hive> add jar /home/arif/applications/hadoop/apache-hive-2.1.1-bin/lib/hive-serdes-sources-1.0.6.jar;
hive> create external table products (productid string, product_name string, mfgdate string, supplier string, quantity int, unit_cost string)
> row format serde 'com.cloudera.hive.serde.JSONSerDe' location '/user/arif/flume_sink/products/';
OK
Time taken: 0.211 seconds
hive> select * from products;
OK
5968dd23fc13ae04d9000001 sildenafilcitrate 20160719031109 WisozkInc 261 $10.47
5968dd23fc13ae04d9000002 MountainJuniperusashei 20161003021009 Keebler-Hilpert 292 $8.74
5968dd23fc13ae04d9000003 DextromathorphanHBr 20161101041113 Schmitt-Weissnat 211 $20.53
5968dd23fc13ae04d9000004 MeophanHBr 20161101061113 Schmitt-Weissnat 198 $18.73
Time taken: 0.291 seconds, Fetched: 4 row(s)
and I have completed these entire steps without any single error, hope this will help you, thanks.
as explained in this post:
http://shzhangji.com/blog/2017/08/05/how-to-extract-event-time-in-apache-flume/
the changes needed is to include an interceptor and serializer to it:
# SOURCES CONFIGURATION
agent.sources.file.type = spooldir
agent.sources.file.channels = channel
agent.sources.file.spoolDir = /path/to/json_files
agent.sources.file.interceptors = i1
agent.sources.file.interceptors.i1.type = regex_extractor
agent.sources.file.interceptors.i1.regex = <regex_for_timestamp>
agent.sources.file.interceptors.i1.serializers = s1
agent.sources.file.interceptors.i1.serializers.s1.type = org.apache.flume.interceptor.RegexExtractorInterceptorMillisSerializer
agent.sources.file.interceptors.i1.serializers.s1.name = timestamp
agent.sources.file.interceptors.i1.serializers.s1.pattern = <pattern_that_matches_your_regex>
thanks for pointing out that besides the link i needed to include a proper snippet :)
I am generating data to a spool directory and redirecting that to a hive table using flume hive sink. Flume sink is connected with hive metastore but after that I am facing the following issue.
Issue
:Unable to deliver event. Exception follows org.apache.flume.EventDeliveryException: Java.lang.ArrayIndexOutOfBoundsException:1
Flume.conf
flume-hive-ingest.sources = src1
flume-hive-ingest.channels = chan1
flume-hive-ingest.sinks = sink1
flume-hive-ingest.sources.src1.type = spooldir
flume-hive-ingest.sources.src1.channels = chan1
flume-hive-ingest.sources.src1.spoolDir =
/vagrant/flume_log
flume-hive-ingest.channels.chan1.type = memory
flume-hive-ingest.channels.chan1.capacity = 1000
flume-hive-ingest.channels.chan1.transactionCapacity =
1000
flume-hive-ingest.sinks.sink1.type = hive
flume-hive-ingest.sinks.sink1.channel = chan1
flume-hive-ingest.sinks.sink1.hive.metastore =
thirft ://one.hdp:9083
flume-hive-ingest.sinks.sink1.hive.database = default
flume-hive-ingest.sinks.sink1.hive.table = stocks
flume-hive-ingest.sinks.sink1.serializer = delimited
flume-hive-ingest.sinks.sink1.serializer.delimiter = ,
flume-hive-ingest.sinks.sink1.serializer.fieldnames =
date ,open,high,low,close,volume,adj_close
Hive script
DROP TABLE IF EXISTS stocks;
CREATE EXTERNAL TABLE stocks (
date STRING,
open DOUBLE,
high DOUBLE,
low DOUBLE,
close DOUBLE,
volume BIGINT,
adj_close DOUBLE)
STORED AS ORC
LOCATION '/ingest/stocks';
I am trying to ingest real time data using Kafka as source and flume as sink.Sink type is HDFS. My producer is working fine,i can see the data being produced and my agent is running fine(no error while running the command) but the file is not getting generated in specified directory.
Command for Starting flume agent:
/usr/hdp/2.5.0.0-1245/flume/bin/flume-ng agent -c /usr/hdp/2.5.0.0-1245/flume/conf -f /usr/hdp/2.5.0.0-1245/flume/conf/flume-hdfs.conf -n tier1
And my flume-hdfs.conf file:
tier1.sources = source1
tier1.channels = channel1
tier1.sinks = sink1
tier1.sources.source1.type = org.apache.flume.source.kafka.KafkaSource
tier1.sources.source1.zookeeperConnect = localhost:2181
tier1.sources.source1.topic = data_1
tier1.sources.source1.channels = channel1
tier1.channels.channel1.type = org.apache.flume.channel.kafka.KafkaChannel
tier1.channels.channel1.brokerList = localhost:6667
tier1.channels.channel1.zookeeperConnect = localhost:2181
tier1.channels.channel1.capacity = 10000
tier1.channels.channel1.transactionCapacity = 1000
tier1.sinks.sink1.type = hdfs
tier1.sinks.sink1.hdfs.path = /user/user_name/FLUME_LOGS/
tier1.sinks.sink1.hdfs.rollInterval = 5
tier1.sinks.sink1.hdfs.rollSize = 0
tier1.sinks.sink1.hdfs.rollCount = 0
tier1.sinks.sink1.hdfs.fileType = DataStream
tier1.sinks.sink1.channel = channel1
I am not able to find out what is wrong with the execution.
Please suggest how to overcome this problem.
Set the path of HDFS sink this way:
tier1.sinks.sink1.hdfs.path = "VALUE of fs.default.name, located in core-site.xml"/user/user_name/FLUME_LOGS/
For example
tier1.sinks.sink1.hdfs.path = hdfs://localhost:54310/user/user_name/FLUME_LOGS/
im trying to use flume spool dir to copy csv file to hdfs. as i'm beginner in Hadoop concepts. Please help me out in resolving the below issue
hdfs directory : /home/hdfs
flume dir : /etc/flume/
please find the flume-hwdgteam01.conf file as below
# Define a source, a channel, and a sink
hwdgteam01.sources = src1
hwdgteam01.channels = chan1
hwdgteam01.sinks = sink1
# Set the source type to Spooling Directory and set the directory
# location to /home/flume/ingestion/
hwdgteam01.sources.src1.type = spooldir
hwdgteam01.sources.src1.spoolDir = /home/hwdgteam01/nandan/input-data
hwdgteam01.sources.src1.basenameHeader = true
# Configure the channel as simple in-memory queue
hwdgteam01.channels.chan1.type = memory
# Define the HDFS sink and set its path to your target HDFS directory
hwdgteam01.sinks.sink1.type = hdfs
hwdgteam01.sinks.sink1.hdfs.path = /home/datalanding
hwdgteam01.sinks.sink1.hdfs.fileType = DataStream
# Disable rollover functionallity as we want to keep the original files
hwdgteam01.sinks.sink1.rollCount = 0
hwdgteam01.sinks.sink1.rollInterval = 0
hwdgteam01.sinks.sink1.rollSize = 0
hwdgteam01.sinks.sink1.idleTimeout = 0
# Set the files to their original name
hwdgteam01.sinks.sink1.hdfs.filePrefix = %{basename}
# Connect source and sink
hwdgteam01.sources.src1.channels = chan1
hwdgteam01.sinks.sink1.channel = chan1
following ways i executed the commands :
/usr/bin/flume-ng agent --conf conf --conf-file /home/hwdgteam01/nandan/config/flume-hwdgteam01.conf -Dflume.root.logger=DEBUG,console --name hwdgteam01
OR
/usr/bin/flume-ng agent -n hwdgteam01 -f /home/hwdgteam01/nandan/config/flume-hwdgteam01.conf
/usr/bin/flume-ng agent -n hwdgteam01 -f /home/hwdgteam01/nandan/config/flume-hwdgteam01.conf
OR
/home/hwdgteam01/nandan/config/flume-ng agent -n hwdgteam01 -f
OR
/home/hwdgteam01/nandan/config/flume-hwdgteam01.conf
but nothing worked out and m getting the following error flume error msg.
please let me know where im going wrong .
thanks for any help
I used flume-ng 1.5 version to collect logs.
There are two agents in the data flow and they are on two hosts, respectively.
And the data is sended from agent1 to agent2.
The agents's component is as follows:
agent1: spooling dir source --> file channel --> avro sink
agent2: avro source --> file channel --> hdfs sink
But it seems to loss data about 1/1000 percentage of million data.
To solve problem I tried these steps:
look up agents log: cannot find any error or exception.
look up agents monitor metrics: the events number that put and take from channel always equals
statistic the data number by hive query and hdfs file use shell, respectively: the two number is equal and less than the online data number
agent1's configuration:
#agent
agent1.sources = src_spooldir
agent1.channels = chan_file
agent1.sinks = sink_avro
#source
agent1.sources.src_spooldir.type = spooldir
agent1.sources.src_spooldir.spoolDir = /data/logs/flume-spooldir
agent1.sources.src_spooldir.interceptors=i1
#interceptors
agent1.sources.src_spooldir.interceptors.i1.type=regex_extractor
agent1.sources.src_spooldir.interceptors.i1.regex=(\\d{4}-\\d{2}-\\d{2}).*
agent1.sources.src_spooldir.interceptors.i1.serializers=s1
agent1.sources.src_spooldir.interceptors.i1.serializers.s1.name=dt
#sink
agent1.sinks.sink_avro.type = avro
agent1.sinks.sink_avro.hostname = 10.235.2.212
agent1.sinks.sink_avro.port = 9910
#channel
agent1.channels.chan_file.type = file
agent1.channels.chan_file.checkpointDir = /data/flume/agent1/checkpoint
agent1.channels.chan_file.dataDirs = /data/flume/agent1/data
agent1.sources.src_spooldir.channels = chan_file
agent1.sinks.sink_avro.channel = chan_file
agent2's configuration
# agent
agent2.sources = source1
agent2.channels = channel1
agent2.sinks = sink1
# source
agent2.sources.source1.type = avro
agent2.sources.source1.bind = 10.235.2.212
agent2.sources.source1.port = 9910
# sink
agent2.sinks.sink1.type= hdfs
agent2.sinks.sink1.hdfs.fileType = DataStream
agent2.sinks.sink1.hdfs.filePrefix = log
agent2.sinks.sink1.hdfs.path = hdfs://hnd.hadoop.jsh:8020/data/%{dt}
agent2.sinks.sink1.hdfs.rollInterval = 600
agent2.sinks.sink1.hdfs.rollSize = 0
agent2.sinks.sink1.hdfs.rollCount = 0
agent2.sinks.sink1.hdfs.idleTimeout = 300
agent2.sinks.sink1.hdfs.round = true
agent2.sinks.sink1.hdfs.roundValue = 10
agent2.sinks.sink1.hdfs.roundUnit = minute
# channel
agent2.channels.channel1.type = file
agent2.channels.channel1.checkpointDir = /data/flume/agent2/checkpoint
agent2.channels.channel1.dataDirs = /data/flume/agent2/data
agent2.sinks.sink1.channel = channel1
agent2.sources.source1.channels = channel1
Any suggestions are welcome!
there is a bug in file line deseriazer when encounter some specific character of utf which point is between U+10000 and U+10FFFF, they represent in utf16 by two 16-bit code unit called surrogate pairs.