Save flume output to hive table with Hive Sink

Save flume output to hive table with Hive Sink - hadoop

I am trying to configure flume with Hive to save flume output to hive table with Hive Sink type. I have single node cluster. I use mapr hadoop distribution.
Here is my flume.conf
agent1.sources = source1
agent1.channels = channel1
agent1.sinks = sink1
agent1.sources.source1.type = exec
agent1.sources.source1.command = cat /home/andrey/flume_test.data
agent1.sinks.sink1.type = hive
agent1.sinks.sink1.channel = channel1
agent1.sinks.sink1.hive.metastore = thrift://127.0.0.1:9083
agent1.sinks.sink1.hive.database = default
agent1.sinks.sink1.hive.table = flume_test
agent1.sinks.sink1.useLocalTimeStamp = false
agent1.sinks.sink1.round = true
agent1.sinks.sink1.roundValue = 10
agent1.sinks.sink1.roundUnit = minute
agent1.sinks.sink1.serializer = DELIMITED
agent1.sinks.sink1.serializer.delimiter = ","
agent1.sinks.sink1.serializer.serdeSeparator = ','
agent1.sinks.sink1.serializer.fieldnames = id,message
agent1.channels.channel1.type = FILE
agent1.channels.channel1.transactionCapacity = 1000000
agent1.channels.channel1.checkpointInterval 30000
agent1.channels.channel1.maxFileSize = 2146435071
agent1.channels.channel1.capacity 10000000
agent1.sources.source1.channels = channel1
My data flume_test.data
1,AAAAAAAA
2,BBBBBBB
3,CCCCCCCC
4,DDDDDD
5,EEEEEEE
6,FFFFFFFFFFF
7,GGGGGG
8,HHHHHHH
9,IIIIII
10,JJJJJJ
11,KKKKKK
12,LLLLLLLL
13,MMMMMMMMM
14,NNNNNNNNN
15,OOOOOOOO
16,PPPPPPPPPP
17,QQQQQQQ
18,RRRRRRR
19,SSSSSSSS
Thats how I create my table in Hive
create table flume_test(id string, message string)
clustered by (message) into 1 buckets
STORED AS ORC tblproperties ("orc.compress"="NONE");
When I use only 1 bucket, select * from flume_test command in hive shell returns me only OK status, without data. If I use more then 1 bucket, it returns me error messages.
Errors for example with 5 buckets after hive table select:
hive> select * from flume_test;
OK
2015-06-18 10:04:57,6909 ERROR Client fs/client/fileclient/cc/client.cc:1385 Thread: 10141 Open failed for file /user/hive/warehouse/flume_test/delta_0004401_0004500/bucket_00, LookupFid error No such file or directory(2)
2015-06-18 10:04:57,6941 ERROR Client fs/client/fileclient/cc/client.cc:1385 Thread: 10141 Open failed for file /user/hive/warehouse/flume_test/delta_0004401_0004500/bucket_00, LookupFid error No such file or directory(2)
2015-06-18 10:04:57,6976 ERROR Client fs/client/fileclient/cc/client.cc:1385 Thread: 10141 Open failed for file /user/hive/warehouse/flume_test/delta_0004401_0004500/bucket_00, LookupFid error No such file or directory(2)
2015-06-18 10:04:57,7044 ERROR Client fs/client/fileclient/cc/client.cc:1385 Thread: 10141 Open failed for file /user/hive/warehouse/flume_test/delta_0004401_0004500/bucket_00, LookupFid error No such file or directory(2)
Time taken: 0.914 seconds
Hive table data saves in /user/hive/warehouse/flume_test directory and it is not empty.
-rwxr-xr-x 3 andrey andrey 4 2015-06-17 16:28 /user/hive/warehouse/flume_test/_orc_acid_version
drwxr-xr-x - andrey andrey 2 2015-06-17 16:28 /user/hive/warehouse/flume_test/delta_0004301_0004400
delta directory contains
-rw-r--r-- 3 andrey andrey 991 2015-06-17 16:28 /user/hive/warehouse/flume_test/delta_0004301_0004400/bucket_00000
-rwxr-xr-x 3 andrey andrey 8 2015-06-17 16:28 /user/hive/warehouse/flume_test/delta_0004301_0004400/bucket_00000_flush_length
I cant read /user/hive/warehouse/flume_test/delta_0004301_0004400/bucket_00000 orc file even with pig.
Also I tried to set this vars after table creation in hive, but this didn't give result.
set hive.txn.manager = org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
set hive.compactor.initiator.on = true;
set hive.compactor.worker.threads = 2;
I found few examples in internet, but they are not full, and I am new to flume, so I cant understand them)

Adding these 2 rows to my config solved my problem, but I still have errors when read table from hive. I can read the table, it returns correct result but with errors
agent1.sinks.sink1.hive.txnsPerBatchAsk = 2
agent1.sinks.sink1.batchSize = 10

It looks like you are not generating avsc file. You seem to be create HIVE table using AVRO file and hence you get the error.

Related

Flume 1.6.0 spooling directory source with timestamp on header

I'm trying to create a new flume agent like source spooldir and puts them in HDFS. This is my config file:
agent.sources = file
agent.channels = channel
agent.sinks = hdfsSink
# SOURCES CONFIGURATION
agent.sources.file.type = spooldir
agent.sources.file.channels = channel
agent.sources.file.spoolDir = /path/to/json_files
# SINKS CONFIGURATION
agent.sinks.hdfsSink.type = hdfs
agent.sinks.hdfsSink.hdfs.path = /HADOOP/PATH/%Y/%m/%d/%H/
agent.sinks.hdfsSink.hdfs.filePrefix = common
agent.sinks.hdfsSink.hdfs.fileSuffix = .json
agent.sinks.hdfsSink.hdfs.rollInterval = 300
agent.sinks.hdfsSink.hdfs.rollSize = 5242880
agent.sinks.hdfsSink.hdfs.rollCount = 0
agent.sinks.hdfsSink.hdfs.maxOpenFiles = 2
agent.sinks.hdfsSink.hdfs.fileType = DataStream
agent.sinks.hdfsSink.hdfs.callTimeout = 100000
agent.sinks.hdfsSink.hdfs.batchSize = 1000
agent.sinks.hdfsSink.channel = channel
# CHANNELS CONFIGURATION
agent.channels.channel.type = memory
agent.channels.channel.capacity = 10000
agent.channels.channel.transactionCapacity = 1000
I'm getting an error that describes Expected timestamp in the Flume event headers, but it was null. The files that I'm reading contains JSON structure, which has a field named timestamp.
Is there a way to add this timestamp in the header?

As per my earlier comment, now I am sharing the entire steps which I followed and performed for spooling header enable json file, putting it to hadoop hdfs cluster using flume, creating a external file over json file and later performed DML query over it -
Created flume-spool.conf
//Flume Configuration Starts
erum.sources =source-1
erum.channels =file-channel-1
erum.sinks =hdfs-sink-1
erum.sources.source-1.channels =file-channel-1
erum.sinks.hdfs-sink-1.channel =file-channel-1
//Define a file channel called fileChannel on erum
erum.channels.file-channel-1.type =file
erum.channels.file-channel-1.capacity =2000000
erum.channels.file-channel-1.transactionCapacity =100000
//Define a source for erum
erum.sources.source-1.type =spooldir
erum.sources.source-1.bind =localhost
erum.sources.source-1.port =44444
erum.sources.source-1.inputCharset =UTF-8
erum.sources.source-1.bufferMaxLineLength =100
//Spooldir in my case is /home/arif/practice/flume_sink
erum.sources.source-1.spoolDir =/home/arif/practice/flume_sink/
erum.sources.source-1.fileHeader =true
erum.sources.source-1.fileHeaderKey=file
erum.sources.source-1.fileSuffix =.COMPLETED
//Sink is flume_import under hdfs
erum.sinks.hdfs-sink-1.pathManager =DEFAULT
erum.sinks.hdfs-sink-1.type =hdfs
erum.sinks.hdfs-sink-1.hdfs.filePrefix =common
erum.sinks.hdfs-sink-1.hdfs.fileSuffix =.json
erum.sinks.hdfs-sink-1.hdfs.writeFormat =Text
erum.sinks.hdfs-sink-1.hdfs.fileType =DataStream
erum.sinks.hdfs-sink-1.hdfs.path =hdfs://localhost:9000/user/arif/flume_sink/products/
erum.sinks.hdfs-sink-1.hdfs.batchSize =1000
erum.sinks.hdfs-sink-1.hdfs.rollSize =2684354560
erum.sinks.hdfs-sink-1.hdfs.rollInterval =5
erum.sinks.hdfs-sink-1.hdfs.rollCount =5000
Now we are running the flume-spool using agent - erum
bin/flume-ng agent -n erum -c conf -f conf/flume-spool.conf -Dflume.root.logger=DEBUG,console
Copied the products.json file inside the erum.sources.source-1.spoolDir flume configured specified directory.
Contents inside the products.json file is as follows as it were -
{"productid":"5968dd23fc13ae04d9000001","product_name":"sildenafilcitrate","mfgdate":"20160719031109","supplier":"WisozkInc","quantity":261,"unit_cost":"$10.47"}
{"productid":"5968dd23fc13ae04d9000002","product_name":"MountainJuniperusashei","mfgdate":"20161003021009","supplier":"Keebler-Hilpert","quantity":292,"unit_cost":"$8.74"}
{"productid":"5968dd23fc13ae04d9000003","product_name":"DextromathorphanHBr","mfgdate":"20161101041113","supplier":"Schmitt-Weissnat","quantity":211,"unit_cost":"$20.53"}
{"productid":"5968dd23fc13ae04d9000004","product_name":"MeophanHBr","mfgdate":"20161101061113","supplier":"Schmitt-Weissnat","quantity":198,"unit_cost":"$18.73"}
Download the hive-serdes-sources-1.0.6.jar from below url-
https://www.dropbox.com/s/lsjgk2zaqz8uli9/hive-serdes-sources-1.0.6.jar?dl=0
After spooling the json file to hdfs cluster using flume-spool, we will start the hive server, login to hive shell and then do the following-
hive> add jar /home/arif/applications/hadoop/apache-hive-2.1.1-bin/lib/hive-serdes-sources-1.0.6.jar;
hive> create external table products (productid string, product_name string, mfgdate string, supplier string, quantity int, unit_cost string)
> row format serde 'com.cloudera.hive.serde.JSONSerDe' location '/user/arif/flume_sink/products/';
OK
Time taken: 0.211 seconds
hive> select * from products;
OK
5968dd23fc13ae04d9000001 sildenafilcitrate 20160719031109 WisozkInc 261 $10.47
5968dd23fc13ae04d9000002 MountainJuniperusashei 20161003021009 Keebler-Hilpert 292 $8.74
5968dd23fc13ae04d9000003 DextromathorphanHBr 20161101041113 Schmitt-Weissnat 211 $20.53
5968dd23fc13ae04d9000004 MeophanHBr 20161101061113 Schmitt-Weissnat 198 $18.73
Time taken: 0.291 seconds, Fetched: 4 row(s)
and I have completed these entire steps without any single error, hope this will help you, thanks.

as explained in this post:
http://shzhangji.com/blog/2017/08/05/how-to-extract-event-time-in-apache-flume/
the changes needed is to include an interceptor and serializer to it:
# SOURCES CONFIGURATION
agent.sources.file.type = spooldir
agent.sources.file.channels = channel
agent.sources.file.spoolDir = /path/to/json_files
agent.sources.file.interceptors = i1
agent.sources.file.interceptors.i1.type = regex_extractor
agent.sources.file.interceptors.i1.regex = <regex_for_timestamp>
agent.sources.file.interceptors.i1.serializers = s1
agent.sources.file.interceptors.i1.serializers.s1.type = org.apache.flume.interceptor.RegexExtractorInterceptorMillisSerializer
agent.sources.file.interceptors.i1.serializers.s1.name = timestamp
agent.sources.file.interceptors.i1.serializers.s1.pattern = <pattern_that_matches_your_regex>
thanks for pointing out that besides the link i needed to include a proper snippet :)

Flume Hive Sink Error

I am generating data to a spool directory and redirecting that to a hive table using flume hive sink. Flume sink is connected with hive metastore but after that I am facing the following issue.
Issue
:Unable to deliver event. Exception follows org.apache.flume.EventDeliveryException: Java.lang.ArrayIndexOutOfBoundsException:1
Flume.conf
flume-hive-ingest.sources = src1
flume-hive-ingest.channels = chan1
flume-hive-ingest.sinks = sink1
flume-hive-ingest.sources.src1.type = spooldir
flume-hive-ingest.sources.src1.channels = chan1
flume-hive-ingest.sources.src1.spoolDir =
/vagrant/flume_log
flume-hive-ingest.channels.chan1.type = memory
flume-hive-ingest.channels.chan1.capacity = 1000
flume-hive-ingest.channels.chan1.transactionCapacity =
1000
flume-hive-ingest.sinks.sink1.type = hive
flume-hive-ingest.sinks.sink1.channel = chan1
flume-hive-ingest.sinks.sink1.hive.metastore =
thirft ://one.hdp:9083
flume-hive-ingest.sinks.sink1.hive.database = default
flume-hive-ingest.sinks.sink1.hive.table = stocks
flume-hive-ingest.sinks.sink1.serializer = delimited
flume-hive-ingest.sinks.sink1.serializer.delimiter = ,
flume-hive-ingest.sinks.sink1.serializer.fieldnames =
date ,open,high,low,close,volume,adj_close
Hive script
DROP TABLE IF EXISTS stocks;
CREATE EXTERNAL TABLE stocks (
date STRING,
open DOUBLE,
high DOUBLE,
low DOUBLE,
close DOUBLE,
volume BIGINT,
adj_close DOUBLE)
STORED AS ORC
LOCATION '/ingest/stocks';

FlumeData file not getting created in HDFS sink

I am trying to ingest real time data using Kafka as source and flume as sink.Sink type is HDFS. My producer is working fine,i can see the data being produced and my agent is running fine(no error while running the command) but the file is not getting generated in specified directory.
Command for Starting flume agent:
/usr/hdp/2.5.0.0-1245/flume/bin/flume-ng agent -c /usr/hdp/2.5.0.0-1245/flume/conf -f /usr/hdp/2.5.0.0-1245/flume/conf/flume-hdfs.conf -n tier1
And my flume-hdfs.conf file:
tier1.sources = source1
tier1.channels = channel1
tier1.sinks = sink1
tier1.sources.source1.type = org.apache.flume.source.kafka.KafkaSource
tier1.sources.source1.zookeeperConnect = localhost:2181
tier1.sources.source1.topic = data_1
tier1.sources.source1.channels = channel1
tier1.channels.channel1.type = org.apache.flume.channel.kafka.KafkaChannel
tier1.channels.channel1.brokerList = localhost:6667
tier1.channels.channel1.zookeeperConnect = localhost:2181
tier1.channels.channel1.capacity = 10000
tier1.channels.channel1.transactionCapacity = 1000
tier1.sinks.sink1.type = hdfs
tier1.sinks.sink1.hdfs.path = /user/user_name/FLUME_LOGS/
tier1.sinks.sink1.hdfs.rollInterval = 5
tier1.sinks.sink1.hdfs.rollSize = 0
tier1.sinks.sink1.hdfs.rollCount = 0
tier1.sinks.sink1.hdfs.fileType = DataStream
tier1.sinks.sink1.channel = channel1
I am not able to find out what is wrong with the execution.
Please suggest how to overcome this problem.

Set the path of HDFS sink this way:
tier1.sinks.sink1.hdfs.path = "VALUE of fs.default.name, located in core-site.xml"/user/user_name/FLUME_LOGS/
For example
tier1.sinks.sink1.hdfs.path = hdfs://localhost:54310/user/user_name/FLUME_LOGS/

Hadoop:copying csv file to hdfs using flume spool dir, Error: INFO source.SpoolDirectorySource: Spooling Directory Source runner has shutdown

im trying to use flume spool dir to copy csv file to hdfs. as i'm beginner in Hadoop concepts. Please help me out in resolving the below issue
hdfs directory : /home/hdfs
flume dir : /etc/flume/
please find the flume-hwdgteam01.conf file as below
# Define a source, a channel, and a sink
hwdgteam01.sources = src1
hwdgteam01.channels = chan1
hwdgteam01.sinks = sink1
# Set the source type to Spooling Directory and set the directory
# location to /home/flume/ingestion/
hwdgteam01.sources.src1.type = spooldir
hwdgteam01.sources.src1.spoolDir = /home/hwdgteam01/nandan/input-data
hwdgteam01.sources.src1.basenameHeader = true
# Configure the channel as simple in-memory queue
hwdgteam01.channels.chan1.type = memory
# Define the HDFS sink and set its path to your target HDFS directory
hwdgteam01.sinks.sink1.type = hdfs
hwdgteam01.sinks.sink1.hdfs.path = /home/datalanding
hwdgteam01.sinks.sink1.hdfs.fileType = DataStream
# Disable rollover functionallity as we want to keep the original files
hwdgteam01.sinks.sink1.rollCount = 0
hwdgteam01.sinks.sink1.rollInterval = 0
hwdgteam01.sinks.sink1.rollSize = 0
hwdgteam01.sinks.sink1.idleTimeout = 0
# Set the files to their original name
hwdgteam01.sinks.sink1.hdfs.filePrefix = %{basename}
# Connect source and sink
hwdgteam01.sources.src1.channels = chan1
hwdgteam01.sinks.sink1.channel = chan1
following ways i executed the commands :
/usr/bin/flume-ng agent --conf conf --conf-file /home/hwdgteam01/nandan/config/flume-hwdgteam01.conf -Dflume.root.logger=DEBUG,console --name hwdgteam01
OR
/usr/bin/flume-ng agent -n hwdgteam01 -f /home/hwdgteam01/nandan/config/flume-hwdgteam01.conf
/usr/bin/flume-ng agent -n hwdgteam01 -f /home/hwdgteam01/nandan/config/flume-hwdgteam01.conf
OR
/home/hwdgteam01/nandan/config/flume-ng agent -n hwdgteam01 -f
OR
/home/hwdgteam01/nandan/config/flume-hwdgteam01.conf
but nothing worked out and m getting the following error flume error msg.
please let me know where im going wrong .
thanks for any help

flume loss data when collect online data to hdfs

I used flume-ng 1.5 version to collect logs.
There are two agents in the data flow and they are on two hosts, respectively.
And the data is sended from agent1 to agent2.
The agents's component is as follows:
agent1: spooling dir source --> file channel --> avro sink
agent2: avro source --> file channel --> hdfs sink
But it seems to loss data about 1/1000 percentage of million data.
To solve problem I tried these steps:
look up agents log: cannot find any error or exception.
look up agents monitor metrics: the events number that put and take from channel always equals
statistic the data number by hive query and hdfs file use shell, respectively: the two number is equal and less than the online data number
agent1's configuration:
#agent
agent1.sources = src_spooldir
agent1.channels = chan_file
agent1.sinks = sink_avro
#source
agent1.sources.src_spooldir.type = spooldir
agent1.sources.src_spooldir.spoolDir = /data/logs/flume-spooldir
agent1.sources.src_spooldir.interceptors=i1
#interceptors
agent1.sources.src_spooldir.interceptors.i1.type=regex_extractor
agent1.sources.src_spooldir.interceptors.i1.regex=(\\d{4}-\\d{2}-\\d{2}).*
agent1.sources.src_spooldir.interceptors.i1.serializers=s1
agent1.sources.src_spooldir.interceptors.i1.serializers.s1.name=dt
#sink
agent1.sinks.sink_avro.type = avro
agent1.sinks.sink_avro.hostname = 10.235.2.212
agent1.sinks.sink_avro.port = 9910
#channel
agent1.channels.chan_file.type = file
agent1.channels.chan_file.checkpointDir = /data/flume/agent1/checkpoint
agent1.channels.chan_file.dataDirs = /data/flume/agent1/data
agent1.sources.src_spooldir.channels = chan_file
agent1.sinks.sink_avro.channel = chan_file
agent2's configuration
# agent
agent2.sources = source1
agent2.channels = channel1
agent2.sinks = sink1
# source
agent2.sources.source1.type = avro
agent2.sources.source1.bind = 10.235.2.212
agent2.sources.source1.port = 9910
# sink
agent2.sinks.sink1.type= hdfs
agent2.sinks.sink1.hdfs.fileType = DataStream
agent2.sinks.sink1.hdfs.filePrefix = log
agent2.sinks.sink1.hdfs.path = hdfs://hnd.hadoop.jsh:8020/data/%{dt}
agent2.sinks.sink1.hdfs.rollInterval = 600
agent2.sinks.sink1.hdfs.rollSize = 0
agent2.sinks.sink1.hdfs.rollCount = 0
agent2.sinks.sink1.hdfs.idleTimeout = 300
agent2.sinks.sink1.hdfs.round = true
agent2.sinks.sink1.hdfs.roundValue = 10
agent2.sinks.sink1.hdfs.roundUnit = minute
# channel
agent2.channels.channel1.type = file
agent2.channels.channel1.checkpointDir = /data/flume/agent2/checkpoint
agent2.channels.channel1.dataDirs = /data/flume/agent2/data
agent2.sinks.sink1.channel = channel1
agent2.sources.source1.channels = channel1
Any suggestions are welcome!

there is a bug in file line deseriazer when encounter some specific character of utf which point is between U+10000 and U+10FFFF, they represent in utf16 by two 16-bit code unit called surrogate pairs.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Save flume output to hive table with Hive Sink - hadoop

Adding these 2 rows to my config solved my problem, but I still have errors when read table from hive. I can read the table, it returns correct result but with errors agent1.sinks.sink1.hive.txnsPerBatchAsk = 2 agent1.sinks.sink1.batchSize = 10

It looks like you are not generating avsc file. You seem to be create HIVE table using AVRO file and hence you get the error.

Related

Flume 1.6.0 spooling directory source with timestamp on header

Flume Hive Sink Error

FlumeData file not getting created in HDFS sink

Hadoop:copying csv file to hdfs using flume spool dir, Error: INFO source.SpoolDirectorySource: Spooling Directory Source runner has shutdown

flume loss data when collect online data to hdfs

Categories

Resources