Flume to load data to local file system - hadoop

I am using hadoop 2.2 in linux.Can any one tell me how to use fileroll in flume.I know that fileroll sends data to local file system.Can anyone tell me how???
Thanks in advance..

In order to use the file roll sink, you need only configure the sink in the flume configuration file. This config file example will fetch data from a spooling directory source located in the directory /logs/source, send it through a memory channel to a file roll sink in directory /logs/sink.
There are other configuration options you should have a look at in the flume user's guide here
# Define a memory channel called ch1 on agent1
agent1.channels.ch1.type = memory
agent1.sources.spool.type = spooldir
agent1.sources.spool.channels = ch1
agent1.sources.spool.spoolDir = /logs/source
agent1.sources.spool.fileHeader = true
agent1.sinks.fr1.type = file_roll
agent1.sinks.fr1.channel = ch1
agent1.sinks.fr1.sink.directory = /logs/sink
agent1.channels = ch1
agent1.sources = spool
agent1.sinks = fr1

Related

Flume 1.6.0 spooling directory source with timestamp on header

I'm trying to create a new flume agent like source spooldir and puts them in HDFS. This is my config file:
agent.sources = file
agent.channels = channel
agent.sinks = hdfsSink
# SOURCES CONFIGURATION
agent.sources.file.type = spooldir
agent.sources.file.channels = channel
agent.sources.file.spoolDir = /path/to/json_files
# SINKS CONFIGURATION
agent.sinks.hdfsSink.type = hdfs
agent.sinks.hdfsSink.hdfs.path = /HADOOP/PATH/%Y/%m/%d/%H/
agent.sinks.hdfsSink.hdfs.filePrefix = common
agent.sinks.hdfsSink.hdfs.fileSuffix = .json
agent.sinks.hdfsSink.hdfs.rollInterval = 300
agent.sinks.hdfsSink.hdfs.rollSize = 5242880
agent.sinks.hdfsSink.hdfs.rollCount = 0
agent.sinks.hdfsSink.hdfs.maxOpenFiles = 2
agent.sinks.hdfsSink.hdfs.fileType = DataStream
agent.sinks.hdfsSink.hdfs.callTimeout = 100000
agent.sinks.hdfsSink.hdfs.batchSize = 1000
agent.sinks.hdfsSink.channel = channel
# CHANNELS CONFIGURATION
agent.channels.channel.type = memory
agent.channels.channel.capacity = 10000
agent.channels.channel.transactionCapacity = 1000
I'm getting an error that describes Expected timestamp in the Flume event headers, but it was null. The files that I'm reading contains JSON structure, which has a field named timestamp.
Is there a way to add this timestamp in the header?
As per my earlier comment, now I am sharing the entire steps which I followed and performed for spooling header enable json file, putting it to hadoop hdfs cluster using flume, creating a external file over json file and later performed DML query over it -
Created flume-spool.conf
//Flume Configuration Starts
erum.sources =source-1
erum.channels =file-channel-1
erum.sinks =hdfs-sink-1
erum.sources.source-1.channels =file-channel-1
erum.sinks.hdfs-sink-1.channel =file-channel-1
//Define a file channel called fileChannel on erum
erum.channels.file-channel-1.type =file
erum.channels.file-channel-1.capacity =2000000
erum.channels.file-channel-1.transactionCapacity =100000
//Define a source for erum
erum.sources.source-1.type =spooldir
erum.sources.source-1.bind =localhost
erum.sources.source-1.port =44444
erum.sources.source-1.inputCharset =UTF-8
erum.sources.source-1.bufferMaxLineLength =100
//Spooldir in my case is /home/arif/practice/flume_sink
erum.sources.source-1.spoolDir =/home/arif/practice/flume_sink/
erum.sources.source-1.fileHeader =true
erum.sources.source-1.fileHeaderKey=file
erum.sources.source-1.fileSuffix =.COMPLETED
//Sink is flume_import under hdfs
erum.sinks.hdfs-sink-1.pathManager =DEFAULT
erum.sinks.hdfs-sink-1.type =hdfs
erum.sinks.hdfs-sink-1.hdfs.filePrefix =common
erum.sinks.hdfs-sink-1.hdfs.fileSuffix =.json
erum.sinks.hdfs-sink-1.hdfs.writeFormat =Text
erum.sinks.hdfs-sink-1.hdfs.fileType =DataStream
erum.sinks.hdfs-sink-1.hdfs.path =hdfs://localhost:9000/user/arif/flume_sink/products/
erum.sinks.hdfs-sink-1.hdfs.batchSize =1000
erum.sinks.hdfs-sink-1.hdfs.rollSize =2684354560
erum.sinks.hdfs-sink-1.hdfs.rollInterval =5
erum.sinks.hdfs-sink-1.hdfs.rollCount =5000
Now we are running the flume-spool using agent - erum
bin/flume-ng agent -n erum -c conf -f conf/flume-spool.conf -Dflume.root.logger=DEBUG,console
Copied the products.json file inside the erum.sources.source-1.spoolDir flume configured specified directory.
Contents inside the products.json file is as follows as it were -
{"productid":"5968dd23fc13ae04d9000001","product_name":"sildenafilcitrate","mfgdate":"20160719031109","supplier":"WisozkInc","quantity":261,"unit_cost":"$10.47"}
{"productid":"5968dd23fc13ae04d9000002","product_name":"MountainJuniperusashei","mfgdate":"20161003021009","supplier":"Keebler-Hilpert","quantity":292,"unit_cost":"$8.74"}
{"productid":"5968dd23fc13ae04d9000003","product_name":"DextromathorphanHBr","mfgdate":"20161101041113","supplier":"Schmitt-Weissnat","quantity":211,"unit_cost":"$20.53"}
{"productid":"5968dd23fc13ae04d9000004","product_name":"MeophanHBr","mfgdate":"20161101061113","supplier":"Schmitt-Weissnat","quantity":198,"unit_cost":"$18.73"}
Download the hive-serdes-sources-1.0.6.jar from below url-
https://www.dropbox.com/s/lsjgk2zaqz8uli9/hive-serdes-sources-1.0.6.jar?dl=0
After spooling the json file to hdfs cluster using flume-spool, we will start the hive server, login to hive shell and then do the following-
hive> add jar /home/arif/applications/hadoop/apache-hive-2.1.1-bin/lib/hive-serdes-sources-1.0.6.jar;
hive> create external table products (productid string, product_name string, mfgdate string, supplier string, quantity int, unit_cost string)
> row format serde 'com.cloudera.hive.serde.JSONSerDe' location '/user/arif/flume_sink/products/';
OK
Time taken: 0.211 seconds
hive> select * from products;
OK
5968dd23fc13ae04d9000001 sildenafilcitrate 20160719031109 WisozkInc 261 $10.47
5968dd23fc13ae04d9000002 MountainJuniperusashei 20161003021009 Keebler-Hilpert 292 $8.74
5968dd23fc13ae04d9000003 DextromathorphanHBr 20161101041113 Schmitt-Weissnat 211 $20.53
5968dd23fc13ae04d9000004 MeophanHBr 20161101061113 Schmitt-Weissnat 198 $18.73
Time taken: 0.291 seconds, Fetched: 4 row(s)
and I have completed these entire steps without any single error, hope this will help you, thanks.
as explained in this post:
http://shzhangji.com/blog/2017/08/05/how-to-extract-event-time-in-apache-flume/
the changes needed is to include an interceptor and serializer to it:
# SOURCES CONFIGURATION
agent.sources.file.type = spooldir
agent.sources.file.channels = channel
agent.sources.file.spoolDir = /path/to/json_files
agent.sources.file.interceptors = i1
agent.sources.file.interceptors.i1.type = regex_extractor
agent.sources.file.interceptors.i1.regex = <regex_for_timestamp>
agent.sources.file.interceptors.i1.serializers = s1
agent.sources.file.interceptors.i1.serializers.s1.type = org.apache.flume.interceptor.RegexExtractorInterceptorMillisSerializer
agent.sources.file.interceptors.i1.serializers.s1.name = timestamp
agent.sources.file.interceptors.i1.serializers.s1.pattern = <pattern_that_matches_your_regex>
thanks for pointing out that besides the link i needed to include a proper snippet :)

Hadoop:copying csv file to hdfs using flume spool dir, Error: INFO source.SpoolDirectorySource: Spooling Directory Source runner has shutdown

im trying to use flume spool dir to copy csv file to hdfs. as i'm beginner in Hadoop concepts. Please help me out in resolving the below issue
hdfs directory : /home/hdfs
flume dir : /etc/flume/
please find the flume-hwdgteam01.conf file as below
# Define a source, a channel, and a sink
hwdgteam01.sources = src1
hwdgteam01.channels = chan1
hwdgteam01.sinks = sink1
# Set the source type to Spooling Directory and set the directory
# location to /home/flume/ingestion/
hwdgteam01.sources.src1.type = spooldir
hwdgteam01.sources.src1.spoolDir = /home/hwdgteam01/nandan/input-data
hwdgteam01.sources.src1.basenameHeader = true
# Configure the channel as simple in-memory queue
hwdgteam01.channels.chan1.type = memory
# Define the HDFS sink and set its path to your target HDFS directory
hwdgteam01.sinks.sink1.type = hdfs
hwdgteam01.sinks.sink1.hdfs.path = /home/datalanding
hwdgteam01.sinks.sink1.hdfs.fileType = DataStream
# Disable rollover functionallity as we want to keep the original files
hwdgteam01.sinks.sink1.rollCount = 0
hwdgteam01.sinks.sink1.rollInterval = 0
hwdgteam01.sinks.sink1.rollSize = 0
hwdgteam01.sinks.sink1.idleTimeout = 0
# Set the files to their original name
hwdgteam01.sinks.sink1.hdfs.filePrefix = %{basename}
# Connect source and sink
hwdgteam01.sources.src1.channels = chan1
hwdgteam01.sinks.sink1.channel = chan1
following ways i executed the commands :
/usr/bin/flume-ng agent --conf conf --conf-file /home/hwdgteam01/nandan/config/flume-hwdgteam01.conf -Dflume.root.logger=DEBUG,console --name hwdgteam01
OR
/usr/bin/flume-ng agent -n hwdgteam01 -f /home/hwdgteam01/nandan/config/flume-hwdgteam01.conf
/usr/bin/flume-ng agent -n hwdgteam01 -f /home/hwdgteam01/nandan/config/flume-hwdgteam01.conf
OR
/home/hwdgteam01/nandan/config/flume-ng agent -n hwdgteam01 -f
OR
/home/hwdgteam01/nandan/config/flume-hwdgteam01.conf
but nothing worked out and m getting the following error flume error msg.
please let me know where im going wrong .
thanks for any help

flume taking time to copy data into hdfs when rolling based on file size

I have a usecase where i want to copy remote file into hdfs using flume. I also want that the copied files should align with the HDFS block size (128MB/256MB).Total size of remote data is 33GB.
I am using avro source and sink to copy remote data into hdfs. Similarly from sink side i am doing file size rolling(128,256).but for copying file from remote machine and storing it into hdfs(file size 128/256 MB) flume is taking an avg of 2 min.
Flume Configuration:
Avro Source(Remote Machine)
### Agent1 - Spooling Directory Source and File Channel, Avro Sink ###
# Name the components on this agent
Agent1.sources = spooldir-source
Agent1.channels = file-channel
Agent1.sinks = avro-sink
# Describe/configure Source
Agent1.sources.spooldir-source.type = spooldir
Agent1.sources.spooldir-source.spoolDir =/home/Benchmarking_Simulation/test
# Describe the sink
Agent1.sinks.avro-sink.type = avro
Agent1.sinks.avro-sink.hostname = xx.xx.xx.xx #IP Address destination machine
Agent1.sinks.avro-sink.port = 50000
#Use a channel which buffers events in file
Agent1.channels.file-channel.type = file
Agent1.channels.file-channel.checkpointDir = /home/Flume_CheckPoint_Dir/
Agent1.channels.file-channel.dataDirs = /home/Flume_Data_Dir/
Agent1.channels.file-channel.capacity = 10000000
Agent1.channels.file-channel.transactionCapacity=50000
# Bind the source and sink to the channel
Agent1.sources.spooldir-source.channels = file-channel
Agent1.sinks.avro-sink.channel = file-channel
Avro Sink(Machine where hdfs running)
### Agent1 - Avro Source and File Channel, Avro Sink ###
# Name the components on this agent
Agent1.sources = avro-source1
Agent1.channels = file-channel1
Agent1.sinks = hdfs-sink1
# Describe/configure Source
Agent1.sources.avro-source1.type = avro
Agent1.sources.avro-source1.bind = xx.xx.xx.xx
Agent1.sources.avro-source1.port = 50000
# Describe the sink
Agent1.sinks.hdfs-sink1.type = hdfs
Agent1.sinks.hdfs-sink1.hdfs.path =/user/Benchmarking_data/multiple_agent_parallel_1
Agent1.sinks.hdfs-sink1.hdfs.rollInterval = 0
Agent1.sinks.hdfs-sink1.hdfs.rollSize = 130023424
Agent1.sinks.hdfs-sink1.hdfs.rollCount = 0
Agent1.sinks.hdfs-sink1.hdfs.fileType = DataStream
Agent1.sinks.hdfs-sink1.hdfs.batchSize = 50000
Agent1.sinks.hdfs-sink1.hdfs.txnEventMax = 40000
Agent1.sinks.hdfs-sink1.hdfs.threadsPoolSize=1000
Agent1.sinks.hdfs-sink1.hdfs.appendTimeout = 10000
Agent1.sinks.hdfs-sink1.hdfs.callTimeout = 200000
#Use a channel which buffers events in file
Agent1.channels.file-channel1.type = file
Agent1.channels.file-channel1.checkpointDir = /home/Flume_Check_Point_Dir
Agent1.channels.file-channel1.dataDirs = /home/Flume_Data_Dir
Agent1.channels.file-channel1.capacity = 100000000
Agent1.channels.file-channel1.transactionCapacity=100000
# Bind the source and sink to the channel
Agent1.sources.avro-source1.channels = file-channel1
Agent1.sinks.hdfs-sink1.channel = file-channel1
Network connectivity between both machine is 686 Mbps.
Can somebody please help me to identify whether something is wrong in the configuration or an alternate configuration so that the copying doesn't take so much of time.
Both agents use file channel. So before writing to HDFS, data has been written to disk twice. You can try to use a memory channel for each agent to see if the performance is improved.

Flume sinks data in inconsistance fashion

I have got a problem. I am using apache flume to read the logs from txt file to sink to hdfs. somehow some records are getting skipped while reading. I am using fileChannel please check the below configuration.
agent2.sources = file_server
agent2.sources.file_server.type=exec
agent2.sources.file_server.command = tail -F /home/datafile/error.log
agent2.sources.file_server.channels = fileChannel
agent2.channels = fileChannel
agent2.channels.fileChannel.type=file
agent2.channels.fileChannel.capacity = 12000
agent2.channels.fileChannel.transactionCapacity = 10000
agent2.channels.fileChannel.checkpointDir=/home/data/flume/checkpoint
agent2.channels.fileChannel.dataDirs=/home/data/flume/data
# Agent2 sinks
agent2.sinks = hadooper loged
agent2.sinks.hadooper.type = hdfs
agent2.sinks.loged.type=logger
agent2.sinks.hadooper.hdfs.path = hdfs://localhost:8020/flume/data/file
agent2.sinks.hadooper.hdfs.fileType = DataStream
agent1.sinks.hadooper.hdfs.writeFormat = Text
agent2.sinks.hadooper.hdfs.writeFormat = Text
agent2.sinks.hadooper.hdfs.rollInterval = 600
agent2.sinks.hadooper.hdfs.rollCount = 0
agent2.sinks.hadooper.hdfs.rollSize = 67108864
agent2.sinks.hadooper.hdfs.batchSize = 10
agent2.sinks.hadooper.hdfs.idleTimeout=0
agent2.sinks.hadooper.channel = fileChannel
agent2.sinks.loged.channel = fileChannel
agent2.sinks.hdfs.threadsPoolSize = 20
Please help.
I think the problem is your are using 2 sinks reading both of them from a single channel; in that case, a Flume event read by one of the 2 sinks is not read by the other one, and viceversa.
If you want both sinks receive a copy of the same Flume event, then you will need to create a dedicated channel for each sink. Once created these channels, the default channel selector, which is ReplicatingChannelSelector, will create a copy into each channel.

Flume Tail a File

I am new to Flume-Ng and need help to tail a file. I have a cluster running hadoop with flume running remotely. I communicate to this cluster by using putty. I want to tail a file on my PC and put it on the HDFS in the cluster. I am using the following code to this.
#flume.conf: http source, hdfs sink
# Name the components on this agent
tier1.sources = r1
tier1.sinks = k1
tier1.channels = c1
# Describe/configure the source
tier1.sources.r1.type = exec
tier1.sources.r1.command = tail -F /(Path to file on my PC)
# Describe the sink
tier1.sinks.k1.type = hdfs
tier1.sinks.k1.hdfs.path = /user/ntimbadi/flume/
tier1.sinks.k1.hdfs.filePrefix = events-
tier1.sinks.k1.hdfs.round = true
tier1.sinks.k1.hdfs.roundValue = 10
tier1.sinks.k1.hdfs.roundUnit = minute
# Use a channel which buffers events in memory
tier1.channels.c1.type = memory
tier1.channels.c1.capacity = 1000
tier1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
tier1.sources.r1.channels = c1
tier1.sinks.k1.channel = c1
I believe the mistake is in the source. This kind source does not take the host name or i.p to look for(in this case should be my PC). Could someone just give me a hint as to how to tail a file on my PC to upload it to the remotely located HDFS using flume.
The exec source in your configuration will run on the machine where you start the flume's tier1 agent. If you want to collect data from another machine, you'll need to start a flume agent on that machine too; to sum up you need:
an agent (remote1) running on the remote machine that has an avro source, which will listen for events from collector agents and will act like an aggregator.
an agent (local1) running on your machine (to act like a collector) that has an exec source and sends data to the remote agent via avro sink.
Or alternatively, you can have only one flume agent running on your local machine (having the same configuration you posted) and set the hdfs path as "hdfs://REMOTE_IP/hdfs/path" (though I'm not entirely sure this will work).
edit:
Below are the sample configurations for the 2-agents scenario (they may not work without some modification).
remote1.channels.mem-ch-1.type = memory
remote1.sources.avro-src-1.channels = mem-ch-1
remote1.sources.avro-src-1.type = avro
remote1.sources.avro-src-1.port = 10060
remote1.sources.avro-src-1.bind = 10.88.66.4 /* REPLACE WITH YOUR MACHINE'S EXTERNAL IP */
remote1.sinks.k1.channel = mem-ch-1
remote1.sinks.k1.type = hdfs
remote1.sinks.k1.hdfs.path = /user/ntimbadi/flume/
remote1.sinks.k1.hdfs.filePrefix = events-
remote1.sinks.k1.hdfs.round = true
remote1.sinks.k1.hdfs.roundValue = 10
remote1.sinks.k1.hdfs.roundUnit = minute
remote1.sources = avro-src-1
remote1.sinks = k1
remote1.channels = mem-ch-1
and
local1.channels.mem-ch-1.type = memory
local1.sources.exc-src-1.channels = mem-ch-1
local1.sources.exc-src-1.type = exec
local1.sources.exc-src-1.command = tail -F /(Path to file on my PC)
local1.sinks.avro-snk-1.channel = mem-ch-1
local1.sinks.avro-snk-1.type = avro
local1.sinks.avro-snk-1.hostname = 10.88.66.4 /* REPLACE WITH REMOTE IP */
local1.sinks.avro-snk-1.port = 10060
local1.sources = exc-src-1
local1.sinks = avro-snk-1
local1.channels = mem-ch-1

Resources