Flume Tail a File - hadoop

I am new to Flume-Ng and need help to tail a file. I have a cluster running hadoop with flume running remotely. I communicate to this cluster by using putty. I want to tail a file on my PC and put it on the HDFS in the cluster. I am using the following code to this.
#flume.conf: http source, hdfs sink
# Name the components on this agent
tier1.sources = r1
tier1.sinks = k1
tier1.channels = c1
# Describe/configure the source
tier1.sources.r1.type = exec
tier1.sources.r1.command = tail -F /(Path to file on my PC)
# Describe the sink
tier1.sinks.k1.type = hdfs
tier1.sinks.k1.hdfs.path = /user/ntimbadi/flume/
tier1.sinks.k1.hdfs.filePrefix = events-
tier1.sinks.k1.hdfs.round = true
tier1.sinks.k1.hdfs.roundValue = 10
tier1.sinks.k1.hdfs.roundUnit = minute
# Use a channel which buffers events in memory
tier1.channels.c1.type = memory
tier1.channels.c1.capacity = 1000
tier1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
tier1.sources.r1.channels = c1
tier1.sinks.k1.channel = c1
I believe the mistake is in the source. This kind source does not take the host name or i.p to look for(in this case should be my PC). Could someone just give me a hint as to how to tail a file on my PC to upload it to the remotely located HDFS using flume.

The exec source in your configuration will run on the machine where you start the flume's tier1 agent. If you want to collect data from another machine, you'll need to start a flume agent on that machine too; to sum up you need:
an agent (remote1) running on the remote machine that has an avro source, which will listen for events from collector agents and will act like an aggregator.
an agent (local1) running on your machine (to act like a collector) that has an exec source and sends data to the remote agent via avro sink.
Or alternatively, you can have only one flume agent running on your local machine (having the same configuration you posted) and set the hdfs path as "hdfs://REMOTE_IP/hdfs/path" (though I'm not entirely sure this will work).
edit:
Below are the sample configurations for the 2-agents scenario (they may not work without some modification).
remote1.channels.mem-ch-1.type = memory
remote1.sources.avro-src-1.channels = mem-ch-1
remote1.sources.avro-src-1.type = avro
remote1.sources.avro-src-1.port = 10060
remote1.sources.avro-src-1.bind = 10.88.66.4 /* REPLACE WITH YOUR MACHINE'S EXTERNAL IP */
remote1.sinks.k1.channel = mem-ch-1
remote1.sinks.k1.type = hdfs
remote1.sinks.k1.hdfs.path = /user/ntimbadi/flume/
remote1.sinks.k1.hdfs.filePrefix = events-
remote1.sinks.k1.hdfs.round = true
remote1.sinks.k1.hdfs.roundValue = 10
remote1.sinks.k1.hdfs.roundUnit = minute
remote1.sources = avro-src-1
remote1.sinks = k1
remote1.channels = mem-ch-1
and
local1.channels.mem-ch-1.type = memory
local1.sources.exc-src-1.channels = mem-ch-1
local1.sources.exc-src-1.type = exec
local1.sources.exc-src-1.command = tail -F /(Path to file on my PC)
local1.sinks.avro-snk-1.channel = mem-ch-1
local1.sinks.avro-snk-1.type = avro
local1.sinks.avro-snk-1.hostname = 10.88.66.4 /* REPLACE WITH REMOTE IP */
local1.sinks.avro-snk-1.port = 10060
local1.sources = exc-src-1
local1.sinks = avro-snk-1
local1.channels = mem-ch-1

Related

how to make flume load files to hdfs, hdfs never close file .tmp and rename file by name.

Actually I have 2 questions, my first question is : How to make HDFS close file (example .123456789.tmp ) after the entire file was flushed by flume agent.
In fact, the file never closed until, I force flume agent to stop.
I beleive there is a method using the 4 parameters as follow:
hdfs.rollSize = 0
hdfs.rollCount =0
hdfs.rollInterval = 0
hdfs.batchsize = 1000000
Well, my second question is, my agent flume receives files from SFTP server, while I need to keep each file name in hdfs. It works fine with spooldir type, but not with SFTP !! is there a any ideas ?
My configuration file for flume agent as follow:
agent.sources = r1
agent.channels = c1
agent.sinks = k
configure ftp source
agent.sources.r1.type = org.keedio.flume.source.mra.source.Source
agent.sources.r1.client.source = sftp
agent.sources.r1.name.server = ip
agent.sources.r1.user = user
agent.sources.r1.password = secret
agent.sources.r1.port = 22
agent.sources.r1.knownHosts = ~/.ssh/known_hosts
agent.sources.r1.work.dir = /DATA/test/flumrFTP
agent.sources.r1.fileHeader = true
agent.sources.r1.basenameHeader = true
agent.sources.r1.inputCharset = ISO-8859-1
#agent.sources.r1.batchSize = 1000
agent.sources.r1.flushlines = true
configure sink s1
agent.sinks.k.type = hdfs
agent.sinks.k.hdfs.path = hdfs://hostname:8000/user/admin/DATA/import_flume/
agent.sinks.k.hdfs.filePrefix = %{basename}
agent.sinks.k.hdfs.rollCount = 0
agent.sinks.k.hdfs.rollInterval = 0
agent.sinks.k.hdfs.rollSize = 0
agent.sinks.k.hdfs.useLocalTimeStamp = true
agent.sinks.k.hdfs.batchsize = 1000000
agent.sinks.k.hdfs.fileType = DataStream
Use a channel which buffers events in memory
agent.channels.c1.type = memory
agent.channels.c1.capacity = 1000000
agent.channels.c1.transactionCapacity = 1000000
agent.sources.r1.channels = c1
agent.sinks.k.channel = c1
Try setting the variable
hdfs.rollInterval It's the number of seconds to wait before rolling current file
This setting closes the file after the number of seconds you set. I set mine at 200 seconds and I am loading smaller files

flume taking time to copy data into hdfs when rolling based on file size

I have a usecase where i want to copy remote file into hdfs using flume. I also want that the copied files should align with the HDFS block size (128MB/256MB).Total size of remote data is 33GB.
I am using avro source and sink to copy remote data into hdfs. Similarly from sink side i am doing file size rolling(128,256).but for copying file from remote machine and storing it into hdfs(file size 128/256 MB) flume is taking an avg of 2 min.
Flume Configuration:
Avro Source(Remote Machine)
### Agent1 - Spooling Directory Source and File Channel, Avro Sink ###
# Name the components on this agent
Agent1.sources = spooldir-source
Agent1.channels = file-channel
Agent1.sinks = avro-sink
# Describe/configure Source
Agent1.sources.spooldir-source.type = spooldir
Agent1.sources.spooldir-source.spoolDir =/home/Benchmarking_Simulation/test
# Describe the sink
Agent1.sinks.avro-sink.type = avro
Agent1.sinks.avro-sink.hostname = xx.xx.xx.xx #IP Address destination machine
Agent1.sinks.avro-sink.port = 50000
#Use a channel which buffers events in file
Agent1.channels.file-channel.type = file
Agent1.channels.file-channel.checkpointDir = /home/Flume_CheckPoint_Dir/
Agent1.channels.file-channel.dataDirs = /home/Flume_Data_Dir/
Agent1.channels.file-channel.capacity = 10000000
Agent1.channels.file-channel.transactionCapacity=50000
# Bind the source and sink to the channel
Agent1.sources.spooldir-source.channels = file-channel
Agent1.sinks.avro-sink.channel = file-channel
Avro Sink(Machine where hdfs running)
### Agent1 - Avro Source and File Channel, Avro Sink ###
# Name the components on this agent
Agent1.sources = avro-source1
Agent1.channels = file-channel1
Agent1.sinks = hdfs-sink1
# Describe/configure Source
Agent1.sources.avro-source1.type = avro
Agent1.sources.avro-source1.bind = xx.xx.xx.xx
Agent1.sources.avro-source1.port = 50000
# Describe the sink
Agent1.sinks.hdfs-sink1.type = hdfs
Agent1.sinks.hdfs-sink1.hdfs.path =/user/Benchmarking_data/multiple_agent_parallel_1
Agent1.sinks.hdfs-sink1.hdfs.rollInterval = 0
Agent1.sinks.hdfs-sink1.hdfs.rollSize = 130023424
Agent1.sinks.hdfs-sink1.hdfs.rollCount = 0
Agent1.sinks.hdfs-sink1.hdfs.fileType = DataStream
Agent1.sinks.hdfs-sink1.hdfs.batchSize = 50000
Agent1.sinks.hdfs-sink1.hdfs.txnEventMax = 40000
Agent1.sinks.hdfs-sink1.hdfs.threadsPoolSize=1000
Agent1.sinks.hdfs-sink1.hdfs.appendTimeout = 10000
Agent1.sinks.hdfs-sink1.hdfs.callTimeout = 200000
#Use a channel which buffers events in file
Agent1.channels.file-channel1.type = file
Agent1.channels.file-channel1.checkpointDir = /home/Flume_Check_Point_Dir
Agent1.channels.file-channel1.dataDirs = /home/Flume_Data_Dir
Agent1.channels.file-channel1.capacity = 100000000
Agent1.channels.file-channel1.transactionCapacity=100000
# Bind the source and sink to the channel
Agent1.sources.avro-source1.channels = file-channel1
Agent1.sinks.hdfs-sink1.channel = file-channel1
Network connectivity between both machine is 686 Mbps.
Can somebody please help me to identify whether something is wrong in the configuration or an alternate configuration so that the copying doesn't take so much of time.
Both agents use file channel. So before writing to HDFS, data has been written to disk twice. You can try to use a memory channel for each agent to see if the performance is improved.

Using Flume file_roll sink type stuck after few minutes

I am using flume file_roll sink type to sink high volume of data (rate ~10000 events/second) via syslogTCP source type. however the process(spark streaming job) which is pushing data to syslogTCP port stuck after 15 - 20 min ingesting arrount 1.5 million events. I also observed some file descriptor issue in the linux box where flume-ng agent is running.
Below is the flume configuration i am using:
agent2.sources = r1
agent2.channels = c1
agent2.sinks = f1
agent2.sources.r1.type = syslogtcp
agent2.sources.r1.bind = i-170d29de.aws.amgen.com
agent2.sources.r1.port = 44442
agent2.channels.c1.type = memory
agent2.channels.c1.capacity = 1000000000
agent2.channels.c1.transactionCapacity = 40000
agent2.sinks.f1.type = file_roll
agent2.sinks.f1.sink.directory = /opt/app/svc-edl-ops-ngmp-dev/rdas/flume_output
agent2.sinks.f1.sink.rollInterval = 300
agent2.sinks.f1.sink.rollSize = 104857600
agent2.sinks.f1.sink.rollCount = 0
agent2.sources.r1.channels = c1
agent2.sinks.f1.channel = c1
because of performance issue mainly because of high ingestion rate I cannot use HDFS sink type.:
This was my bad. I was using console logging and at some point The putty terminal was freezing because of connectivity issue. causing entire flume agent to chock.
By redirecting flume console output OR having a log4j.property which writes output to console has resolved the freezing issue.

Flume adding line feed after 2048 characters in a row

I have a Flume 1.5 agent running on a Ubuntu workstation that collects logs from various devices and re-formats the logs into a comma delimited file with very long rows. After the collection and re-reformatting of the logs they are placed into a spool directory where the Flume Agent sends the log file to a Hadoop server running a Flume agent to accept the log file and place them in a HDFS directory.
Everything works fine except that when Flume sends the file to HDFS directory there are Line Feeds after every 2048 characters in each row.
Below is my flume config files.
Is there a setting to tell flume to not insert line feeds?
#On Ubuntu Workstation
#list sources, sinks and channels in the agent
agent.sources = axon_source
agent.channels = memorychannel
agent.sinks = AvroOut
#define flow
agent.sources.axon_source.channels = memorychannel
agent.sinks.AvroOut.channel = memorychannel
agent.channels.memorychannel.type = memory
agent.channels.memorychannel.capacity = 100000
#source
agent.sources.axon_source.type = spooldir
agent.sources.axon_source.spoolDir = /home/ubuntu/workspace/logdump
agent.sources.axon_source.decodeErrorPolicy = ignore
#avro out
agent.sinks.AvroOut.type = avro
agent.sinks.AvroOut.hostname = 172.31.12.221
agent.sinks.AvroOut.port = 41415
agent.sinks.AvroOut.maxIoWorkers = 2
------------------------------------------------------------
#On Hadoop Server
agent.sources = AvroIn
agent.sources.AvroIn.type = avro
agent.sources.AvroIn.bind = 172.31.131.1
agent.sources.AvroIn.port = 41415
agent.sources.AvroIn.channels = MemChan1
agent.channels = MemChan1
agent.channels.MemChan1.type = memory
agent.channels.MemChan1.capacity = 100000
agent.sinks = HDFSSink
agent.sinks.HDFSSink.type = hdfs
agent.sinks.HDFSSink.channel = MemChan1
agent.sinks.HDFSSink.hdfs.path = /Logs/%Y%m/
agent.sinks.HDFSSink.hdfs.filePrefix = axoncapture
agent.sinks.HDFSSink.hdfs.fileSuffix = .log
agent.sinks.HDFSSink.hdfs.minBlockReplicas = 1
agent.sinks.HDFSSink.hdfs.rollCount = 0
agent.sinks.HDFSSink.hdfs.rollSize = 314572800
agent.sinks.HDFSSink.hdfs.writeFormat = Text
agent.sinks.HDFSSink.hdfs.fileType = DataStream
agent.sinks.HDFSSink.hdfs.useLocalTimeStamp = True
Found the answer to my question:
The default maxLineLength for the LINE deserializer is 2048:
http://archive.cloudera.com/cdh5/cdh/5/flume-ng/FlumeUserGuide.html#line
I added the line to my flume.conf file and fixed the problem:
agent.sources.axon_source.deserializer.maxLineLength=60000

How to access a folder in remote host as spoiling directory path in flume

Here is my FlumeHadoop.conf file.
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = spooldir
a1.sources.r1.channels = c1
a1.sources.r1.spoolDir = /home/rabindra/idirectory
a1.sources.r1.basenameHeader=true
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 1000
a1.channels.c1.byteCapacityBufferPercentage = 20
a1.channels.c1.byteCapacity = 131072000
a1.sinks.k1.channel = c1
# properties of k1-sink
a1.sinks.k1.type = hdfs
#a1.sinks.k1.hdfs.path = hdfs://namenode/flumesource/source1
a1.sinks.k1.hdfs.path = hdfs://localhost/logdata
a1.sinks.k1.hdfs.filePrefix=%{basename}
a1.sinks.k1.hdfs.fileSuffix=.txt
a1.sinks.k1.rollInterval=0
a1.sinks.k1.hdfs.deletePolicy=immediate
a1.sinks.k1.hdfs.rollSize=131072000
a1.sinks.k1.hdfs.rollCount=0
a1.sinks.k1.hdfs.idleTimeout=0
a1.sinks.k1.hdfs.maxOpenFiles = 10000
In a remote host say, 192.168.7.43 and there is a directory rdirectory and the directory path /home/alex/rdirectory.
I want to point the remote folder rdirectory as spoiling directory .
How it is possible?
I sm trying by this change
a1.sources.r1.spoolDir = alex#192.168.7.43:/home/alex/rdirectory
But this gives the exception java.lang.IllegalStateException: Directory doesn't exists: /opt/flume/alex#192.168.7.43:/home/alex/rdirectory .
Did you try running this option
Run flume agent in 192.168.7.4
Spool dir [source] --> netcat [sink]
Run flume agent in local machine
netcat [source] --> HDFS sink
1.Install Samba on host 192.168.7.43, and config folder /home/alex/rdirectory can be interval by //192.168.7.43/rdirectory
2.Mount the share directory on flume agent host, than you can use mount directory like a local directory.
Here is an example mount directory on 192.168.59.166 to host H0045170

Resources