Configure sink elasticsearch apache-flume - elasticsearch

This is my first time here, so sorry if I don't post fine, and sorry for my bad English.
I'm trying to configure Apache Flume and Elasticsearch sinks. Everything is ok, it seems that it runs fine, but there are 2 warnings when I start an agent; the following ones:
2015-11-16 09:11:22,122 (lifecycleSupervisor-1-3) [ERROR - org.apache.flume.lifecycle.LifecycleSupervisor$MonitorRunnable.run(LifecycleSupervisor.java:253)] Unable to start SinkRunner: { policy:org.apache.flume.sink.DefaultSinkProcessor#ce359aa counterGroup:{ name:null counters:{} } } - Exception follows.
java.lang.NoSuchMethodError: org.elasticsearch.common.transport.InetSocketTransportAddress.<init>(Ljava/lang/String;I)V
at org.apache.flume.sink.elasticsearch.client.ElasticSearchTransportClient.configureHostnames(ElasticSearchTransportClient.java:143)
at org.apache.flume.sink.elasticsearch.client.ElasticSearchTransportClient.<init>(ElasticSearchTransportClient.java:77)
at org.apache.flume.sink.elasticsearch.client.ElasticSearchClientFactory.getClient(ElasticSearchClientFactory.java:48)
at org.apache.flume.sink.elasticsearch.ElasticSearchSink.start(ElasticSearchSink.java:357)
at org.apache.flume.sink.DefaultSinkProcessor.start(DefaultSinkProcessor.java:46)
at org.apache.flume.SinkRunner.start(SinkRunner.java:79)
at org.apache.flume.lifecycle.LifecycleSupervisor$MonitorRunnable.run(LifecycleSupervisor.java:251)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
2015-11-16 09:11:22,137 (lifecycleSupervisor-1-3) [WARN - org.apache.flume.lifecycle.LifecycleSupervisor$MonitorRunnable.run(LifecycleSupervisor.java:260)] Component SinkRunner: { policy:org.apache.flume.sink.DefaultSinkProcessor#ce359aa counterGroup:{ name:null counters:{} } } stopped, since it could not besuccessfully started due to missing dependencies
My agent configuration:
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444
# Describe the sink ES
a1.sinks = k1
a1.sinks.k1.type = elasticsearch
a1.sinks.k1.hostNames = 127.0.0.1:9200,127.0.0.2:9300
a1.sinks.k1.indexName = items
a1.sinks.k1.indexType = item
a1.sinks.k1.clusterName = elasticsearch
a1.sinks.k1.batchSize = 500
a1.sinks.k1.ttl = 5d
a1.sinks.k1.serializer=org.apache.flume.sink.elasticsearch.ElasticSearchDynamicSerializer
a1.sinks.k1.channel = c1
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
It starts the netcat and all is fine, but I'm afraid about theses warnings, I don't understand it.

I found a reason, it seems that Apache Flume 1.6.0 and Elasticsearch 2.0 cant communicate right.
I found a good sink from a 3rd person that I modified.
Here is the link
And this is my final configuration,
# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1
# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444
# Describe the sink ES
a1.sinks = k1
a1.sinks.k1.type = com.frontier45.flume.sink.elasticsearch2.ElasticSearchSink
a1.sinks.k1.hostNames = 127.0.0.1:9300
a1.sinks.k1.indexName = items
a1.sinks.k1.indexType = item
a1.sinks.k1.clusterName = elasticsearch
a1.sinks.k1.batchSize = 500
a1.sinks.k1.ttl = 5d
a1.sinks.k1.serializer = com.frontier45.flume.sink.elasticsearch2.ElasticSearchDynamicSerializer
a1.sinks.k1.indexNameBuilder = com.frontier45.flume.sink.elasticsearch2.TimeBasedIndexNameBuilder
a1.sinks.k1.channel = c1
# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
It works for me.
Thanks for answers.
P.S. yes, I had to move the libraries.

Attending to the logs, there is a problem with some missing dependency.
If you have a look to the ElasticSearchSink documentation, you'll see the following:
The elasticsearch and lucene-core jars required for your environment must be placed in the lib directory of the Apache Flume installation. Elasticsearch requires that the major version of the client JAR match that of the server and that both are running the same minor version of the JVM. SerializationExceptions will appear if this is incorrect. To select the required version first determine the version of elasticsearch and the JVM version the target cluster is running. Then select an elasticsearch client library which matches the major version. A 0.19.x client can talk to a 0.19.x cluster; 0.20.x can talk to 0.20.x and 0.90.x can talk to 0.90.x. Once the elasticsearch version has been determined then read the pom.xml file to determine the correct lucene-core JAR version to use. The Flume agent which is running the ElasticSearchSink should also match the JVM the target cluster is running down to the minor version.
Most probably you did not place the required Java jars, or the version is not the appropriate one.

Added below 2 JARs only in flume/lib dir and it worked, do not have to add all other Lucene JARs:
elasticsearch-1.7.1.jar
lucene-core-4.10.4.jar
command to start flume:
bin/flume-ng agent --conf conf --conf-file conf/flume-aggregator.conf --name agent2 -Dflume.root.logger=INFO,console
make sure to add below to flume-env.sh
export JAVA_HOME=/usr/java/default
export JAVA_OPTS="-Xms3072m -Xmx3072m -XX:MaxPermSize=48m -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=35 -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port=5445 -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false"
FLUME_CLASSPATH="/usr/flume/flume1.6/apache-flume-1.6.0-bin/;/usr/flume/flume1.6/apache-flume-1.6.0-bin/lib"
Flume aggregator config to load data in ES : flume-aggregator.conf
agent2.sources = source1
agent2.sinks = sink1
agent2.channels = channel1
################################################
# Describe Source
################################################
# Source Avro
agent2.sources.source1.type = avro
agent2.sources.source1.bind = 0.0.0.0
agent2.sources.source1.port = 9997
################################################
# Describe Interceptors
################################################
# an example of nginx access log regex match
# agent2.sources.source1.interceptors = interceptor1
# agent2.sources.source1.interceptors.interceptor1.type = regex_extractor
#
# agent2.sources.source1.interceptors.interceptor1.regex = "^(\\S+) \\[(.*?)\\] \"(.*?)\" (\\S+) (\\S+)( \"(.*?)\" \"(.*?)\")?"
#
# # agent2.sources.source1.interceptors.interceptor1.regex = ^(.*) ([a-zA-Z\\.\\#\\-\\+_%]+) ([a-zA-Z\\.\\#\\-\\+_%]+) \\[(.*)\\] \\"(POST|GET) ([A-Za-z0-9\\$\\.\\+\\##%_\\/\\-]*)\\??(.*) (.*)\\" ([a-zA-Z0-9\\.\\/\\s\-]*) (.*) ([0-9]+) ([0-9]+) ([0-9\\.]+)
# # agent2.sources.source1.interceptors.interceptor1.serializers = s1 s2 s3 s4 s5 s6 s7 s8 s9 s10 s11 s12 s13
#
# agent2.sources.source1.interceptors.interceptor1.serializers = s1 s2 s3 s4 s5 s6 s7 s8
# agent2.sources.source1.interceptors.interceptor1.serializers.s1.name = clientip
# agent2.sources.source1.interceptors.interceptor1.serializers.s2.name = datetime
# agent2.sources.source1.interceptors.interceptor1.serializers.s3.name = method
# agent2.sources.source1.interceptors.interceptor1.serializers.s4.name = request
# agent2.sources.source1.interceptors.interceptor1.serializers.s5.name = response
# agent2.sources.source1.interceptors.interceptor1.serializers.s6.name = status
# agent2.sources.source1.interceptors.interceptor1.serializers.s7.name = bytes
# agent2.sources.source1.interceptors.interceptor1.serializers.s8.name = requesttime
#
################################################
# Describe Sink
################################################
# Sink ElasticSearch
# Elasticsearch lib ---> flume/lib
# elasticsearch/config/elasticsearch.yml cluster.name clusterName. data/clustername data.
agent2.sinks.sink1.type = org.apache.flume.sink.elasticsearch.ElasticSearchSink
agent2.sinks.sink1.hostNames = 10.20.156.16:9300,10.20.176.20:9300
agent2.sinks.sink1.indexName = pdi
agent2.sinks.sink1.indexType = pdi_metrics
agent2.sinks.sink1.clusterName = My-ES-CLUSTER
agent2.sinks.sink1.batchSize = 1000
agent2.sinks.sink1.ttl = 2
#this serializer is crucial in order to use kibana
agent2.sinks.sink1.serializer = org.apache.flume.sink.elasticsearch.ElasticSearchLogStashEventSerializer
################################################
# Describe Channel
################################################
# Channel Memory
agent2.channels.channel1.type = memory
agent2.channels.channel1.capacity = 10000000
agent2.channels.channel1.transactionCapacity = 1000
################################################
# Bind the source and sink to the channel
################################################
agent2.sources.source1.channels = channel1
agent2.sinks.sink1.channel = channel1

Related

FlumeData file not getting created in HDFS sink

I am trying to ingest real time data using Kafka as source and flume as sink.Sink type is HDFS. My producer is working fine,i can see the data being produced and my agent is running fine(no error while running the command) but the file is not getting generated in specified directory.
Command for Starting flume agent:
/usr/hdp/2.5.0.0-1245/flume/bin/flume-ng agent -c /usr/hdp/2.5.0.0-1245/flume/conf -f /usr/hdp/2.5.0.0-1245/flume/conf/flume-hdfs.conf -n tier1
And my flume-hdfs.conf file:
tier1.sources = source1
tier1.channels = channel1
tier1.sinks = sink1
tier1.sources.source1.type = org.apache.flume.source.kafka.KafkaSource
tier1.sources.source1.zookeeperConnect = localhost:2181
tier1.sources.source1.topic = data_1
tier1.sources.source1.channels = channel1
tier1.channels.channel1.type = org.apache.flume.channel.kafka.KafkaChannel
tier1.channels.channel1.brokerList = localhost:6667
tier1.channels.channel1.zookeeperConnect = localhost:2181
tier1.channels.channel1.capacity = 10000
tier1.channels.channel1.transactionCapacity = 1000
tier1.sinks.sink1.type = hdfs
tier1.sinks.sink1.hdfs.path = /user/user_name/FLUME_LOGS/
tier1.sinks.sink1.hdfs.rollInterval = 5
tier1.sinks.sink1.hdfs.rollSize = 0
tier1.sinks.sink1.hdfs.rollCount = 0
tier1.sinks.sink1.hdfs.fileType = DataStream
tier1.sinks.sink1.channel = channel1
I am not able to find out what is wrong with the execution.
Please suggest how to overcome this problem.
Set the path of HDFS sink this way:
tier1.sinks.sink1.hdfs.path = "VALUE of fs.default.name, located in core-site.xml"/user/user_name/FLUME_LOGS/
For example
tier1.sinks.sink1.hdfs.path = hdfs://localhost:54310/user/user_name/FLUME_LOGS/

flume taking time to copy data into hdfs when rolling based on file size

I have a usecase where i want to copy remote file into hdfs using flume. I also want that the copied files should align with the HDFS block size (128MB/256MB).Total size of remote data is 33GB.
I am using avro source and sink to copy remote data into hdfs. Similarly from sink side i am doing file size rolling(128,256).but for copying file from remote machine and storing it into hdfs(file size 128/256 MB) flume is taking an avg of 2 min.
Flume Configuration:
Avro Source(Remote Machine)
### Agent1 - Spooling Directory Source and File Channel, Avro Sink ###
# Name the components on this agent
Agent1.sources = spooldir-source
Agent1.channels = file-channel
Agent1.sinks = avro-sink
# Describe/configure Source
Agent1.sources.spooldir-source.type = spooldir
Agent1.sources.spooldir-source.spoolDir =/home/Benchmarking_Simulation/test
# Describe the sink
Agent1.sinks.avro-sink.type = avro
Agent1.sinks.avro-sink.hostname = xx.xx.xx.xx #IP Address destination machine
Agent1.sinks.avro-sink.port = 50000
#Use a channel which buffers events in file
Agent1.channels.file-channel.type = file
Agent1.channels.file-channel.checkpointDir = /home/Flume_CheckPoint_Dir/
Agent1.channels.file-channel.dataDirs = /home/Flume_Data_Dir/
Agent1.channels.file-channel.capacity = 10000000
Agent1.channels.file-channel.transactionCapacity=50000
# Bind the source and sink to the channel
Agent1.sources.spooldir-source.channels = file-channel
Agent1.sinks.avro-sink.channel = file-channel
Avro Sink(Machine where hdfs running)
### Agent1 - Avro Source and File Channel, Avro Sink ###
# Name the components on this agent
Agent1.sources = avro-source1
Agent1.channels = file-channel1
Agent1.sinks = hdfs-sink1
# Describe/configure Source
Agent1.sources.avro-source1.type = avro
Agent1.sources.avro-source1.bind = xx.xx.xx.xx
Agent1.sources.avro-source1.port = 50000
# Describe the sink
Agent1.sinks.hdfs-sink1.type = hdfs
Agent1.sinks.hdfs-sink1.hdfs.path =/user/Benchmarking_data/multiple_agent_parallel_1
Agent1.sinks.hdfs-sink1.hdfs.rollInterval = 0
Agent1.sinks.hdfs-sink1.hdfs.rollSize = 130023424
Agent1.sinks.hdfs-sink1.hdfs.rollCount = 0
Agent1.sinks.hdfs-sink1.hdfs.fileType = DataStream
Agent1.sinks.hdfs-sink1.hdfs.batchSize = 50000
Agent1.sinks.hdfs-sink1.hdfs.txnEventMax = 40000
Agent1.sinks.hdfs-sink1.hdfs.threadsPoolSize=1000
Agent1.sinks.hdfs-sink1.hdfs.appendTimeout = 10000
Agent1.sinks.hdfs-sink1.hdfs.callTimeout = 200000
#Use a channel which buffers events in file
Agent1.channels.file-channel1.type = file
Agent1.channels.file-channel1.checkpointDir = /home/Flume_Check_Point_Dir
Agent1.channels.file-channel1.dataDirs = /home/Flume_Data_Dir
Agent1.channels.file-channel1.capacity = 100000000
Agent1.channels.file-channel1.transactionCapacity=100000
# Bind the source and sink to the channel
Agent1.sources.avro-source1.channels = file-channel1
Agent1.sinks.hdfs-sink1.channel = file-channel1
Network connectivity between both machine is 686 Mbps.
Can somebody please help me to identify whether something is wrong in the configuration or an alternate configuration so that the copying doesn't take so much of time.
Both agents use file channel. So before writing to HDFS, data has been written to disk twice. You can try to use a memory channel for each agent to see if the performance is improved.

Apache Flume spoolDirectory configuration is failing

I am using the following code to write the files in my source directory to hdfs.
# Initialize agent's source, channel and sink
agent.sources = test
agent.channels = memoryChannel
agent.sinks = flumeHDFS
# Setting the source to spool directory where the file exists
agent.sources.test.type = spooldir
agent.sources.test.spoolDir = /Data
# Setting the channel to memory
agent.channels.memoryChannel.type = memory
# Max number of events stored in the memory channel
agent.channels.memoryChannel.capacity = 10000
# agent.channels.memoryChannel.batchSize = 15000
agent.channels.memoryChannel.transactioncapacity = 1000000
# Setting the sink to HDFS
agent.sinks.flumeHDFS.type = hdfs
agent.sinks.flumeHDFS.hdfs.path = /user/team
agent.sinks.flumeHDFS.hdfs.fileType = DataStream
# Write format can be text or writable
agent.sinks.flumeHDFS.hdfs.writeFormat = Text
# use a single csv file at a time
agent.sinks.flumeHDFS.hdfs.maxOpenFiles = 1
# rollover file based on maximum size of 10 MB
agent.sinks.flumeHDFS.hdfs.rollCount=0
agent.sinks.flumeHDFS.hdfs.rollInterval=2000
agent.sinks.flumeHDFS.hdfs.rollSize = 0
agent.sinks.flumeHDFS.hdfs.batchSize =1000000
# never rollover based on the number of events
agent.sinks.flumeHDFS.hdfs.rollCount = 0
# rollover file based on max time of 1 min
#agent.sinks.flumeHDFS.hdfs.rollInterval = 0
# agent.sinks.flumeHDFS.hdfs.idleTimeout = 600
# Connect source and sink with channel
agent.sources.TwitterExampleDir.channels = memoryChannel
agent.sinks.flumeHDFS.channel = memoryChannel
But I am getting the following error
: Failed to configure component!
org.apache.flume.conf.ConfigurationException: Failed to configure
component!
at org.apache.flume.conf.source.SourceConfiguration.configure(SourceConfigurati
on.java:110)
at org.apache.flume.conf.FlumeConfiguration$AgentConfiguration.validateSources(
FlumeConfiguration.java:566)
at org.apache.flume.conf.FlumeConfiguration$AgentConfiguration.isValid(FlumeCon
figuration.java:345)
at org.apache.flume.conf.FlumeConfiguration$AgentConfiguration.access$000(Flume
Configuration.java:212)
at org.apache.flume.conf.FlumeConfiguration.validateConfiguration(FlumeConfigur
ation.java:126)
at org.apache.flume.conf.FlumeConfiguration.(FlumeConfiguration.java:108)
at org.apache.flume.node.PropertiesFileConfigurationProvider.getFlumeConfigurat
ion(PropertiesFileConfigurationProvider.java:193)
at org.apache.flume.node.AbstractConfigurationProvider.getConfiguration(Abstrac
tConfigurationProvider.java:94)
at org.apache.flume.node.PollingPropertiesFileConfigurationProvider$FileWatcher
Runnable.run(PollingPropertiesFileConfigurationProvider.java:140)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$
301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Sch
eduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:11
42)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:6
17)
at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.flume.conf.ConfigurationException: No channels set for test
at org.apache.flume.conf.source.SourceConfiguration.configure(SourceConfigurati
on.java:68)
... 15 more
Can anyone help what should I do to put my data from the source directory /Data to hdfs directory /user/team?
Stack Trace mention:
No channels set for test
You have a specified sources as test
agent.sources = test
But while connecting it to channel
agent.sources.TwitterExampleDir.channels = memoryChannel
so you have to mention test instead of TwitterExampleDir

flume loss data when collect online data to hdfs

I used flume-ng 1.5 version to collect logs.
There are two agents in the data flow and they are on two hosts, respectively.
And the data is sended from agent1 to agent2.
The agents's component is as follows:
agent1: spooling dir source --> file channel --> avro sink
agent2: avro source --> file channel --> hdfs sink
But it seems to loss data about 1/1000 percentage of million data.
To solve problem I tried these steps:
look up agents log: cannot find any error or exception.
look up agents monitor metrics: the events number that put and take from channel always equals
statistic the data number by hive query and hdfs file use shell, respectively: the two number is equal and less than the online data number
agent1's configuration:
#agent
agent1.sources = src_spooldir
agent1.channels = chan_file
agent1.sinks = sink_avro
#source
agent1.sources.src_spooldir.type = spooldir
agent1.sources.src_spooldir.spoolDir = /data/logs/flume-spooldir
agent1.sources.src_spooldir.interceptors=i1
#interceptors
agent1.sources.src_spooldir.interceptors.i1.type=regex_extractor
agent1.sources.src_spooldir.interceptors.i1.regex=(\\d{4}-\\d{2}-\\d{2}).*
agent1.sources.src_spooldir.interceptors.i1.serializers=s1
agent1.sources.src_spooldir.interceptors.i1.serializers.s1.name=dt
#sink
agent1.sinks.sink_avro.type = avro
agent1.sinks.sink_avro.hostname = 10.235.2.212
agent1.sinks.sink_avro.port = 9910
#channel
agent1.channels.chan_file.type = file
agent1.channels.chan_file.checkpointDir = /data/flume/agent1/checkpoint
agent1.channels.chan_file.dataDirs = /data/flume/agent1/data
agent1.sources.src_spooldir.channels = chan_file
agent1.sinks.sink_avro.channel = chan_file
agent2's configuration
# agent
agent2.sources = source1
agent2.channels = channel1
agent2.sinks = sink1
# source
agent2.sources.source1.type = avro
agent2.sources.source1.bind = 10.235.2.212
agent2.sources.source1.port = 9910
# sink
agent2.sinks.sink1.type= hdfs
agent2.sinks.sink1.hdfs.fileType = DataStream
agent2.sinks.sink1.hdfs.filePrefix = log
agent2.sinks.sink1.hdfs.path = hdfs://hnd.hadoop.jsh:8020/data/%{dt}
agent2.sinks.sink1.hdfs.rollInterval = 600
agent2.sinks.sink1.hdfs.rollSize = 0
agent2.sinks.sink1.hdfs.rollCount = 0
agent2.sinks.sink1.hdfs.idleTimeout = 300
agent2.sinks.sink1.hdfs.round = true
agent2.sinks.sink1.hdfs.roundValue = 10
agent2.sinks.sink1.hdfs.roundUnit = minute
# channel
agent2.channels.channel1.type = file
agent2.channels.channel1.checkpointDir = /data/flume/agent2/checkpoint
agent2.channels.channel1.dataDirs = /data/flume/agent2/data
agent2.sinks.sink1.channel = channel1
agent2.sources.source1.channels = channel1
Any suggestions are welcome!
there is a bug in file line deseriazer when encounter some specific character of utf which point is between U+10000 and U+10FFFF, they represent in utf16 by two 16-bit code unit called surrogate pairs.

Flume Tail a File

I am new to Flume-Ng and need help to tail a file. I have a cluster running hadoop with flume running remotely. I communicate to this cluster by using putty. I want to tail a file on my PC and put it on the HDFS in the cluster. I am using the following code to this.
#flume.conf: http source, hdfs sink
# Name the components on this agent
tier1.sources = r1
tier1.sinks = k1
tier1.channels = c1
# Describe/configure the source
tier1.sources.r1.type = exec
tier1.sources.r1.command = tail -F /(Path to file on my PC)
# Describe the sink
tier1.sinks.k1.type = hdfs
tier1.sinks.k1.hdfs.path = /user/ntimbadi/flume/
tier1.sinks.k1.hdfs.filePrefix = events-
tier1.sinks.k1.hdfs.round = true
tier1.sinks.k1.hdfs.roundValue = 10
tier1.sinks.k1.hdfs.roundUnit = minute
# Use a channel which buffers events in memory
tier1.channels.c1.type = memory
tier1.channels.c1.capacity = 1000
tier1.channels.c1.transactionCapacity = 100
# Bind the source and sink to the channel
tier1.sources.r1.channels = c1
tier1.sinks.k1.channel = c1
I believe the mistake is in the source. This kind source does not take the host name or i.p to look for(in this case should be my PC). Could someone just give me a hint as to how to tail a file on my PC to upload it to the remotely located HDFS using flume.
The exec source in your configuration will run on the machine where you start the flume's tier1 agent. If you want to collect data from another machine, you'll need to start a flume agent on that machine too; to sum up you need:
an agent (remote1) running on the remote machine that has an avro source, which will listen for events from collector agents and will act like an aggregator.
an agent (local1) running on your machine (to act like a collector) that has an exec source and sends data to the remote agent via avro sink.
Or alternatively, you can have only one flume agent running on your local machine (having the same configuration you posted) and set the hdfs path as "hdfs://REMOTE_IP/hdfs/path" (though I'm not entirely sure this will work).
edit:
Below are the sample configurations for the 2-agents scenario (they may not work without some modification).
remote1.channels.mem-ch-1.type = memory
remote1.sources.avro-src-1.channels = mem-ch-1
remote1.sources.avro-src-1.type = avro
remote1.sources.avro-src-1.port = 10060
remote1.sources.avro-src-1.bind = 10.88.66.4 /* REPLACE WITH YOUR MACHINE'S EXTERNAL IP */
remote1.sinks.k1.channel = mem-ch-1
remote1.sinks.k1.type = hdfs
remote1.sinks.k1.hdfs.path = /user/ntimbadi/flume/
remote1.sinks.k1.hdfs.filePrefix = events-
remote1.sinks.k1.hdfs.round = true
remote1.sinks.k1.hdfs.roundValue = 10
remote1.sinks.k1.hdfs.roundUnit = minute
remote1.sources = avro-src-1
remote1.sinks = k1
remote1.channels = mem-ch-1
and
local1.channels.mem-ch-1.type = memory
local1.sources.exc-src-1.channels = mem-ch-1
local1.sources.exc-src-1.type = exec
local1.sources.exc-src-1.command = tail -F /(Path to file on my PC)
local1.sinks.avro-snk-1.channel = mem-ch-1
local1.sinks.avro-snk-1.type = avro
local1.sinks.avro-snk-1.hostname = 10.88.66.4 /* REPLACE WITH REMOTE IP */
local1.sinks.avro-snk-1.port = 10060
local1.sources = exc-src-1
local1.sinks = avro-snk-1
local1.channels = mem-ch-1

Resources