I am trying to set up a simple data pipeline from a console Kafka producer to the Hadoop file system (HDFS). I am working on a 64bit Ubuntu Virtual Machine and have created separate users for both Hadoop and Kafka as was suggested by the guides that I have followed. Consuming the produced input in Kafka with a console consumer works and the HDFS seems to be up and running.
Now I want to use Flume to pipe the input into the HDFS. I am using the following configuration file:
tier1.sources = source1
tier1.channels = channel1
tier1.sinks = sink1
tier1.sources.source1.type = org.apache.flume.source.kafka.KafkaSource
tier1.sources.source1.zookeeperConnect = 127.0.0.1:2181
tier1.sources.source1.topic = test
tier1.sources.source1.groupId = flume
tier1.sources.source1.channels = channel1
tier1.sources.source1.interceptors = i1
tier1.sources.source1.interceptors.i1.type = timestamp
tier1.sources.source1.kafka.consumer.timeout.ms = 2000
tier1.channels.channel1.type = memory
tier1.channels.channel1.capacity = 10000
tier1.channels.channel1.transactionCapacity = 1000
tier1.sinks.sink1.type = hdfs
tier1.sinks.sink1.hdfs.path = hdfs://flume/kafka/%{topic}/%y-%m-%d
tier1.sinks.sink1.hdfs.rollInterval = 5
tier1.sinks.sink1.hdfs.rollSize = 0
tier1.sinks.sink1.hdfs.rollCount = 0
tier1.sinks.sink1.hdfs.fileType = DataStream
tier1.sinks.sink1.channel = channel1
Now when I run Flume with the following command
bin/flume-ng agent --conf ./conf -f conf/flume.conf -Dflume.root.logger=DEBUG,console -n tier1
I get the same exception in the console output over and over again:
2017-10-19 12:17:04,279 (lifecycleSupervisor-1-2) [DEBUG - org.apache.kafka.clients.NetworkClient.handleConnections(NetworkClient.java:467)] Completed connection to node 2147483647
2017-10-19 12:17:04,279 (lifecycleSupervisor-1-2) [DEBUG - org.apache.kafka.common.network.Selector.poll(Selector.java:307)] Connection with Ubuntu-Sandbox/127.0.1.1 disconnected
java.io.EOFException
at org.apache.kafka.common.network.NetworkReceive.readFromReadableChannel(NetworkReceive.java:83)
at org.apache.kafka.common.network.NetworkReceive.readFrom(NetworkReceive.java:71)
at org.apache.kafka.common.network.KafkaChannel.receive(KafkaChannel.java:153)
at org.apache.kafka.common.network.KafkaChannel.read(KafkaChannel.java:134)
at org.apache.kafka.common.network.Selector.poll(Selector.java:286)
at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:256)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.clientPoll(ConsumerNetworkClient.java:320)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:213)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:193)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:163)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator.ensureActiveGroup(AbstractCoordinator.java:222)
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.ensurePartitionAssignment(ConsumerCoordinator.java:311)
at org.apache.kafka.clients.consumer.KafkaConsumer.pollOnce(KafkaConsumer.java:890)
at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:853)
at org.apache.flume.source.kafka.KafkaSource.doStart(KafkaSource.java:529)
at org.apache.flume.source.BasicSourceSemantics.start(BasicSourceSemantics.java:83)
at org.apache.flume.source.PollableSourceRunner.start(PollableSourceRunner.java:71)
at org.apache.flume.lifecycle.LifecycleSupervisor$MonitorRunnable.run(LifecycleSupervisor.java:249)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
The only way to stop Flume is to kill the Java process.
I thought that it might have something to do with the separate users for Hadoop and Kafka, but even when running everything with the Kafka user I get the same result. I haven't found anything concerning the EOFException method online either, which is strange considering that I have just followed the "Getting Started" guides and used pretty standard configurations for everything.
Maybe it has something to do with the preceding line ("Ubuntu-Sandbox/127.0.1.1 disconnected") and hence the configuration of my VM?
Any help is highly appreciated!
Have you considered using Kafka Connect (part of Apache Kafka) and the HDFS connector instead? This is generally seen to have superseded Flume. It is easy to use, with a similar file-based configuration as Flume.
Related
I am trying to scale airflow using celery and rabbitMq on EC2.
I am following following code:
http://site.clairvoyantsoft.com/setting-apache-airflow-cluster/
Following is code in master node.
sql_alchemy_conn = postgresql+psycopg2://user:gues#localhost:5432/airflow
executor = CeleryExecutor
broker_url = amqp://user:gues#ip-11-222-12-117:5672
celery_result_backend = db+postgresql://user:gues#localhost:5432/airflow
Following is code for salve node:
sql_alchemy_conn = postgresql+psycopg2://user:gues#ip-11-222-12-117:5432/airflow
executor = CeleryExecutor
broker_url = amqp://user:gues#ip-11-222-12-117:5672
celery_result_backend = db+postgresql://user:gues#localhost:5432/airflow
When I run airflow scheduler, it works fine. But on slave node I am getting following error:
[2017-05-23 21:47:44,385: ERROR/MainProcess] consumer: Cannot connect to amqp://user:**#ip-11-222-12-117:5672//: Couldn't log in: a socket error occurred.
Trying again in 2.00 seconds..
However I am able to see both nodes connected using rabbitMq on rabbitMQ UI.
What I am doing wrong?
Have you checked that the amqp server is allowed to listen to anything other than the loopback? Please check this answer: Can't access RabbitMQ web management interface after fresh install
I am trying to ingest using flume spooling directory to HDFS(SpoolDir > Memory Channel > HDFS).
I am using Cloudera Hadoop 5.4.2. (Hadoop 2.6.0, Flume 1.5.0).
It works well with smaller files, however it fails with larger files. Please find below my testing scenerio:
files with size Kbytes to 50-60MBytes, processed without issue.
files with greater than 50-60MB, it writes around 50MB to HDFS then I found flume agent unexpected exit.
There are no error message on flume log.
I found that it is trying to create the ".tmp" file (HDFS) several times, and each time writes couple of megabytes (some time 2MB, some time 45MB ) before unexpected exit.
After some time, the last tried ".tmp" file renamed as completed(".tmp" removed) and the file in source spoolDir also renamed as ".COMPLETED" although full file is not written to HDFS.
In real scenerio, our files will be around 2GB in size. So, need some robust flume configuration to handle workload.
Note:
Flume agent node is part of hadoop cluster and not a datanode (it is an edge node).
Spool directory is local filesystem on the same server running flume agent.
All are physical sever (not virtual).
In the same cluster, we have twitter datafeeding with flume running fine(although very small about of data).
Please find below flume.conf file I am using here:
#############start flume.conf####################
spoolDir.sources = src-1
spoolDir.channels = channel-1
spoolDir.sinks = sink_to_hdfs1
######## source
spoolDir.sources.src-1.type = spooldir
spoolDir.sources.src-1.channels = channel-1
spoolDir.sources.src-1.spoolDir = /stage/ETL/spool/
spoolDir.sources.src-1.fileHeader = true
spoolDir.sources.src-1.basenameHeader =true
spoolDir.sources.src-1.batchSize = 100000
######## channel
spoolDir.channels.channel-1.type = memory
spoolDir.channels.channel-1.transactionCapacity = 50000000
spoolDir.channels.channel-1.capacity = 60000000
spoolDir.channels.channel-1.byteCapacityBufferPercentage = 20
spoolDir.channels.channel-1.byteCapacity = 6442450944
######## sink
spoolDir.sinks.sink_to_hdfs1.type = hdfs
spoolDir.sinks.sink_to_hdfs1.channel = channel-1
spoolDir.sinks.sink_to_hdfs1.hdfs.fileType = DataStream
spoolDir.sinks.sink_to_hdfs1.hdfs.path = hdfs://nameservice1/user/etl/temp/spool
spoolDir.sinks.sink_to_hdfs1.hdfs.filePrefix = %{basename}-
spoolDir.sinks.sink_to_hdfs1.hdfs.batchSize = 100000
spoolDir.sinks.sink_to_hdfs1.hdfs.rollInterval = 0
spoolDir.sinks.sink_to_hdfs1.hdfs.rollSize = 0
spoolDir.sinks.sink_to_hdfs1.hdfs.rollCount = 0
spoolDir.sinks.sink_to_hdfs1.hdfs.idleTimeout = 60
#############end flume.conf####################
Kindly suggest me whether there is any issue with my configuration or am I missing something.
Or is it a known issue that Flume SpoolDir cannot handle with bigger files.
Regards,
-Obaid
I have posted the same topic to another open community, if I get solution from other one, I will update here and vice versa.
I have tested flume with several size files and finally come up with conclusion that "flume is not for larger size files".
So, finally I have started using HDFS NFS Gateway. This is really cool and now I do not even need a spool directory in local storage. Pushing file directly to nfs mounted HDFS using scp.
Hope it will help some one who is facing same issue like me.
Thanks,
Obaid
Try using File channel as it is more reliable than Memory channel.
Use the following configuration to add File-Channel.
spoolDir.channels = channel-1
spoolDir.channels.channel-1.type = file
spoolDir.channels.channel-1.checkpointDir = /mnt/flume/checkpoint
spoolDir.channels.channel-1.dataDirs = /mnt/flume/data
I'm using impala with flume as filestream.
The problem is flume is adding temporary files with extension .tmp, and then when they are deleted impala queries are failing with the following message:
Backend 0:Failed to open HDFS file
hdfs://localhost:8020/user/hive/../FlumeData.1420040201733.tmp
Error(2): No such file or directory
How can I make impala to ignore this tmp files, or flume not to write them, or write them to another directory?
Flume configuration:
### Agent2 - Avro Source and File Channel, hdfs Sink ###
# Name the components on this agent
Agent2.sources = avro-source
Agent2.channels = file-channel
Agent2.sinks = hdfs-sink
# Describe/configure Source
Agent2.sources.avro-source.type = avro
Agent2.sources.avro-source.hostname = 0.0.0.0
Agent2.sources.avro-source.port = 11111
Agent2.sources.avro-source.bind = 0.0.0.0
# Describe the sink
Agent2.sinks.hdfs-sink.type = hdfs
Agent2.sinks.hdfs-sink.hdfs.path = hdfs://localhost:8020/user/hive/table/
Agent2.sinks.hdfs-sink.hdfs.rollInterval = 0
Agent2.sinks.hdfs-sink.hdfs.rollCount = 10000
Agent2.sinks.hdfs-sink.hdfs.fileType = DataStream
#Use a channel which buffers events in file
Agent2.channels.file-channel.type = file
Agent2.channels.file-channel.checkpointDir = /home/ubutnu/flume/checkpoint/
Agent2.channels.file-channel.dataDirs = /home/ubuntu/flume/data/
# Bind the source and sink to the channel
Agent2.sources.avro-source.channels = file-channel
Agent2.sinks.hdfs-sink.channel = file-channel
I had this problem once.
I've upgraded hadoop and flume and it got solved. (from cloudera hadoop cdh-5.2 into cdh-5.3)
Try upgrading - hadoop, flume or impala.
See if your flume configuration match the flume version, that was my problem.
I am trying to run Oryx on top of Hadoop using Google's Cloud Storage Connector for Hadoop:
https://cloud.google.com/hadoop/google-cloud-storage-connector
I prefer to use Hadoop 2.4.1 with Oryx, so I use the hadoop2_env.sh set-up for the hadoop cluster I create on google compute engine, e.g.:
.bdutil -b <BUCKET_NAME> -n 2 --env_var_files hadoop2_env.sh \
--default_fs gs --prefix <PREFIX_NAME> deploy
I face two main problems when I try to run oryx using hadoop.
1) Despite confirming that my hadoop conf directory matches what is expected for the google installation on compute engine, e.g.:
$ echo $HADOOP_CONF_DIR
/home/hadoop/hadoop-install/etc/hadoop
I still find something is looking for a /conf directory, e.g.:
Caused by: java.lang.IllegalStateException: Not a directory: /etc/hadoop/conf
My understanding is that ../etc/hadoop should be the /conf directory, e.g.:
hadoop: configuration files
And while I shouldn't need to make any changes, this problem is only resolved when I copy the config files into a newly created directory, e.g.:
sudo mkdir /etc/hadoop/conf
sudo cp /home/hadoop/hadoop-install/etc/hadoop/* /etc/hadoop/conf
So why is this? Is this a result of using the google hadoop connector?
2) After "resolving" the issue above, I find additional errors which seem (to me) to be related to communication between the hadoop cluster and the google file system:
Wed Oct 01 20:18:30 UTC 2014 WARNING Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Wed Oct 01 20:18:30 UTC 2014 INFO Namespace prefix: hdfs://BUCKET_NAME
Wed Oct 01 20:18:30 UTC 2014 SEVERE Unexpected error in execution
java.lang.ExceptionInInitializerError
at com.cloudera.oryx.common.servcomp.StoreUtils.listGenerationsForInstance(StoreUtils.java:50)
at com.cloudera.oryx.computation.PeriodicRunner.run(PeriodicRunner.java:173)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.IllegalArgumentException: java.net.UnknownHostException: resistance-prediction
at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:373)
at org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:258)
at org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:153)
at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:602)
at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:547)
at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:139)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2591)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:89)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2625)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2607)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:368)
at com.cloudera.oryx.common.servcomp.Store.(Store.java:76)
at com.cloudera.oryx.common.servcomp.Store.(Store.java:57)
... 9 more
Caused by: java.net.UnknownHostException: BUCKET_NAME
... 22 more
What seems relevant to me is that the namespace prefix is hdfs:// when I set the default file system to gs://
Perhaps this is leading to the UnkownHostException?
Note that I have "confirmed" the hadoop cluster is connected to the google file system, e.g.:
hadoop fs -ls
yields the contents of my google cloud bucket and all the expected contents of the gs://BUCKET_NAME directory. However, I am not familiar with the google manifestation of hadoop via the hadoop connector, and the traditional way I usually test to see if the hadoop cluster is running, i.e.:
jps
only yields
6440 Jps
rather than listing all the nodes. However, I am running this command from the master node of the hadoop cluster, i.e., PREFIX_NAME-m, and I am not sure of the expected output when using the google cloud storage connector for hadoop.
So, how can I resolve these errors and have my oryx job (via hadoop) successfully access data in my gs://BUCKET_NAME directory?
Thanks in advance for an insights or suggestions.
UPDATE:
Thanks for the very detailed response. As a work-around I "hard coded" gs:// into oryx by changing:
prefix = "hdfs://" + host + ':' + port;
} else {
prefix = "hdfs://" + host;
to:
prefix = "gs://" + host + ':' + port;
} else {
prefix = "gs://" + host;
I now get the following errors:
Tue Oct 14 20:24:50 UTC 2014 SEVERE Unexpected error in execution
java.lang.ExceptionInInitializerError
at com.cloudera.oryx.common.servcomp.StoreUtils.listGenerationsForInstance(StoreUtils.java:50)
at com.cloudera.oryx.computation.PeriodicRunner.run(PeriodicRunner.java:173)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem not found
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1905)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2573)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2586)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:89)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2625)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2607)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:368)
at com.cloudera.oryx.common.servcomp.Store.(Store.java:76)
at com.cloudera.oryx.common.servcomp.Store.(Store.java:57)
As per the instructions here: https://cloud.google.com/hadoop/google-cloud-storage-connector#classpath I believe I have added connector jar to Hadoop's classpath; I added:
HADOOP_CLASSPATH=$HADOOP_CLASSPATH:'https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-1.2.9-hadoop2.jar
to /home/rich/hadoop-env-setup.sh. and (echo $HADOOP_CLASSPATH) yields:
/contrib/capacity-scheduler/.jar:/home/hadoop/hadoop-install/share/hadoop/common/lib/gcs-connector-1.2.9-hadoop2.jar:/contrib/capacity-scheduler/.jar:/home/hadoop/hadoop-install/share/hadoop/common/lib/gcs-connector-1.2.9-hadoop2.jar
Do I need to add more to the class path?
I also note (perhaps related) that I still get the error for /etc/hadoop/conf even with the export commands. I have been using the sudo mkdir /etc/hadoop/conf as a temporary work around. I mention this here in case it may be leading to additional issues.
There appear to be a couple of problems; the first of which is that normally, when things are run under hadoop jar, hadoop imbues the various system environment variables and classpaths, etc., into the program being run; in your case, since Oryx runs without using hadoop jar, instead using something like:
java -Dconfig.file=oryx.conf -jar computation/target/oryx-computation-x.y.z.jar
then $HADOOP_CONF_DIR doesn't actually make it into the environment so System.getenv in OryxConfiguration.java fails to pick it up, and uses the default /etc/hadoop/conf value. This is solved simply with the export command, which you can test by seeing if it makes it into a subshell:
echo $HADOOP_CONF_DIR
bash -c 'echo $HADOOP_CONF_DIR'
export HADOOP_CONF_DIR
bash -c 'echo $HADOOP_CONF_DIR'
java -Dconfig.file=oryx.conf -jar computation/target/oryx-computation-x.y.z.jar
The second, and more unfortunate issue is that Oryx appears to hard-code 'hdfs' rather allowing any filesystem scheme set by the user:
private Namespaces() {
Config config = ConfigUtils.getDefaultConfig();
boolean localData;
if (config.hasPath("model.local")) {
log.warn("model.local is deprecated; use model.local-data");
localData = config.getBoolean("model.local");
} else {
localData = config.getBoolean("model.local-data");
}
if (localData) {
prefix = "file:";
} else {
URI defaultURI = FileSystem.getDefaultUri(OryxConfiguration.get());
String host = defaultURI.getHost();
Preconditions.checkNotNull(host,
"Hadoop FS has no host? Did you intent to set model.local-data=true?");
int port = defaultURI.getPort();
if (port > 0) {
prefix = "hdfs://" + host + ':' + port;
} else {
prefix = "hdfs://" + host;
}
}
log.info("Namespace prefix: {}", prefix);
}
It all depends on whether Oryx intends to add support for other filesystem schemes in the future, but in the meantime, you would either have to change the Oryx code yourself and recompile, or you could attempt to hack around it (but with potential for pieces of Oryx which have a hard dependency on HDFS to fail).
The change to Oryx should theoretically just be:
String scheme = defaultURI.getScheme();
if (port > 0) {
prefix = scheme + "://" + host + ':' + port;
} else {
prefix = scheme + "://" + host;
}
However, if you do go this route, keep in mind the eventual list consistency semantics of GCS, where multi-stage workflows must not rely on "list" operations to find immediately find all the outputs of a previous stage; Oryx may or may not have such a dependency.
The most reliable solution in your case would be to deploy with --default_fs hdfs, where bdutil will still install the gcs-connector so that you can run hadoop distcp to move your data from GCS to HDFS temporarily, run Oryx, and then once finished, copy it back out into GCS.
I think I have tried every combination of altering my config file. I also saw somewhere that it might be due to my replication factor being 3 so I changed it to 1. I am using cloudera manager on AWS. Below is my config file, any ideas?
In HDFS, the file sizes are all under 20kb, trying to get at least 40-50mb. What is funny is that the same config file is writing ~60mb files on my virtual machine that I was practicing with (pre-installed hadoop + tools). See below for config file, any ideas?
# The configuration file needs to define the sources,
# the channels and the sinks.
# Sources, channels and sinks are defined per agent,
# in this case called 'TwitterAgent'
TwitterAgent.sources = Twitter
TwitterAgent.channels = MemChannel
TwitterAgent.sinks = HDFS
TwitterAgent.sources.Twitter.type = com.cloudera.flume.source.TwitterSource
TwitterAgent.sources.Twitter.channels = MemChannel
TwitterAgent.sources.Twitter.consumerKey = xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
TwitterAgent.sources.Twitter.consumerSecret = xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
TwitterAgent.sources.Twitter.accessToken = xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
TwitterAgent.sources.Twitter.accessTokenSecret = xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
TwitterAgent.sources.Twitter.keywords = apple, grapes, fruits, strawberry, mango, pear
TwitterAgent.sinks.HDFS.channel = MemChannel
TwitterAgent.sinks.HDFS.type = hdfs
TwitterAgent.sinks.HDFS.hdfs.path = hdfs://123.456.789.us-west-2.compute.amazonaws.com:8020/user/flume/tweets
TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream
TwitterAgent.sinks.HDFS.hdfs.writeFormat = Text
TwitterAgent.sinks.HDFS.hdfs.rollInterval = 0
TwitterAgent.sinks.HDFS.hdfs.batchSize = 100000
TwitterAgent.sinks.HDFS.hdfs.rollSize = 0
TwitterAgent.sinks.HDFS.hdfs.rollCount = 0
TwitterAgent.channels.MemChannel.type = memory
TwitterAgent.channels.MemChannel.capacity = 10000
TwitterAgent.channels.MemChannel.transactionCapacity = 1000
If rollInterval, batchSize, rollSize & rollCount are not working, remain things looks hdfs.callTimeout.
Because someone said reducing replication factor could be solution.
Reducing replication factor means reducing hdfs operation time and according to flume user guideline, default value of callTimeout is 10000 milliseconds.
Other clues are
How-to: Do Apache Flume Performance Tuning (Part 1)
How can I force Flume-NG to process the backlog of events after a sink failed?
Using an HDFS Sink and rollInterval in Flume-ng to batch up 90 seconds of log information
So i finally figured out the issue. (note I am running a single node test cluster). One of the solutions in stackoverflow was to set the dfs.replication factor to 1 which I did but that did not solve the problem.
For some reason what was happening was that in my flume agent, there was a mismatch in configs. The HDFS Sink has a parameter called minBlockReplicas, which informs it as to how many block replicas are necessary to have, and if not specified, it pulls that paramaneter from the default HDFS configuration file (which i thought I set to 1). It looks like it was getting a different value for dfs.replication or for dfs.namennode.replication.min.
I circumvented the error my modifying my flume file directly by using
TwitterAgent.sinks.HDFS.hdfs.minBlockReplicas = 1
Hope this helps.
Yes, by adding this line it is resolved my small multiple files creating on HDFS while using flume
a1.sinks.HDFS.hdfs.minBlockReplicas = 1