Set replication in Hadoop - hadoop

I was trying loading file using hadoop API as an experiment.
I want to set replication to minimum as this one is for experiment.
I first tried this with FileSystem.setReplication():
Configuration config = new Configuration();
config.set("fs.defaultFS","hdfs://192.168.248.166:8020");
FileSystem dfs2 = FileSystem.get(config);
Path src2 = new Path("C:\\Users\\abc\\Desktop\\testfile.txt");
Path dst2 = new Path(dfs2.getWorkingDirectory()+"/tempdir");
dfs2.copyFromLocalFile(src2, dst2);
dfs2.setReplication(dst2, (short)1); /**setting replication**/
The replica was shown as 1, but it was available on 3 datanodes.
When I tried it with Configuration.set():
Configuration config = new Configuration();
config.set("fs.defaultFS","hdfs://192.168.248.166:8020");
config.set("dfs.replication", "1"); /**setting replication**/
FileSystem dfs2 = FileSystem.get(config);
Path src2 = new Path("C:\\Users\\abc\\Desktop\\testfile.txt");
Path dst2 = new Path(dfs2.getWorkingDirectory()+"/tempdir");
This gave the desired outcome (1 replica available on 1 datanode)
Why there are two APIs for the same thing?
What is the difference between these two?

The difference is that Filesystem's setReplication() sets the replication of an existing file on HDFS. In your case, you first copy the local file testFile.txt to HDFS, using the default replication factor (3) and then change the replication factor of this file to 1. After this command, it takes a while until the over-replicated blocks get deleted. (source)
On the other hand, when you use the config.set("dfs.replication", "1"); command to set the replication, you can copy the local file after that, so its blocks get copied just once, from the first time.
In other words, I believe (but I might be wrong) that both commands have the same final result, but you have to wait a little bit until the first one is carried out.

Related

How to write data in real time to HDFS using Flume?

I am using Flume to store sensor data in HDFS. Once the data is received through MQTT. The subscriber posts the data in JSON format to Flume HTTP listener. It is currently working fine, but the problem is that flume is not writing to HDFS file till I stop it (or the size of the file reachs 128MB). I am using Hive to apply a schema on read. Unfortunately, the resulting hive table contains only 1 entry. This is normal because Flume did not write new coming data to file (loaded by Hive).
Is there any manner to force Flume to write new coming data to HDFS in a near-real time way? So, I don't need to restart it or to use small files?
here is my flume configuration:
# Name the components on this agent
emsFlumeAgent.sources = http_emsFlumeAgent
emsFlumeAgent.sinks = hdfs_sink
emsFlumeAgent.channels = channel_hdfs
# Describe/configure the source
emsFlumeAgent.sources.http_emsFlumeAgent.type = http
emsFlumeAgent.sources.http_emsFlumeAgent.bind = localhost
emsFlumeAgent.sources.http_emsFlumeAgent.port = 41414
# Describe the sink
emsFlumeAgent.sinks.hdfs_sink.type = hdfs
emsFlumeAgent.sinks.hdfs_sink.hdfs.path = hdfs://localhost:9000/EMS/%{sensor}
emsFlumeAgent.sinks.hdfs_sink.hdfs.rollInterval = 0
emsFlumeAgent.sinks.hdfs_sink.hdfs.rollSize = 134217728
emsFlumeAgent.sinks.hdfs_sink.hdfs.rollCount=0
#emsFlumeAgent.sinks.hdfs_sink.hdfs.idleTimeout=20
# Use a channel which buffers events in memory
emsFlumeAgent.channels.channel_hdfs.type = memory
emsFlumeAgent.channels.channel_hdfs.capacity = 10000
emsFlumeAgent.channels.channel_hdfs.transactionCapacity = 100
# Bind the source and sinks to the channel
emsFlumeAgent.sources.http_emsFlumeAgent.channels = channel_hdfs
emsFlumeAgent.sinks.hdfs_sink.channel = channel_hdfs
I think the tricky bit here is that you would like to write data to HDFS in near real time but don't want small files either (for obvious reasons) and this could be a difficult thing to a achieve.
You'll need to find optimal balance between the following two parameters:
hdfs.rollSize (Default = 1024) - File size to trigger roll, in bytes (0: never roll based on file size)
and
hdfs.batchSize (Default = 100) - Number of events written to file before it is flushed to HDFS
If your data is not likely to reach 128 MB in the preferred time duration, then you may need to reduce the rollSize but only to an extent that you don't run into the small files problem.
Since, you have not set any batch size in your HDFS sink, you should see the results of HDFS flush after every 100 records but once the size of the flushed records jointly reaches 128 MB, the contents would be rolled up in a 128 MB file. Is this also not happening? Could you please confirm?
Hope this helps!

Flume creating small files

I am trying to move my files in hdfs from local system using flume but when i am running my flume it is creating many small files. Size of my original file's are 154 - 500Kb but in my HDFS it is creating many files of size 4-5kb. I searched and got to know that changing the rollSize and rollCount will work i increased the values but still same issue is happening. Also i am getting below error.
Error:
ERROR hdfs.BucketWriter: Hit max consecutive under-replication
rotations (30); will not continue rolling files under this path due to
under-replication
As i am working in cluster i am a bit scared to do changes in the hdfs-site.xml. Please suggest me what i can do to either move original files in HDFS or make the small files more in size (instead of 4-5kb make it 50-60kb).
Below is my configuration.
Configuration:
agent1.sources = source1
agent1.sinks = sink1
agent1.channels = channel1
agent1.sources.source1.channels = channel1
agent1.sinks.sink1.channel = channel1
agent1.sources.source1.type = spooldir
agent1.sources.source1.spoolDir = /root/Downloads/CD/parsedCD
agent1.sources.source1.deletePolicy = immediate
agent1.sources.source1.basenameHeader = true
agent1.sinks.sink1.type = hdfs
agent1.sinks.sink1.hdfs.path = /user/cloudera/flumecd
agent1.sinks.sink1.hdfs.fileType = DataStream
agent1.sinks.sink1.hdfs.filePrefix = %{basename}
agent1.sinks.sink1.hdfs.rollInterval = 0
agent1.sinks.sink1.hdfs.batchsize= 1000
agent1.sinks.sink1.hdfs.rollSize= 1000000
agent1.sinks.sink1.hdfs.rollCount= 0
agent1.channels.channel1.type = memory
agent1.channels.channel1.maxFileSize =900000000
I think the error you are posting is clear enough: the files you are creating are under-replicated (which means the blocks of the files you are creating, and which are distributed along the cluster, have less copies than the replication factor -usually 3-); and while that situation continues in time, no more rolls will be done (because each time you roll the file, a new under-replicated file is created, and the maximum allowed -30- has been reached).
I'll recommend you to check why files are under-replicated. Maybe this is because the cluster is running out of disk, or because the cluster was set up with the minimum number of nodes -i.e. 3 nodes- and one is down -i.e. only 2 datanodes are alive and the replication factor is set to 3-.
Other options (not recommended) would be to decrease the replication factor -even to 1-. Or increase the allowed number of under-replicated rolls (I don't know if such a thing is possible, and even it is possible, in the end you will experience again the same error).

Write event with flume to S3 via HDFS Sink ensure transaction

We are using flume and S3 to store our events.
I recognized that events are only transferred to S3 whenever the HDFS sink rolls to the next file or flume is shutdown gracefully.
This can, in my mind, lead to potential data loss. The Flume Documentation writes:
...Flume uses a transactional approach to guarantee the reliable
delivery of the Events...
here my configuration:
agent.sinks.defaultSink.type = HDFSEventSink
agent.sinks.defaultSink.hdfs.fileType = DataStream
agent.sinks.defaultSink.channel = fileChannel
agent.sinks.defaultSink.serializer = avro_event
agent.sinks.defaultSink.serializer.compressionCodec = snappy
agent.sinks.defaultSink.hdfs.path = s3n://testS3Bucket/%Y/%m/%d
agent.sinks.defaultSink.hdfs.filePrefix = events
agent.sinks.defaultSink.hdfs.rollInterval = 3600
agent.sinks.defaultSink.hdfs.rollCount = 0
agent.sinks.defaultSink.hdfs.rollSize = 262144000
agent.sinks.defaultSink.hdfs.batchSize = 10000
agent.sinks.defaultSink.hdfs.useLocalTimeStamp = true
#### CHANNELS ####
agent.channels.fileChannel.type = file
agent.channels.fileChannel.capacity = 1000000
agent.channels.fileChannel.transactionCapacity = 10000
I assume that I just do something wrong, any Ideas?
After some investigation I found one of the main problems using S3 with flume and the HDFS Sink.
One of the main differences between plain HDFS and the S3 implementation is that S3 does not directly support rename. When a file is renamed in S3 the file will be copied and to the new name and the old file will be deleted. (see: How to rename files and folder in Amazon S3?)
Flume by default extend files with .tmp when the file is not full. After the rotation the file will be renamed to the final filename. In HDFS this will be no problem but in S3 this can cause problems according to this issue:
https://issues.apache.org/jira/browse/FLUME-2445
Because S3 with HDFS sink seams not 100% trustworthy I prefer the more safe way of saving all files local and sync/delete the finished files with the aws tool s3 sync (http://docs.aws.amazon.com/cli/latest/reference/s3/sync.html)
In worse case the files are not synced or the local disk is full but both problems can be easily solved via a monitoring system that anyways should be used.

How to tune Spark application with hadoop custom input format

My spark application process the files (average size is 20 MB) with custom hadoop input format and stores the result in HDFS.
Following is the code snippet.
Configuration conf = new Configuration();
JavaPairRDD<Text, Text> baseRDD = ctx
.newAPIHadoopFile(input, CustomInputFormat.class,Text.class, Text.class, conf);
JavaRDD<myClass> mapPartitionsRDD = baseRDD
.mapPartitions(new FlatMapFunction<Iterator<Tuple2<Text, Text>>, myClass>() {
//my logic goes here
}
//few more translformations
result.saveAsTextFile(path);
This application creates 1 task/ partition per file and processes and stores the corresponding part file in HDFS.
i.e, For 10,000 input files 10,000 tasks are created and 10,000 part files are stored in HDFS.
Both mapPartitions and map operations on baseRDD are creating 1 task per file.
SO question
How to set the number of partitions for newAPIHadoopFile?
suggests to set
conf.setInt("mapred.max.split.size", 4); for configuring no of partitions.
But when this parameter is set CPU is utilized at maximum and none of the stage is not started even after long time.
If I don't set this parameter then application will be completed successfully as mentioned above.
How to set number of partitions with newAPIHadoopFile and increase the efficiency?
What happens with mapred.max.split.size option?
============
update:
What happens with mapred.max.split.size option?
In my use case file size is small and changing the split size options are irrelevant here.
more info on this SO: Behavior of the parameter "mapred.min.split.size" in HDFS
Just use baseRDD.repartition(<a sane amount>).mapPartitions(...). That will move the resulting operation to fewer partitions, especially if your files are small.

MapReduce Distributed Cache

I am adding a file to distributed cache of Hadoop using
Configuration cng=new Configuration();
JobConf conf = new JobConf(cng, Driver.class);
DistributedCache.addCacheFile(new Path("DCache/Orders.txt").toUri(), cng);
where DCache/Orders.txt is the file in HDFS.
When I try to retrieve this file from the cache in configure method of mapper using:
Path[] cacheFiles=DistributedCache.getLocalCacheFiles(conf);
I get null pointer. What can be the error?
Thanks
DistributedCache doesn't work in single node mode, it just returns a null pointer. Or at least that was my experience with the current version.
I think the url is supposed to start with the hdfs identifier.
http://hadoop.apache.org/common/docs/current/mapred_tutorial.html#DistributedCache

Resources