HDFS Replication Issue - hadoop

I have set up a single node cluster (initially) and am attempting to write a file from a client outside the cluster. While the write call returns, the close call hangs for a very long time, eventually returns, but the resulting file in HDFS is 0 bytes in length. The log says:
2016-10-03 22:01:41,367 INFO BlockStateChange: chooseUnderReplicatedBlocks selected 1 blocks at priority level 0; Total=1 Reset bookmarks? true
2016-10-03 22:01:41,367 INFO BlockStateChange: BLOCK* neededReplications = 1, pendingReplications = 0.
2016-10-03 22:01:41,367 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Blocks chosen but could not be replicated = 1; of which 1 have no target, 0 have no source, 0 are UC, 0 are abandoned, 0 already have enough replicas.
Why is the block not written to the single datanode (same as namenode)? What does it mean to "have no target"? The replication count is 1 and I would have thought that a single copy of the file would be stored on the single cluster node.

Related

kafka streams logging disable INFO

Is there any way to disable Kafka streams processing summary info? because it will take alot of disk space
e.g INFO 21284 --- [-StreamThread-6] o.a.k.s.p.internals.StreamThread : stream-thread [test-20-37836474-d182-4066-a5f5-25b211e2fbdb-StreamThread-1] Processed 0 total records, ran 0 punctuators, and committed 0 total tasks since the last update

Invalid state: The Flow Controller is initializing the Data Flow

I'm trying out a test scenario to add a new node to the already existing cluster (for now 1-node) using external zookeeper.
I'm constantly getting the below repeated lines, and on UI "Invalid state: The Flow Controller is initializing the Data Flow."
2022-02-28 17:51:29,668 INFO [main] o.a.n.c.c.n.LeaderElectionNodeProtocolSender Determined that Cluster Coordinator is located at nifi-02:9489; will use this address for sending heartbeat messages
2022-02-28 17:51:29,668 INFO [main] o.a.n.c.p.AbstractNodeProtocolSender Cluster Coordinator is located at nifi-02:9489. Will send Cluster Connection Request to this address
2022-02-28 17:51:37,572 INFO [Cleanup Archive for default] o.a.n.c.repository.FileSystemRepository Successfully deleted 0 files (0 bytes) from archive
2022-02-28 17:52:36,914 INFO [Write-Ahead Local State Provider Maintenance] org.wali.MinimalLockingWriteAheadLog org.wali.MinimalLockingWriteAheadLog#13c90c06 checkpointed with 1 Records and 0 Swap Files in 4 milliseconds (Stop-the-world time = 1 milliseconds, Clear Edit Logs time = 1 millis), max Transaction ID 1
2022-02-28 17:52:37,581 INFO [Cleanup Archive for default] o.a.n.c.repository.FileSystemRepository Successfully deleted 0 files (0 bytes) from archive
NiFi-1.15.3 is being used (unsecure setup)
It seems that cluster coordinator is not running on mentioned port for node already in cluster. This I thought from timeout prospective, but new node is able to detect that a cluster coordinator is present at the mentioned node. (How to solve this?)
nc (netcat) is also timing out for the same port

Flume creating small files

I am trying to move my files in hdfs from local system using flume but when i am running my flume it is creating many small files. Size of my original file's are 154 - 500Kb but in my HDFS it is creating many files of size 4-5kb. I searched and got to know that changing the rollSize and rollCount will work i increased the values but still same issue is happening. Also i am getting below error.
Error:
ERROR hdfs.BucketWriter: Hit max consecutive under-replication
rotations (30); will not continue rolling files under this path due to
under-replication
As i am working in cluster i am a bit scared to do changes in the hdfs-site.xml. Please suggest me what i can do to either move original files in HDFS or make the small files more in size (instead of 4-5kb make it 50-60kb).
Below is my configuration.
Configuration:
agent1.sources = source1
agent1.sinks = sink1
agent1.channels = channel1
agent1.sources.source1.channels = channel1
agent1.sinks.sink1.channel = channel1
agent1.sources.source1.type = spooldir
agent1.sources.source1.spoolDir = /root/Downloads/CD/parsedCD
agent1.sources.source1.deletePolicy = immediate
agent1.sources.source1.basenameHeader = true
agent1.sinks.sink1.type = hdfs
agent1.sinks.sink1.hdfs.path = /user/cloudera/flumecd
agent1.sinks.sink1.hdfs.fileType = DataStream
agent1.sinks.sink1.hdfs.filePrefix = %{basename}
agent1.sinks.sink1.hdfs.rollInterval = 0
agent1.sinks.sink1.hdfs.batchsize= 1000
agent1.sinks.sink1.hdfs.rollSize= 1000000
agent1.sinks.sink1.hdfs.rollCount= 0
agent1.channels.channel1.type = memory
agent1.channels.channel1.maxFileSize =900000000
I think the error you are posting is clear enough: the files you are creating are under-replicated (which means the blocks of the files you are creating, and which are distributed along the cluster, have less copies than the replication factor -usually 3-); and while that situation continues in time, no more rolls will be done (because each time you roll the file, a new under-replicated file is created, and the maximum allowed -30- has been reached).
I'll recommend you to check why files are under-replicated. Maybe this is because the cluster is running out of disk, or because the cluster was set up with the minimum number of nodes -i.e. 3 nodes- and one is down -i.e. only 2 datanodes are alive and the replication factor is set to 3-.
Other options (not recommended) would be to decrease the replication factor -even to 1-. Or increase the allowed number of under-replicated rolls (I don't know if such a thing is possible, and even it is possible, in the end you will experience again the same error).

Hadoop2.4.0 creating 39063 Map tasks to process 10MB file in Local mode with invalid Inputsplit configuration

am using hadoop-2.4.0 with all default configuration expect below:
FileInputFormat.setInputPaths(job, new Path("in")); //10mb file; just one file.
FileOutputFormat.setOutputPath(job, new Path("out"));
job.getConfiguration().set("mapred.max.split.size", "64");
job.getConfiguration().set("mapred.min.split.size", "128");
PS: I set max split size is lesser than min(Initially I set by mistake and I realized)
And, as per inputsplit calucaiton logic
max(minimumSize, min(maximumSize, blockSize))
max(128,min(64,128) --> 128MB and it is great than file size, so it should create only one inputsplit(one mapper)
Am just curious about how the framework calculating 39063 mappers each time when I run this program in eclipse?
Logs:
2015-07-15 12:02:37 DEBUG LocalJobRunner Starting mapper thread pool executor.
2015-07-15 12:02:37 DEBUG LocalJobRunner Max local threads: 1
2015-07-15 12:02:37 DEBUG LocalJobRunner Map tasks to process: 39063
2015-07-15 12:02:38 INFO LocalJobRunner Starting task:
attempt_local192734774_0001_m_000000_0
Thanks,
In your code you have specified:
job.getConfiguration().set("mapred.max.split.size", "64");
job.getConfiguration().set("mapred.min.split.size", "128");
Its calculating into bytes. Hence you are getting high number of Mapper.
I think you should use something like this:
job.getConfiguration().set("mapred.min.split.size", 67108864);
67108864 is value in bytes of 64MB
Calculation: 64*1024*1024 = 67108864
mapred.max.split.size is basicall used to combine small file to defint split size where you are dealing with large number of small files and mapred.min.split.size is used to define split where you are dealing with large files.
If you are using YARN or MR2 then you should use mapreduce.input.fileinputformat.split.minsize

Flume to HDFS split a file to lots of files

I'm trying to transfer a 700 MB log file from flume to HDFS.
I have configured the flume agent as follows:
...
tier1.channels.memory-channel.type = memory
...
tier1.sinks.hdfs-sink.channel = memory-channel
tier1.sinks.hdfs-sink.type = hdfs
tier1.sinks.hdfs-sink.path = hdfs://***
tier1.sinks.hdfs-sink.fileType = DataStream
tier1.sinks.hdfs-sink.rollSize = 0
The source is a spooldir, channel is memory and sink is hdfs.
I have also tried to send a 1MB file, and flume split it to 1000 files, each one of size of 1KB.
Another thing I have noticed is that the transfer was very slow, 1MB took about 1 minute.
Am I doing something wrong?
You need to disable the rolltimeout too, that's done with the following settings:
tier1.sinks.hdfs-sink.hdfs.rollCount = 0
tier1.sinks.hdfs-sink.hdfs.rollInterval = 300
rollcount prevents roll overs, rollIntervall here is set to 300 seconds, setting that to 0 will disable timeouts. You will have to chosse which mechanism you want for rollovers, otherwise Flume will only close the files upon shutdown.
The default values are the following:
hdfs.rollInterval 30 Number of seconds to wait before rolling current file (0 = never roll based on time interval)
hdfs.rollSize 1024 File size to trigger roll, in bytes (0: never roll based on file size)
hdfs.rollCount 10 Number of events written to file before it rolled (0 = never roll based on number of events)

Resources