Hive Could not obtain block - hadoop

I am getting the problem
Failed with exception java.io.IOException:java.io.IOException: Could not obtain block: blk_364919282277866885_1342 file=/user/hive/warehouse/invites/ds=2008-08-08/kv3.txt
I checked the file is actually there.
hive>dfs -ls /user/hive/warehouse/invites/ds=2008-08-08/kv3.txt
Found 1 items
-rw-r--r-- 2 root supergroup 216 2012-11-16 16:28 /user/hive/warehouse/invites/ds=2008-08-08/kv3.txt
What I should do?
Please help.

I ran into this problem on my cluster, but it disappeared once I restarted the task on a cluster with more nodes available. The underlying cause appears to be an out-of-memory error, as this thread indicates. My original cluster on AWS was running 3 c1.xlarge instances (7 GB memory each), while the new one had 10 c3.4xlarge instances (30 GB memory each).

Try hadoop fsck /user/hive/warehouse/invites/ds=2008-08-08/kv3.txt ?

Related

Hadoop multinode cluster too slow. How do I increase speed of data processing?

I have a 6 node cluster - 5 DN and 1 NN. All have 32 GB RAM. All slaves have 8.7 TB HDD. DN has 1.1 TB HDD. Here is the link to my core-site.xml , hdfs-site.xml , yarn-site.xml.
After running an MR job, i checked my RAM Usage which is mentioned below:
Namenode
free -g
total used free shared buff/cache available
Mem: 31 7 15 0 8 22
Swap: 31 0 31
Datanode :
Slave1 :
free -g
total used free shared buff/cache available
Mem: 31 6 6 0 18 24
Swap: 31 3 28
Slave2:
total used free shared buff/cache available
Mem: 31 2 4 0 24 28
Swap: 31 1 30
Likewise, other slaves have similar RAM usage. Even if a single job is submitted, the other submitted jobs enter into ACCEPTED state and wait for the first job to finish and then they start.
Here is the output of ps command of the JAR that I submnitted to execute the MR job:
/opt/jdk1.8.0_77//bin/java -Dproc_jar -Xmx1000m
-Dhadoop.log.dir=/home/hduser/hadoop/logs -Dyarn.log.dir=/home/hduser/hadoop/logs
-Dhadoop.log.file=yarn.log -Dyarn.log.file=yarn.log
-Dyarn.home.dir= -Dyarn.id.str= -Dhadoop.root.logger=INFO,console
-Dyarn.root.logger=INFO,console -Dyarn.policy.file=hadoop-policy.xml
-Dhadoop.log.dir=/home/hduser/hadoop/logs -Dyarn.log.dir=/home/hduser/hadoop/logs
-Dhadoop.log.file=yarn.log -Dyarn.log.file=yarn.log
-Dyarn.home.dir=/home/hduser/hadoop -Dhadoop.home.dir=/home/hduser/hadoop
-Dhadoop.root.logger=INFO,console -Dyarn.root.logger=INFO,console
-classpath --classpath of jars
org.apache.hadoop.util.RunJar abc.jar abc.mydriver2 /raw_data /mr_output/02
Is there any settings that I can change/add to allow multiple jobs to run simultaneously and speed up current data processing ? I am using hadoop 2.5.2. The cluster is in PROD environment and I can not take it down for updating hadoop version.
EDIT 1 : I started a new MR job with 362 GB of data and still the RAM usage is around 8 GB and 22 GB of RAM is free. Here is my job submission command -
nohup yarn jar abc.jar def.mydriver1 /raw_data /mr_output/01 &
Here is some more information :
18/11/22 14:09:07 INFO input.FileInputFormat: Total input paths to process : 130363
18/11/22 14:09:10 INFO mapreduce.JobSubmitter: number of splits:130372
Is there some additional memory parameters that we can use to submit the job to have efficient memory usage ?
I believe you can edit the mapred-default.xml
The Params you are looking for are
mapreduce.job.running.map.limit
mapreduce.job.running.reduce.limit
0 (Probably what it is set too at the moment) means UNLIMITED.
Looking at your Memory 32G/Machine seems too small.
What CPU/Cores are you having ? I would expect Quad CPU/16 Cores Minimum. Per Machine.
Based on your yarn-site.xml your yarn.scheduler.minimum-allocation-mb setting of 10240 is too high. This effectively means you only have at best 18 vcores available. This might be the right setting for a cluster where you have tons of memory but for 32GB it's way too large. Drop it to 1 or 2GB.
Remember, HDFS block sizes are what each mapper typically consumes. So 1-2GB of memory for 128MB of data sounds more reasonable. The added benefit is you could have up to 180 vcores available which will process jobs 10x faster than 18 vcores.
To give you an idea of how a 4 node 32 core 128GB RAM per node cluster is setup:
For Tez: Divide RAM/CORES = Max TEZ Container size
So in my case: 128/32 = 4GB
TEZ:
YARN:

Flink 1.6 bucketing sink HDFS files stuck in .in-progress

I am writing Kafka data stream to bucketing sink in a HDFS path. Kafka gives out string data. Using FlinkKafkaConsumer010 to consume from Kafka
-rw-r--r-- 3 ubuntu supergroup 4097694 2018-10-19 19:16 /streaming/2018-10-19--19/_part-0-1.in-progress
-rw-r--r-- 3 ubuntu supergroup 3890083 2018-10-19 19:16 /streaming/2018-10-19--19/_part-1-1.in-progress
-rw-r--r-- 3 ubuntu supergroup 3910767 2018-10-19 19:16 /streaming/2018-10-19--19/_part-2-1.in-progress
-rw-r--r-- 3 ubuntu supergroup 4053052 2018-10-19 19:16 /streaming/2018-10-19--19/_part-3-1.in-progress
This happens only when I use some mapping function to manipulate the stream data on the fly. If I directly write the stream to HDFS its working fine. Any idea why this might be happening? I am using Flink 1.6.1, Hadoop 3.1.1 and Oracle JDK1.8
Little bit late for this question, but I also experience similar issue.
I have a case class Address
case class Address(val i: Int)
and I read the source from collection with number of Address, for example
env.fromCollection(Seq(new Address(...), ...))
...
val customAvroFileSink = StreamingFileSink
.forBulkFormat(
new Path("/tmp/data/"),
ParquetAvroWriters.forReflectRecord(classOf[Address]))
.build()
...
xxx.addSink(customAvroFileSink)
with checkpoint enabled, my parquet file will also end up with in-progress
I find that the Flink finish the process before checkpoint triggered, so my result never full flushed to the disk. After I changed the checkpoint interval to a smaller number, the parquet is no longer in-progress.
This scenario generally happens when checkpointing is disabled.
Could you check checkpointing setting while running a job with the mapping function? Looks like you have enabled checkpointing for a job writing directly to HDFS.
I had a similar issue and enabling checkpointing and changing the state backend from the default MemoryStateBackend to FsStateBackend worked. In my case, checkpointing failed because MemoryStateBackend had a maxStateSize that was too small such that the state of one of the operations could not fit in memory.
StateBackend stateBackend = new FsStateBackend("file:///home/ubuntu/flink_state_backend");
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment()
.enableCheckpointing(Duration.ofSeconds(60).toMillis())
.setStateBackend(stateBackend);

Unable to load large file to HDFS on Spark cluster master node

I have fired up a Spark Cluster on Amazon EC2 containing 1 master node and 2 servant nodes that have 2.7gb of memory each
However when I tried to put a file of 3 gb on to the HDFS through the code below
/root/ephemeral-hdfs/bin/hadoop fs -put /root/spark/2GB.bin 2GB.bin
it returns the error, "/user/root/2GB.bin could only be replicated to 0 nodes, instead of 1". fyi, I am able to upload files of smaller size but not when it exceeds a certain size (about 2.2 gb).
If the file exceeds the memory size of a node, wouldn't it will be split by Hadoop to the other node?
Edit: Summary of my understanding of the issue you are facing:
1) Total HDFS free size is 5.32 GB
2) HDFS free size on each node is 2.6GB
Note: You have bad blocks (4 Blocks with corrupt replicas)
The following Q&A mentions similar issues:
Hadoop put command throws - could only be replicated to 0 nodes, instead of 1
In that case, running JPS showed that the datanode are down.
Those Q&A suggest a way to restart the data-node:
What is best way to start and stop hadoop ecosystem, with command line?
Hadoop - Restart datanode and tasktracker
Please try to restart your data-node, and let us know if it solved the problem.
When using HDFS - you have one shared file system
i.e. all nodes share the same file system
From your description - the current free space on the HDFS is about 2.2GB , while you tries to put there 3GB.
Execute the following command to get the HDFS free size:
hdfs dfs -df -h
hdfs dfsadmin -report
or (for older versions of HDFS)
hadoop fs -df -h
hadoop dfsadmin -report

How to cleaning hadoop mapreduce memory usage?

I want to ask. I can say for example I have 10 MB memory on each node after I activate start-all.sh process. So, I run the namenode, datanode, secondary namenode, dll. But after I've done the hadoop mapreduce job, why the memory for example decrease to 5 MB for example. Whereas, the hadoop mapreduce job has done.
How can it back to the 10 MB free memory? Thanks all....
Maybe you can try the linux clear memory command :
echo 3 > /proc/sys/vm/drop_caches

Hadoop: Datanodes available: 0 (0 total, 0 dead)

Each time I run:
hadoop dfsadmin -report
I get the following output:
Configured Capacity: 0 (0 KB)
Present Capacity: 0 (0 KB)
DFS Remaining: 0 (0 KB)
DFS Used: 0 (0 KB)
DFS Used%: �%
Under replicated blocks: 0
Blocks with corrupt replicas: 0
Missing blocks: 0
-------------------------------------------------
Datanodes available: 0 (0 total, 0 dead)
There is no data directory in my dfs/ folder.
A lock file exists in this folder: in_use.lock
The master, job tracker and data nodes are running fine.
Please check the datanode logs . It will log errors when it is unable to report to namenode . If you post the those errors , people will be able to help ..
I had exactly same problem and when I checked datanodes logs, there were lots of could not connect to master:9000, and when I checked ports on master via netstat -ntlp I had this in output:
tcp 0 0 127.0.1.1:9000 ...
I realized that I should change my master machine name or change master in all configs. I decided to do the first cause it seems much easier.
so I modified /etc/hosts and changed 127.0.1.1 master to 127.0.1.1 master-machine and added an entry at the end of the file like this:
192.168.1.1 master
Then I changed master to master-machine in /etc/hostname and restart the machine.
The problem was gone.
um...
Did you check firewall?
When i use hadoop, I turn off firewall (iptables -F, in the all nodes)
and then try again.
It has happened to us, when we restarted the cluster. But after a while, the datanodes were automatically detected. Could be possibly because of block report delay time property.
Usually there are errors of namespace id issues in the datanode.
So delete the name dir from master and delete the data dir from the datanodes.
Now format the datanode and try start-dfs.
The report usually takes some time to reflect all the datanodes.
Even I was getting 0 datanodes, but after some time master detects the slaves.
I had the same problem and I just solved it.
/etc/hosts of all nodes should look like this:
127.0.0.1 localhost
xxx.xxx.xxx.xxx master
xxx.xxx.xxx.xxx slave-1
xxx.xxx.xxx.xxx slave-2
Just resolved the issue by following below steps -
Make sure the IP addresses for master and slave nodes are correct in /etc/hosts file
Unless you really need the data, stop-dfs.sh, delete all data directories in master/slave nodes, then run hdfs namenode -format and start-dfs.sh. This should recreate the hdfs and fix the issue
Just formatting the namenode didn't work for me. So I checked the logs at $HADOOP_HOME/logs. In secondarynamenode, I found this error:
ERROR org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Exception in doCheckpoint
java.io.IOException: Inconsistent checkpoint fields.
LV = -64 namespaceID = 2095041698 cTime = 1552034190786 ; clusterId = CID-db399b3f-0a68-47bf-b798-74ed4f5be097 ; blockpoolId = BP-31586866-127.0.1.1-1552034190786.
Expecting respectively: -64; 711453560; 1550608888831; CID-db399b3f-0a68-47bf-b798-74ed4f5be097; BP-2041548842-127.0.1.1-1550608888831.
at org.apache.hadoop.hdfs.server.namenode.CheckpointSignature.validateStorageInfo(CheckpointSignature.java:143)
at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doCheckpoint(SecondaryNameNode.java:550)
at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doWork(SecondaryNameNode.java:360)
at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode$1.run(SecondaryNameNode.java:325)
at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:482)
at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.run(SecondaryNameNode.java:321)
at java.lang.Thread.run(Thread.java:748)
So I stopped hadoop and then specifically formatted the given cluster id:
hdfs namenode -format -clusterId CID-db399b3f-0a68-47bf-b798-74ed4f5be097
This solved the problem.
There's another obscure reason this could happen as well: Your datanode did not start properly, but everything else was working.
In my case, when going through the log, I found that the bound port, 510010, was already in use by SideSync (for MacOS). I found this through
sudo lsof -iTCP -n -P|grep 0010,
But you can use similar techniques to determine what might have already taken your well known data node port.
Killing this off and restarting fixed the problem.
Additionally, if you've installed Hadoop/Yarn as root, but have data dirs in individual home directories, and then try to run it as an individual user, you'll have to make the data node directory public.

Resources