Flink 1.6 bucketing sink HDFS files stuck in .in-progress - hadoop

I am writing Kafka data stream to bucketing sink in a HDFS path. Kafka gives out string data. Using FlinkKafkaConsumer010 to consume from Kafka
-rw-r--r-- 3 ubuntu supergroup 4097694 2018-10-19 19:16 /streaming/2018-10-19--19/_part-0-1.in-progress
-rw-r--r-- 3 ubuntu supergroup 3890083 2018-10-19 19:16 /streaming/2018-10-19--19/_part-1-1.in-progress
-rw-r--r-- 3 ubuntu supergroup 3910767 2018-10-19 19:16 /streaming/2018-10-19--19/_part-2-1.in-progress
-rw-r--r-- 3 ubuntu supergroup 4053052 2018-10-19 19:16 /streaming/2018-10-19--19/_part-3-1.in-progress
This happens only when I use some mapping function to manipulate the stream data on the fly. If I directly write the stream to HDFS its working fine. Any idea why this might be happening? I am using Flink 1.6.1, Hadoop 3.1.1 and Oracle JDK1.8

Little bit late for this question, but I also experience similar issue.
I have a case class Address
case class Address(val i: Int)
and I read the source from collection with number of Address, for example
env.fromCollection(Seq(new Address(...), ...))
...
val customAvroFileSink = StreamingFileSink
.forBulkFormat(
new Path("/tmp/data/"),
ParquetAvroWriters.forReflectRecord(classOf[Address]))
.build()
...
xxx.addSink(customAvroFileSink)
with checkpoint enabled, my parquet file will also end up with in-progress
I find that the Flink finish the process before checkpoint triggered, so my result never full flushed to the disk. After I changed the checkpoint interval to a smaller number, the parquet is no longer in-progress.

This scenario generally happens when checkpointing is disabled.
Could you check checkpointing setting while running a job with the mapping function? Looks like you have enabled checkpointing for a job writing directly to HDFS.

I had a similar issue and enabling checkpointing and changing the state backend from the default MemoryStateBackend to FsStateBackend worked. In my case, checkpointing failed because MemoryStateBackend had a maxStateSize that was too small such that the state of one of the operations could not fit in memory.
StateBackend stateBackend = new FsStateBackend("file:///home/ubuntu/flink_state_backend");
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment()
.enableCheckpointing(Duration.ofSeconds(60).toMillis())
.setStateBackend(stateBackend);

Related

Unable to load large file to HDFS on Spark cluster master node

I have fired up a Spark Cluster on Amazon EC2 containing 1 master node and 2 servant nodes that have 2.7gb of memory each
However when I tried to put a file of 3 gb on to the HDFS through the code below
/root/ephemeral-hdfs/bin/hadoop fs -put /root/spark/2GB.bin 2GB.bin
it returns the error, "/user/root/2GB.bin could only be replicated to 0 nodes, instead of 1". fyi, I am able to upload files of smaller size but not when it exceeds a certain size (about 2.2 gb).
If the file exceeds the memory size of a node, wouldn't it will be split by Hadoop to the other node?
Edit: Summary of my understanding of the issue you are facing:
1) Total HDFS free size is 5.32 GB
2) HDFS free size on each node is 2.6GB
Note: You have bad blocks (4 Blocks with corrupt replicas)
The following Q&A mentions similar issues:
Hadoop put command throws - could only be replicated to 0 nodes, instead of 1
In that case, running JPS showed that the datanode are down.
Those Q&A suggest a way to restart the data-node:
What is best way to start and stop hadoop ecosystem, with command line?
Hadoop - Restart datanode and tasktracker
Please try to restart your data-node, and let us know if it solved the problem.
When using HDFS - you have one shared file system
i.e. all nodes share the same file system
From your description - the current free space on the HDFS is about 2.2GB , while you tries to put there 3GB.
Execute the following command to get the HDFS free size:
hdfs dfs -df -h
hdfs dfsadmin -report
or (for older versions of HDFS)
hadoop fs -df -h
hadoop dfsadmin -report

i executed a hadoop mapreduce program successfully, Can someone tell me how see output through browser like <http://localhost:port/hdfsLocation/>

i executed a hadoop mapreduce program successfully in CDH4, but where can i see my output ? , Can someone tell me how to see output through browser like: It will be helpfull to me
on terminal
hadoop dfs -ls /inputfile
it will give result like
Found 2 items
-rw-r--r-- 3 user17 supergroup 0 2014-11-27 16:47 /inputfile/_SUCCESS
-rw-r--r-- 3 user17 supergroup 24441 2014-11-27 16:47 /inputfile/part-00000
hadoop dfs -cat /inputfile/part-00000
NameNode and DataNode each run an internal web server in order to display basic information about the current status of the cluster. With the default configuration, the NameNode front page is at http://namenode-name:50070/. It lists the DataNodes in the cluster and basic statistics of the cluster. The web interface can also be used to browse the file system (using "Browse the file system" link on the NameNode front page).
if you want see output on web please see. http://gethue.com/#

Hadoop Hive: How to allow regular user continuously write data and create tables in warehouse directory?

I am running Hadoop 2.2.0.2.0.6.0-101 on a single node.
I am trying to run Java MRD program that writes data to an existing Hive table from Eclipse under regular user. I get exception:
org.apache.hadoop.security.AccessControlException: Permission denied: user=dev, access=WRITE, inode="/apps/hive/warehouse/testids":hdfs:hdfs:drwxr-xr-x
This happens because regular user has no write permission to warehouse directory, only hdfs user does:
drwxr-xr-x - hdfs hdfs 0 2014-03-06 16:08 /apps/hive/warehouse/testids
drwxr-xr-x - hdfs hdfs 0 2014-03-05 12:07 /apps/hive/warehouse/test
To circumvent this I change permissions on warehouse directory, so everybody now have write permissions:
[hdfs#localhost wks]$ hadoop fs -chmod -R a+w /apps/hive/warehouse
[hdfs#localhost wks]$ hadoop fs -ls /apps/hive/warehouse
drwxrwxrwx - hdfs hdfs 0 2014-03-06 16:08 /apps/hive/warehouse/testids
drwxrwxrwx - hdfs hdfs 0 2014-03-05 12:07 /apps/hive/warehouse/test
This helps to some extent, and MRD program can now write as a regular user to warehouse directory, but only once. When trying to write data into the same table second time I get:
ERROR security.UserGroupInformation: PriviledgedActionException as:dev (auth:SIMPLE) cause:org.apache.hcatalog.common.HCatException : 2003 : Non-partitioned table already contains data : default.testids
Now, if I delete output table and create it anew in hive shell, I again get default permissions that do not allow regular user to write data into this table:
[hdfs#localhost wks]$ hadoop fs -ls /apps/hive/warehouse
drwxr-xr-x - hdfs hdfs 0 2014-03-11 12:19 /apps/hive/warehouse/testids
drwxrwxrwx - hdfs hdfs 0 2014-03-05 12:07 /apps/hive/warehouse/test
Please advise on Hive correct configuration steps that will allow a program run as a regular user do the following operations in Hive warehouse:
Programmatically create / delete / rename Hive tables?
Programmatically read / write data from Hive tables?
Many thanks!
If you maintain the table from outside Hive, then declare the table as external:
An EXTERNAL table points to any HDFS location for its storage, rather than being stored in a folder specified by the configuration property hive.metastore.warehouse.dir.
A Hive administrator can create the table and it can point it toward your own user owned HDFS storage location and you grant Hive permission to read from there.
As a general comment, there are no ways for an unprivileged user to do an unauthorized privileged action. Any such way is technically an exploit and you should never rely on it: even if is possible today, it will likely be closed soon. Hive Authorization (and HCatalog authorization) is orthogonal to HDFS authorization.
Your application is also incorrect, irrelevant of authorization issues. You are trying to write 'twice' in the same table which means your application does not handle partitions correctly. Start from An Introduction to Hive’s Partitioning.
You can configure for hdfs-site.xml such as:
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
This configure will disable permissions on HDFS. So, a regular user can do the operations on HDFS.
I hope this solve will help you.

How can I have a 66MB job config in a job tracker while jobconf.limit is set to 5MB?

How can I have a 66MB job config in a job tracker while mapred.user.jobconf.limit is set to 5MB ?
$ ls -lh /mapred/jt/jobTracker/job_201309061800_0037.xml
-rwxr-xr-x 1 mapred mapred 66M Sep 6 22:21 /mapred/jt/jobTracker/job_201309061800_0037.xml
$ cat /mapred/jt/jobTracker/job_201309061800_0037.xml | grep mapred.user.jobconf.limit
<property><name>mapred.user.jobconf.limit</name><value>5242880</value><source>mapred-default.xml</source></property>
You only showed the configuration sent from the client (job_201309061800_0037.xml). This configuration is only applied to the current Job and is not effective to the JobTracker. You need to check mapred-default.xml in your JobTracker.
JobTracker will read mapred.user.jobconf.limit when it initializes. After that, this value in the memory (MAX_JOBCONF_SIZE in JobTacker) is not changed. You can check the code here: http://www.grepcode.com/file/repository.cloudera.com/content/repositories/releases/org.apache.hadoop/hadoop-core/0.20.2-cdh3u1/org/apache/hadoop/mapred/JobTracker.java#158
I admit hadoop does not provide some mechanism to indicate which configuration can be set by a Job and which can not be set by a Job. Now my solution is searching the configuration in hadoop source codes and finding out how hadoop uses this configuration.

Hive Could not obtain block

I am getting the problem
Failed with exception java.io.IOException:java.io.IOException: Could not obtain block: blk_364919282277866885_1342 file=/user/hive/warehouse/invites/ds=2008-08-08/kv3.txt
I checked the file is actually there.
hive>dfs -ls /user/hive/warehouse/invites/ds=2008-08-08/kv3.txt
Found 1 items
-rw-r--r-- 2 root supergroup 216 2012-11-16 16:28 /user/hive/warehouse/invites/ds=2008-08-08/kv3.txt
What I should do?
Please help.
I ran into this problem on my cluster, but it disappeared once I restarted the task on a cluster with more nodes available. The underlying cause appears to be an out-of-memory error, as this thread indicates. My original cluster on AWS was running 3 c1.xlarge instances (7 GB memory each), while the new one had 10 c3.4xlarge instances (30 GB memory each).
Try hadoop fsck /user/hive/warehouse/invites/ds=2008-08-08/kv3.txt ?

Resources