Issue in Inserting data to hive partition table with over 100k partitions - hadoop

I created a staging table with 20 million records with only two field viewerid and viewedid. From that i am trying to create a dynamic partitions ORC table with "viewerid" column, but map job is not completing as shown in the attached pic
mapred-site.xml
<configuration>
<property>
<name> mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.jobhistory.address</name>
<value>localhost:10020</value>
</property>
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>localhost:19888</value>
</property>
<property>
<name>mapreduce.map.memory.mb</name>
<value>4096</value>
</property>
<property>
<name>mapreduce.reduce.memory.mb</name>
<value>8192</value>
</property>
<property>
<name>mapreduce.map.java.opts</name>
<value>-Xmx3072m</value>
</property>
<property>
<name>mapreduce.reduce.java.opts</name>
<value>-Xmx6144m</value>
</property>
<property>
<name>mapred.tasktracker.map.tasks.maximum</name>
<value>4</value>
</property>
<property>
<name>mapred.tasktracker.reduce.tasks.maximum</name>
<value>4</value>
</property>
**yarn-site.xml**
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>hadoop-master:8030</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>hadoop-master:8032</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address</name>
<value>hadoop-master:8088</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>hadoop-master:8031</value>
</property>
job status:
my stage table:
hive> desc formatted bmviews;
OK
# col_name data_type comment
viewerid int
viewedid int
# Detailed Table Information
Database: bm
Owner: sudheer
CreateTime: Tue Aug 29 18:22:34 IST 2017
LastAccessTime: UNKNOWN
Retention: 0
Location: hdfs://hadoop-master:54311/user/hive/warehouse/bm.db/bmviews
Table Type: MANAGED_TABLE
Table Parameters:
numFiles 9
numRows 0
rawDataSize 0
totalSize 539543256
transient_lastDdlTime 1504070146
# Storage Information
SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.TextInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
my partition table description:
I have changed the partitions per node to 200k but still facing the issue. I have two data nodes (8g,6g) ram respectively and namenode with 16gb ram.
How can I insert the data into my partition table?

Related

Simple Hive Job with 8 GB csv data consumes all disk (around 500GB)

I have setup Hadoop cluster with 20GB RAM and 6 cores. I have around 8 GB data in 3 csv files and I have to join them. For this purpose, I have used Apache Hive for this. Hadoop, Hive are 3.x version.
Here is the Hive query
SELECT distinct rm.UID,rm.Num_Period , rpd.C_Mon-rpd.Non_Cred_Inputs as Claimed_Mon, rpd.Splr_UID, rpd.Doc_Type,rpd.Doc_No_Num ,rpd.Doc_Date, rpd.Purchased_Type,rpd.Rate_ID, rpd.C_Withheld, rpd.Non_Creditable_Inputs, rsd.G_UID , rsd.G_Type,rsd.Doc_Type as G_doc_type, rsd.Doc_No_Num as G_doc_no_num, rsd.Doc_Date as G_doc_date, rsd.Sale_Type as G_sale_type, rsd.Rate_ID as G_rate_id, rsd.Rate_Value as G_rate_value,rsd.hscode as G_hscode
from ZUniq rm inner join Zpurchasedetails rpd
on rm.UniqID = rpd.UniqID
inner join Zsaledetails rsd on rpd.UniqID = rsd.UniqID
where rpd.Non_Cred_Inputs < rpd.C_Mon;
Now,there is around 300 GB disk free on one node and 400 GB on other. When I run above query, all disks are used and then job goes to pending with a message that no healthy node exits.
Here is the Hadoop configuration
yarn-site.xml
<configuration>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.class</name>
<value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler</value>
<!-- <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler</value> -->
<!-- <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler</value> -->
</property>
<!-- Site specific YARN configuration properties -->
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.env-whitelist</name>
<value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
</property>
<property>
<name>yarn.nodemanager.local-dirs</name>
<value>/mnt/disk1/.hdfs/tmp</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>hms-master</value>
</property>
<property>
<name>yarn.log-aggregation.retain-seconds</name>
<value>5184000</value>
<description>Delete the logs after 60 days </description>
</property>
<property>
<name>yarn.nodemanager.resource.cpu-vcores</name>
<value>3</value>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-vcores</name>
<value>1</value>
</property>
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
<description>Whether virtual memory limits will be enforced for containers</description>
</property>
<property>
<name>yarn.nodemanager.vmem-pmem-ratio</name>
<value>4</value>
<description>Ratio between virtual memory to physical memory when setting memory limits for containers</description>
</property>
<!-- Logging related option -->
<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
</property>
<property>
<name>yarn.log.server.url</name>
<value>http://hms-master:19888/jobhistory/logs</value>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>10240</value>
<description>Total RAM that can be used in single system by all containers.</description>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>10240</value>
<description>Maximum RAM that one continer can get </description>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>1024</value>
<description>Minimum RAM that one continer (e.g. map or reduce) can get. It should be less or equal to yarn.nodemanager.resource.memory-mb value </description>
</property>
</configuration>

Hadoop jobtracker's tracking url cannot access

I have configured my hadoop system in wsl and run the wordcount example. But when I want to see the history of the job, I found the tracking url cannot access.
The job is working well, the jobhistory is running as well.
The history tracking url is my wsl hostname:8088/proxy/application_1585482453915_0002/.
You can see the url above.
But I can still access to localhost:19888/jobhistory to see my jobhistory.
How is this problem occurs? Is it a problem of configuration?
My hadoop version is 2.7.1.
My core-site.xml
<property>
<name>hadoop.tmp.dir</name>
<value>file:/home/hadoop/hadoop/tmp</value>
<description>Abase for other temporary directories.</description>
</property>
<property>
     <name>fs.defaultFS</name>
     <value>hdfs://localhost:9000</value>
</property>
My hdfs-site.xml
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/home/hadoop/hadoop/tmp/dfs/name</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/home/hadoop/hadoop/tmp/dfs/data</value>
</property>
My mapred-site.xml
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.jobhistory.address</name>
<value>localhost:10020</value>
</property>
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>localhost:19888</value>
</property>
My yarn-site.xml
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
<description>Whether virtual memory limits will be enforced for containers</description>
</property>
<property>
<name>yarn.nodemanager.vmem-pmem-ratio</name>
<value>4</value>
<description>Ratio between virtual memory to physical memory when setting memory limits for containers</description>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
My /etc/hosts
127.0.0.1 localhost
127.0.1.1 DESKTOP-U1EOV4J.localdomain DESKTOP-U1EOV4J
The JobHistoryServer daemon is running in localhost (127.0.0.1), whereas the tracking URL is constructed with the hostname, thus redirecting to DESKTOP-U1EOV4J.localdomain (127.0.1.1).
For a Pseudo distributed cluster, it is safer to leave the host of JobHistoryServer to be 0.0.0.0.
Update the job history server properties in mapred-site.xml
<property>
<name>mapreduce.jobhistory.address</name>
<value>0.0.0.0:10020</value>
</property>
<property>
<name>mapreduce.jobhistory.webapp.address</name>
<value>0.0.0.0:19888</value>
</property>
and restart the JobHistoryServer.

hdfs zkfc –formatZK error

I have a cluster consists of three nodes
hadoop-master (namenode) 192.168.4.128
hadoop-slave-1 (secondary name node ) 192.168.4.111
hadoop-slave-3 (data node ) 192.168.4.106
jps command on hadoop-master shows
15799 JournalNode
15929 Jps
14978 QuorumPeerMain
but when executing this command hdfs zkfc –formatZK on namenode
I am getting this error
17/03/30 07:33:09 INFO zookeeper.ZooKeeper: Session: 0x15b1ecb76480000 closed
17/03/30 07:33:09 FATAL tools.DFSZKFailoverController: Got a fatal error, exiting now
org.apache.hadoop.HadoopIllegalArgumentException: Bad argument: –formatZK
at org.apache.hadoop.ha.ZKFailoverController.badArg(ZKFailoverController.java:251)
at org.apache.hadoop.ha.ZKFailoverController.doRun(ZKFailoverController.java:214)
at org.apache.hadoop.ha.ZKFailoverController.access$000(ZKFailoverController.java:61)
at org.apache.hadoop.ha.ZKFailoverController$1.run(ZKFailoverController.java:172)
at org.apache.hadoop.ha.ZKFailoverController$1.run(ZKFailoverController.java:168)
at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:415)
at org.apache.hadoop.ha.ZKFailoverController.run(ZKFailoverController.java:168)
at org.apache.hadoop.hdfs.tools.DFSZKFailoverController.main(DFSZKFailoverController.java:181)
17/03/30 07:33:09 WARN ha.ActiveStandbyElector: Ignoring stale result from old client with sessionId 0x15b1ecb76480000
17/03/30 07:33:09 INFO zookeeper.ClientCnxn: EventThread shut down
my zoo.cfg is
initLimit=10
syncLimit=5
dataDir=/usr/local/zookeeper/data/
clientPort=2181
DataLogDir=/usr/local/log/
server.1=hadoop-master:2888:3888
server.2=hadoop-slave-1:2889:3889
server.3=hadoop-slave-2:2890:3890
my slaves file is
hadoop-slave-1
hadoop-slave-2
hadoop-master
my core-site.xml
<property>
<name>dfs.tmp.dir</name>
<value>/opt/hadoop/data15</value>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://hadoop-master:8020</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
<property>
<name>dfs.journalnode.edits.dir</name>
<value>/usr/local/journal/node/local/data</value>
</property>
<property>
<name>fs.defaultFS</name>
<value>hdfs://mycluster</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/tmp</value>
</property>
my hdfs-site.xml is
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>/opt/hadoop/data16</value>
<final>true</final>
</property>
<property>
<name>dfs.data.dir</name>
<value>/opt/hadoop/data17</value>
<final>true</final>
</property>
<property>
<name>dfs.webhdfs.enabled</name>
<value>true</value>
</property>
<property>
<name>dfs.namenode.secondary.http-address</name>
<value>hadoop-slave-1:50090</value>
</property>
<property>
<name>dfs.nameservices</name>
<value>mycluster</value>
<final>true</final>
</property>
<property>
<name>dfs.ha.namenodes.mycluster</name>
<value>hadoop-master,hadoop-slave-1</value>
<final>true</final>
</property>
<property>
<name>dfs.namenode.rpc-address.mycluster.hadoop-master</name>
<value>hadoop-master:8020</value>
</property>
<property>
<name>dfs.namenode.rpc-address.mycluster.hadoop-slave-1</name>
<value>hadoop-slave-1:8020</value>
</property>
<property>
<name>dfs.namenode.http-address.mycluster.hadoop-master</name>
<value>hadoop-master:50070</value>
</property>
<property>
<name>dfs.namenode.http-address.mycluster.hadoop-slave-1</name>
<value>hadoop-slave-1:50070</value>
</property>
<property>
<name>dfs.namenode.shared.edits.dir</name>
<value>qjournal://hadoop-master:8485;hadoop-slave-2:8485;hadoop-slave-1:8485/mycluster</value>
</property>
<property>
<name>dfs.ha.automatic-failover.enabled</name>
<value>true</value>
</property>
<property>
<name>ha.zookeeper.quorum</name>
<value>hadoop-master:2181,hadoop-slave-1:2181,hadoop-slave-2:2181</value>
</property>
<property>
<name>dfs.ha.fencing.methods</name>
<value>sshfence</value>
</property>
<property>
<name>dfs.ha.fencing.ssh.private-key-files</name>
<value>root/.ssh/id_rsa</value>
</property>
<property>
<name>dfs.ha.fencing.ssh.connect-timeout</name>
<value>3000</value>
</property>
i have applied stop-dfs.sh in all nodes before executing hdfs zkfc –formatZK in name node(hadoop-master)
are there any wrong configuration
and is it neccesary to issue hdfs namenode -format before executing
hdfs zkfc –formatZK

two name nodes are stand by after configuring HA

i have configured high availability in my cluster
which consists of three nodes
hadoop-master(192.168.4.128)(name node)
hadoop-slave-1(192.168.4.111) (another name node )
hadoop-slave-2 (192.168.4.106) (data node)
without formatting name node ( converting a non-HA-enabled cluster to be HA-enabled) as described here
https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html
but i got two name nodes working as standby
so i tried to move the transition of one of these two nodes to active by applying the following command
hdfs haadmin -transitionToActive mycluster --forcemanual
with the following out put
17/04/03 08:07:35 WARN ha.HAAdmin: Proceeding with manual HA state management even though
automatic failover is enabled for NameNode at hadoop-master/192.168.4.128:8020
17/04/03 08:07:36 WARN ha.HAAdmin: Proceeding with manual HA state management even though
automatic failover is enabled for NameNode at hadoop-slave-1/192.168.4.111:8020
Illegal argument: Unable to determine service address for namenode 'mycluster'
my core-site is
<property>
<name>dfs.tmp.dir</name>
<value>/opt/hadoop/data15</value>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://hadoop-master:8020</value>
</property>
<property>
<name>dfs.permissions</name>
<value>false</value>
</property>
<property>
<name>dfs.journalnode.edits.dir</name>
<value>/usr/local/journal/node/local/data</value>
</property>
<property>
<name>fs.defaultFS</name>
<value>hdfs://mycluster</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/tmp</value>
</property>
my hdfs-site.xml is
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>/opt/hadoop/data16</value>
<final>true</final>
</property>
<property>
<name>dfs.data.dir</name>
<value>/opt/hadoop/data17</value>
<final>true</final>
</property>
<property>
<name>dfs.webhdfs.enabled</name>
<value>true</value>
</property>
<property>
<name>dfs.namenode.secondary.http-address</name>
<value>hadoop-slave-1:50090</value>
</property>
<property>
<name>dfs.nameservices</name>
<value>mycluster</value>
<final>true</final>
</property>
<property>
<name>dfs.ha.namenodes.mycluster</name>
<value>hadoop-master,hadoop-slave-1</value>
<final>true</final>
</property>
<property>
<name>dfs.namenode.rpc-address.mycluster.hadoop-master</name>
<value>hadoop-master:8020</value>
</property>
<property>
<name>dfs.namenode.rpc-address.mycluster.hadoop-slave-1</name>
<value>hadoop-slave-1:8020</value>
</property>
<property>
<name>dfs.namenode.http-address.mycluster.hadoop-master</name>
<value>hadoop-master:50070</value>
</property>
<property>
<name>dfs.namenode.http-address.mycluster.hadoop-slave-1</name>
<value>hadoop-slave-1:50070</value>
</property>
<property>
<name>dfs.namenode.shared.edits.dir</name>
<value>qjournal://hadoop-master:8485;hadoop-slave-2:8485;hadoop-slave-1:8485/mycluster</value>
</property>
<property>
<name>dfs.ha.automatic-failover.enabled</name>
<value>true</value>
</property>
<property>
<name>ha.zookeeper.quorum</name>
<value>hadoop-master:2181,hadoop-slave-1:2181,hadoop-slave-2:2181</value>
</property>
<property>
<name>dfs.ha.fencing.methods</name>
<value>sshfence</value>
</property>
<property>
<name>dfs.ha.fencing.ssh.private-key-files</name>
<value>root/.ssh/id_rsa</value>
</property>
<property>
<name>dfs.ha.fencing.ssh.connect-timeout</name>
<value>3000</value>
</property>
what should the service address value be ? and what are possible solutions i can apply in order
to turn on one name node of the two nodes to active state ?
note the zookeeper server on all three nodes is stopped
I met the same issue, and it turn out that I didn't format zookeeper and start ZKFC

How to increase the space for dfs on HDFS cluster

We have 4 datanode HDFS cluster ...there is large amount of space avialable on each data node of about 98gb ...but when i look at the datanode information ..
it's only using about 10gb ...
How can we make it use all the 98gb and not run out of space as indicated in image
this is the hdfs-site.xml on name node
<configuration>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>file:///test/hadoop/hadoopinfra/hdfs/namenode</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>file:///tmp/hadoop/data</value>
</property>
<property>
<name>dfs.datanode.du.reserved</name>
<value>2368709120</value>
</property>
<property>
<name>dfs.datanode.fsdataset.volume.choosing.policy</name>
<value>org.apache.hadoop.hdfs.server.datanode.fsdataset.AvailableSpaceVolumeChoosingPolicy</value>
</property>
<property>
<name>dfs.datanode.available-space-volume-choosing-policy.balanced-space-preference-fraction</name>
<value>1.0</value>
</property>
</configuration>
this is the hdfs-site.xml under data node
<configuration>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>file:///test/hadoop/hadoopinfra/hdfs/datanode</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>file:///tmp/hadoop/data</value>
</property>
<property>
<name>dfs.datanode.du.reserved</name>
<value>2368709120</value>
</property>
<property>
<name>dfs.datanode.fsdataset.volume.choosing.policy</name>
<value>org.apache.hadoop.hdfs.server.datanode.fsdataset.AvailableSpaceVolumeChoosingPolicy</value>
</property>
<property>
<name>dfs.datanode.available-space-volume-choosing-policy.balanced-space-preference-fraction</name>
<value>1.0</value>
</property>
</configuration>
the 98gb is under /test
Please let us know if we missed anything in the configuration
Look at the dfs.datanode.data.dir in the hdfs-site.xml. This property would control all the directories which can be used to store DFS blocks.
Documentation Link
So on you machines execute "df -h" that should list all the mount points which make up the 98 GB. Then in each of the mount points decide which directory can be used to store HDFS block data and add those under hdfs-site.xml comma separated for dfs.datanode.data.dir. Then retstart namenode and all data node services.
And from your edited post :
<property>
<name>dfs.data.dir</name>
<value>file:///test/hadoop/hadoopinfra/hdfs/datanode</value>
</property>
It should not be file://. It should look like :
<property>
<name>dfs.data.dir</name>
<value>/test/hadoop/hadoopinfra/hdfs/datanode</value>
</property>
Same for other properties.

Resources