Hive queries generating mismanaged staging directories - hadoop

We are using HDP hadoop distribution v2.3.2, we are dealing with Hive external tables and these are queried on daily basis.
After few days since process started, the data directories contain lot of staging directories with format: .hive-staging_hive_date-time_
There are lot of staging directories generated, each staging directory corresponds to a query run on the Hive table.
What can I do to avoid these staging directories to be piled up into my data directories ?

The answer I posted at https://stackoverflow.com/a/35583367/14186 may help you here. You can configure Hive to make those staging dirs some place else (normally they are made as a subdir of the final destination dir)
In the example from that answer, I have hive put them in dirs under /tmp, and we have a cron-job that we run each day to delete any leftover staging dirs older than 1 week to keep things tidy in case hive doesn't remove them.

Related

how hadoop directory differ from hadoop-x.x.x

I am new to hadoop and recently when I was running MapReduce jobs on Openstack hadoop cluster and cd into directory on a datanode machine, I found there are two hadoop folders one is called "hadoop" while the other named"hadoop-2.7.1". Obviously, the latter one makes more sense as it tells the hadoop version. The two folder contains same sub-directories, but how these two differ from each other? What if I'd like to disable HDFS permission checking on this machine, which one should I go?
Here is a screenshot
As colors in the screenshot suggest, hadoop is not a separate directory but is just a symbolic link, obviously pointing to hadoop-2.7.1. Run ls -l to check this.
You should cd into hadoop directory. It exists intentionally to avoid writing hadoop version explicitly. When new version of hadoop is deployed, a new versioned directory will be created, and hadoop symbolic link will be changed to point to the latest versioned directory. Like this:
hadoop-2.7.1
hadoop-2.7.2
hadoop-2.7.3
hadoop -> hadoop-2.7.3

hdfs or hadoop command to sync the files or folder between local to hdfs

I have a local files which gets added daily so I want to sync these newly added files to hdfs.
I tried below command but all are complete copy, I want some command which copies only newly added files
$ hdfs dfs -cp /home/user/files/* /data/files/*
You can use hsync.
https://github.com/alexholmes/hsync
Its Alex's custom package and perhaps useful on a dev box but could be hard to deploy on production environment. I am looking for a similar solution but for now this seems to be closest. Other option is to write your own shell script to compare source/target file times and then overwrite newer files only.

"No such file or directory" in hadoop while executing WordCount program using jar command

I am new to Hadoop and am trying to execute the WordCount Problem.
Things I did so far -
Setting up the Hadoop Single Node cluster referring the below link.
http://www.bogotobogo.com/Hadoop/BigData_hadoop_Install_on_ubuntu_single_node_cluster.php
Write the word count problem referring the below link
https://kishorer.in/2014/10/22/running-a-wordcount-mapreduce-example-in-hadoop-2-4-1-single-node-cluster-in-ubuntu-14-04-64-bit/
Problem is when I execute the last line to run the program -
hadoop jar wordcount.jar /usr/local/hadoop/input /usr/local/hadoop/output
Following is the error I get -
The directory seems to be present
The file is also present in the directory with contents
Finally, on a side note I also tried the following directory sturcture in the jar command.
No avail! :/
I would really appreciate if someone could guide me here!
Regards,
Paul Alwin
Your first image is using input from the local Hadoop installation directory, /usr
If you want to use that data on your local filesystem, you can specify file:///usr/...
Otherwise, if you're running pseudo distributed mode, HDFS has been setup, and /usr does not exist in HDFS unless you explicitly created it there.
Based on the stacktrace, I believe the error comes from the /app/hadoop/ staging directory path not existing, or the permissions for it are not allowing your current user to run commands against that path
Suggestion: Hortonworks and Cloudera offer pre-built VirtualBox images and lots of tutorial resources. Most companies will have Hadoop from one of those vendors, so it's better to get familiar with that rather than mess around with having to install Hadoop yourself from scratch, in my opinion

How to copy HDFS files from one cluster to another cluster by preserving the modification time

I have to move some HDFS files from my production cluster to dev cluster. I have to test some operations on HDFS files after moving to dev cluster based on the file modification time. Need files with different dates to test it in dev.
I tried doing with DISTCP, Modification time is updating with the current time in that. i checked the Distcp by using many parameters that I found here distcp version2 guide
Is there any other way to get the files without changing modification time? or can i change the modification time manually after getting the files into hdfs ?
thanks in advance
Use -pt flag with the hadoop distcp command. This will preserve timestamp (modification time) of the file that is distcp'd.
hadoop distcp -pt hdfs://src_cluster/file hdfs://dest_cluster/file
Tested with Hadoop-2.7.3
Refer latest Distcp Guide

how do i backup hbase using distcp?

I would like to do a back up of hbase files using distcp. Then point hbase to the newly copied files and work with the stored tables.
I realize that there are tools out there which are recommended for this job. However, I'd like to know what I need to do after I've copied the files to get hbase to recognize the copied files.
For example, i'd like to start hbase shell and scan the stored tables from the newly copied file.
DistCp (distributed copy) is a tool used for large inter/intra-cluster copying. So if you want to backup your clusterA to clusterB, you'll have to:
do the copy from clusterA to clusterB using distcp
start an Hbase master and some RegionServers
enjoy the command line interface on clusterB
This means have 2 clusters each with HDFS and Hbase.
But, if you want to backup your data in the same cluster, this is simplier:
do the intra copy in a different folder: hadoop distcp hdfs://nn:8020/hbase hdfs://nn:8020/backuptest
stop all the Hbase processes and change the property hbase.rootdir from "hbase" to "backuptest"
restart all the processes

Resources