copy file from unc to hdfs using shellscript - bash

I have UNC path folders in this path " //aloha/log/folderlevel1/folderlevel2/"
Each of these level2 folders will have files like "empllog.txt","deptlog.txt","adminlog.txt" and few others files as well.
I want to copy the content of this particular folders if they were created in last 24 hours & only if these 3 files are present to HDFS cloudera cluster.But if one of these files are not present , then that particular folder should not be copied. Also I need to preserve the folderstructre.
i.e In HDFS it should be "/user/test/todaydate/folderlevel1/folderlevel2"
I have written below shell script to copy files to hdfs with date folder created. But not sure how to proceed further with UNC Paths & other criterias.
day=$(date +%Y-%m-%d)
srcdir="/home/test/sparkjops"
stdir="/user/test/$day/"
hadoop dfs -mkdir $day /user/test
for f in ${srcdir}/*
do
if [ $f == "$srcdir/empllog.txt" ]
then
hadoop dfs -put $f $stdir
elif [ $f == "$srcdir/deptlog.txt" ]
then hadoop dfs -put $f $stdir
elif [ $f == "$srcdir/adminlog.txt" ]
then hadoop dfs -put $f $stdir
fi
done
I have tried to change the UNC Path like below . It did not do anything. No error & did not copy the content as well.
srcdir="//aloha/log/*/*"
srcdir='//aloha/log/*/*'
srcdir="\\aloha\log\*\*"
Appreciate all help.
Thanks.
EDIT 1 :
I ran it with code sh -x debug mode.and also with bash -x(just to check). But It returned that file not found error as below
test#ubuntu:~/sparkjops$ sh -x ./hdfscopy.sh
+ date +%Y-%m-%d
+ day=2016-12-24
+ srcdir= //aloha/logs/folderlevel1/folderlevel2
+ stdir=/user/test/2016-12-24/
+ hadoop dfs -mkdir 2016-12-24 /user/test
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.
mkdir: `2016-12-24': File exists
mkdir: `/user/test': File exists
+ //aloha/logs/folderlevel1/folderlevel2/* = //aloha/logs/folderlevel1/folderlevel2/empllog.txt.txt
./hdfscopy.sh: 12: ./hdfscopy.sh: //aloha/logs/folderlevel1/folderlevel2/*: not found
+ //aloha/logs/folderlevel1/folderlevel2/* = //aloha/logs/folderlevel1/folderlevel2/deptlog.txt.txt
./hdfscopy.sh: 12: ./hdfscopy.sh: //aloha/logs/folderlevel1/folderlevel2/*: not found
+ //aloha/logs/folderlevel1/folderlevel2/* = //aloha/logs/folderlevel1/folderlevel2/adminlog.txt.txt
./hdfscopy.sh: 12: ./hdfscopy.sh: //aloha/logs/folderlevel1/folderlevel2/*: not found
test#ubuntu:~/sparkjops$
But not able to understand why it is not reading from that path. I have tried different escaping sequences as well(doubleslash for each slash, forwardslash as we do in window folderpath) . But none working. All are throwing same error message. I am not sure how to read this file in the script. Any help would be appreciated.

Related

hdfs ls on directory returns No such file or directory error

HDFS ls on below two directories returning No such file or directory error.
[mybox]$ hdfs dfs -ls /data/tdc/dv1/corp/base/dpp/raw/load_date=2018-05-01/ | grep Tenant
drwxr-xr-x - tdcdv1r tdcdv1c 0 2018-05-01 18:28 /data/tdc/dv1/corp/base/dpp/raw/load_date=2018-05-01/rtng_ky=Access.NBNOrder.Amend.Info.{Tenant}.Rejected.v2.event
drwxr-xr-x - tdcdv1r tdcdv1c 0 2018-05-01 15:35 /data/tdc/dv1/corp/base/dpp/raw/load_date=2018-05-01/rtng_ky=Access.NBNOrder.Amend.Info.{Tenant}.v2.event
See the error:
[mybox]$ hdfs dfs -ls /data/tdc/dv1/corp/base/dpp/raw/load_date=2018-05-01/rtng_ky=Access.NBNOrder.Amend.Info.{Tenant}.Rejected.v2.event
ls: `/data/tdc/dv1/corp/base/dpp/raw/load_date=2018-05-01/rtng_ky=Access.NBNOrder.Amend.Info.{Tenant}.Rejected.v2.event': No such file or directory
I am not able to understand. Its a directory, it should return the content but its returning error.
You just need to escape the weird characters ({ and }) in the path:
hdfs dfs -ls /data/tdc/dv1/corp/base/dpp/raw/load_date=2018-05-01/rtng_ky=Access.NBNOrder.Amend.Info.\\{Tenant\\}.Rejected.v2.event
EDIT
As in the comments said you can comment the path to avoid escape the weird characters.
This should work fine:
hdfs dfs -ls '/data/tdc/dv1/corp/base/dpp/raw/load_date=2018-05-01/rtng_ky=Access.NBNOrder.Amend.Info.{Tenant}.Rejected.v2.event'

Difference between 'hdfs dfs -ls' and 'hdfs dfs -ls /'

Why hdfs dfs -ls points to the different location than hdfs dfs -ls /?
It can be clearly seen from below screenshot of two commands give different output:
What is the main cause of the outputs above?
From the official source code org.apache.hadoop.fs.shell.Ls.java . Just search for DESCRIPTION word. It will list below statements:-
public static final String DESCRIPTION =
"List the contents that match the specified file pattern. If " +
"path is not specified, the contents of /user/<currentUser> " +
"will be listed. For a directory a list of its direct children " +
"is returned (unless -" + OPTION_DIRECTORY +
" option is specified)"
hadoop fs -ls will list home directory content of current user.
hadoop fs -ls / will list direct childs of root directory.
The default location for -ls in Hadoop is the home directory of the user, in this case /user/root.
Adding the / makes the -ls command point at the root directory of the file system.
The / looks for the root Folder of the Hdfs

Spark - on EMR saveAsTextFile wont write data to local dir

Running Spark on EMR (AMI 3.8). When trying to write an RDD to a local file, I am getting no results on the name/master node.
On my previous EMR cluster (same version of Spark installed with bootstrap script instead of as an add-on to EMR), the data would write to the local dir on the name node. Now I can see it appearing in "/home/hadoop/test/_temporary/0/task*" directories on the other nodes in the cluster, but only the 'SUCCESS' file on the master node.
How can I get the file to write to the name/master node only?
Here is an example of the command I am using:
myRDD.saveAsTextFile("file:///home/hadoop/test")
I can do this in a round about way using by pushing to HDFS first then writing the results to local filesystem with shell commands. But I would love to hear if others have a more elegant approach.
//rdd to local text file
def rddToFile(rdd: RDD[_], filePath: String) = {
//setting up bash commands
val createFileStr = "hadoop fs -cat " + filePath + "/part* > " + filePath
val removeDirStr = "hadoop fs -rm -r " + filePath
//rm dir in case exists
Process(Seq("bash", "-c", removeDirStr)) !
//save data to HDFS
rdd.saveAsTextFile(filePath)
//write data to local file
Process(Seq("bash", "-c", createFileStr)) !
//rm HDFS dir
Process(Seq("bash", "-c", removeDirStr)) !
}

Checking if directory in HDFS already exists or not

I am having following directory structure in HDFS,
/analysis/alertData/logs/YEAR/MONTH/DATE/HOURS
That is data is coming on houly basis and stored in format of year/month/day/hour.
I have written a shell script in which i am passing path till
"/analysis/alertData/logs" ( this will vary depending on what product of data i am handling)
then shell script go through the year/month/date/hour folders and return the most latest path.
For example:
Directories present in HDFS has following structure:
/analysis/alertData/logs/2014/10/22/01
/analysis/alertData/logs/2013/5/14/04
shell script is given path till : " /analysis/alertData/logs "
it outputs most recent directory : /analysis/alertData/logs/2014/10/22/01
My question is here is how can i validate whether HDFS directory path pass to shell script is valid or not. Lets say i pass a wrong path as input or path which does not exist so how to handle that in shell script.
Sample wrong path can be:
wrong path : /analysis/alertData ( correct path : /analysis/alertData/logs/ )
wrong path : /abc/xyz/ ( path does not exit in HDFS )
I tried using Hadoop dfs -test -z/-d/-e options did not worked for me.
Any suggestion for this.
NOTE : Not posting my original code here, as solution to my problem does not depend on it.
Thanks in advance.
Try w/o test command []:
if $(hadoop fs -test -d $yourdir) ; then echo "ok";else echo "not ok"; fi
Since
hdfs dfs -test -d $yourdir
return 0 if exists, then
if [ $? == 0 ]; then
echo "exists"
else
echo "dir does not exists"
fi
Hadoop fs is deprecated
Usage: hdfs dfs -test -[ezd] URI
Options:
The -e option will check to see if the file exists, returning 0 if true.
The -z option will check to see if the file is zero length, returning 0 if true.
The -d option will check to see if the path is directory, returning 0 if true.
Example: hdfs dfs -test -d $yourdir
Please check the following for more info: https://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-common/FileSystemShell.html
Regards
Hi I have used following script to test the HDFS directory exists or not. I have seen in your question that you tried this test command and not worked. Could you please provide any trace on why this not working..
hadoop fs -test -d $dirpath
if [ $? != 0 ]
then
hadoop fs -mkdir $dirpath
else
echo "Directory already present in HDFS"
fi
works for scala with spark.
import org.apache.hadoop.fs.{FileSystem, Path}
val fs = FileSystem.get(spark.sparkContext.hadoopConfiguration)
val fileExists = fs.exists(new Path(<HDFSPath>)) //return boolean of true or false
In Java we can verify this by using FileSystem class.
FileSystem

Concatenate S3 files to read in EMR

I have an S3 bucket with log files that I want to concatenate, then use as an input to an EMR job. The log files are in paths like: bucket-name/[date]/product/out/[hour]/[minute-based-file]. I'd like to take all the minute logs in all the hour directories in all the date directories, and concatenate them into one file. I want to use that file as an input to an EMR job. The original log files need to be preserved, and the new combined log file will probably be written to a different S3 bucket.
I tried using hadoop fs -getmerge on the EMR master node via SSH, but got this error:
This file system object (file:///) does not support access to the request path 's3://target-bucket-name/merged.log'
The source S3 bucket has some other files in it, so I don't want to include all of its files. The wildcard match looks like this: s3n://bucket-name/*/product/out/*/log.*.
The purpose is to get around the problem of having tens/hundreds of thousands of small (10k-3mb) input files to EMR, and instead give it one large file that it can split more efficiently.
I ended up just writing a script that wraps some Hadoop filesystem commands to do this.
#!/usr/bin/env ruby
require 'date'
# Merge minute-based log files into daily log files
# Usage: Run on EMR master (e.g. SSH to master then `ruby ~/merge-historical-logs.rb [FROM [TO]]`)
SOURCE_BUCKET_NAME = 's3-logs-bucket'
DESTINATION_BUCKET_NAME = 's3-merged-logs-bucket'
# Optional date inputs
min_date = if ARGV[0]
min_date_args = ARGV[0].split('-').map {|item| item.to_i}
Date.new(*min_date_args)
else
Date.new [2012, 9, 1]
end
max_date = if ARGV[1]
max_date_args = ARGV[1].split('-').map {|item| item.to_i}
Date.new(*max_date_args)
else
Date.today
end
# Setup directories
hdfs_logs_dir = '/mnt/tmp/logs'
local_tmp_dir = './_tmp_merges'
puts "Cleaning up filesystem"
system "hadoop fs -rmr #{hdfs_logs_dir}"
system "rm -rf #{local_tmp_dir}*"
puts "Making HDFS directories"
system "hadoop fs -mkdir #{hdfs_logs_dir}"
# We will progress backwards, from max to min
date = max_date
while date >= min_date
# Format date pieces
year = date.year
month = "%02d" % date.month
day = "%02d" % date.day
# Make a directory in HDFS to store this day's hourly logs
today_hours_dir = "#{hdfs_logs_dir}/#{year}-#{month}-#{day}"
puts "Making today's hourly directory"
system "hadoop fs -mkdir #{today_hours_dir}"
# Break the day's hours into a few chunks
# This seems to avoid some problems when we run lots of getmerge commands in parallel
[*(0..23)].each_slice(8).to_a.each do |hour_chunk|
hour_chunk.each do |_hour|
hour = "%02d" % _hour
# Setup args to merge minute logs into hour logs
source_file = "s3://#{SOURCE_BUCKET_NAME}/#{year}-#{month}-#{day}/product/out/#{hour}/"
output_file = "#{local_tmp_dir}/#{hour}.log"
# Launch each hour's getmerge in the background
full_command = "hadoop fs -getmerge #{source_file} #{output_file}"
puts "Forking: #{full_command}"
fork { system full_command }
end
# Wait for this batch of the germerge's to finish
Process.waitall
end
# Delete the local temp files Hadoop created
puts "Removing temp files"
system "rm #{local_tmp_dir}/.*.crc"
# Move local hourly logs to hdfs to free up local space
puts "Moving local logs to HDFS"
system "hadoop fs -put #{local_tmp_dir}/* #{today_hours_dir}"
puts "Removing local logs"
system "rm -rf #{local_tmp_dir}"
# Merge the day's hourly logs into a single daily log file
daily_log_file_name = "#{year}-#{month}-#{day}.log"
daily_log_file_path = "#{local_tmp_dir}_day/#{daily_log_file_name}"
puts "Merging hourly logs into daily log"
system "hadoop fs -getmerge #{today_hours_dir}/ #{daily_log_file_path}"
# Write the daily log file to another s3 bucket
puts "Writing daily log to s3"
system "hadoop fs -put #{daily_log_file_path} s3://#{DESTINATION_BUCKET_DIR}/daily-merged-logs/#{daily_log_file_name}"
# Remove daily log locally
puts "Removing local daily logs"
system "rm -rf #{local_tmp_dir}_day"
# Remove the hourly logs from HDFS
puts "Removing HDFS hourly logs"
system "hadoop fs -rmr #{today_hours_dir}"
# Go back in time
date -= 1
end

Resources