Add multiple files to distributed cache in HIVE - hadoop

I currently have an issue adding a folders contents to Hives distrusted cache. I can successfully add multiple files to the distributed cache in Hive using:
ADD FILE /folder/file1.ext;
ADD FILE /folder/file2.ext;
ADD FILE /folder/file3.ext;
etc.
.
I also see that there is a ADD FILES (plural) option which in my mind means you could specify a directory like: ADD FILES /folder/; and everything in the folder gets included (this works with Hadoop Streaming -files option). But this does not work with Hive. Right now I have to explicitly add each file.
Am I doing this wrong? Is there a way to had a whole folders contents to the distributed cache.
P.S. I tried wild cards ADD FILE /folder/* and ADD FILES /folder/* but that fails too.
Edit:
As of hive 0.11 this now supported so:
ADD FILE /folder
now works.
What I am using is passing the folder location to the hive script as a param so:
$ hive -f my-query.hql -hiveconf folder=/folder
and in the my-query.hql file:
ADD FILE ${hiveconf:folder}
Nice and tidy now!

Add doesn't support directories, but as a workaround you can zip the files. Then add the it to the distributed cache as an archive (ADD ARCHIVE my.zip). When the job is running the content of the archive will be unpacked on the local job directory of the
slave nodes (see the mapred.job.classpath.archives property)
If the number of the files you want to pass is relatively small, and you don't want deal with archives you can also write a small script which prepares the add file command for all the files you have in a given directory:
E.g:
#!/bin/bash
#list.sh
if [ ! "$1" ]
then
echo "Directory is missing!"
exit 1
fi
ls -d $1/* | while read f; do echo ADD FILE $f\;; done
Then invoke it from the Hive shell and execute the generated output:
!/home/user/list.sh /path/to/files

Well, in my case, I had to move a folder with child folders and files in it.
I used the ADD ARCHIVE xxx.gz, which was adding the file, but was not exploding(unzipping) in the slave machines.
Instead, ADD FILE <folder_name_without_traling_slash> actually copies the whole folder recursively to the slaves.
Courtesy: The comments helped debugging
Hope this helps !

Related

Copy files from Hadoop multiple directories to edge node folder

I have the multiple directories in hadoop as following
/env/hdfsdata/ob/sample/partfile..
/env/hdfsdata/ob/sample_1/partfile..
/env/hdfsdata/ob/sample_2/partfile..
I am new to hadoop and shell scripting and looking for a way to copy the files present in sample directory (sample*) onto edge node folder location and the files should be named as follows assuming sample is the prefix for file name
sample.txt
sample_1.txt
sample_2.txt
once the files are copied on to edgenode, location the respective directories has to be deleted in hadoop. I have tried using to list the directories using wild cards and then process these using shell script and cat command but facing issue no such directory found.
Use getmerge to create one file from many
#!/bin/bash
dl() {
FILENAME=$1
BASE_DIR='/env/hdfsdata/ob'
hadoop fs -getmerge "${BASE_DIR}/${FILENAME}/*" "${FILENAME}.txt"
}
FILENAME='sample'
dl "${FILENAME}" # sample
for i in `seq 2`; do
dl "${FILENAME}_${i}" # sample_1, sample_2
done
new to hadoop and shell scripting
You can use Java/Python/etc to do the exact same thing

How to create tar files automatically

I like to create tar-files to distribute some scripts using bash.
For every script certain configuration-files and libraries (or toolboxes) are needed,
e.g. a script called CheckTool.py needs Checks.ini, CheckToolbox.py and CommontToolbox.py to run, which are stored in specific folders on my harddisk and need to be copied in the same manner on the users harddisk.
I can create a tarfile manually for each script, but i like to have it more simple.
For this i have the idea to define a list of all needed files and their pathes for a specific script and read this in a bashscript, which creates the tar file.
I started with:
#!/bin/bash
while read line
do
echo "$line"
done < $1
Which is reading the files and pathes. In my example the lines are:
./CheckTools/CheckMesh.bs
./Configs/CheckMesh.ini
./Toolboxes/CommonToolbox.bs
./Toolboxes/CheckToolbox.bs
My question is how do I have to organize the data to make a tar file with the specified files using bash?
Or is there someone having a better idea?
No need for a complicated script, use option -T of tar. Every file listed in there will be added to the tar file:
-T, --files-from FILE
get names to extract or create from FILE
So your script becomes:
#!/bin/bash
tar -cvpf something.tar -T listoffiles.txt
listoffiles.txt format is super easy, one file per line. You might want to put full path to ensure you get the right files:
./CheckTools/CheckMesh.bs
./Configs/CheckMesh.ini
./Toolboxes/CommonToolbox.bs
./Toolboxes/CheckToolbox.bs
You can add tar commands to the script as needed, or you could loop on the list files, from that point on, your imagination is the limit!

Uploading files from multiple directories to an SFTP site using Shell Scripting

I'm trying to upload items from multiple folder locations locally to an SFTP site. I'm using an existing shell script that I know works for uploads from a single local location, but I can't figure out how to make it work for uploads from multiple local locations.
I'm fairly new to coding and have only basic experience with batch scripting and some minor editing of existing shell scripts, so I would appreciate any help that can be given.
Here's the sample of my existing single local location upload script
open sftp://(userid):(password)#(sftp site) -hostkey="(hostkey)"
pwd
ls
lcd "(local directory)"
lls
cd (remote directory)
ls
put * -filemask=|*/ ./
exit
This has worked well for us previously, but I'm trying to clean up some of our existing scripts by combining them into one process that runs as an automated task, but I can't figure out how to chain multiple tasks like this together.
Just repeat the upload code for each location:
cd /remote/directory
lcd /local/directory1
put * -filemask=|*/ ./
lcd /local/directory2
put * -filemask=|*/ ./
Though if it's really a WinSCP script, you can use just one command like:
put -filemask=|*/ /local/directory1/* /local/directory2/* /remote/directory/
See the documentation for the put command:
put <file> [ [ <file2> ... ] <directory>/[ <newname> ] ]
...
If more parameters are specified, all except the last one specify set of files to upload. Filename can be replaced with Windows wildcard to select multiple files. To upload all files in a directory, use mask *.
The last parameter specifies target remote directory and optionally operation mask to store file(s) under different name. Target directory must end with slash. ...

Getting directory of a file in Unix

I have a requirement where I need to copy some files from one location to other (Where the file may exist). While doing so,
I need to take a backup if the file already exists.
Copy the new file to the same location
I am facing problem in point 2. While I am trying to get the destination path for copying files, I am unable to extract the directory of the file. I tried using various options of find command, but was unable to crack it.
I need to trim the file name from the full file path so that it can be used in cp command. I am new to shell scripting. Any pointers are appreciated.
You can use
cp --backup
-b'--backup[=METHOD]'
*Note Backup options::. Make a backup of each file that would
otherwise be overwritten or removed. As a special case, `cp'
makes a backup of SOURCE when the force and backup options are
given and SOURCE and DEST are the same name for an existing,
regular file. One useful application of this combination of
options is this tiny Bourne shell script:
#!/bin/sh
# Usage: backup FILE...
# Create a GNU-style backup of each listed FILE.
for i; do
cp --backup --force -- "$i" "$i"
done
If you need only the filename, why not do a
basename /root/wkdir/index.txt
and assign it to a variable which would return only the filename?

Bash script to archive files and then copy new ones

Need some help with this as my shell scripting skills are somewhat less than l337 :(
I need to gzip several files and then copy newer ones over the top from another location. I need to be able to call this script in the following manner from other scripts.
exec script.sh $oldfile $newfile
Can anyone point me in the right direction?
EDIT: To add more detail:
This script will be used for monthly updates of some documents uploaded to a folder, the old documents need to be archived into one compressed file and the new documents, which may have different names, copied over the top of the old. The script needs to be called on a document by document case from another script. The basic flow for this script should be -
The script file should create a new gzip
archive with a specified name (created from a prefix constant in the script and the current month and year e.g. prefix.september.2009.tar.gz) only if it
does not already exist, otherwise add to the existing one.
Copy the old file into the archive.
Replace the old file with the new one.
Thanks in advance,
Richard
EDIT: Added mode detail on the archive filename
Here's the modified script based on your clarifications. I've used tar archives, compressed with gzip, to store the multiple files in a single archive (you can't store multiple files using gzip alone). This code is only superficially tested - it probably has one or two bugs, and you should add further code to check for command success etc. if you're using it in anger. But it should get you most of the way there.
#!/bin/bash
oldfile=$1
newfile=$2
month=`date +%B`
year=`date +%Y`
prefix="frozenskys"
archivefile=$prefix.$month.$year.tar
# Check for existence of a compressed archive matching the naming convention
if [ -e $archivefile.gz ]
then
echo "Archive file $archivefile already exists..."
echo "Adding file '$oldfile' to existing tar archive..."
# Uncompress the archive, because you can't add a file to a
# compressed archive
gunzip $archivefile.gz
# Add the file to the archive
tar --append --file=$archivefile $oldfile
# Recompress the archive
gzip $archivefile
# No existing archive - create a new one and add the file
else
echo "Creating new archive file '$archivefile'..."
tar --create --file=$archivefile $oldfile
gzip $archivefile
fi
# Update the files outside the archive
mv $newfile $oldfile
Save it as script.sh, then make it executable:
chmod +x script.sh
Then run like so:
./script.sh oldfile newfile
something like frozenskys.September.2009.tar.gz, will be created, and newfile will replace oldfile. You can also call this script with exec from another script if you want. Just put this line in your second script:
exec ./script.sh $1 $2
A good refference for any bash scripting is Advanced Bash-Scripting Guide.
This guide explains every thing bash scripting.
The basic approach I would take is:
Move the files you want to zip to a directory your create.
(commands mv and mkdir)
zip the directory. (command gzip, I assume)
Copy the new files to the desired location (command cp)
In my experience bash scripting is mainly knowing how to use these command well and if you can run it on the command line you can run it in your script.
Another command that might be useful is
pwd - this returns the current directory
Why don't you use version control? It's much easier; just check out, and compress.
(apologize if it's not an option)

Resources