Is there any way to pass arguments for pig script without using -param - arguments

I am new to pig, is there any way to pass arguments for pig script without using -param like unix. I want to access the values as Stream a stream using $1 and $2 arguments.

Unlike Unix, we can't access parameters using $1, $2 notations. Unless the param key name is 1, 2 etc..
Ref : http://chimera.labs.oreilly.com/books/1234000001811/ch06.html#param_sub
To send individual parameters using -param or -p
pig -p fn=Raj -p ln=Kumar -f display_name.pig
Other alternative is to define all key value pairs in a parameter file.
names.cfg
---------
fn=Raj
ln=Kumar
To use the file use -m or -param_file
pig -m names.cfg -f display_name.pig

Related

How to obtain the name of the disk given the disk ID?

The stat system call (man 2 stat) returns the ID of the device containing the file.
In a script, this ID can be obtained with, say
perl -e 'print((stat "/tmp/blah.txt")[0])'
Given the ID, how do I obtain the name of the disk, such as /dev/sda2 or /dev/disk1s1?
I want to do it in a script (bash, perl, etc.) preferably in a portable way so that it works reliably both on MacOS and Linux.
You can use 'df' to enumerate the file system, and create a map between device IDs and disk. Recall the each df line will list the mount point (/dev/sda1, ...) in the first column, and the mount point on the 5th column).
The following script uses bash associative array for the map, and stat -c '%d' to extract the 'dev' value of a given path.
# Create map dev2fs
function file2dev {
stat -c '%d' "$1"
# If needed, use Perl equivalent
# perl -e 'print ((stat($ARGV[0]))[0])' "$1"
}
declare -A dev2fs
while read fs size used available use mount ; do
id=$(file2dev $mount);
[ "$id" ] && dev2fs[$id]=$fs
done <<< "$(df)"
# Map a file to device
dev=$(file2dev /path/to/file );
echo "Device=${dev2fs[$dev]}"
Also, possible to iterate over mounted file system using 'mount -l'. I'm not sure which one exists on MacOS.
I think you're trying to re-invent the wheel.
Given a file name, the filesystem name it belongs to can be found in last row's first column in df's output. E.g:
df -P /tmp | awk 'END { print $1 }'

Can /apps be extracted from a variable with /apps/ow/... in one line?

The string is: fs=/apps/ow/abc/xyz/def/
IFS=/ read -r fs fs rest <<< "$fs"
fs="/"$fs
I want to extract only first part i.e. output should be /apps only - in one liner.
There are various techniques including read and string deletion in bash but I want to make use of single liner built in bash command without any external bash command.
The single command:
[[ $fs =~ ^(/[^/]+)/ ]]
...will assign your desired result (/apps) to ${BASH_REMATCH[1]}.

Running Shell Script in parallel for each line in a file

I have a delimited (|) input file (TableInfo.txt) that has data as shown below
dbName1|Table1
dbName1|Table2
dbName2|Table3
dbName2|Table4
...
I have a shell script (LoadTables.sh) that parses each line and calls a executable passing args from the line like dbName, TableName. This process reads data from a SQL Server and loads it into HDFS.
while IFS= read -r line;do
fields=($(printf "%s" "$line"|cut -d'|' --output-delimiter=' ' -f1-))
query=$(< ../sqoop/"${fields[1]}".sql)
sh ../ProcessName "${fields[0]}" "${fields[1]}" "$query"
done < ../TableInfo.txt
Right now my process is running in sequential for each line in the file and its time consuming based on the number of entries in the file.
Is there any way I can execute the process in parallel? I have heard about using xargs/GNU parallel/ampersand and wait options. I am not familiar on how to construct and use it. Any help is appreciated.
Note:I don't have GNU parallel installed on the Linux machine. So xargs is the only option as I have heard some cons on using ampersand and wait option.
Put an & on the end of any line you want to move to the background. Replacing the silly (buggy) array-splitting method used in your code with read's own field-splitting, this looks something like:
while IFS='|' read -r db table; do
../ProcessName "$db" "$table" "$(<"../sqoop/${table}.sql")" &
done < ../TableInfo.txt
...FYI, re: what I meant about "buggy" --
fields=( $(foo) )
...performs not only string-splitting but also globbing on the output of foo; thus, a * in the output is replaced with a list of filenames in the current directory; a name such as foo[bar] can be replaced with files named foob, fooa or foor; the globfail shell option can cause such an expansion to result in a failure, the nullglob shell option can cause it to result in an empty result; etc.
If you have GNU xargs, consider the following:
# assuming you have "nproc" to get the number of CPUs; otherwise, hardcode
xargs -P "$(nproc)" -d $'\n' -n 1 bash -c '
db=${1%|*}; table=${1##*|}
query=$(<"../sqoop/${table}.sql")
exec ../ProcessName "$db" "$table" "$query"
' _ < ../TableInfo.txt

How to loop a bash script with different variables value each time

I am new to bash scripting. I have written a small script containing set of commands using defined declared variables (e.g. SAMPLENAME=Alex) . Now, what I want is to loop this script with different values of declared variables (e.g SAMPLENAME=John) from a file everytime.
For example: I have set following values of variables
#!/bin/bash
tID=000H003HG.TAAGGCGA+GCGATCTA
tSM=Alex
tLB=lib1
O_FOLDER_NAME=HD690_Alex
This is the command which will be executed using abolve values of variables,
bwa mem -V -M -R "#RG\tID:${tID}\tSM:${tSM}\tPL:ILLUMINA\tLB:${tLB}" REFERENCES/$REFERENCE.fa "<zcat OUTPUT/2_TRIMMED_DATA/$O_FOLDER_NAME/split-adapter-quality-trimmed/${O_FOLDER_NAME}-READ1.fastq.gz" "<zcat OUTPUT/2_TRIMMED_DATA/${O_FOLDER_NAME}/split-adapter-quality-trimmed/${O_FOLDER_NAME}-READ2.fastq.gz" > OUTPUT/3_MAPPED_READS/${O_FOLDER_NAME}/aligned_reads.sam
Now after the execution of above command, I want it to loop with following different set of values for declared variables,
#!/bin/bash
tID=000998U3HG.STPUIHY+UIYUSIA
tSM=John
tLB=lib2
O_FOLDER_NAME=HD700_John
Thanks!
If you have a whitespace-separated file with the fields tID, tSM, tLB, and O_FOLDER_NAME, just read those.
while read -r tID tSM tLB O_FOLDER_NAME; do
bwa mem -V -M \
-R "#RG\tID:${tID}\tSM:${tSM}\tPL:ILLUMINA\tLB:${tLB}" \
REFERENCES/"$REFERENCE".fa \
"<zcat OUTPUT/2_TRIMMED_DATA/$O_FOLDER_NAME/split-adapter-quality-trimmed/${O_FOLDER_NAME}-READ1.fastq.gz" \
"<zcat OUTPUT/2_TRIMMED_DATA/${O_FOLDER_NAME}/split-adapter-quality-trimmed/${O_FOLDER_NAME}-READ2.fastq.gz" \
> OUTPUT/3_MAPPED_READS/"${O_FOLDER_NAME}"/aligned_reads.sam
done <file
If your input file is in CSV format, you will need to fiddle with IFS or something, but this is the basic principle.
For your two examples, the file could contain the following.
000H003HG.TAAGGCGA+GCGATCTA Alex lib1 HD690_Alex
000998U3HG.STPUIHY+UIYUSIA John lib2 HD700_John
If the file has no use outside of your script, you might want to use a here document instead.
A way to do it is to change the file of variables to be -
#!/bin/bash
echo tID=000H003HG.TAAGGCGA+GCGATCTA
echo tSM=Alex
echo tLB=lib1
echo O_FOLDER_NAME=HD690_Alex
and then run -
eval $(./var_file.sh) && your command

Expand a variable in a variable in Bash

I have a script for which I need to pass a string which contains variables and text to conform an URI.
An example of this would be:
URI="${PROTO}://${USER}:${PASS}#${HOST}/${TARGET}"
The variables ${PROTO}, ${USER}, ${PASS}, ${HOST} and ${TARGET} are not defined when I define the variable $URI but they will be when the Bash Script is going to be executed so I need to expand URI to the final form of the string.
How can I do that? I've read eval: When You Need Another Chance but I really don't like the idea of using eval as it might be dangerous plus it means to escape a lot of parts of the URI string.
Is there another way to do it? What's the recommended solution to this problem?
Thanks
M
A variable stores data; if you don't have values for PROTO et al. yet, then you don't have data. You need a template.
uri_template="%s://%s:%s#%s/%s"
Later, when you do have the rest of the data, you can plug them into the template.
uri=$(printf "$uri_template" "$PROTO" "$USER" "$PASS" "$HOST" "$TARGET")
(In bash, you can avoid the fork from the command substitution by using the -v option: printf -v uri "$uri_template" "$PROTO" "$USER" "$PASS" "$HOST" "$TARGET".)
You can also define a function:
uri () {
# I'm being lazy and assuming uri will be called with the correct 5 arguments
printf "%s://%s:%s#%s/%s" "$#"
}
# Variables and functions exist in separate namespaces, so the following works
uri=$(uri "$PROMPT" "$USER" "$PASS" "$HOST" "$TARGET")
Before executing the script, define the variables using 'export'.
export PROTO='http'
export USER='bob'
export PASS='password'
export HOST='myhostname'
export TARGET='index.html'
bash ./myscript_with_uri.sh
OR
Create the URI script as a procedure that will return the URI.
uri_creater.sh
makeUri ()
{
URI="${PROTO}://${USER}:${PASS}#${HOST}/${TARGET}
}
script_using_uri.sh
. uri_creater.sh
PROTO='http'
USER='bob'
PASS='password'
HOST='myhostname'
TARGET='index.html'
makeUri
url="${URI}"
echo "${url}"
Tested on Centos 6.5 with bash.

Resources