awk exec command for every line and keep columns - bash

I have a large dataset files with two columns like
AS জীৱবিজ্ঞানবিভাগ
AS চেতনাদাস
AS বৈকল্পিক
and I want to run my command on the second column, store the result and get the output with the same column formatting:
AS jibvigyanvibhag
AS chetanadas
AS baikalpik
where my command is this pipe:
echo "$0" | indictrans -s asm -t eng --ml --build-lookup
So I'm doing like
awk -v OFS="\t" '{ print "echo "$2" | indictrans -s asm -t eng --ml --build-lookup" | "/bin/sh"}' in.txt > out.txt
but this will not preserve the columns, it just prints out the first column like this
jibvigyanvibhag
chetanadas
baikalpik
My solution was the following
awk -v OFS="\t" '{ "echo "$2" | indictrans -s asm -t eng --ml --build-lookup" | getline RES; print $1,$2,RES}' in.txt > out.txt
that will print out
AS জীৱবিজ্ঞানবিভাগ jibvigyanvibhag
AS চেতনাদাস chetanadas
AS বৈকল্পিক baikalpik
Now I want to put parametrize the command, but the escape looks odd here:
"echo "$0" | indictrans -s $SOURCE -t $TARGET --ml --build-lookup"
and it does not work. How to correctly exec this command and escape the parameters?
[UPDATE]
This is a partial solution I came out inspired by the suggested one
#!/bin/bash
SOURCE=asm
TARGET=eng
IN=$2
OUT=$3
awk -v OFS="\t" '{
CMD = "echo "$2" | indictrans -s asm -t eng --ml --build-lookup"
CMD | getline RES
print $1,RES
close(CMD)
}' $IN > $OUT
I still cannot get rid of the variables, it seems that I cannot define with -v as usual like
awk -v OFS="\t" -v source=$SOURCE -v target=$TARGET '{
CMD = "echo "$2" | indictrans -s source -t target --ml --build-lookup"
...
NOTES.
The indictrans process handles the stdin and writes to stdout in this way:
for line in ifp:
tline = trn.convert(line)
ofp.write(tline)
# close files
ifp.close()
ofp.close()
where
ifp = codecs.getreader('utf8')(sys.stdin)
ofp = codecs.getwriter('utf8')(sys.stdout)
so it takes one line from stdin, processes the data with some library trn.convert and writes the results to stdout without any parallelism.
For this reason (lack of parallelism in terms of multiline input) the performances are bound by the size of the dataset (number of rows).
An example input two column dataset (1K rows) is available here. An example sample is
KN ಐಕ್ಯತೆ ಕ್ಷೇಮಾಭಿವೃದ್ಧಿ ಸಂಸ್ಥೆ ವಿಜಯಪುರ
KN ಹೊರಗಿನ ಸಂಪರ್ಕಗಳು
KN ಮಕ್ಕಳ ಸಾಹಿತ್ಯ ಮತ್ತು ಸಾಂಸ್ಖ್ರುತಿಕ ಕ್ಷೇತ್ರದಲ್ಲಿ ಸೇವೆ ಸಲ್ಲಿಸುತ್ತಿರುವ ಸಂಸ್ಠೆ ಮಕ್ಕಳ ಲೋಕ
while the example script based on the last accepted answer is here

Don't invoke shells with awk. The shell itself avoids treating data as if it were code unless explicitly instructed to do otherwise -- but when you use system() or popen(), as the awk code is doing here, everything passed as an argument is parsed in a context where data is able to escape its quoting and be treated as code.
Simple approach: One indictrans per line
If you need a separate copy of indictrans for each line to be executed, use:
while read -r col1 rest; do
printf '%s\t%s\n' "$col1" "$(indictrans -s asm -t eng --ml --build-lookup <<<"$rest")"
done <in.txt >out.txt
Fast Approach: One indictrans processing all lines
If indictrans generates one line of output per line of input, you can do even better, by pasting together one stream with all the first columns and a second string with the translations of the remainder of the lines, thus requiring only one copy of indictrans to be run:
#!/usr/bin/env bash
# ^^^^- not compatible with /bin/sh
paste <(<in.txt awk '{print $1}') \
<(<in.txt sed -E 's/^[^[:space:]]*[[:space:]]//' \
| indictrans -s asm -t eng --ml --build-lookup) \
>out.txt

You can pipe column 2 to your command and change it with command's output like below in awk.
{
cmd = "echo "$2" | indictrans -s asm -t eng --ml --build-lookup"
cmd | getline $2
close(cmd)
} 1
If SOURCE and TARGET are awk variables
{
cmd = "echo "$0" | indictrans -s "SOURCE" -t "TARGET" --ml --build-lookup"
cmd
close(cmd)
}

Related

Concatenate the output of 2 commands in the same line in Unix

I have a command like below
md5sum test1.txt | cut -f 1 -d " " >> test.txt
I want output of the above result prefixed with File_CheckSum:
Expected output: File_CheckSum: <checksumvalue>
I tried as follows
echo 'File_Checksum:' >> test.txt | md5sum test.txt | cut -f 1 -d " " >> test.txt
but getting result as
File_Checksum:
adbch345wjlfjsafhals
I want the entire output in 1 line
File_Checksum: adbch345wjlfjsafhals
echo writes a newline after it finishes writing its arguments. Some versions of echo allow a -n option to suppress this, but it's better to use printf instead.
You can use a command group to concatenate the the standard output of your two commands:
{ printf 'File_Checksum: '; md5sum test.txt | cut -f 1 -d " "; } >> test.txt
Note that there is a race condition here: you can theoretically write to test.txt before md5sum is done reading from it, causing you to checksum more data than you intended. (Your original command mentions test1.txt and test.txt as separate files, so it's not clear if you are really reading from and writing to the same file.)
You can use command grouping to have a list of commands executed as a unit and redirect the output of the group at once:
{ printf 'File_Checksum: '; md5sum test1.txt | cut -f 1 -d " " } >> test.txt
printf "%s: %s\n" "File_Checksum:" "$(md5sum < test1.txt | cut ...)" > test.txt
Note that if you are trying to compute the hash of test.txt(the same file you are trying to write to), this changes things significantly.
Another option is:
{
printf "File_Checksum: "
md5sum ...
} > test.txt
Or:
exec > test.txt
printf "File_Checksum: "
md5sum ...
but be aware that all subsequent commands will also write their output to test.txt. The typical way to restore stdout is:
exec 3>&1
exec > test.txt # Redirect all subsequent commands to `test.txt`
printf "File_Checksum: "
md5sum ...
exec >&3 # Restore original stdout
Operator &&
e.g. mkdir example && cd example

I'm facing an error while converting my bash comand to shell script syntax error in shell script

#!/bin/bash
set -o errexit
set -o nounset
#VAF_and_IGV_TAG
paste <(grep -v "^#" output/"$1"/"$1"_Variant_Filtering/"$1"_GATK_filtered.vcf | cut -f-5) \
<(grep -v "^#" output/"$1"/"$1"_Variant_Filtering/"$1"_GATK_filtered.vcf | cut -f10-| cut -d ":" -f2,3) |
sed 's/:/\t/g' |
sed '1i chr\tstart\tend\tref\talt\tNormal_DP_VCF\tTumor_DP_VCF\tDP'|
awk 'BEGIN{FS=OFS="\t"}{sub(/,/,"\t",$6);print}' \
> output/"$1"/"$1"_Variant_Annotation/"$1"_VAF.tsv
My above code ends up with a syntax error if I run this in the terminal without using the variable it shows no syntax error
sh Test.sh S1 Test.sh: 6: Test.sh: Syntax error: "(" unexpected
paste <(grep -v "^#" output/S1/S1_Variant_Filtering/S1_GATK_filtered.vcf | cut -f-5) \
<(grep -v "^#" output/S1/S1_Variant_Filtering/S1_GATK_filtered.vcf | cut -f10-| cut -d ":" -f2,3) |
sed 's/:/\t/g' |
sed '1i chr\tstart\tend\tref\talt\tNormal_DP_VCF\tTumor_DP_VCF\tDP'|
awk 'BEGIN{FS=OFS="\t"}{sub(/,/,"\t",$6);print}' \
> output/S1/S1_Variant_Annotation/S1_VAF.ts
My vcf file looks like this: https://drive.google.com/file/d/1HaGx1-3o1VLCrL8fV0swqZTviWpBTGds/view?usp=sharing
You cannot use <(command) process substitution if you are trying to run this code under sh. Unfortunately, there is no elegant way to avoid a temporary file (or something even more horrid) but your paste command - and indeed the entire pipeline - seems to be reasonably easy to refactor into an Awk script instead.
#!/bin/sh
set -eu
awk -F '\t' 'BEGIN { OFS=FS;
print "chr\tstart\tend\tref\talt\tNormal_DP_VCF\tTumor_DP_VCF\tDP' }
!/#/ { p=$0; sub(/^([^\t]*\t){9}/, "", p);
sub(/^[:]*:/, "", p); sub(/:.*/, "", p);
sub(/,/, "\t", p);
s = sprintf("%s\t%s\t%s\t%s\t%s\t%s", $1, $2, $3, $4, $5, p);
gsub(/:/, "\t", s);
print s
}' output/"$1"/"$1"_Variant_Filtering/"$1"_GATK_filtered.vcf \
> output/"$1"/"$1"_Variant_Annotation/"$1"_VAF.tsv
Without access to the VCF file, I have been unable to test this, but at the very least it should suggest a general direction for how to proceed.
sh does not support bash process substitution <(). The easiest way to port it is to write out two temporary files, and remove them via when via a trap when done. The better option is use a tool that is sufficiently powerful (i.e. sed) to do the filtering and manipulation required:
#!/bin/sh
header="chr\tstart\tend\tref\talt\tNormal_DP_VCF\tTumor_DP_VCF\tDP"
field_1_to_5='\(\([^\t]*\t\)\{5\}\)' # \1 to \2
field_6_to_8='\([^\t]*\t\)\{4\}[^:]*:\([^,]*\),\([^:]*\):\([^:]*\).*' # \3 to \6
src="output/${1}/${1}_Variant_Filtering/${1}_GATK_filtered.vcf"
dst="output/${1}/${1}_Variant_Variant_Annotation/${1}_VAF.tsv"
sed -n \
-e '1i '"$header" \
-e '/^#/!s/'"${field_1_to_5}${field_6_to_8}"'/\1\4\t\5\t\6/p' \
"$src" > "$dst"
If you are using awk (or perl, python etc) just port the script to that language instead.
As an aside, all those repeated $1 suggest you should rework your file naming standard.

Use argument twice from standard output pipelining

I have a command line tool which receives two arguments:
TOOL arg1 -o arg2
I would like to invoke it with the same argument provided it for arg1 and arg2, and to make that easy for me, i thought i would do:
each <arg1_value> | TOOL $1 -o $1
but that doesn't work, $1 is not replaced, but is added once to the end of the commandline.
An explicit example, performing:
cp fileA fileA
returns an error fileA and fileA are identical (not copied)
While performing:
echo fileA | cp $1 $1
returns the following error:
usage: cp [-R [-H | -L | -P]] [-fi | -n] [-apvX] source_file target_file
cp [-R [-H | -L | -P]] [-fi | -n] [-apvX] source_file ... target_directory
any ideas?
If you want to use xargs, the [-I] option may help:
-I replace-str
Replace occurrences of replace-str in the initial-arguments with names read from standard input. Also, unquoted blanks do not terminate input items; instead the separa‐
tor is the newline character. Implies -x and -L 1.
Here is a simple example:
mkdir test && cd test && touch tmp
ls | xargs -I '{}' cp '{}' '{}'
Returns an Error cp: tmp and tmp are the same file
The xargs utility will duplicate its input stream to replace all placeholders in its argument if you use the -I flag:
$ echo hello | xargs -I XXX echo XXX XXX XXX
hello hello hello
The placeholder XXX (may be any string) is replaced with the entire line of input from the input stream to xargs, so if we give it two lines:
$ printf "hello\nworld\n" | xargs -I XXX echo XXX XXX XXX
hello hello hello
world world world
You may use this with your tool:
$ generate_args | xargs -I XXX TOOL XXX -o XXX
Where generate_args is a script, command or shell function that generates arguments for your tool.
The reason
each <arg1_value> | TOOL $1 -o $1
did not work, apart from each not being a command that I recognise, is that $1 expands to the first positional parameter of the current shell or function.
The following would have worked:
set - "arg1_value"
TOOL "$1" -o "$1"
because that sets the value of $1 before calling you tool.
You can re-run a shell to perform variable expansion, with sh -c. The -c takes an argument which is command to run in a shell, performing expansion. Next arguments of sh will be interpreted as $0, $1, and so on, to use in the -c. For example:
sh -c 'echo $1, i repeat: $1' foo bar baz will print execute echo $1, i repeat: $1 with $1 set to bar ($0 is set to foo and $2 to baz), finally printing bar, i repeat: bar
The $1,$2...$N are only visible to bash script to interpret arguments to those scripts and won't work the way you want them to. Piping redirects stdout to stdin and is not what you are looking for either.
If you just want a one-liner, use something like
ARG1=hello && tool $ARG1 $ARG1
Using GNU parallel to use STDIN four times, to print a multiplication table:
seq 5 | parallel 'echo {} \* {} = $(( {} * {} ))'
Output:
1 * 1 = 1
2 * 2 = 4
3 * 3 = 9
4 * 4 = 16
5 * 5 = 25
One could encapsulate the tool using awk:
$ echo arg1 arg2 | awk '{ system("echo TOOL " $1 " -o " $2) }'
TOOL arg1 -o arg2
Remove the echo within the system() call and TOOL should be executed in accordance with requirements:
echo arg1 arg2 | awk '{ system("TOOL " $1 " -o " $2) }'
Double up the data from a pipe, and feed it to a command two at a time, using sed and xargs:
seq 5 | sed p | xargs -L 2 echo
Output:
1 1
2 2
3 3
4 4
5 5

how do I run concurrent background process from shell script?

I tried the following:
#!/bin/bash
while read device; do
name=$(echo "$device" | awk '{ print $1 }')
ip=$(echo "$device" | awk '{ print $2 }')
while read creds; do
community=$(echo "$creds" | awk '{ print $1 }')
version=$(echo "$creds" | awk '{ print $2 }')
mkdir -p walks/$name;
`echo -e "snmpwalk -v$version -c \x27$community\x27 $ip system > walks/$name/$community-$version.txt
done < <(##MySQL query that returns tuples in form: (snmp_ro,(1,2c,3))##")
done < <(cat devices.txt)
exit 0
This is meant to go through and find the snmp string and version of each device.
devices.txt is a list of devices in form: hostname ip
It doesn't create the file: walks/$name/$community-$version.txt, and it only seems to run through the walks 1 at a time, something I don't want.
Use & to put the contents you want backgrounded in, well, the background.
pids=( )
while read -r -u 3 name ip _; do
while read -r -u 4 community version _; do
mkdir -p "walks/$name"
snmpwalk -v"$version" -c "$community" "$ip" system \
</dev/null >"walks/$name/$community-$version.txt" & pids+=( "$!" )
done 4< <(: get data for "$name" and "$ip")
done 3<devices.txt
wait "${pids[#]}"
Other items of note:
read can already split fields into their own variables; using awk for this is silly.
The _ in read -r foo bar _ ensures that if more than two columns exist in the input file, the third column and onward are discarded (actually, put into a variable named _, but this is considered discard by convention) rather than appended to bar.
Make a habit of quoting expansions unless you have a specific and compelling reason to do otherwise; otherwise, you get string-splitting and glob expansion of string contents.
This example puts each input stream on its own file descriptor, and redirects each read to its own FD. This prevents any other content within your loop from consuming stdin.

AWK: execute CURL on each line and parse result

given an input stream with following lines:
123
456
789
098
...
I would like to call
curl -s http://foo.bar/some.php?id=xxx
with xxx being the number for each line, and everytime let an awk script fetch some information from the curl output which is written to the output stream. I am wondering if this is possible without using the awk "system()" call in following way:
cat lines | grep "^[0-9]*$" | awk '
{
system("curl -s " $0 \
" | awk \'{ #parsing; print }\'")
}'
You can use bash and avoid awk system call:
grep "^[0-9]*$" lines | while read line; do
curl -s "http://foo.bar/some.php?id=$line" | awk 'do your parsing ...'
done
A shell loop would achieve a similar result, as follows:
#!/bin/bash
for f in $(cat lines|grep "^[0-9]*$"); do
curl -s "http://foo.bar/some.php?id=$f" | awk '{....}'
done
Alternative methods for doing similar tasks include using Perl or Python with an HTTP client.
If your file gets dynamically appended the id's, you can daemonize a small while loop to keep checking for more data in the file, like this:
while IFS= read -d $'\n' -r a || sleep 1; do [[ -n "$a" ]] && curl -s "http://foo.bar/some.php?id=${a}"; done < lines.txt
Otherwise if it's static, you can change the sleep 1 to break and it will read the file and quit when there is no data left, pretty useful to know how to do.

Resources