Search equality in a certain field with AWK [duplicate] - bash

This question already has answers here:
How do I use shell variables in an awk script?
(7 answers)
Closed 1 year ago.
I am trying to get the name out of /etc/passwd using awk to search only in the 5th field of every row, and then to cut some part of that line and print it out.
This is what I wrote but it doesn't seems to work:
for iter in "$#";
do cat /etc/passwd | awk -F ":" '$5==$iter' | cut -d":" -f6;
done;
concerning the delimiter syntax, everything should be fine I guess?
so my problem is in the $5==$iter, I assume.
How can I change that $5==$iter to - if the 5th field of that row contains my $iter var, then cut and so on..
Sorry for the ignorance, I am a beginner :)
Thanks in advance.

See How do I use shell variables in an awk script?
-v should be used to pass shell variables into awk. Also, there's no reason to use either cat or cut here:
for iter in "$#"; do
awk -F: -v iter="$iter" '$5==iter { print $6 }' </etc/passwd
done

As Charles Duffy commented, your code would be more efficient if it didn't need to read /etc/passwd every pass. And while this particular loop probably doesn't need to be optimized (after all, /etc/passwd is typically not that long and most OS's would cache the file anyway after the first read), it would be interesting to see an awk script read the file only once.
That said, here's another implementation where awk is only invoked once:
printf "%s\n" "$#" | awk -F: '
NR == FNR { etc_passwd[ $5 ] = $6; next }
{ print $0 , etc_passwd[ $0 ] }
' /etc/passwd /dev/stdin
The NR == FNR condition is an idiom that causes its associated command only to be executed for the first file in the list of files that follows the awk script (that is, for the reading of /etc/passwd).

You can also do everything in bash, example:
#!/bin/bash
declare -A passwd # declare a associative array
# build the associative array "passwd" with the
# 5th field as a "key" and 6th field as "value"
while IFS=$':\n' read -a line; do # emulate awk to extract fields
[[ -n "${line[4]}" ]] || continue # avoid blank "keys"
passwd["${line[4]}"]=${line[5]} # in bash, arrays starting in "0"
done < /etc/passwd
for iter in "$#"; do
if [ ${passwd[$iter] + 'x'} ]; then
echo ${passwd[$iter]}
fi
done
(This version doesn't get into accout mĂșltiples values for 5th field)
here is a better version that can handle blank values as well, ike./script.sh '':
while IFS=$':\n' read -a line; do
for iter in "$#"; do
if [ "$iter" == "${line[4]}" ]; then
echo ${line[5]}
continue
fi
done
done < /etc/passwd
A pure awk solution could be:
#!/usr/bin/awk -f
BEGIN {
FS = ":"
for ( i = 1; i < ARGC; i++ ) {
args[ARGV[i]] = 1
delete ARGV[i]
}
ARGV[1] = "/etc/passwd"
}
($5 in args) { print $6 }
and you could call as ./script.awk -f 'param1' 'param2'.

Related

How to filter text data in bash more efficiently

I have data file which I need to filter with bash script, see data example:
name=pencils
name=apples
value=10
name=rocks
value=3
name=tables
value=6
name=beds
name=cups
value=89
I need to group name value pairs like so apples=10, if current line starts with name and next line starts with name, first line should be omitted entirely. So result file should look like this:
apples=10
rocks=3
tables=6
cups=89
I came with this simple solution which works but is very slow, it takes 5 min to complete for file with 2000 lines.
VALUES=$(cat input.txt)
for x in $VALUES; do
if [[ -n $(echo $x | grep 'name=') ]]; then
name=$(echo $x | sed "s/name=//")
elif [[ -n $(echo $x | grep 'value=') ]]; then
value=$(echo $x | sed "s/value=//")
echo "${name}=${value}" >> output.txt
fi
done
I'm aware that this kind of task is not very suitable for bash, but script is already written and this is just small part of it.
How can I optimize this task in bash?
Do not run any commands in subshells, it slows your script a lot. You can do everything in the current shell.
#! /bin/bash
while IFS== read k v ; do
if [[ $k == name ]] ; then
name=$v
elif [[ $k == value ]] ; then
printf '%s=%s\n' "$name" "$v"
fi
done
There are three easy optimizations you can make that will greatly speed up the script without requiring a major rethink.
1. Replace for with while read
Loading input.txt into a string, and then looping over that string with for x in $VALUES is slow. It requires the whole file to be read into memory even though this task could be done in a streaming fashion, reading a line at a time.
A common replacement for for line in $(cat file) is while read line; do ... done < file. It turns out that loops are compound commands, and like the normal one-line commands we're used to, compound commands can have < and > redirections. Redirecting a file into a loop means that for the duration of the loop, stdin comes from the file. So if you call read line inside the loop then it will read one line each iteration.
while IFS= read -r x; do
if [[ -n $(echo $x | grep 'name=') ]]; then
name=$(echo $x | sed "s/name=//")
elif [[ -n $(echo $x | grep 'value=') ]]; then
value=$(echo $x | sed "s/value=//")
echo "${name}=${value}" >> output.txt
fi
done < input.txt
2. Redirect output outside loop
It's not just input that can be redirected. We can do the same thing for the >> output.txt redirection. Here's where you'll see the biggest speedup. When >> output.txt is inside the loop output.txt must be opened and closed every iteration, which is crazy slow. Moving it to the outside means it only needs to be opened once. Much, much faster.
while IFS= read -r x; do
if [[ -n $(echo $x | grep 'name=') ]]; then
name=$(echo $x | sed "s/name=//")
elif [[ -n $(echo $x | grep 'value=') ]]; then
value=$(echo $x | sed "s/value=//")
echo "${name}=${value}"
fi
done < input.txt > output.txt
3. Shell string processing
One final improvement is to use faster string processing. Calling grep requires forking a subprocess every time just to do a simple string split. It'd be a lot faster if we could do the string splitting using just shell constructs. Well, as it happens that's easy now that we've switched to read. read can do more than read whole lines; it can also split on a delimiter from the variable $IFS (inter-field separator).
while IFS='=' read -r key value; do
case "$key" in
name) name="$value";;
value) echo "$name=$value";;
fi
done < input.txt > output.txt
Further reading
BashFAQ/001 - How can I read a file (data stream, variable) line-by-line (and/or field-by-field)?
This explains why I have IFS= read -r in the first two iterations.
BashFAQ/024 - I set variables in a loop that's in a pipeline. Why do they disappear after the loop terminates? Or, why can't I pipe data to read?
cmd | while read; do ... done is another popular use of while read, but it has unique pitfalls.
BashFAQ/100 - How do I do string manipulations in bash?
More in-shell string processing options.
If you have performance issues do not use bash at all. Use a text processing tool like, for instance, awk:
$ awk -F= '{name = $2} $1 == "value" {print name "=" $2}' data.txt
apples=10
rocks=3
tables=6
cups=89
Explanation: -F= defines the field separator as character =. The first block is executed only if the first field of a line ($1) is equal to string value. It prints variable name followed by character = and the second field ($2). The second block is executed on each line and it stores the second field ($2) in variable name.
Normally, if your input resembles what you show, this should automatically skip the first line. Else, we can exclude it explicitly using a test on the NR variable which value is the line number, starting at 1:
awk -F= 'NR != 1 && $1 == "value" {print name "=" $2}
NR != 1 {name = $2}' data.txt
All this works on inputs like the one you show but not on inputs where you would have other types of lines or several value=... consecutive lines. If you really want to test that the name/value pair is on two consecutive lines we need something more. For instance, test if the first field is name and use another variable n to store the line number of the last encountered name=... line. With all these tests we can now put the 2 blocks in a slightly more intuitive order (but the opposite would work the same):
awk -F= 'NR != 1 && $1 == "name" {name = $2; n = NR}
NR != 1 && NR == n+1 && $1 == "value" {print name "=" $2}' data.txt
With awk there might be a more elegant solution but you can have:
awk 'BEGIN{RS="\n?name=";FS="\nvalue="} {if($2) printf "%s=%s\n",$1,$2}' inputs.txt
RS="\n?name=" says that the record separator is name=
FS="\nvalue=" says that the field separator for each record is value=
if($2) says to only proceed the printf is the second field exists

Adding similar lines in bash [duplicate]

This question already has answers here:
Sort keys and Sum their values in bash
(4 answers)
sum of column in text file using shell script
(4 answers)
How can I sum values in column based on the value in another column?
(5 answers)
Closed 4 years ago.
I have a file with below records:
$ cat sample.txt
ABC,100
XYZ,50
ABC,150
QWE,100
ABC,50
XYZ,100
Expecting the output to be:
$ cat output.txt
ABC,300
XYZ,150
QWE,100
I tried the below script:
PREVVAL1=0
SUM1=0
cat sam.txt | sort > /tmp/Pos.part
while read line
do
VAL1=$(echo $line | awk -F, '{print $1}')
VAL2=$(echo $line | awk -F, '{print $2}')
if [ $VAL1 == $PREVVAL1 ]
then
SUM1=` expr $SUM + $VAL2`
PREVVAL1=$VAL1
echo $VAL1 $SUM1
else
SUM1=$VAL2
PREVVAL1=$VAL1
fi
done < /tmp/Pos.part
I want to get some one liner command to get the required output. Wanted to avoid the while loop concept. I want to just add the numbers where the first column is same and show it in a single line.
awk -F, '{a[$1]+=$2} END{for (i in a) print i FS a[i]}' sample.txt
Output
QWE,100
XYZ,150
ABC,300
The first part is executed for each line and creates an associative array. The END part prints this array.
It's an awk one-liner:
awk -F, -v OFS=, '{sum[$1]+=$2} END {for (key in sum) print key, sum[key]}' sample.txt > output.txt
sum[$1] += $2 creates an associative array whose keys are the first field and values are the corresponding sums.
This can also be done easily enough in native bash. The following uses no external tools, no subshells and no pipelines, and is thus far faster (I'd place money on 100x the throughput on a typical/reasonable system) than your original code:
declare -A sums=( )
while IFS=, read -r name val; do
sums[$name]=$(( ${sums[$name]:-0} + val ))
done
for key in "${!sums[#]}"; do
printf '%s,%s\n' "$key" "${sums[$key]}"
done
If you want to, you can make this a one-liner:
declare -A sums=( ); while IFS=, read -r name val; do sums[$name]=$(( ${sums[$name]:-0} + val )); done; for key in "${!sums[#]}"; do printf '%s,%s\n' "$key" "${sums[$key]}"; done

For loop in a awk command

I have a file which has rows , now i want to read it'w value from awk command in Unix. I am able to read that file , but i have added a for loop to traverse all the data into the file. But my for loop is not ending it is going in infinite loop.
Below code i am using to read the file and get the data of $1 ,$2 and $3 position
file=$1;
nbrClients=`wc -l $file | cut -d' ' -f1`;
echo $nbrClients;
awk '{
for(i=0; i<=$nbrClients; ++i)
{print $1 $2 $3}
}' $file
File which i am reading has below format :
abc 12 test.txt
abc 12 test.txt
abc 12 test.txt
abc 12 test.txt
abc 12 test.txt
abc 12 test.txt
So for this nbrClients value will be 6 and it should loop for 6 times but it is not doing so .Please suggest what wrong i am doing in this.
Here is the full code which i am trying to :
file=$1;
nbrClients=`wc -l $file | cut -d' ' -f1`;
echo $nbrClients;
file=$1;
cat | awk '{
fileName=$1
tnxCount=$2
for i in `seq 1 $tnxCount`
do
echo "Starting thread number $i"
nohup perl /home/user/abc.pl -i $fileName >>/home/user/test_load_${today}.out 2>&1 &
done
}' $file;
I think the problem here is that you're under the impression that the for loop is what will cause awk to step through your input file, whereas it's awk's nature to do that already.
Awk works by taking a set of condition { statement } pairs, and then FOR EACH LINE OF INPUT, evaluating the condition, and if it rings true, executing the statement. Note that conditions can be statements (since functions and other commands have a return value) and statements can include if constructs, so there's a lot of flexibility here.
Note that awk can also reduce or simplify stuff you'd do in a shell script. Consider the following:
#!/bin/sh
file="$1"
awk '
NR==FNR {
ClientCount++
next
}
FNR==1 {
printf "%s: %d\n", FILENAME, ClientCount
}
{
print $1, $2, $3
}
' "$file" "$file"
This script reads your input file twice -- once to count the lines (so that the line count can be placed at the top of theoutput), and once to process the lines, printing the first three fields. The script is composed of three condition { statement } groupings:
The first one is the counter. It only operates on the first instance of the file, and the next command insures that no other commands will be run on that file.
The second one operates on the first line of the file. But since the first condition captured all of the first file, this statement will only be executed once, when the first line of the second file is in play.
The third one is what prints the bulk of your output. With awk, when no condition is included, the condition is assumed to be "true", so this statement runs for each line of the second file.
The awk script could of course be compressed onto a single line, I've spaced it out for easier reading.
Note also that this method of keeping or showing a line count might be a little heavy handed. If you know that you're just showing a line count, you can use the internal awk variable NR. At the point in your script where the second condition is evaluated, NR-1 is the line count of the previous file, so you could use:
#!/bin/sh
file="$1"
awk '
NR==FNR {
next
}
FNR==1 {
printf "%s: %d\n", FILENAME, NR-1
}
{
print $1, $2, $3
}
' "$file" "$file"
updating the answer based on comment and latest version of the question
file=$1;
nbrClients=`wc -l $file | cut -d' ' -f1`;
echo $nbrClients;
file=$1;
cat $file | awk -v fileName="$1" -v tnxCount="$2" '{
system("echo "Starting thread number $i"")
system("nohup perl /home/user/abc.pl -i $fileName >>/home/user/test_load_${today}.out 2>&1 &")
}';

Using Array With Awk [duplicate]

This question already has answers here:
How do I use shell variables in an awk script?
(7 answers)
Closed 7 years ago.
I am using an array of values, and I want to look for those values using awk and output to file. In the awk line if I replace the first "$i" with the numbers themselves, the script works, but when I try to use the variable "$i" the script no longer works.
declare -a arr=("5073770" "7577539")
for i in "${arr[#]}"
do
echo "$i"
awk -F'[;\t]' '$2 ~ "$i"{sub(/DP=/,"",$15); print $15}' $INPUT >> "$i"
done
The file I'm looking at contains many lines like the following:
chr12 3356475 . C A 76.508 . AB=0;ABP=0;AC=2;AF=1;AN=2;AO=3;CIGAR=1X;DP=3;DPB=3;DPRA=0;EPP=9.52472;EPPR=0;GTI=0;LEN=1;MEANALT=1;MQM=60;MQMR=0;NS=1;NUMALT=1;ODDS=8.76405;PAIRED=0;PAIREDR=0;PAO=0;PQA=0;PQR=0;PRO=0;QA=111;QR=0;RO=0;RPP=9.52472;RPPR=0;RUN=1;SAF=3;SAP=9.52472;SAR=0;SRF=0;SRP=0;SRR=0;TYPE=snp GT:DP:RO:QR:AO:QA:GL 1/1:3:0:0:3:111:-10,-0.90309,0
Pass the value $i to awk using -v:
awk -F'[;\t]' -v var="$i" '$2 ~ var{sub(/DP=/,"",$15); print $15}' $INPUT >> "$i"
awk will have no idea what the value of the shell's $i is unless you explicitly pass it into awk as a variable
awk -F'[;\t]' -v "VAR=${i}" '$2 ~ VAR {....
I expect the result you see is because 'i' is undefined and treated as zero
which makes your test '$2 ~ $0 {...
You can avoid awk and do this in BASH itself:
arr=("5073770" "7577539" "3356475")
for i in "${arr[#]}"; do
while IFS='['$'\t'';]' read -ra arr; do
[[ ${arr[1]} == *$i* ]] && { s="${arr[14]}"; echo "${s#DP=}"; }
done < "$INPUT"
done

pulling information out of a string in shell script

I am having trouble pulling out the information I need from a string in my shell script. I have read and tried to come up with the correct awk or sed command to do it, but I just can't figure it out. Hopefully you guys can help.
Lets say I have a string as follows:
["ids":2817262,"isvalid":true,"name":"somename","hasproperty":false,"ids":2262,"isvalid":false,"name":"somename","hasproperty":false,"ids":28182,"isvalid":true,"name":"somename","hasproperty":true]
Now what I want to do is pull out all of these properties into individual arrays of strings. For example:
I would like to have an array of ids 2817262 2262 28182
an array of name somename somename somename
an array of hasproperty false false true
Can anyone help me come up with the commands I need to pull this out. Also keep in mind the string will likely be much longer than this, so if we can not make it specific to 3 cases that would be helpful. Thanks so much in advance.
You could use grep.
grep -oP '"ids":\K\d+' file
Example:
$ echo '["ids":2817262,"isvalid":true,"name":"somename","hasproperty":false,"ids":2262,"isvalid":false,"name":"somename","hasproperty":false,"ids":28182,"isvalid":true,"name":"somename","hasproperty":true]' | grep -oP '"ids":\K\d+'
2817262
2262
28182
Since it is tagged with awk
awk '{while(x=match($0,/"ids":([^,]+)/,a)){print a[1];$0=substr($0,x+RLENGTH)}}' file
This just keeps matching any ids then changing the line to contain only what is after the id.
Output
2817262
2262
28182
Could also do this(inspired by Wintermutes comment on another answer)
awk -v RS=",|]" 'sub(/^.*"ids":/,"")' file
The grep solution is beautiful. You question was tagged awk. The awk solution is ugly:
echo '["ids":2817262,"isvalid":true,"name":"somename","hasproperty":false,"ids":2262,"isvalid":false,"name":"somename","hasproperty":false,"ids":28182,"isvalid":true,"name":"somename","hasproperty":true]' \
| awk '{split(substr($0,2,length($0)-2),x,",");
for(i=0;i<length(x);i++) {split(x[i],a,":");
if(a[1]=="\"ids\"") print a[1],a[2]}}'
Output:
"ids" 2817262
"ids" 2262
"ids" 28182
Please choose the grep solution as the correct answer.
Here is a pure bash solution (long-winded, isn't it? I tend to agree with #chepner):
str='["ids":2817262,"isvalid":true,"name":"somename","hasproperty":false,
"ids":2262,"isvalid":false,"name":"somename","hasproperty":false,"ids":28182,
"isvalid":true,"name":"somename","hasproperty":true]'
#Remove [ ]
str=${str/[/}
str=${str/]/}
declare -a ids
declare -a names
declare -a properties
oldIFS="$IFS"
IFS=','
for record in $str
do
type=${record%%:*}
value=${record##*:}
if [[ $type == \"ids\" ]]
then
ids[ids_i++]="$value"
elif [[ $type == \"name\" ]]
then
names[names_i++]="$value"
elif [[ $type == \"hasproperty\" ]]
then
properties[properties_i++]="$value"
else
echo "Ignored type: '$type'" >&2
fi
done
IFS="$oldIFS"
echo "ids: ${ids[#]}"
echo "names: ${names[#]}"
echo "properties: ${properties[#]}"
The only thing going for it is that there are no child processes.
awk 'BEGIN {
Field = 1
Index = 0
}
{
gsub( /[][]/,"")
gsub( /"[a-z]*":/, "")
FS=","
while ( Field < NF) {
ThisID[ Index]=$Field
ThisName[ Index]=$(Field + 2)
ThisProperty [ Index]=$(Field + 3)
Index+=1
Field+=4
}
}
END {
for ( Iter=0;Iter<Index;Iter+=1) printf( "%s ", ThisID[Iter])
printf "\n"
for ( Iter=0;Iter<Index;Iter++) printf( "%s ", ThisName[Iter])
printf "\n"
for ( Iter=0;Iter<Index;Iter++) printf( "%s ", ThisProperty[Iter])
printf "\n"
}' YourFile
still to assign your array to your favorite variable
unset n
string='["ids":2817262,"isvalid":true,"name":"somename","hasproperty":false,"ids":2262,"isvalid":false,"name":"somename","hasproperty":false,"ids":28182,"isvalid":true,"name":"somename","hasproperty":true]'
while IFS=',' read -ra line
do
((n++))
for i in "${line[#]//\"/}"
do
eval ${i%:*}[$n]=${i#*:}
done
done < <(sed 's/[][]//g;s/,"ids/\n"ids/g' <<<$string)
The above will produce 4 arrays (ids,isvalid,name,hasproperty). If you need not isvalid just add:
unset n
string='["ids":2817262,"isvalid":true,"name":"somename","hasproperty":false,"ids":2262,"isvalid":false,"name":"somename","hasproperty":false,"ids":28182,"isvalid":true,"name":"somename","hasproperty":true]'
while IFS=',' read -ra line
do
((n++))
for i in "${line[#]//\"/}"
do
[ "${i%:*}" != "isvalid" ] && eval ${i/:/[$n]=}
done
done < <(sed 's/[][]//g;s/,"ids/\n"ids/g' <<<$string)
Given your posted input, if all you wanted was the list of each type of item then this is all you'd need:
$ awk -v RS=, -F: '{gsub(/[[\]"\n]/,"")} /^ids/{print $2}' file
2817262
2262
28182
$ awk -v RS=, -F: '{gsub(/[[\]"\n]/,"")} /^name/{print $2}' file
somename
somename
somename
$ awk -v RS=, -F: '{gsub(/[[\]"\n]/,"")} /^hasproperty/{print $2}' file
false
false
true
$ awk -v RS=, -F: '{gsub(/[[\]"\n]/,"")} /^isvalid/{print $2}' file
true
false
true
but it's extremely unlikely that this is the right way to approach your problem. As I mentioned in a comment, edit your question to provide more information if you'd like some real help with it.

Resources