awk and md5: replace a column - bash

Starting from Awk replace a column with its hash value, I tried to hash(md5) a list of numbers:
$ cat -n file
1 40755462755
2 40751685373
3 40730094339
4 40722740446
5 40722740446
6 40743802204
7 40730094339
8 40745188886
9 40740593352
10 40745561530
If I run:
cat file | awk '{cmd="echo -n " $1 " | md5sum|cut -d\" \" -f1"; cmd|getline md5; $1=md5;print;}' | cat -n
1 29ece26ce4633b6e9480255db194cc40
2 120148eca0891d0fc645413d0f26b66b
3 cafc48d392a004f75b669f9d1d7bf894
4 7b4367e8f58835c0827dd6a2f61b7258
5 7b4367e8f58835c0827dd6a2f61b7258
6 49b12d1f3305ab93b33b330e8b1d3165
7 49b12d1f3305ab93b33b330e8b1d3165
8 bee44c89ac9d4e8e4e1f1c5c63088c71
9 f07262ac8f53755232c5abbf062364d0
10 2ac7c22170c00a3527eb99a2bfde2c2c
I don't know why the line 7 get the same md5 as line 6 because if I run them separately they are different:
$ echo -n 40743802204 | md5sum|cut -d" " -f1
49b12d1f3305ab93b33b330e8b1d3165
$ echo -n 40730094339 | md5sum|cut -d" " -f1
cafc48d392a004f75b669f9d1d7bf894
I tried some prints:
cat file| awk '{print $0,NF,NR;cmd="echo -n " $1 " | md5sum|cut -d\" \" -f1"; cmd|getline md5; $1=md5"---"cmd"---"$1;print;}' | cat -n
but with no success to find what's going wrong.
EDIT: As the title says, I try to replace a column in a file(a file with hundred fields). So, $1 would be $24 and NF would be 120 for a file and 233 for another file.

I wouldn't use getline in awk like that. You can do:
while read -r num; do
echo -n $num | md5sum | cut -d ' ' -f1;
done < file
29ece26ce4633b6e9480255db194cc40
120148eca0891d0fc645413d0f26b66b
cafc48d392a004f75b669f9d1d7bf894
7b4367e8f58835c0827dd6a2f61b7258
7b4367e8f58835c0827dd6a2f61b7258
49b12d1f3305ab93b33b330e8b1d3165
cafc48d392a004f75b669f9d1d7bf894
bee44c89ac9d4e8e4e1f1c5c63088c71
f07262ac8f53755232c5abbf062364d0
2ac7c22170c00a3527eb99a2bfde2c2c

Ok, I found the issue. The pipes in awk should be closed.
So, I needed a close(cmd);
I found the solution here

I would GUESS, but can't tell since you aren't testing it's return code, that it's because your getline is failing at line 7 so md5 has the same value it did for the previous line. Use of getline is fraught with caveats and not for use by beginners, see http://awk.info/?tip/getline.
What value are you getting out of using awk for this anyway as opposed to just staying in shell?

It's a bit awkward with all the quoting - I'm not sure why would it fail to be honest. But here's something that uses less awk and works just fine:
< tmp | while read num ; do echo -n $num | md5sum | cut -f1 -d' '; done | cat -n

Related

How to use awk to select text from a file starting from a line number until a certain string

I have this file where I want to read it starting from a certain line number, until a string. I already used
awk "NR>=$LINE && NR<=$((LINE + 121)) {print}" db_000022_model1.dlg
to read from a specific line until and incremented line number, but right now I need to make it stop by itself at a certain string in order to be able to use it on other files.
DOCKED: ENDBRANCH 7 22
DOCKED: TORSDOF 3
DOCKED: TER
DOCKED: ENDMDL
I want it to stop after it reaches
DOCKED: ENDMDL
#!/bin/bash
# This script is for extracting the pdb files from a sorted list of scored
# ligands
mkdir top_poses
for d in $(head -20 summary_2.0.sort | cut -d, -f1 | cut -d/ -f1)
do
cd "$d"||continue
# find the cluster with the highest population within the dlg
RUN=$(grep '###*' "$d.dlg" | sort -k10 -r | head -1 | cut -d\| -f3 | sed 's/ //g')
LINE=$(grep -ni "BEGINNING GENETIC ALGORITHM DOCKING $RUN of 100" "$d.dlg" | cut -d: -f1)
echo "$LINE"
# extract the best pose and correct the format
awk -v line="$((LINE + 14))" "NR>=line; /DOCKED: ENDMDL/{exit}" "$d.dlg" | sed 's/^........//' > "$d.pdbqt"
# convert the pdbqt file into pdb
#obabel -ipdbqt $d.pdbqt -opdb -O../top_poses/$d.pdb
cd ..
done
When I try the
awk -v line="$((LINE + 14))" "NR>=line; /DOCKED: ENDMDL/{exit}" "$d.dlg" | sed 's/^........//' > "$d.pdbqt"
Just like that in the shell terminal, it works. But in the script it outputs an empty file.
Depending on your requirements for handling DOCKED: ENDMDL occurring before your target line:
awk -v line="$LINE" 'NR>=line; /DOCKED: ENDMDL/{exit}' db_000022_model1.dlg
or:
awk -v line="$LINE" 'NR>=line{print; if (/DOCKED: ENDMDL/) exit}' db_000022_model1.dlg

how to find maximum and minimum values of a particular column using AWK [duplicate]

I'm using awk to deal with a simple .dat file, which contains several lines of data and each line has 4 columns separated by a single space.
I want to find the minimum and maximum of the first column.
The data file looks like this:
9 30 8.58939 167.759
9 38 1.3709 164.318
10 30 6.69505 169.529
10 31 7.05698 169.425
11 30 6.03872 169.095
11 31 5.5398 167.902
12 30 3.66257 168.689
12 31 9.6747 167.049
4 30 10.7602 169.611
4 31 8.25869 169.637
5 30 7.08504 170.212
5 31 11.5508 168.409
6 31 5.57599 168.903
6 32 6.37579 168.283
7 30 11.8416 168.538
7 31 -2.70843 167.116
8 30 47.1137 126.085
8 31 4.73017 169.496
The commands I used are as follows.
min=`awk 'BEGIN{a=1000}{if ($1<a) a=$1 fi} END{print a}' mydata.dat`
max=`awk 'BEGIN{a= 0}{if ($1>a) a=$1 fi} END{print a}' mydata.dat`
However, the output is min=10 and max=9.
(The similar commands can return me the right minimum and maximum of the second column.)
Could someone tell me where I was wrong? Thank you!
Awk guesses the type.
String "10" is less than string "4" because character "1" comes before "4".
Force a type conversion, using addition of zero:
min=`awk 'BEGIN{a=1000}{if ($1<0+a) a=$1} END{print a}' mydata.dat`
max=`awk 'BEGIN{a= 0}{if ($1>0+a) a=$1} END{print a}' mydata.dat`
a non-awk answer:
cut -d" " -f1 file |
sort -n |
tee >(echo "min=$(head -1)") \
> >(echo "max=$(tail -1)")
That tee command is perhaps a bit much too clever. tee duplicates its stdin stream to the files names as arguments, plus it streams the same data to stdout. I'm using process substitutions to filter the streams.
The same effect can be used (with less flourish) to extract the first and last lines of a stream of data:
cut -d" " -f1 file | sort -n | sed -n '1s/^/min=/p; $s/^/max=/p'
or
cut -d" " -f1 file | sort -n | {
read line
echo "min=$line"
while read line; do max=$line; done
echo "max=$max"
}
Your problem was simply that in your script you had:
if ($1<a) a=$1 fi
and that final fi is not part of awk syntax so it is treated as a variable so a=$1 fi is string concatenation and so you are TELLING awk that a contains a string, not a number and hence the string comparison instead of numeric in the $1<a.
More importantly in general, never start with some guessed value for max/min, just use the first value read as the seed. Here's the correct way to write the script:
$ cat tst.awk
BEGIN { min = max = "NaN" }
{
min = (NR==1 || $1<min ? $1 : min)
max = (NR==1 || $1>max ? $1 : max)
}
END { print min, max }
$ awk -f tst.awk file
4 12
$ awk -f tst.awk /dev/null
NaN NaN
$ a=( $( awk -f tst.awk file ) )
$ echo "${a[0]}"
4
$ echo "${a[1]}"
12
If you don't like NaN pick whatever you'd prefer to print when the input file is empty.
late but a shorter command and with more precision without initial assumption:
awk '(NR==1){Min=$1;Max=$1};(NR>=2){if(Min>$1) Min=$1;if(Max<$1) Max=$1} END {printf "The Min is %d ,Max is %d",Min,Max}' FileName.dat
A very straightforward solution (if it's not compulsory to use awk):
Find Min --> sort -n -r numbers.txt | tail -n1
Find Max --> sort -n -r numbers.txt | head -n1
You can use a combination of sort, head, tail to get the desired output as shown above.
(PS: In case if you want to extract the first column/any desired column you can use the cut command i.e. to extract the first column cut -d " " -f 1 sample.dat)
#minimum
cat your_data_file.dat | sort -nk3,3 | head -1
#this fill find minumum of column 3
#maximun
cat your_data_file.dat | sort -nk3,3 | tail -1
#this will find maximum of column 3
#to find in column 2 , use -nk2,2
#assing to a variable and use
min_col=`cat your_data_file.dat | sort -nk3,3 | head -1 | awk '{print $3}'`

How to remove all but the last 3 parts of FQDN?

I have a list of IP lookups and I wish to remove all but the last 3 parts, so:
98.254.237.114.broad.lyg.js.dynamic.163data.com.cn
would become
163data.com.cn
I have spent hours searching for clues, including parameter substitution, but the closest I got was:
$ string="98.254.237.114.broad.lyg.js.dynamic.163data.com.cn"
$ string1=${string%.*.*.*}
$ echo $string1
Which gives me the inverted answer of:
98.254.237.114.broad.lyg.js.dynamic
which is everything but the last 3 parts.
A script to do a list would be better than just the static example I have here.
Using CentOS 6, I don't mind if it by using sed, cut, awk, whatever.
Any help appreciated.
Thanks, now that I have working answers, may I ask as a follow up to then process the resulting list and if the last part (after last '.') is 3 characters - eg .com .net etc, then to just keep the last 2 parts.
If this is against protocol, please advise how to do a follow up question.
if parameter expansion inside another parameter expansion is supported, you can use this:
$ s='98.254.237.114.broad.lyg.js.dynamic.163data.com.cn'
$ # removing last three fields
$ echo "${s%.*.*.*}"
98.254.237.114.broad.lyg.js.dynamic
$ # pass output of ${s%.*.*.*} plus the extra . to be removed
$ echo "${s#${s%.*.*.*}.}"
163data.com.cn
can also reverse the line, get required fields and then reverse again.. this makes it easier to use change numbers
$ echo "$s" | rev | cut -d. -f1-3 | rev
163data.com.cn
$ echo "$s" | rev | cut -d. -f1-4 | rev
dynamic.163data.com.cn
$ # and easy to use with file input
$ cat ip.txt
98.254.237.114.broad.lyg.js.dynamic.163data.com.cn
foo.bar.123.baz.xyz
a.b.c.d.e.f
$ rev ip.txt | cut -d. -f1-3 | rev
163data.com.cn
123.baz.xyz
d.e.f
echo $string | awk -F. '{ if (NF == 2) { print $0 } else { print $(NF-2)"."$(NF-1)"."$NF } }'
NF signifies the total number of field separated by "." and so we want the last piece (NF), last but 1 (NF-1) and last but 2 (NF-2)
$ echo $string | awk -F'.' '{printf "%s.%s.%s\n",$(NF-2),$(NF-1),$NF}'
163data.com.cn
Brief explanation,
Set the field separator to .
Print only last 3 field using the awk parameter $(NF-2), $(NF-1),and $NF.
And there's also another option you may try,
$ echo $string | awk -v FPAT='[^.]+.[^.]+.[^.]+$' '{print $NF}'
163data.com.cn
It sounds like this is what you need:
awk -F'.' '{sub("([^.]+[.]){"NF-3"}","")}1'
e.g.
$ echo "$string" | awk -F'.' '{sub("([^.]+[.]){"NF-3"}","")}1'
163data.com.cn
but with just 1 sample input/output it's just a guess.
wrt your followup question, this might be what you're asking for:
$ echo "$string" | awk -F'.' '{n=(length($NF)==3?2:3); sub("([^.]+[.]){"NF-n"}","")}1'
163data.com.cn
$ echo 'www.google.com' | awk -F'.' '{n=(length($NF)==3?2:3); sub("([^.]+[.]){"NF-n"}","")}1'
google.com
Version which uses only bash:
echo $(expr "$string" : '.*\.\(.*\..*\..*\)')
To use it with a file you can iterate with xargs:
File:
head list.dat
98.254.237.114.broad.lyg.js.dynamic.163data.com.cn
98.254.34.56.broad.kkk.76onepi.co.cn
98.254.237.114.polst.a65dal.com.cn
iterating the whole file:
cat list.dat | xargs -I^ -L1 expr "^" : '.*\.\(.*\..*\..*\)'
Notice: it won't be very efficient in large scale, so you need to consider by your own whether it is good enough for you.
Regexp explanation:
.* \. \( .* \. .* \. .* \)
\___| | | | |
| \------------------------/> brakets shows which part we extract
| | |
| \-------/> the \. indicates the dots to separate specific number of words
|
|
-> the rest and the final dot which we are not interested in (out of brakets)
details:
http://tldp.org/LDP/abs/html/string-manipulation.html -> Substring Extraction

Use regex capturing group to extract tokens and store them in variables

I have a shell script that reads an output string from the terminal that has the format:
Book 2 Title (Chapter 1) [Page 2]
I want to grab the Title, Chapter, and Page and store them in variables like this:
TITLE="Book 2 Title"
CHAPTER="Chapter 1"
PAGENUMBER="Page 2"
Is there a shell command that allows me to use regex capturing groups to grab these tokens and store them in variables.
Note: So far I've looked into awk, but it separates the tokens by a space so it doesn't work for my case.
s="Book 2 Title (Chapter 1) [Page 2]"
eval $(echo $s | sed 's/^/TITLE="/;s/(/";CHAPTER="/;s/) \[/";PAGENUMBER="/;s/\]/"/' )
echo $TITLE
echo $CHAPTER
echo $PAGENUMBER
Test
Book 2 Title
Chapter 1
Page 2
Let's suppose your line is store in a variable called "Line". The next step would be to "normalize" the content so it would be more easily parsable by commands (like a line from a CSV file).
Version 1: reformant the line so it would have 3 fields separated by "|" character. Then split the line so the variables would contain the required field value.
Line=$(echo "$Line" | sed "s/^\(.*\) (\(.*\)) \[\(.*\)]$/\1|\2|\3/")
# put the corect data to correct variables.
TITLE="${Line%%|*}"
CHAPTER="${Line%|*}"; CHAPTER="${CHAPTER#*|}"
PAGENUMBER="${Line##*|}"
# display the values
echo "$TITLE + $CHAPTER + $PAGENUMBER"
Version 2: using a regular expression mark the 3 fields and then prepare the commands to setup the variables. Use eval to really setup the variables.
eval $(echo "$Line" | sed "s/^\(.*\) (\(.*\)) \[\(.*\)]$/TITLE='\1' CHAPTER='\2' PAGENUMBER='\3'/")
echo "$TITLE + $CHAPTER + $PAGENUMBER"
Version 2 would be much easier to extend to any number of fields.
One way to achieve this could be:
myString="Book 2 Title (Chapter 1) [Page 2]"
title="${myString%(*}"
chapter="$(echo "$myString" | cut -f2 -d'(' | cut -f1 -d')')"
pageNumber="$(echo "$myString" | cut -f2 -d'[' | cut -f1 -d']')"
Output:
echo "$title" && echo "$chapter" && echo "$pageNumber"
Book 2 Title
Chapter 1
Page 2
Edit: the "enhanced" version below would work even if the book title had one or more parenthesis or square brackets:
myString="Book 2 Title Foo (Revised Version) (1993) [abc publisher] (Chapter 1) [Page 2]"
title="${myString%(*}"
chapter="$(echo "$myString" | rev | cut -f2 -d ')' | cut -f1 -d'(' | rev)"
pageNumber="$(echo "$myString" | rev | cut -f2 -d ']' | cut -f1 -d'[' | rev)"
Output:
echo "$title" && echo "$chapter" && echo "$pageNumber"
Book 2 Title Foo (Revised Version) (1993) [abc publisher]
Chapter 1
Page 2
Title=$(awk 'BEGIN {FS=" "}{ print $1;print $2;print $3}' filename )
chapter_tmp=$(awk 'BEGIN {FS="("}{ print $2}' filename)
chapter=$(echo $chapter_tmp | awk 'BEGIN {FS=")"}{ print $1}' )
pages_tmp=$(awk 'BEGIN {FS="["}{ print $2}'filename )
pages=$(echo $pages_tmp | awk 'BEGIN {FS="]"}{ print $1}'

awk: find minimum and maximum in column

I'm using awk to deal with a simple .dat file, which contains several lines of data and each line has 4 columns separated by a single space.
I want to find the minimum and maximum of the first column.
The data file looks like this:
9 30 8.58939 167.759
9 38 1.3709 164.318
10 30 6.69505 169.529
10 31 7.05698 169.425
11 30 6.03872 169.095
11 31 5.5398 167.902
12 30 3.66257 168.689
12 31 9.6747 167.049
4 30 10.7602 169.611
4 31 8.25869 169.637
5 30 7.08504 170.212
5 31 11.5508 168.409
6 31 5.57599 168.903
6 32 6.37579 168.283
7 30 11.8416 168.538
7 31 -2.70843 167.116
8 30 47.1137 126.085
8 31 4.73017 169.496
The commands I used are as follows.
min=`awk 'BEGIN{a=1000}{if ($1<a) a=$1 fi} END{print a}' mydata.dat`
max=`awk 'BEGIN{a= 0}{if ($1>a) a=$1 fi} END{print a}' mydata.dat`
However, the output is min=10 and max=9.
(The similar commands can return me the right minimum and maximum of the second column.)
Could someone tell me where I was wrong? Thank you!
Awk guesses the type.
String "10" is less than string "4" because character "1" comes before "4".
Force a type conversion, using addition of zero:
min=`awk 'BEGIN{a=1000}{if ($1<0+a) a=$1} END{print a}' mydata.dat`
max=`awk 'BEGIN{a= 0}{if ($1>0+a) a=$1} END{print a}' mydata.dat`
a non-awk answer:
cut -d" " -f1 file |
sort -n |
tee >(echo "min=$(head -1)") \
> >(echo "max=$(tail -1)")
That tee command is perhaps a bit much too clever. tee duplicates its stdin stream to the files names as arguments, plus it streams the same data to stdout. I'm using process substitutions to filter the streams.
The same effect can be used (with less flourish) to extract the first and last lines of a stream of data:
cut -d" " -f1 file | sort -n | sed -n '1s/^/min=/p; $s/^/max=/p'
or
cut -d" " -f1 file | sort -n | {
read line
echo "min=$line"
while read line; do max=$line; done
echo "max=$max"
}
Your problem was simply that in your script you had:
if ($1<a) a=$1 fi
and that final fi is not part of awk syntax so it is treated as a variable so a=$1 fi is string concatenation and so you are TELLING awk that a contains a string, not a number and hence the string comparison instead of numeric in the $1<a.
More importantly in general, never start with some guessed value for max/min, just use the first value read as the seed. Here's the correct way to write the script:
$ cat tst.awk
BEGIN { min = max = "NaN" }
{
min = (NR==1 || $1<min ? $1 : min)
max = (NR==1 || $1>max ? $1 : max)
}
END { print min, max }
$ awk -f tst.awk file
4 12
$ awk -f tst.awk /dev/null
NaN NaN
$ a=( $( awk -f tst.awk file ) )
$ echo "${a[0]}"
4
$ echo "${a[1]}"
12
If you don't like NaN pick whatever you'd prefer to print when the input file is empty.
late but a shorter command and with more precision without initial assumption:
awk '(NR==1){Min=$1;Max=$1};(NR>=2){if(Min>$1) Min=$1;if(Max<$1) Max=$1} END {printf "The Min is %d ,Max is %d",Min,Max}' FileName.dat
A very straightforward solution (if it's not compulsory to use awk):
Find Min --> sort -n -r numbers.txt | tail -n1
Find Max --> sort -n -r numbers.txt | head -n1
You can use a combination of sort, head, tail to get the desired output as shown above.
(PS: In case if you want to extract the first column/any desired column you can use the cut command i.e. to extract the first column cut -d " " -f 1 sample.dat)
#minimum
cat your_data_file.dat | sort -nk3,3 | head -1
#this fill find minumum of column 3
#maximun
cat your_data_file.dat | sort -nk3,3 | tail -1
#this will find maximum of column 3
#to find in column 2 , use -nk2,2
#assing to a variable and use
min_col=`cat your_data_file.dat | sort -nk3,3 | head -1 | awk '{print $3}'`

Resources