Extracting the pattern from the line - shell

How can I extract words which contains the pattern "arum" from the following line:
Agarum anoestrum alabastrum sun antirumor alarum antiserum ambulacrum antistrumatic Anatherum antistrumous androphorum antrum 4. foodstuff foody nonfood Aplectrum
So words like sun, 4., foody, nonfood should be removed.

You can use grep:
echo "Agarum anoestrum sun" | tr ' ' '\n' | grep "arum"
tr is used to split the input string in one word per line, since grep operates on a per-line basis and would display the whole line.
If you want the output to be in one line again, use:
echo "Agarum anoestrum sun" | tr ' ' '\n' | grep "arum" | tr '\n' ' '

Using grep -Eo:
grep -Eo 'a[[:alnum:]]*rum' file
arum
anoestrum
alabastrum
antirum
alarum
antiserum
ambulacrum
antistrum
atherum
antistrum
androphorum
antrum

Try:
echo Agarum anoestrum alabastrum sun antirumor alarum antiserum ambulacrum antistrumatic Anatherum antistrumous androphorum antrum 4. foodstuff foody nonfood Aplectrum | awk '{for (i=1;i<NF;i++) { if (match($i, "[aA][[:alnum:]]*[rR][uU][mM]") != 0) { printf ("%s ", $i) }} print}'

Related

CGI output Shell - md5sum wrong value

I need to split an IPv4 address into octets, calculate the MD5 hash of each and print as a CGI output:
IP1=$(echo ${REMOTE_ADDR} | tr "." " " | awk '{print $1'} | md5sum | cut -c1-32)
printf $IP1
In this example, REMOTE_ADDR = 192.168.20.100
But the MD5 of 192 gives me a wrong MD5 IP1=6be7de648baa9067fa3087928d5ab0b4, while it should be 58a2fc6ed39fd083f55d4182bf88826d
If I do this:
cat /tmp/test.txt | md5sum | cut -c1-32
where test.txt contains 192,
I get the correct MD5 hash, i.e 58a2fc6ed39fd083f55d4182bf88826d
What am I doing wrong?
Your awk's print is adding a newline so you're computing the md5 of "192\n", not "192". Use
IP1=$(printf "%s" "${REMOTE_ADDR%.*.*.*}" | md5sum | cut -c1-32)
instead, which uses shell parameter expansion to remove all but the first octet of the IP address, and printf to write it without the newline.
As #Shawn said, the problem was on awk print.
Adding tr -d '\n' solved the problem.
Now it is working correctly; for other octects I had to change print $2..etc on awk
IP1=$(echo ${REMOTE_ADDR} | awk -F. '{print $1'} | tr -d '\n' | md5sum | cut -c1-32 )

How can I search for a string that may or may not be across two lines?

So I am trying to collect as series of numbers after a string. However, the string's position in the block of text I am sorting through can change a lot.
Here is the full code I am working with right now:
echo name HF MP2 | cat > allE
for i in *.out
do grep "Slide" $i | cut -d "\\" -f2 | cat | tr -d '\n' > $i.name &&
grep "EUMP2" $i | cut -d "=" -f3 | cut -c 1-25 | tr '\n' ' ' >> $i.mp2 &&
grep 'AG\\HF' $i | cut -d "=" -f3 | cut -c 1-13 | tr '\n' ' ' >> $i.hf &&
paste $i.name >> $i.energies &&
paste $i.hf >> $i.energies &&
sed -i 's/[ABCEFGHIJKLMNOPQRSTUVWXYZ]//g' $i.mp2 &&
paste $i.mp2 >> $i.energies &&
transpose $i.energies >> $i.allE #temp.txt &&
#cat temp.txt > $i.energies
#echo $i is finished
done
echo see allE for energies
rm *.energies #temp.txt
rm *.name
rm *.mp2
The string I am searching for is AG\HF.
The problem is the data it is searching through can look like (Note: there are actual new line characters in this data, which I think is causing a bit of the problem)
1\1\GINC-N076\SP\RMP2-FC\CC-pVDZ\C12H12\R2536\09-Apr-2020\0\\# mp2/cc-
pVDZ\\Squish3_Slide0\\0,1\H,0,0.,2.4822,0.\C,0,0.,1.3948,0.\C,0,0.,-1.
3948,0.\C,0,1.2079,0.6974,0.\C,0,-1.2079,0.6974,0.\C,0,-1.2079,-0.6974
,0.\C,0,1.2079,-0.6974,0.\H,0,2.1497,1.2411,0.\H,0,-2.1497,1.2411,0.\H
,0,-2.1497,-1.2411,0.\H,0,2.1497,-1.2411,0.\H,0,0.,-2.4822,0.\C,0,0.,1
.3948,3.\C,0,0.,-1.3948,3.\C,0,1.2079,0.6974,3.\C,0,-1.2079,0.6974,3.\
C,0,-1.2079,-0.6974,3.\C,0,1.2079,-0.6974,3.\H,0,0.,2.4822,3.\H,0,2.14
97,1.2411,3.\H,0,-2.1497,1.2411,3.\H,0,-2.1497,-1.2411,3.\H,0,2.1497,-
1.2411,3.\H,0,0.,-2.4822,3.\\Version=ES64L-G09RevD.01\State=1-AG\HF=-4
61.3998608\MP2=-463.0005321\RMSD=3.490e-09\PG=D02H [SG"(C4H4),X(C8H8)]
\\#
OR
1\1\GINC-N076\SP\RMP2-FC\CC-pVDZ\C12H12\R2536\09-Apr-2020\0\\# mp2/cc-
pVDZ\\Squish3.1_Slide0\\0,1\H,0,0.,2.4822,0.\C,0,0.,1.3948,0.\C,0,0.,-
1.3948,0.\C,0,1.2079,0.6974,0.\C,0,-1.2079,0.6974,0.\C,0,-1.2079,-0.69
74,0.\C,0,1.2079,-0.6974,0.\H,0,2.1497,1.2411,0.\H,0,-2.1497,1.2411,0.
\H,0,-2.1497,-1.2411,0.\H,0,2.1497,-1.2411,0.\H,0,0.,-2.4822,0.\C,0,0.
,1.3948,3.1\C,0,0.,-1.3948,3.1\C,0,1.2079,0.6974,3.1\C,0,-1.2079,0.697
4,3.1\C,0,-1.2079,-0.6974,3.1\C,0,1.2079,-0.6974,3.1\H,0,0.,2.4822,3.1
\H,0,2.1497,1.2411,3.1\H,0,-2.1497,1.2411,3.1\H,0,-2.1497,-1.2411,3.1\
H,0,2.1497,-1.2411,3.1\H,0,0.,-2.4822,3.1\\Version=ES64L-G09RevD.01\St
ate=1-AG\HF=-461.4104442\MP2=-463.0062587\RMSD=3.651e-09\PG=D02H [SG"(
C4H4),X(C8H8)]\\#
OR
1\1\GINC-N076\SP\RMP2-FC\CC-pVDZ\C12H12\R2536\09-Apr-2020\0\\# mp2/cc-
pVDZ\\Squish3.3_Slide1.7\\0,1\H,0,0.,2.4822,0.\C,0,0.,1.3948,0.\C,0,0.
,-1.3948,0.\C,0,1.2079,0.6974,0.\C,0,-1.2079,0.6974,0.\C,0,-1.2079,-0.
6974,0.\C,0,1.2079,-0.6974,0.\H,0,2.1497,1.2411,0.\H,0,-2.1497,1.2411,
0.\H,0,-2.1497,-1.2411,0.\H,0,2.1497,-1.2411,0.\H,0,0.,-2.4822,0.\C,0,
0.,-0.3052,3.3\C,0,0.,-3.0948,3.3\C,0,1.2079,-1.0026,3.3\C,0,-1.2079,-
1.0026,3.3\C,0,-1.2079,-2.3974,3.3\C,0,1.2079,-2.3974,3.3\H,0,0.,0.782
2,3.3\H,0,2.1497,-0.4589,3.3\H,0,-2.1497,-0.4589,3.3\H,0,-2.1497,-2.94
11,3.3\H,0,2.1497,-2.9411,3.3\H,0,0.,-4.1822,3.3\\Version=ES64L-G09Rev
D.01\State=1-AG\HF=-461.436061\MP2=-463.0177441\RMSD=7.859e-09\PG=C02H
[SGH(C4H4),X(C8H8)]\\#
OR
1\1\GINC-N076\SP\RMP2-FC\CC-pVDZ\C12H12\R2536\09-Apr-2020\0\\# mp2/cc-
pVDZ\\Squish3.6_Slide0.9\\0,1\H,0,0.,2.4822,0.\C,0,0.,1.3948,0.\C,0,0.
,-1.3948,0.\C,0,1.2079,0.6974,0.\C,0,-1.2079,0.6974,0.\C,0,-1.2079,-0.
6974,0.\C,0,1.2079,-0.6974,0.\H,0,2.1497,1.2411,0.\H,0,-2.1497,1.2411,
0.\H,0,-2.1497,-1.2411,0.\H,0,2.1497,-1.2411,0.\H,0,0.,-2.4822,0.\C,0,
0.,0.4948,3.6\C,0,0.,-2.2948,3.6\C,0,1.2079,-0.2026,3.6\C,0,-1.2079,-0
.2026,3.6\C,0,-1.2079,-1.5974,3.6\C,0,1.2079,-1.5974,3.6\H,0,0.,1.5822
,3.6\H,0,2.1497,0.3411,3.6\H,0,-2.1497,0.3411,3.6\H,0,-2.1497,-2.1411,
3.6\H,0,2.1497,-2.1411,3.6\H,0,0.,-3.3822,3.6\\Version=ES64L-G09RevD.0
1\State=1-AG\HF=-461.4376969\MP2=-463.0163868\RMSD=7.263e-09\PG=C02H [
SGH(C4H4),X(C8H8)]\\#
And several other possible combinations.
Currently I am using grep with a = as a delimiter and that works about 60% of the time.
The grep reads like so:
grep 'AG\\HF' $i | cut -d "=" -f3 | cut -c 1-13 | tr '\n' ' ' >> $i.hf
This grep is going to occur many times in the same file.
Any suggestions for making the collection of the desired 13 characters consistently would be greatly appreciated.
The end result should look like:
-461.4440942 -461.4441024 -461.4441114 -461.4441212 -461.4441321 -461.4441575 -461.4441725 -461.4441893 -461.444208 -461.4442289 -461.4442522 -461.444278 -461.4443063 -461.4443371 -461.4444054 -461.4444421 -461.4444798 -461.4445175 -461.4445544 -461.4445891
What I actually get is a combination of the desired output and this:
-461.4417716\ -461.4413023\ 1-AG\HF -461.439848\M -461.4387568\ -461.4373225\ -461.4354367\ -461.4329522\ -461.4296709\ -461.4253285\ -461.419576\M -461.4119582\ 1-AG\HF -461.4432257\ -461.4431843\ -461.4431419\ -461.443098\M -461.4430519\ 1-AG\HF -461.4429461\ -461.4428799\ -461.4427974\ -461.4426902\ -461.4425469\ -461.4423525\ -461.4420882\ -461.4417302\ -461.4412489\ 1-AG\HF -461.439758\M -461.4386392\ -461.4371684\ -461.4352341\ -461.4326853\ -461.4293183\ -461.4248614\ -461.4189557\ -461.411132\M 1-AG\HF -461.4432226\ -461.443181\M -461.4431381\ -461.4430938\ -461.4430472\ 1-AG\HF -461.4429401\ -461.4428728\ -461.4427889\ -461.44268\MP -461.4425343\ -461.4423369\ -461.4420684\ -461.4417048\ -461.4412162\ 1-AG\HF -461.4397026\ -461.4385667\ -461.4370734\ -461.4351091\ -461.4325204\ -461.4291001\ -461.424572\M -461.4185707\ -461.4106184\ 1-AG\HF -461.4432215\ -461.4431798\ -461.4431369\ -461.4430924\ -461.4430457\ 1-AG\HF -461.442938\M -461.4428704\ -461.4427861\ -461.4426766\ -461.4425301\ -461.4423316\ -461.4420617\ -461.4416963\ -461.4412051\ 1-AG\HF -461.4396839\ -461.4385423\ -461.4370413\ -461.4350669\ -461.4324646\ -461.4290263\ -461.4244739\ -461.4184402\ -461.4104442\ 1-AG\HF
Using the line
awk -v ORS=' ' -F= '$3 ~ /AG\\HF$/{print substr($4, 1, 12)}' $i >> $i.hf &&
All that was output was
-4 -4 -4
You may use this single awk:
awk -F= '$3 ~ /AG\\HF$/{print substr($4, 1, 12)}' file
-461.3998608
-461.4104442
-461.4440942
-461.4441483
To get output in single line:
awk -v ORS=' ' -F= '$3 ~ /AG\\HF$/{print substr($4, 1, 12)}' file

Extracting string from line, give as input to a command and then output the entire line with replacing the string

I have a file containing like below, multiple rows are there
test1| 1234 | test2 | test3
Extract second column 1234 and run a command feeding that as input
lets say we get X as output to the command
Print the output as below for each of the line
test1 | X | test2 | test3
Prefer if I could do it in one-liner, but open to ideas.
I am able to extract string using awk, but I am not sure how I can still preserve the initial output and replace it in the output. Below is what I tested
cat file.txt | awk -F '|' '{newVar=system("command "$2); print newVar $4}'
#
Sample command output, where we extract the "name"
openstack show 36a6c06e-5e97-4a53-bb42
+----------------------------+-----------------------------------+
| Property | Value |
+----------------------------+-----------------------------------+
| id | 36a6c06e-5e97-4a53-bb42 |
| name | testVM1 |
+----------------------------+-----------------------------------+
Perl to the rescue!
perl -lF'/\|/' -ne 'chomp( $F[1] = qx{ command $F[1] }); print join "|", #F' < file.txt
-n reads the input line by line
-l removes newlines from input and adds them to prints
F specifies how to split each input line into the #F array
$F[1] corresponds to the second column, we replace it with the output of the command
chomp removes the trailing newline from the command output
join glues the array back to one line
Using awk:
awk -F ' *| *' '{("command "$2) | getline $2}1' file.txt
e.g.
$ awk -F ' *| *' '{("date -d #"$2) | getline $2}1' file.txt
test1| Thu 01 Jan 1970 05:50:34 AM IST | test2 | test3
I changed the field separator from | to *| * to accommodate the spaces surrounding the fields. You can remove those based on your actual input.
This finally did the trick..
awk -F' *[|] *' -v OFS=' | ' '{
cmd = "openstack show \047" $2 "\047"
while ( (cmd | getline line) > 0 ) {
if ( line ~ /name/ ) {
split(line,flds,/ *[|] */)
$2 = flds[3]
break
}
}
close(cmd)
print
}' file
If command can take the whole list of values once and generate the converted list as output (e.g. tr 'a-z' 'A-Z') then you'd want to do something like this to avoid spawning a shell once per input line (which is extremely slow):
awk -F' *[|] *' '{print $2}' file |
command |
awk -F' *[|] *' -v OFS=' | ' 'NR==FNR{a[FNR]=$0; next} {$2=a[FNR]} 1' - file
otherwise if command needs to be called with one value at a time (e.g. echo) or you just don't care about execution speed then you'd do:
awk -F' *[|] *' -v OFS=' | ' '{
cmd = "command \047" $2 "\047"
if ( (cmd | getline line) > 0 ) {
$2 = line
}
close(cmd)
print
}' file
The \047s will produce single quotes around $2 when it's passed to command and so shield it from shell interpretation (see https://mywiki.wooledge.org/Quotes) and the test on the result of getline will protect you from silently overwriting the current $2 with the output of an earlier command execution in the event of a failure (see http://awk.freeshell.org/AllAboutGetline). The close() ensures that you don't end up with a "too many open files" error or other cryptic problem if the pipe isn't being closed properly, e.g. if command is generating multiple lines and you're just reading the first one.
Given your comment below, if you're going with the 2nd approach above then you'd write something like:
awk -F' *[|] *' -v OFS=' | ' '{
cmd = "openstack show \047" $2 "\047"
while ( (cmd | getline line) > 0 ) {
split(line,flds)
if ( flds[2] == "name" ) {
$2 = flds[3]
break
}
}
close(cmd)
print
}' file

Why uniq -c output with space instead of \t?

I use uniq -c some text file.
Its output like this:
123(space)first word(tab)other things
2(space)second word(tab)other things
....
So I need extract total number(like 123 and 2 above), but I can't figure out how to, because if I split this line by space, it will like this ['123', 'first', 'word(tab)other', 'things'].
I want to know why doesn't it output with tab?
And how to extract total number in shell? ( I finally extract it with python, WTF)
Update: Sorry, I didn't describe my question correctly. I didn't want to sum the total number, I just want to replace (space) with (tab), but it doesn't effect the space in words, because I still need the data after. Just like this:
123(tab)first word(tab)other things
2(tab)second word(tab)other things
Try this:
uniq -c | sed -r 's/^( *[^ ]+) +/\1\t/'
Try:
uniq -c text.file | sed -e 's/ *//' -e 's/ /\t/'
That will remove the spaces prior to the line count, and then replace only the first space with a tab.
To replace all spaces with tabs, use tr:
uniq -c text.file | tr ' ' '\t'
To replace all continuous runs of tabs with a single tab, use -s:
uniq -c text.file | tr -s ' ' '\t'
You can sum all the numbers using awk:
awk '{s+=$1}END{print s}'
$ cat <file> | uniq -c | awk -F" " '{sum += $1} END {print sum}'
One possible solution to getting tabs after counts is to write a uniq -c-like script that formats exactly how you want. Here's a quick attempt (that seems to pass my minute or so of testing):
awk '
(NR == 1) || ($0 != lastLine) {
if (NR != 1) {
printf("%d\t%s\n", count, lastLine);
}
lastLine = $0;
count = 1;
next;
}
{
count++;
}
END {
printf("%d\t%s\n", count, lastLine);
}
' yourFile.txt
Another solution. This is equivalent to the earlier sed solution, but it does use awk as requested / tagged!
cat yourFile.txt \
| uniq -c \
| awk '{
match($0, /^ *[^ ]* /);
printf("%s\t%s\n", $1, substr($0, RLENGTH + 1));
}'
Based on William Pursell answer , if you like Perl compatible regular expressions (PCRE) maybe a more elegant and modern way would be
perl -pe 's/ *(\d+) /$1\t/'
Options are to execute (-e) and print (-p).

finding pattern in a file

I have a txt file of 500 rows and one column.
The column in each row appears some what like this (as an example I am pasting two rows):
chr22:49367820-49368570_NR_021492_LOC100144603,chr22:49368010-49368760_NM_005198_CHKB,chr22:49368010-49368760_NM_152247_CPT1B,chr22:49368010-49368760_NM_152253_CHKB
chr22:49367820-49368570_NR_021492_LOC100144603,chr22:49368010-49368760_NM_005198_CHKB
Want I want to extract from each row is the values starting from NM_ or NR_
like
row 1 has NR_021492 NM_005198 NM_152247 NM_152253
row 2 has NR_021492 NM_005198
...
in tab delimited file
any suggestions for a bash command line?
Try:
sed -r -e 's/chr[0-9]+:[^_]*_(N[RM])_([0-9]+)_[^,_]+([, ]|$)/\1_\2'$'\t''/g;s/'$'\t''$//g'
Presuming GNU sed.
So
sed -r -e 's/chr[0-9]+:[^_]*_(N[RM])_([0-9]+)_[^,_]+([, ]|$)/\1_\2'$'\t''/g;s/'$'\t''$//g' your_file > tab_delimited_file
EDIT: Updated to not leave a trailing tab character on each row.
EDIT 2: Updated again to work for any chr-then-number sequence.
grep "NM" yourfiname | cut -d_ -f3 | sed 's/[/\d]*/NM_/'
grep "NR" yourfiname | cut -d_ -f3 | sed 's/[/\d]*/NR_/'
cat file|sed s/$.*!(NR)//;
Use a regular expression to remove everything before the NR
awk -F '[,:_-]' '{
for (i=1; i<NF; i++)
if ($i == "NR" || $i == "NM")
printf("%s_%s ", $i, $(i+1))
print ""
}'
This will also work, but will print each match on its own line: egrep -o 'N[RM]_[0-9]+

Resources