Find a match in a field and print next n fields - bash

BASH noob here.
I have a tab separated file structured like this:
ABC DEF x 123 456
GHI x 678 910
I need to match "x" and print x plus the following two fields:
x 123 456
x 678 910
I've tried a few things but the issue that throws me off is that "x" is not always in the same field. Can please somebody help?
Thanks in advance.

If you are working in bash, then bash provides parameter expansions with substring removal that are built-in. They (along with many more) are:
${var#pattern} Strip shortest match of pattern from front of $var
${var##pattern} Strip longest match of pattern from front of $var
${var%pattern} Strip shortest match of pattern from back of $var
${var%%pattern} Strip longest match of pattern from back of $var
So in your case you want to trim the longest path from the front up to x as the pattern, e.g.
while read line || [ -n "$line" ]; do
echo "x${line##*x}"
done
Where you read each line and then trim from the front until 'x' is found (you remove the 'x' as well), so you simply output "x....." where "....." is the rest of the line (restoring the 'x')
(for large data sets, you would want to use awk or sed for efficiency reasons)
Example Use/Output
Using your sample data in a heredoc, you could do:
$ while read line || [ -n "$line" ]; do
> echo "x${line##*x}"
> done << 'eof'
> ABC DEF x 123 456
> GHI x 678 910
> eof
x 123 456
x 678 910
You can just select-copy/middle-mouse-paste the following in your xterm to test:
while read line || [ -n "$line" ]; do
echo "x${line##*x}"
done << 'eof'
ABC DEF x 123 456
GHI x 678 910
eof
Using grep -o For Simplicity
You other option, is to use grep -o where the -o option returns the part of the line only-matching the expression you provide, so
grep -o 'x.*$' file
Is another simple option, e.g.
$ grep -o 'x.*$' << 'eof'
> ABC DEF x 123 456
> GHI x 678 910
> eof
x 123 456
x 678 910
Let me know if you have any further questions.

In case you need to match only tab separated field x:
pcregrep -o '(^|\t)\Kx(\t|$).*' file
awk 'n=match($0,/(^|\t)x(\t|$)/) {$0=substr($0,n); sub(/^\t/,""); print}' file
To print only the two following fields:
pcregrep -o '(^|\t)\Kx(\t[^\t]*){2}' file
awk 'n=match($0,/(^|\t)x\t[^\t]*\t[^\t]*/) {$0=substr($0,n,RLENGTH); sub(/^\t/,""); print}' file

Could you please try following, written and tested with shown samples in GNU awk.
awk '
match($0,/[[:space:]]+x[[:space:]]+[0-9]+[[:space:]]+[0-9]+$/){
val=substr($0,RSTART,RLENGTH)
sub(/^[[:space:]]+/,"",val)
print val
}
' Input_file
OR to match more than 1 set of digits after x with spaces try following.
awk '
match($0,/[[:space:]]+x[[:space:]]+([0-9]+[[:space:]]+){1,}[0-9]+/){
val=substr($0,RSTART,RLENGTH)
sub(/^[[:space:]]+/,"",val)
print val
}
' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
match($0,/[[:space:]]+x[[:space:]]+[0-9]+[[:space:]]+[0-9]+$/){ ##Using match function to match regex here.
val=substr($0,RSTART,RLENGTH) ##Creating val which has sub string of matched regex(previous step) length.
sub(/^[[:space:]]+/,"",val) ##Substituting initial space with NULL in val here.
print val ##Printing val here.
}
' Input_file ##mentioning Input_file name here.

I need to match x and print x plus the following two fields:
Using awk without any regex:
awk 'BEGIN {FS=OFS="\t"} {for (i=1; i<=NF; ++i) if ($i == "x") break;
print $i, $(i+1), $(i+2)}' file
x 123 456
x 678 910
Or, using gnu sed:
sed -E 's/(^|.*\t)(x(\t[^\t]+){2}).*/\2/' file
x 123 456
x 678 910

If you want to remove everything before the "x", you can run a sed command like this:
sed 's/^.*x/x/g' file.txt
It finds all occurrences of the pattern ^.*x and replace it with x.
Breakdown of ^.*x:
^ means beginning of a line
.* a wild card pattern that can be more than one character
x the character "x"
Hence it replaces everything before and including "x" that are on the same line with the new pattern, just "x".
For more info on sed's find and replace command, see https://www.cyberciti.biz/faq/how-to-use-sed-to-find-and-replace-text-in-files-in-linux-unix-shell/.

Related

Extract String before bracket and create new line

I have data in below format
ABC-ERW 12344 ZYX 12345
FFANKN 2345 QW [123457, 89053]
FAFDJ-ER 1234 MNO [6532, 789, 234578]
I want to create the data in below format using sed or awk.
ABC-ERW 12344 ZYX 12345
FFANKN 2345 QW 123457
FFANKN 2345 QW 89053
FAFDJ-ER 1234 MNO 6532
FAFDJ-ER 1234 MNO 789
FAFDJ-ER 1234 MNO 234578
I can extract the data before bracket but I don't know how to concatenate the same with data from bracket repeatedly.
My Effort :--
# !/bin/bash
while IFS= read -r line
do
echo "$line"
cnt=`echo $line | grep -o "\[" | wc -l`
if [ $cnt -gt 0 ]
then
startstr=`echo $line | awk -F[ '{print $1}'`
echo $startstr
intrstr=`echo $line | cut -d "[" -f2 | cut -d "]" -f1`
echo $intrstr
else
echo "$line" >> newfile.txt
fi
done < 1.txt
I am able to get the first part and also keep the rows not having "[" in new file but I dont know how to get the values in "[" and pass it at end as number of variables in "[" keep changing randomly.
Regards
With your shown samples, please try following awkcode.
awk '
match($0,/\[[^]]*\]$/){
num=split(substr($0,RSTART+1,RLENGTH-2),arr,", ")
for(i=1;i<=num;i++){
print substr($0,1,RSTART-1) arr[i]
}
next
}
1
' Input_file
Explanation: Adding detailed explanation for above code.
awk ' ##Starting awk program from here.
match($0,/\[[^]]*\]$/){ ##Using match function to match from [ till ] at the end of line.
num=split(substr($0,RSTART+1,RLENGTH-2),arr,", ") ##Splitting matched values by regex above and passing into array named arr with delimiters comma and space.
for(i=1;i<=num;i++){ ##Running for loop till value of num.
print substr($0,1,RSTART-1) arr[i] ##printing sub string before matched along with element of arr with index of i.
}
next ##next will skip all further statements from here.
}
1 ##1 will print current line.
' Input_file ##Mentioning Input_file name here.
Suggesting simple awk script:
awk 'NR==1{print}{for (i=2;i<NF;i++)print $1, $i}' FS="( \\\[)|(, )|(\\\]$)" input.1.txt
Explanation:
FS="( \\\[)|(, )|(\\\]$)" Set awk field seperator to be either [ , ]EOL
This will make the interesting fields $2 ---> $FN to be appended to $1
NR==1{print} print first line only as it is.
{for (i=2;i<NF;i++)print $1, $i} for 2nd line on, print: field $1 appended by current field.
This might work for you (GNU sed):
sed -E '/(.*)\[([^,]*), /{s//\1\2\n\1[/;P;D};s/[][]//g' file
Match the string up to the opening square bracket and also the string after before the comma and space.
Replace the entire match by the leading and trailing matching strings, followed be a newline and the leading matching string.
Print/delete the first line and repeat.
The last line of any repeat above will fail because there is not trailing comma space, in which case the opening and closing square brackets should also be removed.
Alternative:
sed -E ':a;s/([^\n]*)\[([^,]*), /\1\2\n\1[/;ta;s/[][]//g' file

Appending a string to all elements of cells in a column using awk or bash

I have the following text file:
$ cat file.txt
# file;GYPA;Boston
Josh 81-62 20
Mike 72-27 1;42;53
Allie 71-27 24;12
I would like to add GYPA to every element of the third column in the following manner:
GYPA:20
GYPA:1;GYPA:42;GYPA:53
GYPA:24;GYPA:12
so far, I have
cat combine.awk
NR==1 {
FS=";"; Add=$2
}
{
FS="\t"; split($3,a,";");
for (i in a) {
print Add":"a[i]
}
}
the array part did not work.
Assuming there's no backreference (e.g. &) or escape chars in the prefix string you want to add:
$ awk -F';' 'NR==1{add=$2":"; FS=" "; next} {gsub(/(^|;)/,"&"add,$3); print $3}' file
GYPA:20
GYPA:1;GYPA:42;GYPA:53
GYPA:24;GYPA:12
You could do it like this:
#!/usr/bin/awk -f
NR == 1 {
# Get the replacement string from the first line
split($0, h, ";");
add = h[2]
next
}
{
# split the last field by ';' into the array 'a'
# n contains the number of elements in 'a'
n=split($3,a,";");
for(i=1;i<=n;i++){
# print every element of a, separate by ','
printf "%s%s:%s", (i-1)?",":"", add, a[i];
}
# finish the line by printing the ORS
print ""
}
My mistake
The clarification (maybe obvious for someone) that GYPA is not to be hardcoded in the script, but has to be obtained from the first line starting with # came in a comment; I did not see it, hence my answer in the follwing is wrong.
Actual (wrong) answer
Why not sed?
< file.txt sed -n '/^#/!{s/^[^ ]* *[^ ]* */GYPA:/;s/;/;GYPA:/g;p}'
Well, written like this is a bit unreadable, so maybe rewritten like this is better:
< file.txt sed -n ' # -n inhibits the automatic printing
/^#/!{ # only for lines starting with #
s/^[^ ]* *[^ ]* */GYPA:/ # change the first two columns, space included to GYPA:
s/;/;GYPA:/g # add a GYPA: after each semicolon
p # print the resulting line
}'
Actually I'm maybe too much addicted to the -n option, and I should cure myself, as not using that (or any other option) allows you to put all in script which is automatically interpreted as a sed script by the shell:
#!/usr/bin/sed -f
/^#/d
s/^[^ ]* *[^ ]* */GYPA:/
s/;/;GYPA:/g
which you can use like this:
< file.txt ./thefileabove

Searching a string and replacing another string above the searched string

I have a file with the lines below
123
456
123
789
abc
efg
xyz
I need to search with abc and replace immediate above 123 with 111. This is the requirement, abc is only one occurrence in the file but 123 can be multiple occurrences and 123 can be at any position above abc.
Please help me.
I have tried with below sed command
sed -i.bak "/abc/!{x;1!p;d;};x;s/123/1111" filename
With the above command, it is only replacing 123, if 123 is just above abc, if 123 is 2 lines above abc then replace is failing.
There's more than on way to do it. Here's one:
sed -i.bak '1{h;d;};/123/{x;p;d;};/abc/{x;s/123/111/;p;d;};H;${x;p;};d' filename
ed comes in handy for complex editing of files in scripts:
ed -s file <<EOF
/^abc$/;?^123$?;.c
111
.
w
EOF
This: Sets the current line to the first one matching abc (/^abc$/;). Then changes the first line before that point that matches 123 to 111 (?XXX? searches backwards for a matching regular expression, and ?^123$?;. selects that single line for c to change) and finally saves the modified file.
This is a classic case where you keep track of your previous line and change stuff depeinding on conditions satisfying the current line. Genearlly, an awk program looks like this:
awk '(FNR==1){prev=$0; next}
(condition_on_$0) { action_on_prev }
{ print prev; prev = $0 }
END { print $0 }'
So in the case of the OP, this would read:
awk '(FNR==1){prev=$0; next}
$0 == "abc" { if (prev == "123") prev = "111" }
{ print prev; prev = $0 }
END { print $0 }'
This might work for you (GNU sed):
sed -Ez 's/(.*)(\n123.*\nabc)/\1\n111\2/' file
This slurps the file into memory and inserts 111 in front of the last occurrence of 123 before abc.
A less memory intensive solution:
sed -E '/^123$/{:a;N;/\n123$/{h;s///p;g;s/.*\n//;ba};/\nabc$/!ba;s/^/111\n/}' file
This gathers up lines following a line containing 123. If another line containing 123 is encountered it offloads all lines before it and begins gathering lines again. If it finds a line containing abc it inserts 111 at the front of the lines gathered so far.
Another alternative:
sed '/abc/{x;/./{s/^/111\n/p;z};x;b};/123/{x;/./p;x;h;$!d;b};x;/./{x;H;$!d};x' file
$ tac file | awk 'f && sub(/123/,"111"){f=0} /abc/{f=1} 1' | tac
123
456
111
789
abc
efg
xyz

Print next line of given string using unix shell scripting

I have the following text file:
File: Test.txt
Number of Cars:
10
Number of Bikes:
20
Number of Cycles:
10
Note: Now i want to get number of bikes next line that is 20 from the file.
My Try: 1
sed -n '/Number of Bikes:/{N;p;}' Test.txt
Output:
Number of Bikes:
20
My Try: 2
awk '/Number of Bikes/{_=2}_&&_--' Test.txt
Output:
Number of Bikes:
20
Expected Output:
20
If you only need to search for Bikes you can grep the line and include the following line, pipe it to tail and get the last line, and remove any leading blank space.
grep -A 1 "^Number of Bikes:$" file1|tail -1|sed -e 's/ *//'
Here's a way you could do it using awk:
awk 'f && f-- { print $1 } /Number of Bikes:/ { f = 1 }' file
The flag f is set when the heading is matched. The first field of the next line is printed, because this is the only line where f is true. f-- means that f will be set back to 0 for all subsequent lines.
Perhaps slightly better in this case is to simply exit after printing one line:
awk 'f { print $1; exit } /Number of Bikes:/ { f = 1 }' file

kind of tranpose needed of a file with inconsistent number of columns in each row

I have a tab delimited file (in which number of columns in each row is not fixed) which looks like this:
chr1 92536437 92537640 NM_024813 NM_053274
I want to have a file from this in following order (first three columns are identifiers which I need it while splitting it)
chr1 92536437 92537640 NM_024813
chr1 92536437 92537640 NM_053274
Suggestions for a shell script.
#!/bin/bash
{
IFS=' '
while read a b c rest
do
for fld in $rest
do
echo -e "$a\t$b\t$c\t$fld"
done
done
}
Note that you should enter a real tab there (IFS)
I also thought I should do a perl version:
#!/bin/perl -n
($a,$b,$c,#r)=(chomp and split /\t/); print "$a\t$b\t$c\t$_\n" for #r
To do it all from the commandline, reading from in.txt and outputting to out.txt:
perl -ne '($a,$b,$c,#r)=(chomp and split /\t/); print "$a\t$b\t$c\t$_\n" for #r' in.txt > out.txt
Of course if you save the perl script (say as script.pl)
perl script.pl in.txt > out.txt
If you also make the script file executable (chmod +x script.pl):
./script.pl in.txt > out.txt
HTH
Not shell, and the other answer is perfectly fine, but i onelined it in perl :
perl -F'/\s/' -lane '$,="\t"; print #F,$_ for splice #F,3' $FILE
Edit: New (even more unreadable ;) version, inspired by the other answers. Abusing perl's command line parameters and special variables for autosplitting and line ending handling.
Means: For each of the fields after the three first (for splice #F,3), print the first three and it (print #F,$_).
-F sets the field separator to \s (should be \t) for -a autosplitting into #F.
-l turns on line ending handling for -n which runs the -e code for each line of the input.
$, is the output field separator.
[Edited]
So you want to duplicate the first three columns for each remaining item?
$ cat File | while read X
do PRE=$(echo "$X" | cut -f1-3 -d ' ')
for Y in $(echo "$X" | cut -f4- -d ' ')
do echo $PRE $Y >> OutputFilename
done
done
Returns:
chr 786 789 NM
chr 786 789 NR
chr 786 789 NT
chr 123 345 NR
This cuts the first three space delimited columns as a prefix, and then abuses the fact that a for loop will step through a space delimited list to call echo.
Enjoy.
This is just a subset of your data comparison in two files question.
Extracting my slightly hacky solution from there:
for i in 4 5 6 7; do join -e _ -j $i f f -o 1.1,1.2,1.3,0; done | sed '/_$/d'

Resources