Extract string between two patterns (inclusive) while conserving the format - bash

I have a file in the following format
cat test.txt
id1,PPLLTOMaaaaaaaaaaaJACK
id2,PPLRTOMbbbbbbbbbbbJACK
id3,PPLRTOMcccccccccccJACK
I am trying to identify and print the string between TOM and JACK including these two strings, while maintaining the first column FS=,
Desired output:
id1,TOMaaaaaaaaaaaJACK
id2,TOMbbbbbbbbbbbJACK
id3,TOMcccccccccccJACK
So far I have tried gsub:
awk -F"," 'gsub(/.*TOM|JACK.*/,"",$2) && !_[$0]++' test.txt > out.txt
and have the following output
id1 aaaaaaaaaaa
id2 bbbbbbbbbbb
id3 ccccccccccc
As you can see I am getting close but not able to include TOM and JACK patterns in my output. Plus I am also losing the original FS. What am I doing wrong?
Any help will be appreciated.

You are changing a field ($2) which causes awk to reconstruct the record using the value of OFS as the field separator and so in this case changing the commas to spaces.
Never use _ as a variable name - using a name with no meaning is just slightly better than using a name with the wrong meaning, just pick a name that means something which, in this case is seen but idk what you are trying to do when using that in this context.
gsub() and sub() do not support capture groups so you either need to use match()+substr():
$ awk 'BEGIN{FS=OFS=","} match($2,/TOM.*JACK/){$2=substr($2,RSTART,RLENGTH)} 1' file
id1,TOMaaaaaaaaaaaJACK
id2,TOMbbbbbbbbbbbJACK
id3,TOMcccccccccccJACK
or use GNU awk for the 3rd arg to match()
$ gawk 'BEGIN{FS=OFS=","} match($2,/TOM.*JACK/,a){$2=a[0]} 1' file
id1,TOMaaaaaaaaaaaJACK
id2,TOMbbbbbbbbbbbJACK
id3,TOMcccccccccccJACK
or for gensub():
$ gawk 'BEGIN{FS=OFS=","} {$2=gensub(/.*(TOM.*JACK).*/,"\\1","",$2)} 1' file
id1,TOMaaaaaaaaaaaJACK
id2,TOMbbbbbbbbbbbJACK
id3,TOMcccccccccccJACK
The main difference between the match() and gensub() solutions is how they would behave if TOM appeared twice on the line:
$ cat file
id1,PPLLfooTOMbarTOMaaaaaaaaaaaJACK
id2,PPLRTOMbbbbbbbbbbbJACKfooJACKbar
id3,PPLRfooTOMbarTOMcccccccccccJACKfooJACKbar
$
$ awk 'BEGIN{FS=OFS=","} match($2,/TOM.*JACK/,a){$2=a[0]} 1' file
id1,TOMbarTOMaaaaaaaaaaaJACK
id2,TOMbbbbbbbbbbbJACKfooJACK
id3,TOMbarTOMcccccccccccJACKfooJACK
$
$ awk 'BEGIN{FS=OFS=","} {$2=gensub(/.*(TOM.*JACK).*/,"\\1","",$2)} 1' file
id1,TOMaaaaaaaaaaaJACK
id2,TOMbbbbbbbbbbbJACKfooJACK
id3,TOMcccccccccccJACKfooJACK
and just to show one way of stopping at the first instead of the last JACK on the line:
$ awk 'BEGIN{FS=OFS=","} match($2,/TOM.*JACK/,a){$2=gensub(/(JACK).*/,"\\1","",a[0])} 1' file
id1,TOMbarTOMaaaaaaaaaaaJACK
id2,TOMbbbbbbbbbbbJACK
id3,TOMbarTOMcccccccccccJACK

Use capture groups to save the parts of the line you want to keep. Here's how to do it with sed
sed 's/^\([^,]*,\).*\(TOM.*JACK\).*/\1\2/' <test.txt > out.txt

Do you mean to do the following?
$ cat test.txt
id1,PPLLTOMaaaaaaaaaaaJACKABCD
id2,PPLRTOMbbbbbbbbbbbJACKDFCC
id3,PPLRTOMcccccccccccJACKSDER
$ cat test.txt | sed -e 's/,.*TOM/,TOM/g' | sed -e 's/JACK.*/JACK/g'
id1,TOMaaaaaaaaaaaJACK
id2,TOMbbbbbbbbbbbJACK
id3,TOMcccccccccccJACK
$
This should work as long as the TOM and JACK do not repeat themselves.

sed 's/\(.*,\).*\(TOM.*JACK\).*/\1\2/' <oldfile >newfile
Output:
id1,TOMaaaaaaaaaaaJACK
id2,TOMbbbbbbbbbbbJACK
id3,TOMcccccccccccJACK

Related

Compare 2 csv files and delete rows - Shell

I have a 2 csv files. One has several columns, the other is just one column with domains. Simplified data of these files would be
file1.csv:
John,example.org,MyCompany,Australia
Lenny,domain.com,OtherCompany,US
Martha,site.com,ThirdCompany,US
file2.csv:
example.org
google.es
mysite.uk
The output should be
Lenny,domain.com,OtherCompany,US
Martha,site.com,ThirdCompany,US
I have tried this solution
grep -v -f file2.csv file1.csv >output-file
Found here
http://www.unix.com/shell-programming-and-scripting/177207-removing-duplicate-records-comparing-2-csv-files.html
But since there is no explanation whatsoever about how the script works, and I suck at shell, I cannot tweak it to make it work for me
A solution for this would be highly appreciated, a solution with some explanation would be awesome! :)
EDIT:
I have tried the line that was suppose to work, but for some reason it does not. Here the output from my terminal. What's wrong with this?
Desktop $ cat file1.csv ; echo
John,example.org,MyCompany,Australia
Lenny ,domain.com,OtherCompany,US
Martha,mysite.com,ThirCompany,US
Desktop $ cat file2.csv ; echo
example.org
google.es
mysite.uk
Desktop $ grep -v -f file2.csv file1.csv
John,example.org,MyCompany,Australia
Lenny ,domain.com,OtherCompany,US
Martha,mysite.com,ThirCompany,US
Why grep doesn't remove the line
John,example.org,MyCompany,Australia
The line you posted, works just fine.
$ grep -v -f file2.csv file1.csv
Lenny,domain.com,OtherCompany,US
Martha,site.com,ThirdCompany,US
And here's an explanation. grep will search for a given pattern in a given file and print all lines that match. The simplest example of usage is:
$ grep John file1.csv
John,example.org,MyCompany,Australia
Here we used a simple pattern that matches each character, but you can also use regular expressions (basic, extended, and even perl-compatible ones).
To invert the logic, and print only the lines that do not match, we use the -v switch, like this:
$ grep -v John file1.csv
Lenny,domain.com,OtherCompany,US
Martha,site.com,ThirdCompany,US
To specify more than one pattern, you can use the option -e pattern multiple times, like this:
$ grep -v -e John -e Lenny file1.csv
Martha,site.com,ThirdCompany,US
However, if there is a larger number of patterns to check for, we might use the -f file option that will read all patterns from a file specified.
So, when we combine all of those; reading patterns from a file with -f and inverting the matching logic with -v, we get the line you need.
One in awk:
$ awk -F, 'NR==FNR{a[$1];next}($2 in a==0)' file2 file1
Lenny,domain.com,OtherCompany,US
Martha,site.com,ThirdCompany,US
Explained:
$ awk -F, ' # using awk, comma-separated records
NR==FNR { # process the first file, file2
a[$1] # hash the domain to a
next # proceed to next record
}
($2 in a==0) # process file1, if domain in $2 not in a, print the record
' file2 file1 # file order is important

"grep" a csv file including multi-lines fields?

file.csv:
XA90;"standard"
XA100;"this is
the multi-line"
XA110;"other standard"
I want to grep the "XA100" entry like this:
grep XA100 file.csv
to obtain this result:
XA100;"this is
the multi-line"
but grep return only one line:
XA100;"this is
source.csv contains 3 entries.
The "XA100" entry contain a multi-line field.
And grep doesn't seem to be the right tool to "grep" CSV file including multilines fields.
Do you know the way to make the job ?
Edit: the real world file contains many columns. The researched term can be in any column (not at begin of line, nor at the begin of field). All fields are encapsulated by ". Any field can contain a multi-line, from 1 line to any, and this cannot be predicted.
Give this line a try:
awk '/^XA100;/{p=1}p;p&&/"$/{p=0}' file
I extended your example a bit:
kent$ cat f
XA90;"standard"
XA100;"this is
the
multi-
line"
XA110;"other standard"
kent$ awk '/^XA100;/{p=1}p;p&&/"$/{p=0}' f
XA100;"this is
the
multi-
line"
In the comments you mention: In the real world file, each line start with ". I assume they also end with " and present you this:
Test file:
$ cat file
"single line"
"multi-
lined"
Code and outputs:
$ awk 'BEGIN{RS=ORS="\"\n"} /single/' file
"single line"
$ awk 'BEGIN{RS=ORS="\"\n"} /m/' file
"multi-
lined"
You can also parametrize the search:
$ awk -v s="multi" 'BEGIN{RS=ORS="\"\n"} match($0,s)' file
"multi-
lined"
try:
Solution 1:
awk -v RS="XA" 'NR==3{gsub(/$\n$/,"");print RS $0}' Input_file
Making Record separator as string XA then looking for line 3rd here and then globally substituting the $\n$(which is to remove the extra line at the end of the line) with NULL. Then printing the Record Separator with the current line.
Solution 2:
awk '/XA100/{print;getline;while($0 !~ /^XA/){print;getline}}' Input_file
Looking for string XA100 then printing the current line and using getline to go to next line, using while loop then which will run and print the lines until a line is starting from XA.
If this file was exported from MS-Excel or similar then lines end with \r\n while the newlines inside quotes are just \ns so then all you need is:
$ awk -v RS='\r\n' '/XA100/' file
XA100;"this is
the multi-line"
The above uses GNU awk for multi-char RS. On some platforms, e.g. cygwin, you'll have to add -v BINMODE=3 so gawk sees the \rs rather than them getting stripped by underlying C primitives.
Otherwise, it's extremely hard to parse CSV files in general without a real CSV parser (which awk currently doesn't have but is in the works for GNU awk) but you could do this (again with GNU awk for multi-char RS):
$ cat file
XA90;"standard"
XA100;"this is
the multi-line"
XA110;"other standard"
$ awk -v RS="\"[^\"]*\"" -v ORS= '{gsub(/\n/," ",RT); print $0 RT}' file
XA90;"standard"
XA100;"this is the multi-line"
XA110;"other standard"
to replace all newlines within quotes with blank chars and then process it as regular 1-line-per-record file.
Using PS response, this works for the small example:
sed 's/^X/\n&/' file.csv | awk -v RS= '/XA100/ {print}'
For my real world CSV file, with many columns, with researched term anywhere, with unknown count of multi-lines, with characters " replaced by "", with multi-lines lines beginning with ", with all fields encapsulated by ", this works. Note the exclusion of the second character " in sed part:
sed 's/^"[^"]/\n&/' file.csv | awk -v RS= '/RESEARCH_TERM/ {print}'
Because first column of any entry cannot start with "". First column allways looks like "XXXXXXXXX", where X is any character but ".
Thank you all for so much responses, maybe others solutions are working depending the CSV file format you use.

Need to capture particular output

This is the exact output I got from a program:
#Meaningless output
[TABL]
BSSID
4c:e6:78:e3:4e:58
a0:8b:16:e3:3a:42
ADMAC=a1:3c:24:e5:2e:22
ADMAC=.......
#Meaningless output
I just want to capture the BSSID column along with its mac addresses ONLY and not the ADMAC values or any other values.How can I do that using bash(or grep or sed or awk,anything)?Thanks.
awk to the rescue!
$ awk '/BSSID/{p=1} p&&!NF{exit} p' file
BSSID
4c:e6:78:e3:4e:58
a0:8b:16:e3:3a:42
prints after the pattern match until an empty line.
Or, simpler but gets you the empty line at the end.
$ awk '/BSSID/,/^$/' file
BSSID
4c:e6:78:e3:4e:58
a0:8b:16:e3:3a:42
<- empty line here ...
to filter the last empty line, you can add a condition
$ awk '/BSSID/,/^$/{if(NF) print}'
note that the first alternative is the most flexible and the preferred one.
Try this. It worked on Mac using your example.
cat output.txt | awk '/BSSID/,/ADMAC/'| grep -v ADMAC
Tell grep to show the two lines after the match and stop after 1 match.
grep -m1 -A2 "^BSSID$" output.txt
sed to the rescue!
Since the requirement is to include only the MAC addresses, which must include a colon, period, or dash, the following would be reasonable, given the example input:
sed -n '/^BSSID/,/^ *$/ {/[:.-]/p;}'
If you have awk try:
awk '{/BSSID/,/ADMAC/ print}' output.txt

awk command to select exact word in any field

I have input file as
ab,1,3,qqq,bbc
b,445,jj,abc
abcqwe,234,23,123
abc,12,bb,88
uirabc,33,99,66
I have to select the rows which has only 'abc'. And note that abc string can appear in any of the column. Please help me how to achieve this using awk.
Output:
b,445,jj,abc
abc,12,bb,88
You could also use plain grep:
grep "(^|,)abc(,|$)" file
Or if you have to use awk
awk '/(^|,)abc(,|$)/' file
Using awk
awk 'gsub(/(^|,)abc(,|$)/,"&")' file
b,445,jj,abc
abc,12,bb,88
Based on Beny23s regex.
It does look for abc where its starting from ^ start or from a , and
ends with a , or end of line $
Another one using beny23 regex:
awk 'NF>1' FS="(^|,)abc(,|$)" infile
Not asked but if you feel the need to filter just the lines with one ocurrence:
$ cat infile
ab,1,3,qqq,bbc
b,445,jj,abc
abcqwe,234,23,123
abc,12,bb,88
abc,12,bb,abc
uirabc,33,99,66
This will be handy:
$ awk 'NF==2' FS="(^|,)abc(,|$)" infile
b,445,jj,abc
abc,12,bb,88
Also possible using Jotne solution:
$ awk 'gsub(/(^|,)abc(,|$)/,"&")==1' infile
Through awk,
$ awk -F, '{for(i=1;i<=NF;i++){if($i=="abc") print $0;}}' file | uniq
b,445,jj,abc
abc,12,bb,88
OR
$ awk -F, '{for(i=1;i<=NF;i++){if($i=="abc") {print; next}}}' file
b,445,jj,abc,abc
abc,12,bb,88
In the above awk command Field Separator variable is set to , . AWk parses the input file line by line. for function is used to traverse all the fields in a line. If a value of a particular field is abc, then it prints the whole line.

Add blank column using awk or sed

I have a file with the following structure (comma delimited)
116,1,89458180,17,FFFF,0403254F98
I want to add a blank column on the 4th field such that it becomes
116,1,89458180,,17,FFFF,0403254F98
Any inputs as to how to do this using awk or sed if possible ?
thank you
Assuming that none of the fields contain embedded commas, you can restate the task as replacing the third comma with two commas. This is just:
sed 's/,/,,/3'
With the example line from the file:
$ echo "116,1,89458180,17,FFFF,0403254F98" | sed 's/,/,,/3'
116,1,89458180,,17,FFFF,0403254F98
You can use this awk,
awk -F, '$4="," $4' OFS=, yourfile
(OR)
awk -F, '$4=FS$4' OFS=, yourfile
If you want to add 6th and 8th field,
awk -F, '{$4=FS$4; $1=FS$1; $6=FS$6}1' OFS=, yourfile
Through awk
$ echo '116,1,89458180,17,FFFF,0403254F98' | awk -F, -v OFS="," '{print $1,$2,$3,","$4,$5,$6}'
116,1,89458180,,17,FFFF,0403254F98
It prints a , after third field(delimited) by ,
Through GNU sed
$ echo 116,1,89458180,17,FFFF,0403254F98| sed -r 's/^([^,]*,[^,]*,[^,]*)(.*)$/\1,\2/'
116,1,89458180,,17,FFFF,0403254F98
It captures all the characters upto the third command and stored it into a group. Characters including the third , upto the last are stored into another group. In the replacement part, we just add an , between these two captured groups.
Through Basic sed,
Through Basic sed
$ echo 116,1,89458180,17,FFFF,0403254F98| sed 's/^\([^,]*,[^,]*,[^,]*\)\(.*\)$/\1,\2/'
116,1,89458180,,17,FFFF,0403254F98
echo 116,1,89458180,17,FFFF,0403254F98|awk -F',' '{print $1","$2","$3",,"$4","$5","$6}'
Non-awk
t="116,1,89458180,17,FFFF,0403254F98"
echo $(echo $t|cut -d, -f1-3),,$(echo $t|cut -d, -f4-)
You can use bellow awk command to achieve that.Replace the $3 with what ever the column that you want to make it blank.
awk -F, '{$3="" FS $3;}1' OFS=, filename
sed -e 's/\([^,]*,\)\{4\}/&,/' YourFile
replace the sequence of 4 [content (non comma) than comma ] by itself followed by a comma

Resources