Condition on Nth character of string in a Mth column in bash - shell

I have a sample
$ cat c.csv
a,1234543,c
b,1231456,d
c,1230654,e
I need to grep only numbers where 4th character of 2nd column but not be 0 or 1
Output must be
a,1234543,c
I know this only
awk -F, 'BEGIN { OFS = FS } $2 ~/^[2-9]/' c.csv
Is it possible to put a condition on 4th character?

Could you please try following.
awk 'BEGIN{FS=","} substr($2,4,1)!=0 && substr($2,4,1)!=1' Input_file
OR as per Ed site's suggestion:
awk 'BEGIN{FS=","} substr($2,4,1)!~[01]' Input_file
Explanation: Adding a detailed explanation for above code here.
awk ' ##Starting awk program from here.
BEGIN{ ##Starting BEGIN section from here.
FS="," ##Setting field separator as comma here.
} ##Closing BLOCK for this program BEGIN section.
substr($2,4,1)!=0 && substr($2,4,1)!=1 ##Checking conditions if 4th character of current line is NOT 0 and 1 then print the current line.
' Input_file ##Mentioning Input_file name here.

This might work for you (GNU sed or grep):
grep -vE '^([^,]*,){1}[^,]{3}[01]' file
or:
sed -E '/^([^,]*,){1}[^,]{3}[01]/d' file
Replace the 1 for the m'th-1 column and the 3 for the n'th-1 character in that column.

Grep is the answer.
But here is another way using array and variable substitution
test=( $(cat c.csv) ) # load c.csv data to an array
echo ${test[#]//*,???[0-1]*/} # print all items from an array,
# but remove the ones that correspond to this regex *,???[0-1]*
# so 'b,1231456,d' and 'c,1230654,e' from example will be removed
# and only 'a,1234543,c' will be printed

There are many ways to do this with awk. the most literal form would be:
4th character of 2nd column is not 0 or 1
$ awk -F, '($2 !~ /^...[01]/)' file
$ awk -F, '($2 ~ /^...[^01]/)' file
These will also match a line a,abcdefg,b
2nd column is an integer and 4th character is not 0 or 1
$ awk -F, '($2+0==$2) && ($2!~[.]) && ($2 !~ /^...[01]/)'
$ awk -F, '($2 ~ /^[0-9][0-9][0-9][^01][0-9]*$/)'

Related

Regex pattern as variable in AWK

Let's say I have a file with multiple fields and field 1 needs to be filtered for 2 conditions. I was thinking of turning those conditions into a regex pattern and pass them as variables to the awk statement. For some reason, they are not filtering out the records at all. Here is my attempt that runs fine, but doesn't filter out the results per conditions, except when fed directly into awk without variable assignment.
regex1="/abc|def/"; # match first field for abc or def;
regex2="/123|567/"; # and also match the first field for 123 or 567;
cat file_name \
| awk -v pat1="${regex1}" -v pat2="${regex2}" 'BEGIN{FS=OFS="\t"} {if ( ($1~pat1) && ($1~pat2) ) print $0}'
Update: Fixed a syntax error related to missing parenthesis for the if conditions in the awk. (I had it fixed in the code I ran).
Sample data
abc:567 1
egf:888 2
Expected output
abc:567 1
The problem is that I am getting all the results instead of the ones that satisfy the 2 regex for field 1
Note that the match needs to be wildcarded instead of exact match. Meaning 567 as defined in the regex pattern should also match on 567_1 if available.
It seems like the way to implement what you want to do would be:
awk -F'\t' '
($1 ~ /abc|def/) &&
($1 ~ /123|567/)
' file
or probably more robustly:
awk -F'\t' '
{ split($1,a,/:/) }
(a[1] ~ /abc|def/) &&
(a[2] ~ /123|567/)
' file
What's wrong with that?
EDIT here is me running the OPs code before and after fixing the inclusion of regexp delimiters (/) in the dynamic regexp strings:
$ cat tst.sh
#!/usr/bin/env bash
regex1="/abc|def/"; #--match first field for abc or def;
regex2="/123|567/"; #--and also match the first field for 123 or 567;
cat file_name \
| awk -v pat1="${regex1}" -v pat2="${regex2}" 'BEGIN{FS=OFS="\t"} $1~pat1 && $1~pat2'
echo "###################"
regex1="abc|def"; #--match first field for abc or def;
regex2="123|567"; #--and also match the first field for 123 or 567;
cat file_name \
| awk -v pat1="${regex1}" -v pat2="${regex2}" 'BEGIN{FS=OFS="\t"} $1~pat1 && $1~pat2'
$
$ ./tst.sh
###################
abc:567 1
EDIT: Since OP has changed the samples, so adding this solution here, this will work for partial matches also, again written and tested with shown samples in GNU awk.
awk -F':|[[:space:]]+' -v var1="abc|def" -v var2="123|567" '
BEGIN{
num=split(var1,arr1,"|")
split(var2,arr2,"|")
for(i=1;i<=num;i++){
reg1[arr1[i]]
reg2[arr2[i]]
}
}
{
for(i in reg1){
if(index($1,i)){
for(j in reg2){
if(index($2,j)){ print; next }
}
}
}
}
' Input_file
Let's say following is an Input_file:
cat Input_file
abc_2:567_3 1
egf:888 2
Now after running above code we will get abc_2:567_3 1 in output.
With your shown samples only, could you please try following. Written and tested in GNU awk. Give your values which you you want to look for in 1st column in var1 and those which you want to look in 2nd field in var2 variables respectively with pipe delimiter in it.
awk -F':|[[:space:]]+' -v var1="abc|def" -v var2="123|567" '
BEGIN{
num=split(var1,arr1,"|")
split(var2,arr2,"|")
for(i=1;i<=num;i++){
reg1[arr1[i]]
reg2[arr2[i]]
}
}
($1 in reg1) && ($2 in reg2)
' Input_file
Explanation: Adding detailed explanation for above.
awk -F':|[[:space:]]+' -v var1="abc|def" -v var2="123|567" ' ##Starting awk program from here.
##Setting field separator as colon or spaces, setting var1 and var2 values here.
BEGIN{ ##Starting BEGIN section of this program from here.
num=split(var1,arr1,"|") ##Splitting var1 to arr1 here.
split(var2,arr2,"|") ##Splitting var2 to arr2 here.
for(i=1;i<=num;i++){ ##Running for loop from 1 to till value of num here.
reg1[arr1[i]] ##Creating reg1 with index of arr1 value here.
reg2[arr2[i]] ##Creating reg1 with index of arr2 value here.
}
}
($1 in reg1) && ($2 in reg2) ##Checking condition if 1st field is present in reg1 AND in reg2 then print that line.
' Input_file ##Mentioning Input_file name here.

Reformatting text file using awk and cut as a one liner

Data:
CHR SNP BP A1 TEST NMISS BETA SE L95 U95 STAT P
1 chr1:1243:A:T 1243 T ADD 16283 -6.124 0.543 -1.431 0.3534 -1.123 0.14
Desired output:
MarkerName P-Value
chr1:1243 0.14
The actual file is 1.2G worth of lines like the above
I need to strip the 2nd column of the text past the 2nd colon and then paste this to the final 12th column and give it a new header.
I have tried:
awk '{print $2, $12}' | cut -d: -f1-2
but this removes the whole line after the colons and I want to keep the "p" column
I outputted this to a new file and then pasted it onto the P-value column using awk but was wondering if there was a one-liner method of doing this?
Many thanks
My comment in more understandable form:
$ awk '
BEGIN {
print "MarkerName P-Value" # output header
}
NR>1 { # skip the funky first record
split($2,a,/:/) # split by :
printf "%s:%s %s\n",a[1],a[2],$12 # printf allows easier output formating
}' file
Output:
MarkerName P-Value
chr1:1243 0.14
EDIT: Adding one more solution here, since OP mentioned my first solution somehow didn't work for OP but it worked fine for me, as an alternative adding this here.
awk '
BEGIN{
print "MarkerName P-Value"
}
FNR>1{
match($2,/([^:]*:){2}/)
print OFS substr($2,RSTART,RLENGTH-1),$NF
}
' Input_file
With shown samples, could you please try following. You need not to use cut with awk, awk could take care of everything within itself.
awk -F' +|:' '
BEGIN{
print "MarkerName P-Value"
}
FNR>1{
print OFS $2":"$3,$NF
}
' Input_file
Explanation: Adding detailed explanation for above.
awk -F' +|:' ' ##Starting awk program from here and setting field separator as spaces or colon for all lines.
BEGIN{ ##Starting BEGIN section of this program from here.
print "MarkerName P-Value" ##Printing headers here.
}
FNR>1{ ##Checking condition if line number is greater than 1 then do following.
print OFS $2":"$3,$NF ##Printing space(OFS) 2nd field colon 3rd field and last field as per OP request.
}
' Input_file ##Mentioning Input_file name here.
$ awk -F'[: ]+' '{print (NR==1 ? "MarkerName P-Value" : $2":"$3" "$NF)}' file
MarkerName P-Value
chr1:1243 0.14
Sed alternative:
sed -En '1{s/^.*$/MarkerName\tP-Value/p};s/([[:digit:]]+[[:space:]]+)([[:alnum:]]+:[[:digit:]]+)(.*)([[:digit:]]+\.[[:digit:]]+$)/\2\t\4/p'
For the first line, substitute the full line for the headers. Then, split the line into 4 sections based on regular expressions and then print the 2nd subsection followed by a tab and then the 4th subsection.

How do I join lines using space and comma

I have the file that contains content like:
IP
111
22
25
I want to print the output in the format IP 111,22,25.
I have tried tr ' ' , but its not working
Welcome to paste
$ paste -sd " ," file
IP 111,22,25
Normally what paste does is it writes to standard output lines consisting of sequentially corresponding lines of each given file, separated by a <tab>-character. The option -s does it differently. It states to paste each line of the files sequentially with a <tab>-character as a delimiter. When using the -d flag, you can give a list of delimiters to be used instead of the <tab>-character. Here I gave as a list " ," indicating, use space and then only commas.
In pure Bash:
# Read file into array
mapfile -t lines < infile
# Print to string, comma-separated from second element on
printf -v str '%s %s' "${lines[0]}" "$(IFS=,; echo "${lines[*]:1}")"
# Print
echo "$str"
Output:
IP 111,22,25
I'd go with:
{ read a; read b; read c; read d; } < file
echo "$a $b,$c,$d"
This will also work:
xargs printf "%s %s,%s,%s" < file
Try cat file.txt | tr '\n' ',' | sed "s/IP,/IP /g"
tr deletes new lines, sed changes IP,111,22,25 into IP 111,22,25
The following awk script will do the requested:
awk 'BEGIN{OFS=","} FNR==1{first=$0;next} {val=val?val OFS $0:$0} END{print first FS val}' Input_file
Explanation: Adding explanation for above code now.
awk ' ##Starting awk program here.
BEGIN{ ##Starting BEGIN section here of awk program.
OFS="," ##Setting OFS as comma, output field separator.
} ##Closing BEGIN section of awk here.
FNR==1{ ##Checking if line is first line then do following.
first=$0 ##Creating variable first whose value is current first line.
next ##next keyword is awk out of the box keyword which skips all further statements from here.
} ##Closing FNR==1 BLOCK here.
{ ##This BLOCK will be executed for all lines apart from 1st line.
val=val?val OFS $0:$0 ##Creating variable val whose values will be keep concatenating its own value.
}
END{ ##Mentioning awk END block here.
print first FS val ##Printing variable first FS(field separator) and variable val value here.
}' Input_file ##Mentioning Input_file name here which is getting processed by awk.
Using Perl
$ cat captain.txt
IP
111
22
25
$ perl -0777 -ne ' #k=split(/\s+/); print $k[0]," ",join(",",#k[1..$#k]) ' captain.txt
IP 111,22,25
$

bash - replace all occurrences in a line with a captured pattern from that line

I have an input file:
a=,1,2,3
b=,4,5,6,7
c=,8,9
d=,10,11,12
e=,13,14,15
That I need to transform into
a/1 a/2 a/3
b/4 b/5 b/6 b/7
c/8 c/9
d/10 d/11 d/12
e/13 e/14 e/15
So I need to capture the phrase before the = sign and replace every comma with  \1/.
My most successful attempt was:
sed 's#\([^,]*\)=\([^,]*\),#\2 \1/#g'
but that would only replace the first occurrence.
Any suggestions?
With awk:
awk -F'[=,]' '{ for(i=3;i<=NF;i++) printf "%s/%s%s", $1,$i,(i==NF? ORS:OFS) }' file
The output:
a/1 a/2 a/3
b/4 b/5 b/6 b/7
c/8 c/9
d/10 d/11 d/12
e/13 e/14 e/15
Or a shorter one with gsub/sub substitution:
awk -F'=' '{ gsub(",", OFS $1"/"); sub(/^[^ ]+ /, "") }1' file
Following awk may help you in same.
awk -F"=" '{gsub(/\,/,FS $1"/");$1="";gsub(/^ +| +$/,"")} 1' Input_file
Explanation: Adding explanation too now for above solution:
awk -F"=" '{
gsub(/\,/,FS $1"/"); ##Using global substitution and replacing comma with FS(field separator) $1 and a / for all occurrences of comma(,).
$1=""; ##Nullifying the first column now.
gsub(/^ +| +$/,"") ##Globally substituting initial space and space at last with NULL here.
}
1 ##awk works on method of condition then action, so by mentioning 1 making condition TRUE here and not mentioning any action so by default action is print of the current line.
' Input_file ##Mentioning the Input_file name here.
Output will be as follows:
a/1 a/2 a/3
b/4 b/5 b/6 b/7
c/8 c/9
d/10 d/11 d/12
e/13 e/14 e/15
With sed
sed -E '
:A
s/([^=]*)(=[^,]*),([^,]*)/\1\2\1\/\3 /
tA
s/.*=//
' infile

awk delete all lines not containing substring using if condition

I want to delete lines where the first column does not contain the substring 'cat'.
So if string in col 1 is 'caterpillar', i want to keep it.
awk -F"," '{if($1 != cat) ... }' file.csv
How can i go about doing it?
I want to delete lines where the first column does not contain the substring 'cat'
That can be taken care by this awk:
awk -F, '!index($1, "cat")' file.csv
If that doesn't work then I would suggest you to provide your sample input and expected output in question.
This awk does the job too
awk -F, '$1 ~ /cat/{print}' file.csv
Explanation
-F : "Delimiter"
$1 ~ /cat/ : match pattern cat in field 1
{print} : print
A shorter command is:
awk -F, '$1 ~ "cat"' file.csv
-F is the field delimiter: (,)
$1 ~ "cat" is a (not anchored) regular expression match, match at any position.
As no action has been given, the default: {print} is assumed by awk.

Resources