Remove redundancy lines for "almost similar" strings

Remove redundancy lines for "almost similar" strings - bash

I have the below file:
ab=5
ac=6
ad=5
ba=5
bc=7
bd=4
ca=5
cb=7
cd=3
...
"ab" and "ba", "ac" and "ca", "bc" and "cb" are redundant.
How do I eliminate these redundant lines in bash ?
Expected output:
ab=5
ac=6
ad=5
bc=7
bd=4
cd=3

$ awk '{x=substr($0,1,1); y=substr($0,2,1)} !seen[x>y?x y:y x]++' file
ab=5
ac=6
ad=5
bc=7
bd=4
cd=3

Short awk solution:
awk '{ c1=substr($0,1,1); c2=substr($0,2,1) }!a[c1 c2]++ && !((c2 c1) in a)' file
c1=substr($0,1,1) - assign the extracted 1st character to variable c1
c2=substr($0,2,1) - assign the extracted 2nd character to variable c2
!a[c1 c2]++ && !((c2 c1) in a) - crucial condition based on mutual exclusion between "similar" 2-character sequences
The output:
ab=5
ac=6
ad=5
bc=7
bd=4
cd=3

Here's one with perl, generic solution irrespective of number of characters before =
$ cat ip.txt
ab=5
ac=6
abd=51
ba=5
bad=23
bc=7
bd=4
ca=5
cb=7
cd=3
$ perl -F= -lane 'print if !$seen{join "",sort split//,$F[0]}++' ip.txt
ab=5
ac=6
abd=51
bc=7
bd=4
cd=3
like awk, by default uninitialized variables evaluate to false
-F= use = as field separator, results saved in #F array
$F[0] will give first field, i.e the characters before =
split//,$F[0] will give array with individual characters
sort by default does string sorting
join "" will then form single string from the sorted characters with null string as separator
See https://perldoc.perl.org/perlrun.html#Command-Switches for documentation on -lane and -F options. Use -i for inplace editing

Could you please try following and let me know if this helps you, I have written and tested it with GNU awk.
awk -F'=' '{
split($1,array,"")}
!((array[1],array[2]) in a){
a[array[1],array[2]];
a[array[2],array[1]];
print;
next
}
!((array[2],array[1]) in a){
a[array[1],array[2]];
a[array[2],array[1]];
print;
}
' Input_file
Output will be as follows.
ab=5
ac=6
ad=5
bc=7
bd=4
cd=3

Related

change numerical value in file to characters via awk

I'm looking to replace the numerical values in a file with a new value provided by me. Can be present in any part of the text, in some cases, it comes across as the third position but is not always necessarily the case. Also to try and save a new version of the file.
original format
A:fdg:user#server:r
A:g:1234:xtcy
A:d:1111:xtcy
modified format
A:fdg:user#server:rxtTncC
A:g:replaced_value:xtcy
A:d:replaced_value:xtcy
bash line command with awk:
awk -v newValue="newVALUE" 'BEGIN{FS=OFS=":"} /:.:.*:/ && ~/^[0-9]+$/{~=newValue} 1' original_file.txt > replaced_file.txt

You can simply use sed instead of awk:
sed -E 's/\b[0-9]+\b/replaced_value/g' /path/to/infile > /path/to/outfile

Here is an awk that asks you for replacement values for each numerical value it meets:
$ awk '
BEGIN {
FS=OFS=":" # delimiters
}
{
for(i=1;i<=NF;i++) # loop all fields
if($i~/^[0-9]+$/) { # if numerical value found
printf "Provide replacement value for %d: ",$i > "/dev/stderr"
getline $i < "/dev/stdin" # ask for a replacement
}
}1' file_in > file_out # write output to a new file

I would use GNU AWK for this task following way, let file.txt content be
A:fdg:user#server:rxtTncC
A:g:1234:xtcy
A:d:1111:xtcy
then
awk 'BEGIN{newvalue="replacement"}{gsub(/[[:digit:]]+/,newvalue);print}' file.txt
output
A:fdg:user#server:rxtTncC
A:g:replacement:xtcy
A:d:replacement:xtcy
Explanation: replace one or more digits using newvalue. Disclaimer: I assumed numeric is something consisting solely from digits.
(tested in gawk 4.2.1)

How about
awk -F : '$3 ~ /^[0-9]+$/ { $3 = "new value"} {print}' original_file >replaced_file
?

Regex pattern as variable in AWK

Let's say I have a file with multiple fields and field 1 needs to be filtered for 2 conditions. I was thinking of turning those conditions into a regex pattern and pass them as variables to the awk statement. For some reason, they are not filtering out the records at all. Here is my attempt that runs fine, but doesn't filter out the results per conditions, except when fed directly into awk without variable assignment.
regex1="/abc|def/"; # match first field for abc or def;
regex2="/123|567/"; # and also match the first field for 123 or 567;
cat file_name \
| awk -v pat1="${regex1}" -v pat2="${regex2}" 'BEGIN{FS=OFS="\t"} {if ( ($1~pat1) && ($1~pat2) ) print $0}'
Update: Fixed a syntax error related to missing parenthesis for the if conditions in the awk. (I had it fixed in the code I ran).
Sample data
abc:567 1
egf:888 2
Expected output
abc:567 1
The problem is that I am getting all the results instead of the ones that satisfy the 2 regex for field 1
Note that the match needs to be wildcarded instead of exact match. Meaning 567 as defined in the regex pattern should also match on 567_1 if available.

It seems like the way to implement what you want to do would be:
awk -F'\t' '
($1 ~ /abc|def/) &&
($1 ~ /123|567/)
' file
or probably more robustly:
awk -F'\t' '
{ split($1,a,/:/) }
(a[1] ~ /abc|def/) &&
(a[2] ~ /123|567/)
' file
What's wrong with that?
EDIT here is me running the OPs code before and after fixing the inclusion of regexp delimiters (/) in the dynamic regexp strings:
$ cat tst.sh
#!/usr/bin/env bash
regex1="/abc|def/"; #--match first field for abc or def;
regex2="/123|567/"; #--and also match the first field for 123 or 567;
cat file_name \
| awk -v pat1="${regex1}" -v pat2="${regex2}" 'BEGIN{FS=OFS="\t"} $1~pat1 && $1~pat2'
echo "###################"
regex1="abc|def"; #--match first field for abc or def;
regex2="123|567"; #--and also match the first field for 123 or 567;
cat file_name \
| awk -v pat1="${regex1}" -v pat2="${regex2}" 'BEGIN{FS=OFS="\t"} $1~pat1 && $1~pat2'
$
$ ./tst.sh
###################
abc:567 1

EDIT: Since OP has changed the samples, so adding this solution here, this will work for partial matches also, again written and tested with shown samples in GNU awk.
awk -F':|[[:space:]]+' -v var1="abc|def" -v var2="123|567" '
BEGIN{
num=split(var1,arr1,"|")
split(var2,arr2,"|")
for(i=1;i<=num;i++){
reg1[arr1[i]]
reg2[arr2[i]]
}
}
{
for(i in reg1){
if(index($1,i)){
for(j in reg2){
if(index($2,j)){ print; next }
}
}
}
}
' Input_file
Let's say following is an Input_file:
cat Input_file
abc_2:567_3 1
egf:888 2
Now after running above code we will get abc_2:567_3 1 in output.
With your shown samples only, could you please try following. Written and tested in GNU awk. Give your values which you you want to look for in 1st column in var1 and those which you want to look in 2nd field in var2 variables respectively with pipe delimiter in it.
awk -F':|[[:space:]]+' -v var1="abc|def" -v var2="123|567" '
BEGIN{
num=split(var1,arr1,"|")
split(var2,arr2,"|")
for(i=1;i<=num;i++){
reg1[arr1[i]]
reg2[arr2[i]]
}
}
($1 in reg1) && ($2 in reg2)
' Input_file
Explanation: Adding detailed explanation for above.
awk -F':|[[:space:]]+' -v var1="abc|def" -v var2="123|567" ' ##Starting awk program from here.
##Setting field separator as colon or spaces, setting var1 and var2 values here.
BEGIN{ ##Starting BEGIN section of this program from here.
num=split(var1,arr1,"|") ##Splitting var1 to arr1 here.
split(var2,arr2,"|") ##Splitting var2 to arr2 here.
for(i=1;i<=num;i++){ ##Running for loop from 1 to till value of num here.
reg1[arr1[i]] ##Creating reg1 with index of arr1 value here.
reg2[arr2[i]] ##Creating reg1 with index of arr2 value here.
}
}
($1 in reg1) && ($2 in reg2) ##Checking condition if 1st field is present in reg1 AND in reg2 then print that line.
' Input_file ##Mentioning Input_file name here.

Get the character that precede each occurrence of given character/pattern in a string

I'm trying to get the character that precede each occurrence of given character/pattern in a string using standard bash tools as grep, awk/gawk, sed ...
Step I: get the character that precede each occurrence of the character :
Example:
String 1 => :hd:fg:kl:
String 2 => :df:lkjh:
String 3 => :glki:l:s:d:
Expected results
Result 1 => dgl
Result 2 => fh
Result 3 => ilsd
I tried many times with awk but without success
Step II: Insert a given character between each character of the resulting string
Example with /
Result 1 => d/g/l
Result 2 => f/h
Result 3 => i/l/s/d
I have an awk expression for this step awk -F '' -v OFS="/" '{$1=$1;print}'
I don't know if it is possible to do Step I with awk or sed and why not do Step I and Step II in once.
Kind Regards

What about:
awk 'BEGIN{FS=":"}{for(i=1;i<NF;i++){if(i>2)printf"/";printf substr($i,length($i))}print""}' input.txt
input.txt:
:hd:fg:kl:
:df:lkjh:
:glki:l:s:d:
Output:
d/g/l
f/h
i/l/s/d

Solution 1st: Could you please try following and let me know if this helps you.
awk -F":" '
{
for(i=1;i<=NF;i++){
if($i){ val=(val?val:"")substr($i,length($i)) }
}
print val;
val=""
}' Input_file
Output will be as follows.
dgl
fh
ilsd
Solution 2nd: With a / in between output strings.
awk '
BEGIN{
OFS="/";
FS=":"
}
{
for(i=1;i<=NF;i++){
if($i){
val=(val?val OFS:"")substr($i,length($i))
}}
print val;
val=""
}' Input_file
Output will be as follows.
d/g/l
f/h
i/l/s/d
Solution 3rd: With match utility of awk.
awk '
{
while(match($0,/[a-zA-Z]:/)){
val=(val?val:"")substr($0,RSTART,RLENGTH-1)
$0=substr($0,RSTART+RLENGTH)
}
print val
val=""
}' Input_file

This might work for you (GNU sed):
sed -r 's/[^:]*([^:]):+|:+/\1/g;s/\B/\//g' file
Replace zero or more non :'s followed by a single character followed by a : or a lone : by the single character globally throughout the line. Then replace insert a / between each character.

Perl and negative lookahead:
$ perl -p -e 's/.(?!:)//g' file
dgl
fh
ilsd

This is easier to do with perl
$ cat ip.txt
:hd:fg:kl:
:df:lkjh:
:glki:l:s:d:
$ perl -lne 'print join "/", /.(?=:)/g' ip.txt
d/g/l
f/h
i/l/s/d
/.(?=:)/g get all characters preceding :
(?=:) is a lookahead construct
the resulting matches are then printed using / as delimiter string

With all sed with ERE
sed -E 's#[^:]*(.):#\1/#g;s/^.|.$//g' infile

Using GNU sed:
sed -E 's/[^:]*([^:]):/\1/g; s/([^:])/\/\1/g; s/^:\///'
The first command, s/[^:]*([^:]):/\1/g matches strips out the extra characters and the colons (except the first one), so yields this:
:dgl
:fh
:ilsd
The second command s/([^:])/\/\1/g inserts a / before each character, yielding:
:/d/g/l
:/f/h
:/i/l/s/d
The last command s/^:\/// simply removes the :/ from the beginning of each line:
d/g/l
f/h
i/l/s/d

You could iterate across each line starting at the second character with gawk. Everytime the iterator hits a colon print the previous character.
$ awk <file.txt '{for(i=2;i<=length($0);i++) { \
if (substr($0,i,1)==":") printf substr($0,i-1,1);} printf "\n";}'
dgl
fh
ilsd

Using Awk and match()

I have a sequencing file to analyze that has many lines like the following tab separated line:
chr12 3356475 . C A 76.508 . AB=0;ABP=0;AC=2;AF=1;AN=2;AO=3;CIGAR=1X;DP=3;DPB=3;DPRA=0;EPP=9.52472;EPPR=0;GTI=0;LEN=1;MEANALT=1;MQM=60;MQMR=0;NS=1;NUMALT=1;ODDS=8.76405;PAIRED=0;PAIREDR=0;PAO=0;PQA=0;PQR=0;PRO=0;QA=111;QR=0;RO=0;RPP=9.52472;RPPR=0;RUN=1;SAF=3;SAP=9.52472;SAR=0;SRF=0;SRP=0;SRR=0;TYPE=snp GT:DP:RO:QR:AO:QA:GL 1/1:3:0:0:3:111:-10,-0.90309,0
I am trying to use awk to match particular regions to their DP value. This is how I'm trying it:
awk '$2 == 33564.. { match(DP=) }' file.txt | head
Neither the matching nor the wildcards seem to work.
Ideally this code would output 3 because that is what DP equals.

You can use either ; or tab as the field delimiter. Doing so you can access the number in $2 and the DP= field in $14:
awk -F'[;\t]' '$2 ~ /33564../{sub(/DP=/,"",$14);print $14}' file.txt
The sub function is used to remove DP= from $14 which leaves only the value.
Btw, if you also add = to the set of field delimiters the value of DP will be in field 21:
awk -F'[;\t=]' '$2 ~ /33564../{print $21}' file.txt

Having worked with genomic data, I believe that the following will be more robust than the previously posted solution. The main difference is that the key-value pairs are treated as such, without any assumption about their ordering, etc. The minor difference is the carat ("^") in the regex:
awk -F'\t' '
$2 ~ /^33564../ {
n=split($8,a,";");
for(i=1;i<=n;i++) {
split(a[i],b,"=");
if (b[1]=="DP") {print $2, b[2]} }}'
If this script is to be used more than once, then it would be better to abstract the lookup functionality, e.g. like so:
awk -F'\t' '
function lookup(key, string, i,n,a,b) {
n=split(string,a,";");
for(i=1;i<=n;i++) {
split(a[i],b,"=");
if (b[1]==key) {return b[2]}
}
}
$2 ~ /^33564../ {
val = lookup("DP", $8);
if (val) {print $2, val;}
}'

awk delete all lines not containing substring using if condition

I want to delete lines where the first column does not contain the substring 'cat'.
So if string in col 1 is 'caterpillar', i want to keep it.
awk -F"," '{if($1 != cat) ... }' file.csv
How can i go about doing it?

I want to delete lines where the first column does not contain the substring 'cat'
That can be taken care by this awk:
awk -F, '!index($1, "cat")' file.csv
If that doesn't work then I would suggest you to provide your sample input and expected output in question.

This awk does the job too
awk -F, '$1 ~ /cat/{print}' file.csv
Explanation
-F : "Delimiter"
$1 ~ /cat/ : match pattern cat in field 1
{print} : print

A shorter command is:
awk -F, '$1 ~ "cat"' file.csv
-F is the field delimiter: (,)
$1 ~ "cat" is a (not anchored) regular expression match, match at any position.
As no action has been given, the default: {print} is assumed by awk.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Remove redundancy lines for "almost similar" strings - bash

I have the below file: ab=5 ac=6 ad=5 ba=5 bc=7 bd=4 ca=5 cb=7 cd=3 ... "ab" and "ba", "ac" and "ca", "bc" and "cb" are redundant. How do I eliminate these redundant lines in bash ? Expected output: ab=5 ac=6 ad=5 bc=7 bd=4 cd=3

$ awk '{x=substr($0,1,1); y=substr($0,2,1)} !seen[x>y?x y:y x]++' file ab=5 ac=6 ad=5 bc=7 bd=4 cd=3

Related

change numerical value in file to characters via awk

Regex pattern as variable in AWK

Get the character that precede each occurrence of given character/pattern in a string

Using Awk and match()

awk delete all lines not containing substring using if condition

Categories

Resources