deleting spaces between every other column - bash

I have a large dataset that looks like this:
ID224912 A A A B B A B A B A B
and I want to make it look like:
ID224912 AA AB BA BA BA BA
I have tried modifying this code that I found somewhere else but no success:
AWK=''' { printf (""%s %s %s %s"", $1, $2, $3, $4); }
{ for (f = 5; f <= NF; f += 2) printf (""%s %s"", $(f), $(f + 1)); }
{ printf (""\n""); } '''
awk ""${AWK}"" InFile > OutFile
Any suggestions?

This might work for you (GNU sed):
sed -E 's/((\S+\s\S+\s)*\S+).*/\1/g;s/(\S+\s\S+)\s/\1/g' file
The solution is in two parts. First group the spaces between fields to be an even number and delete an extra field if there is one. Then group the fields

$ awk '{r=$1; for (i=2; i<NF; i+=2) r=r OFS $i $(i+1); print r}' file
ID224912 AA AB BA BA BA

You do not have to assign the AWK script into a variable. Just invoke it inline, which is simpler and safer.
It looks strange that you are grouping the first four fields. As far as I can see from your desired output, it would be enough just to treat the first (ID) field separately.
Try something like:
awk '{printf("%s", $1); for (i=2; i<=NF; i+=2) printf(" %s%s", $i, $(i+1)); print ""}' InFile > OutFile
Hope this hepls.

For funsies here is a sed solution:
cat input | sed 's/\([ A-Z ]\) \([ A-Z ]\)/\1\2/g' > output
Just for clarification I tested on BSD sed.

Regarding InFile as your input file, you can use sed this way:
cat InFile |sed -e 's/\([a-zA-Z]\)[ \t]\([a-zA-Z]\)/\1\2/g'
N.B.: with the specified InFile in your initial question (with an odd count of letters), the result is:
ID224912 AA AB BA BA BA B

The following awk line
awk '{printf $1}{for(i=2;i<=NF;i+=2) printf OFS $i $(i+1); print "" }'
will output
ID224912 AA AB BA BA BA B
As you notice, we have an extra column B in the end due to the even amount of columns in the original output. As the OP does not want this, we can fix this with a simple update in the for-loop condititions
awk '{printf $1}{for(i=2;i<NF;i+=2) printf OFS $i $(i+1); print "" }'
will output
ID224912 AA AB BA BA BA

Related

counting occurence of character

I have a file that looks like this
chr1A_p1
chr1A_p2
chr10B_p1
chr10A_p1
chr11D_p2
chr18B_p2
chr9D_p1
I need to count number of time A, B & D occur. Individually, I would do it like this
awk '{if($1~/A/) print $0 }' < test.txt | wc
awk '{if($1~/B/) print $0 }' < test.txt | wc
awk '{if($1~/D/) print $0 }' < test.txt | wc
How to join these lines so that I can count number of A,B,D just through one liner instead of 3 separate lines.
For specific line format (where the needed char is before _):
$ awk -F"_" '{ seen[substr($1, length($1))]++ }END{ for(k in seen) print k, seen[k] }' file
A 3
B 2
D 2
Counting occurrences is generally done by keeping track of a counter. So a single of the OP's awk lines;
awk '{if($1~/A/) print $0}' < test.txt | wc
can be rewritten as
awk '($1~/A/){c++}END{print c}' test.txt
for multiple cases, you can now do:
awk '($1~/A/){c["A"]++}
($1~/B/){c["B"]++}
($1~/D/){c["D"]++}
END{for(i in c) print i,c[i]}' test.txt
Now you can even clean this up a bit more:
awk '{c["A"]+=($1~/A/)}
{c["B"]+=($1~/B/)}
{c["D"]+=($1~/D/)}
END{for(i in c) print i,c[i]}' test.txt
which you can clean up further as:
awk 'BEGIN{split("A B D",a)}
{for(i in a) c[a[i]]+=($1~a[i])}
END{for(i in c) print i,c[i]}' test.txt
But these cases just count how many times a line occurs that contains the letter, not how many times the letter occurs.
awk 'BEGIN{split("A B D",a)}
{for(i in a) c[a[i]]+=gsub(a[i],"",$1)}
END{for(i in c) print i,c[i]}' test.txt
Perl to the rescue!
perl -lne '$seen{$1}++ if /([ABD])/; END { print "$_:$seen{$_}" for keys %seen }' < test.txt
-n reads the input line by line
-l removes newlines from input and adds them to output
a hash table %seen is used to keep the number of occurrences of each symbol. Each time it's matched it's captured and the corresponding field in the hash is incremented.
END is run when the file ends. It outputs all the keys of the hash, i.e. the matched characters, each followed by the number of occurrences.
datafile:
chr1A_p1
chr1A_p2
chr10B_p1
chr10A_p1
chr11D_p2
chr18B_p2
chr9D_p1
script.awk
BEGIN {
arr["A"]=0
arr["B"]=0
arr["D"]=0
}
/A/ { arr["A"]++ }
/B/ { arr["B"]++ }
/D/ { arr["D"]++ }
END {
printf "A: %s, B: %s, D: %s", arr["A"], arr["B"], arr["D"]
}
execution:
awk -f script.awk datafile
result:
A: 3, B: 2, D: 2

Split column using awk or sed

I have a file containing the following text.
dog
aa 6469
bb 5946
cc 715
cat
aa 5692
Bird
aa 3056
bb 2893
cc 1399
dd 33
I need the following output:
A-Z ,aa ,bb, cc, dd
dog, 6469, 5946 ,715, 0
cat ,5692, 0, 0, 0
Bird ,3056, 2893, 1399, 33
I tried:
awk '{$1=$1}1' OFS="," RS=
But is not giving the format I need.
Thanks in advance for your help.
Cris
With Perl
perl -00 -nE'
($t, %p) = split /[\n\s]/; $h{$t} = {%p}; # Top line, Pairs on lines
$o{$t} = ++$c; # remember Order
%k = map { $_, 1} keys %p; # find full set of subKeys
}{ # END block starts
say join ",", "A-Z", sort keys %k;
for $t (sort { $o{$a} <=> $o{$b} } keys %h) {
say join ",", $k, map { ($h{$k}{$_} // 0) } sort keys %k;
}
' data.txt
prints, in the original order
A-Z,aa,bb,cc,dd
dog,6469,5946,715,0
cat,5692,0,0,0
Bird,3056,2893,1399,33
Here's a sed solution, which works on your input, but requires that you know the column names in advance and that the column names are given as sorted full ranges starting with the first column name (so nothing like aa, cc or bb, aa or bb, cc) and that every paragraph is followed by one empty line. You would also need to adjust the script if you don't have exactly four numeric columns:
echo 'A-Z, aa, bb, cc, dd';sed -e '/./{s/.* //;H;d};x;s/\n/, /g;s/, //;s/$/, 0, 0, 0, 0/;:a;s/,[^,]*//5;ta' file
If you need to look up the sed commands, you can look at info sed, especially 3.5 Less Frequently-Used Commands.
awk to the rescue!
awk -v OFS=, 'NF==1 {h[++c]=$1}
NF==2 {v[c,$1]=$2; ks[$1]}
END {printf "%s", "A-Z";
for(k in ks) printf "%s", OFS k;
print "";
for(i=1;i<=c;i++)
{printf "%s", h[i];
for(k in ks) printf "%s", OFS v[i,k]+0;
print ""}}' file'
order of the columns will be random.

Append a specific identifier to data in a tab delimited text file

Essentially I have something like this:
B3 LPC1030_64571 LPC1283_613422
B2 LPC107_67093 LPC174_1161466 LPC1283_579823 LPC5_2182288 LPC1378_340850 LPC203_5679639 LPC107_67396 LPC107_67535 LPC107_70165 LPC107_77297 LPC107_80176 LPC107_81524 LPC107_88715 AMZ216_267328 AMZ216_268028
B1 ...
For those in each Bx row I want to append *".Bx"
A simple awk script will do that:
awk '{for(i=2;i<=NF;i++){$i=$i "." $1}; print}' <infile
or more nicely formated:
awk '{
for(i=2;i<=NF;i++) #NF is the number of fields
{
$i = $i "." $1 #$i now is the text in each field exept the first
};
print #print the modified fields to stdout
}' <infile

sample input and expected output bash while read line: can't compare line into different values

I want to read a file line by line and test if the line exist in $3 in second file then print $0 else print the line.
my first file contains values like that:"
00
01
03
.
.
80
A1
A2
A3
.
.
B5"
the script work correctly until 80 but when the line became a string it doesn't work. here is the code
while read -r line
do
cat file2.txt | awk '
BEGIN { FS="."
test=0
}
('"$line"'==$1) {test=1
result=$0}
END{
if (test==1) { print result}
else { print '"$line"'}
}
'
done < file1.txt'
this is not the right approach with awk. The idiomatic way of doing this is
awk 'NR==FNR {ks[NR]=$1; vs[NR]=$0; next}
{for(k=length(ks); k>0; k--) if(k==$1) {print vs[k]; next}
print "not found:", $1}' file2 file1
note that second file is provided first. Also you mention that you're comparing $3 but the code says $1.
#karakfa
file2:
00
01
02
A1
A2
B1
B2
file 1
B2.9.75.lkf
B1.69.874.cds
00.364.6478.abc
A1.635.7452.cds
01.36.3214.vcd
I want the output like that
00.364.6478.abc
01.36.3214.vcd
02.not found
A1.635.7452.cds
A2.not found
B1.69.874.cds
B2.9.75.lkf
It follow the order of file 2 and if it doesn't exist it write not found
#M122015:try:
awk 'FNR==NR{q=$1;sub(/.*/,"",$1);sub(/^[[:space:]]+/,"");gsub(/ /,".");A[q]=$0;next} ($0 in A){print $0 "." A[$0];next} !($0 in A){print $0".not found."}' FS="." Input_file1 Input_file2

Compare all but last N Columns across two files in bash

I have 2 files: one with 18 columns; another with many more. I need to find the rows that mismatch on ONLY the first 18 columns while ignoring the rest in the other file. However, I need to preserve and print the entire row (cut will not work).
File 1:
F1 F2 F3....F18
A B C.... Y
AA BB CC... YY
File 2:
F1 F2 F3... F18... F32
AA BB CC... YY... 123
AAA BBB CCC... YYY...321
Output Not In File 1:
AAA BBB CCC YYY...321
Output Not In File 2:
A B C...Y
If possible, I would like to use diff or awk with as few loops as possible.
You can use awk:
awk '{k=""; for(i=1; i<=18; i++) k=k SUBSEP $i} FNR==NR{a[k]; next} !(k in a)' file1 file2
For each row in both files we are first creating a key by concatenating first 18 fields
We are then storing this key in an associative array while iterating first file
Finally we print each row from 2nd file when this new key value is not found in our associative array.
You can use grep:
grep -vf file1 file2
grep -vf <(cut -d" " -f1-18 file2) file1
to get set differences between two files, you'll need little more, similar to #anubhava's answer
$ awk 'NR==FNR{f1[$0]; next}
{k=$1; for(i=2;i<=18;i++) k=k FS $i;
if(k in f1) delete f1[k];
else f2[$0]}
END{print "not in f1";
for(k in f2) print k;
print "\nnot in f2";
for(k in f1) print k}' file1 file2
can be re-written to preserve order in file2
$ awk 'NR==FNR{f1[$0]; next}
{k=$1; for(i=2;i<=18;i++) k=k FS $i;
if(k in f1) delete f1[k];
else {if(!p) print "not in f1";
f2[$0]; print; p=1}}
END{print "\nnot in f2";
for(k in f1) print k}' file1 file2

Resources