find if two consecutive lines are different and where - bash

How to find the difference and point of difference between two consecutive lines of a fixed width file ?
sample file:
cat test.txt
1111111111111111122211111111111111
1111111111111111132211111111111111
output :
it should inform user that there is difference between two lines and the position of difference is at :18th character.(as in above example)
It would be really helpful if it could list all the positions in case of multiple variations.For example:
11111111111111111211113111
11111111111111111211114111
Here is should say : difference spotted in 18th and 26th characters.
I was trying things in following lines, but seems lost.
while read line
do
echo $line |sed 's/./ &/g' |xargs -n1 #NOt able to apply diff (stupid try)
done <test.txt

Perl to the rescue:
$ echo '11131111111111111211113111
11111111111111111211114111' \
| perl -le '$d = <> ^ <>;
print pos $d while $d =~ /[^\0]/g'
4
23
It XORs the two input strings and reports all positions where the result isn't the null byte, i.e. where the strings were different.

You can use an empty field separator to make each character a field in awk and compare entries of every even record with odd numbered record:
awk 'BEGIN{ FS="" } NR%2 {
split($0, a)
next
}
{
print "line # ", NR
for (i=1; i<=NF; i++)
if ($i != a[i])
print "difference spotted in position:", i
}' test.txt
line # 2
difference spotted in position: 18
line # 4
difference spotted in position: 18
difference spotted in position: 23
Where input data is:
cat test.txt
1111111111111111122211111111111111
1111111111111111132211111111111111
11111111111111111211113111
11111111111111111311114111
PS: It will only work on awk versions that split records into chars when FS is null, eg GNU awk, OSX awk etc.

$ cat tst.awk
{ curr = $0 }
(NR%2)==0 {
currLgth = length(curr)
prevLgth = length(prev)
maxLgth = (currLgth > prevLgth ? currLgth : prevLgth)
print "Comparing:"
print prev
print curr
for (i=1; i<=maxLgth; i++) {
prevChar = substr(prev,i,1)
currChar = substr(curr,i,1)
if ( prevChar != currChar ) {
printf "Difference: char %d line %d = \"%s\", line %d = \"%s\"\n", i, NR-1, prevChar, NR, currChar
}
}
print ""
}
{ prev = curr }
.
$ cat file
1111111111111111122211111111111111
1111111111111111132211111111111111
11111111111111111111111111
11111111111111111111111
$ awk -f tst.awk file
Comparing:
1111111111111111122211111111111111
1111111111111111132211111111111111
Difference: char 18 line 1 = "2", line 2 = "3"
Comparing:
11111111111111111111111111
11111111111111111111111
Difference: char 24 line 3 = "1", line 4 = ""
Difference: char 25 line 3 = "1", line 4 = ""
Difference: char 26 line 3 = "1", line 4 = ""

Related

Selectively reformatting a file with spaces and \n

I have multiple files in the following format. This one has 3 sequences (number of sequences vary in all files, but always end in ".") with 40 positions each, as indicated by the numbers in the first line. From the beginning of the lines (except the first one) there are the names of the sequences:
3 40
00076284. ATGTCTGTGG TTCTTTAACC
00892634. TTGTCTGAGG TTCGTAAACC
00055673. TTGTCTGAGG TCCGTGAACC
GCCGGGAACA TCCGCAAAAA
ACCGTGAAAC GGGGTGAACT
TCCCCCGAAC TCCCTGAACG
I need to convert it to this format, where the sequences are continuous, with no spaces nor \n, and on a new line after their names.The only spaces that should remain are between the two numbers in the first line.
3 40
00076284.
ATGTCTGTGGTTCTTTAACCGCCGGGAACATCCGCAAAAA
00892634.
TTGTCTGAGGTTCGTAAACCACCGTGAAACGGGGTGAACT
00055673.
TTGTCTGAGGTCCGTGAACCTCCCCCGAACTCCCTGAACG
Tried sed to delete spaces and \n's but don't know how to apply it after the first line and how to avoid making one huge line.
Thanks
Here's a shell script that may provide what you need:
head -1 input
awk '
NR == 1 { sequences = $1 ; positions = $2 ; next }
{
if ( $1 ~ /^[0-9]/ ) {
sid = $1 ; $1 = "" ; sequence_name[ NR - 1 ] = sid
sequence[ NR - 1 ] = $0
} else {
sequence[ ( NR - 1 ) % ( sequences + 1 ) ] = sequence[ (NR-1) % ( sequences + 1 ) ] " " $0
}
}
END {
for ( x = 1 ; x <= length( sequence_name ) ; x++ )
{
print sequence_name[x]
print sequence[x]
}
}' input | tr -d ' '
I added head -1 to the top of the shell just to get the first line out of your file. I couldn't output the first line within the awk script because of the pipe to tr -d ' '.
I think this should work, but my output is longer since if I actually concat all the last "orphan" sequences I get a way longer line.
cat input.txt | awk '/^[0-9]+ [0-9]+$/{printf("%s\n",$0); next} /[0-9]+[.]/{ printf("\n%s\n",$1);for(i=2; i<=NF;i++){printf("%s",$i)}; next} /^ */{ for(i=1; i<=NF;i++){printf("%s",$i)}; next;}'
3 40
Please try and let me know.
Remember the position of empty line and merge the lines before empty line with those after:
awk '
NR==1{print;next}
NR!=1 && !empty{arr[NR]=$1 "\n" $2 $3}
/^$/{empty=NR-1;next}
NR!=1 && empty{printf "%s%s%s\n", arr[NR-empty], $1, $2}
' file
My second solution without awk: Merge the file with itself using empty line as separator
cat >file <<EOF
3 40
00076284. ATGTCTGTGG TTCTTTAACC
00892634. TTGTCTGAGG TTCGTAAACC
00055673. TTGTCTGAGG TCCGTGAACC
GCCGGGAACA TCCGCAAAAA
ACCGTGAAAC GGGGTGAACT
TCCCCCGAAC TCCCTGAACG
EOF
head -n1 file
paste <(sed -n '1!{ /^$/q;p; }' file) <(sed -n '1!{ /^$/,//{/^$/!p}; }' file) |
sed 's/[[:space:]]//g; s/\./.\n/'
Will output:
3 40
00076284.
ATGTCTGTGGTTCTTTAACCGCCGGGAACATCCGCAAAAA
00892634.
TTGTCTGAGGTTCGTAAACCACCGTGAAACGGGGTGAACT
00055673.
TTGTCTGAGGTCCGTGAACCTCCCCCGAACTCCCTGAACG
:
head -n1 file output first line
sed -n '1!{ /^$/q;p; }' file
1! - don't output first line
/^$/q - quit when empty line
p print everything else
sed -n '1!{ /^$/,//{/^$/!p}; }' file
1! - ignore first line
/^$/,// - from empty line until the end
/^$/!p - output if not an empty tline
paste <(..) <(...) - merge the two seds with a tab
sed 's/[[:space:]]//g; s/\./.\n/
s/[[:space:]]//g; remove all spaces
s/\./.\n/ replace a comma with a comma and a newline.

Splitting a large, complex one column file into several columns with awk

I have a text file produced by some commercial software, looking like below. It consists in brackets delimited sections, each of which counts several million elements but the exact value changes from one case to another.
(1
2
3
...
)
(11
22
33
...
)
(111
222
333
...
)
I need to achieve an output like:
1; 11; 111
2; 22; 222
3; 33; 333
... ... ...
I found a complicated way that is:
perform sed operations to get
1
2
3
...
#
11
22
33
...
#
111
222
333
...
use awk as follows to split my file in several sub-files
awk -v RS="#" '{print > ("splitted-" NR ".txt")}'
remove white spaces from my subfiles again with sed
sed -i '/^[[:space:]]*$/d' splitted*.txt
join everything together:
paste splitted*.txt > out.txt
add a field separator (defined in my bash script)
awk -v sep=$my_sep 'BEGIN{OFS=sep}{$1=$1; print }' out.txt > formatted.txt
I feel this is crappy as I loop over million lines several time.
Even if the return time is quite OK (~80sec), I'd like to find a full awk solution but can't get to it.
Something like:
awk 'BEGIN{RS="(\\n)"; OFS=";"} { print something } '
I found some related questions, especially this one row to column conversion with awk, but it assumes a constant number of lines between brackets which I can't do.
Any help would be appreciated.
With GNU awk for multi-char RS and true multi dimensional arrays:
$ cat tst.awk
BEGIN {
RS = "(\\s*[()]\\s*)+"
OFS = ";"
}
NR>1 {
cell[NR][1]
split($0,cell[NR])
}
END {
for (rowNr=1; rowNr<=NF; rowNr++) {
for (colNr=2; colNr<=NR; colNr++) {
printf "%6s%s", cell[colNr][rowNr], (colNr<NR ? OFS : ORS)
}
}
}
$ awk -f tst.awk file
1; 11; 111
2; 22; 222
3; 33; 333
...; ...; ...
If you know you have 3 columns, you can do it in a very ugly way as following:
pr -3ts <file>
All that needs to be done then is to remove your brackets:
$ pr -3ts ~/tmp/f | awk 'BEGIN{OFS="; "}{gsub(/[()]/,"")}(NF){$1=$1; print}'
1; 11; 111
2; 22; 222
3; 33; 333
...; ...; ...
You can also do it in a single awk line, but it just complicates things. The above is quick and easy.
This awk program does the full generic version:
awk 'BEGIN{r=c=0}
/)/{r=0; c++; next}
{gsub(/[( ]/,"")}
(NF){a[r++,c]=$1; rm=rm>r?rm:r}
END{ for(i=0;i<rm;++i) {
printf a[i,0];
for(j=1;j<c;++j) printf "; " a[i,j];
print ""
}
}' <file>
Could you please try following once, considering that your actual Input_file is same as shown samples.
awk -v RS="" '
{
gsub(/\n|, /,",")
}
1' Input_file |
awk '
{
while(match($0,/\([^\)]*/)){
value=substr($0,RSTART+1,RLENGTH-2)
$0=substr($0,RSTART+RLENGTH)
num=split(value,array,",")
for(i=1;i<=num;i++){
val[i]=val[i]?val[i] OFS array[i]:array[i]
}
}
for(j=1;j<=num;j++){
print val[j]
}
delete val
delete array
value=""
}' OFS="; "
OR(above script is considering that numbers inside (...) will be constant, now adding script which will working even field numbers of not equal inside (....).
awk -v RS="" '
{
gsub(/\n/,",")
gsub(/, /,",")
}
1' Input_file |
awk '
{
while(match($0,/\([^\)]*/)){
value=substr($0,RSTART+1,RLENGTH-2)
$0=substr($0,RSTART+RLENGTH)
num=split(value,array,",")
for(i=1;i<=num;i++){
val[i]=val[i]?val[i] OFS array[i]:array[i]
max=num>max?num:max
}
}
for(j=1;j<=max;j++){
print val[j]
}
delete val
delete array
}' OFS="; "
Output will be as follows.
1; 11; 111
2; 22; 222
3; 33; 333
Explanation: Adding explanation for above code here.
awk -v RS="" ' ##Setting RS(record separator) as NULL here.
{ ##Starting BLOCK here.
gsub(/\n/,",") ##using gsub to substitute new line OR comma with space with comma here.
gsub(/, /,",")
}
1' Input_file | ##Mentioning 1 will be printing edited/non-edited line of Input_file. Using | means sending this output as Input to next awk program.
awk ' ##Starting another awk program here.
{
while(match($0,/\([^\)]*/)){ ##Using while loop which will run till a match is FOUND for (...) in lines.
value=substr($0,RSTART+1,RLENGTH-2) ##storing substring from RSTART+1 to till RLENGTH-1 value to variable value here.
$0=substr($0,RSTART+RLENGTH) ##Re-creating current line with substring valeu from RSTART+RLENGTH till last of line.
num=split(value,array,",") ##Splitting value variable into array named array whose delimiter is comma here.
for(i=1;i<=num;i++){ ##Using for loop which runs from i=1 to till value of num(length of array).
val[i]=val[i]?val[i] OFS array[i]:array[i] ##Creating array val whose index is value of variable i and concatinating its own values.
}
}
for(j=1;j<=num;j++){ ##Starting a for loop from j=1 to till value of num here.
print val[j] ##Printing value of val whose index is j here.
}
delete val ##Deleting val here.
delete array ##Deleting array here.
value="" ##Nullifying variable value here.
}' OFS="; " ##Making OFS value as ; with space here.
NOTE: This should work for more than 3 values inside (...) brackets also.
awk 'BEGIN { RS = "\\s*[()]\\s*"; FS = "\\s*" }
NF > 0 {
maxCol++
if (NF > maxRow)
maxRow = NF
for (row = 1; row <= NF; row++)
a[row,maxCol] = $row
}
END {
for (row = 1; row <= maxRow; row++) {
for (col = 1; col <= maxCol; col++)
printf "%s", a[row,col] ";"
print ""
}
}' yourFile
output
1;11;111;
2;22;222;
3;33;333;
...;...;...;
Change FS= "\\s*" to FS = "\n*" when you also want to allow spaces inside your fields.
This script supports columns of different lengths.
When benchmarking also consider replacing [i,j] with [i][j] for GNU awk. I'm unsure which one is faster and did not benchmark the script myself.
Here is the Perl one-liner solution
$ cat edouard2.txt
(1
2
3
a
)
(11
22
33
b
)
(111
222
333
c
)
$ perl -lne ' $x=0 if s/[)(]// ; if(/(\S+)/) { #t=#{$val[$x]};push(#t,$1);$val[$x++]=[#t] } END { print join(";",#{$val[$_]}) for(0..$#val) }' edouard2.txt
1;11;111
2;22;222
3;33;333
a;b;c
I would convert each section to a row and then transpose after, e.g. assuming you are using GNU awk:
<infile awk '{ gsub("[( )]", ""); $1=$1 } 1' RS='\\)\n\\(' OFS=';' |
datamash -t';' transpose
Output:
1;11;111
2;22;222
3;33;333
...;...;...

Extract desired column with values

Please help me with this small script I am making I am trying to grep some columns with values from a big file (tabseparated) (mainFileWithValues.txt) which has this format:
A B C ......... (total 700 columns)
80 2.08 23
14 1.88 30
12 1.81 40
Column names are in column.nam
cat columnnam.nam
A
B
.
.
.
till 20 nmes
I am first taking column number from a big file using:
sed -n "1 s/${i}.*//p" mainFileWithValues.txt | sed 's/[^\t*]//g' |wc -c
Then using cut I am extracting values
I have made a for loop
#/bin/bash
for i in `cat columnnam.nam`
do
cut -f`sed -n "1 s/${i}.*//p" mainFileWithValues.txt | sed 's/[^\t*]//g' |wc -c` mainFileWithValues.txt > test.txt
done
cat test.txt
A
80
14
12
B
2.08
1.88
1.81
my problem is I want output test.txt to be in columns like main file.
i.e.
A B
80 2.08
How can I fix this in this script?
Here is one-liner:
awk 'FNR==NR{h[NR]=$1;next}{for(i=1; i in h; i++){if(FNR==1){for(j=1; j<=NF; j++){if(tolower(h[i])==tolower($j)){d[i]=j; break }}}printf("%s%s",i>1 ? OFS:"", i in d ?$(d[i]):"")}print ""}' columns.nam mainfile
Explanation:
[ note : case insensitive header match, remove tolower(), if you want strict match ]
awk '
FNR==NR{ # Here we read columns.nam file
h[NR]=$1; # h -> array, NR -> as array key, $1 -> as array value
next # go to next line
}
{ # Here we read second file
for(i=1; i in h; i++) # iterate array h
{
if(FNR==1) # if we are reading 1st row of second file, will parse header
{
for(j=1; j<=NF; j++) # iterate over fields of 1st row fields
{
# if it was the field we are looking for
if(tolower(h[i])==tolower($j))
{
# then
# d -> array, i -> as array key which is column order number
# j -> as array value which is column number
d[i]=j;
break
}
}
}
# for all records
# if field we searched was found then print such field
# from d[i] we access, column number
printf("%s%s",i>1 ? OFS:"", i in d ? $(d[i]): "");
}
# print newline char
print ""
}
' columns.nam mainfile
Test Results:
$ cat mainfile
A B C
80 2.08 23
14 1.88 30
12 1.81 40
$ cat columns.nam
A
C
$ awk 'FNR==NR{h[NR]=$1;next}{for(i=1; i in h; i++){if(FNR==1){for(j=1; j<=NF; j++){if(tolower(h[i])==tolower($j)){d[i]=j; break }}}printf("%s%s",i>1 ? OFS:"", i in d ?$(d[i]):"")}print ""}' columns.nam mainfile
A C
80 23
14 30
12 40
You can also make script and run
akshay#db-3325:/tmp$ cat col_parser.awk
FNR == NR {
h[NR] = $1;
next
}
{
for (i = 1; i in h; i++) {
if (FNR == 1) {
for (j = 1; j <= NF; j++) {
if (tolower(h[i]) == tolower($j)) {
d[i] = j;
break
}
}
}
printf("%s%s", i > 1 ? OFS : "", i in d ? $(d[i]) : "");
}
print ""
}
akshay#db-3325:/tmp$ awk -v OFS="\t" -f col_parser.awk columns.nam mainfile
A C
80 23
14 30
12 40
Similar Answer
AWK to display a column based on Column name and remove header and last delimiter
Another awk approach:
awk 'NR == FNR {
hdr[$1]
next
}
FNR == 1 {
for (i=1; i<=NF; i++)
if ($i in hdr)
h[i]
}
{
s=""
for (i in h)
s = s (s == "" ? "" : OFS) $i
print s
}' column.nam mainFileWithValues.txt
A B
80 2.08
14 1.88
12 1.81
To get formatted output pipe above command to column -t

change range of letters in a specific column

I want to change in column 11 these characters
!"#$%&'()*+,-.\/0123456789:;<=>?#ABCDEFGHIJ
for these characetrs:
#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghi
so, if I have in column 11 000#!, it should be PPP_#. I tried awk:
awk '{a = gensub(/[#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghi]/, /[!\"\#$%&'\''()*+,-.\/0123456789:;<=>?#ABCDEFGHIJ]/, "g", $11); print a }' file.txt
but it does not work...
Try Perl.
perl -lane '$F[10] =~ y/!"#$%&'"'"'()*+,-.\/0-9:;<=>?#A-J/#A-Z[\\]^_`a-i/;
print join(" ", #F)'
I am assuming by "column 11" you mean a string of several characters after the tenth run of successive whitespace, which is what the -a option splits on by default (basically to simulate Awk). Unfortunately, changes to the array #F do not show up in the output directly, so you have to reconstruct the output line from (the modified) #F, which will normalize the field delimiter to just a single space.
Just change f = 2 to f = 11:
$ cat tst.awk
BEGIN {
f = 2
old = "!\"#$%&'()*+,-.\\/0123456789:;<=>?#ABCDEFGHIJ"
new = "#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\\\]^_`abcdefghi"
n = length(old)
for (i=1; i<=n; i++) {
map[substr(old,i,1)] = substr(new,i,1)
}
}
{
n = length($f)
newStr = ""
for (i=1; i<=n; i++) {
oldChar = substr($f,i,1)
newStr = newStr (oldChar in map ? map[oldChar] : oldChar)
}
$f = newStr
print
}
$ cat file
a 000#! b
$ awk -f tst.awk file
a PPP_# b
Note that you have to escape "s and \s in strings.

Add leading zeroes to awk variable

I have the following awk command within a "for" loop in bash:
awk -v pdb="$pdb" 'BEGIN {file = 1; filename = pdb"_" file ".pdb"}
/ENDMDL/ {getline; file ++; filename = pdb"_" file ".pdb"}
{print $0 > filename}' < ${pdb}.pdb
This reads a series of files with the name $pdb.pdb and splits them in files called $pdb_1.pdb, $pdb_2.pdb, ..., $pdb_21.pdb, etc. However, I would like to produce files with names like $pdb_01.pdb, $pdb_02.pdb, ..., $pdb_21.pdb, i.e., to add padding zeros to the "file" variable.
I have tried without success using printf in different ways. Help would be much appreciated.
Here's how to create leading zeros with awk:
# echo 1 | awk '{ printf("%02d\n", $1) }'
01
# echo 21 | awk '{ printf("%02d\n", $1) }'
21
Replace %02 with the total number of digits you need (including zeros).
Replace file on output with sprintf("%02d", file).
Or even the whole assigment with filename = sprintf("%s_%02d.pdb", pdb, file);.
This does it without resort of printf, which is expensive. The first parameter is the string to pad, the second is the total length after padding.
echo 722 8 | awk '{ for(c = 0; c < $2; c++) s = s"0"; s = s$1; print substr(s, 1 + length(s) - $2); }'
If you know in advance the length of the result string, you can use a simplified version (say 8 is your limit):
echo 722 | awk '{ s = "00000000"$1; print substr(s, 1 + length(s) - 8); }'
The result in both cases is 00000722.
Here is a function that left or right-pads values with zeroes depending on the parameters: zeropad(value, count, direction)
function zeropad(s,c,d) {
if(d!="r")
d="l" # l is the default and fallback value
return sprintf("%" (d=="l"? "0" c:"") "d" (d=="r"?"%0" c-length(s) "d":""), s,"")
}
{ # test main
print zeropad($1,$2,$3)
}
Some tests:
$ cat test
2 3 l
2 4 r
2 5
a 6 r
The test:
$ awk -f program.awk test
002
2000
00002
000000
It's not fully battlefield tested so strange parameters may yield strange results.

Resources