Compare file with different number format - sorting

First of all I'd like to thank your community, you have been helping me tremendously over the past couple of months, thanks to your detailed answers and your comments.
However I came accross a snag. I want to compare 2 files containing simulation data. These files are the result of a previous operation which consists in extracting the desired data from 2 of output files.
So output-file1-> sorteddata1
Output-file2-> sorteddata2
Sorteddata1 looks like that
0.200000e-4 a b c d e
0.400000e-4 f g h i j
0.560000e-4 k l m n o
.
.
.
Sorteddata2
2.000000E-5 A
3.600000E-5 B
5.600000E-5 C
.
.
.
And what I would like this, sorteddata3:
0.200000e-4 a b c d e A
0.400000e-4 f g h i j
0.560000e-4 k l m n o C
.
.
.
So if the number in the first column is the same, add the corresponding value from sorteddata2 in the 7th column of sorteddata1.
I wanted to start from here:
Compare files with awk
But the number format from the first column of each file is different, so I don't get any return. I really want to use awk for this (personal preference, I kind of like it)
The goal is to plot this using gnuplot, so hopefully a blank in the last column won't be a problem.
Any thoughts on this?

You can use sprintf to make the number stick to the same format:
sprintf(format, expression1, ...)
Return (without printing) the string that printf would have printed
out with the same arguments (see Printf).
Then, the logic is the same as in the linked answer, adding an if/ default case to print either the current line or it together with the matched line from the other file.
awk 'NR==FNR {value=sprintf("%e", $1)
a[value]=$2
next
}
{value2=sprintf("%e", $1)
print $0, a[value2]
}' f2 f1
For your given input, it returns:
$ awk 'NR==FNR{value=sprintf("%e", $1); a[value]=$2; next} {value2=sprintf("%e", $1); if (value2 in a) {print $0, a[value2] }' f2 f1
0.200000e-4 a b c d e A
0.400000e-4 f g h i j
0.560000e-4 k l m n o C
Note in comments you say that E format shows a "unterminated string" error to you. Hence, you can replace the E with e in the number format with sub("E","e",$1). All together:
awk 'NR==FNR{value=sprintf("%e", $1); a[value]=$2; next} {sub("E","e",$1); value2=sprintf("%e", $1); print $0, a[value2] }' f2 f1

Related

grep a list into a multi columns file and get fully matching lines

not sure how to ask this question but an example would surely clarify. Suppose I have this file:
$ cat intoThat
a b
a h
a l
a m
b c
b d
b m
c b
c d
c f
c g
c p
d h
d f
d p
and this list:
cat grepThis
a
b
c
d
now I would like to grepThis intoThat and I would do this:
$grep -wf grepThis intoThat
which will give an output like this:
**a b**
a h
a l
a m
**b c**
**b d**
b m
**c b**
**c d**
c f
c g
c p
d h
d f
d p
now the asterisks are used to highlight those lines that I would like grep to return. These are the lines that have a full match but...how to tell grep (or awk or whatever) to get only these lines?
Of course it is possible that some lines do not match any pattern, e.g. in the intoThat file I may have some other letters like g, h, l, s, t, etc...
With awk, you could do:
awk 'NR==FNR{ seen[$0]++; next } ($1 in seen && $2 in seen)' grepThis intoThat
a b
b c
b d
c b
c d
NR is set to 1 when the first record read by awk and incrementing for each next records reading either in single or multiple input files until all records/line read.
FNR is set to 1 when the first record read by awk and incrementing for each next records reading in current file and reset back to 1 for the next input file if multiple input files.
so NR == FNR is always a true condition for first input file and the block followed by this will perform actions on the first file only.
The seen is an associated awk array named seen (you can use different name as you want) with the key of whole line $0 and value with occurrences of each line occurred (this way usually is using to remove duplicated records in awk too).
The next token skips to executing rest of the commands and those will only execute actually for next file(s) except first.
In next (....), we are just checking if both column$1 and $2 are present in the array, if so they will goes in output.

Grep a set of values appearing multiple times in a file and display in a specific way

I am new to shell scripting.I have a below mentioned requirement.
INPUT FILE 1
START
A X|
M Q|
B Y|
C Z|
D U|
END
INPUT FILE 2
START
A X1|
M Q1|
B Y1|
C Z1|
D U1|
END
START
A X2|
M Q2
B Y2|
C Z2|
D U2|
END
Expected output
X,Y,Z
X1,Y1,Z1
X2,Y2,Z2
The files are ranging from few MBS to 10 GB.
I tried a few combination of
grep -f patternfile file1....N >> file.txt
Awk and transpose
Is there any more effective way for doing the same with performance not getting hampered.
Thanks in advance.
To get values for keys A, B and C use the following awk approach:
awk '$1~/^(A|B|C)$/{ sub("\\|","",$2); s=($1=="C")?"\n":","; printf "%s%s",$2,s }' file[12]
The output:
X,Y,Z
X1,Y1,Z1
X2,Y2,Z2

Split up line with arbitrary many groups

I have many files with many entries (one entry per line) which I have to filter through a sequence of greps and seds. The lines are of the form
a
x, y
u --> v, w
s --> p, q, r
One the steps is splitting up the lines containing --> such that the left-hand side and each of the comma-separated entries on the right side (of which there can be arbitrary many) end up on different lines. I.e., the above lines should become:
a
x, y
u
v
w
s
p
q
r
Separating the left side from the right side is quickly done:
echo "u --> v, w" | sed 's/\(.\+\)\s*\-\->\s*\(.\+\)/\1\n\2/'
Gives me
u
v, w
But this seems to be a dead end in that I cannot then pipe this on to splitting on the comma, since that would also split the x, y.
So, I am wondering if there is a way to completely split up such lines in a sed command, or do I have to turn to, e.g., awk (or just go to Python)? It would be preferable to keep this a bash pipe sequence.
awk '/-->/ {gsub(/-->|,/,RS)}1' inputfile|column -t
a
x, y
u
v
w
s
p
q
r
OR as Anubhav suggested to avoid pipe:
awk '/-->/ {gsub(/[ \t]*(-->|,)[ \t]*/ , ORS)} 1' inputfile
Using awk you can do this:
awk -F'[ \t]*-->[ \t]*' -v OFS='\n' '{gsub(/,[ \t]*/, OFS, $2)} 1' file
a
x, y
u
v
w
s
p
q
r
You can do this by creating a command group when you match -->. In this group, you replace --> with newline, print up to the newline, discard the portion you printed, then replace commas in the remainder:
#!/bin/sed -f
/\s*-->\s*/{
s//\n/
P
s/.*\n//
s/,\s*/\n/g
}
Results:
a
x, y
u
v
w
s
p
q
r
Alternatively, in GNU sed, you could use the T command to skip processing of the right-hand side unless you match and replace the -->:
#!/bin/sed -f
s/\s*-->\s*/\n/
Tend
P
s/.*\n//
s/,\s*/\n/g
:end
This produces the same output, as required.
I've assumed throughout that you don't want to split any commas on the left-hand side, so that
foo, bar --> baz
becomes
foo, bar
baz
If that's not the case (perhaps if you know there will be no comma to the left of -->), then you don't need P or s/.*\n//, and the script is as simple as
/\s*-->\s*/!n
s//\n/
s/,\s*/\n/g

I would like to sort rows of a data file by NF increasing

I would like to sort rows of a data file by NF increasing.
input
z a b c d k l p m
m x y h j i
y w
g t y u
output
y w
g t y u
m x y h j i
z a b c d k l p m
I had tried sort command, but it no works.
How to?
Thanks for help.
Typically you solve these types of problems by modifying the input stream to add some data, operating on that data, and then removing it. In this case, we want to add the field count to the input stream, sort (numerically) on the field count, and then remove it (using a space as the field delimiter):
awk '{ print NF, $0 }' | sort -n | cut -d' ' -f2-
You can either pipe your data to awk or pass the filename as another argument to awk.

Shell script to parse multiple rows from a single column

I am working through a really complex and long multi-conditional statement to do this and was wondering if anyone knew of a simpler method. I have a multi-column/multi-row list that I am trying to parse. What I need to do is take the first row which has the "*" in the 5th position and copy all those entries into the blank spaces on the next few rows and then discard the original top row. What complicates this a bit is that sometimes the next few rows may not have an empty space in all the other fields (see bottom half of original list). If that's the case, I want to take extra entry (Q1 below) and put it at the end of row, in a new column.
Original list:
A B C D ***** F G
E1
E2
E3
Q R S T ***** V W
U1
Q1 U2
Final output:
A B C D E1 F G
A B C D E2 F G
A B C D E3 F G
Q R S T U1 V W
Q R S T U2 V W Q1
Thanks in advance for help!
The concise/cryptic one liner:
awk '/[*]/{f=$0;p="[*]+";next}{r=$2?$2:$1;sub(p,r,f);p=r;print $2?f" "$1:f}' file
A B C D E1 F G
A B C D E2 F G
A B C D E3 F G
Q R S T U1 V W
Q R S T U2 V W Q1
Explanation:
/[*]+/ { # If line matches line with pattern to replace
line = $0 # Store line
pat="[*]+" # Store pattern
next # Skip to next line
}
{
if (NF==2) # If the current line has 2 fields
replace = $2 # We want to replace with the second
else # Else
replace = $1 # We want to replace with first first
sub(pat,replace,line) # Do the substitution
pat=replace # Next time the pattern to replace will have changed
if (NF==2) # If the current line has 2 fields
print line,$1 # Print the line with the replacement and the 1st field
else # Else
print line # Just print the line with the replacement
}
To run the script save it to a file such as script.awk and run awk -f script.awk file.

Resources