Calling AWK inside Gnuplot produces two extra lines - windows

I am using AWK to preprocess files for plotting and fitting in Gnuplot 5.2 on Windows 10. e.g. like this:
plot '<awk "{print}" file1.dat file2.dat'
While fit works fine, plot yields this message:
Bad data on line 1 of file <awk "{print}" file1.dat file2.dat
I took a look at the bad data with print system('awk "{print}" file1.dat file2.dat') which shows that there are two extra lines infront of the data, which I can even reproduce with a minimal print system('awk ""') which gives
fstat < 0: fd = 0
fstat < 0: fd = 2
Of course, if I just want to extract a number out of the AWK command, I can do something like
sum = real(substr(system('awk "{sum+=$2} END {print sum}" file1.dat'), 37,-1))
While this is annoying, it works. But I have not found any work around for plot. Even more I would like a solution that avoids the extra lines from the beginning. Does anyone have an idea how to do that?
Here, I have two more test cases that might provide information:
If I run AWK in CMD, the extra lines are not there.
Other CMD commands also do not procude the lines in gnuplot, i.e. if I call print system('echo test')

Related

Is there a faster way to combine files in an ordered fashion than a for loop?

For some context, I am trying to combine multiple files (in an ordered fashion) named FILENAME.xxx.xyz (xxx starts from 001 and increases by 1) into a single file (denoted as $COMBINED_FILE), then replace a number of lines of text in the $COMBINED_FILE taking values from another file (named $ACTFILE). I have two for loops to do this which work perfectly fine. However, when I have a larger number of files, this process tends to take a fairly long time. As such, I am wondering if anyone has any ideas on how to speed this process up?
Step 1:
for i in {001..999}; do
[[ ! -f ${FILENAME}.${i}.xyz ]] && break
cat ${FILENAME}.${i}.xyz >> ${COMBINED_FILE}
mv -f ${FILENAME}.${i}.xyz ${XYZDIR}/${JOB_BASENAME}_${i}.xyz
done
Step 2:
for ((j=0; j<=${NUM_CONF}; j++)); do
let "n = 2 + (${j} * ${LINES_PER_CONF})"
let "m = ${j} + 1"
ENERGY=$(awk -v NUM=$m 'NR==NUM { print $2 }' $ACTFILE)
sed -i "${n}s/.*/${ENERGY}/" ${COMBINED_FILE}
done
I forgot to mention: there are other files named FILENAME.*.xyz which I do not want to append to the $COMBINED_FILE
Some details about the files:
FILENAME.xxx.xyz are molecular xyz files of the form:
Line 1: Number of atoms
Line 2: Title
Line 3-Number of atoms: Molecular coordinates
Line (number of atoms +1): same as line 1
Line (number of atoms +2): Title 2
... continues on (where line 1 through Number of atoms is associated with conformer 1, and so on)
The ACT file is a file containing the energies which has the form:
Line 1: conformer1 Energy
Line 2: conformer2 Energy2
Where conformer1 is in column 1 and the energy is in column 2.
The goal is to make the energy for the conformer the title for in the combined file (where the energy must be the title for a specific conformer)
If you know that at least one matching file exists, you should be able to do this:
cat -- ${FILENAME}.[0-9][0-9][0-9].xyz > ${COMBINED_FILE}
Note that this will match the 000 file, whereas your script counts from 001. If you know that 000 either doesn't exist or isn't a problem if it were to exist, then you should just be able to do the above.
However, moving these files to renamed names in another directory does require a loop, or one of the less-than-highly portable pattern-based renaming utilities.
If you could change your workflow so that the filenames are preserved, it could just be:
mv -- ${FILENAME}.[0-9][0-9][0-9].xyz ${XYZDIR}/${JOB_BASENAME}
where we now have a directory named after the job basename, rather than a path component fragment.
The Step 2 processing should be doable entirely in Awk, rather than a shell loop; you can read the file into an associative array indexed by line number, and have random access over it.
Awk can also accept multiple files, so the following pattern may be workable for processing the individual files:
awk 'your program' ${FILENAME}.[0-9][0-9][0-9].xyz
for instance just before catenating and moving them away. Then you don't have to rely on a fixed LINES_PER_FILE and such. Awk has the FNR variable which is the record in the current file; condition/action pairs can tell when processing has moved to the next file.
GNU Awk also has extensions BEGINFILE and ENDFILE, which are similar to the standard BEGIN and END, but are executed around each processed file; you can do some calculations over the record and in ENDFILE print the results for that file, and clear your accumulation variables for the next file. This is nicer than checking for FNR == 1, and having an END action for the last file.
if you really wanna materialize all the file names without globbing you can always jot it (it's like seq with more integer digits in default mode before going to scientific notation) :
jot -w 'myFILENAME.%03d' - 0 999 |
mawk '_<(_+=(NR == +_)*__)' \_=17 __=91 # extracting fixed interval
# samples without modulo(%) math
myFILENAME.016
myFILENAME.107
myFILENAME.198
myFILENAME.289
myFILENAME.380
myFILENAME.471
myFILENAME.562
myFILENAME.653
myFILENAME.744
myFILENAME.835
myFILENAME.926

Slight error when using awk to remove spaces from a CSV column

I have used the following awk command on my bash script to delete spaces on the 26th column of my CSV;
awk 'BEGIN{FS=OFS="|"} {gsub(/ /,"",$26)}1' original.csv > final.csv
Out of 400 rows, I have about 5 random rows that this doesn't work on even if I rerun the script on final.csv. Can anyone assist me with a method to take care of this? Thank you in advance.
EDIT: Here is a sample of the 26th column on original.csv vs final.csv respectively;
2212026837 2212026837
2256 41688 6 2256416886
2076113566 2076113566
2009 84517 7 2009845177
2067950476 2067950476
2057 90531 5 2057 90531 5
2085271676 2085271676
2095183426 2095183426
2347366235 2347366235
2200160434 2200160434
2229359595 2229359595
2045373466 2045373466
2053849895 2053849895
2300 81552 3 2300 81552 3
I see two possibilities.
The simplest is that you have some whitespace other than a space. You can fix that by using a more general regex in your gsub: instead of / /, use /[[:space:]]/.
If that solves your problem, great! You got lucky, move on. :)
The other possible problem is trickier. The CSV (or, in this case, pipe-SV) format is not as simple as it appears, since you can have quoted delimiters inside fields. This, for instance, is a perfectly valid 4-field line in a pipe-delimited file:
field 1|"field 2 contains some |pipe| characters"|field 3|field 4
If the first 4 fields on a line in your file looked like that, your gsub on $26 would actually operate on $24 instead, leaving $26 alone. If you have data like that, the only real solution is to use a scripting language with an actual CSV parsing library. Perl has Text::CSV, but it's not installed by default; Python's csv module is, so you could use a program like so:
import csv, fileinput as fi, re;
for row in csv.reader(fi.input(), delimiter='|'):
row[25] = re.sub(r'\s+', '', row[25]) # fields start at 0 instead of 1
print '|'.join(row)
Save the above in a file like colfixer.py and run it with python colfixer.py original.csv >final.csv.
(If you tried hard enough, you could get that shoved into a -c option string and run it from the command line without creating a script file, but Python's not really built for that and it gets ugly fast.)
You can use the string function split, and iterate over the corresponding array to reassign the 26th field:
awk 'BEGIN{FS=OFS="|"} {
n = split($26, a, /[[:space:]]+/)
$26=a[1]
for(i=2; i<=n; i++)
$26=$26""a[i]
}1' original.csv > final.csv

Comparing/finding the difference between two text files using findstr

I have a requirement to compare two text files and to find out the difference between them. Basically I have an input file (input.txt) which will be processed by a batch job and my batch will log the output (successful.txt) where the job has successfully ran.
In simple words, I need to find out the difference between input.txt and successful.txt (input.txt-successful.txt) and I was thinking to use findstr. It seems to be fine, BUT I don't understand one part of it. It always includes the last line of my input.txt in the output. You could see that in the example below. Please note that there is no leading space or line break after the last line of my input.txt.
In below example, you could see the line server1,db1 is present on both the files, but still listed in the output. (It is always the last line of input.txt)
D:\Scripts\dummy>type input.txt
server2,db2
server3,db3
server10,db10
server4,db4
server1,db11
server10,schema11
host1,sch2
host11,sql2
host11,sql3
server1,db1
D:\Scripts\dummy>type successful.txt
server1,db1
server2,db2
server3,db3
server4,db4
server10,db10
host1,sch2
host11,sql2
host11,sql3
D:\Scripts\dummy>findstr /vixg:successful.txt input.txt
server1,db11
server10,schema11
server1,db1
What am I doing wrong?
Cheers,
G
I could reproduce your results by removing the newline after the last line of input.txt, so solution 1 would be to add a newline to the end of input.txt. Since you appear to say that input.txt has no terminal newline, then adding one would cure the problem; findstr is acting as expected because it acts on newline-terminated lines.
Solution 2 would be
type input.txt|findstr /vixg:successful.txt

Search a specific line for a value within a range. Unix bash script

I'd like to jump to a specific line in a file, line 33866. If the third number in this line is within the range -10 and +10 then I'd like to print the entire next line, 33867, to a file and stop.
If it isn't then it should look at line 67893 (difference of +34027), now if its in the range - print the next line and stop.
This should continue, next looking at line 101920 (difference of +34027) and so on until it finds a value in that range or reaches the end of the file.
Now regardless of whether or not that printed anything I need it to repeat the process but at a new starting line, this time the new start line is 33869 (difference 3), to print line 33870 to the same file.
Ideally, it would repeat n times, n being a read value input by the user when the script is ran.
Please stop me right there if this is too much to ask and I'll go back to banging my head against the wall and searching around the net for how to make this work on my own. Also let me know if I'm going about this the wrong way by trying to jump to a specific line and should search for the line by another means.
Any input greatly appreciated!
Edit:
Here is an example of the two lines being handled:
17.33051021 18.02125499 30.40520932
1.776579372 -23.74037576 12.48448432
with the first number starting in column 6, the second number starting in 26 and third in 46. (if minus is ignored I don't think it will matter)
reading your question, I guess your file could be pretty big. Also I assume "the 3rd number" is 3rd field. so I come up with this one-liner:
awk -v l=33866 -v d=34027 'NR==l&&$3>=-10&&$3<=10{p=1;next}p{print;exit}{l+=d}' file
you just need to change the two arguments (l (first line No. you need to check) and d (difference)).
After found the right line to print, awk stops processing further lines in your file.
didn't test, if there were typoes, sorry, but it shows my idea
you should give some example input etc. i.e. the 3rd number, what is that? the 3rd field? or like aa bb 2 dfd 3 asf 555, the 555?
another one, actually you should show what you have done for your problem
Since we don't have any input to test with, I am giving you an answer without testing.
tl=$(wc -l input)
awk '{
for (i=33866; i<tl; i+=34027) {
if (NR==i && $3 >= -10 && $3 <= 10) {
getline;
print;
exit;
}
}
}' input

BASH awk field separator to perform calculation

I basically have a text file containing data which consists of times in this format
(00:00)
(06:08)
(07:54)
I've done my share of research to determine how to take only specifically those times (filtering out all the other gunk), and even separating them using awk. The problem occurs when I attempt to add the digits. I seem to be getting a single digit 0 value...
My code is as follows:
cat somefile.txt | awk -F: '/([0-9][0-9]:[0-9][0-9])/{total+=$1}END{print total}'
Note: I am using brackets because the file contains other times, NOT enclosed in brackets... So i've attempted to get rid of the unwanted data, leaving me with only the above portions (00:00)
I'm trying to add the left portion together, which should obviously in this example result in a total of 13.
Really hope I can find a solution, and I'm sure it's something that is not exactly coming to my mind at the moment.
try this line:
awk -F'[(:]' '{h+=$2}END{print h}' file
example:
kent$ echo "(00:00)
(06:08)
(07:54)"|awk -F'[(:]' '{h+=$2}END{print h}'
13
EDIT
[(:] (I really want to type :) ) defines two FS (field separator)s ( or : so the line would be:
(06:08)
field 1 - ""
field 2 - 06
field 3 - 08)
if you want to get only minutes, you need to add ) to your FS too, otherwise you got $3 with the ending ).

Resources