Parse CSV data between two strings and include string from line below - shell

I have files containing data sampled at 20Hz. Appended to each line are data packets from an IMU that are not synchronised with the 20Hz data. The IMU data packets have a start marker (255,90) and an end marker (51). I am using the term packet for brevity, they are just coma separated variables. Packet1 is not the same length as packet2 and so on.
"2019-12-08 21:29:11.90",3390323,..[ CSV data ]..,1,"1 1025.357 ",[ incomplete packet from line above ],51,255,90,[ packet1 ],51,255,90,[ packet2 ],51,255,90,[ packet3 ],51,255,90,[ First part of packet4 ]
"2019-12-08 21:29:11.95",3390324,.............,1,"1 1025.367 ",[ Second part of packet4 ],51,255,90,[ packet5 ],51,255,90,[ packet6 ],51,255,90,[ packet7 ],51,255,90,[ First part of packet8 ]
I would like to parse the file so that I extract the time stamp with the IMU packets from the first start marker to after the last start marker and take the partial packet from the next line and append it to the end of the line so the output is in the form:
"2019-12-08 21:29:11.90",255,90,[ packet1 ],51,255,90,[ packet2 ],51,255,90,[ packet3 ],51,255,90,[ First part of packet4 ][ Second part of packet4 ],51
"2019-12-08 21:29:11.95",255,90,[ packet5 ],51,255,90,[ packet6 ],51,255,90,[ packet7 ],51,255,90,[ First part of packet8 ][ Second part of packet8 ],51
As requested I have included my real world example: This is just five lines. The last lines would be deleted as it would remain incomplete.
"2019-08-28 10:43:46.2",10802890,32,22.1991,-64,"1 1015.400 ",0,0,0,0,67,149,115,57,11,0,63,24,51,255,90,12,110,51,255,90,177,109,51,255,90,4,193,141,125,51,255,90,114,51,255,90,8,0,250,63,51,255,90,9,0,46,0,136,251,232,66,0,0,160,64,0,0,0,0,0,0,0,0,233,124,139,56,0,0,0,0,0,0,0,0,195,80,152,184,0,0,0,0
"2019-08-28 10:43:46.25",10802891,32,22.1991,-64,"1 1015.400 ",0,0,0,0,118,76,101,57,11,0,32,249,51,255,90,230,252,51,255,90,53,221,51,255,90,4,193,33,60,51,255,90,104,51,255,90,8,0,23,192,51,255,90,9,0,46,0,200,151,233,66,0,0,160,64,0,0,0,0,0,0,0,0,2,117,157,56,0,0,0,0,0,0,0,0,31,182,140,57,0,0,0,0
"2019-08-28 10:43:46.3",10802892,32,22.1991,-64,"1 1015.400 ",0,0,0,0,151,113,95,57,11,0,72,194,51,255,90,105,41,51,255,90,12,15,51,255,90,4,193,70,8,51,255,90,89,51,255,90,8,0,46,210,51,255,90,9,0,46,0,40,130,234,66,0,0,160,64,0,0,0,0,0,0,0,0,132,206,183,56,0,0,0,0,0,0,0,0,97,191,197,56,0,0,0,0
"2019-08-28 10:43:46.35",10802893,32,22.1991,-64,"1 1015.400 ",0,0,0,0,110,51,95,57,11,0,9,37,51,255,90,78,13,51,255,90,255,246,51,255,90,4,193,52,161,51,255,90,152,51,255,90,8,0,163,85,51,255,90,9,0,46,0,104,30,235,66,0,0,160,64,0,0,0,0,0,0,0,0,49,42,201,56,0,0,0,0,0,0,0,0,82,125,132,57,0,0,0,0
"2019-08-28 10:43:46.4",10802894,32,22.1991,-64,"1 1015.400 ",0,0,0,0,173,103,97,57,11,0,185,229,51,255,90,177,130,51,255,90,57,236,51,255,90,4,193,213,77,51,255,90,252,51,255,90,8,0,9,201,51,255,90,9,0,46,0,200,8,236,66,0,0,160,64,0,0,0,0,0,0,0,0,83,67,227,56,0,0,0,0,0,0,0,0,58,205,192,184,0,0,0,0
I would like to parse the data to the following format:
"2019-08-28 10:43:46.2",255,90,12,110,51,255,90,177,109,51,255,90,4,193,141,125,51,255,90,114,51,255,90,8,0,250,63,51,255,90,9,0,46,0,136,251,232,66,0,0,160,64,0,0,0,0,0,0,0,0,233,124,139,56,0,0,0,0,0,0,0,0,195,80,152,184,0,0,0,0,0,0,0,0,118,76,101,57,11,0,32,249,51
"2019-08-28 10:43:46.25",255,90,230,252,51,255,90,53,221,51,255,90,4,193,33,60,51,255,90,104,51,255,90,8,0,23,192,51,255,90,9,0,46,0,200,151,233,66,0,0,160,64,0,0,0,0,0,0,0,0,2,117,157,56,0,0,0,0,0,0,0,0,31,182,140,57,0,0,0,0,0,0,0,0,151,113,95,57,11,0,72,194,51
"2019-08-28 10:43:46.3",255,90,105,41,51,255,90,12,15,51,255,90,4,193,70,8,51,255,90,89,51,255,90,8,0,46,210,51,255,90,9,0,46,0,40,130,234,66,0,0,160,64,0,0,0,0,0,0,0,0,132,206,183,56,0,0,0,0,0,0,0,0,97,191,197,56,0,0,0,0,0,0,0,0,110,51,95,57,11,0,9,37,51
"2019-08-28 10:43:46.35",255,90,78,13,51,255,90,255,246,51,255,90,4,193,52,161,51,255,90,152,51,255,90,8,0,163,85,51,255,90,9,0,46,0,104,30,235,66,0,0,160,64,0,0,0,0,0,0,0,0,49,42,201,56,0,0,0,0,0,0,0,0,82,125,132,57,0,0,0,0,0,0,0,0,173,103,97,57,11,0,185,229,51
"2019-08-28 10:43:46.4",255,90,177,130,51,255,90,57,236,51,255,90,4,193,213,77,51,255,90,252,51,255,90,8,0,9,201,51,255,90,9,0,46,0,200,8,236,66,0,0,160,64,0,0,0,0,0,0,0,0,83,67,227,56,0,0,0,0,0,0,0,0,58,205,192,184,0,0,0,0
This last line would remain incomplete as there is no next line.

When you are dealing with fields you should be thinking awk. In this case, awk provides a simple solution -- so long as your record format does not change. While generally, that wouldn't matter, here it does...
Why? Because your wanted output does not match your problem description.
Why? Because in all records other than the fourth, the first 51 ending your data to append to the previous line is located in field 19, (with a ',' as the field-separator) while in the fourth record it is found in field 12.
So normally where you would just scan forward though your fields to find the first 51 eliminating the need to know what field the first 51 is found in -- using that method with your data does not produce your wanted results. (the 3rd output line would have a short-remainder from the 4th input line reducing its length and instead forcing the additional packet data to the fourth line of output)
However, sacrificing that flexibility and considering fields 7-19 to be packets belonging with the previous line allows your wanted output to be matched exactly. (it also simplifies the script, but at the cost of flexibility in record format)
A short awk script taking the file to process as its first argument can be written as follows:
#!/usr/bin/awk -f
BEGIN { FS=","; dtfield=""; packets=""; pkbeg=7; pkend=19 }
NF > 1 {
if (length(packets) > 0) { # handle 1st part of next line
for (i=pkbeg; i<=pkend; i++) # append packet data though filed 19
packets=packets "," $i
print dtfield packets "\n" # output the date and packet data
packets="" # reset packet data empty
}
dtfield=$1 # for every line, store date field
for (i=pkend+1; i<=NF; i++) # loop from 20 to end savind data
packets=packets "," $i
}
END {
print dtfield packets "\n" # output final line
}
Don't forget to chmod +x scriptname to make the script executable.
Example Use/Output
(non-fixed width due to output line length -- as was done in the question)
$ ./imupackets.awk imu
"2019-08-28
10:43:46.2",255,90,12,110,51,255,90,177,109,51,255,90,4,193,141,125,51,255,90,114,51,255,90,8,0,250,63,51,255,90,9,0,46,0,136,251,232,66,0,0,160,64,0,0,0,0,0,0,0,0,233,124,139,56,0,0,0,0,0,0,0,0,195,80,152,184,0,0,0,0,0,0,0,0,118,76,101,57,11,0,32,249,51
"2019-08-28
10:43:46.25",255,90,230,252,51,255,90,53,221,51,255,90,4,193,33,60,51,255,90,104,51,255,90,8,0,23,192,51,255,90,9,0,46,0,200,151,233,66,0,0,160,64,0,0,0,0,0,0,0,0,2,117,157,56,0,0,0,0,0,0,0,0,31,182,140,57,0,0,0,0,0,0,0,0,151,113,95,57,11,0,72,194,51
"2019-08-28
10:43:46.3",255,90,105,41,51,255,90,12,15,51,255,90,4,193,70,8,51,255,90,89,51,255,90,8,0,46,210,51,255,90,9,0,46,0,40,130,234,66,0,0,160,64,0,0,0,0,0,0,0,0,132,206,183,56,0,0,0,0,0,0,0,0,97,191,197,56,0,0,0,0,0,0,0,0,110,51,95,57,11,0,9,37,51
"2019-08-28
10:43:46.35",255,90,78,13,51,255,90,255,246,51,255,90,4,193,52,161,51,255,90,152,51,255,90,8,0,163,85,51,255,90,9,0,46,0,104,30,235,66,0,0,160,64,0,0,0,0,0,0,0,0,49,42,201,56,0,0,0,0,0,0,0,0,82,125,132,57,0,0,0,0,0,0,0,0,173,103,97,57,11,0,185,229,51
"2019-08-28
10:43:46.4",255,90,177,130,51,255,90,57,236,51,255,90,4,193,213,77,51,255,90,252,51,255,90,8,0,9,201,51,255,90,9,0,46,0,200,8,236,66,0,0,160,64,0,0,0,0,0,0,0,0,83,67,227,56,0,0,0,0,0,0,0,0,58,205,192,184,0,0,0,0
Look things over and let me know if you have questions.

This the following command pipes the your_input_file into a sed command (GNU sed 4.8) that accomplishes the task. At least it works for me with the files you provided (as they are at the time of writing, empty lines included).
cat your_input_file | sed '
s/,51,\(255,90,.*,51\),255,90,/,51\n,\1,255,90,/
s/\("[^"]*"\).*",\(.*\),51\n/\2,51\n\1/
$!N
H
$!d
${
x
s/^[^"]*//
s/\n\n\([^\n]*\)/,\1\n/g
}'
Clearly you can save the sed script in a file (named for instance myscript.sed)
#!/usr/bin/sed -f
s/,51,\(255,90,.*,51\),255,90,/,51\n,\1,255,90,/
s/\("[^"]*"\).*",\(.*\),51\n/\2,51\n\1/
$!N
H
$!d
${
x
s/^[^"]*//
s/\n\n\([^\n]*\)/,\1\n/g
}
and use it like this: ./myscript.sed your_input_file.
Note that if the first ,51, on each line is guaranteed to be followed by 255,90, (something which your fourth example violates, ",0,0,0,0,110,51,95,), then the first substitution command reduces to s/,51,/,51\n,/.
Please, test it and let me know if I have correctly interpreted your question. I have not explained how the script works for the simple reason that it will take considerable time for me to write down an explanation (I tend to be fairly meticulous when walking through a script, as you can see here, where I created another analogous sed script), and I want to be sure it does represent a solution for you.
Maybe shorter solutions are possible (even with sed itself). I'm not sure awk would allow a shorter solution; it would certainly offer infinitely more readability than sed, but (I think) at the price of length. Indeed, as you can see from another answer, the awk script is more readable but longer (369 characters characters/bytes vs sed script's 160 bytes).
Actually, even in the world of sed scripts, the one above is fairly inefficient, I guess, as it basically preprocesses each lines and keeps appending each one to all the preceding ones, then it does some processing on the long resulting multiline and prints it to screen.

Related

Is there a faster way to combine files in an ordered fashion than a for loop?

For some context, I am trying to combine multiple files (in an ordered fashion) named FILENAME.xxx.xyz (xxx starts from 001 and increases by 1) into a single file (denoted as $COMBINED_FILE), then replace a number of lines of text in the $COMBINED_FILE taking values from another file (named $ACTFILE). I have two for loops to do this which work perfectly fine. However, when I have a larger number of files, this process tends to take a fairly long time. As such, I am wondering if anyone has any ideas on how to speed this process up?
Step 1:
for i in {001..999}; do
[[ ! -f ${FILENAME}.${i}.xyz ]] && break
cat ${FILENAME}.${i}.xyz >> ${COMBINED_FILE}
mv -f ${FILENAME}.${i}.xyz ${XYZDIR}/${JOB_BASENAME}_${i}.xyz
done
Step 2:
for ((j=0; j<=${NUM_CONF}; j++)); do
let "n = 2 + (${j} * ${LINES_PER_CONF})"
let "m = ${j} + 1"
ENERGY=$(awk -v NUM=$m 'NR==NUM { print $2 }' $ACTFILE)
sed -i "${n}s/.*/${ENERGY}/" ${COMBINED_FILE}
done
I forgot to mention: there are other files named FILENAME.*.xyz which I do not want to append to the $COMBINED_FILE
Some details about the files:
FILENAME.xxx.xyz are molecular xyz files of the form:
Line 1: Number of atoms
Line 2: Title
Line 3-Number of atoms: Molecular coordinates
Line (number of atoms +1): same as line 1
Line (number of atoms +2): Title 2
... continues on (where line 1 through Number of atoms is associated with conformer 1, and so on)
The ACT file is a file containing the energies which has the form:
Line 1: conformer1 Energy
Line 2: conformer2 Energy2
Where conformer1 is in column 1 and the energy is in column 2.
The goal is to make the energy for the conformer the title for in the combined file (where the energy must be the title for a specific conformer)
If you know that at least one matching file exists, you should be able to do this:
cat -- ${FILENAME}.[0-9][0-9][0-9].xyz > ${COMBINED_FILE}
Note that this will match the 000 file, whereas your script counts from 001. If you know that 000 either doesn't exist or isn't a problem if it were to exist, then you should just be able to do the above.
However, moving these files to renamed names in another directory does require a loop, or one of the less-than-highly portable pattern-based renaming utilities.
If you could change your workflow so that the filenames are preserved, it could just be:
mv -- ${FILENAME}.[0-9][0-9][0-9].xyz ${XYZDIR}/${JOB_BASENAME}
where we now have a directory named after the job basename, rather than a path component fragment.
The Step 2 processing should be doable entirely in Awk, rather than a shell loop; you can read the file into an associative array indexed by line number, and have random access over it.
Awk can also accept multiple files, so the following pattern may be workable for processing the individual files:
awk 'your program' ${FILENAME}.[0-9][0-9][0-9].xyz
for instance just before catenating and moving them away. Then you don't have to rely on a fixed LINES_PER_FILE and such. Awk has the FNR variable which is the record in the current file; condition/action pairs can tell when processing has moved to the next file.
GNU Awk also has extensions BEGINFILE and ENDFILE, which are similar to the standard BEGIN and END, but are executed around each processed file; you can do some calculations over the record and in ENDFILE print the results for that file, and clear your accumulation variables for the next file. This is nicer than checking for FNR == 1, and having an END action for the last file.
if you really wanna materialize all the file names without globbing you can always jot it (it's like seq with more integer digits in default mode before going to scientific notation) :
jot -w 'myFILENAME.%03d' - 0 999 |
mawk '_<(_+=(NR == +_)*__)' \_=17 __=91 # extracting fixed interval
# samples without modulo(%) math
myFILENAME.016
myFILENAME.107
myFILENAME.198
myFILENAME.289
myFILENAME.380
myFILENAME.471
myFILENAME.562
myFILENAME.653
myFILENAME.744
myFILENAME.835
myFILENAME.926

Replace a part of a file by a part of another file

I have two files containing a lot of floating numbers. I would like to replace one of the floating numbers from file 1 by a floating number from File 2, using lines and characters to find the numbers (and not their values).
A lot of topics on the subject, but I couldn't find anything that uses a second file to copy the values from.
Here are examples of my two files:
File1:
14 4
2.64895E-01 4.75834E+02 2.85629E+05 -9.65829E+01
2.76893E-01 8.53749E+02 4.56385E+05 -7.65658E+01
6.25576E-01 5.27841E+02 5.72960E+05 -7.46175E+01
8.56285E-01 4.67285E+02 5.75962E+05 -5.17586E+01
File2:
Some text on the first line
1
Some text on the third line
0
AND01 0.53758275 0.65728944
AND02 0.64889566 0.53386002
AND03 0.65729386 0.64628194
AND04 0.26586960 0.46582925
AND05 0.46480534 0.57415869
In this particular example, I would like to replace the first number of the second line of File1 (2.64895E-01) by the second floating number written on line 5 of File2 (0.65728944).
Note: the value of the numbers will change according to which file I consider, so I have to identify the numbers by their positions inside the files.
I am very new to using bash scripts and have only use "sed" command till now to modify my files.
Any help is welcome :)
Thanks a lot for your inputs!
It's not hard to do it in bash, but if that's not a strict requirement, an easier and more concise solution is possible with an actual text-processing tool like awk:
awk 'NR==5 {val=$2} NR>FNR {FNR==2 && $1=val; print}' file2 file1
Explanation: read file2 first, and store the second field of the 5th record in variable val (the first part: NR==5 {val=$2}). Then, read file1, print every line, but replace the first field of the second record (FNR is current-file record number, and NR is total number of records in all files so far) with value stored in val.
In general, an awk program consists of pattern { actions } sequences. pattern is a condition under which a series of actions will get executed. $1..$NF are variables with field values, and each line (record) is split into fields on the field separator (FS variable, or -F'..' option), which defaults to a space.
The result (output):
14 4
0.53758275 4.75834E+02 2.85629E+05 -9.65829E+01
2.76893E-01 8.53749E+02 4.56385E+05 -7.65658E+01
6.25576E-01 5.27841E+02 5.72960E+05 -7.46175E+01
8.56285E-01 4.67285E+02 5.75962E+05 -5.17586E+01

How can i get only special strings (by condition) from file?

I have a huge text file with strings of a special format. How can i quickly create another file with only strings corresponding to my condition?
for example, file contents:
[2/Nov/2015][rule="myRule"]"GET
http://uselesssotialnetwork.com/picturewithcat.jpg"
[2/Nov/2015][rule="mySecondRule"]"GET
http://anotheruselesssotialnetwork.com/picturewithdog.jpg"
[2/Nov/2015][rule="myRule"]"GET
http://uselesssotialnetwork.com/picturewithzombie.jpg"
and i only need string with "myRule" and "cat"?
I think it should be perl, or bash, but it doesn't matter.
Thanks a lot, sorry for noob question.
Is it correct, that each entry is two lines long? Then you can use sed:
sed -n '/myRule/ {N }; /myRule.*cat/ {p}'
the first rule appends the nextline to patternspace when myRule matches
the second rule tries to match myRule followed by a cat in the patternspace , if found it prints patternspace
If your file is truly huge to the extent that it won't fit in memory (although files up to a few gigabytes are fine in modern computer systems) then the only way is to either change the record separator or to read the lines in pairs
This shows the first way, and assumes that the second line of every pair ends with a double quote followed by a newline
perl -ne'BEGIN{$/ = qq{"\n}} print if /myRule/ and /cat/' huge_file.txt
and this is the second
perl -ne'$_ .= <>; print if /myRule/ and /cat/' huge_file.txt
When given your sample data as input, both methods produce this output
[2/Nov/2015][rule="myRule"]"GET
http://uselesssotialnetwork.com/picturewithcat.jpg"

Delete lines in a file based on first row

I try to work on a whole series of txt files (actually .out, but behaves like a space delimited txt file). I want to delete certain lines in the text, based on the output compared to the first row.
So for example:
ID VAR1 VAR2
1 8 9
2 4 1
3 3 2
I want to delete all the lines with VAR1 < 0,5.
I found a way to do this manually in excel, but with 350+ files, this is going to be a long night, there are sure ways to do this more effective.. I worked on this set of files already in terminal (OSX).
This is a typical job for awk, the venerable language for file manipulation.
What awk does is match each line in a file to a condition, and provide an action for it. It also allows for easy elementary parsing of line columns. In this case, you want to test whether the second column is less than 0.5, and if so not print that line. Otherwise, print the line (in effect this removes lines for which the variable is less than 0.5.
Your variable is in column 2, which in awk is referred to as $2. Each full line is referred to by the variable $0.
So you would do something like this:
{ if ($2 < 0.5) {
}
else {
print $0
}
}
Or something like that, I haven't used awk for a while. The above code is an awk script. Apply it on your file, and redirect the output to a new file (which will have all the lines not satisfying the condition removed).

End of Line Overflow in start of next line

So I have come across an AWK script that used to be working on HP-UX but has been ported over to RHEL6.4/6.5. It does some work to create headers and trailers in a file and the main script body handles the record formatting.
The problem I am seeing when it runs now is that the last letter from the first line flows onto the start of the next line. Then the last two letters of the second line flow into the start of the third and so on.
This is the section of the script that deals with the record formatting:
ls_buffer=ls_buffer $0;
while (length(ls_buffer)>99) {
if (substr(ls_buffer,65,6)=="STUFF") {
.....do some other stuff
} else {
if (substr(ls_buffer,1,1)!="\x01f" && substr(ls_buffer,1,1)!="^") {
printf "%-100s\n", substr(ls_buffer,1,100);
}
};
#----remove 1st 100 chars in string ls_buffer
ls_buffer=substr(ls_buffer,100);
}
To start with it looks like the file had picked up some LF,CR,FF so I removed them with gsub hex replacements further up the code but it is ending the line at 100 and then re-printing the last character at the start of the second line.
This is some sample test output just in case it helps:
1234567890123456789012345678901 00000012345TESTS SUNDRY PAYME130 DE TESTLLAND GROUP
P1234567890123456789012345678901 00000012345TESTS SUNDRY PAYME131 TESTS RE TESTSLIN
NS1234567890123456789012345678901 00000012345TESTS SUNDRY PAYME132 TESTINGS MORTGAG
GES1234567890123456789012345678901 00000012345TESTS SUNDRY PAYME937 TESTS SUNDRY PA
Can anyone offer any suggestions as to why this is happening? Any help would be appreciated.
The problem here seems to be that the offsets are incorrect in the manual buffer printing loop.
Specifically that the loop prints 100 characters from the buffer but then strips only 99 characters off the front of the buffer (despite the comments claim to the contrary).
The substr function in awk starts at the character position of its second argument. So to drop x characters from the front of the string you need to use x+1 as the argument to substr.
Example:
# Print the first ten characters from the string.
$ awk 'BEGIN {f="12345678901234567890"; print substr(f, 1, 10)}'
1234567890
# Attempt to chop off the first ten characters from the string.
$ awk 'BEGIN {f="12345678901234567890"; print substr(f, 10)}'
01234567890
# Correctly chop off the first ten characters from the string.
$ awk 'BEGIN {f="12345678901234567890"; print substr(f, 11)}'
1234567890
So the ls_buffer=substr(ls_buffer,100); line in the original script would seem to need to be ls_buffer=substr(ls_buffer,101); instead.
Given that you claim that the original script is working however I have to wonder if whatever version of awk is on that HP-UX machine had a slightly different interpretation of substr (not that I see how that could be possible).
The above aside this seems like a very odd way to go about this business (manually assembling a buffer and then chopping it up) but without seeing the input and the rest of the script I can't comment much more in that direction.

Resources