End of Line Overflow in start of next line - bash
So I have come across an AWK script that used to be working on HP-UX but has been ported over to RHEL6.4/6.5. It does some work to create headers and trailers in a file and the main script body handles the record formatting.
The problem I am seeing when it runs now is that the last letter from the first line flows onto the start of the next line. Then the last two letters of the second line flow into the start of the third and so on.
This is the section of the script that deals with the record formatting:
ls_buffer=ls_buffer $0;
while (length(ls_buffer)>99) {
if (substr(ls_buffer,65,6)=="STUFF") {
.....do some other stuff
} else {
if (substr(ls_buffer,1,1)!="\x01f" && substr(ls_buffer,1,1)!="^") {
printf "%-100s\n", substr(ls_buffer,1,100);
}
};
#----remove 1st 100 chars in string ls_buffer
ls_buffer=substr(ls_buffer,100);
}
To start with it looks like the file had picked up some LF,CR,FF so I removed them with gsub hex replacements further up the code but it is ending the line at 100 and then re-printing the last character at the start of the second line.
This is some sample test output just in case it helps:
1234567890123456789012345678901 00000012345TESTS SUNDRY PAYME130 DE TESTLLAND GROUP
P1234567890123456789012345678901 00000012345TESTS SUNDRY PAYME131 TESTS RE TESTSLIN
NS1234567890123456789012345678901 00000012345TESTS SUNDRY PAYME132 TESTINGS MORTGAG
GES1234567890123456789012345678901 00000012345TESTS SUNDRY PAYME937 TESTS SUNDRY PA
Can anyone offer any suggestions as to why this is happening? Any help would be appreciated.
The problem here seems to be that the offsets are incorrect in the manual buffer printing loop.
Specifically that the loop prints 100 characters from the buffer but then strips only 99 characters off the front of the buffer (despite the comments claim to the contrary).
The substr function in awk starts at the character position of its second argument. So to drop x characters from the front of the string you need to use x+1 as the argument to substr.
Example:
# Print the first ten characters from the string.
$ awk 'BEGIN {f="12345678901234567890"; print substr(f, 1, 10)}'
1234567890
# Attempt to chop off the first ten characters from the string.
$ awk 'BEGIN {f="12345678901234567890"; print substr(f, 10)}'
01234567890
# Correctly chop off the first ten characters from the string.
$ awk 'BEGIN {f="12345678901234567890"; print substr(f, 11)}'
1234567890
So the ls_buffer=substr(ls_buffer,100); line in the original script would seem to need to be ls_buffer=substr(ls_buffer,101); instead.
Given that you claim that the original script is working however I have to wonder if whatever version of awk is on that HP-UX machine had a slightly different interpretation of substr (not that I see how that could be possible).
The above aside this seems like a very odd way to go about this business (manually assembling a buffer and then chopping it up) but without seeing the input and the rest of the script I can't comment much more in that direction.
Related
Parse CSV data between two strings and include string from line below
I have files containing data sampled at 20Hz. Appended to each line are data packets from an IMU that are not synchronised with the 20Hz data. The IMU data packets have a start marker (255,90) and an end marker (51). I am using the term packet for brevity, they are just coma separated variables. Packet1 is not the same length as packet2 and so on. "2019-12-08 21:29:11.90",3390323,..[ CSV data ]..,1,"1 1025.357 ",[ incomplete packet from line above ],51,255,90,[ packet1 ],51,255,90,[ packet2 ],51,255,90,[ packet3 ],51,255,90,[ First part of packet4 ] "2019-12-08 21:29:11.95",3390324,.............,1,"1 1025.367 ",[ Second part of packet4 ],51,255,90,[ packet5 ],51,255,90,[ packet6 ],51,255,90,[ packet7 ],51,255,90,[ First part of packet8 ] I would like to parse the file so that I extract the time stamp with the IMU packets from the first start marker to after the last start marker and take the partial packet from the next line and append it to the end of the line so the output is in the form: "2019-12-08 21:29:11.90",255,90,[ packet1 ],51,255,90,[ packet2 ],51,255,90,[ packet3 ],51,255,90,[ First part of packet4 ][ Second part of packet4 ],51 "2019-12-08 21:29:11.95",255,90,[ packet5 ],51,255,90,[ packet6 ],51,255,90,[ packet7 ],51,255,90,[ First part of packet8 ][ Second part of packet8 ],51 As requested I have included my real world example: This is just five lines. The last lines would be deleted as it would remain incomplete. "2019-08-28 10:43:46.2",10802890,32,22.1991,-64,"1 1015.400 ",0,0,0,0,67,149,115,57,11,0,63,24,51,255,90,12,110,51,255,90,177,109,51,255,90,4,193,141,125,51,255,90,114,51,255,90,8,0,250,63,51,255,90,9,0,46,0,136,251,232,66,0,0,160,64,0,0,0,0,0,0,0,0,233,124,139,56,0,0,0,0,0,0,0,0,195,80,152,184,0,0,0,0 "2019-08-28 10:43:46.25",10802891,32,22.1991,-64,"1 1015.400 ",0,0,0,0,118,76,101,57,11,0,32,249,51,255,90,230,252,51,255,90,53,221,51,255,90,4,193,33,60,51,255,90,104,51,255,90,8,0,23,192,51,255,90,9,0,46,0,200,151,233,66,0,0,160,64,0,0,0,0,0,0,0,0,2,117,157,56,0,0,0,0,0,0,0,0,31,182,140,57,0,0,0,0 "2019-08-28 10:43:46.3",10802892,32,22.1991,-64,"1 1015.400 ",0,0,0,0,151,113,95,57,11,0,72,194,51,255,90,105,41,51,255,90,12,15,51,255,90,4,193,70,8,51,255,90,89,51,255,90,8,0,46,210,51,255,90,9,0,46,0,40,130,234,66,0,0,160,64,0,0,0,0,0,0,0,0,132,206,183,56,0,0,0,0,0,0,0,0,97,191,197,56,0,0,0,0 "2019-08-28 10:43:46.35",10802893,32,22.1991,-64,"1 1015.400 ",0,0,0,0,110,51,95,57,11,0,9,37,51,255,90,78,13,51,255,90,255,246,51,255,90,4,193,52,161,51,255,90,152,51,255,90,8,0,163,85,51,255,90,9,0,46,0,104,30,235,66,0,0,160,64,0,0,0,0,0,0,0,0,49,42,201,56,0,0,0,0,0,0,0,0,82,125,132,57,0,0,0,0 "2019-08-28 10:43:46.4",10802894,32,22.1991,-64,"1 1015.400 ",0,0,0,0,173,103,97,57,11,0,185,229,51,255,90,177,130,51,255,90,57,236,51,255,90,4,193,213,77,51,255,90,252,51,255,90,8,0,9,201,51,255,90,9,0,46,0,200,8,236,66,0,0,160,64,0,0,0,0,0,0,0,0,83,67,227,56,0,0,0,0,0,0,0,0,58,205,192,184,0,0,0,0 I would like to parse the data to the following format: "2019-08-28 10:43:46.2",255,90,12,110,51,255,90,177,109,51,255,90,4,193,141,125,51,255,90,114,51,255,90,8,0,250,63,51,255,90,9,0,46,0,136,251,232,66,0,0,160,64,0,0,0,0,0,0,0,0,233,124,139,56,0,0,0,0,0,0,0,0,195,80,152,184,0,0,0,0,0,0,0,0,118,76,101,57,11,0,32,249,51 "2019-08-28 10:43:46.25",255,90,230,252,51,255,90,53,221,51,255,90,4,193,33,60,51,255,90,104,51,255,90,8,0,23,192,51,255,90,9,0,46,0,200,151,233,66,0,0,160,64,0,0,0,0,0,0,0,0,2,117,157,56,0,0,0,0,0,0,0,0,31,182,140,57,0,0,0,0,0,0,0,0,151,113,95,57,11,0,72,194,51 "2019-08-28 10:43:46.3",255,90,105,41,51,255,90,12,15,51,255,90,4,193,70,8,51,255,90,89,51,255,90,8,0,46,210,51,255,90,9,0,46,0,40,130,234,66,0,0,160,64,0,0,0,0,0,0,0,0,132,206,183,56,0,0,0,0,0,0,0,0,97,191,197,56,0,0,0,0,0,0,0,0,110,51,95,57,11,0,9,37,51 "2019-08-28 10:43:46.35",255,90,78,13,51,255,90,255,246,51,255,90,4,193,52,161,51,255,90,152,51,255,90,8,0,163,85,51,255,90,9,0,46,0,104,30,235,66,0,0,160,64,0,0,0,0,0,0,0,0,49,42,201,56,0,0,0,0,0,0,0,0,82,125,132,57,0,0,0,0,0,0,0,0,173,103,97,57,11,0,185,229,51 "2019-08-28 10:43:46.4",255,90,177,130,51,255,90,57,236,51,255,90,4,193,213,77,51,255,90,252,51,255,90,8,0,9,201,51,255,90,9,0,46,0,200,8,236,66,0,0,160,64,0,0,0,0,0,0,0,0,83,67,227,56,0,0,0,0,0,0,0,0,58,205,192,184,0,0,0,0 This last line would remain incomplete as there is no next line.
When you are dealing with fields you should be thinking awk. In this case, awk provides a simple solution -- so long as your record format does not change. While generally, that wouldn't matter, here it does... Why? Because your wanted output does not match your problem description. Why? Because in all records other than the fourth, the first 51 ending your data to append to the previous line is located in field 19, (with a ',' as the field-separator) while in the fourth record it is found in field 12. So normally where you would just scan forward though your fields to find the first 51 eliminating the need to know what field the first 51 is found in -- using that method with your data does not produce your wanted results. (the 3rd output line would have a short-remainder from the 4th input line reducing its length and instead forcing the additional packet data to the fourth line of output) However, sacrificing that flexibility and considering fields 7-19 to be packets belonging with the previous line allows your wanted output to be matched exactly. (it also simplifies the script, but at the cost of flexibility in record format) A short awk script taking the file to process as its first argument can be written as follows: #!/usr/bin/awk -f BEGIN { FS=","; dtfield=""; packets=""; pkbeg=7; pkend=19 } NF > 1 { if (length(packets) > 0) { # handle 1st part of next line for (i=pkbeg; i<=pkend; i++) # append packet data though filed 19 packets=packets "," $i print dtfield packets "\n" # output the date and packet data packets="" # reset packet data empty } dtfield=$1 # for every line, store date field for (i=pkend+1; i<=NF; i++) # loop from 20 to end savind data packets=packets "," $i } END { print dtfield packets "\n" # output final line } Don't forget to chmod +x scriptname to make the script executable. Example Use/Output (non-fixed width due to output line length -- as was done in the question) $ ./imupackets.awk imu "2019-08-28 10:43:46.2",255,90,12,110,51,255,90,177,109,51,255,90,4,193,141,125,51,255,90,114,51,255,90,8,0,250,63,51,255,90,9,0,46,0,136,251,232,66,0,0,160,64,0,0,0,0,0,0,0,0,233,124,139,56,0,0,0,0,0,0,0,0,195,80,152,184,0,0,0,0,0,0,0,0,118,76,101,57,11,0,32,249,51 "2019-08-28 10:43:46.25",255,90,230,252,51,255,90,53,221,51,255,90,4,193,33,60,51,255,90,104,51,255,90,8,0,23,192,51,255,90,9,0,46,0,200,151,233,66,0,0,160,64,0,0,0,0,0,0,0,0,2,117,157,56,0,0,0,0,0,0,0,0,31,182,140,57,0,0,0,0,0,0,0,0,151,113,95,57,11,0,72,194,51 "2019-08-28 10:43:46.3",255,90,105,41,51,255,90,12,15,51,255,90,4,193,70,8,51,255,90,89,51,255,90,8,0,46,210,51,255,90,9,0,46,0,40,130,234,66,0,0,160,64,0,0,0,0,0,0,0,0,132,206,183,56,0,0,0,0,0,0,0,0,97,191,197,56,0,0,0,0,0,0,0,0,110,51,95,57,11,0,9,37,51 "2019-08-28 10:43:46.35",255,90,78,13,51,255,90,255,246,51,255,90,4,193,52,161,51,255,90,152,51,255,90,8,0,163,85,51,255,90,9,0,46,0,104,30,235,66,0,0,160,64,0,0,0,0,0,0,0,0,49,42,201,56,0,0,0,0,0,0,0,0,82,125,132,57,0,0,0,0,0,0,0,0,173,103,97,57,11,0,185,229,51 "2019-08-28 10:43:46.4",255,90,177,130,51,255,90,57,236,51,255,90,4,193,213,77,51,255,90,252,51,255,90,8,0,9,201,51,255,90,9,0,46,0,200,8,236,66,0,0,160,64,0,0,0,0,0,0,0,0,83,67,227,56,0,0,0,0,0,0,0,0,58,205,192,184,0,0,0,0 Look things over and let me know if you have questions.
This the following command pipes the your_input_file into a sed command (GNU sed 4.8) that accomplishes the task. At least it works for me with the files you provided (as they are at the time of writing, empty lines included). cat your_input_file | sed ' s/,51,\(255,90,.*,51\),255,90,/,51\n,\1,255,90,/ s/\("[^"]*"\).*",\(.*\),51\n/\2,51\n\1/ $!N H $!d ${ x s/^[^"]*// s/\n\n\([^\n]*\)/,\1\n/g }' Clearly you can save the sed script in a file (named for instance myscript.sed) #!/usr/bin/sed -f s/,51,\(255,90,.*,51\),255,90,/,51\n,\1,255,90,/ s/\("[^"]*"\).*",\(.*\),51\n/\2,51\n\1/ $!N H $!d ${ x s/^[^"]*// s/\n\n\([^\n]*\)/,\1\n/g } and use it like this: ./myscript.sed your_input_file. Note that if the first ,51, on each line is guaranteed to be followed by 255,90, (something which your fourth example violates, ",0,0,0,0,110,51,95,), then the first substitution command reduces to s/,51,/,51\n,/. Please, test it and let me know if I have correctly interpreted your question. I have not explained how the script works for the simple reason that it will take considerable time for me to write down an explanation (I tend to be fairly meticulous when walking through a script, as you can see here, where I created another analogous sed script), and I want to be sure it does represent a solution for you. Maybe shorter solutions are possible (even with sed itself). I'm not sure awk would allow a shorter solution; it would certainly offer infinitely more readability than sed, but (I think) at the price of length. Indeed, as you can see from another answer, the awk script is more readable but longer (369 characters characters/bytes vs sed script's 160 bytes). Actually, even in the world of sed scripts, the one above is fairly inefficient, I guess, as it basically preprocesses each lines and keeps appending each one to all the preceding ones, then it does some processing on the long resulting multiline and prints it to screen.
Replace a part of a file by a part of another file
I have two files containing a lot of floating numbers. I would like to replace one of the floating numbers from file 1 by a floating number from File 2, using lines and characters to find the numbers (and not their values). A lot of topics on the subject, but I couldn't find anything that uses a second file to copy the values from. Here are examples of my two files: File1: 14 4 2.64895E-01 4.75834E+02 2.85629E+05 -9.65829E+01 2.76893E-01 8.53749E+02 4.56385E+05 -7.65658E+01 6.25576E-01 5.27841E+02 5.72960E+05 -7.46175E+01 8.56285E-01 4.67285E+02 5.75962E+05 -5.17586E+01 File2: Some text on the first line 1 Some text on the third line 0 AND01 0.53758275 0.65728944 AND02 0.64889566 0.53386002 AND03 0.65729386 0.64628194 AND04 0.26586960 0.46582925 AND05 0.46480534 0.57415869 In this particular example, I would like to replace the first number of the second line of File1 (2.64895E-01) by the second floating number written on line 5 of File2 (0.65728944). Note: the value of the numbers will change according to which file I consider, so I have to identify the numbers by their positions inside the files. I am very new to using bash scripts and have only use "sed" command till now to modify my files. Any help is welcome :) Thanks a lot for your inputs!
It's not hard to do it in bash, but if that's not a strict requirement, an easier and more concise solution is possible with an actual text-processing tool like awk: awk 'NR==5 {val=$2} NR>FNR {FNR==2 && $1=val; print}' file2 file1 Explanation: read file2 first, and store the second field of the 5th record in variable val (the first part: NR==5 {val=$2}). Then, read file1, print every line, but replace the first field of the second record (FNR is current-file record number, and NR is total number of records in all files so far) with value stored in val. In general, an awk program consists of pattern { actions } sequences. pattern is a condition under which a series of actions will get executed. $1..$NF are variables with field values, and each line (record) is split into fields on the field separator (FS variable, or -F'..' option), which defaults to a space. The result (output): 14 4 0.53758275 4.75834E+02 2.85629E+05 -9.65829E+01 2.76893E-01 8.53749E+02 4.56385E+05 -7.65658E+01 6.25576E-01 5.27841E+02 5.72960E+05 -7.46175E+01 8.56285E-01 4.67285E+02 5.75962E+05 -5.17586E+01
How can i get only special strings (by condition) from file?
I have a huge text file with strings of a special format. How can i quickly create another file with only strings corresponding to my condition? for example, file contents: [2/Nov/2015][rule="myRule"]"GET http://uselesssotialnetwork.com/picturewithcat.jpg" [2/Nov/2015][rule="mySecondRule"]"GET http://anotheruselesssotialnetwork.com/picturewithdog.jpg" [2/Nov/2015][rule="myRule"]"GET http://uselesssotialnetwork.com/picturewithzombie.jpg" and i only need string with "myRule" and "cat"? I think it should be perl, or bash, but it doesn't matter. Thanks a lot, sorry for noob question.
Is it correct, that each entry is two lines long? Then you can use sed: sed -n '/myRule/ {N }; /myRule.*cat/ {p}' the first rule appends the nextline to patternspace when myRule matches the second rule tries to match myRule followed by a cat in the patternspace , if found it prints patternspace
If your file is truly huge to the extent that it won't fit in memory (although files up to a few gigabytes are fine in modern computer systems) then the only way is to either change the record separator or to read the lines in pairs This shows the first way, and assumes that the second line of every pair ends with a double quote followed by a newline perl -ne'BEGIN{$/ = qq{"\n}} print if /myRule/ and /cat/' huge_file.txt and this is the second perl -ne'$_ .= <>; print if /myRule/ and /cat/' huge_file.txt When given your sample data as input, both methods produce this output [2/Nov/2015][rule="myRule"]"GET http://uselesssotialnetwork.com/picturewithcat.jpg"
Delete lines in a file based on first row
I try to work on a whole series of txt files (actually .out, but behaves like a space delimited txt file). I want to delete certain lines in the text, based on the output compared to the first row. So for example: ID VAR1 VAR2 1 8 9 2 4 1 3 3 2 I want to delete all the lines with VAR1 < 0,5. I found a way to do this manually in excel, but with 350+ files, this is going to be a long night, there are sure ways to do this more effective.. I worked on this set of files already in terminal (OSX).
This is a typical job for awk, the venerable language for file manipulation. What awk does is match each line in a file to a condition, and provide an action for it. It also allows for easy elementary parsing of line columns. In this case, you want to test whether the second column is less than 0.5, and if so not print that line. Otherwise, print the line (in effect this removes lines for which the variable is less than 0.5. Your variable is in column 2, which in awk is referred to as $2. Each full line is referred to by the variable $0. So you would do something like this: { if ($2 < 0.5) { } else { print $0 } } Or something like that, I haven't used awk for a while. The above code is an awk script. Apply it on your file, and redirect the output to a new file (which will have all the lines not satisfying the condition removed).
gsub issue with awk (gawk)
I need to search a text file for a string, and make a replacement that includes a number that increments with each match. The string to be "found" could be a single character, or a word, or a phrase. The replacement expression will not always be the same (as it is in my examples below), but will always include a number (variable) that increments. For example: 1) I have a test file named "data.txt". The file contains: Now is the time for all good men to come to the aid of their party. 2) I placed the awk script in a file named "cmd.awk". The file contains: /f/ {sub ("f","f(" ++j ")")}1 3) I use awk like this: awk -f cmd.awk data.txt In this case, the output is as expected: Now is the time f(1)or all good men to come to the aid of(2) their party. The problem comes when there is more than one match on a line. For example, if I was searching for the letter "i" like: /i/ {sub ("i","i(" ++j ")")}1 The output is: Now i(1)s the time for all good men to come to the ai(2)d of their party. which is wrong because it doesn't include the "i" in "time" or "their". So, I tried "gsub" instead of "sub" like: /i/ {gsub ("i","i(" ++j ")")}1 The output is: Now i(1)s the ti(1)me for all good men to come to the ai(2)d of thei(2)r party. Now it makes the replacement for all occurrences of the letter "i", but the inserted number is the same for all matches on the same line. The desired output should be: Now i(1)s the ti(2)me for all good men to come to the ai(3)d of thei(4)r party. Note: The number won't always begin with "1" so I might use awk like this: awk -f cmd.awk -v j=26 data.txt To get the output: Now i(27)s the ti(28)me for all good men to come to the ai(29)d of thei(30)r party. And just to be clear, the number in the replacement will not always be inside parenthesis. And the replacement will not always include the matched string (actually it would be quite rare). The other problem I am having with this is... I want to use an awk-variable (not environment variable) for the "search string", so I can specify it on the awk command line. For example: 1) I placed the awk script in a file named "cmd.awk". The file contains something like: /??a??/ {gsub (a,a "(" ++j ")")}1 2) I would use awk like this: awk -f cmd.awk -v a=i data.txt To get the output: Now i(1)s the ti(2)me for all good men to come to the ai(3)d of thei(4)r party. The question here, is how do I represent the the variable "a" in the /search/ expression ?
awk version: awk '{for(i=2; i<=NF; i++)$i="(" ++k ")" $i}1' FS=i OFS=i
gensub() sounds ideal here, it allows you to replace the Nth match, so what sounds like a solution is to iterate over the string in a do{}while() loop replacing one match at a time and incrementing j. This simple gensub() approach won't work if the replacement does not contain the original text (or worse, contains it multiple times), see below. So in awk, lacking perl's "s///e" evaluation feature, and its stateful regex /g modifier (as used by Steve) the best remaining option is to break the lines into chunks (head, match, tail) and stick them back together again: BEGIN { if (j=="") j=1 if (a=="") a="f" } match($0,a) { str=$0; newstr="" do { newstr=newstr substr(str,1,RSTART-1) # head mm=substr(str,RSTART,RLENGTH) # extract match sub(a,a"("j++")",mm) # replace newstr=newstr mm str=substr(str,RSTART+RLENGTH) # tail } while (match(str,a)) $0=newstr str } {print} This uses match() as an epxression instead of a // pattern so you can use a variable. (You can also just use "($0 ~ a) { ... }", but the results of match() are used in this code, so don't try that here.) You can define j and a on the command line. gawk supports \y which is the equivalent of perlre's \b, and also supports \< and \> to explictly match the start and end of a word, just take care to add extra escapes from a unix command line (I'm not quite sure what Windows might require or permit). Limited gensub() version As referenced above: match($0,a) { idx=1; str=$0 do { prev=str str=gensub(a,a"(" j ")",idx++,prev) } while (str!=prev && j++) $0=str } The problems here are: if you replace substring "i" with substring "k" or "k(1)" then the gensub() index for the next match will be off by 1. You could work around this if you either know that in advance, or work backward through the string instead. if you replace substring "i" with substring "ii" or "ii(i)" then a similar problem arises (resulting in an infinite loop, because gensub() keeps finding a new match) Dealing with both conditions robustly is not worth the code.
I'm not saying this can't be done using awk, but I would strongly suggest moving to a more powerful language. Use perl instead. To include a count of the letter i beginning at 26, try: perl -spe 's:i:$&."(".++$x.")":ge' -- -x=26 data.txt This could also be a shell var: var=26 perl -spe 's:i:$&."(".++$x.")":ge' -- -x=$var data.txt Results: Now i(27)s the ti(28)me for all good men to come to the ai(29)d of thei(30)r party. To include a count of specific words, add word boundaries (i.e. \b) around the words, try: perl -spe 's:\bthe\b:$&."(".++$x.")":ge' -- -x=5 data.txt Results: Now is the(6) time for all good men to come to the(7) aid of their party.