Joining lines that matches specific conditions in bash - bash

I need a command that will join lines if:
-following line starts with more than 5 spaces
-length of the joined lines won't be greater than 79 characters
-those lines are not between lines with pattern1 and pattern2
-same as above but with another set of patterns, like pattern3 and pattern4
It will work on a file like this:
Long line that contains too much text for combining it with following one
That line cannot be attached to the previous becouse of the length
This one also
becouse it doesn't start with spaces
This one
could be
expanded
pattern1
here are lines
that shouldn't be
changed
pattern2
Another line
to grow
After running the command, output should be:
Long line that contains too much text for combining it with following one
That line cannot be attached to the previous becouse of the length
This one also
becouse that one doesn't start with spaces
This one could be expanded
pattern1
here are lines
that shouldn't be
changed
pattern2
Another line to grow
It can't move part of the line.
I'm using bash 2.05 sed 3.02 awk 3.1.1 and grep 2.5.1 and i don't know how to solve this problem :)

Here's a start for you:
#!/usr/bin/awk -f
BEGIN {
TRUE = printflag1 = printflag2 = 1
FALSE = 0
}
# using two different flags prevents premature enabling when blocks are
# nested or intermingled
/pattern1/ {
printflag1 = FALSE
}
/pattern2/ {
printflag1 = TRUE
}
/pattern3/ {
printflag2 = FALSE
}
/pattern4/ {
printflag2 = TRUE
}
{
line = $0
sub(/^ +/, " ", line)
sub(/ +$/, "", line)
}
/^ / &&
length(accum line) <= 79 &&
printflag1 &&
printflag2 {
accum = accum line
next
}
{
print accum
accum = line
}

Related

(sed/awk) extract values text file and write to csv (no pattern)

I have (several) large text files from which I want to extract some values to create a csv file with all of these values.
My current solution is to have a few different calls to sed from which I save the values and then have a python script in which I combine the data in different files to a single csv file. However, this is quite slow and I want to speed it up.
The file let's call it my_file_1.txt has a structure that looks something like this
lines I don't need
start value 123
lines I don't need
epoch 1
...
lines I don't need
some epoch 18 words
stop value 234
lines I don't need
words start value 345 more words
lines I don't need
epoch 1
...
lines I don't need
epoch 72
stop value 456
...
and I would like to construct something like
file,start,stop,epoch,run
my_file_1.txt,123,234,18,1
my_file_1.txt,345,456,72,2
...
How can I get the results I want? It doesn't have to be Sed or Awk as long as I don't need to install something new and it is reasonably fast.
I don't really have any experience with awk. With sed my best guess would be
filename=$1
echo 'file,start,stop,epoch,run' > my_data.csv
sed -n '
s/.*start value \([0-9]\+\).*/'"$filename"',\1,/
h
$!N
/.*epoch \([0-9]\+\).*\n.*stop value\([0-9]\+\)/{s/\2,\1/}
D
T
G
P
' $filename | sed -z 's/,\n/,/' >> my_data.csv
and then deal with not getting the run number. Furthermore, this is not quite correct as the N will gobble up some "start value" lines leading to wrong result. It feels like it could be done easier with awk.
It is similar to 8992158 but I can't use that pattern and I know too little awk to rewrite it.
Solution (Edit)
I was not general enough in my description of the problem so I changed it up a bit and fixed some inconsistensies.
Awk (Rusty Lemur's answer)
Here I generalised from knowing that the numbers were at the end of the line to using gensub. For this I should have specified version of awk at is not available in all versions.
BEGIN {
counter = 1
OFS = "," # This is the output field separator used by the print statement
print "file", "start", "stop", "epoch", "run" # Print the header line
}
/start value/ {
startValue = gensub(/.*start value ([0-9]+).*/, "\\1", 1, $0)
}
/epoch/ {
epoch = gensub(/.*epoch ([0-9]+).*/, "\\1", 1, $0)
}
/stop value/ {
stopValue = gensub(/.*stop value ([0-9]+).*/, "\\1", 1, $0)
# we have everything to print our line
print FILENAME, startValue, stopValue, epoch, counter
counter = counter + 1
startValue = "" # clear variables so they aren't maintained through the next iteration
epoch = ""
}
I accepted this answer because it most understandable.
Sed (potong's answer)
sed -nE '1{x;s/^/file,start,stop,epock,run/p;s/.*/0/;x}
/^.*start value/{:a;N;/\n.*stop value/!ba;x
s/.*/expr & + 1/e;x;G;F
s/^.*start value (\S+).*\n.*epoch (\S+)\n.*stop value (\S+).*\n(\S+)/,\1,\3,\2,\4/p}' my_file_1.txt | sed '1!N;s/\n//'
It's not clear how you'd get exactly the output you provided from the input you provided but this may be what you're trying to do (using any awk in any shell on every Unix box):
$ cat tst.awk
BEGIN {
OFS = ","
print "file", "start", "stop", "epoch", "run"
}
{ f[$1] = $NF }
$1 == "stop" {
print FILENAME, f["start"], f["stop"], f["epoch"], ++run
delete f
}
$ awk -f tst.awk my_file_1.txt
file,start,stop,epoch,run
my_file_1.txt,123,234,N,1
my_file_1.txt,345,456,M,2
awk's basic structure is:
read a record from the input (by default a record is a line)
evaluate conditions
apply actions
The record is split into fields (by default based on whitespace as the separator).
The fields are referenced by their position, starting at 1. $1 is the first field, $2 is the second.
The last field is referenced by a variable named NF for "number of fields." $NF is the last field, $(NF-1) is the second-to-last field, etc.
A "BEGIN" section will be executed before any input file is read, and it can be used to initialize variables (which are implicitly initialized to 0).
BEGIN {
counter = 1
OFS = "," # This is the output field separator used by the print statement
print "file", "start", "stop", "epoch", "run" # Print the header line
}
/start value/ {
startValue = $NF # when a line contains "start value" store the last field as startValue
}
/epoch/ {
epoch = $NF
}
/stop value/ {
stopValue = $NF
# we have everything to print our line
print FILENAME, startValue, stopValue, epoch, counter
counter = counter + 1
startValue = "" # clear variables so they aren't maintained through the next iteration
epoch = ""
}
Save that as processor.awk and invoke as:
awk -f processor.awk my_file_1.txt my_file_2.txt my_file_3.txt > output.csv
This might work for you (GNU sed):
sed -nE '1{x;s/^/file,start,stop,epock,run/p;s/.*/0/;x}
/^start value/{:a;N;/\nstop value/!ba;x
s/.*/expr & + 1/e;x;G;F
s/^start value (\S+).*\nepoch (\S+)\nstop value (\S+).*\n(\S+)/,\1,\3,\2,\4/p}' file |
sed '1!N;s/\n//'
The solution contains two invocations of sed, the first to format all but the file name and second to embed the file name into the csv file.
Format the header line on the first line and prime the run number.
Gather up lines between start value and stop value.
Increment the run number, append it to the current line and output the file name. This prints two lines per record, the first is the file name and the second the remainder of the csv file.
In the second sed invocation read two lines at a time (except for the first line) and remove the newline between them, formatting the csv file.

Find, Replace, Remove - with in file

I'm currently using this code:
awk 'BEGIN { s = \"{$CNEW}\" } /WORD_MATCH/ { $0 = s; n = 1 } 1; END { if(!n) print s }' filename > new_filename
To find a match on WORD_MATCH and then replace that line with $CNEW in a file called filename the results are written to new_filename
This all works well. But I have an issue where I may want to DELETE the line instead of replace it.
So I set $CNEW = '' which works in that I get a blank line in the file, but not actually removing the line.
Is there anyway to adapt the AWK command to allow the removal of the line ?
The total aim is :
If there isn't a line in the file containing WORD_MATCH add one, based on $CNEW
If there is a line in the file containing WORD_MATCH update that line with the new value from $CNEW
If $CNEW ='' then delete the line contain WORD_MATCH.
There will only be one line in he file containing WORD_MATCH
Thanks
awk -v s="$CNEW" '/WORD_MATCH/ { n=1; if (s) $0=s; else next; } 1; END { if(s && !n) print s }' file
How it works
-v s="$CNEW"
This creates s as an awk variable with the value $CNEW. Note that the use of -v neatly eliminates the quoting problems that can occur by trying to define s in a BEGIN block.
/WORD_MATCH/ { n=1; if (s) $0=s; else next; }
If the current line matches WORD_MATCH, then set n to 1. If s is non-empty, then set the current line to s. If not, skip the rest of the commands and start over on the next line.
1
This is cryptic shorthand for print the line.
END { if(s && !n) print s }
At the end of the file, if n is still not 1 and s is non-empty, then print s.

Extracting lines between two patterns and including line above the first and below the second

Having the following text file, I need to extract and print strings between two patterns and ,also, include the line above the first pattern and the one following the second
asdgs sdagasdg sdagdsag
asdfgsdagg gsfagsaf
asdfsdaf dsafsdfdsfas
asdfdasfadf
nnnn nnnnn aaaaa
line before first pattern
***** FIRST *****
dddd ffff cccc
wwww rrrrrrrr xxxx
***** SECOND *****
line after second pattern
asdfgsdagg gsfagsaf
asdfsdaf dsafsdfdsfas
asdfdasfadf
nnnn nnnnn aaaaa
I have found many solution with sed and awk to extract between two tags as the following
sed -n '/FIRST/,/SECOND/p' FileName
but how to include the line before and after the pattern?
Desired output:
line before first pattern
***** FIRST *****
dddd ffff cccc
wwww rrrrrrrr xxxx
***** SECOND *****
line after second pattern
As you've asked for an sed/awk solution (and everyone is scared of ed ;-), here's one way you can do it in awk:
awk '/FIRST/{print p; f=1} {p=$0} /SECOND/{c=1} f; c--==0{f=0}' file
When the first pattern is matched, print the previous line p and set the print flag f. When the second pattern is matched set c to 1. If f is 1 (true), the current line will be printed. c--==0 is only true the line after the second pattern is matched.
Another way you can do this is by looping through the file twice:
awk 'NR==FNR{if(/FIRST/)s=NR;else if(/SECOND/)e=NR;next}FNR>=s-1&&FNR<=e+1' file file
The first pass through the file loops through the file and records the line numbers. The second prints the lines in the range.
The advantage of the second approach is that it is trivially easy to print M lines before and N lines after the range, simply by changing the numbers in the script.
To use shell variables instead of hard-coded patterns, you can pass the variables like this:
awk -v first="$first" -v second="$second" '...' file
Then use $0 ~ first instead of /FIRST/.
I'd say
sed '/FIRST/ { x; G; :a n; /SECOND/! ba; n; q; }; h; d' filename
That is:
/FIRST/ { # If a line matches FIRST
x # swap hold buffer and pattern space,
G # append hold buffer to pattern space.
# We saved the last line before the match in the hold
# buffer, so the pattern space now contains the previous
# and the matching line.
:a # jump label for looping
n # print pattern space, fetch next line.
/SECOND/! ba # unless it matches SECOND, go back to :a
n # fetch one more line after the match
q # quit (printing that last line in the process)
}
h # If we get here, it's before the block. Hold the current
# line for later use.
d # don't print anything.
Note that BSD sed (as comes with Mac OS X and *BSD) is a bit picky about branching commands. If you're working on one of those platforms,
sed -e '/FIRST/ { x; G; :a' -e 'n; /SECOND/! ba' -e 'n; q; }; h; d' filename
should work.
This will work whether or not there's multiple ranges in your file:
$ cat tst.awk
/FIRST/ { print prev; gotBeg=1 }
gotBeg {
print
if (gotEnd) gotBeg=gotEnd=0
if (/SECOND/) gotEnd=1
}
{ prev=$0 }
$ awk -f tst.awk file
line before first pattern
***** FIRST *****
dddd ffff cccc
wwww rrrrrrrr xxxx
***** SECOND *****
line after second pattern
If you ever need to print more than 1 line before FIRST change prev to an array. If you ever need to print more than 1 line after SECOND, change gotEnd to a count.
sed '#n
H;$!d
x;s/\n/²/g
/FIRST.*SECOND/!b
s/.*²\([^²]*²[^²]*FIRST\)/\1/
:a
s/\(FIRST.*SECOND[^²]*²[^²]*\)².\{1,\}/\1/
ta
s/²/\
/g
p' YourFile
POSIX sed version (GNU sed use --posix)
take the following SECOND pattern also if on the same line, easy to adapt for taking at least one new line between
#n : don't print unless expres request (like p)
H;$!d : append each line to buffer, if not last line, delete current line and loop
x;s/\n/²/g : load buffer and replace any new line with another character (here i use ²) because posix sed does not allow a [^\n]
/FIRST.*SECOND/!b : if no pattern presence, quit without output
s/.*²\([^²]*²[^²]*FIRST\)/\1/ : remove everything before line before your first pattern
:a : label for a goto (used later)
s/\(FIRST.*SECOND[^²]*²[^²]*\)².\{1,\}/\1/ : remove everything after a line after your second pattern. It take the biggest string so last occurence of the pattern is the reference
ta : if last s/// occur, got to label a. It cyle, until first SECOND pattern occuring in file (after FIRST)
s/²/\
/g : put back the new lines
p : print the result
based on the Tom's comment: if the file isn't large we can just store it in the array, and then loop over it:
awk '{a[++i]=$0} /FIRST/{s=NR} /SECOND/{e=NR} END {for(i=s-1;i<e+1;i++) print a[i]}'
I would do it with Perl personally. We have the 'range operator' which we can use to detect if we're between two patterns:
if ( m/FIRST/ .. /SECOND/ )
That's the easy part. What's a little less easy is 'catching' the preceeding and next lines. So I set a $prev_line value, so that when I first hit that test, I know what to print. And I clear that $prev_line, both because then it's empty when I print it again, but also because then I can spot the transition at the end of the range.
So something like this:
#!/usr/bin/perl
use strict;
use warnings;
my $prev_line = " ";
while (<DATA>) {
if ( m/FIRST/ .. /SECOND/ ) {
print $prev_line;
$prev_line = '';
print;
}
else {
if ( not $prev_line ) {
print;
}
$prev_line = $_;
}
}
__DATA__
asdgs sdagasdg sdagdsag
asdfgsdagg gsfagsaf
asdfsdaf dsafsdfdsfas
asdfdasfadf
nnnn nnnnn aaaaa
line before first pattern
***** FIRST *****
dddd ffff cccc
wwww rrrrrrrr xxxx
***** SECOND *****
line after second pattern
asdfgsdagg gsfagsaf
asdfsdaf dsafsdfdsfas
asdfdasfadf
nnnn nnnnn aaaaa
This might work for you (GNU sed):
sed '/FIRST/!{h;d};H;g;:a;n;/SECOND/{n;q};$!ba' file
If the current line is not FIRST save it in the hold space and delete the current line. If the line is FIRST append it to the saved line and then print both and any further lines untill SECOND when an additional line is printed and the script exited.

How to get specific data from block of data based on condition

I have a file like this:
[group]
enable = 0
name = green
test = more
[group]
name = blue
test = home
[group]
value = 48
name = orange
test = out
There may be one ore more space/tabs between label and = and value.
Number of lines may wary in every block.
I like to have the name, only if this is not true enable = 0
So output should be:
blue
orange
Here is what I have managed to create:
awk -v RS="group" '!/enable = 0/ {sub(/.*name[[:blank:]]+=[[:blank:]]+/,x);print $1}'
blue
orange
There are several fault with this:
I am not able to set RS to [group], both this fails RS="[group]" and RS="\[group\]". This will then fail if name or other labels contains group.
I do prefer not to use RS with multiple characters, since this is gnu awk only.
Anyone have other suggestion? sed or awk and not use a long chain of commands.
If you know that groups are always separated by empty lines, set RS to the empty string:
$ awk -v RS="" '!/enable = 0/ {sub(/.*name[[:blank:]]+=[[:blank:]]+/,x);print $1}'
blue
orange
#devnull explained in his answer that GNU awk also accepts regular expressions in RS, so you could only split at [group] if it is on its own line:
gawk -v RS='(^|\n)[[]group]($|\n)' '!/enable = 0/ {sub(/.*name[[:blank:]]+=[[:blank:]]+/,x);print $1}'
This makes sure we're not splitting at evil names like
[group]
enable = 0
name = [group]
name = evil
test = more
Your problem seems to be:
I am not able to set RS to [group], both this fails RS="[group]" and
RS="\[group\]".
Saying:
RS="[[]group[]]"
should yield the desired result.
In these situations where there's clearly name = value statements within a record, I like to first populate an array with those mappings, e.g.:
map["<name>"] = <value>
and then just use the names to reference the values I want. In this case:
$ awk -v RS= -F'\n' '
{
delete map
for (i=1;i<=NF;i++) {
split($i,tmp,/ *= */)
map[tmp[1]] = tmp[2]
}
}
map["enable"] !~ /^0$/ {
print map["name"]
}
' file
blue
orange
If your version of awk doesn't support deleting a whole array then change delete map to split("",map).
Compared to using REs and/or sub()s., etc., it makes the solution much more robust and extensible in case you want to compare and/or print the values of other fields in future.
Since you have line-separated records, you should consider putting awk in paragraph mode. If you must test for the [group] identifier, simply add code to handle that. Here's some example code that should fulfill your requirements. Run like:
awk -f script.awk file.txt
Contents of script.awk:
BEGIN {
RS=""
}
{
for (i=2; i<=NF; i+=3) {
if ($i == "enable" && $(i+2) == 0) {
f = 1
}
if ($i == "name") {
r = $(i+2)
}
}
}
!(f) && r {
print r
}
{
f = 0
r = ""
}
Results:
blue
orange
This might work for you (GNU sed):
sed -n '/\[group\]/{:a;$!{N;/\n$/!ba};/enable\s*=\s*0/!s/.*name\s*=\s*\(\S\+\).*/\1/p;d}' file
Read the [group] block into the pattern space then substitute out the colour if the enable variable is not set to 0.
sed -n '...' set sed to run in silent mode, no ouput unless specified i.e. a p or P command
/\[group\]/{...} when we have a line which contains [group] do what is found inside the curly braces.
:a;$!{N;/\n$/!ba} to do a loop we need a place to loop to, :a is the place to loop to. $ is the end of file address and $! means not the end of file, so $!{...} means do what is found inside the curly braces when it is not the end of file. N means append a newline and the next line to the current line and /\n$/ba when we have a line that ends with an empty line branch (b) to a. So this collects all lines from a line that contains `[group] to an empty line (or end of file).
/enable\s*=\s*0/!s/.*name\s*=\s*\(\S\+\).*/\1/p if the lines collected contain enable = 0 then do not substitute out the colour. Or to put it another way, if the lines collected so far do not contain enable = 0 do substitute out the colour.
If you don't want to use the record separator, you could use a dummy variable like this:
#!/usr/bin/awk -f
function endgroup() {
if (e == 1) {
print n
}
}
$1 == "name" {
n = $3
}
$1 == "enable" && $3 == 0 {
e = 0;
}
$0 == "[group]" {
endgroup();
e = 1;
}
END {
endgroup();
}
You could actually use Bash for this.
while read line; do
if [[ $line == "enable = 0" ]]; then
n=1
else
n=0
fi
if [ $n -eq 0 ] && [[ $line =~ name[[:space:]]+=[[:space:]]([a-z]+) ]]; then
echo ${BASH_REMATCH[1]}
fi
done < file
This will only work however if enable = 0 is always only one line above the line with name.

Delete n1 previous lines and n2 lines following with respect to a line containing a pattern

sed -e '/XXXX/,+4d' fv.out
I have to find a particular pattern in a file and delete 5 lines above and 4 lines below it simultaneously. I found out that the line above removes the line containing the pattern and four lines below it.
sed -e '/XXXX/,~5d' fv.out
In sed manual it was given that ~ represents the lines which is followed by the pattern. But when i tried it, it was the lines following the pattern that was deleted.
So, how do I delete 5 lines above and 4 lines below a line containing the pattern simultaneously?
One way using sed, assuming that the patterns are not close enough each other:
Content of script.sed:
## If line doesn't match the pattern...
/pattern/ ! {
## Append line to 'hold space'.
H
## Copy content of 'hold space' to 'pattern space' to work with it.
g
## If there are more than 5 lines saved, print and remove the first
## one. It's like a FIFO.
/\(\n[^\n]*\)\{6\}/ {
## Delete the first '\n' automatically added by previous 'H' command.
s/^\n//
## Print until first '\n'.
P
## Delete data printed just before.
s/[^\n]*//
## Save updated content to 'hold space'.
h
}
### Added to fix an error pointed out by potong in comments.
### =======================================================
## If last line, print lines left in 'hold space'.
$ {
x
s/^\n//
p
}
### =======================================================
## Read next line.
b
}
## If line matches the pattern...
/pattern/ {
## Remove all content of 'hold space'. It has the five previous
## lines, which won't be printed.
x
s/^.*$//
x
## Read next four lines and append them to 'pattern space'.
N ; N ; N ; N
## Delete all.
s/^.*$//
}
Run like:
sed -nf script.sed infile
A solution using awk:
awk '$0 ~ "XXXX" { lines2del = 5; nlines = 0; }
nlines == 5 { print lines[NR%5]; nlines-- }
lines2del == 0 { lines[NR%5] = $0; nlines++ }
lines2del > 0 { lines2del-- }
END { while (nlines-- > 0) { print lines[(NR - nlines) % 5] } }' fv.out
Update:
This is the script explained:
I remember the last 5 lines in the array lines using rotatory indexes (NR%5; NR is the record number; in this case lines).
If I find the pattern in the current line ($0 ~ "XXXX; $0 being the current record: in this case a line; and ~ being the Extended Regular Expression match operator), I reset the number of lines read and note that I have 5 lines to delete (including the current line).
If I already read 5 lines, I print the current line.
If I do not have lines to delete (which is also true if I had read 5 lines, I put the current line in the buffer and increment the number of lines. Note how the number of lines is decremented and then incremented if a line is printed.
If lines need to be deleted, I do not print anything and decrement the number of lines to delete.
At the end of the script, I print all the lines that are in the array.
My original version of the script was the following, but I ended up optimizing it to the above version:
awk '$0 ~ "XXXX" { lines2del = 5; nlines = 0; }
lines2del == 0 && nlines == 5 { print lines[NR%5]; lines[NR%5] }
lines2del == 0 && nlines < 5 { lines[NR%5] = $0; nlines++ }
lines2del > 0 { lines2del-- }
END { while (nlines-- > 0) { print lines[(NR - nlines) % 5] } }' fv.out
awk is a great tool ! I strongly recommend that you find a tutorial on the net and read it. One important thing: awk works with Extended Regular Expressions (ERE). Their syntax is a little different from Standard Regular Expression (RE) used in sed, but all that can be done with RE can be done with ERE.
The idea is to read 5 lines without printing them. If you find the pattern, delete the unprinted lines and the 4 lines bellow. If you do not find the pattern, remember the current line and print the 1st unprinted line. At the end, print what is unprinted.
sed -n -e '/XXXX/,+4{x;s/.*//;x;d}' -e '1,5H' -e '6,${H;g;s/\n//;P;s/[^\n]*//;h}' -e '${g;s/\n//;p;d}' fv.out
Of course, this only works if you have one occurrence of your pattern in the file. If you have many, you need to read 5 new lines after finding your pattern, and it gets complicated if you again have your pattern in those lines. In this case, I think sed is not the right tool.
This might work for you:
sed 'H;$!d;g;s/\([^\n]*\n\)\{5\}[^\n]*PATTERN\([^\n]*\n\)\{5\}//g;s/.//' file
or this:
awk --posix -vORS='' -vRS='([^\n]*\n){5}[^\n]*PATTERN([^\n]*\n){5}' 1 file
a more efficient sed solution:
sed ':a;/PATTERN/,+4d;/\([^\n]*\n\)\{5\}/{P;D};$q;N;ba' file
If you are happy to output the result to a file instead of stdout, vim can do it quite efficiently:
vim -c 'g/pattern/-5,+4d' -c 'w! outfile|q!' infile
or
vim -c 'g/pattern/-5,+4d' -c 'x' infile
to edit the file in-place.

Resources