Deleting lines with more than 30% lowercase letters - bash

I try to process some data but I'am unable to find a working solution for my problem. I have a file which looks like:
>ram
cacacacacacacacacatatacacatacacatacacacacacacacacacacacacaca
cacacacacacacaca
>pam
GAATGTCAAAAAAAAAAAAAAAAActctctct
>sam
AATTGGCCAATTGGCAATTCCGGAATTCaattggccaattccggaattccaattccgg
and many lines more....
I want to filter out all the lines and the corresponding headers (header starts with >) where the sequence string (those not starting with >) are containing 30 or more percent lowercase letters. And the sequence strings can span multiple lines.
So after command xy the output should look like:
>pam
GAATGTCAAAAAAAAAAAAAAAAActctctct
I tried some mix of a while loop for reading the input file and then working with awk, grep, sed but there was no good outcome.

Here's one idea, which sets the record separator to ">" to treat each header with its sequence lines as a single record.
Because the input starts with a ">", which causes an initial empty record, we guard the computation with NR > 1 (record number greater than one).
To count the number of characters we add the lengths of all the lines after the header. To count the number of lower-case characters, we save the string in another variable and use gsub to replace all the lower-case letters with nothing --- just because gsub returns the number of substitutions made, which is a convenient way of counting them.
Finally we check the ratio and print or not (adding back the initial ">" when we do print).
BEGIN { RS = ">" }
NR > 1 {
total_cnt = 0
lower_cnt = 0
for (i=2; i<=NF; ++i) {
total_cnt += length($i)
s = $i
lower_cnt += gsub(/[a-z]/, "", s)
}
ratio = lower_cnt / total_cnt
if (ratio < 0.3) print ">"$0
}
$ awk -f seq.awk seq.txt
>pam
GAATGTCAAAAAAAAAAAAAAAAActctctct

Or:
awk '{n=length(gensub(/[A-Z]/,"","g"));if(NF && n/length*100 < 30)print a $0;a=RT}' RS='>[a-z]+\n' file
RS='>[a-z]+\n' - Sets the record separator to the line containing '>' and name
RT - This value is set by what is matched by RS above
a=RT - save previous RT value
n=length(gensub(/[A-Z]/,"","g")); - get the length of lower case chars
if(NF && n/length*100 < 30)print a $0; - check we have a value and that the percentage is less than 30 for lower case chars

awk '/^>/{b=B;gsub( /[A-]/,"",b);
if( length( b) < length( B) * 0.3) print H "\n" B
H=$0;B="";next}
{B=( (B != "") ? B "\n" : "" ) $0}
END{ b=B;gsub( /[A-]/,"",b);
if( length( b) < length( B) * 0.3) print H "\n" B
}' YourFile
quick qnd dirty, a function suite better the need for printing

Nowadays I would not use sed or awk anymore for anything longer than 2 lines.
#! /usr/bin/perl
use strict; # Force variable declaration.
use warnings; # Warn about dangerous language use.
sub filter # Declare a sub-routing, a function called `filter`.
{
my ($header, $body) = #_; # Give the first two function arguments the names header and body.
my $lower = $body =~ tr/a-z//; # Count the translation of the characters a-z to nothing.
print $header, $body, "\n" # Print header, body and newline,
unless $lower / length ($body) > 0.3; # unless lower characters have more than 30%.
}
my ($header, $body); # Declare two variables for header and body.
while (<>) { # Loop over all lines from stdin or a file given in the command line.
if (/^>/) { # If the line starts with >,
filter ($header, $body) # call filter with header and body,
if defined $header; # if header is defined, which is not the case at the beginning of the file.
($header, $body) = ($_, ''); # Assign the current line to header and an empty string to body.
} else {
chomp; # Remove the newline at the end of the line.
$body .= $_; # Append the line to body.
}
}
filter ($header, $body); # Filter the last record.

Related

How can I read a CSV file if only non-empty fields are wrapped by double quotes?

I'm trying to read a CSV file in a Bash script. I achieved that successfully using gawk and specifying FPAT like:
gawk -v LOGFILE="${LOGFILE}" 'BEGIN {
FPAT = "([^,]+)|(\"[^\"]+\")"
}
NR == 1{
# doing some logic with header
}
NR >= 2{
# doing some logic with fields
}' <filename>
The problem here is, the file contains data like:
"RAM","31st street, Bengaluru, India",,,,"7865431234",,"VALID"
Now, with this data I'm getting wrong data because it is ignoring commas, which is giving me wrong position number of extracted data.
For example, it is telling "7865431234" is present at 3rd position whereas it is at 6th.
Can anyone suggest the changes to get the correct position of fields?
Your FPAT requires each field to contain at least one character, but you want to recognize empty fields with zero characters. Add an alternative to FPAT that allows zero characters:
gawk 'BEGIN { FPAT = "([^,]+)|(\"[^\"]+\")|" }
{ printf "%d:%d:", NR, NF; for (i = 1; i <= NF; i++) printf("[%s]", $i); print "" }'
Note the extra | at the end of FPAT. The action simply identifies the record number, the number of fields, and surrounds the value of each field with square brackets.
When your data string is provided to that script, the output is:
1:8:["RAM"]["31st street, Bengaluru, India"][][][]["7865431234"][]["VALID"]
That shows the four empty fields quite clearly.
Now all you have to do is deal with:
"Mr ""Manipulator"", the Artisan","29th Street, Delhi, India",,,"",,,"INVALID"
where there are double quotes inside the quoted value. That's not dreadfully hard to manage:
gawk 'BEGIN { FPAT = "([^,]+)|(\"([^\"]|\"\")*\")[^,]*|" }
{ printf "%d:%d:", NR, NF; for (i = 1; i <= NF; i++) printf("%d[%s]", i, $i); print "" }' "$#"
The FPAT says that a field is:
a sequence of non-commas,
or it is a field started with a double quote, containing zero or more instances of either:
a non-quote, or
two double quotes
followed by a double quote and optional non-comma data
or it is empty
Note that the 'optional non-comma data' should be empty, and only appears in malformed CSV data.
Given input data:
"RAM","31st street, Bengaluru, India",,,,"7865431234",,"VALID"
"Mr ""Manipulator"", the Artisan","29th Street, Delhi, India",,,,,,"INVALID"
"Some","","Empty","",Fields "" Wrapped,"",in quotes
"Malformed" CSV,Data,"Note it has data after" a close quote,"and before a comma,",,"INVALID"
This produces:
1:8:1["RAM"]2["31st street, Bengaluru, India"]3[]4[]5[]6["7865431234"]7[]8["VALID"]
2:8:1["Mr ""Manipulator"", the Artisan"]2["29th Street, Delhi, India"]3[]4[]5[]6[]7[]8["INVALID"]
3:7:1["Some"]2[""]3["Empty"]4[""]5[Fields "" Wrapped]6[""]7[in quotes]
4:6:1["Malformed" CSV]2[Data]3["Note it has data after" a close quote]4["and before a comma,"]5[]6["INVALID"]
Note that the field numbers are included as a prefix to the bracketed data (so I tweaked the print format slightly).
About the only format this doesn't handle is one where newlines can be embedded in the data for a field — by the nature of the line-based input, it assumes that no field is split over multiple lines. (It also means it won't properly recognize a field that starts with a double quote and doesn't have a matching double quote before the end of the line. I suppose you could add an alternative to recognize that. It would be better just to make the data right.)
Note the advice in Sobrique's answer to use a tool designed to handle CSV for handling CSV. That is generally a good idea, and the more complex the sets of variations you have to deal with, the better an idea it is. This is close to as complicated a regex as you should consider using. Also note that although RFC 4180 defines a version of CSV formally and rigorously, there are multiple programs (including MS Office) that handle different but related formats.
If you have csv that needs parsing, then whilst you can usually hack it with a regex, it's far easier to user a parser.
Something like this:
#!/usr/bin/env perl
use strict;
use warnings;
use Text::CSV;
my $csv = Text::CSV -> new;
open ( my $input, '<', 'flarg.csv' ) or die $!;
while ( my $row = $csv -> getline ( $input ) ) {
if ( $. == 1 ) {
# do first row stuff;
print "Header: ", join ",", #$row,"\n";
}
else {
print join "\n", #$row;
}
}
Or simpler yet - use Text::ParseWords which is core.
#!/usr/bin/env perl
use strict;
use warnings;
use Text::ParseWords;
while ( my $line = <DATA> ) {
my #fields = parse_line(',', 1, $line);
print join "\n", #fields;
}
__DATA__
"RAM","31st street, Bengaluru, India",,,,"7865431234",,"VALID"

awk: Interpreting strings as mathematical expressions

Context: I have an input file that contains parameters with associated values followed by literal mathematical expressions such as:
PARAMETERS DEFINITION
A = 5; B = 2; C=1.5; D=7.5
MATHEMATICAL EXPRESSIONS
A*B
C/D
...
and I would like to get the strings of the second part to be interpreted as mathematical expressions so that I get the results of the expressions in my output file:
...
MATHEMATICAL EXPRESSIONS
10
0.2
...
What I did already: So far, using awk, I store all the parameters names and their corresponding values in two distinct arrays. I then replace each parameter with its value so that I am now in a similar situation as the author of this thread.
However, the answers s/he gets are not in awk except for the last one which is very specific to her/his situation, and hard to understand for me as a beginner with awk and shell scripting.
What I tried afterwards: As I have no clue how to do this in awk, the idea I had was to store the new field value in a variable, then use a shell command within the awk script like this:
#!bin/awk -f
BEGIN{}
{
myExpression=$1
system("echo $myExpression | bc")
}
END{}
This, unfortunately does not work as the variable is somehow not recognized by the echo command.
What I would like:
I would prefer a solution using awk alone with no call to external functions, however, I am not against one using a shell command if it is simpler.
EDIT Taking into account all the comments so far, I will be more precise, my input files look more like this:
PARAMETERS_DEFINITION
[param1] = 5
[param2] = 2
[param3] = 1.5
[param4] = 7.5
MATHEMATICAL_EXPRESSIONS
[param1]*[param2]
some text containing also numbers and formulas that I do not want to be affected.
e.g: 1.45*2.6 = x, de(x)/dx=e(x) ; blah,blah,blah
[param3]/[param4]
The names of the parameters are complex enough so that any match of the string: "[param#]" within the document corresponds to a parameter that I want changed for its value.
Below is the way I manage to store the parameters and their value in arrays is the following:
{
if (match($2,/PARAMETERS_DEFINITION/) != 0) {paramSwitch = 1}
if (match($2,/MATHEMATICAL_EXPRESSIONS/) != 0) {paramSwitch = 0}
if (paramSwitch == 1)
{
parameterName[numOfParam] = $1 ;
parameterVal[numOfParam] = $3 ;
numOfParam += 1
}
}
Instead of this:
{
myExpression=$1
system("echo $myExpression | bc")
}
I think you'd want this:
{
myExpression=$1
system("echo " myExpression " | bc")
}
That's because in awk, assignments do not end up as environment variables, and putting strings next to each other concatenates them.
You asking awk: Interpreting strings as mathematical expressions - this functionality usually called as eval, and no, (AFAIK) awk doesn't knows such function. Therefore your questions is an typical XY problem
The right tool for this is bc, where you (nearly) don't need modify anything, and simply feed the bc with your input, only ensure than the variables are are lowercase, such the following input (edited the your example)
#PARAMETERS DEFINITION
a=5; b=2; c=1.5; d=7.5
#MATHEMATICAL EXPRESSIONS
a*b
c/d
using like
bc -l < inputfile
produces
10
.20000000000000000000
EDIT
For your edit, for the new input data. The following
grep '\[' inputfile | sed 's/[][]//g' | bc -l
for the input
PARAMETERS_DEFINITION
[param1] = 5
[param2] = 2
[param3] = 1.5
[param4] = 7.5
MATHEMATICAL_EXPRESSIONS
[param1]*[param2]
some text containing also numbers and formulas that I do not want to be affected.
e.g: 1.45*2.6 = x, de(x)/dx=e(x) ; blah,blah,blah
[param3]/[param4]
produces the following output:
10
.20000000000000000000
e.g. grepping out only lines what contains [ - any param definition or expression, remove any [], e.g. creating the following bc program:
param1 = 5
param2 = 2
param3 = 1.5
param4 = 7.5
param1*param2
param3/param4
and send the whole "program" to bc...
Using BIDMAS as a basis i have created this mathematical function in awk
I have not included brackets(or indices) yet as they will require some extra effort but i may add them later
This awk script effectively works as bc does.
No system call required, all in awk.
Generic version for all applications
awk '{split($0,a,"+")
for(i in a){
split(a[i],s,"-")
for(j in s){
split(s[j],m,"*")
for(k in m){
split(m[k],d,"/")
for(l in d){
if(l>1)d[1]=d[1]/d[l]
}
m[k]=d[1]
delete d
if(k>1)m[1]=m[1]*m[k]
}
s[j]=m[1]
delete m
if(j>1)s[1]=s[1]-s[j]
}
a[i]=s[1]
delete s
}
for(i in a)b=b+a[i];print b}{b=0}' file
For your specific example
awk '
/MATHEMATICAL_EXPRESSIONS/{z=1}
NR>1&&!z{split($0,y," = ");x[y[1]]=y[2]}
z&&/[\+\-\/\*]/{
for (n in x)gsub(n,x[n])
split($0,a,"+")
for(i in a){
split(a[i],s,"-")
for(j in s){
split(s[j],m,"*")
for(k in m){
split(m[k],d,"/")
for(l in d){
if(l>1)d[1]=d[1]/d[l]
}
m[k]=d[1]
delete d
if(k>1)m[1]=m[1]*m[k]
}
s[j]=m[1]
delete m
if(j>1)s[1]=s[1]-s[j]
}
a[i]=s[1]
delete s
}
for(i in a)b=b+a[i];print b}{b=0}' file
There's something like an eval for awk, its a magical conversion when needed in the context, here adding +0 would do the convertion.
What I got for you (detailled version below) with a file named awkinput with your exemple input
awk '/[A-Z]=[0-9.]+;/ { for (i=1;i<=NF ;i++) { print "working on "$i; split($i,fields,"="); sub(/;/,"",fields[2]); params[fields[1]]=strtonum(fields[2]) } }; /[A-Z](*|\/|+|-)[A-Z]/ { for (p in params) { sub(p, params[p],$0); }; system("echo " $0 " | bc -ql") }' awkinput
Detailled:
/[A-Z]=[0-9.]+;?/ { # if we match something like A=4.2 with or wothout a ; at end
for (i=1;i<=NF ;i++) { # loop through the fields (separated by space, the default Field Separator of awk)
print "working on "$i; # inform on what we do
split($i,fields,"="); # split in an array to get param and value
sub(/;/,"",fields[2]); # Eventually remove the ; at end
params[fields[1]]=strtonum(fields[2]) # new array of parameters where the values are numeric
}
}
/[A-Z](*|\/|+|-)[A-Z]/ { #when the line match a math operation with one param on each side (at least)
for (p in params) { # loop over know params
sub(p, params[p],$0); # replace each param with its value
};
system("echo " $0 " | bc -ql") # print the result (no way to get of system call here)
}
Drawback:
A math of the form AB*C would be resolved to 52*1.5
$ cat test
PARAMETERS DEFINITION
A=5; B=2; C=1.5; D=7.5
MATHEMATICAL EXPRESSIONS
A*B
C/D
$ awk -vRS='[= ;\n]' '{if ($0 ~ /[0-9]/){a[x] = $0; print x"="a[x]}else{x=$0}}/MATHEMATICAL/{print "MATHEMATICAL EXPRESSIONS"}{if ($0~"*") print a[substr($0,1,1)] * a[substr($0,3,1)]}{if ($0~"/") print a[substr($0,1,1)] / a[substr($0,3,1)]}' test
A=5
B=2
C=1.5
D=7.5
MATHEMATICAL EXPRESSIONS
10
0.2
Formatted nicely:
$ cat test.awk
# Store all variables in an array
{
if ($0 ~ /[0-9]/){
a[x] = $0;
print x " = " a[x] # Print the keys & values
}
else{
x = $0
}
}
# Print header
/MATHEMATICAL/ {print "MATHEMATICAL EXPRESSIONS"}
# Do the maths (case can work too, but it's not as widely available)
{
if ($0~"*")
print a[substr($0,1,1)] * a[substr($0,3,1)]
}
{
if ($0~"/")
print a[substr($0,1,1)] / a[substr($0,3,1)]
}
{
if ($0~"+")
print a[substr($0,1,1)] + a[substr($0,3,1)]
}
{
if ($0~"-")
print a[substr($0,1,1)] - a[substr($0,3,1)]
}
$ cat test
PARAMETERS DEFINITION
A=5; B=2; C=1.5; D=7.5
MATHEMATICAL EXPRESSIONS
A*B
C/D
D+C
C-A
$ awk -f test.awk -vRS='[= ;\n]' test
A = 5
B = 2
C = 1.5
D = 7.5
MATHEMATICAL EXPRESSIONS
10
0.2
9
-3.5

How to merge rows from the same column using unix tools

I have a text file that looks like the following:
1000000 45 M This is a line This is another line Another line
that breaks into that also breaks that has a blank
multiple rows into multiple rows - row below.
How annoying!
1000001 50 F I am another I am well behaved.
column that has
text spanning
multiple rows
I would like to convert this into a csv file that looks like:
1000000, 45, M, This is a line that breaks into multiple rows, This is another line that also breaks into multiple rows - How annoying!
1000001, 50, F, I am another column that has text spanning multiple rows, I am well behaved.
The text file output comes from a program that was written in 1984, and I have no way to modify the output. I want it in csv format so that I can convert it to Excel as painlessly as possible. I am not sure where to start, and rather than reinvent the wheel, was hoping someone could point me in the right direction. Thanks!
== EDIT ==
I've modified the text file to have \n between rows - maybe this will be helpful?
== EDIT 2 ==
I've modified the text file to have a blank row.
Using GNU awk
gawk '
BEGIN { FIELDWIDTHS="11 6 5 22 22" }
length($1) == 11 {
if ($1 ~ /[^[:blank:]]/) {
if (f1) print_line()
f1=$1; f2=$2; f3=$3; f4=$4; f5=$5
}
else {
f4 = f4" "$4; f5 = f5" "$5
}
}
function rtrim(str) {
sub(/[[:blank:]]+$/, "", str)
return str
}
function print_line() {
gsub(/[[:blank:]]{2,}/, " ", f4); gsub(/"/, "&&", f4)
gsub(/[[:blank:]]{2,}/, " ", f5); gsub(/"/, "&&", f5)
printf "%s,%s,%s,\"%s\",\"%s\"\n", rtrim(f1), rtrim(f2), rtrim(f3),f4,f5
}
END {if (f1) print_line()}
' file
1000000,45,M,"This is a line that breaks into multiple rows ","This is another line that also breaks into multiple rows - How annoying!"
1000001,50,F,"I am another column that has text spanning multiple rows","I am well behaved. "
I've quoted the last 2 columns in case they contain commas, and doubled any potential inner double quotes.
Here's a Perl script that does what you want. It uses unpack to split the fixed width columns into fields, adding to the previous fields if there is no data in the first column.
As you've mentioned that the widths vary between files, the script works out the widths for itself, based on the content of the first line. The assumption is that there are at least two space characters between each field. It creates a format string like A11 A6 A5 A22 A21, where "A" means any character and the numbers specify the width of each field.
Inspired by glenn's version, I have wrapped any field containing spaces in double quotes. Whether that's useful or not depends on how you're going to end up using the data. For example, if you want to parse it using another tool and there are commas within the input, it may be helpful. If you don't want it to happen, you can change the grep block in both places to simply grep { $_ ne "" }:
use strict;
use warnings;
chomp (my $first_line = <>);
my #fields = split /(?<=\s{2})(?=\S)/, $first_line;
my $format = join " ", map { "A" . length } #fields;
my #cols = unpack $format, $first_line;
while(<>) {
chomp( my $line = $_ );
my #tmp = unpack $format, $line;
if ($tmp[0] ne '') {
print join(", ", grep { $_ ne "" && /\s/ ? qq/"$_"/ : $_ } #cols), "\n";
#cols = #tmp;
}
else {
for (1..$#tmp) {
$cols[$_] .= " $tmp[$_]" if $tmp[$_] ne "";
}
}
}
print join(", ", grep { $_ ne "" && /\s/ ? qq/"$_"/ : $_ } #cols), "\n";
Output:
1000000, 45, M, "This is a line that breaks into multiple rows", "This is another line that also breaks into multiple rows - How annoying!"
1000001, 50, F, "I am another column that has text spanning multiple rows", "I am well behaved."
Using this awk:
awk -F ' {2,}' -v OFS=', ' 'NF==5{if (p) print a[1], a[2], a[3], a[4], a[5];
for (i=1; i<=NF; i++) a[i]=$i; p=index($0,$4)}
NF<4 {for(i=2; i<=NF; i++) index($0,$i) == p ? a[4]=a[4] " " $i : a[5]=a[5] $i}
END { print a[1], a[2], a[3], a[4], a[5] }' file
1000000, 45, M, This is a line that breaks into multiple rows, This is another line that also breaks into multiple rows - How annoying!
1000001, 50, F, I am another column that has text spanning multiple rows, I am well behaved.
You can write a script in python that does that. Read each line, call split on it, if the line is not empty append to the previous line. If it is, then add the next line to the result set. Finally use the csv write to write the result set to file.
Something along the lines of :
#import csv
inputFile = open(filename, 'r')
isNewItem = True
results = []
for line in inputFile:
if len(results) == 0:
isNewItem = True
else if line == '':
isNewItem = True
continue
else:
inNewItem = False
temp = line.split()
if isNewItem:
results.append(temp)
else
lastRow = results[-1]
combinedRow = []
for leftColumn, rigtColumn in lastRow, temp:
combinedRow.append(leftColumn + rightColumn)
with open(csvOutputFileName, 'w') as outFile:
csv.write(results)

awk script: removing line previous to pattern match and after, until a blank line

I began learning awk yesterday in attempt to solve this problem (and learn a useful new language). At first I tried using sed, but soon realized it was not the correct tool to access/manipulate lines previous to a pattern match.
I need to:
Remove all lines containing "foo" (trivial on it's own, but not whilst keeping track of previous lines)
Find lines containing "bar"
Remove the line previous to the one containing "bar"
Remove all lines after and including the line containing "bar" until we reach a blank line
Example input:
This is foo stuff
I like food!
It is tasty!
stuff
something
stuff
stuff
This is bar
Hello everybody
I'm Dr. Nick
things
things
things
Desired output:
It is tasty!
stuff
something
stuff
things
things
things
My attempt:
{
valid=1; #boolean variable to keep track if x is valid and should be printed
if ($x ~ /foo/){ #x is valid unless it contains foo
valid=0; #invalidate x so that is doesn't get printed at the end
next;
}
if ($0 ~ /bar/){ #if the current line contains bar
valid = 0; #x is invalid (don't print the previous line)
while (NF == 0){ #don't print until we reach an empty line
next;
}
}
if (valid == 1){ #x was a valid line
print x;
}
x=$0; #x is a reference to the previous line
}
Super bonus points (not needed to solve my problem but I'm interesting in learning how this would be done):
Ability to remove n lines before pattern match
Option to include/disclude the blank line in output
Below is an alternative awk script using patterns & functions to trigger state changes and manage output, which produces the same result.
function show_last() {
if (!skip && !empty) {
print last
}
last = $0
empty = 0
}
function set_skip_empty(n) {
skip = n
last = $0
empty = NR <= 0
}
BEGIN { set_skip_empty(0) }
END { show_last() ; }
/foo/ { next; }
/bar/ { set_skip_empty(1) ; next }
/^ *$/ { if (skip > 0) { set_skip_empty(0); next } else show_last() }
!/^ *$/{ if (skip > 0) { next } else show_last() }
This works by retaining the "current" line in a variable last, which is either
ignored or output, depending on other events, such as the occurrence of foo and bar.
The empty variable keeps track of whether or not the last variable is really
a blank line, or simple empty from inception (e.g., BEGIN).
To accomplish the "bonus points", replace last with an array of lines which could then accumulate N number of lines as desired.
To exclude blank lines (such as the one that terminates the bar filter), replace the empty test with a test on the length of the last variable. In awk, empty lines have no length (but, lines with blanks or tabs *do* have a length).
function show_last() {
if (!skip && length(last) > 0) {
print last
}
last = $0
}
will result in no blank lines of output.
Read each blank-lines-separated paragraph in as a string, then do a gsub() removing the strings that match the RE for the pattern(s) you care about:
$ awk -v RS= -v ORS="\n\n" '{ gsub(/[^\n]*foo[^\n]*\n|\n[^\n]*\n[^\n]*bar.*/,"") }1' file
It is tasty!
stuff
something
stuff
things
things
things
To remove N lines, change [^\n]*\n to ([^\n]*\n){N}.
To not remove part of the RE use GNU awk and use gensub() instead of gsub().
To remove the blank lines, change the value of ORS.
Play with it...
This awk should work without storing full file in memory:
awk '/bar/{skip=1;next} skip && p~/^$/ {skip=0} NR>1 && !skip && !(p~/foo/){print p} {p=$0}
END{if (!skip && !(p~/foo/)) print p}' file
It is tasty!
stuff
something
stuff
things
things
things
One way:
awk '
/foo/ { next }
flag && NF { next }
flag && !NF { flag = 0 }
/bar/ { delete line[NR-1]; idx-=1; flag = 1; next }
{ line[++idx] = $0 }
END {
for (x=1; x<=idx; x++) print line[x]
}' file
It is tasty!
stuff
something
stuff
things
things
things
If line contains foo skip it.
If flag is enabled and line is not blank skip it.
If flag is enabled and line is blank disable the flag.
If line contains bar delete the previous line, reset the counter, enable the flag and skip it
Store all lines that manages through in array indexed at incrementing number
In the END block print the lines.
Side Notes:
To remove n number of lines before a pattern match, you can create a loop. Start with current line number and using a reverse for loop you can remove lines from your temporary cache (array). You can then subtract n from your self defined counter variable.
To include or exclude blank lines you can use the NF variable. For a typical line, NF variable is set to number of fields based on your field separator. For blank lines this variable is 0. For example, if you modify the line above END block to NF { line[++idx] = $0 } in the answer above you will see we have bypassed all blank lines from output.

search (e.g. awk, grep, sed) for string, then look for X lines above and another string below

I need to be able to search for a string (lets use 4320101), print 20 lines above the string and print after this until it finds the string
For example:
Random text I do not want or blank line
16 Apr 2013 00:14:15
id="4320101"
</eventUpdate>
Random text I do not want or blank line
I just want the following result outputted to a file:
16 Apr 2013 00:14:15
id="4320101"
</eventUpdate>
There are multiple examples of these groups of text in a file that I want.
I tried using this below:
cat filename | grep "</eventUpdate>" -A 20 4320101 -B 100 > greptest.txt
But it only ever shows for 20 lines either side of the string.
Notes:
- the line number the text is on is inconsistent so I cannot go off these, hence why I am using -A 20. - ideally I'd rather have it so when it searches after the string, it stops when it finds and then carries on searching.
Summary: find 4320101, output 20 lines above 4320101 (or one line of white space), and then output all lines below 4320101 up to
</eventUpdate>
Doing research I am unsure of how to get awk, nawk or sed to work in my favour to do this.
This might work for you (GNU sed):
sed ':a;s/\n/&/20;tb;$!{N;ba};:b;/4320102/!D;:c;n;/<\/eventUpdate>/!bc' file
EDIT:
:a;s/\n/&/20;tb;$!{N;ba}; this keeps a window of 20 lines in the pattern space (PS)
:b;/4320102!D; this moves the above window through the file until the pattern 4320102 is found.
:c;n;/<\/eventUpdate>/!bc the 20 line window is printed and any subsequent line until the pattern <\/eventUpdate> is found.
Here is an ugly awk solution :)
awk 'BEGIN{last=1}
{if((length($0)==0) || (Random ~ $0))last=NR}
/4320101/{flag=1;
if((NR-last)>20) last=NR-20;
cmd="sed -n \""last+1","NR-1"p \" input.txt";
system(cmd);
}
flag==1{print}
/eventUpdate/{flag=0}' <filename>
So basically what it does is keeps track of the last blank line or line containing Random pattern in the last variable. Now if the 4320101 has been found, it prints from that line -20 or last whichever is nearer through a system sed command. And sets the flag. The flag causes the next onwards lines to be printed till eventUpdate has been found. Have not tested though, but should be working
Look-behind in sed/awk is always tricky.. This self contained awk script basically keeps the last 20 lines stored, when it gets to 4320101 it prints these stored lines, up to the point where the blank or undesired line is found, then it stops. At that point it switches into printall mode and prints all lines until the eventUpdate is encountered, then it prints that and quits.
awk '
function store( line ) {
for( i=0; i <= 20; i++ ) {
last[i-1] = last[i]; i++;
};
last[20]=line;
};
function purge() {
for( i=20; i >= 0; i-- ) {
if( length(last[i])==0 || last[i] ~ "Random" ) {
stop=i;
break
};
};
for( i=(stop+1); i <= 20; i++ ) {
print last[i];
};
};
{
store($0);
if( /4320101/ ) {
purge();
printall=1;
next;
};
if( printall == 1) {
print;
if( /eventUpdate/ ) {
exit 0;
};
};
}' test
Let's see if I understand your requirements:
You have two strings, which I'll call KEY and LIMIT. And you want to print:
At most 20 lines before a line containing KEY, but stopping if there is a blank line.
All the lines between a line containing KEY and the following line containing LIMIT. (This ignores your requirement that there be no more than 100 such lines; if that's important, it's relatively straightforward to add.)
The easiest way to accomplish (1) is to keep a circular buffer of 20 lines, and print it out when you hit key. (2) is trivial in either sed or awk, because you can use the two-address form to print the range.
So let's do it in awk:
#file: extract.awk
# Initialize the circular buffer
BEGIN { count = 0; }
# When we hit an empty line, clear the circular buffer
length() == 0 { count = 0; next; }
# When we hit `key`, print and clear the circular buffer
index($0, KEY) { for (i = count < 20 ? 0 : count - 20; i < count; ++i)
print buf[i % 20];
hi = 0;
}
# While we're between key and limit, print the line
index($0, KEY),index($0, LIMIT)
{ print; next; }
# Otherwise, save the line
{ buf[count++ % 20] = $0; }
In order to get that to work, we need to set the values of KEY and LIMIT. We can do that on the command line:
awk -v "KEY=4320101" -v "LIMIT=</eventUpdate>" -f extract.awk $FILENAME
Notes:
I used index($0, foo) instead of the more usual /foo/, because it avoids having to escape regex special characters, and there is nowhere in the requirements that regexen are even desired. index(haystack, needle) returns the index of needle in haystack, with indices starting at 1, or 0 if needle is not found. Used as a true/false value, it is true of needle is found.
next causes processing of the current line to end. It can be quite handy, as this little program shows.
You can try something like this -
awk '{
a[NR] = $0
}
/<\/eventUpdate>/ {
x = NR
}
END {
for (i in a) {
if (a[i]~/4320101/) {
for (j=i-20;j<=x;j++) {
print a[j]
}
}
}
}' file
The simplest way is to use 2 passes of the file - the first to identify the line numbers in the range within which your target regexp is found, the second to print the lines in the selected range, e.g.:
awk '
NR==FNR {
if ($0 ~ /\<4320101\>/ {
for (i=NR-20;i<NR;i++)
range[i]
inRange = 1
}
if (inRange) {
range[NR]
}
if ($0 ~ /<\/eventUpdate>/) {
inRange = 0
}
next
}
FNR in range
' file file

Resources