Remove all lines except the last which start with the same string - bash

I'm using awk to process a file to filter lines to specific ones of interest. With the output which is generated, I'd like to be able to remove all lines except the last which start with the same string.
Here's an example of what is generated:
this is a line
duplicate remove me
duplicate this should go too
another unrelated line
duplicate but keep me
example remove this line
example but keep this one
more unrelated text
Lines 2 and 3 should be removed because they start with duplicate, as does line 5. Therefore line 5 should be kept, as it is the last line starting with duplicate.
The same follows for line 6, since it begins with example, as does line 7. Therefore line 7 should be kept, as it is the last line which starts with example.
Given the example above, I'd like to produce the following output:
this is a line
another unrelated line
duplicate but keep me
example but keep this one
more unrelated text
How could I achieve this?
I tried the following, however it doesn't work correctly:
awk -f initialProcessing.awk largeFile | awk '{currentMatch=$1; line=$0; getline; nextMatch=$1; if (currentMatch != nextMatch) {print line}}' -

Why don't you read the file from the end to the beginning and print the first line containing duplicate? This way you don't have to worry about what was printed or not, hold the line, etc.
tac file | awk '/duplicate/ {if (f) next; f=1}1' | tac
This sets a flag f the first time duplicate is seen. From the second timem, this flag makes the line be skipped.
If you want to make this generic in a way that every first word is printed just the last time, use an array approach:
tac file | awk '!seen[$1]++' | tac
This keeps track of the first words that have appeared so far. They are stored in the array seen[], so that by saying !seen[$1]++ we make it True just when $1 occurs for the first time; from the second time on, it evaluates as False and the line is not printed.
Test
$ tac a | awk '!seen[$1]++' | tac
this is a line
another unrelated line
duplicate but keep me
example but keep this one
more unrelated text

You could use an (associative) array to always keep the last occurence:
awk '{last[$1]=$0;} END{for (i in last) print last[i];}' file

Related

How to get all lines from a file after the last empty line?

Having a file like foo.txt with content
1
2
3
4
5
How do i get the lines starting with 4 and 5 out of it (everything after last empty line), assuming the amount of lines can be different?
Updated
Let's try a slightly simpler approach with just sed.
$: sed -n '/^$/{g;D;}; N; $p;' foo.txt
4
5
-n says don't print unless I tell you to.
/^$/{g;D;}; says on each blank line, clear it all out with this:
g : Replace the contents of the pattern space with the contents of the hold space. Since we never put anything in, this erases the (possibly long accumulated) pattern space. Note that I could have used z since this is GNU, but I wanted to break it out for non-GNU sed's below, and in this case this works for both.
D : remove the now empty line from the pattern space, and go read the next.
Now previously accumulated lines have been wiped if (and only if) we saw a blank line. The D loops back to the beginning, so N will never see a blank line.
N : Add a newline to the pattern space, then append the next line of input to the pattern space. This is done on every line except blanks, after which the pattern space will be empty.
This accumulates all nonblanks until either 1) a blank is hit, which will clear and restart the buffer as above, or 2) we reach EOF with a buffer intact.
Finally, $p says on the LAST line (which will already have been added to the pattern space unless the last line was blank, which will have removed the pattern space...), print the pattern space. The only time this will have nothing to print is if the last line of the file was a blank line.
So the whole logic boils down to: clean the buffer on empty lines, otherwise pile the non-empty lines up and print at the end.
If you don't have GNU sed, just put the commands on separate lines.
sed -n '
/^$/{
g
D
}
N
$p
' foo.txt
Alternate
The method above is efficient, but could potentially build up a very large pattern buffer on certain data sets. If that's not an issue, go with it.
Or, if you want it in simple steps, don't mind more processes doing less work each, and prefer less memory consumed:
last=$( sed -n /^$/= foo.txt|tail -1 ) # find the last blank
next=$(( ${last:-0} + 1 )) # get the number of the line after
cmd="$next,\$p" # compose the range command to print
sed -n "$cmd" foo.txt # run it to print the range you wanted
This runs a lot of small, simple tasks outside of sed so that it can give sed the simplest, most direct and efficient description of the task possible. It will read the target file twice, but won't have to manage filling, flushing, and refilling the accumulation of data in the pattern buffer with records before a blank line. Still likely slower unless you are memory bound, I'd think.
Reverse the file, print everything up to the first blank line, reverse it again.
$ tac foo.txt | awk '/^$/{exit}1' | tac
4
5
Using GNU awk:
awk -v RS='\n\n' 'END{printf "%s",$0}' file
RS is the record separator set to empty line.
The END statement prints the last record.
try this:
tail +$(($(grep -nE ^$ test.txt | tail -n1 | sed -e 's/://g')+1)) test.txt
grep your input file for empty lines.
get last line with tail => 5:
remove unnecessary :
add 1 to 5 => 6
tail starting from 6
You can try with sed :
sed -n ':A;$bB;/^$/{x;s/.*//;x};H;n;bA;:B;H;x;s/^..//;p' infile
With GNU sed:
sed ':a;/$/{N;s/.*\n\n//;ba;}' file

Sed range and removing last matching line

I have this data:
One
two
three
Four
five
six
Seven
eight
And this command:
sed -n '/^Four$/,/^[^[:blank:]]/p'
I get the following output:
Four
five
six
Seven
How can I change this sed expression to not match the final line of the output? So the ideal output should be:
Four
five
six
I've tried many things involving exclamation points but haven't managed to get close to getting this working.
Use a "do..while()" loop:
sed -n '/^Four$/{:a;p;n;/^[[:blank:]]/ba}'
details:
/^Four$/ {
:a # define the label "a"
p # print the pattern-space
n # load the next line in the pattern space
/^[[:blank:]]/ba # if the pattern succeeds, go to label "a"
}
You may pipe to another sed and skip last line:
sed -n '/^Four$/,/^[^[:blank:]]/p' file | sed '$d'
Four
five
six
Alternatively you may use:
sed -n '/^Four$/,/^[^[:blank:]]/{/^Four$/p; /^[^[:blank:]]/!p;}' file
You're using the wrong tool. sed is for doing s/old/new, that is all. Just use awk:
$ awk '/^[^[:blank:]]/{f=/^Four$/} f' file
Four
five
six
How it works: Every time it finds a line that doesn't start with spaces (/^[^[:blank:]]/) it sets a flag f (for "found") to 1 if that line starts with Four and 0 otherwise (f=/^Four$/). Whenever f is non-zero that is interpreted as a true condition and so invokes awks default behavior which is to print the current line. So when it hits a block starting with Four it prints every line in that block because f is 1/true and for every other block it doesn't print since f is 0/false.
Following awk may help you here.
awk '!/^ /{flag=""} /Four/{flag=1} flag' Input_file
Output will be as follows.
Four
five
six
Also in case of you need to save the output into Input_file itself append > temp_file && mv temp_file Input_file to above code.
grep -Pzo '\n\KFour\n(\s.+\n)+' input.txt
Output
Four
five
six
This might work for you (GNU sed):
sed '/^Four/{:a;n;/^\s/ba};d' file
If the line begins with Four print it and any following lines beginning with a space.
Another way:
sed '/^\S/h;G;/^Four/MP;d' file
If a line begins with a non-space, copy it to the hold space (HS). Append the HS to each line and if either line begins with Four print the first line and delete the rest. This will delete all lines other than the section beginning with Four.

Bash script - remove lines by looking ahead

I have a csv file where some rows have an empty first field, and some rows have content in the first field. The rows with content in the first field are header rows.
I would like to remove every unnecessary header row. The best way I can see of doing this is by deleting every row for which:
First field is not empty
First field in the following row is not empty
I do not necessarily need to keep the data in the same file, so I can see this being possible using grep, awk, or sed, but none of my attempts have come close to working.
Example input:
header1,value1,etc
,value2,etc
header2,value3,etc
header3,value4,etc
,value5,etc
Desired output:
header1,value1,etc
,value2,etc
header3,value4,etc
,value5,etc
Since the header2 line is not followed by a line with an empty field 1, it is an unnecessary header row.
awk -F, '$1{h=$0;next}h{print h;h=""}1' file
-F,: Use comma as a field separator
$1{h=$0;next}: If the first field has data ( other than 0 ), save the line and go on to the next line.
h{print h;h=""}1: If there is a saved header line, print it and forget it. (This can only execute if there is nothing in $1 because of the next above.)
1: print the current line.
These kind of tasks are often conceptually easier by reversing the file and checking if the previous line is a header:
tac file |
awk -F, '$1 && have_header {next} {print; have_header = length($1)}' |
tac

Select full block of text delimited by some chars

I have a very large text file (40GB gzipped) where blocks of data are separated by //.
How can I select blocks of data where a certain line matches some criterion? That is, can I grep a pattern and extend the selection in both directions to the // delimiter? I can make no assumptions on the size of the block and the position of the line.
not interesting 1
not interesting 2
//
get the whole block 1
MATCH THIS LINE
get the whole block 2
get the whole block 3
//
not interesting 1
not interesting 2
//
I want to select the block of data with MATCH THIS LINE:
get the whole block 1
MATCH THIS LINE
get the whole block 2
get the whole block 3
I tried sed but can't get my head around the pattern definition. This for example should match from // to MATCH THIS LINE:
sed -n -e '/\/\//,/MATCH THIS LINE/ p' file.txt
But it fails matching the //.
Is it possible to achieve this with GNU command line tools?
With GNU awk (due to multi-char RS), you can set the record separator to //, so that every record is a //-delimited set of characters:
$ awk -v RS="//" '/MATCH THIS LINE/' file
get the whole block 1
MATCH THIS LINE
get the whole block 2
get the whole block 3
Note this leaves an empty line above and below because it catches the new line just after // and prints it back, as well as the last one before the // at the end. To remove them you can pipe to awk 'NF'.
To print the separator between blocks of data you can say (thanks 123):
awk -v RS="//" '/MATCH THIS LINE/{print RT $0 RT}' file

Number of total records

I'm quite new using AWK. I just discover the FNR variable. I just wonder if it is possible to get the number of total records before processing the file?
So the FNR at the end of the file.
I just need it to do something like that
awk 'FNR<TOTALRECORDS-4 {print}'
In order to delete the 4 last lines of the files.
Thanks
If you merely want to print all but the last 4 lines of a file, use a different tool. But if you are doing some other processing with awk and need to incorporate this, just store the lines in a buffer and print them as needed. That is, store the most recent 4 lines, and print the last one in the buffer when you get a newline. For example:
awk 'NR>4 { print a[i%4]} {a[i++%4]=$0}' input
This keeps 4 lines in the array a. If we are in the first 4 lines of the file, do nothing but store the line in a. If we are on a line greater than 4, the first thing you do is print the line 4 lines back (stored in a at index i%4) You can put commands that manipulate $0 between these two action statements as needed.
To remove the last 4 lines from a file, you can just use head:
head -n -4 somefile > outputfile

Resources