How to KEEP only the last line of consecutive lines starting with the same word? - bash

See this thread : How to remove the second line of consecutive lines starting with the same word?
Instead of keeping the first duplicate consecutive line starting with "TITLE", I would like to only keep the last one, to get from this input:
TITLE something
DATA some data
TITLE something else
DATA some other data
TITLE some more
TITLE extra info
DATA some more data
This output:
TITLE something
DATA some data
TITLE something else
DATA some other data
TITLE extra info
DATA some more data
Also, I'd like to be able to handle an arbitrary number of repetitions, and not only 2 (if by example 7 lines in a row start by "TITLE", only keep the last one).
Like the other post, it can be a perl/bash/sed/awk command that only keep the last line and output the rest of the file as well. I've been workng on this for a long time, but I could only find solutions that does the opposite of what I want.

With sed:
sed '/^TITLE/ { :a $! { N; /\nTITLE/ { s/.*\n//; ba; }; }; }' filename
That is:
/^TITLE/ { # if a line begins with TITLE
:a # jump label for looping.
$! { # unless we hit the end of input (in case the file
# ends with title lines)
N # fetch the next line
/\nTITLE/ { # if it begins with TITLE as well
s/.*\n// # remove the first
ba # go back to a
}
}
}

Just reverse the order of lines, then print the now-first occurrence, then reverse them again:
$ tac file | awk '$1!=prev; {prev=$1}' | tac
TITLE something
DATA some data
TITLE something else
DATA some other data
TITLE extra info
DATA some more data
or if there can be multiple consecutive DATA lines and you want to keep all of those:
$ tac file | awk '!($1=="TITLE" && $1==prev); {prev=$1}' | tac
TITLE something
DATA some data
TITLE something else
DATA some other data
TITLE extra info
DATA some more data

If you're looking for a Perl one-line solution, like the one in the question that you linked, then this will do
perl -ne'if (/^TITLE/) {$t = $_} else {print $t, $_; $t = ""}' myfile
Note that it will not print a TITLE line at all unless it is followed by a line that doesn't begin with TITLE

This might work for you (GNU sed):
sed -r 'N;/^(TITLE ).*\n\1/!P;D' file
This compares 2 lines and if the first is the same as the second does not print the first.

Related

How to replace a whole line (between 2 words) using sed?

Suppose I have text as:
This is a sample text.
I have 2 sentences.
text is present there.
I need to replace whole text between two 'text' words. The required solution should be
This is a sample text.
I have new sentences.
text is present there.
I tried using the below command but its not working:
sed -i 's/text.*?text/text\
\nI have new sentence/g' file.txt
With your shown samples please try following. sed doesn't support lazy matching in regex. With awk's RS you could do the substitution with your shown samples only. You need to create variable val which has new value in it. Then in awk performing simple substitution operation will so the rest to get your expected output.
awk -v val="your_new_line_Value" -v RS="" '
{
sub(/text\.\n*[^\n]*\n*text/,"text.\n"val"\ntext")
}
1
' Input_file
Above code will print output on terminal, once you are Happy with results of above and want to save output into Input_file itself then try following code.
awk -v val="your_new_line_Value" -v RS="" '
{
sub(/text\.\n*[^\n]*\n*text/,"text.\n"val"\ntext")
}
1
' Input_file > temp && mv temp Input_file
You have already solved your problem using awk, but in case anyone else will be looking for a sed solution in the future, here's a sed script that does what you needed. Granted, the script is using some advanced sed features, but that's the fun part of it :)
replace.sed
#!/usr/bin/env sed -nEf
# This pattern determines the start marker for the range of lines where we
# want to perform the substitution. In our case the pattern is any line that
# ends with "text." — the `$` symbol meaning end-of-line.
/text\.$/ {
# [p]rint the start-marker line.
p
# Next, we'll read lines (using `n`) in a loop, so mark this point in
# the script as the beginning of the loop using a label called `loop`.
:loop
# Read the next line.
n
# If the last read line doesn't match the pattern for the end marker,
# just continue looping by [b]ranching to the `:loop` label.
/^text/! {
b loop
}
# If the last read line matches the end marker pattern, then just insert
# the text we want and print the last read line. The net effect is that
# all the previous read lines will be replaced by the inserted text.
/^text/ {
# Insert the replacement text
i\
I have a new sentence.
# [print] the end-marker line
p
}
# Exit the script, so that we don't hit the [p]rint command below.
b
}
# Print all other lines.
p
Usage
$ cat lines.txt
foo
This is a sample text.
I have many sentences.
I have many sentences.
I have many sentences.
I have many sentences.
text is present there.
bar
$
$ ./replace.sed lines.txt
foo
This is a sample text.
I have a new sentence.
text is present there.
bar
Substitue
sed -i 's/I have 2 sentences./I have new sentences./g'
sed -i 's/[A-Z]\s[a-z].*/I have new sentences./g'
Insert
sed -i -e '2iI have new sentences.' -e '2d'
I need to replace whole text between two 'text' words.
If I understand, first text. (with a dot) is at the end of first line and second text at the beginning of third line. With awk you can get the required solution adding values to var s:
awk -v s='\nI have new sentences.\n' '/text.?$/ {s=$0 s;next} /^text/ {s=s $0;print s;s=""}' file
This is a sample text.
I have new sentences.
text is present there.

How to get paragraphs of text by index number

I am wondering if there is a way to get paragraphs of text (source file would be a pyx file) by number as sed does with lines
sed -n ${i}p
At this moment I'd be interested to use awk with:
awk '/custom-pyx-tag\(/,/\)custom-pyx-tag/'
but I can't find documentation or examples about that.
I'm also trying to trim "\r\n" with gsub(/\r\n/,"; ") int the same awk command but it doesn't work, and I can't really figure out why.
Any hint would be very appreciated, thanks
EDIT:
This is just one example and not my exact need but I would need to know how to do it for a multipurpose project
Let's take the case that I have exported the ID3Tags of a huge collection of audio files and these have been stored in a pyx-like format, so in the end I will have a nice big file with this pattern repeating for each file in the collection:
audio-genre(
blablabla
)audio-genre
audio-artist(
bla.blabla
)audio-artist
audio album(
bla-bla-bla
)audio-album
audio-track-num(
0x
)audio-track-num
audio-track-title(
bla.bla-bla
)audio-track-title
audio-lyrics(
blablablablabla
bla.bla.bla.bla
blah-blah-blah
blabla-blabla
)audio-lyrics
...
Now if I want to extract the artist of the 1234th audio file I can use:
awk '/audio-artist\(/, /)audio-artist/' | sed '/audio-artist/d' | sed -n 1234p
so being one line it can be obtained with sed, but I don't know how to get an entire paragraph given its index, for example if I want to get the lyrics of the 6543th file how could I do it?
In the end it is just a question of whether there is a command equivalent to
sed -n $ {num} p
but to be used for paragraphs
awk -v indx=1024
'BEGIN {
RS=""
}
{ split($0,arr,"audio-artist");
for (i=2;i<=length(arr);i=i+2)
{ gsub("[()]","",arr[i]);
arts[cnt+=1]=arr[i]
}
}
END {
print arts[indx]
}' audioartist
One liner:
awk -v indx=1234 'BEGIN {RS=""} NR==1 { split($0,arr,"audio-artist");for (i=2;i<=length(arr);i=i+2) { gsub("[()]","",arr[i]);arts[cnt+=1]=arr[i] } } END { print arts[indx] }' audioartist
Using awk, and the file called audioartist, we consume the file as one line by setting the records separator (RS) to "". We then split the whole file into an array arr, based on the separator audio-artist. We look through the array arr starting from 2 in steps of 2 till the end of the array and strip out the opening and closing brackets, creating another array called arts with an incrementing count as the index and the stripped artist as the value. At the end we print the arts index specified by the passed indx variable (in this case 1234).

Matching pairs using Linux terminal

I have a file named list.txt containing a (supplier,product) pair and I must show the number of products from every supplier and their names using Linux terminal
Sample input:
stationery:paper
grocery:apples
grocery:pears
dairy:milk
stationery:pen
dairy:cheese
stationery:rubber
And the result should be something like:
stationery: 3
stationery: paper pen rubber
grocery: 2
grocery: apples pears
dairy: 2
dairy: milk cheese
Save the input to file, and remove the empty lines. Then use GNU datamash:
datamash -s -t ':' groupby 1 count 2 unique 2 < file
Output:
dairy:2:cheese,milk
grocery:2:apples,pears
stationery:3:paper,pen,rubber
The following pipeline shoud do the job
< your_input_file sort -t: -k1,1r | sort -t: -k1,1r | sed -E -n ':a;$p;N;s/([^:]*): *(.*)\n\1:/\1: \2 /;ta;P;D' | awk -F' ' '{ print $1, NF-1; print $0 }'
where
sort sorts the lines according to what's before the colon, in order to ease the successive processing
the cryptic sed joins the lines with common supplier
awk counts the items for supplier and prints everything appropriately.
Doing it with awk only, as suggested by KamilCuk in a comment, would be a much easier job; doing it with sed only would be (for me) a nightmare. Using both is maybe silly, but I enjoyed doing it.
If you need a detailed explanation, please comment, and I'll find time to provide one.
Here's the sed script written one command per line:
:a
$p
N
s/([^:]*): *(.*)\n\1:/\1: \2 /
ta
P
D
and here's how it works:
:a is just a label where we can jump back through a test or branch command;
$p is the print command applied only to the address $ (the last line); note that all other commands are applied to every line, since no address is specified;
N read one more line and appends it to the current pattern space, putting a \newline in between; this creates a multiline in the pattern space
s/([^:]*): *(.*)\n\1:/\1: \2 / captures what's before the first colon on the line, ([^:]*), as well as what follows it, (.*), getting rid of eccessive spaces, *;
ta tests if the previous s command was successful, and, if this is the case, transfers the control to the line labelled by a (i.e. go to step 1);
P prints the leading part of the multiline up to and including the embedded \newline;
D deletes the leading part of the multiline up to and including the embedded \newline.
This should be close to the only awk code I was referring to:
< os awk -F: '{ count[$1] += 1; items[$1] = items[$1] " " $2 } END { for (supp in items) print supp": " count[supp], "\n"supp":" items[supp]}'
The awk script is more readable if written on several lines:
awk -F: '{ # for each line
# we use the word before the : as the key of an associative array
count[$1] += 1 # increment the count for the given supplier
items[$1] = items[$1] " " $2 # concatenate the current item to the previous ones
}
END { # after processing the whole file
for (supp in items) # iterate on the suppliers and print the result
print supp": " count[supp], "\n"supp":" items[supp]
}

'sed' replace last patern and delete others pattern

I want to replace only the last string "delay" by "ens_delay" in my file and delete the others one before the last one:
Input file:
alpha_notify_teta=''
alpha_notify_check='YES'
text='CRDS'
delay=''
delay=''
delay=''
textfileooooop=''
alpha_enable='YES'
alpha_hostnames=''
alpha_orange='YES'
alpha_orange_interval='300'
alpha_notification_level='ALL'
expression='YES'
delay='9'
textfileooooop=''
alpha_enable='YES'
alpha_hostnames=''
Output file: (expected value)
alpha_notify_teta=''
alpha_notify_check='YES'
text='CRDS'
textfileooooop=''
alpha_enable='YES'
alpha_hostnames=''
alpha_orange='YES'
alpha_orange_interval='300'
alpha_notification_level='ALL'
expression='YES'
ens_delay='9'
textfileooooop=''
alpha_enable='YES'
alpha_hostnames=''
Here my first command but it doesn't work because it will work only if I have delay as last line.
sed -e '$,/delay/ s/delay/ens_delay/'
My second command will delete all lines contain "delay", even "ens_delay" will be deleted.
sed -i '/delay/d'
Thank you
This might work for you (GNU sed):
sed '/^delay=/,$!b;/^delay=/!H;//{x;s/^[^\n]*\n\?//;/./p;x;h};$!d;x;s/^/ens_/' file
Lines before the first line beginning delay= should be printed as normal. Otherwise, a line beginning delay= is stored in the hold space and subsequent lines that do not begin delay= are appended to it. Should the hold space already contain such lines, the first line is deleted and the remaining lines printed before the hold space is replaced by the current line. At the end of the file, the first line of the hold space is amended to prepend the string ens_ and then the whole of the hold space is printed.
You cannot do this kind of thing with sed. There is no way in sed to "look forward" and tell if there are more matches to the pattern. You can kind of look back, but that won't be sufficient to solve this problem.
This perl script will solve it:
#!/usr/bin/perl
use strict;
use warnings;
my ($seek, $replacement, $last, #new) = (shift, shift, 0);
open(my $fh, shift) or die $!;
my #l = <$fh>;
close($fh) or die $!;
foreach (reverse #l){
if(/$seek/){
if ($last++ == 0){
s/$seek/$replacement/;
} else {
next;
}
}
unshift(#new, $_);
}
print join "", #new;
Call like:
./script delay= ens_delay= inputfile
I chose to entirely eliminate lines which you intended to delete rather than collapse them in to a single blank line. If that is really required then it's a bit more complicated: the first such line in any consecutive set (or rather the last such) must be pushed on to the output list and you have to track whether this has just been done so you know whether to push the next time, too.
You could also solve this problem with awk, python, or any number of other languages. Just not sed.
Have this monster:
sed -e "1,$(expr $(sed -n '/^delay=/=' your_file.txt | tail -1) - 1)"'s/^delay=.*$//' \
-e 's/^delay=/ens_delay=/' your_file.txt
Here:
sed -n '/^delay=/=' your_file.txt | tail -1 return the last line number of the encountered pattern (let's name it X)
expr is used to get the X-1 line
"1,X-1"'[command]' means "perform this command betwen the first and the X-1 line included (I used double quotes to let the expansion getting done)
's/^delay=.*$//' the said [command]
-e 's/^delay=/ens_delay=/' the next expression to perform (will occur only on the last line)
Output:
alpha_notify_teta=''
alpha_notify_check='YES'
text='CRDS'
textfileooooop=''
alpha_enable='YES'
alpha_hostnames=''
alpha_hsm_backup_notification='YES'
alpha_orange='YES'
alpha_orange_interval='300'
alpha_notification_level='ALL'
expression='YES'
ens_delay='9'
textfileooooop=''
alpha_enable='YES'
alpha_hostnames=''
alpha_hsm_backup_notification='YES'
If you want to delete the lines instead of leaving them blank:
sed -e "1,$(expr $(sed -n '/^delay=/=' your_file.txt | tail -1) - 1)"'{/^delay=.*$/d}' \
-e 's/^delay=/ens_delay=/' your_file.txt
As was mentioned elsewhere, sed can't know which occurrence of a substring is the last one. But awk can keep track of things in arrays. For example, the following will delete all duplicate assignments, as well ask making your substitution:
awk 'BEGIN{FS=OFS="="} $1=="delay"{$1="ens_delay"} !($1 in a){o[++i]=$1} {a[$1]=$0} END{for(x=0;x<i;x++) printf "%s\n",a[o[x]]}' inputfile
Or, broken out for easier reading/comments:
BEGIN {
FS=OFS="=" # set the field separator, to help isolate the left hand side
}
$1=="delay" {
$1="ens_delay" # your field substitution
}
!($1 in a) {
o[++i]=$1 # if we haven't seen this variable, record its position
}
{
a[$1]=$0 # record the value of the last-seen occurrence of this variable
}
END {
for (x=0;x<i;x++) # step through the array,
printf "%s\n",a[o[x]] # printing the last-seen values, in the order
} # their variable was first seen in the input file.
You might not care about the order of the variables. If so, the following might be simpler:
awk 'BEGIN{FS=OFS="="} $1=="delay"{$1="ens_delay"} {o[$1]=$0} END{for(i in o) printf "%s\n", o[i]}' inputfile
This simply stores the last-seen line in an array whose key is the variable name, then prints out the content of the array in an unknown order.
Assuming I understand your specifications properly, this should do what you need. Given infile x,
$: last=$( grep -n delay x|tail -1|sed 's/:.*//' )
This grep's the file for all lines with delay and returns them with the line number prepended with a colon. The tail -1 grabs the last of those lines, ignoring all the others. sed 's/:.*//' strips the colon and the actual line content, leaving only the number (here it was 14.)
That all evaluates out to assign 14 as $last.
$: sed '/delay/ { '$last'!d; '$last' s/delay/ens_delay/; }' x
alpha_notify_teta=''
alpha_notify_check='YES'
text='CRDS'
textfileooooop=''
alpha_enable='YES'
alpha_hostnames=''
alpha_orange='YES'
alpha_orange_interval='300'
alpha_notification_level='ALL'
expression='YES'
ens_delay='9'
textfileooooop=''
alpha_enable='YES'
alpha_hostnames=''
Apologies for the ugly catenation. What this does is writes the script using the value of $last so that the result looks like this to sed:
$: sed '/delay/ { 14!d; 14 s/delay/ens_delay/; }' x
sed reads leading numbers as line selectors, so what this script of commands do -
First, sed automatically prints lines unless told not to, so by default it would just print every line. The script modifies that.
/delay/ {...} is a pattern-based record selector. It will apply the commands between the {} to all lines that match /delay/, which is why it doesn't need another grep - it handles that itself. Inside the curlies, the script does two things.
First, 14!d says (only if this line has delay, which it will) that if the line number is 14, do not (the !) delete the record. Since all the other lines with delay won't be line 14 (or whatever value of the last one the earlier command created), those will get deleted, which automatically restarts the cycle and reads the next record.
Second, if the line number is 14, then it won't delete, and so will progress to the s/delay/ens_delay/ which updates your value.
For all lines that don't match /delay/, sed just prints them as-is.

Removing lines between tags in a text file

I have many text files containing annotations. The original text is marked with lines containing the words:
START OF TEXT OF PASSAGE 1
END OF TEXT OF PASSAGE 1
Obviously I can search each document for the phrase START OF TEXT and delete everything up to it. Then search for END OF TEXT and start selecting text for deletion until I get to the next START OF TEXT.
I have come up with this design so far:
#!/bin/bash
a="START OF PROJECT"
b="END OF PROJECT"
while read line; do
if line contains a; do
while read line; do
'if line does not contain b'
'append the line to output.txt'; fi
done
done
fi
done
Perhaps there is an easier way using sed, awk, grep and pipes?
'for every document' 'loop through it doing this' ('find the original text between START and END' | >> output.txt)
Unfortunately I am poor at bash and ignorant of sed/awk.
The reason for this is that I am assembling a huge text document that is a concatenation of thousands of marked up documents – each of which contains some annotated passages.
In Python:
import re
with open('in.txt') as f, open('out.txt', 'w') as output:
output.write('\n'.join(re.findall(r'START OF TEXT(.*?)END OF TEXT', f.read())))
This reads the input, searches for all matches that begin and end with the necessary markers, captures the text of interest in a group, joins all those groups on a linefeed, and writes that to the result file.
Pretty easy to do with awk. You would create a script (I'll call it yank.awk) containing this:
#!/usr/bin/awk
/START OF PROJECT/ { capture = 1; next }
/END OF PROJECT/ { capture = 0 }
capture == 1 { print }
and then run it like so:
yank.awk in.txt > output.txt
Could also do with sed and grep:
sed -ne '/START OF PROJECT/,/END OF PROJECT/p' in.txt | grep -vE '(START|END) OF PROJECT' > output.txt
(Another Python solution)
You can have itertools.groupby group lines together based on a boolean value - just use a global flag to keep track of whether you are in a block or not, and then use groupby to group the lines that are in or out of blocks. Then just discard the ones that are not blocks:
sample_lines = """
lskdjflsdkjf
sldkjfsdlkjf
START OF TEXT
Asdlkfjlsdkfj
Bsldkjf
Clsdkjf
END OF TEXT
sldkfjlsdkjf
sdlkjfdklsjf
sdlkfjdlskjf
START OF TEXT
Dsdlkfjlsdkfj
Esldkjf
Flsdkjf
END OF TEXT
sldkfjlsdkjf
sdlkjfdklsjf
sdlkfjdlskjf
""".splitlines()
from itertools import groupby
in_block = False
def is_in_block(line):
global in_block
if line.startswith("END OF TEXT"):
in_block = False
ret = in_block
if line.startswith("START OF TEXT"):
in_block = True
return ret
for lines_are_text,lines in groupby(sample_lines, key=is_in_block):
if lines_are_text:
print(list(lines))
gives:
['Asdlkfjlsdkfj', 'Bsldkjf', 'Clsdkjf']
['Dsdlkfjlsdkfj', 'Esldkjf', 'Flsdkjf']
See that first group has the lines that start with A, B, and C, and the second group is made up of those lines starting with D, E, and F.
It sounds like the specific solution you need is:
awk '/END OF TEXT OF PASSAGE/{f=0} f; /START OF TEXT OF PASSAGE/{f=1}' file
See https://stackoverflow.com/a/18409469/1745001 for other ways to select text from files.
Use Perl's Flip-Flop Operator to Print Text Between Markers
Given a corpus like:
START OF TEXT OF PASSAGE 1
foo
END OF TEXT OF PASSAGE 1
START OF TEXT OF PASSAGE 2
bar
END OF TEXT OF PASSAGE 2
you can use the Perl flip-flop operator to process within a range of lines. For example, from the shell prompt:
$ perl -ne 'if (/^START OF TEXT/ ... /^END OF TEXT/) {
next if /^(?:START|END)/;
print;
}' /tmp/corpus
foo
bar
Basically, this short Perl script loops through your input. When it finds your start and end tags, it throws away the tags themselves and prints everything else in between.
Usage Notes
The line breaks between passages in the corpus are for readability. It doesn't matter if your real corpus has no line breaks between passages, so long as the text markers always start at the beginning of the line as shown in your original post. If that assumption doesn't hold true, then you will need to adjust the regular expressions used to identify the start and end of your passages.
You can pass multiple files to the Perl script. Again, it makes no practical difference as long as you don't exceed the length limit of your shell.
If you want the final output to go to somewhere other than standard output, just use shell redirection. For example:
perl -ne 'if (/^START OF TEXT/ ... /^END OF TEXT/) {
next if /^(?:START|END)/;
print;
}' /tmp/file1 /tmp/file2 /tmp/file3 > /tmp/output
You can use sed as follows:
sed -n '/^START OF TEXT/,/^END OF TEXT/{/^\(START\|END\) OF TEXT/!p}' infile
or, with extended regular expressions (-r):
sed -rn '/^START OF TEXT/,/^END OF TEXT/{/^(START|END) OF TEXT/!p}' infile
-n prevents sed from printing as a default. The rest works as follows:
/^START OF TEXT/,/^END OF TEXT/ { # For lines between these two matches
/^\(START\|END\) OF TEXT/!p # If the line does NOT match, print it
}
This works with GNU sed and might require some tweaking to run with other seds.

Resources