How to get paragraphs of text by index number - shell

I am wondering if there is a way to get paragraphs of text (source file would be a pyx file) by number as sed does with lines
sed -n ${i}p
At this moment I'd be interested to use awk with:
awk '/custom-pyx-tag\(/,/\)custom-pyx-tag/'
but I can't find documentation or examples about that.
I'm also trying to trim "\r\n" with gsub(/\r\n/,"; ") int the same awk command but it doesn't work, and I can't really figure out why.
Any hint would be very appreciated, thanks
EDIT:
This is just one example and not my exact need but I would need to know how to do it for a multipurpose project
Let's take the case that I have exported the ID3Tags of a huge collection of audio files and these have been stored in a pyx-like format, so in the end I will have a nice big file with this pattern repeating for each file in the collection:
audio-genre(
blablabla
)audio-genre
audio-artist(
bla.blabla
)audio-artist
audio album(
bla-bla-bla
)audio-album
audio-track-num(
0x
)audio-track-num
audio-track-title(
bla.bla-bla
)audio-track-title
audio-lyrics(
blablablablabla
bla.bla.bla.bla
blah-blah-blah
blabla-blabla
)audio-lyrics
...
Now if I want to extract the artist of the 1234th audio file I can use:
awk '/audio-artist\(/, /)audio-artist/' | sed '/audio-artist/d' | sed -n 1234p
so being one line it can be obtained with sed, but I don't know how to get an entire paragraph given its index, for example if I want to get the lyrics of the 6543th file how could I do it?
In the end it is just a question of whether there is a command equivalent to
sed -n $ {num} p
but to be used for paragraphs

awk -v indx=1024
'BEGIN {
RS=""
}
{ split($0,arr,"audio-artist");
for (i=2;i<=length(arr);i=i+2)
{ gsub("[()]","",arr[i]);
arts[cnt+=1]=arr[i]
}
}
END {
print arts[indx]
}' audioartist
One liner:
awk -v indx=1234 'BEGIN {RS=""} NR==1 { split($0,arr,"audio-artist");for (i=2;i<=length(arr);i=i+2) { gsub("[()]","",arr[i]);arts[cnt+=1]=arr[i] } } END { print arts[indx] }' audioartist
Using awk, and the file called audioartist, we consume the file as one line by setting the records separator (RS) to "". We then split the whole file into an array arr, based on the separator audio-artist. We look through the array arr starting from 2 in steps of 2 till the end of the array and strip out the opening and closing brackets, creating another array called arts with an incrementing count as the index and the stripped artist as the value. At the end we print the arts index specified by the passed indx variable (in this case 1234).

Related

How to replace a whole line (between 2 words) using sed?

Suppose I have text as:
This is a sample text.
I have 2 sentences.
text is present there.
I need to replace whole text between two 'text' words. The required solution should be
This is a sample text.
I have new sentences.
text is present there.
I tried using the below command but its not working:
sed -i 's/text.*?text/text\
\nI have new sentence/g' file.txt
With your shown samples please try following. sed doesn't support lazy matching in regex. With awk's RS you could do the substitution with your shown samples only. You need to create variable val which has new value in it. Then in awk performing simple substitution operation will so the rest to get your expected output.
awk -v val="your_new_line_Value" -v RS="" '
{
sub(/text\.\n*[^\n]*\n*text/,"text.\n"val"\ntext")
}
1
' Input_file
Above code will print output on terminal, once you are Happy with results of above and want to save output into Input_file itself then try following code.
awk -v val="your_new_line_Value" -v RS="" '
{
sub(/text\.\n*[^\n]*\n*text/,"text.\n"val"\ntext")
}
1
' Input_file > temp && mv temp Input_file
You have already solved your problem using awk, but in case anyone else will be looking for a sed solution in the future, here's a sed script that does what you needed. Granted, the script is using some advanced sed features, but that's the fun part of it :)
replace.sed
#!/usr/bin/env sed -nEf
# This pattern determines the start marker for the range of lines where we
# want to perform the substitution. In our case the pattern is any line that
# ends with "text." — the `$` symbol meaning end-of-line.
/text\.$/ {
# [p]rint the start-marker line.
p
# Next, we'll read lines (using `n`) in a loop, so mark this point in
# the script as the beginning of the loop using a label called `loop`.
:loop
# Read the next line.
n
# If the last read line doesn't match the pattern for the end marker,
# just continue looping by [b]ranching to the `:loop` label.
/^text/! {
b loop
}
# If the last read line matches the end marker pattern, then just insert
# the text we want and print the last read line. The net effect is that
# all the previous read lines will be replaced by the inserted text.
/^text/ {
# Insert the replacement text
i\
I have a new sentence.
# [print] the end-marker line
p
}
# Exit the script, so that we don't hit the [p]rint command below.
b
}
# Print all other lines.
p
Usage
$ cat lines.txt
foo
This is a sample text.
I have many sentences.
I have many sentences.
I have many sentences.
I have many sentences.
text is present there.
bar
$
$ ./replace.sed lines.txt
foo
This is a sample text.
I have a new sentence.
text is present there.
bar
Substitue
sed -i 's/I have 2 sentences./I have new sentences./g'
sed -i 's/[A-Z]\s[a-z].*/I have new sentences./g'
Insert
sed -i -e '2iI have new sentences.' -e '2d'
I need to replace whole text between two 'text' words.
If I understand, first text. (with a dot) is at the end of first line and second text at the beginning of third line. With awk you can get the required solution adding values to var s:
awk -v s='\nI have new sentences.\n' '/text.?$/ {s=$0 s;next} /^text/ {s=s $0;print s;s=""}' file
This is a sample text.
I have new sentences.
text is present there.

Matching pairs using Linux terminal

I have a file named list.txt containing a (supplier,product) pair and I must show the number of products from every supplier and their names using Linux terminal
Sample input:
stationery:paper
grocery:apples
grocery:pears
dairy:milk
stationery:pen
dairy:cheese
stationery:rubber
And the result should be something like:
stationery: 3
stationery: paper pen rubber
grocery: 2
grocery: apples pears
dairy: 2
dairy: milk cheese
Save the input to file, and remove the empty lines. Then use GNU datamash:
datamash -s -t ':' groupby 1 count 2 unique 2 < file
Output:
dairy:2:cheese,milk
grocery:2:apples,pears
stationery:3:paper,pen,rubber
The following pipeline shoud do the job
< your_input_file sort -t: -k1,1r | sort -t: -k1,1r | sed -E -n ':a;$p;N;s/([^:]*): *(.*)\n\1:/\1: \2 /;ta;P;D' | awk -F' ' '{ print $1, NF-1; print $0 }'
where
sort sorts the lines according to what's before the colon, in order to ease the successive processing
the cryptic sed joins the lines with common supplier
awk counts the items for supplier and prints everything appropriately.
Doing it with awk only, as suggested by KamilCuk in a comment, would be a much easier job; doing it with sed only would be (for me) a nightmare. Using both is maybe silly, but I enjoyed doing it.
If you need a detailed explanation, please comment, and I'll find time to provide one.
Here's the sed script written one command per line:
:a
$p
N
s/([^:]*): *(.*)\n\1:/\1: \2 /
ta
P
D
and here's how it works:
:a is just a label where we can jump back through a test or branch command;
$p is the print command applied only to the address $ (the last line); note that all other commands are applied to every line, since no address is specified;
N read one more line and appends it to the current pattern space, putting a \newline in between; this creates a multiline in the pattern space
s/([^:]*): *(.*)\n\1:/\1: \2 / captures what's before the first colon on the line, ([^:]*), as well as what follows it, (.*), getting rid of eccessive spaces, *;
ta tests if the previous s command was successful, and, if this is the case, transfers the control to the line labelled by a (i.e. go to step 1);
P prints the leading part of the multiline up to and including the embedded \newline;
D deletes the leading part of the multiline up to and including the embedded \newline.
This should be close to the only awk code I was referring to:
< os awk -F: '{ count[$1] += 1; items[$1] = items[$1] " " $2 } END { for (supp in items) print supp": " count[supp], "\n"supp":" items[supp]}'
The awk script is more readable if written on several lines:
awk -F: '{ # for each line
# we use the word before the : as the key of an associative array
count[$1] += 1 # increment the count for the given supplier
items[$1] = items[$1] " " $2 # concatenate the current item to the previous ones
}
END { # after processing the whole file
for (supp in items) # iterate on the suppliers and print the result
print supp": " count[supp], "\n"supp":" items[supp]
}

How can I retrieve the matching records from mentioned file format in bash

XYZNA0000778800Z
16123000012300321000000008000000000000000
16124000012300322000000007000000000000000
17234000012300323000000005000000000000000
17345000012300324000000004000000000000000
17456000012300325000000003000000000000000
9
XYZNA0000778900Z
16123000012300321000000008000000000000000
16124000012300322000000007000000000000000
17234000012300323000000005000000000000000
17345000012300324000000004000000000000000
17456000012300325000000003000000000000000
9
I have above file format from which I want to find a matching record. For example, match a number(7789) on line starting with XYZ and once matched look for a matching number (7345) in lines below starting with 1 until it reaches to line starting with 9. retrieve the entire line record. How can I accomplish this using shell script, awk, sed or any combination.
Expected Output:
XYZNA0000778900Z
17345000012300324000000004000000000000000
With sed one can do:
$ sed -n '/^XYZ.*7789/,/^9$/{/^1.*7345/p}' file
17345000012300324000000004000000000000000
Breakdown:
sed -n ' ' # -n disabled automatic printing
/^XYZ.*7789/, # Match line starting with XYZ, and
# containing 7789
/^1.*7345/p # Print line starting with 1 and
# containing 7345, which is coming
# after the previous match
/^9$/ { } # Match line that is 9
range { stuff } will execute stuff when it's inside range, in this case the range is starting at /^XYZ.*7789/ and ending with /^9$/.
.* will match anything but newlines zero or more times.
If you want to print the whole block matching the conditions, one can use:
$ sed -n '/^XYZ.*7789/{:s;N;/\n9$/!bs;/\n1.*7345/p}' file
XYZNA0000778900Z
16123000012300321000000008000000000000000
16124000012300322000000007000000000000000
17234000012300323000000005000000000000000
17345000012300324000000004000000000000000
17456000012300325000000003000000000000000
9
This works by reading lines between ^XYZ.*7779 and ^9$ into the pattern
space. And then printing the whole thing if ^1.*7345 can be matches:
sed -n ' ' # -n disables printing
/^XYZ.*7789/{ } # Match line starting
# with XYZ that also contains 7789
:s; # Define label s
N; # Append next line to pattern space
/\n9$/!bs; # Goto s unless \n9$ matches
/\n1.*7345/p # Print whole pattern space
# if \n1.*7345 matches
I'd use awk:
awk -v rid=7789 -v fid=7345 -v RS='\n9\n' -F '\n' 'index($1, rid) { for(i = 2; i < $NF; ++i) { if(index($i, fid)) { print $i; next } } }' filename
This works as follows:
-v RS='\n9\n' is the meat of the whole thing. Awk separates its input into records (by default lines). This sets the record separator to \n9\n, which means that records are separated by lines with a single 9 on them. These records are further separated into fields, and
-F '\n' tells awk that fields in a record are separated by newlines, so that each line in a record becomes a field.
-v rid=7789 -v fid=7345 sets two awk variables rid and fid (meant by me as record identifier and field identifier, respectively. The names are arbitrary.) to your search strings. You could encode these in the awk script directly, but this way makes it easier and safer to replace the values with those of a shell variables (which I expect you'll want to do).
Then the code:
index($1, rid) { # In records whose first field contains rid
for(i = 2; i < $NF; ++i) { # Walk through the fields from the second
if(index($i, fid)) { # When you find one that contains fid
print $i # Print it,
next # and continue with the next record.
} # Remove the "next" line if you want all matching
} # fields.
}
Note that multi-character record separators are not strictly required by POSIX awk, and I'm not certain if BSD awk accepts it. Both GNU awk and mawk do, though.
EDIT: Misread question the first time around.
an extendable awk script can be
$ awk '/^9$/{s=0} s&&/7345/; /^XYZ/&&/7789/{s=1} ' file
set flag s when line starts with XYZ and contains 7789; reset when line is just 9, and print when flag is set and contains pattern 7345.
This might work for you (GNU sed):
sed -n '/^XYZ/h;//!H;/^9/!b;x;/^XYZ[^\n]*7789/!b;/7345/p' file
Use the option -n for the grep-like nature of sed. Gather up records beginning with XYZ and ending in 9. Reject any records which do not have 7789 in the header. Print any remaining records that contain 7345.
If the 7345 will always follow the header,this could be shortened to:
sed -n '/^XYZ/h;//!H;/^9/!b;x;/^XYZ[^\n]*7789.*7345/p' file
If all records are well-formed (begin XYZ and end in 9) then use:
sed -n '/^XYZ/h;//!H;/^9/!b;x;/^[^\n]*7789.*7345/p' file

use grep and awk to transfer data from .srt to .csv/xls

I got an interesting project to do! I'm thinking about converting an srt file into a csv/xls file.
a srt file would look like this:
1
00:00:00,104 --> 00:00:02,669
Hi, I'm shell-scripting.
2
00:00:02,982 --> 00:00:04,965
I'm not sure if it would work,
but I'll try it!
3
00:00:05,085 --> 00:00:07,321
There must be a way to do it!
while I want to output it into a csv file like this:
"1","00:00:00,104","00:00:02,669","Hi, I'm shell-scripting."
"2","00:00:02,982","00:00:04,965","I'm not sure if it would work"
,,,"but I'll try it!"
"3","00:00:05,085","00:00:07,321","There must be a way to do it!"
So as you can see, each subtitle takes up two rows. My thinking would be using grep to put the srt data into the xls, and then use awk to format the xls file.
What do you guys think? How am I suppose to write it? I tried
$grep filename.srt > filename.xls
It seems that all the data including the time codes and the subtitle words ended up all in column A of the xls file...but I want the words to be in column B...How would awk be able to help with the formatting?
Thank you in advance! :)
$ cat tst.awk
BEGIN { RS=""; FS="\n"; OFS=","; q="\""; s=q OFS q }
{
split($2,a,/ .* /)
print q $1 s a[1] s a[2] s $3 q
for (i=4;i<=NF;i++) {
print "", "", "", q $i q
}
}
$ awk -f tst.awk file
"1","00:00:00,104","00:00:02,669","Hi, I'm shell-scripting."
"2","00:00:02,982","00:00:04,965","I'm not sure if it would work,"
,,,"but I'll try it!"
"3","00:00:05,085","00:00:07,321","There must be a way to do it!"
I think something like this should do it quite nicely:
awk -v RS= -F'\n' '
{
sub(" --> ","\x7c",$2) # change "-->" to "|"
printf "%s|%s|%s\n",$1,$2,$3 # print scene, time start, time stop, description
for(i=4;i<=NF;i++)printf "|||%s\n",$i # print remaining lines of description
}' file.srt
The -v RS= sets the Record Separator to blank lines. The -F'\n' sets the Field Separator to new lines.
The sub() replaces the "-->" with a pipe symbol (|).
The first three fields are then printed separated by pipes, and then there is a little loop to print the remaining lines of description, inset by three pipe symbols to make them line up.
Output
1|00:00:00,104|00:00:02,669|Hi, I'm shell-scripting.
2|00:00:02,982|00:00:04,965|I'm not sure if it would work,
|||but I'll try it!
3|00:00:05,085|00:00:07,321|There must be a way to do it!
As I am feeling like having some more fun with Perl and Excel, I took the above output and parsed it in Perl and wrote a real Excel XLSX file. Of course, there is no real need to use awk and Perl so ideally one would re-cast the awk and integrate it into the Perl since the latter can write Excel files while the former cannot. Anyway here is the Perl.
#!/usr/bin/perl
use strict;
use warnings;
use Excel::Writer::XLSX;
my $DEBUG=0;
my $workbook = Excel::Writer::XLSX->new('result.xlsx');
my $worksheet = $workbook->add_worksheet();
my $row=0;
while(my $line=<>){
$row++; # move down a line in Excel worksheet
chomp $line; # strip CR
my #f=split /\|/, $line; # split fields of line into array #f[], on pipe symbols (|)
for(my $j=0;$j<scalar #f;$j++){ # loop through all fields
my $cell= chr(65+$j) . $row; # calcuate Excell cell, starting at A1 (65="A")
$worksheet->write($cell,$f[$j]); # write to spreadsheet
printf "%s:%s ",$cell,$f[$j] if $DEBUG;
}
printf "\n" if $DEBUG;
}
$workbook->close;
Output
My other answer was half awk and half Perl, but, given that awk can't write Excel spreadsheets whereas Perl can, it seems daft to require you to master both awk and Perl when Perl is perfectly capable of doing it all on its own... so here goes in Perl:
#!/usr/bin/perl
use strict;
use warnings;
use Excel::Writer::XLSX;
my $workbook = Excel::Writer::XLSX->new('result.xlsx');
my $worksheet = $workbook->add_worksheet();
my $ExcelRow=0;
local $/ = ""; # set paragraph mode, so we read till next blank line as one record
while(my $para=<>){
$ExcelRow++; # move down a line in Excel worksheet
chomp $para; # strip CR
my #lines=split /\n/, $para; # split paragraph into lines on linefeed character
my $scene = $lines[0]; # pick up scene number from first line of para
my ($start,$end)=split / --> /,$lines[1]; # pick up start and end time from second line
my $cell=sprintf("A%d",$ExcelRow); # work out cell
$worksheet->write($cell,$scene); # write scene to spreadsheet column A
$cell=sprintf("B%d",$ExcelRow); # work out cell
$worksheet->write($cell,$start); # write start time to spreadsheet column B
$cell=sprintf("C%d",$ExcelRow); # work out cell
$worksheet->write($cell,$end); # write end time to spreadsheet column C
$cell=sprintf("D%d",$ExcelRow); # work out cell
$worksheet->write($cell,$lines[2]); # write description to spreadsheet column D
for(my $i=3;$i<scalar #lines;$i++){ # output additional lines of description
$ExcelRow++;
$cell=sprintf("D%d",$ExcelRow); # work out cell
$worksheet->write($cell,$lines[$i]);
}
}
$workbook->close;
Save the above on a file called srt2xls and then make it executable with the command:
chmod +x srt2xls
Then you can run it with
./srt2xls < SomeFileile.srt
and it will give you this spreadsheet called result.xlsx
Since you want to convert the srt into csv. below is awk command
awk '{gsub(" --> ","\x22,\x22");if(NF!=0){if(j<3)k=k"\x22"$0"\x22,";else{k="\x22"$0"\x22 ";l=1}j=j+1}else j=0;if(j==3){print k;k=""}if(l==1){print ",,,"k ;l=0;k=""}}' inputfile > output.csv
detail veiw of awk
awk '{
gsub(" --> ","\x22,\x22");
if(NF!=0)
{
if(j<3)
k=k"\x22"$0"\x22,";
else
{
k="\x22"$0"\x22 ";
l=1
}
j=j+1
}
else
j=0;
if(j==3)
{
print k;
k=""
}
if(l==1)
{
print ",,,"k;
l=0;
k=""
}
}' inputfile > output.csv
take the output.csv on windows platform and then open with microsoft excel and save it as .xls extension.

How to KEEP only the last line of consecutive lines starting with the same word?

See this thread : How to remove the second line of consecutive lines starting with the same word?
Instead of keeping the first duplicate consecutive line starting with "TITLE", I would like to only keep the last one, to get from this input:
TITLE something
DATA some data
TITLE something else
DATA some other data
TITLE some more
TITLE extra info
DATA some more data
This output:
TITLE something
DATA some data
TITLE something else
DATA some other data
TITLE extra info
DATA some more data
Also, I'd like to be able to handle an arbitrary number of repetitions, and not only 2 (if by example 7 lines in a row start by "TITLE", only keep the last one).
Like the other post, it can be a perl/bash/sed/awk command that only keep the last line and output the rest of the file as well. I've been workng on this for a long time, but I could only find solutions that does the opposite of what I want.
With sed:
sed '/^TITLE/ { :a $! { N; /\nTITLE/ { s/.*\n//; ba; }; }; }' filename
That is:
/^TITLE/ { # if a line begins with TITLE
:a # jump label for looping.
$! { # unless we hit the end of input (in case the file
# ends with title lines)
N # fetch the next line
/\nTITLE/ { # if it begins with TITLE as well
s/.*\n// # remove the first
ba # go back to a
}
}
}
Just reverse the order of lines, then print the now-first occurrence, then reverse them again:
$ tac file | awk '$1!=prev; {prev=$1}' | tac
TITLE something
DATA some data
TITLE something else
DATA some other data
TITLE extra info
DATA some more data
or if there can be multiple consecutive DATA lines and you want to keep all of those:
$ tac file | awk '!($1=="TITLE" && $1==prev); {prev=$1}' | tac
TITLE something
DATA some data
TITLE something else
DATA some other data
TITLE extra info
DATA some more data
If you're looking for a Perl one-line solution, like the one in the question that you linked, then this will do
perl -ne'if (/^TITLE/) {$t = $_} else {print $t, $_; $t = ""}' myfile
Note that it will not print a TITLE line at all unless it is followed by a line that doesn't begin with TITLE
This might work for you (GNU sed):
sed -r 'N;/^(TITLE ).*\n\1/!P;D' file
This compares 2 lines and if the first is the same as the second does not print the first.

Resources