Block cut with sed and suppress the last line - bash

# cat file
LBL 434
any lines but not block start
...
LBL 75677
...
any
LBL 777
...
LBL 798
...
# sed -ne '/LBL 75677/,/LBL/p' file | head -n -1
LBL 75677
...
any
#
The above command is good for me, but I would like to know:
Can I suppress the last line without the head command, only in one sed script? I know the commands and control flow of sed (N P D b ...) but I couldn't figure out it at the moment.
#Cyrus, Thanks It works fine and I know how it works thanks again.
But I wanted to find different way of solution if it is.
I tried the lines of block /LBL 75677/,/LBL/ put into the space buffer of sed with N command and D remove the last line from space buffer (this is first line of new block) and print all space buffer. Does somebody can do it.

Below script :
sed -n '/LBL 75677/{p;:loop;n;/LBL/!{p;b loop}}' file
may be what you're looking for.
:loop here is a label and b loop is unconditional jumping to that label.
Here we create a small loop and go on to print the lines until the next LBL is reached.

sed is for simple substitutions on individual lines (s/old/new/), that is all. For anything else you should be using awk:
$ awk '/LBL/{f=0} /LBL 75677/{f=1} f' file
LBL 75677
...
any
In addition to being simpler and clearer than an equivalent sed script, the above will execute faster (especially if you only want one record output and so can change /LBL/{f=0} to /LBL/{exit}), and be more portable as it will work as-is on all awks on all UNIX systems and will be vastly easier to enhance if/when your requirements change (when dealing with anything more than s/old/new/ a tiny requirements change typically means a complete rewrite for a sed script).
If you're using any constructs other than s, g, and p (with -n) in sed then you are using constructs that became obsolete in the mid-1970s when awk was invented and so sed no longer needed all the cryptic runes to perform simple multi-line tasks.

Related

Sed through files without using for loop?

I have a small script which basically generates a menu of all the scripts in my ~/scripts folder and next to each of them displays a sentence describing it, that sentence being the third line within the script commented out. I then plan to pipe this into fzf or dmenu to select it and start editing it or whatever.
1 #!/bin/bash
2
3 # a script to do
So it would look something like this
foo.sh a script to do X
bar.sh a script to do Y
Currently I have it run a for loop over all the files in the scripts folder and then run sed -n 3p on all of them.
for i in $(ls -1 ~/scripts); do
echo -n "$i"
sed -n 3p "~/scripts/$i"
echo
done | column -t -s '#' | ...
I was wondering if there is a more efficient way of doing this that did not involve a for loop and only used sed. Any help will be appreciated. Thanks!
Instead of a loop that is parsing ls output + sed, you may try this awk command:
awk 'FNR == 3 {
f = FILENAME; sub(/^.*\//, "", f); print f, $0; nextfile
}' ~/scripts/* | column -t -s '#' | ...
Yes there is a more efficient way, but no, it doesn't only use sed. This is probably a silly optimization for your use case though, but it may be worthwhile nonetheless.
The inefficiency is that you're using ls to read the directory and then parse its output. For large directories, that causes lots of overhead for keeping that list in memory even though you only traverse it once. Also, it's not done correctly, consider filenames with special characters that the shell interprets.
The more efficient way is to use find in combination with its -exec option, which starts a second program with each found file in turn.
BTW: If you didn't rely on line numbers but maybe a tag to mark the description, you could also use grep -r, which avoids an additional process per file altogether.
This might work for you (GNU sed):
sed -sn '1h;3{H;g;s/\n/ /p}' ~/scripts/*
Use the -s option to reset the line number addresses for each file.
Copy line 1 to the hold space.
Append line 3 to the hold space.
Swap the hold space for the pattern space.
Replace the newline with a space and print the result.
All files in the directory ~/scripts will be processed.
N.B. You may wish to replace the space delimiter by a tab or pipe the results to the column command.

How to get all lines from a file after the last empty line?

Having a file like foo.txt with content
1
2
3
4
5
How do i get the lines starting with 4 and 5 out of it (everything after last empty line), assuming the amount of lines can be different?
Updated
Let's try a slightly simpler approach with just sed.
$: sed -n '/^$/{g;D;}; N; $p;' foo.txt
4
5
-n says don't print unless I tell you to.
/^$/{g;D;}; says on each blank line, clear it all out with this:
g : Replace the contents of the pattern space with the contents of the hold space. Since we never put anything in, this erases the (possibly long accumulated) pattern space. Note that I could have used z since this is GNU, but I wanted to break it out for non-GNU sed's below, and in this case this works for both.
D : remove the now empty line from the pattern space, and go read the next.
Now previously accumulated lines have been wiped if (and only if) we saw a blank line. The D loops back to the beginning, so N will never see a blank line.
N : Add a newline to the pattern space, then append the next line of input to the pattern space. This is done on every line except blanks, after which the pattern space will be empty.
This accumulates all nonblanks until either 1) a blank is hit, which will clear and restart the buffer as above, or 2) we reach EOF with a buffer intact.
Finally, $p says on the LAST line (which will already have been added to the pattern space unless the last line was blank, which will have removed the pattern space...), print the pattern space. The only time this will have nothing to print is if the last line of the file was a blank line.
So the whole logic boils down to: clean the buffer on empty lines, otherwise pile the non-empty lines up and print at the end.
If you don't have GNU sed, just put the commands on separate lines.
sed -n '
/^$/{
g
D
}
N
$p
' foo.txt
Alternate
The method above is efficient, but could potentially build up a very large pattern buffer on certain data sets. If that's not an issue, go with it.
Or, if you want it in simple steps, don't mind more processes doing less work each, and prefer less memory consumed:
last=$( sed -n /^$/= foo.txt|tail -1 ) # find the last blank
next=$(( ${last:-0} + 1 )) # get the number of the line after
cmd="$next,\$p" # compose the range command to print
sed -n "$cmd" foo.txt # run it to print the range you wanted
This runs a lot of small, simple tasks outside of sed so that it can give sed the simplest, most direct and efficient description of the task possible. It will read the target file twice, but won't have to manage filling, flushing, and refilling the accumulation of data in the pattern buffer with records before a blank line. Still likely slower unless you are memory bound, I'd think.
Reverse the file, print everything up to the first blank line, reverse it again.
$ tac foo.txt | awk '/^$/{exit}1' | tac
4
5
Using GNU awk:
awk -v RS='\n\n' 'END{printf "%s",$0}' file
RS is the record separator set to empty line.
The END statement prints the last record.
try this:
tail +$(($(grep -nE ^$ test.txt | tail -n1 | sed -e 's/://g')+1)) test.txt
grep your input file for empty lines.
get last line with tail => 5:
remove unnecessary :
add 1 to 5 => 6
tail starting from 6
You can try with sed :
sed -n ':A;$bB;/^$/{x;s/.*//;x};H;n;bA;:B;H;x;s/^..//;p' infile
With GNU sed:
sed ':a;/$/{N;s/.*\n\n//;ba;}' file

Using both GNU Utils with Mac Utils in bash

I am working with plotting extremely large files with N number of relevant data entries. (N varies between files).
In each of these files, comments are automatically generated at the start and end of the file and would like to filter these out before recombining them into one grand data set.
Unfortunately, I am using MacOSx, where I encounter some issues when trying to remove the last line of the file. I have read that the most efficient way was to use head/tail bash commands to cut off sections of data. Since head -n -1 does not work for MacOSx I had to install coreutils through homebrew where the ghead command works wonderfully. However the command,
tail -n+9 $COUNTER/test.csv | ghead -n -1 $COUNTER/test.csv >> gfinal.csv
does not work. A less than pleasing workaround was I had to separate the commands, use ghead > newfile, then use tail on newfile > gfinal. Unfortunately, this will take while as I have to write a new file with the first ghead.
Is there a workaround to incorporating both GNU Utils with the standard Mac Utils?
Thanks,
Keven
The problem with your command is that you specify the file operand again for the ghead command, instead of letting it take its input from stdin, via the pipe; this causes ghead to ignore stdin input, so the first pipe segment is effectively ignored; simply omit the file operand for the ghead command:
tail -n+9 "$COUNTER/test.csv" | ghead -n -1 >> gfinal.csv
That said, if you only want to drop the last line, there's no need for GNU head - OS X's own BSD sed will do:
tail -n +9 "$COUNTER/test.csv" | sed '$d' >> gfinal.csv
$ matches the last line, and d deletes it (meaning it won't be output).
Finally, as #ghoti points out in a comment, you could do it all using sed:
sed -n '9,$ {$!p;}' file
Option -n tells sed to only produce output when explicitly requested; 9,$ matches everything from line 9 through (,) the end of the file (the last line, $), and {$!p;} prints (p) every line in that range, except (!) the last ($).
I realize that your question is about using head and tail, but I'll answer as if you're interested in solving the original problem rather than figuring out how to use those particular tools to solve the problem. :)
One method using sed:
sed -e '1,8d;$d' inputfile
At this level of simplicity, GNU sed and BSD sed both work the same way. Our sed script says:
1,8d - delete lines 1 through 8,
$d - delete the last line.
If you decide to generate a sed script like this on-the-fly, beware of your quoting; you will have to escape the dollar sign if you put it in double quotes.
Another method using awk:
awk 'NR>9{print last} NR>1{last=$0}' inputfile
This works a bit differently in order to "recognize" the last line, capturing the previous line and printing after line 8, and then NOT printing the final line.
This awk solution is a bit of a hack, and like the sed solution, relies on the fact that you only want to strip ONE final line of the file.
If you want to strip more lines than one off the bottom of the file, you'd probably want to maintain an array that would function sort of as a buffered FIFO or sliding window.
awk -v striptop=8 -v stripbottom=3 '
{ last[NR]=$0; }
NR > striptop*2 { print last[NR-striptop]; }
{ delete last[NR-striptop]; }
END { for(r in last){if(r<NR-stripbottom+1) print last[r];} }
' inputfile
You specify how much to strip in variables. The last array keeps a number of lines in memory, prints from the far end of the stack, and deletes them as they are printed. The END section steps through whatever remains in the array, and prints everything not prohibited by stripbottom.

How can I merge multiple lines to create exactly two records based on field separators?

I need help writing a Unix script loop to process the following data:
200250|Wk50|200212|January|20024|Quarter4|2002|2002
|2003-01-12
|2003-01-18
|2003-01-05
|2003-02-01
|2002-11-03
|2003-02-01|
|2003-02-01|||||||
200239|Wk39|200209|October|20023|Quarter3|2002|2002
|2002-10-27
|2002-11-02
|2002-10-06
|2002-11-02
|2002-08-04
|2002-11-02|
|2003-02-01|||||||
I have data in above format in a text file. What I need to do is remove newline characters on all lines which have | as the first character in the next line. The output I need is:
200250|Wk50|200212|January|20024|Quarter4|2002|2002|2003-01-12|2003-01-18|2003-01-05|2003-02-01|2002-11-03|2003-02-01||2003-02-01|||||||
200239|Wk39|200209|October|20023|Quarter3|2002|2002|2002-10-27|2002-11-02 |2002-10-06|2002-11-02|2002-08-04|2002-11-02||2003-02-01|||||||
I need some help to achieve this. These shell commands are giving me nightmares!
The 'sed' approach:
sed ':a;N;$!ba;s/\n|/|/g' input.txt
Though, awk would be faster & easier to understand/maintain. I just had that example handy (a common solution for removing trailing newlines w/ sed).
EDIT:
To clarify the difference between this answer (option #1) and the alternative solution by #potong (which I actually prefer: sed ':a;N;s/\n|/|/;ta;P;D' file), which I'll call option #2:
note that these are two of many possible options with sed. I actually prefer non-sed solutions since they do in general run faster. But these two options are notable because they demonstrate two distinct ways to process a file: option #1 all in-memory, and option #2 as a stream. (note: below when I say "buffer", technically I mean "pattern space"):
option #1 reads the whole file into memory:
:a is just a label; N says append the next line to the buffer; if end-of-file ($) is not (!) reached, then branch (b) back to label :a ...
then after the whole file is read into memory, process the buffer with the substitution command (s), replacing all occurrences of "\n|" (newline followed by "|") with just a "|", on the entire (g) buffer
option #2 just process a couple lines at a time:
reads / appends the next line (N) into the buffer, processes it (s/\n|/|/); branches (t) back to label :a only if the substitution was successful; otherwise prints (P) and clears/deletes (D) the current buffer up to the first embedded newline ... and the stream continues.
option #1 takes a lot more memory to run. In general, as large as your file. Option #2 requires minimal memory; so small I didn't bother to see what it correlates to (I'm guessing the length of a line.)
option #1 runs faster. In general, twice as fast as option #2; but obviously it depends on the file and what is being done.
On a ~500MB file, option #1 runs about twice as fast (1.5s vs 3.4s),
$ du -h /tmp/foobar.txt
544M /tmp/foobar.txt
$ time sed ':a;N;$!ba;s/\n|/|/g' /tmp/foobar.txt > /dev/null
real 0m1.564s
user 0m1.390s
sys 0m0.171s
$ time sed ':a;N;s/\n|/|/;ta;P;D' /tmp/foobar.txt > /dev/null
real 0m3.418s
user 0m3.239s
sys 0m0.163s
At the same time, option #1 takes about 500MB of memory, and option #2 requires less than 1MB:
$ ps -F -C sed
UID PID PPID C SZ RSS PSR STIME TTY TIME CMD
username 4197 11001 99 172427 558888 1 19:22 pts/10 00:00:01 sed :a;N;$!ba;s/\n|/|/g /tmp/foobar.txt
note: /proc/{pid}/smaps (Pss): 558188 (545M)
And option #2:
$ ps -F -C sed
UID PID PPID C SZ RSS PSR STIME TTY TIME CMD
username 4401 11001 99 3468 864 3 19:22 pts/10 00:00:03 sed :a;N;s/\n|/|/;ta;P;D /tmp/foobar.txt
note: /proc/{pid}/smaps (Pss): 236 (0M)
In summary (w/ commentary),
if you have files of unknown size, streaming without buffering is a better decision.
if every second matters, then buffering the entire file and processing it at once may be fine -- but ymmv.
my personal experience with tuning shell scripts is that awk or perl (or tr, but it's the least portable) or even bash may be preferable to using sed.
yet, sed is a very flexible and powerful tool that gets a job done quickly, and can be tuned later.
Here is an awk solution:
$ awk 'substr($0,1,1)=="|"{printf $0;next} {printf "\n"$0} END{print""}' data
200250|Wk50|200212|January|20024|Quarter4|2002|2002|2003-01-12|2003-01-18|2003-01-05|2003-02-01|2002-11-03|2003-02-01||2003-02-01|||||||
200239|Wk39|200209|October|20023|Quarter3|2002|2002|2002-10-27|2002-11-02|2002-10-06|2002-11-02|2002-08-04|2002-11-02||2003-02-01|||||||
Explanation:
Awk implicitly loops through every line in the file.
substr($0,1,1)=="|"{printf $0;next}
If this line begins with a vertical bar, then print it (without a final newline) and then skip to the next line. We are using printf here, as opposed to the more common print, so that newlines are not printed unless we explicitly ask for them.
{printf "\n"$0}
If the line didn't begin with a vertical bar, print a newline and then this line (again without a final newline).
END{print""}
At the end of the file, print a newline.
Refinement
The above prints out an extra newline at the beginning of the file. If that is a problem, then it can be eliminated with just a minor change:
$ awk 'substr($0,1,1)=="|"{printf $0;next} {printf new $0;new="\n"} END{print""}' data
200250|Wk50|200212|January|20024|Quarter4|2002|2002|2003-01-12|2003-01-18|2003-01-05|2003-02-01|2002-11-03|2003-02-01||2003-02-01|||||||
200239|Wk39|200209|October|20023|Quarter3|2002|2002|2002-10-27|2002-11-02|2002-10-06|2002-11-02|2002-08-04|2002-11-02||2003-02-01|||||||
This might work for you (GNU sed):
sed ':a;N;s/\n|/|/;ta;P;D' file
This processes the file a line at a time an alternative to #michael_n's which slurps the file content into memory before processing.
You could do this simply through perl,
$ perl -0777pe 's/\n(?=\|)//g' file
200250|Wk50|200212|January|20024|Quarter4|2002|2002|2003-01-12|2003-01-18|2003-01-05|2003-02-01|2002-11-03|2003-02-01||2003-02-01|||||||
200239|Wk39|200209|October|20023|Quarter3|2002|2002|2002-10-27|2002-11-02|2002-10-06|2002-11-02|2002-08-04|2002-11-02||2003-02-01|||||||
awk -f test.awk input.txt
test.awk
{
if($0 ~ /^\|/)
{
array[i++] = $0
}
else
{
for(j=0;j<i;j++)
{
line = line array[j];
}
i=0;
print line
line = $0;
}
}
awk -f inp.awk input | sed '/^$/d'
inp.awk
{
if($0 !~ /^\|/)
{
print line;
line = $0;
}
else
{
line = line $0;
}
}

sed substitute and show line number

I'm working in bash trying to use sed substitution on a file and show both the line number where the substitution occurred and the final version of the line. For a file with lines that contain foo, trying with
sed -n 's/foo/bar/gp' filename
will show me the lines where substitution occurred, but I can't figure out how to include the line number. If I try to use = as a flag to print the current line number like
sed -n 's/foo/bar/gp=' filename
I get
sed: -e expression #1, char 14: unknown option to `s'
I can accomplish the goal with awk like
awk '{if (sub("foo","bar",$0)){print NR $0}}' filename
but I'm curious if there's a way to do this with one line of sed. If possible I'd love to use a single sed statement without a pipe.
I can't think of a way to do it without listing the search pattern twice and using command grouping.
sed -n "/foo/{s/foo/bar/g;=;p;}" filename
EDIT: mklement0 helped me out there by mentioning that if the pattern space is empty, the default pattern space is the last one used, as mentioned in the manual. So you could get away with it like this:
sed -n "/foo/{s//bar/g;=;p;}" filename
Before that, I figured out a way not to repeat the pattern space, but it uses branches and labels. "In most cases," the docs specify, "use of these commands indicates that you are probably better off programming in something like awk or Perl. But occasionally one is committed to sticking with sed, and these commands can enable one to write quite convoluted scripts." [source]
sed -n "s/foo/bar/g;tp;b;:p;=;p" filename
This does the following:
s/foo/bar/g does your substitution.
tp will jump to :p iff a substitution happened.
b (branch with no label) will process the next line.
:p defines label p, which is the target for the tp command above.
= and p will print the line number and then the line.
End of script, so go back and process the next line.
See? Much less readable...and maybe a distant cousin of :(){ :|:& };:. :)
It cannot be done in any reasonable way with sed, here's how to really do it clearly and simply in awk:
awk 'sub(/foo/,"bar"){print NR, $0}' filename
sed is an excellent tool for simple substitutions on a single line, for anything else use awk.

Resources