Remove newline if preceded with specific character on non-consecutive lines - bash

I have a text file with every other line ending with a % character. I want to find the pattern "% + newline" and replace it with "%". In other words, I want to delete the newline character right after the % and not the other newline characters.
For example, I want to change the following:
abcabcabcabc%
123456789123
abcabcabcabc%
123456789123
to
abcabcabcabc%123456789123
abcabcabcabc%123456789123
I've tried the following sed command, to no avail.
sed 's/%\n/%/g' < input.txt > output.txt

By default sed can't remove newlines because it reads one newline-separated line at a time.
With any awk in any shell on every UNIX box for any number of lines ending in %, consecutive or not:
$ awk '{printf "%s%s", $0, (/%$/ ? "" : ORS)}' file
abcabcabcabc%123456789123
abcabcabcabc%123456789123
and with consecutive % lines:
$ cat file
now is the%
winter of%
our%
discontent
$ awk '{printf "%s%s", $0, (/%$/ ? "" : ORS)}' file
now is the%winter of%our%discontent

Your data sample imply that there are no several consecutive lines ending with %.
In that case, you may use
sed '/%$/{N;s/\n//}' file.txt > output.txt
It works as follows:
/%$/ - finds all lines ending with %
{N;s/\n//} - a block:
N - adds a newline to the pattern space, then appends the next line of input to the pattern space
s/\n// - removes a newline in the current pattern space.
See the online sed demo.

In portable sed that supports any number of continued lines:
parse.sed
:a # A goto label named 'a'
/%$/ { # When the last line ends in '%'
N # Append the next line
s/\n// # Remove new-line
ta # If new-line was replaced goto label 'a'
}
Run it like this:
sed -f parse.sed infile
Output when infile contains your input and the input from Ed Morton's answer:
abcabcabcabc%123456789123
abcabcabcabc%123456789123
now is the%winter of%our%discontent

Related

Bash; Replacing new line with ", " and ending with ".", can someone explain awk and sed, please?

so let's say i have
aaa
bbb
ccc
ddd
and i need to replace all new lines by comma and space, and end with a dot like so
aaa, bbb, ccc, ddd.
I found this, but i can't understand it at all and i need to know what every single character does ;
... | awk '{ printf $0", " }' | sed 's/.\{2\}$/./'
Can someone make those two commands human-readable ?
tysm!
About the command:
... | awk '{ printf $0", " }' | sed 's/.\{2\}$/./'
Awk prints the line $0 followed by , without newlines. When this is done, you have , trailing at the end.
Then the pipe to sed replaces the last , with a single dot as this part .\{2\}$ matches 2 times any character at the end of the string.
With sed using a single command, you can read all lines using N to pull the next line in the pattern space, and use a label to keep on replacing a newline as long as it is not the last line last line.
After that you can append a dot to the end.
sed ':a;N;$!ba;s/\n/, /g;s/$/./' file
Output
aaa, bbb, ccc, ddd.
ok,
first of all; thank u.
I do now understand that printf $0", " means 'print every line, and ", " at the end of each'
as for the sed command, a colleague explained it to me a minute ago;
in 's/.\{2\}$/./',
s/ replace
. any character
{2} x2, so two characters
$ at end of the line
/ by ( 's/ / /' = replace/ this / that /)
. the character '.'
/ end
without forgetting to escape { and }, so u end up with
's/ . \{2\} $ / . /'
but wait, it gets even better;
my colleague also told me that \{2\} wasn't necessary in this case ;
.{2} (without the escapes) could simply be replaced by
.. 'any character' twice.
so 's/..$/./' wich is way more readable i think
'replace/ wich ever two last characters / this character/'
hope this helps if any other 42 student gets here
tism again
awk '{ printf $0", " }'
This is awk command with single action, encased in {...}, this action is applied to every line of input.
Printf is print with format, here no formatting takes places but another feature of printf is leveraged - printf does not attach output row separator (default: newline) as opposed to print.
$0 denotes whole current line (sans trailing newline).
", " is string literal for comma followed by space.
$0", " instructs awk to concatenate current line with comma and space.
Whole command thus might be read as: for every line output current line followed by comma and space
sed 's/.\{2\}$/./'
s is one of commands, namely substitute, basic form is
s/regexp/replacement/
therefore
.\{2\}$ is regular expression, . denotes any characters, \{2\} repeated 2 times, $ denotes end of line, thus this one matches 2 last characters of each line, as text was already converted to single line, it will match 2 last characters of whole text.
. is replacement, literal dot character
Whole command thus might be read as: for every line replace 2 last characters using dot
Assuming the four lines are in a file...
#!/bin/sh
cat << EOF >> ed1
%s/$/,/g
$
s/,/./
wq
EOF
ed -s file < ed1
cat file | tr '\n' ' ' > f2
mv f2 file
rm -v ./ed1
echo 'aaa
bbb
ccc
ddd' |
mawk NF+=RS FS='\412' RS= OFS='\40\454' ORS='\456\12'
aaa, bbb, ccc, ddd.

Replace every 4th occurence of char "_" with "#" in multiple files

I am trying to replace every 4th occurrence of "_" with "#" in multiple files with bash.
E.g.
foo_foo_foo_foo_foo_foo_foo_foo_foo_foo..
would become
foo_foo_foo_foo#foo_foo_foo_foo#foo_foo...
#perl -pe 's{_}{++$n % 4 ? $& : "#"}ge' *.txt
I have tried perl but the problem is this replaces every 4th _ carrying on from the last file. So for example, some files the first _ is replaced because it is not starting each new file at a count of 0, it carries on from the previous file.
I have tried:
#awk '{for(i=1; i<=NF; i++) if($i=="_") if(++count%4==0) $i="#"}1' *.txt
but this also does not work.
Using sed I cannot find a way to keep replacing every 4th occurrence as there are different numbers of _ in each file. Some files have 20 _, some have 200 _. Therefore, I cant specify a range.
I am really lost what to do, can anybody help?
You just need to reset the counter in the perl one using eof to tell when it's done reading each file:
perl -pe 's{_}{++$n % 4 ? "_" : "#"}ge; $n = 0 if eof' *.txt
This MAY be what you want, using GNU awk for RT:
$ awk -v RS='_' '{ORS=(FNR%4 ? RT : "#")} 1' file
foo_foo_foo_foo#foo_foo_foo_foo#foo_foo..
It only reads each _-separated string into memory 1 at a time so should work no matter how large your input file, assuming there are _s in it.
It assumes you want to replace every 4th _ across the whole file as opposed to within individual lines.
A simple sed would handle this:
s='foo_foo_foo_foo_foo_foo_foo_foo_foo_foo'
sed -E 's/(([^_]+_){3}[^_]+)_/\1#/g' <<< "$s"
foo_foo_foo_foo#foo_foo_foo_foo#foo_foo
Explanation:
(: Start capture group #1
([^_]+_){3}: Match Match 1+ of non-_ characters followed by a _. Repeat this group 3 times to match 3 such words separated by _
[^_]+: Match 1+ of non-_ characters
): End capture group #1
_: Match a _
Replacement is \1# to replace 4th _ with a #
With GNU sed:
sed -nsE ':a;${s/(([^_]*_){3}[^_]*)_/\1#/g;p};N;ba' *.txt
-n suppresses the automatic printing, -s processes each file separately, -E uses extended regular expressions.
The script is a loop between label a (:a) and the branch-to-label-a command (ba). Each iteration appends the next line of input to the pattern space (N). This way, after the last line has been read, the pattern space contains the whole file(*). During the last iteration, when the last line has been read ($), a substitute command (s) replaces every 4th _ in the pattern space by a # (s/(([^_]*_){3}[^_]*)_/\1#/g) and prints (p) the result.
When you will be satisfied with the result you can change the options:
sed -i -nE ':a;${s/(([^_]*_){3}[^_]*)_/\1#/g;p};N;ba' *.txt
to modify the files in-place, or:
sed -i.bkp -nE ':a;${s/(([^_]*_){3}[^_]*)_/\1#/g;p};N;ba' *.txt
to modify the files in-place, but keep a *.txt.bkp backup of each file.
(*) Note that if you have very large files this could cause memory overflows.
With your shown samples, please try following awk program. Have created an awk variable named fieldNum where I have assigned 4 to it, since OP needs to enter # after every 4th _, you can keep it as per your need too.
awk -v fieldNum="4" '
BEGIN{ FS=OFS="_" }
{
val=""
for(i=1;i<=NF;i++){
val=(val?val:"") $i (i%fieldNum==0?"#":(i<NF?OFS:""))
}
print val
}
' Input_file
With GNU awk
$ cat ip.txt
foo_foo_foo_foo_foo_foo_foo_foo_foo_foo
123_45678_90
_
$ awk -v RS='(_[^_]+){3}_' -v ORS= '{sub(/_$/, "#", RT); print $0 RT}' ip.txt
foo_foo_foo_foo#foo_foo_foo_foo#foo_foo
123_45678_90
#
-v RS='(_[^_]+){3}_' set input record separator to cover sequence of four _ (text matched by this separator will be available via RT)
-v ORS= empty output record separator
sub(/_$/, "#", RT) change last _ to #
Use -i inplace for inplace editing.
If the count should reset for each line:
perl -pe's/(?:_[^_]*){3}\K_/\#/g'
$ cat a.txt
foo_foo_foo_foo_foo_foo_foo_foo_foo_foo
foo_foo_foo_foo_foo_foo_foo_foo_foo_foo
$ perl -pe's/(?:_[^_]*){3}\K_/\#/g' a.txt a.txt
foo_foo_foo_foo#foo_foo_foo_foo#foo_foo
foo_foo_foo_foo#foo_foo_foo_foo#foo_foo
foo_foo_foo_foo#foo_foo_foo_foo#foo_foo
foo_foo_foo_foo#foo_foo_foo_foo#foo_foo
If the count shouldn't reset for each line, but should reset for each file:
perl -0777pe's/(?:_[^_]*){3}\K_/\#/g'
The -0777 cause the whole file to be treated as one line. This causes the count to work properly across lines.
But since a new a match is used for each file, the count is reset between files.
$ cat a.txt
foo_foo_foo_foo_foo_foo_foo_foo_foo_foo
foo_foo_foo_foo_foo_foo_foo_foo_foo_foo
$ perl -0777pe's/(?:_[^_]*){3}\K_/\#/g' a.txt a.txt
foo_foo_foo_foo#foo_foo_foo_foo#foo_foo
foo_foo_foo#foo_foo_foo_foo#foo_foo_foo
foo_foo_foo_foo#foo_foo_foo_foo#foo_foo
foo_foo_foo#foo_foo_foo_foo#foo_foo_foo
To avoid that reading the entire file at once, you could continue using the same approach, but with the following added:
$n = 0 if eof;
Note that eof is not the same thing as eof()! See eof.

If a line has a length less than a number, append to its previous line

I have a file that looks like this:
ABCDEFGH
ABCDEFGH
ABC
ABCDEFGH
ABCDEFGH
ABCD
ABCDEFGH
Most of the lines have a fixed length of 8. But there are some lines in between that have a length less than 8. I need a simple line of code that appends each of those short lines to its previous line.
I have tried the following code but it takes lots of memory when working with large files.
cat FILENAME | awk 'BEGIN{OFS=FS="\t"}{print length($1), $1}' | tr
'\n' '\t' | sed 's/8/\n/g' | awk 'BEGIN{OFS="";FS="\t"}{print $2, $4}'
The output I expect:
ABCDEFGH
ABCDEFGHABC
ABCDEFGH
ABCDEFGHABCD
ABCDEFGH
If perl is your option, please try:
perl -0777 -pe 's/(\n)(.{1,7})$/\2/mg' filename
-0777 option tells perl to slurp all lines.
The pattern (\n)(.{1,7}) matches to a line with length less than 8, assigning \1 to a newline and \2 to the string.
The replacement \2 does not contain the preceding newline and is appended to the previous line.
sed <FILENAME 'N;/\n.\{8\}/!s/\n//;P;D'
N; - append next line to pattern space
/\n.\{8\}/ - does second line contain 8 characters?
!s/\n//; - no: join the two lines
P - print first line of pattern space
D - delete first line of pattern space, start next cycle
Default print without \n and append it to the last line when the current line has length 8.
The first and last line are special.
awk 'NR==1 {printf $0;next}
length($0)==8 {printf "\n"}
{printf("%s",$0)}
END { printf "\n" }' FILENAME
When you have GNU sed 4.2 (support -z option), you can try
EDIT (see comments): the inferiour
sed -rz 's/\n(.{0,7})\n/\1\n/g' FILENAME
If you like old traditional tools, you can use ed, the standard text editor:
printf '%s\n' 'g/^.\{,7\}$/-,.j' wq | ed -s filename

Split a big txt file to do grep - unix

I work (unix, shell scripts) with txt files that are millions field separate by pipe and not separated by \n or \r.
something like this:
field1a|field2a|field3a|field4a|field5a|field6a|[...]|field1d|field2d|field3d|field4d|field5d|field6d|[...]|field1m|field2m|field3m|field4m|field5m|field6m|[...]|field1z|field2z|field3z|field4z|field5z|field6z|
All text is in the same line.
The number of fields is fixed for every file.
(in this example I have field1=name; field2=surname; field3=mobile phone; field4=email; field5=office phone; field6=skype)
When I need to find a field (ex field2), command like grep doesn't work (in the same line).
I think that a good solution can be do a script that split every 6 field with a "\n" and after do a grep. I'm right? Thank you very much!
With awk :
$ cat a
field1a|field2a|field3a|field4a|field5a|field6a|field1d|field2d|field3d|field4d|field5d|field6d|field1m|field2m|field3m|field4m|field5m|field6m|field1z|field2z|field3z|field4z|field5z|field6z|
$ awk -F"|" '{for (i=1;i<NF;i=i+6) {for (j=0; j<6; j++) printf $(i+j)"|"; printf "\n"}}' a
field1a|field2a|field3a|field4a|field5a|field6a|
field1d|field2d|field3d|field4d|field5d|field6d|
field1m|field2m|field3m|field4m|field5m|field6m|
field1z|field2z|field3z|field4z|field5z|field6z|
Here you can easily set the length of line.
Hope this helps !
you can use sed to split the line in multiple lines:
sed 's/\(\([^|]*|\)\{6\}\)/\1\n/g' input.txt > output.txt
explanation:
we have to use heavy backslash-escaping of (){} which makes the code slightly unreadable.
but in short:
the term (([^|]*|){6}) (backslashes removed for readability) between s/ and /\1, will match:
[^|]* any character but '|', repeated multiple times
| followed by a '|'
the above is obviously one column and it is grouped together with enclosing parantheses ( and )
the entire group is repeated 6 times {6}
and this is again grouped together with enclosing parantheses ( and ), to form one full set
the rest of the term is easy to read:
replace the above (the entire dataset of 6 fields) with \1\n, the part between / and /g
\1 refers to the "first" group in the sed-expression (the "first" group that is started, so it's the entire dataset of 6 fields)
\n is the newline character
so replace the entire dataset of 6 fields by itself followed by a newline
and do so repeatedly (the trailing g)
you can use sed to convert every 6th | to a newline.
In my version of tcsh I can do:
sed 's/\(\([^|]\+|\)\{6\}\)/\1\n/g' filename
consider this:
> cat bla
a1|b2|c3|d4|
> sed 's/\(\([^|]\+|\)\{6\}\)/\1\n/g' bla
a1|b2|
c3|d4|
This is how the regex works:
[^|] is any non-| character.
[^|]\+ is a sequence of at least one non-| characters.
[^|]\+| is a sequence of at least one non-| characters followed by a |.
\([^|]\+|\) is a sequence of at least one non-| characters followed by a |, grouped together
\([^|]\+|\)\{6\} is 6 consecutive such groups.
\(\([^|]\+|\)\{6\}\) is 6 consecutive such groups, grouped together.
The replacement just takes this sequence of 6 groups and adds a newline to the end.
Here is how I would do it with awk
awk -v RS="|" '{printf $0 (NR%7?RS:"\n")}' file
field1a|field2a|field3a|field4a|field5a|field6a|[...]
field1d|field2d|field3d|field4d|field5d|field6d|[...]
field1m|field2m|field3m|field4m|field5m|field6m|[...]
field1z|field2z|field3z|field4z|field5z|field6z|
Just adjust the NR%7 to number of field you to what suites you.
What about printing the lines on blocks of six?
$ awk 'BEGIN{FS=OFS="|"} {for (i=1; i<=NF; i+=6) {print $(i), $(i+1), $(i+2), $(i+3), $(i+4), $(i+5)}}' file
field1a|field2a|field3a|field4a|field5a|field6a
field1d|field2d|field3d|field4d|field5d|field6d
field1m|field2m|field3m|field4m|field5m|field6m
field1z|field2z|field3z|field4z|field5z|field6z
Explanation
BEGIN{FS=OFS="|"} set input and output field separator as |.
{for (i=1; i<=NF; i+=6) {print $(i), $(i+1), $(i+2), $(i+3), $(i+4), $(i+5)}} loop through items on blocks of 6. Every single time, print six of them. As print end up writing a new line, then you are done.
If you want to treat the files as being in multiple lines, then make \n the field separator. For example, to get the 2nd column, just do:
tr \| \\n < input-file | sed -n 2p
To see which columns match a regex, do:
tr \| \\n < input-file | grep -n regex

Add a period at end of paragraph

I need a command to add a period (full stop) to the end of a paragraph. I have tried the following command:
sed '/ +$ / s/$/ ./' $FILENAME
but it does not work!!
awk -v RS="" -v ORS=".\n\n" 1 file
This redefines the input record separator to be empty, so that awk reads blank-line separated paragraphs as a single record. It sets the output record separator to be a dot and 2 newlines. The actual awk program, 1 simple prints each record.
One side-effect is that any consecutive blank lines will be collapsed into a single blank line.
OK, sheesh
awk -v RS="" -v ORS="\n\n" '{sub(/\.?$/,".")} 1'
In action: (piping through cat -n just to point out the newlines)
echo -e "a.\n\nb\nc\n\n\nd" |
awk -v RS="" -v ORS="\n\n" '{sub(/\.?$/,".")} 1' |
cat -n
1 a.
2
3 b
4 c.
5
6 d.
7
There's an extra newline at the end, due to the ORS.
And, as a bonus, here's a bit of Perl that preserves the inter-paragraph spacing:
echo -e "a.\n\nb\nc\n\n\nd" | perl -0777 -pe 's/\.?(\n(\n+|$))/.$1/g' | cat -n
1 a.
2
3 b
4 c.
5
6
7 d.
Not very good, but it seems to work...
$ cat input
This is a paragraph with some text. Some random text that is not really important.
This is another paragraph with some text.
However this sentence is still in the same paragraph
$ tr '\n' '#' < input | sed 's/\([^.]\)##/\1.##/g' | tr '#' '\n'
This is a paragraph with some text. Some random text that is not really important.
This is another paragraph with some text.
However this sentence is still in the same paragraph.
Accumulate 'paragraphs' in the hold space. Keep accumulating as long as
the input line contains any non-space character(s).
When you get a blank/empty line, assume you have an accumulated paragraph. Swap the current (blank) line with the hold space. Replace the last non-space character in the pattern space (which is now the "paragraph" you were accumulating) with itself followed by a dot, unless that character is a dot. Print the result.
I think this does it:
$ cat test
this is a test line. one-line para
this is a test line. one-line para. with period.
this is a
two line para-
graph with dot.
this is a
two-line paragraph
with no dot
also works on last
line of file
$ sed -n \
-e '/^[[:space:]]*$/{x;s/\([^.[:space:]][[:space:]]*\)$/\1./;p;n;}' \
-e '/^[[:space:]]*[^[:space:]]/H' \
test
this is a test line. one-line para.
this is a test line. one-line para. with period.
this is a
two line para-
graph with dot.
this is a
two-line paragraph
with no dot.
Using sed.
sed ':loop;$!{N;b loop};s/[^\.]$/&./;s/\([^\.]\)\(\n[ \t]*\n\)/\1.\2/g' file
explanation
:loop;$!{N;b loop} will save all the lines in pattern space delimited by newline.
s/[^.]$/&./ will add . if last paragraph doesn't have dot in end.
s/\([^\.]\)\(\n[ \t]*\n\)/\1.\2/g will add dot before \n \n, which is identify as new paragraph.
This should work:
sed "s/[[:alpha:]]\+[^\.]$/\./" $FILENAME
A pure sed solution using the hold space to save all lines from a paragraph and append a period just before printing:
sed -ne '
## Append current line to "hold space".
H
## When found an empty line, get content of "hold space", remove leading
## newline added by "H" command, append a period at the end and print.
## Also, clean "hold space" to save following paragraph.
/^$/ { g; s/^\n//; s/\(.*\)\(\n\)/\1.\2/; p; s/^.*$//; h; b }
## Last line is a bit special, it has no following blank line but it is also
## an end of paragraph. It is similar to previous case but simpler.
$ { x; s/^\n//; s/$/./; p }
' infile
Assuming an infile with content:
one
two
three
four
five
six
It yields:
one
two.
three.
four
five
six.

Resources