Bash; Replacing new line with ", " and ending with ".", can someone explain awk and sed, please? - bash

so let's say i have
aaa
bbb
ccc
ddd
and i need to replace all new lines by comma and space, and end with a dot like so
aaa, bbb, ccc, ddd.
I found this, but i can't understand it at all and i need to know what every single character does ;
... | awk '{ printf $0", " }' | sed 's/.\{2\}$/./'
Can someone make those two commands human-readable ?
tysm!

About the command:
... | awk '{ printf $0", " }' | sed 's/.\{2\}$/./'
Awk prints the line $0 followed by , without newlines. When this is done, you have , trailing at the end.
Then the pipe to sed replaces the last , with a single dot as this part .\{2\}$ matches 2 times any character at the end of the string.
With sed using a single command, you can read all lines using N to pull the next line in the pattern space, and use a label to keep on replacing a newline as long as it is not the last line last line.
After that you can append a dot to the end.
sed ':a;N;$!ba;s/\n/, /g;s/$/./' file
Output
aaa, bbb, ccc, ddd.

ok,
first of all; thank u.
I do now understand that printf $0", " means 'print every line, and ", " at the end of each'
as for the sed command, a colleague explained it to me a minute ago;
in 's/.\{2\}$/./',
s/ replace
. any character
{2} x2, so two characters
$ at end of the line
/ by ( 's/ / /' = replace/ this / that /)
. the character '.'
/ end
without forgetting to escape { and }, so u end up with
's/ . \{2\} $ / . /'
but wait, it gets even better;
my colleague also told me that \{2\} wasn't necessary in this case ;
.{2} (without the escapes) could simply be replaced by
.. 'any character' twice.
so 's/..$/./' wich is way more readable i think
'replace/ wich ever two last characters / this character/'
hope this helps if any other 42 student gets here
tism again

awk '{ printf $0", " }'
This is awk command with single action, encased in {...}, this action is applied to every line of input.
Printf is print with format, here no formatting takes places but another feature of printf is leveraged - printf does not attach output row separator (default: newline) as opposed to print.
$0 denotes whole current line (sans trailing newline).
", " is string literal for comma followed by space.
$0", " instructs awk to concatenate current line with comma and space.
Whole command thus might be read as: for every line output current line followed by comma and space
sed 's/.\{2\}$/./'
s is one of commands, namely substitute, basic form is
s/regexp/replacement/
therefore
.\{2\}$ is regular expression, . denotes any characters, \{2\} repeated 2 times, $ denotes end of line, thus this one matches 2 last characters of each line, as text was already converted to single line, it will match 2 last characters of whole text.
. is replacement, literal dot character
Whole command thus might be read as: for every line replace 2 last characters using dot

Assuming the four lines are in a file...
#!/bin/sh
cat << EOF >> ed1
%s/$/,/g
$
s/,/./
wq
EOF
ed -s file < ed1
cat file | tr '\n' ' ' > f2
mv f2 file
rm -v ./ed1

echo 'aaa
bbb
ccc
ddd' |
mawk NF+=RS FS='\412' RS= OFS='\40\454' ORS='\456\12'
aaa, bbb, ccc, ddd.

Related

Remove newline if preceded with specific character on non-consecutive lines

I have a text file with every other line ending with a % character. I want to find the pattern "% + newline" and replace it with "%". In other words, I want to delete the newline character right after the % and not the other newline characters.
For example, I want to change the following:
abcabcabcabc%
123456789123
abcabcabcabc%
123456789123
to
abcabcabcabc%123456789123
abcabcabcabc%123456789123
I've tried the following sed command, to no avail.
sed 's/%\n/%/g' < input.txt > output.txt
By default sed can't remove newlines because it reads one newline-separated line at a time.
With any awk in any shell on every UNIX box for any number of lines ending in %, consecutive or not:
$ awk '{printf "%s%s", $0, (/%$/ ? "" : ORS)}' file
abcabcabcabc%123456789123
abcabcabcabc%123456789123
and with consecutive % lines:
$ cat file
now is the%
winter of%
our%
discontent
$ awk '{printf "%s%s", $0, (/%$/ ? "" : ORS)}' file
now is the%winter of%our%discontent
Your data sample imply that there are no several consecutive lines ending with %.
In that case, you may use
sed '/%$/{N;s/\n//}' file.txt > output.txt
It works as follows:
/%$/ - finds all lines ending with %
{N;s/\n//} - a block:
N - adds a newline to the pattern space, then appends the next line of input to the pattern space
s/\n// - removes a newline in the current pattern space.
See the online sed demo.
In portable sed that supports any number of continued lines:
parse.sed
:a # A goto label named 'a'
/%$/ { # When the last line ends in '%'
N # Append the next line
s/\n// # Remove new-line
ta # If new-line was replaced goto label 'a'
}
Run it like this:
sed -f parse.sed infile
Output when infile contains your input and the input from Ed Morton's answer:
abcabcabcabc%123456789123
abcabcabcabc%123456789123
now is the%winter of%our%discontent

modify distribution of data inside a file

I need help with bash in order to modify a file.txt. I have names, each name in a line
for example
Peter
John
Markus
and I need them in the same row and with " before and at the end of each element of the vector.
"Peter" "John" "Markus"
Well, I can insert " when I have all elements in a row but I don't know how to modify the shape...all lines in a row.
array=( Peter John Markus )
number=${#array[#]}
for ((i=0;i<number;i++)); do
array[i]="\"${array[i]}"\"
echo "${array[i]}"
done
With awk
$ awk '{printf "\""$0"\" "} END{print""}' file
"Peter" "John" "Markus"
How it works:
printf "\""$0"\" "
With every new line of input, $0, this prints out a quote, the line itself, a quote and a space.
END{print""}
(optional) After we have read the last line of the file, this prints out a newline.
With sed and tr
$ sed 's/.*/"&"/' file | tr '\n' ' '
"Peter" "John" "Markus"
How it works:
s/.*/"&"/
This puts a quote before and after every line
tr '\n' ' '
This replaces newline characters with spaces so that all names appear on the same line.
With sed alone
$ sed ':a;$!{N;ba};s/^/"/; s/$/"/; s/\n/" "/g' file
"Peter" "John" "Markus"
How it works:
:a;$!{N;ba}
This reads the whole file in to the pattern space.
s/^/"/
This adds a quote at the beginning of the file
s/$/"/
This adds a quote to the end of the file.
s/\n/" "/g
This replaces every newline with the three characters: quote-space-quote.
With bash
To make the bash script in the question print on one line, one can use echo -n in place of echo. In other words, replace:
echo "${array[i]}"
With:
echo -n "${array[i]} "
Quoting all words on one line
From the comments, suppose that our file has all the names on one line and we want to quote each individually. Use:
$ cat file2
Peter John Markus
$ sed -r 's/[[:alnum:]]+/"&"/g' file2
"Peter" "John" "Markus"
The above is for GNU sed. On OSX or other BSD system, try:
sed -E 's/[[:alnum:]]+/"&"/g' file2
Perl to the rescue:
perl -pe 'chomp; $_ = qq("$_" );chop if eof' < input
Explanation:
-p reads the input line by line and prints what's in $_
chomp removes a newline
$_ = qq("$_" ) puts a " before and "<Space> after the string.
chop if eof removes the trailing space.

Single-quote part of a line using sed or awk

Convert input text as follows, using sed or awk:
Input file:
113259740 QA Test in progress
219919630 UAT Test in progress
Expected output:
113259740 'QA Test in progress'
219919630 'UAT Test in progress'
Using GNU sed or BSD (OSX) sed:
sed -E "s/^( *)([^ ]+)( +)(.*)$/\1\2\3'\4'/" file
^( *) captures all leading spaces, if any
([^ ]+) captures the 1st field (a run of non-space characters of at least length 1)
( +) captures the space(s) after the first field
(.*)$ matches the rest of the line, whatever it may be
\1\2\3'\4' replaces each (matching) input line with the captured leading spaces, followed by the 1st field, followed by the captured first inter-field space(s), followed by the single-quoted remainder of the input line. To discard the leading spaces, simply omit \1.
Note:
Matching the 1st field is more permissive than strictly required in that it matches any non-space sequence of characters, not just digits (as in the sample input data).
A generalized solution supporting other forms of whitespace (such as tabs), including after the 1st field, would look like this:
sed -E "s/^([[:space:]]*)([^[:space:]]+)([[:space:]]+)(.*)$/\1\2\3'\4'/" file
If your sed version doesn't support -E (or -r) to enable support for extended regexes, try the following, POSIX-compliant variant that uses a basic regex:
sed "s/^\( *\)\([^ ]\{1,\}\)\( \{1,\}\)\(.*\)$/\1\2\3'\4'/" file
You could try this GNU sed command also,
sed -r "s/^( +) ([0-9]+) (.*)$/\1 \2 '\3'/g" file
^( +), catches one or more spaces at the starting and stored it in a group(1).
([0-9]+) - After catching one or more spaces at the starting, next it matches a space after that and fetch all the numbers that are next to that space then store it in a group(2).
(.*)$ - Fetch all the characters that are next to numbers upto the last character and then store it in a group(3).
All the fetched groups are rearranged in the replacement part according to the desired output.
Example:
$ cat ccc
113259740 QA Test in progress
219919630 UAT Test in progress
$ sed -r "s/^( +) ([0-9]+) (.*)$/\1 \2 '\3'/g" ccc
113259740 'QA Test in progress'
219919630 'UAT Test in progress'
And in awk:
awk '{ printf "%s '"'"'", $1; for (i=2; i<NF; ++i) printf "%s ", $i; print $NF "'"'"'" }' file
Explanation:
printf "%s '"'"'", $1; Print the first field, followed by a space and a quote (')
for (i=2; i<NF; ++i) printf "%s ", $i; Print all of the following fields save the last one, each followed by a space.
print $NF "'"'"'" Print the last field followed by a quote(')
Note that '"'"'" is used to print just a single quote ('). An alternative is to specify the quote character on the command line as a variable:
awk -v qt="'" '{ printf "%s %s", $1, qt; for (i=2; i<NF; ++i) printf "%s ", $i; print $NF qt }' file
An awk solution:
awk -v q="'" '{ f1=$1; $1=""; print f1, q substr($0,2) q }' file
Lets awk split each input line into fields by whitespace (the default behavior).
-v q="'" defines awk variable q containing a single quote so as to make it easier to use a single quote inside the awk program, which is single-quoted as a whole.
f1=$1 saves the 1st field for later use.
$1=="" effectively removes the first field from the input line, leaving $0, which originally referred to the whole input line, to contain a space followed by the rest of the line (strictly speaking, the fields are re-concatenated using the output-field separator OFS, which defaults to a space; since the 1st field is now empty, the resulting $0 starts with a single space followed by all remaining fields separated by a space each).
print f1, q substr($0,2) q then prints the saved 1st field, followed by a space (OFS) due to ,, followed by the remainder of the line (with the initial space stripped with substr()) enclosed in single quotes (q).
Note that this solution normalizes whitespace:
leading and trailing whitespace is removed
interior whitespace of length greater than 1 is compressed to a single space each.
Since the post is tagged with bash, here is an all Bash solution that preserves leading white space.
while IFS= read -r line; do
read -r f1 f2 <<<"$line"
echo "${line/$f1 $f2/$f1 $'\''$f2$'\''}"
done < file
Output:
113259740 'QA Test in progress'
219919630 'UAT Test in progress'
Here is a simple way to do it with awk
awk '{sub($2,v"&");sub($NF,"&"v)}1' v=\' file
113259740 'QA Test in progress'
219919630 'UAT Test in progress'
It does not change the formatting of the file.
You can perform this by taking advantage of the word-splitting involved in most shells like bash. To avoid ending up with an extra single quote in the final result, you can just remove it with sed. This will also trim any extra spaces before i, between i and j and after j.
cat file.txt | sed "s/'//g" | while read i j; do echo "$i '$j'"; done
Here, we'll pipe the first word into variable i, and the rest in j.

Split a big txt file to do grep - unix

I work (unix, shell scripts) with txt files that are millions field separate by pipe and not separated by \n or \r.
something like this:
field1a|field2a|field3a|field4a|field5a|field6a|[...]|field1d|field2d|field3d|field4d|field5d|field6d|[...]|field1m|field2m|field3m|field4m|field5m|field6m|[...]|field1z|field2z|field3z|field4z|field5z|field6z|
All text is in the same line.
The number of fields is fixed for every file.
(in this example I have field1=name; field2=surname; field3=mobile phone; field4=email; field5=office phone; field6=skype)
When I need to find a field (ex field2), command like grep doesn't work (in the same line).
I think that a good solution can be do a script that split every 6 field with a "\n" and after do a grep. I'm right? Thank you very much!
With awk :
$ cat a
field1a|field2a|field3a|field4a|field5a|field6a|field1d|field2d|field3d|field4d|field5d|field6d|field1m|field2m|field3m|field4m|field5m|field6m|field1z|field2z|field3z|field4z|field5z|field6z|
$ awk -F"|" '{for (i=1;i<NF;i=i+6) {for (j=0; j<6; j++) printf $(i+j)"|"; printf "\n"}}' a
field1a|field2a|field3a|field4a|field5a|field6a|
field1d|field2d|field3d|field4d|field5d|field6d|
field1m|field2m|field3m|field4m|field5m|field6m|
field1z|field2z|field3z|field4z|field5z|field6z|
Here you can easily set the length of line.
Hope this helps !
you can use sed to split the line in multiple lines:
sed 's/\(\([^|]*|\)\{6\}\)/\1\n/g' input.txt > output.txt
explanation:
we have to use heavy backslash-escaping of (){} which makes the code slightly unreadable.
but in short:
the term (([^|]*|){6}) (backslashes removed for readability) between s/ and /\1, will match:
[^|]* any character but '|', repeated multiple times
| followed by a '|'
the above is obviously one column and it is grouped together with enclosing parantheses ( and )
the entire group is repeated 6 times {6}
and this is again grouped together with enclosing parantheses ( and ), to form one full set
the rest of the term is easy to read:
replace the above (the entire dataset of 6 fields) with \1\n, the part between / and /g
\1 refers to the "first" group in the sed-expression (the "first" group that is started, so it's the entire dataset of 6 fields)
\n is the newline character
so replace the entire dataset of 6 fields by itself followed by a newline
and do so repeatedly (the trailing g)
you can use sed to convert every 6th | to a newline.
In my version of tcsh I can do:
sed 's/\(\([^|]\+|\)\{6\}\)/\1\n/g' filename
consider this:
> cat bla
a1|b2|c3|d4|
> sed 's/\(\([^|]\+|\)\{6\}\)/\1\n/g' bla
a1|b2|
c3|d4|
This is how the regex works:
[^|] is any non-| character.
[^|]\+ is a sequence of at least one non-| characters.
[^|]\+| is a sequence of at least one non-| characters followed by a |.
\([^|]\+|\) is a sequence of at least one non-| characters followed by a |, grouped together
\([^|]\+|\)\{6\} is 6 consecutive such groups.
\(\([^|]\+|\)\{6\}\) is 6 consecutive such groups, grouped together.
The replacement just takes this sequence of 6 groups and adds a newline to the end.
Here is how I would do it with awk
awk -v RS="|" '{printf $0 (NR%7?RS:"\n")}' file
field1a|field2a|field3a|field4a|field5a|field6a|[...]
field1d|field2d|field3d|field4d|field5d|field6d|[...]
field1m|field2m|field3m|field4m|field5m|field6m|[...]
field1z|field2z|field3z|field4z|field5z|field6z|
Just adjust the NR%7 to number of field you to what suites you.
What about printing the lines on blocks of six?
$ awk 'BEGIN{FS=OFS="|"} {for (i=1; i<=NF; i+=6) {print $(i), $(i+1), $(i+2), $(i+3), $(i+4), $(i+5)}}' file
field1a|field2a|field3a|field4a|field5a|field6a
field1d|field2d|field3d|field4d|field5d|field6d
field1m|field2m|field3m|field4m|field5m|field6m
field1z|field2z|field3z|field4z|field5z|field6z
Explanation
BEGIN{FS=OFS="|"} set input and output field separator as |.
{for (i=1; i<=NF; i+=6) {print $(i), $(i+1), $(i+2), $(i+3), $(i+4), $(i+5)}} loop through items on blocks of 6. Every single time, print six of them. As print end up writing a new line, then you are done.
If you want to treat the files as being in multiple lines, then make \n the field separator. For example, to get the 2nd column, just do:
tr \| \\n < input-file | sed -n 2p
To see which columns match a regex, do:
tr \| \\n < input-file | grep -n regex

Add a period at end of paragraph

I need a command to add a period (full stop) to the end of a paragraph. I have tried the following command:
sed '/ +$ / s/$/ ./' $FILENAME
but it does not work!!
awk -v RS="" -v ORS=".\n\n" 1 file
This redefines the input record separator to be empty, so that awk reads blank-line separated paragraphs as a single record. It sets the output record separator to be a dot and 2 newlines. The actual awk program, 1 simple prints each record.
One side-effect is that any consecutive blank lines will be collapsed into a single blank line.
OK, sheesh
awk -v RS="" -v ORS="\n\n" '{sub(/\.?$/,".")} 1'
In action: (piping through cat -n just to point out the newlines)
echo -e "a.\n\nb\nc\n\n\nd" |
awk -v RS="" -v ORS="\n\n" '{sub(/\.?$/,".")} 1' |
cat -n
1 a.
2
3 b
4 c.
5
6 d.
7
There's an extra newline at the end, due to the ORS.
And, as a bonus, here's a bit of Perl that preserves the inter-paragraph spacing:
echo -e "a.\n\nb\nc\n\n\nd" | perl -0777 -pe 's/\.?(\n(\n+|$))/.$1/g' | cat -n
1 a.
2
3 b
4 c.
5
6
7 d.
Not very good, but it seems to work...
$ cat input
This is a paragraph with some text. Some random text that is not really important.
This is another paragraph with some text.
However this sentence is still in the same paragraph
$ tr '\n' '#' < input | sed 's/\([^.]\)##/\1.##/g' | tr '#' '\n'
This is a paragraph with some text. Some random text that is not really important.
This is another paragraph with some text.
However this sentence is still in the same paragraph.
Accumulate 'paragraphs' in the hold space. Keep accumulating as long as
the input line contains any non-space character(s).
When you get a blank/empty line, assume you have an accumulated paragraph. Swap the current (blank) line with the hold space. Replace the last non-space character in the pattern space (which is now the "paragraph" you were accumulating) with itself followed by a dot, unless that character is a dot. Print the result.
I think this does it:
$ cat test
this is a test line. one-line para
this is a test line. one-line para. with period.
this is a
two line para-
graph with dot.
this is a
two-line paragraph
with no dot
also works on last
line of file
$ sed -n \
-e '/^[[:space:]]*$/{x;s/\([^.[:space:]][[:space:]]*\)$/\1./;p;n;}' \
-e '/^[[:space:]]*[^[:space:]]/H' \
test
this is a test line. one-line para.
this is a test line. one-line para. with period.
this is a
two line para-
graph with dot.
this is a
two-line paragraph
with no dot.
Using sed.
sed ':loop;$!{N;b loop};s/[^\.]$/&./;s/\([^\.]\)\(\n[ \t]*\n\)/\1.\2/g' file
explanation
:loop;$!{N;b loop} will save all the lines in pattern space delimited by newline.
s/[^.]$/&./ will add . if last paragraph doesn't have dot in end.
s/\([^\.]\)\(\n[ \t]*\n\)/\1.\2/g will add dot before \n \n, which is identify as new paragraph.
This should work:
sed "s/[[:alpha:]]\+[^\.]$/\./" $FILENAME
A pure sed solution using the hold space to save all lines from a paragraph and append a period just before printing:
sed -ne '
## Append current line to "hold space".
H
## When found an empty line, get content of "hold space", remove leading
## newline added by "H" command, append a period at the end and print.
## Also, clean "hold space" to save following paragraph.
/^$/ { g; s/^\n//; s/\(.*\)\(\n\)/\1.\2/; p; s/^.*$//; h; b }
## Last line is a bit special, it has no following blank line but it is also
## an end of paragraph. It is similar to previous case but simpler.
$ { x; s/^\n//; s/$/./; p }
' infile
Assuming an infile with content:
one
two
three
four
five
six
It yields:
one
two.
three.
four
five
six.

Resources