POSIX: abcdef to ab bc cd de ef - shell

Using POSIX sed or awk, I would like to duplicate every second character in every pair of neighboring characters and list every newly-formed pair on a new line.
example.txt:
abcd 10001.
Expected result:
ab
bc
cd
d
1
10
00
00
01
1.
So far, this is what I have (N.B. omit "--posix" if on macOS). For some reason, adding a literal newline character before \2 does not produce the expected result. Removing the first group and using \1 has the same effect. What am I missing?
sed --posix -E -e 's/(.)(.)/&\2\
/g' example.txt
abb
cdd
100
000
1..

Try:
$ echo "abcd 10001." | awk '{for(i=1;i<length($0);i++) print substr($0,i,2)}'
ab
bc
cd
d
1
10
00
00
01
1.

You may use
sed --posix -e 's/./&\
&/g' example.txt | sed '1d;$d'
The first sed command finds every char in the string and replaces with the same char, then a newline and then the same char again. Since it replaces first and last chars, the first and last resulting lines must be removed, which is achieved with sed '1d;$d'.
Had sed supported lookarounds, one could have used (?!^).(?!$) (any char but not at the start or end of string) and the last sed command would not have been necessary, but it is not possible with sed. You could use it in perl though, perl -pe 's/(?!^).(?!$)/$&\n$&/g' example.txt (see demo online, $& in the RHS is the same as & placeholder in sed, the whole match value).

With GNU awk could you please try following. Written and tested with shown samples and tested it in link
https://ideone.com/qahp0S
awk '
BEGIN{
FS=""
}
{
for(i=1;i<=(NF-1);i++){
print $i$(i+1)
}
}
' Input_file
Explanation: setting field separator as NULL in the BEGIN section of program for all lines here. Then in main program running a for loop which runs from 1st field to till 2nd last field. In that loop's each iteration printing current and next field.

Using same routine, it can be done in bash itself:
s='abcd 10001.'
for((i=0; i<${#s}-1; i++)); do echo "${s:i:2}"; done
ab
bc
cd
d
1
10
00
00
01
1.

Just for fun, a single sed consisting of 3 substitutions:
$ echo "abcd 10001." | sed 's/./&&/g;s/\(^.\|.$\)//g;s/../&\n/g'
The first part duplicates all characters, the second part removes the first and last character, the third part adds a newline character after each character-pair.
If you want to be POSIX compliant you have to do:
$ echo "abcd 10001." | sed -e 's/./&&/g' -e 's/^.//g' -e 's/.$//g' -e 's/../&\n/g'
Here we had to add an extra one as the expression \(^.\|.$) is an ERE and posix sed only accepts a BRE

This might work for you (GNU sed):
sed 's/.\(.\)/&\n\1/;/../P;D' file
Replace the first two characters by the first two characters, a newline and the second character.
Print the first line if it is two characters long, delete the first line and repeat.
Alternative, more long winded:
sed -E ':a;s/^(([^\n]{2}\n)*[^\n])([^\n])([^\n])/\1\3\n\3\4/;ta' file
Or, with no hardcoded new line:
sed -E '/.../{G;s/^(.(.))(.*)(.)/\1\4\2\3/;P;D}' file
Lastly:
sed 's/./&\n&/g;s/^..\|..$/g' file

Process substitution isn't specified by POSIX. The POSIX requirement was only specified for awk and sed, so maybe the next solution is acceptable:
paste -d '\0' <(echo; fold -w1 example.txt) <(fold -w1 example.txt) | grep ..
or
while read -n1 ch; do
printf "%s\n%s" "${ch}" "${ch}"
done < example.txt | grep ..
or
sed 's/./&&/g;s/.//' example.txt | grep -o ..

Related

unix sed substitute nth occurence misfunction?

Let's say I have a string which contains multiple occurences of the letter Z.
For example: aaZbbZccZ.
I want to print parts of that string, each time until the next occurence of Z:
aaZ
aaZbbZ
aaZbbZccZ
So I tried using unix sed for this, with the command sed s/Z.*/Z/i where i is an index that I have running from 1 to the number of Z's in the string. As far as my sed understanding goes: this should delete everything that comes after the i'th Z, But in practice this only works when I have i=1 as in sed s/Z.*/Z/, but not as I increment i, as in sed s/Z.*/Z/2 for example, where it just prints the entire original string. It feels as if there's something I am missing about the functioning of sed, since according to multiple manuals, it should work.
edit: for example, in the string aaZbbZccZ while applying sed s/Z.*/Z/2 I am expecting to have aaZbbZ, as everything after the 2nd occurence of Z get's deleted.
Below sed works closely to what you are looking for, except it removes also the last Z.
$echo aaZbbZccZdd | sed -e 's/Z[^Z]*//1g;s/$/Z/'
aaZ
$echo aaZbbZccZdd | sed -e 's/Z[^Z]*//2g;s/$/Z/'
aaZbbZ
$echo aaZbbZccZdd | sed -e 's/Z[^Z]*//3g;s/$/Z/'
aaZbbZccZ
$echo aaZbbZccZdd | sed -e 's/Z[^Z]*//4g;s/$/Z/'
aaZbbZccZddZ
Edit:
Modified according to Aaron suggestion.
Edit2:
If you don't know how many Z there are in the string it's safer to use below command. Otherwise additional Z is added at the end.
-r - enables regular expressions
-e - separates sed operations, the same as ; but easier to read in my opinion.
$echo aaZbbZccZddZ | sed -r -e 's/Z[^Z]*//1g' -e 's/([^Z])$/\1Z/'
aaZ
$echo aaZbbZccZddZ | sed -r -e 's/Z[^Z]*//2g' -e 's/([^Z])$/\1Z/'
aaZbbZ
$echo aaZbbZccZddZ | sed -r -e 's/Z[^Z]*//3g' -e 's/([^Z])$/\1Z/'
aaZbbZccZ
$echo aaZbbZccZddZ | sed -r -e 's/Z[^Z]*//4g' -e 's/([^Z])$/\1Z/'
aaZbbZccZddZ
$echo aaZbbZccZddZ | sed -r -e 's/Z[^Z]*//5g' -e 's/([^Z])$/\1Z/'
aaZbbZccZddZ
This should do what you expect (see comments) unless your string can contain line breaks:
# -n will prevent default printing
echo 'aaZbbZccZ' | sed -n '{
# Add a line break after each 'Z'
s/Z/Z\
/g
# Print it and consume it in the next sed command
p
}' | sed -n '{
# Add only the first line to the hold buffer (you can remove it if you don't mind to see first blank line)
1 {
h
}
# As for the rest of the lines
2,$ {
# Replace the hold buffer with the pattern space
x
# Remove line breaks
s/\n//
# Print the result
p
# Get the hold buffer again (matched line)
x
# And append it with new line to the hold buffer
H
}'
The idea is to break the string into multiples lines (each is terminated with Z), that will be processed one by one on the second sed command.
On the second sed we use the Hold Buffer to remember previous lines, print the aggregated result, append new lines and each time remove the line breaks we previously added.
And the output is
aaZ
aaZbbZ
aaZbbZccZ
This might work for you (GNU sed):
sed -n 's/Z/&\n/g;:a;/\n/P;s/\n\(.*Z\)/\1/;ta' file
Use sed's grep-like option -n to explicitly print content. Append a newline after each Z. If there were no substitutions then there is nothing to be done. Print upto the first newline, remove the first newline if the following characters contain a Z and repeat.

Replacing/removing excess white space between columns in a file

I am trying to parse a file with similar contents:
I am a string 12831928
I am another string 41327318
A set of strings 39842938
Another string 3242342
I want the out file to be tab delimited:
I am a string\t12831928
I am another string\t41327318
A set of strings\t39842938
Another string\t3242342
I have tried the following:
sed 's/\s+/\t/g' filename > outfile
I have also tried cut, and awk.
Just use awk:
$ awk -F' +' -v OFS='\t' '{sub(/ +$/,""); $1=$1}1' file
I am a string 12831928
I am another string 41327318
A set of strings 39842938
Another string 3242342
Breakdown:
-F' +' # tell awk that input fields (FS) are separated by 2 or more blanks
-v OFS='\t' # tell awk that output fields are separated by tabs
'{sub(/ +$/,""); # remove all trailing blank spaces from the current record (line)
$1=$1} # recompile the current record (line) replacing FSs by OFSs
1' # idiomatic: any true condition invokes the default action of "print"
I highly recommend the book Effective Awk Programming, 4th Edition, by Arnold Robbins.
The difficulty comes in the varying number of words per-line. While you can handle this with awk, a simple script reading each word in a line into an array and then tab-delimiting the last word in each line will work as well:
#!/bin/bash
fn="${1:-/dev/stdin}"
while read -r line || test -n "$line"; do
arr=( $(echo "$line") )
nword=${#arr[#]}
for ((i = 0; i < nword - 1; i++)); do
test "$i" -eq '0' && word="${arr[i]}" || word=" ${arr[i]}"
printf "%s" "$word"
done
printf "\t%s\n" "${arr[i]}"
done < "$fn"
Example Use/Output
(using your input file)
$ bash rfmttab.sh < dat/tabfile.txt
I am a string 12831928
I am another string 41327318
A set of strings 39842938
Another string 3242342
Each number is tab-delimited from the rest of the string. Look it over and let me know if you have any questions.
sed -E 's/[ ][ ]+/\\t/g' filename > outfile
NOTE: the [ ] is openBracket Space closeBracket
-E for extended regular expression support.
The double brackets [ ][ ]+ is to only substitute tabs for more than 1 consecutive space.
Tested on MacOS and Ubuntu versions of sed.
Your input has spaces at the end of each line, which makes things a little more difficult than without. This sed command would replace the spaces before that last column with a tab:
$ sed 's/[[:blank:]]*\([^[:blank:]]*[[:blank:]]*\)$/\t\1/' infile | cat -A
I am a string^I12831928 $
I am another string^I41327318 $
A set of strings^I39842938 $
Another string^I3242342 $
This matches – anchored at the end of the line – blanks, non-blanks and again blanks, zero or more of each. The last column and the optional blanks after it are captured.
The blanks before the last column are then replaced by a single tab, and the rest stays the same – see output piped to cat -A to show explicit line endings and ^I for tab characters.
If there are no blanks at the end of each line, this simplifies to
sed 's/[[:blank:]]*\([^[:blank:]]*\)$/\t\1/' infile
Notice that some seds, notably BSD sed as found in MacOS, can't use \t for tab in a substitution. In that case, you have to use either '$'\t'' or '"$(printf '\t')"' instead.
another approach, with gnu sed and rev
$ rev file | sed -r 's/ +/\t/1' | rev
You have trailing spaces on each line. So you can do two sed expressions in one go like so:
$ sed -E -e 's/ +$//' -e $'s/ +/\t/' /tmp/file
I am a string 12831928
I am another string 41327318
A set of strings 39842938
Another string 3242342
Note the $'s/ +/\t/': This tells bash to replace \t with an actual tab character prior to invoking sed.
To show that these deletions and \t insertions are in the right place you can do:
$ sed -E -e 's/ +$/X/' -e $'s/ +/Y/' /tmp/file
I am a stringY12831928X
I am another stringY41327318X
A set of stringsY39842938X
Another stringY3242342X
Simple and without invisible semantic characters in the code:
perl -lpe 's/\s+$//; s/\s\s+/\t/' filename
Explanation:
Options:
-l: remove LF during processing (in this case)
-p: loop over records (like awk) and print
-e: code follows
Code:
remove trailing whitespace
change two or more whitespace to tab
Tested on OP data. The trailing spaces are removed for consistency.

how to delete a large number of lines from a file

I have a file with ~700,000 lines and I would like to remove a bunch of specific lines (~30,000) using bash scripting or another method.
I know I can remove lines using sed:
sed -i.bak -e '1d;34d;45d;678d' myfile.txt # an example
I have the lines in a text file but I don't know if I can use it as input to sed, maybe perl??
Thanks
A few options:
sed <(sed 's/$/d/' lines_file) data_file
awk 'NR==FNR {del[$1]; next} !(FNR in del)' lines_file data_file
perl -MPath::Class -e '
%del = map {$_ => 1} file("lines_file")->slurp(chomp => 1);
$f = file("data_file")->openr();
while (<$f>) {
print unless $del{$.};
}
'
perl -ne'
BEGIN{ local #ARGV =pop; #h{<>} =() }
exists $h{"$.\n"} or print;
' myfile.txt lines
You can make the remove the lines using sed file.
First make a list of lines to remove. (One line number for one line)
$ cat lines
1
34
45
678
Make this file to sed format.
$ sed -e 's|$| d|' lines >lines.sed
$ cat lines.sed
1 d
34 d
45 d
678 d
Now use this sed file and give it as input to sed command.
$ sed -i.bak -f lines.sed file_with_70k_lines
This will remove the lines.
If you can create a text file of the format
1d
34d
45d
678d
then you can run something like
sed -i.bak -f scriptfile datafile
You can use a genuine editor for that, and ed is the standard editor.
I'm assuming your lines are in a file lines.txt, one number per line, e.g.,
1
34
45
678
Then (with a blatant bashism):
ed -s file.txt < <(sed -n '/^[[:digit:]]\+$/p' lines.txt | sort -nr | sed 's/$/d/'; printf '%s\n' w q)
A first sed selects only the numbers from file lines.txt (just in case).
There's something quite special to take into account here: that when you delete line 1, then line 34 in the original file becomes line 33. So it's better to remove the lines from the end: start with 678, then 45, etc. that's why we're using sort -nr (to sort the numbers in reverse order). A final sed appends d (ed's delete command) to the numbers.
Then we issue the w (write) and q (quit) commands.
Note that this overwrites the original file!

using sed how to put space after numbers in a big string

This question was asked in an interview. I could not answer! So getting some help here to understand the logic. i.e. how to put space between a number string and character string.
Given the string "1abc2abcd3efghi10z11jkl100pqrs" what command you use to get following result -
"1 abc 2 abcd 3 efghi 10 z 11 jkl 100 pqrs"
Thanks in advance.
Here is another -- yet simple -- way to think about it:
echo "1abc2abcd3efghi10z11jkl100pqrs" | \
sed -r 's/([0-9])([a-zA-Z])/\1 \2/g; s/([a-zA-Z])([0-9])/\1 \2/g'
add a whitespace between a digit-letter string & letter-digit string
() is to capture the group and \1 and \2 is to return the first and second captured group
With GNU sed:
$ echo "1abc2abcd3efghi10z11jkl100pqrs" | sed -e 's/[0-9]\+/ & /g' -e 's/^ \| $//'
1 abc 2 abcd 3 efghi 10 z 11 jkl 100 pqrs
With awk:
$ echo "1abc2abcd3efghi10z11jkl100pqrs" | awk '{gsub(/[0-9]+/," & ",$0); $1=$1}1'
1 abc 2 abcd 3 efghi 10 z 11 jkl 100 pqrs
gsub with substitute all numbers with space before and after it.
$1=$1 will re-compute entire line and add OFS (by default single
space)
I would have chosen sed over awk:
echo "1abc2abcd3efghi10z11jkl100pqrs" | sed 's/[0-9]\+/ & /g; s/^[ ]//; s/[ ]$//'
It surrounds each run of digits with spaces and afterwards removes the (possibly) leading and trailing ones.
It yields:
1 abc 2 abcd 3 efghi 10 z 11 jkl 100 pqrs
echo 1abc2abcd3efghi10z11jkl100pqrs | \
sed -r -e 's/([[:digit:]]+)/ \1 /g' -e 's/^ *//g' -e 's/ *$//g'
Take the expression -e 's/([[:digit:]]+)/ \1 /g' first.
The parentheses around [[:digit:]]+ 'capture' each sequence of one or more digits. Since it's the first capture group, it's referenced in the substitution by \1 (then there's the space before and after:  \1 ).
The g tells sed to perform this substitution 'globally' on the input.
The -r before the expression tells sed to use extended regular expressions.
The other two 'expressions' (each expression has -e before it to show that it's an expression):
-e 's/^ *//g' will remove leading whitespace, and -e 's/ *$//g' will remove trailing whitespace.
Using perl:
echo 1abc2abcd3efghi10z11jkl100pqrs | perl -F'(\d+)' -ane \
'$F[0] and print "#F\n" or print "#F[1..$#F]"'
Some explanation:
-an together tells Perl to split each line of input and put the resulting fields into the array #F.
-F specifies a delimiter of one or more digits to use with -an to split the input. The parentheses cause the delimiters themselves to be stored in the array, not just the strings they separate.
-e specifies the code to run after each line is read. We simply want to print the contents of #F, with the default field separator (space) used to separate elements of the array. The and...or combination is used to ignore the first field if it is empty, as it will be if the input line starts with a delimiter.

sed help - convert a string of form ABC_DEF_GHI to AbcDefGhi

How can covert a string of form ABC_DEF_GHI to AbcDefGhi using any online command such as sed etc. ?
Here's a one-liner using gawk:
echo ABC_DEF_GHI | gawk 'function cap(s){return toupper(substr(s,1,1))tolower(substr(s,2))}{n=split($0,x,"_");for(i=1;i<=n;i++)o=o cap(x[i]); print o}'
AbcDefGhi
Optimized awk 1-liner
awk -v RS=_ '{printf "%s%s", substr($0,1,1), tolower(substr($0,2))}'
Optimized sed 1-liner
sed 's/\(.\)\(..\)_\(.\)\(..\)_\(.\)\(..\)/\1\L\2\U\3\L\4\U\5\L\6/'
Edit:
Here's a gawk version:
gawk -F_ '{for (i=1;i<=NF;i++) printf "%s%s",substr($i,1,1),tolower(substr($i,2)); printf "\n"}'
Original:
Using sed for this is pretty scary:
sed -r 'h;s/(^|_)./\n/g;y/ABCDEFGHIJKLMNOPQRSTUVWXYZ/abcdefghijklmnopqrstuvwxyz/;x;s/((^|_)(.))[^_]*/\3\n/g;G;:a;s/(^.*)([^\n])\n\n(.*)\n([^\n]*)$/\1\n\2\4\3/;ta;s/\n//g'
Here it is broken down:
# make a copy in hold space
h;
# replace all the characters which will remain upper case with newlines
s/(^|_)./\n/g;
# lowercase all the remaining characters
y/ABCDEFGHIJKLMNOPQRSTUVWXYZ/abcdefghijklmnopqrstuvwxyz/;
# swap the copy into pattern space and the lowercase characters into hold space
x;
# discard all but the characters which will remain upper case
s/((^|_)(.))[^_]*/\3\n/g;
# append the lower case characters to the end of pattern space
G;
# top of the loop
:a;
# shuffle the lower case characters back into their proper positions (see below)
s/(^.*)([^\n])\n\n(.*)\n([^\n]*)$/\1\n\2\4\3/;
# if a replacement was made, branch to the top of the loop
ta;
# remove all the newlines
s/\n//g
Here's how the shuffle works:
At the time it starts, this is what pattern space looks like:
A
D
G
bc
ef
hi
The shuffle loop picks up the string that's between the last newline and the end and moves it to the position before the two consecutive newlines (actually three) and moves the extra newline so it's before the character that it previously followed.
After the first step through the loop, this is what pattern space looks like:
A
D
Ghi
bc
ef
And processing proceeds similarly until there's nothing before the extra newline at which point the match fails and the loop branch is not taken.
If you want to title case a sequence of words separated by spaces, the script would be similar:
$ echo 'BEST MOVIE THIS YEAR' | sed -r 'h;s/(^| )./\n/g;y/ABCDEFGHIJKLMNOPQRSTUVWXYZ/abcdefghijklmnopqrstuvwxyz/;x;s/((^| ).)[^ ]*/\1\n/g;G;:a;s/(^.*)( [^\n]*)\n\n(.*)\n([^\n]*)$/\1\n\2\4\3/;ta;s/^([^\n]*)(.*)\n([^\n]*)$/\1\3\2/;s/\n//g'
Best Movie This Year
One liner using perl:
$ echo 'ABC_DEF_GHI' | perl -npe 's/([A-Z])([^_]+)_?/$1\L$2\E/g;'
AbcDefGhi
This might work for you:
echo "ABC_DEF_GHI" |
sed 'h;s/\(.\)[^_]*\(_\|$\)/\1/g;x;y/'$(printf "%s" {A..Z} / {a..z})'/;G;:a;s/\(\(^[a-z]\)\|_\([a-z]\)\)\([^\n]*\n\)\(.\)/\5\4/;ta;s/\n//'
AbcDefGhi
Or using GNU sed:
echo "ABC_DEF_GHI" | sed 's/\([A-Z]\)\([^_]*\)\(_\|$\)/\1\L\2/g'
AbcDefGhi
Less scary sed version with tr:
echo ABC_DEF_GHI | sed -e 's/_//g' - | tr 'A-Z' 'a-z'

Resources