Portable sed way to find longest common prefix of strings - shell

The sed solutions in Longest common prefix of two strings in bash only work with GNU sed. I'd like a more portable sed solution (e.g. for BSD/macOS sed, Busybox sed).

The following solutions are tested with GNU sed, macOS (10.15) sed and busybox (v1.29) sed.
$ printf '%s\n' a ab abc | sed -e '$q;N;s/^\(.*\).*\n\1.*$/\1/;h;G;D'
a
$ printf '%s\n' a b c | sed -e '$q;N;s/^\(.*\).*\n\1.*$/\1/;h;G;D'
$
To be more efficient when there are many strings especially when there's no common prefix at all (note the ..* part which is different from the previous solution):
$ printf '%s\n' a ab abc | sed -ne :L -e '$p;N;s/^\(..*\).*\n\1.*/\1/;tL' -e q
a
$ printf '%s\n' a b c | sed -ne :L -e '$p;N;s/^\(..*\).*\n\1.*/\1/;tL' -e q
$
Regarding $q in the first solution
According to GNU sed manual (info sed):
N command on the last line
Most versions of sed exit without printing anything when the N command is issued on the last line of a file. GNU sed prints pattern space before exiting unless of course the -n command switch has been specified.
Note that I did not use sed -E because macOS sed's -E does not support \N back-reference in s/pattern/replace/ command's pattern string.
$ # with GNU sed:
$ echo foofoo | gsed -E 's/(foo)\1/bar/'
bar
$
$ # with macOS's own sed:
$ echo foofoo | sed -E 's/(foo)\1/bar/'
foofoo
$
UPDATE (2021-04-26):
Found this in another answer :
sed -e '1{h;d;}' -e 'G;s/\(.*\).*\n\1.*/\1/;h;$!d'
Note that it does not work when there's only one line. Can be easily fixed by removing the 1d part:
sed -e '1h;G;s/^\(.*\).*\n\1.*/\1/;h;$!d'

Related

Error on sed script - extra characters after command

I've been trying to create a sed script that reads a list of phone numbers and only prints ones that match the following schemes:
+1(212)xxx-xxxx
1(212)xxx-xxxx
I'm an absolute beginner, but I tried to write a sed script that would print this for me using the -n -r flags (the contents of which are as follows):
/\+1\(212\)[0-9]{3}-[0-9]{4}/p
/1\(212\)[0-9]{3}-[0-9]{4}/p
If I run this in sed directly, it works fine (i.e. sed -n -r '/\+1\(212\)[0-9]{3}-[0-9]{4}/p' sample.txt prints matching lines as expected. This does NOT work in the sed script I wrote, instead sed says:
sed: -e expression #1, char 2: extra characters after command
I could not find a good solution, this error seems to have so many causes and none of the answers I found apply easily here.
EDIT: I ran it with sed -n -r script.sed sample.txt
sed can not automatically determine whether you intended a parameter to be a script file or a script string.
To run a sed script from a file, you have to use -f:
$ echo 's/hello/goodbye/g' > demo.sed
$ echo "hello world" | sed -f demo.sed
goodbye world
If you neglect the -f, sed will try to run the filename as a command, and the delete command is not happy to have emo.sed after it:
$ echo "hello world" | sed demo.sed
sed: -e expression #1, char 2: extra characters after command
Of the various unix tools out there, two use BRE as their default regex dialect. Those two tools are sed and grep.
In most operating systems, you can use egrep or grep -E to tell that tool to use ERE as its dialect. A smaller (but still significant) number of sed implementations will accept a -E option to use ERE.
In BRE mode, however, you can still create atoms with brackets. And you do it by escaping parentheses. That's why your initial expression is failing -- the parentheses are NOT special by default in BRE, but you're MAKING THEM SPECIAL by preceding the characters with backslashes.
The other thing to keep in mind is that if you want sed to execute a script from a command line argument, you should use the -e option.
So:
$ cat ph.txt
+1(212)xxx-xxxx
1(212)xxx-xxxx
212-xxx-xxxx
$ grep '^+\{0,1\}1([0-9]\{3\})' ph.txt
+1(212)xxx-xxxx
1(212)xxx-xxxx
$ egrep '^[+]?1\([0-9]{3}\)' ph.txt
+1(212)xxx-xxxx
1(212)xxx-xxxx
$ sed -n -e '/^+\{0,1\}1([0-9]\{3\})/p' ph.txt
+1(212)xxx-xxxx
1(212)xxx-xxxx
$ sed -E -n -e '/^[+]?1\([0-9]{3}\)/p' ph.txt
+1(212)xxx-xxxx
1(212)xxx-xxxx
Depending on your OS, you may be able to get a full list of how this works from man re_format.

Adding double quotes to beginning, end and around comma's in bash variable

I have a shell script that accepts a parameter that is comma delimited,
-s 1234,1244,1567
That is passed to a curl PUT json field. Json needs the values in a "1234","1244","1567" format.
Currently, I am passing the parameter with the quotes already in it:
-s "\"1234\",\"1244\",\"1567\"", which works, but the users are complaining that its too much typing and hard to do. So I'd like to just take a comma delimited list like I had at the top and programmatically stick the quotes in.
Basically, I want a parameter to be passed in as 1234,2345 and end up as a variable that is "1234","2345"
I've come to read that easiest approach here is to use sed, but I'm really not familiar with it and all of my efforts are failing.
You can do this in BASH:
$> arg='1234,1244,1567'
$> echo "\"${arg//,/\",\"}\""
"1234","1244","1567"
awk to the rescue!
$ awk -F, -v OFS='","' -v q='"' '{$1=$1; print q $0 q}' <<< "1234,1244,1567"
"1234","1244","1567"
or shorter with sed
$ sed -r 's/[^,]+/"&"/g' <<< "1234,1244,1567"
"1234","1244","1567"
translating this back to awk
$ awk '{print gensub(/([^,]+)/,"\"\\1\"","g")}' <<< "1234,1244,1567"
"1234","1244","1567"
you can use this:
echo QV=$(echo 1234,2345,56788 | sed -e 's/^/"/' -e 's/$/"/' -e 's/,/","/g')
result:
echo $QV
"1234","2345","56788"
just add double quotes at start, end, and replace commas with quote/comma/quote globally.
easy to do with sed
$ echo '1234,1244,1567' | sed 's/[0-9]*/"\0"/g'
"1234","1244","1567"
[0-9]* zero more consecutive digits, since * is greedy it will try to match as many as possible
"\0" double quote the matched pattern, entire match is by default saved in \0
g global flag, to replace all such patterns
In case, \0 isn't recognized in some sed versions, use & instead:
$ echo '1234,1244,1567' | sed 's/[0-9]*/"&"/g'
"1234","1244","1567"
Similar solution with perl
$ echo '1234,1244,1567' | perl -pe 's/\d+/"$&"/g'
"1234","1244","1567"
Note: Using * instead of + with perl will give
$ echo '1234,1244,1567' | perl -pe 's/\d*/"$&"/g'
"1234""","1244""","1567"""
""$
I think this difference between sed and perl is similar to this question: GNU sed, ^ and $ with | when first/last character matches
Using sed:
$ echo 1234,1244,1567 | sed 's/\([0-9]\+\)/\"\1\"/g'
"1234","1244","1567"
ie. replace all strings of numbers with the same strings of numbers quoted using backreferencing (\1).

Case insensitive search matching with sed?

I'm trying to use SED to extract text from two words, such as "Account" and "Recognized", and I'd like that the searching be case insensitive. So I tried to use the I parameter, but receive this error message:
cat Security.txt | sed -n "/Account/,/Recognized/pI" | sed -e '1d' -e '$d'
sed: -e expression #1, char 24: extra characters after command
Avoid useless use of cat
/pattern/I is how to specify case-insensitive matching in sed
sed -n "/Account/I,/Recognized/Ip" Security.txt | sed -e '1d' -e '$d'
You can use single sed command to achieve the same:
sed -n '/account/I,/recognized/I{/account/I!{/recognized/I!p}}' Security.txt
Or awk
awk 'BEGIN{IGNORECASE=1} /account/{f=1; next} /recognized/{f=0} f' Security.txt
Reference:
How to select lines between two patterns?
Use:
sed -n "/Account/,/Recognized/Ip"
i.e. change the order to: Ip instead of pI
You have useless use of cat where you should've fed the file directly to sed. Below could be a way of doing it.
$ cat file.txt
Some stuff Account sllslsljjs Security.
Another stuff account name and ffss security.
$ sed -nE 's/^.*account[[:blank:]]*(.*)[[:blank:]]*security.*$/\1/pI' file.txt
sllslsljjs
name and ffss
The [[:blank:]]* is greedy and will strip the spaces before and after the required text. The -E option enables the use of extended regular expressions.

invoking sed with a shell variable

Why doesn't this work?
$ s="-e 's/^ *//' -e 's/ *$//'"
$ ls | sed $s
sed: 1: "'s/^
": invalid command code '
$ ls | gsed $s
gsed: -e expression #1, char 1: unknown command: `''
But this does:
$ ls | eval sed $s
... prints staff ...
$ ls | eval gsed $s
... prints staff ...
Tried removing single quotes from $s but it only works for patterns without spaces:
$ s="-e s/a/b/"
$ ls | sed $s
... prints staff ...
$ s="-e s/^ *//"
$ ls | sed $s
sed: 1: "s/^
": unterminated substitute pattern
or
$ s="-e s/^\ *//"
$ ls | sed $s
sed: 1: "s/^\
": unterminated substitute pattern
Mac OS 10.8, bash 4.2, default sed and gsed 4.2.2 from Mac Ports
Simple looking question with a complicated answer. Most of the issue is with the shell; it is only partly a problem with sed. (In other words, you could use a number of different commands instead of sed and would run into similar issues.)
Note that most commands documented with an option letter and a separate argument string will also work when the argument string is attached to the option. For example:
sort -t :
sort -t:
Both of these give the value : to the -t option. Similarly with sed and the -e option. That is, you can write either of these:
sed -n -e /match/p
sed -n -e/match/p
Let's look at the one of the working sed commands you wrote:
$ s="-e s/a/b/"
$ ls | sed $s
What the sed command is passed here is two arguments (after it's command name):
-e
s/a/b/
This is a perfectly fine set of arguments for sed. What went wrong with the first one, then?
$ s="-e 's/^ *//' -e 's/ *$//'"
$ ls | sed $s
Well, this time, the sed command was passed 6 arguments:
-e
's/^
*//'
-e
's/
*$//'
You can use the al command (argument list — print each argument on its own line; it is described and implemented at the bottom of this answer) to see how arguments are presented to sed. Simply type al in place of sed in the examples.
Now, the -e option should be followed by a valid sed command, but 's/^ is not a valid command; the quote ' is not a valid sed command. When you type the command at the shell prompt, the shell processes the single quote and removes it, so sed does not normally see it, but that happens before shell variables are expanded.
Why, then, does the eval work:
$ s="-e 's/^ *//' -e 's/ *$//'"
$ ls | eval sed $s
The eval re-evaluates the command line. It sees:
eval sed -e 's/$ *//' -e 's/ *$//'
and goes through the full evaluation process. It removes the single quotes after grouping the characters, so sed sees:
-e
s/$ *//
-e
s/ *$//
which is all completely valid sed scripting.
One of your tests was:
$ s="-e s/^ *//"
$ ls | sed $s
And this failed because sed was given the arguments:
-e
s/^
*//
The first is not a valid substitute command, and the second is unlikely to be a valid file name. Interestingly, you could rescue this by putting double quotes around the $s, as in:
$ s="-e s/^ *//"
$ ls | sed "$s"
Now sed gets a single argument:
-e s/^ *//
but the -e can have the command attached, and leading spaces on commands are ignored, so this is all valid. You can't do that with your first attempt, though:
$ s="-e 's/^ *//' -e 's/ *$//'"
$ ls | sed "$s"
Now you get told about the ' not being recognized. You could, however, have used:
$ s="-e s/^ *//; s/ *$//"
$ ls | sed "$s"
Again, sed sees a single argument, and there are two semicolon-separated sed commands in the argument to the -e option.
You can ring the variations from here. I find the al command very useful; it quite often helps me understand where something is going wrong.
Source for al — argument list
#include <stdio.h>
int main(int argc, char **argv)
{
while (*++argv)
puts(*argv);
return 0;
}
This is one of the smallest useful C programs you can write ('hello world' is one line shorter, but it isn't useful for much beyond demonstrating how to compile and run a program). It lists each of its arguments on a line on its own. You can also simulate it in bash and other related shells with the printf command:
printf "%s\n" "$#"
Wrap it as a function:
al()
{
printf "%s\n" "$#"
}
The sed worked for your normal replace pattern because it did not have any metacharacters. You had just a and b. When there are metacharacters involved, you need single quotes.
I think the only way sed would work properly for your variable assignment case is only by using eval.

Concise and portable "join" on the Unix command-line

How can I join multiple lines into one line, with a separator where the new-line characters were, and avoiding a trailing separator and, optionally, ignoring empty lines?
Example. Consider a text file, foo.txt, with three lines:
foo
bar
baz
The desired output is:
foo,bar,baz
The command I'm using now:
tr '\n' ',' <foo.txt |sed 's/,$//g'
Ideally it would be something like this:
cat foo.txt |join ,
What's:
the most portable, concise, readable way.
the most concise way using non-standard unix tools.
Of course I could write something, or just use an alias. But I'm interested to know the options.
Perhaps a little surprisingly, paste is a good way to do this:
paste -s -d","
This won't deal with the empty lines you mentioned. For that, pipe your text through grep, first:
grep -v '^$' | paste -s -d"," -
This sed one-line should work -
sed -e :a -e 'N;s/\n/,/;ba' file
Test:
[jaypal:~/Temp] cat file
foo
bar
baz
[jaypal:~/Temp] sed -e :a -e 'N;s/\n/,/;ba' file
foo,bar,baz
To handle empty lines, you can remove the empty lines and pipe it to the above one-liner.
sed -e '/^$/d' file | sed -e :a -e 'N;s/\n/,/;ba'
How about to use xargs?
for your case
$ cat foo.txt | sed 's/$/, /' | xargs
Be careful about the limit length of input of xargs command. (This means very long input file cannot be handled by this.)
Perl:
cat data.txt | perl -pe 'if(!eof){chomp;$_.=","}'
or yet shorter and faster, surprisingly:
cat data.txt | perl -pe 'if(!eof){s/\n/,/}'
or, if you want:
cat data.txt | perl -pe 's/\n/,/ unless eof'
Just for fun, here's an all-builtins solution
IFS=$'\n' read -r -d '' -a data < foo.txt ; ( IFS=, ; echo "${data[*]}" ; )
You can use printf instead of echo if the trailing newline is a problem.
This works by setting IFS, the delimiters that read will split on, to just newline and not other whitespace, then telling read to not stop reading until it reaches a nul, instead of the newline it usually uses, and to add each item read into the array (-a) data. Then, in a subshell so as not to clobber the IFS of the interactive shell, we set IFS to , and expand the array with *, which delimits each item in the array with the first character in IFS
I needed to accomplish something similar, printing a comma-separated list of fields from a file, and was happy with piping STDOUT to xargs and ruby, like so:
cat data.txt | cut -f 16 -d ' ' | grep -o "\d\+" | xargs ruby -e "puts ARGV.join(', ')"
I had a log file where some data was broken into multiple lines. When this occurred, the last character of the first line was the semi-colon (;). I joined these lines by using the following commands:
for LINE in 'cat $FILE | tr -s " " "|"'
do
if [ $(echo $LINE | egrep ";$") ]
then
echo "$LINE\c" | tr -s "|" " " >> $MYFILE
else
echo "$LINE" | tr -s "|" " " >> $MYFILE
fi
done
The result is a file where lines that were split in the log file were one line in my new file.
Simple way to join the lines with space in-place using ex (also ignoring blank lines), use:
ex +%j -cwq foo.txt
If you want to print the results to the standard output, try:
ex +%j +%p -scq! foo.txt
To join lines without spaces, use +%j! instead of +%j.
To use different delimiter, it's a bit more tricky:
ex +"g/^$/d" +"%s/\n/_/e" +%p -scq! foo.txt
where g/^$/d (or v/\S/d) removes blank lines and s/\n/_/ is substitution which basically works the same as using sed, but for all lines (%). When parsing is done, print the buffer (%p). And finally -cq! executing vi q! command, which basically quits without saving (-s is to silence the output).
Please note that ex is equivalent to vi -e.
This method is quite portable as most of the Linux/Unix are shipped with ex/vi by default. And it's more compatible than using sed where in-place parameter (-i) is not standard extension and utility it-self is more stream oriented, therefore it's not so portable.
POSIX shell:
( set -- $(cat foo.txt) ; IFS=+ ; printf '%s\n' "$*" )
My answer is:
awk '{printf "%s", ","$0}' foo.txt
printf is enough. We don't need -F"\n" to change field separator.

Resources