bash-replacing a number with unicode character using sed - bash

So I have this output generated from printf
011010
Now I want to pipe it and use sed to replace 0's and 1's with unicode character, so I get unicode characters printed instead of binary (011010).
I can do this just copy-pasting the characters themselves, but I want to use values instead like the ones found in unicode table:
Position: 0x2701
Decimal: 9985
Symbol: ✁
How do I use the above values with sed to generate the character?

With bash (since v4.2) or zsh, the simple solution is to use the $'...' syntax, which understands C escapes including \u escapes:
$ echo 011010 | sed $'s/1/\u2701/g'
0✁✁0✁0
If you have Gnu sed, you can use escape sequences in the s// command. Gnu sed, unfortunately, does not understand \u unicode escapes, but it does understand \x hex escapes. However, to get it to decode them, you need to make sure that it sees the backslashes. Then you can do the translation in UTF-8, assuming you know the UTF-8 sequence corresponding to the Unicode codepoint:
$ # Quote the argument
$ echo 011010 | sed 's/1/\xE2\x9C\x81/g'
0✁✁0✁0
$ # Or escape the backslashes
$ echo 011010 | sed s/1/\\xE2\\x9C\\x81/g
0✁✁0✁0
$ # This doesn't work because the \ is removed by bash before sed sees it
$ echo 011010 | sed s/1/\xE2\x9C\x81/g
0xE2x9Cx81xE2x9Cx810xE2x9Cx810
$ # So that was the same as: sed s/1/xE2x9Cx81/g

Related

What is the difference between \u and \U in GNU sed

I come across these two commands \u and \U (and others such as \l and \L) in sed. I have to admit I am a newbie and have little experience with GNU sed.
I have tried the following two commands but got the same result:
# I have tested this on Ubuntu 20.04
echo "abc" | sed 's/./\u&/g' # output is: ABC
echo "abc" | sed 's/./\U&/g' # output is: ABC
The output is the same for the two commands. So, what is the difference between them?
In a substitution replacement, \u converts the next character to uppercase, whereas \U converts the rest of the replacement to uppercase, or until \L or \E occurs.
Your example will not show the difference between them, because you are only replacing one character at a time. If you use the pattern .* (instead of .) in echo abc | sed 's/.*/\u&/', you will get Abc.
The commands are documented in info sed, and here: https://www.gnu.org/software/sed/manual/sed.html#The-_0022s_0022-Command

Removing punctuation and tabs with sed

I am using the following to remove punctuation, tabs, and convert uppercase text to lowercase in a text file.
sed 's/[[:punct:]]//g' $HOME/file.txt | sed $'s/\t//g' | tr '[:upper:]' '[:lower:]'
Do I need to use these two separate sed commands to remove punctuation and tabs or can this be done with a single sed command?
Also, could someone explain what the $ is doing in the second sed command? Without it the command doesn't remove tabs. I looked in the man page but I didn't see anything that mentioned this.
The input file looks like this:
Pochemu oni ne v shkole?
Kto tam?
Otkuda eto moloko?
Chei chai ona p’et?
Kogda vy chitaete?
Kogda ty chitaesh’?
A single sed with multiple -e expressions, which can be done as below for FreeBSD sed
sed -e $'s/\t//g' -e "s/[[:punct:]]\+//g" -e 'y/ABCDEFGHIJKLMNOPQRSTUVWXYZ/abcdefghijklmnopqrstuvwxyz/' file
With the y quanitifier for,
[2addr]y/string1/string2/
Replace all occurrences of characters in string1 in the pattern
space with the corresponding characters from string2.
If in GNU sed, \L quantifier for lower-case conversion should work fine.
sed -e $'s/\t//g' -e "s/[[:punct:]]\+//g" -e "s/./\L&/g"
$'' is a bash quoting mechanism to enable ANSI C-like escape sequences.

Trying to print "\n" in bash with sed

Having some problems having sed insert the two-character sequence \n. (I'm using bash to create some expect scripts). Anything that I try in a replace pattern ends up as an actual newline character.
I've tried:
sed s/<string>/'\\\\n'/
sed s/<string>/\\\\n/
sed s/<string>/\\n/
And pretty much any permutation that does or doesn't make any sense.
I need it to work with the bash and sed installed on a Mac.
sed s/<string>/'\\n'/ works for me with both the Lunix (GNU) and OS X (bsd) versions of sed:
$ echo aXb | sed s/X/'\\n'/
a\nb
sed s/<string>/\\\\n/ would also work. When bash sees \\ (outside of quotes), it treats it as a single escaped backslash, so \ is actually passed to the command. When it sees \\\\n, that's just two escaped backslashes followed by "n", so it passes \\n to the command. Then, when sed sees \\n, it also treats that as an escaped backslash followed by "n", so the replacement string winds up being \n. Since the "n" is always after any completed escape sequence, it's just treated as another character in the replacement string.
pure code, single quoted
sed 's/Pattern/\\n/' YourFile
Shell interpreted, double quote
sed "s/Pattern/\\\\n/" YourFile

Delete flanking uppercase characters in a string

How could I remove the uppercases that start and end in this string (DNA sequence) using the linux terminal?
Input:
TCGTAAATGGTgggggtcagaccctaaggtttccataaagGCTGGtccaaacgcaacttctaattgaatgataaaatactcatgcatgttGTTCGAtaaaacgtaatatttatggcgtgtctacctaccgttccatcttatcgtttaaactttggtacaattctcagttaagtgacgattgctttggaggaagtaatactgtgatcacaatctatgctgtttgcgttgccAAAAAAtttcaatgtaaaaaaaaaTCGAAAATGGT
Desired Output:
gggggtcagaccctaaggtttccataaagGCTGGtccaaacgcaacttctaattgaatgataaaatactcatgcatgttGTTCGAtaaaacgtaatatttatggcgtgtctacctaccgttccatcttatcgtttaaactttggtacaattctcagttaagtgacgattgctttggaggaagtaatactgtgatcacaatctatgctgtttgcgttgccAAAAAAtttcaatgtaaaaaaaaa
Note there are other internal uppercases in the string that must be preserved.
Thanks!
Using sed you can do this, assuming each string is in one line:
sed 's/^[A-Z]*\|[A-Z]*$//g' <<< "$s"
You could use sed with a regular expression:
sed -e 's/^[A-Z]*//' -e 's/[A-Z]*$//'
(It would also be possible to combine these into a single regex, but I wrote it this way for clarity; the first regex strips leading uppercase chars, the second strips trailing uppercase chars.)
[me#localhost ~]$ echo 'TCGTAAATGGTgggggtcagaccctaaggtttccataaagGCTGGtccaaacgcaacttctaattgaatgataaaatactcatgcatgttGTTCGAtaaaacgtaatatttatggcgtgtctacctaccgttccatcttatcgtttaaactttggtacaattctcagttaagtgacgattgctttggaggaagtaatactgtgatcacaatctatgctgtttgcgttgccAAAAAAtttcaatgtaaaaaaaaaTCGAAAATGGT' | sed -e 's/^[A-Z]*//' -e 's/[A-Z]*$//'
gggggtcagaccctaaggtttccataaagGCTGGtccaaacgcaacttctaattgaatgataaaatactcatgcatgttGTTCGAtaaaacgtaatatttatggcgtgtctacctaccgttccatcttatcgtttaaactttggtacaattctcagttaagtgacgattgctttggaggaagtaatactgtgatcacaatctatgctgtttgcgttgccAAAAAAtttcaatgtaaaaaaaaa
Suppose
sequence=TCGTAAATGGTgggggtcagaccctaaggtttccataaagGCTGGtccaaacgcaacttctaattgaatgataaaatactcatgcatgttGTTCGAtaaaacgtaatatttatggcgtgtctacctaccgttccatcttatcgtttaaactttggtacaattctcagttaagtgacgattgctttggaggaagtaatactgtgatcacaatctatgctgtttgcgttgccAAAAAAtttcaatgtaaaaaaaaaTCGAAAATGGT
A pure bash requiring extended patterns would be
shopt -s extglob
tmp1=${sequence##*([TCGA])} # Save the result of stripping the leading capitals
echo ${tmp1%%*([TCGA])} # Strip the trailing capitals

Is there an easy way to pass a "raw" string to grep?

grep can't be fed "raw" strings when used from the command-line, since some characters need to be escaped to not be treated as literals. For example:
$ grep '(hello|bye)' # WON'T MATCH 'hello'
$ grep '\(hello\|bye\)' # GOOD, BUT QUICKLY BECOMES UNREADABLE
I was using printf to auto-escape strings:
$ printf '%q' '(some|group)\n'
\(some\|group\)\\n
This produces a bash-escaped version of the string, and using backticks, this can easily be passed to a grep call:
$ grep `printf '%q' '(a|b|c)'`
However, it's clearly not meant for this: some characters in the output are not escaped, and some are unnecessarily so. For example:
$ printf '%q' '(^#)'
\(\^#\)
The ^ character should not be escaped when passed to grep.
Is there a cli tool that takes a raw string and returns a bash-escaped version of the string that can be directly used as pattern with grep? How can I achieve this in pure bash, if not?
If you want to search for an exact string,
grep -F '(some|group)\n' ...
-F tells grep to treat the pattern as is, with no interpretation as a regex.
(This is often available as fgrep as well.)
If you are attempting to get grep to use Extended Regular Expression syntax, the way to do that is to use grep -E (aka egrep). You should also know about grep -F (aka fgrep) and, in newer versions of GNU Coreutils, grep -P.
Background: The original grep had a fairly small set of regex operators; it was Ken Thompson's original regular expression implementation. A new version with an extended repertoire was developed later, and for compatibility reasons, got a different name. With GNU grep, there is only one binary, which understands the traditional, basic RE syntax if invoked as grep, and ERE if invoked as egrep. Some constructs from egrep are available in grep by using a backslash escape to introduce special meaning.
Subsequently, the Perl programming language has extended the formalism even further; this regex dialect seems to be what most newcomers erroneously expect grep, too, to support. With grep -P, it does; but this is not yet widely supported on all platforms.
So, in grep, the following characters have a special meaning: ^$[]*.\
In egrep, the following characters also have a special meaning: ()|+?{}. (The braces for repetition were not in the original egrep.) The grouping parentheses also enable backreferences with \1, \2, etc.
In many versions of grep, you can get the egrep behavior by putting a backslash before the egrep specials. There are also special sequences like \<\>.
In Perl, a huge number of additional escapes like \w \s \d were introduced. In Perl 5, the regex facility was substantially extended, with non-greedy matching *? +? etc, non-grouping parentheses (?:...), lookaheads, lookbehinds, etc.
... Having said that, if you really do want to convert egrep regular expressions to grep regular expressions without invoking any external process, try ${regex/pattern/substitution} for each of the egrep special characters; but recognize that this does not handle character classes, negated character classes, or backslash escapes correctly.
When I use grep -E with user provided strings I escape them with this
ere_quote() {
sed 's/[][\.|$(){}?+*^]/\\&/g' <<< "$*"
}
example run
ere_quote ' \ $ [ ] ( ) { } | ^ . ? + *'
# output
# \\ \$ \[ \] \( \) \{ \} \| \^ \. \? \+ \*
This way you may safely insert the quoted string in your regular expression.
e.g. if you wanted to find each line starting with the user content, with the user providing funny strings as .*
userdata=".*"
grep -E -- "^$(ere_quote "$userdata")" <<< ".*hello"
# if you have colors in grep you'll see only ".*" in red
I think that previous answers are not complete because they miss one important thing, namely string which begin with dash (-). So while this won't work:
echo "A-B-C" | grep -F "-B-"
This one will:
echo "A-B-C" | grep -F -- "-B-"
quote() {
sed 's/[^\^]/[&]/g;s/[\^]/\\&/g' <<< "$*"
}
Usage: grep [OPTIONS] "$(quote [STRING])"
This function has some substantial benefits:
quote is independent from the regex flavor. You can use quote's output in
grep (-G)` (BRE, the default)
grep -E (ERE)
grep -P (PCRE)
sed (-E) "s/$(quote [STRING])/.../" (as long as you don't use \, [, or ] instead of /).
quote even works in corner cases that are not directly quoting related, for instance
Leading - are quoted so that they aren't misinterpreted as options by grep.
Trailing spaces are quoted so that the aren't removed by $(...).
quote only fails if [STRING] contains linebreaks. But in general there is no fix for this since tools like grep and sed may not support linebreaks in their search pattern (even if they are written as \n).
Also, there is the drawback that the quoted output usually is three times longer than the unquoted input.
Just want to comment example below which shows that substring "-B" is iterpreted by grep as a command line option and the command failed.
echo "A-B-C" | grep -F "-B-"
grep has a special option for this case:
-e PATTERNS, --regexp=PATTERNS
Use PATTERNS as the patterns. If this option is used multiple times or is combined with the -f (--file) option,
search for all patterns given. This option can be used to protect a pattern beginning with “-”.
So a fix for the issue is:
echo "A-B-C" | grep -F -e "-B-" -

Resources