Two seemingly identical strings with newlines not equal

Two seemingly identical strings with newlines not equal - bash

I am trying to convert a list of quoted strings, separated by commas, into list of strings separated by newlines using bash and sed.
Here is an example of what I am doing:
#!/bin/bash
comma_to_newline() {
sed -En $'s/[ \t]*"([^"]*)",?[ \t]*/\\1\\\n/gp'
}
input='"one","two","three"'
expected="one\ntwo\nthree"
result="$( echo "${input}" | comma_to_newline )"
echo "Expected: <${expected}>"
echo "Result: <${result}>"
if [ "${result}" = "${expected}" ]; then
echo "EQUAL!"
else
echo "NOT EQUAL!"
fi
And the output I am getting is:
Expected: <one
two
three>
Result: <one
two
three>
NOT EQUAL!
I know it has something to do with the newlines characters, but I can't work out what. If I replace the newlines with some other string, such as XXX, it works fine and bash reports the strings as being equal.

Prompted by the comments on my question, I managed to work out what was going on. I was so focussed on coming up with a working sed expression and ensuring that result was correct, that I failed to noticed that the expected string was incorrect.
In order to use \n newlines in a bash string, you have to use the $'one\ntwo\nthree' syntax - see How can I have a newline in a string in sh? for other solutions.
I was developing against bash version 3.2.57 (the version that comes with Mac OS 10.14.6). When assigning a variable using expected="one\ntwo\nthree" then echoing it, they were being displayed as newlines in the console. Newer versions of bash display these strings as escaped - so I assume it is a bug that has been fixed in later versions of bash.

For diagnosing seemingly identical strings, try combining side-by-side diff output with a one char per line hexdump format. Replace:
else
echo "NOT EQUAL!"
fi
...with:
else
echo "NOT EQUAL!"
diff -y \
<(hexdump -v -e '/1 "%_ad# "' -e '/1 " _%_u\_\n"' <<< "${expected}") \
<(hexdump -v -e '/1 "%_ad# "' -e '/1 " _%_u\_\n"' <<< "${result}")
fi

There is extra new line character \n in string returing from your function.
Octal dump
$echo '"one","two","three"' | sed -En $'s/[ \t]*"([^"]*)",?[ \t]*/\\1\\\n/gp' | od -c
0000000 o n e \n t w o \n t h r e e \n \n
0000017
$echo "one\ntwo\nthree" | od -c
0000000 o n e \ n t w o \ n t h r e e \n
0000020
$
Also, use echo -e
$echo "one\ntwo\nthree"
one\ntwo\nthree
$echo -e "one\ntwo\nthree"
one
two
three
$
From man page
-e enable interpretation of backslash escapes

Related

How to add a space after special characters in bash script?

I have a text file with something like,
!aa
#bb
#cc
$dd
%ee
expected output is,
! aa
# bb
# cc
$ dd
% ee
What I have tried, echo "${foo//#/# }".
This does work fine with one string but it does not work for all the lines in the file. I have tried with this while loop to read all the lines of the file and do the same using echo but it does not work.
while IFS= read -r line; do
foo=$line
sep="!##$%"
echo "${foo//$sep/$sep }"
done < $1
I have tried with awk split but it does not give the expected output. Is there any workaround for this? by using awk or sed.

The following assumes you want to add a space after every character in the !##$% set (even if it is the last character in a line). Test file:
$ cat file.txt
a!a
#bb
c#c
$dd
ee%
foo
%b%r
$ sep='!##$%'
With sed:
$ sed 's/['"$sep"']/& /g' file.txt
a! a
# bb
c# c
$ dd
ee%
foo
% b% r
With awk:
$ awk '{gsub(/['"$sep"']/,"& "); print}' file.txt
a! a
# bb
c# c
$ dd
ee%
foo
% b% r
With plain bash (not recommended, it is too slow):
$ while IFS= read -r line; do
str=""
for (( i=0; i<${#line}; i++ )); do
char="${line:i:1}"
str="$str$char"
[[ "$char" =~ [$sep] ]] && str="$str "
done
printf '%s\n' "$str"
done < file.txt
a! a
# bb
c# c
$ dd
ee%
foo
% b% r
Or (not sure which is the worst):
$ while IFS= read -r line; do
for (( i=0; i<${#sep}; i++ )); do
char="${sep:i:1}"
line="${line//$char/$char }"
done
printf '%s\n' "$line"
done < file.txt
a! a
# bb
c# c
$ dd
ee%
foo
% b% r

Characters you call special in your example seems to be subset of characters known as [[:punct:]] to GNU sed, thus I propose following solution:
sed 's/\([[:punct:]]\)/\1 /g' file.txt
which with file.txt content being
!aa
#bb
#cc
$dd
%ee
output
! aa
# bb
# cc
$ dd
% ee
Explanation: I use capturing group \(...\) which has any character belonging to [:punct:] then I replace what was captured with content of that capture followed by space. I use g to apply it to all occurences in each line, though this has not visible impact for data above. You might elect to drop g if you are sure there will be at most one character to replace in every line.
If you want to know more about [:punct:] or other similar character sets read about Character Classes on Regular-Expressions.info

If the file always contain a symbol at the start of line like that then use this
sed -Ei 's/^(.)/\1 /g' yourfile.txt
The -E option is to tell sed to use regex. -i modifies the file inline, you can remove it if you want to output to console or another file. The ^(.) regex captures the first character on the line and add a space to it (\1 )

Assuming that special characters are non-numeric and non-alphabetic characters, and special characters can appear anywhere in the line, use the following regular expression to replace them.
sed 's/[^a-zA-Z0-9]/& /g' urfile

BASH: unescape string

Suppose I have the following string:
"some\nstring\n..."
And it displays as one line when catted in bash. Further,
string_from_pipe | sed 's/\\\\/\\/g' # does not work
| awk '{print $0}'
| awk '{s = $0; print s}'
| awk '{s = $0; printf "%s",s}'
| echo $0
| sed 's/\\(.)/\1/g'
# all have not worked.
How do I unescape this string such that it prints as:
some
string
Or even displays that way inside a file?

POSIX sh provides printf %b for just this purpose:
s='some\nstring\n...'
printf '%b\n' "$s"
...will emit:
some
string
...
More to the point, the APPLICATION USAGE section of the POSIX spec for echo explicitly suggests using printf %b for this purpose rather than relying on optional XSI extensions.

As you observed, echo does not solve the problem:
$ s="some\nstring\n..."
$ echo "$s"
some\nstring\n...
You haven't mentioned where you got that string or which escapes are in it.
Using a POSIX-compliant shell's printf
If the escapes are ones supported by printf, then try:
$ printf '%b\n' "$s"
some
string
...
Using sed
$ echo "$s" | sed 's/\\n/\n/g'
some
string
...
Using awk
$ echo "$s" | awk '{gsub(/\\n/, "\n")} 1'
some
string
...

If you have the string in a variable (say myvar), you can use:
${myvar//\\n/$'\n'}
For example:
$ myvar='hello\nworld\nfoo'
$ echo "${myvar//\\n/$'\n'}"
hello
world
foo
$
(Note: it's usually safer to use printf %s <string> than echo <string>, if you don't have full control over the contents of <string>.)

How about using the -e option of echo?
$ s="some\nstring\n..." && echo -e "$s"
some
string
...
From the echo man-page
-e enable interpretation of the following backslash escapes
[...]
\a alert (bell)
\b backspace
\c suppress further output
\e escape character
\f form feed
\n new line
\r carriage return
\t horizontal tab
\v vertical tab
\\ backslash
\0nnn the character whose ASCII code is NNN (octal). NNN can be 0 to 3 octal digits
\xHH the eight-bit character whose value is HH (hexadecimal). HH can be one or two hex digits

Remove leading zeros from MAC address

I have a MAC address that looks like this.
01:AA:BB:0C:D0:E1
I want to convert it to lowercase and strip the leading zeros.
1:aa:bb:c:d0:e1
What's the simplest way to do that in a Bash script?

$ echo 01:AA:BB:0C:D0:E1 | sed 's/\(^\|:\)0/\1/g;s/.*/\L\0/'
1:aa:bb:c:d0:e1
\(^\|:\)0 represents either the line start (^) or a :, followed by a 0.
We want to replace this by the capture (either line start or :), which removed the 0.
Then, a second substitution (s/.*/\L\0/) put the whole line in lowercase.
$ sed --version | head -1
sed (GNU sed) 4.2.2
EDIT: Alternatively:
echo 01:AA:BB:0C:D0:E1 | sed 's/0\([0-9A-Fa-f]\)/\1/g;s/.*/\L\0/'
This replaces 0x (with x any hexa digit) by x.
EDIT: if your sed does not support \L, use tr:
echo 01:AA:BB:0C:D0:E1 | sed 's/0\([0-9A-Fa-f]\)/\1/g' | tr '[:upper:]' '[:lower:]'

Here's a pure Bash≥4 possibility:
mac=01:AA:BB:0C:D0:E1
IFS=: read -r -d '' -a macary < <(printf '%s:\0' "$mac")
macary=( "${macary[#]#0}" )
macary=( "${macary[#],,}" )
IFS=: eval 'newmac="${macary[*]}"'
The line IFS=: read -r -d '' -a macary < <(printf '%s:\0' "$mac") is the canonical way to split a string into an array,
the expansion "${macary[#]#0}" is that of the array macary with leading 0 (if any) removed,
the expansion "${macary[#],,}" is that of the array macary in lowercase,
IFS=: eval 'newmac="${macary[*]}"' is a standard way to join the fields of an array (note that the use of eval is perfectly safe).
After that:
declare -p newmac
yields
declare -- newmac="1:aa:bb:c:d0:e1"
as required.

A more robust way is to validate the MAC address first:
mac=01:AA:BB:0C:D0:E1
a='([[:xdigit:]]{2})' ; regex="^$a:$a:$a:$a:$a:$a$"
[[ $mac =~ $regex ]] || { echo "Invalid MAC address" >&2; exit 1; }
And then, using the valid result of the regex match (BASH_REMATCH):
set -- $(printf '%x ' $(printf '0x%s ' "${BASH_REMATCH[#]:1}" ))
IFS=: eval 'printf "%s\n" "$*"'
Which will print:
1:aa:bb:c:d0:e1
Hex values without leading zeros and in lowercase.
If Uppercase is needed, change the printf '%x ' to printf '%X '.
If Leading zeros are needed change the same to printf '%02x '.

How to display a file with multiple lines as a single string with escape chars (\n)

In bash, how I can display the content of a file with multiple lines as a single string where new lines appears as \n.
Example:
$ echo "line 1
line 2" >> file.txt
I need to get the content as this "line 1\nline2" with bash commands.
I tried using a combinations of cat/printf/echo with no success.

You can use bash's printf to get something close:
$ printf "%q" "$(< file.txt)"
$'line1\nline2'
and in bash 4.4 there is a new parameter expansion operator to produce the same:
$ foo=$(<file.txt)
$ echo "${foo#Q}"
$'line1\nline2'

$ cat file.txt
line 1
line 2
$ paste -s -d '~' file.txt | sed 's/~/\\n/g'
line 1\nline 2
You can use paste command to paste all the lines of the serially with delimiter say ~ and replace all ~ with \n with a sed command.

Without '\n' after file 2, you need to use echo -n
echo -n "line 1
line 2" > file.txt
od -cv file.txt
0000000 l i n e 1 \n l i n e 2
sed -z 's/\n/\\n/g' file.txt
line 1\nline 2
With '\n' after line 2
echo "line 1
line 2" > file.txt
od -cv file.txt
0000000 l i n e 1 \n l i n e 2 \n
sed -z 's/\n/\\n/g' file.txt
line 1\nline 2\n

This tools may display character codes also:
$ hexdump -v -e '/1 "%_c"' file.txt ; echo
line 1\nline 2\n
$ od -vAn -tc file.txt
l i n e 1 \n l i n e 2 \n

you could try piping a string from stdin or file and trim the desired pattern...
try this:
cat file|tr '\n' ' '
where file is the file name with the \n. this will return a string with all the text in a single line.
if you want to write the result to a file just redirect the result of the command, like this.
cat file|tr '\n' ' ' >> file2
here is another example:
How to remove carriage return from a string in Bash

How to perform a for loop on each character in a string in Bash?

I have a variable like this:
words="这是一条狗。"
I want to make a for loop on each of the characters, one at a time, e.g. first character="这", then character="是", character="一", etc.
The only way I know is to output each character to separate line in a file, then use while read line, but this seems very inefficient.
How can I process each character in a string through a for loop?

You can use a C-style for loop:
foo=string
for (( i=0; i<${#foo}; i++ )); do
echo "${foo:$i:1}"
done
${#foo} expands to the length of foo. ${foo:$i:1} expands to the substring starting at position $i of length 1.

With sed on dash shell of LANG=en_US.UTF-8, I got the followings working right:
$ echo "你好嗎 新年好。全型句號" | sed -e 's/\(.\)/\1\n/g'
你
好
嗎
新
年
好
。
全
型
句
號
and
$ echo "Hello world" | sed -e 's/\(.\)/\1\n/g'
H
e
l
l
o
w
o
r
l
d
Thus, output can be looped with while read ... ; do ... ; done
edited for sample text translate into English:
"你好嗎 新年好。全型句號" is zh_TW.UTF-8 encoding for:
"你好嗎" = How are you[ doing]
" " = a normal space character
"新年好" = Happy new year
"。全型空格" = a double-byte-sized full-stop followed by text description

${#var} returns the length of var
${var:pos:N} returns N characters from pos onwards
Examples:
$ words="abc"
$ echo ${words:0:1}
a
$ echo ${words:1:1}
b
$ echo ${words:2:1}
c
so it is easy to iterate.
another way:
$ grep -o . <<< "abc"
a
b
c
or
$ grep -o . <<< "abc" | while read letter; do echo "my letter is $letter" ; done
my letter is a
my letter is b
my letter is c

I'm surprised no one has mentioned the obvious bash solution utilizing only while and read.
while read -n1 character; do
echo "$character"
done < <(echo -n "$words")
Note the use of echo -n to avoid the extraneous newline at the end. printf is another good option and may be more suitable for your particular needs. If you want to ignore whitespace then replace "$words" with "${words// /}".
Another option is fold. Please note however that it should never be fed into a for loop. Rather, use a while loop as follows:
while read char; do
echo "$char"
done < <(fold -w1 <<<"$words")
The primary benefit to using the external fold command (of the coreutils package) would be brevity. You can feed it's output to another command such as xargs (part of the findutils package) as follows:
fold -w1 <<<"$words" | xargs -I% -- echo %
You'll want to replace the echo command used in the example above with the command you'd like to run against each character. Note that xargs will discard whitespace by default. You can use -d '\n' to disable that behavior.
Internationalization
I just tested fold with some of the Asian characters and realized it doesn't have Unicode support. So while it is fine for ASCII needs, it won't work for everyone. In that case there are some alternatives.
I'd probably replace fold -w1 with an awk array:
awk 'BEGIN{FS=""} {for (i=1;i<=NF;i++) print $i}'
Or the grep command mentioned in another answer:
grep -o .
Performance
FYI, I benchmarked the 3 aforementioned options. The first two were fast, nearly tying, with the fold loop slightly faster than the while loop. Unsurprisingly xargs was the slowest... 75x slower.
Here is the (abbreviated) test code:
words=$(python -c 'from string import ascii_letters as l; print(l * 100)')
testrunner(){
for test in test_while_loop test_fold_loop test_fold_xargs test_awk_loop test_grep_loop; do
echo "$test"
(time for (( i=1; i<$((${1:-100} + 1)); i++ )); do "$test"; done >/dev/null) 2>&1 | sed '/^$/d'
echo
done
}
testrunner 100
Here are the results:
test_while_loop
real 0m5.821s
user 0m5.322s
sys 0m0.526s
test_fold_loop
real 0m6.051s
user 0m5.260s
sys 0m0.822s
test_fold_xargs
real 7m13.444s
user 0m24.531s
sys 6m44.704s
test_awk_loop
real 0m6.507s
user 0m5.858s
sys 0m0.788s
test_grep_loop
real 0m6.179s
user 0m5.409s
sys 0m0.921s

I believe there is still no ideal solution that would correctly preserve all whitespace characters and is fast enough, so I'll post my answer. Using ${foo:$i:1} works, but is very slow, which is especially noticeable with large strings, as I will show below.
My idea is an expansion of a method proposed by Six, which involves read -n1, with some changes to keep all characters and work correctly for any string:
while IFS='' read -r -d '' -n 1 char; do
# do something with $char
done < <(printf %s "$string")
How it works:
IFS='' - Redefining internal field separator to empty string prevents stripping of spaces and tabs. Doing it on a same line as read means that it will not affect other shell commands.
-r - Means "raw", which prevents read from treating \ at the end of the line as a special line concatenation character.
-d '' - Passing empty string as a delimiter prevents read from stripping newline characters. Actually means that null byte is used as a delimiter. -d '' is equal to -d $'\0'.
-n 1 - Means that one character at a time will be read.
printf %s "$string" - Using printf instead of echo -n is safer, because echo treats -n and -e as options. If you pass "-e" as a string, echo will not print anything.
< <(...) - Passing string to the loop using process substitution. If you use here-strings instead (done <<< "$string"), an extra newline character is appended at the end. Also, passing string through a pipe (printf %s "$string" | while ...) would make the loop run in a subshell, which means all variable operations are local within the loop.
Now, let's test the performance with a huge string.
I used the following file as a source:
https://www.kernel.org/doc/Documentation/kbuild/makefiles.txt
The following script was called through time command:
#!/bin/bash
# Saving contents of the file into a variable named `string'.
# This is for test purposes only. In real code, you should use
# `done < "filename"' construct if you wish to read from a file.
# Using `string="$(cat makefiles.txt)"' would strip trailing newlines.
IFS='' read -r -d '' string < makefiles.txt
while IFS='' read -r -d '' -n 1 char; do
# remake the string by adding one character at a time
new_string+="$char"
done < <(printf %s "$string")
# confirm that new string is identical to the original
diff -u makefiles.txt <(printf %s "$new_string")
And the result is:
$ time ./test.sh
real 0m1.161s
user 0m1.036s
sys 0m0.116s
As we can see, it is quite fast.
Next, I replaced the loop with one that uses parameter expansion:
for (( i=0 ; i<${#string}; i++ )); do
new_string+="${string:$i:1}"
done
The output shows exactly how bad the performance loss is:
$ time ./test.sh
real 2m38.540s
user 2m34.916s
sys 0m3.576s
The exact numbers may very on different systems, but the overall picture should be similar.

I've only tested this with ascii strings, but you could do something like:
while test -n "$words"; do
c=${words:0:1} # Get the first character
echo character is "'$c'"
words=${words:1} # trim the first character
done

It is also possible to split the string into a character array using fold and then iterate over this array:
for char in `echo "这是一条狗。" | fold -w1`; do
echo $char
done

The C style loop in #chepner's answer is in the shell function update_terminal_cwd, and the grep -o . solution is clever, but I was surprised not to see a solution using seq. Here's mine:
read word
for i in $(seq 1 ${#word}); do
echo "${word:i-1:1}"
done

#!/bin/bash
word=$(echo 'Your Message' |fold -w 1)
for letter in ${word} ; do echo "${letter} is a letter"; done
Here is the output:
Y is a letter
o is a letter
u is a letter
r is a letter
M is a letter
e is a letter
s is a letter
s is a letter
a is a letter
g is a letter
e is a letter

To iterate ASCII characters on a POSIX-compliant shell, you can avoid external tools by using the Parameter Expansions:
#!/bin/sh
str="Hello World!"
while [ ${#str} -gt 0 ]; do
next=${str#?}
echo "${str%$next}"
str=$next
done
or
str="Hello World!"
while [ -n "$str" ]; do
next=${str#?}
echo "${str%$next}"
str=$next
done

sed works with unicode
IFS=$'\n'
for z in $(sed 's/./&\n/g' <(printf '你好嗎')); do
echo hello: "$z"
done
outputs
hello: 你
hello: 好
hello: 嗎

Another approach, if you don't care about whitespace being ignored:
for char in $(sed -E s/'(.)'/'\1 '/g <<<"$your_string"); do
# Handle $char here
done

Another way is:
Characters="TESTING"
index=1
while [ $index -le ${#Characters} ]
do
echo ${Characters} | cut -c${index}-${index}
index=$(expr $index + 1)
done

fold and while read are great for the job as shown in some answers here. Contrary to those answers, I think it's much more intuitive to pipe in the order of execution:
echo "asdfg" | fold -w 1 | while read c; do
echo -n "$c "
done
Outputs: a s d f g

I share my solution:
read word
for char in $(grep -o . <<<"$word") ; do
echo $char
done

TEXT="hello world"
for i in {1..${#TEXT}}; do
echo ${TEXT[i]}
done
where {1..N} is an inclusive range
${#TEXT} is a number of letters in a string
${TEXT[i]} - you can get char from string like an item from an array

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Two seemingly identical strings with newlines not equal - bash

Related

How to add a space after special characters in bash script?

BASH: unescape string

Remove leading zeros from MAC address

How to display a file with multiple lines as a single string with escape chars (\n)

How to perform a for loop on each character in a string in Bash?

Categories

Resources