How does grep handle DOS end of line? - bash

I have a Windows text file which contains a line (with ending CRLF)
aline
The following is several commands' output:
[root#panel ~]# grep aline file.txt
aline
[root#panel ~]# grep aline$'\r' file.txt
[root#panel ~]# grep aline$'\r'$'\n' file.txt
[root#panel ~]# grep aline$'\n' file.txt
aline
The first command's output is normal. I'm curious about the second and the third output. Why is it an empty line? And the last output, I think it can not find the string but it actually finds it, why? The commands are run on CentOS/bash.

In this case grep really matches the string "aline\r" but you just don't see it because it was overwritten by the ANSI sequence that prints color. Pass the output to od -c and you'll see
$ grep aline file.txt
aline
$ grep aline$'\r' file.txt
$ grep aline$'\r' --color=never file.txt
aline
$ grep aline$'\r' --color=never file.txt | od -c
0000000 a l i n e \r \n
0000007
$ grep aline$'\r' --color=always file.txt | od -c
0000000 033 [ 0 1 ; 3 1 m 033 [ K a l i n e
0000020 \r 033 [ m 033 [ K \n
0000030
With --color=never you can see the output string because grep doesn't print out the color. \r simply resets the cursor to the start of the line and then a new line is printed out, nothing is overwritten. But by default grep will check whether it's running on the terminal or its output is being piped and prints out the matched string in color if supported, and it seems resetting the color then print \n clears the rest of the line
To match \n you can use the -z option to make null bytes the line separator
$ grep -z aline$'\r'$'\n' --color=never file.txt
aline
$ grep -z aline$'\r'$'\n' --color=never file.txt | od -c
0000000 a l i n e \r \n \0
0000010
$ grep -z aline$'\r'$'\n' --color=always file.txt | od -c
0000000 033 [ 0 1 ; 3 1 m 033 [ K a l i n e
0000020 \r 033 [ m 033 [ K \n \0
0000031
Your last command grep aline$'\n' file.txt works because \n is simply a word separator in bash, so the command is just the same as grep aline file.txt. Exactly the same thing happened in the 3rd line: grep aline$'\r'$'\n' file.txt To pass a newline you must quote the argument to prevent word splitting
$ echo "aline" | grep -z "aline$(echo $'\n')"
aline
To demonstrate the effect of the quote with the 3rd line I added another line to the file
$ cat file.txt
aline
another line
$ grep -z "aline$(echo $'\n')" file.txt | od -c
0000000 a l i n e \r \n a n o t h e r l
0000020 i n e \n \0
0000025
$ grep -z "aline$(echo $'\n')" file.txt
aline
another line
$

If the input is not well-formed, the behavior is undefined.
In practice, some versions of GNU grep use CR for internal purposes, so attempting to match it does not work at all, or produces really bizarre results.
For not entirely different reasons, passing in a literal newline as part of the regular expression could have some odd interpretations, including, but not limited to, interpreting the argument as two separate patterns. (Look at how grep -F reads from a file, and imagine that at least some implementations use the same logic to parse the command line.)
In the grand scheme of things, the sane solution is to fix the input so it's a valid text file before attempting to run Unix line-oriented tools on it.
For quick and dirty solutions, some tools have well-defined semantics for random binary input. Perl is a model citizen in this respect.
bash$ perl -ne 'print if /aline\r$/' <<<$'aline\r'
aline
Awk also tends to work amicably, though there are several implementations, so the risk that somebody somewhere has a version which doesn't behave identically to AT&T Awk is higher.
Maybe notice also how \r is the last character before the end of the line (the DOS line ending is the sequence CR LF, where LF is the standard Unix line terminator for text files).

At least for me phuclv's answer doesn't completely cover the last case, i.e. grep aline$'\n' file.txt.
Your mileage my vary depending on which shell and which version and implementation of grep you are using, but for me grep -z "aline$(echo $'\n')" and grep -z aline$'\n' both just match the same pattern as grep -z aline.
This becomes more apparent if the -o switch is used, so that grep outputs only the matched string and not the entire line (which is the entire file for most text files when the -z option is used).
If you use the same file.txt as in phuclv's second example:
$ cat file.txt
aline
another line
$ grep -z "aline$(echo $'\n')" file.txt | od -c
0000000 a l i n e \r \n a n o t h e r l
0000020 i n e \n \0
0000025
$ grep -z -o "aline$(echo $'\n')" file.txt | od -c
0000000 a l i n e \0
0000006
$ grep -z -o aline$'\n' file.txt | od -c
0000000 a l i n e \0
0000006
$ grep -z -o aline file.txt | od -c
0000000 a l i n e \0
0000006
To actually match a \n as part of the pattern I had to use the -P switch to turn on "Perl-compatible regular expression"
$ grep -z -o -P 'aline\r\n' file.txt | od -c
0000000 a l i n e \r \n \0
0000010
$ grep -z -o -P 'aline\r\nanother' file.txt | od -c
0000000 a l i n e \r \n a n o t h e r \0
0000017
For reference:
grep --version|head -n1
grep (GNU grep) 3.1
bash --version|head -n1
GNU bash, version 4.4.20(1)-release (x86_64-pc-linux-gnu)

Related

Why does herestring add a newline [duplicate]

It seems that here string is adding line break. Is there a convenient way of removing it?
$ string='test'
$ echo -n $string | md5sum
098f6bcd4621d373cade4e832627b4f6 -
$ echo $string | md5sum
d8e8fca2dc0f896fd7cb4cb0031ba249 -
$ md5sum <<<"$string"
d8e8fca2dc0f896fd7cb4cb0031ba249 -
Yes, you are right: <<< adds a trailing new line.
You can see it with:
$ cat - <<< "hello" | od -c
0000000 h e l l o \n
0000006
Let's compare this with the other approaches:
$ echo "hello" | od -c
0000000 h e l l o \n
0000006
$ echo -n "hello" | od -c
0000000 h e l l o
0000005
$ printf "hello" | od -c
0000000 h e l l o
0000005
So we have the table:
| adds new line |
-------------------------|
printf | No |
echo -n | No |
echo | Yes |
<<< | Yes |
From Why does a bash here-string add a trailing newline char?:
Most commands expect text input. In the unix world, a text file
consists of a sequence of lines, each ending in a
newline.
So in most cases a final newline is required. An especially common
case is to grab the output of a command with a command susbtitution,
process it in some way, then pass it to another command. The command
substitution strips final newlines; <<< puts one back.
fedorqui's helpful answer shows that and why here-strings (and also here-documents) invariably append a newline.
As for:
Is there a convenient way of removing it?
In Bash, use printf inside a process substitution as an "\n-less" alternative to a here-string:
... < <(printf %s ...)
Applied to your example:
$ md5sum < <(printf %s 'test')
098f6bcd4621d373cade4e832627b4f6
Alternatively, as user202729 suggests, simply use printf %s in the pipeline, which has the added advantage of not only using a more familiar feature but also making the command work in (more strictly) POSIX-compliant shells (in scripts targeting /bin/sh):
$ printf %s 'test' | md5sum
098f6bcd4621d373cade4e832627b4f6
As a "here doc" add a newline:
$ string="hello test"
$ cat <<_test_ | xxd
> $string
> _test_
0000000: 6865 6c6c 6f20 7465 7374 0a hello test.
Also a "here string" does:
$ cat <<<"$string" | xxd
0000000: 6865 6c6c 6f20 7465 7374 0a hello test.
Probably the easiest solution to get an string non-ending on newline would be printf:
$ printf '%s' "$string" | xxd
0000000: 6865 6c6c 6f20 7465 7374 hello test

Two seemingly identical strings with newlines not equal

I am trying to convert a list of quoted strings, separated by commas, into list of strings separated by newlines using bash and sed.
Here is an example of what I am doing:
#!/bin/bash
comma_to_newline() {
sed -En $'s/[ \t]*"([^"]*)",?[ \t]*/\\1\\\n/gp'
}
input='"one","two","three"'
expected="one\ntwo\nthree"
result="$( echo "${input}" | comma_to_newline )"
echo "Expected: <${expected}>"
echo "Result: <${result}>"
if [ "${result}" = "${expected}" ]; then
echo "EQUAL!"
else
echo "NOT EQUAL!"
fi
And the output I am getting is:
Expected: <one
two
three>
Result: <one
two
three>
NOT EQUAL!
I know it has something to do with the newlines characters, but I can't work out what. If I replace the newlines with some other string, such as XXX, it works fine and bash reports the strings as being equal.
Prompted by the comments on my question, I managed to work out what was going on. I was so focussed on coming up with a working sed expression and ensuring that result was correct, that I failed to noticed that the expected string was incorrect.
In order to use \n newlines in a bash string, you have to use the $'one\ntwo\nthree' syntax - see How can I have a newline in a string in sh? for other solutions.
I was developing against bash version 3.2.57 (the version that comes with Mac OS 10.14.6). When assigning a variable using expected="one\ntwo\nthree" then echoing it, they were being displayed as newlines in the console. Newer versions of bash display these strings as escaped - so I assume it is a bug that has been fixed in later versions of bash.
For diagnosing seemingly identical strings, try combining side-by-side diff output with a one char per line hexdump format. Replace:
else
echo "NOT EQUAL!"
fi
...with:
else
echo "NOT EQUAL!"
diff -y \
<(hexdump -v -e '/1 "%_ad# "' -e '/1 " _%_u\_\n"' <<< "${expected}") \
<(hexdump -v -e '/1 "%_ad# "' -e '/1 " _%_u\_\n"' <<< "${result}")
fi
There is extra new line character \n in string returing from your function.
Octal dump
$echo '"one","two","three"' | sed -En $'s/[ \t]*"([^"]*)",?[ \t]*/\\1\\\n/gp' | od -c
0000000 o n e \n t w o \n t h r e e \n \n
0000017
$echo "one\ntwo\nthree" | od -c
0000000 o n e \ n t w o \ n t h r e e \n
0000020
$
Also, use echo -e
$echo "one\ntwo\nthree"
one\ntwo\nthree
$echo -e "one\ntwo\nthree"
one
two
three
$
From man page
-e enable interpretation of backslash escapes

How to use awk to find a char in a string in bash

I have a char variable called sign and a given string sub. I need to find out how many times this sign appears in the sub and cannot use grep.
For example:
sign = c
sub = mechanic cup cat
echo "$sub" | awk <code i am asking for> | wc -l
And the output should be 4 because c appears 4 times. What should be inside <>?
sign=c
sub='mechanic cup cat'
echo "$sub" |
awk -v sign="$sign" -F '' '{for (i=1;i<=NF;i++){if ($i==sign) cnt++}} END{print cnt}'
Edit:
Changes for the requirements in the comment:
Test if the length of sign is 1 (no = present). If true, change sign and sub to lowercase to ignore the case.
Use ${sign:0:1} to only pass the first character to awk.
sign=c
sub='mechanic Cup cat'
if [ "${#sign}" -eq 1 ]; then
sign=${sign,,}
sub=${sub,,}
fi
echo "$sub" |
awk -v sign="${sign:0:1}" -F '' '{for (i=1;i<=NF;i++){if ($i==sign) cnt++}} END{print cnt}'
A combination of Quasimodo's comment and Freddy's lower-case example:
$ sign=c
$ sub='mechanic Cup cat'
A tr + wc solution if ${sign} is a single character.
Count the number of times ${sign} shows up in ${sub}, ignoring case:
$ tr -cd [${sign,,}] <<< ${sub,,} | wc -c
4
Where:
${sign,,} & {sub,,} - convert to all lowercase
tr -cd [...] - find all characters listed inside the brackets ([]), -d says to delete/remove said characters while -c says to take the complement (ie, remove all but the characters in the brackets), so -cp [${sign,,] says to remove all but the character stored in ${sign}
<<< .... - here string (allows passing a variable/string in as an argument to tr
wc -c count the number of chracers
NOTE: This only works if ${sign} contains a single character.
A sed solution that should work regardless of the number of characters in ${sign}.
$ sub='mechanic Cup cat'
First we embed a new line character before each occurrence of ${sign,,}:
$ sign=c
$ sed "s/\(${sign,,}\)/\n\1/g" <<< ${sub,,}
me
chani
c
cup
cat
$ sign=cup
$ sed "s/\(${sign,,}\)/\n\1/g" <<< ${sub,,}
mechanic
cup cat
Where:
\(${sign,,}\) - find the pattern that matches ${sign} (all lowercase) and assign to position 1
\n\1 - place a newline (\n) in the stream just before our pattern in position 1
At this point we just want the lines that start with ${sign,,}, which is where tail +2 comes into play (ie, display lines 2 through n):
$ sign=c
$ sed "s/\(${sign,,}\)/\n\1/g" <<< ${sub,,} | tail +2
chani
c
cup
cat
$ sign=cup
$ sed "s/\(${sign,,}\)/\n\1/g" <<< ${sub,,} | tail +2
cup cat
And now we pipe to wc -l to get a line count (ie, count the number of times ${sign} shows up in ${sub} - ignoring case):
$ sign=c
$ sed "s/\(${sign,,}\)/\n\1/g" <<< ${sub,,} | tail +2 | wc -l
4
$ sign=cup
$ sed "s/\(${sign,,}\)/\n\1/g" <<< ${sub,,} | tail +2 | wc -l
1

How to display a file with multiple lines as a single string with escape chars (\n)

In bash, how I can display the content of a file with multiple lines as a single string where new lines appears as \n.
Example:
$ echo "line 1
line 2" >> file.txt
I need to get the content as this "line 1\nline2" with bash commands.
I tried using a combinations of cat/printf/echo with no success.
You can use bash's printf to get something close:
$ printf "%q" "$(< file.txt)"
$'line1\nline2'
and in bash 4.4 there is a new parameter expansion operator to produce the same:
$ foo=$(<file.txt)
$ echo "${foo#Q}"
$'line1\nline2'
$ cat file.txt
line 1
line 2
$ paste -s -d '~' file.txt | sed 's/~/\\n/g'
line 1\nline 2
You can use paste command to paste all the lines of the serially with delimiter say ~ and replace all ~ with \n with a sed command.
Without '\n' after file 2, you need to use echo -n
echo -n "line 1
line 2" > file.txt
od -cv file.txt
0000000 l i n e 1 \n l i n e 2
sed -z 's/\n/\\n/g' file.txt
line 1\nline 2
With '\n' after line 2
echo "line 1
line 2" > file.txt
od -cv file.txt
0000000 l i n e 1 \n l i n e 2 \n
sed -z 's/\n/\\n/g' file.txt
line 1\nline 2\n
This tools may display character codes also:
$ hexdump -v -e '/1 "%_c"' file.txt ; echo
line 1\nline 2\n
$ od -vAn -tc file.txt
l i n e 1 \n l i n e 2 \n
you could try piping a string from stdin or file and trim the desired pattern...
try this:
cat file|tr '\n' ' '
where file is the file name with the \n. this will return a string with all the text in a single line.
if you want to write the result to a file just redirect the result of the command, like this.
cat file|tr '\n' ' ' >> file2
here is another example:
How to remove carriage return from a string in Bash

How can I stop a here string (<<<) from adding a line break or new lines?

It seems that here string is adding line break. Is there a convenient way of removing it?
$ string='test'
$ echo -n $string | md5sum
098f6bcd4621d373cade4e832627b4f6 -
$ echo $string | md5sum
d8e8fca2dc0f896fd7cb4cb0031ba249 -
$ md5sum <<<"$string"
d8e8fca2dc0f896fd7cb4cb0031ba249 -
Yes, you are right: <<< adds a trailing new line.
You can see it with:
$ cat - <<< "hello" | od -c
0000000 h e l l o \n
0000006
Let's compare this with the other approaches:
$ echo "hello" | od -c
0000000 h e l l o \n
0000006
$ echo -n "hello" | od -c
0000000 h e l l o
0000005
$ printf "hello" | od -c
0000000 h e l l o
0000005
So we have the table:
| adds new line |
-------------------------|
printf | No |
echo -n | No |
echo | Yes |
<<< | Yes |
From Why does a bash here-string add a trailing newline char?:
Most commands expect text input. In the unix world, a text file
consists of a sequence of lines, each ending in a
newline.
So in most cases a final newline is required. An especially common
case is to grab the output of a command with a command susbtitution,
process it in some way, then pass it to another command. The command
substitution strips final newlines; <<< puts one back.
fedorqui's helpful answer shows that and why here-strings (and also here-documents) invariably append a newline.
As for:
Is there a convenient way of removing it?
In Bash, use printf inside a process substitution as an "\n-less" alternative to a here-string:
... < <(printf %s ...)
Applied to your example:
$ md5sum < <(printf %s 'test')
098f6bcd4621d373cade4e832627b4f6
Alternatively, as user202729 suggests, simply use printf %s in the pipeline, which has the added advantage of not only using a more familiar feature but also making the command work in (more strictly) POSIX-compliant shells (in scripts targeting /bin/sh):
$ printf %s 'test' | md5sum
098f6bcd4621d373cade4e832627b4f6
As a "here doc" add a newline:
$ string="hello test"
$ cat <<_test_ | xxd
> $string
> _test_
0000000: 6865 6c6c 6f20 7465 7374 0a hello test.
Also a "here string" does:
$ cat <<<"$string" | xxd
0000000: 6865 6c6c 6f20 7465 7374 0a hello test.
Probably the easiest solution to get an string non-ending on newline would be printf:
$ printf '%s' "$string" | xxd
0000000: 6865 6c6c 6f20 7465 7374 hello test

Resources