sed is adding character to start of word vs end - bash

So I'm trying to add a "!" to the end of every word in my placesCapEx file from my placesCap file
This is what it looks like:
Yugoslavian
Zambia
Zambian
Zomba
This is what I want it to look like:
Yugoslavian!
Zambia!
Zambian!
Zomba!
I've tried sed 's/$/\!/' Wordlists/placesCap > Wordlists/placesCapEx and just sed 's/$/!/' Wordlists/placesCap > Wordlists/placesCapEx
What happens is when I run this and then cat Wordlists/placesCapEx it outputs
!ugoslavian
!ambia
!ambian
!omba
I've done some research and someone stated something about it being a Unix thing but they never went into detail

Your simpler sed command should work fine for a text file where end-of-line is a single newline character. You likely have "dos" format files here (carriage return / linefeed).
Consider:
$ cat zippy
Zippy
$ od -c zippy
0000000 Z i p p y \r \n
0000007
$ sed 's/$/!/' zippy
!ippy
$ sed 's/$/!/' zippy | od -c
0000000 Z i p p y \r ! \n
0000010
You're seeing the effect of \r displayed on a terminal: move the cursor to start of line, print the '!', newline goes to next line.
To handle the presence of \r\n pairs as your end-of-line character, you might try:
$ sed 's/\r*$/!/' zippy
Zippy!
...assuming your sed honors the \r as mine (GNU sed 4.2.2) does.

Related

awk adds undesired newline at the end of last detected parameter [duplicate]

The intent of this question is to provide an answer to the daily questions whose answer is "you have DOS line endings" so we can simply close them as duplicates of this one without repeating the same answers ad nauseam.
NOTE: This is NOT a duplicate of any existing question. The intent of this Q&A is not just to provide a "run this tool" answer but also to explain the issue such that we can just point anyone with a related question here and they will find a clear explanation of why they were pointed here as well as the tool to run so solve their problem. I spent hours reading all of the existing Q&A and they are all lacking in the explanation of the issue, alternative tools that can be used to solve it, and/or the pros/cons/caveats of the possible solutions. Also some of them have accepted answers that are just plain dangerous and should never be used.
Now back to the typical question that would result in a referral here:
I have a file containing 1 line:
what isgoingon
and when I print it using this awk script to reverse the order of the fields:
awk '{print $2, $1}' file
instead of seeing the output I expect:
isgoingon what
I get the field that should be at the end of the line appear at the start of the line, overwriting some text at the start of the line:
whatngon
or I get the output split onto 2 lines:
isgoingon
what
What could the problem be and how do I fix it?
The problem is that your input file uses DOS line endings of CRLF instead of UNIX line endings of just LF and you are running a UNIX tool on it so the CR remains part of the data being operated on by the UNIX tool. CR is commonly denoted by \r and can be seen as a control-M (^M) when you run cat -vE on the file while LF is \n and appears as $ with cat -vE.
So your input file wasn't really just:
what isgoingon
it was actually:
what isgoingon\r\n
as you can see with cat -v:
$ cat -vE file
what isgoingon^M$
and od -c:
$ od -c file
0000000 w h a t i s g o i n g o n \r \n
0000020
so when you run a UNIX tool like awk (which treats \n as the line ending) on the file, the \n is consumed by the act of reading the line, but that leaves the 2 fields as:
<what> <isgoingon\r>
Note the \r at the end of the second field. \r means Carriage Return which is literally an instruction to return the cursor to the start of the line so when you do:
print $2, $1
awk will print isgoingon and then will return the cursor to the start of the line before printing what which is why the what appears to overwrite the start of isgoingon.
To fix the problem, do either of these:
dos2unix file
sed 's/\r$//' file
awk '{sub(/\r$/,"")}1' file
perl -pe 's/\r$//' file
Apparently dos2unix is aka frodos in some UNIX variants (e.g. Ubuntu).
Be careful if you decide to use tr -d '\r' as is often suggested as that will delete all \rs in your file, not just those at the end of each line.
Note that GNU awk will let you parse files that have DOS line endings by simply setting RS appropriately:
gawk -v RS='\r\n' '...' file
but other awks will not allow that as POSIX only requires awks to support a single character RS and most other awks will quietly truncate RS='\r\n' to RS='\r'. You may need to add -v BINMODE=3 for gawk to even see the \rs though as the underlying C primitives will strip them on some platforms, e.g. cygwin.
One thing to watch out for is that CSVs created by Windows tools like Excel will use CRLF as the line endings but can have LFs embedded inside a specific field of the CSV, e.g.:
"field1","field2.1
field2.2","field3"
is really:
"field1","field2.1\nfield2.2","field3"\r\n
so if you just convert \r\ns to \ns then you can no longer tell linefeeds within fields from linefeeds as line endings so if you want to do that I recommend converting all of the intra-field linefeeds to something else first, e.g. this would convert all intra-field LFs to tabs and convert all line ending CRLFs to LFs:
gawk -v RS='\r\n' '{gsub(/\n/,"\t")}1' file
Doing similar without GNU awk left as an exercise but with other awks it involves combining lines that do not end in CR as they're read.
Also note that though CR is part of the [[:space:]] POSIX character class, it is not one of the whitespace characters included as separating fields when the default FS of " " is used, whose whitespace characters are only tab, blank, and newline. This can lead to confusing results if your input can have blanks before CRLF:
$ printf 'x y \n'
x y
$ printf 'x y \n' | awk '{print $NF}'
y
$
$ printf 'x y \r\n'
x y
$ printf 'x y \r\n' | awk '{print $NF}'
$
That's because trailing field separator white space is ignored at the beginning/end of a line that has LF line endings, but \r is the final field on a line with CRLF line endings if the character before it was whitespace:
$ printf 'x y \r\n' | awk '{print $NF}' | cat -Ev
^M$
You can use the \R shorthand character class in PCRE for files with unknown line endings. There are even more line ending to consider with Unicode or other platforms. The \R form is a recommended character class from the Unicode consortium to represent all forms of a generic newline.
So if you have an 'extra' you can find and remove it with the regex s/\R$/\n/ will normalize any combination of line endings into \n. Alternatively, you can use s/\R/\n/g to capture any notion of 'line ending' and standardize into a \n character.
Given:
$ printf "what\risgoingon\r\n" > file
$ od -c file
0000000 w h a t \r i s g o i n g o n \r \n
0000020
Perl and Ruby and most flavors of PCRE implement \R combined with the end of string assertion $ (end of line in multi-line mode):
$ perl -pe 's/\R$/\n/' file | od -c
0000000 w h a t \r i s g o i n g o n \n
0000017
$ ruby -pe '$_.sub!(/\R$/,"\n")' file | od -c
0000000 w h a t \r i s g o i n g o n \n
0000017
(Note the \r between the two words is correctly left alone)
If you do not have \R you can use the equivalent of (?>\r\n|\v) in PCRE.
With straight POSIX tools, your best bet is likely awk like so:
$ awk '{sub(/\r$/,"")} 1' file | od -c
0000000 w h a t \r i s g o i n g o n \n
0000017
Things that kinda work (but know your limitations):
tr deletes all \r even if used in another context (granted the use of \r is rare, and XML processing requires that \r be deleted, so tr is a great solution):
$ tr -d "\r" < file | od -c
0000000 w h a t i s g o i n g o n \n
0000016
GNU sed works, but not POSIX sed since \r and \x0D are not supported on POSIX.
GNU sed only:
$ sed 's/\x0D//' file | od -c # also sed 's/\r//'
0000000 w h a t \r i s g o i n g o n \n
0000017
The Unicode Regular Expression Guide is probably the best bet of what the definitive treatment of what a "newline" is.
Run dos2unix. While you can manipulate the line endings with code you wrote yourself, there are utilities which exist in the Linux / Unix world which already do this for you.
If on a Fedora system dnf install dos2unix will put the dos2unix tool in place (should it not be installed).
There is a similar dos2unix deb package available for Debian based systems.
From a programming point of view, the conversion is simple. Search all the characters in a file for the sequence \r\n and replace it with \n.
This means there are dozens of ways to convert from DOS to Unix using nearly every tool imaginable. One simple way is to use the command tr where you simply replace \r with nothing!
tr -d '\r' < infile > outfile

AWK printing single array indices with other text [duplicate]

The intent of this question is to provide an answer to the daily questions whose answer is "you have DOS line endings" so we can simply close them as duplicates of this one without repeating the same answers ad nauseam.
NOTE: This is NOT a duplicate of any existing question. The intent of this Q&A is not just to provide a "run this tool" answer but also to explain the issue such that we can just point anyone with a related question here and they will find a clear explanation of why they were pointed here as well as the tool to run so solve their problem. I spent hours reading all of the existing Q&A and they are all lacking in the explanation of the issue, alternative tools that can be used to solve it, and/or the pros/cons/caveats of the possible solutions. Also some of them have accepted answers that are just plain dangerous and should never be used.
Now back to the typical question that would result in a referral here:
I have a file containing 1 line:
what isgoingon
and when I print it using this awk script to reverse the order of the fields:
awk '{print $2, $1}' file
instead of seeing the output I expect:
isgoingon what
I get the field that should be at the end of the line appear at the start of the line, overwriting some text at the start of the line:
whatngon
or I get the output split onto 2 lines:
isgoingon
what
What could the problem be and how do I fix it?
The problem is that your input file uses DOS line endings of CRLF instead of UNIX line endings of just LF and you are running a UNIX tool on it so the CR remains part of the data being operated on by the UNIX tool. CR is commonly denoted by \r and can be seen as a control-M (^M) when you run cat -vE on the file while LF is \n and appears as $ with cat -vE.
So your input file wasn't really just:
what isgoingon
it was actually:
what isgoingon\r\n
as you can see with cat -v:
$ cat -vE file
what isgoingon^M$
and od -c:
$ od -c file
0000000 w h a t i s g o i n g o n \r \n
0000020
so when you run a UNIX tool like awk (which treats \n as the line ending) on the file, the \n is consumed by the act of reading the line, but that leaves the 2 fields as:
<what> <isgoingon\r>
Note the \r at the end of the second field. \r means Carriage Return which is literally an instruction to return the cursor to the start of the line so when you do:
print $2, $1
awk will print isgoingon and then will return the cursor to the start of the line before printing what which is why the what appears to overwrite the start of isgoingon.
To fix the problem, do either of these:
dos2unix file
sed 's/\r$//' file
awk '{sub(/\r$/,"")}1' file
perl -pe 's/\r$//' file
Apparently dos2unix is aka frodos in some UNIX variants (e.g. Ubuntu).
Be careful if you decide to use tr -d '\r' as is often suggested as that will delete all \rs in your file, not just those at the end of each line.
Note that GNU awk will let you parse files that have DOS line endings by simply setting RS appropriately:
gawk -v RS='\r\n' '...' file
but other awks will not allow that as POSIX only requires awks to support a single character RS and most other awks will quietly truncate RS='\r\n' to RS='\r'. You may need to add -v BINMODE=3 for gawk to even see the \rs though as the underlying C primitives will strip them on some platforms, e.g. cygwin.
One thing to watch out for is that CSVs created by Windows tools like Excel will use CRLF as the line endings but can have LFs embedded inside a specific field of the CSV, e.g.:
"field1","field2.1
field2.2","field3"
is really:
"field1","field2.1\nfield2.2","field3"\r\n
so if you just convert \r\ns to \ns then you can no longer tell linefeeds within fields from linefeeds as line endings so if you want to do that I recommend converting all of the intra-field linefeeds to something else first, e.g. this would convert all intra-field LFs to tabs and convert all line ending CRLFs to LFs:
gawk -v RS='\r\n' '{gsub(/\n/,"\t")}1' file
Doing similar without GNU awk left as an exercise but with other awks it involves combining lines that do not end in CR as they're read.
Also note that though CR is part of the [[:space:]] POSIX character class, it is not one of the whitespace characters included as separating fields when the default FS of " " is used, whose whitespace characters are only tab, blank, and newline. This can lead to confusing results if your input can have blanks before CRLF:
$ printf 'x y \n'
x y
$ printf 'x y \n' | awk '{print $NF}'
y
$
$ printf 'x y \r\n'
x y
$ printf 'x y \r\n' | awk '{print $NF}'
$
That's because trailing field separator white space is ignored at the beginning/end of a line that has LF line endings, but \r is the final field on a line with CRLF line endings if the character before it was whitespace:
$ printf 'x y \r\n' | awk '{print $NF}' | cat -Ev
^M$
You can use the \R shorthand character class in PCRE for files with unknown line endings. There are even more line ending to consider with Unicode or other platforms. The \R form is a recommended character class from the Unicode consortium to represent all forms of a generic newline.
So if you have an 'extra' you can find and remove it with the regex s/\R$/\n/ will normalize any combination of line endings into \n. Alternatively, you can use s/\R/\n/g to capture any notion of 'line ending' and standardize into a \n character.
Given:
$ printf "what\risgoingon\r\n" > file
$ od -c file
0000000 w h a t \r i s g o i n g o n \r \n
0000020
Perl and Ruby and most flavors of PCRE implement \R combined with the end of string assertion $ (end of line in multi-line mode):
$ perl -pe 's/\R$/\n/' file | od -c
0000000 w h a t \r i s g o i n g o n \n
0000017
$ ruby -pe '$_.sub!(/\R$/,"\n")' file | od -c
0000000 w h a t \r i s g o i n g o n \n
0000017
(Note the \r between the two words is correctly left alone)
If you do not have \R you can use the equivalent of (?>\r\n|\v) in PCRE.
With straight POSIX tools, your best bet is likely awk like so:
$ awk '{sub(/\r$/,"")} 1' file | od -c
0000000 w h a t \r i s g o i n g o n \n
0000017
Things that kinda work (but know your limitations):
tr deletes all \r even if used in another context (granted the use of \r is rare, and XML processing requires that \r be deleted, so tr is a great solution):
$ tr -d "\r" < file | od -c
0000000 w h a t i s g o i n g o n \n
0000016
GNU sed works, but not POSIX sed since \r and \x0D are not supported on POSIX.
GNU sed only:
$ sed 's/\x0D//' file | od -c # also sed 's/\r//'
0000000 w h a t \r i s g o i n g o n \n
0000017
The Unicode Regular Expression Guide is probably the best bet of what the definitive treatment of what a "newline" is.
Run dos2unix. While you can manipulate the line endings with code you wrote yourself, there are utilities which exist in the Linux / Unix world which already do this for you.
If on a Fedora system dnf install dos2unix will put the dos2unix tool in place (should it not be installed).
There is a similar dos2unix deb package available for Debian based systems.
From a programming point of view, the conversion is simple. Search all the characters in a file for the sequence \r\n and replace it with \n.
This means there are dozens of ways to convert from DOS to Unix using nearly every tool imaginable. One simple way is to use the command tr where you simply replace \r with nothing!
tr -d '\r' < infile > outfile

Variable and string substitution is not working for parameters [duplicate]

The intent of this question is to provide an answer to the daily questions whose answer is "you have DOS line endings" so we can simply close them as duplicates of this one without repeating the same answers ad nauseam.
NOTE: This is NOT a duplicate of any existing question. The intent of this Q&A is not just to provide a "run this tool" answer but also to explain the issue such that we can just point anyone with a related question here and they will find a clear explanation of why they were pointed here as well as the tool to run so solve their problem. I spent hours reading all of the existing Q&A and they are all lacking in the explanation of the issue, alternative tools that can be used to solve it, and/or the pros/cons/caveats of the possible solutions. Also some of them have accepted answers that are just plain dangerous and should never be used.
Now back to the typical question that would result in a referral here:
I have a file containing 1 line:
what isgoingon
and when I print it using this awk script to reverse the order of the fields:
awk '{print $2, $1}' file
instead of seeing the output I expect:
isgoingon what
I get the field that should be at the end of the line appear at the start of the line, overwriting some text at the start of the line:
whatngon
or I get the output split onto 2 lines:
isgoingon
what
What could the problem be and how do I fix it?
The problem is that your input file uses DOS line endings of CRLF instead of UNIX line endings of just LF and you are running a UNIX tool on it so the CR remains part of the data being operated on by the UNIX tool. CR is commonly denoted by \r and can be seen as a control-M (^M) when you run cat -vE on the file while LF is \n and appears as $ with cat -vE.
So your input file wasn't really just:
what isgoingon
it was actually:
what isgoingon\r\n
as you can see with cat -v:
$ cat -vE file
what isgoingon^M$
and od -c:
$ od -c file
0000000 w h a t i s g o i n g o n \r \n
0000020
so when you run a UNIX tool like awk (which treats \n as the line ending) on the file, the \n is consumed by the act of reading the line, but that leaves the 2 fields as:
<what> <isgoingon\r>
Note the \r at the end of the second field. \r means Carriage Return which is literally an instruction to return the cursor to the start of the line so when you do:
print $2, $1
awk will print isgoingon and then will return the cursor to the start of the line before printing what which is why the what appears to overwrite the start of isgoingon.
To fix the problem, do either of these:
dos2unix file
sed 's/\r$//' file
awk '{sub(/\r$/,"")}1' file
perl -pe 's/\r$//' file
Apparently dos2unix is aka frodos in some UNIX variants (e.g. Ubuntu).
Be careful if you decide to use tr -d '\r' as is often suggested as that will delete all \rs in your file, not just those at the end of each line.
Note that GNU awk will let you parse files that have DOS line endings by simply setting RS appropriately:
gawk -v RS='\r\n' '...' file
but other awks will not allow that as POSIX only requires awks to support a single character RS and most other awks will quietly truncate RS='\r\n' to RS='\r'. You may need to add -v BINMODE=3 for gawk to even see the \rs though as the underlying C primitives will strip them on some platforms, e.g. cygwin.
One thing to watch out for is that CSVs created by Windows tools like Excel will use CRLF as the line endings but can have LFs embedded inside a specific field of the CSV, e.g.:
"field1","field2.1
field2.2","field3"
is really:
"field1","field2.1\nfield2.2","field3"\r\n
so if you just convert \r\ns to \ns then you can no longer tell linefeeds within fields from linefeeds as line endings so if you want to do that I recommend converting all of the intra-field linefeeds to something else first, e.g. this would convert all intra-field LFs to tabs and convert all line ending CRLFs to LFs:
gawk -v RS='\r\n' '{gsub(/\n/,"\t")}1' file
Doing similar without GNU awk left as an exercise but with other awks it involves combining lines that do not end in CR as they're read.
Also note that though CR is part of the [[:space:]] POSIX character class, it is not one of the whitespace characters included as separating fields when the default FS of " " is used, whose whitespace characters are only tab, blank, and newline. This can lead to confusing results if your input can have blanks before CRLF:
$ printf 'x y \n'
x y
$ printf 'x y \n' | awk '{print $NF}'
y
$
$ printf 'x y \r\n'
x y
$ printf 'x y \r\n' | awk '{print $NF}'
$
That's because trailing field separator white space is ignored at the beginning/end of a line that has LF line endings, but \r is the final field on a line with CRLF line endings if the character before it was whitespace:
$ printf 'x y \r\n' | awk '{print $NF}' | cat -Ev
^M$
You can use the \R shorthand character class in PCRE for files with unknown line endings. There are even more line ending to consider with Unicode or other platforms. The \R form is a recommended character class from the Unicode consortium to represent all forms of a generic newline.
So if you have an 'extra' you can find and remove it with the regex s/\R$/\n/ will normalize any combination of line endings into \n. Alternatively, you can use s/\R/\n/g to capture any notion of 'line ending' and standardize into a \n character.
Given:
$ printf "what\risgoingon\r\n" > file
$ od -c file
0000000 w h a t \r i s g o i n g o n \r \n
0000020
Perl and Ruby and most flavors of PCRE implement \R combined with the end of string assertion $ (end of line in multi-line mode):
$ perl -pe 's/\R$/\n/' file | od -c
0000000 w h a t \r i s g o i n g o n \n
0000017
$ ruby -pe '$_.sub!(/\R$/,"\n")' file | od -c
0000000 w h a t \r i s g o i n g o n \n
0000017
(Note the \r between the two words is correctly left alone)
If you do not have \R you can use the equivalent of (?>\r\n|\v) in PCRE.
With straight POSIX tools, your best bet is likely awk like so:
$ awk '{sub(/\r$/,"")} 1' file | od -c
0000000 w h a t \r i s g o i n g o n \n
0000017
Things that kinda work (but know your limitations):
tr deletes all \r even if used in another context (granted the use of \r is rare, and XML processing requires that \r be deleted, so tr is a great solution):
$ tr -d "\r" < file | od -c
0000000 w h a t i s g o i n g o n \n
0000016
GNU sed works, but not POSIX sed since \r and \x0D are not supported on POSIX.
GNU sed only:
$ sed 's/\x0D//' file | od -c # also sed 's/\r//'
0000000 w h a t \r i s g o i n g o n \n
0000017
The Unicode Regular Expression Guide is probably the best bet of what the definitive treatment of what a "newline" is.
Run dos2unix. While you can manipulate the line endings with code you wrote yourself, there are utilities which exist in the Linux / Unix world which already do this for you.
If on a Fedora system dnf install dos2unix will put the dos2unix tool in place (should it not be installed).
There is a similar dos2unix deb package available for Debian based systems.
From a programming point of view, the conversion is simple. Search all the characters in a file for the sequence \r\n and replace it with \n.
This means there are dozens of ways to convert from DOS to Unix using nearly every tool imaginable. One simple way is to use the command tr where you simply replace \r with nothing!
tr -d '\r' < infile > outfile

Converting CR to LF using sed

I had a file containing CR and CRLF on Windows.
I ran this command on it:
$ sed -i 's \x0d \x0a ' foo
What I got back was that:
All CR that were not followed by LF were converted to LF
But
Those CR that were part of CRLF were left unchanged.
Why is that?
Assuming that you're running this on a Unix platform, using GNU sed:
sed -i 's/\r/\n/g; s/\n$//' foo
This replaces all isolated CR (\r, \x0d) instances as well as CRLF (\r\n, \x0d\x0a) sequences with one LF (\n, \x0a) each - see bottom for an explanation.
As for what you tried (again, assuming that you're running this on a Unix platform, using GNU sed):
sed reads everything up to, but not including, a LF (\n) as a single line, and, on output, terminates that line with LF.
In your case that means that a single line read would end in CR (\r) (due to sed reading up to CRLF, stripping the LF), possibly containing isolated CR instances in that line.
's \x0d \x0a ', due to not using option g, replaces at most 1 CR character with LF.
What that should have resulted in:
The first CR (\r, \x0d) instance on each line should have been replaced with LF (\n, \x0a)
Any additional CR instances on the current line - including one that is part of the line-ending CRLF sequence - would have been left alone.
Why does a correct solution need two s calls?
's/\r/\n/g' globally (g) replaces all CR (\r) instances in the current line with LF \n.
Since the CR that was part of the line-ending CRLF was therefore also replaced with \n, the in-memory line (the pattern space, in sed speak) now ends in \n.
Because sed invariably appends an LF (\n) on output, the extra trailing \n must be removed, which is what s/\n$//' does.
The reason of this behavior is that lines ending with \r in unix appear as ONE line with the next line that has a \n:
$ echo -e "line1\rline2\r\nline3" |cat -A
line1^Mline2^M$
line3$
As a result your sed, without g option, will replace the first \r in this "concatenated" line :
$ echo -e "line1\rline2\r\nline3" |sed 's \x0d \x0a ' |cat -A
line1$
line2^M$ #this is same input line as line1 and thus \r is not replaced the second time in the same line without g
line3$
You need to include g for global replacements of \r when found more than once in the same whatever considered to be input line:
$ echo -e "line1\rline2\r\nline3\rline4\r\nline5\r\nline6" |cat -A
line1^Mline2^M$ #line2 \r will not be replaced without g
line3^Mline4^M$ #line4 \r will not be replaced without g
line5^M$ # This \r will be replaced since it is unique on input line
line6$
$ echo -e "line1\rline2\r\nline3\rline4\r\nline5\r\nline6" |sed 's \r \n ' |cat -A
line1$
line2^M$
line3$
line4^M$
line5$ #the \r is removed from here even without g , since input line5 was alone
$
line6$
$ echo -e "line1\rline2\r\nline3\rline4\r\nline5\r\nline6" |sed 's \r \n g' |cat -A
line1$
line2$
$
line3$
line4$
$
line5$
$
line6$
Attention:
As it is obvious from above tests , replacing \r with \n will make CRLF to be LFLF = \n\n and this will generate an extra blank line. This may or may not be desirable. This extra line can be removed as advised i.e by answer of mklement0

Command line text find/replace for ^M (\r) and ^K (\v)

I'm trying to write a shell script that (among other things) will replace windows line endings (^M) and vertical tabs (^K) with new lines. Sed looks like the tool to use, but I can't quite get it. I can't see why this won't work..
$ sed -i 's/^K/\n/g' article_filemakerExport.xml
sed: 1: "article_filemakerExport ...": command a expects \ followed by text
Note: I'm working on a mac.
With the Windows line ending, you want to remove the ^M (or \r or carriage return), but you want to replace the ^K with newline, it would seem.
The command I'd use is tr, twice.
tr -d '\r' < article_filemakerExport.xml | tr '\13' '\12' > tmp.$$ &&
mv tmp.$$ article_filemakerExport.xml || rm -f tmp.$$
Given that one operation is delete and the other substitute, I don't think you can combine those into a single tr invocation. You can use cp tmp.$$ article_filemakerExport.xml; rm -f tmp.$$ if you're worried about links, etc.
You could also use dos2unix to convert the CRLF to NL line endings instead of tr.
Note that tr is a pure filter; it only reads standard input and only writes to standard output. It does not read or write files directly.
Actually, I need to replace both of these with a newline.
That's easier: a single invocation of tr will do the job:
tr '\13\15' '\12\12' < article_filemakerExport.xml > tmp.$$ &&
mv tmp.$$ article_filemakerExport.xml || rm -f tmp.$$
Or, if you prefer:
tr '\13\r' '\n\n' < article_filemakerExport.xml > tmp.$$ &&
mv tmp.$$ article_filemakerExport.xml || rm -f tmp.$$
I don't think there's a \z-style notation for control-K, but I'm willing to learn otherwise (it might be vertical tab, \v).
(Added the && and || rm -f tmp.$$ commands at the hinting of Ed Morton.)
Partial list of control characters
C Oct Dec Hex Unicode Name
\a 07 7 07 U+0007 BELL
\b 10 8 08 U+0008 BACKSPACE
\t 11 9 09 U+0009 HORIZONTAL TABULATION
\n 12 10 0A U+000A LINE FEED
\v 13 11 0B U+000B VERTICAL TABULATION
\f 14 12 0C U+000C FORM FEED
\r 15 13 0D U+000D CARRIAGE RETURN
You can find a complete set of these control characters at the Unicode site (http://www.unicode.org/charts/PDF/U0000.pdf). No doubt there are many other possible places to look too.
dos2unix <article_filemakerExport.xml | tr '\013\015' '\n\n'
A BSD (OS X) sed solution, assisted by ANSI C-quoted bash strings:
sed -i "" $'s/\r$/\\\n/g; s/\v/\\\n/g' article_filemakerExport.xml
Note:
BSD sed - unlike GNU sed - requires an argument with the -i option; so, to indicate that no backup file should be created, an empty string ("") must be passed - see below for how that explains the error you got.
The command replaces \r\n with \n\n rather than \n, which is what I understand you want (to get just \n, simply make the 2nd substitution string empty; to replace \r even when not followed directly by \n, remove the $ after \r).
Here's a proof of concept with sample input:
$ sed $'s/\r$/\\\n/g; s/\v/\\\n/g' <<<$'one\vtwo\r\nthree\nfour'
one
two
three
four
(All line breaks in the output above are \n.)
An ANSI C-quoted string ($'...') is needed to compensate for the lack of support for escape sequences in BSD sed: the shell creates desired control characters ($'\v' creates a vertical tab (^K; $'\13' would work too), $'\r' the CR (^M), $'\n' the newline) and passes the resulting literals to sed.
\\\n results in a literal \ followed by a literal newline - BSD sed requires literal newlines in the replacement string to be \-escaped (and doesn't support the escape code \n).
As for why your command didn't work:
Note: It looks like your problems stem at least in part from assuming that BSD sed works the same as GNU sed, which, unfortunately, is not the case: there are many subtle and not so subtle differences - see https://stackoverflow.com/a/24276470/45375
The missing argument for the -i option caused sed to interpret your program as the -i argument, and your filename as the program. Since your filename starts with a, sed saw the a (append text) command, and choked on the rest of the filename (because it's not a valid a command).
Even fixing the missing -i option argument wouldn't have made the command work, for the reasons listed above (in short: no support for control-char. escape sequences), and also your attempt to represent a vertical tab as string ^K (in GNU sed you could have used \v directly).

Resources