Remove junk characters from a utf-8 file in Unix - shell

I'm getting the junk chars (<9f>, <9d>, <9d> etc), CNTRL chars (^Z,^M etc) and NULL chars(^#) in a file. However I was able to remove CNTRL and NULL chars from the file but couldn't eliminate the junk characters. Could anyone suggest a way to remove these junk chars?
Control characters are being removed using the following command:
sed 's/\x1a//g;s/\xef\xbf\xbd//g'
Null characters are removed using the below command
tr -d '\000'
Also, Please a suggest a single command to remove all the above mentioned 3 types of garbal characters.
Thanks in Advance

Strip "unusual" unicode characters
In the comments you mention that you want to block out control characters while keeping the Greek characters, so the solution below with tr does not suit. One solution is sed which offers unicode support and their [[:alpha:]] class matches also alphabetical characters outside ascii. You first need to set LC_CTYPE to specify which characters all fall into the [[:alpha:]] range. For German with Umlauts, that's e.g.
LC_CTYPE=de_DE.UTF-8
Then you can use sed to strip out everything which is not a letter or punctuation:
sed 's/[^[:alpha:];\ -#]//g' < junk.txt
What \ -# does: It matches all characters in the ascii range between space and # (see ascii table. Sed has a [[:punct:]] class, but unfortunately this also matches a lot of junk, so \ -# is needed.
You may need to play around a little with LC_CTYPE, setting it to utf-8 only I could match greek characters, but not japanese.
If you only care about ascii
If you only care about regular ascii characters you can use tr: First you convert the file to a "one byte per character" encoding, since tr does not understand multibyte characters, e.g. using iconv.
Then, I'd advise you use a whitelist approach (as opposed to the blacklist approach you have in your question) as it's a lot easier to state what you want to keep, than what you want to filter out.
This command should do it:
iconv -c -f utf-8 -t latin1 < junk.txt | tr -cd '\11\12\40-\176'
this line..
converts to latin1 (single byte per char) and ignores all characters above codepoint 127 (which are the special characters, but be aware, that strips away also things like umlaut or special characters in your language which you might want to keep!)
strips all characters away which are outside this whitelist: \11\12\40-\176. The numbers there are octal. Have a look at e.g. this ascii table. \11 is tab, \12 is carriage return. \40-\176 is all characters which are commonly considered as "normal"

Related

How do I remove special symbols such as & from the file?

I've been trying to clean up my huge xml file (> 6gb) with tr util. The goal is to get rid of all invalid characters and also to get rid of such things as , &, > and etc.
Here is my current implementation:
cat input.xml | tr -dc '[:print:]' > output.xml
But it only removes invalid characters. Do you have any suggestions how to achieve it with tr util?
tr probably won't work
tr is only for replacing individual characters or character classes. Your examples , &, and > are strings. We'll need another tool.
Here's an example with perl
$ cat input.xml
<xml><tag> hello&, >world!</tag></xml>
$ cat input.xml | perl -p -e 's/&.*?;//g'
<xml><tag>hello, world!</tag></xml>
Explanation:
perl -p -e 's/&.*?;//g'
perl -------------------- Run a perl program
-p ----------------- Sets up a loop around our program
-e -------------- Use what comes next as a line of our program
's/&.*?;//g' - Our program, which is a perl regular expression.
- Explanation below:
' ------------ Quotes prevent shell expansion/interpolation.
s ----------- Start a string substitution.
/ ---------- Use '/' as the command separator.
& --------- Matches literal ampersand (&),
. -------- followed by any character (.),
* ------- any number of times (*),
?; ----- until the next semicolon (?;).
// --- Replaces the matching text with the characters between the slashes (i.e. nothing at all)
g -- Allows matching the pattern multiple times per line
' - Quotes prevent shell expansion/interpolation
Note that I'm assuming a pattern of [AMPERSAND(&), SOMETHING, SEMICOLON(;)] based on the example strings you provided.
You could extend that program to also remove your invalid characters, but I'd just continue to use tr for that. It's faster at least on my system.
So putting it all together you get
cat input.xml | perl -p -e 's/&.*?;//g' | tr -dc '[:print:]' > output.xml
open the file in Notepad++ and use replace option.
A character escape is a way of representing a character in source code using only ASCII characters. In HTML you can escape the euro sign € in the following ways.
Format Name
€ hexadecimal numeric character reference
€ decimal numeric character reference
€ named character reference
In CSS syntax you would use one of the following.
Format Notes
\20AC must be followed by a space if the next character is one of a-f, A-F, 0-9
\0020AC must be 6 digits long, no space needed (but can be included)
A trailing space is treated as part of the escape, so use 2 spaces if you actually want to follow the escaped character with space. If using escapes in CSS identifiers, see the additional rules below.
Because you should use UTF-8 for the character encoding of the page, you won't normally need to use character escapes. You may, however, find them useful to represent invisible or ambiguous characters or characters that would otherwise interact in undesirable ways with the surrounding source code or text.

Extract text between two special characters

Trying to extract the text between the special characters "\ and \" through sed
Ex: "\hell##$\"},
expected output : hell##$
You can do it quite easily with using a capture-group and backreference with basic regular-expressions:
sed 's/^["][\]\([^\]*\).*$/\1/'
Explanation
Normal substitution sed 's/find/replace/, where
find is ^["][\] a double-quote and \ before beginning the capture \(...\) which contains [^\]* (zero or more characters not a \), the closing of the capture \) and then .*$ the remainder of the string;
replace is \1 (the first backreference) containing the text captured between \(...\).
(note: if your "\ doesn't begin the string, remove the first '^' anchor)
Example
$ echo '"\hell##$\"},' | sed 's/^["][\]\([^\]*\).*$/\1/'
hell##$
Look things over and let me know if you have questions.
This might work for you (GNU sed):
sed -nE '/"\\[^\\]*\\+([^\\"][^\\]*\\+)*"/{s/"\\/\n/;s/.*\n//;s/\\"/\n/;P;D}' file
The solution comes in two parts:
Firstly, a regexp to determine whether a pair of two characters exists. This can be tricky as a negated class is insufficient because edge cases can easily defeat a simplistic approach.
Secondly, once a pair of characters does exist the text between them must be extracted piece meal.

How to remove control characters in a delimited file?

I am just wondering what is the best way to remove control characters from a delimited file using sed/awk in bash. Thanks.
You can use the character class [:cntrl:] with GNU sed:
sed 's/[[:cntrl:]]//g' file.txt
From here:
‘[:cntrl:]’
Control characters. In ASCII, these characters have octal codes 000 through 037, and 177 (DEL). In other character sets, these are the equivalent characters, if any.

Force encode from US-ASCII to UTF-8 (iconv)

I'm trying to transcode a bunch of files from US-ASCII to UTF-8.
For that, I'm using iconv:
iconv -f US-ASCII -t UTF-8 file.php > file-utf8.php
My original files are US-ASCII encoded, which makes the conversion not happen. Apparently it occurs because ASCII is a subset of UTF-8...
iconv US ASCII to UTF-8 or ISO-8859-15
And quoting:
There's no need for the textfile to appear otherwise until non-ASCII
characters are introduced
True. If I introduce a non-ASCII character in the file and save it, let's say with Eclipse, the file encoding (charset) is switched to UTF-8.
In my case, I'd like to force iconv to transcode the files to UTF-8 anyway. Whether there is non-ASCII characters in it or not.
Note: The reason is my PHP code (non-ASCII files...) is dealing with some non-ASCII string, which causes the strings not to be well interpreted (french):
Il était une fois... l'homme série animée mythique d'Albert
Barillé (Procidis), 1ère
...
US ASCII -- is -- a subset of UTF-8 (see Ned's answer below)
Meaning that US ASCII files are actually encoded in UTF-8
My problem came from somewhere else
ASCII is a subset of UTF-8, so all ASCII files are already UTF-8 encoded. The bytes in the ASCII file and the bytes that would result from "encoding it to UTF-8" would be exactly the same bytes. There's no difference between them, so there's no need to do anything.
It looks like your problem is that the files are not actually ASCII. You need to determine what encoding they are using, and transcode them properly.
Short Answer
iconv will use whatever input/output encoding you specify regardless of what the contents of the file are. If you specify the wrong input encoding, the output will be garbled.
You can try to use the file command to detect a file's type/encoding.
file only guesses at the file encoding and may be wrong (especially in cases where special characters only appear late in large files).
even after running iconv, file may not report any change due to the limited way in which file attempts to guess at the encoding. For a specific example, see my long answer.
you can use hexdump to look at bytes of non-7-bit-ASCII text and compare against code tables for common encodings (ISO 8859-*, UTF-8) to decide for yourself what the encoding is.
7-bit ASCII (aka US ASCII) is identical at a byte level to UTF-8 and the 8-bit ASCII extensions (ISO 8859-*). So if your file only has 7-bit characters, then you can call it UTF-8, ISO 8859-* or US ASCII because at a byte level they are all identical. It only makes sense to talk about UTF-8 and other encodings (in this context) once your file has characters outside the 7-bit ASCII range.
Long Answer
I ran into this today and came across your question. Perhaps I can add a little more information to help other people who run into this issue.
ASCII
First, the term ASCII is overloaded, and that leads to confusion.
7-bit ASCII only includes 128 characters (00-7F or 0-127 in decimal). 7-bit ASCII is also sometimes referred to as US-ASCII.
ASCII
UTF-8
UTF-8 encoding uses the same encoding as 7-bit ASCII for its first 128 characters. So a text file that only contains characters from that range of the first 128 characters will be identical at a byte level whether encoded with UTF-8 or 7-bit ASCII.
Codepage layout
ISO 8859-* and other ASCII Extensions
The term extended ASCII (or high ASCII) refers to eight-bit or larger character encodings that include the standard seven-bit ASCII characters, plus additional characters.
Extended ASCII
ISO 8859-1 (aka "ISO Latin 1") is a specific 8-bit ASCII extension standard that covers most characters for Western Europe. There are other ISO standards for Eastern European languages and Cyrillic languages. ISO 8859-1 includes encoding for characters like Ö, é, ñ and ß for German and Spanish (UTF-8 supports these characters too, but the underlying encoding is different).
"Extension" means that ISO 8859-1 includes the 7-bit ASCII standard and adds characters to it by using the 8th bit. So for the first 128 characters, ISO 8859-1 is equivalent at a byte level to both ASCII and UTF-8 encoded files. However, when you start dealing with characters beyond the first 128, you are no longer UTF-8 equivalent at the byte level, and you must do a conversion if you want your "extended ASCII" encoded file to be UTF-8 encoded.
ISO 8859 and proprietary adaptations
Before the ISO 8-bit ascii extension standards (ISO 8859-*) were released, there were many proprietary 8-bit code-pages (mapping bytes to characters) from IBM, DEC, HP, Apple, etc.
One notable way in which ISO character sets differ from code pages is
that the character positions 128 to 159, corresponding to ASCII
control characters with the high-order bit set, are specifically
unused and undefined in the ISO standards, though they had often been
used for printable characters in proprietary code pages
i.e. in all the ISO 8-bit extensions, characters 128-159 (80-9F) are not used, whereas in the previous proprietary code-pages these were used for ASCII control characters (that already exist in 7-bit ascii) but with the 8th bit set.
The above statement about 80-9F not being used/defined is not exactly true. Apparently in the ISO/IEC standard, this range is defined for control characters, but in the IANA character set of the same name, this range is not defined. I got this from some archived discussion on the confusingly written and misleading wikipedia page for windows-1252...but was unable to verify as the ISO standards are paywalled.
windows-1252
...to further confuse things.
After the ISO 8-bit extensions came out, Microsoft released a new code-page windows-1252 which is a superset* of ISO-8859-1 that uses the unused ISO range of characters 128-159 (80-9F) for things like smart quotes. Compare rows 8x and 9x of the code tables (iso-8859-1 windows-1252) if you don't understand.
Superset means that if you render ISO-8859-1 as windows-1252 it looks fine (because all printable characters in ISO-8859-1 also exist in windows-1252 with the same encoding)...but if you try to render windows-1252 as ISO-8859-1 and the rendered data happens to contain bytes in the 128-159 range, then those characters won't display properly.
It is very common to mislabel Windows-1252 text with the charset label
ISO-8859-1. A common result was that all the quotes and apostrophes
(produced by "smart quotes" in word-processing software) were replaced
with question marks or boxes on non-Windows operating systems, making
text difficult to read. Most modern web browsers and e-mail clients
treat the media type charset ISO-8859-1 as Windows-1252 to accommodate
such mislabeling. This is now standard behavior in the HTML5
specification, which requires that documents advertised as ISO-8859-1
actually be parsed with the Windows-1252 encoding.
So in the html5 standard, there is no encoding named ISO-8859-1, instead iso-8859-1 is one of multiple labels for encoding windows-1252.
windows-1252
html5 encodings
* - note, not technically a superset of the ISO/IEC 8859-1 standard, because the standard defines control characters in the 80-9F range and windows-1252 defines different characters in this range. But the IANA characterset 8859-1 does NOT define characters in this range, so technically it is a superset of the IANA characterset but not the ISO/IEC standard? (This is why standards should be open, so we can check these things.)
Detecting encoding with file
One lesson I learned today is that we can't trust file to always give correct interpretation of a file's character encoding.
file (command)
The command tells only what the file looks like, not what it is (in the case where file looks at the content). It is easy to fool the program by putting a magic number into a file the content of which does not match it. Thus the command is not usable as a security tool other than in specific situations.
file looks for magic numbers in the file that hint at the type, but these can be wrong, no guarantee of correctness. file also tries to guess the character encoding by looking at the bytes in the file. Basically file has a series of tests that helps it guess at the file type and encoding.
My file is a large CSV file. file reports this file as US ASCII encoded, which is WRONG.
$ ls -lh
total 850832
-rw-r--r-- 1 mattp staff 415M Mar 14 16:38 source-file
$ file -b --mime-type source-file
text/plain
$ file -b --mime-encoding source-file
us-ascii
My file has umlauts in it (ie Ö). The first non-7-bit-ascii doesn't show up until over 100k lines into the file. I suspect this is why file doesn't realize the file encoding isn't US-ASCII.
$ pcregrep -no '[^\x00-\x7F]' source-file | head -n1
102321:�
I'm on a Mac, so using PCRE's grep. With GNU grep you could use the -P option. Alternatively on a Mac, one could install coreutils (via Homebrew or other) in order to get GNU grep.
I haven't dug into the source-code of file, and the man page doesn't discuss the text encoding detection in detail, but I am guessing file doesn't look at the whole file before guessing encoding.
Whatever my file's encoding is, these non-7-bit-ASCII characters break stuff. My German CSV file is ;-separated and extracting a single column doesn't work.
$ cut -d";" -f1 source-file > tmp
cut: stdin: Illegal byte sequence
$ wc -l *
3081673 source-file
102320 tmp
3183993 total
Note the cut error and that my "tmp" file has only 102320 lines with the first special character on line 102321.
Let's take a look at how these non-ASCII characters are encoded. I dump the first non-7-bit-ascii into hexdump, do a little formatting, remove the newlines (0a) and take just the first few.
$ pcregrep -o '[^\x00-\x7F]' source-file | head -n1 | hexdump -v -e '1/1 "%02x\n"'
d6
0a
Another way. I know the first non-7-bit-ASCII char is at position 85 on line 102321. I grab that line and tell hexdump to take the two bytes starting at position 85. You can see the special (non-7-bit-ASCII) character represented by a ".", and the next byte is "M"... so this is a single-byte character encoding.
$ tail -n +102321 source-file | head -n1 | hexdump -C -s85 -n2
00000055 d6 4d |.M|
00000057
In both cases, we see the special character is represented by d6. Since this character is an Ö which is a German letter, I am guessing that ISO 8859-1 should include this. Sure enough, you can see "d6" is a match (ISO/IEC 8859-1).
Important question... how do I know this character is an Ö without being sure of the file encoding? The answer is context. I opened the file, read the text and then determined what character it is supposed to be. If I open it in Vim it displays as an Ö because Vim does a better job of guessing the character encoding (in this case) than file does.
So, my file seems to be ISO 8859-1. In theory I should check the rest of the non-7-bit-ASCII characters to make sure ISO 8859-1 is a good fit... There is nothing that forces a program to only use a single encoding when writing a file to disk (other than good manners).
I'll skip the check and move on to conversion step.
$ iconv -f iso-8859-1 -t utf8 source-file > output-file
$ file -b --mime-encoding output-file
us-ascii
Hmm. file still tells me this file is US ASCII even after conversion. Let's check with hexdump again.
$ tail -n +102321 output-file | head -n1 | hexdump -C -s85 -n2
00000055 c3 96 |..|
00000057
Definitely a change. Note that we have two bytes of non-7-bit-ASCII (represented by the "." on the right) and the hex code for the two bytes is now c3 96. If we take a look, seems we have UTF-8 now (c3 96 is the encoding of Ö in UTF-8) UTF-8 encoding table and Unicode characters
But file still reports our file as us-ascii? Well, I think this goes back to the point about file not looking at the whole file and the fact that the first non-7-bit-ASCII characters don't occur until late in the file.
I'll use sed to stick a Ö at the beginning of the file and see what happens.
$ sed '1s/^/Ö\'$'\n/' source-file > test-file
$ head -n1 test-file
Ö
$ head -n1 test-file | hexdump -C
00000000 c3 96 0a |...|
00000003
Cool, we have an umlaut. Note the encoding though is c3 96 (UTF-8). Hmm.
Checking our other umlauts in the same file again:
$ tail -n +102322 test-file | head -n1 | hexdump -C -s85 -n2
00000055 d6 4d |.M|
00000057
ISO 8859-1. Oops! It just goes to show how easy it is to get the encodings screwed up. To be clear, I've managed to create a mix of UTF-8 and ISO 8859-1 encodings in the same file.
Let's try converting our mangled (mixed encoding) test file with the umlaut (Ö) at the front and see what happens.
$ iconv -f iso-8859-1 -t utf8 test-file > test-file-converted
$ head -n1 test-file-converted | hexdump -C
00000000 c3 83 c2 96 0a |.....|
00000005
$ tail -n +102322 test-file-converted | head -n1 | hexdump -C -s85 -n2
00000055 c3 96 |..|
00000057
The first umlaut that was UTF-8 was interpreted as ISO 8859-1 since that is what we told iconv...not what we want, but that is what we told iconf to do. The second umlaut is correctly converted from d6 (ISO 8859-1) to c3 96 (UTF-8).
I'll try again, but this time I will use Vim to do the Ö insertion instead of sed. Vim seemed to detect the encoding better before (as "latin1" aka ISO 8859-1) so perhaps it will insert the new Ö with a consistent encoding.
$ vim source-file
$ head -n1 test-file-2
�
$ head -n1 test-file-2 | hexdump -C
00000000 d6 0d 0a |...|
00000003
$ tail -n +102322 test-file-2 | head -n1 | hexdump -C -s85 -n2
00000055 d6 4d |.M|
00000057
Indeed vim used the correct/consistent ISO encoding when inserting the character at the beginning of the file.
Now the test: Does file do a better job of recognizing the encoding with special characters at the beginning of the file?
$ file -b --mime-encoding test-file-2
iso-8859-1
$ iconv -f iso-8859-1 -t utf8 test-file-2 > test-file-2-converted
$ file -b --mime-encoding test-file-2-converted
utf-8
Yes it does! Moral of the story. Don't trust file to always guess your encoding right. It is easy to mix encodings within the same file. When in doubt, look at the hex.
A hack that would address this specific limitation of file when dealing with large files would be to shorten the file to make sure that special (non-ascii) characters appear early in the file so file is more likely to find them.
$ first_special=$(pcregrep -o1 -n '()[^\x00-\x7F]' source-file | head -n1 | cut -d":" -f1)
$ tail -n +$first_special source-file > /tmp/source-file-shorter
$ file -b --mime-encoding /tmp/source-file-shorter
iso-8859-1
You could then use (presumably correct) detected encoding to feed as input to iconv to ensure you are converting correctly.
Update
Christos Zoulas updated file to make the amount of bytes looked at configurable. One day turn-around on the feature request, awesome!
http://bugs.gw.com/view.php?id=533
Allow altering how many bytes to read from analyzed files from the command line
The feature was released in file version 5.26.
Looking at more of a large file before making a guess about encoding takes time. However, it is nice to have the option for specific use-cases where a better guess may outweigh additional time and I/O.
Use the following option:
−P, −−parameter name=value
Set various parameter limits.
Name Default Explanation
bytes 1048576 max number of bytes to read from file
Something like...
file_to_check="myfile"
bytes_to_scan=$(wc -c < $file_to_check)
file -b --mime-encoding -P bytes=$bytes_to_scan $file_to_check
... it should do the trick if you want to force file to look at the whole file before making a guess. Of course, this only works if you have file 5.26 or newer.
Update 2023-02-06
Thanks #theprivileges for pointing out the parameter behaviour has changed as of file 5.44. There is now an additional encoding parameter that specifies how many bytes of the bytes read by file should be used for encoding determination.
e.g.
file_to_check="myfile"
bytes_to_scan=$(wc -c < $file_to_check)
file -b --mime-encoding -P bytes=$bytes_to_scan -P encoding=$bytes_to_scan file_to_check="myfile"
Note! It appears with this change, that the bytes of the file used for determining encoding is now capped to a max of 64k. So for very large files where special characters only occur late in the file, you may need to resort to a different workaround (e.g. moving special characters up in the file for proper detection).
Forcing file to display UTF-8 instead of US-ASCII
Some of the other answers seem to focus on trying to make file display UTF-8 even if the file only contains plain 7-bit ascii. If you think this through you should probably never want to do this.
If a file contains only 7-bit ascii but the file command is saying the file is UTF-8, that implies that the file contains some characters with UTF-8 specific encoding. If that isn't really true, it could cause confusion or problems down the line. If file displayed UTF-8 when the file only contained 7-bit ascii characters, this would be a bug in the file program.
Any software that requires UTF-8 formatted input files should not have any problem consuming plain 7-bit ascii since this is the same on a byte level as UTF-8. If there is software that is using the file command output before accepting a file as input and it won't process the file unless it "sees" UTF-8...well that is pretty bad design. I would argue this is a bug in that program.
If you absolutely must take a plain 7-bit ascii file and convert it to UTF-8, simply insert a single non-7-bit-ascii character into the file with UTF-8 encoding for that character and you are done. But I can't imagine a use-case where you would need to do this. The easiest UTF-8 character to use for this is the Byte Order Mark (BOM) which is a special non-printing character that hints that the file is non-ascii. This is probably the best choice because it should not visually impact the file contents as it will generally be ignored.
Microsoft compilers and interpreters, and many pieces of software on
Microsoft Windows such as Notepad treat the BOM as a required magic
number rather than use heuristics. These tools add a BOM when saving
text as UTF-8, and cannot interpret UTF-8 unless the BOM is present
or the file contains only ASCII.
This is key:
or the file contains only ASCII
So some tools on windows have trouble reading UTF-8 files unless the BOM character is present. However this does not affect plain 7-bit ascii only files. I.e. this is not a reason for forcing plain 7-bit ascii files to be UTF-8 by adding a BOM character.
Here is more discussion about potential pitfalls of using the BOM when not needed (it IS needed for actual UTF-8 files that are consumed by some Microsoft apps). https://stackoverflow.com/a/13398447/3616686
Nevertheless if you still want to do it, I would be interested in hearing your use case. Here is how. In UTF-8 the BOM is represented by hex sequence 0xEF,0xBB,0xBF and so we can easily add this character to the front of our plain 7-bit ascii file. By adding a non-7-bit ascii character to the file, the file is no longer only 7-bit ascii. Note that we have not modified or converted the original 7-bit-ascii content at all. We have added a single non-7-bit-ascii character to the beginning of the file and so the file is no longer entirely composed of 7-bit-ascii characters.
$ printf '\xEF\xBB\xBF' > bom.txt # put a UTF-8 BOM char in new file
$ file bom.txt
bom.txt: UTF-8 Unicode text, with no line terminators
$ file plain-ascii.txt # our pure 7-bit ascii file
plain-ascii.txt: ASCII text
$ cat bom.txt plain-ascii.txt > plain-ascii-with-utf8-bom.txt # put them together into one new file with the BOM first
$ file plain-ascii-with-utf8-bom.txt
plain-ascii-with-utf8-bom.txt: UTF-8 Unicode (with BOM) text
People say you can't and I understand you may be frustrated when asking a question and getting such an answer.
If you really want it to show in UTF-8 instead of US ASCII then you need to do it in two steps.
First:
iconv -f us-ascii -t utf-16 yourfile > youfileinutf16.*
Second:
iconv -f utf-16le -t utf-8 yourfileinutf16 > yourfileinutf8.*
Then if you do a file -i, you'll see the new character set is UTF-8.
I think Ned's got the core of the problem -- your files are not actually ASCII. Try
iconv -f ISO-8859-1 -t UTF-8 file.php > file-utf8.php
I'm just guessing that you're actually using ISO 8859-1. It is popular with most European languages.
There is no difference between US ASCII and UTF-8, so there isn't any need to reconvert it.
But here a little hint, if you have trouble with special-chars while recoding.
Add //TRANSLIT after the source-charset-Parameter.
Example:
iconv -f ISO-8859-1//TRANSLIT -t UTF-8 filename.sql > utf8-filename.sql
This helps me with strange types of quotes, which are always breaking the character set reencode process.
Here's a script that will find all files matching a pattern you pass it, and then converting them from their current file encoding to UTF-8. If the encoding is US ASCII, then it will still show as US ASCII, since that is a subset of UTF-8.
#!/usr/bin/env bash
find . -name "${1}" |
while read line;
do
echo "***************************"
echo "Converting ${line}"
encoding=$(file -b --mime-encoding ${line})
echo "Found Encoding: ${encoding}"
iconv -f "${encoding}" -t "utf-8" ${line} -o ${line}.tmp
mv ${line}.tmp ${line}
done
vim -es '+set fileencoding=utf-8' '+wq!' file
-es runs vim in ex and script mode, thus nothing is rendered. Then it executes the command where the file encoding is set (vim takes care of the details) and then the file is closed '+wq!'.
I am late to the question but the previous answers using iconv did simply not do the job and left the file in a state with non utf-8 characters even when adding -c to drop those.
You can use file -i file_name to check what exactly your original file format is.
Once you get that, you can do the following:
iconv -f old_format -t utf-8 input_file -o output_file
I accidentally encoded a file in UTF-7 and had a similar issue. When I typed file -i name.file I would get charset=us-ascii.
iconv -f us-ascii -t utf-9//translit name.file would not work since I've gathered UTF-7 is a subset of US ASCII, as is UTF-8.
To solve this, I entered
iconv -f UTF-7 -t UTF-8//TRANSLIT name.file -o output.file
I'm not sure how to determine the encoding other than what others have suggested here.
The following converts all files in a folder.
Create backup folder of original files.
mkdir backup
Convert all files in US ASCII encoding to UTF-8 (single line command)
for f in $(file -i * .sql | grep us-ascii | cut -d ':' -f 1); do iconv -f us-ascii -t utf-8 $f -o $ f.utf-8 && mv $f backup / && mv "$f.utf-8" $f; done
Convert all files in encoding ISO 8859-1 to UTF-8 (single line command)
for f $(file -i * .sql | grep iso-8859-1 | cut -d ':' -f 1); do iconv -f iso-8859-1 -t utf-8 $f -o $f.utf-8 && mv $f backup / && mv "$f.utf-8" $f; done
Inspired a lot by Mathieu's answer and Marcelo's answer:
I face the need to see file -i myfile.htm to show UTF-8 instead of US ASCII (yes, I know it is a subset of UTF-8).
So here is a one liner inspired from previous answers that will convert on Linux all *.htm file from US ASCII to UTF-8 so file -i will show you UTF-8. You can change *.htm (two places in the command below) to fit your need.
mkdir backup 2>/dev/null; for f in $(file -i *.htm | grep -i us-ascii | cut -d ':' -f 1); do iconv -f "us-ascii" -t "utf-16" $f > $f.tmp; iconv -f "utf-16le" -t "utf-8" $f.tmp > $f.utf8; cp $fic backup/; mv $f.utf8 $f; rm $f.tmp; done; file -i *.htm
Just FYI, file doesn't check whole content (as already mentioned in the long answer from mattpr) to detect encoding of a file by default. To force the whole content to be scanned for charset detection, this code can be used...
file_to_check="myfile"
bytes_to_scan=$(wc -c < $file_to_check)
file -b --mime-encoding --parameter encoding=$bytes_to_scan $file_to_check
See also corresponding manual https://man7.org/linux/man-pages/man1/file.1.html

Remove non-English and accented characters from a flat file using Unix shell script

I have a file which contains lot of accented and some wild-card (?, *) characters. How do I replace these characters with space in Unix (using sed or similar utility). I tried it using sed but somehow it is ignoring accented characters.
Thanks
Using GNU sed, you can do the following:
sed 's/[^\o51-\o57\o64-\o89\o96-\o105\o112-\o121\o128-\o137\o144-\o145\o147\o150\o291-\o293]/ /g' inputfile
Note that those are letter "O" rather than digit zero after the backslashes.
This isn't a terribly specific answer, but it should give you a few keywords to search for.
First, the easy bit. It's straightforward to have sed match regexp characters. For example:
% echo 'one tw? f*ur' | sed 's/\*/ /'
one tw? f ur
% echo 'one tw? f*ur' | sed 's/[*?]/ /'
one tw f*ur
%
Handling the non-ASCII characters is messier.
Some seds can handle non-ASCII characters, usually unicode files. Some seds can't. Unfortunately, it may not be obvious from your sed's manpage which it is. Life is hard.
One thing you'll have to find out is what encoding the input file is in. A unicode file will be encoded in one or other of UTF-8 or UTF-16 (or possibly one of a couple of less common ones). This isn't the place for an expansion of unicode and encodings, but those are the keywords to scan the manpages for....
Even if you can't find a sed which can handle unicode, then you might be able to use perl, python, or some other scripting language to do the processing -- these generally have regexp engines which can do unicode. The perl -n option creates an implicit loop which might make the transformation you want a one-liner.
If your input document is in a different (non-unicode) encoding, such as one of the ISO-8859 ones, then I would guess that the best thing to do would be to convert it to UTF-8 using something like iconv, and proceed from there.
If your accented characters are single-byte you can use tr with character sets to accomplish this. If you can identify a range of characters to match, that's probably easiest:
tr '\192-\255' ' ' < infile > outfile
If you're dealing with larger-than-8-bit characters, awk and sed can probably handle it, but you need to make sure your inputs are properly quoted. Try using the decimal or hexadecimal representations instead of the characters themselves.

Resources