Is it possible to encode the output of a grep command in UTF-8 no matter what the encoding of the input file was?
I execute a grep statement in a python script (subprocess) and I want to guarantee the resulting bytes are UTF-8.
Example:
grep -P "ÄA" -m -1 file.txt
I dont know the input encoding of the file...
Grep follows the UNIX philosophy, that is, it does one thing, and it does this one thing well. File encoding is not part of that one thing.
That's what other tools are for. There is another tool that does character decoding and encoding well, called iconv. Use that to change the encoding of the input file to UTF-8.
This does require you to know the input file encoding. If you don't know, you have to guess, based on heuristic analysis of the input file (it'll be hard to be certain, recognising that something has been decoded using the wrong codec usually requires a human to verify the result). There is a tool for that too, called enca. This tool can also do the conversion once a guess has been made. It usually is a separate install (it is not part of the common default POSIX toolset). See How to auto detect text file encoding? over on Super User for more options.
Note however, given that codec guessing tools need to do so by using statistical analysis, it is better to do the guessing on the input file, not on the output of grep.
None of this has anything to do with Python, of course. Except if you wanted to do the encoding detection in Python instead, at which point you'd want to look at the chardet library.
Related
While compiling a c file, gcc by default compiles it to a file called "a.out". My professor said that the output file contains the binaries, but I when I open it I usually encounter unreadable text (VS Code says something like "This file contains unsupported text encoding").
I assumed that by 'binaries', I would be able to see literal zeroes and ones in the file but that does not seem to be the case. So what exactly does it output file look like or what exactly does it contain and what is 'text encoding'? Why can I not read it? What special characters might it contain? I'm aware of the fact that gcc first pre-processes, which means it removes all comments, expands all macros and copies the contents of any header files that might be included. You get the header file by running gcc -E <file_name>.c, then the this processed file is complied into assembly. Up to this point, the output files are readable, i.e., I can open them with VS Code, but after this the assembled code and the object file thereafter are human-unreadable.
For reference, I have no prior experience with programming or any language for that matter and this is my first CS related course in my first sem of college, and I apologize if this is too trivial of a question to ask.
I actually had the same confusion early on. Not about that file type specifically, but about binary vs text files.
After all aren't all files, even text ones binary? In the sense that all information is 1s and 0s? Well, yes, all information can be stored/transmitted as 1s and 0s, but that's not what binary/text files refer to.
It refers to what that information, the content of the file, those 1s and 0s represent.
In a text file the bytes encode characters. In a binary file the bits encode some information that is not text. The format and semantics of that information is completely free, it can mean anything and use whatever encoding scheme. It's up to the application that writes/reads the file to properly understand the bit patterns.
Most text editors (like VS Code) when open a file they treat it as a text file. I.e. they try to interpret the bit patterns as a text encoding scheme (e.g. ASCII or UTF-8) But not all bit patterns are valid ASCII/UTF-8 so that's why you get "unsupported text encoding".
If you want to inspect the actual 1s and 0 for both text and binary files you need to use a utility that shows you that, e.g. hex viewers/editors.
my little cross-platform Tcl/Tk pet-project uses utf-8 as the text encoding everywhere.
encoding system utf-8
fconfigure stderr -encoding utf-8
fconfigure stdout -encoding utf-8
#puts "汉语"
puts "foo"
This works great on Linux and macOS, where the standard encoding is UTf-8:
$ tclsh utf8test.tcl
foo
Unfortunately, it doesn't work at all on Windows, where the terminal typically uses cp1252, and it consequently outputs garbage:
$ wine tclsh85.exe utf8test.tcl
潦൯
As far as I understand, this is because we cannot really change the encoding of the Windows terminal, and - according to this sf issue - I need to detect myself whether the output is going to a terminal or somewhere else.
I have no idea how to make this decision.
So how can I output printable characters to the console in a project that otherwise uses UTF-8 everywhere?
edit1
my original example would output some high-unicode points (chinese characters 汉语).
however, the problem also persists if I try to output a single ASCII string (foo).
since most of the text i'm outputting is ASCII anyhow, I would like at least to fix the output for that.
i thought that cp1252 would be backwards compatible with ASCII...
Output to the real console on Windows is special — it has it's own channel implementation — since the low-level API takes Unicode (I think it is actually UTF-16, but that isn't important here) and Tcl uses the low-level IO API on Windows where it can. I do not know whether Tcl has automatically detected your console correctly. I know absolutely nothing about what interaction there is between the API and Wine's implementation of it and whatever is going on in the host environment.
Console output is supposed to Just Work™ if you don't modify the encoding on its channel. Indeed, the usual rule is don't change encodings except on files and sockets (where you know the encoding might be not the system one).
I get text file of random encoding format, usc-2le, ansi, utf-8, usc-2be etc. I have to convert this files to utf8.
For conversion am using the following command
iconv options -f from-encoding -t utf-8 <inputfile > outputfile
But if incorrect from-encoding is provided, then incorrect file is generated.
I want a way to find the input file encoding type.
Thanks in advance
On Linux you could try using file(1) on your unknown input file. Most of the time it would guess the encoding correctly. Or else try several encodings to iconv till you "feel" that the result is acceptable (for example if you know that the file is some Russian poetry, you might try KOI-8, UTF-8, etc.... till you recognize a good Russian poem).
But character encoding is a nightmare and can be ambiguous. The provider of the file should tell you what encoding he used (and there is no way to get that encoding reliably and in all cases : there are some byte sequences which would be valid and interpreted differently with various encodings).
(notice that the HTTP protocol mentions and explicits the encoding)
In 2017, better use UTF-8 everywhere (and you should follow that http://utf8everywhere.org/ link) so ask your human partners to send you UTF-8 (hopefully most of your files are in UTF-8, since today they all should be).
(so encoding is more a social issue than a technical one)
I get text file of random encoding format
Notice that "random encoding" don't exist. You want and need to find out what character encoding (and file format) has been used by the provider of that file (so you mean "unknown encoding", not "random" one).
BTW, do you have a formal, unambiguous, sound and precise definition of text file, beyond file without zero bytes, or files with few control characters? LaTeX, C source, Markdown, SQL, UUencoding, shar, XPM, and HTML files are all text files, but very different ones!
You probably want to expect UTF-8, and you might use the file extension as some hint. Knowing the media-type could help.
(so if HTTP has been used to transfer the file, it is important to keep (and trust) the Content-Type...; read about HTTP headers)
[...] then incorrect file is generated.
How do you know that the resulting file is incorrect? You can only know if you have some expectations about that result (e.g. that it contains Russian poetry, not junk characters; but perhaps these junk characters are some bytecode to some secret interpreter, or some music represented in weird fashion, or encrypted, etc....). Raw files are just sequences of bytes, you need some extra knowledge to use them (even if you know that they use UTF-8).
We do file encoding conversion with
vim -c "set encoding=utf8" -c "set fileencoding=utf8" -c "wq" filename
It's working fine , no need to give source encoding.
I have come up with a method to determine encoding (or at least a guess at it) for a file that I pass in:
def encoding_type(file_path)
File.read(file_path).encoding.name
end
The problem with this is that I have a file that is 15GB, so that means the entire file is being read into memory.
Is there anyway to accomplish what I am doing in this method without needing to read the entire file into memory?
The file -mime command will return the mime type and encoding of the file:
file -mime myfile
myfile: text/plain; charset=iso-8859-1
def detect_charset(file_path)
`file --mime #{file_path}`.strip.split('charset=').last
rescue => e
Rails.logger.warn "Unable to determine charset of #{file_path}"
Rails.logger.warn "Error: #{e.message}"
end
The method you suggest in your question will not do what you think. It will simply set the file to the Encoding.default_internal encoding, possibly after transcoding it from Encoding.default_external. These are both usually UTF-8. The encoding is going to always be Encoding.default_internal after you run that code, it is not guessing or determining the encoding from the actual file.
If you have a file and you really don't know what encoding it is, you indeed will have to guess. There's no way to be 100% sure you've gotten it right as the author intended (and some files are corrupt and mixed encoding or not legal in any encoding).
There are libraries with heuristics meant to try and guess (they won't be right all the time).
Here's one, which I've never actually used myself, but the likelyist prospect I found in 10 minutes of googling: https://github.com/oleander/rchardet There might be other ruby gems for this. You could also use ruby system() to call a linux command line utility that tries to do this as well, someone above mentions the Linux file command.
If you don't want to load the entire file in to test it, you can certainly just load part of it in. Probably the chardet library will work more reliably the more it's got, but, sure, just read the first X bytes of the file in and then ask chardet to guess it's encoding.
require 'chardet19'
first1000bytes = File.read(file, 1000)
cd = CharDet.detect(first1000bytes)
cd.encoding
cd.confidence
You can also always check to see if any string in ruby is valid for the encoding it's set at:
str.valid_encoding?
So you could simply go through a variety of encodings and see if it's valid:
orig_encoding = str.encoding
str.force_encoding("ISO-8859-1").valid_encoding?
str.force_encoding("UTF-8").valid_encoding?
str.force_enocding(orig_encoding) # put it back to what it was
But it's certainly possible for a file to be valid in more than one encoding, or to be valid in a given encoding but read as nonsense by humans in that encoding.
If you have your best guess encoding, but it's still not valid_encoding? for that encoding, it may just have a few bad bytes in it. You can remove them with String.scrub in ruby 2.1, or with this pure-ruby backport of String.scrub in other ruby versions.
Hope this helps give you some idea of what you're dealing with and what your options are.
I have a bunch of files with a mixtures of encodings mainly ISO-8859-1 and UTF-8.
I would like to make all files UTF-8, but when trying to batch encode this files using
iconv some problems arise. (Files cuts by half, etc.)
I supposse the reason is that iconv requires to know the 'from' encoding, so if the command looks like this
iconv -f ISO-8859-1 -t UTF-8 in.php -o out.php
but 'in.php' if already UTF-8 encoded, that causes problems (correct me if I'm wrong)
Is there a way, that I can list all the files whose encoding is not UTF-8?
You can't find files that are definitely ISO-8859-1, but you can find files that are valid UTF-8 (which unlike most multibyte encodings give you a reasonable assurance that they are in fact UTF-8). moreutils has a tool isutf8 which can do this for you. Or you can write your own, it would be fairly simple.
It's often hard to tell just by reading a text file whether it's in UTF-8 encoding or not. You could scan the file for certain indicator bytes which can never occur in UTF-8, and if you find them, you know the file is in ISO-8859-1. If you find a byte with its high-order bit set, where the bytes both immediately before and immediately after it don't have their high-order bit set, you know it's ISO encoded (because bytes >127 always occur in sequences in UTF-8). Beyond that, it's basically guesswork - you'll have to look at the sequences of bytes with that high bit set and see whether it would make sense for them to occur in ISO-8859-1 or not.
The file program will make an attempt to guess the encoding of a text file it's processing, you could try that.
with find it's quite simple
find . -print0 | xargs -0 file | grep 8859
Is there a way, that I can list all the files whose encoding is not UTF-8?
Perhaps not so easily in bash alone, but it's a trivial task from eg. Python:
import os.path
for child in os.path.listdir(TARGETDIR):
child= os.path.join(TARGETDIR, child)
if os.path.isfile(child):
content= open(child, 'rb').read()
try:
unicode(content, 'utf-8')
except UnicodeDecodeError:
open(child, 'wb').write(unicode(content, 'iso-8859-1'))
This assumes that any file that can be interpreted as a valid UTF-8 sequence is one (and so can be left alone), whilst anything that isn't must be ISO-8859-1.
This is a reasonable assumption if those two are the only possible encodings, because valid UTF-8 sequences (of at least two top-bit-set characters in a particular order) are relatively rare in real Latin text, where we tend only to use the odd single accented characters here and there.
What kind of content? XML? Then yes, if properly tagged at the top. Generic text files? I don't know of any a-priori way to know what encoding is used, although it might be possible, sometimes, with clever code. "Tagged" UTF-8 text files, by which I mean UTF-8 text files with a Byte-Order mark? (For UTF-8, the three byte sequence "") Probably. The Byte Order Mark characters will not commonly appear as the first three characters in a ISO-8859-1 encoded file. (Which bobince pointed out in a comment to this post, so I'm correcting my post.)
For your purposes, tools exist that can probably solve most of your question. Logan Capaldo pointed out one in his answer.
But after all, if it were always possible to figure out, unambiguously, what character encoding was used in a file, then the iconv utility wouldn't need you to provide the "from" encoding. :)