Detect UTF-8 encoding (How does MS IDE do it)? - utf-8

A problem with various character encodings is that the containing file is not always clearly marked. There are inconsistent conventions for marking some using "byte-order-markers" or BOMs. But in essence you have to be told what the file encoding is, to read it accurately.
We build programming tools that read source files, and this gives us grief. We have means to specify defaults, and sniff for BOMs, etc. And we do pretty well with conventions and defaults. But a place we (and I assume everybody else) gets hung up on are UTF-8 files that are not BOM-marked.
Recent MS IDEs (e.g., VS Studio 2010) will apparently "sniff" a file to determine if it is UTF-8 encoded without a BOM. (Being in the tools business, we'd like to be compatible with MS because of their market share, even if it means having to go over the "stupid" cliff with them.) I'm specifically interested in what they use as a heuristic (although discussions of heuristics is fine)? How can it be "right"? (Consider an ISO8859-x encoded string interpreted this way).
EDIT: This paper on detecting character encodings/sets is pretty interesting:
http://www-archive.mozilla.org/projects/intl/UniversalCharsetDetection.html
EDIT December 2012: We ended scanning the entire file to see if it contained any violations of UTF-8 sequences... and if it does not, we call it UTF-8. The bad part of this solution is you have to process the characters twice if it is UTF-8. (If it isn't UTF-8, this test is likely to determine that fairly quickly, unless the file happens to all 7 bit ASCII at which point reading like UTF-8 won't hurt).

If encoding is UTF-8, the first character you see over 0x7F must be the start of a UTF-8 sequence. So test it for that. Here is the code we use for that:
unc ::IsUTF8(unc *cpt)
{
if (!cpt)
return 0;
if ((*cpt & 0xF8) == 0xF0) { // start of 4-byte sequence
if (((*(cpt + 1) & 0xC0) == 0x80)
&& ((*(cpt + 2) & 0xC0) == 0x80)
&& ((*(cpt + 3) & 0xC0) == 0x80))
return 4;
}
else if ((*cpt & 0xF0) == 0xE0) { // start of 3-byte sequence
if (((*(cpt + 1) & 0xC0) == 0x80)
&& ((*(cpt + 2) & 0xC0) == 0x80))
return 3;
}
else if ((*cpt & 0xE0) == 0xC0) { // start of 2-byte sequence
if ((*(cpt + 1) & 0xC0) == 0x80)
return 2;
}
return 0;
}
If you get a return of 0, it is not valid UTF-8. Else skip the number of chars returned and continue checking the next one over 0x7F.

Visual Studio Code uses jschardet, which returns a guess and a confidence level. It's all open source, so you can inspect the code.
https://github.com/microsoft/vscode/issues/101930#issuecomment-655565813

we just found a solution to this
Basically, when you don't know the encoding of a file/stream/source you need to check the entire file and/or look for portions of texts to see if you get UTF-8 matches. I see this similar to what some antiviral products does, checking for portions of known viral sub-strings
Maybe I'd suggest you apply call to a function similar to what we did when reading the file/stream, line by line to determine whether UTF-8 encoding is found or not
Please refer to our post below
Ref.
- https://stackoverflow.com/questions/17283872/how-to-detect-utf-8-based-encoded-strings

Related

Print a code128 barcode starting with the character 'C'

I've written label printing software (Windows, WPF, C#, .net 4.5) that happily prints barcodes with a Datamax H-Class printer, with one exception, when printing a barcode that starts with the character C
When I attempt this, the barcode is truncated up until the first numeric character within it.
Lower case c works fine, but as some of our model codes do start with C, I need to find a way to work around this.
I guess there must be some sort of escape character that would allow this? But I've not managed to find it via Google.
I'm not 100% sure it's a code128 issue either, could it be related to the Datamax H-Class printer, the Datamax Windows C# SDK or possibly the code128 font we're using on the printer?
Sorry the details are so vague, any help or advice on what to check next would be very much appreciated.
Update.
Just in case this is of any use (I doubt it though sadly) the code I'm using to send barcodes to the printer (successfully in the case of all barcode strings not starting with C ) is as follows:
ParametersDPL paramDPL = new ParametersDPL();
paramDPL.Align = ParametersDPL.Alignment.Left;
paramDPL.Rotate = ParametersDPL.Rotation.Rotate_270;
paramDPL.IsUnicode = false;
paramDPL.TextEncoding = Encoding.ASCII;
paramDPL.WideBarWidth = 7;
paramDPL.NarrowBarWidth = 4;
paramDPL.SymbolHeight = 60;
//if the stockCode starts with 'C' the barcode will be truncated
docDPL.WriteBarCode("E", String.Format("{0} {1}", stockCode, serialNumber), COL_1, ROW_5, paramDPL);
The ParametersDPL object is from the Datamax C# SDK. The only possible problem I could see with the code is perhaps the setting of the IsUnicode or TextEncoding properties, but I've experimented with them quite a bit to no effect. None of the other properties on the ParametersDPL seemed like likely culprits either.
I'm unfamiliar with Datamax PCL, but the symptoms suggest that the "C" is being used to select subalphabet "C" of code128. It might be useful to try a stock code starting "A" or "ZB" and see whether the "A" or "B" disappears. If it does, then the first character may be being used to select a subalphabet ("A" is caps-only ASCII, "B" is no-controls ASCII.)
You'd then need to look very closely at Datamax PCL format - it may be that there's a (possibly opional) formatting character there, which makes it leading-character-sensitive. Perhaps forcing in a leading "B" would cure the problem.

Determine escape sequence independent from terminal type

My app reading escape sequences from terminal in raw mode. And when it's running on xterm I got F2 like "\eOQ". But when it's running in linux tty terminal (Switching by Ctrl-Alt-F1) I got "\e[[[B".
What is the correct way to determine that I got F2 independent from terminal type application running on?
If you're wanting to read terminal keypresses, you likely want to look at something like libtermkey , which abstracts the general problem away for you. Internally it uses a combination of terminfo lookups, or hardcoded knowledge of the extended xterm-like model for modified keypresses, so it can understand things like Ctrl-Up, which a regular curses/etc... cannot.
while((ret = termkey_waitkey(tk, &key)) != TERMKEY_RES_EOF) {
termkey_strfkey(tk, buffer, sizeof buffer, &key, TERMKEY_FORMAT_VIM);
printf("You pressed key %s\n", buffer);
if(key.type == TERMKEY_TYPE_FUNCTION &&
!key.modifiers &&
key.code.number = 2)
printf("Got F2\n");
}
Ok, as I got the best way to use [n]curses library. It is read terminfo (termcap) database and determine what mean escape sequence you got depend on terminal type.
It is not necessary using it's terminal graphics functions. To get correct escape sequences using curses you may do the following:
newterm(NULL, stdout, stdin);
raw();
noecho();
keypad();
ch = getch();
if (ch == KEY_F(2)) printf("Got F2");
endwin();
Also, it is probably possibly do it manually by reading terminfo database in you app.

Applying a diff-patch to a string/file

For an offline-capable smartphone app, I'm creating a one-way text sync for Xml files. I'd like my server to send the delta/difference (e.g. a GNU diff-patch) to the target device.
This is the plan:
Time = 0
Server: has version_1 of Xml file (~800 kiB)
Client: has version_1 of Xml file (~800 kiB)
Time = 1
Server: has version_1 and version_2 of Xml file (each ~800 kiB)
computes delta of these versions (=patch) (~10 kiB)
sends patch to Client (~10 kiB transferred)
Client: computes version_2 from version_1 and patch <= this is the problem =>
Is there a Ruby library that can do this last step to apply a text patch to files/strings? The patch can be formatted as required by the library.
Thanks for your help!
(I'm using the Rhodes Cross-Platform Framework, which uses Ruby as programming language.)
Your first task is to choose a patch format. The hardest format for humans to read (IMHO) turns out to be the easiest format for software to apply: the ed(1) script. You can start off with a simple /usr/bin/diff -e old.xml new.xml to generate the patches; diff(1) will produce line-oriented patches but that should be fine to start with. The ed format looks like this:
36a
<tr><td class="eg" style="background: #182349;"> </td><td><tt>#182349</tt></td></tr>
.
34c
<tr><td class="eg" style="background: #66ccff;"> </td><td><tt>#xxxxxx</tt></td></tr>
.
20,23d
The numbers are line numbers, line number ranges are separated with commas. Then there are three single letter commands:
a: add the next block of text at this position.
c: change the text at this position to the following block. This is equivalent to a d followed by an a command.
d: delete these lines.
You'll also notice that the line numbers in the patch go from the bottom up so you don't have to worry about changes messing up the lines numbers in subsequent chunks of the patch. The actual chunks of text to be added or changed follow the commands as a sequence of lines terminated by a line with a single period (i.e. /^\.$/ or patch_line == '.' depending on your preference). In summary, the format looks like this:
[line-number-range][command]
[optional-argument-lines...]
[dot-terminator-if-there-are-arguments]
So, to apply an ed patch, all you need to do is load the target file into an array (one element per line), parse the patch using a simple state machine, call Array#insert to add new lines and Array#delete_at to remove them. Shouldn't take more than a couple dozen lines of Ruby to write the patcher and no library is needed.
If you can arrange your XML to come out like this:
<tag>
blah blah
</tag>
<other-tag x="y">
mumble mumble
</other>
rather than:
<tag>blah blah</tag><other-tag x="y">mumble mumble</other>
then the above simple line-oriented approach will work fine; the extra EOLs aren't going to cost much space so go for easy implementation to start.
There are Ruby libraries for producing diffs between two arrays (google "ruby algorithm::diff" to start). Combining a diff library with an XML parser will let you produce patches that are tag-based rather than line-based and this might suit you better. The important thing is the choice of patch formats, once you choose the ed format (and realize the wisdom of the patch working from the bottom to the top) then everything else pretty much falls into place with little effort.
I know this question is almost five years old, but I'm going to post an answer anyway. When searching for how to make and apply patches for strings in Ruby, even now, I was unable to find any resources that answer this question satisfactorily. For that reason, I'll show how I solved this problem in my application.
Making Patches
I'm assuming you're using Linux, or else have access to the program diff through Cygwin. In that case, you can use the excellent Diffy gem to create ed script patches:
patch_text = Diffy::Diff.new(old_text, new_text, :diff => "-e").to_s
Applying Patches
Applying patches is not quite as straightforward. I opted to write my own algorithm, ask for improvements in Code Review, and finally settle on using the code below. This code is identical to 200_success's answer except for one change to improve its correctness.
require 'stringio'
def self.apply_patch(old_text, patch)
text = old_text.split("\n")
patch = StringIO.new(patch)
current_line = 1
while patch_line = patch.gets
# Grab the command
m = %r{\A(?:(\d+))?(?:,(\d+))?([acd]|s/\.//)\Z}.match(patch_line)
raise ArgumentError.new("Invalid ed command: #{patch_line.chomp}") if m.nil?
first_line = (m[1] || current_line).to_i
last_line = (m[2] || first_line).to_i
command = m[3]
case command
when "s/.//"
(first_line..last_line).each { |i| text[i - 1].sub!(/./, '') }
else
if ['d', 'c'].include?(command)
text[first_line - 1 .. last_line - 1] = []
end
if ['a', 'c'].include?(command)
current_line = first_line - (command=='a' ? 0 : 1) # Adds are 0-indexed, but Changes and Deletes are 1-indexed
while (patch_line = patch.gets) && (patch_line.chomp! != '.') && (patch_line != '.')
text.insert(current_line, patch_line)
current_line += 1
end
end
end
end
text.join("\n")
end

Ellipsizing a set of names

OK, I'm sure somebody, somewhere must have come up with an algorithm for this already, so I figured I'd ask before I go off to (re)invent it myself.
I have a list of arbitrary (user-entered) non-empty text strings. Each string can be any length (except 0), and they're all unique. I want to display them to the user, but I want to trim them to some fixed length that I decide, and replace part of them with an ellipsis (...). The catch is that I want all of the output strings to be unique.
For example, if I have the strings:
Microsoft Internet Explorer 6
Microsoft Internet Explorer 7
Microsoft Internet Explorer 8
Mozilla Firefox 3
Mozilla Firefox 4
Google Chrome 14
then I wouldn't want to trim the ends of the strings, because that's the unique part (don't want to display "Microsoft Internet ..." 3 times), but it's OK to cut out the middle part:
Microsoft...rer 6
Microsoft...rer 7
Microsoft...rer 8
Mozilla Firefox 3
Mozilla Firefox 4
Google Chrome 14
Other times, the middle part might be unique, and I'd want to trim the end:
Minutes of Company Meeting, 5/25/2010 -- Internal use only
Minutes of Company Meeting, 6/24/2010 -- Internal use only
Minutes of Company Meeting, 7/23/2010 -- Internal use only
could become:
Minutes of Company Meeting, 5/25/2010...
Minutes of Company Meeting, 6/24/2010...
Minutes of Company Meeting, 7/23/2010...
I guess it should probably never ellipsize the very beginning of the strings, even if that would otherwise be allowed, since that would look weird. And I guess it could ellipsize more than one place in the string, but within reason -- maybe 2 times would be OK, but 3 or more seems excessive. Or maybe the number of times isn't as important as the size of the chunks that remain: less than about 5 characters between ellipses would be rather pointless.
The inputs (both number and size) won't be terribly large, so performance is not a major concern (well, as long as the algorithm doesn't try something silly like enumerating all possible strings until it finds a set that works!).
I guess these requirements seem pretty specific, but I'm actually fairly lenient -- I'm just trying to describe what I have in mind.
Has something like this been done before? Is there some existing algorithm or library that does this? I've googled some but found nothing quite like this so far (but maybe I'm just bad at googling). I have to believe somebody somewhere has wanted to solve this problem already!
It sounds like an application of the longest common substring problem.
Replace the longest substring common to all strings with ellipsis. If the string is still too long and you are allowed to have another ellipsis, repeat.
You have to realize that you might not be able to "ellipsize" a given set of strings enough to meet length requirements.
Sort the strings. Keep the first X characters of each string. If this prefix is not unique to the string before and after, then advance until unique characters (compared to the string before and after) are found. (If no unique characters are found, the string has no unique part, see bottom of post) Add ellipses before and after those unique characters.
Note that this still might look funny:
Microsoft Office -> Micro...ffice
Microsoft Outlook -> Micro...utlook
I don't know what language you're looking to do this in, but here's a Python implementation.
def unique_index(before, current, after, size):
'''Returns the index of the first part of _current_ of length _size_ that is
unique to it, _before_, and _after_. If _current_ has no part unique to it,
_before_, and _after_, it returns the _size_ letters at the end of _current_'''
before_unique = False
after_unique = False
for i in range(len(current)-size):
#this will be incorrect in the case mentioned below
if i > len(before)-1 or before[i] != current[i]:
before_unique = True
if i > len(after)-1 or after[i] != current[i]:
after_unique = True
if before_unique and after_unique:
return i
return len(current)-size
def ellipsize(entries, prefix_size, max_string_length):
non_prefix_size = max_string_length - prefix_size #-len("...")? Post isn't clear about this.
#If you want to preserve order then make a copy and make a mapping from the copy to the original
entries.sort()
ellipsized = []
# you could probably remove all this indexing with something out of itertools
for i in range(len(entries)):
current = entries[i]
#entry is already short enough, don't need to truncate
if len(current) <= max_string_length:
ellipsized.append(current)
continue
#grab empty strings if there's no string before/after
if i == 0:
before = ''
else:
before = entries[i-1]
if i == len(entries)-1:
after = ''
else:
after = entries[i+1]
#Is the prefix unique? If so, we're done.
current_prefix = entries[i][:prefix_size]
if not before.startswith(current_prefix) and not after.startswith(current_prefix):
ellipsized.append(current[:max_string_length] + '...') #again, possibly -3
#Otherwise find the unique part after the prefix if it exists.
else:
index = prefix_size + unique_index(before[prefix_size:], current[prefix_size:], after[prefix_size:], non_prefix_size)
if index == prefix_size:
header = ''
else:
header = '...'
if index + non_prefix_size == len(current):
trailer = ''
else:
trailer = '...'
ellipsized.append(entries[i][:prefix_size] + header + entries[i][index:index+non_prefix_size] + trailer)
return ellipsized
Also, you mention the string themselves are unique, but do they all have unique parts? For example, "Microsoft" and "Microsoft Internet Explorer 7" are two different strings, but the first has no part that is unique from the second. If this is the case, then you'll have to add something to your spec as to what to do to make this case unambiguous. (If you add "Xicrosoft", "MXcrosoft", "MiXrosoft", etc. to the mix with these two strings, there is no unique string shorter than the original string to represent "Microsoft") (Another way to think about it: if you have all possible X letter strings you can't compress them all to X-1 or less strings. Just like no compression method can compress all inputs, as this is essentially a compression method.)
Results from original post:
>>> for entry in ellipsize(["Microsoft Internet Explorer 6", "Microsoft Internet Explorer 7", "Microsoft Internet Explorer 8", "Mozilla Firefox 3", "Mozilla Firefox 4", "Google Chrome 14"], 7, 20):
print entry
Google Chrome 14
Microso...et Explorer 6
Microso...et Explorer 7
Microso...et Explorer 8
Mozilla Firefox 3
Mozilla Firefox 4
>>> for entry in ellipsize(["Minutes of Company Meeting, 5/25/2010 -- Internal use only", "Minutes of Company Meeting, 6/24/2010 -- Internal use only", "Minutes of Company Meeting, 7/23/2010 -- Internal use only"], 15, 40):
print entry
Minutes of Comp...5/25/2010 -- Internal use...
Minutes of Comp...6/24/2010 -- Internal use...
Minutes of Comp...7/23/2010 -- Internal use...

RS232c VB6 help

Hey all, i am trying to turn on a A/V Reciever with a RS232 command using the VB6 comm32. To turn it on it says to use:
Command code Parameter code CR Code set example
PW ON <CR> PWON<CR>
And this is my VB6 code i am currently using that doesnt seem to work...
MSComm.CommPort = 2
MSComm.Settings = "9600,n,8,1"
MSComm.PortOpen = True
If Not MSComm.PortOpen Then
MsgBox "not opened"
Else
MSComm.Output = "PWON" & Chr(13)
Do While MSComm.InBufferCount > 0
Text1.Text = Text1.Text & MSComm.Input
Loop
End If
The reciever never turns on. What could i be doing incorrectly? I checked to make sure the com port was 2 and it is.
David
You are just sending the characters <CR> rather than a real carriage return (ASCII code 13). Documentation for serial peripherals often puts the names of control characters in brackets (see Wikipedia for a list of them). You need the line:
MSComm.Output = "PWON" & Chr(13)
It also seems that the code that follows to read data from the serial port should be changed because if the data has not arrived in the serial port's buffer yet, it will read nothing. Take a look at Microsoft's example for how to do so. You could decide to stop reading once a particular substring in the input has been found, once a certain number of bytes have been read (Len function), etc.

Resources