What does "Stream did not contain valid UTF-8" mean? - utf-8

I'm creating a simple HTTP server. I need to read the requested image and send it to browser. I'm using this code:
fn read_file(mut file_name: String) -> String {
file_name = file_name.replace("/", "");
if file_name.is_empty() {
file_name = String::from("index.html");
}
let path = Path::new(&file_name);
if !path.exists() {
return String::from("Not Found!");
}
let mut file_content = String::new();
let mut file = File::open(&file_name).expect("Unable to open file");
let res = match file.read_to_string(&mut file_content) {
Ok(content) => content,
Err(why) => panic!("{}",why),
};
return file_content;
}
This works if the requested file is text based, but when I want to read an image I get the following message:
stream did not contain valid UTF-8
What does it mean and how to fix it?

The documentation for String describes it as:
A UTF-8 encoded, growable string.
The Wikipedia definition of UTF-8 will give you a great deal of background on what that is. The short version is that computers use a unit called a byte to represent data. Unfortunately, these blobs of data represented with bytes have no intrinsic meaning; that has to be provided from outside. UTF-8 is one way of interpreting a sequence of bytes, as are file formats like JPEG.
UTF-8, like most text encodings, has specific requirements and sequences of bytes that are valid and invalid. Whatever image you have tried to load contains a sequence of bytes that cannot be interpreted as a UTF-8 string; this is what the error message is telling you.
To fix it, you should not use a String to hold arbitrary collections of bytes. In Rust, that's better represented by a Vec:
fn read_file(mut file_name: String) -> Vec<u8> {
file_name = file_name.replace("/", "");
if file_name.is_empty() {
file_name = String::from("index.html");
}
let path = Path::new(&file_name);
if !path.exists() {
return String::from("Not Found!").into();
}
let mut file_content = Vec::new();
let mut file = File::open(&file_name).expect("Unable to open file");
file.read_to_end(&mut file_content).expect("Unable to read");
file_content
}
To evangelize a bit, this is a great aspect of why Rust is a nice language. Because there is a type that represents "a set of bytes that is guaranteed to be a valid UTF-8 string", we can write safer programs since we know that this invariant will always be true. We don't have to keep checking throughout our program to "make sure" it's still a string.

Related

How do I retrieve a string from a PWSTR after a Win32 function succeeds?

I would like to get my username in an std::String using the windows-rs crate.
use bindings::Windows::Win32::{
System::WindowsProgramming::GetUserNameW,
Foundation::PWSTR,
};
fn main() {
let mut pcbbuffer: u32 = 255;
let mut helper: u16 = 0;
let lpbuffer = PWSTR(&mut helper);
println!("lpbuffer: {:?}\npcbbuffer: {:?}", lpbuffer, pcbbuffer);
unsafe {
let success = GetUserNameW(lpbuffer, &mut pcbbuffer);
println!("GetUserNameW succeeded: {:?}\nlpbuffer: {:?}\npcbbuffer: {:?}", success.as_bool(), lpbuffer, pcbbuffer);
}
}
produces the output:
lpbuffer: PWSTR(0xca20f5f76e)
pcbbuffer: 255
GetUserNameW succeeded: true
lpbuffer: PWSTR(0x7200650073)
pcbbuffer: 5
The username is "user" that's 4 + 1 terminating character = 5 which is good. I also see the GetUserNameW function succeeded and the pointer to the string changed.
What are the next steps?
The code as posted works by coincidence alone. It sports a spectacular buffer overflow, hardly what you'd want to see in Rust code. Specifically, you're taking the address of a single u16 value, and pass it into an API, telling it that the pointed-to memory were 255 elements in size.
That needs to be solved: You will have to allocate a buffer large enough to hold the API's output first.
Converting a UTF-16 encoded string to a Rust String with its native encoding can be done using several different ways, such as String::from_utf16_lossy().
The following code roughly sketches out the approach:
fn main() {
let mut cb_buffer = 257_u32;
// Create a buffer of the required size
let mut buffer = Vec::<u16>::with_capacity(cb_buffer as usize);
// Construct a `PWSTR` by taking the address to the first element in the buffer
let lp_buffer = PWSTR(buffer.as_mut_ptr());
let result = unsafe { GetUserNameW(lp_buffer, &mut cb_buffer) };
// If the API returned success, and more than 0 characters were written
if result.as_bool() && cb_buffer > 0 {
// Construct a slice over the valid data
let buffer = unsafe { slice::from_raw_parts(lp_buffer.0, cb_buffer as usize - 1) };
// And convert from UTF-16 to Rust's native encoding
let user_name = String::from_utf16_lossy(buffer);
println!("User name: {}", user_name);
}
}

How do I remove the \\?\ prefix from a canonical Windows path?

On Windows, Path::canonicalize() returns the path in the format:
\\\\?\\C:\\projects\\3rdparty\\rust...
This is because it is the correct canonical path, and allows 'long' paths on Windows (see Why does my canonicalized path get prefixed with \\?\).
However, this is not a user-friendly path, and people do not understand it.
For display and logging purposes how can I easily remove this prefix in a generic platform independent way?
Path::components will return a component \\?\C: as the first component...
Should I convert this to a &str and use a regex? Is there some other more ergonomic method for removing the prefix, e.g. some type with a Display implementation that automatically does the right thing?
My requirements specifically are:
Correctly displays X:\\... for a canonical path on Windows.
Doesn't screw up non-Windows platforms (e.g. strip or change path components)
Example:
use std::path::{Path, PathBuf};
fn simple_path<P: AsRef<Path>>(p: P) -> String {
String::from(p.as_ref().to_str().unwrap()) // <-- ?? What to do here?
}
pub fn main() {
let path = PathBuf::from("C:\temp").canonicalize().unwrap();
let display_path = simple_path(path);
println!("Output: {}", display_path);
}
Use the dunce crate:
extern crate dunce;
…
let compatible_path = dunce::canonicalize(&any_path);
Just stripping \\?\ may give wrong/invalid paths. The dunce crate checks whether the UNC path is compatible and converts the path accurately whenever possible. It passes through all other paths. It compiles to plain fs::canonicalize() on non-Windows.
The straightforward answer is to do platform-specific string munging:
use std::path::{Path, PathBuf};
#[cfg(not(target_os = "windows"))]
fn adjust_canonicalization<P: AsRef<Path>>(p: P) -> String {
p.as_ref().display().to_string()
}
#[cfg(target_os = "windows")]
fn adjust_canonicalization<P: AsRef<Path>>(p: P) -> String {
const VERBATIM_PREFIX: &str = r#"\\?\"#;
let p = p.as_ref().display().to_string();
if p.starts_with(VERBATIM_PREFIX) {
p[VERBATIM_PREFIX.len()..].to_string()
} else {
p
}
}
pub fn main() {
let path = PathBuf::from(r#"C:\Windows\System32"#)
.canonicalize()
.unwrap();
let display_path = adjust_canonicalization(path);
println!("Output: {}", display_path);
}
For the record, I don't agree that your premise is a good idea. Windows Explorer handles these verbatim paths just fine, and I think users are capable of handling it as well.
For [...] logging purposes
This sounds like a terrible idea. If you are logging something, you want to know the exact path, not some potentially incorrect path.
Here's a version that reconstructs the path from the components.
It helps with std::fs::canonicalize on Windows, but a naive Path::new(r"\\?\C:\projects\3rdparty\rust") at play.rust-lang.org will produce a single-component Path.
use std::path::{Component, Path, PathBuf, Prefix};
// Should remove the “\\?” prefix from the canonical path
// in order to avoid CMD bailing with “UNC paths are not supported”.
let head = path.components().next().ok_or("empty path?")?;
let diskˢ;
let head = if let Component::Prefix(prefix) = head {
if let Prefix::VerbatimDisk(disk) = prefix.kind() {
diskˢ = format!("{}:", disk as char);
Path::new(&diskˢ).components().next().ok_or("empty path?")?
} else {
head
}
} else {
head
};
println!("{:?}", head);
let path = std::iter::once(head)
.chain(path.components().skip(1))
.collect::<PathBuf>();

How to replace emoji characters with their descriptions in a Swift string

I'm looking for a way to replace emoji characters with their description in a Swift string.
Example:
Input "This is my string 😄"
I'd like to replace the 😄 to get:
Output "This is my string {SMILING FACE WITH OPEN MOUTH AND SMILING EYES}"
To date I'm using this code modified from the original code of this answer by MartinR, but it works only if I deal with a single character.
let myCharacter : Character = "😄"
let cfstr = NSMutableString(string: String(myCharacter)) as CFMutableString
var range = CFRangeMake(0, CFStringGetLength(cfstr))
CFStringTransform(cfstr, &range, kCFStringTransformToUnicodeName, Bool(0))
var newStr = "\(cfstr)"
// removing "\N" from the result: \N{SMILING FACE WITH OPEN MOUTH AND SMILING EYES}
newStr = newStr.stringByReplacingOccurrencesOfString("\\N", withString:"")
print("\(newStr)") // {SMILING FACE WITH OPEN MOUTH AND SMILING EYES}
How can I achieve this?
Simply do not use a Character in the first place but use a String as input:
let cfstr = NSMutableString(string: "This 😄 is my string 😄") as CFMutableString
that will finally output
This {SMILING FACE WITH OPEN MOUTH AND SMILING EYES} is my string {SMILING FACE WITH OPEN MOUTH AND SMILING EYES}
Put together:
func transformUnicode(input : String) -> String {
let cfstr = NSMutableString(string: input) as CFMutableString
var range = CFRangeMake(0, CFStringGetLength(cfstr))
CFStringTransform(cfstr, &range, kCFStringTransformToUnicodeName, Bool(0))
let newStr = "\(cfstr)"
return newStr.stringByReplacingOccurrencesOfString("\\N", withString:"")
}
transformUnicode("This 😄 is my string 😄")
Here is a complete implementation.
It avoids to convert to description also the non-emoji characters (e.g. it avoids to convert “ to {LEFT DOUBLE QUOTATION MARK}). To accomplish this, it uses an extension based on this answer by Arnold that returns true or false whether a string contains an emoji.
The other part of the code is based on this answer by MartinR and the answer and comments to this answer by luk2302.
var str = "Hello World 😄 …" // our string (with an emoji and a horizontal ellipsis)
let newStr = str.characters.reduce("") { // loop through str individual characters
var item = "\($1)" // string with the current char
let isEmoji = item.containsEmoji // true or false
if isEmoji {
item = item.stringByApplyingTransform(String(kCFStringTransformToUnicodeName), reverse: false)!
}
return $0 + item
}.stringByReplacingOccurrencesOfString("\\N", withString:"") // strips "\N"
extension String {
var containsEmoji: Bool {
for scalar in unicodeScalars {
switch scalar.value {
case 0x1F600...0x1F64F, // Emoticons
0x1F300...0x1F5FF, // Misc Symbols and Pictographs
0x1F680...0x1F6FF, // Transport and Map
0x2600...0x26FF, // Misc symbols
0x2700...0x27BF, // Dingbats
0xFE00...0xFE0F, // Variation Selectors
0x1F900...0x1F9FF: // Various (e.g. 🤖)
return true
default:
continue
}
}
return false
}
}
print (newStr) // Hello World {SMILING FACE WITH OPEN MOUTH AND SMILING EYES} …
Please note that some emoji could not be included in the ranges of this code, so you should check if all the emoji are converted at the time you will implement the code.

Decoding quoted-printable messages in Swift

I have a quoted-printable string such as "The cost would be =C2=A31,000". How do I convert this to "The cost would be £1,000".
I'm just converting text manually at the moment and this doesn't cover all cases. I'm sure there is just one line of code that will help with this.
Here is my code:
func decodeUTF8(message: String) -> String
{
var newMessage = message.stringByReplacingOccurrencesOfString("=2E", withString: ".", options: NSStringCompareOptions.LiteralSearch, range: nil)
newMessage = newMessage.stringByReplacingOccurrencesOfString("=E2=80=A2", withString: "•", options: NSStringCompareOptions.LiteralSearch, range: nil)
newMessage = newMessage.stringByReplacingOccurrencesOfString("=C2=A3", withString: "£", options: NSStringCompareOptions.LiteralSearch, range: nil)
newMessage = newMessage.stringByReplacingOccurrencesOfString("=A3", withString: "£", options: NSStringCompareOptions.LiteralSearch, range: nil)
newMessage = newMessage.stringByReplacingOccurrencesOfString("=E2=80=9C", withString: "\"", options: NSStringCompareOptions.LiteralSearch, range: nil)
newMessage = newMessage.stringByReplacingOccurrencesOfString("=E2=80=A6", withString: "…", options: NSStringCompareOptions.LiteralSearch, range: nil)
newMessage = newMessage.stringByReplacingOccurrencesOfString("=E2=80=9D", withString: "\"", options: NSStringCompareOptions.LiteralSearch, range: nil)
newMessage = newMessage.stringByReplacingOccurrencesOfString("=92", withString: "'", options: NSStringCompareOptions.LiteralSearch, range: nil)
newMessage = newMessage.stringByReplacingOccurrencesOfString("=3D", withString: "=", options: NSStringCompareOptions.LiteralSearch, range: nil)
newMessage = newMessage.stringByReplacingOccurrencesOfString("=20", withString: "", options: NSStringCompareOptions.LiteralSearch, range: nil)
newMessage = newMessage.stringByReplacingOccurrencesOfString("=E2=80=99", withString: "'", options: NSStringCompareOptions.LiteralSearch, range: nil)
return newMessage
}
Thanks
An easy way would be to utilize the (NS)String method
stringByRemovingPercentEncoding for this purpose.
This was observed in
decoding quoted-printables,
so the first solution is mainly a translation of the answers in
that thread to Swift.
The idea is to replace the quoted-printable "=NN" encoding by the
percent encoding "%NN" and then use the existing method to remove
the percent encoding.
Continuation lines are handled separately.
Also, percent characters in the input string must be encoded first,
otherwise they would be treated as the leading character in a percent
encoding.
func decodeQuotedPrintable(message : String) -> String? {
return message
.stringByReplacingOccurrencesOfString("=\r\n", withString: "")
.stringByReplacingOccurrencesOfString("=\n", withString: "")
.stringByReplacingOccurrencesOfString("%", withString: "%25")
.stringByReplacingOccurrencesOfString("=", withString: "%")
.stringByRemovingPercentEncoding
}
The function returns an optional string which is nil for invalid input.
Invalid input can be:
A "=" character which is not followed by two hexadecimal digits,
e.g. "=XX".
A "=NN" sequence which does not decode to a valid UTF-8 sequence,
e.g. "=E2=64".
Examples:
if let decoded = decodeQuotedPrintable("=C2=A31,000") {
print(decoded) // £1,000
}
if let decoded = decodeQuotedPrintable("=E2=80=9CHello =E2=80=A6 world!=E2=80=9D") {
print(decoded) // “Hello … world!”
}
Update 1: The above code assumes that the message uses the UTF-8
encoding for quoting non-ASCII characters, as in most of your examples: C2 A3 is the UTF-8 encoding for "£", E2 80 A4 is the UTF-8 encoding for ….
If the input is "Rub=E9n" then the message is using the
Windows-1252 encoding.
To decode that correctly, you have to replace
.stringByRemovingPercentEncoding
by
.stringByReplacingPercentEscapesUsingEncoding(NSWindowsCP1252StringEncoding)
There are also ways to detect the encoding from a "Content-Type"
header field, compare e.g. https://stackoverflow.com/a/32051684/1187415.
Update 2: The stringByReplacingPercentEscapesUsingEncoding
method is marked as deprecated, so the above code will always generate
a compiler warning. Unfortunately, it seems that no alternative method
has been provided by Apple.
So here is a new, completely self-contained decoding method which
does not cause any compiler warning. This time I have written it
as an extension method for String. Explaining comments are in the
code.
extension String {
/// Returns a new string made by removing in the `String` all "soft line
/// breaks" and replacing all quoted-printable escape sequences with the
/// matching characters as determined by a given encoding.
/// - parameter encoding: A string encoding. The default is UTF-8.
/// - returns: The decoded string, or `nil` for invalid input.
func decodeQuotedPrintable(encoding enc : NSStringEncoding = NSUTF8StringEncoding) -> String? {
// Handle soft line breaks, then replace quoted-printable escape sequences.
return self
.stringByReplacingOccurrencesOfString("=\r\n", withString: "")
.stringByReplacingOccurrencesOfString("=\n", withString: "")
.decodeQuotedPrintableSequences(enc)
}
/// Helper function doing the real work.
/// Decode all "=HH" sequences with respect to the given encoding.
private func decodeQuotedPrintableSequences(enc : NSStringEncoding) -> String? {
var result = ""
var position = startIndex
// Find the next "=" and copy characters preceding it to the result:
while let range = rangeOfString("=", range: position ..< endIndex) {
result.appendContentsOf(self[position ..< range.startIndex])
position = range.startIndex
// Decode one or more successive "=HH" sequences to a byte array:
let bytes = NSMutableData()
repeat {
let hexCode = self[position.advancedBy(1) ..< position.advancedBy(3, limit: endIndex)]
if hexCode.characters.count < 2 {
return nil // Incomplete hex code
}
guard var byte = UInt8(hexCode, radix: 16) else {
return nil // Invalid hex code
}
bytes.appendBytes(&byte, length: 1)
position = position.advancedBy(3)
} while position != endIndex && self[position] == "="
// Convert the byte array to a string, and append it to the result:
guard let dec = String(data: bytes, encoding: enc) else {
return nil // Decoded bytes not valid in the given encoding
}
result.appendContentsOf(dec)
}
// Copy remaining characters to the result:
result.appendContentsOf(self[position ..< endIndex])
return result
}
}
Example usage:
if let decoded = "=C2=A31,000".decodeQuotedPrintable() {
print(decoded) // £1,000
}
if let decoded = "=E2=80=9CHello =E2=80=A6 world!=E2=80=9D".decodeQuotedPrintable() {
print(decoded) // “Hello … world!”
}
if let decoded = "Rub=E9n".decodeQuotedPrintable(encoding: NSWindowsCP1252StringEncoding) {
print(decoded) // Rubén
}
Update for Swift 4 (and later):
extension String {
/// Returns a new string made by removing in the `String` all "soft line
/// breaks" and replacing all quoted-printable escape sequences with the
/// matching characters as determined by a given encoding.
/// - parameter encoding: A string encoding. The default is UTF-8.
/// - returns: The decoded string, or `nil` for invalid input.
func decodeQuotedPrintable(encoding enc : String.Encoding = .utf8) -> String? {
// Handle soft line breaks, then replace quoted-printable escape sequences.
return self
.replacingOccurrences(of: "=\r\n", with: "")
.replacingOccurrences(of: "=\n", with: "")
.decodeQuotedPrintableSequences(encoding: enc)
}
/// Helper function doing the real work.
/// Decode all "=HH" sequences with respect to the given encoding.
private func decodeQuotedPrintableSequences(encoding enc : String.Encoding) -> String? {
var result = ""
var position = startIndex
// Find the next "=" and copy characters preceding it to the result:
while let range = range(of: "=", range: position..<endIndex) {
result.append(contentsOf: self[position ..< range.lowerBound])
position = range.lowerBound
// Decode one or more successive "=HH" sequences to a byte array:
var bytes = Data()
repeat {
let hexCode = self[position...].dropFirst().prefix(2)
if hexCode.count < 2 {
return nil // Incomplete hex code
}
guard let byte = UInt8(hexCode, radix: 16) else {
return nil // Invalid hex code
}
bytes.append(byte)
position = index(position, offsetBy: 3)
} while position != endIndex && self[position] == "="
// Convert the byte array to a string, and append it to the result:
guard let dec = String(data: bytes, encoding: enc) else {
return nil // Decoded bytes not valid in the given encoding
}
result.append(contentsOf: dec)
}
// Copy remaining characters to the result:
result.append(contentsOf: self[position ..< endIndex])
return result
}
}
Example usage:
if let decoded = "=C2=A31,000".decodeQuotedPrintable() {
print(decoded) // £1,000
}
if let decoded = "=E2=80=9CHello =E2=80=A6 world!=E2=80=9D".decodeQuotedPrintable() {
print(decoded) // “Hello … world!”
}
if let decoded = "Rub=E9n".decodeQuotedPrintable(encoding: .windowsCP1252) {
print(decoded) // Rubén
}
This encoding is called 'quoted-printable', and what you need to do is convert string to NSData using ASCII encoding, then just iterate over the data replacing all 3-symbol parties like '=A3' with the byte/char 0xA3, and then converting the resulting data to string using NSUTF8StringEncoding.
Unfortunately, I'm a bit late with my answer. It might be helpful for the others though.
var string = "The cost would be =C2=A31,000"
var finalString: String? = nil
if let regEx = try? NSRegularExpression(pattern: "={1}?([a-f0-9]{2}?)", options: NSRegularExpressionOptions.CaseInsensitive)
{
let intermediatePercentEscapedString = regEx.stringByReplacingMatchesInString(string, options: NSMatchingOptions.WithTransparentBounds, range: NSMakeRange(0, string.characters.count), withTemplate: "%$1")
print(intermediatePercentEscapedString)
finalString = intermediatePercentEscapedString.stringByRemovingPercentEncoding
print(finalString)
}
In order to give an applicable solution, a few more information is required. So, I will make some assumptions.
In an HTML or Mail message for example, you can apply one or more encodings to some kind of source data. For example, you could encode a binary file e.g. an png file with base64 and then zip it. The order is important.
In your example as you say, the source data is a String and has been encoded via UTF-8.
In a HTPP message, your Content-Type is thus text/plain; charset = UTF-8. In your example there seems also an additional encoding applied,
a "Content-Transfer-Encoding": possibly Content-transfer-encoding is quoted-printable or base64 (not sure about that, though).
In order to revert it back, you would need to apply the corresponding decodings in reverse order.
Hint:
You can view the headers (Contente-type and Content-Transfer-Encoding) of a mail message when viewing the raw source of the mail.
You can also look at this working solution - https://github.com/dunkelstern/QuotedPrintable
let result = QuotedPrintable.decode(string: quoted)

lifetime not long enough rust

I want to open a file, replace some characters, and make some splits. Then I want to return the list of strings. however I get error: broken does not live long enough. My code works when it is in main, so it is only an issue with lifetimes.
fn tokenize<'r>(fp: &'r str) -> Vec<&'r str> {
let data = match File::open(&Path::new(fp)).read_to_string(){
Ok(n) => n,
Err(e) => fail!("couldn't read file: {}", e.desc)
};
let broken = data.replace("'", " ' ").replace("\"", " \" ").replace(" ", " ");
let mut tokens = vec![];
for t in broken.as_slice().split_str(" ").filter(|&x| *x != "\n"){
tokens.push(t)
}
return tokens;
}
How can I make the value returned by this function live in the scope of the caller?
The problem is that your function signature says "the result has the same lifetime as the input fp", but that's simply not true. The result contains references to data, which is allocated inside your function; it has nothing to do with fp! As it stands, data will cease to exist at the end of your function.
Because you're effectively creating new values, you can't return references; you need to transfer ownership of that data out of the function. There are two ways I can think of to do this, off the top of my head:
Instead of returning Vec<&str>, return Vec<String>, where each token is a freshly-allocated string.
Return data inside a wrapper type which implements the splitting logic. Then, you can have fn get_tokens(&self) -> Vec<&str>; the lifetime of the slices can be tied to the lifetime of the object which contains data.

Resources