Visual Studio C++ 2008 Manipulating Bytes? - byte

I'm trying to write strictly binary data to files (no encoding). The problem is, when I hex dump the files, I'm noticing rather weird behavior. Using either one of the below methods to construct a file results in the same behavior. I even used the System::Text::Encoding::Default to test as well for the streams.
StreamWriter^ binWriter = gcnew StreamWriter(gcnew FileStream("test.bin",FileMode::Create));
(Also used this method)
FileStream^ tempBin = gcnew FileStream("test.bin",FileMode::Create);
BinaryWriter^ binWriter = gcnew BinaryWriter(tempBin);
binWriter->Write(0x80);
binWriter->Write(0x81);
.
.
binWriter->Write(0x8F);
binWriter->Write(0x90);
binWriter->Write(0x91);
.
.
binWriter->Write(0x9F);
Writing that sequence of bytes, I noticed the only bytes that weren't converted to 0x3F in the hex dump were 0x81,0x8D,0x90,0x9D, ... and I have no idea why.
I also tried making character arrays, and a similar situation happens. i.e.,
array<wchar_t,1>^ OT_Random_Delta_Limits = {0x00,0x00,0x03,0x79,0x00,0x00,0x04,0x88};
binWriter->Write(OT_Random_Delta_Limits);
0x88 would be written as 0x3F.

If you want to stick to binary files then don't use StreamWriter. Just use a FileStream and Write/WriteByte. StreamWriters (and TextWriters in generally) are expressly designed for text. Whether you want an encoding or not, one will be applied - because when you're calling StreamWriter.Write, that's writing a char, not a byte.
Don't create arrays of wchar_t values either - again, those are for characters, i.e. text.
BinaryWriter.Write should have worked for you unless it was promoting the values to char in which case you'd have exactly the same problem.
By the way, without specifying any encoding, I'd expect you to get non-0x3F values, but instead the bytes representing the UTF-8 encoded values for those characters.
When you specified Encoding.Default, you'd have seen 0x3F for any Unicode values not in that encoding.
Anyway, the basic lesson is to stick to Stream when you want to deal with binary data rather than text.
EDIT: Okay, it would be something like:
public static void ConvertHex(TextReader input, Stream output)
{
while (true)
{
int firstNybble = input.Read();
if (firstNybble == -1)
{
return;
}
int secondNybble = input.Read();
if (secondNybble == -1)
{
throw new IOException("Reader finished half way through a byte");
}
int value = (ParseNybble(firstNybble) << 4) + ParseNybble(secondNybble);
output.WriteByte((byte) value);
}
}
// value would actually be a char, but as we've got an int in the above code,
// it just makes things a bit easier
private static int ParseNybble(int value)
{
if (value >= '0' && value <= '9') return value - '0';
if (value >= 'A' && value <= 'F') return value - 'A' + 10;
if (value >= 'a' && value <= 'f') return value - 'a' + 10;
throw new ArgumentException("Invalid nybble: " + (char) value);
}
This is very inefficient in terms of buffering etc, but should get you started.

A BinaryWriter() class initialized with a stream will use a default encoding of UTF8 for any chars or strings that are written. I'm guessing that the
binWriter->Write(0x80);
binWriter->Write(0x81);
.
.
binWriter->Write(0x8F);
binWriter->Write(0x90);
binWriter->Write(0x91);
calls are binding to the Write( char) overload so they're going through the character encoder. I'm not very familiar with C++/CLI, but it seems to me that these calls should be binding to Write(Int32), which shouldn't have this problem (maybe your code is really calling Write() with a char variable that's set to the values in your example. That would account for this behavior).

0x3F is commonly known as the ASCII character '?'; the characters that are mapping to it are control characters with no printable representation. As Jon points out, use a binary stream rather than a text-oriented output mechanism for raw binary data.
EDIT -- actually your results look like the inverse of what I would expect. In the default code page 1252, the non-printable characters (i.e. ones likely to map to '?') in that range are 0x81, 0x8D, 0x8F, 0x90 and 0x9D

Related

Integers converted to strings do not work as expected

Issue comparing str() to what I'd expect is their String form
Code :
int i = 2;
String r = str(i);
println(r);
if (r == "2") {
println("String works");
} else println("String doesnt work");
if (i == 2) {
println("Integer works");
} else println("Integer doesnt work");
Prints :
2
String doesnt work
Integer works
The second if statement is a copy paste of the first, with only the variable and value changed so theres nothing wrong with my if statement
Processing documentation states (about str()):
Converts a value of a primitive data type (boolean, byte, char, int, or float) to its String
representation. For example, converting an integer with str(3) will return the String value of "3",
converting a float with str(-12.6) will return "-12.6", and converting a boolean with str(true) will
return "true".
Also doesnt work with str(2) == "2" or str(i) == "2"
How do I fix this and get it to work (without converting it back to an integer because that would make my code a bit ugly)
You should not compare String values using ==. Use the equals() function instead:
if (r.equals("2")) {
From the reference:
To compare the contents of two Strings, use the equals() method, as in if (a.equals(b)), instead of if (a == b). A String is an Object, so comparing them with the == operator only compares whether both Strings are stored in the same memory location. Using the equals() method will ensure that the actual contents are compared. (The troubleshooting reference has a longer explanation.)
More info here: How do I compare strings in Java?

Why should we use OutputStream.write(byte[] b, int off, int len) instead of OutputStream.write(byte[] b)?

Sorry, everybody. It's a Java beginner question, but I think it will be helpful for a lot of java learners.
FileInputStream fis = new FileInputStream(file);
OutputStream os = socket.getOutputStream();
byte[] buffer = new byte[1024];
int len;
while((len=fis.read(buffer)) != -1){
os.write(buffer, 0, len);
}
The code above is part of FileSenderClient class which is for sending files from client to a server using java.io and java.net.Socket.
My question is that: in the above code, why should we use
os.write(buffer, 0, len)
instead of
os.write(buffer)
In another way to ask this question: what is the point of having a "len" parameter for "OutputStream.write()" method?
It seems both codes are working fine.
while((len=fis.read(buffer)) != -1){
os.write(buffer, 0, len);
}
Because you only want to write data that you actually read. Consider the case where the input consists of N buffers plus one byte. Without the len parameter you would write (N+1)*1024 bytes instead of N*1024+1 bytes. Consider also the case of reading from a socket, or indeed the general case of reading: the actual contract of InputStream.read() is that it transfers at least one byte, not that it fills the buffer. Often it can't, for one reason or another.
It seems both codes are working fine.
No they're not.
It actually does not work in the same way.
It is very likely you used a very small text file to test. But if you look carefully, you will still find there is a lot of extra spaces in the end of you file you received, and the size of the file you received is larger than the file you send.
The reason is that you have created a byte array in a size of 1024 but you don't have so many data to put (or read()) into that byte array. Therefore, the byte array is full with NULL in the end part. When it comes to writing to file, these NULLs are still written into the file and show as spaces " " in Windows Notepad...
If you use advanced text editors like Notepad++ or Sublime Text to view the file you received, you will see these NULL characters.

Why the following code prints garbage values for input strings greater than 128 bytes?

This is a problem of codechef that I recently came across. The answer seems to be right for every test case where the value of input string is less than 128 bytes as it is passing a couple of test cases. For every value greater than 128 bytes it is printing out a large value which seems to be a garbage value.
std::string str;
std::cin>>str;
vector<pair<char,int>> v;
v.push_back(make_pair('C',0));
v.push_back(make_pair('H',0));
v.push_back(make_pair('E',0));
v.push_back(make_pair('F',0));
int i=0;
while(1)
{
if(str[i]=='C')
v['C'].second++;
else if (str[i]=='H')
{
v['H'].second++;
v['C'].second--;
}
else if (str[i]=='E')
{
v['E'].second++;
v['C'].second--;
}
else if (str[i]=='F')
v['F'].second++;
else
break;
i++;
Even enclosing the same code within
/*reading the string values from a file and not console*/
std::string input;
std::ifstream infile("input.txt");
while(getline(infile,input))
{
istringstream in(input);
string str;
in>>str;
/* above code goes here */
}
generates the same result. I am not looking for any solution(s) or hint(s) to get to the right answer as I want to test the correctness of my algorithm. But I want to know why this happens as I am new to vector containers`.
-Regards.
if(str[i]=='C')
v['C'].second++;
You're modifying v[67]
... which is not contained in your vector, and thus either invalid memory or uninitialized
You seem to be trying to use a vector as an associative array. There is already such a structure in C++: a std::map. Use that instead.
With using this v['C'] you actually access the 67th (if 'A' is 65 from ASCII) element of a container having only 4 items. Depending on compiler and mode (debug vs release) you get undefined behavior for the code.
What you probably wanted to use was map i.e. map<char,int> v; instead of vector<pair<char,int>> v; and simple v['C']++; instead of v['C'].second++;

Last byte in Huffman compression

I am wondering about what is the best way to handle the last byte in Huffman Copression. I have some nice code in C++, that can compress text files very well, but currently I must write to my coded file also number of coded chars (well, it equal to input file size), because of no idea how to handle last byte better.
For example, last char to compress is 'a', which code is 011 and I am just starting new byte to write, so the last byte will look like:
011 + some 5 bits of trash, I am making them zeros for example at the end.
And when I am encoding this coded file, it may happen that code 00000 (or with less zeros) is code for some char, so I will have some trash char at the end of my encoded file.
As I wrote in first paragraph, I am avoiding this by saving numbers of chars of input file in coded file, and while encoding, I am reading the coded file to reach that number (not to EndOfFile, to don't get to those example 5 zeros).
It's not really efficient, size of coded file is increased for long number.
How can I handle this in better way?
Your approach (write the number of encoded bytes the to the file) is a perfectly reasonable approach. If you want to try a different avenue, you could consider inventing a new "pseudo-EOF" character that marks the end of the input (I'll denote it as &square;). Whenever you want to compress a string s, you instead compress the string s&square;. This means that when you build up your encoding tree, you would include one copy of the &square; character so that you have a unique encoding for &square;. Then, when you write out the string to the file, you would write out the bits characters of the string as normal, then write out the bit pattern for &square;. If there are leftover bits, you can just leave them set arbitrarily.
The advantage to this approach is that as you decode the file, if at any point you find the &square; character, you can immediately stop decoding bits because you know that you have hit the end of the file. This does not require you to store the number of bytes that were written out anywhere - the encoding implicitly marks its own endpoint.
The disadvantage to this setup is that it might increase the length of the bit patterns used by certain characters, since you will need to assign a bit pattern to &square; in addition to all the other characters.
I teach an introductory programming course and we use Huffman encoding as one of our assignments. We have students use the above approach, since it's a bit easier than having to write out the number of bits or bytes before the file contents. For more details, you could take a look at this handout or these lecture slides from the course.
Hope this helps!
I know this is an old question, but still, there's an alternate, so it might help someone.
When you're writing your compressed file to output, you probably have some integer keeping track of where you are in the current byte (for bit shifting).
char c, p;
p = '\0';
int curr = 7;
while (infile.get(c))
{
std::string trav = GetTraversal(c);
for (int i = 0; i < trav.size(); i++)
{
if (trav[i] == '1')
p += (1 << curr);
if (--curr < 0)
{
outfile.put(p);
p = '\0';
curr = 7;
}
}
}
if (curr < 7)
outfile.put(p);
At the end of this block, (curr+1)%8 equals the number of trash bits in the last data byte. You can then store it at the end as a single extra byte, and just keep it in mind when you're decompressing.

Easy way to get NUMBERFMT populated with defaults?

I'm using the Windows API GetNumberFormatEx to format some numbers for display with the appropriate localization choices for the current user (e.g., to make sure they have the right separators in the right places). This is trivial when you want exactly the user default.
But in some cases I sometimes have to override the number of digits after the radix separator. That requires providing a NUMBERFMT structure. What I'd like to do is to call an API that returns the NUMBERFMT populated with the appropriate defaults for the user, and then override just the fields I need to change. But there doesn't seem to be an API to get the defaults.
Currently, I'm calling GetLocaleInfoEx over and over and then translating that data into the form NUMBERFMT requires.
NUMBERFMT fmt = {0};
::GetLocaleInfoEx(LOCALE_NAME_USER_DEFAULT,
LOCALE_IDIGITS | LOCALE_RETURN_NUMBER,
reinterpret_cast<LPWSTR>(&fmt.NumDigits),
sizeof(fmt.NumDigits)/sizeof(WCHAR));
::GetLocaleInfoEx(LOCALE_NAME_USER_DEFAULT,
LOCALE_ILZERO | LOCALE_RETURN_NUMBER,
reinterpret_cast<LPWSTR>(&fmt.LeadingZero),
sizeof(fmt.LeadingZero)/sizeof(WCHAR));
WCHAR szGrouping[32] = L"";
::GetLocaleInfoEx(LOCALE_NAME_USER_DEFAULT, LOCALE_SGROUPING, szGrouping,
ARRAYSIZE(szGrouping));
if (::lstrcmp(szGrouping, L"3;0") == 0 ||
::lstrcmp(szGrouping, L"3") == 0
) {
fmt.Grouping = 3;
} else if (::lstrcmp(szGrouping, L"3;2;0") == 0) {
fmt.Grouping = 32;
} else {
assert(false); // unexpected grouping string
}
WCHAR szDecimal[16] = L"";
::GetLocaleInfoEx(LOCALE_NAME_USER_DEFAULT, LOCALE_SDECIMAL, szDecimal,
ARRAYSIZE(szDecimal));
fmt.lpDecimalSep = szDecimal;
WCHAR szThousand[16] = L"";
::GetLocaleInfoEx(LOCALE_NAME_USER_DEFAULT, LOCALE_STHOUSAND, szThousand,
ARRAYSIZE(szThousand));
fmt.lpThousandSep = szThousand;
::GetLocaleInfoEx(LOCALE_NAME_USER_DEFAULT,
LOCALE_INEGNUMBER | LOCALE_RETURN_NUMBER,
reinterpret_cast<LPWSTR>(&fmt.NegativeOrder),
sizeof(fmt.NegativeOrder)/sizeof(WCHAR));
Isn't there an API that already does this?
I just wrote some code to do this last week. Alas, there does not seem to be a GetDefaultNumberFormat(LCID lcid, NUMBERFMT* fmt) function; you will have to write it yourself as you've already started. On a side note, the grouping string has a well-defined format that can be easily parsed; your current code is wrong for "3" (should be 30) and obviously will fail on more exotic groupings (though this is probably not much of a concern, really).
If all you want to do is cut off the fractional digits from the end of the string, you can go with one of the default formats (like LOCALE_NAME_USER_DEFAULT), then check for the presence of the fractional separator (comma in continental languages, point in English) in the resulting character string, and then chop off the fractional part by replacing it with a null byte:
#define cut_off_decimals(sz, cch) \
if (cch >= 5 && (sz[cch-4] == _T('.') || sz[cch-4] == _T(','))) \
sz[cch-4] = _T('\0');
(Hungarian alert: sz is the C string, cch is character count, including the terminating null byte. And _T is the Windows generic text makro for either char or wchar_t depending on whether UNICODE is defined, only needed for compatibility with Windows 9x/ME.)
Note that this will produce incorrect results for the very odd case of a user-defined format where the third-to-last character is a dot or a comma that has some special meaning to the user other than fractional separator. I have never seen such a number format in my whole life, and hence I conclude that this is good and safe enough.
And of course this won't do anything if the third-to-last character is neither a dot nor a comma.

Resources