I recently discovered the <codecvt> header, so I wanted to convert between UTF-8 and UTF-16.
I use the codecvt_utf8_utf16 facet with wstring_convert from C++11.
The issue I have, is when I try to convert an UTF-16 string to UTF-8, then in UTF-16 again, the endianness changes.
For this code :
#include <codecvt>
#include <string>
#include <locale>
#include <iostream>
using namespace std;
int main(int argc, char const *argv[])
{
wstring_convert<codecvt_utf8_utf16<char16_t>, char16_t>
convert;
u16string utf16 = u"\ub098\ub294\ud0dc\uc624";
cout << hex << "UTF-16\n\n";
for (char16_t c : utf16)
cout << "[" << c << "] ";
string utf8 = convert.to_bytes(utf16);
cout << "\n\nUTF-16 to UTF-8\n\n";
for (unsigned char c : utf8)
cout << "[" << int(c) << "] ";
cout << "\n\nConverting back to UTF-16\n\n";
utf16 = convert.from_bytes(utf8);
for (char16_t c : utf16)
cout << "[" << c << "] ";
cout << endl;
}
I get this output :
UTF-16
[b098] [b294] [d0dc] [c624]
UTF-16 to UTF-8
[eb] [82] [98] [eb] [8a] [94] [ed] [83] [9c] [ec] [98] [a4]
Converting back to UTF-16
[98b0] [94b2] [dcd0] [24c6]
When I change the third template argument of wstring_convert to std::little_endian, the bytes are reversed.
What did I miss ?
It was indeed a bug, https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66855
It will be fixed in 5.3
Related
Here in this code, the character length is changing suddenly. Before introducing char file the strlen(str) was correct. As I introduced the new char file the strlen value of variable str changes.
#include <unistd.h>
#include <iostream>
#include <stdio.h>
#include <string.h>
using namespace std;
int main(){
char buf[BUFSIZ];
if(!getcwd(buf,BUFSIZ)){
perror("ERROR!");
}
cout << buf << endl;
char *str;
str = new char[strlen(buf)];
strcpy(str,buf);
strcat(str,"/");
strcat(str,"input/abcdefghijklmnop");
cout << str << endl;
cout << strlen(str) << endl;
char *file;
file = new char[strlen(str)];
cout << strlen(file) << endl;
strcpy(file,str);
cout << file << endl;
}
Your code has undefined behavior because of buffer overflow. You should be scared.
You should consider using std::string.
std::string sbuf;
{
char cwdbuf[BUFSIZ];
if (getcwd(cwdbuf, sizeof(cwdbuf))
sbuf = cwdbuf;
else {
perror("getcwd");
exit(EXIT_FAILURE);
}
}
sbuf += "/input/abcdefghijklmnop";
You should compile with all warnings & debug info (e.g. g++ -Wall -Wextra -g) then use the debugger gdb. Don't forget that strings are zero-byte terminated. Your str is much too short. If you insist on avoiding std::string (which IMHO you should not), you need to allocate more space (and remember the extra zero byte).
str = new char[strlen(buf)+sizeof("/input/abcdefghijklmnop")];
strcpy(str, buf);
strcat(str, "/input/abcdefghijklmnop");
Remember that the sizeof some literal string is one byte more than its length (as measured by strlen). For instance sizeof("abc") is 4.
Likewise your file variable is one byte too short (missing space for the terminating zero byte).
file = new char[strlen(str)+1];
BTW on GNU systems (such as Linux) you could use asprintf(3) or strdup(3) (and use free not delete to release the memory) and consider using valgrind.
Is it possible to unzip previously zipped vectors using the C++ Range-v3 library? I would expect it to behave similarly to Haskell's unzip function or Python's zip(*list).
It would be convenient, for instance, when sorting a vector by values of another vector:
using namespace ranges;
std::vector<std::string> names {"john", "bob", "alice"};
std::vector<int> ages {32, 19, 35};
// zip names and ages
auto zipped = view::zip(names, ages);
// sort the zip by age
sort(zipped, [](auto &&a, auto &&b) {
return std::get<1>(a) < std::get<1>(b);
});
// put the sorted names back into the original vector
std::tie(names, std::ignore) = unzip(zipped);
When passed container arguments, view::zip in range-v3 creates a view consisting of tuples of references to the original elements. Passing the zipped view to sort sorts the elements in place. I.e., this program:
#include <vector>
#include <string>
#include <iostream>
#include <range/v3/algorithm.hpp>
#include <range/v3/view.hpp>
using namespace ranges;
template <std::size_t N>
struct get_n {
template <typename T>
auto operator()(T&& t) const ->
decltype(std::get<N>(std::forward<T>(t))) {
return std::get<N>(std::forward<T>(t));
}
};
namespace ranges {
template <class T, class U>
std::ostream& operator << (std::ostream& os, common_pair<T, U> const& p) {
return os << '(' << p.first << ", " << p.second << ')';
}
}
int main() {
std::vector<std::string> names {"john", "bob", "alice"};
std::vector<int> ages {32, 19, 35};
auto zipped = view::zip(names, ages);
std::cout << "Before: Names: " << view::all(names) << '\n'
<< " Ages: " << view::all(ages) << '\n'
<< " Zipped: " << zipped << '\n';
sort(zipped, less{}, get_n<1>{});
std::cout << " After: Names: " << view::all(names) << '\n'
<< " Ages: " << view::all(ages) << '\n'
<< " Zipped: " << zipped << '\n';
}
Outputs:
Before: Names: [john,bob,alice]
Ages: [32,19,35]
Zipped: [(john, 32),(bob, 19),(alice, 35)]
After: Names: [bob,john,alice]
Ages: [19,32,35]
Zipped: [(bob, 19),(john, 32),(alice, 35)]
Live Example on Coliru.
I am trying to use the letter character class from unicode i.e. \p{L} with Boost Spirit but I have no luck so far. Below is an example where I am trying to use (on line 30) the \p{L} character class. When I replace line 30 with line 29 it works but that is not the intended use as I need any letter from Unicode in my example.
My use case is for UTF8 only. At the end of they day what I am trying to do here is substract a unicode range from all unicode letters when using boost-spirit lexer.
PS
Of course, my example is trimmed down and may not make a lot of sense as a use case but I hope you get the idea.
#include <boost/config/warning_disable.hpp>
#include <boost/spirit/include/lex_lexertl.hpp>
#include <boost/fusion/include/std_pair.hpp>
#include <iostream>
#include <fstream>
#include <chrono>
#include <vector>
using namespace boost;
using namespace boost::spirit;
using namespace std;
using namespace std::chrono;
std::vector<pair<string, string> > getTokenMacros() {
std::vector<pair<string, string> > tokenDefinitionsVector;
tokenDefinitionsVector.emplace_back("JAPANESE_HIRAGANA", "[\u3041-\u3096]");
tokenDefinitionsVector.emplace_back("JAPANESE_HIRAGANA1",
"[\u3099-\u309E]");
tokenDefinitionsVector.emplace_back("ASIAN_NWS", "{JAPANESE_HIRAGANA}|"
"{JAPANESE_HIRAGANA1}");
tokenDefinitionsVector.emplace_back("ASIAN_NWS_WORD", "{ASIAN_NWS}*");
//tokenDefinitionsVector.emplace_back("NON_ASIAN_LETTER", "[A-Za-z0-9]");
tokenDefinitionsVector.emplace_back("NON_ASIAN_LETTER", "[\\p{L}-[{ASIAN_NWS}]]");
tokenDefinitionsVector.emplace_back("WORD", "{NON_ASIAN_LETTER}+");
tokenDefinitionsVector.emplace_back("ANY", ".");
return tokenDefinitionsVector;
}
;
struct distance_func {
template<typename Iterator1, typename Iterator2>
struct result: boost::iterator_difference<Iterator1> {
};
template<typename Iterator1, typename Iterator2>
typename result<Iterator1, Iterator2>::type operator()(Iterator1& begin,
Iterator2& end) const {
return distance(begin, end);
}
};
boost::phoenix::function<distance_func> const distance_fctor = distance_func();
template<typename Lexer>
struct word_count_tokens: lex::lexer<Lexer> {
word_count_tokens() :
asianNwsWord("{ASIAN_NWS_WORD}", lex::min_token_id + 110), word(
"{WORD}", lex::min_token_id + 170), any("{ANY}",
lex::min_token_id + 3000) {
using lex::_start;
using lex::_end;
using boost::phoenix::ref;
std::vector<pair<string, string> > tokenMacros(getTokenMacros());
for (auto start = tokenMacros.begin(), end = tokenMacros.end();
start != end; start++) {
this->self.add_pattern(start->first, start->second);
}
this->self = asianNwsWord | word | any;
}
lex::token_def<> asianNwsWord, word, any;
};
int main(int argc, char* argv[]) {
typedef lex::lexertl::token<string::iterator> token_type;
typedef lex::lexertl::actor_lexer<token_type> lexer_type;
word_count_tokens<lexer_type> word_count_lexer;
// read in the file int memory
ifstream sampleFile("/home/dan/Documents/wikiSample.txt");
string str = "abc efg ぁあ";
string::iterator first = str.begin();
string::iterator last = str.end();
lexer_type::iterator_type iter = word_count_lexer.begin(first, last);
lexer_type::iterator_type end = word_count_lexer.end();
typedef boost::iterator_range<string::iterator> iterator_range;
vector<iterator_range> parsed_tokens;
while (iter != end && token_is_valid(*iter)) {
cout << (iter->id() - lex::min_token_id) << " " << iter->value()
<< endl;
const iterator_range range = get<iterator_range>(iter->value());
parsed_tokens.push_back(range);
++iter;
}
if (iter != end) {
string rest(first, last);
cout << endl << "!!!!!!!!!" << endl << "Lexical analysis failed\n"
<< "stopped at: \"" << rest << "\"" << endl;
cout << "#" << (int) rest.at(0) << "#" << endl;
}
return 0;
}
I am trying to do some simple box drawing in the terminal using unicode characters. However I noticed that wcout wouldn't output anything for the box drawing characters, not even a place holder. So I decided to write the program below and find out which unicode characters were supported and found that wcout refused to output anything above 255. Is there something i have to do to make wcout work properly? Why can't access any of the extended unicode characters?
#include <wchar.h>
#include <locale>
#include <iostream>
using namespace std;
int main()
{
for (wchar_t c = 0; c < 0xFFFF; c++)
{
cout << "Iteration " << (int)c << endl;
wcout << c << endl << endl;
}
return 0;
}
I don't recommend using wcout because it is non-portable, inefficient (always performs transcoding) and doesn't support all of Unicode (e.g. surrogate pairs).
Instead you can use the open-source {fmt} library to portably print Unicode text including box drawing characters, for example:
#include <fmt/core.h>
int main() {
fmt::print("┌────────────────────┐\n"
"│ Hello, world! │\n"
"└────────────────────┘\n");
}
prints (https://godbolt.org/z/4EP6Yo):
┌────────────────────┐
│ Hello, world! │
└────────────────────┘
Disclaimer: I'm the author of {fmt}.
my xcode will compile but not let me input anything
#include <iostream>
using namespace std;
bool truthStatement;
int main (int argc, const char * argv[])
{
string name;
cout << "What is your name?" << endl;
cin >> name;
if (name == "Matt"){
cout << "You're cool" << endl;
} else {
cout << "You suck" << endl;
}
}
I have had this problem before. Make sure you are using the Return key and not the Enter key on the keyboard.