wget and encoding. how to force utf-8? - utf-8

When I try to download the link (ubuntu 16.04, wget 1.17.1):
wget --remote-encoding=UTF-8 http://www.altai_terr.vybory.izbirkom.ru/region/altai_terr?action=ik&vrn=4224065120534
I get a file:
>cat altai_terr\?action\=ik
...
<div class="center-colm">
<h2>????????????? ???????? ?????????? ????</h2>
<p>
<strong>????? ????????: </strong><span id="address_ik"><span>656035, ????? ???????, ???????? ?.?.??????, 59</span></span>
</p>
...
I check the file:
>file -bi altai_terr\?action\=ik
text/html; charset=iso-8859-1
I check installed locales:
…
en_US
en_US.iso88591
en_US.iso885915
en_US.utf8
…
ru_RU
ru_RU.cp1251
ru_RU.iso88595
ru_RU.koi8r
ru_RU.utf8
russian
ru_UA
ru_UA.koi8u
ru_UA.utf8
…
How can I download the file without "???"?
P.S.
If I do run python 2.7 and this code:
x = 'http://www.altai_terr.vybory.izbirkom.ru/region/altai_terr?action=ik&vrn=4224065120534'
page_uik = requests.get(url = x)
print page_uik.text
I do get:
...
<div class="center-colm">
<h2>Участковая избирательная комиссия №1767</h2>
<p>
<strong>Адрес комиссии: </strong><span id="address_ik"><span>659595, Алтайский край, Усть-Пристанский район, село Коробейниково, улица Комсомольская, дом 33а</span>, дом культуры</span>
</p>
...

I have to convert the file after downloading with this command:
iconv -f CP1251 -t UTF-8 altai_terr\?action\=ik

It is not a wget business.
You can ask webserver about a specific encoding, but probably the webserver will ignore you.
Webserver will tell you what he think it is the encoding, but never trust the server.
HTML also allow author to specify the encoding (so without need to ask to the system administrator/web master).
So it is your task, after you get the document, to check what it is the encoding, and then to translate it, and to handle errors and exceptions. You may see UTF-8 site with invalid codes, or also often, site with several encoding (usually because of dynamic generation of different part, with wrong assumption of encoding).
So get what wget give you, and you need to do yourself the decoding.

Related

How to read a file in utf8 encoding and output in Windows 10?

What is proper procedure to read and output utf8 encoded data in Windows 10?
My attempt to read utf8 encoded file in Windows 10 and output lines into terminal does not reproduce symbols of some languages.
OS: Windows 10
Native codepage: 437
Switched codepage: 65001
In cmd window issued command chcp 65001. Following ruby code reads utf8 encoded file and outputs lines with puts.
fname = 'hello_world.dat'
File.open(fname,'r:UTF-8') do |f|
puts f.read
end
hello_world.dat content
Afrikaans: Hello Wêreld!
Albanian: Përshendetje Botë!
Amharic: ሰላም ልዑል!
Arabic: مرحبا بالعالم!
Armenian: Բարեւ աշխարհ!
Basque: Kaixo Mundua!
Belarussian: Прывітанне Сусвет!
Bengali: ওহে বিশ্ব!
Bulgarian: Здравей свят!
Catalan: Hola món!
Chichewa: Moni Dziko Lapansi!
Chinese: 你好世界!
Croatian: Pozdrav svijete!
Czech: Ahoj světe!
Danish: Hej Verden!
Dutch: Hallo Wereld!
English: Hello World!
Estonian: Tere maailm!
Finnish: Hei maailma!
French: Bonjour monde!
Frisian: Hallo wrâld!
Georgian: გამარჯობა მსოფლიო!
German: Hallo Welt!
Greek: Γειά σου Κόσμε!
Hausa: Sannu Duniya!
Hebrew: שלום עולם!
Hindi: नमस्ते दुनिया!
Hungarian: Helló Világ!
Icelandic: Halló heimur!
Igbo: Ndewo Ụwa!
Indonesian: Halo Dunia!
Italian: Ciao mondo!
Japanese: こんにちは世界!
Kazakh: Сәлем Әлем!
Khmer: សួស្តី​ពិភពលោក!
Kyrgyz: Салам дүйнө!
Lao: ສະ​ບາຍ​ດີ​ຊາວ​ໂລກ!
Latvian: Sveika pasaule!
Lithuanian: Labas pasauli!
Luxemburgish: Moien Welt!
Macedonian: Здраво свету!
Malay: Hai dunia!
Malayalam: ഹലോ വേൾഡ്!
Mongolian: Сайн уу дэлхий!
Myanmar: မင်္ဂလာပါကမ္ဘာလောက!
Nepali: नमस्कार संसार!
Norwegian: Hei Verden!
Pashto: سلام نړی!
Persian: سلام دنیا!
Polish: Witaj świecie!
Portuguese: Olá Mundo!
Punjabi: ਸਤਿ ਸ੍ਰੀ ਅਕਾਲ ਦੁਨਿਆ!
Romanian: Salut Lume!
Russian: Привет мир!
Scots Gaelic: Hàlo a Shaoghail!
Serbian: Здраво Свете!
Sesotho: Lefatše Lumela!
Sinhala: හෙලෝ වර්ල්ඩ්!
Slovenian: Pozdravljen svet!
Spanish: ¡Hola Mundo!
Sundanese: Halo Dunya!
Swahili: Salamu Dunia!
Swedish: Hej världen!
Tajik: Салом Ҷаҳон!
Thai: สวัสดีชาวโลก!
Turkish: Selam Dünya!
Ukrainian: Привіт Світ!
Uzbek: Salom Dunyo!
Vietnamese: Chào thế giới!
Welsh: Helo Byd!
Xhosa: Molo Lizwe!
Yiddish: העלא וועלט!
Yoruba: Mo ki O Ile Aiye!
Zulu: Sawubona Mhlaba!
Steven Penny suggested to use PowerShell and do not change code page. Following picture demonstrates that the issue persists.
Windows Terminal installer (which is not a part of Windows distribution) solves utf8 output issue, please see included screen capture.
The problem is, you are using a some methods and tools that are really old. First:
Native codepage: 437
Switched codepage: 65001
You don't need to mess with the codepage any more, just leave it as the default. Also, from you picture I see you are also using Console Host, which is also really old. Windows Terminal [1] has been available since 2019, and has built in UTF-8 support. Using Windows Terminal, I can run your script, even without specifying UTF-8:
fname = 'hello_world.dat'
File.open(fname,'r') do |f|
puts f.read
end
and I get perfect result:
To use Windows Terminal, download the msixbundle file [2], then install it. Or, as it's essentially just a Zip file, you can rename it to file.zip and extract it with Windows, then run WindowsTerminal.exe. Or, since you are really having trouble with this process, you can use a portable version I just created
[3] (at your own risk).
https://github.com/microsoft/terminal
https://github.com/microsoft/terminal/releases/tag/v1.8.1444.0
https://github.com/microsoft/terminal/files/6563899/CascadiaPackage_1.8.1444.0_x64.zip

encode / decode binary data in a qr-code using qrencode and zbarimg in bash

I have some binary data that I want to encode in a qr-code and then be able to decode, all of that in bash. After a search, it looks like I should use qrencode for encoding, and zbarimg for decoding. After a bit of troubleshooting, I still do not manage to decode what I had encoded
Any idea why? Currently the closest I am to a solution is:
$ dd if=/dev/urandom bs=10 count=1 status=none > data.bin
$ xxd data.bin
00000000: b255 f625 1cf7 a051 3d07 .U.%...Q=.
$ cat data.bin | qrencode -l H -8 -o data.png
$ zbarimg --raw --quiet data.png | xxd
00000000: c2b2 55c3 b625 1cc3 b7c2 a051 3d07 0a ..U..%.....Q=..
It looks like I am not very far, but something is still off.
Edit 1: a possible fix is to use base64 wrapping, as explained in the answer by #leagris .
Edit 2: using base64 encoding doubles the size of the message. The reason why I use binary in the first place is to be size-efficient so I would like to avoid that. De-accepting the answer by #leagris as I would like to have it 'full binary', sorry.
Edit 3: as of 2020-03-03 it looks like this is a well-known issue of zbarimg and that a pull request to fix this is on its way:
https://github.com/mchehab/zbar/pull/64
Edit 4: if you know of another command-line tool on linux that is able to decrypt qr-codes with binary content, please feel free to let me know.
My pull request has been applied. ZBar version 0.23.1 and newer will be able to decode binary QR codes:
zbarimg --raw --oneshot -Sbinary qr.png
zbarcam --raw --oneshot -Sbinary
QR codes have several encoding modes. The simplest, most commonly used and widely supported is the alphanumeric encoding which is suitable for simple text. The byte encoding allows storing arbitrary 8 bit data in the QR code. The ECI mode is like 8 bit mode but with additional metadata that tells the decoder which character set to use in order to decode the binary data back to text. Here's a list of known ECI values and the character encodings they represent. For example, when a decoder encounters an ECI 26 mode QR code it knows to decode the binary data as UTF-8.
The qrencode tool is doing its job correctly: it is creating a byte mode QR code with the data you gave it as its contents. The problem is most decoders were explicitly designed to handle textual data first and foremost. The retrieval of the raw binary data is a detail at best.
Current versions of the zbar library will treat byte mode QR codes as if they were unknown ECI mode QR codes. If a character set isn't specified, it will attempt to guess the encoding and convert the data to it. This will most likely mangle the binary data. As you noted, I brought this up in issue #55 and after some time managed to submit a pull request to improve this. Should it be merged, the library will have binary decoder option that will instruct decoders to return the raw binary data without converting it. Another source of data mangling is the tendency of the command line tools to append line feeds to the output. I submitted a pull request to allow users to prevent this and it has already been merged.
The zxing-cpp library will also try to guess the encoding of binary data in QR codes. The comments suggest that the QR code specification requires that decoders pick an encoding without specifying a default or allowing them to return the raw binary data. In order to make that possible, the binary data is copied to a byte array which can be accessed through the DecoderResult. When I have some free time, I intend to write zximg and zxcam tools with binary decoding support for this library.
It's always possible to encode binary data as base 64 and encode the result as an alphanumeric QR code. However, base 64 encoding will increase the size of the data and the alphanumeric mode doesn't allow use of the QR code's maximum capacity. In a comment, you mentioned what you intend to use binary QR codes for:
I want to have a package to effectively dump some gpg stuff in a format that makes recovery easy.
That is the exact use case I'm attempting to enable with my pull request: an easier-to-restore paperkey. 4096 bit RSA secret keys can be directly QR encoded in 8 bit mode but not in alphanumeric mode as base 64-encoded data.
See also: Storing binary data in QR codes
Look like zbarimg is only supporting printable characters and adding a newline
printf '%s' 'Hello World!' >data.bin
xxd data.bin
qrencode -l H -8 -o data.png -r data.bin
zbarimg --raw --quiet data.png | xxd
I think a better more portable option would be to base64 encode your binary data before qr encoding.
Like this:
dd if=/dev/urandom bs=10 count=1 status=none > data.bin
xxd data.bin
base64 <data.bin | qrencode -l H -8 -o data.png
zbarimg --raw --quiet data.png | base64 -d | xxd

Image downloaded with wget has size of 4 bytes

I have a problem with downloading certain image.
I'm trying to download image and save it on disk.
Here is the wget command, that I'm using and it works perfectly fine with almost every image. (code above works fine with this url)
wget -O test.gif http://www.fmwconcepts.com/misc_tests/animation_example/lena_anim2.gif
Almost, becouse when I try to download image from this url: http://sklepymuzyczne24.pl/_data/ranking/686/e3991/ranking.gif
It fails. Downloaded file size is 4 bytes. I tried doing this using curl instead of wget, but the results are the same.
I think that the second image (the one not working) might be somehow generated (the image automatically changes, depending on store reviews). I belive that it has something to do with this issue.
Looks like some kind of misconfiguration on the server side. It won't return the image unless you specify that you accept gzip compressed content. Most web browsers nowadays do this by default, so the image is working fine in browser, but for wget or curl you need to add accept-encoding header manually. This way you will get gzip compressed image. Then you can pipe it to gunzip and get a normal, uncompressed image.
You could save the image using:
wget --header='Accept-Encoding: gzip' -O- http://sklepymuzyczne24.pl/_data/ranking/686/e3991/ranking.gif | gunzip - > ranking.gif

download all images on the page with WGET

I'm trying to download all the images that appear on the page with WGET, it seems that eveything is fine but the command is actually downloading only the first 6 images, and no more. I can't figure out why.
The command i used:
wget -nd -r -P . -A jpeg,jpg http://www.edpeers.com/2013/weddings/umbria-wedding-photographer/
It's downloading only the first 6 images relevant of the page and all other stuff that i don't need, look at the page, any idea why it's only getting the first 6 relevant images?
Thanks in advance.
I think the main problem is, that there are only 6 jpegs on that site, all others are gifs, example:
<img src="http://www.edpeers.com/wp-content/themes/prophoto5/images/blank.gif"
data-lazyload-src="http://www.edpeers.com/wp-content/uploads/2013/11/aa_umbria-italy-wedding_075.jpg"
class="alignnone size-full wp-image-12934 aligncenter" width="666" height="444"
alt="Umbria wedding photographer" title="Umbria wedding photographer" /
data-lazyload-src is a jquery plugin, which wouldn't download the jpegs, see http://www.appelsiini.net/projects/lazyload
Try -p instead of -r
wget -nd -p -P . -A jpeg,jpg http://www.edpeers.com/2013/weddings/umbria-wedding-photographer/
see http://explainshell.com:
-p
--page-requisites
This option causes Wget to download all the files that are necessary to properly display a given HTML
page. This includes such things as inlined images, sounds, and referenced stylesheets.

How to fetch a binary file from a remote embeded system using telnet?

I have a remote embedded system and it is telnet-able. How can I fetch a binary file from it using ruby? If it were a text file, I could have used:
con = Net::Telnet::new("Host"=>ip,"Timeout"=>200) #Host not host
File.open("fetched_file","w+") do |f|
con.cmd("cat /ect/file") {|data| f.write(data)}
end
But this wouldn't work for binary file you won't get desirable data by cating it.
establish your telnet connection then
send the command:
uuencode filename -
to the remote host, replacing filename with the filename
take the data you are sent and pass it to uudecode on your system
If the device has uuencode installed, you could use that to 'wrap' the binary into printable characters. Other possibility is to run dd if=/etc/file 2>/dev/null to dump the data (however I am not completely certain this will word any better...)

Resources