Japanese SRT files garbled, can't determine encoding to fix with iconv - utf-8

I have an srt file, excerpt:
2
00:00:36,208 --> 00:00:39,667
Èá óå óêïôþóù, ÃïõÜéíôæåëóôéí!
3
00:00:57,917 --> 00:01:00,917
Ãéáôß ôñÝ÷åéò, ÃïõÜéíôæåëóôéí;
Óïõ ðÞñá äþñï ãåíåèëßùí.
4
00:01:00,958 --> 00:01:03,208
Äåí ðåéñÜæåé, äåí ÷ñåéáæüôáí
íá ìïõ ðÜñåéò êÜôé.
5
00:01:03,250 --> 00:01:06,375
Óïõ ðÞñá ëßãï êïñìü äÝíôñïõ.
Êáé èá ôï öáò.
6
00:01:06,417 --> 00:01:08,875
Ùñáßá. ¸ôóé êé áëëéþò
èá Ýôñùãá êïñìü.
7
00:01:08,917 --> 00:01:10,208
Äåí èá Ýôñùãåò.
8
00:01:10,208 --> 00:01:11,000
Íáé. ÂëÝðåéò...
9
00:01:11,000 --> 00:01:12,417
...üëá ôá ðñÜãìáôá ðïõ Þèåëåò
íá ìïõ êÜíåéò...
10
00:01:12,417 --> 00:01:13,958
...ó÷åäßáæá íá ôá êÜíù ìüíïò ìïõ.
Supposedly these are japanese subtitles, but obviously it is garbled from encoding issue. I am trying to figure out how to correct it and convert to UTF-8 ultimately. Anyone have any ideas?
File output: UTF-8 Unicode (with BOM) text, with CRLF line terminators
File can be obtained here for testing:
http://www.opensubtitles.org/en/subtitles/5040215/the-incredible-burt-wonderstone-ja

What you have is a document that has been transcoded from the ISO-8859-1 character set to the UTF-8 encoding scheme, but the document source was coded in the ISO-8859-7 character set. After the transcoding to UTF-8, a U+FEFF byte order mark (BOM) has been added and a few quotation marks (U+201C, U+201D).
The language is Greek and 2nd subtitle sequence when corrected is:
2
00:00:36,208 --> 00:00:39,667
Θα σε σκοτώσω, Γουάιντζελστιν!
The English translation is "I'll kill you, Gouaintzelstin!".
To reverse/correct it:
Decode the document from the UTF-8 encoding scheme
Remove all code-points greater than U+00FF
Encode the document using the ISO-8859-1 encoding
Transcode the document using the ISO-8859-7 encoding to the UTF-8 encoding scheme.
An implementation of the above in Perl:
#!/usr/bin/perl
use strict;
use warnings;
use Encode qw[];
(#ARGV == 1 && -f $ARGV[0])
or die qq[Usage: $0 <file>];
my $file = shift #ARGV;
my ($octets, $string);
# Read all the octets from the file
$octets = do {
open my $fh, '<:raw', $file
or die qq[Could not open '$file' for reading: '$!'];
local $/; <$fh>
};
# Decode the octets using the UTF-8 encoding scheme
$string = Encode::decode('UTF-8', $octets, Encode::FB_CROAK);
# Remove all code points greater than U+00FF
$string =~ s/[^\x00-\xFF]//g;
# Encode the string using the ISO-8859-1 encoding
$octets = Encode::encode('ISO-8859-1', $string);
# Decode the octets using the ISO-8859-7 encoding
$string = Encode::decode('ISO-8859-7', $octets);
# Encode the string using the UTF-8 encoding
$octets = Encode::encode('UTF-8', $string);
# Output the octets on standard output
print $octets;

Related

How to get sha256 hash output as binary data instead of hex using powershell?

Is there an in-built method to generate SHA256 as a binary data for a given string ? Basically, I am using below bash script to first generate a hash as a binary data and then do a base64 encoding. All I want to have is the exact thing in Powershell so that both the outputs are identical:
clearString="test"
payloadDigest=`echo -n "$clearString" | openssl dgst -binary -sha256 | openssl base64 `
echo ${payloadDigest}
In powershell, I can get the SHA256 in hex using the below script, but then struggling to get it as a binary data:
$ClearString= "test"
$hasher = [System.Security.Cryptography.HashAlgorithm]::Create('sha256')
$hash = $hasher.ComputeHash([System.Text.Encoding]::UTF8.GetBytes($ClearString))
$hashString = [System.BitConverter]::ToString($hash)
$256String= $hashString.Replace('-', '').ToLower()
$sha256String="(stdin)= "+$256String
$sha256String
I can then use [Convert]::ToBase64String($Bytes) to convert to base64 but in between how do I get a binary data similar to bash output before I pass it to base64 conversion.
I think you are over-doing this, since the ComputeHash method already returns a Byte[] array (binary data). To do what (I think) you are trying to achieve, don't convert the bytes to string, because that will result in a hexadecimal string.
$ClearString= "test"
$hasher = [System.Security.Cryptography.HashAlgorithm]::Create('sha256')
$hash = $hasher.ComputeHash([System.Text.Encoding]::UTF8.GetBytes($ClearString))
# $hash is a Byte[] array
# convert to Base64
[Convert]::ToBase64String($hash)
returns
n4bQgYhMfWWaL+qgxVrQFaO/TxsrC4Is0V1sFbDwCgg=

How to call CMD with utf8 arguments from perl?

How to call CMD with utf8 arguments from perl without messing the argument's characters?
One of the things I've tried is to convert a string of unicode characters to its unicode character codes then use system($cmd):
use utf8;
`chcp 65001`;
binmode STDOUT, ":encoding(UTF-8)";
$string = "αω";
$converted_string = convert_to_unicode_code($string);
# gets $converted_string = '\x{03B1}\x{03C9}'
$cmd = 'program "'.$converted_string.'"';
# $cmd's value is: program "\x{03B1}\x{03C9}"
system($cmd);
sub convert_to_unicode_code {
my $input = shift;
$input =~ s/(.)/"\\x{" . (sprintf "%04X", ord $1) . "}"/eg;
return $input;
}
Actually this solution doesn't work as expected and calls program "\x{03B1}\x{03C9}" instead of program "αω".
See Win32::Unicode.
αω.bat
#echo hiαω
so48996757.pl
use utf8;
use Win32::Unicode::Process qw(systemW);
system 'chcp 65001';
systemW 'αω';

Non-determinism in encoding when using open() with scalar and I/O layers in Perl

For several hours now I am fighting a bug in my Perl program. I am not sure if I do something wrong or the interpreter does, but the code is non-deterministic while it should be deterministic, IMO. Also it exhibits the same behavior on ancient Debian Lenny (Perl 5.10.0) and a server just upgraded to Debian Wheezy (Perl 5.14.2). It boiled down to this piece of Perl code:
#!/usr/bin/perl
use warnings;
use strict;
use utf8;
binmode STDOUT, ":utf8";
binmode STDERR, ":utf8";
my $c = "";
open C, ">:utf8", \$c;
print C "š";
close C;
die "Does not happen\n" if utf8::is_utf8($c);
print utf8::decode($c) ? "Decoded\n" : "Undecoded\n";
It initializes Perl 5 interpreter in strict mode with warnings enabled, with character strings (as opposed to byte strings) and named standard streams encoded in UTF8 (internal notion of UTF-8, but pretty close; changing to full UTF-8 makes no difference). Then it opens a file handle to an “in-memory file” (scalar variable), prints a single two-byte UTF-8 character into it and examines the variable upon closure.
The scalar variable now always has UTF8 bit flipped off. However it sometimes contains a byte string (converted to character string via utf8::decode()) and sometimes a character string that just needs to flip on its UTF8 bit (Encode::_utf8_on()).
When I execute my code repeatedly (1000 times, via Bash), it prints Undecoded and Decoded with approximately the same frequencies. When I change the string I write into the “file”, e.g. add a newline at its end, Undecoded disappears. When utf8::decode succeeds and I try it for the same original string in a loop, it keeps succeeding in the same instance of interpreter; however, if it fails, it keeps failing.
What is the explanation for the observed behavior? How can I use file handle to a scalar variable together with character strings?
Bash playground:
for i in {1..1000}; do perl -we 'use strict; use utf8; binmode STDOUT, ":utf8"; binmode STDERR, ":utf8"; my $c = ""; open C, ">:utf8", \$c; print C "š"; close C; die "Does not happen\n" if utf8::is_utf8($c); print utf8::decode($c) ? "Decoded\n" : "Undecoded\n";'; done | grep Undecoded | wc -l
For reference and to be absolutely sure, I also made a version with pedantic error handling – same results.
#!/usr/bin/perl
use warnings;
use strict;
use utf8;
binmode STDOUT, ":utf8" or die "Cannot binmode STDOUT\n";
binmode STDERR, ":utf8" or die "Cannot binmode STDERR\n";
my $c = "";
open C, ">:utf8", \$c or die "Cannot open: $!\n";
print C "š" or die "Cannot print: $!\n";
close C or die "Cannot close: $!\n";
die "Does not happen\n" if utf8::is_utf8($c);
print utf8::decode($c) ? "Decoded\n" : "Undecoded\n";
Examining $c in details reveals it has nothing to do with the content of $c or its internals, and the result of decode accurately represents what it did or didn't do.
$ for i in {1..2}; do
perl -MDevel::Peek -we'
use strict; use utf8;
binmode STDOUT, ":utf8";
binmode STDERR, ":utf8";
my $c = "";
open C, ">:utf8", \$c;
print C "š";
close C;
die "Does not happen\n" if utf8::is_utf8($c);
Dump($c);
print utf8::decode($c) ? "Decoded\n" : "Undecoded\n";
Dump($c)
'
echo
done
SV = PV(0x17c8470) at 0x17de990
REFCNT = 1
FLAGS = (PADMY,POK,pPOK)
PV = 0x17d7a40 "\305\241"
CUR = 2
LEN = 16
Decoded
SV = PV(0x17c8470) at 0x17de990
REFCNT = 1
FLAGS = (PADMY,POK,pPOK,UTF8)
PV = 0x17d7a40 "\305\241" [UTF8 "\x{161}"]
CUR = 2
LEN = 16
SV = PV(0x2d0fee0) at 0x2d26400
REFCNT = 1
FLAGS = (PADMY,POK,pPOK)
PV = 0x2d1f4b0 "\305\241"
CUR = 2
LEN = 16
Undecoded
SV = PV(0x2d0fee0) at 0x2d26400
REFCNT = 1
FLAGS = (PADMY,POK,pPOK)
PV = 0x2d1f4b0 "\305\241"
CUR = 2
LEN = 16
This was a bug in utf8::decode, but it was fixed in 5.16.3 or earlier, probably 5.16.0 since it was still present in 5.14.2.
A suitable workaround it to use Encode's decode_utf8 instead.

How to change encoding for existing file with Vim

here is a subtitle file on http://subscene.com/subtitles/crank/farsi_persian/281992. if you download it you will see some codes like:
1
00:02:05,360 --> 00:02:07,430
åßÊæÑ¡ ãÇ åäæÒ ÏÇÑíã ãí ÑÎíã¿
ÎæÈå
2
00:02:07,600 --> 00:02:10,956
áíæÓ! ãÇ ÏÇÑíã í ÇÑ ãíäíã Èå ¿
æ Ïíå åí æÞÊ ãä Ñæ ÕÏÇ äãíÒäí
the thing i expect is:
1
00:02:05,360 --> 00:02:07,430
هكتور، ما هنوز داريم مي چرخيم؟
خوبه
2
00:02:07,600 --> 00:02:10,956
چليوس! ما داريم چي کار ميکنيم بچه ؟
و ديگه هيچ وقت من رو صدا نميزني
i reached it by changing the file extension from srt to txt, opening it with chrome browser, chenging encoding to arabic windows and re save file contents by select all text.
i have no idea how to do this with vim, or shell script. i tried :write ++enc=utf-8 russian.txt or set encoding or set fileencoding, but no luck.
thanks, mona
in vim:
after loading your file, don't do any modification. then you could do:
:e ++enc=cp1256
To save in utf-8, just
:w ++enc=utf-8
or you could do it in shell:
iconv -cf WINDOWS-1256 -t utf-8 problem.srt -o correct.srt

Perl - Using Variables from an Input File in the URL when a Variable has a Space (two words)

What am I doing? The script loads a string from a .txt (locations.txt), and separates it into 6 variables. Each variable is separated by a comma. Then I go to a website, whose address depends on these 6 values.
What is the problem? If there is a space as a character in a variable as part of a string in locations.txt. When there is a space, it does not get the correct url.
The input file is:
locations.txt = Heinz,Weber,Sierra Leone,1915,M,White
Because Sierra Leone has a space, the url is:
https://familysearch.org/search/collection/results#count=20&query=%2Bgivenname%3AHeinz%20%2Bsurname%3AWeber%20%2Bbirth_place%3A%22Sierra%20Leone%22%20%2Bbirth_year%3A1914-1918~%20%2Bgender%3AM%20%2Brace%3AWhite&collection_id=2000219
But that does not get processed correctly in the code below.
I'm using the packages:
use strict;
use warnings;
use WWW::Mechanize::Firefox;
use HTML::TableExtract;
use Data::Dumper;
use LWP::UserAgent;
use JSON;
use CGI qw/escape/;
use HTML::DOM;
This is the beginning of the code :
open(my $l, 'locations26.txt') or die "Can't open locations: $!";
open(my $o, '>', 'out2.txt') or die "Can't open output file: $!";
while (my $line = <$l>) {
chomp $line;
my %args;
#args{qw/givenname surname birth_place birth_year gender race/} = split /,/, $line;
$args{birth_year} = ($args{birth_year} - 2) . '-' . ($args{birth_year} + 2);
my $mech = WWW::Mechanize::Firefox->new(create => 1, activate => 1);
$mech->get("https://familysearch.org/search/collection/results#count=20&query=%2Bgivenname%3A".$args{givenname}."%20%2Bsurname%3A".$args{surname}."%20%2Bbirth_place%3A".$args{birth_place}."%20%2Bbirth_year%3A".$args{birth_year}."~%20%2Bgender%3AM%20%2Brace%3AWhite&collection_id=2000219");
# REST OF THE SCRIPT HERE. MANY LINES.
}
As another example, the following would work:
locations.txt = Benjamin,Schuvlein,Germany,1913,M,White
I have not used Mechanize, so not sure whether you need to encode the URL. Try encoding space to %20 or + before running $mech->get
$url =~ s/ /+/g;
Or
$url =~ s/ /%20/g
whichever works :)
====
Edit:
my $url = "https://familysearch.org/search/collection/results#count=20& query=%2Bgivenname%3A".$args{givenname}."%20%2Bsurname%3A".$args{surname}."%20%2Bbirth_place%3A".$args{birth_place}."%20%2Bbirth_year%3A".$args{birth_year}."~%20%2Bgender%3AM%20%2Brace%3AWhite&collection_id=2000219";
$url =~ s/ /+/g;
$mech->get($url);
Try that.
If you have the error
Global symbol "$url" requires explicit package name.
this means that you forgot to declare $url with :
my $url;
Your use part seems freaky, I'm pretty sure that you don't need all of those modules # the same time. If you use WWW::Mechanize, no need LWP::UserAgent and CGI I guess...

Resources