I have an srt file, excerpt:
2
00:00:36,208 --> 00:00:39,667
Èá óå óêïôþóù, ÃïõÜéíôæåëóôéí!
3
00:00:57,917 --> 00:01:00,917
Ãéáôß ôñÝ÷åéò, ÃïõÜéíôæåëóôéí;
Óïõ ðÞñá äþñï ãåíåèëßùí.
4
00:01:00,958 --> 00:01:03,208
Äåí ðåéñÜæåé, äåí ÷ñåéáæüôáí
íá ìïõ ðÜñåéò êÜôé.
5
00:01:03,250 --> 00:01:06,375
Óïõ ðÞñá ëßãï êïñìü äÝíôñïõ.
Êáé èá ôï öáò.
6
00:01:06,417 --> 00:01:08,875
Ùñáßá. ¸ôóé êé áëëéþò
èá Ýôñùãá êïñìü.
7
00:01:08,917 --> 00:01:10,208
Äåí èá Ýôñùãåò.
8
00:01:10,208 --> 00:01:11,000
Íáé. ÂëÝðåéò...
9
00:01:11,000 --> 00:01:12,417
...üëá ôá ðñÜãìáôá ðïõ Þèåëåò
íá ìïõ êÜíåéò...
10
00:01:12,417 --> 00:01:13,958
...ó÷åäßáæá íá ôá êÜíù ìüíïò ìïõ.
Supposedly these are japanese subtitles, but obviously it is garbled from encoding issue. I am trying to figure out how to correct it and convert to UTF-8 ultimately. Anyone have any ideas?
File output: UTF-8 Unicode (with BOM) text, with CRLF line terminators
File can be obtained here for testing:
http://www.opensubtitles.org/en/subtitles/5040215/the-incredible-burt-wonderstone-ja
What you have is a document that has been transcoded from the ISO-8859-1 character set to the UTF-8 encoding scheme, but the document source was coded in the ISO-8859-7 character set. After the transcoding to UTF-8, a U+FEFF byte order mark (BOM) has been added and a few quotation marks (U+201C, U+201D).
The language is Greek and 2nd subtitle sequence when corrected is:
2
00:00:36,208 --> 00:00:39,667
Θα σε σκοτώσω, Γουάιντζελστιν!
The English translation is "I'll kill you, Gouaintzelstin!".
To reverse/correct it:
Decode the document from the UTF-8 encoding scheme
Remove all code-points greater than U+00FF
Encode the document using the ISO-8859-1 encoding
Transcode the document using the ISO-8859-7 encoding to the UTF-8 encoding scheme.
An implementation of the above in Perl:
#!/usr/bin/perl
use strict;
use warnings;
use Encode qw[];
(#ARGV == 1 && -f $ARGV[0])
or die qq[Usage: $0 <file>];
my $file = shift #ARGV;
my ($octets, $string);
# Read all the octets from the file
$octets = do {
open my $fh, '<:raw', $file
or die qq[Could not open '$file' for reading: '$!'];
local $/; <$fh>
};
# Decode the octets using the UTF-8 encoding scheme
$string = Encode::decode('UTF-8', $octets, Encode::FB_CROAK);
# Remove all code points greater than U+00FF
$string =~ s/[^\x00-\xFF]//g;
# Encode the string using the ISO-8859-1 encoding
$octets = Encode::encode('ISO-8859-1', $string);
# Decode the octets using the ISO-8859-7 encoding
$string = Encode::decode('ISO-8859-7', $octets);
# Encode the string using the UTF-8 encoding
$octets = Encode::encode('UTF-8', $string);
# Output the octets on standard output
print $octets;
Related
Is there an in-built method to generate SHA256 as a binary data for a given string ? Basically, I am using below bash script to first generate a hash as a binary data and then do a base64 encoding. All I want to have is the exact thing in Powershell so that both the outputs are identical:
clearString="test"
payloadDigest=`echo -n "$clearString" | openssl dgst -binary -sha256 | openssl base64 `
echo ${payloadDigest}
In powershell, I can get the SHA256 in hex using the below script, but then struggling to get it as a binary data:
$ClearString= "test"
$hasher = [System.Security.Cryptography.HashAlgorithm]::Create('sha256')
$hash = $hasher.ComputeHash([System.Text.Encoding]::UTF8.GetBytes($ClearString))
$hashString = [System.BitConverter]::ToString($hash)
$256String= $hashString.Replace('-', '').ToLower()
$sha256String="(stdin)= "+$256String
$sha256String
I can then use [Convert]::ToBase64String($Bytes) to convert to base64 but in between how do I get a binary data similar to bash output before I pass it to base64 conversion.
I think you are over-doing this, since the ComputeHash method already returns a Byte[] array (binary data). To do what (I think) you are trying to achieve, don't convert the bytes to string, because that will result in a hexadecimal string.
$ClearString= "test"
$hasher = [System.Security.Cryptography.HashAlgorithm]::Create('sha256')
$hash = $hasher.ComputeHash([System.Text.Encoding]::UTF8.GetBytes($ClearString))
# $hash is a Byte[] array
# convert to Base64
[Convert]::ToBase64String($hash)
returns
n4bQgYhMfWWaL+qgxVrQFaO/TxsrC4Is0V1sFbDwCgg=
How to call CMD with utf8 arguments from perl without messing the argument's characters?
One of the things I've tried is to convert a string of unicode characters to its unicode character codes then use system($cmd):
use utf8;
`chcp 65001`;
binmode STDOUT, ":encoding(UTF-8)";
$string = "αω";
$converted_string = convert_to_unicode_code($string);
# gets $converted_string = '\x{03B1}\x{03C9}'
$cmd = 'program "'.$converted_string.'"';
# $cmd's value is: program "\x{03B1}\x{03C9}"
system($cmd);
sub convert_to_unicode_code {
my $input = shift;
$input =~ s/(.)/"\\x{" . (sprintf "%04X", ord $1) . "}"/eg;
return $input;
}
Actually this solution doesn't work as expected and calls program "\x{03B1}\x{03C9}" instead of program "αω".
See Win32::Unicode.
αω.bat
#echo hiαω
so48996757.pl
use utf8;
use Win32::Unicode::Process qw(systemW);
system 'chcp 65001';
systemW 'αω';
For several hours now I am fighting a bug in my Perl program. I am not sure if I do something wrong or the interpreter does, but the code is non-deterministic while it should be deterministic, IMO. Also it exhibits the same behavior on ancient Debian Lenny (Perl 5.10.0) and a server just upgraded to Debian Wheezy (Perl 5.14.2). It boiled down to this piece of Perl code:
#!/usr/bin/perl
use warnings;
use strict;
use utf8;
binmode STDOUT, ":utf8";
binmode STDERR, ":utf8";
my $c = "";
open C, ">:utf8", \$c;
print C "š";
close C;
die "Does not happen\n" if utf8::is_utf8($c);
print utf8::decode($c) ? "Decoded\n" : "Undecoded\n";
It initializes Perl 5 interpreter in strict mode with warnings enabled, with character strings (as opposed to byte strings) and named standard streams encoded in UTF8 (internal notion of UTF-8, but pretty close; changing to full UTF-8 makes no difference). Then it opens a file handle to an “in-memory file” (scalar variable), prints a single two-byte UTF-8 character into it and examines the variable upon closure.
The scalar variable now always has UTF8 bit flipped off. However it sometimes contains a byte string (converted to character string via utf8::decode()) and sometimes a character string that just needs to flip on its UTF8 bit (Encode::_utf8_on()).
When I execute my code repeatedly (1000 times, via Bash), it prints Undecoded and Decoded with approximately the same frequencies. When I change the string I write into the “file”, e.g. add a newline at its end, Undecoded disappears. When utf8::decode succeeds and I try it for the same original string in a loop, it keeps succeeding in the same instance of interpreter; however, if it fails, it keeps failing.
What is the explanation for the observed behavior? How can I use file handle to a scalar variable together with character strings?
Bash playground:
for i in {1..1000}; do perl -we 'use strict; use utf8; binmode STDOUT, ":utf8"; binmode STDERR, ":utf8"; my $c = ""; open C, ">:utf8", \$c; print C "š"; close C; die "Does not happen\n" if utf8::is_utf8($c); print utf8::decode($c) ? "Decoded\n" : "Undecoded\n";'; done | grep Undecoded | wc -l
For reference and to be absolutely sure, I also made a version with pedantic error handling – same results.
#!/usr/bin/perl
use warnings;
use strict;
use utf8;
binmode STDOUT, ":utf8" or die "Cannot binmode STDOUT\n";
binmode STDERR, ":utf8" or die "Cannot binmode STDERR\n";
my $c = "";
open C, ">:utf8", \$c or die "Cannot open: $!\n";
print C "š" or die "Cannot print: $!\n";
close C or die "Cannot close: $!\n";
die "Does not happen\n" if utf8::is_utf8($c);
print utf8::decode($c) ? "Decoded\n" : "Undecoded\n";
Examining $c in details reveals it has nothing to do with the content of $c or its internals, and the result of decode accurately represents what it did or didn't do.
$ for i in {1..2}; do
perl -MDevel::Peek -we'
use strict; use utf8;
binmode STDOUT, ":utf8";
binmode STDERR, ":utf8";
my $c = "";
open C, ">:utf8", \$c;
print C "š";
close C;
die "Does not happen\n" if utf8::is_utf8($c);
Dump($c);
print utf8::decode($c) ? "Decoded\n" : "Undecoded\n";
Dump($c)
'
echo
done
SV = PV(0x17c8470) at 0x17de990
REFCNT = 1
FLAGS = (PADMY,POK,pPOK)
PV = 0x17d7a40 "\305\241"
CUR = 2
LEN = 16
Decoded
SV = PV(0x17c8470) at 0x17de990
REFCNT = 1
FLAGS = (PADMY,POK,pPOK,UTF8)
PV = 0x17d7a40 "\305\241" [UTF8 "\x{161}"]
CUR = 2
LEN = 16
SV = PV(0x2d0fee0) at 0x2d26400
REFCNT = 1
FLAGS = (PADMY,POK,pPOK)
PV = 0x2d1f4b0 "\305\241"
CUR = 2
LEN = 16
Undecoded
SV = PV(0x2d0fee0) at 0x2d26400
REFCNT = 1
FLAGS = (PADMY,POK,pPOK)
PV = 0x2d1f4b0 "\305\241"
CUR = 2
LEN = 16
This was a bug in utf8::decode, but it was fixed in 5.16.3 or earlier, probably 5.16.0 since it was still present in 5.14.2.
A suitable workaround it to use Encode's decode_utf8 instead.
here is a subtitle file on http://subscene.com/subtitles/crank/farsi_persian/281992. if you download it you will see some codes like:
1
00:02:05,360 --> 00:02:07,430
åßÊæÑ¡ ãÇ åäæÒ ÏÇÑíã ãí ÑÎíã¿
ÎæÈå
2
00:02:07,600 --> 00:02:10,956
áíæÓ! ãÇ ÏÇÑíã í ÇÑ ãíäíã Èå ¿
æ Ïíå åí æÞÊ ãä Ñæ ÕÏÇ äãíÒäí
the thing i expect is:
1
00:02:05,360 --> 00:02:07,430
هكتور، ما هنوز داريم مي چرخيم؟
خوبه
2
00:02:07,600 --> 00:02:10,956
چليوس! ما داريم چي کار ميکنيم بچه ؟
و ديگه هيچ وقت من رو صدا نميزني
i reached it by changing the file extension from srt to txt, opening it with chrome browser, chenging encoding to arabic windows and re save file contents by select all text.
i have no idea how to do this with vim, or shell script. i tried :write ++enc=utf-8 russian.txt or set encoding or set fileencoding, but no luck.
thanks, mona
in vim:
after loading your file, don't do any modification. then you could do:
:e ++enc=cp1256
To save in utf-8, just
:w ++enc=utf-8
or you could do it in shell:
iconv -cf WINDOWS-1256 -t utf-8 problem.srt -o correct.srt
What am I doing? The script loads a string from a .txt (locations.txt), and separates it into 6 variables. Each variable is separated by a comma. Then I go to a website, whose address depends on these 6 values.
What is the problem? If there is a space as a character in a variable as part of a string in locations.txt. When there is a space, it does not get the correct url.
The input file is:
locations.txt = Heinz,Weber,Sierra Leone,1915,M,White
Because Sierra Leone has a space, the url is:
https://familysearch.org/search/collection/results#count=20&query=%2Bgivenname%3AHeinz%20%2Bsurname%3AWeber%20%2Bbirth_place%3A%22Sierra%20Leone%22%20%2Bbirth_year%3A1914-1918~%20%2Bgender%3AM%20%2Brace%3AWhite&collection_id=2000219
But that does not get processed correctly in the code below.
I'm using the packages:
use strict;
use warnings;
use WWW::Mechanize::Firefox;
use HTML::TableExtract;
use Data::Dumper;
use LWP::UserAgent;
use JSON;
use CGI qw/escape/;
use HTML::DOM;
This is the beginning of the code :
open(my $l, 'locations26.txt') or die "Can't open locations: $!";
open(my $o, '>', 'out2.txt') or die "Can't open output file: $!";
while (my $line = <$l>) {
chomp $line;
my %args;
#args{qw/givenname surname birth_place birth_year gender race/} = split /,/, $line;
$args{birth_year} = ($args{birth_year} - 2) . '-' . ($args{birth_year} + 2);
my $mech = WWW::Mechanize::Firefox->new(create => 1, activate => 1);
$mech->get("https://familysearch.org/search/collection/results#count=20&query=%2Bgivenname%3A".$args{givenname}."%20%2Bsurname%3A".$args{surname}."%20%2Bbirth_place%3A".$args{birth_place}."%20%2Bbirth_year%3A".$args{birth_year}."~%20%2Bgender%3AM%20%2Brace%3AWhite&collection_id=2000219");
# REST OF THE SCRIPT HERE. MANY LINES.
}
As another example, the following would work:
locations.txt = Benjamin,Schuvlein,Germany,1913,M,White
I have not used Mechanize, so not sure whether you need to encode the URL. Try encoding space to %20 or + before running $mech->get
$url =~ s/ /+/g;
Or
$url =~ s/ /%20/g
whichever works :)
====
Edit:
my $url = "https://familysearch.org/search/collection/results#count=20& query=%2Bgivenname%3A".$args{givenname}."%20%2Bsurname%3A".$args{surname}."%20%2Bbirth_place%3A".$args{birth_place}."%20%2Bbirth_year%3A".$args{birth_year}."~%20%2Bgender%3AM%20%2Brace%3AWhite&collection_id=2000219";
$url =~ s/ /+/g;
$mech->get($url);
Try that.
If you have the error
Global symbol "$url" requires explicit package name.
this means that you forgot to declare $url with :
my $url;
Your use part seems freaky, I'm pretty sure that you don't need all of those modules # the same time. If you use WWW::Mechanize, no need LWP::UserAgent and CGI I guess...