How to change encoding for existing file with Vim - shell

here is a subtitle file on http://subscene.com/subtitles/crank/farsi_persian/281992. if you download it you will see some codes like:
1
00:02:05,360 --> 00:02:07,430
åßÊæÑ¡ ãÇ åäæÒ ÏÇÑíã ãí ÑÎíã¿
ÎæÈå
2
00:02:07,600 --> 00:02:10,956
áíæÓ! ãÇ ÏÇÑíã í ÇÑ ãíäíã Èå ¿
æ Ïíå åí æÞÊ ãä Ñæ ÕÏÇ äãíÒäí
the thing i expect is:
1
00:02:05,360 --> 00:02:07,430
هكتور، ما هنوز داريم مي چرخيم؟
خوبه
2
00:02:07,600 --> 00:02:10,956
چليوس! ما داريم چي کار ميکنيم بچه ؟
و ديگه هيچ وقت من رو صدا نميزني
i reached it by changing the file extension from srt to txt, opening it with chrome browser, chenging encoding to arabic windows and re save file contents by select all text.
i have no idea how to do this with vim, or shell script. i tried :write ++enc=utf-8 russian.txt or set encoding or set fileencoding, but no luck.
thanks, mona

in vim:
after loading your file, don't do any modification. then you could do:
:e ++enc=cp1256
To save in utf-8, just
:w ++enc=utf-8
or you could do it in shell:
iconv -cf WINDOWS-1256 -t utf-8 problem.srt -o correct.srt

Related

Windows batch loop saving results into a file line by line

I run a command on each file within a folder and I would like to write the result line by line in a text file
for %r in (*) do (magick identify -format "%f, %w, %h" %r >> out.txt)
(it returns the image name and its size)
Which gives:
1048.tif, 3175, 2802,1049.tif, 3175, 2802...
I would like something like
> 1048.tif, 3175, 2802
> 1049.tif, 3175, 2802...
I tried with echo before magick identify but it writes the command and not the result
With ImageMagick you can try putting a "\n" in your format string...
... -format "%f, %w, %h\n" ...
That will insert a line break after the height.

Broken text in pdf with sphinx-build

I use sphinx-build (Sphinx v1.6.3) on a Mac (Mojave 10.14.1) to generate a PDF in different languages.
All languages work, but polish gives me broken characters:
They original text is stored as *.rst file (in German) and then I translate them into *.po files.
One example word which does not work is:
Treść
This is the according PO-File:
# SOME DESCRIPTIVE TITLE.
# Copyright (C) Beat Gurtner
# This file is distributed under the same license as the Dokumentation des
# Sakkadentrainers package.
# FIRST AUTHOR <EMAIL#ADDRESS>, 2019.
#
msgid ""
msgstr ""
"Project-Id-Version: Dokumentation des Sakkadentrainers\n"
"Report-Msgid-Bugs-To: \n"
"POT-Creation-Date: 2018-07-26 14:43+0200\n"
"PO-Revision-Date: 2019-12-29 17:38+0000\n"
"MIME-Version: 1.0\n"
"Content-Type: text/plain; charset=utf-8\n"
"X-Poedit-SourceCharset: UTF-8\n"
"Content-Transfer-Encoding: 8bit\n"
"Generated-By: Babel 2.4.0\n"
"Last-Translator: \n"
"Language-Team: \n"
"Language: pl\n"
"X-Generator: Poedit 2.2.4\n"
#: ../../index.rst:7
msgid "Willkommen zur Dokumentation des Sakkadentrainers"
msgstr "Witamy w dokumentacji Sakkadycznytrener"
#: ../../index.rst:9
msgid "`Zurück zum Training <https://www.sakkadentrainer.ch>`_"
msgstr "`Powrót do treningu <https://www.sakkadentrainer.ch>`_"
#: ../../index.rst:11
msgid "Inhalt:"
msgstr "Treść:"
The command to generate the PDF is:
sphinx-build -t pl -D language=pl -b pdf /Applications/MAMP/htdocs/sakkadentrainer/doc/ /Applications/MAMP/htdocs/sakkadentrainer_medical_doc/pdf/pl/
Any help is appreciated
This is the solutions. Put it into your conf.py file.
The problem is that you need a font that supports special characters.
pdf_stylesheets = ['sphinx','kerning','a4']

Pandoc cross-ref: adding empty line after last section title

I need to write articles and switch from latex to pandoc (better: I intend to do so). My markdown file looks like this:
....
bla bla
\noindent
\setlength{\parindent}{-0.2in}
\setlength{\leftskip}{0.2in}
\setlength{\parskip}{8pt}
# References
Mind, References really is the last line of the file.
I compile like this:
pandoc ../../bibliography/default.yaml -f markdown-tex_math_dollars -s --bibliography ../../bibliography/bibliography.bib --csl ../../bibliography/harvard-cite-them-right.csl -F pandoc-crossref $f -o $f.pdf
My default yaml file:
---
geometry: a4paper,verbose,tmargin=2.5cm,bmargin=2.5cm,lmargin=1.5cm,rmargin=1.5cm
inputenc: latin9
indent: true
sectionsDepth: 3
link-citations: true
numberSections: true
linestretch: 1.5
header-includes:
- \renewcommand{\familydefault}{\rmdefault}
- \usepackage{lineno}
- \linenumbers
---
And what I get in PDF is this:
As you can see, there's an empty line after References.
Q:
How can I get rid of this empty line?
How can I remove the numbering from section title References only, leaving the remaining section titles numbered?
Thanks

Python: will not read a certain file in a for loop

I have a directory containing files and they are all processed except one, file2.txt with my the_script.py script.
Independanty i ran a simple for line in file2.txt: print line and it worked just fine. The lines were printed. So the file is not the problem, it is formatted just as the other ones (automatically, output of another script).
Here is the_script.py :
#!/usr/bin/python
import os
import glob
#[...]rest of the code not dealing with the files in questions
for filename in glob.glob("outdir/*_mapp"): #i need to get all the files in outdir/ directory with the *_mapp extension
infilemapp=open(filename)
print "start"
print infilemapp #test, priting all filenames
organism=(filename.split("/", 1)[1])[:-5] # outdir/acorus.txt_mapp --> acorus.txt IRRELEVANT PARSING LINE
infilelpwe=organism+"_lpwe" #acorus.txt --> acorus.txt_lpwe IRRELEVANT PARSING LINE
for line in infilemapp:
print line
print "end"
What i expected is to get, for ALL files, "start, filename, filecontent, end". I get in console:
bash-4.3$ ./the_script.py
start
<open file 'outdir/file1.txt_mapp', mode 'r' at 0x7fb5795ec930>
['3R', '2F', '0R', '3F', '1R', '4F', '1F']
end
start
<open file 'outdir/file3.txt_mapp', mode 'r' at 0x7fb5795eca50>
['0R', '5R', '7R', '4R', '1F', '6R', '2R', '6F', '1R', '4F', '7F', '5F', '0F', '3R']
end
start
<open file 'outdir/file2.txt_mapp', mode 'r' at 0x7fb5795ec930>
end
As you can see, nothing is printed for file2.txt_mapp.
bash-4.3$ cat outdir/file2.txt_mapp
['5F', '0F', '2F', '6F', '3R', '5R', '6R', '4F', '1R', '4R', '6F']
The file is alphabetically in the middle of all files. Why does my script not work for this specific one? Please if you have any suggestions...

Japanese SRT files garbled, can't determine encoding to fix with iconv

I have an srt file, excerpt:
2
00:00:36,208 --> 00:00:39,667
Èá óå óêïôþóù, ÃïõÜéíôæåëóôéí!
3
00:00:57,917 --> 00:01:00,917
Ãéáôß ôñÝ÷åéò, ÃïõÜéíôæåëóôéí;
Óïõ ðÞñá äþñï ãåíåèëßùí.
4
00:01:00,958 --> 00:01:03,208
Äåí ðåéñÜæåé, äåí ÷ñåéáæüôáí
íá ìïõ ðÜñåéò êÜôé.
5
00:01:03,250 --> 00:01:06,375
Óïõ ðÞñá ëßãï êïñìü äÝíôñïõ.
Êáé èá ôï öáò.
6
00:01:06,417 --> 00:01:08,875
Ùñáßá. ¸ôóé êé áëëéþò
èá Ýôñùãá êïñìü.
7
00:01:08,917 --> 00:01:10,208
Äåí èá Ýôñùãåò.
8
00:01:10,208 --> 00:01:11,000
Íáé. ÂëÝðåéò...
9
00:01:11,000 --> 00:01:12,417
...üëá ôá ðñÜãìáôá ðïõ Þèåëåò
íá ìïõ êÜíåéò...
10
00:01:12,417 --> 00:01:13,958
...ó÷åäßáæá íá ôá êÜíù ìüíïò ìïõ.
Supposedly these are japanese subtitles, but obviously it is garbled from encoding issue. I am trying to figure out how to correct it and convert to UTF-8 ultimately. Anyone have any ideas?
File output: UTF-8 Unicode (with BOM) text, with CRLF line terminators
File can be obtained here for testing:
http://www.opensubtitles.org/en/subtitles/5040215/the-incredible-burt-wonderstone-ja
What you have is a document that has been transcoded from the ISO-8859-1 character set to the UTF-8 encoding scheme, but the document source was coded in the ISO-8859-7 character set. After the transcoding to UTF-8, a U+FEFF byte order mark (BOM) has been added and a few quotation marks (U+201C, U+201D).
The language is Greek and 2nd subtitle sequence when corrected is:
2
00:00:36,208 --> 00:00:39,667
Θα σε σκοτώσω, Γουάιντζελστιν!
The English translation is "I'll kill you, Gouaintzelstin!".
To reverse/correct it:
Decode the document from the UTF-8 encoding scheme
Remove all code-points greater than U+00FF
Encode the document using the ISO-8859-1 encoding
Transcode the document using the ISO-8859-7 encoding to the UTF-8 encoding scheme.
An implementation of the above in Perl:
#!/usr/bin/perl
use strict;
use warnings;
use Encode qw[];
(#ARGV == 1 && -f $ARGV[0])
or die qq[Usage: $0 <file>];
my $file = shift #ARGV;
my ($octets, $string);
# Read all the octets from the file
$octets = do {
open my $fh, '<:raw', $file
or die qq[Could not open '$file' for reading: '$!'];
local $/; <$fh>
};
# Decode the octets using the UTF-8 encoding scheme
$string = Encode::decode('UTF-8', $octets, Encode::FB_CROAK);
# Remove all code points greater than U+00FF
$string =~ s/[^\x00-\xFF]//g;
# Encode the string using the ISO-8859-1 encoding
$octets = Encode::encode('ISO-8859-1', $string);
# Decode the octets using the ISO-8859-7 encoding
$string = Encode::decode('ISO-8859-7', $octets);
# Encode the string using the UTF-8 encoding
$octets = Encode::encode('UTF-8', $string);
# Output the octets on standard output
print $octets;

Resources