How can I detect non UTF-8 encoding in RStudio

How can I detect non UTF-8 encoding in RStudio - rstudio

I have a script like
a <- 1
# A very long comment, perhaps copy paste from somewhere containing the word ﬁt.
and I want to search for non UTF-8 encoding. How can I do this in RStudio?

I realized, the answer is really simple: Just go to Edit => Find (Strg + F) and search for [^\x00-\x7F] + with enabled Regex field in the search bar.

Related

Edit Microsoft Word .doc and .docx files

I would like to translate all the text inside a Microsoft Word .doc or .docx file without changing the formatting of the file itself.
Are there any gems or libraries that can help me with this?

The general case is extremely complicated, but for translating continuous runs of text that are formatted the same, you can use WIN32OLE to access Word documents using Word so long as you are on Windows and have a copy of Word installed.
You can find documentation on Word's object model. You can also use the built-in Object Browser (start the macro editor and press F2).
The following short script can form the starting point for your exploration:
require 'win32ole'
file = ENV['USERPROFILE'] + '/Desktop/' + 'This is a test.docx';
word = WIN32OLE.new('Word.Application')
word.visible = true
doc = word.Documents.Open(file)
doc.paragraphs.each { |p| puts p.Range.Text }
doc.Close()
word.Quit()

NLTK corpus reader paragraph

I tried to copy paste content from word document (.docx) to a .txt file and made it read by a nltk corpus reader to find number of paragraph. It returns almost 30 paragraph as one paragraph. I manually entered a line break in .txt file and it returned 30 paragraphs.
import nltk
corpusReader = nltk.corpus.reader.plaintext.PlaintextCorpusReader(".", "d.txt")
print "Paragraphs =", len(corpusReader.paras())
Is it possible for PlaintextCorpus reader to read .docx?
While copy pasting from .docx to .txt, How to preserve line break?
Is there a way using python,where I open .txt file and find ?!or . or ... and followed by some blank spaces(4 in number) and press "enter" to create line break automatically?
break.
Edit 1.
Walked the para_block_reader=read_line_block path, but it always gives one paragraph count extra.
import nltk
from nltk.corpus.reader.util import *
corpusReader = nltk.corpus.reader.plaintext.PlaintextCorpusReader(".", "d.txt",para_block_reader=read_line_block)
print "Paragraphs =", len(corpusReader.paras())

The plaintext corpus reader can only read plain-text files. There are Python libraries that can read docx, but that will not address your problem, which is that Word delimits paragraphs by a single line break, but plaintext documents traditionally understand a paragraph boundary to be a blank line-- i.e., two successive newlines. In other words, your export method does preserve the newlines; it's just that there's not enough of them.
So there is an easy way to fix up your texts so that paragraphs are recognized without extra to-do: Once you've written out your plaintext file (which you can do from Word's Save As... menu or by cutting and pasting), post-process it like this (add encoding= arguments as necessary):
with open("my_plaintext.txt") as oldfile:
content = oldfile.read()
content = re.sub("\n", "\n\n", content)
with open("my_plaintext_fixed.txt", "w") as newfile:
newfile.write(content)
You can now read myplaintext_fixed.txt" with thePlaintextCorpusReader`, and everything will work as expected.

The source code for PlainTextCorpus reader is the first class defined on this page, it is fairly simple.
It has sub-components, if you don't secify them in the constructor it uses the NLTK defaults
para_block_reader (default: read_blankline_block), which says how the document is broken up into paragraphs.
sentence_tokenizer (default: English Punkt), which says how to break a paragraph into sentences
word_tokenizer (default WordPunctTokenizer()), which says how to break a sentence into tokens (words, and symbols).
Note that the defaults may change in different versions, on NLTK. I feel like the default word_tokenizer used to be the Penn tokenizer.
Re: 1.
No PlaintextCorpus reader can not read Docx. It only reads plain text.
I'm sure you can find a python library to convert it
Re 2
Copy and Paste is offtopic for this site, try SuperUser.
I suggest though you instead use option 1 and get a library to do the conversion.
Re 3
Yes, you can do a search and replace using Regex.
import re
def breakup(mystring):
return re.replace(mystring, r"(\.|\!|\.\.\.) ", "\n")
But perhaps instead you might want to swap out your para_block_reader or sent_tokenizer

Render non english characters in asciidoctor-pdf

I am trying to write documentation with asciidoctor-pdf and I need to use characters like : ă,â,î,ş,ţ. The pdf output is rendered but the mentioned characters are rendered empty. I am not sure how to handle the issue.
For example:
I wrote this code:
= Document Title
Doc Writer <doc#example.com>
:doctype: book
:source-highlighter: coderay
:listing-caption: Listing
// Uncomment next line to set page size (default is Letter)
//:pdf-page-size: A4
A simple http://asciidoc.org[AsciiDoc] document.
== Introducţie
A paragraph followed by a simple list with square bullets.
And the result was the word Introducţie rendered as Introduc ie and finally the error:
/usr/local/rvm/gems/ruby-2.2.2/gems/pdf-core-0.2.5/lib/pdf/core/pdf_object.rb:55: warning: regexp match /.../n against to UTF-8 string
Can be a system encoding configuration problem?
Do I need to set different encoding configuration in ruby?
Thank you.

I think that if you want to be sure, you can always use the decimal entity references form. For the latin small Letter T with cedilla it is: ţ
Check this table for the complete list:
List of Unicode characters
In addition, if you want to use this special char in a title, there was an issue with it:
Section id with characters outside of Windows-1252 encoding causes warning
It seems to be fixed now, but I did not verify it.

One of possible ways to write such special characters in titles is to declare them in preamble of your asciidoc document, for example,
:t-cedil: ţ
and to call it in the main text
== pass:normal[Test-{t-cedil}]
So your title will look like
Test-ţ

Notepad++ - Binary text error

I have a huge txt file made using python. When I'm trying to sort it using Notepad++/TextFX it returns error: This tool is not compatible with binary text. Please select text without [NUL] characters.. Does it means that I have non-printable chars in this txt file? Is it possible to convert this file to compatible format so I could sort it using TextFX?
EDIT: I used mode 'a' in Python to write this file.
Thank you for your advices.

using TextFX in Notepad++ you could try the following:
Mark the suspicious part or the whole text
Select TextFX, TextFX Characters, Zap all nonprintable characters to #. (The last entry in that submenu.)
All the problematic characters should have been replaced with "#", you can then search for "#".
Another idea is the function: Search, "Find characters in range". Check "My range:" and enter "0" and "0" as range, to find [Nul] characters.
Lars

How do I write the Multiplication Sign in reStructuredText (reST)?

I'm writing reST documents that will be rendered to HTML and PDF using Sphinx. My source files are UTF-8, and expect my HTML to be displayed as UTF-8.
What's the best practice for writing the multiplication sign?
That is: ×, not x.
I know I can insert it as a Unicode character. If I were writing LaTeX, I'd use \times. In HTML there's ×. Is the simple Unicode going to properly survive the conversion process when I render everything with Sphinx? Even if I render to other more exotic formats?
I just want to double-check that this isn't going to trip things up somewhere.

It turns out that the Sphinx documentation contains the answer, I just didn't read quite far enough:
Since the easiest way to include
special characters like em dashes or
copyright signs in reST is to directly
write them as Unicode characters, one
has to specify an encoding. Sphinx
assumes source files to be encoded in
UTF-8 by default; you can change this
with the source_encoding config value.
https://www.sphinx-doc.org/en/master/usage/restructuredtext/basics.html#source-encoding
Edit:
Just to follow up here: while the multiplication sign works just fine, many other seemingly standard unicode symbols (less than or equal to, and greater than or equal to) are missing from the default font used in LaTeX (and thus PDF) rendering.

To add to #Paul McMillan's answer, if you're trying to publish your sphinx documentation as a latex pdf, you can often get around the missing unicode symbols by including in your conf.py preamble:
_PREAMBLE = r"""
\usepackage[utf8]{inputenc}
\DeclareUnicodeCharacter{00D7}{\times}
"""
latex_elements = {
'preamble': _PREAMBLE,
}
Where 00D7 is the unicode encoding, and \times is what you want it to be replaced by in latex.
You can find the unicode encoding for your character on the fileformat website.
More information here and here.

I would use MathML within rst :math:`m \times p`

What's wrong with |times| after using .. include:: <isonum.txt> ?
From http://docutils.sourceforge.net/0.6/docutils/parsers/rst/include/isonum.txt :
.. |sup2| unicode:: U+000B2 .. SUPERSCRIPT TWO
.. |sup3| unicode:: U+000B3 .. SUPERSCRIPT THREE
.. |times| unicode:: U+000D7 .. MULTIPLICATION SIGN
.. |trade| unicode:: U+02122 .. TRADE MARK SIGN
.. |uarr| unicode:: U+02191 .. UPWARDS ARROW

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

How can I detect non UTF-8 encoding in RStudio - rstudio

I have a script like a <- 1 # A very long comment, perhaps copy paste from somewhere containing the word ﬁt. and I want to search for non UTF-8 encoding. How can I do this in RStudio?

I realized, the answer is really simple: Just go to Edit => Find (Strg + F) and search for [^\x00-\x7F] + with enabled Regex field in the search bar.

Related

Edit Microsoft Word .doc and .docx files

NLTK corpus reader paragraph

Render non english characters in asciidoctor-pdf

Notepad++ - Binary text error

How do I write the Multiplication Sign in reStructuredText (reST)?

Categories

Resources