Apache Tika strange whitespace symbols - whitespace

I am trying to extract the text from sample pdf located here by using Apache Tika.
Using the GUI of Apache Tike (run from console: java -jar tika-app-1.22.jar) seems to produce plain text pretty well. The text looks like this:
The issue when I try to extract the text by running the following command:
java -jar tika-app-1.22.jar --text lorem-ipsum.pdf
This produces text that has the á character everywhere whitespace is expected:
Any help configuring the command to produce normal whitespaces is appreciated, thank you.

Related

invalid character error on eclipse docker using xming on windows

Inavlid Character error in eclipse on docker
I have created a container using file similar to
https://github.com/batmat/docker-eclipse/blob/master/Dockerfile
on docker installed on windows 7.I did need to make a change of setting the locale first in dockerfile. I tried with both en_US.UTF8 and en_IN.UTF8.
When I start the container I am successfully able to open the eclipse in xming but eclipse is giving invalid character error on double quotes.(Probably on some other characters)
Is there any other change/setting I need to do?
This does not look like a file encoding problem (because the syntax error is not displayed at the first character), but that instead of ASCII quotation marks (") similar looking characters (e. g. ", ײ, ״, ʺ, etc.) were used.
Replace line 17 with the following line:
System.out.println("Hello There");
See also: Java Language Specification - 3.10.5. String Literals

UTF-8 issue with CoreNLP server

I run a Stanford CoreNLP Server with the following command:
java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer
I try to parse the sentence Who was Darth Vader’s son?. Note that the apostrophe behind Vader is not an ASCII character.
The online demo successfully parse the sentence:
The server I run on localhost fails:
I also tried to perform the query using Python.
import requests
url = 'http://localhost:9000/'
sentence = 'Who was Darth Vader’s son?'
r=requests.post(url, params={'properties' : '{"annotators": "tokenize,ssplit,pos,ner", "outputFormat": "json"}'}, data=sentence.encode('utf8'))
tree = r.json()
The last command raises an exception:
ValueError: Invalid control character at: line 1 column 1172 (char 1171)
However, I noticed occurrences of the character \x00 in the text (i.e. r.text). If I remove them, the json parsing succeeds:
import json
tree = json.loads(r.text.replace('\x00', ''))
Finally, r.encoding is ISO-8859-1, even though I did not use the option -strict to run the server. Note that it does not change anything if I manually replace it by UTF-8.
If I run the same code replacing url = 'http://localhost:9000/' by url = 'http://corenlp.run/', then everything succeeds. The call r.json() returns a dict, r.encoding is indeed UTF-8, and no character \x00 is in the text.
What is wrong with the CoreNLP server I run?
This is a known bug with the 3.6.0 release. If you build the server from GitHub, it should work properly with UTF-8 characters. Setting the appropriate Content-Type header in the request will also fix this issue (see https://github.com/stanfordnlp/CoreNLP/issues/125).

Why can't I include images in pdf using Pandoc?

I can successfully produce the images when output is HTML, but errors out when attempting pdf output.
input file text for image,
![](images\icon.png "test")
Error produced,
pandoc: Error producing PDF from TeX source.
! Undefined control sequence.
images\icon
l.535 \includegraphics{images\icon.png}
Note that pandoc produces the PDF via LaTeX, as the error message reveals. Your input
![](images\icon.png "test")
is converted into LaTeX
\includegraphics{images\icon.png}
\ in LaTeX has a special meaning: it begins a control sequence. So LaTeX is looking for an \icon command here and not finding it. The fix is to use a forward slash / instead of a backslash \ as path separator. LaTeX allows you to use / for paths even in Windows.
Of course, this may cause problems in some other output formats. I suppose I should change pandoc to convert backslashes in paths to forward slashes when writing LaTeX.
I've had a similar problem on Windows. My images are stored in a subdirectory named "figures". No matter what I tried the path wasn't followed. I have solved it by including --resource-path=.;figures in the call to pandoc.

Restore diacritics in utf8 format - Linux

I've got several text files full of sentences like this: "Mais, tu n'as pas fait tes devoirs ?!" -\u00c9l\u00e8ve : "Ben non"
Is there a quick way (script or utility) to restore all the diacritics in utf8 format? (expected result: Élève : "Ben non")
I could do it manually with sed but since my text files contain diacritics peculiar to several languages, that would take too much time.
Thank you very much
I found it...
python -c "print (open('filetoconvert.txt','rb').read().decode('unicode-escape').encode('utf-8'))"
If you have a Java JDK installed, there's a utility program called native2ascii for converting files to and from unicode escapes. For example:
native2ascii -reverse filetoconvert.txt > converted.txt

ruby mechanize: how read downloaded binary csv file

I'm not very familiar using ruby with binary data. I'm using mechanize to download a large number of csv files to my local disk. I then need to search these files for specific strings.
I use the save_as method in mechanize to save the file (which saves the file as binary). The content type of the file (according to mechanize) is:
application/vnd.ms-excel;charset=x-UTF-16LE-BOM
From here, I'm not sure how to read the file. I've tried reading it in as a normal file in ruby, but I just get the binary data. I've also tried just using standard unix tools (strings/grep) to try and search without any luck.
When I run the 'file' command on one of the files, I get:
foo.csv: Little-endian UTF-16 Unicode Pascal program text, with very long lines, with CRLF, CR, LF line terminators
I can see the data just fine with cat or vi. With vi I also see some control characters.
I've also tried both the csv and fastercsv ruby libraries, but I get 'IllegalFormatError' exception for these. I've also tried this solution without any luck.
Any help would be greatly appreciated. Thanks.
You can use the command 'iconv' to conver to UTF-8,
# iconv -f 'UTF-16LE' -t 'UTF-8' bad_file.csv > good_file.csv
There is also a wrapper for iconv in the standard library, you could use that to convert the file after reading it into your program.

Resources