UTF-8 with lex (flex) - utf-8

I have a lexer specified with the following definitions:
ws [ \t\n]+
punc (\.|\,|\!|\?)
word ({punc}|[a-zA-Z0-9])*
special (\%|\_|\&|\$|\#)
I have some utf-8 files that I need to parse, and naturally it blows when it comes to those characters. I know that similar questions were asked a few times in the past, but none of them did any help. I tried to use the approach given in this answer, but I failed. I guess the problem is in the definition of the word above?
It would be really helpful if someone could give details on the general concept of using UTF-8 encoding with flex.

Try (process -with flex -8):
%%
ws [ \t\n]+
punc (\.|\,|\!|\?)
word ({punc}|[a-zA-Z0-9\x80-\xf3])*
special (\%|\_|\&|\$|\#)
%%
(the coding is a bit course-grained ...) The link metioned by the OP, leading to Kaz's anwer is much more exact, wrt the allowed sequences.

Related

ACM programming - Arithmetica 1.0 - any additional operators?

Has anyone in this forum attempted to solve the ACM programming problem http://acm.mipt.ru/judge/problems.pl?browse=yes&problem=024? It is one of the simpler problems in ACM MIPT and the goal is to evaluate an expression consisting of +, -, * and parentheses. Despite the apparent simplicity, I haven't been able to get my solution accepted, apparently because one of the test case expressions has an operator not stated in the problem. I even added support for division ('/') but that too didn't help. Any idea on what other operator needs to be supported? FYI, my program removes all whitespaces from the input before processing so that spaces shouldn't be a problem. Anything not stated in the problem but needs to be taken care of?
You're being bitten by ruby's handling of strings and characters.
curr_ch = #input[i]
gives you an integer, for the input you get, the ASCII code of the character at index i of the input.
curr_ch == '('
for example compares that integer to the string "(", of course that fails. Also the regex matches fail because you pass them an integer where a string is expected.
Replacing all occurrences of some_var = #input[some_index] with some_var = #input[some_index...some_index+1] gives me a programme that seems to work (it works on a few test inputs I gave it). Probably someone who actually knows the quirks of ruby can give you a better fix.

Any ruby gems to do (Chinese) Transliterate (Romanization), especially for URL?

Generally spoken, it takes Unicode text and tries to represent it in
US-ASCII characters (universally displayable, unaccented characters)
by attempting to transliterate the pronunciation expressed by the text
in some other writing system to Roman letters.
ex,
"一二三".ooxx => "e-er-san"
After doing http://rubygems.org/search?utf8=%E2%9C%93&query=pinyin I got some rubygems, but none of them are robustly workable for this issue.
Doing this perfectly is almost impossible, since some Chinese characters have two or more pronunciations, for example 银行 = yin hang, 不行 = bu xing (the last character is identical, pronounced hang in one context and xing in the other)... Other than that, you could probably roll your own using the unicode database, which I think has pronunciation info as well. If you want to be more fancy, I think there are some open source input methods which have the mappings, and they'll have them for words too, so that if you find 银行 together, it will know that the second character is hang, not xing. OpenVanilla might have databases you can work with (OSS).

Using phrase_from_file to read a file's lines

I've been trying to parse a file containing lines of integers using phrase_from_file with the grammar rules
line --> I,line,{integer(I)}.
line --> ['\n'].
thusly: phrase_from_file(line,'input.txt').
It fails, and I got lost very quickly trying to trace it.
I've even tried to print I, but it doesn't even get there.
EDIT::
As none of the solutions below really fit my needs (using read/1 assumes you're reading terms, and sometimes writing that DCG might just take too long), I cannibalized this code I googled, the main changes being the addition of:
read_rest(-1,[]):-!.
read_word(C,[],C) :- ( C=32 ;
C=(-1)
) , !.
If you are using phrase_from_file/2 there is a very simple way to test your programs prior to reading actual files. Simply call the very same non-terminal with phrase/2. Thus, a goal
phrase(line,"1\n2").
is the same as calling
phrase_from_file(line,fichier)
when fichier is a file containing above 3 characters. So you can test and experiment in a very compact manner with phrase/2.
There are further issues #Jan Burse already mentioned. SWI reads in character codes. So you have to write
newline --> "\n".
for a newline. And then you still have to parse integers yourself. But all that is tested much easier with phrase/2. The nice thing is that you can then switch to reading files without changing the actual DCG code.
I guess there is a conceptional problem here. Although I don't know the details of phrase_from_file/2, i.e. which Prolog system you are using, I nevertheless assume that it will produce character codes. So for an integer 123 in the file you will get the character codes 0'1, 0'2 and 0'3. This is probably not what you want.
If you would like to process the characters, you would need to use a non-terminal instead of a bare bone variable I, to fetch them. And instead of the integer test, you would need a character test, and you can do the test earlier:
line --> [I], {0'0=<I, I=<0'9}, line.
Best Regards
P.S.: Instead of going the DCG way, you could also use term read operations. See also:
read numbers from file in prolog and sorting

ruby from any encoding to ascii

I have to deal with mainly English alphabets and all the punctuation marks, I don't have to worry about European accents. So the only concern I have is when a user paste something he copies from the web that includes, for instance, an apostrophe that when I do a puts in the console (on Win7), it outputs
"ItΓÇÖs" # where as it actually is " It's "
So my main question is, is there a end-it-all conversion method I can use in Ruby that just properly replaces all the ,.;?!"'~` _- with ASCII counter parts?
I really understand very little about encodings, if you think this is wrong question to ask, which can very likely be the case, please do advice as to what I should look for instead.
Thank you
I work in publishing where we deal with this a lot. We have had success with stringex https://github.com/rsl/stringex. They have a to_ascii method that normalizes unicode dashes etc.
And in ruby 2.0:
"ItΓÇÖs".encode("ASCII", invalid: :replace, undef: :replace, replace: '')
=> "Its"
For programmatically handling multibyte encodings iconv is your friend. And, James Grey wrote a series of blog articles talking about how to take apart the problem and convert encodings.
The problem gets more complicated when dealing with text that has been pasted in, because some characters could be in one multibyte-encoding, and other characters could be in another. You might have to walk the string checking for multibyte characters, then asking Ruby what the encoding is, and, if it's not what you expect, convert it to the expected or desired encoding, then move to the next character. Grey's articles cover it all nicely and are good reading.

Putting spaces back into a string of text with unreliable space information

I need to parse some text from pdfs but the pdf formatting results in extremely unreliable spacing. The result is that I have to ignore the spaces and have a continuous stream of non-space characters.
Any suggestions on how to parse the string and put spaces back into the string by guessing?
I'm using ruby. Or should I say I'musingruby?
Edit: I've pulled the text out using pdf-reader. Some of the pdf files are nicely formatted and some are not. An example of text mixed with positioning:
.7aspe-5.5cts-715.1o0.6f-708.5f-0.4aces-721.4that-716.3are-720.0i-1.8mportant-716.3in-713.9soc-5.5i-1.8alcommunica6.6tion6.3.-711.6Althoug6.3h-708.1m-1.9od6.3els-709.3o6.4f-702.8f5.4ace-707.9proc6.6essing-708.2haveproposed-611.2ways-615.5to-614.7deal-613.2with-613.0these-613.9diff10.4erent-613.7tasks,-611.9it-617.1remainsunclear-448.0how-450.7these-443.2mechanisms-451.7might-446.7be-447.7implemented-447.2in-450.3visualOne-418.9model-418.8of-417.3human-416.4face-421.9processing-417.5proposes-422.7that-419.8informa-tion-584.5is-578.0processed-586.1in-583.1specialised-584.7modules-577.0(Breen-584.4et-582.9al.,-582.32002;Bruce-382.1and-384.0Y92.0oung,-380.21986;-379.2Haxby-379.9et-380.5al.,-
and if I print just string data (I added returns at the end of each line to keep it from
messing up the layout here:
'Distinctrepresentationsforfacialidentityandchangeableaspectsoffacesinthehumantemporal
lobeTimothyJ.Andrews*andMichaelP.EwbankDepartmentofPsychology,WolfsonResearchInstitute,
UniversityofDurham,UKReceived23December2003;revised26March2004;accepted27July2004Availab
leonline14October2004Theneuralsystemunderlyingfaceperceptionmustrepresenttheunchanging
featuresofafacethatspecifyidentity,aswellasthechangeableaspectsofafacethatfacilitates
ocialcommunication.However,thewayinformationaboutfacesisrepresentedinthebrainremainsc
ontroversial.Inthisstudy,weusedfMRadaptation(thereductioninfMRIactivitythatfollowsthe
repeatedpresentationofidenticalimages)toaskhowdifferentface-andobject-selectiveregionsofvisualcortexcontributetospecificaspectsoffaceperception'
The data is spit out by callbacks so if I print each string as it is returned it looks like this:
'The
-571.3
neural
-573.7
system
-577.4
underly
13.9
ing
-577.2
face
-573.0
perc
13.7
eption
-574.9
must
-572.1
repr
20.8
esent
-577.0
the
unchangin
14.4
g
-538.5
featur
16.5
es
-529.5
of
-536.6
a
-531.4
face
'
On examination it looks like the true spaces are large negative numbers < -300 and the false spaces are much smaller positive numbers. Thanks guys. Just getting to the point where i am asking the question clearly helped me answer it!
Hmmmm... I'd have to say that guessing is never a good idea. Looking at the problem root cause and solving that is the answer, anything else is a kludge.
If the spacing is unreliable from the PDF, how is it unreliable? The PDF viewer needs to be able to reliably space the text so the data is there somewhere, you just need to find it.
EDIT following comment:
The idea of parsing the file using a dictionary (your only other option really, apart from randomly inserting spaces and hoping for the best) and inserting spaces at identified word boundaries (a real problem when dealing with punctuation, plurals that don't alter the base word i.e. plural, etc) would, I believe, be a much greater programming challenge than correctly parsing the PDF in the first place. After all, PDF is clearly defined whereas English is somewhat wooly.
Why not look down the route of existing solutions like ps2ascii in linux, call the function from your Ruby and pick up the result.
PDF doesn't only store spaces as space characters, but also uses layout commands for spacing (so it doesn't print a space, but moves the "pen" to the right). Perhaps you should have a look at the PDF reference (the big PDF on the bottom of the site), Chapter 9 "Text" should be what you're looking for.
EDIT: After reading your comment to Lazarus' answer, this doesn't seem to be what you're looking for. I think you should try to get a word list from somewhere and try to split your text using it. A good strategy would be to do that using recursion, because for example:
"meandyou"
The first word could be "me" or "mean", but if you try "mean", "dyou" doesn't make sense, so it will be "me", same for the next word that could be "a" or "an" or "and", only "and" makes sense.
If it were me I'd go back to the source PDFs and try a different method of extracting the text, such as iText (for Java) or maybe some kind of PDF-to-HTML to text conversion software method.

Resources