Putting spaces back into a string of text with unreliable space information - ruby

I need to parse some text from pdfs but the pdf formatting results in extremely unreliable spacing. The result is that I have to ignore the spaces and have a continuous stream of non-space characters.
Any suggestions on how to parse the string and put spaces back into the string by guessing?
I'm using ruby. Or should I say I'musingruby?
Edit: I've pulled the text out using pdf-reader. Some of the pdf files are nicely formatted and some are not. An example of text mixed with positioning:
.7aspe-5.5cts-715.1o0.6f-708.5f-0.4aces-721.4that-716.3are-720.0i-1.8mportant-716.3in-713.9soc-5.5i-1.8alcommunica6.6tion6.3.-711.6Althoug6.3h-708.1m-1.9od6.3els-709.3o6.4f-702.8f5.4ace-707.9proc6.6essing-708.2haveproposed-611.2ways-615.5to-614.7deal-613.2with-613.0these-613.9diff10.4erent-613.7tasks,-611.9it-617.1remainsunclear-448.0how-450.7these-443.2mechanisms-451.7might-446.7be-447.7implemented-447.2in-450.3visualOne-418.9model-418.8of-417.3human-416.4face-421.9processing-417.5proposes-422.7that-419.8informa-tion-584.5is-578.0processed-586.1in-583.1specialised-584.7modules-577.0(Breen-584.4et-582.9al.,-582.32002;Bruce-382.1and-384.0Y92.0oung,-380.21986;-379.2Haxby-379.9et-380.5al.,-
and if I print just string data (I added returns at the end of each line to keep it from
messing up the layout here:
'Distinctrepresentationsforfacialidentityandchangeableaspectsoffacesinthehumantemporal
lobeTimothyJ.Andrews*andMichaelP.EwbankDepartmentofPsychology,WolfsonResearchInstitute,
UniversityofDurham,UKReceived23December2003;revised26March2004;accepted27July2004Availab
leonline14October2004Theneuralsystemunderlyingfaceperceptionmustrepresenttheunchanging
featuresofafacethatspecifyidentity,aswellasthechangeableaspectsofafacethatfacilitates
ocialcommunication.However,thewayinformationaboutfacesisrepresentedinthebrainremainsc
ontroversial.Inthisstudy,weusedfMRadaptation(thereductioninfMRIactivitythatfollowsthe
repeatedpresentationofidenticalimages)toaskhowdifferentface-andobject-selectiveregionsofvisualcortexcontributetospecificaspectsoffaceperception'
The data is spit out by callbacks so if I print each string as it is returned it looks like this:
'The
-571.3
neural
-573.7
system
-577.4
underly
13.9
ing
-577.2
face
-573.0
perc
13.7
eption
-574.9
must
-572.1
repr
20.8
esent
-577.0
the
unchangin
14.4
g
-538.5
featur
16.5
es
-529.5
of
-536.6
a
-531.4
face
'
On examination it looks like the true spaces are large negative numbers < -300 and the false spaces are much smaller positive numbers. Thanks guys. Just getting to the point where i am asking the question clearly helped me answer it!

Hmmmm... I'd have to say that guessing is never a good idea. Looking at the problem root cause and solving that is the answer, anything else is a kludge.
If the spacing is unreliable from the PDF, how is it unreliable? The PDF viewer needs to be able to reliably space the text so the data is there somewhere, you just need to find it.
EDIT following comment:
The idea of parsing the file using a dictionary (your only other option really, apart from randomly inserting spaces and hoping for the best) and inserting spaces at identified word boundaries (a real problem when dealing with punctuation, plurals that don't alter the base word i.e. plural, etc) would, I believe, be a much greater programming challenge than correctly parsing the PDF in the first place. After all, PDF is clearly defined whereas English is somewhat wooly.
Why not look down the route of existing solutions like ps2ascii in linux, call the function from your Ruby and pick up the result.

PDF doesn't only store spaces as space characters, but also uses layout commands for spacing (so it doesn't print a space, but moves the "pen" to the right). Perhaps you should have a look at the PDF reference (the big PDF on the bottom of the site), Chapter 9 "Text" should be what you're looking for.
EDIT: After reading your comment to Lazarus' answer, this doesn't seem to be what you're looking for. I think you should try to get a word list from somewhere and try to split your text using it. A good strategy would be to do that using recursion, because for example:
"meandyou"
The first word could be "me" or "mean", but if you try "mean", "dyou" doesn't make sense, so it will be "me", same for the next word that could be "a" or "an" or "and", only "and" makes sense.

If it were me I'd go back to the source PDFs and try a different method of extracting the text, such as iText (for Java) or maybe some kind of PDF-to-HTML to text conversion software method.

Related

ABCPDF.Net AddText Control hyphenation

I'm using ABCPDF.net for generating PDF Pages. We've got a problem with the hyphenation system.
For example if we add a text with long words using
doc.AddText("This is a Verylongwordwhichdoesntfit");
and the Rect is too small, we get:
this is a verylongwo
rdwhichdoesntfit.
My Question now is:
Can i control where it starts a new line. to have it break between long and word.
And can i tell it to use a - before the break like this?
this is a verylongwo-
rdwhichdoesntfit.
Thanks a lot.
Details in the documentation here:
http://www.websupergoo.com/helppdfnet/source/3-concepts/b-htmlstyles.htm
Firstly, with .AddText() there is no possibility of hyphenation at all. You'd have to switch to .AddHtml().
Secondly, no, abcpdf has no intelligence about hyphenating at all; it can be told to break lines after certain characters (default is space), but it has no knowledge of English words or syllables.
See http://www.websupergoo.com/helppdfnet/source/3-concepts/b-htmlstyles.htm#stylerun (search for canBreakAfter at that link)
If you're able to edit your text, you can use soft hyphen characters
http://www.websupergoo.com/helppdfnet/source/3-concepts/b-htmlstyles.htm#stylerun, last line of the "Chars" section
If you require fine control over hyphenation you can make use of the soft hyphen character – ­. This character is invisible and indicates a point at which a chunk of text may reasonably be broken.
For example, you'd use this command, and it might break at any of the places where the ­ appears:
doc.AddHtml("This is a Very­long­word­which­doesnt­fit");
But even this won't add the visible hyphens at the break, I don't think.

XSL-FO: wrapping long words in table cell

I'm using Docbook-XSL and Apache FOP to generate PDF documents containing tables. With the default settings, tables have fixed-width columns and lines wrap at word boundaries. But if a word is longer than the cell width, it overflows the cell. I'd like to break up the words across multiple lines in such a case. How could this be done?
Hyphenation is not a solution since the words need not be in English. (Edit: hyphenation in other languages is not a solution either. It may not be known ahead of time what language the data is in, and there may be "words" that cannot be hyphenated, such as numeric strings.)
I found suggestions to use keep-together.within-column="always" for fo:table-rows, but that didn't seem to have any effect.
(Edit:) Another suggestion was to insert zero-width spaces between all characters. But this also breaks short words mid-word. I would need a solution that breaks at word boundaries whenever possible, and mid-word only when needed.
FOP, like just about every FO processor, can hyphenate languages other than English. See http://xmlgraphics.apache.org/fop/2.1/hyphenation.html
You could try using an FO processor, such as Antenna House AH Formatter, that implements 'auto' table layout and can adjust the widths of the table columns depending on where the text can break (as well as do hyphenation for multiple languages).
Other answers for breaking text in table cells are at:
Force line break after string length
XSL-FO: Force Wrap on Table Entries

Using an "uncommon" delimiter for creating arrays in Ruby on Rails

I am building an app in Ruby on Rails in which I am pulling in content another file, and wonder if there's any simple way to create a unique delimiter for separating string content, or whether there's another approach I should take.
Let's say I have a paragraph of text, I'd like to pull in, and let's say I don't know what the text will contain.
What I would like to do is put some sort of delimiter at, let's say, 5 random points in the paragraph so that, later on, an array can be created in which content up to that delimiter can be separated out into an individual element.
For a bit of context, let's say I have a paragraph pulled in as a string:
Hello, this is a paragraph of text which will be delimited. Goodbye.
Now, let's say I add a delimiter at various points, as follows (I know how to do this in code):
Hello, this [DELIMITER] is a paragraph [DELIMITER] of text which [DELIMITER] will [DELIMITER] be delimitted. Goodbye.
Again, I know how to do this, but let's say I'm able to use the above to create an array as follows:
my_array = ["Hello, this", "is a paragraph", "of text which", "will", "be delimitted. Goodbye"
I'm confident of achieving all of the above. The challenge I'm having is: what should my delimiter be?
Normally, commas are used as delimiters but, if the text already includes a comma, this will result in delimitations where I do not wish them to occur. In the above example, for example, the comma between "Hello" and "this" would cause the "Hello, this" element to be split up into "Hello" and "this"—not what I want.
What I have thought of doing is using a random (hex) number generator to create a new delimiter each time the page is loaded, e.g. "Hello, this 023ABCDEF is a paragraph 023ABCDEF...", but I'm not sure this is the correct approach.
Is there a simpler solution?
Multipart mime messages take (more or less) the approach of a GUID separator; it's adequate.
I view this as a different type of problem, though, closer to a text editor marking sections of text bold, or italic, etc. That can be handled via string parsing (a la Markdown, SO's formatting) or data structures.
The text editor approach is generally more flexible, and instead of a simple collection of strings, uses a collection (or tree) of structures that hold metadata about the section (type, formatting, whatever).
The best approach depends on your needs:
Are sections nestable?
Will this be rendered?
If so, do section "types" need specific rendering?
Are there section "types", or are they all the same?
Will the text in question be edited before, during, or after sectioning?
Etc.

What Ruby Regex code can I use for obtaining "out of sight" from the input "outofsight"?

I'm building an application that returns results based on a movie input from a user. If the user messes up and forgets to space out the title of the movie is there a way I can still take the input and return the correct data? For example "outofsight" will still be interpreted as "out of sight".
There is no regex that can do this in a good and reliable way. You could try a search server like Solr.
Alternatively, you could do auto-complete in the GUI (if you have one) on the input of the user, and this way mitigate some of the common errors users can end up doing.
Example:
User wants to search for "outofsight"
Starts typing "out"
Sees "out of sight" as suggestion
Selects "out of sight" from suggestions
????
PROFIT!!!
There's no regex that can tell you where the word breaks were supposed to be. For example, if the input is "offlight", is it supposed to return "Off Light" or "Of Flight"?
This is impossible without a dictionary and some kind of fuzzy-search algorithm. For the latter see How can I do fuzzy substring matching in Ruby?.
You could take a string and put \s* in between each character.
So outofsight would be converted to:
o\s*u\s*t\s*o\s*f\s*s\s*i\s*g\s*h\s*t
... and match out of sight.
You can't do this with regular expressions, unless you want to store one or more patterns to match for each movie record. That would be silly.
A better approach for catching minor misspellings would be to calculate Levenshtein distances between what the user is typing and your movie titles. However, when your list of movies is large, this will become a rather slow operation, so you're better off using a dedicated search engine like Lucene/Solr that excels at this sort of thing.

Least used delimiter character in normal text < ASCII 128

For coding reasons which would horrify you (I'm too embarrassed to say), I need to store a number of text items in a single string.
I will delimit them using a character.
Which character is best to use for this, i.e. which character is the least likely to appear in the text? Must be printable and probably less than 128 in ASCII to avoid locale issues.
I would choose "Unit Separator" ASCII code "US": ASCII 31 (0x1F)
In the old, old days, most things were done serially, without random access. This meant that a few control codes were embedded into ASCII.
ASCII 28 (0x1C) File Separator - Used to indicate separation between files on a data input stream.
ASCII 29 (0x1D) Group Separator - Used to indicate separation between tables on a data input stream (called groups back then).
ASCII 30 (0x1E) Record Separator - Used to indicate separation between records within a table (within a group). These roughly map to a tuple in modern nomenclature.
ASCII 31 (0x1F) Unit Separator - Used to indicate separation between units within a record. The roughly map to fields in modern nomenclature.
Unit Separator is in ASCII, and there is Unicode support for displaying it (typically a "us" in the same glyph) but many fonts don't display it.
If you must display it, I would recommend displaying it in-application, after it was parsed into fields.
Assuming for some embarrassing reason you can't use CSV I'd say go with the data. Take some sample data, and do a simple character count for each value 0-127. Choose one of the ones which doesn't occur. If there is too much choice get a bigger data set. It won't take much time to write, and you'll get the answer best for you.
The answer will be different for different problem domains, so | (pipe) is common in shell scripts, ^ is common in math formulae, and the same is probably true for most other characters.
I personally think I'd go for | (pipe) if given a choice but going with real data is safest.
And whatever you do, make sure you've worked out an escaping scheme!
When using different languages, this symbol: ¬
proved to be the best. However I'm still testing.
Probably | or ^ or ~ you could also combine two characters
You said "printable", but that can include characters such as a tab (0x09) or form feed (0x0c). I almost always choose tabs rather than commas for delimited files, since commas can sometimes appear in text.
(Interestingly enough the ascii table has characters GS (0x1D), RS (0x1E), and US (0x1F) for group, record, and unit separators, whatever those are/were.)
If by "printable" you mean a character that a user could recognize and easily type in, I would go for the pipe | symbol first, with a few other weird characters (# or ~ or ^ or \, or backtick which I can't seem to enter here) as a possibility. These characters +=!$%&*()-'":;<>,.?/ seem like they would be more likely to occur in user input. As for underscore _ and hash # and the brackets {}[] I don't know.
How about you use a CSV style format? Characters can be escaped in a standard CSV format, and there's already a lot of parsers already written.
Can you use a pipe symbol? That's usually the next most common delimiter after comma or tab delimited strings. It's unlikely most text would contain a pipe, and ord('|') returns 124 for me, so that seems to fit your requirements.
For fast escaping I use stuff like this:
say you want to concatinate str1, str2 and str3
what I do is:
delimitedStr=str1.Replace("#","#a").Replace("|","#p")+"|"+str2.Replace("#","#a").Replace("|","#p")+"|"+str3.Replace("#","#a").Replace("|","#p");
then to retrieve original use:
splitStr=delimitedStr.Split("|".ToCharArray());
str1=splitStr[0].Replace("#p","|").Replace("#a","#");
str2=splitStr[1].Replace("#p","|").Replace("#a","#");
str3=splitStr[2].Replace("#p","|").Replace("#a","#");
note: the order of the replace is important
its unbreakable and easy to implement
Pipe for the win! |
We use ascii 0x7f which is pseudo-printable and hardly ever comes up in regular usage.
Well it's going to depend on the nature of your text to some extent but a vertical bar 0x7C doesn't crop up in text very often.
I don't think I've ever seen an ampersand followed by a comma in natural text, but you can check the file first to see if it contains the delimiter, and if so, use an alternative. If you want to always be able to know that the delimiter you use will not cause a conflict, then do a loop checking the file for the delimiter you want, and if it exists, then double the string until the file no longer has a match. It doesn't matter if there are similar strings because your program will only look for exact delimiter matches.
This can be good or bad (usually bad) depending on the situation and language, but keep mind mind that you can always Base64 encode the whole thing. You then don't have to worry about escaping and unescaping various patterns on each side, and you can simply seperate and split strings based on a character which isn't used in your Base64 charset.
I have had to resort to this solution when faced with putting XML documents into XML properties/nodes. Properties can't have CDATA blocks in them at all, and nodes escaped as CDATA obviously cannot have further CDATA blocks inside that without breaking the structure.
CSV is probably a better idea for most situations, though.
Both pipe and caret are the obvious choices. I would note that if users are expected to type the entire response, caret is easier to find on any keyboard than is pipe.
I've used double pipe and double caret before. The idea of a non printable char works if your not hand creating or modifying the file. For quick random access file storage and retrieval field width is used. You don't even have to read the file.. your literally pulling from the file by reference. This is how databases do some storage.. but they also manage the spaces between records and such. And it introduced the problem of max data element width. (Index attach a header which is used to define the width of each element and it's data type in the original old days.. later they introduced compression with remapping chars. This allows for a text file to get about 1/8 the size in transmission.. variable length char encoding for the win
make it dynamic : )
announce your control characters in the file header
for example
delimiter: ~
escape: \
wrapline: $
width: 19
hello world~this i$
s \\just\\ a sampl$
e text~$someVar$~h$
ere is some \~\~ma$
rkdown strikethrou$
gh\~\~ text
would give the strings
hello world
this is \just\ a sample text
$someVar$
here is some ~~markdown strikethrough~~ text
i have implemented something similar:
a plaintar text container format,
to escape and wrap utf16 text in ascii,
as an alternative to mime multipart messages.
see https://github.com/milahu/live-diff-html-editor

Resources