Chaining LINQ Take - linq

I recently wrote the following code to split a css file into chunks:
Dim seg = css.Take(css.Length / segmentCount).TakeWhile(Function(x) x <> "}"c).Take(1)
The idea being that I take a chunk of the css then continue taking until i hit a closing brace and then take the brace as well.
Obviously this didn't work and i realised why it doesnt almost immediately (The Take needle isnt maintained between the take calls).
My question is is there way to write this idea as a LINQ query efficiently considering that they string might be 300,000 characters or so long.
(I ended up using a combination of SubString and IndexOf for this but a one liner in LINQ would be interesting)

Related

Slow Ruby Regex Becomes Fast with Odd Change

I've been debugging a site to find the source of long page loading times, and I've narrowed it down to a regex that's used to extract URLs from text:
/(?:([\w+.-]+):\/\/|(?:www\.))[^\s<]+/g
This takes about 3 seconds to run on a large block of text. I found out that if I add the inverse of the first clause to the start of the regex ((?:[^\w+.-]|^)), it runs almost instantly:
/(?:[^\w+.-]|^)(?:([\w+.-]++):\/\/|(?:www\.))[^\s<]+/gx
It seems to me like the added clause shouldn't affect the regex at all, since nothing could cause that clause to fail (as those characters would be matched by the "[\w+.-]++" clause). Why does this make the regex run so much faster?
Edit
Some people have asked for an example of what I'm trying to do. To simplify things and to address the concerns people had in the comments, I'll be using the following two regexes:
# slow one
/(?:([\w+.-]+):\/\/|(?:www\.))[^\s<]+/g
# fast one
/[^\w+.-](?:([\w+.-]+):\/\/|(?:www\.))[^\s<]+/g
Fire up IRB/Pry and throw some text in a variable (this is a scrubbed version of what is actually searched against):
text = <<END_OF_TEXT
Unable to deliver message to email#example.com. Error message: request: <soap:Envelope xmlns:soap=";http://schemas.xmlsoap.org/soap/envelope/" xmlns:t=";http://schemas.microsoft.com/exchange/services/year/types" xmlns:m=";http://schemas.microsoft.com/exchange/services/year/messages"><soap:Header><t:RequestServerVersion Version="ExchangeYear"/></soap:Header><soap:Body><m:CreateItem MessageDisposition="SendAndSaveCopy"><m:SavedItemFolderId><t:DistinguishedFolderId Id="stuff"/></m:SavedItemFolderId><m:Items><t:Message><t:MimeContent>
END_OF_TEXT
Use the slow regex on it and note how slow it is:
text.gsub(/(?:([\w+.-]+):\/\/|(?:www\.))[^\s<]+/).to_a
Use the fast regex and note how fast it is:
text.gsub(/[^\w+.-](?:([\w+.-]+):\/\/|(?:www\.))[^\s<]+/).to_a
I figured out that this problem is specific to the type of data I used in the example (not a lot of spaces). If you run it against RFC 3986, which is much longer, both versions are equally fast.
The first pattern is slow because it starts with an alternation and the first branch of the alternation is very permissive since it allows any number of words characters or dots or hyphens. Consequence, this alternation takes a lot of time/steps before failing.
The second pattern is faster because (?:[^\w+.-]|^) (that is an alternation too) works like a kind of anchor. Indeed, even it is an alternation, it is quickly tested because the first branch matches only one character and the second is a zero-width assertion. So it takes less time/steps to fail. (in particular because it must be followed by a word character or a dot or an hypĥen, that is a binding condition)
But you can write this pattern in a better way. Since your are looking for urls, you can be more precise for the begining: the url can begin with, lets say, "http", "ftp", "sftp", "gopher", "www" (feel free to add other schemes if needed).
So you can describe the start with:
(?:https?:\/\/|ftp:\/\/|sftp:\/\/|gopher:\/\/|www\.)
To limit the cost of the alternation (5 branches to test at each positions in the string) you can use two tricks:
you can use a word boundary to quickly skip positions that are not the start or the end of a word:
\b(?:https?:\/\/|ftp:\/\/|sftp:\/\/|gopher:\/\/|www\.)
you can add a lookahead with the first letter of each branches, to quickly avoid uneeded positions in the string without to test the five branches:
\b(?=[fghsw])(?:https?:\/\/|ftp:\/\/|sftp:\/\/|gopher:\/\/|www\.)
So you can write a more efficient pattern like this:
/\b(?=[fghsw])(?:https?:\/\/|ftp:\/\/|sftp:\/\/|gopher:\/\/|www\.)[^\s<]+/
In short: a pattern is efficient when it fail fast at bad positions in the string.
An other possible design that uses more memory and needs to check if the capture group exists for each match, but that is faster:
/[^ghsfw]*+(?:\B[ghsfw][^ghsfw]*)*+|\b((?:https?:\/\/|ftp:\/\/|sftp:\/\/|gopher:\/\/|www\.)[^\s<"&]+)/
(the idea is to divide the pattern in two main branches, the first one describes all that you want to avoid, and the second describes the urls. The effect is quick jumps to key positions in the string)
Note: when patterns begin to be long, you can use the free-spacing mode (or comment mode...) for readability and maintainability:
/(?x)
\b (?=[fghsw])
(?:
https?:\/\/ |
ftp:\/\/ |
sftp:\/\/ |
gopher:\/\/ |
www\.
)
[^\s<]+/
or you can use a formatted string and a join as suggested by Cary Swoveland in comments.

Line count in csv doesn't match

I have a large CSV with a large number of columns. I am trying to count the number of lines using
File.open(file).readlines.to_a.compact.count.to_i
It displays 57 although there are only 56 rows. Upon close examination I found that a part of one line is wrapped to form the next line. How to get the correct count?
Upon close examination I found that a part of one line is wrapped to form the next line. How to get the correct count?
You need to show an example of the incoming data if you want us to help beyond generic answers.
To fix the problem, you have to be able to identify the line. We can't help you there because it could look like anything. Making a wild guess, I'd say that one of the columns had an embedded new-line in it, which forces the line to wrap.
It the file is a true CSV file, that column should be wrapped in double-quotes, so you could search the file for lines that do NOT end with whatever data type should be in the last column, then read the next line, join them, then rewrite the file. But, again, we have nothing to work with, because your file's format could be a huge number of different things.
Your best bet is to use the CSV class that comes with Ruby, and let it read the file, instead of trying to treat it like a text file. CSV files are text, but they are formatted to maintain the columns and rows, so using the CSV class will give you a better chance of getting at the data.
Looking at your code:
There are a number of ways to count the number of lines in a file, including the easiest which is:
`wc -l /path/to/file`.to_i
if you're using *nix.
Using File.open(file).readlines.to_a is horribly redundant and not fast or scalable if your file is big.
readlines returns an array.
to_a returns an array.
Why turn the array into an array?
readlines loads an entire file into memory, then splits it on line ends into an array. That process can be a lot slower than simply reading the file line-by-line and incrementing a counter, plus "slurping" can make your program crawl if the file is larger than available memory.
See "Why is "slurping" a file not a good practice?" for more information.
compact removes nils from an array. readlines should never return any nils so compact will iterate over the array looking for something that shouldn't exist.
count returns an integer.
to_i converts the receiver to an integer.
In other words, to_i is turning an integer into an integer. Why?
If you want to do it in Ruby instead of using wc -l, do something simple and fast:
lines_in_file = 0
File.foreach(some_file) { lines_in_file += 1 }
After running that, lines_in_file will contain the number of lines read. Memory won't be impacted and it'll run like blue blazes on huge files.

scripting logic : matching patterns

I am trying to figure out regex/scripting logic to parse something out like this;
RAW DATA
{CLNDSDB=MedGen:OMIM:SNOMED_CT;CLNDSDBID=C0432243:271640:254100000}
Here, the value is;
MedGen = C0432243
OMIM = 271640
SNOMED_CT = 254100000
Result: 271640
I am envisaging a convoluted if-else loop to get the result. Just wanted to know if there any simple way of get the same result. Much appreciate your answers.
Perhaps something like this: (assuming there is always three fields)
(?<=[=:])(?<key>[^:;]+)(?=[:=;](?:[^:;=]+[=;:]){3}(?<val>[^:]+))
The idea is to capture the field values inside a lookahead assertion so as not to be interfering with overlapping substrings.
However, there is probably a cleaner way that uses successive split.
It's difficult to tell from the question whether the input string is two lines or one:
str = 'RAW DATA
{CLNDSDB=MedGen:OMIM:SNOMED_CT;CLNDSDBID=C0432243:271640:254100000}
'
or
str = '{CLNDSDB=MedGen:OMIM:SNOMED_CT;CLNDSDBID=C0432243:271640:254100000}'
but, in either case I'd use a simple pattern:
str = '{CLNDSDB=MedGen:OMIM:SNOMED_CT;CLNDSDBID=C0432243:271640:254100000}'
medgen, omim, snomed_ct = str.match(/(\w+):(\w+):(\w+)}/).captures
medgen # => "C0432243"
omim # => "271640"
snomed_ct # => "254100000"
Here's the pattern at Rubular.
I am envisaging a convoluted if-else loop to get the result.
Well, don't do that. Most programming solutions are surprisingly simple, so start simple. As you learn, your programming toolbox will grow as you become familiar with new ways of doing things, and you'll find certain tools are more useful for certain tasks. Still, always start from "simple", get the basics working, then carefully add to handle the corner cases.
In this case, when using a regular expression, it's important to look for landmarks in the string that you can use to locate your target text. In this case the trailing '}' is usable, so I wrote three simple captures to find \w strings separated by :.

Optimal Regular Expression: match sets of lines starting with

Alright, this one's interesting. I have a solution, but I don't like it.
The goal is to be able to find a set of lines that start with 3 periods - not an individual line, mind you, but a collection of all the lines in a row that match. For example, here's some matches (each match is separated by a blank line):
...
...hello
...
...hello
...world
...
...wazzup?
...
My solution is as follows:
^\.\.\..*(\n\.\.\..*)*$
It matches all those, so it's what I'm using for now - however, it looks kinda silly to repeat the \.\.\..* pattern. Is there a simpler way?
Please test your regex before submitting it, rather than submit what "should work." For example, I tried the following first:
(^\.\.\..*$)+
which only returned individual lines, even though in my mind it looks like it would do the trick - I guess I just don't understand regex internals. (And no, I didn't need to set any flags to get ^ and $ to match line boundaries, since I'm implementing this in Ruby.)
So I'm not totally sure there's a good answer, but one would be much appreciated - thanks in advance!
In most regex implementations you can shorten \.\.\. using \.{3} so your solution would turn into \.{3}.*(\n\.{3}.*)*.
What you already have is already simple and understandable. Keep in mind that more "clever" RegExps may very well be slower and undoubtedly less readable.
Assuming lines are terminated by a \n:
((^|\n)\.{3}[^\n]*)+
I am not familiar with Ruby, so depending on how it returns matches you might need to "nonmatch" groups:
((?:(?:^|\n)\.{3}[^\n]*)+)
^([.]{3}.*$\n?)+
This doesn't really need $ in there.
You are pretty close to a solution with (^\.\.\..*$)+, but because the + modifier is on the outside of the group, it is getting overwritten each time and you are only left with the last line. Try wrapping it in an outer group: ((^\.\.\..*$)+) and looking at the first submatch and ignoring the inner one.
Combined with the other suggestion: ((^\.{3}.*$)+)

Putting spaces back into a string of text with unreliable space information

I need to parse some text from pdfs but the pdf formatting results in extremely unreliable spacing. The result is that I have to ignore the spaces and have a continuous stream of non-space characters.
Any suggestions on how to parse the string and put spaces back into the string by guessing?
I'm using ruby. Or should I say I'musingruby?
Edit: I've pulled the text out using pdf-reader. Some of the pdf files are nicely formatted and some are not. An example of text mixed with positioning:
.7aspe-5.5cts-715.1o0.6f-708.5f-0.4aces-721.4that-716.3are-720.0i-1.8mportant-716.3in-713.9soc-5.5i-1.8alcommunica6.6tion6.3.-711.6Althoug6.3h-708.1m-1.9od6.3els-709.3o6.4f-702.8f5.4ace-707.9proc6.6essing-708.2haveproposed-611.2ways-615.5to-614.7deal-613.2with-613.0these-613.9diff10.4erent-613.7tasks,-611.9it-617.1remainsunclear-448.0how-450.7these-443.2mechanisms-451.7might-446.7be-447.7implemented-447.2in-450.3visualOne-418.9model-418.8of-417.3human-416.4face-421.9processing-417.5proposes-422.7that-419.8informa-tion-584.5is-578.0processed-586.1in-583.1specialised-584.7modules-577.0(Breen-584.4et-582.9al.,-582.32002;Bruce-382.1and-384.0Y92.0oung,-380.21986;-379.2Haxby-379.9et-380.5al.,-
and if I print just string data (I added returns at the end of each line to keep it from
messing up the layout here:
'Distinctrepresentationsforfacialidentityandchangeableaspectsoffacesinthehumantemporal
lobeTimothyJ.Andrews*andMichaelP.EwbankDepartmentofPsychology,WolfsonResearchInstitute,
UniversityofDurham,UKReceived23December2003;revised26March2004;accepted27July2004Availab
leonline14October2004Theneuralsystemunderlyingfaceperceptionmustrepresenttheunchanging
featuresofafacethatspecifyidentity,aswellasthechangeableaspectsofafacethatfacilitates
ocialcommunication.However,thewayinformationaboutfacesisrepresentedinthebrainremainsc
ontroversial.Inthisstudy,weusedfMRadaptation(thereductioninfMRIactivitythatfollowsthe
repeatedpresentationofidenticalimages)toaskhowdifferentface-andobject-selectiveregionsofvisualcortexcontributetospecificaspectsoffaceperception'
The data is spit out by callbacks so if I print each string as it is returned it looks like this:
'The
-571.3
neural
-573.7
system
-577.4
underly
13.9
ing
-577.2
face
-573.0
perc
13.7
eption
-574.9
must
-572.1
repr
20.8
esent
-577.0
the
unchangin
14.4
g
-538.5
featur
16.5
es
-529.5
of
-536.6
a
-531.4
face
'
On examination it looks like the true spaces are large negative numbers < -300 and the false spaces are much smaller positive numbers. Thanks guys. Just getting to the point where i am asking the question clearly helped me answer it!
Hmmmm... I'd have to say that guessing is never a good idea. Looking at the problem root cause and solving that is the answer, anything else is a kludge.
If the spacing is unreliable from the PDF, how is it unreliable? The PDF viewer needs to be able to reliably space the text so the data is there somewhere, you just need to find it.
EDIT following comment:
The idea of parsing the file using a dictionary (your only other option really, apart from randomly inserting spaces and hoping for the best) and inserting spaces at identified word boundaries (a real problem when dealing with punctuation, plurals that don't alter the base word i.e. plural, etc) would, I believe, be a much greater programming challenge than correctly parsing the PDF in the first place. After all, PDF is clearly defined whereas English is somewhat wooly.
Why not look down the route of existing solutions like ps2ascii in linux, call the function from your Ruby and pick up the result.
PDF doesn't only store spaces as space characters, but also uses layout commands for spacing (so it doesn't print a space, but moves the "pen" to the right). Perhaps you should have a look at the PDF reference (the big PDF on the bottom of the site), Chapter 9 "Text" should be what you're looking for.
EDIT: After reading your comment to Lazarus' answer, this doesn't seem to be what you're looking for. I think you should try to get a word list from somewhere and try to split your text using it. A good strategy would be to do that using recursion, because for example:
"meandyou"
The first word could be "me" or "mean", but if you try "mean", "dyou" doesn't make sense, so it will be "me", same for the next word that could be "a" or "an" or "and", only "and" makes sense.
If it were me I'd go back to the source PDFs and try a different method of extracting the text, such as iText (for Java) or maybe some kind of PDF-to-HTML to text conversion software method.

Resources