Extract substrings/values from a long text - ruby

I have a long string/text, e.g.
...blahblahblahblah,"shortcode":"Bk5z5Lgn1234",blahblahblablha...,"shortcode":"Wuipsz5Lgn1234",blahblahblablh...
I'm looking to extract all substrings of the following pattern:
"shortcode":"Bk5z5Lgn1234"
"shortcode":"Wuipsz5Lgn1234"
The values of the shortcodes, i.e. Bk5z5Lgn1234 and Wuipsz5Lgn1234, are of constant length (11 characters). Just getting the values will be fine. If getting all the occurrences of shortcode values is complicated, just getting the first value will be sufficient.
I know how to find the substrings (using the scan method), but I have no idea how to traverse the string and pull out the shortcode values.

If the code is always in the exact format that you specified, and 11 characters long, this regular expression will find them:
"shortcode":"(.{11})"
The following will return all the matches:
text.scan(/"shortcode":"(.{11})"/)
This is admittedly likely not to be the most efficient solution, but simple and easy to use. Parsing HTML with regular expressions is never the best idea.

Related

How to get first character that is causing reg expression not to match

We have one quite complex regular expression which checks for string structure.
I wonder if there is an easy way to find out which character in the string that is causing reg expression not to match.
For example,
string.match(reg_exp).get_position_which_fails
Basically, the idea is how to get "position" of state machine when it gave up.
Here is an example of regular expression:
%q^[^\p{Cc}\p{Z}]([^\p{Cc}\p{Zl}\p{Zp}]{0,253}[^\p{Cc}\p{Z}])?$
The short answer is: No.
The long answer is that a regular expression is a complicated finite state machine that may be in a state trying to match several different possible paths simultaneously. There's no way of getting a partial match out of a regular expression without constructing a regular expression that allows partial matches.
If you want to allow partial matches, either re-engineer your expression to support them, or write a parser that steps through the string using a more manual method.
You could try generating one of these automatically with Ragel if you have a particularly difficult expression to solve.

Julia: Strange characters in my string

I scraped some text from the internet, which I put in an UTF8String. I can use this string normally, but when I select some specific characters (strange character with accents, like in my case รบ), which are not part of the UTF8 standard, I get an error, saying that I used invalid indexes. This only happens when the string contains strange characters; my code works with normal string that do not contain strange characters.
Any way to solve this?
EDIT:
I have a variable word of type SubString{UTF8String}
When I use do method(word), no problems occur. When I do method(word[2:end]) (assuming length of at least 2), I get an error in case the second character is strange (not in UTF8).
Julia does indexing on byte positions instead of character position. It is way more efficient for a variable length encoding like UTF-8, but it makes some operations use some more boilerplate.
The problem is that some codepoints is encoded as multiple bytes and when you slice the string from 2:end you would have got half of the first character (witch is invalid and you get an error).
The solution is to get the second valid index instead of 2 in the slice. I think that is something like str[nextind(str, 1):end]
PS. Sorry for a less than clear answer on my phone.
EDIT:
I tried this, and it seems like SubString{UTF8String} and UTF8String has different behaviour on slicing. I've reported it as bug #7811 on GitHub.

Jmeter : Removing Spaces using RegEx

Jmeter :
I am having a JSON from which I have to fetch value of "ci".
I am using the following RegEx : ci:\s*(.*?)\" and getting the following result RegEx tester:
Match count: 1
Match1[0]=ci: 434547"
Match1=434547
Issue is Match1[0] is having spaces because of which while running the load test it says
: Server Error - Could not convert JSON to Object
Need help is correcting this RegEx.
Basically, your RegEx is fine. This is the way I would look for it too, the first group (Match[1]) would give you 434613, which is the value you are looking for. As I don't know that piece of software you are using, I have no idea why using just that match doesn't work.
Here is an idea to work around that: if the value will always be the only numeric value in the string, you could simplify the RegEx to:
\d+
This will give you a numeric value that is at least 1 digit long. If there are other numeric values in the string though, but these have different lengths, try this:
\d{m,n} --> between m and n digits long
\d{n,} --> at least n digits long
\d{0,n} --> not more than n digits long
This is not as secure / reliable as the original RegEx (since it assumes some certain conditions), but it might work in your case, because you don't have to look for groups but just use the whole matched text. Tell me if it helped!

How do I use a regular expression to match a number as an integer?

When I match a number using a regular expression I get it as a string:
?> 'TestingSubject2981'.match /\d+$/
=> #<MatchData "2981">
Is it somehow possible to get the number as an integer without some to_is?
The issue is that regular expressions only work on strings, not on other data types.
A regex has patterns to match numbers, but those still only find the characters that represent the number, not the binary values that we'd use for math. Once the engine returns the matches, they're still characters, so we have to use to_i to convert them to their binary representations.
MMM-kay?
Regular expressions are not supposed to convert strings to integers (or any other class for that matter). The only way I can see is using the String#to_i method. And I can't see why you would avoid it.
You can also use to get number from string:
'TestingSubject2981'.scan(/\d+/)[0].to_i

O(n) string handling in Scheme

Background: I've been writing a little interpreter in Scheme (R5RS).
The reader/lexer takes a (sometimes long) string from input and tokenises it. It does this by matching the first few characters of the string against some token and returning the token and the remaining unmatched part of the string.
Problem: to return the remaining portion of the string, a new string is created every time a token is read. This means the reader is O(n^2) in the number of tokens present in the string.
Possible solution: convert the string to a list, which can be done in time O(n), then pull tokens from the list instead of the string, returning the remainder of the list instead of the remainder of the string. But this seems terribly inefficient and artificial.
Question: am I imagining it, or is there just no other way to do this efficiently in Scheme due to its purely functional outlook?
Edit: in R5RS Scheme, there isn't a way to return a pointer into a string. The "substring" function is the only function which extracts an object which is itself a string. But the Scheme standard insists this be a newly allocated string. Why? Because strings are not immutable in Scheme R5RS, e.g. see the "string-set!" function!!
One solution suggested below which works is to store an index into the string. Then one can read off the characters one at a time from that index until a token is read. Too bad the regexp library I'm using for the tokenisation requires an actual string not an index into one...
Consider making a shared-substring implementation of strings (this is how Java does it, for example). So when you want to grab a substring of a given string, rather than copying the characters, simply keep a pointer to (some location in) those characters, and a length.

Resources