I need to evaluate the output to see if it starts with a specific sequence.
For example if Cat1 = (A)
I want to verify that the entry begins with the value of Cat1 and can contain any text after it. If so then to output that entry.
I don't exactly know how to use wildcards in conjunction with the variable to allow entries such as
(A) First assignment
(A) Second assignment
to be selected and then to be transferred.
The portion that is in question is the following in my code:
if(assign.title == ){
SpreadsheetApp.openByUrl(url).getSheetByName(shet).appendRow([assign.title, marks.assignedGrade,
assign.maxPoints]);}
}
Your issue can be solved by using Regular Expressions which essentially are special text strings used to describe a search pattern.
Therefore, if you want to search for the entries which begin with (A) and appendRow() like you mentioned above, you should use the following code snippet:
function theFunction() {
var ss = SpreadsheetApp.openByUrl("YOUR_URL").getSheetByName("YOUR_SHEET_NAME");
var regEx = /((A)).*/;
//Getting the assign & marks variables
if (assign.title.match(regEx))
appendRow([assign.title, marks.assignedGrade, assign.maxPoints]);
}
The regular expression here is represented by the var regEx = /((A)).*/; which searches for a string to see if it starts with the (A) string.
Furthermore, I suggest you take a look at these links since they might be of help:
Syntax for Regular Expressions;
Regular Expressions Tester.
Related
I need to check if any elements of a large (60,000+ elements) array are present in a long string of text. My current code looks like this:
if $TARGET_PARTLIST.any? { |target_pn| pdf_content_string.include? target_pn }
self.last_match_code = target_pn
self.is_a_match = true
end
I get a syntax error undefined local variable or method target_pn.
Could someone let me know the correct syntax to use for this block of code? Also, if anyone knows of a quicker way to do this, I'm all ears!
In this case, all your syntax is correct, you've just got a logic error. While target_pn is defined (as a parameter) inside the block passed to any?, it is not defined in the block of the if statement because the scope of the any?-block ends with the closing curly brace, and target_pn is not available outside its scope. A correct (and more idiomatic) version of your code would look like this:
self.is_a_match = $TARGET_PARTLIST.any? do |target_pn|
included = pdf_content_string.include? target_pn
self.last_match_code = target_pn if included
included
end
Alternately, as jvillian so kindly suggests, one could turn the string into an array of words, then do an intersection and see if the resulting set is nonempty. Like this:
self.is_a_match = !($TARGET_PARTLIST &
pdf_content_string.gsub(/[^A-Za-z ]/,"")
.split).empty?
Unfortunately, this approach loses self.last_match_code. As a note, pointed out by Sergio, if you're dealing with non-English languages, the above regex will have to be changed.
Hope that helps!
You should use Enumerable#find rather than Enumerable#any?.
found = $TARGET_PARTLIST.find { |target_pn| pdf_content_string.include? target_pn }
if found
self.last_match_code = found
self.is_a_match = true
end
Note this does not ensure that the string contains a word that is an element of $TARGET_PARTLIST. For example, if $TARGET_PARTLIST contains the word "able", that string will be found in the string, "Are you comfortable?". If you only want to match words, you could do the following.
found = $TARGET_PARTLIST.find { |target_pn| pdf_content_string[/\b#{target_pn}\b/] }
Note this uses the method String#[].
\b is a word break in the regular expression, meaning that the first (last) character of the matched cannot be preceded (followed) by a word character (a letter, digit or underscore).
If speed is important it may be faster to use the following.
found = $TARGET_PARTLIST.find { |target_pn|
pdf_content_string.include?(target_on) && pdf_content_string[/\b#{target_pn}\b/] }
A probably more performant way would be to move all this into native code by letting Regexp search for it.
# needed only once
TARGET_PARTLIST_RE = Regexp.new("\\b(?:#{$TARGET_PARTLIST.sort.map { |pl| Regexp.escape(pl) }.join('|')})\\b")
# to check
self.last_match_code = pdf_content_string[TARGET_PARTLIST_RE]
self.is_a_match = !self.last_match_code.nil?
A much more performant way would be to build a prefix tree and create the regexp using the prefix tree (this optimises the regexp lookup), but this is a bit more work :)
I need to get an array of floats (both positive and negative) from the multiline string. E.g.: -45.124, 1124.325 etc
Here's what I do:
text.scan(/(\+|\-)?\d+(\.\d+)?/)
Although it works fine on regex101 (capturing group 0 matches everything I need), it doesn't work in Ruby code.
Any ideas why it's happening and how I can improve that?
See scan documentation:
If the pattern contains no groups, each individual result consists of the matched string, $&. If the pattern contains groups, each individual result is itself an array containing one entry per group.
You should remove capturing groups (if they are redundant), or make them non-capturing (if you just need to group a sequence of patterns to be able to quantify them), or use extra code/group in case a capturing group cannot be avoided.
In this scenario, the capturing group is used to quantifiy a pattern sequence, thus all you need to do is convert the capturing group into a non-capturing one by replacing all unescaped ( with (?: (there is only one occurrence here):
text = " -45.124, 1124.325"
puts text.scan(/[+-]?\d+(?:\.\d+)?/)
See demo, output:
-45.124
1124.325
Well, if you need to also match floats like .04 you can use [+-]?\d*\.?\d+. See another demo
There are cases when you cannot get rid of a capturing group, e.g. when the regex contains a backreference to a capturing group. In that case, you may either a) declare a variable to store all matches and collect them all inside a scan block, or b) enclose the whole pattern with another capturing group and map the results to get the first item from each match, c) you may use a gsub with just a regex as a single argument to return an Enumerator, with .to_a to get the array of matches:
text = "11234566666678"
# Variant a:
results = []
text.scan(/(\d)\1+/) { results << Regexp.last_match(0) }
p results # => ["11", "666666"]
# Variant b:
p text.scan(/((\d)\2+)/).map(&:first) # => ["11", "666666"]
# Variant c:
p text.gsub(/(\d)\1+/).to_a # => ["11", "666666"]
See this Ruby demo.
([+-]?\d+\.\d+)
assumes there is a leading digit before the decimal point
see demo at Rubular
If you need capture groups for a complex pattern match, but want the entire expression returned by .scan, this can work for you.
Suppose you want to get the image urls in this string perhaps from a markdown text with html image tags:
str = %(
Before
<img src="https://images.zenhubusercontent.com/11223344e051aa2c30577d9d17/110459e6-915b-47cd-9d2c-1842z4b73d71">
After
<img src="https://user-images.githubusercontent.com/111222333/75255445-f59fb800-57af-11ea-9b7a-a235b84bf150.png">).strip
You may have a regular expression defined to match just the urls, and maybe used a Rubular example like this to build/test your Regexp
image_regex =
/https\:\/\/(user-)?images.(githubusercontent|zenhubusercontent).com.*\b/
Now you don't need each sub-capture group, but just the the entire expression in your your .scan, you can just wrap the whole pattern inside a capture group and use it like this:
image_regex =
/(https\:\/\/(user-)?images.(githubusercontent|zenhubusercontent).com.*\b)/
str.scan(image_regex).map(&:first)
=> ["https://user-images.githubusercontent.com/1949900/75255445-f59fb800-57af-11ea-9b7a-e075f55bf150.png",
"https://user-images.githubusercontent.com/1949900/75255473-02bca700-57b0-11ea-852a-58424698cfb0.png"]
How does this actually work?
Since you have 3 capture groups, .scan alone will return an Array of arrays with, one for each capture:
str.scan(image_regex)
=> [["https://user-images.githubusercontent.com/111222333/75255445-f59fb800-57af-11ea-9b7a-e075f55bf150.png", "user-", "githubusercontent"],
["https://images.zenhubusercontent.com/11223344e051aa2c30577d9d17/110459e6-915b-47cd-9d2c-0714c8f76f68", nil, "zenhubusercontent"]]
Since we only want the 1st (outter) capture group, we can just call .map(&:first)
I want to match dynamicCast(header.get_0('(0008,0020)'), Q$String_$1):
header.containsKey('(0008,0020)')?(dateString = dynamicCast(header.get_0('(0008,0020)'), Q$String_$1)[0]):header.containsKey('(0008,0022)')?(dateString = dynamicCast(header.get_0('(0008,0022)'), Q$String_$1)[0]):header.containsKey('(0008,0021)')?(dateString = dynamicCast(header.get_0('(0008,0021)'), Q$String_$1)[0]):header.containsKey('(0008,0023)') && (dateString = dynamicCast(header.get_0('(0008,0023)'), Q$String_$1)[0]);
I managed to make it work with this regex
dynamicCast\(header.get.*, Q\$(String_|int_)\$1\)
The problem is, it matches the whole block. What is the proper regex magic spell to get the four matches I want?
I'm currently rewriting auto-generated JavaScript using a regex using Ruby. I'm then replacing each match with
header.get_0('(0008,0020)')
One a problem is I have to match some different flavors, inside the method get_0 there are many different possibilities. I might need to match every single possibility, and then, why use regex?
dynamicCast(header.get_0('(0028,' + element + ')'), Q$String_$1)
You can use the following to match:
dynamicCast\(header\.get_0\('\([^)]+\)'\), Q\$(?:String_|int_)\$1\)
See DEMO
i have a CSV in the below way. "India,Inc" is a company name which is single value which contains , in it
How to Get the Values in LINQ
12321,32432,423423,Kevin O'Brien,"India,Inc",234235,23523452,235235
Assuming that you will always have the columns that you specify and that the only variable is that company name can have commas inside, this UGLY code can help you achieve your goal.
var file = File.ReadLines("test.csv");
var value = from p in file
select new string[]
{ p.Split(',')[0],
p.Split(',')[1],
p.Split(',')[2],
p.Split(',')[3],
p.Split(',').Count() == 7 ? p.Split(',')[4] :
(p.Split(',').Count() > 7 ? String.Join(",",p.Split(',').Skip(4).Take(p.Split(',').Count() - 7).ToArray() ) : ""),
p.Split(',')[p.Split(',').Count() - 3],
p.Split(',')[p.Split(',').Count() - 2],
p.Split(',')[p.Split(',').Count() - 1]
};
A regular expression would work, bit nasty due to the recursive nature but it does achieve your goal.
List<string> matches = new List<string>();
string subjectString = "12321,32432,423423,Kevin O'Brien,\"India,Inc\",234235,23523452,235235";
Regex regexObj = new Regex(#"(?<="")\b[123456789a-z,']+\b(?="")|[123456789a-z']+", RegexOptions.IgnoreCase);
Match matchResults = regexObj.Match(subjectString);
while (matchResults.Success)
{
matches.Add(matchResults.Value);
// matched text: matchResults.Value
// match start: matchResults.Index
// match length: matchResults.Length
matchResults = matchResults.NextMatch();
}
This should suffice in most cases. It handles quoted strings, strings with double quotes within them, and embedded commas.
var subjectString = "12321,32432,423423,Kevin O'Brien,\"India,Inc\",234235,\"Test End\"\"\",\"\"\"Test Start\",\"Test\"\"Middle\",23523452,235235";
var result=Regex.Split(subjectString,#",(?=(?:[^""]*""[^""]*"")*[^""]*$)")
.Select(x=>x.StartsWith("\"") && x.EndsWith("\"")?x.Substring(1,x.Length-2):x)
.Select(x=>x.Replace("\"\"","\""));
It does however break, if you have a field with a single double quote inside it, and the string itself is not enclosed in double quotes -- this is invalid in most definitions of a CSV file, where any field that contains CR, LF, Comma, or Double quote must be enclosed in double quotes.
You should be able to reuse the same Regex expression to break on lines as well for small CSV files. Larger ones you would want a better implementation. Replace the double quotes with LF, and remove the matching ones (unquoted LF's). Then use the regular expression again replacing the quotes with CR, and split on matching.
Another option is to use CSVHelper and not traying to reinvent the wheel
var csv = new CsvHelper.CsvReader(new StreamReader("test.csv"));
while (csv.Read())
{
Console.WriteLine(csv.GetField<int>(0));
Console.WriteLine(csv.GetField<string>(1));
Console.WriteLine(csv.GetField<string>(2));
Console.WriteLine(csv.GetField<string>(3));
Console.WriteLine(csv.GetField<string>(4));
}
Guide
I would recommend LINQ to CSV, because it is powerful enough to handle special characters including commas, quotes, and decimals. They have really worked a lot of these issues out for you.
It only takes a few minutes to set up and it is really worth the time because you won't run into these types of issues down the road like you would with custom code. Here are the basic steps, but definitely follow the instructions in the link above.
Install the Nuget package
Create a class to represent a line item (name the fields the way they're named in the csv)
Use CsvContext.Read() to read into an IEnumerable which you can easily manipulate with LINQ
Use CsvContext.Write() to write a List or IEnumerable to a CSV
This is very easy to setup, has very little code, and is much more scalable than doing it yourself.
becuase you're only reading values delminated bycommas, the spaces shouldn't cause an issue if you just treat them like any other character.
var values = File.ReadLines(path)
SelectMany(line => line.Split(','));
Ruby 1.9.1, OSX 10.5.8
I'm trying to write a simple app that parses through of bunch of java based html template files to replace a period (.) with an underscore if it's contained within a specific tag. I use ruby all the time for these types of utility apps, and thought it would be no problem to whip up something using ruby's regex support. So, I create a Regexp.new... object, open a file, read it in line by line, then match each line against the pattern, if I get a match, I create a new string using replaceString = currentMatch.gsub(/./, '_'), then create another replacement as whole string by newReplaceRegex = Regexp.escape(currentMatch) and finally replace back into the current line with line.gsub(newReplaceRegex, replaceString) Code below, of course, but first...
The problem I'm having is that when accessing the indexes within the returned MatchData object, I'm getting the first result twice, and it's missing the second sub string it should otherwise be finding. More strange, is that when testing this same pattern and same test text using rubular.com, it works as expected. See results here
My pattern:
(<(?:WEBOBJECT|webobject) (?:NAME|name)=(?:[a-zA-Z0-9]+.)+(?:[a-zA-Z0-9]+)(?:>))
Text text:
<WEBOBJECT NAME=admin.normalMode.someOtherPatternWeDontWant.moreThatWeDontWant>moreNonMatchingText<WEBOBJECT NAME=admin.SecondLineMatch>AndEvenMoreNonMatchingText
Here's the relevant code:
tagRegex = Regexp.new('(<(?:WEBOBJECT|webobject) (?:NAME|name)=(?:[a-zA-Z0-9]+\.)+(?:[a-zA-Z0-9]+)(?:>))+')
testFile = File.open('RegexTestingCompFix.txt', "r+")
lineCount=0
testFile.each{|htmlLine|
lineCount += 1
puts ("Current line: #{htmlLine} at line num: #{lineCount}")
tagMatch = tagRegex.match(htmlLine)
if(tagMatch)
matchesArray = tagMatch.to_a
firstMatch = matchesArray[0]
secondMatch = matchesArray[1]
puts "First match: #{firstMatch} and second match #{secondMatch}"
tagMatch.captures.each {|lineMatchCapture|
puts "Current capture for tagMatches: #{lineMatchCapture} of total match count #{matchesArray.size}"
#create a new regex using the match results; make sure to use auto escape method
originalPatternString = Regexp.escape(lineMatchCapture)
replacementRegex = Regexp.new(originalPatternString)
#replace any periods with underscores in a copy of lineMatchCapture
periodToUnderscoreCorrection = lineMatchCapture.gsub(/\./, '_')
#replace original match with underscore replaced copy within line
htmlLine.gsub!(replacementRegex, periodToUnderscoreCorrection)
puts "The modified htmlLine is now: #{htmlLine}"
}
end
}
I would think that I should get the first tag in matchData[0] then the second tag in matchData1, or, what I'm really doing because I don't know how many matches I'll get within any given line is matchData.to_a.each. And in this case, matchData has two captures, but they're both the first tag match
which is: <WEBOBJECT NAME=admin.normalMode.someOtherPatternWeDontWant.moreThatWeDontWant>
So, what the heck am I doing wrong, why does rubular test give me the expected results?
You want to use the on String#scan instead of the Regexp#match:
tag_regex = /<(?:WEBOBJECT|webobject) (?:NAME|name)=(?:[a-zA-Z0-9]+\.)+(?:[a-zA-Z0-9]+)(?:>)/
lines = "<WEBOBJECT NAME=admin.normalMode.someOtherPatternWeDontWant.moreThatWeDontWant>moreNonMatchingText\
<WEBOBJECT NAME=admin.SecondLineMatch>AndEvenMoreNonMatchingText"
lines.scan(tag_regex)
# => ["<WEBOBJECT NAME=admin.normalMode.someOtherPatternWeDontWant.moreThatWeDontWant>", "<WEBOBJECT NAME=admin.SecondLineMatch>"]
A few recommendations for next ruby questions:
newlines and spaces are your friends, you don't loose points for using more lines on your code ;-)
use do-end on blocks instead of {}, improves readability a lot
declare variables in snake case (hello_world) instead of camel case (helloWorld)
Hope this helps
I ended up using the String.scan approach, the only tricky point there was figuring out that this returns an array of arrays, not a MatchData object, so there was some initial confusion on my part, mostly due to my ruby green-ness, but it's working as expected now. Also, I trimmed the regex per Trevoke's suggestion. But snake case? Never...;-) Anyway, here goes:
tagRegex = /(<(?:webobject) (?:name)=(?:\w+\.)+(?:\w+)(?:>))/i
testFile = File.open('RegexTestingCompFix.txt', "r+")
lineCount=0
testFile.each do |htmlLine|
lineCount += 1
puts ("Current line: #{htmlLine} at line num: #{lineCount}")
oldMatches = htmlLine.scan(tagRegex) #oldMatches thusly named due to not explicitly using Regexp or MatchData, as in "the old way..."
if(oldMatches.size > 0)
oldMatches.each_index do |index|
arrayMatch = oldMatches[index]
aMatch = arrayMatch[0]
#create a new regex using the match results; make sure to use auto escape method
replacementRegex = Regexp.new(Regexp.escape(aMatch))
#replace any periods with underscores in a copy of lineMatchCapture
periodToUnderscoreCorrection = aMatch.gsub(/\./, '_')
#replace original match with underscore replaced copy within line, matching against the new escaped literal regex
htmlLine.gsub!(replacementRegex, periodToUnderscoreCorrection)
puts "The modified htmlLine is now: #{htmlLine}"
end # I kind of still prefer the brackets...;-)
end
end
Now, why does MatchData work the way it does? It seems like it's behavior is a bug really, and certainly not very useful in general if you can't get it provide a simple means of accessing all the matches. Just my $.02
Small bits:
This regexp helps you get "normalMode" .. But not "secondLineMatch":
<webobject name=\w+\.((?:\w+)).+> (with option 'i', for "case insensitive")
This regexp helps you get "secondLineMatch" ... But not "normalMode":
<webobject name=\w+\.((?:\w+))> (with option 'i', for "case insensitive").
I'm not really good at regexpt but I'll keep toiling at it.. :)
And I don't know if this helps you at all, but here's a way to get both:
<webobject name=admin.(\w+) (with option 'i').