I am trying to write a matching XPath rule but I can't seem to pin point words with the exact letters in the 5th and 6th position.
example 'ab' in 'qwerabqwert'
/location1[Variable='variable1'][item1[contains(.,'AB')] or item1[contains(.,'ab')]
Please help.
You can use the substring() function in the below way:
/location1[Variable='variable1'][item1[substring(., 5, 2) = ('AB', 'ab')]]
I have a query to the Cognitive text keyphase API from Microsoft from '16 Excel Power Query - getting keywords from tweets. Works fine.
However, the JSON doc that's returned per query is converted by Power Query into a list of ~1-5 rows.
In the case of the pic, I want all responses returned to be in one cell/row, regardless of the number of items returned.
Here is my full M query (you need to put your own key in) if you're interested.
let
TweetCognitive = (TweetID as text, TweetText as text) =>
let
JsonRecords = Text.FromBinary(Json.FromValue([id=TweetID, text=TweetText])),
JsonRequest = "{""documents"": [" & JsonRecords & "]}",
JsonContent = Text.ToBinary(JsonRequest, TextEncoding.Ascii),
Response =
Web.Contents("https://westus.api.cognitive.microsoft.com/text/analytics/v2.0/keyPhrases?",
[
Headers = [#"Ocp-Apim-Subscription-Key"="yourkeyhere",
#"Content-Type"="application/json", Accept="application/json"],
Content=JsonContent
]),
JsonResponse = Json.Document(Response,1252)
in
JsonResponse
in
TweetCognitive
You can use List.Accumulate to turn a list of values into a single value. For example, this would combine the values in the list into a single text value with ". " separating each row's value:
List.Accumulate(JsonResponse, "", (state, current) => state & current & ". ")
This would generate "monday frank love happiness today. nice good kind. tomorrow. " in your example. If you want to get rid of the trailing space, you can surround the List.Accumulate expression with Text.Trim.
The basic function to concatenate elements in a list is Text.Combine. For instance:
Text.Combine(JsonResponse, " ")
This avoids the extra delimeter at the end you get with List.Accumulate. Note also List.Combine is for creating a longer combined list from shorter lists, and the similar naming there may cause confusion.
I scraped a website using Nokogiri and after using xpath I was left with the following string (which is a few td's pushed into one string).
"Total First Downs\n\t\t\t\t\t\t\t\t359\n\t\t\t\t\t\t\t\t274\n\t\t\t\t\t\t\t"
My goal is to make this into an array that looks like the following(it will be a nested array):
["Total First Downs", "359", "274"]
The issue is creating a regex equation that removes the escaped characters, subs in one "," but does not sub in a "," after the last set of integers. If the comma after the last set of integers is necessary, I could use #compact to get rid of the nil that occurs in the array. If you need the code on how I scraped the website here it is: (please note i saved the webpage for testing in order for my ip address to not get burned during the trial phase)
f = File.open('page')
doc = Nokogiri::HTML:(f)
f.close
number = doc.xpath('//tr[#class="tbdy1"]').count
stats = Array.new(number) {Array.new}
i = 0
doc.xpath('//tr[#class="tbdy1"]').each do |tr|
stats[i] << tr.text
i += 1
end
Thanks for your help
I don't fully understand your problem, but the result can be easily achieved with this:
"Total First Downs\n\t\t\t\t\t\t\t\t359\n\t\t\t\t\t\t\t\t274\n\t\t\t\t\t\t\t"
.split(/[\n\t]+/)
# => ["Total First Downs", "359", "274"]
Try with gsub
"Total First Downs\n\t\t\t\t\t\t\t\t359\n\t\t\t\t\t\t\t\t274\n\t\t\t\t\t\t\t".gsub("/[\n\t]+/",",")
I'm attempting to parse blocks of text and need a way to detect the difference between apostrophes in different contexts. Possession and abbreviation in one group, quotations in the other.
e.g.
"I'm the cars' owner" -> ["I'm", "the", "cars'", "owner"]
but
"He said 'hello there' " -> ["He","said"," 'hello there' "]
Detecting whitespace on either side won't help as things like " 'ello " and " cars' " would parse as one end of a quotation, same with matching pairs of apostrophes. I'm getting the feeling that there's no way of doing it other than an outrageously complicated NLP solution and I'm just going to have to ignore any apostrophes not occurring mid-word, which would be unfortunate.
EDIT:
Since writing I have realised this is impossible. Any regex-ish based parser would have to parse:
'ello there my mates' dogs
in 2 different ways, and could only do that with understanding of the rest of the sentence. Guess I'm for the inelegant solution of ignoring the least likely case and hoping it's rare enough to only cause infrequent anomalies.
Hm, I'm afraid this won't be easy. Here's a regex that kinda works, alas only for stuff like "I'm" and "I've":
>> s1 =~ /[\w\s]*((?<!I)'(?:[^']+)')[\w\s]*/
=> nil
>> s2 =~ /[\w\s]*((?<!I)'(?:[^']+)')[\w\s]*/
=> 0
>> $1
=> "'hello there'"
If you play around with it a bit more, you may be able to eliminate some other common contractions, which might still be better than nothing.
Some rules to think about:
Quotes will start with an apostrophe with a whitespace character or nothing before it.
Quotes will end with an apostrophe with punctuation or a whitespace character after it.
Some words may look like the end of quotes, e.g., peoples'.
Quote delimiting apostrophes will never have letters directly before and after them.
Use a very simple two-phase process.
In pass 1 of 2, start with this regular expression to break the text down into alternating segments of word and non-word characters.
/(\w+)|(\W+)/gi
Store the matches in a list like this (I'm using AS3-style pseudo-code, since I don't work with ruby):
class MatchedWord
{
var text:String;
var charIndex:int;
var isWord:Boolean;
var isContraction:Boolean = false;
function MatchedWord( text:String, charIndex:int, isWord:Boolean )
{
this.text = text; this.charIndex = charIndex; this.isWord = isWord;
}
}
var match:Object;
var matched_word:MatchedWord;
var matched_words:Vector.<MatchedWord> = new Vector.<MatchedWord>();
var words_regex:RegExp = /(\w+)|(\W+)/gi
words_regex.lastIndex = 0; //this is where to start looking for matches, and is updated to the end of the last match each time exec is called
while ((match = words_regex.exec( original_text )) != null)
matched_words.push( new MatchedWord( match[0], match.index, match[1] != null ) ); //match[0] is the entire match and match[1] is the first parenthetical group (if it's null, then it's not a word and match[2] would be non-null)
In pass 2 of 2, iterate over the list of matches to find contractions by checking to see if each (trimmed, non-word) match ENDS with an apostrophe. If it does, then check the next adjacent (word) match to see if it matches one of only 8 common contraction endings. Despite all the two-part contractions I could think of, there are only 8 common endings.
d
l
ll
m
re
s
t
ve
Once you've identified such a pair of matches (non-word)="'" and (word)="d", then you just include the preceding adjacent (word) match and concatenate the three matches to get your contraction.
Understanding the process just described, one modification you must make is expand that list of contraction endings to include contractions that start with apostrophe, such as "'twas" and "'tis". For those, you simply don't concatenate the preceding adjacent (word) match, and you look at the apostrophe match a little more closely to see if it included other non-word character before it (that's why it's important it ends with an apostrophe). If the trimmed string EQUALS an apostrophe, then merge it with the next match, and if it only ENDS with an apostrophe, then strip off the apostrophe and merge it with the following match. Likewise, conditions that will include the prior match should first check to ensure the (trimmed non-word) match ending with an apostrophe EQUALS an apostrophe, so there are no extra non-word characters included accidentally.
Another modification you may need to make is expand that list of 8 endings to include endings that are whole words such as "g'day" and "g'night". Again, it's a simple modification involving a conditional check of the preceding (word) match. If it's "g", then you include it.
That process should capture the majority of contractions, and is flexible enough to include new ones you can think of.
The data structure would look like this.
Condition(Ending, PreCondition)
where PreCondition is
"*", "!", or "<exact string>"
The final list of conditions would look like this:
new Condition("d","*") //if apostrophe d is found, include the preceding word string and count as successful contraction match
new Condition("l","*");
new Condition("ll","*");
new Condition("m","*");
new Condition("re","*");
new Condition("s","*");
new Condition("t","*");
new Condition("ve","*");
new Condition("twas","!"); //if apostrophe twas is found, exclude the preceding word string and count as successful contraction match
new Condition("tis","!");
new Condition("day","g"); //if apostrophe day is found and preceding word string is g, then include preceding word string and count as successful contraction match
new Condition("night","g");
If you just process those conditions as I explained, that should cover all of these 86 contractions (and more):
'tis 'twas ain't aren't can't could've couldn't didn't doesn't don't
everybody's g'day g'night hadn't hasn't haven't he'd he'll he's how'd
how'll how's I'd I'll I'm I've isn't it'd it'll it's let's li'l
might've mightn't mustn't needn't nobody's nothing's shan't she'd
she'll she's should've shouldn't that'd that'll that's there's they'd
they'll they're they've wasn't we'd we'll we're we've weren't what'll
what're what'd what's what've when'd when'll when's where'd where'll
where's who's who'll who're who'd who'll who's who've why'd why'll
why's won't would've wouldn't you'd you'll you're you've
On a side note, don't forget about slang contractions that don't use apostrophes such as "gotta" > "got to" and "gonna" > "going to".
Here is the final AS3 code. Overall, you're looking at less than 50 lines of code to parse the text into alternating word and non-word groups, and identify and merge contractions. Simple. You could even add a Boolean "isContraction" variable to the MatchedWord class and set the flag in the code below when a contraction is identified.
//Automatically merge known contractions
var conditions:Array = [
["d","*"], //if apostrophe d is found, include the preceding word string and count as successful contraction match
["l","*"],
["ll","*"],
["m","*"],
["re","*"],
["s","*"],
["t","*"],
["ve","*"],
["twas","!"], //if apostrophe twas is found, exclude the preceding word string and count as successful contraction match
["tis","!"],
["day","g"], //if apostrophe day is found and preceding word string is g, then include preceding word string and count as successful contraction match
["night","g"]
];
for (i = 0; i < matched_words.length - 1; i++) //not a type-o, intentionally stopping at next to last index to avoid a condition check in the loop
{
var m:MatchedWord = matched_words[i];
var apostrophe_text:String = StringUtils.trim( m.text ); //check if this ends with an apostrophe first, then deal more closely with it
if (!m.isWord && StringUtils.endsWith( apostrophe_text, "'" ))
{
var m_next:MatchedWord = matched_words[i + 1]; //no bounds check necessary, since loop intentionally stopped at next to last index
var m_prev:MatchedWord = ((i - 1) >= 0) ? matched_words[i - 1] : null; //bounds check necessary for previous match, since we're starting at beginning, since we may or may not need to look at the prior match depending on the precondition
for each (var condition:Array in conditions)
{
if (StringUtils.trim( m_next.text ) == condition[0])
{
var pre_condition:String = condition[1];
switch (pre_condition)
{
case "*": //success after one final check, include prior match, merge current and next match into prior match and delete current and next match
if (m_prev != null && apostrophe_text == "'") //EQUAL apostrophe, not just ENDS with apostrophe
{
m_prev.text += m.text + m_next.text;
m_prev.isContraction = true;
matched_words.splice( i, 2 );
}
break;
case "!": //success after one final check, do not include prior match, merge current and next match, and delete next match
if (apostrophe_text == "'")
{
m.text += m_next.text;
m.isWord = true; //match now includes word text so flip it to a "word" block for logical consistency
m.isContraction = true;
matched_words.splice( i + 1, 1 );
}
else
{ //strip apostrophe off end and merge with next item, nothing needs deleted
//preserve spaces and match start indexes by manipulating untrimmed strings
var apostrophe_end:int = m.text.lastIndexOf( "'" );
var apostrophe_ending:String = m.text.substring( apostrophe_end, m.text.length );
m.text = m.text.substring( 0, m.text.length - apostrophe_ending.length); //strip apostrophe and any trailing spaces
m_next.text = apostrophe_ending + m_next.text;
m_next.charIndex = m.charIndex + apostrophe_end;
m_next.isContraction = true;
}
break;
default: //conditional success, check prior match meets condition
if (m_prev != null && m_prev.text == pre_condition)
{
m_prev.text += m.text + m_next.text;
m_prev.isContraction = true;
matched_words.splice( i, 2 );
}
break;
}
}
}
}
}
I'm not very good with regex, but here's what I got (the string to parse and the regex are on this page) http://rubular.com/r/iIIYDHkwVF
It just needs to match that exact test string
The regular expression is
^"AddonInfo"$(\n\s*)+^\{\s*
It's looking for
^"AddonInfo"$ — a line containing only "AddonInfo"
(\n\s*)+ — followed by at least one newline and possibly many blank or empty lines
^\{\s* — and finally a line beginning with { followed by optional whitespace
To break down a regular expression into its component pieces, have a look at an answer that explains beginning with the basics.
To match the entire string, use
^"AddonInfo"$(\n\s*)+^\{(\s*".+?"\s+".+?"\s*\n)+^\}
So after the open curly, you're looking for one or more lines such that each contains a pair of quote-delimited simple strings (no escaping).
This one works:
^"AddonInfo"[^{]*{[^}]*}
Explanation:
^"AddonInfo" matches "AddonInfo" in the beginning of a line
[^{]* matches all the following non-{ characters
{ matches the following {
[^}]* matches all the following non-} characters
} matches the following }
^"AddonInfo"(\s*)+^\{\s*(?:"([^"]+)"\s+"([^"]*)"\s+)+\}
You will get $1 to point into first key, $2 first value, $3 second key, $4, second value, and so on.
Notice that key is to be non-empty ("([^"]+"), but value may be empty (uses * instead of +).