context search for punctuation marks R quanteda - tweets

Doing some research about punctuation marks in tweets (well ..), I have difficulties in getting the contexts of repeated punctuation marks with kwic.
With "." and "..." (one sign) it works nicely, but "!!!" does not give any results.
Even
search_tweets("!!!", n=100, retryonratelimit = TRUE, include_rts=TRUE, lang="it")
gives ndoc=100. but the result of
kwic(tokens(corp), pattern="!!!", window=10)
is "No data available in table".
What is wrong here?

Related

Extract specific value from command in Windows PowerShell

I need help to extract some values from a command:
PS C:\Users\cs> c:\windows\system32\inetsrv\appcmd list sites
SITE "A" (id:1,bindings:http//:csdev.do.com,state:Stopped)
SITE "B" (id:2,bindings:tsd-gr2,state:Stopped)
SITE "C" (id:3,bindings:http/1028:8091:,http/19.28:80:ddprem.do.com,state:Stopped)
SITE "D" (id:4,bindings:http/109.149.232:80,state:Stopped)
I tried to extract first value as below:
PS C:\Users\cs> c:\windows\system32\inetsrv\appcmd list sites | %{ $_.Split('\"')[1]; }
A
B
C
D
I need two more field: the ID and the URL (only if there is do.com in bindings). There might be many URLs in the binding. I need only the first one which has do.com all the remaining should be marked as null or blank.
A,1,csdev.do.com
B,2,null
C,3,ddprem.do.com
D,4,null
While using the WebAdministration module seems like the best approach, you could try regex for this looping over the lines the command c:\windows\system32\inetsrv\appcmd list sites returns and parse the values you need from them.
Since I cannot test this for real myself, I'm using your example output from c:\windows\system32\inetsrv\appcmd list sites as a string array:
$siteList = 'SITE "A" (id:1,bindings:http//:csdev.do.com,state:Stopped)',
'SITE "B" (id:2,bindings:tsd-gr2,state:Stopped)',
'SITE "C" (id:3,bindings:http/1028:8091:,http/19.28:80:ddprem.do.com,state:Stopped)',
'SITE "D" (id:4,bindings:http/109.149.232:80,state:Stopped)'
$regex = [regex] '^SITE "(?<site>\w+)".+id:(?<id>\d+),bindings:(?:.+:(?<url>\w+\.do\.com))?'
$siteList | ForEach-Object {
$match = $regex.Match($_)
while ($match.Success) {
$url = if ($match.Groups['url'].Value) { $match.Groups['url'].Value } else { 'null' }
'{0},{1},{2}' -f $match.Groups['site'].Value, $match.Groups['id'].Value, $url
$match = $match.NextMatch()
}
}
Result:
A,1,csdev.do.com
B,2,null
C,3,ddprem.do.com
D,4,null
Regex details:
^ Assert position at the beginning of the string
SITE\ " Match the characters “SITE "” literally
(?<site> Match the regular expression below and capture its match into backreference with name “site”
\w Match a single character that is a “word character” (letters, digits, etc.)
+ Between one and unlimited times, as many times as possible, giving back as needed (greedy)
)
" Match the character “"” literally
. Match any single character that is not a line break character
+ Between one and unlimited times, as many times as possible, giving back as needed (greedy)
id: Match the characters “id:” literally
(?<id> Match the regular expression below and capture its match into backreference with name “id”
\d Match a single digit 0..9
+ Between one and unlimited times, as many times as possible, giving back as needed (greedy)
)
,bindings: Match the characters “,bindings:” literally
(?: Match the regular expression below
. Match any single character that is not a line break character
+ Between one and unlimited times, as many times as possible, giving back as needed (greedy)
: Match the character “:” literally
(?<url> Match the regular expression below and capture its match into backreference with name “url”
\w Match a single character that is a “word character” (letters, digits, etc.)
+ Between one and unlimited times, as many times as possible, giving back as needed (greedy)
\. Match the character “.” literally
do Match the characters “do” literally
\. Match the character “.” literally
com Match the characters “com” literally
)
)? Between zero and one times, as many times as possible, giving back as needed (greedy)

Replacing all but alphabetic characters with spaces in python, in any language

The code
phrase = "".join([c if c.isalpha() else " " for c in phrase])
substitute all non-alphabetic character with spaces. It works very well with strings made up with occidental language characters.
But giving it the value:
phrase = u'इसका स्वामित्व और नियंत्रण किया। इसके'
the result is u'इसक स व म त व और न य त रण क य इसक ', while it shouldn't change, since the string is only made of alphabetic characters and spaces.
I think the reason is that some character is a surrogate pair.
Is it a bug with python's isalpha() method?
Or, if not, how can I deal properly with characters represented by surrogate pairs?

Complex requirements for string split around select commas

TL;DR
I need some help making a regex that will match any commas in a string that are side by side with unlimited white space around them and between them. The commas and their surrounding white space cannot be within matching single quotes or double quotes. I then need to capture the non-whitespace values from around those commas and count how many of those commas there are.
The values captured from around the commas will become their own values in the final array, while the commas that were counted will become nil values that are added to the final array.
Explanation of the problem:
This is a pretty complex problem so any help is greatly appreciated. I'm adding functionality to a library I've been using for a while now. I have this string that contains an array
"['d,og,f:asdf,:hello,",,\",,alsee',,,'ho,la', "-123,4,5.3", true, :good, false,,, "gr\'\'\'true,\',\'ee\"n", ":::testme", true]"
I would like to split this string only around select commas so that I have an array containing the following values
'd,og,f:asdf,:hello,",,\",,alsee'
nil
nil
'ho,la'
"-123,4,5.3"
true
:good
false
nil
nil
"gr\'\'\'true,\',\'ee\"n"
":::testme"
true
Then nil values are coming from the side by side commas that are not contained in any string. I wrote the following regex to split the string above (I already got rid of the start and end brackets):
/(?<=(?:['\"]|false|true|^|,)),(?=(?:\s*(?:(?::[\w]+)|(?:(?::?(?:\"[\s\S]*\")|(?:'[\s\S]*'))|(?:false|true)))\s*(?:,|$)))/
This splits the string so I get these values:
(0) "'d,og,f:asdf,:hello,",,\",,alsee',,"
(1) "'ho,la'"
(2) " "-123,4,5.3""
(3) " true"
(4) " :good, false,,"
(5) " "gr\'\'\'true,\',\'ee\"n""
(6) " ":::testme""
(7) " true"
All the values are strings as can be seen by their surrounding double quotes. They will not all end up that way though. A true or false will be converted to a boolean. The values surrounded by internal quotes will end up as strings. Then a value preceded with a : will end up as a symbol.
There are problems with the values at index 0 and 4. Index 0 should be this:
(0.0) "'d,og,f:asdf,:hello,",,\",,alsee'"
(0.1) nil
(0.2) nil
As you can see, the two commas at the end are gone. They have become the two nil values you see above. Then the string starts at the first single quote and ends at the last single quote, signifying that this value in the array is a string.
Then index 4 (" :good, false,,") should be this:
(4.0) " :good"
(4.1) " false"
(4.2) nil
(4.3) nil
The two commas at the end have become nil. Then " false" is it's own value which will later be converted to a boolean, while " :good" is also it's own value and will later be converted to a symbol.
To fix the problem with index 4 I have all the values run through a second regex. Here it is:
/^(\s*:(?:(?:[\w]+|\"[\s\S]+\"|'[\s\S]+')\s*)),([\s\S]*)$/
Instead of splitting this one I get the capture groups. It ends up returning this array for the value at index 4:
(4.0) " :good"
(4.1) " false,,"
That's what I wanted except for one problem. The value at index 4.1 (" false,,") has the two trailing commas which should be nil values in the array.
I need some help making a regex that will match any commas in a string that are side by side with unlimited white space around them and between them. The commas and their surrounding white space cannot be within matching single quotes or double quotes. I then need to capture the non-whitespace values from around those commas and count how many of those commas there are.
The values captured from around the commas will become their own values in the final array, while the commas that were counted will become nil values that are added to the final array.
"['d,og,f:asdf,:hello,"
,,\
",,alsee',,,'ho,la', "
-123,4,5.3
", true, :good, false,,, "
gr\
'\'
I count 4 strings. 3 in double quotes and the last one in single quotes?
You say this is broken down into smaller strings by your regx. But what about the characters outside the 4 strings?
Sorry, it looks a bit of a mess.
Try putting it all in a here document string and then breaking it down by a regx.
I finally figured it out myself. You can see how it fits in with the rest if you look at the description of the question above.
/^(([\s]*,)*)[\s]*((?::[\w]+)|(?::?(?:\"[\s\S]*\")|(?:'[\s\S]*')|false|true))?(([\s]*,)*)$/

How do I split this certain kind of string into an array in ruby [duplicate]

I would like to find a regex that will pick out all commas that fall outside quote sets.
For example:
'foo' => 'bar',
'foofoo' => 'bar,bar'
This would pick out the single comma on line 1, after 'bar',
I don't really care about single vs double quotes.
Has anyone got any thoughts? I feel like this should be possible with readaheads, but my regex fu is too weak.
This will match any string up to and including the first non-quoted ",". Is that what you are wanting?
/^([^"]|"[^"]*")*?(,)/
If you want all of them (and as a counter-example to the guy who said it wasn't possible) you could write:
/(,)(?=(?:[^"]|"[^"]*")*$)/
which will match all of them. Thus
'test, a "comma,", bob, ",sam,",here'.gsub(/(,)(?=(?:[^"]|"[^"]*")*$)/,';')
replaces all the commas not inside quotes with semicolons, and produces:
'test; a "comma,"; bob; ",sam,";here'
If you need it to work across line breaks just add the m (multiline) flag.
The below regexes would match all the comma's which are present outside the double quotes,
,(?=(?:[^"]*"[^"]*")*[^"]*$)
DEMO
OR(PCRE only)
"[^"]*"(*SKIP)(*F)|,
"[^"]*" matches all the double quoted block. That is, in this buz,"bar,foo" input, this regex would match "bar,foo" only. Now the following (*SKIP)(*F) makes the match to fail. Then it moves on to the pattern which was next to | symbol and tries to match characters from the remaining string. That is, in our output , next to pattern | will match only the comma which was just after to buz . Note that this won't match the comma which was present inside double quotes, because we already make the double quoted part to skip.
DEMO
The below regex would match all the comma's which are present inside the double quotes,
,(?!(?:[^"]*"[^"]*")*[^"]*$)
DEMO
While it's possible to hack it with a regex (and I enjoy abusing regexes as much as the next guy), you'll get in trouble sooner or later trying to handle substrings without a more advanced parser. Possible ways to get in trouble include mixed quotes, and escaped quotes.
This function will split a string on commas, but not those commas that are within a single- or double-quoted string. It can be easily extended with additional characters to use as quotes (though character pairs like « » would need a few more lines of code) and will even tell you if you forgot to close a quote in your data:
function splitNotStrings(str){
var parse=[], inString=false, escape=0, end=0
for(var i=0, c; c=str[i]; i++){ // looping over the characters in str
if(c==='\\'){ escape^=1; continue} // 1 when odd number of consecutive \
if(c===','){
if(!inString){
parse.push(str.slice(end, i))
end=i+1
}
}
else if(splitNotStrings.quotes.indexOf(c)>-1 && !escape){
if(c===inString) inString=false
else if(!inString) inString=c
}
escape=0
}
// now we finished parsing, strings should be closed
if(inString) throw SyntaxError('expected matching '+inString)
if(end<i) parse.push(str.slice(end, i))
return parse
}
splitNotStrings.quotes="'\"" // add other (symmetrical) quotes here
Try this regular expression:
(?:"(?:[^\\"]+|\\(?:\\\\)*[\\"])*"|'(?:[^\\']+|\\(?:\\\\)*[\\'])*')\s*=>\s*(?:"(?:[^\\"]+|\\(?:\\\\)*[\\"])*"|'(?:[^\\']+|\\(?:\\\\)*[\\'])*')\s*,
This does also allow strings like “'foo\'bar' => 'bar\\',”.
MarkusQ's answer worked great for me for about a year, until it didn't. I just got a stack overflow error on a line with about 120 commas and 3682 characters total. In Java, like this:
String[] cells = line.split("[\t,](?=(?:[^\"]|\"[^\"]*\")*$)", -1);
Here's my extremely inelegant replacement that doesn't stack overflow:
private String[] extractCellsFromLine(String line) {
List<String> cellList = new ArrayList<String>();
while (true) {
String[] firstCellAndRest;
if (line.startsWith("\"")) {
firstCellAndRest = line.split("([\t,])(?=(?:[^\"]|\"[^\"]*\")*$)", 2);
}
else {
firstCellAndRest = line.split("[\t,]", 2);
}
cellList.add(firstCellAndRest[0]);
if (firstCellAndRest.length == 1) {
break;
}
line = firstCellAndRest[1];
}
return cellList.toArray(new String[cellList.size()]);
}
#SocialCensus, The example you gave in the comment to MarkusQ, where you throw in ' alongside the ", doesn't work with the example MarkusQ gave right above that if we change sam to sam's: (test, a "comma,", bob, ",sam's,",here) has no match against (,)(?=(?:[^"']|["|'][^"']")$). In fact, the problem itself, "I don't really care about single vs double quotes", is ambiguous. You have to be clear what you mean by quoting either with " or with '. For example, is nesting allowed or not? If so, to how many levels? If only 1 nested level, what happens to a comma outside the inner nested quotation but inside the outer nesting quotation? You should also consider that single quotes happen by themselves as apostrophes (ie, like the counter-example I gave earlier with sam's). Finally, the regex you made doesn't really treat single quotes on par with double quotes since it assumes the last type of quotation mark is necessarily a double quote -- and replacing that last double quote with ['|"] also has a problem if the text doesn't come with correct quoting (or if apostrophes are used), though, I suppose we probably could assume all quotes are correctly delineated.
MarkusQ's regexp answers the question: find all commas that have an even number of double quotes after it (ie, are outside double quotes) and disregard all commas that have an odd number of double quotes after it (ie, are inside double quotes). This is generally the same solution as what you probably want, but let's look at a few anomalies. First, if someone leaves off a quotation mark at the end, then this regexp finds all the wrong commas rather than finding the desired ones or failing to match any. Of course, if a double quote is missing, all bets are off since it might not be clear if the missing one belongs at the end or instead belongs at the beginning; however, there is a case that is legitimate and where the regex could conceivably fail (this is the second "anomaly"). If you adjust the regexp to go across text lines, then you should be aware that quoting multiple consecutive paragraphs requires that you place a single double quote at the beginning of each paragraph and leave out the quote at the end of each paragraph except for at the end of the very last paragraph. This means that over the space of those paragraphs, the regex will fail in some places and succeed in others.
Examples and brief discussions of paragraph quoting and of nested quoting can be found here http://en.wikipedia.org/wiki/Quotation_mark .

How to handle Combining Diacritical Marks with UnicodeUtils?

I am trying to insert spaces into a string of IPA characters, e.g. to turn ɔ̃wɔ̃tɨ into ɔ̃ w ɔ̃ t ɨ. Using split/join was my first thought:
s = ɔ̃w̃ɔtɨ
s.split('').join(' ') #=> ̃ ɔ w ̃ ɔ p t ɨ
As I discovered by examining the results, letters with diacritics are in fact encoded as two characters. After some research I found the UnicodeUtils module, and used the each_grapheme method:
UnicodeUtils.each_grapheme(s) {|g| g + ' '} #=> ɔ ̃w ̃ɔ p t ɨ
This worked fine, except for the inverted breve mark. The code changes ̑a into ̑ a. I tried normalization (UnicodeUtils.nfc, UnicodeUtils.nfd), but to no avail. I don't know why the each_grapheme method has a problem with this particular diacritic mark, but I noticed that in gedit, the breve is also treated as a separate character, as opposed to tildes, accents etc. So my question is as follows: is there a straightforward method of normalization, i.e. turning the combination of Latin Small Letter A and Combining Inverted Breve into Latin Small Letter A With Inverted Breve?
I understand your question concerns Ruby but I suppose the problem is about the same as with Python. A simple solution is to test the combining diacritical marks explicitly :
import unicodedata
liste=[]
s = u"ɔ̃w̃ɔtɨ"
comb=False
prec=u""
for char in s:
if unicodedata.combining(char):
liste.append(prec+char)
prec=""
else:
liste.append(prec)
prec=char
liste.append(prec)
print " ".join(liste)
>>>> ɔ̃ w̃ ɔ t ɨ

Resources