Extract 11-character alphanumeric identifier from setence - powerquery

I am attempting to extract an 11-character identifier from an imported sentence. The identifier is as combination of numbers and letters in a specific sequence. Presumably, it will separated by spaces both before and after the identifier. I had previously created something similar in VBA, but now trying to do this in Power Query. Any ideas?
Snippet from VBA code for 11-character identifier is as follows:
Array("[1-9]", "[AC-HJKMNPQRT-Y]", "[AC-HJKMNPQRT-Y0-9]", "[0-9]", "[AC-HJKMNPQRT-Y]", _
"[AC-HJKMNPQRT-Y0-9]", "[0-9]", "[AC-HJKMNPQRT-Y]", "[AC-HJKMNPQRT-Y]", "[0-9]", "[0-9]")
Code above illustrates the first character is a number between 1 and 9, then second character is a number between 0 and 9 or letter outlined as AC-HJKMNPQRT-Y, and so on.
Anyway, seems like PQ can't do regex, so any ideas? Thanks.

Related

Formatting numbers with thousand separator as empty string on amcharts4

I need to format the chart number format so that numbers stop looking like that 1,525 (comma separator) and start looking like this 1 525 (empty string thousand separator). Plus, I need dot separator for decimal, but only if a number has any, like this 1 525.4
The closest number format I was able to find for amCharts4 version is
chart.numberFormatter.numberFormat = '#,###.#';
Any ideas?
So, after a research I've found a solution - you have to use locales.
This line of code helped me a lot:
chart.language.locale = am4lang_[locale];
For empty string separator I used am4lang_ru_RU.
Btw, if you need to make your own number, strings, etc formatting, you can create your locales for that.

Extract Tweet ID from text

I have a large, 4.5M+ row CSV (commas are the separators) containing tweets. The CSV comes from some time ago, and has all manner of line breaks inside column data, characters, etc. It is likely malformed in some ways but it is difficult for me to discern exactly where and how with a file of this size.
I want to move through this CSV file as a large body of text, pull out all the Tweet IDs, and put each pulled ID into a line in a new file.
Doing this via bash, perl, Python will work fine. Can anyone help here? I can't seem to even find info on the parameters for a tweet ID, though the ones in this corpus seem to all be 17 integers.
Since in your question the only evidence for a Tweet ID is that its an integer of length of 17, that is the only rule I am going to use.
Plus, I am going to use it as a hard-and-fast rule. Anything that is an integer of length is a Tweet ID, nothing else.
After that its a normal regular expression search.
import re
string = '''
12345678912345678, abcd, efgh
45645645645645645, ijkl, mnop
78944556677889900, qrst, uvwx
0, y, z
'''
m = re.findall('[0-9]{17}', string)
print(m)
re.findall searches for the regular expression (first arg) in the string (second argument)
(a):- [0-9] means any integer between 0 to 9
(b):- {m} means the regular exp. that preceded this must repeat m number of times
(a)+(b):- [0-9]{17} get me a match that has is a string of integers 0 to 9 repeated 17 times. i.e. a number of length 17
find out more about re module in python
This is as much I can help with you without knowing anything about the input file and tweet format.

How do I write a regex for Excel cell range?

I need to validate that something is an Excel cell range in Ruby, i.e: "A4:A6". By looking at it, the requirement I am looking for is:
<Alphabetical, Capitalised><Integer>:<Integer><Alphabetical, Capitalised>
I am not sure how to form a RegExp for this.
I would appreciate a small explanation for a solution, as opposed to purely a solution.
A bonus would be to check that the range is restricted to within a row or column. I think this would be out of scope of Regular Expressions though.
I have tried /[A-Z]+[0-9]+:[A-Z]+[0-9]+/ this works but allows extra characters on the ends.
This does not work because it allows extra's to be added on to the beginning or end:
"HELLOAA3:A7".match(/\A[A-Z]+[0-9]+:[A-Z]+[0-9]+\z/) also returns a match, but is more on the right track.
How would I limit the number range to 10000?
How would I limit the number of characters to 3?
This is my solution:
(?:(?:\'?(?:\[(?<wbook>.+)\])?(?<sheet>.+?)\'?!)?(?<colabs>\$)?(?<col>[a-zA-Z]+)(?<rowabs>\$)?(?<row>\d+)(?::(?<col2abs>\$)?(?<col2>[a-zA-Z]+)(?<row2abs>\$)?(?<row2>\d+))?|(?<name>[A-Za-z]+[A-Za-z\d]*))
It includes named ranges, but the R1C1 notation is not supported.
The pattern is written in perl compatible regex dialect (i.e. can also be used with C#), I'm not familiar with Ruby, so I can't tell the difference, but you may want to look here: What is the difference between Regex syntax in Ruby vs Perl?
This will do both: match Excel range and that they must be same row or column. Stub
^([A-Z]+)(\d+):(\1\d+|[A-Z]+\2)$
A4:A6 // ok
A5:B10 // not ok
B5:Z5 // ok
AZ100:B100hello // not ok
The magic here is the back-reference group:
([A-Z]+)(\d+) -- column is in capture group 1, row in group 2
(\1\d+|[A-Z]+\2) -- the first column followed by any number; or
-- the first row preceded by any character

sscanf skips capital 'N' letter

I have got a strange sscanf problem with a capital letter 'N'(maybe I do not understand something correct me please):
Example 1:
char cBuff[128];
sscanf("GUIDNameNENE","%*[GUIDName]%127s" ,cBuff);
returns cBuff:ENE
Example 2:
char cBuff[128];
sscanf("GUIDNamenENE","%*[GUIDName]%127s" ,cBuff);
returns cBuff:nENE
Example 3:
char cBuff[128];
sscanf("GUIDNaMENE","%*[GUIDNa]%127s" ,cBuff);
returns cBuff:ENE
I have tried many other variants but still always skips capital N.
Where is the problem?
Thank you in advance!
%[GUIDName] is not a weird way of quoting and matching an exact string. It defines a set of characters that will match. They will match in any order, and they will match repeatedly.
The longest match for the set %[GUIDName] in your input is GUIDNameN.
You could of course say %*[G]%*[U]%*[I]%*[D]%*[N]%*[a]%*[m]%*[e] and that would not eat any of the characters GUIDNam, but it would still eat multiple es.
I would guess the reason it skips the capital N is because it's part of the set of characters that you ignore. The key point is that what you specify between the brackets are a set of characters to match, not in a fixed order, but rather that sscanf tries to match the longest string consisting of only the characters after the '[' up to the first matching ']'. If I recall correct.
You could try specifying the size for the set of characters to be skipped like this:
sscanf("GUIDNameNENE","%*8[GUIDName]%127s" ,cBuff);
But that will of course only work if the string always is eight characters long and if it is you could choose to just ignore the eight initial characters like this:
sscanf("GUIDNameNENE","%*8s%127s" ,cBuff);

Suggested variable system and procedures to incorporate this LEET table into my Free Pascal program

I code using Free Pascal and Lazarus.
I want to incorporate the LEET Table seen here (http://en.wikipedia.org/wiki/Leet#Orthography) into a new program, but I'm unsure of the best way to do so. Should I use array structures (one for each letter of the alphabet) or 'Set Types' for each letter or records for each letter? Any suggestions of how to implement an idea would be appreciated.
The aim of the program is to open and read a text file line by line (I've got this done already) using an OpenDialog and it will then say "For each word, if it finds the letters 'E', 'O' or 'I', replace them with values from the table for the letter found"
e.g. if strLineFromFile contains letter 'E', replace it with 3, £, + &....and so on
repeat
...
strLineFromFile(Readln(SourceFile));
Look for letters E, I and O in strLineFromFile
Lookup LEET Table - Switch chars
until EOF(SourceFile);
I'm open to suggestions on the best way to optimise this process - I'm not expecting pure code but pointers as to perhaps what function\procedures would be best and what variable system to use for ptimum performance.
Note : I'm still learning so nothing too complex please!
Ted
Sets are not ordered, so they don't make sense here.
An array['a'..'z'] of array of string. The first array level is all letters in the input, the second array allows for various translations of the same input-letter.

Resources