Applescript noob, I'm trying to identify a date format in filenames, and return the characters immediately preceding the date. The way the date is formatted in the files is just 6 consecutive numbers. The data before that is an indication of the length of the file and are also numbers. These files will never have 6 or more consecutive numbers, except for the date, so I don't have to worry about false positives. What I need to do is find the 6 consecutive numbers so I can use that to find the data before the date and group all those files together.
ex:
Barry_Waterson_Speech_1955_27.02_012219_video_file_from_grdx1.mov
Test Recording Iceland 19 040407 low quality screener.mov
initially it seemed like the numbers preceding the date had set values that I could have the code look out for with
if fileName contains "29" then
but now I'm stumped on how to approach this. My general idea was the following:
Looks like something’s eaten the last part of your question. At any rate, AppleScript is not the best language for text processing, but whatever language you use the standard technique is regular expression-based pattern matching.
For example, to match six digits you’d use the pattern \d{6}. The \d pattern matches any digit, the {6} matches the preceding pattern exactly six times.
If you want to extract the text from the start of a line up to the six digits, you’d use something like (?-s)^(.+?)\d{6}. The ^ matches the start of each line. The .+? matches one or more characters (.+) only up to the next pattern match (?); grouping it in parens extracts the matched text. By default, the . pattern matches any character including a line break, so add (?-s) to the start of the pattern to turn off the line break matching (-s).
Bit cryptic, but very powerful and you’ll get the hang with a bit of practice. Tons of online docs and examples too; just search for “PCRE regular expression”. (Tip: build it up one pattern at a time, testing at every step.)
AppleScript doesn’t have built-in support for regular expressions, but it can use Cocoa’s NSRegularExpression class via the AppleScript-ObjC bridge. The syntax isn’t very friendly so you may want to use a library that wraps it for you:
use script "Text"
set theText to "Barry_Waterson_Speech_1955_27.02_012219_video_file_from_grdx1.mov
Test Recording Iceland 19 040407 low quality screener.mov"
search text theText for "^(.+?)\\d{6}" using pattern matching
returns:
{{class:matched text, startIndex:1, endIndex:39, foundText:"Barry_Waterson_Speech_1955_27.02_012219", foundGroups:{{class:matched group, startIndex:1, endIndex:33, foundText:"Barry_Waterson_Speech_1955_27.02_"}}},
{class:matched text, startIndex:67, endIndex:98, foundText:"Test Recording Iceland 19 040407", foundGroups:{{class:matched group, startIndex:67, endIndex:92, foundText:"Test Recording Iceland 19 "}}}}
Related
The regex below:
EMAIL_REGEX = /\A[\w+\-.]+#[a-z\d\-.]+\.[a-z]+\z/i
is what I initially used to validate email format. After finding that the format "name#email...com" was passing my tests, I copy/pasted a different piece of regex that limits the amount of periods. This looks like:
EMAIL_REGEX = /\A[\w+\-.]+#[a-z\d\-]+(?:\.[a-z\d\-]+)*\.[a-z]+\z/i
The main difference is the piece of regex below:
(?:\.[a-z\d\-]+)
I can't quite figure out how this bit works. Can someone break it down for me?
Notice that in this subexpression:
(?:\.[a-z\d\-]+)
The character class [a-z\d-] does not contain a period. The expression requires there to be at least one (+) of those characters after the period (\.) in order to match. Therefore, a series of periods with no letters or digits or hyphens between them won't match the repetition of the subexpression.
The problem with your regular expression here is that you're allowing for multiple dots:
/[a-z\.]+\.[a-z]+\z/
To fix this you need to make your repeating pattern more specific in terms of structure:
/(?:[a-z]+\.)+[a-z]+\z/
That means you can have one or more repeating groups of letters plus dot. That will exclude multiple dots in a row.
Do keep in mind that email addresses are getting increasingly insane with the introduction of new GTLDs that are often used without any sort of prefix. That is, example#google may be a valid address in the future. You can't expect there to be a dot in the domain.
You have [a-z\d\-]+(?:\.[a-z\d\-]+)*. The [a-z\d\-]+ part ensures that this part of the string starts with a sequence of at least one non-period character. A period is only allowed one per (?:\.[a-z\d\-]+) structure. In each (?:\.[a-z\d\-]+), the period \. is necessarily followed by [a-z\d\-]+, which includes at least one non-period character. This ensures that whenever a period appears, it has at least one non-period character on the left and on the right. In other words, consecutive periods are not allowed.
I am relatively new to scripting and within an InDesign Script I am trying to change all the first letters of all sentences to uppercase (many of the are lowercase, since I randomly generated the setences from different text sources).
I am so far able to find the text parts with this Grep expression:
\.(\s)+\l
I also found this script by Peter Kahrel, that he shares on InDesign Secrets:
app.findGrepPreferences.findWhat = "^.";
found = app.activeDocument.findGrep();
for (i = 0; i < found.length; i++)
found[i].characters[0].changecase (ChangecaseMode.lowercase);
However, when I now replace the ^. with my own expression, and change lowercase to uppercase, the script does not work, which makes sense, since I do not want to change the first character of my findGrep results, but the last one. But how can I find the last character? The breaks between the sentences have different lengths, so I cannot simply type 2 instead of 0.
Any help would be very appreciated! Thank you!
Edit: I'm working on CS6.
Your GREP returns matches that start with a period, then have any number of spaces (including hard returns, probably), and always end with one lowercase character. So far, so good. You can access the last character (and in fact any last item in any InDesign object collection) in this way:
found[i].characters[-1].changecase (ChangecaseMode.lowercase);
which 'indexes' from the end, rather than from the start.
However! The only character in your matches, other than the period and spaces, is always going to be a lowercase letter. So you can skip the entire "how to find the correct index" thing, and probably slightly speed up the script as well, by simply applying lowercase (or, as you are using it, uppercase) to the entire match:
found[i].changecase (ChangecaseMode.lowercase);
because nothing will happen to not-lowercaseable characters (a word I declare to signify "having the Unicode-defined property of being lowercase and having an uppercase equivalent). (Or the other way around, if I understand your purpose correct.)
This seems like a simple one, but I am missing something.
I have a number of inputs coming in from a variety of sources and in different formats.
Number inputs
123
123.45
123,45 (note the comma used here to denote decimals)
1,234
1,234.56
12,345.67
12,345,67 (note the comma used here to denote decimals)
Additional info on the inputs
Numbers will always be less than 1 million
EDIT: These are prices, so will either be whole integers or go to the hundredths place
I am trying to write a regex and use gsub to strip out the thousands comma. How do I do this?
I wrote a regex: myregex = /\d+(,)\d{3}/
When I test it in Rubular, it shows that it captures the comma only in the test cases that I want.
But when I run gsub, I get an empty string: inputstr.gsub(myregex,"")
It looks like gsub is capturing everything, not just the comma in (). Where am I going wrong?
result = inputstr.gsub(/,(?=\d{3}\b)/, '')
removes commas only if exactly three digits follow.
(?=...) is a lookahead assertion: It needs to be possible to be matched at the current position, but it's not becoming part of the text that is actually matched (and subsequently replaced).
You are confusing "match" with "capture": to "capture" means to save something so you can refer to it later. You want to capture not the comma, but everything else, and then use the captured portions to build your substitution string.
Try
myregex = /(\d+),(\d{3})/
inputstr.gsub(myregex,'\1\2')
In your example, it is possible to tell from the number of digits after the last separator (either , or .) that it is a decimal point, since there are 2 lone digits. For most cases, if the last group of digits does not have 3 digits then you can assume that the separator in front is decimal point. Another sign is the multiple appearance of a separator in big numbers allows us to differentiate between decimal point and separators.
However, I can give a string 123,456 or 123.456 without any sort of context. It is impossible to tell whether they are "123 thousand 456" or "123 point 456".
You need to scan the document to look for clue whether , is used for thousand separator or decimal point, and vice versa for .. With the context provided, then you can safely apply the same method to remove the thousand separators.
You may also want to check out this article on Wikipedia on the less common ways to specify separators or decimal points. Knowing and deciding not to support is better than assuming things will work.
I am trying to create a spam filter using Regular Expressions that matches the following situation.
There is a group of exactly 8 alphanumeric characters to be matched.
It must contain 2 or more uppercase letters;
AND it must contain 2 or more lowercase letters;
AND it must contain 1 or more numbers.
So far, all I have been able to come up with is this:
(?i)[A-Za-z0-9]{8}
My code does match a mixed case group of 8, but does not force upper or lower case or specify how many times each type must occur. So, I couple it with other must-haves that are always present in the messages in question.
Here is a sample of the pattern I am trying to detect:
WbNDSk9e
This is part of a spam URL. Other groups I have seen follow the same pattern of at least 2 each UC and LC letters and 1 or more numbers and always have exactly 8 characters. I've seen no other characters or variations yet.
To my knowledge, the only switch I am able to use is to turn on Case Sensitivity, with (?i). Some of the other switches I have seen in some replies do not work in the program I use. Am I asking too much from a single line RegExpr rule?
I currently use RegEx Match to test my rules and my anti-spam program uses the same engine.
^(?=.*?[A-Z].*?[A-Z])(?=.*?[a-z].*?[a-z])(?=.*?\d).{8}$
Broken down:
(?=.*?[A-Z].*?[A-Z]) forces at least 2 upper-case letters.
(?=.*?[a-z].*?[a-z]) forces at least 2 lower-case letters.
(?=.*?\d) forces at least 1 digit.
The ^ ... $ caret and dollar force that it matches the whole string.
You don't want the (?i) flag because it will make it case-insensitive.
I have the following
address.gsub(/^\d*/, "").gsub(/\d*-?\d*$/, "").gsub(/\# ?\d*/,"")
Can this be done in one gsub? I would like to pass a list of patterns rather then just one pattern - they are all being replaced by the same thing.
You could combine them with an alternation operator (|):
address = '6 66-666 #99 11-23'
address.gsub(/^\d*|\d*-?\d*$|\# ?\d*/, "")
# " 66-666 "
address = 'pancakes 6 66-666 # pancakes #99 11-23'
address.gsub(/^\d*|\d*-?\d*$|\# ?\d*/,"")
# "pancakes 6 66-666 pancakes "
You might want to add little more whitespace cleanup. And you might want to switch to one of:
/\A\d*|\d*-?\d*\z|\# ?\d*/
/\A\d*|\d*-?\d*\Z|\# ?\d*/
depending on what your data really looks like and how you need to handle newlines.
Combining the regexes is a good idea--and relatively simple--but I'd like to recommend some additional changes. To wit:
address.gsub(/^\d+|\d+(?:-\d+)?$|\# *\d+/, "")
Of your original regexes, ^\d* and \d*-?\d*$ will always match, because they don't have to consume any characters. So you're guaranteed to perform two replacements on every line, even if that's just replacing empty strings with empty strings. Of my regexes, ^\d+ doesn't bother to match unless there's at least one digit at the beginning of the line, and \d+(?:-\d+)?$ matches what looks like an integer-or-range expression at the end of the line.
Your third regex, \# ?\d*, will match any # character, and if the # is followed by a space and some digits, it'll take those as well. Judging by your other regexes and my experience with other questions, I suspect you meant to match a # only if it's followed by one or more digits, with optional spaces intervening. That's what my third regex does.
If any of my guesses are wrong, please describe what you were trying to do, and I'll do my best to come up with the right regex. But I really don't think those first two regexes, at least, are what you want.
EDIT (in answer to the comment): When working with regexes, you should always be aware of the distinction between a regex the matches nothing and a regex that doesn't match. You say you're applying the regexes to street addresses. If an address doesn't happen to start with a house number, ^\d* will match nothing--that is, it will report a successful match, said match consisting of the empty string preceding the first character in the address.
That doesn't matter to you, you're just replacing it with another empty string anyway. But why bother doing the replacement at all? If you change the regex to ^\d+, it will report a failed match and no replacement will be performed. The result is the same either way, but the "matches noting" scenario (^\d*) results in a lot of extra work that the "doesn't match" scenario avoids. In a high-throughput situation, that could be a life-saver.
The other two regexes bring additional complications: \d*-?\d*$ could match a hyphen at the end of the string (e.g. "123-", or even "-"); and \# ?\d* could match a hash symbol anywhere in string, not just as part of an apartment/office number. You know your data, so you probably know neither of those problems will ever arise; I'm just making sure you're aware of them. My regex \d+(?:-\d+)?$ deals with the trailing-hyphen issue, and \# *\d+ at least makes sure there are digits after the hash symbol.
I think that if you combine them together in a single gsub() regex, as an alternation,
it changes the context of the starting search position.
Example, each of these lines start at the beginning of the result of the previous
regex substitution.
s/^\d*//g
s/\d*-?\d*$//g
s/\# ?\d*//g
and this
s/^\d*|\d*-?\d*$|\# ?\d*//g
resumes search/replace where the last match left off and could potentially produce a different overall output, especially since a lot of the subexpressions search for similar
if not the same characters, distinguished only by line anchors.
I think your regex's are unique enough in this case, and of course changing the order
changes the result.