Using Regex in Ruby to parse out a chunk of file names - ruby

I am trying to standardize file names in a directory that have some similarities, but are not always consistent. They are, however, standard enough.
Examples of file names (where the date is Month/Day/Year):
Weekly sales report 022213 LV.xls
Weekly sales report 091908 LV-F.xls
Weekly sales 072508.xls
Weekly U S sales V1.0 061308.xls
Weekly U.S. Sales Jan0606.xls
My current solution has been an effective, but ugly find and replace for any possible string combinations. x.gsub!(/^Weekly|sales|report|U S|U.S.|\s/,'')
However, I would assume that there would be a way to look at the file name string and grab the chunk that has all of the date information. This would be the chunk bounded by whitespace on the left and ends in at least 4 digits. Is there a straightforward way to accomplish this?

Your requirement as stated would suggest the following:
date_portion = x.match(/\s(\S*\d{4,8})/)[1]
That's: match one whitespace char, then capture zero-or-more non-whitespace, followed by 4 to 8 digits; return the captured text.

Related

How can I identify 6 consecutive numbers in a string?

Applescript noob, I'm trying to identify a date format in filenames, and return the characters immediately preceding the date. The way the date is formatted in the files is just 6 consecutive numbers. The data before that is an indication of the length of the file and are also numbers. These files will never have 6 or more consecutive numbers, except for the date, so I don't have to worry about false positives. What I need to do is find the 6 consecutive numbers so I can use that to find the data before the date and group all those files together.
ex:
Barry_Waterson_Speech_1955_27.02_012219_video_file_from_grdx1.mov
Test Recording Iceland 19 040407 low quality screener.mov
initially it seemed like the numbers preceding the date had set values that I could have the code look out for with
if fileName contains "29" then
but now I'm stumped on how to approach this. My general idea was the following:
Looks like something’s eaten the last part of your question. At any rate, AppleScript is not the best language for text processing, but whatever language you use the standard technique is regular expression-based pattern matching.
For example, to match six digits you’d use the pattern \d{6}. The \d pattern matches any digit, the {6} matches the preceding pattern exactly six times.
If you want to extract the text from the start of a line up to the six digits, you’d use something like (?-s)^(.+?)\d{6}. The ^ matches the start of each line. The .+? matches one or more characters (.+) only up to the next pattern match (?); grouping it in parens extracts the matched text. By default, the . pattern matches any character including a line break, so add (?-s) to the start of the pattern to turn off the line break matching (-s).
Bit cryptic, but very powerful and you’ll get the hang with a bit of practice. Tons of online docs and examples too; just search for “PCRE regular expression”. (Tip: build it up one pattern at a time, testing at every step.)
AppleScript doesn’t have built-in support for regular expressions, but it can use Cocoa’s NSRegularExpression class via the AppleScript-ObjC bridge. The syntax isn’t very friendly so you may want to use a library that wraps it for you:
use script "Text"
set theText to "Barry_Waterson_Speech_1955_27.02_012219_video_file_from_grdx1.mov
Test Recording Iceland 19 040407 low quality screener.mov"
search text theText for "^(.+?)\\d{6}" using pattern matching
returns:
{{class:matched text, startIndex:1, endIndex:39, foundText:"Barry_Waterson_Speech_1955_27.02_012219", foundGroups:{{class:matched group, startIndex:1, endIndex:33, foundText:"Barry_Waterson_Speech_1955_27.02_"}}},
{class:matched text, startIndex:67, endIndex:98, foundText:"Test Recording Iceland 19 040407", foundGroups:{{class:matched group, startIndex:67, endIndex:92, foundText:"Test Recording Iceland 19 "}}}}

Extract Tweet ID from text

I have a large, 4.5M+ row CSV (commas are the separators) containing tweets. The CSV comes from some time ago, and has all manner of line breaks inside column data, characters, etc. It is likely malformed in some ways but it is difficult for me to discern exactly where and how with a file of this size.
I want to move through this CSV file as a large body of text, pull out all the Tweet IDs, and put each pulled ID into a line in a new file.
Doing this via bash, perl, Python will work fine. Can anyone help here? I can't seem to even find info on the parameters for a tweet ID, though the ones in this corpus seem to all be 17 integers.
Since in your question the only evidence for a Tweet ID is that its an integer of length of 17, that is the only rule I am going to use.
Plus, I am going to use it as a hard-and-fast rule. Anything that is an integer of length is a Tweet ID, nothing else.
After that its a normal regular expression search.
import re
string = '''
12345678912345678, abcd, efgh
45645645645645645, ijkl, mnop
78944556677889900, qrst, uvwx
0, y, z
'''
m = re.findall('[0-9]{17}', string)
print(m)
re.findall searches for the regular expression (first arg) in the string (second argument)
(a):- [0-9] means any integer between 0 to 9
(b):- {m} means the regular exp. that preceded this must repeat m number of times
(a)+(b):- [0-9]{17} get me a match that has is a string of integers 0 to 9 repeated 17 times. i.e. a number of length 17
find out more about re module in python
This is as much I can help with you without knowing anything about the input file and tweet format.

Remove all letters from file names on a mac

I have about 300 photos which are a string of letters and numbers, Im planning to data merge these into an indesign document and would like to use the numbers only as these IDs are related to customers and their quotes.
I need to remove every letter or space or symbol from the file names leaving only the id number. is this possible?
Here's an example of the file names:
148132durrnt-photojosh.jpg, 173722dumellphotojosh.jpg, 173816mxwell.jpg, 176764very.jpg, 176876pyumo.jpg, 178054plnt.jpg, engll170774pijosh.jpg, entley166282pijosh.jpg, hodgkinson169226pijosh.jpg
So there's a mixture of some with the number at the start, some at the end, some in the middle, and a couple dont actually have the numbers at all so those should be ignored..
I don't know of a way to do this other than manually, and wanted to try save some time..

BASH - How to delete all numerals from a text file, unless they are part of a specific string?

I have a text file, and I want to delete all the numerals included in them. However, there are two key strings "9/11" and "September 11", in which I want to keep the numerals. How can I delete all the numerals except when they are a part of these key strings?
I use sed 's/[0-9]*//g' to get rid of the numerals. So for now, the sample text before processing would be something like this:
12 Aug. 2002, News Section. 9/11 was a terrible tragedy for the nation, in which 2,500 ...
And I want the file after processing to look like this:
Aug. , News Section. 9/11 was a terrible tragedy for the nation, in which ...
I tried searching for the answer, but to no avail. Thanks in advance for any suggestions.
This will do the job. It's like a kind of capturing the part we want to stay and matching the part you want to remove. So by replacing all the matched characters with the chars present inside group index 1 will make the captured chars to stay and the other matched chars to leave.
sed 's~\(\b9/11\b\|\bSeptember 11\b\)\|[[:digit:]]~\1~g' file
DEMO

Finding date in file, getting data after it

Help me brainstorm how I would solve this problem.
I have a file of dates with corresponding data, the format looks like this:
Date,data,data,data,data,data
Date,data,data,data,data,data
It's a plain csv file, only commas being used.
I need to be able to select a beginning date. And then get the data for the next 20 days beginning with the date selected.
Date format:
2007.05.21 (y,m,d)
So I think it would be best to search for the date. Either loading the entire file first into memory or read line by line. The file is only 1 megabyte, however I might want to do this with a 100 megabyte file as well. Is that still little?
Also I will want to do this very many times. I think I may want to keep the file in memory for the entire run of the program. So I can repeatedly access it.
After finding the date. I need to be able to get column 2 day 1, column 4 day 4. Ect. However there is always the same amount of columns for each day. So I guess if this is loaded into some kind of array I can always know in what array number the next and next day starts.
Any help would be greatly appreciated. Also any code examples provided would really help. This is not a homework problem or anything like that and I'm really new to programming.
You can user csv library to parse your file like this line by line
require 'csv'
date_to_search = Date(2009, 10, 10)
CSV.read('yourfilename.txt', :col_sep => ',') do |row|
# row will be an array of strings which you can parse
cur_date = Date.parse(row[0])
if cur_date == date_to_search
# you are set to read next 19 lines
# you can keep a counter and increment it after parsing each line (row here)
end
# compare and check if you need this line (and next 19)
# other calculations
end
As your data is sorted, Binary Search is what you want to use.
Simply put, you look up an element near the middle of your CSV, compare its date to the one you're looking for, and continue recursively in the matching half of the file (See the Wikipedia link for details).
Binary search has a runtime complexity of O(log n), which means that the number of read operations on a file containing 1,000,000 lines (Reasonable estimation for 100 MB) will never (under normal circumstances, that is, lines of different length are equally distributed) exceed 20.
Therefore, there is no need to keep the file in memory, quite the contrary. The operating system's disk cache will do the task of accelerating consecutive operations for you without running into memory shortage.
To read and process a line, you first need to find its first character, which is either the first letter after a newline character (\n) or the beginning of the file. Reading multiple lines can be achieved similar.
To parse a line, I suggest you split the line at the separation characters and/or the date's dots. This is, of course, only appropriate if the CSV comes from a trustworthy source and never changes its layout.

Resources