I have a massive (400Mb) CSV file that I need to upload into a database.
The problem is that some lines contain 16 commas (",") and some 17.
I need to find the lines that contain 17 commas so that I can fix them (shouldn't be that many).
Is there a way to search in sublime so that each line becomes visible, that repeatedly contains the same particular character?
This is a job for regular expressions!
Instructions on activating regexes in Sublime Text
You want the regex (.*,){17} - i.e., seventeen instances of any old nonsense followed by a comma.
Related
I am using the Linux command pdftotext -layout *.pdf to extract text from some pdf files, for data mining. The resultant text files all reside in a single folder, but they need some pre-processing before they can be used.
Issues
Issue 1: The first value of each row in each file that I am trying to access is a barcode, which can be either a 13-digit GTIN code, or a 5-digit PLU code. The problem here is that the GTIN codes are delineated with a single space character, which is hard to replace with a script, as each row also contains a description field which, naturally, also contains single spaces between words. Here I will need to replace a set of 13 numerals plus a space with the same 13 numerals plus two spaces (at least), so that a later stage of the pre-processing can replace all multiple spaces with a tab character.
Issue 2: Another problem I am facing with this pre-processing is the newlines. There are many blank lines between data rows. Some are single blank lines between the data rows, and some are two or more lines. I want to end up with no blank lines between the data rows, but each row will be delineated by a newline character.
Issue 3: The final resulting files each need to be tab separated value files, for importing into a spreadsheet. Some of the descriptions in the data rows may contain commas, so I am using TSV rather than CSV files. I only need a single tab between each value in the row.
Sample rows
(I have replaced spaces with • and newlines with ¶ characters here for clarity.)
9415077026340•Pams•Sour•Cream•&•Chives•Rice•Crackers•100g•••$1.19¶
¶
¶
9415077026296•Pams•BBQ•Chicken•Rice•Crackers•100g•••$1.19¶
¶
61424••••••••••••Yoghurt•Raisins•kg•••$23.90/kg¶
¶
9415077036349•Pams•Sliced•Peaches•In•Juice•410g•••$1.29¶
Intended result
(I have also replaced tabs with ⇥ characters here for clarity.)
9415077026340⇥Pams•Sour•Cream•&•Chives•Rice•Crackers•100g⇥$1.19¶
9415077026296⇥Pams•BBQ•Chicken•Rice•Crackers•100g⇥$1.19¶
61424⇥Yoghurt•Raisins•kg⇥$23.90/kg¶
9415077036349⇥Pams•Sliced•Peaches•In•Juice•410g⇥$1.29¶
What have I tried?
I am slowly learning more about the various Linux script utilities such as sed / grep / awk / tr, etc. There are many solutions posted in StackOverflow which resolve some of the issues that I am facing, but they are disparate and confusing when I attempt to string them all together in the way that I need them. Some are "close, but not quite" solutions, such as replacing all double newlines with a single newline between each data row. I don't need the extra row between them. I have been looking and trying several different options that are close to what I need. It would be helpful if someone could propose a solution which uses a single utility, such as sed, to solve all of the issues at once.
Applescript noob, I'm trying to identify a date format in filenames, and return the characters immediately preceding the date. The way the date is formatted in the files is just 6 consecutive numbers. The data before that is an indication of the length of the file and are also numbers. These files will never have 6 or more consecutive numbers, except for the date, so I don't have to worry about false positives. What I need to do is find the 6 consecutive numbers so I can use that to find the data before the date and group all those files together.
ex:
Barry_Waterson_Speech_1955_27.02_012219_video_file_from_grdx1.mov
Test Recording Iceland 19 040407 low quality screener.mov
initially it seemed like the numbers preceding the date had set values that I could have the code look out for with
if fileName contains "29" then
but now I'm stumped on how to approach this. My general idea was the following:
Looks like something’s eaten the last part of your question. At any rate, AppleScript is not the best language for text processing, but whatever language you use the standard technique is regular expression-based pattern matching.
For example, to match six digits you’d use the pattern \d{6}. The \d pattern matches any digit, the {6} matches the preceding pattern exactly six times.
If you want to extract the text from the start of a line up to the six digits, you’d use something like (?-s)^(.+?)\d{6}. The ^ matches the start of each line. The .+? matches one or more characters (.+) only up to the next pattern match (?); grouping it in parens extracts the matched text. By default, the . pattern matches any character including a line break, so add (?-s) to the start of the pattern to turn off the line break matching (-s).
Bit cryptic, but very powerful and you’ll get the hang with a bit of practice. Tons of online docs and examples too; just search for “PCRE regular expression”. (Tip: build it up one pattern at a time, testing at every step.)
AppleScript doesn’t have built-in support for regular expressions, but it can use Cocoa’s NSRegularExpression class via the AppleScript-ObjC bridge. The syntax isn’t very friendly so you may want to use a library that wraps it for you:
use script "Text"
set theText to "Barry_Waterson_Speech_1955_27.02_012219_video_file_from_grdx1.mov
Test Recording Iceland 19 040407 low quality screener.mov"
search text theText for "^(.+?)\\d{6}" using pattern matching
returns:
{{class:matched text, startIndex:1, endIndex:39, foundText:"Barry_Waterson_Speech_1955_27.02_012219", foundGroups:{{class:matched group, startIndex:1, endIndex:33, foundText:"Barry_Waterson_Speech_1955_27.02_"}}},
{class:matched text, startIndex:67, endIndex:98, foundText:"Test Recording Iceland 19 040407", foundGroups:{{class:matched group, startIndex:67, endIndex:92, foundText:"Test Recording Iceland 19 "}}}}
I'm trying to come up with a regex for enforcing Git commit messages to match a certain format. I've been banging my head against the keyboard modifying the semi-working version I have, but I just can't get it to work exactly as I want. Here's what I have now:
/^([a-z]{2,4}-[\d]{2,5}[, \n]{1,2})+\n{1}^[\w\n\s\*\-\.\:\'\,]+/i
Here's the text I'm trying to enforce:
AB-1432, ABC-435, ABCD-42
Here is the multiline description, following a blank
line after the Jira issue IDs
- Maybe bullet points, with either dashes
* Or asterisks
Currently, it matches that, but it will also match if there's no blank line after the issue IDs, and if there's multiple blank lines after.
Is there anyway to enforce that, or will I just have to live with it?
It's also pretty ugly, I'm sure there's a more succinct way to write that out.
Thanks.
Your regex allows for \n as one of the possible characters after the required newline, so that's why it matches when there are multiple.
Here's a cleaned up regex:
/^([a-z]{2,4}-\d{2,5}(?=[, \n]),? ?\n?)+^\n([-\w\s*.:',]+\n)+/i
Notes:
This requires at least one [-\w\s*.:',] character before the next newline.
I changed the issue IDs to have one possible comma, space, and newline, in that order (up to one of each). Can you use lookaheads? If so, I added (?=[, \n]) to make sure the issue ID is followed by at least one of those characters.
Also notice that many of the characters don't need to be escaped in a character class.
I have big file which is tab separated. The biggest problem is that I need to import the data into database but some of the columns are multi line which is causing some problems. What I would like to do it to convert the file into proper comma separated file using bash.
Here is the example of the file (I will substitute the tabs with pipe |):
1|Some text|another text|12| Some big big big
text with lots of data and multiple lines
and commas|34|34
2|Some text|another text||Another big big big big
text with lots of characters like , and tab|33|25
In above example there are basically two lines of data. What I would like to have is:
"1","Some text","another text","12"," Some big big big
text with lots of data and multiple lines
and commas","34","34"
"2","Some text","another text","","Another big big big big
text with lots of characters like , and tab","33","25"
In vim I can see that each full line of data (with multiple line column) is terminated by ^M$ so it looks like this:
1|Some text|another text|12| Some big big big
text with lots of data and multiple lines
and commas|34|34^M$
2|Some text|another text||Another big big big big
text with lots of characters like , and tab|33|25^M$
This is really tricky, and it depends on executing the right sequence of substitutions. The following seems to work (at least on your given example):
" Enclose non-multiline lines with quotes.
:g/\t/s/^\|\(\r\)$/"\1/g
" Undo the ending quote before / the beginning quote after a multiline.
:v/\t/-1s/"$//
:v/\t/+1s/^"//
" Undo the beginning quote after an incomplete (i.e. no ^M) previous record.
:g/\t/-1s/\r\#<!\n\zs"//
" Replace tabs with quotes and commas.
:%s/\t/","/g
" Finally, remove the ^M end-of-record marker.
:%s/\r$//
I need to do a find and replace, where I need to replace 2 lines at time. Does anyone know how to do this in the VS2008 IDE?
To clarify, i want to replace 2 lines with 1 line.
Many thanks
Thanks to František Žiačik for the answer to this one.
To perform a find/replace, replacing multiple lines, you need to switch on regular expressions and use a line break (\n) between your lines, also using (:b*) at the start of each line to handle any tabs or spaces tabs.
So to find:
line one
line two
you would search for ":bline one\n:bline two" (without the quotes)
Try Multiline Search and Replace macro for Visual Studio.
You can activate the 'Use regular expressions' in the find dialog, and use \n to match a newline.
In your case, type FirstLine\n:Zs*SecondLine.
:Zs* skips leading blanks on line 2.
For example ,\n:Zs*c matches a comma, newline, any number of blanks, a 'c'.