I need to find some specific characters in a cell and replace them with other characters.
So far I can do that by using :
=SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(A1,"★","•",0),"<b>","",0),"</b>","",0),"✔ ","",0)
However, this formula will become very long if I need to replace a lot of characters. Is there any way to reduce the duplicate parts, especially when I need to replace some characters with only the other one. Ex: Replace , , ✔ with "" as example above.
Demo sheet: https://docs.google.com/spreadsheets/d/1wX9mEykCMjeotTRTg_jSMcm9Mm7WM0kPTetRwGLzaYU/edit#gid=0
Google Sheets (but not Excel) has a handy formula, REGEXREPLACE, that will let you do what you need:
=SUBSTITUTE(REGEXREPLACE(A1,"<b>|</b>|✔",""),"★","•")
If you need to remove any more characters, just add them after the checkmark, separated by |.
Related
I am using the Linux command pdftotext -layout *.pdf to extract text from some pdf files, for data mining. The resultant text files all reside in a single folder, but they need some pre-processing before they can be used.
Issues
Issue 1: The first value of each row in each file that I am trying to access is a barcode, which can be either a 13-digit GTIN code, or a 5-digit PLU code. The problem here is that the GTIN codes are delineated with a single space character, which is hard to replace with a script, as each row also contains a description field which, naturally, also contains single spaces between words. Here I will need to replace a set of 13 numerals plus a space with the same 13 numerals plus two spaces (at least), so that a later stage of the pre-processing can replace all multiple spaces with a tab character.
Issue 2: Another problem I am facing with this pre-processing is the newlines. There are many blank lines between data rows. Some are single blank lines between the data rows, and some are two or more lines. I want to end up with no blank lines between the data rows, but each row will be delineated by a newline character.
Issue 3: The final resulting files each need to be tab separated value files, for importing into a spreadsheet. Some of the descriptions in the data rows may contain commas, so I am using TSV rather than CSV files. I only need a single tab between each value in the row.
Sample rows
(I have replaced spaces with • and newlines with ¶ characters here for clarity.)
9415077026340•Pams•Sour•Cream•&•Chives•Rice•Crackers•100g•••$1.19¶
¶
¶
9415077026296•Pams•BBQ•Chicken•Rice•Crackers•100g•••$1.19¶
¶
61424••••••••••••Yoghurt•Raisins•kg•••$23.90/kg¶
¶
9415077036349•Pams•Sliced•Peaches•In•Juice•410g•••$1.29¶
Intended result
(I have also replaced tabs with ⇥ characters here for clarity.)
9415077026340⇥Pams•Sour•Cream•&•Chives•Rice•Crackers•100g⇥$1.19¶
9415077026296⇥Pams•BBQ•Chicken•Rice•Crackers•100g⇥$1.19¶
61424⇥Yoghurt•Raisins•kg⇥$23.90/kg¶
9415077036349⇥Pams•Sliced•Peaches•In•Juice•410g⇥$1.29¶
What have I tried?
I am slowly learning more about the various Linux script utilities such as sed / grep / awk / tr, etc. There are many solutions posted in StackOverflow which resolve some of the issues that I am facing, but they are disparate and confusing when I attempt to string them all together in the way that I need them. Some are "close, but not quite" solutions, such as replacing all double newlines with a single newline between each data row. I don't need the extra row between them. I have been looking and trying several different options that are close to what I need. It would be helpful if someone could propose a solution which uses a single utility, such as sed, to solve all of the issues at once.
I have a google sheets documents with data in this format:
Some data 10:5 Somemore Data
I am trying to separate the text from the numbers in separate columns based on the colon sign so that the output looks like this:
Some data | 10 | 5 | Somemore Data
I tried the SPLIT and RIGHT/LEFT functions but I can't get it to work.
This is what I have so far
=LEFT(C2,FIND(":",C2)-3)
This separates the text on the LEFT but using it on the right side doesn't work. My formula also doesn't separate the numbers. Looking for a formula that can achieve the above desired result.
My spreadsheet - https://docs.google.com/spreadsheets/d/1EmL4kzCGxRbwvNJntwMokqgt8yjjAqnZuUidTbZe6Z8/edit?usp=sharing
Thanks.
There is already a solution in your shared sheet with SPLIT and REGEXREPLACE.
Here is one a bit simpler with REGEXEXTRACT:
=ARRAYFORMULA(IF(A2:A="", "", REGEXEXTRACT(A2:A,"^(.+?)[ ]+(\d+)[ ]*:[ ]*(\d+)[ ]+(.+)$")))
Every group will be a cell in a row to the right.
Regex description and demo: link.
Edit: stripped spaces. You have a nasty chars in your strings - nonbreaking space bar which is indistinguishable from the regular space. Could not understand why a simpler regex (^(.+?)\s+(\d+)\s*:\s*(\d+)\s+(.+)$) did not work. All because of this nbsp (char 160). Thus [ ] (nbsp and a regular space) instead of just \s.
Applescript noob, I'm trying to identify a date format in filenames, and return the characters immediately preceding the date. The way the date is formatted in the files is just 6 consecutive numbers. The data before that is an indication of the length of the file and are also numbers. These files will never have 6 or more consecutive numbers, except for the date, so I don't have to worry about false positives. What I need to do is find the 6 consecutive numbers so I can use that to find the data before the date and group all those files together.
ex:
Barry_Waterson_Speech_1955_27.02_012219_video_file_from_grdx1.mov
Test Recording Iceland 19 040407 low quality screener.mov
initially it seemed like the numbers preceding the date had set values that I could have the code look out for with
if fileName contains "29" then
but now I'm stumped on how to approach this. My general idea was the following:
Looks like something’s eaten the last part of your question. At any rate, AppleScript is not the best language for text processing, but whatever language you use the standard technique is regular expression-based pattern matching.
For example, to match six digits you’d use the pattern \d{6}. The \d pattern matches any digit, the {6} matches the preceding pattern exactly six times.
If you want to extract the text from the start of a line up to the six digits, you’d use something like (?-s)^(.+?)\d{6}. The ^ matches the start of each line. The .+? matches one or more characters (.+) only up to the next pattern match (?); grouping it in parens extracts the matched text. By default, the . pattern matches any character including a line break, so add (?-s) to the start of the pattern to turn off the line break matching (-s).
Bit cryptic, but very powerful and you’ll get the hang with a bit of practice. Tons of online docs and examples too; just search for “PCRE regular expression”. (Tip: build it up one pattern at a time, testing at every step.)
AppleScript doesn’t have built-in support for regular expressions, but it can use Cocoa’s NSRegularExpression class via the AppleScript-ObjC bridge. The syntax isn’t very friendly so you may want to use a library that wraps it for you:
use script "Text"
set theText to "Barry_Waterson_Speech_1955_27.02_012219_video_file_from_grdx1.mov
Test Recording Iceland 19 040407 low quality screener.mov"
search text theText for "^(.+?)\\d{6}" using pattern matching
returns:
{{class:matched text, startIndex:1, endIndex:39, foundText:"Barry_Waterson_Speech_1955_27.02_012219", foundGroups:{{class:matched group, startIndex:1, endIndex:33, foundText:"Barry_Waterson_Speech_1955_27.02_"}}},
{class:matched text, startIndex:67, endIndex:98, foundText:"Test Recording Iceland 19 040407", foundGroups:{{class:matched group, startIndex:67, endIndex:92, foundText:"Test Recording Iceland 19 "}}}}
I need to format the chart number format so that numbers stop looking like that 1,525 (comma separator) and start looking like this 1 525 (empty string thousand separator). Plus, I need dot separator for decimal, but only if a number has any, like this 1 525.4
The closest number format I was able to find for amCharts4 version is
chart.numberFormatter.numberFormat = '#,###.#';
Any ideas?
So, after a research I've found a solution - you have to use locales.
This line of code helped me a lot:
chart.language.locale = am4lang_[locale];
For empty string separator I used am4lang_ru_RU.
Btw, if you need to make your own number, strings, etc formatting, you can create your locales for that.
I need to have Sphinx index some characters that are apparently part of the internal char_set, notably "," and "&".
As far as I understand there is no way (?!) to simply add these as indexable chars, I need to
Make a char_set table in my index
Not only include the "," and "&" as indexable characters but now manually add the char_set Sphinx uses
This is a little frustrating as it seems one should be able to add back in chars without manually recreating the char_set used internally. However if that is the situation that is the situation, yet in the documentation
http://sphinxsearch.com/docs/current/conf-charset-table.html
I don't see or understand how I would manually specify a char_set table and exclude / index the two or three characters I want.
You do need to duplicate the 'stock' charset_table before you can add to it.
At less than 100 characters, even if you typed it rather than copy/pasting, would still be les effort than typing this whole question!
(even less if use the the english shorthand)
Just copy what there and add your new chars to the end. If you don't know the unicode mapping, something like asciitable.com is useful at least for the chars in ASCII.