Extract each occurrence of a string into a separate line to build a list of URLs - emeditor

I would like to extract all occurrences of a URL string pattern (which can appear multiple times in a file) to build a list of all occurrences.
Currently I can identify each occurrence with the Find in files feature, but I would like the Extract feature to list each occurrence on a new line. Currently the feature lists each line that contains the string. And a line can contain the sting multiple times.
My goal is to get a list of the full URL that contains __data/assets/
In the below example __data/assets/ occurs 48 times.
However, the extract only 44 lines are extracted, but I need to output all 48 occurrences (the full URL).
I will be running this extract over 270 files in total.
View source of this example webpage:
https://www.walkerville.sa.gov.au/council/strategic-plans/2020-2024-living-in-the-town-of-walkerville-a-strategic-community-plan

It looks that all URLs are surrounded by double quotation marks.
If so, you can search for a regular expression:
[^\"]*__data/assets/[^\"]*
and select Display Matched Strings Only in the Extract Options dialog box.

Related

Pre-processing multiple text files from a pdf using just pdftotext and sed in a bash script, if possible

I am using the Linux command pdftotext -layout *.pdf to extract text from some pdf files, for data mining. The resultant text files all reside in a single folder, but they need some pre-processing before they can be used.
Issues
Issue 1: The first value of each row in each file that I am trying to access is a barcode, which can be either a 13-digit GTIN code, or a 5-digit PLU code. The problem here is that the GTIN codes are delineated with a single space character, which is hard to replace with a script, as each row also contains a description field which, naturally, also contains single spaces between words. Here I will need to replace a set of 13 numerals plus a space with the same 13 numerals plus two spaces (at least), so that a later stage of the pre-processing can replace all multiple spaces with a tab character.
Issue 2: Another problem I am facing with this pre-processing is the newlines. There are many blank lines between data rows. Some are single blank lines between the data rows, and some are two or more lines. I want to end up with no blank lines between the data rows, but each row will be delineated by a newline character.
Issue 3: The final resulting files each need to be tab separated value files, for importing into a spreadsheet. Some of the descriptions in the data rows may contain commas, so I am using TSV rather than CSV files. I only need a single tab between each value in the row.
Sample rows
(I have replaced spaces with • and newlines with ¶ characters here for clarity.)
9415077026340•Pams•Sour•Cream•&•Chives•Rice•Crackers•100g•••$1.19¶
¶
¶
9415077026296•Pams•BBQ•Chicken•Rice•Crackers•100g•••$1.19¶
¶
61424••••••••••••Yoghurt•Raisins•kg•••$23.90/kg¶
¶
9415077036349•Pams•Sliced•Peaches•In•Juice•410g•••$1.29¶
Intended result
(I have also replaced tabs with ⇥ characters here for clarity.)
9415077026340⇥Pams•Sour•Cream•&•Chives•Rice•Crackers•100g⇥$1.19¶
9415077026296⇥Pams•BBQ•Chicken•Rice•Crackers•100g⇥$1.19¶
61424⇥Yoghurt•Raisins•kg⇥$23.90/kg¶
9415077036349⇥Pams•Sliced•Peaches•In•Juice•410g⇥$1.29¶
What have I tried?
I am slowly learning more about the various Linux script utilities such as sed / grep / awk / tr, etc. There are many solutions posted in StackOverflow which resolve some of the issues that I am facing, but they are disparate and confusing when I attempt to string them all together in the way that I need them. Some are "close, but not quite" solutions, such as replacing all double newlines with a single newline between each data row. I don't need the extra row between them. I have been looking and trying several different options that are close to what I need. It would be helpful if someone could propose a solution which uses a single utility, such as sed, to solve all of the issues at once.

Sublime show lines with specific repeated character

I have a massive (400Mb) CSV file that I need to upload into a database.
The problem is that some lines contain 16 commas (",") and some 17.
I need to find the lines that contain 17 commas so that I can fix them (shouldn't be that many).
Is there a way to search in sublime so that each line becomes visible, that repeatedly contains the same particular character?
This is a job for regular expressions!
Instructions on activating regexes in Sublime Text
You want the regex (.*,){17} - i.e., seventeen instances of any old nonsense followed by a comma.

Findstr does not return second line

I am using FINDSTR -I to find a string in n number of files in a folder. Also write the results to a new file.
I need find string "IDC" along with number next to it in all files.
but on some lines in files, IDC is spread across two lines, and my search returns just first line.
09:49:34.386 4;**IDC-200.0**;CA
13:07:39.987 87;T22.8,BT2;LI;VLT12.7;**IDC-**
13:07:39.995 **42.0**;CAP240.0/
can some one help in copy next line to output file, if IDC is spread across two lines.
Microsoft’s findstr works strictly line based. It is not really possible to search for a string which does not completely exist within a line and get all lines output.
But it is possible to define multiple search strings which are used one after the other on a line before processing next line until either one of the search strings returns a positive match or none of the search strings matches a string on current line.
Example:
%SystemRoot%\System32\findstr.exe /R /C:"IDC-" /C:"^[0-2][0-9]:[0-5][0-9]:[0-5][0-9]\.[0-9][0-9][0-9] \*\*[0-9][0-9.]*\*\*;" *.txt
Findstr (SS64 article) searches with those options in all *.txt files of current folder for
a line containing case sensitive the string IDC- anywhere within the line or
a line starting with time in expected format, a space, two asterisks, a floating point number with at least 1 digit before decimal point, two more asterisks and a semicolon.
With those two search strings all 3 lines of provided example are found and output in correct order and other lines not containing IDC- or matching the second regular expression search string are ignored by FINDSTR.
Note: SS64 article FINDSTR - Searching across Line Breaks explains how a search can be done which includes a line break. But output is nevertheless only the first line on which the found multi-line string begins.

Verifying searched text displayed is in a single line

How can I test whether a sentence (combination of four or five words) is displayed in a single line?
I have to search with a name or some other fields. After search results are displayed, I should test whether the displayed text is a single line. For example, the code below is used to verify the search result link:
//ol[contains(#class,'search results')]/li[contains(#class,'mod result') and contains(#class,'XXXXXX')]//a[contains(#href,'trk=XXXXXX')]
I am not familiar with ruby, but the following java approach should work in any language.
Assuming that your "sentence" is entirely contained in one element, you could find all occurrences with something like:
driver.findElements(By.xpath("//*[text()='your sentence']"))
Then simply test for the size of the array.
Assuming that a single or multiple lines will be contained within a single DOM element, you could use the vertical component of the element size to check for the multiple line condition.
webElement.getSize()

How to use regular expression in fetching data from graphite?

I want to fetch data from different counters from graphite in one single request like:-
summarize(site.testing_server_2.triggers_unknown.count,'1hour','sum')&format=json
summarize(site.testing_server_2.requests_failed.count,'1hour','sum')&format=json
summarize(site.testing_server_2.core_network_bad_soap.count,'1hour','sum')&format=json
and so on.. 20 more.
But I don't want to fetch
summarize(site.testing_server_2.module_xyz_abc.count,'1hour','sum')&format=json
in that request how can i do that?
This is what I tried:
summarize(site.testing_server_2.*.count,'1hour','sum')&format=json&from=-24hour
It gets json data for 'module_xyz_abc' too, but that i don't want.
You can't use regular expressions per se, but you can use some similar (in concept and somewhat in format) matching techniques available within the Graphite Render URL API. There are a few ways you can "match" within a target's "bucket" (i.e. between the dots).
Target Matching
Asterisk * match
The asterisk can be used to match ANY -zero or more- character(s). It can be used to replace the entire bucket (site.*.test) or within the bucket (site.w*t.test). Here is an example:
site.testing_server_2.requests_*.count
This would match site.testing_server_2.requests_failed.count, site.testing_server_2.requests_success.count, site.testing_server_2.requests_blah123.count, and so forth.
Character range [a-z0-9] match
The character range match is used to match on a single character (site.w[0-9]t.test) in the target's bucket and is specified as a range or list. For example:
site.testing_server_[0-4].requests_failed.count
This would match on site.testing_server_0.requests_failed.count, site.testing_server_1.requests_failed.count, site.testing_server_2.requests_failed.count, and so forth.
Value list (group capture) {blah, test, ...} match
The value list match can be used to match anything in the list of values, in the specified portion of the target's bucket.
site.testing_server_2.{triggers_unknown,requests_failed,core_network_bad_soap}.count
This would match site.testing_server_2.triggers_unknown.count, site.testing_server_2.requests_failed.count, and site.testing_server_2.core_network_bad_soap.count. But nothing else, so site.testing_server_2.module_xyz_abc.count would not match.
Answer
Without knowing all of your bucket values it is difficult to be surgical with the approach (perhaps with a combination of the matching options), so I'll recommend just going with a value list match. This should allow you to get all of the values in one -somewhat long- request. For example (and keep in mind you'd need to include all of your values):
summarize(site.testing_server_2.{triggers_unknown,requests_failed,core_network_bad_soap}.count,'1hour','sum')&format=json&from=-24hour
For more, see Graphite Paths and Wildcards

Resources