Grep page source for URL

Grep page source for URL - terminal

I have a webpage source in a text doc, there's a few lines like so:
"rid" : 'http://web.site/urlhere',
How do I use Linux/terminal to grep just the http://web.site/urlhere portion?

You can pass the -o option to grep to tell it to only display the matching pattern.
grep -o http://web.site/urlhere somefile.txt
Assuming you're looking for generic URLs, you could start with something like this (and probably improve it):
grep -o "'http.*'" someFile.txt | sed "s/'//g"
This will search for the text http after a single quote and will include all the characters from that line until the last single quote. It will then pipe the result (only the matching pattern) to sed and remove the single quotes.
Note: You could run into trouble if you have more single quotes after the url (but your question doesn't mention that)...
Since you're question is very non-specific, there are probably many other input conditions that could cause problems, but the above should be a good starting point.
More info on grep: http://unixhelp.ed.ac.uk/CGI/man-cgi?grep

Related

Regex to match characters between two specific characters in shell script

I want to clean my file before/after saving so I have to delete unnecessary characters that I have there. Sadly, even that my regex is working in Regex101, it does not work in shell script I wrote.
I am getting my list from Kubernetes via
kubectl get pods -n $1 -o jsonpath='{range .items[*]}{#.spec.containers[*].image}{","}{#.status.containerStatuses[*].imageID}{"\n"}{end}'
Then I saving it to the temp file and using sed to clear it - the regex should match and (sed should) delete any character between , and # (also should delete #). I am escaping them since they are special characters.
sed -i 's/(?<=\,)(.*?)(?<=\#)//g' temp
The problem is that this regex is working fine (for example in Regex101) but is not working with the sed command. I even tried awk but getting the same output.
awk '!/(?<=\,)(.*?)(?<=\#)/' temp
Am I missing something or is the regex acting differently somehow in Unix/shell?
Thanks for any input.
Example content of the file (for test):
docker.elastic.co/elasticsearch/elasticsearch:7.17.5,docker-pullable://docker.elastic.co/elasticsearch/elasticsearch#sha256:76344d5f89b13147743db0487eb76b03a7f9f0cd55abe8ab887069711f2ee27d
docker.io/bitnami/kafka:3.3.1-debian-11-r11,docker-pullable://bitnami/kafka#sha256:be29db0e37b6ab13df5fc14988a4aa64ee772c7f28b4b57898015cf7435ff662
docker.io/bitnami/mongodb:6.0.3-debian-11-r0,docker-pullable://bitnami/mongodb#sha256:e7438d7964481c0bcfcc8f31bca2d73022c0b7ba883143091a71ae01be6d9edb
docker.io/bitnami/postgresql:14.1.0-debian-10-r80,docker-pullable://bitnami/postgresql#sha256:6eb9c4ab3444e395df159e2cad21f283e4bf30802958467590c886f376dc9959
docker.io/bitnami/zookeeper:3.8.0-debian-11-r47,docker-pullable://bitnami/zookeeper#sha256:0f3169499c5ee02386c3cb262b2a0d3728998d9f0a94130a8161e389f61d1462
Expected output:
docker.elastic.co/elasticsearch/elasticsearch:7.17.5,sha256:76344d5f89b13147743db0487eb76b03a7f9f0cd55abe8ab887069711f2ee27d
docker.io/bitnami/kafka:3.3.1-debian-11-r11,sha256:be29db0e37b6ab13df5fc14988a4aa64ee772c7f28b4b57898015cf7435ff662
docker.io/bitnami/mongodb:6.0.3-debian-11-r0,sha256:e7438d7964481c0bcfcc8f31bca2d73022c0b7ba883143091a71ae01be6d9edb
docker.io/bitnami/postgresql:14.1.0-debian-10-r80,sha256:6eb9c4ab3444e395df159e2cad21f283e4bf30802958467590c886f376dc9959
docker.io/bitnami/zookeeper:3.8.0-debian-11-r47,sha256:0f3169499c5ee02386c3cb262b2a0d3728998d9f0a94130a8161e389f61d1462

You are trying to use Perl extensions which are not supported by more traditional regex tools like sed and Awk.
Perhaps see also Why are there so many different regular expression dialects? and the Stack Overflow regex tag info page.
If I can guess what you are trying to do, you want simply
sed -i 's/,[^#]*#/,/g' temp
The /g flag is unnecessary if you only expect one match per line.
Neither , nor # is a regex metacharacter; they do not require escaping.
Usually you would want to avoid using a temporary file or sed -i; perhaps simply
kubectl blah blah | sed 's/,[^#]*#/,/' > temp
to create the file, or remove the redirection if you want to pipe the results further.

Grepping for exact string while ignoring regex for dot character

So here's my issue. I need to develop a small bash script that can grep a file containing account names (let's call it file.txt). The contents would be something like this:
accounttest
account2
account
accountbtest
account.test
Matching an exact line SHOULD be easy but apparently it's really not.
I tried:
grep "^account$" file.txt
The output is:
account
So in this situation the output is OK, only "account" is displayed.
But if I try:
grep "^account.test$" file.txt
The output is:
accountbtest
account.test
So the next obvious solution that comes to mind, in order to stop interpreting the dot character as "any character", is using fgrep, right?
fgrep account.test file.txt
The output, as expected, is correct this time:
account.test
But what if I try now:
fgrep account file.txt
Output:
accounttest
account2
account
accountbtest
account.test
This time the output is completely wrong, because I can't use the beginning/end line characters with fgrep.
So my question is, how can I properly grep a whole line, including the beginning and end of line special characters, while also matching exactly the "." character?
EDIT: Please note that I do know that the "." character needs to be escaped, but in my situation, escaping is not an option, because of further processing that needs to be done to the account name, which would make things too complicated.

The . is a special character in regex notation which needs to be escaped to match it as a literal string when passing to grep, so do
grep "^account\.test$" file.txt
Or if you cannot afford to modify the search string use the -F flag in grep to treat it as literal string and not do any extra processing in it
grep -Fx 'account.test' file.txt
From man grep
-F, --fixed-strings
Interpret PATTERN as a list of fixed strings (instead of regular expressions), separated by newlines, any of which is to be matched.
-x, --line-regexp
Select only those matches that exactly match the whole line. For a regular expression pattern, this is like parenthesizing the pattern and then surrounding it with ^ and $.

fgrep is the same as grep -F. grep also has the -x option which matches against whole lines only. You can combine these to get what you want:
grep -Fx account.test file.txt

grep exact pattern from a file in bash

I have the following IP addresses in a file
3.3.3.1
3.3.3.11
3.3.3.111
I am using this file as input file to another program. In that program it will grep each IP address. But when I grep the contents I am getting some wrong outputs.
like
cat testfile | grep -o 3.3.3.1
but I am getting output like
3.3.3.1
3.3.3.1
3.3.3.1
I just want to get the exact output. How can I do that with grep?

Use the following command:
grep -owF "3.3.3.1" tesfile
-o returns the match only and not the whole line.-w greps for whole words, meaning the match must be enclosed in non word chars like <space>, <tab>, ,, ; the start or the end of the line etc. It prevents grep from matching 3.3.3.1 out of 3.3.3.111.
-F greps for fixed strings instead of patterns. This prevents the . in the IP address to be interpreted as any char, meaning grep will not match 3a3b3c1 (or something like this).

To match whole words only, use grep -ow 3.3.3.1 testfile
UPDATE: Use the solution provided by hek2mgl as it is more robust.

You may use anhcors.
grep '^3\.3\.3\.1$' file
Since by default grep uses regex, you need to escape the dots in-order to make grep to match literal dot character.

Is it possible to clean up an HTML file with grep to extract certain strings?

There is a website that I am a part of and I wanted to get the information out of the site on a daily basis. The page looks like this:
User1 added User2.
User40 added user3.
User13 added user71
User47 added user461
so on..
There's no JSON end point to get the information and parse it. So I have to wget the page and clean up the HTML:
User1 added user2
Is it possible to clean this up even though the username always changes?

I would divide that problem into two:
How to clean up your HTML
Yes it is possible to use grep directly, but I would recommend using a standard tool to convert HTML to text before using grep. I can think of two (html2text is a conversion utility, and w3m is actually a text browser), but there are more:
wget -O - http://www.stackoverflow.com/ | html2text | grep "How.*\?"
w3m http://www.stackoverflow.com/ | grep "How.*\?"
These examples will get the homepage of StackOverflow and display all questions found on that page starting with How and ending with ? (it displays about 20 such lines for me, but YMMV depending on your settings).
How to extract only the desired strings
Concerning your username, you can just tune your expression to match different users (-E is necessary due to the extended regular expression syntax, -o will make grep print only the matching part(s) of each line):
[...] | grep -o -E ".ser[0-9]+ added .ser[0-9]+"
This however assumes that users always have a name matching .ser[0-9]+. You may want to use a more general pattern like this one:
[...] | grep -o -E "[[:graph:]]+[[:space:]]+added[[:space:]]+[[:graph:]]+"
This pattern will match added surrounded by any two other words, delimited by an arbitrary number of whitespace characters. Or simpler (assuming that a word may contain everything but blank, and the words are delimited by exactly one blank):
[...] | grep -o -E "[^ ]+ added [^ ]+"

Do you intent to just strip away the HTML-Tags?
Then try this:
sed 's/<[^>]*>//g' infile >outfile

Find and replace html code for multiple files within multiple directories

I have a very basic understanding of shell scripting, but what I need to do requires more complex commands.
For one task, I need to find and replace html code within the index.html files on my server. These files are in multiple directories with a consistent naming convention. ([letter][3-digit number]) See the example below.
files: index.html
path: /www/mysite/board/today/[rsh][0-9]/
string to find: (div id="id")[code](/div)<--#include="(path)"-->(div id="id")[more code](/div)
string to replace with: (div id="id")<--include="(path)"-->(/div)
I hope you don't mind the pseudo-regex. The folders containing my target index.html files look similar to r099, s017, h123. And suffice the say, the html code I'm trying to replace is relatively long, but its still just a string.
The second task is similar to the first, only the filename changes as well.
files: [rsh][0-9].html
path: www/mysite/person/[0-9]/[0-9]/[0-9]/card/2011/
string: (div id="id")[code](/div)<--include="(path)"-->(div id="id")[more code](/div)
string to replace with: (div id="id")<--include="(path)"-->(/div)
I've seen other examples on SO and elsewhere on the net that simply show scripts modifying files under a single directory to find & replace a string without any special characters, but I haven't seen an example similar to what I'm trying to do just yet.
Any assistance would be greatly appreciated.
Thank You.

You have three separate sub-problems:
replacing text in a file
coping with special characters
selecting files to apply the transformation to
1. The canonical text replacement tool is sed:
sed -e 's/PATTERN/REPLACEMENT/g' <INPUT_FILE >OUTPUT_FILE
If you have GNU sed (e.g. on Linux or Cygwin), pass -i to transform the file in place. You can act on more than one file in the same command line.
sed -i -e 's/PATTERN/REPLACEMENT/g' FILE OTHER_FILE…
If your sed doesn't have the -i option, you need to write to a different file and move that into place afterwards. (This is what GNU sed does behind the scenes.)
sed -e 's/PATTERN/REPLACEMENT/g' <FILE >FILE.tmp
mv FILE.tmp FILE
2. If you want to replace a literal string by a literal string, you need to prefix all special characters by a backslash. For sed patterns, the special characters are .\[^$* plus the separator for the s command (usually /). For sed replacement text, the special characters are \& and newlines. You can use sed to turn a string into a suitable pattern or replacement text.
pattern=$(printf %s "$string_to_replace" | sed -e 's![.\[^$*/]!\\&!g')
replacement=$(printf %s "$replacement_string" | sed -e 's![\&]!\\&!g')
3. To act on multiple files directly in one or more directories, use shell wildcards. Your requirements don't seem completely consistent; I think these are the patterns you're looking for, but be sure to review them.
/www/mysite/board/today/[rsh][0-9][0-9][0-9]/index.html
/www/mysite/person/[0-9]/[0-9]/[0-9]/card/2011/[rsh][0-9].html
This will match files like /www/mysite/board/today/r012/index.html and /www/mysite/person/4/5/6/card/2011/h7.html, but not /www/mysite/board/today/subdir/s012/index.html or /www/mysite/board/today/r1234/index.html.
If you need to act on files in subdirectories recursively, use find. It doesn't seem to be in your requirements and this answer is long enough already, so I'll stop here.
4. Putting it all together:
string_to_replace='(div id="id")[code](/div)<--#include="(path)"-->(div id="id")[more code](/div)'
replacement_string='(div id="id")<--include="(path)"-->(/div)'
pattern=$(printf %s "$string_to_replace" | sed -e 's![.\[^$*/]!\\&!g')
replacement=$(printf %s "$replacement_string" | sed -e 's![\&]!\\&!g')
sed -i -e "s/$pattern/$replacement/g" \
/www/mysite/board/today/[rsh][0-9][0-9][0-9]/index.html \
/www/mysite/person/[0-9]/[0-9]/[0-9]/card/2011/[rsh][0-9].html
Final note: you seem to be working on HTML with regular expressions. That's often not a good idea.

Finding the files can easily be done using find -regex:
find www/mysite/board/today -regex ".*[rsh][0-9][0-9][0-9]/index.html"
find www/mysite/person -regex ".*[0-9]/[0-9]/[0-9]/card/2011/[rsh][0-9][0-9][0-9].html"
Due to nature of HTML, replacing the content might not be very easy with sed, so I would suggest using an HTML or XML parsing library in a perl script. Can you provide a short sample of an actual html file and the result of the replacements?

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Grep page source for URL - terminal

I have a webpage source in a text doc, there's a few lines like so: "rid" : 'http://web.site/urlhere', How do I use Linux/terminal to grep just the http://web.site/urlhere portion?

Related

Regex to match characters between two specific characters in shell script

Grepping for exact string while ignoring regex for dot character

grep exact pattern from a file in bash

Is it possible to clean up an HTML file with grep to extract certain strings?

Find and replace html code for multiple files within multiple directories

Categories

Resources