Remove text between two tokens including tokens in makefile - makefile

I have a hostfile with two markings in there somewhere and I need to remove all lines between the two markings including the markings.
I found this command in another question:
cat hostfile | grep -P '(?<=##STARTMARK).*(?=##ENDMARK)'
but that still leaves the markers in there.
I currently have this
127.0.0.1 home-host.dev
##STARTMARK
127.0.0.1 a-blocked-host.com
##ENDMARK
and I want this
127.0.0.1 home-host.dev

Try to use sed,
sed '/##STARTMARK/,/##ENDMARK/d' hostfile
127.0.0.1 home-host.dev
Note the second line (i.e., the blank line) would be kept since they are not embraced by the tokens as your criteria.

Related

Extract image URI from markdown files using sed/grep containing duplicates in a single line

I have some markdown files to process which contain links to images that I wish to download. e.g. a markdown file:
[![](https://imgs.xkcd.com/comics/git.png)](https://imgs.xkcd.com/comics/git.png)
a lot of text
some more text...
[![](https://1.bp.blogspot.com/-Ze2SiBflkZ4/XbtF1TjELcI/AAAAAAAALL4/IDC6W-b5moU0eGu2eN60aZ4pxfXW1ybmQCLcBGAsYHQ/s320/take_a_break_git.gif)](https://1.bp.blogspot.com/-Ze2SiBflkZ4/XbtF1TjELcI/AAAAAAAALL4/IDC6W-b5moU0eGu2eN60aZ4pxfXW1ybmQCLcBGAsYHQ/s1600/take_a_break_git.gif)
some more text
another URL but not image
[https://github.com]
so on
I am trying to parse through this file and extract the list of image URLs, which I can later pass on wget command to download.
So far I have used grep and sed and have got results:
$ sed -nE "/https?:\/\/[^ ]+.(jpg|png|gif)/p" $path
[![](https://imgs.xkcd.com/comics/git.png)](https://imgs.xkcd.com/comics/git.png)
[![](https://1.bp.blogspot.com/-Ze2SiBflkZ4/XbtF1TjELcI/AAAAAAAALL4/IDC6W-b5moU0eGu2eN60aZ4pxfXW1ybmQCLcBGAsYHQ/s320/take_a_break_git.gif)](https://1.bp.blogspot.com/-Ze2SiBflkZ4/XbtF1TjELcI/AAAAAAAALL4/IDC6W-b5moU0eGu2eN60aZ4pxfXW1ybmQCLcBGAsYHQ/s1600/take_a_break_git.gif)
$ grep -Eo "https?://[^ ]+.(jpg|png|gif)" $path
https://imgs.xkcd.com/comics/git.png)](https://imgs.xkcd.com/comics/git.png
https://1.bp.blogspot.com/-Ze2SiBflkZ4/XbtF1TjELcI/AAAAAAAALL4/IDC6W-b5moU0eGu2eN60aZ4pxfXW1ybmQCLcBGAsYHQ/s320/take_a_break_git.gif)](https://1.bp.blogspot.com/-Ze2SiBflkZ4/XbtF1TjELcI/AAAAAAAALL4/IDC6W-b5moU0eGu2eN60aZ4pxfXW1ybmQCLcBGAsYHQ/s1600/take_a_break_git.gif
The regex is essentially working fine, but the issue is that as the same URL is present twice in the same line, the text selected is the first occurrence of https and last occurrence of jpg|png|gif. But I want the first occurrence of https and first occurrence of jpg|png|gif
How can fix this?
P.S. I have also tried lynx -dump -image_links -listonly $path but this prints the entire file.
I am also open to other options that solve the purpose, and as long as I can hook the code up in my current shell script.
You may add square brackets into the negated bracket expression:
grep -Eo "https?://[^][ ]+\.(jpg|png|gif)"
See the online demo. Details:
https?:// - http:// or https://
[^][ ]+ - one or more chars other than ], [ and space
\. - a dot
(jpg|png|gif) - either of the three alternative substrings.

Using egrep to copy URLs

I'm trying to make a script in bash that locates URLs from a textfile (example.com, example.eu, etc) and copies them over to another textfile using egrep. My current output gives me the URLs that i want, but unfortunately a lot more that i don't want, such as 123.123 or example.3xx.
My script currently looks like this:
egrep -o '\w*\.[^\d\s]\w{2,3}\b' trace.txt > url.txt
I tried using some regex checker sites, but the regex on the site gives me more of a correct answer than my own results.
Any help is appriceated
If you know the domains suffix, you can have a regex that looks for *.(com|eu|org)
Based on https://stackoverflow.com/a/2183140/939457 (and https://www.rfc-editor.org/rfc/rfc2181#section-11) a domain name is a series of labels that can contain any char except . separated by .. Since you want only those valid TLDs you can use https://data.iana.org/TLD/tlds-alpha-by-domain.txt to generate a list of patterns:
grep -i -E -f <(curl -s https://data.iana.org/TLD/tlds-alpha-by-domain.txt | sed 's/^/([^.]{1,63}\\\.){1,4}/') <<'EOF'
aaa.ali.bab.yandex
fsfdsa.d.s
alpha flkafj
foo.bar.zone
alpha.beta.gama.delta.zappos
example.com
EOF
Result:
aaa.ali.bab.yandex
foo.bar.zone
alpha.beta.gama.delta.zappos
example.com
Note: this is a memory killer the above example took 2GB, the list of TLDs is huge, you might consider searching for a list of commonly used TLDs and use that instead.

How can I filter out specific lines while still writing to the same file?

I set myself a small project that involved StevenBlack's host file. I know that he provides a way to make your own hosts file with his Python script, however I wanted to set myself a challenge.
The problem is as follows:
I have a script that gets the Fakenews+Gambling+Social hosts file.
However, I still want to access Reddit. And to make it worse, the file gets constantly updated. Meaning that I can't remove the lines with sed -e '123,456d'.
I think I got pretty close. But I'm not sure, here is the command
cat ./hosts | grep "# Reddit" -A10 | sed -e '1,11d'
While it does indeed remove the Reddit entries, I have no idea how to put it back together. Meaning, that with the command above I can indeed filter out the Reddit lines, but I don't know how to put back into the hosts file and not create an empty file.
It's my first post and I'm very bad at explaining problems. If there is any need for clarification, just say it. Also English isn't my first language, so that doesn't help.
EDIT: Example
cd /home/myname/Documents/git
wget https://raw.githubusercontent.com/StevenBlack/hosts/master/alternates/fakenews-gambling-social/hosts
At this point, I have the raw hosts file. Now I want to filter out Reddit. The entries I want to remove are:
Reddit
0.0.0.0 reddit.com
0.0.0.0 www.reddit.com
0.0.0.0 i.reddit.com
0.0.0.0 redd.it
And now comes the problem. I don't know how to remove them from the hosts file, since the lines are changing constantly.
My approach was cat ./hosts | grep "# Reddit" -A10 | sed -e '1,11d', which is in hindsight pretty useless.
You can filter them as you download:
wget "$url" -O- | grep -v 'redd.\?it' > hosts

How do I grep for all lines without a "#" character in the line

I have a text file open in BBEdit/InDesign with email addresses on some lines (about a third of the lines) and name and date stuff on the other lines. I just want to keep the lines that have emails and remove all the others.
A simple pattern I can see to eliminate all the lines apart from those with email addresses on them is to have a negative match for the # character.
I can't use grep -v pattern because the Find and Replace implementation of grep dialogue box just has the fields for Find pattern and Replace pattern. grep -something options don't exist in this context.
Note, I am note trying to construct a valid email address test at all, just using the presence of one (or more) # character to allow a line to stay, all other lines must be deleted from the list.
The closest I got was a pattern which hits only the email address lines (opposite outcome of my goal):
^((\w+|[ \.])\w+)[?#].*$
I tried various combination of ^.*[^#].*$ and more sophisticated /w and [/w|\.] in parentheses and escaping the # with [^\#] and negative look forwards like (!?).
I want to find these non-email address lines and delete them using any of these apps on OS X BBEdit/InDesign. I will use the command line if I have to. There must be a way using in-app Find and Replace with grep though I'd expect.
As stated in the comments grep -v # filename lists all lines without an # symbol. You could also use grep # filename > new_filename
The file new_filename will consist only of lines with #. You can use this new file or delete all lines in the old file and paste contents of new file into it.

Store into variable a string that changes dynamically

I'm quite new unix in shell scripting so I assume I need help regarding the following issue:
I want to store into a variable a string of a log text that changes dynamically. E.x one time you run the program that creates this log, this string may be server1 and the next time server238. I have found ways to find the first occurrence of this string through sed or grep and cut. However since the log file that this software creates may differ from version to version I can't count on a specific line that contains this string. E.x one version may log "The server you are using is server98" and the next one to "Used server is server98". Is there a way through sed or awk that this string may be retrieved regardless of the log layout.
Thanks in advance.
I'd go with:
server=$(grep -Eo 'server[0-9]+' file | head -n 1)
to find any occurrence of the word server followed by some digits, e.g. server3, server98
-E means to use Extended regular expressions (i.e. \d+ for multiple digits)
-o means only output the matching part of the string - not the whole line that contains it.
Here it is in action on OSX:
cat file
server9
fred server98 fred
3
/usr/bin/grep -Eo 'server[0-9]+' file
server9
server98
Try this:
MY_VAR="$(sed -n 's/^.*\(server[0-9][0-9]*\).*$/\1/p' my_file.log | sort -u)"

Resources