Using egrep to copy URLs - bash

I'm trying to make a script in bash that locates URLs from a textfile (example.com, example.eu, etc) and copies them over to another textfile using egrep. My current output gives me the URLs that i want, but unfortunately a lot more that i don't want, such as 123.123 or example.3xx.
My script currently looks like this:
egrep -o '\w*\.[^\d\s]\w{2,3}\b' trace.txt > url.txt
I tried using some regex checker sites, but the regex on the site gives me more of a correct answer than my own results.
Any help is appriceated

If you know the domains suffix, you can have a regex that looks for *.(com|eu|org)

Based on https://stackoverflow.com/a/2183140/939457 (and https://www.rfc-editor.org/rfc/rfc2181#section-11) a domain name is a series of labels that can contain any char except . separated by .. Since you want only those valid TLDs you can use https://data.iana.org/TLD/tlds-alpha-by-domain.txt to generate a list of patterns:
grep -i -E -f <(curl -s https://data.iana.org/TLD/tlds-alpha-by-domain.txt | sed 's/^/([^.]{1,63}\\\.){1,4}/') <<'EOF'
aaa.ali.bab.yandex
fsfdsa.d.s
alpha flkafj
foo.bar.zone
alpha.beta.gama.delta.zappos
example.com
EOF
Result:
aaa.ali.bab.yandex
foo.bar.zone
alpha.beta.gama.delta.zappos
example.com
Note: this is a memory killer the above example took 2GB, the list of TLDs is huge, you might consider searching for a list of commonly used TLDs and use that instead.

Related

Search File Server for all files with email address using GREP

There has been a data breach and I need to find all file paths across a file server with email addresses.
I was trying
grep -lr --include='*.{csv,xls,xlsx,txt}' "\b[A-Za-z0-9._%+-]+#[A-Za-z0-9.-]+\.[A-Za-z]{2,6}\b" * >output.txt
But, this is returning nothing.
I would be grateful for an suggestions. thanks!
Your grep command is almost correct, there is only small glitches that make it not working.
First, for matching your email regex, you should use grep's extended regex option -E.
Next, as explained in this answer to another question, your --include pattern will not work in zsh. You need to put your ending quote before the braces, as follow: --include='*.'{csv,xls,xlsx,txt}
Finally, if you want to get all files on the server, you should perform this command on root directory / instead of on all files/directories present in the directory you are when you execute the command as you do with *
So your grep command should be:
grep -Elr --include='*.'{csv,xls,xlsx,txt} "\b[A-Za-z0-9._%+-]+#[A-Za-z0-9.-]+\.[A-Za-z]{2,6}\b" /
Some points to take into account:
you will not detect email in excel files .xls and .xlsx as they are binary files so grep will not be able to parse them.
email pattern matching is rather hard, there are a lot of special cases in email parsing. The email pattern you're currently using will catch almost all emails, but not all of them.

grep -w -f is not returning all matches from list

I am trying to use a list that looks like this:
List file:
1mAF
2mAF
4mAF
7mAF
9mAF
10mAF
11mAF
13mAF
18mAF
27mAF
33mAF
36mAF
37mAF
38mAF
39mAF
40mAF
41mAF
45mAF
46mAF
47mAF
49mAF
57mAF
58mAF
60mAF
61mAF
62mAF
63mAF
64mAF
67mAF
82mAF
86mAF
87mAF
95mAF
96mAF
to grab out lines that contain a word-level match in a tab delimited file that looks like this:
File_of_interest:
11mAF-NODE_111-g7687-JEFF-tig00000037_arrow-g7396-AFLA_058530 11mAF cluster63
17mAF-NODE_343-g9350 17mAF cluster07
18mAF-NODE_34-g3647-JEFF-tig00000037_arrow-g7396-AFLA_058530 18mAF cluster20
22mAF-NODE_36-g3735 22mAF cluster28
36mAF-NODE_107-g7427 36mAF cluster77
45mAF-NODE_151-g9067 45mAF cluster14
47mAF-NODE_30-g3242-JEFF-tig00000037_arrow-g7396-AFLA_058530 47mAF cluster21
67mAF-NODE_54-g4372 67mAF cluster06
69mAF-NODE_27-g2754 69mAF cluster39
71mAF-NODE_44-g4178 71mAF cluster25
73mAF-NODE_47-g4895 73mAF cluster57
78mAF-NODE_4-g688 78mAF cluster53
but when I do grep -w -f list file_of_interest these are the only ones I get:
18mAF-NODE_34-g3647-JEFF-tig00000037_arrow-g7396-AFLA_058530 18mAF cluster20
36mAF-NODE_107-g7427 36mAF cluster77
45mAF-NODE_151-g9067 45mAF cluster14
and this misses a bunch of the values that are in the original list for example note that "67mAF" is in the list and in the file but it isn't returned.
I have tried removing everything after "mAF" in the list and trying again -- no change. I have rewritten the list in a completely new file to no avail. Oddly, I get more of them if I "sort" the list into a new file and then do the grep, but I still don't get all of them. I have also removed all invisible characters using sed (sed $'s/[^[:print:]\t]//g'). no change.
I am on OSX and both files were created on OSX, but normally grep -f -w works in the fashion i'm describing above.
I am completely flummoxed. Is I thought grep -w -f would look for all word-level matches of items in the file in the target file... am I wrong?
Thanks!
My guess is at least one of these files originates from a Windows machine and has CRLF line endings. file(1) might be used to tell you. If that is the case do:
fromdos FILE
or, alternatively:
dos2unix FILE

Sed pacman.conf remove # for multilib & include

I'm actually facing a wall with my custom installation script.
At a point of the script, I need to enable the 64 bits repository for 64 bits machines and (for instance) I need to get from that format :
#multilib-testing[...]
#include[...]
#multilib[...]
#include[...]
To that format
#multilib-testing[...]
#include[...]
multilib[...]
include[...]
But as you can see, there are include everywhere and I can't use sed because it will recursively delete all the "include" of that specific file and it's not what I want...
I can't seem to find a solution with sed. I tried something I saw on another thread with
cat /etc/pacman.conf | grep -A 1 "multilib"
But I didn't get it well and I'm out of options...
Ideally, I would like to get a sed solution (but feel free to tell me what others options I could get as long as you explain !).
The pattern (and the beginning) shoud be something like that :
sed -i '/multilib/ s/#//' /etc/pacman.conf
And should be effective for the pattern and the line after (which is the include).
Also, I will be pleased if you could actually teach me why you do that or that as I'm learning and I can't remember something if I can't figure why I did like that. (also excuse my mid-game english).
We can use this to match a range by patterns. We can then match the # at the beginning of each line and remove it.
sed -i "/\[multilib\]/,/Include/"'s/^#//' /etc/pacman.conf

Using bash to list files with a certain combination of characters

So I have a directory with ~50 files, and each contain different things. I often find myself not remembering which files contain what. (This is not a problem with the naming -- it is sort of like having a list of programs and not remembering which files contain conditionals).
Anyways, so far, I've been using
cat * | grep "desiredString"
for a string that I know is in there. However, this just gives me the lines which contain the desired string. This is usually enough, but I'd like it to give me the file names instead, if at all possible.
How could I go about doing this?
It sounds like you want grep -l, which will list the files that contain a particular string. You can also just pass the filename arguments directly to grep and skip cat.
grep -l "desiredString" *
In the directory containing the files among which you want to search:
grep -rn "desiredString" .
This can list all the files matching "desiredString", with file names, matching lines and line numbers.

Bash: find references to filenames in other files

Problem:
I have a list of filenames, filenames.txt:
Eg.
/usr/share/important-library.c
/usr/share/youneedthis-header.h
/lib/delete/this-at-your-peril.c
I need to rename or delete these files and I need to find references to these files in a project directory tree: /home/noob/my-project/ so I can remove or correct them.
My thought is to use bash to extract the filename: basename filename, then grep for it in the project directory using a for loop.
FILELISTING=listing.txt
PROJECTDIR=/home/noob/my-project/
for f in $(cat "$FILELISTING"); do
extension=$(basename ${f##*.})
filename=$(basename ${f%.*})
pattern="$filename"\\."$extension"
grep -r "$pattern" "$PROJECTDIR"
done
I could royally screw up this project -- does anyone see a flaw in my logic; better: do you see a more reliable scalable way to do this over a huge directory tree? Let's assume that revision control is off the table ( it is, in fact ).
A few comments:
Instead of
for f in $(cat "$FILELISTING") ; do
...
done
it's somewhat safer to write
while IFS= read -r f ; do
...
done < "$FILELISTING"
That way, your code will have no problem with spaces, tabs, asterisks, and so on in the filenames (though it still won't support newlines).
Your goal in separating f into extension and filename, and then reassembling them with \., seems to be that you want the filename to be treated as a literal string; right? Like, you're worried that grep will treat the . as meaning "any character" rather than as "one dot". A more general solution is to use grep's -F option, which tells it to treat the pattern as a fixed string rather than a regex:
grep -r -F "$f" "$PROJECTDIR"
Your introduction mentions using basename, but then you don't actually use it. Is that intentional?
If your non-use of basename is intentional, then filenames.txt really just contains a list of patterns to search for; you don't even need to write a loop, in this case, since grep's -f option tells it to take a newline-separated list of patterns from a file:
grep -r -F -f "$FILELISTING" "$PROJECTDIR"
You should back up your project, using something like tar -czf backup.tar.gz "$PROJECTDIR". "Revision control is off the table" doesn't mean you can't have a rollback strategy!
Edited to add:
To pass all your base-names to grep at once, in the hopes that it can do something smarter with them than just looping over them just as though the calls were separate, you can write something like:
grep -r -F "$(sed 's#.*/##g' "$FILELISTING")" "$PROJECTDIR"
(I used sed rather than while+basename for brevity's sake, but you can an entire loop inside the "$(...)" if you prefer.)
This is a job for an IDE.
You're right that this is a perilous task, and unless you know the build process and the search directories and the order of the directories, you really can't say what header is with which file.
Let's take something as simple as this:
# include "sql.h"
You have a file in the project headers/sql.h. Is that file needed? Maybe it is. Maybe not. There's also a /usr/include/sql.h. Maybe that's the one that's actually used. You can't tell without looking at the Makefile and seeing the order of the include directories which is which.
Then, there are the libraries that get included and may need their own header files in order to be able to compile. And, once you get to the C preprocessor, you really will have a hard time.
This is a task for an IDE (Integrated Development Environment). An IDE builds the project and tracks file and other resource dependencies. In the Java world, most people use Eclipse, and there is a C/C++ plugin for those developers. However, there are over 2 dozen listed in Wikipedia and almost all of them are open source. The best one will depend upon your environment.

Resources