How to extract Active domain - bash

Is there any bash command/script in Linux so we can extract the active domains from a long list,
example, I have a csv file (domains.csv) there are 55 million domains are listed horizontally, we need only active domains in a csv file (active.csv)
Here active mean a domain who has a web page at least, not a domain who is expired or not expired. example whoisdatacenter.info is not expired but it has no webpage, we consider it as non-active.
I check google and stack website. I saw we can get domain by 2 ways. like
$ curl -Is google.com | grep -i location
Location: http://www.google.com/
or
nslookup google.com | grep -i name
Name: google.com
but I got no idea how can I write a program in bash for this for 55 million domains.
below commands, won't give any result so I come up that nsloop and curl is wayway to get result
$ nslookup whoisdatacenter.info | grep -i name
$ curl -Is whoisdatacenter.info | grep -i location
1st 25 lines
$ head -25 domains.csv
"
"0----0.info"
"0--0---------2lookup.com"
"0--0-------free2lookup.com"
"0--0-----2lookup.com"
"0--0----free2lookup.com"
"0--1.xyz"
"0--123456789.com"
"0--123456789.net"
"0--6.com"
"0--7.com"
"0--9.info"
"0--9.net"
"0--9.world"
"0--a.com"
"0--a.net"
"0--b.com"
"0--m.com"
"0--mm.com"
"0--reversephonelookup.com"
"0--z.com"
"0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0.com"
"0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0-0.com"
"0-0-0-0-0-0-0-0-0-0-0-0-0-10-0-0-0-0-0-0-0-0-0-0-0-0-0.info"
code I am running
while read line;
do nslookup "$line" | awk '/Name/';
done < domains.csv > active3.csv
the result I am getting
sh -x ravi2.sh
+ read line
+ nslookup ''
+ awk /Name/
nslookup: '' is not a legal name (unexpected end of input)
+ read line
+ nslookup '"'
+ awk /Name/
+ read line
+ nslookup '"0----0.info"'
+ awk /Name/
+ read line
+ nslookup '"0--0---------2lookup.com"'
+ awk /Name/
+ read line
+ nslookup '"0--0-------free2lookup.com"'
+ awk /Name/
+ read line
+ nslookup '"0--0-----2lookup.com"'
+ awk /Name/
+ read line
+ nslookup '"0--0----free2lookup.com"'
+ awk /Name/
still, active3.csv is empty
below . the script is working, but something stopping the bulk lookup, either it's in my host or something else.
while read line
do
nslookup $(echo "$line" | awk '{gsub(/\r/,"");gsub(/.*-|"$/,"")} 1') | awk '/Name/{print}'
done < input.csv >> output.csv
The bulk nslookup show such error in below
server can't find facebook.com\013: NXDOMAIN
[Solved]
Ravi script is working perfectly fine, I was running in my MAC which gave Nslookup Error, I work in CentOS Linux server, Nslookup work great with Ravi script
Thanks a lot!!

EDIT: Please try my EDIT solution as per OP's shown samples.
while read line
do
nslookup $(echo "$line" | awk '{gsub(/\r/,"");gsub(/.*-|"$/,"")} 1') | awk '/Name/{found=1;next} found && /Address/{print $NF}'
done < "Input_file"
Could you please try following.
OP has control M characters in her Input_file so run following command too remove them first:
tr -d '\r' < Input_file > temp && mv temp Input_file
Then run following code:
while read line
do
nslookup "$line" | awk '/Name/{found=1;next} found && /Address/{print $NF}'
done < "Input_file"
I am assuming that since you are passing domain name you need to get their address(IP address) in output. Also since you are using a huge Input_file so it may be a bit slow in providing output, but trust me this is a simpler way.

nslookup simply indicates whether or not the domain name has a record in DNS. Having one or more IP addresses does not automatically mean you have a web site; many IP addresses are allocated for different purposes altogether (but might coincidentally host a web site for another domain name entirely!)
(Also, nslookup is not particularly friendly to scripting; you will want to look at dig instead for automation.)
There is no simple way to visit 55 million possible web sites in a short time, and probably you should not be using Bash if you want to. See e.g. https://pawelmhm.github.io/asyncio/python/aiohttp/2016/04/22/asyncio-aiohttp.html for an exposition of various approaches based on Python.
The immediate error message indicates that you have DOS carriage returns in your input file; this is a common FAQ which is covered very well over at Are shell scripts sensitive to encoding and line endings?
You can run multiple curl instances in parallel but you will probably eventually saturate your network -- experiment with various degrees of parallelism -- maybe split up your file into smaller pieces and run each piece on a separate host with a separate network connection (perhaps in the cloud) but to quickly demonstrate,
tr -d '\r' <file |
xargs -P 256 -i sh -c 'curl -Is {} | grep Location'
to run 256 instances of curl in parallel. You will still need to figure out which output corresponds to which input, so maybe refactor to something like
tr -d '\r' <file |
xargs -P 256 -i sh -c 'curl -Is {} | sed -n "s/Location/{}:&/p"'
to print the input domain name in front of each output.
(Maybe also note that just a domain name is not a complete URL. curl will helpfully attempt to add a "http://" in front and then connect to that, but that still doesn't give you an accurate result if the domain only has a "https://" website and no redirect from the http:// one.)
If you are on a Mac, where xargs doesn't understand -i, try -I {} or something like
tr -d '\r' <file |
xargs -P 256 sh -c 'for url; do curl -Is "$url" | sed -n "s/Location/{}:&/p"; done' _
The examples assume you didn't already fix the DOS carriage returns once and for all; you probably really should (and consider dropping Windows from the equation entirely).

Related

nslookup/dig/drill commands on a file that contains websites to add ip addresses

UPDATE : Still open for solutions using nslookup without parallel, dig or drill
I need to write a script that scans a file containing web page addresses on each line, and adds to these lines the IP address corresponding to the name using nslookup command. The script looks like this at the moment :
#!/usr/bin/
while read ip
do
nslookup "$ip" |
awk '/Name:/{val=$NF;flag=1;next} /Address:/ &&
flag{print val,$NF;val=""}' |
sed -n 'p;n'
done < is8.input
The input file contains the following websites :
www.edu.ro
vega.unitbv.ro
www.wikipedia.org
The final output should look like :
www.edu.ro 193.169.21.181
vega.unitbv.ro 193.254.231.35
www.wikipedia.org 91.198.174.192
The main problem i have with the current state of the script is that it takes the names from nslookup (which is good for www.edu.ro) instead of taking the aliases when those are available. My output looks like this:
www.edu.ro 193.169.21.181
etc.unitbv.ro 193.254.231.35
dyna.wikimedia.org 91.198.174.192
I was thinking about implementing a if-else for aliases but i don't know how to do one on the current command. Also the script can be changed if anyone has a better understanding of how to format nslookup to show it like the output given.
Minimalist workaround quasi-answer. Here's a one-liner replacement for the script using GNU parallel, host (less work to parse than nslookup), and sed:
parallel "host {} 2> /dev/null |
sed -n '/ has address /{s/.* /'{}' /p;q}'" < is8.input
...or using nslookup at the cost of added GNU sed complexity.
parallel "nslookup {} 2> /dev/null |
sed -n '/^A/{s/.* /'{}' /;T;p;q;}'" < is8.input
...or using xargs:
xargs -I '{}' sh -c \
"nslookup {} 2> /dev/null |
sed -n '/^A/{s/.* /'{}' /;T;p;q;}'" < is8.input
Output of any of those:
www.edu.ro 193.169.21.181
vega.unitbv.ro 193.254.231.35
www.wikipedia.org 208.80.154.224
Replace your complete nslookup line with:
echo "$IP $(dig +short "$IP" | grep -m 1 -E '^[0-9.]{7,15}$')"
This might work for you (GNU sed and host):
sed '/\S/{s#.*#host & | sed -n "/ has address/{s///p;q}"#e}' file
For all non-empty lines: invoke the host command on the supplied host name and pipe the results to another invocation of sed which strips out text and quits after the first result.

bash: cURL from a file, increment filename if duplicate exists

I'm trying to curl a list of URLs to aggregate the tabular data on them from a set of 7000+ URLs. The URLs are in a .txt file. My goal was to cURL each line and save them to a local folder after which I would grep and parse out the HTML tables.
Unfortunately, because of the format of the URLs in the file, duplicates exist (example.com/State/City.html. When I ran a short while loop, I got back fewer than 5500 files, so there are at least 1500 dupes in the list. As a result, I tried to grep the "/State/City.html" section of the URL and pipe it to sed to remove the / and substitute a hyphen to use with curl -O. cURL was trying to grab
Here's a sample of what I tried:
while read line
do
FILENAME=$(grep -o -E '\/[A-z]+\/[A-z]+\.htm' | sed 's/^\///' | sed 's/\//-/')
curl $line -o '$FILENAME'
done < source-url-file.txt
It feels like I'm missing something fairly straightforward. I've scanned the man page because I worried I had confused -o and -O which I used to do a lot.
When I run the loop in the terminal, the output is:
Warning: Failed to create the file State-City.htm
I think you dont need multitude seds and grep, just 1 sed should suffice
urls=$(echo -e 'example.com/s1/c1.html\nexample.com/s1/c2.html\nexample.com/s1/c1.html')
for u in $urls
do
FN=$(echo "$u" | sed -E 's/^(.*)\/([^\/]+)\/([^\/]+)$/\2-\3/')
if [[ ! -f "$FN" ]]
then
touch "$FN"
echo "$FN"
fi
done
This script should work and also take care of downloading same files multiple files.
just replace the touch command by your curl one
First: you didn't pass the url info to grep.
Second: try this line instead:
FILENAME=$(echo $line | egrep -o '\/[^\/]+\/[^\/]+\.html' | sed 's/^\///' | sed 's/\//-/')

Bash script working with second column from txt but keep first column in result as relevant

I am trying to write a bash script to ease a process with IP information gathering.
Right now I have made a script which runs throught the one column of IP address in multiple files, looks for geo and host information and stores it to a new file.
What would be nice is also to have a script that generates a result from files with a 3 columns - date, time, ip address. Separator is space.
I tried this an that but no. I am a total newbie :)
This is my original script:
#!/usr/bin/env bash
find *.txt -print0 | while read -d $'\0' file;
do
for i in $( cat "$file")
do echo -e "$i,"$( geoiplookup -f "/usr/share/GeoIP/GeoLiteCity.dat" $i | cut -d' ' -f6,8-9)" "$(nslookup $i | grep name | awk '{print $4}')"" >> "res/res-"$file".txt";
done
done
Input file example
2014-03-06 12:13:27 213.102.145.172
2014-03-06 12:18:24 83.177.253.118
2014-03-25 15:42:01 213.102.155.173
2014-03-25 15:55:47 213.101.185.223
2014-03-26 15:21:43 90.130.182.2
Can you please help me on this?
It's not entirely clear what the current code is attempting to do, but here is a hopefully useful refactoring which could be at least a starting point.
#!/usr/bin/env bash
find *.txt -print0 | while read -d $'\0' file;
do
while read date time ip; do
geo=$(geoiplookup -f "/usr/share/GeoIP/GeoLiteCity.dat" "$ip" |
cut -d' ' -f6,8-9)
addr=$(nslookup "$ip" | awk '/name/ {print $4}')
#addr=$(dig +short -x "$ip")
echo "$ip $geo $addr"
done <"$file" >"res/res-$file.txt"
done
My copy of nslookup does not output four fields but I assume that part of your script is correct. The output from dig +short is better suitable for machine processing, so maybe switch to that instead. Perhaps geoiplookup also offers an option to output machine-readable results, or maybe there is an alternative interface which does.
I assume it was a mistake that your script would output partially comma-separated, partially whitespace-separated results, so I changed that, too. Maybe you should use CSV or JSON instead if you intend for other tools to be able to read this output.
Trying to generate a file named res/res-$file.txt will only work if file is not in any subdirectory, so I'm guessing you will want to fix that with basename; or perhaps the find loop should be replaced with a simple for file in *.txt instead.

How to append string to file if it is not included in the file?

Protagonists
The Admin
Pipes
The Cron Daemon
A bunch of text processing utilities
netstat
>> the Scribe
Setting
The Cron Daemon is repeatedly performing the same job where he forces an innocent netstat to show the network status (netstat -n). Pipes then have to pick up the information and deliver it to bystanding text processing utilities (| grep tcp | awk '{ print $5 }' | cut -d "." -f-4). >> has to scribe the important results to a file. As his highness, The Admin, is a lazy and easily annoyed ruler, >> only wants to scribe new information to the file.
*/1 * * * * netstat -n | grep tcp | awk '{ print $5 }' | cut -d "." -f-4 >> /tmp/file
Soliloquy by >>
To append, or not append, that is the question:
Whether 'tis new information to bother The Admin with
and earn an outrageous Fortune,
Or to take Arms against `netstat` and the others,
And by opposing, ignore them? To die: to sleep;
note by the publisher: For all those that had problems understanding Hamlet, like I did, the question is, how do I check if the string is already included in the file and if not, append it to the file?
Unless you are dealing with a very big file, you can use the uniq command to remove the duplicate lines from the file. This means you will also have the file sorted, I don't know if this is an advantage or disadvantage for you:
netstat -n | grep tcp | awk '{ print $5 }' | cut -d "." -f-4 >> /tmp/file && sort /tmp/file | uniq > /tmp/file.uniq
This will give you the sorted results without duplicates in /tmp/file.uniq
What a piece of work is piping, how easy to reason about,
how infinite in use cases, in bash and script,
how elegant and admirable in action,
how like a vim in flexibility,
how like a gnu!
Here is a slightly different take:
netstat -n | awk -F"[\t .]+" '/tcp/ {print $9"."$10"."$11"."$12}' | sort -nu | while read ip; do if ! grep -q $ip /tmp/file; then echo $ip >> /tmp/file; fi; done;
Explanation:
awk -F"[\t .]+" '/tcp/ {print $9"."$10"."$11"."$12}'
Awk splits the input string by tabs and ".". The input string is filtered (instead of using a separate grep invocation) by lines containing "tcp". Finally the resulting output fields are concatenated with dots and printed out.
sort -nu
Sorts the IPs numerically and creates a set of unique entries. This eliminates the need for the separate uniq command.
if ! grep -q $ip /tmp/file; then echo $ip >> /tmp/file; fi;
Greps for the ip in the file, if it doesn't find it, the ip gets appended.
Note: This solution does not remove old entries and clean up the file after each run - it merely appends - as your question implied.

Get open ports as an array

So, I'm using netstat -lt to get open ports. However, I'm not interested in certain values (like SSH or 22), so I want to be able to exclude them. I also want to get them as an array in bash. So far I have netstat -lt | sed -r 's/tcp[^:]+://g' | cut -d' ' -f1 but they're not an array, nor am I excluding anything.
Try using the ss command, which replaces netstat.
ss -atu | awk '{print $5}' | awk -F: '{print $NF}'
The ss command gives you all TCP and UDP ports on the local machine (the only sockets that would have ports). The first awk extracts the column containing the local address and port number. The second awk takes only the last field following a colon; this is necessary in case you have IPv6 sockets on your machine, whose IP address will also include colons.
Once you've done this, you can grep out the ports you don't want. Also, see the documentation referred to by the ss man page for information on filters, which may let you filter out unwanted sockets from the output of ss.
Add ($()) around your statement:
port=($(netstat -ltn | sed -rne '/^tcp/{/:(22|25)\>/d;s/.*:([0-9]+)\>.*/\1/p}'))
Filtering ports 22 and 25.
a=( `netstat -ltn --inet | sed -r -e '1,2d''s/tcp[^:]+://g' | cut -d' ' -f1 | sed -e '1,2d' | grep -v "22\|33\|25"` )
second sed command removes headers if your version of netstat prints such. I have "Active" and "Proto" as first two lines. Use grep to filter unwanted ports. add -n to netstat to see port numbers instead of names. --inet is to force ipv4, otherwise you may see IPv6 which may confuse your script.
btw not sure you need an array. usually arrays are needed only if you are going to work on a subset of values you have. If you work on all values there are simpler constructs but not sure what you're going to do.
Regards.
update: you can use a single sed command with two operations instead of two separate invocations:
sed -r -e '1,2d' -e 's/tcp[^:]+://g'

Resources