Just a little disclaimer, I am not very familiar with programming so please excuse me if I'm using any terms incorrectly/in a confusing way.
I want to be able to extract specific information from a webpage and tried doing this by piping the output of a curl function into grep. Oh and this is in cygwin if that matters.
When just typing in
$ curl www.ncbi.nlm.nih.gov/gene/823951
The terminal prints the whole webpage in what I believe to be html. From here I thought I could just pipe this output into a grep function with whatever search term want with:
$ curl www.ncbi.nlm.nih.gov/gene/823951 | grep "Gene Symbol"
But instead of printing the webpage at all, the terminal gives me:
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 142k 0 142k 0 0 41857 0 --:--:-- 0:00:03 --:--:-- 42083
Can anyone explain why it does this/how I can search for specific lines of text in a webpage? I eventually want to compile information like gene names, types, and descriptions into a database, so I was hoping to export the results from the grep function into a text file after that.
Any help is extremely appreciated, thanks in advance!
Curl detects that it is not outputting to a terminal, and shows you the Progress Meter. You can suppress the progress meter with -s.
The HTML data is indeed being sent to grep. However that page does not contain the text "Gene Symbol". Grep is case-sensitive (unless invoked with -i) and you are looking for "Gene symbol".
$ curl -s www.ncbi.nlm.nih.gov/gene/823951 | grep "Gene symbol"
<dt class="noline"> Gene symbol </dt>
You probably also want the next line of HTML, which you can make grep output with the -A option:
$ curl -s www.ncbi.nlm.nih.gov/gene/823951 | grep -A1 "Gene symbol"
<dt class="noline"> Gene symbol </dt>
<dd class="noline">AT3G47960</dd>
See man curl and man grep for more information about these and other options.
Related
I am currently attempting to make a script that when i enter the name of a vulnerability it will return to me the CVSS3 scores from tenable.
So far my plan is:
Curl the page
Grep the content i want
output the grepped CVSS3 score
when running myscript however grep is throwing the following error:
~/Documents/Tools/Scripts ❯ ./CVSS3-Grabber.sh
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 30964 0 30964 0 0 28355 0 --:--:-- 0:00:01 --:--:-- 28355
grep: unrecognized option '-->Nessus<!--'
Usage: grep [OPTION]... PATTERNS [FILE]...
Try 'grep --help' for more information.
This has me very confused as when i run this in the command line i curl the content to sample.txt and then using the exact same grep syntax:
grep $pagetext -e CVSS:3.0/E:./RL:./RC:.
it returns to me the content i need, however when i run it via my script below...
#! /bin/bash
pagetext=$(curl https://www.tenable.com/plugins/nessus/64784)
cvss3_temporal=$(grep $pagetext -e CVSS:3.0/E:./RL:./RC:.)
echo $cvss3_temporal
i receive the errors above!
I believe this is because the '--' are causing grep to think the text inside the file that it is an instruction which grep doesnt know hence the error. I have tried copying the output of the curl to a text file and then grepping that rather than straight from the curl but still no joy. Does anyone know of a method to get grep to ignore '--' or any flags when reading text? Or alternatively if i can configure curl so that it only brings back text and no symbols?
Thanks in advance!
You don't need to store curl response in a variable, just pipe grep after curl like this:
cvss3_temporal=$(curl -s https://www.tenable.com/plugins/nessus/64784 |
grep -F 'CVSS:3.0/E:./RL:./RC:.')
Note use of -s in curl to suppress progress and -F in grep to make sure you are searching for a fixed string.
Grep filters a given file or standard input if none was given. In bash, you can use the <<< here-word syntax to send the variable content to grep's input:
grep -e 'CVSS:3.0/E:./RL:./RC:.' <<< "$pagetext"
Or, if you don't need the page anywhere else, you can pipe the output from curl directly to grep:
curl https://www.tenable.com/plugins/nessus/64784 | grep -e 'CVSS:3.0/E:./RL:./RC:.'
I have a text file with thousands of hyperlinks in the format "URL = http://examplelink.com" in a file called mylinks.txt.
What I want to do is search through all of these links, and checks if any of them contains some keywords, like, "2018", "2017". If the link contains the keyword, I want to save the link in the file "yes.txt" and if it doesn't it goes to the file "no.txt".
So at the end, I would end up with two files: one with the links that send me to pages with the keywords I'm searching for, and other one with the links that doesn't.
I was thinking about doing this with curl, but I don't know even if it's possible and I don't know also how to "filter" the links by keywords.
What I have got until now is:
curl -K mylinks.txt >> output.txt
But this only creates a super large file with the HTML's of the links it searches.
I've searched and read through various curl tutorials and haven't found anything that "selectively" search for pages and save the links (not the content) of the pages it found matching the criteria.
-Untested-
For links in lines containing "2017" or "2018".
cat mylinks.txt | grep -E '2017|2018' | grep -o 'URL =*>' >> yes.txt
To get url of lines that doesn't contain the keywords.
cat mylinks.txt | grep -vE '2017|2018' | grep -o 'URL =*>' >> no.txt
This is unix piping. (The char | ) takes the program output stdout at the left and feed the stdin to the program on the right.
In Unix-like computer operating systems, a pipeline is a sequence of
processes chained together by their standard streams, so that the
output of each process (stdout) feeds directly as input (stdin) to the
next one. https://en.wikipedia.org/wiki/Pipeline_(Unix)
Here is my take at it (kind of tested on an url-file with a few examples).
This is supposed to be saved as a script, it's too long to type it into the console directly.
#!/bin/bash
urlFile="/path/to/myLinks.txt"
cut -d' ' -f3 "$urlFile" | \
while read url
do
echo "checking url $url"
if (curl "$url" | grep "2017")
then
echo "$url" >> /tmp/yes.txt
else
echo "$url" >> /tmp/no.txt
fi
done
Explanation: the cut is necessary to cut away the prefix "URL = " in each line. Then the url's are fed into the while-read loop. For each url, we curl it, grep for the interesting keyword in it (in this case "2017"), and if the grep returns 0, we append this URL to the file with the interesting URLs.
Obviously, you should adjust the paths and the keyword.
i have create small program consisting of a couple of shell scripts that work together, almost finished
and everything seems to work fine, except for one thing of which i'm not really sure how to do..
which i need, to be able to finish this project...
there seem to be many routes that can be taken, but i just can't get there...
i have some curl results with lots of unused data including different links, and between all data there is a bunch of similar links
i only need to get (into a variable) the link of the highest number (without the always same text)
the links are all similar, and have this structure:
always same text
always same text
always same text
i was thinking about something like;
content="$(curl -s "$url/$param")"
linksArray= get from $content all links that are in the href section of the links
that contain "always same text"
declare highestnumber;
for file in $linksArray
do
href=${1##*/}
fullname=${href%.html}
OIFS="$IFS"
IFS='_'
read -a nameparts <<< "${fullname}"
IFS="$OIFS"
if ${nameparts[1]} > $highestnumber;
then
highestnumber=${nameparts[1]}
fi
done
echo ${nameparts[1]}_${highestnumber}.html
result:
https://always/same/link/unique-name_19.html
this was just my guess, any working code that can be run from bash script is oke...
thanks...
update
i found this nice program, it is easily installed by:
# 64bit version
wget -O xidel/xidel_0.9-1_amd64.deb https://sourceforge.net/projects/videlibri/files/Xidel/Xidel%200.9/xidel_0.9-1_amd64.deb/download
apt-get -y install libopenssl
apt-get -y install libssl-dev
apt-get -y install libcrypto++9
dpkg -i xidel/xidel_0.9-1_amd64.deb
it looks awsome, but i'm not really sure how to tweak it to my needs.
based on that link and the below answer, i guess a possible solution would be..
use xidel, or use "$ sed -n 's/.href="([^"]).*/\1/p' file" as suggested in this link, but then tweak it to get the link with html tags like:
< a href="https://always/same/link/same-name_17.html">always same text< /a>
then filter out all that doesn't end with ( ">always same text< /a> )
and then use the grep sort as mentioned below.
Continuing from the comment, you can use grep, sort and tail to isolate the highest number of your list of similar links without too much trouble. For example, if you list of links is as you have described (I've saved them in a file dat/links.txt for the purpose of the example), you can easily isolate the highest number in a variable:
Example List
$ cat dat/links.txt
always same text
always same text
always same text
Parsing the Highest Numbered Link
$ myvar=$(grep -o 'https:.*[.]html' dat/links.txt | sort | tail -n1); \
echo "myvar : '$myvar'"
myvar : 'https://always/same/link/same-name_19.html'
(note: the command above is all one line separate by the line-continuation '\')
Applying Directly to Results of curl
Whether your list is in a file, or returned by curl -s, you can apply the same approach to isolate the highest number link in the returned list. You can use process substitution with the curl command alone, or you can pipe the results to grep. E.g. as noted in my original comment,
$ myvar=$(grep -o 'https:.*[.]html' < <(curl -s "$url/$param") | sort | tail -n1); \
echo "myvar : '$myvar'"
or pipe the result of curl to grep,
$ myvar=$(curl -s "$url/$param" | grep -o 'https:.*[.]html' | sort | tail -n1); \
echo "myvar : '$myvar'"
(same line continuation note.)
Why not use Xidel with xquery to sort the links and return the last?
xidel -q links.txt --xquery "(for $i in //#href order by $i return $i)[last()]" --input-format xml
The input-format parameter makes sure you don't need any html tags at the start and ending of your txt file.
If I'm not mistaken, in the latest Xidel the -q (quiet) param is replaced by -s (silent).
I am attempting to call an API for a series of ID's, and then leverage those ID's in a bash script using curl, to query a machine for some information, and then scrub the data for only a select few things before it outputs this.
#!/bin/bash
url="http://<myserver:myport>/ws/v1/history/mapreduce/jobs"
for a in $(cat jobs.txt); do
content="$(curl "$url/$a/counters" "| grep -oP '(FILE_BYTES_READ[^:]+:\d+)|FILE_BYTES_WRITTEN[^:]+:\d+|GC_TIME_MILLIS[^:]+:\d+|CPU_MILLISECONDS[^:]+:\d+|PHYSICAL_MEMORY_BYTES[^:]+:\d+|COMMITTED_HEAP_BYTES[^:]+:\d+'" )"
echo "$content" >> output.txt
done
This is for a MapR project I am currently working on to peel some fields out of the API.
In the example above, I only care about 6 fields, though the output that comes from the curl command gives me about 30 fields and their values, many of which are irrelevant.
If I use the curl command in a standard prompt, I get the fields I am looking for, but when I add it to the script I get nothing.
Please remove quotes after
$url/$a/counters" ". Like following:
content="$(curl "$url/$a/counters | grep -oP '(FILE_BYTES_READ[^:]+:\d+)|FILE_BYTES_WRITTEN[^:]+:\d+|GC_TIME_MILLIS[^:]+:\d+|CPU_MILLISECONDS[^:]+:\d+|PHYSICAL_MEMORY_BYTES[^:]+:\d+|COMMITTED_HEAP_BYTES[^:]+:\d+'" )"
Given this curl command:
curl --user-agent "fogent" --silent -o page.html "http://www.google.com/search?q=insansiate"
* Spelling is intentionally incorrect. I want to grab the suggestion as my result.
I want to be able to either grep into the page.html file perhaps with grep -oE or pipe it right from curl and never store a file.
The result should be: 'instantiate'
I need only the word 'instantiate', or the phrase, whatever google is auto correcting, is what I am after.
Here is the basic html that is returned:
<span class=spell style="color:#cc0000">Did you mean: </span><a href="/search?hl=en&ie=UTF-8&&sa=X&ei=VEMUTMDqGoOINraK3NwL&ved=0CB0QBSgA&q=instantiate&spell=1"class=spell><b><i>instantiate</i></b></a> <span class=std>Top 2 results shown</span>
So perhaps from/to of the string below, which I hope is unique enough to cover all my bases.
class=spell><b><i>instantiate</i></b></a>
I keep running into issues with greedy grep; perhaps I should run it though an html prettify tool first to get a line break or 50 in there. I don't know of any simple way to do so in bash, which is what I would ideally like this to be in. I really don't want to deal with firing up perl, and making sure I have the correct module.
Any suggestions, thank you?
As I'm sure you're aware, screen scraping is a delicate business. This command sequence is no exception since it relies on the specific structure of the page which could change at any time without notice.
grep -o 'Did you mean:\([^>]*>\)\{5\}' page.html | sed 's/.*<i>\([^<]*\)<.*/\1/' page.html
In a pipe:
curl --user-agent "fogent" --silent "http://www.google.com/search?q=insansiate" | grep -o 'Did you mean:\([^>]*>\)\{5\}' page.html | sed 's/.*<i>\([^<]*\)<.*/\1/'
This relies on finding five ">" characters between "Did you mean:" and the "</i>" after the word you're looking for.
Have you considered other methods of getting spelling suggestions or are you specifically interested in what Google provides?
If you have ispell or aspell installed, you can do:
echo insansiate | ispell -a
and parse the result.
xidel is a great utility for scraping web pages; it supports retrieving pages and extracting information in various query languages (CSS selectors, XPath).
In the case at hand, the simple CSS selector a.spell will do the trick.
xidel --user-agent "fogent" "http://google.com/search?q=insansiate" -e 'a.spell'
Note how xidel does its own page retrieval, so no need for curl in this case.
If, however, you needed curl for more exotic retrieval options, here's how you'd combine the two tools (line break for readability):
curl --user-agent "fogent" --silent "http://google.com/search?q=insansiate" |
xidel - -e 'a.spell'
curl --> tidy -asxml --> xmlstarlet sel
Edit: Sorry, did not see your Perl notice.
#!/usr/bin/perl
use strict;
use LWP::UserAgent;
my $arg = shift // 'insansiate';
my $lwp = LWP::UserAgent->new(agent => 'Mozilla');
my $c = $lwp->get("http://www.google.com/search?q=$arg") or die $!;
my #content = split(/:/, $c->content);
for(#content) {
if(m;<b><i>(.+)</i></b>;) {
print "$1\n";
exit;
}
}
Running:
> perl google.pl
instantiate
> perl google.pl disconect
disconnect