How to get the highest numbered link from curl result? - bash

i have create small program consisting of a couple of shell scripts that work together, almost finished
and everything seems to work fine, except for one thing of which i'm not really sure how to do..
which i need, to be able to finish this project...
there seem to be many routes that can be taken, but i just can't get there...
i have some curl results with lots of unused data including different links, and between all data there is a bunch of similar links
i only need to get (into a variable) the link of the highest number (without the always same text)
the links are all similar, and have this structure:
always same text
always same text
always same text
i was thinking about something like;
content="$(curl -s "$url/$param")"
linksArray= get from $content all links that are in the href section of the links
that contain "always same text"
declare highestnumber;
for file in $linksArray
do
href=${1##*/}
fullname=${href%.html}
OIFS="$IFS"
IFS='_'
read -a nameparts <<< "${fullname}"
IFS="$OIFS"
if ${nameparts[1]} > $highestnumber;
then
highestnumber=${nameparts[1]}
fi
done
echo ${nameparts[1]}_${highestnumber}.html
result:
https://always/same/link/unique-name_19.html
this was just my guess, any working code that can be run from bash script is oke...
thanks...
update
i found this nice program, it is easily installed by:
# 64bit version
wget -O xidel/xidel_0.9-1_amd64.deb https://sourceforge.net/projects/videlibri/files/Xidel/Xidel%200.9/xidel_0.9-1_amd64.deb/download
apt-get -y install libopenssl
apt-get -y install libssl-dev
apt-get -y install libcrypto++9
dpkg -i xidel/xidel_0.9-1_amd64.deb
it looks awsome, but i'm not really sure how to tweak it to my needs.
based on that link and the below answer, i guess a possible solution would be..
use xidel, or use "$ sed -n 's/.href="([^"]).*/\1/p' file" as suggested in this link, but then tweak it to get the link with html tags like:
< a href="https://always/same/link/same-name_17.html">always same text< /a>
then filter out all that doesn't end with ( ">always same text< /a> )
and then use the grep sort as mentioned below.

Continuing from the comment, you can use grep, sort and tail to isolate the highest number of your list of similar links without too much trouble. For example, if you list of links is as you have described (I've saved them in a file dat/links.txt for the purpose of the example), you can easily isolate the highest number in a variable:
Example List
$ cat dat/links.txt
always same text
always same text
always same text
Parsing the Highest Numbered Link
$ myvar=$(grep -o 'https:.*[.]html' dat/links.txt | sort | tail -n1); \
echo "myvar : '$myvar'"
myvar : 'https://always/same/link/same-name_19.html'
(note: the command above is all one line separate by the line-continuation '\')
Applying Directly to Results of curl
Whether your list is in a file, or returned by curl -s, you can apply the same approach to isolate the highest number link in the returned list. You can use process substitution with the curl command alone, or you can pipe the results to grep. E.g. as noted in my original comment,
$ myvar=$(grep -o 'https:.*[.]html' < <(curl -s "$url/$param") | sort | tail -n1); \
echo "myvar : '$myvar'"
or pipe the result of curl to grep,
$ myvar=$(curl -s "$url/$param" | grep -o 'https:.*[.]html' | sort | tail -n1); \
echo "myvar : '$myvar'"
(same line continuation note.)

Why not use Xidel with xquery to sort the links and return the last?
xidel -q links.txt --xquery "(for $i in //#href order by $i return $i)[last()]" --input-format xml
The input-format parameter makes sure you don't need any html tags at the start and ending of your txt file.
If I'm not mistaken, in the latest Xidel the -q (quiet) param is replaced by -s (silent).

Related

how to copy all the URLs of a certain column of a web page?

I want to import several number of files into my server using wget , the 492 files are here:
https://trace.ncbi.nlm.nih.gov/Traces/sra/?study=ERP001736
so I want to copy the URL of all files in "File Name" column to save them into a file and import them with wget.
So how can I copy all those URLs from that column ?
Thanks for reading :)
Since you've tagged bash, this should work.
wget -O- is used to output the data to the standard output, where it's greppable. (curl would do that by default.)
grep -oE is used to capture the URLs (which happily are in a regular enough format that a simple regexp works).
Then, wget -i is used to read URLs from the file generated. You might wish to add -nc or other suitable partial-fetch flags; those files are pretty hefty.
wget -O- https://trace.ncbi.nlm.nih.gov/Traces/sra/?study=ERP001736 | grep -oE 'http://ftp.sra.ebi.ac.uk/[^"]+' > urls.txt
wget -i urls.txt
First, I recommend using a more specific and robust implementation...
but, in the case you are against a wall and in a hurry -
$: curl -s https://trace.ncbi.nlm.nih.gov/Traces/sra/?study=ERP001736 |
sed -En '/href="http:\/\/.*clean.fastq.gz"/{s/^.*href="([^"]+)".*/\1/;p;}' |
while read url; do wget "$url"; done
This is a quick and dirty rough first pass, but it will give you something to work with.
If you aren't in a screaming hurry, try writing something more robust and step-wise in perl or python.

Pull information from wget data

I want to pull and display the twitter username of various accounts that I specify via the ID. I figured I could do this, in part, with wget.
echo what id would you like to search
read ID
wget https://twitter.com/intent/user?user_id=$ID > ~/temp/$ID
This is really as far as I got as I cant figure out how to pull the data from it. I have tried this;
read ID
source ~/temp/$ID
echo $value
To echo anything that was labeled as "value" (the username is labeled as "value" several times).
Examples:
Stack Overflow's Twitter account is #stackoverflow, and their twitter id is: 128700677 So I can run
wget https://twitter.com/intent/user?user_id=128700677
and the document will be a nice 248 line long HTML document, you can try it and see. So basically, is there a way to have the script either go through and find the most common value="" or just go to/display <title>Stack Overflow (#StackOverflow) on Twitter</title> without the <title></title> and on Twitter
PS: Would this count as bootstrapping?
EDIT-----------------------------
This needs to be able to work with bash because I already have a system set up in bash. This will just help confirm #s
As that-other-guy said, it would be better to use twitter API to find that out. However, you can try and push your method a little bit further, like
wget -O - "https://twitter.com/intent/user?user_id=${ID}" | grep -Po "(?<=screen_name=).*(?=')" | head -n 1
to filter out strings like href='/intent/user?screen_name=StackOverflow' and extract what's after screen_name= part in the first string.
P.S. I didn't notice a lot of value= in the html, to be honest, and sourcing something like html in your script is not the best thing to do, as you may get something destructive executing this way.
screen_name could be fetched with:
read -r ID ;\
screen_name=$(wget -q -O - http://twitter.com/intent/user?user_id="$ID" | sed -n 's/^.*button follow".*screen_name=\([^"]*\)".*$/\1/p')
printf "%s\n" "$screen_name"
nickname could be fetched with:
read -r ID ;\
nickname=$(wget -q -O - https://twitter.com/intent/user?user_id=128700677 | sed -n 's/^.*"nickname">\([^<]*\)<.*$/\1/p')
printf "%s\n" "$nickname"
title could be fetched with:
read -r ID ;\
title=$(wget -q -O - https://twitter.com/intent/user?user_id=128700677 | sed -n 's/^.*<title>\(.*\) on Twitter<.title>.*$/\1/p')
printf "%s\n" "$title"
The use of the REST API sounds a better idea.

using curl to call data, and grep to scrub output

I am attempting to call an API for a series of ID's, and then leverage those ID's in a bash script using curl, to query a machine for some information, and then scrub the data for only a select few things before it outputs this.
#!/bin/bash
url="http://<myserver:myport>/ws/v1/history/mapreduce/jobs"
for a in $(cat jobs.txt); do
content="$(curl "$url/$a/counters" "| grep -oP '(FILE_BYTES_READ[^:]+:\d+)|FILE_BYTES_WRITTEN[^:]+:\d+|GC_TIME_MILLIS[^:]+:\d+|CPU_MILLISECONDS[^:]+:\d+|PHYSICAL_MEMORY_BYTES[^:]+:\d+|COMMITTED_HEAP_BYTES[^:]+:\d+'" )"
echo "$content" >> output.txt
done
This is for a MapR project I am currently working on to peel some fields out of the API.
In the example above, I only care about 6 fields, though the output that comes from the curl command gives me about 30 fields and their values, many of which are irrelevant.
If I use the curl command in a standard prompt, I get the fields I am looking for, but when I add it to the script I get nothing.
Please remove quotes after
$url/$a/counters" ". Like following:
content="$(curl "$url/$a/counters | grep -oP '(FILE_BYTES_READ[^:]+:\d+)|FILE_BYTES_WRITTEN[^:]+:\d+|GC_TIME_MILLIS[^:]+:\d+|CPU_MILLISECONDS[^:]+:\d+|PHYSICAL_MEMORY_BYTES[^:]+:\d+|COMMITTED_HEAP_BYTES[^:]+:\d+'" )"

Grab just the first filename from a zip file stream?

I want to extract just the first filename from a remote zip archive without downloading the entire zip. In particular, I'm trying to get the build number of dartium (link to zip file). Since the file is quite large, I don't want to download the entire thing.
If I download the entire thing, unzip -l reports the first file as being: 0 2013-04-07 12:18 dartium-lucid64-inc-21033.0/. I want to get just this filename so I can parse out the 21033 portion as the build number.
I was doing this (total hack):
_url="https://storage.googleapis.com/dartium-archive/continuous/dartium-lucid64.zip"
curl -s $_url | head -c 256 | sed -n "s:.*dartium-lucid64-inc-\([0-9]\+\).*:\1:p"
It was working when I had my shell in ASCII mode, but I recently switched it to UTF-8 and it seems sed is now honoring that, which breaks my script.
I thought about hacking it by doing:
export LANG=
curl -s ...
But that seemed like an even bigger hack.
Is there a better way?
Firstly, you can set bytes range using curl.
Next, use "strings" to extract all strings from binary stream.
Add "q" after "p" to quit after find only first occurrence.
curl -s $_url -r0-256 | strings | sed -n "s:.*dartium-lucid64-inc-\([0-9]\+\).*:\1:p;q"
Or this:
curl -s $_url -r0-256 | strings | sed -n "/dartium-lucid64/{s:.*-\([^-]\+\)\/.*:\1:p;q}"
It must be a bit faster and more reliable. Also it extracts full version, including subversion (if you need it).

Extract a specific string from a curl'd result

Given this curl command:
curl --user-agent "fogent" --silent -o page.html "http://www.google.com/search?q=insansiate"
* Spelling is intentionally incorrect. I want to grab the suggestion as my result.
I want to be able to either grep into the page.html file perhaps with grep -oE or pipe it right from curl and never store a file.
The result should be: 'instantiate'
I need only the word 'instantiate', or the phrase, whatever google is auto correcting, is what I am after.
Here is the basic html that is returned:
<span class=spell style="color:#cc0000">Did you mean: </span><a href="/search?hl=en&ie=UTF-8&&sa=X&ei=VEMUTMDqGoOINraK3NwL&ved=0CB0QBSgA&q=instantiate&spell=1"class=spell><b><i>instantiate</i></b></a> <span class=std>Top 2 results shown</span>
So perhaps from/to of the string below, which I hope is unique enough to cover all my bases.
class=spell><b><i>instantiate</i></b></a>
I keep running into issues with greedy grep; perhaps I should run it though an html prettify tool first to get a line break or 50 in there. I don't know of any simple way to do so in bash, which is what I would ideally like this to be in. I really don't want to deal with firing up perl, and making sure I have the correct module.
Any suggestions, thank you?
As I'm sure you're aware, screen scraping is a delicate business. This command sequence is no exception since it relies on the specific structure of the page which could change at any time without notice.
grep -o 'Did you mean:\([^>]*>\)\{5\}' page.html | sed 's/.*<i>\([^<]*\)<.*/\1/' page.html
In a pipe:
curl --user-agent "fogent" --silent "http://www.google.com/search?q=insansiate" | grep -o 'Did you mean:\([^>]*>\)\{5\}' page.html | sed 's/.*<i>\([^<]*\)<.*/\1/'
This relies on finding five ">" characters between "Did you mean:" and the "</i>" after the word you're looking for.
Have you considered other methods of getting spelling suggestions or are you specifically interested in what Google provides?
If you have ispell or aspell installed, you can do:
echo insansiate | ispell -a
and parse the result.
xidel is a great utility for scraping web pages; it supports retrieving pages and extracting information in various query languages (CSS selectors, XPath).
In the case at hand, the simple CSS selector a.spell will do the trick.
xidel --user-agent "fogent" "http://google.com/search?q=insansiate" -e 'a.spell'
Note how xidel does its own page retrieval, so no need for curl in this case.
If, however, you needed curl for more exotic retrieval options, here's how you'd combine the two tools (line break for readability):
curl --user-agent "fogent" --silent "http://google.com/search?q=insansiate" |
xidel - -e 'a.spell'
curl --> tidy -asxml --> xmlstarlet sel
Edit: Sorry, did not see your Perl notice.
#!/usr/bin/perl
use strict;
use LWP::UserAgent;
my $arg = shift // 'insansiate';
my $lwp = LWP::UserAgent->new(agent => 'Mozilla');
my $c = $lwp->get("http://www.google.com/search?q=$arg") or die $!;
my #content = split(/:/, $c->content);
for(#content) {
if(m;<b><i>(.+)</i></b>;) {
print "$1\n";
exit;
}
}
Running:
> perl google.pl
instantiate
> perl google.pl disconect
disconnect

Resources