using curl to call data, and grep to scrub output - bash

I am attempting to call an API for a series of ID's, and then leverage those ID's in a bash script using curl, to query a machine for some information, and then scrub the data for only a select few things before it outputs this.
#!/bin/bash
url="http://<myserver:myport>/ws/v1/history/mapreduce/jobs"
for a in $(cat jobs.txt); do
content="$(curl "$url/$a/counters" "| grep -oP '(FILE_BYTES_READ[^:]+:\d+)|FILE_BYTES_WRITTEN[^:]+:\d+|GC_TIME_MILLIS[^:]+:\d+|CPU_MILLISECONDS[^:]+:\d+|PHYSICAL_MEMORY_BYTES[^:]+:\d+|COMMITTED_HEAP_BYTES[^:]+:\d+'" )"
echo "$content" >> output.txt
done
This is for a MapR project I am currently working on to peel some fields out of the API.
In the example above, I only care about 6 fields, though the output that comes from the curl command gives me about 30 fields and their values, many of which are irrelevant.
If I use the curl command in a standard prompt, I get the fields I am looking for, but when I add it to the script I get nothing.

Please remove quotes after
$url/$a/counters" ". Like following:
content="$(curl "$url/$a/counters | grep -oP '(FILE_BYTES_READ[^:]+:\d+)|FILE_BYTES_WRITTEN[^:]+:\d+|GC_TIME_MILLIS[^:]+:\d+|CPU_MILLISECONDS[^:]+:\d+|PHYSICAL_MEMORY_BYTES[^:]+:\d+|COMMITTED_HEAP_BYTES[^:]+:\d+'" )"

Related

How to use grep/awk/sed to print until a certain character?

I am a complete beginner on shell scripting and I am trying to iterate through a set of JSON files and trying to extract a certain field out of it. Each JSON file has a "country:"xxx" field. In each JSON file, there are 10k of the same field with the same country name so I need only the first occurrence and I can do that using "-m 1".
I tried to use grep for this but could not figure out how to extract the whole field including the country name from each file at first occurrence.
for FILE in *.json;
do
grep -o -a -m 1 -h -r '"country":"' $FILE;
done
I tried to use another pipe and use the below pattern but it did not work
| egrep -o '^[^"]+'
Actual Output:
"country":"
"country":"
"country":"
Desired Output:
"country:"romania"
"country:"united kingdom"
"country:"tajikistan"
but I need the whole thing. Any help would be great. Thanks
There is one general answer on the question "I only want the first occurence", and that answer is:
... | head -n 1
This mean, whatever your do: take the head (the first lines), the -n switch gives you the possibility to say how many you want (one in this case).
The same can be done for the last occurence(s), but then you use tail instead of head (you can also use the -n switch).
After trying many things. I found the pattern I was looking for.
grep -Po '"country":.*?[^\\]",' $FILE | head -n 1;

Pull information from wget data

I want to pull and display the twitter username of various accounts that I specify via the ID. I figured I could do this, in part, with wget.
echo what id would you like to search
read ID
wget https://twitter.com/intent/user?user_id=$ID > ~/temp/$ID
This is really as far as I got as I cant figure out how to pull the data from it. I have tried this;
read ID
source ~/temp/$ID
echo $value
To echo anything that was labeled as "value" (the username is labeled as "value" several times).
Examples:
Stack Overflow's Twitter account is #stackoverflow, and their twitter id is: 128700677 So I can run
wget https://twitter.com/intent/user?user_id=128700677
and the document will be a nice 248 line long HTML document, you can try it and see. So basically, is there a way to have the script either go through and find the most common value="" or just go to/display <title>Stack Overflow (#StackOverflow) on Twitter</title> without the <title></title> and on Twitter
PS: Would this count as bootstrapping?
EDIT-----------------------------
This needs to be able to work with bash because I already have a system set up in bash. This will just help confirm #s
As that-other-guy said, it would be better to use twitter API to find that out. However, you can try and push your method a little bit further, like
wget -O - "https://twitter.com/intent/user?user_id=${ID}" | grep -Po "(?<=screen_name=).*(?=')" | head -n 1
to filter out strings like href='/intent/user?screen_name=StackOverflow' and extract what's after screen_name= part in the first string.
P.S. I didn't notice a lot of value= in the html, to be honest, and sourcing something like html in your script is not the best thing to do, as you may get something destructive executing this way.
screen_name could be fetched with:
read -r ID ;\
screen_name=$(wget -q -O - http://twitter.com/intent/user?user_id="$ID" | sed -n 's/^.*button follow".*screen_name=\([^"]*\)".*$/\1/p')
printf "%s\n" "$screen_name"
nickname could be fetched with:
read -r ID ;\
nickname=$(wget -q -O - https://twitter.com/intent/user?user_id=128700677 | sed -n 's/^.*"nickname">\([^<]*\)<.*$/\1/p')
printf "%s\n" "$nickname"
title could be fetched with:
read -r ID ;\
title=$(wget -q -O - https://twitter.com/intent/user?user_id=128700677 | sed -n 's/^.*<title>\(.*\) on Twitter<.title>.*$/\1/p')
printf "%s\n" "$title"
The use of the REST API sounds a better idea.

How to get the highest numbered link from curl result?

i have create small program consisting of a couple of shell scripts that work together, almost finished
and everything seems to work fine, except for one thing of which i'm not really sure how to do..
which i need, to be able to finish this project...
there seem to be many routes that can be taken, but i just can't get there...
i have some curl results with lots of unused data including different links, and between all data there is a bunch of similar links
i only need to get (into a variable) the link of the highest number (without the always same text)
the links are all similar, and have this structure:
always same text
always same text
always same text
i was thinking about something like;
content="$(curl -s "$url/$param")"
linksArray= get from $content all links that are in the href section of the links
that contain "always same text"
declare highestnumber;
for file in $linksArray
do
href=${1##*/}
fullname=${href%.html}
OIFS="$IFS"
IFS='_'
read -a nameparts <<< "${fullname}"
IFS="$OIFS"
if ${nameparts[1]} > $highestnumber;
then
highestnumber=${nameparts[1]}
fi
done
echo ${nameparts[1]}_${highestnumber}.html
result:
https://always/same/link/unique-name_19.html
this was just my guess, any working code that can be run from bash script is oke...
thanks...
update
i found this nice program, it is easily installed by:
# 64bit version
wget -O xidel/xidel_0.9-1_amd64.deb https://sourceforge.net/projects/videlibri/files/Xidel/Xidel%200.9/xidel_0.9-1_amd64.deb/download
apt-get -y install libopenssl
apt-get -y install libssl-dev
apt-get -y install libcrypto++9
dpkg -i xidel/xidel_0.9-1_amd64.deb
it looks awsome, but i'm not really sure how to tweak it to my needs.
based on that link and the below answer, i guess a possible solution would be..
use xidel, or use "$ sed -n 's/.href="([^"]).*/\1/p' file" as suggested in this link, but then tweak it to get the link with html tags like:
< a href="https://always/same/link/same-name_17.html">always same text< /a>
then filter out all that doesn't end with ( ">always same text< /a> )
and then use the grep sort as mentioned below.
Continuing from the comment, you can use grep, sort and tail to isolate the highest number of your list of similar links without too much trouble. For example, if you list of links is as you have described (I've saved them in a file dat/links.txt for the purpose of the example), you can easily isolate the highest number in a variable:
Example List
$ cat dat/links.txt
always same text
always same text
always same text
Parsing the Highest Numbered Link
$ myvar=$(grep -o 'https:.*[.]html' dat/links.txt | sort | tail -n1); \
echo "myvar : '$myvar'"
myvar : 'https://always/same/link/same-name_19.html'
(note: the command above is all one line separate by the line-continuation '\')
Applying Directly to Results of curl
Whether your list is in a file, or returned by curl -s, you can apply the same approach to isolate the highest number link in the returned list. You can use process substitution with the curl command alone, or you can pipe the results to grep. E.g. as noted in my original comment,
$ myvar=$(grep -o 'https:.*[.]html' < <(curl -s "$url/$param") | sort | tail -n1); \
echo "myvar : '$myvar'"
or pipe the result of curl to grep,
$ myvar=$(curl -s "$url/$param" | grep -o 'https:.*[.]html' | sort | tail -n1); \
echo "myvar : '$myvar'"
(same line continuation note.)
Why not use Xidel with xquery to sort the links and return the last?
xidel -q links.txt --xquery "(for $i in //#href order by $i return $i)[last()]" --input-format xml
The input-format parameter makes sure you don't need any html tags at the start and ending of your txt file.
If I'm not mistaken, in the latest Xidel the -q (quiet) param is replaced by -s (silent).

Using grep (on Windows) to find a string contained by 's

I'm trying to write a shell script in windows, which is why I'm not using something like awk or grep -o etc.
I'm trying to parse my angular files for the controllers being used. For example, I'll have a line like this in a file.
widgetList.controller('widgetListController', [
What I want is to pull out widgetListController
Here's what I've got so far:
grep -h "[[:alpha:]]*Controller[[:alpha:]]*" C:/workspace/AM/$file | tr ' ' '\n' | grep -h "[[:alpha:]]*Controller[[:alpha:]]*"
It works decently well, but it will pull out the entire line like so:
widgetList.controller('widgetListController', rather than just the word.
Also in instances where the controller is formatted as so:
controller : 'widgetListController',
It returns 'widgetListController',
How can I adjust this to simply return whatever is between the 's? I've tried various ways of escaping that character but it doesn't seem to be working.
You can use this sed command:
sed "/Controller/s/.*'\([^']*\)'.*$/\1/" C:/workspace/AM/$file
Output:
widgetListController

bash script to perform dig -x

Good day. I was reading another post regarding resolving hostnames to IPs and only using the first IP in the list.
I want to do the opposite and used the following script:
#!/bin/bash
IPLIST="/Users/mymac/Desktop/list2.txt"
for IP in 'cat $IPLIST'; do
domain=$(dig -x $IP +short | head -1)
echo -e "$domain" >> results.csv
done < domainlist.txt
I would like to give the script a list of 1000+ IP addresses collected from a firewall log, and resolve the list of destination IP's to domains. I only want one entry in the response file since I will be adding this to the CSV I exported from the firewall as another "column" in Excel. I could even use multiple responses as semi-colon separated on one line (or /,|,\,* etc). The list2.txt is a standard ascii file. I have tried EOF in Mac, Linux, Windows.
216.58.219.78
206.190.36.45
173.252.120.6
What I am getting now:
The domainlist.txt is getting an exact duplicate of list2.txt while the results has nothing. No error come up on the screen when I run the script either.
I am running Mac OS X with Macports.
Your script has a number of syntax and stylistic errors. The minimal fix is to change the quotes around the cat:
for IP in `cat $IPLIST`; do
Single quotes produce a literal string; backticks (or the much preferred syntax $(cat $IPLIST)) performs a command substitution, i.e. runs the command and inserts its output. But you should fix your quoting, and preferably read the file line by line instead. We can also get rid of the useless echo.
#!/bin/bash
IPLIST="/Users/mymac/Desktop/list2.txt"
while read IP; do
dig -x "$IP" +short | head -1
done < "$IPLIST" >results.csv
Seems that in your /etc/resolv.conf you configured a nameserver which does not support reverse lookups and that's why the responses are empty.
You can pass the DNS server which you want to use to the dig command. Lets say 8.8.8.8 (Google) for example:
dig #8.8.8.8 -x "$IP" +short | head -1
The commands returns the domain with a . appended. If you want to replace that you can additionally pipe to sed:
... | sed 's/.$//'

Resources