how can I scrap data from reddit (in bash)

how can I scrap data from reddit (in bash) - bash

I want to scrap titles and date from http://www.reddit.com/r/movies.json in bash
wget -q -O - "http://www.reddit.com/r/movies.json" | grep -Po '(?<="title": ").*?(?=",)' | sed 's/\"/"/'
I have titles but I don't know how to add dates, can someone help?

wget -q -O - "http://www.reddit.com/r/movies.json" | grep -Po
'(?<="title": ").*?(?=",)' | sed 's/"/"/'
As extension suggest it is JSON (application/json) file, therefore grep and sed are poorly suited for working with it, as they are mainly for using regular expressions. If you are allowed to install tools, jq should be handy here. Try using your system package manager to install it, if it succeed you should get pretty printed version of movies.json by doing
wget -q -O - "http://www.reddit.com/r/movies.json" | jq
and then find where interesting values are placed which should allow you to grab it. See jq Cheat Sheet for example of jq usage. If you are limited to already installed tools I suggest taking look at json module of python.

Related

Grep title of a page which is written with spaces [duplicate]

This question already has answers here:
Parsing XML using unix terminal
(9 answers)
Closed last month.
I am trying to get the meta title of some website...
some people write title like
`<title>AllHeart Web INC, IT Services Digital Solutions Technology
</title>
`
`<title>AllHeart Web INC, IT Services Digital Solutions Technology</title>`
`<title>
AllHeart Web INC, IT Services Digital Solutions Technology
</title>`
some like more ways... my current focus on above 3 ways...
I wrote a simple code, it only capture 2nd way of title written, but i am not sure how can I grep the other ways,
`curl -s https://allheartweb.com/ | grep -o '<title>.*</title>'`
I also made a code (very bad i guess)
where i can grep number of line like
`
% curl -s https://allheartweb.com/ | grep -n '<title>'
7:<title>AllHeart Web INC, IT Services Digital Solutions Technology
% curl -s https://allheartweb.com/ | grep -n '</title>'
8:</title>
`
and store it and run loop to get title item... which i guess a bad idea...
any help I can get all possible of getting title?

Try this:
curl -s https://allheartweb.com/ | tr -d '\n' | grep -m 1 -oP '(?<=<title>).+?(?=</title>)'
You can remove newlines from HTML via tr because they have no meaning in the title. The next step returns the first match of the shortest string enclosed in <title> </title>.
This is quite a simple approach of course. xmllint would be better but that's not available to all platforms by default.

'grep' is not a very good tool to match multiple lines. It is processing line-by-line. You could hack that by making your incoming text one line like
curl -s https://allheartweb.com/ | xargs | grep -o -E "<title>.*</title>"
This is probably what you want.

Try this sed:
curl -s https://allheartweb.com/ | sed -n "{/<title>/,/<\/title>/p}"

how to copy all the URLs of a certain column of a web page?

I want to import several number of files into my server using wget , the 492 files are here:
https://trace.ncbi.nlm.nih.gov/Traces/sra/?study=ERP001736
so I want to copy the URL of all files in "File Name" column to save them into a file and import them with wget.
So how can I copy all those URLs from that column ?
Thanks for reading :)

Since you've tagged bash, this should work.
wget -O- is used to output the data to the standard output, where it's greppable. (curl would do that by default.)
grep -oE is used to capture the URLs (which happily are in a regular enough format that a simple regexp works).
Then, wget -i is used to read URLs from the file generated. You might wish to add -nc or other suitable partial-fetch flags; those files are pretty hefty.
wget -O- https://trace.ncbi.nlm.nih.gov/Traces/sra/?study=ERP001736 | grep -oE 'http://ftp.sra.ebi.ac.uk/[^"]+' > urls.txt
wget -i urls.txt

First, I recommend using a more specific and robust implementation...
but, in the case you are against a wall and in a hurry -
$: curl -s https://trace.ncbi.nlm.nih.gov/Traces/sra/?study=ERP001736 |
sed -En '/href="http:\/\/.*clean.fastq.gz"/{s/^.*href="([^"]+)".*/\1/;p;}' |
while read url; do wget "$url"; done
This is a quick and dirty rough first pass, but it will give you something to work with.
If you aren't in a screaming hurry, try writing something more robust and step-wise in perl or python.

How to get the highest numbered link from curl result?

i have create small program consisting of a couple of shell scripts that work together, almost finished
and everything seems to work fine, except for one thing of which i'm not really sure how to do..
which i need, to be able to finish this project...
there seem to be many routes that can be taken, but i just can't get there...
i have some curl results with lots of unused data including different links, and between all data there is a bunch of similar links
i only need to get (into a variable) the link of the highest number (without the always same text)
the links are all similar, and have this structure:
always same text
always same text
always same text
i was thinking about something like;
content="$(curl -s "$url/$param")"
linksArray= get from $content all links that are in the href section of the links
that contain "always same text"
declare highestnumber;
for file in $linksArray
do
href=${1##*/}
fullname=${href%.html}
OIFS="$IFS"
IFS='_'
read -a nameparts <<< "${fullname}"
IFS="$OIFS"
if ${nameparts[1]} > $highestnumber;
then
highestnumber=${nameparts[1]}
fi
done
echo ${nameparts[1]}_${highestnumber}.html
result:
https://always/same/link/unique-name_19.html
this was just my guess, any working code that can be run from bash script is oke...
thanks...
update
i found this nice program, it is easily installed by:
# 64bit version
wget -O xidel/xidel_0.9-1_amd64.deb https://sourceforge.net/projects/videlibri/files/Xidel/Xidel%200.9/xidel_0.9-1_amd64.deb/download
apt-get -y install libopenssl
apt-get -y install libssl-dev
apt-get -y install libcrypto++9
dpkg -i xidel/xidel_0.9-1_amd64.deb
it looks awsome, but i'm not really sure how to tweak it to my needs.
based on that link and the below answer, i guess a possible solution would be..
use xidel, or use "$ sed -n 's/.href="([^"]).*/\1/p' file" as suggested in this link, but then tweak it to get the link with html tags like:
< a href="https://always/same/link/same-name_17.html">always same text< /a>
then filter out all that doesn't end with ( ">always same text< /a> )
and then use the grep sort as mentioned below.

Continuing from the comment, you can use grep, sort and tail to isolate the highest number of your list of similar links without too much trouble. For example, if you list of links is as you have described (I've saved them in a file dat/links.txt for the purpose of the example), you can easily isolate the highest number in a variable:
Example List
$ cat dat/links.txt
always same text
always same text
always same text
Parsing the Highest Numbered Link
$ myvar=$(grep -o 'https:.*[.]html' dat/links.txt | sort | tail -n1); \
echo "myvar : '$myvar'"
myvar : 'https://always/same/link/same-name_19.html'
(note: the command above is all one line separate by the line-continuation '\')
Applying Directly to Results of curl
Whether your list is in a file, or returned by curl -s, you can apply the same approach to isolate the highest number link in the returned list. You can use process substitution with the curl command alone, or you can pipe the results to grep. E.g. as noted in my original comment,
$ myvar=$(grep -o 'https:.*[.]html' < <(curl -s "$url/$param") | sort | tail -n1); \
echo "myvar : '$myvar'"
or pipe the result of curl to grep,
$ myvar=$(curl -s "$url/$param" | grep -o 'https:.*[.]html' | sort | tail -n1); \
echo "myvar : '$myvar'"
(same line continuation note.)

Why not use Xidel with xquery to sort the links and return the last?
xidel -q links.txt --xquery "(for $i in //#href order by $i return $i)[last()]" --input-format xml
The input-format parameter makes sure you don't need any html tags at the start and ending of your txt file.
If I'm not mistaken, in the latest Xidel the -q (quiet) param is replaced by -s (silent).

Get Macbook screen size from terminal/bash

Does anyone know of any possible way to determine or glean this information from the terminal (in order to use in a bash shell script)?
On my Macbook Air, via the GUI I can go to "About this mac" > "Displays" and it tells me:
Built-in Display, 13-inch (1440 x 900)
I can get the screen resolution from the system_profiler command, but not the "13-inch" bit.
I've also tried with ioreg without success. Calculating the screen size from the resolution is not accurate, as this can be changed by the user.
Has anyone managed to achieve this?

I think you could only get the display model-name which holds a reference to the size:
ioreg -lw0 | grep "IODisplayEDID" | sed "/[^<]*</s///" | xxd -p -r | strings -6 | grep '^LSN\|^LP'
will output something like:
LP154WT1-SJE1
which depends on the display manufacturer. But as you can see the first three numbers in this model name string imply the display-size: 154 == 15.4''
EDIT
Found a neat solution but it requires an internet connection:
curl -s http://support-sp.apple.com/sp/product?cc=`system_profiler SPHardwareDataType | awk '/Serial/ {print $4}' | cut -c 9-` |
sed 's|.*<configCode>\(.*\)</configCode>.*|\1|'
hope that helps

The next script:
model=$(system_profiler SPHardwareDataType | \
/usr/bin/perl -MLWP::Simple -MXML::Simple -lane '$c=substr($F[3],8)if/Serial/}{
print XMLin(get(q{http://support-sp.apple.com/sp/product?cc=}.$c))->{configCode}')
echo "$model"
will print for example:
MacBook Pro (13-inch, Mid 2010)
Or the same without perl but more command forking:
model=$(curl -s http://support-sp.apple.com/sp/product?cc=$(system_profiler SPHardwareDataType | sed -n '/Serial/s/.*: \(........\)\(.*\)$/\2/p')|sed 's:.*<configCode>\(.*\)</configCode>.*:\1:')
echo "$model"
It is fetched online from apple site by serial number, so you need internet connection.

I've found that there seem to be several different Apple URLs for checking this info. Some of them seem to work for some serial numbers, and others for other machines.
e.g:
https://selfsolve.apple.com/wcResults.do?sn=$Serial&Continue=Continue&num=0
https://selfsolve.apple.com/RegisterProduct.do?productRegister=Y&country=USA&id=$Serial
http://support-sp.apple.com/sp/product?cc=$serial (last 4 digits)
https://selfsolve.apple.com/agreementWarrantyDynamic.do
However, the first two URLs are the ones that seem to work for me. Maybe it's because the machines I'm looking up are in the UK and not the US, or maybe it's due to their age?
Anyway, due to not having much luck with curl on the command line (The Apple sites redirect, sometimes several times to alternative URLs, and the -L option doesn't seem to help), my solution was to bosh together a (rather messy) PHP script that uses PHP cURL to check the serials against both URLs, and then does some regex trickery to report the info I need.
Once on my web server, I can now curl it from the terminal command line and it's bringing back decent results 100% of the time.
I'm a PHP novice so I won't embarrass myself by posting the script up in it's current state, but if anyone's interested I'd be happy to tidy it up and share it on here (though admittedly it's a rather long winded solution to what should be a very simple query).
This info really should be simply made available in system_profiler. As it's available through System Information.app, I can't see a reason why not.

Hi there for my bash script , under GNU/Linux : I make the follow to save
# Resolution Fix
echo `xrandr --current | grep current | awk '{print $8}'` >> /tmp/width
echo `xrandr --current | grep current | awk '{print $10}'` >> /tmp/height
cat /tmp/height | sed -i 's/,//g' /tmp/height
WIDTH=$(cat /tmp/width)
HEIGHT=$(cat /tmp/height)
rm /tmp/width /tmp/height
echo "$WIDTH"'x'"$HEIGHT" >> /tmp/Resolution
Resolution=$(cat /tmp/Resolution)
rm /tmp/Resolution
# Resolution Fix
and the follow in the same script for restore after exit from some app / game
in some S.O
This its execute command directly
ResolutionRestore=$(xrandr -s $Resolution)
But if dont execute call the variable with this to execute the varible content
$($ResolutionRestore)
And the another way you can try its with the follow for example
RESOLUTION=$(xdpyinfo | grep -i dimensions: | sed 's/[^0-9]*pixels.*(.*).*//' | sed 's/[^0-9x]*//')
VRES=$(echo $RESOLUTION | sed 's/.*x//')
HRES=$(echo $RESOLUTION | sed 's/x.*//')

Retrieve latest twitter tweet with a particular hashtag via CLI

I'm looking for a way to retrieve the latest tweet that contains a particular twitter hashtag via the cli (bash). Something that would run like: "./get-tweet.sh blah" and return "Dude I'm feeling so #blah" Thanks!
Looks like I can get the rss feed by doing this:
curl -s 'http://search.twitter.com/search.rss?q=%23blah&rpp=1'
I would then just need to cut out the correct xml

I'd look at TTYtter for doing that. In particular it allows for scripting like so:
ttytter -runcommand="/search #haiku"
You'll need to do the initial setup interactively though for oAuth.

OK, I hacked together a solution:
!/bin/bash
curl -s "http://search.twitter.com/search.rss?q=%23$1&rpp=1" > /tmp/hashtag.xml
xmllint --xpath '/rss/channel/item/title/text()' /tmp/hashtag.xml | sed 's/http\://t.co/*//g' | sed 's/#//g' | sed 's/#//g' |xargs -i -0 echo -n "{}"
echo
rm /tmp/hashtag.xml

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio