BASH extract links from youtube html file - bash

Im trying to make youtube music player on my raspberry, and I've stuck on this moment:
Wget is downloading site for example: https://www.youtube.com/results?search_query=test to file output.html
Links in that site are saved in strings like that: <a href="/watch?v=DDzfeTTigKo"
Now when I am trying to grep them cat site | grep -B 0 -A 0 watch?v=
It prints me the wall of text from that file, and I just want that specific lines like i mention above. And i want it to be saved in file site2
Is this possible?

Try this with GNU grep:
grep -o '"/watch?v=[^"]*"' file.html

Related

how to copy all the URLs of a certain column of a web page?

I want to import several number of files into my server using wget , the 492 files are here:
https://trace.ncbi.nlm.nih.gov/Traces/sra/?study=ERP001736
so I want to copy the URL of all files in "File Name" column to save them into a file and import them with wget.
So how can I copy all those URLs from that column ?
Thanks for reading :)
Since you've tagged bash, this should work.
wget -O- is used to output the data to the standard output, where it's greppable. (curl would do that by default.)
grep -oE is used to capture the URLs (which happily are in a regular enough format that a simple regexp works).
Then, wget -i is used to read URLs from the file generated. You might wish to add -nc or other suitable partial-fetch flags; those files are pretty hefty.
wget -O- https://trace.ncbi.nlm.nih.gov/Traces/sra/?study=ERP001736 | grep -oE 'http://ftp.sra.ebi.ac.uk/[^"]+' > urls.txt
wget -i urls.txt
First, I recommend using a more specific and robust implementation...
but, in the case you are against a wall and in a hurry -
$: curl -s https://trace.ncbi.nlm.nih.gov/Traces/sra/?study=ERP001736 |
sed -En '/href="http:\/\/.*clean.fastq.gz"/{s/^.*href="([^"]+)".*/\1/;p;}' |
while read url; do wget "$url"; done
This is a quick and dirty rough first pass, but it will give you something to work with.
If you aren't in a screaming hurry, try writing something more robust and step-wise in perl or python.

Print links to all pdfs using bash

I'm writing bash script that should download html page and from that page extracts all links to the pdf files.
I have to say, that I'm newbie to bash so for now I can only grep all lines that contains <a href and afterwards grep these lines that have pdf word.
I can barelly use awk but i don't know how to write right regex to get only text in <a href="*.pdf"> where I want to have *.pdf.
EDIT: grep "... is not found.
Try this line to the whole html String. Works perfectly for me.
grep -io "<a[[:space:]]*href=\"[^\"]\+\.pdf\">" | awk 'BEGIN{FS="\""}{print $2}'

Use full URL as saved file name with wget

I'm using wget in terminal to download a large list of images.
example — $ wget -i images.txt
I have all the image URLS in the images.txt file.
However, the image urls tend to be like example.com/unqiueNumber/images/main_250.jpg
which means that all the images come out named main_250.jpg
What I really need is the images to be saved with the entire URLs for the images for each one, so that the 'unique number' is part of the filenames.
Any suggestions?
Presuming the urls for the images are in a text file named images.txt with one url per line you can run
cat images.txt | sed 'p;s/\//-/g' | sed 'N;s/\n/ -O /' | xargs wgetto download each and every image with a filename that was formed out of the url.
Now for the explanation:
in this example I'll use https://www.newton.ac.uk/files/covers/968361.jpg
https://www.moooi.com/sites/default/files/styles/large/public/product-images/random_detail.jpg?itok=ErJveZTY
as images.txt (you can add as many images as you like to your file, as long as they are in this same format).
cat images.txt pipes the content of the file to standard output
sed 'p;s/\//-/g' prints the file to stdout with the url on one line and then the intended filename on the next line, like so:
https://www.newton.ac.uk/files/covers/968361.jpg
https:--www.newton.ac.uk-files-covers-968361.jpg
https://www.moooi.com/sites/default/files/styles/large/public/product-images/random_detail.jpg?itok=ErJveZTY
https:--www.moooi.com-sites-default-files-styles-large-public-product-images-random_detail.jpg?itok=ErJveZTY
sed 'N;s/\n/ -O /' combines the two lines of each image (the url and the intended filename) into one line and it adds the -O option inbetween (this is for wget to know that the second argument is the intended filename), the result for this part looks like this:https://www.newton.ac.uk/files/covers/968361.jpg -O https:--www.newton.ac.uk-files-covers-968361.jpg
https://www.moooi.com/sites/default/files/styles/large/public/product-images/random_detail.jpg?itok=ErJveZTY -O https:--www.moooi.com-sites-default-files-styles-large-public-product-images-random_detail.jpg?itok=ErJveZTY
and finally xargs wget runs wget for each line as the option, the endresult in this example is two images in the current directory named https:--www.newton.ac.uk-files-covers-968361.jpg and https:--www.moooi.com-sites-default-files-styles-large-public-product-images-random_detail.jpg?itok=ErJveZTY respectively.
With GNU Parallel you can do:
cat images.txt | parallel wget -O '{= s:/:-:g; =}' {}
I have a not so elegant solution, that may not work everywhere.
You probably know that if your URL ends in a query, wget will use that query in the filename. e.g. if you have http://domain/page?q=blabla, you will get a file called page?q=blabla after download. Usually, this is annoying, but you can turn it to your advantage.
Suppose, you wanted to download some index.html pages, and wanted to keep track of their origin, as well as, avoid ending up with index.html, index.html.1, index.html.2, etc. in your download folder. Your input file urls.txt may look something like the following:
https://google.com/
https://bing.com/
https://duckduckgo.com/
If you call wget -i urls.txt you end up with those numbered index.html files. But if you "doctor" your urls with a fake query, you get useful file names.
Write a script that appends each url as a query to itself, e.g.
https://google.com/?url=https://google.com/
https://bing.com/?url=https://bing.com/
https://duckduckgo.com/?url=https://duckduck.com/
Looks cheesy, right? But if you now execute wget -i urls.txt, you get the following files:
index.html?url=https:%2F%2Fbing.com%2F
index.html?url=https:%2F%2Fduckduck.com%2F
index.html?url=https:%2F%2Fgoogle.com%2F
instead of non-descript numbered index.htmls. Sure, they look ugly, but you can clean up the filenames, and voilà! Each file will have its origin as its name.
The approach probably has some limitations, e.g. if the site you are downloading from actually executes the query and parses the parameters, etc.
Otherwise, you'll have to solve the file name/source url problem outside of wget, either with a bash script or in other programming languages.

How can I find text after some string over bash

I have this bash script and works
DIRECTORY='1.20_TRUNK/mips-tuxbox-oe1.6'
# Download html page and save to tmp folder to ump.tmp file
wget -O 'ump.tmp' 'http://download.oscam.cc/index.php?&direction=0&order=mod&directory=$DIRECTORY&'
ft='index.php?action=downloadfile&filename=oscam-svn'
st='-webif-Distribution.tar.gz&directory=$DIRECTORY&'
File ump.tmp containts e.g. three links
I need find solution for find only number 10082 in first "a" links of the page. But this number is amended. When you run the script e.g per month, it may be different
I do not have the "cat" command. I have receiver and not linux. Receiver have enigma system and "cat" isn´t implemented
I tested through comparison "sed", but it does not work.
sed -n "/filename=oscam-svn/,/-mips-tuxbox-webif/p" ump.tmp
Using a proper XHTML parser :
$ xmllint --html --xpath '//a/#href[contains(., "downloadfile")]' ump.tmp 2>/dev/null |
grep -oP "oscam-svn\K\d+"
But there's not this string in the given HTML file
"Find" is kind of vague, but you can use grep to get the link with the number 10082 in it from the temp file.
$ grep "10082" ump.tmp

Need the name for a URL that contains lots of garbage expect the name. (Advanced BASH)

http://romhustler.net/file/54654/RFloRzkzYjBxeUpmSXhmczJndVZvVXViV3d2bjExMUcwRmdhQzltaU5UUTJOVFE2TVRrM0xqZzNMakV4TXk0eU16WTZNVE01TXpnME1UZ3pPRHBtYVc1aGJGOWtiM2R1Ykc5aFpGOXNhVzVy <-- Url that needs to be identified
http://romhustler.net/rom/ps2/final-fantasy-x-usa <-- Parent url
If you copy paste this url you will see the browser identify the files name. How can I get a bash script to do the same ?
I need to WGET the first URL, but because it will be for 100 more items i cant copy paste each URL.
I currently have the menu set up for all the files. Just dont know how to mass download each file individually as the URL's for the files have no matching patterns.
*Bits of my working menu:
#Raw gamelist grabber
w3m http://romhustler.net/roms/ps2 |cat|egrep "/5" > rawmenu.txt
#splits initial file into a files(games01) that contain 10 lines.
#-d puts lists files with 01
split -l 10 -d rawmenu.txt games
#s/ /_/g - replaces spaces with underscore
#s/__.*//g - removes anything after two underscores
select opt in\
$(cat games0$num|sed -e 's/ /_/g' -e 's/__.*//g')\
"Next"\
"Quit" ;
if [[ "$opt" =~ "${lines[0]}" ]];
then
### Here the URL needs to be grabbed ###
This has to be done is BASH. Is this possible ?
It appears that romhustler.net use some Javascript on thier full download pages to hide the final download link for a few seconds after the page loads, possibly to prevent this kind of web scraping.
However, if they were using direct links to ZIP files for example, we could do this:
# Use curl to get the HTML of the page and egrep to match the hyperlinks to each ROM
curl -s http://romhustler.net/roms/ps2 | egrep -o "rom/ps2/[a-zA-Z0-9_-]+" > rawmenu.txt
# Loop through each of those links and extract the full download link
while read LINK
do
# Extract full download link
FULLDOWNLOAD=`curl -s "http://romhustler.net$LINK" | egrep -o "/download/[0-9]+/[a-zA-Z0-9]+"`
# Download the file
wget "http://romhustler.net$FULLDOWNLOAD"
done < "rawmenu.txt"

Resources