How can I extract URLs from the source code of a webpage?

How can I extract URLs from the source code of a webpage? - bash

I am trying to setup a stream for some scanners I have found on Broadcastify. The problem is that the URLs they use are dynamic and only the same for a few hours a time. I would like to create a shell script that can simply scan the page from which the stream is accessed (which does have a static URL) and return the current URL of the stream, which can then be fed to the audio player.
For instance, right now the following stream at https://www.broadcastify.com/listen/feed/30185/web has a stream at http://audio12.broadcastify.com/kq2ydfr1jz98shw.mp3
However, that stream link will only work for a short period of time. I need an MP3 stream like the one above.
I only have minor experience with shell scripting, so I'm wondering what the best approach would be here. Specifically, my first problem is if I simply "View page source" and search for "mp3", there are no results. I can only find the URL by inspecting element (F12 developer tools) and, in Chrome for instance, going to Application → Frames → Media. I thought I could do a "view frame source" on the audio player in the past but that option isn't there now.
I imagine I could use grep if I was able to CURL the source code, but I'm not sure what I would need to CURL here, if that makes sense.
UPDATE
Thanks mk12 for the insight. Based off that, here is my shell script:
#!/bin/bash
curl "https://www.broadcastify.com/listen/feed/$1/web" | grep webAuth > /var/tmp/broadcastifyauth$1.txt
pta=`cat /var/tmp/broadcastifyauth$1.txt | sed -i 's/$.ajaxSetup({ headers: { "webAuth": "//g' /var/tmp/broadcastifyauth$1.txt`
pta=`cat /var/tmp/broadcastifyauth$1.txt | sed -i 's/" }});//g' /var/tmp/broadcastifyauth$1.txt`
auth=`cat /var/tmp/broadcastifyauth$1.txt`
echo $auth
curl "https://www.broadcastify.com/listen/webpl.php?feedId=$1" --request POST --header "webAuth: $auth" --data 't=14' >/var/tmp/broadcastify$1.txt
pta=`cat /var/tmp/broadcastify$1.txt | grep -o 'http://[^"]*' > /var/tmp/broadcastify$1.b.txt`
pta=`cat /var/tmp/broadcastify$1.b.txt`
echo $pta
#pta=`cat /var/tmp/broadcastify$1.txt | sed -n '/<audio/s/^.*<audio width="300px" id="mePlayer_$1" src="\([^"]*\)".*/\1/p' > /var/tmp/broadcastify$1.b.txt`
#ptb=`cat /var/tmp/broadcastify$1.b.txt`
#echo $ptb
Here is its output:
root#na01:/etc/asterisk/scripts/music# ./broadcastify.sh 30185
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 9175 100 9175 0 0 51843 0 --:--:-- --:--:-- --:--:-- 52130
74f440ad812f0cc2192ab782e27608cc
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 946 0 942 100 4 3851 16 --:--:-- --:--:-- --:--:-- 3844
http://relay.broadcastify.com/b94hfrp5k1s0tvy.mp3?xan=DCJP4HvtwMoXdH9HvtwMJ5vv342DfleDptcoX3dH9H48vtwMJ
Works!

The mp3 URL is not present in the original HTML document — it's added to the DOM later by JavaScript code. That's why you can't find it in "View page source," but you can with "Inspect element."
If you run curl https://www.broadcastify.com/listen/feed/30185/web, you will see the following somewhere in the middle:
<div id="fp" width="300px"></div>
<script>
$.ajaxSetup({ headers: { "webAuth": "74f440ad812f0cc2192ab782e27608cc" }});
$('#fp').load('/listen/webpl.php?feedId=30185',{t:14});
</script>
Note in particular that it loads content (using jQuery .load) into the initially empty <div id="fp"> just above. When you use "Inspect element" to find the audio player, you'll find it gets placed inside that div.
Before trying to reproduce this request with curl, I looked in the Network tab of the developer tools to see what the browser did. Filtering for "listen," I found the webpl.php request. Here is the relevant information from the "Headers" tab:
URL: https://www.broadcastify.com/listen/webpl.php?feedId=30185
Request
POST /listen/webpl.php HTTP/1.1
Content-Type: application/x-www-form-urlencoded
webAuth: 74f440ad812f0cc2192ab782e27608cc
Query String Parameters
feedId: 30185
Request Data
MIME Type: application/x-www-form-urlencoded
t: 14
Let's reproduce this request with curl:
curl 'https://www.broadcastify.com/listen/webpl.php?feedId=30185' \
--request POST \
--header 'webAuth: 74f440ad812f0cc2192ab782e27608cc' \
--data 't=14'
Here's the result:
<script src="/scripts/me_4.2.9/mediaelement-and-player.min.js"></script>
<link rel="stylesheet" href="/scripts/me_4.2.9/mediaelementplayer.min.css"/>
<audio width="300px" id="mePlayer_30185" src="http://relay.broadcastify.com/9wzfd3hrpyctvqx.mp3?xan=DCJP4HvtwMoXdH9HvtwMJ5vv342DfleDptcoX3dH9H48vtwMJ" type="audio/mp3" controls="controls"
autoplay="true">
</audio>
<script>
$('audio').mediaelementplayer({
features: ['playpause', 'current', 'volume'],
error: function () {
alert("Feed has disconnected from the server. This could be due to a power outage, network connection problem, or server problem. Click OK to restart the player. If the player fails to connect then the feed might be down for an extended timeframe.");
location.reload();
}
});
</script>
<br />
<div class="c">If the feed does not automatically play, click or touch the play icon in the player above.</div>
There's your mp3 link, in the src attribute of the <audio> tag. If we try to get it:
$ curl http://relay.broadcastify.com/9wzfd3hrpyctvqx.mp3?xan=DCJP4HvtwMoXdH9HvtwMJ5vv342DfleDptcoX3dH9H48vtwMJ
Moved Temporarily. Redirecting to http://audio13.broadcastify.com/9wzfd3hrpyctvqx.mp3?nocache=2623053&xan=DCJP4HvtwMoXdH9HvtwMJ5vv342DfleDptcoX3dH9H48vtwMJ
If you try to access that URL (or the original one with -L, instructing curl to follow redirects), the mp3 stream will start printing to your terminal as a bunch of nonsense characters.
So, your shell script should hit the /listen/webpl.php endpoint instead of trying to scrape the web player HTML page. Or perhaps just scrape the page to first get a webAuth token.
Update
In response to your update with the shell script, here is a simplified script that does the same thing and also strips the "Moved Temporarily" prefix to just get the audio url. Note that there's no need to use a temporary file, and the $(...) syntax is preferred over the `...` syntax:
#!/bin/bash
# I always start my scripts with this. See https://sipb.mit.edu/doc/safe-shell/
set -eufo pipefail
auth=$(curl -s "https://www.broadcastify.com/listen/feed/$1/web" \
| grep webAuth \
| head -n 1 \
| sed 's/^.*"webAuth": "//;s/".*$//')
relay_url=$(curl -s "https://www.broadcastify.com/listen/webpl.php?feedId=$1" \
-H "webAuth: $auth" -d 't=14' \
| grep -o 'http://[^"]*')
audio_url=$(curl -s "$relay_url" | cut -d' ' -f5)
echo "$audio_url"

Related

cURL command not returning data or error messages

Figuring out a new API.
I’m trying to call an endpoint and populate a JSON file with the data. I see the following output:
admin#server:~$ curl -H "Authorization: Bearer <NOT DISPLAYED FOR THIS POST>" -o /home/admin/result.json https://www.endpoint.com/manage/query/run?id=55408&cmd=service&output=json
[1] 14493
[2] 14494
admin#server:~$ % Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
It hangs here seemingly indefinitely until I hit enter and then the below displays
[1]- Done curl -H "Authorization: Bearer <NOT DISPLAYED FOR THIS POST>" -o /home/admin/result.json https://www.endpoint.com/manage/query/run?id=55408
[2]+ Done cmd=service
No error messages, and no data in result.json. Calling the same cURL command without the -o option also returns the same results, when normally I would expect to see the data pop up in my terminal. If I visit the endpoint URL in browser (auth token can be a URL parameter as well for this API), I see the exact data I want to download. Changing the Auth Token makes no difference in the output.
I know every API is different, and there are a hundred different troubleshooting questions I haven't addressed in this post, but has anyone experienced this type of output before with a cURL command? I've never seen this behavior before.
The API is for Slate, an SIS for universities, if that helps. Thank you!

Your URL has a & in it which is a bash syntax character to run a task in the background. Quote the URL to prevent it being interpreted as syntax.
admin#server:~$ curl -H "Authorization: Bearer <NOT DISPLAYED FOR THIS POST>" -o /home/admin/result.json "https://www.endpoint.com/manage/query/run?id=55408&cmd=service&output=json"

Unable to update file in the Google Drive by using resumable approach

I've been trying to learn how to use the Google Drive API to update a file in the Google Drive by using a resumable session.
I received a 'Forbidden' response to the upload content request.
Could you help me find missing or misused steps?
User is authorized with permissions:
drive.file (https://www.googleapis.com/auth/drive.file)
Execute a request to create resumable session:
PATCH 'https://www.googleapis.com/upload/drive/v3/files/1XIU63B-U8b9Fe1_UFFVvd7OOdS_ANqAj?uploadType=resumable
Retrieve session url:
https://www.googleapis.com/upload/drive/v3/files/1XIU63B-U8b9Fe1_UFFVvd7OOdS_ANqAj?uploadType=resumable&upload_id=AEnB2Uqew...
Send content by using resumable session:
PUT https://www.googleapis.com/upload/drive/v3/files/1XIU63B-U8b9Fe1_UFFVvd7OOdS_ANqAj?uploadType=resumable&upload_id=AEnB2Uqew...
I didn't find anything specific related to this step in the documentation, so I use regular upload documentation https://developers.google.com/drive/api/v3/manage-uploads#upload-resumable to update file in "Multiple chunks"
I get 403 error status code with 'Forbidden' reason and header with upload_id:
X-GUploader-UploadID: AEnB2Uqewr...

You want to update the existing file in Google Drive with the resumable upload method.
Unfortunately, from your question, I couldn't understand about the detail request body of your test. By this, I cannot replicate your situation. So in this answer, I would like to propose a sample flow for updating the existing file with the resumable upload.
Sample situation:
In this answer, as a sample situation, it supposes that a text file in Google Drive is updated by the resumable upload with the multiple chunks. And as the method for requesting, I use the curl command.
I prepared 2 files for 2 chunks. As the test situation, the 2 chuncs of 262,144 bytes and 37,856 bytes are uploaded. So total upload size is 300,000 bytes.
When you use the resumable upload, please be careful the following point.
Add the chunk's data to the request body. Create chunks in multiples of 256 KB (256 x 1024 bytes) in size, except for the final chunk that completes the upload. Keep the chunk size as large as possible so that the upload is efficient. Ref
Flow for updating a file with the resumable upload:
1. Initiate a resumable upload session
Create the session for uploading with the resumable upload. In this case, the existing file is updated, so the endpoint is PUT https://www.googleapis.com/upload/drive/v3/files/[FILE_ID]?uploadType=resumable. But as an important point, please use the method of PATCH instead of PUT. When PUT is used, location is not included in the response header. I thought that the official document might be not correct.
$ curl -X PATCH -i \
-H "Authorization: Bearer ###accessToken###" \
"https://www.googleapis.com/upload/drive/v3/files/[FILE_ID]?uploadType=resumable"
If you want to update the file as the multipart upload, please use the following sample command. In this case, the filename is changed.
$ curl -X PATCH -i \
-H "Authorization: Bearer ###accessToken###" \
-H "Content-Type: application/json; charset=UTF-8" \
-d '{"name":"updatedFilename.txt"}' \
"https://www.googleapis.com/upload/drive/v3/files/[FILE_ID]?uploadType=resumable"
When above sample command is run, 200 OK is returned, and the response header includes location like location: https://www.googleapis.com/upload/drive/v3/files/[FILE_ID]?uploadType=resumable&upload_id=###. For uploading the data, location is used as the endpoint.
2. Upload the 1st chunk
$ curl -X PUT -i \
-H "Content-Length: 262144" \
-H "Content-Range: bytes 0-262143/300000" \
-H "Content-Type: text/plain" \
-F "file=#data1.txt" \
"https://www.googleapis.com/upload/drive/v3/files/[FILE_ID]?uploadType=resumable&upload_id=###"
When this curl command is run, 308 Resume Incomplete is returned. By this, it is found that the chunk could be correctly uploaded.
3. Upload the 2nd chunk (This is the last chunk of this sample flow.)
$ curl -X PUT -i \
-H "Content-Length: 37856" \
-H "Content-Range: bytes 262144-299999/300000" \
-H "Content-Type: text/plain" \
-F "file=#data2.txt" \
"https://www.googleapis.com/upload/drive/v3/files/[FILE_ID]?uploadType=resumable&upload_id=###"
When this curl command is run, 200 OK is returned, and the file metadata is also returned. By this, it is found that the resumable upload could be correctly done.
Note:
In this case, the file is updated as the overwrite. So please be careful this.
In my environment, even when PUT is modified to PATCH for uploading the chunks, I could confirm that the above flow worked.
If in your environment, an error occurs, please try to test this modification.
About above sample situation, if you want to upload one chunk of 300,000 bytes, please use -H "Content-Length: 300000" -H "Content-Range: bytes 0-299999/300000".
References:
Perform a resumable upload

How to download a big file from google drive via curl in Bash?

I wanna make a very simple bash script for downloading files from google drive via Drive API, so in this case there is a big file on google drive and I installed OAuth 2.0 Playground on my google drive account, then in the Select the Scope box, I choose Drive API v3, and https://www.googleapis.com/auth/drive.readonly to make a token and link.
After clicking Authorize APIs and then Exchange authorization code for tokens. I copied the Access tokenlike below.
#! /bin/bash
read -p 'Enter your id : ' id
read -p 'Enter your new token : ' token
read -p 'Enter your file name : ' file
curl -H "Authorization: Bearer $token" "https://www.googleapis.com/drive/v3/files/$id?alt=media" -o "$file"
but it won't work, any idea ?
for example the size of my file is 12G, when I run the code I will get this as output and after a second it back to prompt again ! I checked it in two computers with two different ip addresses.(I also add alt=media to URL)
-bash-3.2# bash mycode.sh
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 166 100 166 0 0 80 0 0:00:02 0:00:02 --:--:-- 80
-bash-3.2#
the content of file that it created is like this
{
"error": {
"errors": [
{
"domain": "global",
"reason": "downloadQuotaExceeded",
"message": "The download quota for this file has been exceeded."
}
],
"code": 403,
"message": "The download quota for this file has been exceeded."
}
}

You want to download a file from Google Drive using the curl command with the access token.
If my understanding is correct, how about this modification?
Modified curl command:
Please add the query parameter of alt=media.
curl -H "Authorization: Bearer $token" "https://www.googleapis.com/drive/v3/files/$id?alt=media" -o "$file"
Note:
This modified curl command supposes that your access token can be used for downloading the file.
In this modification, the files except for Google Docs can be downloaded. If you want to download the Google Docs, please use the Files: export method of Drive API. Ref
Reference:
Download files
If I misunderstood your question and this was not the direction you want, I apologize.

UPDATE AS FOR MARCH 2021
Simply follow this guide here. It worked for me.
In summary:
For small files to download run
wget --no-check-certificate 'https://docs.google.com/uc?export=download&id=FILEID' -O FILENAME
While if you are trying to download a quite large file you should try to run
wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=FILEID' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=FILEID" -O FILENAME && rm -rf /tmp/cookies.txt
Simply substitute FILEID and FILENAME with your custom values.
FILEID can be found in your file share link (after the /d/ as illustrated in the article mantioned above).
FILENAME is simply the name you want to save the download as. Remember to include the right extension. For Example FILENAME = my_file.pdf if the file is a pdf.

This is a known bug
It has been reported in this Issue Tracker post. This is caused because as you can read in the documentation:
(about download url)
Short lived download URL for the file. This field is only populated
for files with content stored in Google Drive; it is not populated for
Google Docs or shortcut files.
So you should use another field.
You can follow the report by clicking on the star next to the issue
number to give more priority to the bug and to receive updates.
As you can read in the comments of the report, the current workaround is:
Use webContentlink instead
or
Change www.googleapis.com to content.googleapis.com

Cannot download a file on OneDrive programmatically from Japan?

I made a script that downloads several files located in my professional OneDrive. This script works perfectly from a French computer, a US computer but it can't work from a Japanese computer.
To permit you understand the problem, I will detail the program:
1- I establish the token system (I got inspired by Jay Lee detailed answer) and retrieve the token in the access_token variable.
2- To download the file, in my case I cannot use
curl -w %{time_total} https://graph.microsoft.com/v1.0/me/drive/items/01M...WU/content -H "Authorization: Bearer $access_token"
Thus, this how I proceed:
#I get the item properties
itemProperties=$(curl ${ODf1Mb} -H "Authorization: Bearer $access_token")
#In these properties I select the downloadUrl that will permit me to download the file
downloadUrl=$(echo -e "$itemProperties" | grep "#microsoft.graph.downloadUrl" | awk -F'[",]' '{ print $9 }')
#Finally I execute this URL storing the download time in a variable (I do all this stuff for this)
dload=$(curl -w %{time_total} ${downloadUrl} -H "Authorization: Bearer $access_token")
As I said at the begin, for French and US computers it will work but on the Japanese machine it doesn't. I do get the itemProperties and the downloadUrl but when I call the downloadUrl with CURL it seems that it cannot reach the server because I have this:
As we can see we do not even have the Total weight to be downloaded. As an element of comparison, this is the result in a French machine:
I know, there is a warning relating to command substitution but I haven't tried to fix it yet because it makes its job.
Note -> the downloadUrl has this format:
https://lpl-my.sharepoint.com/personal/{user}_{company infra domain}_com/_layouts/15/download.aspx?
I just cannot figure out what is the problem. I can access to the https://lpl-my.sharepoint.com through the browser so I don't think the server IP is banned.

Check your ping / traceroute to see if lpl-my.sharepoint.com resolves to the same network location.
Also, I have seen other folks run curl with -v to see verbose traces and see if what the difference is.

Curl range not working(downloads entire file)

curl -v -r 0-500 http://somefile -o localfile
It should download just the first 501 bytes, no? Instead, it just downloads the entire thing. All 67 megabytes. Thanks curl! Could my companies proxy servers be blocking this feature somehow? I am skeptical about that, since the downloads themselves do work, just not the range feature. Am I missing something?

As a client you could always abort the download when you have received what you want.
By using head, you will be able to limit the download to 500 bytes, even if the server does not accept the range-header
curl -v -r 0-500 http://somefile |head -c 500 > localfile

It should download just the first 501 bytes, no?
It depends on the server. From man curl:
You should also be aware that many HTTP/1.1 servers do not have this feature enabled, so that when you attempt to get a range, you'll instead get the whole document.
As you can see in the response from the server, it's using HTTP/1.1. So it's not surprising that the range feature is not supported at the server side.

Please use the following command
curl -H "range: bytes=354-500" -O http://example.com/file.extension

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio