As of 4 days ago, you were able to send a GET request to or visit https://video.google.com/timedtext?lang=en&v={youtubeVideoId} and receive an xml response containing the caption track of a given youtube video. Does anyone know if this support has been removed, because as of tonight, it no longer provides the xml response with the captions, the page is simply empty for every video. There were numerous videos this worked for 4 days ago that no longer work. Thanks in advance
Captions in default language (single available or English it seems):
To get captions of a YouTube video just use this Linux command (using curl and base64):
curl -s 'https://www.youtube.com/youtubei/v1/get_transcript?key=AIzaSyAO_FJ2SlqU8Q4STEHLGCilw_Y9_11qcW8' -H 'Content-Type: application/json' --data-raw "{\"context\":{\"client\":{\"clientName\":\"WEB\",\"clientVersion\":\"2.2021111\"}},\"params\":\"$(printf '\n\x0bVIDEO_ID' | base64)\"}"
Change the VIDEO_ID parameter with the one interesting you.
Note: the key isn't a YouTube Data API v3 one, it is the first public (tested on some computers in different countries) one coming if you curl https://www.youtube.com/ | grep AIzaSy
Note: If interested in how I reverse-engineered this YouTube feature, say it in the comments and I would write a paragraph to explain
Captions in desired language if available:
YouTube made things tricky maybe to lose you at this step, so follow me: the only thing we have to change is the params value which is base64 encoded data which is in addition to weird characters also containing base64 data which also contains weird characters.
Get the language initials like "ru" for russian
Encode \n\x00\x12\x02LANGUAGE_INITIALS\x1a\x00 in base64 with for instance A=$(printf '\n\x00\x12\x02LANGUAGE_INITIALS\x1a\x00' | base64) (don't forget to change LANGUAGE_INITIALS to your language initials wanted ru for instance). The result for ru is CgASAnJ1GgA=
Encode the result as a URL by replacing the = to %3D with for instance B=$(printf %s $A | jq -sRr #uri). The result for ru is CgASAnJ1GgA%3D
Only if using shell commands: replace the single % to two % with for instance C=$(echo $B | sed 's/%/%%/'). The result for ru is CgASAnJ1GgA%%3D
Encode \n\x0bVIDEO_ID\x12\x0e$C (don't forget to change VIDEO_ID to your video id, with $C the result of the previous step) with for instance D=$(printf "\n\x0bVIDEO_ID\x12\x0e$C" | base64). The result for ru and lo0X2ZdElQ4 is CgtsbzBYMlpkRWxRNBIOQ2dBU0FuSjFHZ0ElM0Q=
Use this params value from the Captions in default language section: curl -s 'https://www.youtube.com/youtubei/v1/get_transcript?key=AIzaSyAO_FJ2SlqU8Q4STEHLGCilw_Y9_11qcW8' -H 'Content-Type: application/json' --data-raw "{\"context\":{\"client\":{\"clientName\":\"WEB\",\"clientVersion\":\"2.2021111\"}},\"params\":\"$D\"}"
I recommend that anyone who uses python to try the module youtube_transcript_api. I used to send GET request to https://video.google.com/timedtext?lang=en&v={videoId}, but now the page is blank. The following is the code example. In addition, this method does not need api key.
from youtube_transcript_api import YouTubeTranscriptApi
srt = YouTubeTranscriptApi.get_transcript("videoId",languages=['en'])
Old API currently returns 404 on every request. And YouTube right now uses new version of this API:
https://www.youtube.com/api/timedtext?v={youtubeVideoId}&asr_langs=de%2Cen%2Ces%2Cfr%2Cid%2Cit%2Cja%2Cko%2Cnl%2Cpt%2Cru%2Ctr%2Cvi&caps=asr&exp=xftt%2Cxctw&xoaf=5&hl=en&ip=0.0.0.0&ipbits=0&expire=1637102374&sparams=ip%2Cipbits%2Cexpire%2Cv%2Casr_langs%2Ccaps%2Cexp%2Cxoaf&signature=0BEBD68A2638D8A18A5BC78E1851D28300247F93.7D5E6D26397D8E8A93F65CCA97260D090C870462&key=yt8&kind=asr&lang=en&fmt=json3
The main problem with this API is to calculate the signature field of request. Unfortunately I couldn't find its algorithm. Maybe someone can reverse engineered it form YouTube player.
The YouTube API change around captions caused me a lot of hassle, which I circumvented through use of youtube-dl, which has won GitHub legal support and is now again available for download/clone.
The software is available as source or binary download for all major platforms, details on their GitHub page, linked above.
Sample use is this simple:
youtube-dl --write-sub --sub-lang en --skip-download --sub-format vtt https://www.youtube.com/watch?v=E-lZ8lCG7WY
Related
I'm trying to fetch the yammer Followers using below rest API.
https://www.yammer.com/api/v1/users.json
Api contains details for each user. From this I need to extract followers count alone.
{"type":"user","id":1517006975,"network_id":461,"stats":{"following":0,"followers":0,"updates":0}}
Rate limit for per page is 50, as we have 100 000+ users I need to iterate 2000+ times to get the whole dump which is actually slow.
So I need method to directly extract the necessary data.
I am using shell script + pentaho .
I think you have two options.
If you are bound to shell, you could run the json response through a series of sed silliness to get to a list that you can then parse more effectively with shell tools. something like: curl http://foo.com | sed 's/,/\n/g'
will get you something more row based, and then you can start to parse it out from there using more sed or awk or cut and tr.
look at jq? it is a statically linked standalone c binary that allows really nice filtering of json
I am writing a program in bash and want to use curl to display the contents of a website. The website url is http://examplewebsite.com/example.php?pass=hello hello:world. Using:
curl http://examplewebsite.com/example.php?pass=hello hello:world
However this returns:
Couldn't resolve host 'hello:world'
The error was caused by the space between “hello” and “world” that caused bash to split the link into two different tokens, which curl interpreted as two different links.
You should at least quote the URL parameters, as Explosion Pills did.
However, it is not a good style to pass arguments directly like that, because you might end up with some characters that need escaping (space is one of those, but it seems to be handled automatically by curl).
To do this, you can use the --data-urlencode:
curl "http://examplewebsite.com/example.php" --data-urlencode "pass=hello hello:world" --get
(--data-urlencode switches the method to POST instead of GET, so you can use --get to switch back to GET)
Just wrap the URL in a string:
curl "http://examplewebsite.com/example.php?pass=hello hello:world"
Your mileage may vary as to whether this works properly or not, so you should also URL encode the value:
curl "http://examplewebsite.com/example.php?pass=hello%22hello%3Aworld"
I've been trying to get an response over http with curl. The response is in json format and contains numbers
when I get the reply there are fields with numeric values but the floating point has been changed as follows:
"value": 2.7123123E7 instead of just "value": 27123123
why is this happening and how I can disable it? I do not want to parse the file second time and do the change, but just disable this behavior. For example my web browser where I submit the same query does not has this behavior but I cannot use my browser because the data I want to gather (response) is very big and it stucks :S
Thank you
It looks like jq will do this for you if you want a simple filter to convert the notation:
$ echo '{"value":2.7123123E7}' | jq '.'
{
"value": 27123123
}
See the manual for more info. So, a simple parsing would just be to pipe the output of curl through jq.
Does anyone know how Facebook encodes emoji with high-surrogate pairs in the Graph API?
Low surrogate pairs seem fine. For example, ❤️ (HEAVY BLACK HEART, though it is red in iOS/OSX, link to image if you can't see the emoji) comes through as \u2764\ufe0f which appears to match the UTF-16 hex codes / "Formal Unicode Notation" shown here at iemoji.com.
And indeed, in Ruby when parsing the JSON output from the API:
ActiveSupport::JSON.decode('"\u2764\ufe0f"')
you correctly get:
"❤️"
However, to pick another emoji, 💤 (SLEEPING SYMBOL, link to image here. Facebook returns \udbba\udf59. This seems to correspond with nothing I can find on any unicode resources, e.g., for example this one at iemoji.com.
And when I attempt to decode in Ruby using the same method above:
ActiveSupport::JSON.decode('"\udbba\udf59"')
I get:
""
Any idea what's going on here?
Answering my own question though most of the credit belongs to #bobince for showing me the way in the comments above.
The answer is that Facebook encodes emoji using the "Google" encoding as seen on this Unicode table.
I have created a ruby gem called emojivert that can convert from one encoding to another, including from "Google" to "Unified". It is based on another existing project called rails-emoji.
So the failing example above would be fixed by doing:
string = ActiveSupport::JSON.decode('"\udbba\udf59"')
> ""
fixed = Emojivert.google_to_unified(string)
> "💤"
I have a big string ( a html code from web page).
Now the problem is how to parse the links to images.
I want to make an array of all the links to images in that web page.
I know how to do this i java but I do not know how to do the parse strings and do a string manipulations in shell. I know there are many tricks and I guess this can be easy done.
in the end I want to get something like this
#!/bin/bash
read BIG_STRING <<< $(curl some_web_page_with_links_to_images.com)
#parse the big string and fill the LINKS variable
# fill this with the links to image somewhow (.jpg and .png only)
#after the parsing the LINKS should look like this
LINKS=("www.asd.com/asd1.jpg" "www.asd.com/asd.jpg" "www.asd.com/asd2123.jpg")
#I need the parsing and to fill the LINKS variable with the links from the web page
# get length of an array
tLen=${#LINKS[#]}
for (( i=0; i<${tLen}; i++ ));
do
echo ${LINKS[$i]}
done
Thanks, for the responses you saved me days of frustrations
Why not start with the right tool? Parsing HTML is hard, especially with sed. If you have the mojo tool from the Mojolicious project you can do this:
mojo get http://example.com a attr href
And then just check whether each line ends with jpg, png, or whatever.
It's hard to offer more than approximations. Let's assume all interesting links are href="" attributes, and there's at most one href attribute per line (And the links are also one line only, actually I'm not sure if newlines are allowed inside URLs.
Let's assume your sourcefile is called test.html.
The following should print all links under these assumptions:
sed -n 's/.*\<href="\([^"]*\)".*/\1/p' test.html
To understand how this works, you should know what regular expressions are and have read up a tutorial on sed (particularly how the s ubstitute command works)