I have a directory full of .mp3 files with filenames that contain a youtube link in it.
All of the youtube watch URL parts in particular start with a - and end with a .mp3.
However, there is a problem.
Some youtube links have -'s in them, and some of the titles have -'s in them too.
I need to extract only this part of the video from the title:
https://www.youtube.com/watch?v= (dQw4w9WgXcQ)
The title of the video downloaded with youtube-dl is:
Rick Astley - Never Gonna Give You Up-dQw4w9WgXcQ.mp3
The title of the video is:
Rick Astley - Never Gonna Give You Up
What I was trying to accomplish is to get all the links that I had already downloaded and put it in a text file that tells youtube-dl to not re-download them (download archive)
How would I go about doing this? (preferably with a bash sed command, but at this point i am willing to try anything.)
it's easier than you think: the greedy .* followed by - will eat all the -s until the last one:
# first get the titles an ids into a tab-separated multiline string
both=`find * -name "*.mp3" | sed 's/\(.*\)-\(.*\)\.mp3/\1\t\2/'`
# then cut it into two multiline strings
titles=`echo "$both" | cut -f1`
ids=`echo "$both" | cut -f2`
# or process each title-id pair one-by-one
echo "$both" | while IFS=$'\t' read title id; do
echo "$title"
echo "$id"
done
Related
I have a large set of .mp3 files which contain similar pattern of text attached to each file like this
'Yethamaiyaa Yetham HD Extreme Quality .-V7NEP5gnTTY.mp3'
where the actual track name is only
'Yethamaiyaa Yetham.mp3'
and this additional string
'HD Extreme Quality .-V7NEP5gnTTY' is attached to each file.
How do I remove this unnecessary string starting with HD and just before .mp3. The issue is that there is an additional dot . available between the marker strings. Also, the pattern of markers are same for all 400+ files. Any help to solve the issue is appriciated.
ls *.mp3 | sed -n "s/^\(.*\) HD .*/mv -- '&' '\1.mp3'/p" | bash
The above code uses sed to remove everything from " HD " to the end of the filename. The portion of the filename before " HD " is captured by the parens so it can be used later as \1. The entire line is replaced with the required mv command. I quoted it very carefully to account for the spaces in the filename.
If you want to see the commands it will perform without executing them, leave off the pipe to bash.
Preview commands:
ls *.mp3 | sed -n "s/^\(.*\) HD .*/mv -- '&' '\1.mp3'/p"
I have a text file with thousands of hyperlinks in the format "URL = http://examplelink.com" in a file called mylinks.txt.
What I want to do is search through all of these links, and checks if any of them contains some keywords, like, "2018", "2017". If the link contains the keyword, I want to save the link in the file "yes.txt" and if it doesn't it goes to the file "no.txt".
So at the end, I would end up with two files: one with the links that send me to pages with the keywords I'm searching for, and other one with the links that doesn't.
I was thinking about doing this with curl, but I don't know even if it's possible and I don't know also how to "filter" the links by keywords.
What I have got until now is:
curl -K mylinks.txt >> output.txt
But this only creates a super large file with the HTML's of the links it searches.
I've searched and read through various curl tutorials and haven't found anything that "selectively" search for pages and save the links (not the content) of the pages it found matching the criteria.
-Untested-
For links in lines containing "2017" or "2018".
cat mylinks.txt | grep -E '2017|2018' | grep -o 'URL =*>' >> yes.txt
To get url of lines that doesn't contain the keywords.
cat mylinks.txt | grep -vE '2017|2018' | grep -o 'URL =*>' >> no.txt
This is unix piping. (The char | ) takes the program output stdout at the left and feed the stdin to the program on the right.
In Unix-like computer operating systems, a pipeline is a sequence of
processes chained together by their standard streams, so that the
output of each process (stdout) feeds directly as input (stdin) to the
next one. https://en.wikipedia.org/wiki/Pipeline_(Unix)
Here is my take at it (kind of tested on an url-file with a few examples).
This is supposed to be saved as a script, it's too long to type it into the console directly.
#!/bin/bash
urlFile="/path/to/myLinks.txt"
cut -d' ' -f3 "$urlFile" | \
while read url
do
echo "checking url $url"
if (curl "$url" | grep "2017")
then
echo "$url" >> /tmp/yes.txt
else
echo "$url" >> /tmp/no.txt
fi
done
Explanation: the cut is necessary to cut away the prefix "URL = " in each line. Then the url's are fed into the while-read loop. For each url, we curl it, grep for the interesting keyword in it (in this case "2017"), and if the grep returns 0, we append this URL to the file with the interesting URLs.
Obviously, you should adjust the paths and the keyword.
When I do
echo $filename
I get
Pew Pew.mp4
However,
echo "${#filename}"
Returns 19
How do I delete all characters after the file extension? It needs to work no matter what the file extension is because the file name in the variable will not always match *.mp4
You should try to find out why you have such strange files before fixing it.
Once you know, you can rename files.
When you just want to rename 1 file, just use the command
mv "Pew Pew.mp4"* "Pew Pew.mp4"
Cutting off the complete extension (with filename=${filename%%.*}) won't help you if you want to use the stripped extension (mp4 or jpg or ...).
EDIT:
I think OP want a work-around so I give another try.
When you have a a short list of extensions, you can try
for ext in mpeg mpg jpg avo mov; do
for filename in *.${ext}*; do
mv "${filename%%.*}.${ext}"* "${filename%%.*}.${ext}"
done
done
You can try strings to get the readable string.
echo "${filename}" | strings | wc
# Rename file
mv "${filename}" "$(echo "${filename}"| strings)"
EDIT:
strings gives more than 1 line as a result and unwanted spaces. Since Pew Pew has a space inside, I hope that all spaces, underscores and minus-signs are in front of the dot.
The newname can be constructed with something like
tmpname=$(echo "${filename}"| strings | head -1)
newname=${tmpname% *}
# or another way
newname=$(echo "${filename}"| sed 's/[[:alnum:]_- ]*\.[[:alnum:]]*\).*/\1/')
# or another (the best?) way (hoping that the first unwanted character is not a space)
newname="${filename%%[^[:alnum:]\.-_]*}"
# resulting in
mv "${filename}" "${filename%%[^[:alnum:]\.-_]*}"
http://romhustler.net/file/54654/RFloRzkzYjBxeUpmSXhmczJndVZvVXViV3d2bjExMUcwRmdhQzltaU5UUTJOVFE2TVRrM0xqZzNMakV4TXk0eU16WTZNVE01TXpnME1UZ3pPRHBtYVc1aGJGOWtiM2R1Ykc5aFpGOXNhVzVy <-- Url that needs to be identified
http://romhustler.net/rom/ps2/final-fantasy-x-usa <-- Parent url
If you copy paste this url you will see the browser identify the files name. How can I get a bash script to do the same ?
I need to WGET the first URL, but because it will be for 100 more items i cant copy paste each URL.
I currently have the menu set up for all the files. Just dont know how to mass download each file individually as the URL's for the files have no matching patterns.
*Bits of my working menu:
#Raw gamelist grabber
w3m http://romhustler.net/roms/ps2 |cat|egrep "/5" > rawmenu.txt
#splits initial file into a files(games01) that contain 10 lines.
#-d puts lists files with 01
split -l 10 -d rawmenu.txt games
#s/ /_/g - replaces spaces with underscore
#s/__.*//g - removes anything after two underscores
select opt in\
$(cat games0$num|sed -e 's/ /_/g' -e 's/__.*//g')\
"Next"\
"Quit" ;
if [[ "$opt" =~ "${lines[0]}" ]];
then
### Here the URL needs to be grabbed ###
This has to be done is BASH. Is this possible ?
It appears that romhustler.net use some Javascript on thier full download pages to hide the final download link for a few seconds after the page loads, possibly to prevent this kind of web scraping.
However, if they were using direct links to ZIP files for example, we could do this:
# Use curl to get the HTML of the page and egrep to match the hyperlinks to each ROM
curl -s http://romhustler.net/roms/ps2 | egrep -o "rom/ps2/[a-zA-Z0-9_-]+" > rawmenu.txt
# Loop through each of those links and extract the full download link
while read LINK
do
# Extract full download link
FULLDOWNLOAD=`curl -s "http://romhustler.net$LINK" | egrep -o "/download/[0-9]+/[a-zA-Z0-9]+"`
# Download the file
wget "http://romhustler.net$FULLDOWNLOAD"
done < "rawmenu.txt"
Need to extract .co.uk urls from a file with lots of entries, some .com .us etc.. i need only the .co.uk ones. any way to do that?
pd: im learning bash
edit:
code sample:
32
<tr><td id="Table_td" align="center">23<a name="23"></a></td><td id="Table_td"><input type="text" value="http://www.ultraguia.co.uk/motets.php?pg=2" size="57" readonly="true" style="border: none"></td>
note some repeat
important: i need all links, broken or 404 too
found this code somwhere in the net:
cat file.html | tr " " "\n" | grep .co.uk
output:
href="http://www.domain1.co.uk/"
value="http://www.domain1.co.uk/"
href="http://www.domain2.co.uk/"
value="http://www.domain2.co.uk/"
think im close
thanks!
The following approach uses a real HTML engine to parse your HTML, and will thus be more reliable faced with CDATA sections or other syntax which is hard to parse:
links -dump http://www.google.co.uk/ -html-numbered-links 1 -anonymous \
| tac \
| sed -e '/^Links:/,$ d' \
-e 's/[0-9]\+.[[:space:]]//' \
| grep '^http://[^/]\+[.]co[.]uk'
It works as follows:
links (a text-based web browser) actually retrieves the site.
Using -dump causes the rendered page to be emitted to stdout.
Using -html-numbered-links requests a numbered table of links.
Using -anonymous tweaks defaults for added security.
tac reverses the output from Links in a line-ordered list
sed -e '/^Links:/,$ d' deletes everything after (pre-reversal, before) the table of links, ensuring that actual page content can't be misparsed
sed -e 's/[0-9]\+.[[:space:]]//' removes the numbered headings from the individual links.
grep '^https\?://[^/]\+[.]co[.]uk' finds only those links with their host parts ending in .co.uk.
One way using awk:
awk -F "[ \"]" '{ for (i = 1; i<=NF; i++) if ($i ~ /\.co\.uk/) print $i }' file.html
output:
http://www.mysite.co.uk/
http://www.ultraguia.co.uk/motets.php?pg=2
http://www.ultraguia.co.uk/motets.php?pg=2
If you are only interested in unique urls, pipe the output into sort -u
HTH
Since there is no answer yet, I can provide you with an ugly but robust solution. You can exploit the wget command to grab the URLs in your file. Normally, wget is used to download from thos URLs, but by denying wget time for it lookup via DNS, it will not resolve anything and just print the URLs. You can then grep on those URLs that have .co.uk in them. The whole story becomes:
wget --force-html --input-file=yourFile.html --dns-timeout=0.001 --bind-address=127.0.0.1 2>&1 | grep -e "^\-\-.*\\.co\\.uk/.*"
If you want to get rid of the remaining timestamp information on each line, you can pipe the output through sed, as in | sed 's/.*-- //'.
If you do not have wget, then you can get it here