rename multiple mp3 filenames from greek to english with bash script - bash

I have dozens of mp3 files where the filename contains Greek letters. I would like to rename them to "latin only characters" so that the title etc. is displayed correctly on all common playback devices.
It takes a long time to do this manually, so I need your help.
Is there a simple bash script that can do this job?
as example:
I want the script to rename the file from σαγαπώ.mp3 to sagapo.mp3
edit://
I was now able to rename the file name with a python script.
Of:
Βασίλης Μπατής - Ζημιά _ Vasilis Mpatis - Zimia _ Official Video Clip HQ 2017.mp3
would:
Basilis Mpatis - Zimia _ Vasilis Mpatis - Zimia _ Official Video Clip HQ 2017.mp3
So far so good, now the question is how do I get rid of all "unnecessary" information from the file name, so that in the end only the artist and title remain as file names.
This is what the file name should look like at the end.
Basilis Mpatis - Zimia.mp3
Anyone an idea?
Here is my Python script:
import os
# Pfad zum Ordner mit den MP3-Dateien
path = '/home/sakis/mp3'
# Alle MP3-Dateien im Ordner durchlaufen
for file in os.listdir(path):
if file.endswith('.mp3'):
# Aktuellen Dateinamen speichern und in Unicode umwandeln
old_name = file.encode('utf-8').decode('utf-8')
# Dateinamen in zwei Teile trennen
name, extension = old_name.rsplit('.', 1)
# Griechische Buchstaben im Dateinamen ersetzen
new_name = old_name.replace('Ά', 'A').replace('Έ', 'E').replace('Ή', 'H').replace('Ί', 'I').replace('Ό', 'O').replace('Ύ', 'Y').replace('Ώ', 'W').replace('ΐ', 'I').replace('Α', 'A').replace('Β', 'B').replace('Γ', 'G').replace('Δ', 'D').replace('Ε', 'E').replace('Ζ', 'Z').replace('Η', 'H').replace('Θ', 'TH').replace('Ι', 'I').replace('Κ', 'K').replace('Λ', 'L').replace('Μ', 'M').replace('Ν', 'N').replace('Ξ', 'X').replace('Ο', 'O').replace('Π', 'P').replace('Ρ', 'R').replace('Σ', 'S').replace('Τ', 'T').replace('Υ', 'Y').replace('Φ', 'F').replace('Χ', 'X').replace('Ψ', 'PS').replace('Ω', 'O').replace('ά', 'a').replace('έ', 'e').replace('ή', 'i').replace('ί', 'i').replace('ό', 'o').replace('ύ', 'y').replace('ώ', 'w').replace('ϊ', 'i').replace('ϋ', 'u').replace('ό', 'o').replace('α', 'a').replace('β', 'b').replace('γ', 'g').replace('δ', 'd').replace('ε', 'e').replace('ζ', 'z').replace('η', 'i').replace('θ', 'th').replace('ι', 'i').replace('κ', 'k').replace('λ', 'l').replace('μ', 'm').replace('ν', 'n').replace('ξ', 'x').replace('ο', 'o').replace('π', 'p').replace('ρ', 'r').replace('ς', 's').replace('σ', 's').replace('τ', 't').replace('υ', 'y').replace('φ', 'f').replace('χ', 'x').replace('ψ', 'ps').replace('ω', 'o')
# Alle weiteren Zeichen im Dateinamen entfernen
name = ''.join(c for c in name if c.isalnum() or c in [' ', '-', '_'])
# Neuen Dateinamen setzen
os.rename(os.path.join(path, old_name), os.path.join(path, new_name))
print('Done!')

You need to define your own tranlsation table, because nobody can guess how you want to translate the names. Assume that the greek name is stored in variable greek_name, something like this could do:
english_name=$(tr αβΓγΔδεΖζ... avGgDdeZz... $greek_name)
Of course you have to make compromises: Since for instance the letter υ can be pronounced as "i", "f" or as "w" depending on the context, you have to settle for one.
Another problem is that several greek letters are pronounced the same; for instance, Ο and Ω. If you don't manage to map them uniquely, it might happen that two greek file names map to the same english file name. Therefore, when you do the renaming, make sure that you get at least an error message in this case:
if ! mv -n "$greek_name" "$english_name"
then
echo Can not rename "$greek_name", because "$english_name" already exists
fi
UPDATE:
It's not clear how you would translate i.e. ψ, as the most natural mapping would be to use two charcaters, "ps". You could either use an english letter which has no equivalent in Greek anyway ('c' comes to my mind), or you translate these special cases in a separate step, for instance:
# english_name could still contain a ψ because this
# was not handled by `tr`
english_name=${english_name//ψ/ps}
You have of course make up your mind whether you want the upper case Ψ being translated into PS or Ps.
You have not specified how you want to use in translation the English letter b. In Greek, this sound is written as νπ, i.e. two Greek letters map to a single English one. If you want to implement this mapping, you have to do it before the one-to-one translation done by tr, for instance:
# Already preprocess νπ before translating the other
# Greek letters:
greek_name=${greek_name//νπ/b}
greek_name=${greek_name//Ν[πΠ]/B}
This reflects the idea that a Greek word starting with Νπ is meant to be a word starting with an upper case letter, and ΝΠ is meant to start an all-upper-case word, both corresponding to an upper-case B in English.

Related

How to print unicode charaters in Command Prompt with Ruby

I was wondering how to print unicode characters, such as Japanese or fun characters like 📦.
I can print hearts with:
hearts = "\u2665"
puts hearts.encode('utf-8')
How can I print more unicode charaters with Ruby in Command Prompt?
My method works with some characters but not all.
Code examples would be greatly appreciated.
You need to enclose the unicode character in { and } if the number of hex digits isn't 4 (credit : /u/Stefan) e.g.:
heart = "\u2665"
package = "\u{1F4E6}"
fire_and_one_hundred = "\u{1F525 1F4AF}"
puts heart
puts package
puts fire_and_one_hundred
Alternatively you could also just put the unicode character directly in your source, which is quite easy at least on macOS with the Emoji & Symbols menu accessed by Ctrl + Command + Space by default (a similar menu can be accessed on Windows 10 by Win + ; ) in most applications including your text editor/Ruby IDE most likely:
heart = "♥"
package = "📦"
fire_and_one_hundred = "🔥💯"
puts heart
puts package
puts fire_and_one_hundred
Output:
♥
📦
🔥💯
How it looks in the macOS terminal:

How to decoding IFC using Ruby

In Ruby, I'm reading an .ifc file to get some information, but I can't decode it. For example, the file content:
"'S\X2\00E9\X0\jour/Cuisine'"
should be:
"'Séjour/Cuisine'"
I'm trying to encode it with:
puts ifcFileLine.encode("Windows-1252")
puts ifcFileLine.encode("ISO-8859-1")
puts ifcFileLine.encode("ISO-8859-5")
puts ifcFileLine.encode("iso-8859-1").force_encoding("utf-8")'
But nothing gives me what I need.
I don't know anything about IFC, but based solely on the page Denis linked to and your example input, this works:
ESCAPE_SEQUENCE_EXPR = /\\X2\\(.*?)\\X0\\/
def decode_ifc(str)
str.gsub(ESCAPE_SEQUENCE_EXPR) do
$1.gsub(/..../) { $&.to_i(16).chr(Encoding::UTF_8) }
end
end
str = 'S\X2\00E9\X0\jour/Cuisine'
puts "Input:", str
puts "Output:", decode_ifc(str)
All this code does is replace every sequence of four characters (/..../) between the delimiters, which will each be a Unicode code point in hexadecimal, with the corresponding Unicode character.
Note that this code handles only this specific encoding. A quick glance at the implementation guide shows other encodings, including an \X4 directive for Unicode characters outside the Basic Multilingual Plane. This ought to get you started, though.
See it on eval.in: https://eval.in/776980
If someone is interested, I wrote here a Python Code that decode 3 of the IFC encodings : \X, \X2\ and \S\
import re
def decodeIfc(txt):
# In regex "\" is hard to manage in Python... I use this workaround
txt = txt.replace('\\', 'µµµ')
txt = re.sub('µµµX2µµµ([0-9A-F]{4,})+µµµX0µµµ', decodeIfcX2, txt)
txt = re.sub('µµµSµµµ(.)', decodeIfcS, txt)
txt = re.sub('µµµXµµµ([0-9A-F]{2})', decodeIfcX, txt)
txt = txt.replace('µµµ','\\')
return txt
def decodeIfcX2(match):
# X2 encodes characters with multiple of 4 hexadecimal numbers.
return ''.join(list(map(lambda x : chr(int(x,16)), re.findall('([0-9A-F]{4})',match.group(1)))))
def decodeIfcS(match):
return chr(ord(match.group(1))+128)
def decodeIfcX(match):
# Sometimes, IFC files were made with old Mac... wich use MacRoman encoding.
num = int(match.group(1), 16)
if (num <= 127) | (num >= 160):
return chr(num)
else:
return bytes.fromhex(match.group(1)).decode("macroman")

Moving chunks of data in a file with awk

I'm moving my bookmarks from kippt.com to pinboard.in.
I exported my bookmarks from Kippt and for some reason, they were storing tags (preceded by #) and description within the same field. Pinboard keeps tags and description separated.
This is what a Kippt bookmark looks like after export:
<DT>This is a title
<DD>#tag1 #tag2 This is a description
This is what it should look like before importing into Pinboard:
<DT>This is a title
<DD>This is a description
So basically, I need to replace #tag1 #tag2 by TAGS="tag1,tag2" and move it on the first line within <A>.
I've been reading about moving chunks of data here: sed or awk to move one chunk of text betwen first pattern pair into second pair?
I haven't been to come up with a good recipe so far. Any insight?
Edit:
Here's an actual example of what the input file looks like (3 entries out of 3500):
<DT>Phabricator
<DD>#bug #tracking
<DT>The hidden commands for diagnosing and improving your Netflix streaming quality – Quartz
<DT>Icelandic Farm Holidays | Local experts in Iceland vacations
<DD>#iceland #tour #car #drive #self Self-driving tour of Iceland
This might not be the most beautiful solution, but since it seems to be a one-time-thing it should be sufficient.
import re
dt = re.compile('^<DT>')
dd = re.compile('^<DD>')
with open('bookmarks.xml', 'r') as f:
for line in f:
if re.match(dt, line):
current_dt = line.strip()
elif re.match(dd, line):
current_dd = line
tags = [w for w in line[4:].split(' ') if w.startswith('#')]
current_dt = re.sub('(<A[^>]+)>', '\\1 TAGS="' + ','.join([t[1:] for t in tags]) + '">', current_dt)
for t in tags:
current_dd = current_dd.replace(t + ' ', '')
if current_dd.strip() == '<DD>':
current_dd = ""
else:
print current_dt
print current_dd
current_dt = ""
current_dd = ""
print current_dt
print current_dd
If some parts of the code are not clear, just tell me. You can of course use python to write the lines to a file instead of printing them, or even modify the original file.
Edit: Added if-clause so that empty <DD> lines won't show up in the result.
script.awk
BEGIN{FS="#"}
/^<DT>/{
if(d==1) print "<DT>"s # for printing lines with no tags
s=substr($0,5);tags="" # Copying the line after "<DT>". You'll know why
d=1
}
/^<DD>/{
d=0
m=match(s,/>/) # Find the end of the HREF descritor first match of ">"
for(i=2;i<=NF;i++){sub(/ $/,"",$i);tags=tags","$i} # Concatenate tags
td=match(tags,/ /) # Parse for tag description (marked by a preceding space).
if(td==0){ # No description exists
tags=substr(tags,2)
tagdes=""
}
else{ # Description exists
tagdes=substr(tags,td)
tags=substr(tags,2,td-2)
}
print "<DT>" substr(s,1,m-1) ", TAGS=\"" tags "\"" substr(s,m)
print "<DD>" tagdes
}
awk -f script.awk kippt > pinboard
INPUT
<DT>Phabricator
<DD>#bug #tracking
<DT>The hidden commands for diagnosing and improving your Netflix streaming quality – Quartz
<DT>Icelandic Farm Holidays | Local experts in Iceland vacations
<DD>#iceland #tour #car #drive #self Self-driving tour of Iceland
OUTPUT:
<DT>Phabricator
<DD>
<DT>The hidden commands for diagnosing and improving your Netflix streaming quality – Quartz
<DT>Icelandic Farm Holidays | Local experts in Iceland vacations
<DD> Self-driving tour of Iceland

text processing for IPv4 dotted decimal notation conversion to /8 or /16 format

I have an input file that contains a list of ip addresses and the ip_counts(some parameter that I use internally.)The file looks somewhat like this.
202.124.127.26 2135869
202.124.127.25 2111217
202.124.127.17 2058082
202.124.127.16 2014958
202.124.127.20 1949323
202.124.127.24 1933773
202.124.127.27 1932076
202.124.127.22 1886466
202.124.127.18 1882955
202.124.127.21 1803528
202.124.127.23 1786348
119.224.129.200 1776592
119.224.129.211 1639325
202.124.127.19 1479198
119.224.129.201 1145426
202.49.175.110 1133354
119.224.129.210 1119525
68.232.45.132 1085491
119.224.129.209 1015078
131.203.3.8 857951
202.162.73.4 817197
207.123.58.125 785326
202.7.6.18 762603
117.121.253.254 718022
74.125.237.120 710448
68.232.44.219 693002
202.162.73.2 671559
205.128.75.126 611301
119.161.91.17 604393
119.224.129.202 559930
8.27.241.126 528862
74.125.237.152 517516
8.254.9.254 514341
As you can see the ip addresses themselves are unsorted.So I use the sort command on the file to sort the ip addresses as below
cat address_count.txt | sort -t . -k 1,1n -k 2,2n -k 3,3n -k 4,4n > sorted_address.txt
Which gives me an output with ip addresses in the sorted order.The partial output of that file is shown below.
4.23.63.126 15731
4.26.254.254 320705
4.27.8.254 25174
8.12.129.50 176141
8.12.223.125 11800
8.19.32.65 15854
8.19.240.53 11013
8.19.240.70 11915
8.19.240.72 31541
8.19.240.73 23304
8.20.213.28 96434
8.20.213.32 108191
8.20.213.34 170058
8.20.213.39 23512
8.20.213.41 10420
8.20.213.61 24809
8.26.195.253 28568
8.27.152.253 104446
8.27.233.125 115856
8.27.235.126 16102
8.27.235.254 25628
8.27.238.254 108485
8.27.240.125 169262
8.27.241.126 528862
8.27.241.252 197302
8.27.248.125 14926
8.254.9.254 514341
12.129.210.71 89663
15.192.45.21 20139
15.192.45.26 35265
15.193.0.148 10313
15.193.113.29 40318
15.201.49.136 14243
15.240.238.52 57163
17.250.248.95 28166
23.33.125.13 19179
23.33.125.37 17953
31.151.163.60 72709
38.99.42.37 192356
38.99.68.180 41251
38.99.68.181 10272
38.104.237.74 74012
38.108.112.103 37034
38.108.112.115 69698
38.108.112.121 92173
38.108.112.122 99230
38.112.63.238 39958
38.119.130.62 42159
46.4.28.22 19769
Now I want to parse the file given above and convert it to aaa.bbb.ccc.0/8 format and
aaa.bbb.0.0/16 format and I also want to count the number of occurences of the ip's in each subnet.I want to do this using bash.I am open to using sed or awk.How do I achieve this.
For example
8.19.240.53 11013
8.19.240.70 11915
8.19.240.72 31541
8.19.240.73 23304
8.20.213.28 96434
8.20.213.32 108191
8.20.213.34 170058
8.20.213.39 23512
8.20.213.41 10420
8.20.213.61 24809
The about input portion should produce 8.19.240.0/8 and 8.20.213.0/8 and similarly for /16 domains.I also want to count the occurences of machines in the subnet.
For example In the above output this subnet should have the count 4 in the next column beside it.It should also add the already displayed count.i.e (11013 + 11915 + 31541 + 23304) in another column.
8.19.240.0/8 4 (11013 + 11915 + 31541 + 23304)
8.20.213.0/8 6 (96434 + 108191 + 170058 + 23512 + 10420 + 24809)
It would be great if someone could suggest some way to achieve this.
The main problem here is that without having the routing table from the individual moments the packets arrived, you have no idea what netblock they were originally in. Sure, you can put them in the class-full blocks they would be in, in a class-full routing situation, but all that will give you is a nice presentation (and, admittedly, a shorter file).
Furthermore, your example looks a bit broken. You have a bunch of IP addresses in 8.0.0.0/8 and you are aggregating them into what looks like /24 routes and presenting them with a /8 at the end.
Nonetheless, in awk you can use sub() to do text replacement (or you can use index to find occurrences of ., or you can use split to split at dots). It should be relatively easy to go from that to "drop last digit, add the string "0/24" and use that as a key to update an IP-count and a hit-count dictionary, then drop the last two octets and the slash, replace with "0.0/16" and do the same" (all arrays in awk are associative arrays, so essentially dicts). No need to sort in advance, when you loop through the result, you'll get the keys in a random order, but on average there will be fewer of them, so sorting afterwards will be cheaper.
I seem to not have an awk at hand, so I cannot give you a code example.
This might work for you:
awk '{a=$1;sub(/\.[^.]*$/,"",a);ac[a]++;at[a]+=$2};END{for(x in ac)print x".0/8",ac[x],at[x]}' file
This prints the '0/8 addresses to get the 0/16 duplicate the code i.e. b=a;sub(/\.[^.]*$/,"",b);ba[b]++ etc, etc.

Replace only within a XML tag; exporting from Referencer .reflib to bibtex format with filenames intact and URL-encoding removed, with a bash command

I have many references in Referencer. I'm trying to include filenames in my bibtex file when exporting from Referencer. Since the software doesn't do this by default I'm trying to use a sed command to include the filename as a bibtex information in the XML file before I export and thus include the filename.
Input
<doc>
<filename>file:///home/dwickrama/Desktop/stevenJonesLab/papers/Transcription%20Factor%20Binding/A%20Common%20Nuclear%20Signal%20Transduction%20Pathway%20Activated%20by%20Growth%20Factor%20and%20Cytokine.pdf</filename>
<relative_filename>A%20Common%20Nuclear%20Signal%20Transduction%20Pathway%20Activated%20by%20Growth%20Factor%20and%20Cytokine.pdf</relative_filename>
<key>Sadowski93</key>
<notes></notes>
<bib_type>article</bib_type>
<bib_doi></bib_doi>
<bib_title>A common nuclear signal transduction pathway activated by growth factor and cytokine receptors.</bib_title>
<bib_authors>Sadowski, H B and Shuai, K and Darnell, J E and Gilman, M Z</bib_authors>
<bib_journal>Science</bib_journal>
<bib_volume>261</bib_volume>
<bib_number>5129</bib_number>
<bib_pages>1739-44</bib_pages>
<bib_year>1993</bib_year>
<bib_extra key="pmid">8397445</bib_extra>
Ouput
<doc>
<filename>file:///home/dwickrama/Desktop/stevenJonesLab/papers/Transcription%20Factor%20Binding/A%20Common%20Nuclear%20Signal%20Transduction%20Pathway%20Activated%20by%20Growth%20Factor%20and%20Cytokine.pdf</filename>
<bib_extra key="File">article:../Transcription\ Factor\ Binding/A\ Common\ Nuclear\ Signal\ Transduction\ Pathway\ Activated\ by\ Growth\ Factor\ and\ Cytokine.pdf:pdf</bib_extra>
<relative_filename>A%20Common%20Nuclear%20Signal%20Transduction%20Pathway%20Activated%20by%20Growth%20Factor%20and%20Cytokine.pdf</relative_filename>
<key>Sadowski93</key>
<notes></notes>
<bib_type>article</bib_type>
<bib_doi></bib_doi>
<bib_title>A common nuclear signal transduction pathway activated by growth factor and cytokine receptors.</bib_title>
<bib_authors>Sadowski, H B and Shuai, K and Darnell, J E and Gilman, M Z</bib_authors>
<bib_journal>Science</bib_journal>
<bib_volume>261</bib_volume>
<bib_number>5129</bib_number>
<bib_pages>1739-44</bib_pages>
<bib_year>1993</bib_year>
<bib_extra key="pmid">8397445</bib_extra>
I can use the following sed command to partially do what I want, but the URL encoding "%20" remains. How do I get rid of that in only the bibtex tag ?
sed -e 's/\(\ \ \ \ <filename>file:\/\/\/home\/dwickrama\/Desktop\/stevenJonesLab\/papers\)\([^.]*\)\(\.\?\)\(.*\)\(<\/filename>\)/\1\2\3\4\5\n\ \ \ \ <bib_extra\ key=\"File\">article:\.\.\2\3\4:\4<\/bib_extra>/g' NewPapers.reflib > NewPapers.new.reflib
Regex and sed are not very good tools for processing XML, or URL-decoding.
A quick script in more complete scripting language would be able to do it more clearly and reliably. For example in Python:
import urllib, urlparse
from xml.dom import minidom
doc= minidom.parse('NewPapers.reflib')
el= doc.getElementsByTagName('filename')[0]
path= urlparse.urlparse(el.firstChild.data)[2]
foldername, filename= map(urllib.unquote, path.split('/')[-2:])
extra= doc.createElement('bib_extra')
extra.setAttribute('key', 'File')
extra.appendChild(document.createTextNode('article:../%s/%s:pdf' % (foldername, filename)))
el.parentNode.insertBefore(extra, el.nextSibling)
doc.writexml(open('NewPapers.new.reflib'))
(I haven't included a function to reproduce the backslash-escaping in the given example output as it's not clearly exactly what format that is. The simplest approach would be filename= filename.replace(' ', '\\ '), but I'm not sure that would be correct.)
all you need is to add a line after right?? So just print it out after is searched.
#!/bin/bash
s='<bib_extra key="File">article:../Transcription\\ Factor\\ Binding/A\\ Common\\ Nuclear\\ Signal\\ Transduction\\ Pathway\\ Activated\\ by\\ Growth\\ Factor\\ and\\ Cytokine.pdf:pdf</bib_extra>'
awk -vstr="$s" '
/<filename>/{
print
print str;next
}
{print}' file

Resources