how to clean the text from emoticons using bash - bash

Is there a way to remove anything that's not either a token, punctuation or a special character from text using awk or sed? What I really want to get rid off are the emoticons and the 󾌧 like symbols.
Sample input:
Si tú no estáss yo no voy a lloraar por tiii🎶🎶
Me respondes porfavor?? 😭❤ piensas venir a Ecuador
cosas veredes!!!! Ay Papá. 😂😂😂
👀 🔵🔴 what y'all know about this?
🇲🇽👑❤️‼️ 🇲🇽👑❤️‼️ tag they make the final decision 🇲🇽🙏🏼👑
Vähän on twiitattavaa muuta kuin että aijjai ja oijjoi sekä nannaa. 😉👍👏👏👏🇫🇮💕
Binta On est arrivé au chicken elle voulait pleuré carrément tellement elle était heureuse 😂😂😂😂😭
ja mir fällt nix mehr ein😂😂
Někdo v pátek semnou na flédu na Moju reč??? 󾌧
Sample output:
Si tú no estáss yo no voy a lloraar por tiii
Me respondes porfavor?? piensas venir a Ecuador
cosas veredes!!!! Ay Papá.
what y'all know about this?
‼️ ‼️ tag they make the final decision
Vähän on twiitattavaa muuta kuin että aijjai ja oijjoi sekä nannaa.
Binta On est arrivé au chicken elle voulait pleuré carrément tellement elle était heureuse
ja mir fällt nix mehr ein
Někdo v pátek semnou na flédu na Moju reč???

My best solution is using Python, the Python file must be in UTF-8.
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import re
text = u"""Si tú no estáss yo no voy a lloraar por tiii🎶🎶
Me respondes porfavor?? 😭❤ piensas venir a Ecuador
cosas veredes!!!! Ay Papá. 😂😂😂
👀 🔵🔴 what y'all know about this?
🇲🇽👑❤️‼️ 🇲🇽👑❤️‼️ tag they make the final decision 🇲🇽🙏🏼👑
Vähän on twiitattavaa muuta kuin että aijjai ja oijjoi sekä nannaa. 😉👍👏👏👏🇫🇮💕
Binta On est arrivé au chicken elle voulait pleuré carrément tellement elle était heureuse 😂😂😂😂😭
ja mir fällt nix mehr ein😂😂
Někdo v pátek semnou na flédu na Moju reč???
"""
emoji_pattern = re.compile(
"["
u"\U0001F600-\U0001F64F" # emoticons
u"\U0001F300-\U0001F5FF" # symbols & pictographs
u"\U0001F680-\U0001F6FF" # transport & map symbols
u"\U0001F1E0-\U0001F1FF" # flags (iOS)
u"\U00002760-\U0000276F" # emoticons
"]+", flags=re.UNICODE
)
print(emoji_pattern.sub(r'', text))
Out
Si tú no estáss yo no voy a lloraar por tiii
Me respondes porfavor?? piensas venir a Ecuador
cosas veredes!!!! Ay Papá.
what y'all know about this?
‼️ ️‼️ tag they make the final decision
Vähän on twiitattavaa muuta kuin että aijjai ja oijjoi sekä nannaa.
Binta On est arrivé au chicken elle voulait pleuré carrément tellement elle était heureuse
ja mir fällt nix mehr ein
Někdo v pátek semnou na flédu na Moju reč???

This command will remove every character that is not alphabetic, numeric, punctuation or white space:
sed 's/[^[:alnum:][:punct:][:space:]]//g' input
Limitation: Note that some of those funny characters that you see might be valid unicode alphabetic characters for which your computer lacks an installed font. This won't remove them.
How it works
[:alnum:], [:punct:], and [:space:] are character classes that match, respectively any alphanumeric, punctuation, or white space character. The regex [^[:alnum:][:punct:][:space:]] matches any character that does not belong to one of those three classes. The sed substitution command s/[^[:alnum:][:punct:][:space:]]//g does global search-and-replace that finds any character not in one of those classes and replaces it with nothing, that is, removes it.

You might be able to use tr:
% tr -dc '[:print:]' < emoji.txt
Si t no estss yo no voy a lloraar por tiiiMe respondes porfavor?? piensas venir a Ecuadorcosas veredes!!!! Ay Pap. what y'all know about this? tag they make the final decision Vhn on twiitattavaa muuta kuin ett aijjai ja oijjoi sek nannaa. Binta On est arriv au chicken elle voulait pleur carrment tellement elle tait heureuse ja mir fllt nix mehr einNkdo v ptek semnou na fldu na Moju re???
As you can see this will also remove newline characters, this can be prevented with:
% tr -dc '[:print:]\n' < emoji.txt
Si t no estss yo no voy a lloraar por tiii
Me respondes porfavor?? piensas venir a Ecuador
cosas veredes!!!! Ay Pap.
what y'all know about this?
tag they make the final decision
Vhn on twiitattavaa muuta kuin ett aijjai ja oijjoi sek nannaa.
Binta On est arriv au chicken elle voulait pleur carrment tellement elle tait heureuse
ja mir fllt nix mehr ein
Nkdo v ptek semnou na fldu na Moju re???

Related

cat with loop can't read long lines [duplicate]

This question already has answers here:
Read a file line by line assigning the value to a variable [duplicate]
(10 answers)
I just assigned a variable, but echo $variable shows something else
(7 answers)
Closed 4 years ago.
I have a problem with my script. I tried to read a xml file with cat and read each lines with a loop. For example:
cat file.xml | while read line; do echo $line done
But inside my xml files, i had very long lines without backslash and it seems like cat file.xml didn't take big lines on file. However, when i did cat file.xml without the 'while read line', it works.
Is cat limited by the length of the line? Or did i just do a bad manipulation? What should i do to get these lines?
Thanks and bye.
Here is my script that does not work (in french):
#!/bin/bash
## SCRIPT PERMETTANT DE POUVOIR PRENDRE UNE SOURCE DE TXT POSSEDANT DU TEXTE À CHAQUE LIGNE ET LES PLACER, GRACE À UN MOT CLEF, DANS DES FICHIERS SPECIFIES VIA LE CHEMIN D'ACCESS D'UN FOLDER INDIQUÉ PAR L'UTILISATEUR.
## EXEMPLE ##
## L'utilisateur prend un dossier "X" ou sont contenus des XML. Il a placé dans tous ces XML un mot clé "motclefnumero1". Grace à ce script, il pourra changer ce mot clé par les lignes d'un fichier texte.
#### DEMANDE UTILISATEUR ####
echo 'Quel est le fichier source TXT (Possedant ce que vous voulez mettre)'
read textSource
echo 'Quel est le folder où les fichiers que vous souhaitez traiter sont placés?'
read folderSource
echo 'Indiquer le mot clé souhaité (Exemple : motclef1)'
read motClef
# cat file | cut -c1-80
# TABLEAU CONTENANT LES LIGNES DE NOTRE SOURCE TXT
myArray=()
while IFS= read -r line; do
myArray+=("$line")
done < "$textSource"
i=0
## PROCESS
ls -1 "$folderSource" | while read file; do
cat "$folderSource/$file" | while read texte; do
# Dans le cas où le dossier folderSource n'existe pas
if [ ! -d "$folderSource/resultat" ]; then
mkdir "$folderSource/resultat"
fi
## Effectuer la transputation du texte demandé dans notre texte de remplacement
echo ${texte//$motClef/${myArray[$i]}} >> "$folderSource/resultat/$file"
echo "Line $i : $texte"
## CONSOLE LOG
echo ${myArray[$i]} $folderSource/$file
echo $i
done
## Increment i var
i=$((i+1))
done
RESOLVED :
Hello, i've resolved my problem. Instead of use this :
cat "$folderSource/$file" | while read texte; do
Just use IFS to read each line, it works :
while IFS='' read -r texte || [[ -n "$texte" ]]; do
done < "$folderSource/$file"

Scantailor CLI output

I am working with scantailor-cli and I can't get any output images, only the creation of the project with the input images and also without respecting the configuration.
The sample bash script is:
#!/bin/bash
# Este script requiere: xsane, perl-rename, Scan Tailor
impresora="hpaio:/usb/Deskjet_F4400_series?serial=CN01BC111V05C5" # Nombre de la impresora: usar scanimage -L para ver los dispositivos disponibles
dpi=150 # DPI a usar
directorio_padre="scan" # Nombre de la carpeta donde se creará todo
nombre_proyecto="proyecto" # Nombre del proyecto de Scan Tailor
orientacion=left # Orientación para rotar las hojas en Scan Tailor; posibles: left, right, upsidedown y none
plantilla=2 # Tipo de proyecto en Scan Tailor; posibles: 0 (automático), 1 (una sola página), 1.5 (página y media) y 2 (dos páginas)
contenido=normal # Tipo de detención del contenido en Scan Tailor; posibles: cautious, normal y aggressive
margenes=10 # Cantidad de margen que se agregará en todos los lados en Scan Tailor
alineacion_vertical=center # Alienación vertical de los contenidos en Scant Tailor; posibles: top, center y bottom
alineacion_horizontal=center # Alienación horizontal de los contenidos en Scant Tailor; posibles: left, center y right
# Para obtener la ruta absoluta del repositorio; viene de http://stackoverflow.com/questions/59895/can-a-bash-script-tell-which-directory-it-is-stored-in
SCRIPT_PATH="${BASH_SOURCE[0]}";
if ([ -h "${SCRIPT_PATH}" ]) then
while([ -h "${SCRIPT_PATH}" ]) do SCRIPT_PATH=`readlink "${SCRIPT_PATH}"`; done
fi
pushd . > /dev/null
cd `dirname ${SCRIPT_PATH}` > /dev/null
SCRIPT_PATH=`pwd`;
popd > /dev/null
# Va a la carpeta donde está el script
echo "Yendo a «$SCRIPT_PATH»."
cd $SCRIPT_PATH
# Busca si ya existe un directorio con el nombre a utilizar; viene de https://stackoverflow.com/questions/59838/check-if-a-directory-exists-in-a-shell-script
if [ -d "$directorio_padre" ]; then
echo "ERROR: Ya existe el directorio con nombre «$directorio_padre»."
exit
fi
# Indica si se mencionó un número entero; viene de https://unix.stackexchange.com/questions/151654/checking-if-an-input-number-is-an-integer
if ! [[ "$1" =~ ^[0-9]+$ ]]; then
echo "ERROR: Un número entero es necesario para el número de páginas a escanear."
exit
fi
# Escaner con xsane
echo "Iniciando escaneando en nueva carpeta llamada «$directorio_padre»..."
mkdir $directorio_padre && cd $directorio_padre
mkdir originales && cd originales
echo "Escaneando portada a color..."
scanimage -d $impresora -v -p --resolution $dpi --format tiff > out0.tif
echo "Escaneando interiores en grises..."
scanimage -d $impresora -v -p --resolution $dpi --format tiff --mode Gray --batch --batch-start=1 --batch-count=$1
# Cambio de nombres con perl-rename
echo "Cambiando nombres de los archivos..."
perl-rename -v "s/out(\d\d\.tif)/p_0\1/" *.tif
perl-rename -v "s/out(\d\.tif)/p_00\1/" *.tif
# Postprocesamiento con Scan Tailor
cd ..
scantailor-cli -v --orientation=$orientacion --layout=$plantilla --deskew=auto --content-detection=$contenido --margins=$margenes --alignment-vertical=$alineacion_vertical --alignment-horizontal=$alineacion_horizontal --output-dpi=$dpi -o=$SCRIPT_PATH/$directorio_padre/$nombre_proyecto.ScanTailor $SCRIPT_PATH/$directorio_padre/originales $SCRIPT_PATH/$directorio_padre/scan-tailor
The Scan Tailor command in this script is: scantailor-cli -v --orientation=left --layout=2 --deskew=auto --content-detection=normal --margins=10 --alignment-vertical=center --alignment-horizontal=center --output-dpi=150 -o=path/to/proyecto.ScanTailor path/to/originales path/to/scan-tailor.
Is it possible to execute all the workflow with the cli interface?
I just had the same problem. As far as I understand the logic, this is currently (version 0.9.12.2-1, Arch community repo) a bug in the program (I now filed it here).
These are the steps called "filters":
Fix Orientation
Split Pages
Deskew
Select Content
Margins
Output
The default range claims to be 4..6 according to scantailor-cli -h but it really is 1..4 what you can see via -v. Hence you need to set --start-filter=4 --end-filter=6.

wget.sh: line 124: syntax error: unexpected end of file

I got a huge problem that i can't solve. I'm coding an application for my company, you can see that my code is composed by two bash functions.
When i try to compile i get every time the same error : wget.sh: line 124: syntax error: unexpected end of file n wget.sh is my file. And i don't know why, i searched a lot and it don't seems to be a real syntax error like i fogot a fi after a if. Furthermore i look at my file and there is no other line after 123...
Help me to solve this please !
#!/bin/bash
#----------------------------------------------------ApplicationTaxa----------------------------------------------------------
#------------------------------------------------Créateur:Axel Bonnafoux-------------------------------------------------------
#Projet conditions : Avoir le fichier build.Xml dans le dossier pour pouvoir éxecuter le code Java.
# ---------------------------------------------Projet partie 1 : Concaténation (bash)--------------------------------------------
Annee=$(date +%Y)
Mois2=$(date +%m)
Mot="init"
Mot2="maj"
Mot3="Facture"
# Creer un dossier Année
if [ ! -d taxa/$Annee ]
then
mkdir -p taxa/$Annee
fi
I run this without function and its actually working ! Help me to know why
#Fonction concat
#Concatene les fichiers client récupérés sur serveur ftp
Concat()
{
for Month in $F
do
# Créer un dossier Mois
Mois=$(echo $FILES |cut -d '/' -f4 )
mkdir -p taxa/$Annee/$Mois
# Parcour les fichiers disponibles et les concatene par Mois par client
for Day in $Month'/*'
do
for file in $Day'/*'
do
filename1=$(echo $file |cut -d '/' -f6 )
filename2=$(echo $filename1|cut -d '-' -f1|cut -d '_' -f1)
# Si le fichier n'existe pas, on le créer et on copie son contenu
if [ ! -e taxa/$Annee/$Mois/$filename2.csv ]
then
touch taxa/$Annee/$Mois/$filename2.csv
cat $file >> taxa/$Annee/$Mois/$filename2.csv
# Concatene le nouveau fichier client avec l'ancien
else
cat $file |sed '1d' >> taxa/$Annee/$Mois/$filename2.csv
fi
done
done
done}
#-----------------------------------------Projet partie 2 : Traitement des données (bash&&Java)------------------------------------------
#Fonction traitement
#Execute la partie javascript pour chaque fichier, somme les coûts et le temps passé des appels
Traitement()
{
for FILES2 in $F2
do
#Récupération du Mois courant
Mois=$(echo $FILES2 |cut -d '/' -f4 )
for D in $FILES2'/*'
do
#Création d'un fichier Excel par client
filename1=$(echo $D |cut -d '/' -f6 )
filename2=$(echo $filename1|cut -d '-' -f1|cut -d '_' -f1)
touch taxa/$Annee/$Mois/$filename2.xls
java -classpath Taxa2 WriteMatriceFG taxa/$Annee/$Mois/$filename2.csv taxa/$Annee/$Mois/$filename2.xls
#Initialisation d'un tableau de Correspondance à remplir plus tard manuellement
#Il contient les forfait et prix horaires pour chaque client
touch TableauCorrespondance_$Mois.xls
touch TableauCorrespondance_$Mois_2.xls
java -classpath Taxa2 WriteMatricePrix TableauCorrespondance_$Mois_2.xls filename2
cat TableauCorrespondance_$Mois_2.xls >> TableauCorrespondance_$Mois
rm TableauCorrespondance_$Mois_2.xls
#Verifie que le nombre de ligne est correct et si le fichier est complet ( qu'il n'y ai pas de trou en somme)
NF = $(ls *csv | wc -l)
nbligne=$(wc -l TableauCorrespondance_$Mois_2.xls|cut -d ' ' -f1)
Res=java -classpath Taxa2 verification TableauCorrespondance_$Mois.xls
if [$NF=$((nbligne*3)) && Res]
then
#Enfin, on calcule la facture que le client doit régler en fonction du tableau de correspondance qui doit être remplit.
java -classpath Taxa2 MatriceTreatment filename2.xls TableauCorrespondance_$Mois.xls
else
echo "votre tableau de correspondance nest pas complet"
fi
done
done}
# récupère les données du serveur ftp si l'on a rien (avec l'option n), récupère seulement les données du mois avec l'option maj et traite seulement les données avec toutes les autres options
if [ $1 = "$mot" ]
then
wget -m --ftp-user=********* --ftp-password=********* ftp://ftp-openvno.alphalink.fr/valo/$Annee
F=ftp-openvno.alphalink.fr/valo/$Annee'/*'
Concat
else
if [ -d taxa/$Annee/$Mois2 ] && [ $1 = "$Mot2" ]
then
rm -r taxa/$Annee/$Mois2
wget -m --ftp-user=*********** --ftp-password=******** ftp://ftp-openvno.alphalink.fr/valo/$Annee/$Mois2
F=ftp-openvno.alphalink.fr/valo/$Annee'/*'
Concat
else
F2=taxa/$Annee'/*'
Traitement
fi
fi
#supprime les fichiers téléchargés devenu obsolète
rm -r ftp-openvno.alphalink.fr
exit 0
It would be mostly possible due to incorrect closing of any statements in your script. As mentioned in comments you can paste your script to shellcheck.net to get some useful reports.

How to format text file as it can be seen in man pages (justifying text, nothing more) using bash

What I would like to do is the following.
Text file content :
This is a simple text file
containing lines of text
with different width
but I would like to justify
them. Any idea ?
Expected result :
This is a simple text file containing
lines of text with different width
but I would like to justify them.
Any Idea ?
I already can split my files at the required width using :
cat textfile|fmt -s -w 37
But in that case, there is no justification...
EDIT : Using par as suggested, I found a problem with accented chars.
This is what gives par 37j1 for me :
This is à simplé text file
containing lines of tèxt with
different wïdth but I woùld like to
justîfy them. Any idéà ?
Not justified anymore... But spaces are altered anyway...
Thanks for your help,
Slander
You can employ nroff as using it man.
(echo '.ll 37'
echo '.pl 0'
cat orig.txt) | nroff
from your input produces:
This is a simple text file containing
lines of text with different width
but I would like to justify them. Any
idea ?
The above WORKS ONLY WITH ASCII.
EDIT
If you want handle utf8 text with a nroff, you can try the next:
cat orig.txt | ( #yes, i know - UUOC
echo '.ll 37' #line length
echo '.pl 0' #page length (0-disables empty lines)
echo '.nh' #no hypenation
preconv -e utf8 -
) | groff -Tutf8
From this utf8 encoded input:
Voix ambiguë d'un cœur qui au zéphyr préfère les jattes de kiwi.
Voyez le brick géant que j'examine près du wharf.
Monsieur Jack, vous dactylographiez bien mieux que votre ami Wolf.
Eble ĉiu kvazaŭ-deca fuŝĥoraĵo ĝojigos homtipon..
Laŭ Ludoviko Zamenhof bongustas freŝa ĉeĥa manĝaĵo kun spicoj.
Nechť již hříšné saxofony ďáblů rozezvučí síň úděsnými tóny waltzu, tanga a
quickstepu.
produces:
Voix ambiguë d’un cœur qui au zéphyr
préfère les jattes de kiwi. Voyez le
brick géant que j’examine près du
wharf. Monsieur Jack, vous
dactylographiez bien mieux que votre
ami Wolf. Eble ĉiu kvazaŭ‐deca
fuŝĥoraĵo ĝojigos homtipon.. Laŭ
Ludoviko Zamenhof bongustas freŝa
ĉeĥa manĝaĵo kun spicoj. Nechť již
hříšné saxofony ďáblů rozezvučí síň
úděsnými tóny waltzu, tanga a
quickstepu.
If you delete the line
echo '.nh' #no hypenation
you will get hypenated text
Voix ambiguë d’un cœur qui au zéphyr
préfère les jattes de kiwi. Voyez le
brick géant que j’examine près du
wharf. Monsieur Jack, vous dactylo‐
graphiez bien mieux que votre ami
Wolf. Eble ĉiu kvazaŭ‐deca fuŝĥoraĵo
ĝojigos homtipon.. Laŭ Ludoviko Za‐
menhof bongustas freŝa ĉeĥa manĝaĵo
kun spicoj. Nechť již hříšné saxo‐
fony ďáblů rozezvučí síň úděsnými
tóny waltzu, tanga a quickstepu.
You could use par:
par -j -w37 < inputfile
The -j option would justify paragraphs.
-w denotes max output line length.
For your input, it'd produce:
This is a simple text file containing
lines of text with different width
but I would like to justify them. Any
idea ?
An alternative would be to use emacs:
emacs -batch inputfile --eval '(set-fill-column 37)' --eval '(fill-region (point-min) (point-max))' -f save-buffer
This would also produce:
This is a simple text file containing
lines of text with different width
but I would like to justify them. Any
idea ?

Bash - Rename part of Filenames

i need , some help , with my mini-script , to Fix , Spanish Filename with ISO_8859-1 and/or with part of names like "&#00243"
The Script its there : http://www.pastebin.com/vT5Z2BqE
Yesterday with a 3 Things , are working , i add more , and dont work anymore , i dont understand why .
Look , if i use that command in a "Bash Shell" / "Gnome-Terminal" like :
inukaze#Inukaze:~$ cd Filenames_to_fix
inukaze#Inukaze:~/Filenames_to_fix$
inukaze#Inukaze:~/Filenames_to_fix$ expresion='&#00176'
inukaze#Inukaze:~/Filenames_to_fix$ sustituto='°'
inukaze#Inukaze:~/Filenames_to_fix$ ls *$expresion*
01 - La Espada del Augurio &#00176.avi
inukaze#Inukaze:~/Filenames_to_fix$ for i in $( ls $expresion ); do
> orig=$i
> dest=$(echo $i | sed -e "s/$expresion/$sustituto/")
> mv $orig $dest
> done
mv: no se puede efectuar stat' sobre «01»: No existe el fichero o el directorio
mv: no se puede efectuarstat' sobre «-»: No existe el fichero o el directorio
mv: no se puede efectuar stat' sobre «La»: No existe el fichero o el directorio
mv: no se puede efectuarstat' sobre «Espada»: No existe el fichero o el directorio
mv: no se puede efectuar stat' sobre «del»: No existe el fichero o el directorio
mv: no se puede efectuarstat' sobre «Augurio»: No existe el fichero o el directorio
mv: no se puede efectuar `stat' sobre «°»: No existe el fichero o el directorio
I need , the change of part of filename "&#00176" for "ª" , for example
Someone / somebody , can explain why this error , and how to fix it ???
I dont wanna interactive mode , and dont wanna replace "extension" i wanna "rename" the bad part of filename , with the "Good" character in its place :D.
Thank you for readme , and sorry my bad english , thank you for any help can you give me with this script
You do not quote $orig and $dest and that causes problems when the filename contains spaces (mv is given the file name as several separate arguments (which is why it prints several error messages with parts of the file name)). Try to use
mv "$orig" "$dest"
instead.
The for loop uses whitespace as a delimiter. Since your file name contains whitespace, you will need to change what you are using as a delimiter.
Here is the equivalent using find and while
find . -maxdepth 1 -name "*${expresion}*" -print0 | while read -d $'\0' file
do
orig="$file"
dest=$(echo "$file" | sed -e "s/${expresion}/${sustituto}/")
mv "$orig" "$dest"
done
HOWEVER, a better solution is probably to use the rename command:
rename $expresion $sustituto *${expresion}*
Is the rename command available?
rename $expresion $sustituto *$expresion*

Resources