Replace 5 dots with a single space - bash

I have a title that has 5 consecutive dots which I'd like replaced with just one space using bash script. Doing this is not helping:
tr '.....' ' '
Obviously because it's replacing the five dots with 5 spaces.
Basically, I have a title that I want changed to a slug. So I'm using:
tr A-Z a-z | tr '[:punct:] [:blank:]' '-'
to change everything to lowercase and change any punctuation mark and spaces to a hyphen, but I'm stuck with the dots.
The title I'm using is something like: Believe.....Right Now
So I want that turned into believe-right-now
How do I change the 5 dots to a single space?

You don't need sed or awk. Your original tr command should do the trick, you just need to add the -s flag. After tr translates the desired characters into hyphens, -s will squeeze all repeated hyphens into one:
tr A-Z a-z | tr -s '[:punct:] [:blank:]' '-'
I'm not sure what the input/output context is for you, but I tested the above as follows, and it worked for me:
tr A-Z a-z <<< "Believe.....Right Now" | tr -s '[:punct:] [:blank:]' '-'
output:
believe-right-now
See http://www.ss64.com/bash/tr.html for reference.

Transliterations are performed with mappings, which means each character is mapped into something else -- or delete, with tr -d. This is the reason why tr '.....' ' ' does not work.
Replacing five dots with space:
using sed with extended regular expressions:
$ sed -r 's/\.{5}/ /g' <<< "Believe.....Right Now"
Believe Right Now
using sed without -r:
$ sed 's/\.\{5\}/ /g' <<< "Believe.....Right Now"
Believe Right Now
using parameter expansion:
$ text="foo.....bar.....zzz" && echo "${text//...../ }"
foo bar zzz
Replacing five dots and spaces with -:
$ sed -r 's/\.{5}| /-/g' <<< "Believe.....Right Now"
Believe-Right-Now
Full replacement -- ditching tr usage:
$ sed -re 's/\.{5}| /-/g' -e 's/([A-Z])/\l&/g' <<< "Believe.....Right Now"
believe-right-now
or, in case your sed version does not support the flag -r, you may use:
$ sed -e 's/\.\{5\}\| /-/g' -e 's/\([A-Z]\)/\l&/g' <<< "Believe.....Right Now"
believe-right-now

$ cat file
Believe.....Right Now
$ awk '{gsub(/[[:punct:][:space:].]+/,"-"); print tolower($0)}' file
believe-right-now

Related

Delete words in a line using grep or sed

I want to delete three words with a special character on a line such as
Input:
\cf4 \cb6 1749,1789 \cb3 \
Output:
1749,1789
I have tried a couple sed and grep statements but so far none have worked, mainly due to the character \.
My unsuccessful attempt:
sed -i 's/ [.\c ] //g' inputfile.ext >output file.ext
Awk accepts a regex Field Separator (in this case, comma or space):
$ awk -F'[ ,]' '$0 = $3 "." $4' <<< '\cf4 \cb6 1749,1789 \cb3 \'
1749.1789
-F'[ ,]' - Use a single character from the set space/comma as Field Separator
$0 = $3 "." $4 - If we can set the entire line $0 to Field 3 $4 followed by a literal period "." followed by Field 4 $4, do the default behavior (print entire line)
Replace <<< 'input' with file if every line of that file has the same delimeters (spaces/comma) and number of fields. If your input file is more complex than the sample you shared, please edit your question to show actual input.
The backslash is a special meta-character that confuses bash.
We treat it like any other meta-character, by escaping it, with--you guessed it--a backslash!
But first, we need to grep this pattern out of our file
grep '\\... \\... [0-9]+,[0-9]+ \\... \\' our_file # Close enough!
Now, just sed out those pesky backslashes
| sed -e 's/\\//g' # Don't forget the g, otherwise it'll only strip out 1 backlash
Now, finally, sed out the clusters of 2 alpha followed by a number and a space!
| sed -e 's/[a-z][a-z][0-9] //g'
And, finally....
grep '\\... \\... [0-9]+,[0-9]+ \\... \\' our_file | sed -e 's/\\//g' | sed -e 's/[a-z][a-z][0-9] //g'
Output:
1749,1789
My guess is you are having trouble because you have backslashes in input and can't figure out how to get backslashes into your regex. Since backslashes are escape characters to shell and regex you end up having to type four backslashes to get one into your regex.
Ben Van Camp already posted an answer that uses single quotes to make the escaping a little easier; however I shall now post an answer that simply avoids the problem altogether.
grep -o '[0-9]*,[0-9]*' | tr , .
Locks on to the comma and selects the digits on either side and outputs the number. Alternately if comma is not guaranteed we can do it this way:
egrep -o ' [0-9,]*|^[0-9,]*' | tr , . | tr -d ' '
Both of these assume there's only one usable number per line.
$ awk '{sub(/,/,".",$3); print $3}' file
1749.1789
$ sed 's/\([^ ]* \)\{2\}\([^ ]*\).*/\2/; s/,/./' file
1749.1789

understanding SED commands

I need to understand a shell code which uses the following command to fetch directions from a source to destination using GOOGLE MAPS API:
wget --no-parent -O - https://maps.googleapis.com/maps/api/directions/json?origin=$begin\&destination=$finish\&sensor=false > new.txt
Next we fetch the following line of the output:
**"html_instructions" : "Head \u003cb\u003enorthwest\u003c/b\u003e"**
grep -n html_instructions new.txt > new1.txt
Can somebody please tell me the meaning of using:
sed -e 's/\\u003cb//g'
etc in the following command:
sed -e 's/\\u003cb//g' -e 's/\\u003e//g' -e 's/\\u003c\/b//g' -e 's/\\u003c//g' -e 's/div.*div//g' -e 's/.*://g' -e 's/"//g' -e 's/ "//g' new1.txt > new2.txt
Which outputs Head northwest only.
Thanks in advance!
sed -e 's/\\u003cb//g' -e 's/\\u003e//g' -e 's/\\u003c\/b//g' -e 's/\\u003c//g' -e 's/div.*div//g' -e 's/.*://g' -e 's/"//g' -e 's/ "//g' new1.txt > new2.txt
The string after each -e is a sed command. The sed command s/\\u003cb//g searches for all occurrences of the unicode character 003CB (which is a greek small letter upsilon with dialytika) and replaces it with nothing. In other words, it remove the character from the string.
Thus, that sed command removes every occurrence of unicode characters 003cb, u003e, and u003c from the lines and new1.txt and sends the output to new2.txt.
Additionally, s/div.*div//g causes any string that begins and ends with "div" to be removed. The command s/.*://g removes any text from the beginning of the line to the last colon in the line. s/"//g removes the every occurence of the double-quote character. s/ "//g removes every occurrence of space followed by double-quote.
In general, the sed command s/new/old/ searches for the first occurrence of new and replaces it with old. With a g appended at the end, as in s/new/old/g, it makes the substitution globally: looks for every occurrence of new and replaces it with old. Adding a lot of power to these commands, new may be a regular expression. Consider s/.*://g. The dot character has the special meaning of "any character at all". The star character means zero or more of the preceding character. Thus the regular expression.*:` means zero or more of any characters followed by a colon.
You can take all in one go with awk:
awk -F\" '/html_instructions/ {gsub(/(\\u003(c|cb|e)|\/b)/,x);print $4}'
Head northwest
So whole line should be:
wget --no-parent -O - https://maps.googleapis.com/maps/api/directions/json?origin=$begin\&destination=$finish\&sensor=false | awk -F\" '/html_instructions/ {gsub(/(\\u003(c|cb|e)|\/b)/,x);print $4}'
Head northwest
to get it into a variable
d=$(wget --no-parent -O - https://maps.googleapis.com/maps/api/directions/json?origin=$begin\&destination=$finish\&sensor=false | awk -F\" '/html_instructions/ {gsub(/(\\u003(c|cb|e)|\/b)/,x);print $4}')
echo $d
Head northwest

escaping newlines in sed replacement string

Here are my attempts to replace a b character with a newline using sed while running bash
$> echo 'abc' | sed 's/b/\n/'
anc
no, that's not it
$> echo 'abc' | sed 's/b/\\n/'
a\nc
no, that's not it either. The output I want is
a
c
HELP!
Looks like you are on BSD or Solaris. Try this:
[jaypal:~/Temp] echo 'abc' | sed 's/b/\
> /'
a
c
Add a black slash and hit enter and complete your sed statement.
$ echo 'abc' | sed 's/b/\'$'\n''/'
a
c
In Bash, $'\n' expands to a single quoted newline character (see "QUOTING" section of man bash). The three strings are concatenated before being passed into sed as an argument. Sed requires that the newline character be escaped, hence the first backslash in the code I pasted.
You didn't say you want to globally replace all b. If yes, you want tr instead:
$ echo abcbd | tr b $'\n'
a
c
d
Works for me on Solaris 5.8 and bash 2.03
In a multiline file I had to pipe through tr on both sides of sed, like so:
echo "$FILE_CONTENTS" | \
tr '\n' ¥ | tr ' ' ∑ | mySedFunction $1 | tr ¥ '\n' | tr ∑ ' '
See unix likes to strip out newlines and extra leading spaces and all sorts of things, because I guess that seemed like the thing to do at the time when it was made back in the 1900s. Anyway, this method I show above solves the problem 100%. Wish I would have seen someone post this somewhere because it would have saved me about three hours of my life.
echo 'abc' | sed 's/b/\'\n'/'
you are missing '' around \n

Add Tab Separator to Grep

I am new to grep and awk, and I would like to create tab separated values in the "frequency.txt" file output (this script looks at a large corpus and then outputs each individual word and how many times it is used in the corpus - I modified it for the Khmer language). I've looked around ( grep a tab in UNIX ), but I can't seem to find an example that makes sense to me for this bash script (I'm too much of a newbee).
I am using this bash script in cygwin:
#!/bin/bash
# Create a tally of all the words in the corpus.
#
echo Creating tally of word frequencies...
#
sed -e 's/[a-zA-Z]//g' -e 's/​/ /g' -e 's/\t/ /g' \
-e 's/[«|»|:|;|.|,|(|)|-|?|។|”|“]//g' -e 's/[0-9]//g' \
-e 's/ /\n/g' -e 's/០//g' -e 's/១//g' -e 's/២//g' \
-e 's/៣//g' -e 's/៤//g' -e 's/៥//g' -e 's/៦//g' \
-e 's/៧//g' -e 's/៨//g' -e 's/៩//g' dictionary.txt | \
tr [:upper:] [:lower:] | \
sort | \
uniq -c | \
sort -rn > frequency.txt
grep -Fwf dictionary.txt frequency.txt | awk '{print $2 "," $1}'
Awk is printing with a comma, but that is only on-screen. How can I place a tab (a comma would work as well), between the frequency and the term?
Here's a small part of the dictionary.txt file (Khmer does not use spaces, but in this corpus there is a non-breaking space between each word which is converted to a space using sed and regular expressions):
ព្រះ​វិញ្ញាណ​នឹង​ប្រពន្ធ​ថ្មោង​ថ្មី​ពោល​ថា
អញ្ជើញ​មក ហើយ​អ្នក​ណា​ដែល​ឮ​ក៏​ថា
អញ្ជើញ​មក​ដែរ អ្នក​ណា​ដែល​ស្រេក
នោះ​មាន​តែ​មក ហើយ​អ្នក​ណា​ដែល​ចង់​បាន
មាន​តែ​យក​ទឹក​ជីវិត​នោះ​ចុះ
ឥត​ចេញ​ថ្លៃ​ទេ។
Here is an example output of frequency.txt as it is now (frequency and then term):
25605 នឹង 25043 ជា 22004 បាន 20515 នោះ
I want the output frequency.txt to look like this (where TAB is an actual tab character):
25605TABនឹង 25043TABជា 22004TABបាន 20515TABនោះ
Thanks for your help!
You should be able to replace the whole lengthy sed command with this:
tr -d '[a-zA-Z][0-9]«»:;.,()-?។”“|០១២៣៤៥៦៧៨៩'
tr '\t' ' '
Comments:
's/​/ /g' - the first two slashes mean re-use the previous match which was [a-z][A-Z] and replace them with spaces, but they were deleted so this is a no-op
's/[«|»|:|;|.|,|(|)|-|?|។|”|“]//g' - the pipe characters don't delimit alternatives inside square brackets, they are literal (and more than one is redundant), the equivalent would be 's/[«»:;.,()-?។”“|]//g' (leaving one pipe in case you really want to delete them)
's/ /\n/g' - earlier, you replaced tabs with spaces, now you're replacing the spaces with newlines
You should be able to have the tabs you want by inserting this in your pipeline right after the uniq:
sed 's/^ *\([0-9]\+\) /\1\t/'
If you want the AWK command to output a tab:
awk 'BEGIN{OFS='\t'} {print $2, $1}'
What about writing awk to file with "<"?
The following script should get you where you need to go. The pipe to tee will let you see output on the screen while at the same time writing the output to ./outfile
#!/bin/sh
sed ':a;N;s/[a-zA-Z0-9។០១២៣៤៥៦៧៨៩\n«»:;.,()?”“-]//g;ta' < dictionary.txt | \
gawk '{$0=toupper($0);for(i=1;i<=NF;i++)a[$i]++}
END{for(item in a)printf "%s\t%d ", item, a[item]}' | \
tee ./outfile

shell replace cr\lf by comma

I have input.txt
1
2
3
4
5
I need to get such output.txt
1,2,3,4,5
How to do it?
Try this:
tr '\n' ',' < input.txt > output.txt
With sed, you could use:
sed -e 'H;${x;s/\n/,/g;s/^,//;p;};d'
The H appends the pattern space to the hold space (saving the current line in the hold space). The ${...} surrounds actions that apply to the last line only. Those actions are: x swap hold and pattern space; s/\n/,/g substitute embedded newlines with commas; s/^,// delete the leading comma (there's a newline at the start of the hold space); and p print. The d deletes the pattern space - no printing.
You could also use, therefore:
sed -n -e 'H;${x;s/\n/,/g;s/^,//;p;}'
The -n suppresses default printing so the final d is no longer needed.
This solution assumes that the CRLF line endings are the local native line ending (so you are working on DOS) and that sed will therefore generate the local native line ending in the print operation. If you have DOS-format input but want Unix-format (LF only) output, then you have to work a bit harder - but you also need to stipulate this explicitly in the question.
It worked OK for me on MacOS X 10.6.5 with the numbers 1..5, and 1..50, and 1..5000 (23,893 characters in the single line of output); I'm not sure that I'd want to push it any harder than that.
In response to #Jonathan's comment to #eumiro's answer:
tr -s '\r\n' ',' < input.txt | sed -e 's/,$/\n/' > output.txt
tr and sed used be very good but when it comes to file parsing and regex you can't beat perl
(Not sure why people think that sed and tr are closer to shell than perl... )
perl -pe 's/\n/$1,/' your_file
if you want pure shell to do it then look at string matching
${string/#substring/replacement}
Use paste command. Here is using pipes:
echo "1\n2\n3\n4\n5" | paste -s -d, /dev/stdin
Here is using a file:
echo "1\n2\n3\n4\n5" > /tmp/input.txt
paste -s -d, /tmp/input.txt
Per man pages the s concatenates all lines and d allows to define the delimiter character.
Awk versions:
awk '{printf("%s,",$0)}' input.txt
awk 'BEGIN{ORS=","} {print $0}' input.txt
Output - 1,2,3,4,5,
Since you asked for 1,2,3,4,5, as compared to 1,2,3,4,5, (note the comma after 5, most of the solutions above also include the trailing comma), here are two more versions with Awk (with wc and sed) to get rid of the last comma:
i='input.txt'; awk -v c=$(wc -l $i | cut -d' ' -f1) '{printf("%s",$0);if(NR<c){printf(",")}}' $i
awk '{printf("%s,",$0)}' input.txt | sed 's/,\s*$//'
printf "1\n2\n3" | tr '\n' ','
if you want to output that to a file just do
printf "1\n2\n3" | tr '\n' ',' > myFile
if you have the content in a file do
cat myInput.txt | tr '\n' ',' > myOutput.txt
python version:
python -c 'import sys; print(",".join(sys.stdin.read().splitlines()))'
Doesn't have the trailing comma problem (because join works that way), and splitlines splits data on native line endings (and removes them).
cat input.txt | sed -e 's|$|,|' | xargs -i echo "{}"

Resources