Add Tab Separator to Grep - bash

I am new to grep and awk, and I would like to create tab separated values in the "frequency.txt" file output (this script looks at a large corpus and then outputs each individual word and how many times it is used in the corpus - I modified it for the Khmer language). I've looked around ( grep a tab in UNIX ), but I can't seem to find an example that makes sense to me for this bash script (I'm too much of a newbee).
I am using this bash script in cygwin:
#!/bin/bash
# Create a tally of all the words in the corpus.
#
echo Creating tally of word frequencies...
#
sed -e 's/[a-zA-Z]//g' -e 's/​/ /g' -e 's/\t/ /g' \
-e 's/[«|»|:|;|.|,|(|)|-|?|។|”|“]//g' -e 's/[0-9]//g' \
-e 's/ /\n/g' -e 's/០//g' -e 's/១//g' -e 's/២//g' \
-e 's/៣//g' -e 's/៤//g' -e 's/៥//g' -e 's/៦//g' \
-e 's/៧//g' -e 's/៨//g' -e 's/៩//g' dictionary.txt | \
tr [:upper:] [:lower:] | \
sort | \
uniq -c | \
sort -rn > frequency.txt
grep -Fwf dictionary.txt frequency.txt | awk '{print $2 "," $1}'
Awk is printing with a comma, but that is only on-screen. How can I place a tab (a comma would work as well), between the frequency and the term?
Here's a small part of the dictionary.txt file (Khmer does not use spaces, but in this corpus there is a non-breaking space between each word which is converted to a space using sed and regular expressions):
ព្រះ​វិញ្ញាណ​នឹង​ប្រពន្ធ​ថ្មោង​ថ្មី​ពោល​ថា
អញ្ជើញ​មក ហើយ​អ្នក​ណា​ដែល​ឮ​ក៏​ថា
អញ្ជើញ​មក​ដែរ អ្នក​ណា​ដែល​ស្រេក
នោះ​មាន​តែ​មក ហើយ​អ្នក​ណា​ដែល​ចង់​បាន
មាន​តែ​យក​ទឹក​ជីវិត​នោះ​ចុះ
ឥត​ចេញ​ថ្លៃ​ទេ។
Here is an example output of frequency.txt as it is now (frequency and then term):
25605 នឹង 25043 ជា 22004 បាន 20515 នោះ
I want the output frequency.txt to look like this (where TAB is an actual tab character):
25605TABនឹង 25043TABជា 22004TABបាន 20515TABនោះ
Thanks for your help!

You should be able to replace the whole lengthy sed command with this:
tr -d '[a-zA-Z][0-9]«»:;.,()-?។”“|០១២៣៤៥៦៧៨៩'
tr '\t' ' '
Comments:
's/​/ /g' - the first two slashes mean re-use the previous match which was [a-z][A-Z] and replace them with spaces, but they were deleted so this is a no-op
's/[«|»|:|;|.|,|(|)|-|?|។|”|“]//g' - the pipe characters don't delimit alternatives inside square brackets, they are literal (and more than one is redundant), the equivalent would be 's/[«»:;.,()-?។”“|]//g' (leaving one pipe in case you really want to delete them)
's/ /\n/g' - earlier, you replaced tabs with spaces, now you're replacing the spaces with newlines
You should be able to have the tabs you want by inserting this in your pipeline right after the uniq:
sed 's/^ *\([0-9]\+\) /\1\t/'
If you want the AWK command to output a tab:
awk 'BEGIN{OFS='\t'} {print $2, $1}'

What about writing awk to file with "<"?

The following script should get you where you need to go. The pipe to tee will let you see output on the screen while at the same time writing the output to ./outfile
#!/bin/sh
sed ':a;N;s/[a-zA-Z0-9។០១២៣៤៥៦៧៨៩\n«»:;.,()?”“-]//g;ta' < dictionary.txt | \
gawk '{$0=toupper($0);for(i=1;i<=NF;i++)a[$i]++}
END{for(item in a)printf "%s\t%d ", item, a[item]}' | \
tee ./outfile

Related

Sed output a value between two matching strings in a url

I have multiple urls as input
https://drive.google.com/a/domain.com/file/d/1OR9QLGsxiLrJIz3JAdbQRACd-G9ZfL3O/view?usp=drivesdk
https://drive.google.com/a/domain.com/file/d/1sEWMFqGW9p2qT-8VIoBesPlVJ4xvOzXD/view?usp=drivesdk
How can I create a sed command to simply return only the file ID
desired output:
1OR9QLGsxiLrJIz3JAdbQRACd-G9ZfL3O
1sEWMFqGW9p2qT-8VIoBesPlVJ4xvOzXD
Looks like I need to start between /d/ and stop at /view but I'm not quite sure how to do that.
I've tried? sed -e 's/d\(.*\)\/view/\1/'
I was able to do this with cut -d '/' -f 8
also awk -F/ '{print $8}' file worked, thanks!
Your command was almost right:
# Wrong
sed -e 's/d\(.*\)\/view/\1/'
# better, removing unmatched stuff including the / after the d
sed -e 's/.*d\/\(.*\)\/view.*/\1/'
# better: using # for making the command easier to read
sed -e 's#.*d/\(.*\)/view.*#\1#'
# Alternative:Using cut when you don't know which field /d/ is
some_straem | grep -Eo '/d/.*/view' | cut -d/ -f3

Delete words in a line using grep or sed

I want to delete three words with a special character on a line such as
Input:
\cf4 \cb6 1749,1789 \cb3 \
Output:
1749,1789
I have tried a couple sed and grep statements but so far none have worked, mainly due to the character \.
My unsuccessful attempt:
sed -i 's/ [.\c ] //g' inputfile.ext >output file.ext
Awk accepts a regex Field Separator (in this case, comma or space):
$ awk -F'[ ,]' '$0 = $3 "." $4' <<< '\cf4 \cb6 1749,1789 \cb3 \'
1749.1789
-F'[ ,]' - Use a single character from the set space/comma as Field Separator
$0 = $3 "." $4 - If we can set the entire line $0 to Field 3 $4 followed by a literal period "." followed by Field 4 $4, do the default behavior (print entire line)
Replace <<< 'input' with file if every line of that file has the same delimeters (spaces/comma) and number of fields. If your input file is more complex than the sample you shared, please edit your question to show actual input.
The backslash is a special meta-character that confuses bash.
We treat it like any other meta-character, by escaping it, with--you guessed it--a backslash!
But first, we need to grep this pattern out of our file
grep '\\... \\... [0-9]+,[0-9]+ \\... \\' our_file # Close enough!
Now, just sed out those pesky backslashes
| sed -e 's/\\//g' # Don't forget the g, otherwise it'll only strip out 1 backlash
Now, finally, sed out the clusters of 2 alpha followed by a number and a space!
| sed -e 's/[a-z][a-z][0-9] //g'
And, finally....
grep '\\... \\... [0-9]+,[0-9]+ \\... \\' our_file | sed -e 's/\\//g' | sed -e 's/[a-z][a-z][0-9] //g'
Output:
1749,1789
My guess is you are having trouble because you have backslashes in input and can't figure out how to get backslashes into your regex. Since backslashes are escape characters to shell and regex you end up having to type four backslashes to get one into your regex.
Ben Van Camp already posted an answer that uses single quotes to make the escaping a little easier; however I shall now post an answer that simply avoids the problem altogether.
grep -o '[0-9]*,[0-9]*' | tr , .
Locks on to the comma and selects the digits on either side and outputs the number. Alternately if comma is not guaranteed we can do it this way:
egrep -o ' [0-9,]*|^[0-9,]*' | tr , . | tr -d ' '
Both of these assume there's only one usable number per line.
$ awk '{sub(/,/,".",$3); print $3}' file
1749.1789
$ sed 's/\([^ ]* \)\{2\}\([^ ]*\).*/\2/; s/,/./' file
1749.1789

Replace 5 dots with a single space

I have a title that has 5 consecutive dots which I'd like replaced with just one space using bash script. Doing this is not helping:
tr '.....' ' '
Obviously because it's replacing the five dots with 5 spaces.
Basically, I have a title that I want changed to a slug. So I'm using:
tr A-Z a-z | tr '[:punct:] [:blank:]' '-'
to change everything to lowercase and change any punctuation mark and spaces to a hyphen, but I'm stuck with the dots.
The title I'm using is something like: Believe.....Right Now
So I want that turned into believe-right-now
How do I change the 5 dots to a single space?
You don't need sed or awk. Your original tr command should do the trick, you just need to add the -s flag. After tr translates the desired characters into hyphens, -s will squeeze all repeated hyphens into one:
tr A-Z a-z | tr -s '[:punct:] [:blank:]' '-'
I'm not sure what the input/output context is for you, but I tested the above as follows, and it worked for me:
tr A-Z a-z <<< "Believe.....Right Now" | tr -s '[:punct:] [:blank:]' '-'
output:
believe-right-now
See http://www.ss64.com/bash/tr.html for reference.
Transliterations are performed with mappings, which means each character is mapped into something else -- or delete, with tr -d. This is the reason why tr '.....' ' ' does not work.
Replacing five dots with space:
using sed with extended regular expressions:
$ sed -r 's/\.{5}/ /g' <<< "Believe.....Right Now"
Believe Right Now
using sed without -r:
$ sed 's/\.\{5\}/ /g' <<< "Believe.....Right Now"
Believe Right Now
using parameter expansion:
$ text="foo.....bar.....zzz" && echo "${text//...../ }"
foo bar zzz
Replacing five dots and spaces with -:
$ sed -r 's/\.{5}| /-/g' <<< "Believe.....Right Now"
Believe-Right-Now
Full replacement -- ditching tr usage:
$ sed -re 's/\.{5}| /-/g' -e 's/([A-Z])/\l&/g' <<< "Believe.....Right Now"
believe-right-now
or, in case your sed version does not support the flag -r, you may use:
$ sed -e 's/\.\{5\}\| /-/g' -e 's/\([A-Z]\)/\l&/g' <<< "Believe.....Right Now"
believe-right-now
$ cat file
Believe.....Right Now
$ awk '{gsub(/[[:punct:][:space:].]+/,"-"); print tolower($0)}' file
believe-right-now

How to compose custom command-line argument from file lines?

I know about the xargs utility, which allows me to convert lines into multiple arguments, like this:
echo -e "a\nb\nc\n" | xargs
Results in:
a b c
But I want to get:
a:b:c
The character : is used for an example. I want to be able to insert any separator between lines to get a single argument. How can I do it?
If you have a file with multiple lines than you want to change to a single argument changing the NEWLINES by a single character, the paste command is what you need:
$ echo -en "a\nb\nc\n" | paste -s -d ":"
a:b:c
Then, your command becomes:
your_command "$(paste -s -d ":" your_file)"
EDIT:
If you want to insert more than a single character as a separator, you could use sed before paste:
your_command "$(sed -e '2,$s/^/<you_separator>/' your_file | paste -s -d "")"
Or use a single more complicated sed:
your_command "$(sed -n -e '1h;2,$H;${x;s/\n/<you_separator>/gp}' your_file)"
The example you gave is not working for me. You would need:
echo -e "a\nb\nc\n" | xargs
to get a b c.
Coming back to your need, you could do this:
echo "a b c" | awk 'OFS=":" {print $1, $2, $3}'
it will change the separator from space to : or whatever you want it to be.
You can also use sed:
echo "a b c" | sed -e 's/ /:/g
that will output a:b:c.
After all these data processing, you can use xargs to perform the command you want to. Just | xargs and do whatever you want.
Hope it helps.
You can join the lines using xargs and then replace the space(' ' ) using sed.
echo -e "a\nb\nc"|xargs| sed -e 's/ /:/g'
will result in
a:b:c
obviously you can use this output as argument for other command using another xargs.
echo -e "a\nb\nc"|xargs| sed -e 's/ /:/g'|xargs

shell replace cr\lf by comma

I have input.txt
1
2
3
4
5
I need to get such output.txt
1,2,3,4,5
How to do it?
Try this:
tr '\n' ',' < input.txt > output.txt
With sed, you could use:
sed -e 'H;${x;s/\n/,/g;s/^,//;p;};d'
The H appends the pattern space to the hold space (saving the current line in the hold space). The ${...} surrounds actions that apply to the last line only. Those actions are: x swap hold and pattern space; s/\n/,/g substitute embedded newlines with commas; s/^,// delete the leading comma (there's a newline at the start of the hold space); and p print. The d deletes the pattern space - no printing.
You could also use, therefore:
sed -n -e 'H;${x;s/\n/,/g;s/^,//;p;}'
The -n suppresses default printing so the final d is no longer needed.
This solution assumes that the CRLF line endings are the local native line ending (so you are working on DOS) and that sed will therefore generate the local native line ending in the print operation. If you have DOS-format input but want Unix-format (LF only) output, then you have to work a bit harder - but you also need to stipulate this explicitly in the question.
It worked OK for me on MacOS X 10.6.5 with the numbers 1..5, and 1..50, and 1..5000 (23,893 characters in the single line of output); I'm not sure that I'd want to push it any harder than that.
In response to #Jonathan's comment to #eumiro's answer:
tr -s '\r\n' ',' < input.txt | sed -e 's/,$/\n/' > output.txt
tr and sed used be very good but when it comes to file parsing and regex you can't beat perl
(Not sure why people think that sed and tr are closer to shell than perl... )
perl -pe 's/\n/$1,/' your_file
if you want pure shell to do it then look at string matching
${string/#substring/replacement}
Use paste command. Here is using pipes:
echo "1\n2\n3\n4\n5" | paste -s -d, /dev/stdin
Here is using a file:
echo "1\n2\n3\n4\n5" > /tmp/input.txt
paste -s -d, /tmp/input.txt
Per man pages the s concatenates all lines and d allows to define the delimiter character.
Awk versions:
awk '{printf("%s,",$0)}' input.txt
awk 'BEGIN{ORS=","} {print $0}' input.txt
Output - 1,2,3,4,5,
Since you asked for 1,2,3,4,5, as compared to 1,2,3,4,5, (note the comma after 5, most of the solutions above also include the trailing comma), here are two more versions with Awk (with wc and sed) to get rid of the last comma:
i='input.txt'; awk -v c=$(wc -l $i | cut -d' ' -f1) '{printf("%s",$0);if(NR<c){printf(",")}}' $i
awk '{printf("%s,",$0)}' input.txt | sed 's/,\s*$//'
printf "1\n2\n3" | tr '\n' ','
if you want to output that to a file just do
printf "1\n2\n3" | tr '\n' ',' > myFile
if you have the content in a file do
cat myInput.txt | tr '\n' ',' > myOutput.txt
python version:
python -c 'import sys; print(",".join(sys.stdin.read().splitlines()))'
Doesn't have the trailing comma problem (because join works that way), and splitlines splits data on native line endings (and removes them).
cat input.txt | sed -e 's|$|,|' | xargs -i echo "{}"

Resources