Shorten sed sustitution or possible alternative - bash

I've some data being fed in few files. The requirement is to format the textual contents in these files and add newlines post formatting.
Requirement of substitution:
Text | Substituted
-----------------------
#Network | #Network
# Network | #Network
#Daemon | #Daemon
# Daemon | #Daemon
#Service | #Service
# Service | #Service
----------------------
I've tried using sed to do this, but the command gets huge and cluttered, as the substitution is not limited to only letters N,D & S and more and more Capital Alphabets gets added day by day in the requirement.
cat results_090316.out | sed -e 's/ //g' -e 's/#N/#N/g' -e 's/#S/#S/g' -e 's/#D/#D/g' -e 's/# N/#N/g' -e 's/# S/#S/g' -e 's/# D/#D/g' | tr '#' '\n'
If sed is not the proper tool to perform such substitutions, could you suggest an alternative?
The code is written in bash on RHEL 6 / Solaris 10 OS.

You can shorten it using a character class and optional space matching:
sed 's/ //g; s/# *\([NDS]\)/#\1/g' results_090316.out

Your choice of tool is alright, but you're not using the full power of regular expressions. For example, below I use a "character class" to create a custom group of characters to match, e.g. [NSD], and then use a "backreference" (\1) by first "capturing" a piece of the search (with \( and \)):
cat results_090316.out | sed -e 's/ //g' -e 's/#\([NSD]\)/#\1/g' -e 's/# \([NSD]\)/#\1/g' | tr '#' '\n'
But we can do better and use the ? "quantifier" (zero or one of the predecessing atom) to combine even the no-space and space cases:
cat results_090316.out | sed -e 's/ //g' -e 's/# \?\([NSD]\)/#\1/g' | tr '#' '\n'

Related

How to (optimally) pick a single normalized random word from a file with bash / sed / shuf?

I'm looking to remove any non-alphabetic (English) characters and make the output lower-case from /usr/share/dict/words. Here's what I have so far:
sed "$(shuf -i "1-$(cat /usr/share/dict/words | wc -l)" -n 1)q;d" /usr/share/dict/words | tr '[:upper:]' '[:lower:]' | sed 's/[^-a-z]//g'
This works fine but is it possible to do it all in the one sed command?
EDIT: The American word file looks like this:
A
A's
AMD
AMD's
AOL
AOL's
AWS
AWS's
Aachen
Aachen's
I'm looking to make this lower-case and remove any non-alphabetic characters (as mentioned in my original question). The solution I have works fine but I'm hoping to reduce the number of commands (maybe just sed?). Output of the above would then be:
a
as
amd
amds
aol
aols
aws
awss
aachen
aachens
You don't need sed and wc -- shuf can shuffle the lines of a file.
tr can remove non-alphas, so again don't need sed
shuf -n1 /usr/share/dict/words | tr -dc '[:alpha:]' | tr '[:upper:]' '[:lower:]'
This single awk command should do the job:
awk '{gsub(/[^[:alpha:]]+/, ""); print tolower($0)}' file
a
as
amd
amds
aol
aols
aws
awss
aachen
aachens
This might work for you (GNU sed and shuf):
shuf -n1 /usr/share/dict/words | sed 's/[^[:alpha:]-]//g;s/.*/\L&/'
Choose a random line, remove any non-alpha (except hyphen) characters and lowercase the result.

Replace file line with multi-line, special char string

I'm trying to automate generating a README.md.
The idea is:
Generate markdown table string like...
table="| Image | Name | Description | Notes |\n"
table+="| --- | --- | --- | --- |\n"
table+="| $img1 | $name1 | $desc1 | $notes1 |\n"
table+="| $img2 | $name2 | $desc2 | $notes2 |\n"
...
*simplified
*contains special characters like e.g. |-()[]/<>
Replace <!-- insert-table-here --> in a readme_template.md file with the full table
## Header
<!-- insert-table-here -->
<sub>More info...</sub>
Save new file as README.md
I can't get step 2 working.
How do you replace a line in a file with a multi-line, special char ridden string?
Every sed, awk, perl, or even head/tail command I try seems to not work. Are heredocs the better approach?
I have found some hack solutions for specific cases with specific chars but I want to identify a more robust method.
EDIT: Thanks to #potong, this is what ended up working for me.
echo -e ${table} | sed -e '/<!-- insert-table-here -->/{r /dev/stdin' -e 'd}' readme_template.md > README.md
EDIT 2: After spending some more time on this, I found a nice multi-match option through awk
awk \
-v t1="$(generate_table1)" \
-v t2="$(generate_table2)" \
'{
gsub(/<!-- insert-table-1 -->/,t1)
gsub(/<!-- insert-table-2 -->/,t2)
}1' \
readme_template.md > README.md
This might work for you (GNU sed and bash):
cat <<\! | sed -e '/<!-- insert-table-here -->/{r /dev/stdin' -e 'd}' file
Here is a heredoc
with special symbols
{}-()[]/<>
!
The heredoc is piped through to the sed command using /dev/stdin as a file for the r command, then the original line is deleted using the d command.
N.B. The use of the -e command line option to split the two parts of the sed script (oneliner). This is necessary because the r command needs to be terminated by a newline and the -e option provides this functionality.

Unable to echo variable value

Assigning the output of sed command in a variable but unable to print its value, the command works fine:-
uptime | sed -e 's/^.*up //' -e 's/[^0-9:].*//' | sed 's/:/*60+/g'
but I assigned a variable for its out like below:-
abc=uptime | sed -e 's/^.*up //' -e 's/[^0-9:].*//' | sed 's/:/*60+/g'
and calling variable is not pulling the value.
Tried like below:-
echo {"$abc"}
printf "$abc"
echo "${abc}"
Kindly suggest the syntax for output.
abc=uptime | sed -e 's/^.*up //' -e 's/[^0-9:].*//' | sed 's/:/*60+/g'
Actually we need to pull the uptime value for only number of days on AIX server and call that value to form a report of servers which will show number of days server uptime for AIX servers. Need to know how to call the variable value and embedd it in a shell script.
Depends on your shell, but for most sh-ish variants:
abc=$(uptime | sed -e 's/^.*up //' -e 's/[^0-9:].*//' | sed 's/:/*60+/g')
Because you are not storing the final evaluation in abc.
Try
abc=$(uptime | sed -e 's/^.*up //' -e 's/[^0-9:].*//' | sed 's/:/*60+/g')

Removing text in unix shell

Sorry, I'm pretty new to coding. I'm just trying to remove the CST that follows the end of the string. The final output that I'm trying to get says "Sunset: 4:38 PM CST". Exclude the quotation marks.
Here is the code that I'm using within the shell.
curl http://m.wund.com/US/MN/Winona.html | grep 'Sunset' | sed -e :a -e 's/<[^>]*>//g;/</N;//ba' | sed -e 's/Sunset/Sunset: /g' | sed -e 's/PST//g'
Just change:
... | sed -e 's/PST//g'
to
... | sed -e 's/CST//g'
You might also want to invoke curl -s instead of just curl to omit all the downloading stuff.

Add Tab Separator to Grep

I am new to grep and awk, and I would like to create tab separated values in the "frequency.txt" file output (this script looks at a large corpus and then outputs each individual word and how many times it is used in the corpus - I modified it for the Khmer language). I've looked around ( grep a tab in UNIX ), but I can't seem to find an example that makes sense to me for this bash script (I'm too much of a newbee).
I am using this bash script in cygwin:
#!/bin/bash
# Create a tally of all the words in the corpus.
#
echo Creating tally of word frequencies...
#
sed -e 's/[a-zA-Z]//g' -e 's/​/ /g' -e 's/\t/ /g' \
-e 's/[«|»|:|;|.|,|(|)|-|?|។|”|“]//g' -e 's/[0-9]//g' \
-e 's/ /\n/g' -e 's/០//g' -e 's/១//g' -e 's/២//g' \
-e 's/៣//g' -e 's/៤//g' -e 's/៥//g' -e 's/៦//g' \
-e 's/៧//g' -e 's/៨//g' -e 's/៩//g' dictionary.txt | \
tr [:upper:] [:lower:] | \
sort | \
uniq -c | \
sort -rn > frequency.txt
grep -Fwf dictionary.txt frequency.txt | awk '{print $2 "," $1}'
Awk is printing with a comma, but that is only on-screen. How can I place a tab (a comma would work as well), between the frequency and the term?
Here's a small part of the dictionary.txt file (Khmer does not use spaces, but in this corpus there is a non-breaking space between each word which is converted to a space using sed and regular expressions):
ព្រះ​វិញ្ញាណ​នឹង​ប្រពន្ធ​ថ្មោង​ថ្មី​ពោល​ថា
អញ្ជើញ​មក ហើយ​អ្នក​ណា​ដែល​ឮ​ក៏​ថា
អញ្ជើញ​មក​ដែរ អ្នក​ណា​ដែល​ស្រេក
នោះ​មាន​តែ​មក ហើយ​អ្នក​ណា​ដែល​ចង់​បាន
មាន​តែ​យក​ទឹក​ជីវិត​នោះ​ចុះ
ឥត​ចេញ​ថ្លៃ​ទេ។
Here is an example output of frequency.txt as it is now (frequency and then term):
25605 នឹង 25043 ជា 22004 បាន 20515 នោះ
I want the output frequency.txt to look like this (where TAB is an actual tab character):
25605TABនឹង 25043TABជា 22004TABបាន 20515TABនោះ
Thanks for your help!
You should be able to replace the whole lengthy sed command with this:
tr -d '[a-zA-Z][0-9]«»:;.,()-?។”“|០១២៣៤៥៦៧៨៩'
tr '\t' ' '
Comments:
's/​/ /g' - the first two slashes mean re-use the previous match which was [a-z][A-Z] and replace them with spaces, but they were deleted so this is a no-op
's/[«|»|:|;|.|,|(|)|-|?|។|”|“]//g' - the pipe characters don't delimit alternatives inside square brackets, they are literal (and more than one is redundant), the equivalent would be 's/[«»:;.,()-?។”“|]//g' (leaving one pipe in case you really want to delete them)
's/ /\n/g' - earlier, you replaced tabs with spaces, now you're replacing the spaces with newlines
You should be able to have the tabs you want by inserting this in your pipeline right after the uniq:
sed 's/^ *\([0-9]\+\) /\1\t/'
If you want the AWK command to output a tab:
awk 'BEGIN{OFS='\t'} {print $2, $1}'
What about writing awk to file with "<"?
The following script should get you where you need to go. The pipe to tee will let you see output on the screen while at the same time writing the output to ./outfile
#!/bin/sh
sed ':a;N;s/[a-zA-Z0-9។០១២៣៤៥៦៧៨៩\n«»:;.,()?”“-]//g;ta' < dictionary.txt | \
gawk '{$0=toupper($0);for(i=1;i<=NF;i++)a[$i]++}
END{for(item in a)printf "%s\t%d ", item, a[item]}' | \
tee ./outfile

Resources