I'm puzzled here about awk, sed, etc - bash

I'm trying for a while to work this out with no success so far
I have a command output that I need to chew to make it suitable for further processing
The text I have is:
1/2 [3] (27/03/2012 19:32:54) word word word word 4/5
What I need is to extract only the numbers 1/2 [3] 4/5 so it will look:
1 2 3 4 5
So, basically I was trying to exclude all characters that are not digits, like "/", "[", "]", etc.
I tried awk with FS, tried using regexp, but none of my tries were successful.
I would then add something to it like
first:1 second:2 third:3 .... etc
Please take in mind I'm talking about a file that contains a lot if lines with the same structure, but I already though about using awk to sum every column with
awk '{sum1+=$1 ; sum2+=$2 ;......etc} END {print "first:"sum1 " second:"sum2.....etc}'
But first I will need to extract only the relevant numbers,
The date that is in between "( )" can be omitted completely but they are numbers too, so filtering merely by digits won't be enough as it will match them too
Hope you can help me out
Thanks in advance!

This: sed -r 's/[(][^)]*[)]/ /g; s/[^0-9]+/ /g' should work. It makes two passes, removing parenthesized expressions first and then replacing all runs of non-digits with single spaces.

You can do something like sed -e 's/(.*)//' -e 's/[^0-9]/ /g'. It deletes everything inside the round brackets, than substitutes all non-digit characters with a space. To get rid of extra spaces you can feed it to column -t:
$ echo '1/2 [3] (27/03/2012 19:32:54) word word word word 4/5' | sed -e 's/(.*)//' -e 's/[^0-9]/ /g' | column -t
1 2 3 4 5

TXR:
#(collect)
#one/#two [#three] (#date #time) #(skip :greedy) #four/#five
#(filter :tonumber one two three four five)
#(end)
#(bind (first second third fourth fifth)
#(mapcar (op apply +) (list one two three four five)))
#(output)
first:#first second:#second third:#third fourth:#fourth fifth:#fifth
#(end)
data:
1/2 [3] (27/03/2012 19:32:54) word word word word 4/5
10/20 [30] (27/03/2012 19:32:54) word word 40/50
run:
$ txr data.txr data.txt
first:11 second:22 third:33 fourth:44 fifth:55
Easy to add some error checking:
#(collect)
# (cases)
#one/#two [#three] (#date #time) #(skip :greedy) #four/#five
# (or)
#line
# (throw error `badly formatted line: #line`)
# (end)
# (filter :tonumber one two three four five)
#(end)
#(bind (first second third fourth fifth)
#(mapcar (op apply +) (list one two three four five)))
#(output)
first:#first second:#second third:#third fourth:#fourth fifth:#fifth
#(end)
$ txr data.txr -
foo bar junk
txr: unhandled exception of type error:
txr: ("badly formatted line: foo bar junk")
Aborted
TXR is for robust programming. There is strong typing, so you can't treat strings as numbers just because they contain digits. Variables have to be bound before use, and so misspelled variables do not silently default to zero or blank, but rather produce an unbound variable <name> in <file>:<line> type error. Text extraction is performed with lots of specific context to guard against misinterpreting input in one format as being in another format.

see below, if it is what you want:
kent$ echo "1/2 [3] (27/03/2012 19:32:54) word word word word 4/5"|sed -r 's/\([^)]*\)//g; s/[^0-9]/ /g'
1 2 3 4 5
if you want it to look better:
kent$ echo "1/2 [3] (27/03/2012 19:32:54) word word word word 4/5"|sed -r 's/\([^)]*\)//g; s/[^0-9]/ /g;s/ */ /g'
1 2 3 4 5

awk '{ first+=gensub("^([0-9]+)/.*","\\1","g",$0)
second+=gensub("^[0-9]+/([0-9]+) .*","\\1","g",$0)
thirdl+=gensub("^[0-9]+/[0-9]+ \[([0-9]+)\].*","\\1","g",$0)
fourth+=gensub("^.* ([0-9]+)/[0-9]+ *$","\\1","g",$0)
fifth+=gensub("^.* [0-9]+/([0-9]+) *$","\\1","g",$0)
}
END { print "first: " first " second: " second " third: " third " fourth: " fourth " fifth: " fifth
}
Might work for you.

This will give you digits extracted out excluding text in parenthesis:
digits=$(echo '1/2 [3] (27/03/2012 19:32:54) word word word word 4/5' |\
sed 's/(.*)//' | grep -o '[0-9][0-9]*')
echo $digits
or pure sed solution:
echo '1/2 [3] (27/03/2012 19:32:54) word word word word 4/5' |\
sed -e 's/(.*)//' -e 's/[^0-9]/ /g' -e 's/[ \t][ \t]*/ /g'
OUTPUT:
1 2 3 4 5

one pass with awk is sufficient if you set a fancy field separator: any one of slash, space, open bracket or close bracket separates a field:
awk -F '[][/ ]' '
{s1+=$1; s2+=$2; s3+=$4; s4+=$(NF-1); s5+=$NF}
END {printf("first:%d second:%d third:%d fourth:%d fifth:%d\n", s1, s2, s3, s4, s5)}
'

Related

Sed to add color to column for a specific pattern?

I figured out how to colorize column 3 in green like this:
green=$'\033[1;32m';off=$'\e[m';echo -e "num co1umn1 column2 column3\n=== === === ===\n1 this is me\n2 that is you"|column -t|sed "s/[^[:blank:]]\{1,\}/$green&$off/3";unset green off
CLI result
How do I need to alter my sed command to colorize the pattern 'is' only within column 3 so that the output becomes:
Wanted result
If you want to color the whole word is, you can use (with GNU sed):
sed "s/\bis\b/$green&$off/"
sed "s/\<is\>/$green&$off/"
Here, \b is a word boundary, \< is a leading word boundary and \> is a trailing word boundary.
Else, you can tell sed to start looking for matches from the third line:
sed "3,$ s/[^[:blank:]]\{1,\}/$green&$off/3"
Output:
One way to do this is to ignore the first two lines of the output in sed:
sed "1,2 ! s/[^[:blank:]]\{1,\}/$green&$off/3";
Using sed
$ ... | sed "/^[[:digit:]]/s/\(\([^ ]* \+\)\{2\}\)\([^ ]*\)/\1$green\3$off/"
Modifying the echo to cover a couple other instances of is ...
$ echo -e "num co1umn1 column2 column3\n=== === === ===\n1 is is me\n2 that isn't you" | column -t
num co1umn1 column2 column3
=== === === ===
1 is is me # only colorize the 2nd occurrence of "is"
2 that isn't you # don't colorize "isn't" in 3rd column
Extending OP's current sed solution:
sed -r "3,$ s/[^[:blank:]]{1,}/XXX&XXX/3; s/XXXisXXX/${green}is${off}/; s/XXX//g"
Where:
3,$ - apply following sed scripts against line numbers 3-to-EOF (ie, skip 1st 2 lines)
first we offset the 3rd column values with XXX bookends (choose a set of characters that you know won't show up anywhere in the data)
then colorize XXXisXXX (removing the XXXs at the same time)
then remove any remaining XXX (from 3rd column in other rows)
This generates:

How to convert a line into camel case?

This picks all the text on single line after a pattern match, and converts it to camel case using non-alphanumeric as separator, remove the spaces at the beginning and at the end of the resulting string, (1) this don't replace if it has 2 consecutive non-alphanumeric chars, e.g "2, " in the below example, (2) is there a way to do everything using sed command instead of using grep, cut, sed and tr?
$ echo " hello
world
title: this is-the_test string with number 2, to-test CAMEL String
end! " | grep -o 'title:.*' | cut -f2 -d: | sed -r 's/([^[:alnum:]])([0-9a-zA-Z])/\U\2/g' | tr -d ' '
ThisIsTheTestStringWithNumber2,ToTestCAMELString
To answer your first question, change [^[:alnum:]] to [^[:alnum:]]+ to mach one ore more non-alnum chars.
You may combine all the commands into a GNU sed solution like
sed -En '/.*title: *(.*[[:alnum:]]).*/{s//\1/;s/([^[:alnum:]]+|^)([0-9a-zA-Z])/\U\2/gp}'
See the online demo
Details
-En - POSIX ERE syntax is on (E) and default line output supressed with n
/.*title: *(.*[[:alnum:]]).*/ - matches a line having title: capturing all after it up to the last alnum char into Group 1 and matching the rest of the line
{s//\1/;s/([^[:alnum:]]+|^)([0-9a-zA-Z])/\U\2/gp} - if the line is matched,
s//\1/ - remove all but Group 1 pattern (received above)
s/([^[:alnum:]]+|^)([0-9a-zA-Z])/\U\2/ - match and capture start of string or 1+ non-alnum chars into Group 1 (with ([^[:alnum:]]+|^)) and then capture an alnum char into Group 2 (with ([0-9a-zA-Z])) and replace with uppercased Group 2 contents (with \U\2).

how to remove all whitespaces in front and beind 3 consecutive periods

I'm trying to remove all white spaces before and after 3 consecutive periods and replace it with the actual ellipse symbol.
I've tried the following code:
sed 's/[[:space:]]*\.\.\.[[:space:]]*/…/g'
It replaces the 3 periods with the ellipse symbol, but the spaces before and after remain.
Sample Input.
hello ... world
Desired output
hello…world
Expression you are using is ERE(extended regular expressions) you have to add -E option to sed as follows to allow it, since you are using character classes in your code [[:space:]].
sed -E 's/[[:space:]]*\.\.\.[[:space:]]*/.../g' Input_file
Without -E try:
sed 's/ *\.\.\. */.../g' Input_file
Here is another sed
echo "hello ... world" | sed -E 's/ +(\.\.\.) +/\1/g'
hello...world
4 dots, do nothing?
echo "hello .... world" | sed -E 's/ +(\.\.\.) +/\1/g'
hello .... world
In bash, just use parameter substitution...
foo="hello ... world"
foo="${foo//+( )...+( )/...}"
Now, echo "$foo", outputs:
hello...world
The syntax for BaSH regex variable substitution are as follows:
${var-name/search/replace}
A single /replaces only the first occurrence from the left, while a double //replaces every occurrence.
One of ?*+#! followed by (pattern-list) replaces a specified number of occurrences of the patterns in pattern-list as follows:
? Zero or one occurrence
* Zero or more occurrences
+ One or more occurrences
# A single occurence
! Anything that *doesn't* match one of the occurrences
Pattern list can be any combination of literal strings, or character classes, separated by the pipe character |

Replacing one space with two spaces in Unix

I am trying to replace every time there is one space with two spaces in Unix. We are just reading from standard input and writing to standard ouput. I also have to avoid using the functions awk and perl. For example if I read in something like San Diego it should print San Diego. If there are already multiple spaces, it should just leave them alone.
How about bash only? First test file:
$ cat file
1
2 3
4 5
San Diego NO
Then:
$ cat file |
while IFS= read line
do
while [[ "$line" =~ (^|.+[^ ])\ ([^ ].*) ]]
do
line="${BASH_REMATCH[1]} ${BASH_REMATCH[2]}"
done
echo "$line"
done
1
2 3
4 5
San Diego NO
You have to a bit careful here not to forget spaces at the beginning or end.
I present three solutions for educational purpose:
sed 's/\(^\|[^ ]\) \($\|[^ ]\)/\1 \2/g' # solution 1
sed 's/\( \+\)/ \1/g;s/ \( \+\)/\1/g' # solution 2
sed 's/ \( \+\)/\1/g;s/\( \+\)/ \1/g' # solution 3
All three solutions make use of subexpressions:
9.3.6 BREs Matching Multiple Characters
A subexpression can be defined within a BRE by enclosing it between
the character pairs \( and \). Such a subexpression shall match
whatever it would have matched without the \( and \), except that
anchoring within subexpressions is optional behavior; see BRE
Expression Anchoring. Subexpressions can be arbitrarily nested.
The back-reference expression '\n' shall match the same (possibly
empty) string of characters as was matched by a subexpression enclosed
between "\(" and "\)" preceding the '\n'. The character n shall be a
digit from 1 through 9, specifying the nth subexpression (the one that
begins with the nth \( from the beginning of the pattern and ends
with the corresponding paired \) ). The expression is invalid if
less than n subexpressions precede the \n. For example, the
expression ".∗\1$" matches a line consisting of two adjacent
appearances of the same string, and the expression a*\1 fails to
match a. When the referenced subexpression matched more than one
string, the back-referenced expression shall refer to the last matched
string. If the subexpression referenced by the back-reference matches
more than one string because of an asterisk (*) or an interval
expression (see item (5)), the back-reference shall match the last
(rightmost) of these strings.
Solution 1: sed 's/\(^\|[^ ]\) \($\|[^ ]\)/\1 \2/g'
Here there are two subexpressions. The first subexpression \(^\|[^ ]\) matches the beginning of the line (^) or (\|) a non-space character ([^ ]). The second subexpression \($\|[^ ]\) is similar but with the end-of-line ($).
Solution 2: sed 's/\( \+\)/ \1/g;s/ \( \+\)/\1/g'
This replaces one-or more spaces by the same amount of spaces and an extra one. Afterwards we correct the ones with 3 spaces or more by removing a single space from those.
Solution 3: sed 's/ \( \+\)/\1/g;s/\( \+\)/ \1/g'
This does the same thing as solution 2 but inverts the logic. First remove a space from all sequences that have more then one space, and afterwards add a space. This one-liner is just one-character shorter then solution 2.
Example: based on solution 1
The following commands are nothing more then echo "string" | sed ..., but to show the spaces, wrapped into a printf statement.
# default string
$ printf "|%s|" " foo bar car "
| foo bar car |
# spaces replaced
$ printf "|%s|" "$(echo " foo bar car " | sed 's/\(^\|[^ ]\) \($\|[^ ]\)/\1 \2/g')"
| foo bar car |
# 3 spaces in front and back
$ printf "|%s|" "$(echo " foo bar car " | sed 's/\(^\|[^ ]\) \($\|[^ ]\)/\1 \2/g')"
| foo bar car |
note: If you want to replace any form of blanks (spaces and tabs in any encoding) by the same doubled blank, you could use :
sed 's/\(^\|[^[:blank:]]\)\([[:blank:]]\)\($\|[^[:blank:]]\)/\1\2\2\3/g'
sed 's/\(^\|[[:graph:]]\)\([[:blank:]]\)\($\|[[:graph:]]\)/\1\2\2\3/g
Something along the lines of
cat input.txt | sed 's,\([[:alnum:]]\) \([[:alnum:]]\),\1 \2,'
should work for that purpose.
replace only occurrence of 1 space between 2 chars hat are not white space with 2 spaces
`sed 's/\([^ ]\) \([^ ]\)/\1 \2/g' file`
1) [^ ] - not space char
2) \1 \2 - first expression found in Parenthesis, 2 spaces, second Parentheses expiration
3) sed used with s///g is replacing the regex in the first // with the value in the second //

Reverse four length of letters with sed in unix

How can I reverse a four length of letters with sed?
For example:
the year was 1815.
Reverse to:
the raey was 5181.
This is my attempt:
cat filename | sed's/\([a-z]*\) *\([a-z]*\)/\2, \1/'
But it does not work as I intended.
not sure it is possible to do it with GNU sed for all cases. If _ doesn't occur immediately before/after four letter words, you can use
sed -E 's/\b([a-z0-9])([a-z0-9])([a-z0-9])([a-z0-9])\b/\4\3\2\1/gi'
\b is word boundary, word definition being any alphabet or digit or underscore character. So \b will ensure to match only whole words not part of words
$ echo 'the year was 1815.' | sed -E 's/\b([a-z0-9])([a-z0-9])([a-z0-9])([a-z0-9])\b/\4\3\2\1/gi'
the raey was 5181.
$ echo 'two time five three six good' | sed -E 's/\b([a-z0-9])([a-z0-9])([a-z0-9])([a-z0-9])\b/\4\3\2\1/gi'
two emit evif three six doog
$ # but won't work if there are underscores around the words
$ echo '_good food' | sed -E 's/\b([a-z0-9])([a-z0-9])([a-z0-9])([a-z0-9])\b/\4\3\2\1/gi'
_good doof
tool with lookaround support would work for all cases
$ echo '_good food' | perl -pe 's/(?<![a-z0-9])([a-z0-9])([a-z0-9])([a-z0-9])([a-z0-9])(?!=[a-z0-9])/$4$3$2$1/gi'
_doog doof
(?<![a-z0-9]) and (?!=[a-z0-9]) are negative lookbehind and negative lookahead respectively
Can be shortened to
perl -pe 's/(?<![a-z0-9])[a-z0-9]{4}(?!=[a-z0-9])/reverse $&/gie'
which uses the e modifier to place Perl code in substitution section. This form is suitable to easily change length of words to be reversed
Possible shortest sed solution even if a four length of letters contains _s.
sed -r 's/\<(.)(.)(.)(.)\>/\4\3\2\1/g'
Following awk may help you in same. Tested this in GNU awk and only with provided sample Input_file
echo "the year was 1815." |
awk '
function reverse(val){
num=split(val, array,"");
i=array[num]=="."?num-1:num;
for(;i>q;i--){
var=var?var array[i]:array[i]
};
printf (array[num]=="."?var".":var);
var=""
}
{
for(j=1;j<=NF;j++){
printf("%s%s",j==NF||j==2?reverse($j):$j,j==NF?RS:FS)
}}'
This might work for you (GNU sed):
sed -r '/\<\w{4}\>/!b;s//\n&\n/g;s/^[^\n]/\n&/;:a;/\n\n/!s/(.*\n)([^\n])(.*\n)/\2\1\3/;ta;s/^([^\n]*)(.*)\n\n/\2\1/;ta;s/\n//' file
If there are no strings of the length required to reverse, bail out.
Prepend and append newlines to all required strings.
Insert a newline at the start of the pattern space (PS). The PS is divided into two parts, the first line will contain the current word being reversed. The remainder will contain the original line.
Each character of the word to be reversed is inserted at the front of the first line and removed from the original line. When all the characters in the word have been processed, the original word will have gone and only the bordering newlines will exist. These double newlines are then replaced by the word in the first line and the process is repeated until all words have been processed. Finally the newline introduced to separate the working line and the original is removed and the PS is printed.
N.B. This method may be used to reverse strings of varying string length i.e. by changing the first regexp strings of any number can be reversed. Also strings between two lengths may also be reversed e.g. /\<w{2,4}\>/ will change all words between 2 and 4 character length.
It's a recurrent problem so somebody created a bash command called "rev".
echo "$(echo the | rev) $(echo year | rev) $(echo was | rev) $(echo 1815 | rev)".
OR
echo "the year was 1815." | rev | tr ' ' '\n' | tac | tr '\n' ' '

Resources