I have a string containing duplicate words, for example:
abc, def, abc, def
How can I remove the duplicates? The string that I need is:
abc, def
We have this test file:
$ cat file
abc, def, abc, def
To remove duplicate words:
$ sed -r ':a; s/\b([[:alnum:]]+)\b(.*)\b\1\b/\1\2/g; ta; s/(, )+/, /g; s/, *$//' file
abc, def
How it works
:a
This defines a label a.
s/\b([[:alnum:]]+)\b(.*)\b\1\b/\1\2/g
This looks for a duplicated word consisting of alphanumeric characters and removes the second occurrence.
ta
If the last substitution command resulted in a change, this jumps back to label a to try again.
In this way, the code keeps looking for duplicates until none remain.
s/(, )+/, /g; s/, *$//
These two substitution commands clean up any left over comma-space combinations.
Mac OSX or other BSD System
For Mac OSX or other BSD system, try:
sed -E -e ':a' -e 's/\b([[:alnum:]]+)\b(.*)\b\1\b/\1\2/g' -e 'ta' -e 's/(, )+/, /g' -e 's/, *$//' file
Using a string instead of a file
sed easily handles input either from a file, as shown above, or from a shell string as shown below:
$ echo 'ab, cd, cd, ab, ef' | sed -r ':a; s/\b([[:alnum:]]+)\b(.*)\b\1\b/\1\2/g; ta; s/(, )+/, /g; s/, *$//'
ab, cd, ef
You can use awk to do this.
Example:
#!/bin/bash
string="abc, def, abc, def"
string=$(printf '%s\n' "$string" | awk -v RS='[,[:space:]]+' '!a[$0]++{printf "%s%s", $0, RT}')
string="${string%,*}"
echo "$string"
Output:
abc, def
This can also be done in pure Bash:
#!/bin/bash
string="abc, def, abc, def"
declare -A words
IFS=", "
for w in $string; do
words+=( [$w]="" )
done
echo ${!words[#]}
Output
def abc
Explanation
words is an associative array (declare -A words) and every word is added as
a key to it:
words+=( [${w}]="" )
(We do not need its value therefore I have taken "" as value).
The list of unique words is the list of keys (${!words[#]}).
There is one caveat thought, the output is not separated by ", ". (You will
have to iterate again. IFS is only used with ${words[*]} and even than only
the first character of IFS is used.)
I have another way for this case. I changed my input string such as below and run command to editing it:
#string="abc def abc def"
$ echo "abc def abc def" | xargs -n1 | sort -u | xargs | sed "s# #, #g"
abc, def
Thanks for all support!
The problem with an associative array or xargs and sort in the other examples is, that the words become sorted. My solution only skips words that already have been processed. The associative array map keeps this information.
Bash function
function uniq_words() {
local string="$1"
local delimiter=", "
local words=""
declare -A map
while read -r word; do
# skip already processed words
if [ ! -z "${map[$word]}" ]; then
continue
fi
# mark the found word
map[$word]=1
# don't add a delimiter, if it is the first word
if [ -z "$words" ]; then
words=$word
continue
fi
# add a delimiter and the word
words="$words$delimiter$word"
# split the string into lines so that we don't have
# to overwrite the $IFS system field separator
done <<< $(sed -e "s/$delimiter/\n/g" <<< "$string")
echo ${words}
}
Example 1
uniq_words "abc, def, abc, def"
Output:
abc, def
Example 2
uniq_words "1, 2, 3, 2, 1, 0"
Output:
1, 2, 3, 0
Example with xargs and sort
In this example, the output is sorted.
echo "1 2 3 2 1 0" | xargs -n1 | sort -u | xargs | sed "s# #, #g"
Output:
0, 1, 2, 3
Related
This is how my input looks like:
string="a 1,a 2,a 3"
This is how I generate list out of the input:
sed -e 's/[^,]*/"&"/g' <<< ${string}
Above command gives me the desired output as:
"a 1","a 2","a 3"
How do I trim each element so that if the input is " a 1, a 2, a 3", my output still comes back as "a 1","a 2","a 3"?
I think it is important to understand that in bash, the double quotes have a special meaning.
string="a 1,a 2,a 3" represents the string a 1,a 2,a 3 (no quotes)
sed -e 's/[^,]*/"&"/g' <<< ${string} is equivalent to the variable out='"a 1","a 2","a 3"'
To accomplish what you want, you can do:
$ string=" a 1, a 2, a 3 "
$ echo "\"$(echo ${string//*( ),*( )/\",\"})\""
"a 1","a 2","a 3"
This is only using bash builtin operations.
replace all combinations of multiple spaces and commas by the quoted comma ${string//*( ),*( )/\",\"}
use word splitting to remove all leading and trailing blanks $(echo ...) (note: this is a bit ugly and will fail on cases like a 1 , a 2 as it will remove the double space between a and 1)
print two extra double-quotes at the beginning and end of the string.
A better way is to use a double substitution:
$ string=" a 1, a 2, a 3 "
$ foobar="\"${string//,/\",\"}\""
$ echo "${foobar//*( )\"*( )/\"}"
"a 1","a 2","a 3"
note: here we make use of KSH-globs which can be enabled with the extglob setting (shopt -s extglob)
Here an answer which extends your sed command with some basic preprocessing that removes unwanted spaces:
sed -E -e 's/ *(^|,) */\1/g;s/[^,]*/"&"/g' <<< ${string}
The -E option enables extended regular expression which saves some \.
EDIT: Since OP told to wrap output in " so adding it now.
echo "$string" | sed -E 's/^ +/"/;s/$/"/;s/, +/"\,"/g'
Output will be as follows.
echo "$string" | sed -E 's/^ +/"/;s/$/"/;s/, +/"\,"/g'
"a 1","a 2","a 3"
Could you please try following and let me know if this helps you.
awk '{sub(/^\" +/,"\"");gsub(/, +/,"\",\"")} 1' Input_file
In case you want to save output into same Input_file itself append > temp_file && mv temp_file Input_file.
Solution 2nd: Using sed.
sed -E 's/^" +/"/;s/, +/"\,"/g' Input_file
Instead of complex sed pattern, you can use grep -ow option.
[nooka#lori ~]$ string1="a 1,a 2,a 3"
[nooka#lori ~]$ string2=" a 1, a 2, a 3, a 4"
[nooka#lori ~]$
nooka#lori ~]$ echo $(echo $string1 | grep -ow "[a-zA-Z] [0-9]"|sed "s/^/\"/;s/$/\"/")|sed "s/\" /\",/g"
"a 1","a 2","a 3"
[nooka#lori ~]$ echo $(echo $string2 | grep -ow "[a-zA-Z] [0-9]"|sed "s/^/\"/;s/$/\"/")|sed "s/\" /\",/g"
"a 1","a 2","a 3","a 4"
1) use grep -ow to get only those words as per the pattern defined above. You can tweak the pattern per your needs (for ex: [a-zA-Z] [0-9][0-9]* etc) for more patterns cases.
2) Then you wrap the output (a 1 or a 2 etc) with a " using the first sed cmd.
3) Then you just put , between 2 " and you get what you wanted. This assumes you pattern always follows a single space between string and number value.
Ok. Finally got it working.
string=" a 1 b 2 c 3 , something else , yet another one with spaces , totally working "
trimmed_string=$(awk '{gsub(/[[:space:]]*,[[:space:]]*/,",")}1' <<< $string)
echo ${trimmed_string}
a 1 b 2 c 3,something else,yet another one with spaces,totally working
string_as_list=$(sed -e 's/[^,]*/"&"/g' <<< ${trimmed_string})
echo ${string_as_list}
"a 1 b 2 c 3","something else","yet another one with spaces","totally working"
This whole thing had to be done because terraform expects list
variables to be passed like that. They must be surrounded by double quotes (" "),
delimited by comma( , ) inside square brackets([ ]).
i have txt file like below.
abc
def
ghi
123
456
789
expected output is
abc|def|ghi
123|456|789
I want replace new line with pipe symbol (|). i want to use in egrep.After empty line it should start other new line.
you can try with awk
awk -v RS= -v OFS="|" '{$1=$1}1' file
you get,
abc|def|ghi
123|456|789
Explanation
Set RS to a null/blank value to get awk to operate on sequences of blank lines.
From the POSIX specification for awk:
RS
The first character of the string value of RS shall be the input record separator; a by default. If RS contains more than one character, the results are unspecified. If RS is null, then records are separated by sequences consisting of a plus one or more blank lines, leading or trailing blank lines shall not result in empty records at the beginning or end of the input, and a shall always be a field separator, no matter what the value of FS is.
$1==$1 re-formatting output with OFS as separator, 1 is true for always print.
Here's one using GNU sed:
cat file | sed ':a; N; $!ba; s/\n/|/g; s/||/\n/g'
If you're using BSD sed (the flavor packaged with Mac OS X), you will need to pass in each expression separately, and use a literal newline instead of \n (more info):
cat file | sed -e ':a' -e 'N' -e '$!ba' -e 's/\n/|/g' -e 's/||/\
/g'
If file is:
abc
def
ghi
123
456
789
You get:
abc|def|ghi
123|456|789
This replaces each newline with a | (credit to this answer), and then || (i.e. what was a pair of newlines in the original input) with a newline.
The caveat here is that | can't appear at the beginning or end of a line in your input; otherwise, the second sed will add newlines in the wrong places. To work around that, you can use another character that won't be in your input as an intermediate value, and then replace singletons of that character with | and pairs with \n.
EDIT
Here's an example that implements the workaround above, using the NUL character \x00 (which should be highly unlikely to appear in your input) as the intermediate character:
cat file | sed ':a;N;$!ba; s/\n/\x00/g; s/\x00\x00/\n/g; s/\x00/|/g'
Explanation:
:a;N;$!ba; puts the entire file in the pattern space, including newlines
s/\n/\x00/g; replaces all newlines with the NUL character
s/\x00\x00/\n/g; replaces all pairs of NULs with a newline
s/\x00/|/g replaces the remaining singletons of NULs with a |
BSD version:
sed -e ':a' -e 'N' -e '$!ba' -e 's/\n/\x00/g' -e 's/\x00\x00/\
/g' -e 's/\x00/|/g'
EDIT 2
For a more direct approach (GNU sed only), provided by #ClaudiuGeorgiu:
sed -z 's/\([^\n]\)\n\([^\n]\)/\1|\2/g; s/\n\n/\n/g'
Explanation:
-z uses NUL characters as line-endings (so newlines are not given special treatment and can be matched in the regular expression)
s/\([^\n]\)\n\([^\n]\)/\1|\2/g; replaces every 3-character sequence of <non-newline><newline><non-newline> with <non-newline>|<non-newline>
s/\n\n/\n/g replaces all pairs of newlines with a single newline
In native bash:
#!/usr/bin/env bash
curr=
while IFS= read -r line; do
if [[ $line ]]; then
curr+="|$line"
else
printf '%s\n' "${curr#|}"
curr=
fi
done
[[ $curr ]] && printf '%s\n' "${curr#|}"
Tested:
$ f() { local curr= line; while IFS= read -r line; do if [[ $line ]]; then curr+="|$line"; else printf '%s\n' "${curr#|}"; curr=; fi; done; [[ $curr ]] && printf '%s\n' "${curr#|}"; }
$ f < <(printf '%s\n' 'abc' 'def' 'ghi' '' 123 456 789)
abc|def|ghi
123|456|789
Use rs. For example:
rs -C'|' 2 3 < file
rs = reshape data array. Here I'm specifying that I want 2 rows, 3 columns, and the output separator to be pipe.
I have a string like
string = ionworldionfriendsionPeople
How can I split it and store in to array based on the pattern ion as
array[0]=ionworld
array[1]=ionfriends
array[2]=ionPeople
I tried IFS but I am unable to split correctly. Can any one help on this.
Edit:
I tried
test=ionworldionfriendsionPeople
IFS='ion' read -ra array <<< "$test"
Also my string may sometimes contains spaces like
string = ionwo rldionfri endsionPeo ple
You can use some POSIX parameter expansion operators to build up the array in reverse order.
foo=ionworldionfriendsionPeople
tmp="$foo"
while [[ -n $tmp ]]; do
# tail is set to the result of dropping the shortest suffix
# matching ion*
tail=${tmp%ion*}
# Drop everything from tmp matching the tail, then prepend
# the result to the array
array=("${tmp#$tail}" "${array[#]}")
# Repeat with the tail, until its empty
tmp="$tail"
done
The result is
$ printf '%s\n' "${array[#]}"
ionworld
ionfriends
ionPeople
If your input string never contains whitespace, you can use parameter expansion:
#! /bin/bash
string=ionworldionfriendsionPeople
array=(${string//ion/ })
for m in "${array[#]}" ; do
echo ion"$m"
done
If the string contains whitespace, find another character and use it:
ifs=$IFS
IFS=#
array=(${string//ion/#})
IFS=$ifs
You'll need to skip the first element in the array which will be empty, though.
Using grep -oP with lookahead regex:
s='ionworldionfriendsionPeople'
grep -oP 'ion.*?(?=ion|$)' <<< "$s"
Will give output:
ionworld
ionfriends
ionPeople
To populate an array:
arr=()
while read -r; do
arr+=("$REPLY")
done < <(grep -oP 'ion.*?(?=ion|$)' <<< "$s")
Check array content:
declare -p arr
declare -a arr='([0]="ionworld" [1]="ionfriends" [2]="ionPeople")'
If your grep doesn't support -P (PCRE) then you can use this gnu-awk:
awk -v RS='ion' 'RT{p=RT} $1!=""{print p $1}' <<< "$s"
Output:
ionworld
ionfriends
ionPeople
# To split string :
# -----------------
string=ionworldionfriendsionPeople
echo "$string" | sed -e "s/\(.\)ion/\1\nion/g"
# To set in Array:
# ----------------
string=ionworldionfriendsionPeople
array=(`echo "$string" | sed -e "s/\(.\)ion/\1 ion/g"`)
# To check array content :
# ------------------------
echo ${array[*]}
I am trying to store whole user input in a bash variable (appending variable).
Then to sort them etc.
The problem is that for input f.e.:
sdsd fff sss
asdasds
It creates this output:
fff
sdsd
sssasdasds
Expected output is:
asdasds
fff
sdsd
sss
Code follows:
content=''
while read line
do
content+=$(echo "$line")
done
result=`echo "$content" | sed -r 's/[^a-zA-Z ]+/ /g' | tr '[:upper:]' '[:lower:]' | tr ' ' '\n' | sort -u | sed '/^$/d' | sed 's/[^[:alpha:]]/\n/g'`
echo "$result" >> "$dictionary"
You aren't providing a space when you are appending.
content+=$(echo "$line")
You need to make sure there is a space between the end of the old value and the new value.
content+=" $line"
(There's no need for echo for this either as #gniourf_gniourf correctly pointed out.)
Something that will achieve what you're showing in your example:
words_ary=()
while read -r -a line_ary; do
(( ${#line_ary[#]} )) || continue # skip empty lines
words_ary+=( "${line_ary[#],,}" ) # The ,, is to convert to lower-case
done
printf '%s\n' "${words_ary[#]}" | sort -u >> "$dictionary"
We're splitting input into words at spaces and put these words in array line_ary
We're checking that we have a non-empty input
we append each word, converted to lowercase, from input to the array words_ary
finally we sort each word from words_ary and append the sorted words to file $dictionary.
How can I extract all the words from a file, every word on a single line?
Example:
test.txt
This is my sample text
Output:
This
is
my
sample
text
The tr command can do this...
tr [:blank:] '\n' < test.txt
This asks the tr program to replace white space with a new line.
The output is stdout, but it could be redirected to another file, result.txt:
tr [:blank:] '\n' < test.txt > result.txt
And here the obvious bash line:
for i in $(< test.txt)
do
printf '%s\n' "$i"
done
EDIT Still shorter:
printf '%s\n' $(< test.txt)
That's all there is to it, no special (pathetic) cases included (And handling multiple subsequent word separators / leading / trailing separators is by Doing The Right Thing (TM)). You can adjust the notion of a word separator using the $IFS variable, see bash manual.
The above answer doesn't handle multiple spaces and such very well. An alternative would be
perl -p -e '$_ = join("\n",split);' test.txt
which would. E.g.
esben#mosegris:~/ange/linova/build master $ echo "test test" | tr [:blank:] '\n'
test
test
But
esben#mosegris:~/ange/linova/build master $ echo "test test" | perl -p -e '$_ = join("\n",split);'
test
test
This might work for you:
# echo -e "this is\tmy\nsample text" | sed 's/\s\+/\n/g'
this
is
my
sample
text
perl answer will be :
pearl.214> cat file1
a b c d e f pearl.215> perl -p -e 's/ /\n/g' file1
a
b
c
d
e
f
pearl.216>