How to force uniq to distinquish between em- and en-dashes?

How to force uniq to distinquish between em- and en-dashes? - utf-8

uniq (GNU coreutils 8.5) does not seem to distinguish between em- and en-dashes:
$ echo -e "a–b\na—b" | uniq -c
2 a–b
Is there any way to force this distinction? I've tried various settings for LC_COLLATE with no luck.

Worked for me
echo -e "a–b\na—b" | LC_COLLATE=C uniq -c
1 a–b
1 a—b

Related

Bash - Variable with variables and commands

I've been trying to create a variable that holds the value of other values and also commands to be issued. I've found individual answers to both those sitatuons, but I'm struggling to put them together.
$DID and $SECTOR below are two variables that are already established.
DATAPARMS=$($DID,$SECTOR,date,uptime | sed 's/^.* up \+\(.\+\), \+[0-9] user.*$/\1/',ls -I README.txt /var/www/html | wc -l,grep "VERSION" /root/config | grep -o '".*"' | sed 's/"//g')
The individual commands (grep, sed, etc) can be ignore.
I would then like to call this with echo:
echo "$DATAPARMS"
Any pointers on how best to accomplish this? Thank you very much!

As described in BashFAQ #50, variables should not be used to store code. Use a function instead, as follows:
getDataParams() {
echo "$DID,$SECTOR,date,uptime" | sed 's/^.* up \+\(.\+\), \+[0-9] user.*$/\1/'
ls -I README.txt /var/www/html | wc -l
grep "VERSION" /root/config | grep -o '".*"' | sed 's/"//g'
}
getDataParams

Display interface + ip list nice way

I have to display net interface and IP's attached to it.
I came up with this code:
if [ -f intf ]; then
rm -I intf
fi &&
if [ -f ipl ]; then
rm -I ipl
fi &&
ip ntable | grep dev | sort | uniq | sed -e 's/^.*dev //;/^lo/d' >> intf &&
ip a | grep -oP "inet\s+\K[\w./]+" | grep -v 127 >> ipl &&
paste <(cat intf) <(cat ipl)
It does the job but I believe it's ugly :), created files, IMHO a total mess :)
any one can suggest the nice way to get exact the same result but short and efficient way ?
If there are a few interfaces, right now I'm thinking about looping, but that will make this code even bigger and probably uglier :) What would you suggest?

As the first thing, you can eliminate the need for temporary files with process substitution:
paste <(ip ntable | grep dev | sort -u | sed -e 's/^.*dev //;/^lo/d') <(ip a | grep -oP "inet\s+\K[\w./]+" | grep -v 127)
sort -u does the same thing as sort | uniq

This oneliner outputs the interface name and its ip address:
ifconfig |\
grep -e 'Link' -A 1 |\
paste -d" " - - - |\
grep ' addr' |\
sed -e 's/ */ /g' -e 's/Link.*addr://' |\
cut -d" " -f1,2
Here an explanation of the commands:
Shows network configuration
Filters lines containing Link and the next line to it.
Joins three lines
Filters lines having an assigned address
Trim whitespaces and remove not relevant information
Splits remaining data and keeps only interface name and ip address.
Example output:
br-2065e5d2fc59 172.18.0.1
docker0 172.17.0.1
lo 127.0.0.1
wlp3s0

uniq -c without additional spaces

Is there an option in uniq -c (or an alternative) that doesn't add additional whitespaces around the count number? Currently I generally pipe it through sed, like so:
sort | uniq -c | sed 's/^ *\([0-9]*\) /\1 /'
But this seems kinda redundant, particularly given how frequently I have to do this.

You can try to make the sed command as short as possible with
sort | uniq -c | sed 's/^ *//'
If you have GNU grep, you can also use the -P flag:
sort | uniq -c | grep -Po '\d.*'
(Do not use awk '{$1=$1};1', it will trim more than you want)
When you need this often, you can make a function or script calling
sort | uniq -c | sed 's/^ *//'
or only
uniq -c | sed 's/^ *//'

sort -R is not an option in my OS

I have a couple OS that do not have sort -R to generate a random list from a txt file I have. For example, I am trying to use the following command:
sort -R file | head -20000 > newfile
I looked up the man pages in these OS and sure enough, the -R option is not listed.
What is an alternative that can generate a random list from a file and print to a new file?
CentOS 5

Try:
shuf file | head -n 20000 > newfile
or:
cat file | perl -MList::Util=shuffle -e 'print shuffle(<STDIN>);'

You can use the shuf command, if it is installed.
shuf can either take a file as its input
shuf file | head -n 20000 > newfile
or read from stdin
cat file | shuf | head -n 20000 > newfile

cat file | awk 'BEGIN{srand();}{print rand()"\t"$0}' | sort -k1 -n | cut -f2 | head -20000 > newfile
This is working out for me.
cat ALLEMAILS.txt | awk 'BEGIN{srand();}{print rand()"\t"$0}' | sort -k1 -n | cut -f2 | head -20000 | tee 20000random.txt
This for seeing progress.

How do I obtain only digits from a string?

Suppose I have a string like this:
blah=-Xms512m
I want the output as 512.
I know I can get it using grep on Linux like this:
echo $blah | grep -o -e [0-9]\\+
But this doesn't work on Solaris.
Any nice solutions so that it's compatible on both, Linux and Solaris?
Or atleast on Solaris?

I f you know the numbers will be together like that:
pax> echo 'blah=-Xms512m' | sed 's/[^0-9]//g'
512
It basically replaces all non-numeric characters with nothing. Of course, it won't do sensible stuff with:
pax> echo 'blah77=-Xms512m' | sed 's/[^0-9]//g'
77512
but, if you've only got one number it will work fine.
If you just need the first number, you can use:
pax> echo 'blah77=-Xms512m' | sed -e 's/^[^0-9]*//' -e 's/[^0-9].*$//'
77
For the last:
pax> echo 'blah77=-Xms512m' | sed -e 's/[^0-9]*$//' -e 's/^.*[^0-9]//'
512

If you want to be completly brute-force, try using tr:
echo "blah=-Xms512m" | tr -c -d '[0-9]'

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

How to force uniq to distinquish between em- and en-dashes? - utf-8

uniq (GNU coreutils 8.5) does not seem to distinguish between em- and en-dashes: $ echo -e "a–b\na—b" | uniq -c 2 a–b Is there any way to force this distinction? I've tried various settings for LC_COLLATE with no luck.

Worked for me echo -e "a–b\na—b" | LC_COLLATE=C uniq -c 1 a–b 1 a—b

Related

Bash - Variable with variables and commands

Display interface + ip list nice way

uniq -c without additional spaces

sort -R is not an option in my OS

How do I obtain only digits from a string?

Categories

Resources