Bash Cut diamond question mark symbol � - bash

I am trying to display the 2nd and 7th character from each line of text.
while read line
do
x=`echo $line | cut -c2,7`
echo $x
done
Sample Input:
C.B - Cantonment Board/Cantonment
C.M.C – City Municipal Council
C.T – Census Town
E.O – Estate Office
Expected Output:
.C
.â
.“
.“
My output:
.C
.�
.�
.�
Anyone knows why this happens?

cut does not really support Unicode. You might want to use Perl instead (adapted from this Unix & Linux post):
perl -CIO -ne 'print substr($_, 1, 1) . substr($_, 6, 1) . "\n"'
For example:
$ perl -CIO -ne 'print substr($_, 1, 1) . substr($_, 6, 1) . "\n"' < foo
.C
.â
.“
.“
-CIO tells perl that both input and output are in Unicode. substr(var, m, n) extracts the substring of length n beginning at index m (starting from 0). So the second character is the substring of length 1 at index 1. $_ is the variable holding the current input line.

You can use bash's substring parameter expansion.
while read line; do
x=${line:1:1}${line:6:1} # 0-based counting
echo "$x"
done <<EOF
C.B - Cantonment Board/Cantonment
C.M.C – City Municipal Council
C.T – Census Town
E.O – Estate Office
EOF
The form ${var:offset:length} returns length characters starting at position offset in the value of var. Strings are 0-indexed, like arrays.
(I am not sure, though, if bash always handles utf-8 correctly, or if it depends on how it was compiled.)

Related

How can I find the missing integers in a unique and sequential list (one per line) in a unix terminal?

Suppose I have a file as follows (a sorted, unique list of integers, one per line):
1
3
4
5
8
9
10
I would like the following output (i.e. the missing integers in the list):
2
6
7
How can I accomplish this within a bash terminal (using awk or a similar solution, preferably a one-liner)?
Using awk you can do this:
awk '{for(i=p+1; i<$1; i++) print i} {p=$1}' file
2
6
7
Explanation:
{p = $1}: Variable p contains value from previous record
{for ...}: We loop from p+1 to the current row's value (excluding current value) and print each value which is basically the missing values
Using seq and grep:
seq $(head -n1 file) $(tail -n1 file) | grep -vwFf file -
seq creates the full sequence, grep removes the lines that exists in the file from it.
perl -nE 'say for $a+1 .. $_-1; $a=$_'
Calling no external program (if filein contains the list of numbers):
#!/bin/bash
i=0
while read num; do
while (( ++i<num )); do
echo $i
done
done <filein
To adapt choroba's clever answer for my own use case, I needed my sequence to deal with zero-padded numbers.
The -w switch to seq is the magic here - it automatically pads the first number with the necessary number of zeroes to keep it aligned with the second number:
-w, --equal-width equalize width by padding with leading zeroes
My integers go from 0 to 9999, so I used the following:
seq -w 0 9999 | grep -vwFf "file.txt"
...which finds the missing integers in a sequence from 0000 to 9999. Or to put it back into the more universal solution in choroba's answer:
seq -w $(head -n1 "file.txt") $(tail -n1 "file.txt") | grep -vwFf "file.txt"
I didn't personally find the - in his answer was necessary, but there may be usecases which make it so.
Using Raku (formerly known as Perl_6)
raku -e 'my #a = lines.map: *.Int; say #a.Set (^) #a.minmax.Set;'
Sample Input:
1
3
4
5
8
9
10
Sample Output:
Set(2 6 7)
I'm sure there's a Raku solution similar to #JJoao's clever Perl5 answer, but in thinking about this problem my mind naturally turned to Set operations.
The code above reads lines into the #a array, mapping each line so that elements in the #a array are Ints, not strings. In the second statement, #a.Set converts the array to a Set on the left-hand side of the (^) operator. Also in the second statement, #a.minmax.Set converts the array to a second Set, on the right-hand side of the (^) operator, but this time because the minmax operator is used, all Int elements from the min to max are included. Finally, the (^) symbol is the symmetric set-difference (infix) operator, which finds the difference.
To get an unordered whitespace-separated list of missing integers, replace the above say with put. To get a sequentially-ordered list of missing integers, add the explicit sort below:
~$ raku -e 'my #a = lines.map: *.Int; .put for (#a.Set (^) #a.minmax.Set).sort.map: *.key;' file
2
6
7
The advantage of all Raku code above is that finding "missing integers" doesn't require a "sequential list" as input, nor is the input required to be unique. So hopefully this code will be useful for a wide variety of problems in addition to the explicit problem stated in the Question.
OTOH, Raku is a Perl-family language, so TMTOWTDI. Below, a #a.minmax array is created, and grepped so that none of the elements of #a are returned (none junction):
~$ raku -e 'my #a = lines.map: *.Int; .put for #a.minmax.grep: none #a;' file
2
6
7
https://docs.raku.org/language/setbagmix
https://docs.raku.org/type/Junction
https://raku.org

increment a letter sequence to represent a whole number where a=0 and z=25

I have tried several different search terms but have not found exactly what I want, I am sure there is already an answer for this so please point me to it if so.
I would like to understand how to increment a letter code given a standard number convention in a bash script.
Starting with AAAA=0 or with leading zerosAAAA=000000 (26x26x26x26) I would like to increment the value with a a positive single digit each time, so aaab=000001,aaac=000002 and aaba=000026 and aaaca=000052 etc.
Thanks Art!
I guess this is what you want
echo {a..z}{a..z}{a..z}{a..z} | tr ' ' '\n' | nl
will be too long, perhaps test with this first
echo {a..z}{a..z} | tr ' ' '\n' | nl
if you don't need the line numbers remove last pipe and nl
If you need the output in xxxx=nnnnnn format, you can use awk
echo {a..z}{a..z}{a..z}{a..z} | tr ' ' '\n' | awk '{printf "%s=%06d\n", $0, NR-1}'
aaaa=000000
aaab=000001
aaac=000002
aaad=000003
aaae=000004
aaaf=000005
aaag=000006
aaah=000007
aaai=000008
aaaj=000009
...
zzzv=456971
zzzw=456972
zzzx=456973
zzzy=456974
zzzz=456975
Fast
If you are aiming for speed and simplicity:
#!/bin/bash
i=0
for text in {a..z}{a..z}{a..z}{a..z}; do
printf '%06d %5.5s\n' "$i" "$text"
(( i++ ))
done
Precise
Aiming at having a function that convert any number to the character string:
We must Understand that what you are describing is a number written in base 26, using the character a as 0, b as 1, c as 3, etc.
Thus, aaaa means 0000, aaab means 0001, aaac means 0002, .... aaaz means 0025
and aaba means 0026, aaca means 0052.
bc could do the base conversion directly (as numbers):
$ echo 'obase=26; 199'|bc
07 17
The 7th letter is: a0, b1, c2, d3, e4, f5, g6, (h)7,
the 17th letter is (r).
If we set the variable list to: list=$(printf '%s' {a..z}) or list=abcdefghijklmnopqrstuvwxyz
We could get each letter from the number with: ${list:7:1} and ${list:17:1}
$ echo "${list:7:1} and ${list:17:1}"
h and r
$ printf '%s' "${list:7:1}" "${list:17:1}" # Using printf:
hr
Script
All together inside an script, is:
#!/bin/bash
list=$(printf '%s' {a..z})
getletters(){
local numbers
numbers="$(echo "obase=26; $1"|bc)"
for number in $numbers; do
printf '%s' "${list:10#$number:1}";
done;
echo;
}
count=2
limit=$(( 26**$count - 1 ))
for (( i=0; i<=$limit; i++)); do
printf '%06d %-5.5s\n' "$i" "$(getletters "$i")"
done
Please change count from 2 to 4 to get the whole list. Be aware that such list is more than half a million lines: The limit is 456,975 and will take some time.
With perl, you can ++ a string to increment the letter:
for (my ($n,$s) = (0,"aaaa"); $n < 200; $n++, $s++) {
printf "%s=%0*d\n", $s, length($s), $n;
}
outputs
aaaa=0000
aaab=0001
aaac=0002
aaad=0003
...
aaby=0050
aabz=0051
aaca=0052
aacb=0053
...
aahp=0197
aahq=0198
aahr=0199

select variable letter index from file [duplicate]

This question already has answers here:
number in string to find char in that string UNIX
(2 answers)
Closed 8 years ago.
Say I have a file
3 boy
2 hello
3 bus
and I want to select the ith letter from each line, where i is the number in front of the line (resulting in y, e, s). Is there an easy way to do this with sed/cut? I tried matching for example the substring with the first that many letters with
cat test.txt | sed -e 's/\([0-9]\) \(.*\)\{\1\}.*/\2/'
to then cut it afterwards, but this yields an error Invalid content of \{\}. What is the proper way to do this (preferrably with just sed/cut/... so without for-loops etc.)?
I am looking for a way that can be done as pipelining, i.e. starting the line with cat test.txt | ....
a single awk script can be written as
awk '{print substr($2,$1,1)}' inputFile
gives output as
y
e
s
substr(str, pos, len) function returns the substring starting at postion pos with length as len
You can use a loop:
while read -r number name
do
echo "${name:$number - 1:1}"
done < file
This takes profit of the ${string:position:length} syntax: extract $length characters substring from $string at $position. As the first character is at position 0, we have to substract 1 to get the needed one.
For your given input it returns:
$ while read -r number name; do echo "${name:$number - 1:1}"; done < a
y
e
s

Length of string in bash

How do you get the length of a string stored in a variable and assign that to another variable?
myvar="some string"
echo ${#myvar}
# 11
How do you set another variable to the output 11?
To get the length of a string stored in a variable, say:
myvar="some string"
size=${#myvar}
To confirm it was properly saved, echo it:
$ echo "$size"
11
Edit 2023-02-13: Use of printf %n instead of locales...
UTF-8 string length
In addition to fedorqui's correct answer, I would like to show the difference between string length and byte length:
myvar='Généralités'
chrlen=${#myvar}
oLang=$LANG oLcAll=$LC_ALL
LANG=C LC_ALL=C
bytlen=${#myvar}
LANG=$oLang LC_ALL=$oLcAll
printf "%s is %d char len, but %d bytes len.\n" "${myvar}" $chrlen $bytlen
will render:
Généralités is 11 char len, but 14 bytes len.
you could even have a look at stored chars:
myvar='Généralités'
chrlen=${#myvar}
oLang=$LANG oLcAll=$LC_ALL
LANG=C LC_ALL=C
bytlen=${#myvar}
printf -v myreal "%q" "$myvar"
LANG=$oLang LC_ALL=$oLcAll
printf "%s has %d chars, %d bytes: (%s).\n" "${myvar}" $chrlen $bytlen "$myreal"
will answer:
Généralités has 11 chars, 14 bytes: ($'G\303\251n\303\251ralit\303\251s').
Nota: According to Isabell Cowan's comment, I've added setting to $LC_ALL along with $LANG.
Same, but without having to play with locales
I recently learn %n format of printf command (builtin):
myvar='Généralités'
chrlen=${#myvar}
printf -v _ %s%n "$myvar" bytlen
printf "%s is %d char len, but %d bytes len.\n" "${myvar}" $chrlen $bytlen
Généralités is 11 char len, but 14 bytes len.
Syntax is a little counter-intuitive, but this is very efficient! (further function strU8DiffLen is about 2 time quicker by using printf than previous version using local LANG=C.)
Length of an argument, working sample
Argument work same as regular variables
showStrLen() {
local -i chrlen=${#1} bytlen
printf -v _ %s%n "$1" bytlen
LANG=$oLang LC_ALL=$oLcAll
printf "String '%s' is %d bytes, but %d chars len: %q.\n" "$1" $bytlen $chrlen "$1"
}
will work as
showStrLen théorème
String 'théorème' is 10 bytes, but 8 chars len: $'th\303\251or\303\250me'
Useful printf correction tool:
If you:
for string in Généralités Language Théorème Février "Left: ←" "Yin Yang ☯";do
printf " - %-14s is %2d char length\n" "'$string'" ${#string}
done
- 'Généralités' is 11 char length
- 'Language' is 8 char length
- 'Théorème' is 8 char length
- 'Février' is 7 char length
- 'Left: ←' is 7 char length
- 'Yin Yang ☯' is 10 char length
Not really pretty output!
For this, here is a little function:
strU8DiffLen() {
local -i bytlen
printf -v _ %s%n "$1" bytlen
return $(( bytlen - ${#1} ))
}
or written in one line:
strU8DiffLen() { local -i _bl;printf -v _ %s%n "$1" _bl;return $((_bl-${#1}));}
Then now:
for string in Généralités Language Théorème Février "Left: ←" "Yin Yang ☯";do
strU8DiffLen "$string"
printf " - %-$((14+$?))s is %2d chars length, but uses %2d bytes\n" \
"'$string'" ${#string} $((${#string}+$?))
done
- 'Généralités' is 11 chars length, but uses 14 bytes
- 'Language' is 8 chars length, but uses 8 bytes
- 'Théorème' is 8 chars length, but uses 10 bytes
- 'Février' is 7 chars length, but uses 8 bytes
- 'Left: ←' is 7 chars length, but uses 9 bytes
- 'Yin Yang ☯' is 10 chars length, but uses 12 bytes
Unfortunely, this is not perfect!
But there left some strange UTF-8 behaviour, like double-spaced chars, zero spaced chars, reverse deplacement and other that could not be as simple...
Have a look at diffU8test.sh or diffU8test.sh.txt for more limitations.
I wanted the simplest case, finally this is a result:
echo -n 'Tell me the length of this sentence.' | wc -m;
36
You can use:
MYSTRING="abc123"
MYLENGTH=$(printf "%s" "$MYSTRING" | wc -c)
wc -c or wc --bytes for byte counts = Unicode characters are counted with 2, 3 or more bytes.
wc -m or wc --chars for character counts = Unicode characters are counted single until they use more bytes.
In response to the post starting:
If you want to use this with command line or function arguments...
with the code:
size=${#1}
There might be the case where you just want to check for a zero length argument and have no need to store a variable. I believe you can use this sort of syntax:
if [ -z "$1" ]; then
#zero length argument
else
#non-zero length
fi
See GNU and wooledge for a more complete list of Bash conditional expressions.
If you want to use this with command line or function arguments, make sure you use size=${#1} instead of size=${#$1}. The second one may be more instinctual but is incorrect syntax.
Using your example provided
#KISS (Keep it simple stupid)
size=${#myvar}
echo $size
Here is couple of ways to calculate length of variable :
echo ${#VAR}
echo -n $VAR | wc -m
echo -n $VAR | wc -c
printf $VAR | wc -m
expr length $VAR
expr $VAR : '.*'
and to set the result in another variable just assign above command with back quote into another variable as following:
otherVar=`echo -n $VAR | wc -m`
echo $otherVar
http://techopsbook.blogspot.in/2017/09/how-to-find-length-of-string-variable.html
I know that the Q and A's are old enough, but today I faced this task for first time. Usually I used the ${#var} combination, but it fails with unicode: most text I process with the bash is in Cyrillic...
Based on #atesin's answer, I made short (and ready to be more shortened) function which may be usable for scripting. That was a task which led me to this question: to show some message of variable length in pseudo-graphics box. So, here it is:
$ cat draw_border.sh
#!/bin/sh
#based on https://stackoverflow.com/questions/17368067/length-of-string-in-bash
border()
{
local BPAR="$1"
local BPLEN=`echo $BPAR|wc -m`
local OUTLINE=\|\ "$1"\ \|
# line below based on https://www.cyberciti.biz/faq/repeat-a-character-in-bash-script-under-linux-unix/
# comment of Bit Twiddler Jun 5, 2021 # 8:47
local OUTBORDER=\+`head -c $(($BPLEN+1))</dev/zero|tr '\0' '-'`\+
echo $OUTBORDER
echo $OUTLINE
echo $OUTBORDER
}
border "Généralités"
border 'А вот еще одна '$LESSCLOSE' '
border "pure ENGLISH"
And what this sample produces:
$ draw_border.sh
+-------------+
| Généralités |
+-------------+
+----------------------------------+
| А вот еще одна /usr/bin/lesspipe |
+----------------------------------+
+--------------+
| pure ENGLISH |
+--------------+
First example (in French?) was taken from someone's example above.
Second one combines Cyrillic and the value of some variable. Third one is self-explaining: only 1s 1/2 of ASCII chars.
I used echo $BPAR|wc -m instead of printf ... in order to not rely on if the printf is buillt-in or not.
Above I saw talks about trailing newline and -n parameter for echo. I did not used it, thus I add only one to the $BPLEN. Should I use -n, I must add 2.
To explain the difference between wc -m and wc -c, see the same script with only one minor change: -m was replaced with -c
$ draw_border.sh
+----------------+
| Généralités |
+----------------+
+---------------------------------------------+
| А вот еще одна /usr/bin/lesspipe |
+---------------------------------------------+
+--------------+
| pure ENGLISH |
+--------------+
Accented characters in Latin, and most of characters in Cyrillic are two-byte, thus the length of drawn horizontals are greater than the real length of the message.
Hope, it will save some one some time :-)
p.s. Russian text says "here is one more"
p.p.s. Working "two-liner"
#!/bin/sh
#based on https://stackoverflow.com/questions/17368067/length-of-string-in-bash
border()
{
# line below based on https://www.cyberciti.biz/faq/repeat-a-character-in-bash-script-under-linux-unix/
# comment of Bit Twiddler Jun 5, 2021 # 8:47
local OUTBORDER=\+`head -c $(( $(echo "$1"|wc -m) +1))</dev/zero|tr '\0' '-'`\+
echo $OUTBORDER"\n"\|\ "$1"\ \|"\n"$OUTBORDER
}
border "Généralités"
border 'А вот еще одна '$LESSCLOSE' '
border "pure ENGLISH"
In order to not clutter the code with repetitive OUTBORDER's drawing, I put the forming of OUTBORDER into separate command
Maybe just use wc -c to count the number of characters:
myvar="Hello, I am a string."
echo -n $myvar | wc -c
Result:
21
Length of string in bash
str="Welcome to Stackoveflow"
length=`expr length "$str"`
echo "Length of '$str' is $length"
OUTPUT
Length of 'Welcome to Stackoveflow' is 23

Finding gaps in sequential numbers

I don’t do this stuff for a living so forgive me if it’s a simple question (or more complicated than I think). I‘ve been digging through the archives and found a lot of tips that are close but being a novice I’m not sure how to tweak for my needs or they are way beyond my understanding.
I have some large data files that I can parse out to generate a list of coordinate that are mostly sequential
5
6
7
8
15
16
17
25
26
27
What I want is a list of the gaps
1-4
9-14
18-24
I don’t know perl, SQL or anything fancy but thought I might be able to do something that would subtract one number from the next. I could then at least grep the output where the difference was not 1 or -1 and work with that to get the gaps.
With awk :
awk '$1!=p+1{print p+1"-"$1-1}{p=$1}' file.txt
explanations
$1 is the first column from current input line
p is the previous value of the last line
so ($1!=p+1) is a condition : if $1 is different than previous value +1, then :
this part is executed : {print p+1 "-" $1-1} : print previous value +1, the - character and fist columns + 1
{p=$1} is executed for each lines : p is assigned to the current 1st column
interesting question.
sputnick's awk one-liner is nice. I cannot write a simpler one than his. I just add another way using diff:
seq $(tail -1 file)|diff - file|grep -Po '.*(?=d)'
the output with your example would be:
1,4
9,14
18,24
I knew that there is comma in it, instead of -. you could replace the grep with sed to get -, grep cannot change the input text... but the idea is same.
hope it helps.
A Ruby Answer
Perhaps someone else can give you the Bash or Awk solution you asked for. However, I think any shell-based answer is likely to be extremely localized for your data set, and not very extendable. Solving the problem in Ruby is fairly simple, and provides you with flexible formatting and more options for manipulating the data set in other ways down the road. YMMV.
#!/usr/bin/env ruby
# You could read from a file if you prefer,
# but this is your provided corpus.
nums = [5, 6, 7, 8, 15, 16, 17, 25, 26, 27]
# Find gaps between zero and first digit.
nums.unshift 0
# Create array of arrays containing missing digits.
missing_nums = nums.each_cons(2).map do |array|
(array.first.succ...array.last).to_a unless
array.first.succ == array.last
end.compact
# => [[1, 2, 3, 4], [9, 10, 11, 12, 13, 14], [18, 19, 20, 21, 22, 23, 24]]
# Format the results any way you want.
puts missing_nums.map { |ary| "#{ary.first}-#{ary.last}" }
Given your current corpus, this yields the following on standard output:
1-4
9-14
18-24
Just remember the previous number and verify that the current one is the previous plus one:
#! /bin/bash
previous=0
while read n ; do
if (( n != previous + 1 )) ; then
echo $(( previous + 1 ))-$(( n - 1 ))
fi
previous=$n
done
You might need to add some checking to prevent lines like 28-28 for single number gaps.
Perl solution similar to awk solution from StardustOne:
perl -ane 'if ($F[0] != $p+1) {printf "%d-%d\n",$p+1,$F[0]-1}; $p=$F[0]' file.txt
These command-line options are used:
-n loop around every line of the input file, do not automatically print every line
-a autosplit mode – split input lines into the #F array. Defaults to splitting on whitespace. Fields are indexed starting with 0.
-e execute the perl code
Given input file, use the numinterval util and paste its output beside file, then munge it with tr, xargs, sed and printf:
gaps() { paste <(echo; numinterval "$1" | tr 1 '-' | tr -d '[02-9]') "$1" |
tr -d '[:blank:]' | xargs echo |
sed 's/ -/-/g;s/-[^ ]*-/-/g' | xargs printf "%s\n" ; }
Output of gaps file:
5-8
15-17
25-27
How it works. The output of paste <(echo; numinterval file) file looks like:
5
1 6
1 7
1 8
7 15
1 16
1 17
8 25
1 26
1 27
From there we mainly replace things in column #1, and tweak the spacing. The 1s are replaced with -s, and the higher numbers are blanked. Remove some blanks with tr. Replace runs of hyphens like "5-6-7-8" with a single hyphen "5-8", and that's the output.
This one list the ones who breaks the sequence from a list.
Idea taken from #choroba but done with a for.
#! /bin/bash
previous=0
n=$( cat listaNums.txt )
for number in $n
do
numListed=$(($number - 1))
if [ $numListed != $previous ] && [ $number != 2147483647 ]; then
echo $numListed
fi
previous=$number
done

Resources