Bash: concatenated variables derived from text file using grep gives confused output - bash

In my directory, I have a multiple nifti files (e.g., WIP944_mp2rage-0.75iso_TR5.nii) from my MRI scanner accompanied by text files (e.g., WIP944_mp2rage-0.75iso_TR5_info.txt) containing information on the acquisition parameters (e.g., "Series description: WIP944_mp2rage-0.75iso_TR5_INV1_PHS_ND"). Based on these parameters (e.g., INV1_PHS_ND), I need to change the nifti file name, which are echoed in $niftibase. I used grep to do this. When echoing all variables individually, it gives me what I want, but when I try to concatenate them into one filename, the variables are mixed together, instead of delimited by a dot.
I tried multiple forms of sed to cut away potentially invisible characters and identified the source of the problems: the "INV1_PHS_ND" part of 'series description' gives me troubles, which is the $struct component, potentially due to the fact that this part varies in how many fields are extracted. Sometimes this is 3 (in the case of INV1_PHS_ND), but it can be 2 as well (INV1_ND). When I introduce this variable into the filename, everything goes haywire.
for infofile in ${PWD}/*.txt; do
# General characteristics of subjects (i.e., date of session, group number, and subject number)
reco=$(grep -A0 "Series description:" ${infofile} | cut -d ' ' -f 3 | cut -d '_' -f 1)
date=$(grep -A0 "Series date:" ${infofile} | cut -c 16-21)
group=$(grep -A0 "Subject:" ${infofile} | cut -d '^' -f 2 | cut -d '_' -f 1 )
number=$(grep -A0 "Subject:" ${infofile} | cut -d '^' -f 2 | cut -d '_' -f 2)
ScanNr=$(grep -A0 "Series number:" ${infofile} | cut -d ' ' -f 3)
# Change name if reco has structural prefix
if [[ $reco = *WIP944* ]]; then
struct=$(grep -A0 "Series description: WIP944" ${infofile} | cut -d '_' -f 4,5,6)
niftibase=$(basename $infofile _info.txt).nii
#echo ${subStudy}.struct.${date}.${group}.${protocol}.${paradigm}.nii
echo ${subStudy}.struct.${struct}.${date}.${group}.${protocol}${number}.${paradigm}.n${ScanNr}.nii
#mv ${niftibase} ${subStudy}.struct.${struct}.${date}.${group}.${protocol}${number}.${paradigm}.n${ScanNr}.nii
fi
done
This gives me output like this:
.niit47.n4lot.Noc002
.niit47.n5lot.Noc002D
.niit47.n6lot.Noc002
.niit47.n8lot.Noc002
.niit47.n9lot.Noc002
.niit47.n10ot.Noc002
.niit47.n11ot.Noc002D
for all 7 WIP944 files. However, it needs to be in the direction of this:
H1.struct.INV2_PHS_ND.190523.Pilot.Noc001.Heat47.n11.nii, where H1, Noc, and Heat47 are loaded in from a setup file.
EDIT: I tried to use awk in the following way:
reco=$(awk 'FNR==8 {print;exit}' $infofile | cut -d ' ' -f 3 | cut -d '_' -f 1)
date=$(awk 'FNR==2 {print;exit}' $infofile | cut -c 15-21)
group=$(awk 'FNR==6 {print;exit}' $infofile | cut -d '^' -f 2 | cut -d '_' -f 1 )
number=$(awk 'FNR==6 {print;exit}' $infofile | cut -d '^' -f 2 | cut -d '_' -f 2)
ScanNr=$(awk 'FNR==14 {print;exit}' $infofile | cut -d ' ' -f 3)
which again gave me the correct output when echoing the variables individually, but not when I tried to combine them: .niit47.n11022_PHS_ND.
I used echo "$struct" | tr -dc '[:print:]' | od -c to see if there were hidden characters due to line endings, which resulted in:
0000000 I N V 2 _ P H S _ N D
0000013
EDIT: This is how the text file looks like:
Series UID: 1.3.12.2.1107.5.2.34.18923.2019052316005066316714852.0.0.0
Study date: 20190523
Study time: 153529.718000
Series date: 20190523
Series time: 160111.750000
Subject: MDC-0153,pilot_003^pilot_003
Subject birth date: 19970226
Series description: WIP944_mp2rage-0.75iso_TR5_INV1_PHS_ND
Image type: ORIGINAL\PRIMARY\P\ND
Manufacturer: SIEMENS
Model name: Investigational_Device_7T
Software version: syngo MR B17
Study id: 1
Series number: 5
Repetition time (ms): 5000
Echo time[1] (ms): 2.51
Inversion time (ms): 900
Flip angle: 7
Number of averages: 1
Slice thickness (mm): 0.75
Slice spacing (mm):
Image columns: 320
Image rows: 320
Phase encoding direction: ROW
Voxel size x (mm): 0.75
Voxel size y (mm): 0.75
Number of volumes: 1
Number of slices: 240
Number of files: 240
Number of frames: 0
Slice duration (ms) : 0
Orientation: sag
PixelBandwidth: 248
I have one of these for each nifti file. subStudy is hardcoded in a setup file, which is loaded in prior to running the for loop. When I echo this, it shows the correct value. I need to change the names of multiple files with a specific prefix, which are stored in $reco.

As confirmed in comments, the input files have DOS carriage returns, which are basically invalid in Unix files. Also, you should pay attention to proper quoting.
As a general overhaul, I would recommend replacing the entire Bash script with a simple Awk script, which is both simpler and more idiomatic.
for infofile in ./*.txt; do # no need to use $(PWD)
# Pre-filter with a simple grep
grep -q '^Series description: [^ _]*WIP944' "$infofile" && continue
# Still here? Means we want to rename
suffix="$(awk -F : '
BEGIN { split("Series description:Series date:Subject:Series number", f, /:/) }
{ sub(/\r/, ""); } # get rid of pesky DOS carriage return
NR == 1 { nifbase = FILENAME; sub(/_info\.txt$/, ".nii", nifbase) }
$1 in f { x[$1] = substring($0, length($1)+2) }
END {
split(x["Series description"], t, /_/); struct=t[4] "_" t[5] "_" t[6]
split(x["Series description"], t, /_/); reco = t[1]
date=substr(x["Series date"], 16, 5)
split(x["Subject"], t, /\^/); split(t[2], tt, /_/); group=tt[1]
number=tt[2]
ScanNr=x["Series number"]
### FIXME: protocol and paradigm are still undefined
print struct "." date "." group "." protocol number "." paradigm ".n" ScanNr
}' "$infofile")"
echo mv "$infofile" "$subStudy.struct.$suffix"
done
This probably still requires some tweaking (at least "protocol" and "paradigm" are still undefined). Once it seems to print the correct values, you can remove the echo before mv and have it actually rename files for you.
(Probably still better test on a copy of your real data files first!)

Related

Replace value of a specific column from a file

I have file whose size is approx 1 GB and that file has a data in below format .
A|CD|44123|0|0
B|CD|44124|0|0
C|CD|44125|0|0
D|CD|44126|0|0
E|CD|44127|0|0
F|CD|44128|0|0
J|CD|44129|0|0
I|CD|44130|0|0
In this file I have to replace the third column value from a value which i will get after conversion . For which i have to open this file and then read the file and replace it . This process is taking around 5 hours .Below is the code which i am using
cat $FILE_NAME |\
while read REC
do
DATE=`echo "$REC" | cut -d\| -f3`
DATE_NEW=`$UTIL $DATE | head -1 |cut -d" " -f12`
RECORD="$DATE_NEW,"
echo "$RECORD" >> $New_File
done
Is there a way we can make this more better and fast.
Desired output will be like this where DATE_NEW value will be placed on each 3rd column DATE_NEW value will be the converted value which I will get from this
DATE_NEW=`$UTIL $DATE | head -1 |cut -d" " -f12`
A|CD|10/20/2020|0|0
B|CD|10/25/2020|0|0
C|CD|10/25/2020|0|0
D|CD|10/25/2020|0|0
E|CD|11/15/2020|0|0
F|CD|11/14/2020|0|0
J|CD|11/16/2020|0|0
I|CD|11/17/2020|0|0
After the comment from #Sundeep Why is using a shell loop to process text considered bad practice? I wrote the logic in Perl and from 5-7 hours processing time in Perl it took 99 Seconds to get the job done.
Give this a try:
awk -v cmd="Cmd2GetNEWDATE" 'BEGIN{FS=OFS="|"}{cmd|getline v;close(cmd)}$3=v' file

How to write a bash function that can detect if a given input ends in Kilobytes `K` or Megabytes `M`?

I have a bash function that is currently set up as:
MB=$(( $(echo $(FUNCTION_THAT_RETURNS_Kb_OR_Mb) | cut -d "K" -f 1 | sed 's/^.*- //') / 1000 ))
where the middle portion echo $(FUNCTION_THAT_RETURNS_Kb_OR_Mb) returns a value that ends in K or M, (for example: 515223 K or 36326 M) for Kilobytes or Megabytes. I currently have designed the function to strip the trailing units indicator for K, and then divide by 1000 to convert to megabytes. However, when the inside part of it ends in M, it fails. How can I write a function that detects if its in kilobytes or megabytes?
Don't reinvent the wheel - there is numfmt:
function_that_returns_Kb_or_Mb() { echo "515223 K"; }
mb=$(function_that_returns_Kb_or_Mb | numfmt -d '' --from=iec --to-unit=Mi)
# mb=504
function_that_returns_Kb_or_Mb() { echo "36326 M"; }
mb=$(function_that_returns_Kb_or_Mb | numfmt -d '' --from=iec --to-unit=Mi)
# mb=36326
Notes:
echo $(FUNCTION_THAT_RETURNS_Kb_OR_Mb) is a useless use of echo. It's like echo $(echo $(echo $(...)))). Just FUNCTION_THAT_RETURNS_Kb_OR_Mb | blabla.
By convention UPPERCASE VARIABLES are used for exported variables, like PATH COLUMNS UID PWD etc. - use lower case identifiers in your scripts.
I assumed input and output is using IEC scale, for SI scale use --from=si --to-unit=M.

while loops in parallel with input from splited file

I am stuck on that. So I have this while-read loop within my code that is taking so long and I would like to run it in many processors. But, I'd like to split the input file and run 14 loops (because I have 14 threads), one for each splited file, in parallel. Thing is that I don't know how to tell the while loop which file to get and work with.
For example, in a regular while-read loop I would code:
while read line
do
<some code>
done < input file or variable...
But in this case I would like to split the above input file in 14 files and run 14 while loops in parallel, one for each splited file.
I tried :
split -n 14 input_file
find . -name "xa*" | \
parallel -j 14 | \
while read line
do
<lot of stuff>
done
also tried
split -n 14 input_file
function loop {
while read line
do
<lot of stuff>
done
}
export -f loop
parallel -j 14 ::: loop
But neither I was able to tell which file would be the input to the loop so parallel would understand "take each of those xa* files and place into individual loops in parallel"
An example of the input file (a list of strings)
AEYS01000010.10484.12283
CVJT01000011.50.2173
KF625180.1.1799
KT949922.1.1791
LOBZ01000025.54942.57580
EDIT
This is the code.
The output is a table (741100 lines) with some statistics regarding DNA sequences alignments already made.
The loop takes an input_file (no broken lines, varies from 500 to ~45000 lines, 800Kb) with DNA sequence acessions, reads it line-by-line and look for each correspondent full taxonomy for those acessions in a databank (~45000 lines). Then, it does a few sums/divisions. Output is a .tsv and looks like this (an example for sequence "KF625180.1.1799"):
Rate of taxonomies for this sequence in %: KF625180.1.1799 D_6__Bacillus_atrophaeus
Taxonomy %aligned number_ocurrences_in_the_alignment num_ocurrences_in_databank %alingment/databank
D_6__Bacillus_atrophaeus 50% 1 20 5%
D_6__Bacillus_amyloliquefaciens 50% 1 154 0.649351%
$ head input file
AEYS01000010.10484.12283
CVJT01000011.50.217
KF625180.1.1799
KT949922.1.1791
LOBZ01000025.54942.57580
Two additional files are also used inside the loop. They are not the loop input.
1) a file called alnout_file that only serves for finding how many hits (or alignments) a given sequence had against the databank. It was also previously made outside this loop. It can vary in the number of lines from hundreads to thousands. Only columns 1 and 2 matters here. Column1 is the name of the sequence and col2 is the name of all sequences it matched in the databnk. It looks like that:
$ head alnout_file
KF625180.1.1799 KF625180.1.1799 100.0 431 0 0 1 431 1 431 -1 0
KF625180.1.1799 KP143082.1.1457 99.3 431 1 2 1 431 1 429 -1 0
KP143082.1.1457 KF625180.1.1799 99.3 431 1 2 1 429 1 431 -1 0
2) a databank .tsv file containing ~45000 taxonomies correspondent to the DNA sequences. Each taxonomy is in one line:
$ head taxonomy.file.tsv
KP143082.1.1457 D_0__Bacteria;D_1__Firmicutes;D_2__Bacilli;D_3__Bacillales;D_4__Bacillaceae;D_5__Bacillus;D_6__Bacillus_amyloliquefaciens
KF625180.1.1799 D_0__Bacteria;D_1__Firmicutes;D_2__Bacilli;D_3__Bacillales;D_4__Bacillaceae;D_5__Bacillus;D_6__Bacillus_atrophaeus
So, given sequence KF625180.1.1799. I previously aligned it against a databank containing ~45000 other DNA sequences and got an output whis has all the accessions to sequences that it matched. What the loop does is that it finds the taxonomies for all those sequences and calculates the "statistics" I mentionded previously. Code does it for all the DNA-sequences-accesions I have.
TAXONOMY=path/taxonomy.file.tsv
while read line
do
#find hits
hits=$(grep $line alnout_file | cut -f 2)
completename=$(grep $line $TAXONOMY | sed 's/D_0.*D_4/D_4/g')
printf "\nRate of taxonomies for this sequence in %%:\t$completename\n"
printf "Taxonomy\t%aligned\tnumber_ocurrences_in_the_alignment\tnum_ocurrences_in_databank\t%alingment/databank\n"
#find hits and calculate the frequence (%) of the taxonomy in the alignment output
# ex.: Bacillus_subtilis 33
freqHits=$(grep "${hits[#]}" $TAXONOMY | \
cut -f 2 | \
awk '{a[$0]++} END {for (i in a) {print i, "\t", a[i]/NR*100, "\t", a[i]}}' | \
sed -e 's/D_0.*D_5/D_5/g' -e 's#\s\t\s#\t#g' | \
sort -k2 -hr)
# print frequence of each taxonomy in the databank
freqBank=$(while read line; do grep -c "$line" $TAXONOMY; done < <(echo "$freqHits" | cut -f 1))
#print cols with taxonomy and calculations
paste <(printf %s "$freqHits") <(printf %s "$freqBank") | awk '{print $1,"\t",$2"%","\t",$3,"\t",$4,"\t",$3/$4*100"%"}'
done < input_file
It is a lot of greps and parsing so it takes about ~12h running in one processor for doing it to all the 45000 DNA sequence accessions. The, I would like to split input_file and do it in all the processors I have (14) because it would the time spend in that.
Thank you all for being so patient with me =)
You are looking for --pipe. In this case you can even use the optimized --pipepart (version >20160621):
export TAXONOMY=path/taxonomy.file.tsv
doit() {
while read line
do
#find hits
hits=$(grep $line alnout_file | cut -f 2)
completename=$(grep $line $TAXONOMY | sed 's/D_0.*D_4/D_4/g')
printf "\nRate of taxonomies for this sequence in %%:\t$completename\n"
printf "Taxonomy\t%aligned\tnumber_ocurrences_in_the_alignment\tnum_ocurrences_in_databank\t%alingment/databank\n"
#find hits and calculate the frequence (%) of the taxonomy in the alignment output
# ex.: Bacillus_subtilis 33
freqHits=$(grep "${hits[#]}" $TAXONOMY | \
cut -f 2 | \
awk '{a[$0]++} END {for (i in a) {print i, "\t", a[i]/NR*100, "\t", a[i]}}' | \
sed -e 's/D_0.*D_5/D_5/g' -e 's#\s\t\s#\t#g' | \
sort -k2 -hr)
# print frequence of each taxonomy in the databank
freqBank=$(while read line; do grep -c "$line" $TAXONOMY; done < <(echo "$freqHits" | cut -f 1))
#print cols with taxonomy and calculations
paste <(printf %s "$freqHits") <(printf %s "$freqBank") | awk '{print $1,"\t",$2"%","\t",$3,"\t",$4,"\t",$3/$4*100"%"}'
done
}
export -f doit
parallel -a input_file --pipepart doit
This will chop input_file into 10*ncpu blocks (where ncpu is the number of CPU threads), pass each block to doit, run ncpu jobs in parallel.
That said I think your real problem is spawning too many programs: If you rewrite doit in Perl or Python I will expect you will see a major speedup.
As an alternative I threw together a quick test.
#! /bin/env bash
mkfifo PIPELINE # create a single queue
cat "$1" > PIPELINE & # supply it with records
{ declare -i cnt=0 max=14
while (( ++cnt <= max )) # spawn loop creates worker jobs
do printf -v fn "%02d" $cnt
while read -r line # each work loop reads common stdin...
do echo "$fn:[$line]"
sleep 1
done >$fn.log 2>&1 & # these run in background in parallel
done # this one exits
} < PIPELINE # *all* read from the same queue
wait
cat [0-9][0-9].log
Doesn't need split, but does need a mkfifo.
Obviously, change the code inside the internal loop.
This answers what you asked, namely how to process in parallel the 14 files you get from running split. However, I don't think it is the best way of doing whatever it is that you are trying to do - but we would need some answers from you for that.
So, let's make a million line file and split it into 14 parts:
seq 1000000 > 1M
split -n 14 1M part-
That gives me 14 files called part-aa through part-an. Now your question is how to process those 14 parts in parallel - (read the last line first):
#!/bin/bash
# This function will be called for each of the 14 files
DoOne(){
# Pick up parameters
job=$1
file=$2
# Count lines in specified file
lines=$(wc -l < "$file")
echo "Job No: $job, file: $file, lines: $lines"
}
# Make the function above known to processes spawned by GNU Parallel
export -f DoOne
# Run 14 parallel instances of "DoOne" passing job number and filename to each
parallel -k -j 14 DoOne {#} {} ::: part-??
Sample Output
Job No: 1, file: part-aa, lines: 83861
Job No: 2, file: part-ab, lines: 72600
Job No: 3, file: part-ac, lines: 70295
Job No: 4, file: part-ad, lines: 70295
Job No: 5, file: part-ae, lines: 70294
Job No: 6, file: part-af, lines: 70295
Job No: 7, file: part-ag, lines: 70295
Job No: 8, file: part-ah, lines: 70294
Job No: 9, file: part-ai, lines: 70295
Job No: 10, file: part-aj, lines: 70295
Job No: 11, file: part-ak, lines: 70295
Job No: 12, file: part-al, lines: 70294
Job No: 13, file: part-am, lines: 70295
Job No: 14, file: part-an, lines: 70297
You would omit the -k argument to GNU Parallel normally - I only added it so the output comes in order.
I think that using a bunch of grep and awk commands is the wrong approach here - you would be miles better off using Perl, or awk. As you have not provided any sample files I generated some using this code:
#!/bin/bash
for a in {A..Z} {0..9} ; do
for b in {A..Z} {0..9} ; do
for c in {A..Z} {0..9} ; do
echo "${a}${b}${c}"
done
done
done > a
# Now make file "b" which has the same stuff but shuffled into a different order
gshuf < a > b
Note that there are 26 letters in the alphabet, so if I add the digits 0..9 to the letters of the alphabet, I get 36 alphanumeric digits and if I nest 3 loops of that I get 36^3 or 46,656 lines which matches your file sizes roughly. File a now looks like this:
AAA
AAB
AAC
AAD
AAE
AAF
File b looks like this:
UKM
L50
AOC
79U
K6S
6PO
12I
XEV
WJN
Now I want to loop through a finding the corresponding line in b. First, I use your approach:
time while read thing ; do grep $thing b > /dev/null ; done < a
That takes 9 mins 35 seconds.
If I now exit grep on the first match, on average I will find it in the middle, which means the time will be halved since I won't continue to needlessly read b after I find what I want.
time while read thing ; do grep -m1 $thing b > /dev/null ; done < a
That improves the time down to 4 mins 30 seconds.
If I now use awk to read the contents of b into an associative array (a.k.a. hash) and then read the elements of a and find them in b like this:
time awk 'FNR==NR{a[$1]=$1; next} {print a[$1]}' b a > /dev/null
That now runs in 0.07 seconds. Hopefully you get the idea of what I am driving at. I expect Perl would do this in the same time and also provide more expressive facilities for the maths in the middle of your loop too.
I hope this small script helps you out:
function process {
while read line; do
echo "$line"
done < $1
}
function loop {
file=$1
chunks=$2
dir=`mktemp -d`
cd $dir
split -n l/$chunks $file
for i in *; do
process "$i" &
done
rm -rf $dir
}
loop /tmp/foo 14
It runs the process loop on the specified file with the specified number of chunks (without splitting lines) in parallel (using & to put each invocation in the background). I hope it gets you started.
This can do the job for You, I am not familiar with parallel instead using native bash spawning processes &:
function loop () {
while IFS= read -r -d $'\n'
do
# YOUR BIG STUFF
done < "${1}"
}
arr_files=(./xa*)
for i in "${arr_files[#]}"
do loop "${i}" &
done
wait

unix find the difference from a file row wise

I have some data like
[09359]0000.365604| =>SttSasph_Hmbm_bSPO_PhQmOm (Hmbm_PhQmOm_utWmP.asp)
[09359]0000.365687| =>Hmbm_bSPO_PhQmOm_Wd (Hmbm_PhQmOm_utWmP.asp)
[09359]0000.365879| =>SttSasph_Hmbm_quOuO_PhQmOm (Hmbm_PhQmOm_utWmP.asp)
[09359]0000.365890| =>Hmbm_quOuO_PhQmOm_Wd (Hmbm_PhQmOm_utWmP.asp)
[09359]0000.365979| WSmmOT SDDQ vSQWSbmO not POt, QOvOQtWnH to Onv mOthod
[09359]0001.625300| db_HOt_POPPWon_Wd: aspuQQOnt POPPWon WD WP 1016,59
[09359]0002.365979| WSmmOT SDDQ vSQWSbmO not POt, QOvOQtWnH to Onv mOthod
Every Line starts with a process number (Which can change) in square brackets
Then Seconds after the module (0001) in this case
Then MicroSeconds after the fullstop.
Then a Pipe to terminate.
Rest part can be ignored
What I need is to caluclate
Convert Seconds into MircoSeconds
Add the Microsconds to Converted Microseconds (From 1)
Find out the difference in microseconds. for eg. line2-line1 , line3-line2, line4- line3 and so.
Print the result in seperate file.
I tried to use this logic. But, it didnt work.
May I get suggestions with optimised way to do it or
improvement in my existing logic
sec=$(grep '^\[.\{1,\}\]' mass.May28.1 | cut -d "| " -f1 | cut -c8- | cut -d"." -f1)
msec=$(grep '^\[.\{1,\}\]' mass.May28.1 | cut -d "| " -f1 | cut -c8- | cut -d"." -f2)
$f_msec=$((sec * 1000000 + msec)) > final_difference_file
If you are comfortable with awk, then you can use this script:
script.awk
BEGIN{ FS="[\\[\\]\\|]+" }
{ printf("[%s]%011.6f|%s\n", $2,$3-prev,$4)
prev = $3 }
Use it like this: awk -f script.awk yourfile
The first line setups the fieldsplitting to use the brackets and pipe (ignore the backslashes they are need to escape the symbols that are regexp metacharacters). The second line prints the fields and calculates the timediff. The last line stores the current time for the calculation in the next line.
This can also be done with a bash script. Since bash lacks floating point arithmetic, we have to gather seconds and microseconds seperately (or call an external tool like bc for each line):
script.sh
IFS='|[].'
factor=1000000
prev=0
while read dummy pid secs msecs text;
do
msecs=$(( $secs * $factor + $msecs ))
timediff=$(( $msecs - $prev ))
prev=$msecs
secs=$(( $timediff / $factor ))
msecs=$(( $timediff - $secs * $factor ))
printf "[%s]%04d.%06d|%s\n" "$pid" "$secs" "$msecs" "$text"
done
Use it like this: bash script.sh yourfile

Length of string in bash

How do you get the length of a string stored in a variable and assign that to another variable?
myvar="some string"
echo ${#myvar}
# 11
How do you set another variable to the output 11?
To get the length of a string stored in a variable, say:
myvar="some string"
size=${#myvar}
To confirm it was properly saved, echo it:
$ echo "$size"
11
Edit 2023-02-13: Use of printf %n instead of locales...
UTF-8 string length
In addition to fedorqui's correct answer, I would like to show the difference between string length and byte length:
myvar='Généralités'
chrlen=${#myvar}
oLang=$LANG oLcAll=$LC_ALL
LANG=C LC_ALL=C
bytlen=${#myvar}
LANG=$oLang LC_ALL=$oLcAll
printf "%s is %d char len, but %d bytes len.\n" "${myvar}" $chrlen $bytlen
will render:
Généralités is 11 char len, but 14 bytes len.
you could even have a look at stored chars:
myvar='Généralités'
chrlen=${#myvar}
oLang=$LANG oLcAll=$LC_ALL
LANG=C LC_ALL=C
bytlen=${#myvar}
printf -v myreal "%q" "$myvar"
LANG=$oLang LC_ALL=$oLcAll
printf "%s has %d chars, %d bytes: (%s).\n" "${myvar}" $chrlen $bytlen "$myreal"
will answer:
Généralités has 11 chars, 14 bytes: ($'G\303\251n\303\251ralit\303\251s').
Nota: According to Isabell Cowan's comment, I've added setting to $LC_ALL along with $LANG.
Same, but without having to play with locales
I recently learn %n format of printf command (builtin):
myvar='Généralités'
chrlen=${#myvar}
printf -v _ %s%n "$myvar" bytlen
printf "%s is %d char len, but %d bytes len.\n" "${myvar}" $chrlen $bytlen
Généralités is 11 char len, but 14 bytes len.
Syntax is a little counter-intuitive, but this is very efficient! (further function strU8DiffLen is about 2 time quicker by using printf than previous version using local LANG=C.)
Length of an argument, working sample
Argument work same as regular variables
showStrLen() {
local -i chrlen=${#1} bytlen
printf -v _ %s%n "$1" bytlen
LANG=$oLang LC_ALL=$oLcAll
printf "String '%s' is %d bytes, but %d chars len: %q.\n" "$1" $bytlen $chrlen "$1"
}
will work as
showStrLen théorème
String 'théorème' is 10 bytes, but 8 chars len: $'th\303\251or\303\250me'
Useful printf correction tool:
If you:
for string in Généralités Language Théorème Février "Left: ←" "Yin Yang ☯";do
printf " - %-14s is %2d char length\n" "'$string'" ${#string}
done
- 'Généralités' is 11 char length
- 'Language' is 8 char length
- 'Théorème' is 8 char length
- 'Février' is 7 char length
- 'Left: ←' is 7 char length
- 'Yin Yang ☯' is 10 char length
Not really pretty output!
For this, here is a little function:
strU8DiffLen() {
local -i bytlen
printf -v _ %s%n "$1" bytlen
return $(( bytlen - ${#1} ))
}
or written in one line:
strU8DiffLen() { local -i _bl;printf -v _ %s%n "$1" _bl;return $((_bl-${#1}));}
Then now:
for string in Généralités Language Théorème Février "Left: ←" "Yin Yang ☯";do
strU8DiffLen "$string"
printf " - %-$((14+$?))s is %2d chars length, but uses %2d bytes\n" \
"'$string'" ${#string} $((${#string}+$?))
done
- 'Généralités' is 11 chars length, but uses 14 bytes
- 'Language' is 8 chars length, but uses 8 bytes
- 'Théorème' is 8 chars length, but uses 10 bytes
- 'Février' is 7 chars length, but uses 8 bytes
- 'Left: ←' is 7 chars length, but uses 9 bytes
- 'Yin Yang ☯' is 10 chars length, but uses 12 bytes
Unfortunely, this is not perfect!
But there left some strange UTF-8 behaviour, like double-spaced chars, zero spaced chars, reverse deplacement and other that could not be as simple...
Have a look at diffU8test.sh or diffU8test.sh.txt for more limitations.
I wanted the simplest case, finally this is a result:
echo -n 'Tell me the length of this sentence.' | wc -m;
36
You can use:
MYSTRING="abc123"
MYLENGTH=$(printf "%s" "$MYSTRING" | wc -c)
wc -c or wc --bytes for byte counts = Unicode characters are counted with 2, 3 or more bytes.
wc -m or wc --chars for character counts = Unicode characters are counted single until they use more bytes.
In response to the post starting:
If you want to use this with command line or function arguments...
with the code:
size=${#1}
There might be the case where you just want to check for a zero length argument and have no need to store a variable. I believe you can use this sort of syntax:
if [ -z "$1" ]; then
#zero length argument
else
#non-zero length
fi
See GNU and wooledge for a more complete list of Bash conditional expressions.
If you want to use this with command line or function arguments, make sure you use size=${#1} instead of size=${#$1}. The second one may be more instinctual but is incorrect syntax.
Using your example provided
#KISS (Keep it simple stupid)
size=${#myvar}
echo $size
Here is couple of ways to calculate length of variable :
echo ${#VAR}
echo -n $VAR | wc -m
echo -n $VAR | wc -c
printf $VAR | wc -m
expr length $VAR
expr $VAR : '.*'
and to set the result in another variable just assign above command with back quote into another variable as following:
otherVar=`echo -n $VAR | wc -m`
echo $otherVar
http://techopsbook.blogspot.in/2017/09/how-to-find-length-of-string-variable.html
I know that the Q and A's are old enough, but today I faced this task for first time. Usually I used the ${#var} combination, but it fails with unicode: most text I process with the bash is in Cyrillic...
Based on #atesin's answer, I made short (and ready to be more shortened) function which may be usable for scripting. That was a task which led me to this question: to show some message of variable length in pseudo-graphics box. So, here it is:
$ cat draw_border.sh
#!/bin/sh
#based on https://stackoverflow.com/questions/17368067/length-of-string-in-bash
border()
{
local BPAR="$1"
local BPLEN=`echo $BPAR|wc -m`
local OUTLINE=\|\ "$1"\ \|
# line below based on https://www.cyberciti.biz/faq/repeat-a-character-in-bash-script-under-linux-unix/
# comment of Bit Twiddler Jun 5, 2021 # 8:47
local OUTBORDER=\+`head -c $(($BPLEN+1))</dev/zero|tr '\0' '-'`\+
echo $OUTBORDER
echo $OUTLINE
echo $OUTBORDER
}
border "Généralités"
border 'А вот еще одна '$LESSCLOSE' '
border "pure ENGLISH"
And what this sample produces:
$ draw_border.sh
+-------------+
| Généralités |
+-------------+
+----------------------------------+
| А вот еще одна /usr/bin/lesspipe |
+----------------------------------+
+--------------+
| pure ENGLISH |
+--------------+
First example (in French?) was taken from someone's example above.
Second one combines Cyrillic and the value of some variable. Third one is self-explaining: only 1s 1/2 of ASCII chars.
I used echo $BPAR|wc -m instead of printf ... in order to not rely on if the printf is buillt-in or not.
Above I saw talks about trailing newline and -n parameter for echo. I did not used it, thus I add only one to the $BPLEN. Should I use -n, I must add 2.
To explain the difference between wc -m and wc -c, see the same script with only one minor change: -m was replaced with -c
$ draw_border.sh
+----------------+
| Généralités |
+----------------+
+---------------------------------------------+
| А вот еще одна /usr/bin/lesspipe |
+---------------------------------------------+
+--------------+
| pure ENGLISH |
+--------------+
Accented characters in Latin, and most of characters in Cyrillic are two-byte, thus the length of drawn horizontals are greater than the real length of the message.
Hope, it will save some one some time :-)
p.s. Russian text says "here is one more"
p.p.s. Working "two-liner"
#!/bin/sh
#based on https://stackoverflow.com/questions/17368067/length-of-string-in-bash
border()
{
# line below based on https://www.cyberciti.biz/faq/repeat-a-character-in-bash-script-under-linux-unix/
# comment of Bit Twiddler Jun 5, 2021 # 8:47
local OUTBORDER=\+`head -c $(( $(echo "$1"|wc -m) +1))</dev/zero|tr '\0' '-'`\+
echo $OUTBORDER"\n"\|\ "$1"\ \|"\n"$OUTBORDER
}
border "Généralités"
border 'А вот еще одна '$LESSCLOSE' '
border "pure ENGLISH"
In order to not clutter the code with repetitive OUTBORDER's drawing, I put the forming of OUTBORDER into separate command
Maybe just use wc -c to count the number of characters:
myvar="Hello, I am a string."
echo -n $myvar | wc -c
Result:
21
Length of string in bash
str="Welcome to Stackoveflow"
length=`expr length "$str"`
echo "Length of '$str' is $length"
OUTPUT
Length of 'Welcome to Stackoveflow' is 23

Resources