itsmejitu#itsmejitu:~$ numbers=(47 -78 12 45 6)
itsmejitu#itsmejitu:~$ printf "%d \n" ${numbers[#]} | sort -n
-78
6
12
45
47
itsmejitu#itsmejitu:~$ declare -a letters
itsmejitu#itsmejitu:~$ letters=(a c e z l s q a d c v)
itsmejitu#itsmejitu:~$ printf "%s \0" ${letters[#]} | sort -z | xargs -0n1
a
a
c
c
d
e
l
q
s
v
z
itsmejitu#itsmejitu:~$ printf "%s \n" ${letters[#]} | sort -z | xargs -0n1
a
c
e
z
l
s
q
a
d
c
v
Sorting integers is straightforward
I tried to do sorting of letters in bash. Couldn't do it, So my friend sent me this. He couldn't explain though. I looked through printf, xargs manuals. But the terms used there is beyond my understanding(Not a CS student).
Is there any simpler way to understand this?
thanks!!
In the first example, sort sees 5 different numbers separated by line feeds.
In the second example, sort and xargs see 11 different two-character strings (each has a trailing space) separated by null characters.
In the third example, sort and xargs see a single string (containing embedded line feeds and spaces) "separated" by a null character.
It might help to pipe the output of printf through hexdump -C or od to see what sort sees in each case.
Related
Background
I have a .xyz file from which I need to remove a specific set of lines from. As well as do some text replacements. I have a separate .txt file that contains a list of integers, corresponding to line numbers that need to be removed, and another for the lines which need replacing. This file will be called atomremove.txt and looks as follows. The other file is structured similarly.
Just as a preemptive TL;DR: The tabs in my input file that happen to have one extra whitespace (because they justify to a certain position regardless of one extra whitespace), end up being converted to a single whitespace in the output file.
14
13
11
10
4
The xyz file from which I need to remove lines will look like something like this.
24
Comment block
H 18.38385 15.26701 2.28399
C 19.32295 15.80772 2.28641
O 16.69023 17.37471 2.23138
B 17.99018 17.98940 2.24243
C 22.72612 1.13322 2.17619
C 14.47116 18.37823 2.18809
C 15.85803 18.42398 2.20614
C 20.51484 15.08859 2.30584
C 22.77653 3.65203 2.19000
H 20.41328 14.02079 2.31959
H 22.06640 8.65013 2.27145
C 19.33725 17.20040 2.26894
H 13.96336 17.42048 2.19342
H 21.69450 3.68090 2.22196
C 23.01832 9.16815 2.25575
C 23.48143 2.42830 2.16161
H 22.07113 11.03567 2.32659
C 13.75496 19.59644 2.16380
O 23.01248 6.08053 2.20226
C 12.41476 19.56937 2.14732
C 16.54400 19.61620 2.20021
C 23.50500 4.83405 2.17735
C 23.03249 10.56089 2.28599
O 17.87129 19.42333 2.22107
My Code
I am successful in doing the line removal, and the replacements, although the output is not as expected. It appears to replace some of the tabs with the whitespace, specifically for lines that have a 'y' coordinate with only 5 decimals. I am going to share the resulting output first, and then my code.
Here is the output
19
Comment Block
H 18.38385 15.26701 2.28399
C 19.32295 15.80772 2.28641
O 16.69023 17.37471 2.23138
H 22.72612 1.13322 2.17619
C 14.47116 18.37823 2.18809
C 15.85803 18.42398 2.20614
C 20.51484 15.08859 2.30584
C 22.77653 3.65203 2.19000
C 19.33725 17.20040 2.26894
C 23.01832 9.16815 2.25575
C 23.48143 2.42830 2.16161
H 22.07113 11.03567 2.32659
C 13.75496 19.59644 2.16380
O 23.01248 6.08053 2.20226
C 12.41476 19.56937 2.14732
C 16.54400 19.61620 2.20021
C 23.50500 4.83405 2.17735
H 23.03249 10.56089 2.28599
O 17.87129 19.42333 2.22107
Here is my code.
atomstorefile="./extract_internal/atomremove.txt"
atomchangefile="./extract_internal/atomchange.txt"
temp="temp.txt"
tempp="tempp.txt"
temppp="temppp.txt"
filestoreloc="./"$basefilename"_xyzoutputs/chops"
#get number of files in directory and set a loop for that # of files
numfiles=$( ls "./"$basefilename"_xyzoutputs/splits" | wc -l )
numfiles=$(( numfiles/2 ))
counter=1
while [ $counter -lt $(( numfiles + 1 )) ];
do
#set a loop for each split half
splithalf=1
while [ $splithalf -lt 3 ];
do
#storing the xyz file in a temp file for edits (non destructive)
cat ./"$basefilename"_xyzoutputs/splits/split"$splithalf"-geometry$counter.xyz > $temp
#changin specified atoms
while read line;
do
line=$(( line + 2 ))
sed -i "${line}s/C/H/" $temp
done < $atomchangefile
# removing specified atoms
while read line;
do
line=$(( line + 2 ))
sed -i "${line}d" $temp
done < $atomstorefile
remainatoms=$( wc -l $temp | awk '{print $1}' )
remainatoms=$(( remainatoms - 2 ))
tail -n $remainatoms $temp > $tempp
echo $remainatoms > "$filestoreloc"/split"$splithalf"-geometry$counter.xyz
echo Comment Block >> "$filestoreloc"/split"$splithalf"-geometry$counter.xyz
cat $tempp >> "$filestoreloc"/split"$splithalf"-geometry$counter.xyz
splithalf=$(( splithalf + 1 ))
done
counter=$(( counter + 1 ))
done
I am sure the solution is simple. Any insight into what is causing this issue would be very appreciated.
Not sure what you are doing but you file can be fixed using column -t < filename command.
Example :
❯ cat test
H 18.38385 15.26701 2.28399
C 19.32295 15.80772 2.28641
O 16.69023 17.37471 2.23138
H 22.72612 1.13322 2.17619
C 14.47116 18.37823 2.18809
C 15.85803 18.42398 2.20614
C 20.51484 15.08859 2.30584
C 22.77653 3.65203 2.19000
C 19.33725 17.20040 2.26894
C 23.01832 9.16815 2.25575
C 23.48143 2.42830 2.16161
H 22.07113 11.03567 2.32659
C 13.75496 19.59644 2.16380
O 23.01248 6.08053 2.20226
C 12.41476 19.56937 2.14732
C 16.54400 19.61620 2.20021
C 23.50500 4.83405 2.17735
H 23.03249 10.56089 2.28599
O 17.87129 19.42333 2.22107
~
❯ column -t < test
H 18.38385 15.26701 2.28399
C 19.32295 15.80772 2.28641
O 16.69023 17.37471 2.23138
H 22.72612 1.13322 2.17619
C 14.47116 18.37823 2.18809
C 15.85803 18.42398 2.20614
C 20.51484 15.08859 2.30584
C 22.77653 3.65203 2.19000
C 19.33725 17.20040 2.26894
C 23.01832 9.16815 2.25575
C 23.48143 2.42830 2.16161
H 22.07113 11.03567 2.32659
C 13.75496 19.59644 2.16380
O 23.01248 6.08053 2.20226
C 12.41476 19.56937 2.14732
C 16.54400 19.61620 2.20021
C 23.50500 4.83405 2.17735
H 23.03249 10.56089 2.28599
O 17.87129 19.42333 2.22107
~
❯
The reason you wreck your whitespace is that you need to quote your strings. But a much superior solution is to refactor all of this monumentally overcomplicated shell script to a simple sed or Awk script.
Assuming the line numbers all indicate line numbers in the original input file, try this.
tmp=$(mktemp -t atomtmpXXXXXXXXX) || exit
trap 'rm -f "$tmp"' ERR EXIT
( sed 's%$%s/C/H/%' extract_internal/atomchange.txt
sed 's%$%d%' extract_internal/atomremove.txt ) >"$tmp"
ls -l "$tmp"; nl "$tmp" # debugging
for file in "$basefilename"_xyzoutputs/splits/*; do
dst= "$basefilename"_xyzoutputs/chops/${file#*/splits/}
sed -f "$tmp" "$file" >"$dst"
done
This combines the two input files into a new sed script (remarkably, by way of sed); the debugging line lets you inspect the result (probably remove it once you understand how this works).
Your question doesn't really explain how the input files relate to the output files so I had to guess a bit. One of the important changes is to avoid sed -i when you are not modifying an existing file; but above all, definitely avoid repeatedly overwriting the same file with sed -i.
I have a question. I have a file with coordinates (TAB separated)
2 10
35 50
90 200
400 10000
...
I would like to substract the first column of the second line from the second column of the fist line , i.e. calculate the distance, i.e. I would like a file with
25
40
200
...
How could I do that using awk???
Thank you very much in advance
here is an awk one-liner may help you:
kent$ awk 'a{print $1-a}{a=$2}' file
25
40
200
Here's a pure bash solution:
{
read _ ps
while read f s; do
echo $((f-ps))
((ps=s))
done
} < input_file
This only works if you have (small) integers, as it uses bash's arithmetic. If you want to deal with arbitrary sized integers or floats, you can use bc (with only one fork):
{
read _ ps
while read f s; do
printf '%s-%s\n' "$f" "$ps"
ps=$s
done
} < input_file | bc
Now I leave the others give an awk answer!
Alright, since nobody wants to upvote my answer, here's a really funny solution that uses bash and bc:
a=( $(<input_file) )
printf -- '-(%s)+(%s);\n' "${a[#]:1:${#a[#]}-2}" | bc
or the same with dc (shorter but doesn't work with negative numbers):
a=( $(<input_file) )
printf '%s %sr-pc' "${a[#]:1:${#a[#]}-2}" | dc
using sed and ksh for evaluation
sed -n "
1x
1!H
$ !b
x
s/^ *[0-9]\{1,\} \(.*\) [0-9]\{1,\} *\n* *$/\1 /
s/\([0-9]\{1,\}\)\(\n\)\([0-9]\{1,\}\) /echo \$((\3 - \1))\2/g
s/\n *$//
w /tmp/Evaluate.me
"
. /tmp/Evaluate.me
rm /tmp/Evaluate.me
How can I add spaces between every character or symbol within a UTF-8 document? E.g. 123hello! becomes 1 2 3 h e l l o !.
I have BASH, OpenOffice.org, and gedit, if any of those can do that.
I don't care if it sometimes leaves extra spaces in places (e.g. 2 or 3 spaces in a single place is no problem).
Shortest sed version
sed 's/./& /g'
Output
$ echo '123hello!' | sed 's/./& /g'
1 2 3 h e l l o !
Obligatory awk version
awk '$1=$1' FS= OFS=" "
Output
$ echo '123hello!' | awk '$1=$1' FS= OFS=" "
1 2 3 h e l l o !
sed(1) can do this:
$ sed -e 's/\(.\)/\1 /g' < /etc/passwd
r o o t : x : 0 : 0 : r o o t : / r o o t : / b i n / b a s h
d a e m o n : x : 1 : 1 : d a e m o n : / u s r / s b i n : / b i n / s h
It works well on e.g. UTF-8 encoded Japanese content:
$ file japanese
japanese: UTF-8 Unicode text
$ sed -e 's/\(.\)/\1 /g' < japanese
E X I F 中 の 画 像 回 転 情 報 対 応 に よ り 、 一 部 画 像 ( 特 に 『
$
sed is ok but this is pure bash
string=hello
for ((i=0; i<${#string}; i++)); do
string_new+="${string:$i:1} "
done
Since you have bash, I am will assume that you have access to sed. The following command line will do what you wish.
$ sed -e 's:\(.\):\1 :g' < input.txt > output.txt
I like these solutions because they do not have a trailing space like the rest
here.
GNU awk:
echo 123hello! | awk NF=NF FS=
GNU awk:
echo 123hello! | awk NF=NF FPAT=.
POSIX awk:
echo 123hello! | awk '{while(a=substr($0,++b,1))printf b-1?FS a:a}'
This might work for you:
echo '1 23h ello ! ' | sed 's/\s*/ /g;s/^\s*\(.*\S\)\s*$/\1/;l'
1 2 3 h e l l o !$
1 2 3 h e l l o !
In retrospect a far better solution:
sed 's/\B/ /g' file
Replaces the space between letters with a space.
string='abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789'
echo ${string} | sed -r 's/(.{1})/\1 /g'
Pure POSIX Shell version:
addspace() {
__addspace_str="$1"
while [ -n "${__addspace_str#?}" ]; do
printf '%c ' "$__addspace_str"
__addspace_str="${__addspace_str#?}"
done
printf '%c' "$__addspace_str"
}
Or if you need to put it in a variable:
addspace_var() {
addspace_result=""
__addspace_str="$1"
while [ -n "${__addspace_str#?}" ]; do
addspace_result="$addspace_result${__addspace_str%${__addspace_str#?}} "
__addspace_str="${__addspace_str#?}"
done
addspace_result="$addspace_result$__addspace_str"
}
addspace_var abc
echo "$addspace_result"
Tested with dash, ksh, zsh, bash (+ bash --posix), and busybox ash.
Explanation
${x#?}
This parameter expansion removes the first character of x. ${x#...} in general removes a prefix given by a pattern, and ? matches any single character.
printf '%c ' "$str"
The %c format parameter transforms the string argument into its first character, so the full format string '%c ' prints the first character of the string followed by a space. Note that if the string was empty this would cause issues, but we already checked that it wasn't before, so it's fine. To print the first character safely in any situation we can use '%.1s', but I like living dangerously :3j
${x%${x#?}}
This is an alternate way to get the first character of the string. We already know that ${x#?} is all but the first character. Well, ${x%...} removes ... from the end of x, so ${x%${x#?}} removes all but the first character from the end of x, leaving only the first one.
__prefixed_variable_names
POSIX doesn't define local, so to avoid variable conflicts it's safer to create unique names that are unlikely to clobber each other. I am starting to experiment using M4 to generate unique names while not having to destroy my code every time but it's probably overkill for people who don't use shell as much as me.
[ -n "${str#?}" ]
Why not just [ -n "$str" ]? It's to avoid the dreaded trailing space, it's also why we have a little statement guy at the bottom there outside the loop. The loops goes until the string is one character long, then we finish outside of it so we can append this last character without adding a space.
When should I use this?
This is good for small inputs in long running loops, since it avoids the overhead of calling an external process, but for larger inputs it starts lagging behind fast, specially the var version. (I fault the ${x%${x#?}} trick).
Benchmark Commands
# addspace
time dash -c ". ./addspace.sh; for x in $(seq -s ' ' 1 10000); do addspace \"$input\" >/dev/null; done"
# addspace_var
time dash -c ". ./addspace.sh; for x in $(seq -s ' ' 1 10000); do addspace_var \"$input\" >/dev/null; done"
# sed for comparison
time dash -c ". ./addspace.sh; for x in $(seq -s ' ' 1 10000); do echo \"$input\" | sed 's/./& /g' >/dev/null; done"
Input Length = 3
addspace addspace_var sed
real 0m0,106s 0m0,106s 0m10,651s
user 0m0,077s 0m0,075s 0m9,349s
sys 0m0,029s 0m0,031s 0m3,030s
Input Length = 200
addspace addspace_var sed
real 0m6,050s 0m47,115s 0m11,049s
user 0m5,557s 0m46,919s 0m9,727s
sys 0m0,488s 0m0,068s 0m3,085s
Input Length = 1000
addspace addspace_var sed
real 0m55,989s TBD 0m11,534s
user 0m53,560s TBD 0m10,214s
sys 0m2,428s TBD 0m2,975s
(Yeah, I was waiting a bit for that last var one.)
In situations like this you can simply check the length of the input and call the appropriate function for maximum performance.
addspace() {
if [ ${#1} -lt 100 ]; then
addspace_builtins "$1"
else
addspace_proccess "$1"
fi
}
I am having two files numbers.txt(1 \n 2 \n 3 \n 4 \n 5 \n) and alpha.txt (a \n n \n c \n d \n e \n)
Now I want to iterate both the files at the same time something like.
for num in `cat numbers.txt` && alpha in `cat alpha.txt`
do
echo $num "blah" $alpha
done
Or other idea I was having is
for num in `cat numbers.txt`
do
for alpha in `cat alpha.txt`
do
echo $num 'and' $alpha
break
done
done
but this kind of code always take the first value of $alpha.
I hope my problem is clear enough.
Thanks in advance.
Here it is what I actually intended to do. (Its just an example)
I am having one more file say template.txt having content.
variable1= NUMBER
variable2= ALPHA
I wanted to take the output from two files i.e numbers.txt and alpha.txt(one line from both at a time) and want to replace the NUMBER and ALPHA with the respective content from those two files.
so here it what I did as i got to know how to iterate both files together.
paste number.txt alpha.txt | while read num alpha
do
cp template.txt temp.txt
sed -i "{s/NUMBER/$num/g}" temp.txt
sed -i "{s/ALPHA/$alpha/g}" temp.txt
cat temp.txt >> final.txt
done
Now what i am having in final.txt is:
variable1= 1
variable2= a
variable1= 2
variable2= b
variable1= 3
variable2= c
variable1= 4
variable2= d
variable1= 5
variable2= e
variable1= 6
variable2= f
variable1= 7
variable2= g
variable1= 8
variable2= h
variable1= 9
variable2= i
variable1= 10
variable2= j
Its very simple and stupid approach. I wanted to know is there any other way to do this??
Any suggestion will be appreciated.
No, your question isn't clear enough. Specifically, the way you wish to iterate through your files is unclear, but assuming you want to have an output such as:
1 blah a
2 blah b
3 blah c
4 blah d
5 blah e
you can use the paste utility, like this:
paste number.txt alpha.txt | while read alpha num ; do
echo "$num and $alpha"
done
or even:
paste -d# alpha num | sed 's/#/ blah /'
Your first loop is impossible in bash. Your second one, without the break, would combine each line from numbers.txt with each line from alpha.txt, like this:
1 AND a
1 AND n
1 AND c
...
2 AND a
...
3 AND a
...
4 AND a
...
Your break makes it skip all lines from the alpha.txt, except the 1st one (bmk has already explained it in his answer)
It should be possible to organize the correct loop using the while loop construction, but it would be rather ugly.
There're lots of easier alternatives which maybe a better choice, depending on specifics of your task. For example, you could try this:
paste numbers.txt alpha.txt
or, if you really want your "AND"s, then, something like this:
paste numbers.txt alpha.txt | sed 's/\t/ AND /'
And if your numbers are really sequential (and you can live without 'AND'), you can simply do:
cat -n alpha.txt
Here is an alternate solution according to the first model you suggested:
while read -u 5 a && read -u 6 b
do
echo $a $b
done 5<numbers.txt 6<alpha.txt
The notation 5<numbers.txt tells the shell to open numbers.txt using file descriptor 5. read -u 5 a means read from a value for a from file descriptor 5, which has been associated with numbers.txt.
The advantage of this approach over paste is that it gives you fine-grain control over how you merge the two files. For example you could read one line from the first file and twice from the second file.
In your second example the inner loop is executed only once because of the break. It will simply jump out of the loop, i.e. you will always only get the first element of alpha.txt. Therefore I think you should remove it:
for num in `cat numbers.txt`
do
for alpha in `cat alpha.txt`
do
echo $num 'and' $alpha
done
done
If multiple loop isn't specifically your requirement but getting corresponding lines is then you may try the following code:
for line in `cat numbers.txt`
do
echo $line "and" $(cat alpha.txt| head -n$line | tail -n1 )
done
The head gets you the number of lines equal to the value of line and tail gets you the last element.
#tollboy, I think the answer you are looking for is this:
count=1
for item in $(paste number.txt alpha.txt); do
if [[ "${item}" =~ [a-zA-Z] ]]; then
echo "variable${count}= ${item}" >> final.txt
elif [[ "${item}" =~ [0-9] ]]; then
echo "variable${count}= ${item}" >> final.txt
fi
count=$((count+1))
done
When you type paste number.txt alpha.txt in your console, you see:
1 a
2 b
3 c
4 d
5 e
6 f
7 g
8 h
9 i
10 j
From bash's point of view $(paste number.txt alpha.txt) it looks like this:
1 a 2 b 3 c 4 d 5 e 6 f 7 g 8 h 9 i 10 j
So for each item in that list, figure out if it is alpha or numeric, and print it to the output file.
Lastly, increment the count.
I have tab delimited files with several columns. I want to count the frequency of occurrence of the different values in a column for all the files in a folder and sort them in decreasing order of count (highest count first). How would I accomplish this in a Linux command line environment?
It can use any common command line language like awk, perl, python etc.
To see a frequency count for column two (for example):
awk -F '\t' '{print $2}' * | sort | uniq -c | sort -nr
fileA.txt
z z a
a b c
w d e
fileB.txt
t r e
z d a
a g c
fileC.txt
z r a
v d c
a m c
Result:
3 d
2 r
1 z
1 m
1 g
1 b
Here is a way to do it in the shell:
FIELD=2
cut -f $FIELD * | sort| uniq -c |sort -nr
This is the sort of thing bash is great at.
The GNU site suggests this nice awk script, which prints both the words and their frequency.
Possible changes:
You can pipe through sort -nr (and reverse word and freq[word]) to see the result in descending order.
If you want a specific column, you can omit the for loop and simply write freq[3]++ - replace 3 with the column number.
Here goes:
# wordfreq.awk --- print list of word frequencies
{
$0 = tolower($0) # remove case distinctions
# remove punctuation
gsub(/[^[:alnum:]_[:blank:]]/, "", $0)
for (i = 1; i <= NF; i++)
freq[$i]++
}
END {
for (word in freq)
printf "%s\t%d\n", word, freq[word]
}
Perl
This code computes the occurrences of all columns, and prints a sorted report for each of them:
# columnvalues.pl
while (<>) {
#Fields = split /\s+/;
for $i ( 0 .. $#Fields ) {
$result[$i]{$Fields[$i]}++
};
}
for $j ( 0 .. $#result ) {
print "column $j:\n";
#values = keys %{$result[$j]};
#sorted = sort { $result[$j]{$b} <=> $result[$j]{$a} || $a cmp $b } #values;
for $k ( #sorted ) {
print " $k $result[$j]{$k}\n"
}
}
Save the text as columnvalues.pl
Run it as: perl columnvalues.pl files*
Explanation
In the top-level while loop:
* Loop over each line of the combined input files
* Split the line into the #Fields array
* For every column, increment the result array-of-hashes data structure
In the top-level for loop:
* Loop over the result array
* Print the column number
* Get the values used in that column
* Sort the values by the number of occurrences
* Secondary sort based on the value (for example b vs g vs m vs z)
* Iterate through the result hash, using the sorted list
* Print the value and number of each occurrence
Results based on the sample input files provided by #Dennis
column 0:
a 3
z 3
t 1
v 1
w 1
column 1:
d 3
r 2
b 1
g 1
m 1
z 1
column 2:
c 4
a 3
e 2
.csv input
If your input files are .csv, change /\s+/ to /,/
Obfuscation
In an ugly contest, Perl is particularly well equipped.
This one-liner does the same:
perl -lane 'for $i (0..$#F){$g[$i]{$F[$i]}++};END{for $j (0..$#g){print "$j:";for $k (sort{$g[$j]{$b}<=>$g[$j]{$a}||$a cmp $b} keys %{$g[$j]}){print " $k $g[$j]{$k}"}}}' files*
Ruby(1.9+)
#!/usr/bin/env ruby
Dir["*"].each do |file|
h=Hash.new(0)
open(file).each do |row|
row.chomp.split("\t").each do |w|
h[ w ] += 1
end
end
h.sort{|a,b| b[1]<=>a[1] }.each{|x,y| print "#{x}:#{y}\n" }
end
Here is a tricky one approaching linear time (but probably not faster!) by avoiding sort and uniq, except for the final sort. It is based on... tee and wc instead!
$ FIELD=2
$ values="$(cut -f $FIELD *)"
$ mkdir /tmp/counts
$ cd /tmp/counts
$ echo | tee -a $values
$ wc -l * | sort -nr
9 total
3 d
2 r
1 z
1 m
1 g
1 b
$
Pure-Bash version:
FIELD=1
declare -A results
while read -a line; do
results[${line[$FIELD]:-(empty)}]=$((results[${line[$FIELD]:-(empty)}]+1));
done < file.txt
echo ${results[#]#A}
The key logic is to fill an associative array which keys are the values found in the file and the array's value is the number of occurrence:
$FIELD is the selected column number
${line[$FIELD]} is the column value from that line in the file
${...:-(empty)} is a special case for empty values (what happens if there is less columns than expected?)
To have the output sorted in the expected OP format, a little more work is needed:
sort -rn < <(
for k in "${!results[#]}"; do
echo "${results[$k]} $k";
done
)
Warning: it works well for tab-delimited and space-delimited files, but works bad for values with spaces in it.