I'm trying to retrieve items from Node01.pc and put it within a table.
echo ${NodeCPU[0]} is able to print the item from the line.
But when I use printf or echo it either breaks or does not display the output from the array item.
The formating of the table seems work and it displays only if it's not the arrays. Could it be that there's more than to the file that I can see?
Node01.pc contains
But I only need lines 3,5,7,9
I'm not sure if what is the best way to do this, or if I even need to store items into arrays.
I thought about retrieving all text from the texts files and making a new file which will contain all the data, but I'm not sure how to do that.
This is the code that I have right now.
Node01=($(cat Node01.pc))
Node02=($(cat Node02.pc))
Node03=($(cat Node03.pc))
Node04=($(cat Node04.pc))
Node05=($(cat Node05.pc))
NodeCPU=("${Node01[2]}" "${Node02[2]}" "${Node03[2]}" "${Node04[2]}" "${Node05[2]}")
NodeMEM=("${Node01[4]}" "${Node02[4]}" "${Node03[4]}" "${Node04[4]}" "${Node05[4]}")
NodeHDD=("${Node01[6]}" "${Node02[6]}" "${Node03[6]}" "${Node04[6]}" "${Node05[6]}")
NodeNET=("${Node01[8]}" "${Node02[8]}" "${Node03[8]}" "${Node04[8]}" "${Node05[8]}")
rows="%-10s| %-7s| %-7s| %-7s| %-7s\n"
printf "%-10s| %-7s| %-7s| %-7s| %-7s\n" NodeNumber CPU MEM HDD NET
printf "%.${TableWidth}s\n" "$seperator"
printf "$rows" "$(( $i+1 ))" "${NodeCPU[i]}" "${NodeMEM[i]}" "${NodeHDD[i]}" "${NodeNET[i]}"
This is an example of what I want to display
NodeNumber | CPU | MEM | HDD | NET
1 | 10 | 20 | 20 | 40
2 | 10 | 20 | 20 | 40
3 | 10 | 20 | 20 | 40
4 | 10 | 20 | 20 | 40
5 | 10 | 20 | 20 | 40
EDIT This is what I'm currently getting:
NodeNumber| CPU | MEM | HDD | NET
| 4 | 70
| 5 | 90
| 6 | 100
| 6 | 70
| 40 | 40
Issue I'm having is with
printf "$rows" "$(( $i+1 ))" "${NodeCPU[i]}" "${NodeMEM[i]}" "${NodeHDD[i]}" "${NodeNET[i]}"

Why worry about all the separate array? Simply loop over all "Node*.pc" files in the current directory and read the contents of each file into an array with readarray and then output the file count and elements nos. 2, 4, 6, 8 of the array in the proper format (adjust elements output as needed), e.g.
cnt=1 ## file counter
## print heading
printf "NodeNumber | CPU | MEM | HDD | NET\n----------------------------------\n"
for i in Node*.pc; do ## loop over all Node*.pc files in directory
readarray -t node < "$i" ## read contents into array
## output count and elements 2, 4, 6, 8 in proper format
printf "%-11s| %-4s| %-4s| %-4s| %s\n" $((cnt++)) \
"${node[2]}" "${node[4]}" "${node[6]}" "${node[8]}"
Example Use/Output
With the example data shown copied to the file Node01.pc in the current directory, you would get:
$ bash
NodeNumber | CPU | MEM | HDD | NET
1 | 70 | 80 | 4 | 4
(I called the script
It would output the information from each file as separate lines numbered 1, 2, ... Look things over an let me know if this is what you intended. (you can also do the same thing with awk faster by setting FS=\n and treating the lines as columns in a single record)
You can do the same thing in awk with:
awk '
RS=""; FS="\n"
printf "NodeNumber | CPU | MEM | HDD | NET\n----------------------------------\n"
NF >= 9 {
printf "%-11s| %-4s| %-4s| %-4s| %s\n",++cnt,$3,$5,$7,$9
' Node*.pc
(note: in awk the field numbers are 1-based, while in bash the array indexes are 0-based)
Output is the same.


Inconsistency in output field separator

We have to find the difference(d) Between last 2 nos and display rows with the highest value of d in ascending order
1 | Latha | Third | Vikas | 90 | 91
2 | Neethu | Second | Meridian | 92 | 94
3 | Sethu | First | DAV | 86 | 98
4 | Theekshana | Second | DAV | 97 | 100
5 | Teju | First | Sangamithra | 89 | 100
6 | Theekshitha | Second | Sangamithra | 99 |100
Required OUTPUT
awk 'BEGIN{FS="|";OFS="$";}{
print $1,$2,$3,$4,$5,$6,avg
}'|sort -nk7 -t "$"| tail -3
4 $ Theekshana $ Second $ DAV $ 97 $ 100$3
5 $ Teju $ First $ Sangamithra $ 89 $ 100$11
3 $ Sethu $ First $ DAV $ 86 $ 98$12
As you can see there is space before and after $ sign but for the last column (avg) there is no space, please explain why its happening
awk 'BEGIN{FS=" | ";OFS="$";}{
print $1,$2,$3,$4,$5,$6,avg
}'|sort -nk7 -t "$"| tail -3
I have not mentiond | as the output field separator but still it appears, why is this happening and the difference is zero too
I am just 6 days old in unix,please answer even if its easy
your field separator is only the pipe symbol, so surrounding whitespace is part of the field definitions and that's what you see in the output. In combined uses pipe has the regex special meaning and need to be escaped. In your second case it means space or space is the field separator.
$ awk 'BEGIN {FS=" *\\| *"; OFS="$"}
{d=sqrt(($NF-$(NF-1))^2); $1=$1;
print d "\t" $0,d}' file | sort -n | tail -3 | cut -f2-
a slight rewrite will eliminate the number of fields dependency and fixes the format.

bash looping and extracting of the fragment of txt file

I am dealing with the analysis of big number of dlg text files located within the workdir. Each file has a table (usually located in different positions of the log) in the following format:
File 1:
| | | | |
Clus | Lowest | Run | Mean | Num | Histogram
-ter | Binding | | Binding | in |
Rank | Energy | | Energy | Clus| 5 10 15 20 25 30 35
1 | -5.78 | 11 | -5.78 | 1 |#
2 | -5.53 | 13 | -5.53 | 1 |#
3 | -5.47 | 17 | -5.44 | 2 |##
4 | -5.43 | 20 | -5.43 | 1 |#
5 | -5.26 | 19 | -5.26 | 1 |#
6 | -5.24 | 3 | -5.24 | 1 |#
7 | -5.19 | 4 | -5.19 | 1 |#
8 | -5.14 | 16 | -5.14 | 1 |#
9 | -5.11 | 9 | -5.11 | 1 |#
10 | -5.07 | 1 | -5.07 | 1 |#
11 | -5.05 | 14 | -5.05 | 1 |#
12 | -4.99 | 12 | -4.99 | 1 |#
13 | -4.95 | 8 | -4.95 | 1 |#
14 | -4.93 | 2 | -4.93 | 1 |#
15 | -4.90 | 10 | -4.90 | 1 |#
16 | -4.83 | 15 | -4.83 | 1 |#
17 | -4.82 | 6 | -4.82 | 1 |#
18 | -4.43 | 5 | -4.43 | 1 |#
19 | -4.26 | 7 | -4.26 | 1 |#
The aim is to loop over all the dlg files and take the single line from the table corresponding to wider cluster (with bigger number of slashes in Histogram column). In the above example from the table this is the third line.
3 | -5.47 | 17 | -5.44 | 2 |##
Then I need to add this line to the final_log.txt together with the name of the log file (that should be specified before the line). So in the end I should have something in following format (for 3 different log files):
"Name of the file 1": 3 | -5.47 | 17 | -5.44 | 2 |##
"Name_of_the_file_2": 1 | -5.99 | 13 | -5.98 | 16 |################
"Name_of_the_file_3": 2 | -4.78 | 19 | -4.44 | 3 |###
A possible model of my BASH workflow would be:
file_name2=$(basename "$f")
echo "Processing of $f..."
# take a name of the file and save it in the log
echo "$file_name" >> $PWD/final_results.log
# search of the beginning of the table inside of each file and save it after its name
cat $f |grep 'CLUSTERING HISTOGRAM' >> $PWD/final_results.log
# check whether it works
gedit $PWD/final_results.log
Here I need to substitute combination of echo and grep in order to take selected parts of the table.
You can use this one, expected to be fast enough. Extra lines in your files, besides the tables, are not expected to be a problem.
grep "#$" *.dlg | sort -rk11 | awk '!seen[$1]++'
grep fetches all the histogram lines which are then sorted in reverse order by last field, that means lines with most # on the top, and finally awk removes the duplicates. Note that when grep is parsing more than one file, it has -H by default to print the filenames at the beginning of the line, so if you test it for one file, use grep -H.
Result should be like this:
file1.dlg: 3 | -5.47 | 17 | -5.44 | 2 |##########
file2.dlg: 3 | -5.47 | 17 | -5.44 | 2 |####
file3.dlg: 3 | -5.47 | 17 | -5.44 | 2 |#######
Here is a modification to get the first appearence in case of many equal max lines in a file:
grep "#$" *.dlg | sort -k11 | tac | awk '!seen[$1]++'
We replaced the reversed parameter in sort, with the 'tac' command which is reversing the file stream, so now for any equal lines, initial order is preserved.
Second solution
Here using only awk:
awk -F"|" '/#$/ && $NF > max[FILENAME] {max[FILENAME]=$NF; row[FILENAME]=$0}
END {for (i in row) print i ":" row[i]}' *.dlg
Update: if you execute it from different directory and want to keep only the basename of every file, to remove the path prefix:
awk -F"|" '/#$/ && $NF > max[FILENAME] {max[FILENAME]=$NF; row[FILENAME]=$0}
END {for (i in row) {sub(".*/","",i); print i ":" row[i]}}'
Probably makes more sense as an Awk script.
This picks the first line with the widest histogram in the case of a tie within an input file.
awk 'FNR == 1 { if(sel) print sel; sel = ""; max = 0 }
FNR < 9 { next }
length($10) > max { max = length($10); sel = FILENAME ":" $0 }
END { if (sel) print sel }' ./"$prot"/*.dlg
This assumes the histograms are always the tenth field; if your input format is even messier than the lump you show, maybe adapt to taste.
In some more detail, the first line triggers on the first line of each input file. If we have collected a previous line (meaning this is not the first input file), print that, and start over. Otherwise, initialize for the first input file. Set sel to nothing and max to zero.
The second line skips lines 1-8 which contain the header.
The third line checks if the current line's histogram is longer than max. If it is, update max to this histogram's length, and remember the current line in sel.
The last line is spillover for when we have processed all files. We never printed the sel from the last file, so print that too, if it's set.
If you mean to say we should find the lines between CLUSTERING HISTOGRAM and the end of the table, we should probably have more information about what the surrounding lines look like. Maybe something like this, though;
awk '/CLUSTERING HISTOGRAM/ { if (sel) print sel; looking = 1; sel = ""; max = 0 }
!looking { next }
looking > 1 && $1 != looking { looking = 0; nextfile }
$1 == looking && length($10) > max { max = length($10); sel = FILENAME ":" $0 }
END { if (sel) print sel }' ./"$prot"/*.dlg
This sets looking to 1 when we see CLUSTERING HISTOGRAM, then counts up to the first line where looking is no longer increasing.
I would suggest processing using awk:
for i in $FILES
echo -n \""$i\": "
awk 'BEGIN {
/(^ *[0-9]+)/ { # process only lines that start with a number
if (length(substr($10, 2)) > outputlength) { # if line has more hashes, store it
outputlength=length(substr($10, 2))
print output # output the resulting line
}' "$i"

Join two csv files if value is between interval in file 2

I have two csv files that I need to join, F1 has milions of lines, F2 (file 1) has thousands of lines. I need to join these files, if the position in file F1 (F1.pos) is between F2.start and F2.end. Is there any way, how to do this in bash? Because I have a code in Python pandas to sqllite3 and I am looking for something quicker.
Table F1 looks like:
| name | pos |
|------ |------ |
| a | 1020 |
| b | 1200 |
| c | 1800 |
Table F2 looks like:
| interval_name | start | end |
|--------------- |------- |------ |
| int1 | 990 | 1090 |
| int2 | 1100 | 1150 |
| int3 | 500 | 2000 |
Result should look like:
| name | pos | interval_name | start | end |
|------ |------ |--------------- |------- |------ |
| a | 1020 | int1 | 990 | 1090 |
| a | 1020 | int3 | 500 | 2000 |
| b | 1200 | int1 | 990 | 1090 |
| b | 1200 | int3 | 500 | 2000 |
| c | 1800 | int3 | 500 | 2000 |
DISCLAIMER: Use dedicated/local tools if available, this is hacking:
There is an apparent error in your desired output: name b should not match int1.
$ tail -n+1 *.csv
==> f1.csv <==
==> f2.csv <==
$ awk -F, -vOFS=, '
print "name,pos,interval_name,start,end"
FNR==1 {next}
NR==FNR {Int[$1] = $2 "," $3; next}
for(i in Int) {
split(Int[i], I)
if($2 >= I[1] && $2 <= I[2]) print $0, i, Int[i]
' f2.csv f1.csv
This is not particularly efficient in any way; the only sorting used is to ensure that the Int array is parsed in the correct order, which changes if your sample data is not indicative of the actual schema. I would be very interested to know how my solution performs vs pandas.
Here's one in awk. It hashes the smaller file records to arrays and for each of the bigger file records it iterates thru the hashes so it is slow:
$ awk '
NR==FNR { # hash f2 records
FNR<=2 { # mind the front matter
print $0 data[FNR]
{ # check if in range and output
for(i in start)
if($4>start[i] && $4<end[i])
print $0 data[i]
}' f2 f1
| name | pos | interval_name | start | end |
|------ |------ |--------------- |------- |------ |
| a | 1020 | int1 | 990 | 1090 |
| a | 1020 | int3 | 500 | 2000 |
| b | 1200 | int3 | 500 | 2000 |
| c | 1800 | int3 | 500 | 2000 |
I doubt that a bash script would be faster than a python script. Just don't import the files into a database – write a custom join function instead!
The best way to join depends on your input data. If nearly all F1.pos are inside of nearly all intervals then a naive approach would be the fastest. The naive approach in bash would look like this:
#! /bin/bash
join --header -t, -j99 F1 F2 |
sed 's/^,//' |
awk -F, 'NR>1 && $2 >= $4 && $2 <= $5'
# NR>1 is only there to skip the column headers
However, this will be very slow if there are only a few intersections, for instance, when the average F1.pos only is in 5 intervals. In this case the following approach will be way faster. Implement it in a programing language of your choice – bash is not appropriate for this:
Sort F1 by pos in ascending order.
Sort F2 by start and then by end in ascending order.
For each sorted file, keep a pointer to a line, starting at the first line.
Repeat until F1's pointer reaches the end:
For the current F1.pos advance F2's pointer until F1.pos ≥ F2.start.
Lock F2's pointer, but continue to read lines until F1.pos ≤ F2.end. Print the read lines in the output format name,pos,interval_name,start,end.
Advance F1's pointer by one line.
Only sorting the files could be actually faster in bash. Here is a script to sort both files.
#! /bin/bash
sort -t, -n -k2 F1-without-headers > F1-sorted
sort -t, -n -k2,3 F2-without-headers > F2-sorted
Consider using LC_ALL=C, -S N% and --parallel N to speed up the sorting process.

Sorting tab delimited numbers by column with pure bash script.

Im stuck on some homework. The requirements of the assignment are to accept an input file and perform some statistics on the values. The user may specify whether to calculate the statistics by row or by value. The shell script must be pure bash script so I can't use awk, sed, perl, python etc.
sample input:
1 1 1 1 1 1 1
39 43 4 3225 5 2 2
6 57 8 9 7 3 4
3 36 8 9 14 4 3
3 4 2 1 4 5 5
6 4 4814 7 7 6 6
I can't figure out how to sort and process the data by column. My code for processing the rows works fine.
while read -r line
echo $(printf "%d\n" $line | sort -n) | tr ' ' \\t > sorted.txt
#I perform the stats calculations
# for row line by working with the temp file sorted.txt
How could I process this data by column? I've never worked with shell script so I've been staring at this for hours.
If you wanted to analyze by columns you'll need the cols value first (number of columns). head -n 1 gives you the first row, and NF counts the number of fields, giving us the number of columns.
cols=$(head -n 1 test.txt | awk '{print NF}');
Then you can use cut with the '\t' delimiter to grab every column from input.txt, and run it through sort -n, as you did in your original post.
$ for i in `seq 2 $((cols+1))`; do cut -f$i -d$'\t' input.txt; done | sort -n > output.txt
For rows, you can use the shell built-in printf with the format modifier %dfor integers. The sort command works on lines of input, so we replace spaces ' ' with newlines \n using the tr command:
$ cat input.txt | while read line; do echo $(printf "%d\n" $line); done | tr ' ' '\n' | sort -n > output.txt
Now take the output file to gather our statistics:
Min: cat output.txt | head -n 1
Max: cat output.txt | tail -n 1
Sum: (courtesy of Dimitre Radoulov): cat output.txt | paste -sd+ - | bc
Mean: (courtesy of porges): cat output.txt | awk '{ $total += $2 } END { print $total/NR }'
Median: (courtesy of maxschlepzig): cat output.txt | awk ' { a[i++]=$1; } END { print a[int(i/2)]; }'
Histogram: cat output.txt | uniq -c
8 1
3 2
4 3
6 4
3 5
4 6
3 7
2 8
2 9
1 14
1 36
1 39
1 43
1 57
1 3225
1 4814

Print out the value with the highest number of occurrences in a file

In a bash shell script, I want to go through a list of numbers and then print out the number that occurs most often. If there are several different numbers appearing an equal amount of times, I want to print the highest number. For example, in a file like this:
I want to print the value 20.
How can I achieve this?
If the numbers are in a file, one per line:
sort < myfile | uniq -c | sort -r | head -1
without the count:
A=$(sort < myfile | uniq -c | sort -r | head -1)
set $A
echo $2
You can use this command -
echo 10 10 10 15 15 20 20 20 20 | sed 's/ /\n/g' | sort | uniq -c | sort -V | tail -n 1 | awk '{print $2}'
It will print the number you want.
