Reading a log file into 2D/1D array in bash script - bash

I have a file log.txt as below
1 0.694003 5.326995 7.500997 6.263974 0.633941 36.556128
2 2.221990 4.422010 4.652992 5.964420 0.660997 51.874905
3 4.376005 7.440002 6.260000 6.238917 0.728308 10.927455
4 1.914000 5.451991 0.668012 6.355688 0.634081 106.733134
5 2.530005 0.000000 8.084005 3.916278 0.687023 2252.538670
6 1.997993 1.406001 7.977006 3.923551 0.517551 37.611894
7 0.971998 1.823007 8.804005 4.110159 0.567905 905.995133
8 0.480005 3.109009 8.711002 4.060954 0.508963 553.712280
9 1.015001 3.996992 7.781004 3.547329 0.396635 16.883011
I want to read 6th column of this file into an array myArray so that it will give below:
echo ${myArray[9]} = 0.396635
Thank you.

Here's a way to do this (bash 4+), assuming log.txt's first column starts at 1 and doesn't skip any numbers.
readarray -t myArray < <(tr -s ' ' < log.txt | cut -d' ' -f6)
echo ${myArray[8]}
tr -s ' ' collapses the whitespace, for easier manipulation
cut -d' ' -f6 selects the 6th space separated column
<(...) turns the subcommand into a temporary file
readarray reads lines from the file into the variable myArray
Note that the array is 8 indexed, so I've selected [8] instead of [9].

Assumption:
first column of file is an integer
first column of file may not be sequential
OP wants (needs?) the array index to match the value in the first column
Sample data file:
$ cat log.txt
3 4.376005 7.440002 6.260000 6.238917 0.728308 10.927455
5 2.530005 0.000000 8.084005 3.916278 0.687023 2252.538670
7 0.971998 1.823007 8.804005 4.110159 0.567905 905.995133
9 1.015001 3.996992 7.781004 3.547329 0.396635 16.883011
23 0.480005 3.109009 8.711002 4.060954 0.508963 553.712280
One idea using awk (to parse the input file):
$ awk '{print $1,$6}' log.txt
3 0.728308
5 0.687023
7 0.567905
9 0.396635
23 0.508963
We can then feed this into a while loop to build the array:
unset myArray
while read -r ndx value
do
myArray["${ndx}"]="${value}"
done < <(awk '{print $1,$6}' log.txt)
Verify contents of array:
$ typeset -p myArray
declare -a myArray=([3]="0.728308" [5]="0.687023" [7]="0.567905" [9]="0.396635" [23]="0.508963")
$ for ndx in "${!myArray[#]}"
do
echo "index = ${ndx} ; value = ${myArray[${ndx}]}"
done
index = 3 ; value = 0.728308
index = 5 ; value = 0.687023
index = 7 ; value = 0.567905
index = 9 ; value = 0.396635
index = 23 ; value = 0.508963

Another approach using just bash4+ builtins. (if acceptable)
#!/usr/bin/env bash
mapfile -t rows < log.txt
read -ra column <<< "${rows[8]}"
echo "${column[5]}"

Related

How to print 1-10,11-20 and so on number of rows of a file in loop using shell? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 3 years ago.
Improve this question
I have a file consisting of 4000 rows, I need to iterate the records of that file over shell script and extract first 10 rows and send that rows to my java code which i already wrote, and then next 10 rows and so on
To pass 10 lines at a time as arguments to your script:
< file xargs -d$'\n' -n 10 myscript
To pipe 10 lines at a time as input to your script:
< file xargs -d$'\n' -n 10 sh -c 'printf "%s\n" "$#" | myscript' {}
Assuming your input is in a file named file which I'm creating with 30 instead of 4000 lines of input:
$ seq 30 > file
and modifying to have some lines that contain spaces, some that contain shell variables, and some that contain regexp and globbing chars to show no type of shell expansion is being done:
$ head -10 file
1
here is a multi-field line
3
4
$HOME
6
.*
8
9
10
Here's 10 args at a time being passed to an awk script:
$ < file xargs -d$'\n' -n 10 awk 'BEGIN{for (i=1; i<ARGC; i++) print i, "<" ARGV[i] ">"; exit} END{print "---"}'
1 <1>
2 <here is a multi-field line>
3 <3>
4 <4>
5 <$HOME>
6 <6>
7 <.*>
8 <8>
9 <9>
10 <10>
---
1 <11>
2 <12>
3 <13>
4 <14>
5 <15>
6 <16>
7 <17>
8 <18>
9 <19>
10 <20>
---
1 <21>
2 <22>
3 <23>
4 <24>
5 <25>
6 <26>
7 <27>
8 <28>
9 <29>
10 <30>
---
and here's 10 lines of input at a time being passed to an awk script:
$ < file xargs -d$'\n' -n 10 sh -c 'printf "%s\n" "$#" | awk '\''{print NR, "<" $0 ">"} END{print "---"}'\''' {}
1 <1>
2 <here is a multi-field line>
3 <3>
4 <4>
5 <$HOME>
6 <6>
7 <.*>
8 <8>
9 <9>
10 <10>
---
1 <11>
2 <12>
3 <13>
4 <14>
5 <15>
6 <16>
7 <17>
8 <18>
9 <19>
10 <20>
---
1 <21>
2 <22>
3 <23>
4 <24>
5 <25>
6 <26>
7 <27>
8 <28>
9 <29>
10 <30>
---
Considering that OP wants to pass lines as an argument to OP's code if that is the case then could you please try following once(haven't tested it by running it since I don't have OP's java code etc).
awk '
FNR%10==0{
system("your_java_code " value OFS $0)
value=""
}
{
value=(value?value OFS:"")$0
}
END{
if(value){
system("your_java_code " value)
}
}
' Input_file
OR
awk '
{
value=(value?value OFS:"")$0
}
FNR%10==0{
system("your_java_code " value)
value=""
}
END{
if(value){
system("your_java_code " value)
}
}
' Input_file
PS: Just for safer side, I kept END section of awk code so that in case there are left over lines(let's say total number of lines are NOT completely divided by 10) then it will call java program with remaining lines to it.
This might work for you (GNU parallel):
parallel -kN10 javaProgram :::: file
This will pass the lines 1-10, 11-20, ... as arguments to program javaProgram
If you want to pass 10 lines at time, use:
parallel -kN10 --cat javaProgram :::: file
Sounds to me like you want to slice out rows from a file, then pipe those rows to java. This interpretation differs from the other answers, so let me know if I'm not understanding you:
$ file=/etc/services
$ count=$(wc -l < "${file}")
$ start=1
$ stride=10
$ for ((i=start; i<=count; i+=stride)); do
awk -v i="${i}" -v stride="${stride}" \
'NR > (i+stride) { exit } NR >= i && NR < (i + stride)' "${file}" \
| java ...
done
file holds the path to the data rows. count is the total count of rows in that file. start is the first row, stride is how many you want to slice out in each iteration.
The for loop then performs the stride addition, while awk slices out the rows so numbered. We pipe them to the java program on standard in.
Assuming that you are passing the 10 lines groups from your file to your script as command line arguments, this is an answer:
rows=4000 # the number of rows in file
groupsize=10 # the size of lines groups
OIFS="$IFS"; IFS=$'\n' # use newline as input field separator to avoid `for` splitting on spaces
groups=$(($rows / $groupsize)) # the number of groups of lines
for i in $(seq 1 $groups); do # loop through each group of lines
from=$((($i * $groupsize) - $groupsize + 1))
to=$(($i * $groupsize))
# build the arguments for each script invocation by concatenating each group of lines
for line in `sed -n -e ${from},${to}p file`; do # 'file' is your input file name
arguments=$arguments \"$line\"
done
echo script $arguments # remove echo and change 'script' with your script name
done
IFS="$OIFS" # restore original input field separator
Like this :
for ((i=0; i<=4000; i+=10)); do
arr=( ) # create a new empty array
for ((j=$i; j<=i+10; j++)); do
arr+=( $j ) # add id to array
done
printf '%s\n' "${arr[#]}" # or execute command with all the id
done

How to find the max value in array without using sort cmd in shell script

I have a array=(4,2,8,9,1,0) and I don't want to sort the array to find the highest number in the array because I need to get the index value of the highest number as it is, so I can use it for further reference.
Expected output:
9 index value => 3
Can somebody help me to achieve this?
Slight variation with a loop using the ternary conditional operator and no assumptions about range of values:
arr=(4 2 8 9 1 0)
max=${arr[0]}
maxIdx=0
for ((i = 1; i < ${#arr[#]}; ++i)); do
maxIdx=$((arr[i] > max ? i : maxIdx))
max=$((arr[i] > max ? arr[i] : max))
done
printf '%s index => values %s\n' "$maxIdx" "$max"
The only assumption is that array indices are contiguous. If they aren't, it becomes a little more complex:
arr=([1]=4 [3]=2 [5]=8 [7]=9 [9]=1 [11]=0)
indices=("${!arr[#]}")
maxIdx=${indices[0]}
max=${arr[maxIdx]}
for i in "${indices[#]:1}"; do
((arr[i] <= max)) && continue
maxIdx=$i
max=${arr[i]}
done
printf '%s index => values %s\n' "$maxIdx" "$max"
This first gets the indices into a separate array and sets the initial maximum to the value corresponding to the first index; then, it iterates over the indices, skipping the first one (the :1 notation), checks if the current element is a new maximum, and if it is, stores the index and the maximum.
Without using sort, you can use a simple loop in shell. Here is a sample bash code:
#!/usr/bin/env bash
array=(4 2 8 9 1 0)
for i in "${!array[#]}"; do
[[ -z $max ]] || (( ${array[i]} > $max )) && { max="${array[i]}"; maxind=$i; }
done
echo "max=$max, maxind=$maxind"
max=9, maxind=3
arr=(4 2 8 9 1 0)
paste <(printf "%s\n" "${arr[#]}") <(seq 0 $((${#arr[#]} - 1)) ) |
sort -k1,1 |
tail -n1 |
sed 's/\t/ index value => /'
Print each array element on a newline with printf
Print array indexes with seq
Join both streams using paste
Numerically sort the lines using the first fields (ie. array value) sort
Print the last line tail -n1
The array value and result is separated by a tab. Substitute tab with the output string you want using sed. One could use ex. cut -d, -f2 to get only the index or use read a b <( ... ) to read the numbers into variables, etc.
Using Perl
$ export data=4,2,8,9,1,0
$ echo $data | perl -ne ' map{$i++; if($_>$x) {$x=$_;$id=$i} } split(","); print "max=$x", " index=",--${id},"\n" '
max=9 index=3
$

Renumbering numbers in a text file based on an unique mapping

I have a big txt file with 2 columns and more than 2 million rows. Every value represents an id and there may be duplicates. There are about 100k unique ids.
1342342345345 34523453452343
0209239498238 29349203492342
2349234023443 99203900992344
2349234023443 182834349348
2923000444 9902342349234
I want to identify each id and re-number all of them starting from 1. It should re-number duplicates also using the same new id. If possible, it should be done using bash.
The output could be something like:
123 485934
34 44834
167 34564
167 2345
2 34564
Doing this in pure bash will be really slow. I'd recommend:
tr -s '[:blank:]' '\n' <file |
sort -un |
awk '
NR == FNR {id[$1] = FNR; next}
{for (i=1; i<=NF; i++) {$i = id[$i]}; print}
' - file
4 8
3 7
5 9
5 2
1 6
With bash and sort:
#!/bin/bash
shopt -s lastpipe
declare -A hash # declare associative array
index=1
# read file and fill associative array
while read -r a b; do
echo "$a"
echo "$b"
done <file | sort -nu | while read -r x; do
hash[$x]="$((index++))"
done
# read file and print values from associative array
while read -r a b; do
echo "${hash[$a]} ${hash[$b]}"
done < file
Output:
4 8
3 7
5 9
5 2
1 6
See: man bash and man sort
Pure Bash, with a single read of the file:
declare -A hash
index=1
while read -r a b; do
[[ ${hash[$a]} ]] || hash[$a]=$((index++)) # assign index only if not set already
[[ ${hash[$b]} ]] || hash[$b]=$((index++)) # assign index only if not set already
printf '%s %s\n' "${hash[$a]}" "${hash[$b]}"
done < file > file.indexed
Notes:
the index is assigned in the order read (not based on sorting)
we make a single pass through the file (not two as in other solutions)
Bash's read is slower than awk; however, if the same logic is implemented in Perl or Python, it will be much faster
this solution is more CPU bound because of the hash lookups
Output:
1 2
3 4
5 6
5 7
8 9
Just keep a monotonic counter and a table of seen numbers; when you see a new id, give it the value of the counter and increment:
awk '!a[$1]{a[$1]=++N} {$1=a[$1]} !a[$2]{a[$2]=++N} {$2=a[$2]} 1' input
awk 'NR==FNR { ids[$1] = ++c; next }
{ print ids[$1], ids[$2] }
' <( { cut -d' ' -f1 renum.in; cut -d' ' -f2 renum.in; } | sort -nu ) renum.in
join the two columns into one then sort the that into numerical order (-n), and make unique (-u), before using awk to use this sequence to generate an array of mappings between old to new ids.
Then for each line in input, swap ids and print.

Use Bash scripting to select columns and rows with specific name

I'm working with a very large text file (4GB) and I want to make a smaller file with only the data I need in it. It is a tab deliminated file and there are row and column headers. I basically want to select a subset of the data that has a given column and/or row name.
colname_1 colname_2 colname_3 colname_4
row_1 1 2 3 5
row_2 4 6 9 1
row_3 2 3 4 2
I'm planning to have a file with a list of the columns I want.
colname_1 colname_3
I'm a newbie to bash scripting and I really don't know how to do this. I saw other examples, but they all new what column number they wanted in advance and I don't. Sorry if this is a repeat question, I tried to search.
I would want the result to be
colname_1 colname_3
row_1 1 3
row_2 2 9
row_3 2 4
Bash works best as "glue" between standard command-line utilities. You can write loops which read each line in a massive file, but it's painfully slow because bash is not optimized for speed. So let's see how to use a few standard utilities -- grep, tr, cut and paste -- to achieve this goal.
For simplicity, let's put the desired column headings into a file, one per line. (You can always convert a tab-separated line of column headings to this format; we're going to do just that with the data file's column headings. But one thing at a time.)
$ printf '%s\n' colname_{1,3} > columns
$ cat columns
colname_1
colname_2
An important feature of the printf command-line utility is that it repeats its format until it runs out of arguments.
Now, we want to know which column in the data file each of these column headings corresponds to. We could try to write this as a loop in awk or even in bash, but if we convert the header line of the data file into a file with one header per line, we can use grep to tell us, by using the -n option (which prefixes the output with the line number of the match).
Since the column headers are tab-separated, we can get turn them into separate lines just by converting tabs to newlines using tr:
$ head -n1 giga.dat | tr '\t' '\n'
colname_1
colname_2
colname_3
colname_4
Note the blank line at the beginning. That's important, because colname_1 actually corresponds to column 2, since the row headers are in column 1.
So let's look up the column names. Here, we will use several grep options:
-F The pattern argument consists of several patterns, one per line, which are interpreted as ordinary strings instead of regexes.
-x The pattern must match the complete line.
-n The output should be prefixed by the line number of the match.
If we have Gnu grep, we could also use -f columns to read the patterns from the file named columns. Or if we're using bash, we could use the bashism "$(<columns)" to insert the contents of the file as a single argument to grep. But for now, we'll stay Posix compliant:
$ head -n1 giga.dat | tr '\t' '\n' | grep -Fxn "$(cat columns)"
2:colname_1
4:colname_3
OK, that's pretty close. We just need to get rid of everything other than the line number; comma-separate the numbers, and put a 1 at the beginning.
$ { echo 1
> grep -Fxn "$(<columns)" < <(head -n1 giga.dat | tr '\t' '\n')
> } | cut -f1 -d: | paste -sd,
1,2,4
cut -f1 Select field 1. The argument could be a comma-separated list, as in cut -f1,2,4.
cut -d: Use : instead of tab as a field separator ("delimiter")
paste -s Concatenate the lines of a single file instead of corresponding lines of several files
paste -d, Use a comma instead of tab as a field separator.
So now we have the argument we need to pass to cut in order to select the desired columns:
$ cut -f"$({ echo 1
> head -n1 giga.dat | tr '\t' '\n' | grep -Fxn -f columns
> } | cut -f1 -d: | paste -sd,)" giga.dat
colname_1 colname_3
row_1 1 3
row_2 4 9
row_3 2 4
You can actually do this by keeping track of the array indexes for the columns that match the column names in your file containing the column list. After you have found the array indexes in the data file for the column names in your column list file, you simply read your data file (beginning at the second line) and output the row_label plus the data for the columns at the array index you determined in matching the column list file to the original columns.
There are probably several ways to approach this and the following assumes the data in each column does not contain any whitespace. The use of arrays presumes bash (or other advanced shell supporting arrays) and not POSIX shell.
The script takes two file names as input. The first is your original data file. The second is your column list file. An approach could be:
#!/bin/bash
declare -a cols ## array holding original columns from original data file
declare -a csel ## array holding columns to select (from file 2)
declare -a cpos ## array holding array indexes of matching columns
cols=( $(head -n 1 "$1") ) ## fill cols from 1st line of data file
csel=( $(< "$2") ) ## read select columns from file 2
## fill column position array
for ((i = 0; i < ${#csel[#]}; i++)); do
for ((j = 0; j < ${#cols[#]}; j++)); do
[ "${csel[i]}" = "${cols[j]}" ] && cpos+=( $j )
done
done
printf " "
for ((i = 0; i < ${#csel[#]}; i++)); do ## output header row
printf " %s" "${csel[i]}"
done
printf "\n" ## output newline
unset cols ## unset cols to reuse in reading lines below
while read -r line; do ## read each data line in data file
cols=( $line ) ## separate into cols array
printf "%s" "${cols[0]}" ## output row label
for ((j = 0; j < ${#cpos[#]}; j++)); do
[ "$j" -eq "0" ] && { ## handle format for first column
printf "%5s" "${cols[$((${cpos[j]}+1))]}"
continue
} ## output remaining columns
printf "%13s" "${cols[$((${cpos[j]}+1))]}"
done
printf "\n"
done < <( tail -n+2 "$1" )
Using your example data as follows:
Data File
$ cat dat/col+data.txt
colname_1 colname_2 colname_3 colname_4
row_1 1 2 3 5
row_2 4 6 9 1
row_3 2 3 4 2
Column Select File
$ cat dat/col.txt
colname_1 colname_3
Example Use/Output
$ bash colnum.sh dat/col+data.txt dat/col.txt
colname_1 colname_3
row_1 1 3
row_2 4 9
row_3 2 4
Give it a try and let me know if you have any questions. Note, bash isn't known for its blinding speed handling large files, but as long as the column list isn't horrendously long, the script should be reasonably fast.

How to sum a row of numbers from text file-- Bash Shell Scripting

I'm trying to write a bash script that calculates the average of numbers by rows and columns. An example of a text file that I'm reading in is:
1 2 3 4 5
4 6 7 8 0
There is an unknown number of rows and unknown number of columns. Currently, I'm just trying to sum each row with a while loop. The desired output is:
1 2 3 4 5 Sum = 15
4 6 7 8 0 Sum = 25
And so on and so forth with each row. Currently this is the code I have:
while read i
do
echo "num: $i"
(( sum=$sum+$i ))
echo "sum: $sum"
done < $2
To call the program it's stats -r test_file. "-r" indicates rows--I haven't started columns quite yet. My current code actually just takes the first number of each column and adds them together and then the rest of the numbers error out as a syntax error. It says the error comes from like 16, which is the (( sum=$sum+$i )) line but I honestly can't figure out what the problem is. I should tell you I'm extremely new to bash scripting and I have googled and searched high and low for the answer for this and can't find it. Any help is greatly appreciated.
You are reading the file line by line, and summing line is not an arithmetic operation. Try this:
while read i
do
sum=0
for num in $i
do
sum=$(($sum + $num))
done
echo "$i Sum: $sum"
done < $2
just split each number from every line using for loop. I hope this helps.
Another non bash way (con: OP asked for bash, pro: does not depend on bashisms, works with floats).
awk '{c=0;for(i=1;i<=NF;++i){c+=$i};print $0, "Sum:", c}'
Another way (not a pure bash):
while read line
do
sum=$(sed 's/[ ]\+/+/g' <<< "$line" | bc -q)
echo "$line Sum = $sum"
done < filename
Using the numsum -r util covers the row addition, but the output format needs a little glue, by inefficiently paste-ing a few utils:
paste "$2" \
<(yes "Sum =" | head -$(wc -l < "$2") ) \
<(numsum -r "$2")
Output:
1 2 3 4 5 Sum = 15
4 6 7 8 0 Sum = 25
Note -- to run the above line on a given file foo, first initialize $2 like so:
set -- "" foo
paste "$2" <(yes "Sum =" | head -$(wc -l < "$2") ) <(numsum -r "$2")

Resources