Reading numeric values from grep output in bash - bash

I have a file filled by rows of text, I'm interested about a group of these, every line starts with the same word, in each line there are two numbers i have to elaborate later, and they are always in the same position, for example:
Round trip time was 49.9721 milliseconds in repetition 5 tcp_ping received 128 bytes back
I was thinking about trying to use grep to grab the rows wanted into a new file, and then put the content of this new file into an array, to easily access it during the elaboration, but this isn't working, any tips?
#!/bin/bash
InputFile="../data/N.dat"
grep "Round" ../data/tcp_16.out > "$InputFile"
IFS=' ' read -a array <<< "$InputFile"

If they're all you care about, you can read only the numbers in.
I'd also strongly suggest extracting the values you're going to be analyzing into arrays, like so, rather than storing the full lines as strings:
ms_time_arr=( ) # array: map repetitions to ms_time
bytes_arr=( ) # array: map repetitions to bytes
while read -r ms_time repetition bytes_back _; do
# log to stderr to show that we read the data
echo "At $ms_time ms, repetition $repetition, got $bytes_back back" >&2
ms_time_arr[$repetition]=$ms_time
bytes_arr[$repetition]=$bytes_back
done < <(grep -e 'Round' <../data/N.dat | tr -d '[[:alpha:]]')
# more logging, to show that array contents survive the loop
declare -p ms_time_arr bytes_arr
This works by using tr to remove all alpha characters, leaving only numbers, punctuation and whitespace.

Related

Efficient substring parsing of large fixed length text file in Bash

I have a large text file (millions of records) of fixed length data and need to extract unique substrings and create a number of arrays with those values. I have a working version, however I'm wondering if performance can be improved since I need to run the script iteratively.
$_file5 looks like:
138000010065011417865201710152017102122
138000010067710416865201710152017102133
138000010131490417865201710152017102124
138000010142349413865201710152017102154
138400010142356417865201710152017102165
130000101694334417865201710152017102176
Here is what I have so far:
while IFS='' read -r line || [[ -n "$line" ]]; do
_in=0
_set=${line:15:6}
_startDate=${line:21:8}
_id="$_account-$_set-$_startDate"
for element in "${_subsets[#]}"; do
if [[ $element == "$_set" ]]; then
_in=1
break
fi
done
# If we find a new one and it's not 504721
if [ $_in -eq 0 ] && [ $_set != "504721" ] ; then
_subsets=("${_subsets[#]}" "$_set")
_ids=("${_ids[#]}" "$_id")
fi
done < $_file5
And this yields:
_subsets=("417865","416865","413865")
_ids=("9899-417865-20171015", "9899-416865-20171015", "9899-413865-20171015")
I'm not sure if sed or awk would be better here and can't find a way to implement either. Thanks.
EDIT: Benchmark Tests
So I benchmarked my original solution against the two provided. Ran this over 10 times and all results where similar to below.
# Bash read
real 0m8.423s
user 0m8.115s
sys 0m0.307s
# Using sort -u (#randomir)
real 0m0.719s
user 0m0.693s
sys 0m0.041s
# Using awk (#shellter)
real 0m0.159s
user 0m0.152s
sys 0m0.007s
Looks like awk wins this one. Regardless, the performance improvement from my original code is substantial. Thank you both for your contributions.
I don't think you can beat the performance of sort -u with bash loops (except in corner cases, as this one turned out to be, see footnote✻).
To reduce the list of strings you have in file to a list of unique strings (set), based on a substring:
sort -k1.16,1.21 -u file >set
Then, to filter-out the unwanted id, 504721, starting at position 16, you can use grep -v:
grep -vE '.{15}504721' set
Finally, reformat the remaining lines and store them in arrays with cut/sed/awk/bash.
So, to populate the _subsets array, for example:
$ _subsets=($(sort -k1.16,1.21 -u file | grep -vE '.{15}504721' | cut -c16-21))
$ printf "%s\n" "${_subsets[#]}"
413865
416865
417865
or, to populate the _ids array:
$ _ids=($(sort -k1.16,1.21 -u file | grep -vE '.{15}504721' | sed -E 's/^.{15}(.{6})(.{8}).*/9899-\1-\2/'))
$ printf "%s\n" "${_ids[#]}"
9899-413865-20171015
9899-416865-20171015
9899-417865-20171015
✻ If the input file is huge, but it contains only a small number (~40) of unique elements (for the relevant field), then it makes perfect sense for the awk solution to be faster. sort needs to sort a huge file (O(N*logN)), then filter the dupes (O(N)), all for a large N. On the other hand, awk needs to pass through the large input only once, checking for dupes along the way via set membership testing. Since the set of uniques is small, membership testing takes only O(1) (on average, but for such a small set, practically constant even in worst case), making the overall time O(N).
In case there were less dupes, awk would have O(N*log(N)) amortized, and O(N2) worst case. Not to mention the higher constant per-instruction overhead.
In short: you have to know how your data looks like before choosing the right tool for the job.
Here's an awk solution embedded in a bash script:
#!/bin/bash
fn_parser() {
awk '
BEGIN{ _account="9899" }
{ _set=substr($0,16,6)
_startDate=substr($0,22,8)
#dbg print "#dbg:_set=" _set "\t_startDate=" _startDate
if (_set != "504721") {
_id= _account "-" _set"-" _startDate
ids[_id] = _id
sets[_set]=_set
}
}
END {
printf "_subsets=("
for (s in sets) { printf("%s\"%s\"" , (commaCtr++ ? "," : ""), sets[s]) }
print ");"
printf "_ids=("
for (i in ids) { printf("%s\"%s\"" , (commaCtr2++ ? "," : ""), ids[i]) }
print ")"
}
' "${#}"
}
#dbg set -vx
eval $( echo $(fn_parser *.txt) )
echo "_subsets="$_subsets
echo "_ids="$_ids
output
_subsets=413865,417865,416865
_ids=9899-416865-20171015,9899-413865-20171015,9899-417865-20171015
Which I believe would be the same output your script would get if you did an echo on your variable names.
I didn't see that _account was being extracted from your file, and assume it is passed in from a previous step in your batch. But until I know if that is a critical piece, I'll have to come back to figuring out how to pass in var to a function that calls awk.
People won't like using eval, but hopefully no one will embed /bin/rm -rf / into your data set ;-)
I use the eval so that the data extracted is available via the shell variables. You can uncomment the #dbg before the eval line to see how the code is executing in the "layers" of function, eval, var=value assignments.
Hopefully, you see how the awk script is a transcription of your code into awk.
It does rely on the fact that arrays can contain only 1 copy of a key/value pair.
I'd really appreciate if you post timings for all solutions submitted. (You could reduce the file size by 1/2 and still have a good test). Be sure to run each version several times, and discard the first run.
IHTH

argument length is too big. How to chunk it up?

I have this python code which will take a filename and set of offsets (comma separated) and will read the corresponding lines defined in the offsets.
do
python fileOffset.py /mnt/media1/file $offsets >> tmpfile
done
$offsets will provide the string which is comma separated which contain the filepointers ( 12,123,121134). This works fine until I get a very lengthy string of offsets which will throw a argument list too long error. As a solution I have written the following code which will split the offsets and call the fileOffset.py one for one offset.
IFS=', ' read -a array <<< $offsets
for element in "${array[#]}"
do
python fileOffset.py /mnt/media1/$file $element >> tmpfile
done
But this makes processing of the file very slow. How could I make it faster?
You can use xargs :
IFS=', ' xargs read -a array <<< $offset
However, I'm with #FrederikPihil's comment: Use python at all as you are already spawning a python process on each iteration.

Bash Columns SED and BASH Commands without AWK?

I wrote 2 difference scripts but I am stuck at the same problem.
The problem is am making a table from a file ($2) that I get in args and $1 is the numbers of columns. A little bit hard to explain but I am gonna show you input and output.
The problem is now that I don't know how I can save every column now in a difference var so i can build it in my HTML code later
#printf #TR##TD#$...#/TD##TD#$...#/TD##TD#$..#/TD##/TR##TD#$...
so input look like that :
Name\tSize\tType\tprobe
bla\t4711\tfile\t888888888
abcde\t4096\tdirectory\t5555
eeeee\t333333\tblock\t6666
aaaaaa\t111111\tpackage\t7777
sssss\t44444\tfile\t8888
bbbbb\t22222\tfolder\t9999
Code :
c=1
column=$1
file=$2
echo "$( < $file)"| while read Line ; do
Name=$(sed "s/\\\t/ /g" $file | cut -d' ' -f$c,-$column)
printf "$Name \n"
#let c=c+1
#printf "<TR><TD>$Name</TD><TD>$Size</TD><TD>$Type</TD></TR>\n"
exit 0
done
Output:
Name Size Type probe
bla 4711 file 888888888
abcde 4096 directory 5555
eeeee 333333 block 6666
aaaaaa 111111 package 7777
sssss 44444 file 8888
bbbbb 22222 folder 9999
This is tailor-made job for awk. See this script:
awk -F'\t' '{printf "<tr>";for(i=1;i<=NF;i++) printf "<td>%s</td>", $i;print "</tr>"}' input
<tr><td>bla</td><td>4711</td><td>file</td><td>888888888</td></tr>
<tr><td>abcde</td><td>4096</td><td>directory</td><td>5555</td></tr>
<tr><td>eeeee</td><td>333333</td><td>block</td><td>6666</td></tr>
<tr><td>aaaaaa</td><td>111111</td><td>package</td><td>7777</td></tr>
<tr><td>sssss</td><td>44444</td><td>file</td><td>8888</td></tr>
<tr><td>bbbbb</td><td>22222</td><td>folder</td><td>9999</td></tr>
In bash:
celltype=th
while IFS=$'\t' read -a columns; do
rowcontents=$( printf '<%s>%s</%s>' "$celltype" "${columns[#]}" "$celltype" )
printf '<tr>%s</tr>\n' "$rowcontents"
celltype=td
done < <( sed $'s/\\\\t/\t/g' "$2")
Some explanations:
IFS=$'\t' read -a columns reads a line from standard input, using only the tab character to separate fields, and putting each field into a separate element of the array columns. We change IFS so that other whitespace, which could occur in a field, is not treated as a field delimiter.
On the first line read from standard input, <th> elements will be output by the printf line. After resetting the value of celltype at the end of the loop body, all subsequent rows will consist of <td> elements.
When setting the value of rowcontents, take advantage of the fact that the first argument is repeated as many times as necessary to consume all the arguments.
Input is via process substitution from the sed command, which requires a crazy amount of quoting. First, the entire argument is quoted with $'...', which tells bash to replace escaped characters. bash converts this to the literal string s/\\t/^T/g, where I am using ^T to represent a literal ASCII 09 tab character. When sed sees this argument, it performs its own escape replacement, so the search text is a literal backslash followed by a literal t, to be replaced by a literal tab character.
The first argument, the column count, is unnecessary and is ignored.
Normally, you avoid making the while loop part of a pipeline because you set parameters in the loop that you want to use later. Here, all the variables are truly local to the while loop, so you could avoid the process substitution and use a pipeline if you wish:
sed $'s/\\\\t/\t/g' "$2" | while IFS=$'\t' read -a columns; do
...
done

Save a newline separated list into several bash variables

I'm relatively new to shell scripting and am writing a script to organize my music library. I'm using awk to parse the id3 tag info and am generating a newline separated list like so:
Kanye West
College Dropout
All Falls Down
I want to store each field in a separate variable so I can easily compose some mkdir and mv commands. I've tried piping the output to IFS=$'\n' read artist album title but each variable remains empty. I'm open to producing a different output from awk, but I still want to know how to parse a newline separated list using bash.
Edit:
It turns out that by piping directly to read by doing:
id3info "$filename" | awk "$awkscript" | {read artist; read album; read title;}
WILL NOT WORK. It results in the variables existing in a different scope. I found that using a herestring works best:
{read artist; read album; read title;} <<< "$(id3info "$filename" | awk "$awkscript")"
read normally reads one line at a time. So, if your id3 info is in the file testfile.txt, you can read it in as follows:
{ read artist ; read album ; read song ; } <testfile.txt
echo "artist='$artist' album='$album' song='$song'"
# insert your mkdir and mv commands....
When run on your test file, the above outputs:
artist='Kanye West' album='College Dropout' song='All Falls Down'
You can just read the file into a bash array and loop through the array like so:
IFS=$'\r\n' content=($(cat ${filepath}))
for ((idx = 0; idx < ${#content[#]}; idx+=3)); do
artist=${content[idx]}
album=${content[idx+1]}
title=${content[idx+2]}
done
Or read three lines in a loop.
yourscript |
while read artist; do # read first line of input
read album # read second line of input
read song # read third line of input
: self-destruct if the genre is rap
done
This loop will consume input lines in groups of three. If there is not an even multiple of three lines of input, the reads after that inside the loop will simply fail and the variables will be empty.
You can read the output from awk into an array. E.g.
readarray -t array <<< "$(printf '%s\n' 'Kanye West' 'College Dropout' 'All Falls Down')"
for ((i=0; i<${#array[#]}; i++ )) ; do
echo "array[$i]=${array[$i]}"
done
Produces:
array[0]=Kanye West
array[1]=College Dropout
array[2]=All Falls Down

Iterate over a file using two values on the same line

I need pass a series of couples values which are arguments for a c++ software. So I wrote this script:
while read randomNumbers; do
lambda = $randomNumbers | cut -f1 -d ' '
mi = $randomNumbers | cut -f2 -d ' '
./queueSim mm1-queue $lambda $mi
done < "randomNumbers"
where the first arg is the first value for each line in the file "randomNumbers" and the second one in the second value (of course). I got a segfault and a "command not found".
How can I assign to lambda and mi valus got from the line and pass this variable to c++ software?
There's no need for cut. Let read split the line for you:
while read lambda mi; do
./queueSim mm1-queue $lambda $mi
done < randomNumbers
Note that it is also commonly used in conjunction with IFS to split the input line on different fields. For example, to parse /etc/passwd ( a file with colon separated lines ), you will often see:
while IFS=: read username passwd uid gid info home shell; do ...
I would recommend assigning the values like this:
lambda=$(echo $randomNumbers | cut -f1 -d ' ')
mi=$(echo $randomNumbers | cut -f2 -d ' ')
the way you do it, you actually try to run a command that is named like whatever is the current content of $randomNumbers.
Edit:
Another thing: since your columns are delimited by a whitespace character, you could also just read the entire line into an array whose elements are separated by whitespaces as well. One way to achieve this is:
columns=( $(echo "$randomNumbers" | grep -o "[^ ]*") )
./queueSim mm1-queue ${columns[#]::2}
The first line matches all substrings that are not containing any spaces separately and puts them into the array columns. The second line does the same thing as the corresponding one in your implementation: inserting the first two columns as parameters. Since is done with slicing: you take the entire array ${columns[#]}, but select a certain subsequence of it by applying the boundary ::2 on the right, which returns in every element of columns beginning from the left (position 0), that is not on a position >=2.

Resources