Using nested loops in bash to process huge datasets

Using nested loops in bash to process huge datasets - bash

I am currently working on big datasets (typically 10 Gb for each) that prevent me from using R (RStudio) and dealing with data frames as I used to.
In order to deal with a restricted amount of memory (and CPU power), I've tried Julia and Bash (Shell Script) to process those files.
My question is the following: I've concatenated my files (I have more or less 1 million individual files merged into one big file) and I would like to process those big files in this way: Let's say that I have something like:
id,latitude,longitude,value
18,1,2,100
18,1,2,200
23,3,5,132
23,3,5,144
23,3,5,150
I would like to process my file saying that for id = 18, compute the max (200), the min (100) or some other propreties then go to next id and do the same. I guess some sort of nested loop in bash would work but I'm having issues doing it in an elegant way, the answers found on the Internet so far were not really helping. I cannot process it in Julia because it's too big/heavy, that's why I'm looking for answers mostly in bash.
However, I wanted to do this because I thought it would be faster to process a huge file rather than open a file, calculate, close file and go to the next one again and again. I'm not sure at all though !
Finally, which one would be better to use? Julia or Bash? Or something else?
Thank you !

Julia or Bash?
If you are talking about using plain bash and not some commands that could be executed in any other shell, then the answer is obviously Julia. Plain bash is magnitudes slower than Julia.
However, I would recommend to use an existing tool instead of writing your own.
GNU datamash could be what you need. You can call it from bash or any other shell.
for id = 18, compute the max (200), the min (100) [...] then go to next id and do the same
With datamash you could use the following bash command
< input.csv datamash -Ht, -g 1 min 4 max 4
Which would print
GroupBy(id),min(value),max(value)
18,100,200
23,132,150

Loops in bash are slow, I think that Julia is a much better fit in this case. Here is what I would do:
(Ideally) convert your data into a binary format, like NetCDF or HDF5.
load a chunk of data (e.g. 100 000 rows, not all, unless all data holds into RAM) and perform min/max per id as you propose
go to the next chunk and update the min/max for every ids
Do not load all the data at once in memory if you can avoid it. For computing easy statistics like the minimum, maximum, sum, mean, standard deviation, ... this can by done.
In my opinion, the memory overhead of julia (versus bash) are probably quite small given the size of the problem.
Be sure to read the performance tips in Julia and in particular place hoot-loops inside functions and not in global scope.
https://docs.julialang.org/en/v1/manual/performance-tips/index.html
Alternatively, such operations can also be done with specific queries in a SQL database.

Bash is definitely not the best option. (Fortran, baby!)
Anyway, the following can be translated to any language you want.
#!/bin/bash
function postprocess(){
# Do whatever statistics you want on the arrays.
echo "id: $last_id"
echo "lats: ${lat[#]}"
echo "lons: ${lon[#]}"
echo "vals: ${val[#]}"
}
# Set dummy start variable
last_id="not a valid id"
count=0
while read line; do
id=$( echo $line | cut -d, -f1 )
# Ignore first line
[ "$id" == "id" ] && continue
# If this is a new id, post-process the old one
if [ $id -ne $last_id -a $count -ne 0 ] 2> /dev/null; then
# Do post processing of data
postprocess
# Reset counter
count=0
# Reset value arrays
unset lat
unset lon
unset val
fi
# Increment counter
(( count++ ))
# Set last_id
last_id=$id
# Get values into arrays
lat+=($( echo $line | cut -d, -f2 ))
lon+=($( echo $line | cut -d, -f3 ))
val+=($( echo $line | cut -d, -f4 ))
done < test.txt
[ $count -gt 0 ] && postprocess

For this kind of problem, I'd be wary of using bash, because it isn't suited to line-by-line processing. And awk is too line-oriented for this kind of job, making the code complicated.
Something like this in perl might do the job, with a loop of loops grouping lines together by their id field.
IT070137 ~/tmp $ cat foo.pl
#!/usr/bin/perl -w
use strict;
my ($id, $latitude, $longitude, $value) = read_data();
while (defined($id)) {
my $group_id = $id;
my $min = $value;
my $max = $value;
($id, $latitude, $longitude, $value) = read_data();
while (defined($id) && $id eq $group_id) {
if ($value < $min) {
$min = $value;
}
if ($value > $max) {
$max = $value;
}
($id, $latitude, $longitude, $value) = read_data();
}
print $group_id, " ", $min, " ", $max, "\n";
}
sub read_data {
my $line = <>;
if (!defined($line)) {
return (undef, undef, undef, undef);
}
chomp($line);
my ($id, $latitude, $longitude, $value) = split(/,/, $line);
return ($id, $latitude, $longitude, $value);
}
IT070137 ~/tmp $ cat foo.txt
id,latitude,longitude,value
18,1,2,100
18,1,2,200
23,3,5,132
23,3,5,144
23,3,5,150
IT070137 ~/tmp $ perl -w foo.pl foo.txt
id value value
18 100 200
23 132 150
Or if you prefer Python:
#!/usr/bin/python -tt
from __future__ import print_function
import fileinput
def main():
data = fileinput.input()
(id, lattitude, longitude, value) = read(data)
while id:
group_id = id
min = value
(id, lattitude, longitude, value) = read(data)
while id and group_id == id:
if value < min:
min = value
(id, lattitude, longitude, value) = read(data)
print(group_id, min)
def read(data):
line = data.readline()
if line == '':
return (None, None, None, None)
line = line.rstrip()
(id, lattitude, longitude, value) = line.split(',')
return (id, lattitude, longitude, value)
main()

Related

bash script to compare number inside the 2 files

I want to compare 2 number from two different file using Bash script. The file is tmp$i and tmp$(($i-1)). I have tried the script below is not working
#!/bin/bash
for i in `seq 1 5`
do
if [ $tmp$i -lt $tmp$(($i-1)) ];then
cat tmp$i >> inf
else
cat tmp$i >> sup
fi
done
Sample data
Tmp1:
0.8856143905954186 0.8186070632371812 0.7624440603372680 0.7153352945456424 0.6762383806114797 0.6405457936981878
Tmp2:
0.5809579333203458 0.5567050091247218 0.5329405222386163 0.5115305043007474 0.4963898045543342 0.4846139486344327

You are not setting $tmp so you end up simply comparing whether i is smaller than i-1 which of course it isn't.
Removing the dollar sign nominally fixes that, but will just compare two strings (for which numeric cardinality isn't well-defined, so in practice, always false), not access the contents of files named like those strings. tmp2 is neither larger nor smaller than tmp1. (Bash can perform lexical comparison, but test ... -lt isn't the tool to do that.)
Try this instead:
if [ $(cat "tmp$i") -lt $(cat "tmp$((i - 1))") ]; then
In response to the observation that you want to do this on decimal numbers, you need a different tool, because Bash only supports integer arithmetic. My approach would be to write a simple Awk script which performs the comparison.
In order to be able to use it as a conditional, it should exit(0) if the condition is true, exit(1) otherwise.
In order to keep the main script readable, I would encapsulate it in a function, like this:
smaller_first_line () {
awk 'NR==1 && FNR==1 { i=$1; next } FNR==1 { exit($1 < i) }' "$1" "$2"
}
if smaller_first_line "tmp$i" "tmp$((i - 1))"; then
:

Expand Grouped Data using shell or perl

I have a log file that is grouping http requests in 5 minute increments based on a unique set of characteristics. Format is as follows:
beginTime endTime platform hostname osVersion os requestType httpStatus nbInstances
So a sample log line could be:
1423983600 1423983900 platform1 test01 8.1 win createAcct 200 15
This indicates in that 5 minute timeframe there were 15 requests with this unique attribute set. What I would like to do is then take this and generate 15 lines identical lines in an output file.
Right now I have a very simple script that is getting the job done but probably not very efficient:
#!/bin/bash
file=$1
count=0
cat $file | while read line
do
string=`echo $line | awk '{print $1,$2,$3,$4,$5,$6,$7,$8}'`
nbInst=`echo $line | awk '{print $9}'`
while [[ $count -lt $nbInst ]]
do
echo "$string" >> test_data.log
count=`expr $count + 1`
done
count=0
done
Any ideas on a faster solution in bash or perl? Thanks.

As mentioned in the comments - it seems unusual that you need to de-coalesce your events to process and index.
However this should do what you're asking:
#!/usr/bin/perl
use strict;
use warnings;
#uses DATA segment from below as file. You'll probably want either STDIN
#or open a file handle.
while (<DATA>) {
#separate line on whitespace
my #line = split;
#grab the last element of the line (pop returns the value, and removes
#from the list)
for ( 1 .. pop(#line) ) {
print join( " ", #line ), "\n";
}
}
__DATA__
1423983600 1423983900 platform1 test01 8.1 win createAcct 200 15

Iterative and conditional deleting of lines in a file

Intro
I have a file named data.dat with the following structure:
1: 67: 1 :s
1: 315: 1 :s
1: 648: 1 :ns
1: 799: 1 :s
1: 809: 1 :s
1: 997: 1 :ns
2: 32: 1 :s
Algorithm
The algorithm that I'm looking for is:
Generate a random number between 1 and number of lines in this file.
Delete that line if the fourth column is "s".
Otherwise generate another random number and repeat this until the number of lines reaches to a certain value.
Technical Concepts
Though technical concepts are irrelevant to this algorithm, but I try to explain the problem. The data shows connectivity table of a network. This algorithm allows us to run it over different initial conditions and study general properties of these networks. Especially, because of randomness property of deleting bonds, any common behavior among these networks can be interpreted as a fundamental law.
Update: Another good reason to produce a random number in each step is that after removing each line, it's possible that property of being s/ns of remaining lines can be changed.
Code
Here is the code I have until now:
#!/bin/bash
# bash in OSX
While ((#there is at least 1 s in the fourth column)); do
LEN=$(grep -c "." data.dat) # number of lines
RAND=$((RANDOM%${LEN}+1)) # generating random number
if [[awk -F, "NR==$RAND" 'data.dat' | cut -d ':' -f 4- == "s"]]; then
sed '$RANDd' data.txt
else
#go back and produce another random
done
exit
I try to find the fourth column with awk -F, "NR==$RAND" 'data.dat' | cut -d ':' -f 4- and deleting the line by sed '$RANDd' data.txt.
Questions
How should I check that there is s pairs in my file?
I am not sure if the condition in if is correct.
Also, I don't know how to force loop after else to go back to generate another random number.
Thank you,
I really appreciate your help.

Personally, I would recommend against doing this in bash unless you have absolutely no choice.
Here's another way you could do it in Perl (quite similar in functionality to Alex's answer but a bit simpler):
use strict;
use warnings;
my $filename = shift;
open my $fh, "<", $filename or die "could not open $filename: $!";
chomp (my #lines = <$fh>);
my $sample = 0;
my $max_samples = 10;
while ($sample++ < $max_samples) {
my $line_no = int rand #lines;
my $line = $lines[$line_no];
if ($line =~ /:s\s*$/) {
splice #lines, $line_no, 1;
}
}
print "$_\n" for #lines;
Usage: perl script.pl data.dat
Read the file into the array #lines. Pick a random line from the array and if it ends with :s (followed by any number of spaces), remove it. Print the remaining lines at the end.
This does what you want but I should warn you that relying on built-in random number generators in any language is not a good way to arrive at statistically significant conclusions. If you need high-quality random numbers, you should consider using a module such as Math::Random::MT::Perl to generate them, rather than the built-in rand.

#!/usr/bin/env perl
# usage: $ excise.pl < data.dat > smaller_data.dat
my $sampleLimit = 10; # sample up to ten lines before printing output
my $dataRef;
my $flagRef;
while (<>) {
chomp;
push (#{$dataRef}, $_);
push (#{$flagRef}, 1);
}
my $lineCount = scalar #elems;
my $sampleIndex = 0;
while ($sampleIndex < $sampleLimit) {
my $sampleLineIndex = int(rand($lineCount));
my #sampleElems = split("\t", $dataRef->[$sampleLineIndex];
if ($sampleElems[3] == "s") {
$flagRef->[$sampleLineIndex] = 0;
}
$sampleIndex++;
}
# print data.dat to standard output, minus any sampled lines that had an 's' in them
foreach my $lineIndex (0..(scalar #{$dataRef} - 1)) {
if ($flagRef->[$lineIndex] == 1) {
print STDOUT $dataRef->[$lineIndex]."\n";
}
}

NumLine=$( grep -c "" data.dat )
while [ ${NumLine} -gt ${TargetLine} ]
do
# echo "Line at start: ${NumLine}"
RndLine=$(( ( ${RANDOM} % ${NumLine} ) + 1 ))
RndValue="$( echo " ${RANDOM}" | sed 's/.*\(.\{6\}\)$/\1/' )"
sed "${RndLine} {
s/^\([^:]*:\)[^:]*\(:.*:ns$\)/\1${RndValue}\2/
t
d
}" data.dat > /tmp/data.dat
mv /tmp/data.dat data.dat
NumLine=$( grep -c "" data.dat )
#cat data.dat
#echo "- Next Iteration -------"
done
tested on AIX (so not a GNU sed). Under Linux, use --posix for sed option and you can use a -i in place of temporary file + redirection + move in this case
Dont't forget that RANDOM is NOT a real RANDOM so study on network behavior based on not random value could not reflect a reality bu a specific case

While loop computed hash compare in bash?

I am trying to write a script to count the number of zero fill sectors for a dd image file. This is what I have so far, but it is throwing an error saying it cannot open file #hashvalue#. Is there a better way to do this or what am I missing? Thanks in advance.
count=1
zfcount=0
while read Stuff; do
count+=1
if [ $Stuff == "bf619eac0cdf3f68d496ea9344137e8b" ]; then
zfcount+=1
fi
echo $Stuff
done < "$(dd if=test.dd bs=512 2> /dev/null | md5sum | cut -d ' ' -f 1)"
echo "Total Sector Count Is: $count"
echo "Zero Fill Sector Count is: $zfcount"

Doing this in bash is going to be extremely slow -- on the order of 20 minutes for a 1GB file.
Use another language, like Python, which can do this in a few seconds (if storage can keep up):
python -c '
import sys
total=0
zero=0
file = open(sys.argv[1], "r")
while True:
a=file.read(512)
if a:
total = total + 1
if all(x == "\x00" for x in a):
zero = zero + 1
else:
break
print "Total sectors: " + str(total)
print "Zeroed sectors: " + str(zero)
' yourfilehere

Your error message comes from this line:
done < "$(dd if=test.dd bs=512 2> /dev/null | md5sum | cut -d ' ' -f 1)"
What that does is reads your entire test.dd, calculates the md5sum of that data, and parses out just the hash value, then, by merit of being included inside $( ... ), it substitutes that hash value in place, so you end up with that line essentially acting like this:
done < e6e8c42ec6d41563fc28e50080b73025
(except, of course, you have a different hash). So, your shell attempts to read from a file named like the hash of your test.dd image, can't find the file, and complains.
Also, it appears that you are under the assumption that dd if=test.dd bs=512 ... will feed you 512-byte blocks one at a time to iterate over. This is not the case. dd will read the file in bs-sized blocks, and write it in the same sized blocks, but it does not insert a separator or synchronize in any way with whatever is on the other side of its pipe line.

Using Shell tools (sed | awk... etc) to compute max, min and average field values from a given sample.dat file

I have a sample.dat file which contains experiment values for 10 different
fields, recorded over time. Using sed, awk or any other shell tool, i need to write a script that reads in sample.data file and for each field computes the max, min and average.
sample.dat
field1:experiment1: 10.0
field2:experiment1: 12.5
field1:experiment2: 5.0
field2:experiment2: 14.0
field1:experiment3: 18.0
field2:experiment3: 3.5
Output
field1: MAX = 18.0, MIN = 5.0, AVERAGE = 11.0
field2: MAX = 14.0, MIN = 3.5, AVERAGE = 10.0

awk -F: '
{
sum[$1]+=$3;
if(!($1 in min) || (min[$1]>$3))
min[$1]=$3;
if(!($1 in max) || (max[$1]<$3))
max[$1]=$3;
count[$1]++
}
END {
for(element in sum)
printf("%s: MAX=%.1f, MIN=%.1f, AVARAGE=%.1f\n",
element,max[element],min[element],sum[element]/count[element])
}' sample.dat
Output
field1: MAX=18.0, MIN=5.0, AVARAGE=11.0
field2: MAX=14.0, MIN=3.5, AVARAGE=10.0

Here is a Perl solution I made (substitute the file name for whatever file you use):
#!/usr/bin/perl
use strict;
use warnings;
use List::Util qw(max min sum);
open( my $fh, "<", "sample.dat" ) or die $!;
my %fields;
while (<$fh>) {
chomp;
$_ =~ s/\s+//g;
my #line = split ":";
push #{ $fields{ $line[0] } }, $line[2];
}
close($fh);
foreach ( keys %fields ) {
print "$_: MAX="
. max #{ $fields{$_} };
print ", MIN="
. min #{ $fields{$_} };
print ", AVERAGE="
. ( (sum #{ $fields{$_} }) / #{ $fields{$_} } ) . "\n";
}

In bash with bc:
#!/bin/bash
declare -A min
declare -A max
declare -A avg
declare -A avgCnt
while read line; do
key="${line%%:*}"
value="${line##*: }"
if [ -z "${max[$key]}" ]; then
max[$key]="$value"
min[$key]="$value"
avg[$key]="$value"
avgCnt[$key]=1
else
larger=`echo "$value > ${max[$key]}" | bc`
smaller=`echo "$value < ${min[$key]}" | bc`
avg[$key]=`echo "$value + ${avg[$key]}" | bc`
((avgCnt[$key]++))
if [ "$larger" -eq "1" ]; then
max[$key]="$value"
fi
if [ "$smaller" -eq "1" ]; then
min[$key]="$value"
fi
fi
done < "$1"
for i in "${!max[#]}"
do
average=`echo "${avg[$i]} / ${avgCnt[$i]}" | bc`
echo "$i: MAX = ${max[$i]}, MIN = ${min[$i]}, AVERAGE = $average"
done

You can make use of this python code :
from collections import defaultdict
d = defaultdict(list)
[d[(line.split(":")[0])].append(float(line.split(":")[2].strip("\n "))) for line in open("sample.dat")]
for f in d: print f, ": MAX=", max(d[f]),", MIN=", min(d[f]),", AVG=", sum(d[f])/float(len(d[f]))

You can use gnu-R for something like this:
echo "1" > foo
echo "2" >> foo
cat foo \
| r -e \
'
f <- file("stdin")
open(f)
v <- read.csv(f,header=F)
write(max(v),stdout())
'
2
For summary statistics,
cat foo \
| r -e \
'
f <- file("stdin")
open(f)
v <- read.csv(f,header=F)
write(summary(v),stdout())
'
# Max, Min, Mean, median, quartiles, deviation, etc.
...
And in json:
... | r -e \
'
library(rjson)
f <- file("stdin")
open(f)
v <- read.csv(f,header=F)
json_summary <- toJSON(summary(v))
write(json_summary,stdout())
'
# same stats
| jq '.Max'
# for maximum
If you are using the linux command line environment, then you probably don't want to reimplement wheels, stay vectorized, and have clean code that is easy to read and develop, and which performs some standard, composable function.
In this case, you don't need an object oriented language (using python will tend to induce interface and code bloat and iterations with google, pip, and conda depending on the libs you need and type conversions you have to code by hand), you don't need verbose syntax, and you probably need to deal with dataframes/vectors/rows/columns of numerical data by default.
You probably also want scripts that can float around your particular machine without issues. If you are on Linux, that probably means: gnu-R. Install dependencies via apt-get.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Using nested loops in bash to process huge datasets - bash

Related

bash script to compare number inside the 2 files

Expand Grouped Data using shell or perl

Iterative and conditional deleting of lines in a file

While loop computed hash compare in bash?

Using Shell tools (sed | awk... etc) to compute max, min and average field values from a given sample.dat file

Categories

Resources