Sample 10000 random rows from a 200GB dataset - bash

I am trying to sample 10000 random rows from a large dataset with ~3 billion rows (with headers). I've considered using shuf -n 1000 input.file > output.file but this seems quite slow (>2 hour run time with my current available resources).
I've also used awk 'BEGIN{srand();} {a[NR]=$0} END{for(i=1; i<=10; i++){x=int(rand()*NR) + 1; print a[x];}}' input.file > output.file from this answer for a percentage of lines from smaller files, though I am new to awk and don't know how to include headers.
I wanted to know if there was a more efficient solution to sampling a subset (e.g. 10000 rows) of data from the 200GB dataset.

I don't think any program written in a scripting language can beat the shuf in the context of this question. Anyway, this is my try in bash. Run it with ./scriptname input.file > output.file
#!/bin/bash
samplecount=10000
datafile=$1
[[ -f $datafile && -r $datafile ]] || {
echo "Data file does not exists or is not readable" >&2
exit 1
}
linecount=$(wc -l "$datafile")
linecount=${linecount%% *}
pickedlinnum=(-1)
mapfile -t -O1 pickedlinnum < <(
for ((i = 0; i < samplecount;)); do
rand60=$((RANDOM + 32768*(RANDOM + 32768*(RANDOM + 32768*RANDOM))))
linenum=$((rand60 % linecount))
if [[ -z ${slot[linenum]} ]]; then # no collision
slot[linenum]=1
echo ${linenum}
((++i))
fi
done | sort -n)
for ((i = 1; i <= samplecount; ++i)); do
mapfile -n1 -s$((pickedlinnum[i] - pickedlinnum[i-1] - 1))
echo -n "${MAPFILE[0]}"
done < "$datafile"

Something in awk. Supply it with random seed ($RANDOM in Bash) and number n of wanted records. It counts the lines with wc -l and uses that count to select randomly n values between 1—lines[1] in file and outputs them. Can't really say anything about speed, I don't even have 200 GBs of disk. (:
$ awk -v seed=$RANDOM -v n=10000 '
BEGIN {
cmd="wc -l " ARGV[1] # use wc for line counting
if(ARGV[1]==""||n==""||(cmd | getline t)<=0) # require all parameters
exit 1 # else exit
split(t,lines) # wc -l returns "lines filename"
srand(seed) # use the seed
while(c<n) { # keep looping n times
v=int((lines[1]) * rand())+1 # get a random line number
if(!(v in a)){ # if its not used yet
a[v] # use it
++c
}
}
}
(NR in a)' file # print if NR in selected
Testing with dataset from seq 1 100000000. shuf -n 10000 file took about 6 seconds where the awk above took about 18 s.

Related

Is there a way to compute permutations in bash?

I have multiple files (nearly 1000 of them) separated by space and I need to compute permutations with join function between each one of them using the second columns only. The important thing is that the comparison must not be repeated, that's why the permutation.
For instance, a small example with 3 files A.txt B.txt and C.txt
The general idea is to get A B comparison, A C and B C. Neither B A nor C A nor C B
The 101 code would be
join -1 2 -2 2 A.txt B.txt | cut -d ' ' -f1 > AB.txt
join -1 2 -2 2 A.txt C.txt | cut -d ' ' -f1 > AC.txt
join -1 2 -2 2 B.txt C.txt | cut -d ' ' -f1 > BC.txt
Is there a way to accomplish this for thousand of files? I tried using a for loop, but toasted my brains out, and now I'm trying with a while loop. But I better get some orientation first.
As the number of iterations is quite large performance becomes an issue. Here is an optimized version of Matty's answer, using an array, to divide the number of iterations by 2 (half a million instead of a million) and to avoid a test:
declare -a files=( *.txt )
declare -i len=${#files[#]}
declare -i lenm1=$(( len - 1 ))
for (( i = 0; i < lenm1; i++ )); do
a="${files[i]}"
ab="${a%.txt}"
for (( j = i + 1; j < len; j++ )); do
b="${files[j]}"
join -1 2 -2 2 "$a" "$b" | cut -d ' ' -f1 > "$ab$b"
done
done
But consider that bash was not designed for such intensive tasks with half a million iterations. There might be a better (more efficient) way to accomplish what you want.
It looks like what you are after can be accomplished with two nested for loops and a lexicographic comparison to maintain alphabetical order?
# prints pairs of filenames
for f in dir/*; do
for g in dir/*; do
if [[ "$f" < "$g" ]]; then # ensure alphabetical order
echo $f $g
fi
done
done
Here's why you don't want to use bash for this:
First create 1000 files
seq 1000 | xargs touch
Now, distinct pairs with bash
time {
files=(*)
len=${#files[#]}
for ((i=0; i<len-1; i++)); do
a=${files[i]}
for ((j=i+1; j<len; j++)); do
b=${files[j]}
echo "$a $b"
done
done >/dev/null
}
real 0m5.091s
user 0m4.818s
sys 0m0.262s
Versus, for example, the same in perl:
time {
perl -e '
opendir my $dh, ".";
my #files = sort grep {$_ != "." && $_ != ".."} readdir $dh;
closedir $dh;
for (my $i = 0; $i < #files - 1; $i++) {
my $a = $files[$i];
for (my $j = $i + 1; $j < #files; $j++) {
my $b = $files[$j];
print "$a $b\n";
}
}
' > /dev/null
}
real 0m0.131s
user 0m0.120s
sys 0m0.006s

How to get values from one file that fall in a list of ranges from another file

I have bunch of files with sorted numerical values, in example:
cat tag_1_file.val
234
551
626
cat tag_2_file.val
12
1023
1099
etc.
And one file with tags and value ranges that fit my needs. Values are sorted first by tag, then by 2nd column, then by 3rd. Ranges may overlap.
cat ranges.val
tag_1 200 300
tag_1 600 635
tag_2 421 443
and so on.
So I try to loop through file with ranges and then look for all values that fall in range (in every line) in file with appropriate tag:
cat ~/blahblah/ranges.val | while read -a line;
#read line as array
do
cat ~/blahblah/${line[0]}_file.val | while read number;
#get tag name and cat the appropriate file
do
if [[ "$number" -ge "${line[1]}" ]] && [[ "$number" -le "${line[2]}" ]]
#check if current value fall into range
then
echo $number >> ${line[0]}.output
#toss the value that fall into interval to another file
elif [[ "$number" -gt "${line[2]}" ]]
then break
fi
done
done
But these two nested while loops are deadly slow with huge files containing 100M+ lines.
I think, there must be more efficient way of doing such things and I'd be grateful for any hint.
UPD: The expected output based on this example is:
cat file tag_1.output
234
626
Have you tried recoding the inner loop in something more efficient than Bash? Perl would probably be good enough:
while read tag low hi; do
perl -nle "print if \$_ >= ${low} && \$_ <= ${hi}" \
<${tag}_file.val >>${tag}.output
done <ranges.val
The behaviour if this version is slightly different in two ways - the loop doesn't bail out once the high point is reached, and the output file is created even if it is empty. Over to you if that isn't what you want!
another not so efficient implementation with awk
$ awk 'NR==FNR {t[NR]=$1; s[NR]=$2; e[NR]=$3; next}
{for(k in t)
if(t[k]==FILENAME) {
inout = t[k] "." ((s[k]<=$1 && $1<=e[k])?"in":"out");
print > inout;
next}}' ranges tag_1 tag_2
$ head tag_?.*
==> tag_1.in <==
234
==> tag_1.out <==
551
626
==> tag_2.out <==
12
1023
1099
note that I renamed files to match the tag names, otherwise you have to add tag extraction from filenames. Suffix ".in" for in ranges and ".out" for not. Depends on the sorted order of the files. If you have thousands of tag files adding a another layer to filter out the ranges per tag will speed it up. Now it iterates over ranges.
I'd write
while read -u3 -r tag start end; do
f="${tag}_file.val"
if [[ -r $f ]]; then
while read -u4 -r num; do
(( start <= num && num <= end )) && echo "$num"
done 4< "$f"
fi
done 3< ranges.val
I'm deliberately reading the files on separate file descriptors, otherwise the inner while-read loop will also slurp up the rest of "ranges.val".
bash while-read loops are very slow. I'll be back if a few minutes with an alternate solution
here's a GNU awk answer (requires, I believe, a fairly recent version)
gawk '
#load "filefuncs"
function read_file(tag, start, end, file, number, statdata) {
file = tag "_file.val"
if (stat(file, statdata) != -1) {
while (getline number < file) {
if (start <= number && number <= end) print number
}
}
}
{read_file($1, $2, $3)}
' ranges.val
perl
perl -Mautodie -ane '
$file = $F[0] . "_file.val";
next unless -r $file;
open $fh, "<", $file;
while ($num = <$fh>) {
print $num if $F[1] <= $num and $num <= $F[2]
}
close $fh;
' ranges.val
I have a solution for you from bioinformatics:
We have a format and a tool for this kind of task.
The format called .bed is used for description of ranges on chromosomes, but should work with your tags too.
The best toolset for this format is bedtools, which is lightning fast.
The specific tool, which might help you is intersect.
With this installed it becomes a task of formating the data for the tool:
#!/bin/bash
#reformating your positions to .bed format;
#1 adding the tag to each line
#2 repeating the position to make it a range
#3 converting to tab-separation
awk -F $'\t' 'BEGIN {OFS = FS} {print FILENAME, $0, $0}' *_file.val | sed 's/_file.val//g' >all_positions_in_one_range_file.bed
#making your range-file tab-separated
sed 's/ /\t/g' ranges.val >ranges_with_tab.bed
#doing the real comparision of the ranges with bedtools
bedtools intersect -a all_positions_in_one-range_file.bed -b ranges_with_tab.bed >all_positions_intersected.bed
#spliting the one result file back into files named by your tag
awk -F $'\t' '{print $2 >$1".out"}' all_positions_intersected.bed
Or if you prefer oneliners:
bedtools intersect -a <(awk -F $'\t' 'BEGIN {OFS = FS} {print FILENAME, $0, $0}' *_file.val | sed 's/_file.val//g') -b <(sed 's/ /\t/g' ranges.val) | awk -F $'\t' '{print $2 >$1".out"}'

KSH Shell script - Process file by blocks of lines

I am trying to write a bash script in a KSH environment that would iterate through a source text file and process it by blocks of lines
So far I have come up with this code, although it seems to go indefinitely since the tail command does not return 0 lines if asked to retrieve lines beyond those in the source text file
i=1
while [[ `wc -l /path/to/block.file | awk -F' ' '{print $1}'` -gt $((i * 1000)) ]]
do
lc=$((i * 1000))
DA=ProcessingResult_$i.csv
head -$lc /path/to/source.file | tail -1000 > /path/to/block.file
cd /path/to/processing/batch
./process.sh #This will process /path/to/block.file
mv /output/directory/ProcessingResult.csv /output/directory/$DA
i=$((i + 1))
done
Before launching the above script I perform a manual 'first injection': head -$lc /path/to/source.file | tail -1000 > /path/to/temp.source.file
Any idea on how to get the script to stop after processing the last lines from the source file?
Thanks in advance to you all
If you do not want to create so many temporary files up front before beginning to process each block, you could try the below solution. It can save lot of space when processing huge files.
#!/usr/bin/ksh
range=$1
file=$2
b=0; e=0; seq=1
while true
do
b=$((e+1)); e=$((range*seq));
sed -n ${b},${e}p $file > ${file}.temp
[ $(wc -l ${file}.temp | cut -d " " -f 1) -eq 0 ] && break
## process the ${file}.temp as per your need ##
((seq++))
done
The above code generates only one temporary file at a time.
You could pass the range(block size) and the filename as command line args to the script.
example: extractblock.sh 1000 inputfile.txt
have a look to man split
NAME
split - split a file into pieces
SYNOPSIS
split [OPTION]... [INPUT [PREFIX]]
-l, --lines=NUMBER
put NUMBER lines per output file
For example
split -l 1000 source.file
Or to extract the 3rd chunk for example (1000 here is not the number of lines , it is the number of chunks, or a chunk is 1/1000 of source.file)
split -nl/3/1000 source.file
A note on condition :
[[ `wc -l /path/to/block.file | awk -F' ' '{print $1}'` -gt $((i * 1000)) ]]
Maybe it should be source.file instead of block.file, and it is quite inefficient on a big file because it will read (count the lines of the file) for each iteration ; number of lines can be stored in a variable, also using wc on standard input prevents from using awk:
nb_lines=$(wc -l </path/to/source.file )
With Nahuel's recommendation I was able to build the script like this:
i=1
cd /path/to/sourcefile/
split source.file -l 1000 SF
for sf in /path/to/sourcefile/SF*
do
DA=ProcessingResult_$i.csv
cd /path/to/sourcefile/
cat $sf > /path/to/block.file
rm $sf
cd /path/to/processing/batch
./process.sh #This will process /path/to/block.file
mv /output/directory/ProcessingResult.csv /output/directory/$DA
i=$((i + 1))
done
This worked great

adding numbers without grep -c option

I have a txt file like
Peugeot:406:1999:Silver:1
Ford:Fiesta:1995:Red:2
Peugeot:206:2000:Black:1
Ford:Fiesta:1995:Red:2
I am looking for a command That counts the number of red Ford Fiesta cars.
The last number in each line is the amount of that particular car.
The command I am looking for CANNOT use the -c option of grep.
so this command should just output the number 4.
Any help would be welcome, thank you.
A simple bit of awk would do the trick:
awk -F: '$1=="Ford" && $4=="Red" { c+=$5 } END { print c }' file
Output:
4
Explanation:
The -F: switch means that the input field separator is a colon, so the car manufacturer is $1 (the 1st field), the model is $2, etc.
If the 1st field is "Ford" and the 4th field is "Red", then add the value of the 5th (last) field to the variable c. Once the whole file has been processed, print out the value of c.
For a native bash solution:
c=0
while IFS=":" read -ra col; do
[[ ${col[0]} == Ford ]] && [[ ${col[3]} == Red ]] && (( c += col[4] ))
done < file && echo $c
Effectively applies the same logic as the awk one above, without any additional dependencies.
Methods:
1.) use some scripting language for counting, like awk or perl and such. Awk solution already posted, here is an perl solution.
perl -F: -lane '$s+=$F[4] if m/Ford:.*:Red/}{print $s' < carfile
#or
perl -F: -lane '$s+=$F[4] if ($F[0]=~m/Ford/ && $F[3]=~/Red/)}{print $s' < carfile
both examples prints
4
2.) The second method is based on shell-pipelining
filter out the right rows
extract the column with the count
sum the numbers
e.g some examples:
grep 'Ford:.*:Red:' carfile | cut -d: -f5 | paste -sd+ | bc
the grep filter out the right rows
the cut get the last column
the paste creates an line like 2+2 what can be counted by
the bc for counting
Another example:
sed -n 's/\(Ford:.*:Red\):\(.*\)/\2/p' carfile | paste -sd+ | bc
the sed filter and extract
another example - different way of counting
(echo 0 ; sed -n 's/\(Ford:.*:Red\):\(.*\)/\2+/p' carfile ;echo p )| dc
numbers are counted by RPN calculator called dc, e.g. it works like 0 2 + - first comes the values and as the last the operation.
the first echo puts into the stack 0
the sed creates a stream of numbers like 2+ 2+
the last echo p prints the stack
exists many other possibilies how count a strem of numbers.
e.g counting by bash
while read -r num
do
sum=$(( $sum + $num ))
done < <(sed -n 's/\(Ford:.*:Red\):\(.*\)/\2/p' carfile)
and pure bash:
while IFS=: read -r maker model year color count
do
if [[ "$maker" == "Ford" && "$color" == "Red" ]]
then
(( sum += $count ))
fi
done < carfile
echo $sum

Get 20% of lines in File randomly

This is my code:
nb_lignes=`wc -l $1 | cut -d " " -f1`
for i in $(seq $nb_lignes)
do
m=`head $1 -n $i | tail -1`
//command
done
Please how can i change it to get Get 20% of lines in File randomly to apply "command" on each line ?
20% or 40% or 60 % (it's a parameter)
Thank you.
This will randomly get 20% of the lines in the file:
awk -v p=20 'BEGIN {srand()} rand() <= p/100' filename
So something like this for the whole solution (assuming bash):
#!/bin/bash
filename="$1"
pct="${2:-20}" # specify percentage
while read line; do
: # some command with "$line"
done < <(awk -v p="$pct" 'BEGIN {srand()} rand() <= p/100' "$filename")
If you're using a shell without command substitution (the <(...) bit), you can do this - but the body of the loop won't be able to have any side effects in the outer script (e.g. any variables it sets won't be set anymore once the loop completes):
#!/bin/sh
filename="$1"
pct="${2:-20}" # specify percentage
awk -v p="$pct" 'BEGIN {srand()} rand() <= p/100' "$filename" |
while read line; do
: # some command with "$line"
done
Try this:
file=$1
nb_lignes=$(wc -l $file | cut -d " " -f1)
num_lines_to_get=$((20*${nb_lignes}/100))
for (( i=0; i < $num_lines_to_get; i++))
do
line=$(head -$((${RANDOM} % $nb_lignes)) $file | tail -1)
echo "$line"
done
Note that ${RANDOM} only generates numbers less than 32768 so this approach won't work for large files.
If you have shuf installed, you can use the following to get a random line instead of using $RANDOM.
line=$(shuf -n 1 $file)
you can do it with awk.see below:
awk -v b=20 '{a[NR]=$0}END{val=((b/100)*NR)+1;for(i=1;i<val;i++)print a[i]}' all.log
the above command prints 20% of all the lines starting from begining of the file.
you just have to change the value of b on command line to get the required % of lines.
tested below:
> cat temp
1
2
3
4
5
6
7
8
9
10
> awk -v b=10 '{a[NR]=$0}END{val=((b/100)*NR)+1;for(i=1;i<val;i++)print a[i]}' temp
1
> awk -v b=20 '{a[NR]=$0}END{val=((b/100)*NR)+1;for(i=1;i<val;i++)print a[i]}' temp
1
2
>
shuf will produce the file in a randomized order; if you know how many lines you want, you can give that to the -n parameter. No need to get them one at a time. So:
shuf -n $(( $(wc -l < $FILE) * $PCT / 100 )) "$file" |
while read line; do
# do something with $line
done
shuf comes standard with GNU/Linux distros afaik.

Resources