change random line with shellscript - shell

how can i easily (quick and dirty) change, say 10, random lines of a file with a simple shellscript?
i though about abusing ed and generating random commands and line ranges, but i'd like to know if there was a better way

awk 'BEGIN{srand()}
{ lines[++c]=$0 }
END{
while(d<10){
RANDOM = int(1 + rand() * c)
if( !( RANDOM in r) ) {
r[RANDOM]
print "do something with " lines[RANDOM]
++d
}
}
}' file
or if you have the shuf command
shuf -n 10 $file | while read -r line
do
sed -i "s/$line/replacement/" $file
done

Playing off #Dennis' version, this will always output 10.
Doing random numbers in a separate array could create
duplicates and, consequently, fewer than 10 modifications.
file=~/testfile
c=$(wc -l < "$file")
awk -v c=$c '
BEGIN {
srand();
count = 10;
}
{
if (c*rand() < count) {
--count;
print "do something with " $0;
} else
print;
--c;
}
' "$file"

This seems to be quite a bit faster:
file=/your/input/file
c=$(wc -l < "$file")
awk -v c=$c 'BEGIN {
srand();
for (i=0;i<10;i++) lines[i] = int(1 + rand() * c);
asort(lines);
p = 1
}
{
if (NR == lines[p]) {
++p
print "do something with " $0
}
else print
}' "$file"
I

Related

awk to process the first two lines then the next two and so on

Suppose i have a very file which i created from two files one is old & another is the updated file by using cat & sort on the primary key.
File1
102310863||7097881||6845193||271640||06007709532577||||
102310863||7097881||6845123||271640||06007709532577||||
102310875||7092992||6840808||023740||10034500635650||||
102310875||7092992||6840818||023740||10034500635650||||
So pattern of this file is line 1 = old value & line 2 = updated value & so on..
now I want to process the file in such a way that awk first process the first two lines of the file & find out the difference & then move on two the next two lines.
now the process is
if($[old record]!=$[new record])
i= [new record]#[old record];
Desired output
102310863||7097881||6845123#6845193||271640||06007709532577||||
102310875||7092992||6840818#6840808||023740||10034500635650||||
$ cat tst.awk
BEGIN { FS="[|][|]"; OFS="||" }
NR%2 { split($0,old); next }
{
for (i=1;i<=NF;i++) {
if (old[i] != $i) {
$i = $i "#" old[i]
}
}
print
}
$
$ awk -f tst.awk file
102310863||7097881||6845123#6845193||271640||06007709532577||||
102310875||7092992||6840818#6840808||023740||10034500635650||||
This awk could help:
$ awk -F '\\|\\|' '{
getline new;
split(new, new_array, "\\|\\|");
for(i=1;i<=NF;i++) {
if($i != new_array[i]) {
$i = new_array[i]"#"$i;
}
}
} 1' OFS="||" < input_file
102310863||7097881||6845123#6845193||271640||06007709532577||||
102310875||7092992||6840818#6840808||023740||10034500635650||||
I think, you are good enough in awk to understand above code. Skipping the explanation.
Updated version, and thanks #martin for the double | trick:
$ cat join.awk
BEGIN {new=0; FS="[|]{2}"; OFS="||"}
new==0 {
split($0, old_data, "[|]{2}")
new=1
next
}
new==1 {
split($0, new_data, "[|]{2}")
for (i = 1; i <= 7; i++) {
if (new_data[i] != old_data[i]) new_data[i] = new_data[i] "#" old_data[i]
}
print new_data[1], new_data[2], new_data[3], new_data[4], new_data[5], new_data[6], new_data[7]
new = 0
}
$ awk -f join.awk data.txt
102310863||7097881||6845123#6845193||271640||06007709532577||||
102310875||7092992||6840818#6840808||023740||10034500635650||||

addition of variables combined with >/< test BASH

So i am trying to write a bash script to check if all values in a data set are within a certain margin of the average.
so far:
#!/bin/bash
cat massbuild.csv
while IFS=, read col1 col2
do
x=$(grep "$col2" $col1.pdb | grep "HETATM" | awk '{ sum += $7; n++ } END { if (n > 0) print sum / n; }')
i=$(grep "$col2" $col1.pdb | grep "HETATM" | awk '{print $7;}')
if $(($i > $[$x + 15])); then
echo "OUTSIDE THE RANGE!"
fi
done < massbuild.csv
So far, I have broken it down by components to test, and have found the values of x and i read correctly, but it seems that adding 15 to x, or the comparison to i doesn't work.
I have read around online and i am stumped =/
Without sample input and expected output we're just guessing but MAYBE this is the right starting point for your script (untested, of course, since no in/out provided):
#!/bin/bash
awk -F, '
NR==FNR {
file = $1 ".pdb"
ARGV[ARGC] = file
file2col2s[file] = (col1to2s[file] ? file2col2s[file] FS : "") $2
next
}
FNR==1 { split(file2col2s[FILENAME],col2s) }
/HETATM/ {
for (i=1;i in col2s;i++) {
col2 = col2s[i]
if ($0 ~ col2) {
sum[FILENAME,col2] += $7
cnt[FILENAME,col2]++
}
}
}
END {
for (file in file2col2s) {
split(file2col2s[file],col2s)
for (i=1;i in col2s;i++) {
col2 = col2s[i]
print sum[file,col2]
print cnt[file,col2]
}
}
}
' massbuild.csv
Does this help?
a=4; b=0; if [ "$a" -lt "$(( $b + 5 ))" ]; then echo "a < b + 5"; else echo "a >= b + 5"; fi
Ref: http://www.tldp.org/LDP/abs/html/comparison-ops.html

how to sum each column in a file using bash

I have a file on the following format
id_1,1,0,2,3,lable1
id_2,3,2,2,1,lable1
id_3,5,1,7,6,lable1
and I want the summation of each column ( I have over 300 columns)
9,3,11,10,lable1
how can I do that using bash.
I tried using what described here but didn't work.
Using awk:
$ awk -F, '{for (i=2;i<NF;i++)a[i]+=$i}END{for (i=2;i<NF;i++) printf a[i]",";print $NF}' file
9,3,11,10,lable1
This will print the sum of each column (from i=2 .. i=n-1) in a comma separated file followed the value of the last column from the last row (i.e. lable1).
If the totals would need to be grouped by the label in the last column, you could try this:
awk -F, '
{
L[$NF]
for(i=2; i<NF; i++) T[$NF,i]+=$i
}
END{
for(i in L){
s=i
for(j=NF-1; j>1; j--) s=T[i,j] FS s
print s
}
}
' file
If the labels in the last column are sorted then you could try without arrays and save memory:
awk -F, '
function labelsum(){
s=p
for(i=NF-1; i>1; i--) s=T[i] FS s
print s
split(x,T)
}
p!=$NF{
if(p) labelsum()
p=$NF
}
{
for(i=2; i<NF; i++) T[i]+=$i
}
END {
labelsum()
}
' file
Here's a Perl one-liner:
<file perl -lanF, -E 'for ( 0 .. $#F ) { $sums{ $_ } += $F[ $_ ]; } END { say join ",", map { $sums{ $_ } } sort keys %sums; }'
It will only do sums, so the first and last column in your example will be 0.
This version will follow your example output:
<file perl -lanF, -E 'for ( 1 .. $#F - 1 ) { $sums{ $_ } += $F[ $_ ]; } END { $sums{ $#F } = $F[ -1 ]; say join ",", map { $sums{ $_ } } sort keys %sums; }'
A modified version based on the solution you linked:
#!/bin/bash
colnum=6
filename="temp"
for ((i=2;i<$colnum;++i))
do
sum=$(cut -d ',' -f $i $filename | paste -sd+ | bc)
echo -n $sum','
done
head -1 $filename | cut -d ',' -f $colnum
Pure bash solution:
#!/usr/bin/bash
while IFS=, read -a arr
do
for((i=1;i<${#arr[*]}-1;i++))
do
((farr[$i]=${farr[$i]}+${arr[$i]}))
done
farr[$i]=${arr[$i]}
done < file
(IFS=,;echo "${farr[*]}")

How to sort the lines in a file from shortest to longest?

Similar to Sorting lines from longest to shortest, how can I sort all of the lines in a file from shortest to longest? E.g."
This is a long sentence.
This is not so long.
This is not long.
That becomes:
This is not long.
This is not so long.
This is a long sentence.
It's almost exactely the same as in the link you gave
awk '{ print length($0) " " $0; }' $file | sort -n | cut -d ' ' -f 2-
the -r option was for reversing the sort.
perl -ne 'push #a, $_ } { print sort { length $a <=> length $b } #a' input
(On my box, this runs about 4 times faster than the awk | sort | cut solution.)
Note that this uses a terrible perl idiom and abuses the semantics of -n to save a few keystrokes. It would be better to write this as:
perl -ne '{ push #a, $_ } END { print sort { length $a <=> length $b } #a }' input
Note that this solution does not perform well on large input.
You could also do the sorting within awk:
cat << EOF > file
This is a long sentence.
This is not so long.
This is not long.
EOF
sort.awk
# Only find length once
{ len = length($0) }
# If we haven't seen this line before add it to the lines array
# and move on to next record
lines[len] == "" { lines[len] = $0; next }
# A duplicate, append to the previous record
{ lines[len] = lines[len] RS $0 }
END {
# lines array is sorted according to the indices, the sorted
# indices are stored in the indices array
asorti(lines, indices)
for(key in indices)
print lines[indices[key]]
}
Run like this:
awk -f sort.awk file
Or as a one-liner:
< file awk '{ len = length($0) } lines[len] == "" { lines[len] = $0; next } { lines[len] = lines[len] RS $0 } END { asorti(lines, indices); for(key in indices) print lines[indices[key]] }'
Output:
This is not long.
This is not so long.
This is a long sentence.
Another perl implementation:
perl -ne 'print length($_)." $_"' file | sort -n | cut -d ' ' -f 2-
$_ is the current line, similar to awk's $0
With POSIX Awk:
{
c = length
m[c] = m[c] ? m[c] RS $0 : $0
} END {
for (c in m) print m[c]
}
Example

How to parse data vertically in shell script?

I have data printed out in the console like this:
A B C D E
1 2 3 4 5
I want to manipulate it so A:1 B:2 C:3 D:4 E:5 is printed.
What is the best way to go about it? Should I tokenize the two lines and then print it out using arrays?
How do I go about it in bash?
Awk is good for this.
awk 'NR==1{for(i=0;i<NF;i++){row[i]=$i}} NR==2{for(i=0;i<NF;i++){printf "%s:%s",row[i],$i}}' oldfile > newfile
A slightly more readable version for scripts:
#!/usr/bin/awk -f
NR == 1 {
for(i = 0; i < NF; i++) {
first_row[i] = $i
}
}
NR == 2 {
for(i = 0; i < NF; i++) {
printf "%s:%s", first_row[i], $i
}
print ""
}
If you want it to scale vertically, you'll have to say how.
For two lines with any number of elements:
(read LINE;
LINE_ONE=($LINE);
read LINE;
LINE_TWO=($LINE);
for i in `seq 0 $((${#LINE_ONE[#]} - 1))`;
do
echo ${LINE_ONE[$i]}:${LINE_TWO[$i]};
done)
To do pairs of lines just wrap it in a loop.
This might work for you:
echo -e "A B C D E FFF GGG\n1 2 3 4 5 666 7" |
sed 's/ \|$/:/g;N;:a;s/\([^:]*:\)\([^\n]*\)\n\([^: ]\+ *\)/\2\1\3\n/;ta;s/\n//'
A:1 B:2 C:3 D:4 E:5 FFF:666 GGG:7
Perl one-liner:
perl -lane 'if($.%2){#k=#F}else{print join" ",map{"$k[$_]:$F[$_]"}0..$#F}'
Somewhat more legible version:
#!/usr/bin/perl
my #keys;
while (<>) {
chomp;
if ($. % 2) { # odd lines are keys
#keys = split ' ', $_;
} else { # even lines are values
my #values = split ' ', $_;
print join ' ', map { "$keys[$_]:$values[$_]" } 0..$#values;
}
}

Resources