I have a file that I want to import into a database table, but I want to have a piece in each row. In the import, I need to indicate for each row the offset (first byte) and length (number of bytes)
I have the following files:
*line_numbers.txt* -> Each row contains the number of
the last row of a record in *plans.txt*.
*plans.txt* -> All the information required for all the rows.
I have the following code:
#Starting line number of the record
sLine=0
#Starting byte value of the record
offSet=0
while read line
do
endByte=`awk -v fline=${sLine} -v lline=${line} \
'{if (NR > fline && NR < lline) \
sum += length($0); } \
END {print sum}' plans.txt`
echo "\"plans.txt.${offSet}.${endByte}/\"" >> lobs.in
sLine=$((line+1))
offSet=$((endByte+offSet))
done < line_numbers.txt
This code will write in the file lobs.in something similar to:
"plans.txt.0.504/"
"plans.txt.505.480/"
"plans.txt.984.480/"
"plans.txt.1464.1159/"
"plans.txt.2623.515/"
This means, for example, that the first record starts at byte 0 and continues for the next 504 bytes. The next starts at byte 505 and continues for the next 480 bytes.
I still have to run more tests, but It seems to be working.
My problem is It is very very slow for the volume I need to process.
Do you have any performance tips?
I looked in a way to insert the loop in awk, but I need 2 input files and I don't know how to process It without the while.
Thank you!
Doing this all in awk would be much faster.
Suppose you have:
$ cat lines.txt
100
200
300
360
10000
50000
And:
$ awk -v maxl=50000 'BEGIN{for (i=1;i<=maxl;i++) printf "Line %d\n", i}' >data.txt
(So you have Line 1\nLine 2\n...Line maxl in the file data.txt)
You would do something like:
awk 'FNR==NR{lines[FNR]=$1; next}
{data[FNR]=length($0); next}
END{ sl=1
for (i=1; i in lines; i++) {
bc=0
for (j=sl; j<=lines[i]; j++){
bc+=data[j]
}
printf "line %d to %d is %d bytes\n", sl, j-1, bc
sl=lines[i]+1
}
}' lines.txt data.txt
line 1 to 100 is 1392 bytes
line 101 to 200 is 1500 bytes
line 201 to 300 is 1500 bytes
line 301 to 360 is 900 bytes
line 361 to 10000 is 153602 bytes
line 10001 to 50000 is 680000 bytes
Simple improvement. Never redirect inside a loop with >>, what can be redirected outside a loop with >>. Worse:
while read line
do
# .... stuff omitted ...
echo "\"plans.txt.${offSet}.${endByte}/\"" >> lobs.in
# ....
done < line_numbers.txt
Note how the only line in the loop that outputs anything is echo. Better:
while read line
do
# .... stuff omitted ...
echo "\"plans.txt.${offSet}.${endByte}/\""
# ....
done < line_numbers.txt >> lobs.in
Related
I have about 54,000 packets to analyze and I am trying to determine the average # of packets per second (as well as the min and max # of packets during a given second)
My input file is a single column of the packet times (see sample below):
0.004
0.015
0.030
0.050
..
..
1999.99
I've used awk to determine the timing deltas but can't figure out a way to parse out the chunks of time to get an output of:
0-1s = 10 packets
1-2s = 15 packets
etc
Here is an example of how you can use awk to get the desired output.
Suppose your original input file is sample.txt, first thing to do is reverse sort it (sort -nr) then you can supply awk with the newly sorted file along with the time variable through awk "-v" argument. Perform your tests inside awk, make use of "next" to skip lines and "exit" to quit the awk script when needed.
#!/bin/bash
#
for i in 0 1 2 3
do
sort -nr sample.txt |awk -v time=$i 'BEGIN{number=0}''{
if($1>=(time+1)){next}
else if( $1>=time && $1 <(time+1))
{number+=1}
else{
printf "[ %d - %d [ : %d records\n",time,time+1,number;exit}
}'
done
Here's the sample file:
0.1
0.2
0.8
.
.
0.94
.
.
1.5
1.9
.
3.0
3.6
Here's the program's output:
[ 1 - 2 [ : 5 records
[ 2 - 3 [ : 8 records
[ 3 - 4 [ : 2 records
Hope this helps !
Would you please try the followings:
With bash:
max=0
while read -r line; do
i=${line%.*} # extract the integer part
a[$i]=$(( ${a[$i]} + 1 )) # increment the array element
(( i > max )) && max=$i # update the maximum index
done < sample.txt
# report the summary
for (( i=0; i<=max; i++ )); do
printf "%d-%ds = %d packets\n" "$i" $(( i+1 )) "${a[$i]}"
done
With AWK:
awk '
{
i = int($0)
a[i]++
if (i > max) max = i
}
END {
for (i=0; i<=max; i++)
printf("%d-%ds = %d packets\n", i, i+1, a[i])
}' sample.txt
sample.txt:
0.185
0.274
0.802
1.204
1.375
1.636
1.700
1.774
1.963
2.044
2.112
2.236
2.273
2.642
2.882
3.000
3.141
5.023
5.082
Output:
0-1s = 3 packets
1-2s = 6 packets
2-3s = 6 packets
3-4s = 2 packets
4-5s = 0 packets
5-6s = 2 packets
Hope this helps.
I have a column of several rows like this:
20.000
15.000
42.500
42.500
45.000
45.000
50.000
50.000
50.000
50.000
50.000
50.000
50.000
50.000
50.000
and I need to end up with a file where:
first element is 20/2
second element is the previous value + 15/2
third element is the previous values + 42.5/2
an so on until the end
My problem is how to do the "loop".
Perl to the rescue:
perl -lne 'print $s += $_ / 2' input-file > output-file
-l removes newlines from input and adds them to output
-n reads the input line by line, executing the code for each
$_ is the value read from each line
/ 2 is division by 2
+= is the operator that adds its ride hand side to its left hand side and stores the result in the left hand side, returning the new value. I named the variable $s as in "sum".
simply,
$ awk '{print v+=$1/2}' file
10
17.5
38.75
60
82.5
105
130
155
180
205
230
255
280
305
330
you can set printf formatting if needed
Try this:
awk '{prev += ($0) / 2; printf("%.3f\n", prev);}' a2.txt
Input:
20.000
15.000
42.500
42.500
45.000
45.000
50.000
50.000
50.000
50.000
50.000
50.000
50.000
50.000
50.000
Output:
10.000
17.500
38.750
60.000
82.500
105.000
130.000
155.000
180.000
205.000
230.000
255.000
280.000
305.000
330.000
I guess you need output to be one line:
awk '{s+=$1/2; out = out s " ";} END{print out}' file
#=> 10 17.5 38.75 60 82.5 105 130 155 180 205 230 255 280 305 330
There's an extra space at the end which I think has no harm.
You can remove it if you don't want it.
I think you might looking for for loop
awk '{for (i = 1; i <= NF; i++) print temp = temp + $i/2 }' filename
remember one thing, i refers to a column number if you want to run this operation in only one column you can change value of
i = column number; i <= column number;
You can use this loop for complex scenario.
If you want to change the separator you can use parameter like -F and the separator.
awk -F ":" '{}' filename
I had a problem solved in a previous post using the awk, but now I want to put an if loop in it, but I am getting an error.
Here's the problem:
I had a lot of files that looked like this:
Header
175566717.000
175570730.000
175590376.000
175591966.000
175608932.000
175612924.000
175614836.000
.
.
.
175680016.000
175689679.000
175695803.000
175696330.000
And I wanted to extract the first 2000 lines (line 1 to 2000), then extract the lines 1500 to 3500, then 3000 to 5000 and so on... What I mean is: extract a window of 2000 lines with an overlap of 500 lines between contiguous windows until the end of the file.
This is the awk command used for it:
awk -v i=1 -v t=2000 -v d=501 'NR>1{a[NR-1]=$0}END{
while(i<NR-1){
++n;
for(k=i;k<i+t;k++)print a[k] > "win"n".txt";
close("_win"n".txt")
i=i+t-d
}
}' myfile.txt
done
And I get several files with names win1.txt , win2.txt , win3.txt , etc...
My problem now is that because the file was not a multiple of 2000, my last window has less than 2000 lines. How can I put an if loop that would do this: if the last window had less than 2000 digital numbers, the previous window should had all the lines until the end of the file.
EXTRA INFO
When the windows are created, there is a line break at the end.That is why I needed the if loop to take into account a window of less than 2000 digital numbers, and not just lines.
If you don't have to use awk for some other reason, try the sed approach
#!/bin/bash
file="$(sed '/^\s*$/d' myfile.txt)"
sed -n 1,2000p <<< "$file"
first=1500
last=3500
max=$(wc -l <<< "$file" | awk '{print $1}')
while [[ $max -ge 2000 && $last -lt $((max+1500)) ]]; do
sed -n "$first","$last"p <<< "$file"
((first+=1500))
((last+=1500))
done
Obviously this is going to be less fast than awk and more error prone for gigatic files, but should work in most cases.
Change the while condition to make it stop earlier:
while (i+t <= NR) {
Change the end condition of the for loop to compensate for the last output file being potentially bigger:
for (k = i; k < (i+t+t-d <= NR ? i+t : NR); k++)
The rest of your code can stay the same; although I took the liberty of removing the close statement (why was that?), and to set d=500, to make the output files really overlap by 500 lines.
awk -v i=1 -v t=2000 -v d=500 'NR>1{a[NR-1]=$0}END{
while (i+t <= NR) {
++n;
for (k=i; k < (i+t+t-d <= NR ? i+t : NR); k++) print a[k] > "win"n".txt";
i=i+t-d
}
}' myfile.txt
I tested it with small values of t and d, and it seems to work as requested.
One final remark: for big input files, I wouldn't encourage storing the whole thing in array a.
I have a script that reads from /proc/stat and calculates CPU usage. There are three relevant lines in /proc/stat:
cpu 1312092 24 395204 12582958 77712 456 3890 0 0 0
cpu0 617029 12 204802 8341965 62291 443 2718 0 0 0
cpu1 695063 12 190402 4240992 15420 12 1172 0 0 0
Currently, my script only reads the first line and calculates usage from that:
cpu=($( cat /proc/stat | grep '^cpu[^0-9] ' ))
unset cpu[0]
idle=${cpu[4]}
total=0
for value in "${cpu[#]}"; do
let total=$(( total+value ))
done
let usage=$(( (1000*(total-idle)/total+5)/10 ))
echo "$usage%"
This works as expected, because the script only parses this line:
cpu 1312092 24 395204 12582958 77712 456 3890 0 0 0
It's easy enough to get only the lines starting with cpu0 and cpu1
cpu=$( cat /proc/stat | grep '^cpu[0-9] ' )
but I don't know how to iterate over each line and apply this same process. Ive tried resetting the internal field separator inside a subshell, like this:
cpus=$( cat /proc/stat | grep '^cpu[0-9] ' )
(
IFS=$'\n'
for cpu in $cpus; do
cpu=($cpu)
unset cpu[0]
idle=${cpu[4]}
total=0
for value in "${cpu[#]}"; do
let total=$(( total+value ))
done
let usage=$(( (1000*(total-idle)/total+5)/10 ))
echo -n "$usage%"
done
)
but this gets me a syntax error
line 18: (1000*(total-idle)/total+5)/10 : division by 0 (error token is "+5)/10 ")
If I echo the cpu variable in the loop it looks like it's separating the lines properly. I looked at this thread and I think Im assigning the cpu variable to an array properly but is there another error Im not seeing?
I put my script into whats wrong with my script and it doesnt show me any errors apart from a warning about using cat within $(), s o I'm stumped.
Change this line in the middle of your loop:
IFS=' ' cpu=($cpu)
You need this because outside of your loop you're setting IFS=$'\n', but with that settingcpu($cpu)` won't do what you expect.
Btw, I would write your script like this:
#!/bin/bash -e
grep ^cpu /proc/stat | while IFS=$'\n' read cpu; do
cpu=($cpu)
name=${cpu[0]}
unset cpu[0]
idle=${cpu[4]}
total=0
for value in "${cpu[#]}"; do
((total+=value))
done
((usage=(1000 * (total - idle) / total + 5) / 10))
echo "$name $usage%"
done
The equivalent using awk:
awk '/^cpu/ { total=0; idle=$5; for (i=2; i<=NF; ++i) { total += $i }; print $1, int((1000 * (total - idle) / total + 5) / 10) }' < /proc/stat
Because the OP asked, an awk program.
awk '
/cpu[0-9] .*/ {
total = 0
idle = $5
for(i = 0; i <= NF; i++) { total += $i; }
printf("%s: %f%%\n", $1, 100*(total-idle)/total);
}
' /proc/stat
The /cpu[0-9] .*/ means "execute for every line matching this expression".
The variables like $1 do what you'd expect, but the 1st field has index 1, not 0: $0 means the whole line in awk.
Lets say I want to collect some statistics over a .txt file that looks like this:
misses 15
hit 18
misses 20
hit 31
I wrote a bash script that just prints every line:
#!/bin/bash
while read line
do
echo $line
done <t.txt
What I want now is this: the example is in pseudocode:
read every line
if first column is miss add the number to total misses
if hits add the number to total hits
print total hits, and total misses in a file
I have searched and this can be done with awk.
Can you help to do this with awk?
No need for any bash gubbins, awk will do it all
awk 'BEGIN{ hits=0; misses=0; }/^hit/{ hits+=$NF}; /^misses/{misses=$NF}; END {print "Hits: " hits "\nMisses: " misses }' txtfile
Ok 3 parts to this one liner:
BEGIN{ hits=0; misses=0; }
Is only run once before awk reads any lines from txtfile and initialises the two variables hits and misses.
/^hits/{ hits+=$NF}; /^misses/{misses=$NF};
is run on each line, if the line begins with hits the last column is added to the hits variable and if it begins with misses then the misses variables get the last column added to it.
END {print "Hits: " hits "\nMisses: " misses' }
Runs only once all lines have been processed and prints out a message detailing hits and misses.
Just translate your algorithm to bash:
#!/bin/bash
while read what count ; do
case $what in
(hit) let total_h+=count ;;
(misses) let total_m+=count ;;
(*) echo Invalid input >&2; exit ;;
esac
done < t.txt
echo Hits: $total_h, Misses: $total_m
awk '/misses/||/hit/{a[$1]+=$2}END{print "total hits",a["hit"],",and total misses",a["misses"]}' your_file
tested:
> cat temp
misses 15
hit 18
misses 20
hit 31
hait 31
> awk '/misses/||/hit/{a[$1]+=$2}END{print "total hits",a["hit"],",and total misses",a["misses"]}' temp
total hits 49 ,and total misses 35
declare misses=0 hit=0
while read var val; do
let $var+=val
done < f.txt
echo "Misses: $misses"
echo "Hits: $hit"