Fast editing subtitles file - bash

I like GNU/Linux and writing bash scripts to automatize my tasks. But I am a beginner and have a lot of problems with it. So, I have a subtitle file in format like I this(I'm Polish so it's a Polish subtitles):
00:00:27:W zamierzchłych czasach|ziemia pokryta była lasami.
00:00:31:Od wieków mieszkały|w nich duchy bogów.
00:00:37:Człowiek żył w harmonii ze zwierzętami.
I thing you understanding this simple format. The problem is that in the "movie file", before the movie starts is 1:15 of introduction. I want to add to each subtitle file's line 1:15. So the example should look like this:
00:01:43:W zamierzchłych czasach|ziemia pokryta była lasami.
00:01:46:Od wieków mieszkały|w nich duchy bogów.
00:01:52:Człowiek żył w harmonii ze zwierzętami.
Could you help me to write this script?
BTW I'm Polish and I'm still learning English. So if you cannot understand me, write.

Here's a solution in awk - probably easier than bash for this kind of problem:
#!/usr/bin/awk -f
BEGIN {
FS=":"
}
{
hr = $1
min = $2
sec = $3
sec = sec + 15
if (sec >= 60) {
sec = sec - 60
min = min + 1
}
min = min + 1
if (min >= 60) {
min = min - 60
hr = hr + 1
}
printf "%02d:%02d:%02d:%s\n", hr, min, sec, $4
}
Suggestions for improvement welcome!

Related

Extract bibtex entries based on the year

Okay, I got the file.bib file with multiple entries such
#Book{Anley:2007:shellcoders-handbook-2nd-ed,
author = {Chris Anley and John Heasman and Felix Lindner and Gerardo
Richarte},
title = "{The Shellcoder's Handbook}",
publisher = {Wiley},
year = 2007,
edition = 2,
month = aug,
}
there you can find the "year = 2007" line. My task is to filter out the years which are greater than 2020 ($currentyear) or lower than 1900 ($minyear), the result should be a also the output of the month "may", which stands behind a "year" line in this file. (Which is a mistake by the admin). (btw the file is over 4000 lines long).
It is better to use awk for this. Similar to your line, it would read:
awk -v t1="1900" -v t2="$(date "+%Y")" \
'!match($0,/year.*=.*/){next}
{t=substr(RSTART,RLENGTH)
match(t,/[0-9][0-9][0-9][0-9]/)
y=substr(RSTART,RLENGTH)
}
(y > t1) && (y <= t2) { print y }' file

Time Delta problem in Hackerrank not taking good answer / Python 3

The hackerrank challenge is in the following url: https://www.hackerrank.com/challenges/python-time-delta/problem
I got testcase 0 correct, but the website is saying that I have wrong answers for testcase 1 and 2, but in my pycharm, I copied the website expected output and compared with my output and they were exactly the same.
Please have a look at my code.
#!/bin/pyth
# Complete the time_delta function below.
from datetime import datetime
def time_delta(tmp1, tmp2):
dicto = {'Jan':1, 'Feb':2, 'Mar':3,
'Apr':4, 'May':5, 'Jun':6,
'Jul':7, 'Aug':8, 'Sep':9,
'Oct':10, 'Nov':11, 'Dec':12}
# extracting t1 from first timestamp without -xxxx
t1 = datetime(int(tmp1[2]), dicto[tmp1[1]], int(tmp1[0]), int(tmp1[3][:2]),int(tmp1[3][3:5]), int(tmp1[3][6:]))
# extracting t1 from second timestamp without -xxxx
t2 = datetime(int(tmp2[2]), dicto[tmp2[1]], int(tmp2[0]), int(tmp2[3][:2]), int(tmp2[3][3:5]), int(tmp2[3][6:]))
# converting -xxxx of timestamp 1
t1_utc = int(tmp1[4][:3])*3600 + int(tmp1[4][3:])*60
# converting -xxxx of timestamp 2
t2_utc = int(tmp2[4][:3])*3600 + int(tmp2[4][3:])*60
# absolute difference
return abs(int((t1-t2).total_seconds()-(t1_utc-t2_utc)))
if __name__ == '__main__':
# fptr = open(os.environ['OUTPUT_PATH'], 'w')
t = int(input())
for t_itr in range(t):
tmp1 = list(input().split(' '))[1:]
tmp2 = list(input().split(' '))[1:]
delta = time_delta(tmp1, tmp2)
print(delta)
t1_utc = int(tmp1[4][:3])*3600 + int(tmp1[4][3:])*60
For a time zone like +0715, you correctly add “7 hours of seconds” and “15 minutes of seconds”
For a timezone like -0715, you are adding “-7 hours of seconds” and “+15 minutes of seconds”, resulting in -6h45m, instead of -7h15m.
You need to either use the same “sign” for both parts, or apply the sign afterwards.

AWK performance while processing big files

I have an awk script that I use for calculate how much time some transactions takes to complete. The script gets the unique ID of each transaction and stores the minimum and maximum timestamp of each one. Then it calculates the difference and at the end it shows those results that are over 60 seconds.
It works very well when used with some thousand (200k) but it takes more time when used in real world. I tested it several times and it takes about 15 minutes to process about 28 million of lines. Can I consider this good performance or it is possible to improve it?
I'm open to any kind of suggestion.
Here you have the complete code
zgrep -E "\(([a-z0-9]){15,}:" /path/to/very/big/log | awk '{
gsub("[()]|:.*","",$4); #just removing ugly chars
++cont
min=$4"min" #name for maximun value of current transaction
max=$4"max" #same as previous, just for readability
split($2,secs,/[:,]/) #split hours,minutes and seconds
seconds = 3600*secs[1] + 60*secs[2] + secs[3] #turn everything into seconds
if(arr[min] > seconds || arr[min] == 0)
arr[min]=seconds
if(arr[max] < seconds)
arr[max]=seconds
dif=arr[max] - arr[min]
if(dif > 60)
result[$4] = dif
}
END{
for(x in result)
print x" - "result[x]
print ":Processed "cont" lines"
}'
You don't need to calculate the dif every time you read a record. Just do it once in the END section.
You don't need that cont variable, just use NR.
You dont need to populate min and max separately string concatenation is slow in awk.
You shouldn't change $4 as that will force the record to be recompiled.
Try this:
awk '{
name = $4
gsub(/[()]|:.*/,"",name); #just removing ugly chars
split($2,secs,/[:,]/) #split hours,minutes and seconds
seconds = 3600*secs[1] + 60*secs[2] + secs[3] #turn everything into seconds
if (NR==1) {
min[name] = max[name] = seconds
}
else {
if (min[name] > seconds) {
min[name] = seconds
}
if (max[name] < seconds) {
max[name] = seconds
}
}
}
END {
for (name in min) {
diff = max[name] - min[name]
if (diff > 60) {
print name, "-", diff
}
}
print ":Processed", NR, "lines"
}'
After making some test, and with the suggestions gave by Ed Morton (both for code improvement and performance test) I found that the bottleneck was the zgrep command. Here is an example that does several things:
Check if we have a transaction line (first if)
Cleans the transaction id
checks if this has been already registered (second if) by checking if it is in the array
If is not registered then checks if it is the appropriate type of transaction and if so it registers the timestamp in second
If is already registered saves the new time-stamp as the maximun
After all it makes the necessary operations to calculate the time difference
Thank you very much to all that helped me.
zcat /veryBigLog.gz | awk '
{if($4 ~ /^\([:alnum:]/ ){
name=$4;gsub(/[()]|:.*/,"",name);
if(!(name in min)){
if($0 ~ /TypeOFTransaction/ ){
split($2,secs,/[:,]/)
seconds = 3600*secs[1] + 60*secs[2] + secs[3]
max[name] = min[name]=seconds
print lengt(min) "new "name " start at "seconds
}
}else{
split($2,secs,/[:,]/)
seconds = 3600*secs[1] + 60*secs[2] + secs[3]
if( max[name] < seconds) max[name]=seconds
print name " new max " max[name]
}
}}END{
for(x in min){
dif=max[x]- min[x]
print max[x]" max - min "min[x]" : "dif
}
print "Processed "NR" Records"
print "Found "length(min)" MOs" }'

Awk Calc Avg Rows Below Certain Line

I'm having trouble calculating an average of specific numbers in column BELOW a specific text identifier using awk. I have two columns of data and I'm trying to start the average keying on a common identifier that repeats, which is 01/1991. So, awk should calc the average of all lines beginning with 01/1991, which repeats, using the next 21 lines with total count of rows for average = 22 for the total number of years 1991-2012. The desired output is an average of each TextID/Name entry for all the January's (01) for each year 1991 - 2012 show below:
TextID/Name 1
Avg: 50.34
TextID/Name 2
Avg: 45.67
TextID/Name 3
Avg: 39.97
...
sample data:
TextID/Name 1
01/1991, 57.67
01/1992, 56.43
01/1993, 49.41
..
01/2012, 39.88
TextID/Name 2
01/1991, 45.66
01/1992, 34.77
01/1993, 56.21
..
01/2012, 42.11
TextID/Name 3
01/1991, 32.22
01/1992, 23.71
01/1993, 29.55
..
01/2012, 35.10
continues with the same data for TextID/Name 4
I'm getting an answer using this code shown below but the average is starting to calculate BEFORE the specific identifier line and not on and below that line (01/1991).
awk '$1="01/1991" {sum+=$2} (NR%22==0){avg=sum/22;print"Average: "avg;sum=0;next}' myfile
Thanks and explanations of the solution is greatly appreciated! I have edited the original answer with more description - thank you again.
If you look at your file, the first field is "01/1991," with a comma at the end, not "01/1991". Also, NR%22==0 will look at line numbers divisible by 22, not 22 lines after the point it thinks you care about.
You can do something like this instead:
awk '
BEGIN { l=-1; }
$1 == "01/1991," {
l=22;
s=0;
}
l > 0 { s+=$2; l--; }
l == 0 { print s/22; l--; }'
It has a counter l that it sets to the number of lines to count, then it sums up that number of lines.
You may want to consider simply summing all lines from one 01/1991 to the next though, which might be more robust.
If you're allowed to use Perl instead of Awk, you could do:
#!/usr/bin/env perl
$start = 0;
$have_started = 0;
$count = 0;
$sum = 0;
while (<>) {
$line = $_;
# Grab the value after the date and comma
if ($line = /\d+\/\d+,\s+([\d\.]+)/) {
$val = $+;
}
# Start summing values after 01/1991
if (/01\/1991,\s+([\d\.]+)/) {
$have_started = 1;
$val = $+;
}
# If we have started counting,
if ($have_started) {
$count++;
$sum += $+;
}
}
print "Average of all values = " . $sum/$count;
Run it like so:
$ cat your-text-file.txt | above-perl-script.pl

How to analyze time tracking reports with awk?

I'm tracking my time with two great tools, todotxt and punch. With these one
can generate reports that look like this:
2012-11-23 (2 hours 56 minutes):
first task (52 minutes)
second task (2 hours 4 minutes)
2012-11-24 (2 hours 8 minutes):
second task (2 hours 8 minutes)
My question: what's a convenient way for analyzing this kind of output? E.g.
how could I sum up the time that is spent doing "first task"/"second task" or find out
my total working hours for a longer period such as "2012-11-*"?
So, I'd like to have a command such as punch.sh report /regex-for-date-or-task-i'm-interested-in/.
I read and saw that this is possible with awk. But I don't know how to 1) sum minutes + hours and 2) provide "task names with spaces" as variables for awk.
UPDATE:
I'm also tagging my tasks with +tags to mark different projects (as in first task +projecttag). So it would also be great to sum the time spent on all tasks with a certain tag.
Thanks for any help!
Before running this script. Please uncomment the appropriate gsub() function. Run like:
awk -f script.awk file
Contents of script.awk:
BEGIN {
FS="[( \t)]"
}
/^\t/ {
line = $0
#### If your input has...
## No tags, show hrs in each task
# gsub(/^[\t ]*| *\(.*/,"",line)
## Tags, show hrs in each task
# gsub(/^[\t ]*| *\+.*/,"",line)
## Tags, show hrs in each tag
# gsub(/^[^+]*\+| *\(.*/,"",line)
####
for(i=1;i<=NF;i++) {
if ($i == "hours") h[line]+=$(i-1)
if ($i == "minutes") m[line]+=$(i-1)
}
}
END {
for (i in m) {
while (m[i] >= 60) { m[i]-=60; h[i]++ }
print i ":", (h[i] ? h[i] : "0") " hrs", m[i], "mins"
}
}

Resources