Extract bibtex entries based on the year - bash

Okay, I got the file.bib file with multiple entries such
#Book{Anley:2007:shellcoders-handbook-2nd-ed,
author = {Chris Anley and John Heasman and Felix Lindner and Gerardo
Richarte},
title = "{The Shellcoder's Handbook}",
publisher = {Wiley},
year = 2007,
edition = 2,
month = aug,
}
there you can find the "year = 2007" line. My task is to filter out the years which are greater than 2020 ($currentyear) or lower than 1900 ($minyear), the result should be a also the output of the month "may", which stands behind a "year" line in this file. (Which is a mistake by the admin). (btw the file is over 4000 lines long).

It is better to use awk for this. Similar to your line, it would read:
awk -v t1="1900" -v t2="$(date "+%Y")" \
'!match($0,/year.*=.*/){next}
{t=substr(RSTART,RLENGTH)
match(t,/[0-9][0-9][0-9][0-9]/)
y=substr(RSTART,RLENGTH)
}
(y > t1) && (y <= t2) { print y }' file

Related

script to loop through and combine two text files

I have two .csv files which I am trying to 'multiply' out via a script. The first file is person information and looks basically like this:
First Name, Last Name, Email, Phone
Sally,Davis,sdavis#nobody.com,555-555-5555
Tom,Smith,tsmith#nobody.com,555-555-1212
The second file is account numbers and looks like this:
AccountID
1001
1002
Basically I want to get every name with every account Id. So if I had 10 names in the first file and 10 account IDs in the second file, I should end up with 100 rows in the resulting file and have it look like this:
First Name, Last Name, Email, Phone, AccountID
Sally,Davis,sdavis#nobody.com,555-555-5555, 1001
Tom,Smith,tsmith#nobody.com,555-555-1212, 1001
Sally,Davis,sdavis#nobody.com,555-555-5555, 1002
Tom,Smith,tsmith#nobody.com,555-555-1212, 1002
Any help would be greatly appreciated
You could simply write a for loop for each value to be repeated by it's id count and append the description, but just in the reverse order.
Has that not worked or have you not tried that?
If python works for you, here's a script which does that:
def main():
f1 = open("accounts.txt", "r")
f1_total_lines = sum(1 for line in open('accounts.txt'))
f2_total_lines = sum(1 for line in open('info.txt'))
f1_line_counter = 1;
f2_line_counter = 1;
f3 = open("result.txt", "w")
f3.write('First Name, Last Name, Email, Phone, AccountID\n')
for line_account in f1.readlines():
f2 = open("info.txt", "r")
for line_info in f2.readlines():
parsed_line_account = line_account
parsed_line_info = line_info.rstrip() # we have to trim the newline character from every line from the 'info' file
if f2_line_counter == f2_total_lines: # ...for every but the last line in the file (because it doesn't have a newline character)
parsed_line_info = line_info
f3.write(parsed_line_info + ',' + parsed_line_account)
if f1_line_counter == f1_total_lines:
f3.write('\n')
f2_line_counter = f2_line_counter + 1
f1_line_counter = f1_line_counter + 1
f2_line_counter = 1 # reset the line counter to the first line
f1.close()
f2.close()
f3.close()
if __name__ == '__main__':
main()
And the files I used are as follows:
info.txt:
Sally,Davis,sdavis#nobody.com,555-555-555
Tom,Smith,tsmith#nobody.com,555-555-1212
John,Doe,jdoe#nobody.com,555-555-3333
accounts.txt:
1001
1002
1003
If You Intended to Duplicate Account_ID
If you intended to add each Account_ID to every record in your information file then a short awk solution will do, e.g.
$ awk -F, '
FNR==NR{a[i++]=$0}
FNR!=NR{b[j++]=$0}
END{print a[0] ", " b[0]
for (k=1; k<i; k++)
for (m=1; m<i; m++)
print a[m] ", " b[k]}
' info id
First Name, Last Name, Email, Phone, AccountID
Sally,Davis,sdavis#nobody.com,555-555-5555, 1001
Tom,Smith,tsmith#nobody.com,555-555-1212, 1001
Sally,Davis,sdavis#nobody.com,555-555-5555, 1002
Tom,Smith,tsmith#nobody.com,555-555-1212, 1002
Above the lines in the first file (when the file-record-number equals the record-number, e.g. FNR==NR) are stored in array a, the lines from the second file (when FNR!=NR) are stored in array b and then they combined and output in the END rule in the desired order.
Without Duplicating Account_ID
Since Account_ID is usually a unique bit of information, if you did not intended to duplicate every ID at the end of each record, then there is no need to loop. The paste command does that for you. In your case with your information file as info and you account ID file as id, it is as simple as:
$ paste -d, info id
First Name, Last Name, Email, Phone,AccountID
Sally,Davis,sdavis#nobody.com,555-555-5555,1001
Tom,Smith,tsmith#nobody.com,555-555-1212,1002
(note: the -d, option just sets the delimiter to a comma)
Seems a lot easier that trying to reinvent the wheel.
Can be easily done with arrays
OLD=$IFS; IFS=$'\n'
ar1=( $(cat file1) )
ar2=( $(cat file2) )
IFS=$OLD
ind=${!ar1[#]}
for i in $ind; { echo "${ar1[$i]}, ${ar2[$i]}"; }

AWK performance while processing big files

I have an awk script that I use for calculate how much time some transactions takes to complete. The script gets the unique ID of each transaction and stores the minimum and maximum timestamp of each one. Then it calculates the difference and at the end it shows those results that are over 60 seconds.
It works very well when used with some thousand (200k) but it takes more time when used in real world. I tested it several times and it takes about 15 minutes to process about 28 million of lines. Can I consider this good performance or it is possible to improve it?
I'm open to any kind of suggestion.
Here you have the complete code
zgrep -E "\(([a-z0-9]){15,}:" /path/to/very/big/log | awk '{
gsub("[()]|:.*","",$4); #just removing ugly chars
++cont
min=$4"min" #name for maximun value of current transaction
max=$4"max" #same as previous, just for readability
split($2,secs,/[:,]/) #split hours,minutes and seconds
seconds = 3600*secs[1] + 60*secs[2] + secs[3] #turn everything into seconds
if(arr[min] > seconds || arr[min] == 0)
arr[min]=seconds
if(arr[max] < seconds)
arr[max]=seconds
dif=arr[max] - arr[min]
if(dif > 60)
result[$4] = dif
}
END{
for(x in result)
print x" - "result[x]
print ":Processed "cont" lines"
}'
You don't need to calculate the dif every time you read a record. Just do it once in the END section.
You don't need that cont variable, just use NR.
You dont need to populate min and max separately string concatenation is slow in awk.
You shouldn't change $4 as that will force the record to be recompiled.
Try this:
awk '{
name = $4
gsub(/[()]|:.*/,"",name); #just removing ugly chars
split($2,secs,/[:,]/) #split hours,minutes and seconds
seconds = 3600*secs[1] + 60*secs[2] + secs[3] #turn everything into seconds
if (NR==1) {
min[name] = max[name] = seconds
}
else {
if (min[name] > seconds) {
min[name] = seconds
}
if (max[name] < seconds) {
max[name] = seconds
}
}
}
END {
for (name in min) {
diff = max[name] - min[name]
if (diff > 60) {
print name, "-", diff
}
}
print ":Processed", NR, "lines"
}'
After making some test, and with the suggestions gave by Ed Morton (both for code improvement and performance test) I found that the bottleneck was the zgrep command. Here is an example that does several things:
Check if we have a transaction line (first if)
Cleans the transaction id
checks if this has been already registered (second if) by checking if it is in the array
If is not registered then checks if it is the appropriate type of transaction and if so it registers the timestamp in second
If is already registered saves the new time-stamp as the maximun
After all it makes the necessary operations to calculate the time difference
Thank you very much to all that helped me.
zcat /veryBigLog.gz | awk '
{if($4 ~ /^\([:alnum:]/ ){
name=$4;gsub(/[()]|:.*/,"",name);
if(!(name in min)){
if($0 ~ /TypeOFTransaction/ ){
split($2,secs,/[:,]/)
seconds = 3600*secs[1] + 60*secs[2] + secs[3]
max[name] = min[name]=seconds
print lengt(min) "new "name " start at "seconds
}
}else{
split($2,secs,/[:,]/)
seconds = 3600*secs[1] + 60*secs[2] + secs[3]
if( max[name] < seconds) max[name]=seconds
print name " new max " max[name]
}
}}END{
for(x in min){
dif=max[x]- min[x]
print max[x]" max - min "min[x]" : "dif
}
print "Processed "NR" Records"
print "Found "length(min)" MOs" }'

Awk Calc Avg Rows Below Certain Line

I'm having trouble calculating an average of specific numbers in column BELOW a specific text identifier using awk. I have two columns of data and I'm trying to start the average keying on a common identifier that repeats, which is 01/1991. So, awk should calc the average of all lines beginning with 01/1991, which repeats, using the next 21 lines with total count of rows for average = 22 for the total number of years 1991-2012. The desired output is an average of each TextID/Name entry for all the January's (01) for each year 1991 - 2012 show below:
TextID/Name 1
Avg: 50.34
TextID/Name 2
Avg: 45.67
TextID/Name 3
Avg: 39.97
...
sample data:
TextID/Name 1
01/1991, 57.67
01/1992, 56.43
01/1993, 49.41
..
01/2012, 39.88
TextID/Name 2
01/1991, 45.66
01/1992, 34.77
01/1993, 56.21
..
01/2012, 42.11
TextID/Name 3
01/1991, 32.22
01/1992, 23.71
01/1993, 29.55
..
01/2012, 35.10
continues with the same data for TextID/Name 4
I'm getting an answer using this code shown below but the average is starting to calculate BEFORE the specific identifier line and not on and below that line (01/1991).
awk '$1="01/1991" {sum+=$2} (NR%22==0){avg=sum/22;print"Average: "avg;sum=0;next}' myfile
Thanks and explanations of the solution is greatly appreciated! I have edited the original answer with more description - thank you again.
If you look at your file, the first field is "01/1991," with a comma at the end, not "01/1991". Also, NR%22==0 will look at line numbers divisible by 22, not 22 lines after the point it thinks you care about.
You can do something like this instead:
awk '
BEGIN { l=-1; }
$1 == "01/1991," {
l=22;
s=0;
}
l > 0 { s+=$2; l--; }
l == 0 { print s/22; l--; }'
It has a counter l that it sets to the number of lines to count, then it sums up that number of lines.
You may want to consider simply summing all lines from one 01/1991 to the next though, which might be more robust.
If you're allowed to use Perl instead of Awk, you could do:
#!/usr/bin/env perl
$start = 0;
$have_started = 0;
$count = 0;
$sum = 0;
while (<>) {
$line = $_;
# Grab the value after the date and comma
if ($line = /\d+\/\d+,\s+([\d\.]+)/) {
$val = $+;
}
# Start summing values after 01/1991
if (/01\/1991,\s+([\d\.]+)/) {
$have_started = 1;
$val = $+;
}
# If we have started counting,
if ($have_started) {
$count++;
$sum += $+;
}
}
print "Average of all values = " . $sum/$count;
Run it like so:
$ cat your-text-file.txt | above-perl-script.pl

Comparing many files in Bash

I'm trying to automate a task at work that I normally do by hand, that is taking database output from the permissions of multiple users and comparing them to see what they have in common. I have a script right now that uses comm and paste, but it's not giving me all the output I'd like.
Part of the problem comes in comm only dealing with two files at once, and I need to compare at least three to find a trend. I also need to determine if two out of the three have something in common, but the third one doesn't have it (so comparing the output of two comm commands doesn't work). I need these in comma separated values so it can be imported into Excel. Each user has a column, and at the end is a listing of everything they have in common. comm would work perfectly if it could compare more than two files (and show two-out-of-three comparisons).
In addition to the code I have to clean all the extra cruft off the raw csv file, here's what I have so far in comparing four users. It's highly inefficient, but it's what I know.
cat foo1 | sort > foo5
cat foo2 | sort > foo6
cat foo3 | sort > foo7
cat foo4 | sort > foo8
comm foo5 foo6 > foomp
comm foo7 foo8 > foomp2
paste foomp foomp2 > output2
sed 's/[\t]/,/g' output2 > output4.csv
cat output4.csv
Right now this outputs two users, their similarities and differences, then does the same for another two users and pastes it together. This works better than doing it by hand, but I know I could be doing more.
An example input file would be something like:
User1
Active Directory
Internet
S: Drive
Sales Records
User2
Active Directory
Internet
Pricing Lookup
S: Drive
User3
Active Directory
Internet
Novell
Sales Records
where they have AD and Internet in common, two out of three have sales records access and S: drive permission, only one of each has Novell and Pricing access.
Can someone give me a hand in what I'm missing?
Using GNU AWK (gawk) you can print a table that shows how multiple users' permissions correlate. You could also do the same thing in any language that supports associative arrays (hashes), such as Bash 4, Python, Perl, etc.
#!/usr/bin/awk -f
{
array[FILENAME, $0] = $0
perms[$0] = $0
if (length($0) > maxplen) {
maxplen = length($0)
}
users[FILENAME] = FILENAME
}
END {
pcount = asort(perms)
ucount = asort(users)
maxplen += 2
colwidth = 8
printf("%*s", maxplen, "")
for (u = 1; u <= ucount; u++) {
printf("%-*s", colwidth, users[u])
}
printf("\n")
for (p = 1; p <= pcount; p++) {
printf("%-*s", maxplen, perms[p])
for (u = 1; u <= ucount; u++) {
if (array[users[u], perms[p]]) {
printf("%-*s", colwidth, " X")
} else {
printf("%-*s", colwidth, "")
}
}
printf("\n")
}
}
Save this file, perhaps calling it "correlate", then set it to be executable:
$ chmod u+x correlate
Then, assuming that the filenames correspond to the usernames or are otherwise meaningful (your examples are "user1" through "user3" so that works well), you can run it like this:
$ ./correlate user*
and you would get the following output based on your sample input:
user1 user2 user3
Active Directory X X X
Internet X X X
Novell X
Pricing Lookup X
S: Drive X X
Sales Records X X
Edit:
This version doesn't use asort() and so it should work on non-GNU versions of AWK. The disadvantage is that the order of rows and columns is unpredictable.
#!/usr/bin/awk -f
{
array[FILENAME, $0] = $0
perms[$0] = $0
if (length($0) > maxplen) {
maxplen = length($0)
}
users[FILENAME] = FILENAME
}
END {
maxplen += 2
colwidth = 8
printf("%*s", maxplen, "")
for (u in users) {
printf("%-*s", colwidth, u)
}
printf("\n")
for (p in perms) {
printf("%-*s", maxplen, p)
for (u in users) {
if (array[u, p]) {
printf("%-*s", colwidth, " X")
} else {
printf("%-*s", colwidth, "")
}
}
printf("\n")
}
}
You can use the diff3 program. From the man page:
diff3 - compare three files line by line
Given your sample inputs, above, running diff3 results in:
====
1:3,4c
S: Drive
Sales Records
2:3,4c
Pricing Lookup
S: Drive
3:3,4c
Novell
Sales Records
Does this get you any closer to what you're looking for?
I would use the strings command to remove any binary from the files, cat them together then use uniq -c on the concatenated file to get a count of occurrences of a string

Fast editing subtitles file

I like GNU/Linux and writing bash scripts to automatize my tasks. But I am a beginner and have a lot of problems with it. So, I have a subtitle file in format like I this(I'm Polish so it's a Polish subtitles):
00:00:27:W zamierzchłych czasach|ziemia pokryta była lasami.
00:00:31:Od wieków mieszkały|w nich duchy bogów.
00:00:37:Człowiek żył w harmonii ze zwierzętami.
I thing you understanding this simple format. The problem is that in the "movie file", before the movie starts is 1:15 of introduction. I want to add to each subtitle file's line 1:15. So the example should look like this:
00:01:43:W zamierzchłych czasach|ziemia pokryta była lasami.
00:01:46:Od wieków mieszkały|w nich duchy bogów.
00:01:52:Człowiek żył w harmonii ze zwierzętami.
Could you help me to write this script?
BTW I'm Polish and I'm still learning English. So if you cannot understand me, write.
Here's a solution in awk - probably easier than bash for this kind of problem:
#!/usr/bin/awk -f
BEGIN {
FS=":"
}
{
hr = $1
min = $2
sec = $3
sec = sec + 15
if (sec >= 60) {
sec = sec - 60
min = min + 1
}
min = min + 1
if (min >= 60) {
min = min - 60
hr = hr + 1
}
printf "%02d:%02d:%02d:%s\n", hr, min, sec, $4
}
Suggestions for improvement welcome!

Resources