Faster way to find duplicates conditioned by time

Faster way to find duplicates conditioned by time - performance

In a machine with AIX without PERL I need to filter records that will be considered duplicated if they have the same id and if they were registered between a period of four hours.
I implemented this filter using AWK and work pretty well but I need a solution much faster:
# Generar lista de Duplicados
awk 'BEGIN {
FS=","
}
/OK/ {
old[$8] = f[$8];
f[$8] = mktime($4, $3, $2, $5, $6, $7);
x[$8]++;
}
/OK/ && x[$8]>1 && f[$8]-old[$8]
Any suggestions? Are there ways to improve the environment (preloading the file or someting like that)?
The input file is already sorted.
With the corrections suggested by jj33 I made a new version with better treatment of dates, still maintaining a low profile for incorporating more operations:
awk 'BEGIN {
FS=",";
SECSPERMINUTE=60;
SECSPERHOUR=3600;
SECSPERDAY=86400;
split("0 31 59 90 120 151 181 212 243 273 304 334", DAYSTOMONTH, " ");
split("0 366 731 1096 1461 1827 2192 2557 2922 3288 3653 4018 4383 4749 5114 5479 5844 6210 6575 6940 7305", DAYSTOYEAR, " ");
}
/OK/ {
old[$8] = f[$8];
f[$8] = mktime($4, $3, $2, $5, $6, $7);
x[$8]++;
}
/OK/ && x[$8]>1 && f[$8]-old[$8] 2 ) && ( ((y % 4 == 0) && (y % 100 != 0)) || (y % 400 == 0) ) ) {
d2m = d2m + 1;
}
d2y = DAYSTOYEAR[ y - 1999 ];
return ss + (mm*SECSPERMINUTE) + (hh*SECSPEROUR) + (d*SECSPERDAY) + (d2m*SECSPERDAY) + (d2y*SECSPERDAY);
}
'

This sounds like a job for an actual database. Even something like SQLite could probably help you reasonably well here. The big problem I see is your definition of "within 4 hours". That's a sliding window problem, which means you can't simply quantize all the data to 4 hour segments... you have to compute all "nearby" elements for every other element separately. Ugh.

If your data file contains all your records (i.e. it includes records that do not have dupicate ids within the file) you could pre-process it and produce a file that only contains records that have duplicate (ids).
If this is the case that would reduce the size of file you need to process with your AWK program.

How is the input file sorted? Like, cat file|sort, or sorted via a single specific field, or multiple fields? If multiple fields, what fields and what order? It appears the hour fields are a 24 hour clock, not 12, right? Are all the date/time fields zero-padded (would 9am be "9" or "09"?)
Without taking into account performance it looks like your code has problems with month boundaries since it assumes all months are 30 days long. Take the two dates 2008-05-31/12:00:00 and 2008-06-01:12:00:00. Those are 24 hours apart but your code produces the same time code for both (63339969600)

I think you would need to consider leap years. I didn't do the math, but I think during a leap year, with a hard code of 28 days for feb, a comparison of noon on 2/29 and noon on 3/1 would result in the same duplicate time stamp as before. Although it looks like you didn't implement it like that. They way you implemented it, I think you still have the problem but it's between dates on 12/31 of $leapyear and 1/1 of $leapyear+1.
I think you might also have some collisions during time changes if your code has to handle time zones that handle them.
The file doesn't really seem to be sorted in any useful way. I'm guessing that field $1 is some sort of status (the "OK" you're checking for). So it's sorted by record status, then by DAY, then MONTH, YEAR, HOURS, MINUTES, SECONDS. If it was year,month,day I think there could be some optimizations there. Still might be but my brain's going in a different direction right now.
If there are a small number of duplicate keys in proportion to total number of lines, I think your best bet is to reduce the file your awk script works over to just duplicate keys (as David said). You could also preprocess the file so the only lines present are the /OK/ lines. I think I would do this with a pipeline where the first awk script only prints the lines with duplicate IDs and the second awk script is basically the one above but optimized to not look for /OK/ and with the knowledge that any key present is a duplicate key.
If you know ahead of time that all or most lines will have repeated keys, it's probably not worth messing with. I'd bite the bullet and write it in C. Tons more lines of code, much faster than the awk script.

On many unixen, you can get sort to sort by a particular column, or field. So by sorting the file by the ID, and then by the date, you no longer need to keep the associative array of when you last saw each ID at all. All the context is there in the order of the file.
On my Mac, which has GNU sort, it's:
sort -k 8 < input.txt > output.txt
to sort on the ID field. You can sort on a second field too, by saying (e.g) 8,3 instead, but ONLY 2 fields. So a unix-style time_t timestamp might not be a bad idea in the file - it's easy to sort, and saves you all those date calculations. Also, (again at least in GNU awk), there is a mktime function that makes the time_t for you from the components.

#AnotherHowie, I thought the whole preprocessing could be done with sort and uniq. The problem is that the OP's data seems to be comma delimited and (Solaris 8's) uniq doesn't allow you any way specify the record separator, so there wasn't a super clean way to do the preprocessing using standard unix tools. I don't think it would be any faster so I'm not going to look up the exact options, but you could do something like:
cut -d, -f8 <infile.txt | sort | uniq -d | xargs -i grep {} infile.txt >outfile.txt
That's not very good because it executes grep for every line containing a duplicate key. You could probably massage the uniq output into a single regexp to feed to grep, but the benefit would only be known if the OP posts expected ratio of lines containing suspected duplicate keys to total lines in the file.

Related

How can i save in a list/array first parts of a line and then sort them depending on the second part?

I have a school project that gives me several lines of string in a text like this:
team1-team2:2-1
team3-team1:2-2
etc
it wants me to determine what team won (or drew) and then make a league table with them, awarding points for wins/draws.
this is my first time using bash. what i did was save team1/team2 names in a variable and then do the same for goals. how should i make the table? i managed to make my script create a new file that saves in there all team names (And checking for no duplicates) but i dont know how to continue. should i make an array for each team saving in there their results? and then how do i implement the rankings, for example
team1 3p
team2 1p
etc.
im not asking for actual code, just a guide as to how i should implement it. is making a new file the right move? should i try making a new array with the teams instead? or something else?

The problem can be divided into 3 parts:
Read the input data into memory in a format that can be manipulated easily.
Manipulate the data in memory
Output the results in the desired format.
When reading the data into memory, you might decide to read all the data in one go before manipulating it. Or you might decide to read the input data one line at a time and manipulate each line as it is read. When using shell scripting languages, like bash, the second option usually results in simpler code.
The most important decision to make here is how you want to structure the data in memory. You normally want to avoid duplication of data, and you usually want a data structure that is easy to transform into your desired output. In this case, the most logical data structure is an associative array, using the team name as the key.
Assuming that you have to use bash, here is a framework for you to build upon:
#!/bin/bash
declare -A results
while IFS=':-' read team1 team2 score1 score2; do
if [ ${score1} -gt ${score2} ]; then
((results[${team1}]+=2))
elif [ ...next test... ]; then
...
else
...
fi
done < scores.txt
# Now you have an associative array containing the points for each team.
# You can either output it as it stands, or sort it by piping through the
# 'sort' command.
for key in $[!results[#]}; do
echo ...
done

I would use awk for this
AWK is an interpreted programming language(AWK stands for Aho, Weinberger, Kernighan) designed for text processing and typically used as a data extraction and reporting tool. AWK is used largely with Unix systems.
Using pure bash scripting is often messy for that kind of jobs.
Let me show you how easy it can be using awk
Input file : scores.txt
team1-team2:2-1
team3-team1:2-2
Code :
awk -F'[:-]' ' # set delimiters to ':' or '-'
{
if($3>$4){teams[$1] += 3} # first team gets 3 points
else if ($3<$4){teams[$2] += 3} # second team gets 3 points
else {teams[$1]+=1; teams[$2]+=1} # both teams get 1 point
}
END{ # after scanning input file
for(team in teams){
print(team OFS teams[team]) # print total points per team
}
}' scores.txt | sort -rnk 2 > ranking.txt # sort by nb of points
Output (ranking.txt):
team1 4
team3 1

How to format output of a select statement in Postgres back into a delimited file?

I am trying to work with some oddly created 'dumps' of some tables in postgres. Due to the tables containing specific data I will have to refrain from posting the exact information but I can give an example.
To give a bit more information, someone though that this exact command was a good way to backup a table.
echo 'select * from test1'|psql > test1.date.txt
However, in this one example that gives a lot of information that no one neeeds. To also be even more fun the person saw fit to remove the | that is normally seen with the data.
So what I end up with is something like this.
rowid test1
-------+----------------------
1 hi
2 no
(2 rows)
To also note, for this customer there are multiple tables here. My thoughts here was to use some simple python to figure out where in each line the + was and then mark those points. Then apply those points to each line throughout the file.
I was able to make this work for one set of files but for some reason the next set of files just doesn't work. What happens instead is that on most lines a pipe gets thrown in the middle of data
Maybe there is something I missing here, but does anyone see an easy way to put something like the above back into a normal delimiter file that I could then just load into the database?
Any python or bash related suggestions would also work in this case. Thank you.

As mentioned above, without a real example of where the '|' are that are causing problems, or a real example of where you are having problem, it is hard to know whether we are addressing your actual issue. That said, your two primary swiss-army=knives for text processing are sed and awk. If you have data similar to your example, with pipes between data fields you need to discard, then awk provides a fairly easy solution.
Take for example your short example and add a pipe in the middle that needs to be discarded, e.g.
$ cat dat/pgsql2.txt
rowid test1
-------+----------------------
1 | hi
2 | no
To process the file in awk discarding the '|' and outputting the remaining records in comma-separated-value format, you could do something like the following:
awk '{
if (NR > 2) {
for (i = 1; i <= NF; i++) {
if ($i != "|") {
if (i == 1)
printf "%s", $i
else
printf ",%s", $i
}
printf "\n"
}
}
}' inputfile
Which simply reads from inputfile (last line) and loops over the number of fields (NF) (3 in this case) and if the row-number is > 2 (to omit the heading) and the field $i is not "|", then it simply checks if this is the first field and outputs it without a comma, otherwise all other fields are output with a preceding comma.
Example Output
1,hi
2,no
awk is a bit awkward at first, but as far as text processing goes, there isn't much that will top it.

After trying multiple methods the only way I could make this work sadly was to just use the import feature for Excel and then play with that to get the columns I needed.

Create column that is equal to the value of an existing column in same file +1?

I have a tab delimited text file with several million rows and with 2 columns that looks like this:
1 693731
1 729679
1 730087
1 731718
1 734349
I want to add an additional column to the file that is equal to the value of column 2 + 1. So for the above example it would look like this:
1 693731 693732
1 729679 729680
1 730087 730088
1 731718 731719
1 734349 734350
What would be the best way to do this using unix shell? Any input would be greatly appreciated.

When you're dealing with columnar data, "awk" is a good tool. It also has math capabilities built-in, so it's a natural for this:
awk '{ print $1"\t"$2"\t"($2+1); }' < data.tsv
awk runs this code for every line in the input. For each of those lines, the $ notation indicates a column: $1 is the first column in the current row, $2 is the second, and so on.
Though I prefer explicit column enumeration, you may use ghoti's optimization, where $0 represents all data on the current row:
awk '{ print $0"\t"($2+1); }' < data.tsv
Because of the UNIX toolbox approach, there are many ways to solve this problem. Whether or not this is "best" depends on many factors: speed, maintainability, portability, etc.

how should I got about subtracting two different times in a bash environment

I am currently working on a script that processes and combines several different files, and for the one part, it is necessary that I find the difference between two different times in order to determine a "total" amount of time that someone has worked. the times themselves are in the following format
34:18:00,40:26:00,06:08:00
with the first one being start time, second end time, third total time. Although this one is displayed correctly, there are some entries that need to be double checked and corrected (the total time is not correct based on the start/end time). I have found several different solutions in other posts but most of them also include dates and such too (most of them using awk), I am not experienced with awk so am not sure how to go about removing the date portion from those examples. I have also heard that I could convert the times to unix epoch time, but I was just curious if there were any other ways to accomplish this, thanks!

Something like this might help you:
#!/bin/bash
time2seconds() {
a=( ${1//:/ } )
echo $((${a[0]}*3600+${a[1]}*60+${a[2]}))
}
seconds2time() {
printf "%.2d:%.2d:%.2d" $(($1/3600)) $((($1/60)%60)) $(($1%60))
}
IFS=, read start stop difference <<< "34:18:00,40:26:00,06:08:00"
printf "Start=%s\n" "$start"
printf "Stop=%s\n" "$stop"
printf "Difference=%s (given in file: %s)\n" $(seconds2time $(($(time2seconds $stop)-$(time2seconds $start)))) "$difference"
Output is:
Start=34:18:00
Stop=40:26:00
Difference=06:08:00 (given in file: 06:08:00)
Note: there's nothing that checks if the times are in a valid format, I don't know how reliable your data are.

Select time intervals from log files using Bash

I need to extract some information from a log file using a shell script (bash). A line from the log file usually looks like this:
2009-10-02 15:41:13,796| some information
Occasionally, such a line is followed by a few more lines giving details about the event. These additional lines do not have a specific format (in particular they don't start with a timestamp).
I know how to use grep to filter the file based on keywords and expressions. Basically what I'm having trouble with is that sometimes I need to look at specific intervals only. For example I need to look only at the events which happened during the last X minutes. I'm not experienced with shell scripting, but due to the complexity of the time format, this seems to be a rather difficult task for me. On the other hand, I can imagine that this is something not too unusual, so I'm wondering if there are some tools which can make this easier for me or if you can give me some hints on how to tackle this problem?

gawk -F"[-: ]" 'BEGIN{
fivemin = 60 * 60 * 5 #last 5 min
now=systime()
difference=now - fivemin
}
/^20/{
yr=$1
mth=$2
day=$3
hr=$4
min=$5
sec=$5
t1=mktime(yr" "mth" "day" "hr" "min" "sec)
if ( t1 >= difference) {
print
}
}' file

Basically what I'm having trouble with is that sometimes I need to look
at specific intervals only.
You could use date to convert the date signature for you with the %s parameter:
%s seconds since 1970-01-01 00:00:00 UTC
With it we can make a small demonstration:
#!/bin/bash
timespan_seconds=300 # 5 minutes
time_specified=$(date +"%s" -d "2010-08-25 14:54:40")
let time_now=$(date +"%s")
let time_diff=($time_now - $timespan_seconds)
if [ $time_specified -ge $time_diff ]; then
echo "Time is within range"
fi
Note that this doesn't address future time.

You might want to take a look at my Python program which extracts data from log files based on a range of times. The specification of dates is not yet implemented (it is designed to look at roughly the most recent 24 hours). The time format that it expects (e.g. Jan 14 04:10:13) looks a little different than what you want, but that could be adapted. I haven't tested it with non-timestamped lines, but it should print everything within the specified range of times.
This will give you some usage information:
timegrep.py --help

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio