Find the most hit url in a large file - algorithm

I was reading this Yelp interview on Glassdoor
"We have a fairly large log file, about 5GB. Each line of the log file contains an url which a user has visited on our site. We want to figure out what's the most popular 100 urls visited by our users. "
and one of the solution is
cat log | sort | uniq -c | sort -k2n | head 100
Can someone explain to me what is the purpose of the second sort (sort -k2n)?
Thanks!

It looks like the stages are:
1) get the log file into the filter
2) get identical filenames together
3) count the number of occurrences of each different filename
4) Sort the pairs (filename, number of occurrences) by number of occurrences
5) Print out the 100 more common filenames

Related

Explain how this command finds the 5 most CPU intensive processes

If I have the command
ps | sort -k 3,3 | tail -n 5
How is this find the 5 most CPU intensive processes?
I get that it is taking all the processes, sorting them based on a column through the -k option, but what does 3,3 mean?
You could read what you seek for from the official manual of sort (info sort in linux); in particular, you are interested in the following extracts:
‘-k POS1[,POS2]’
‘--key=POS1[,POS2]’
Specify a sort field that consists of the part of the line between
POS1 and POS2 (or the end of the line, if POS2 is omitted),
_inclusive_.
and, skipping a few paragraphs,
Example: To sort on the second field, use ‘--key=2,2’ (‘-k 2,2’).
See below for more notes on keys and more examples. See also the
‘--debug’ option to help determine the part of the line being used
in the sort.
So, basically, 3,3 emphasises that only the third column shall be considered for sorting, and the others will be ignored.

How to format output of a select statement in Postgres back into a delimited file?

I am trying to work with some oddly created 'dumps' of some tables in postgres. Due to the tables containing specific data I will have to refrain from posting the exact information but I can give an example.
To give a bit more information, someone though that this exact command was a good way to backup a table.
echo 'select * from test1'|psql > test1.date.txt
However, in this one example that gives a lot of information that no one neeeds. To also be even more fun the person saw fit to remove the | that is normally seen with the data.
So what I end up with is something like this.
rowid test1
-------+----------------------
1 hi
2 no
(2 rows)
To also note, for this customer there are multiple tables here. My thoughts here was to use some simple python to figure out where in each line the + was and then mark those points. Then apply those points to each line throughout the file.
I was able to make this work for one set of files but for some reason the next set of files just doesn't work. What happens instead is that on most lines a pipe gets thrown in the middle of data
Maybe there is something I missing here, but does anyone see an easy way to put something like the above back into a normal delimiter file that I could then just load into the database?
Any python or bash related suggestions would also work in this case. Thank you.
As mentioned above, without a real example of where the '|' are that are causing problems, or a real example of where you are having problem, it is hard to know whether we are addressing your actual issue. That said, your two primary swiss-army=knives for text processing are sed and awk. If you have data similar to your example, with pipes between data fields you need to discard, then awk provides a fairly easy solution.
Take for example your short example and add a pipe in the middle that needs to be discarded, e.g.
$ cat dat/pgsql2.txt
rowid test1
-------+----------------------
1 | hi
2 | no
To process the file in awk discarding the '|' and outputting the remaining records in comma-separated-value format, you could do something like the following:
awk '{
if (NR > 2) {
for (i = 1; i <= NF; i++) {
if ($i != "|") {
if (i == 1)
printf "%s", $i
else
printf ",%s", $i
}
printf "\n"
}
}
}' inputfile
Which simply reads from inputfile (last line) and loops over the number of fields (NF) (3 in this case) and if the row-number is > 2 (to omit the heading) and the field $i is not "|", then it simply checks if this is the first field and outputs it without a comma, otherwise all other fields are output with a preceding comma.
Example Output
1,hi
2,no
awk is a bit awkward at first, but as far as text processing goes, there isn't much that will top it.
After trying multiple methods the only way I could make this work sadly was to just use the import feature for Excel and then play with that to get the columns I needed.

Use cut and grep to separate data while printing multiple fields

I want to be able to separate data by weeks, and the week is stated in a specific field on every line and would like to know how to use grep, cut, or anything else that's relevant JUST on that field the week is specified in while still being able to save the rest of the data that's being given to me. I need to be able to pipe the information into it via | because that's how the rest of my program needs it to be.
as the output gets processed, it should look something like this
asset.14548.extension 0
asset.40795.extension 0
asset.98745.extension 1
I want to be able to sort those names by their week number while still being able to keep the asset name in my output because the number of times that asset shows up is counted up, but my problem is I can't make my program smart enough to take just the "1" from the week number but smart enough to ignore the "1" located in the asset name.
UPDATE
The closest answer I found was
grep "^.........................$week" ;
That's good, but it relies on every string being the same length. Is there a way I can have it start from the right instead of the left? Because if so then that'd answer my question.
^ tells grep to start checking from the left and . tells grep to ignore whatever's in that space
I found what I was looking for in some documentation. Anchor matches!
grep "$week$" file
would output this if $week was 0
asset.14548.extension 0
asset.40795.extension 0
I couldn't find my exact question or a closely similar question with a simple answer, so hopefully it helps the next person scratching their head on this.

Automatic update data file and display?

I am trying to develop a system where this application allows user to book ticket seat. I am trying to implement an automatic system(a function) where the app can choose the best seats for the user.
My current database(seats.txt) file store in this way(not sure if its a good format:
X000X
00000
0XXX0
where X means the seat is occupied, 0 means nothing.
After user login to my system, and choose the "Choose best for you", the user will be prompt to enter how many seats he/she want (I have done this part), now, if user enter: 2, I will check from first row, see if there is any empty seats, if yes, then I assign(this is a simple way, once I get this work, I will write a better "automatic-booking" algorithm)
I try to play with sed, awk, grep.. but it just cant work (I am new to bash programming, just learning bash 3 days ago).
Anyone can help?
FYI: The seats.txt format doesn't have to be that way. It can also be, store all seats in 1 row, like: X000X0XXX00XXX
Thanks =)
Here's something to get you started. This script reads in each seat from your file and displays if it's taken or empty, keeping track of the row and column number all the while.
#!/bin/bash
let ROW=1
let COL=1
# Read one character at a time into the variable $SEAT.
while read -n 1 SEAT; do
# Check if $SEAT is an X, 0, or other.
case "$SEAT" in
# Taken.
X) echo "Row $ROW, col $COL is taken"
let COL++
;;
# Empty.
0)
echo "Row $ROW, col $COL is EMPTY"
let COL++
;;
# Must be a new line ('\n').
*) let ROW++
let COL=1
;;
esac
done < seats.txt
Notice that we feed in seats.txt at the end of the script, not at the beginning. It's weird, but that's UNIX for ya. Curiously, the entire while loop behaves like one big command:
while read -n 1 SEAT; do {stuff}; done < seats.txt
The < at the end feeds in seats.txt to the loop as a whole, and specifically to the read command.
It's not really clear what help you're asking for here. "Anyone can help?" is a very broad question.
If you're asking if you're using the right tools then yes, the text processing tools (sed/awk/grep et al) are ideal for this given the initial requirement that it be done in bash in the first place. I'd personally choose a different baseline than bash but, if that's what you've decided, then your tool selection is okay.
I should mention that bash itself can do a lot of the things you'll probably be doing with the text processing tools and without the expense of starting up external processes. But, since you're using bash, I'm going to assume that performance is not your primary concern (don't get me wrong, bash will probably be fast enough for your purposes).
I would probably stick with the multi-line data representation for two reasons. The first is that simple text searches for two seats together will be easier if you keep the rows separate from each other. Otherwise, in the 5seat-by-2row XXXX00XXXX, a simplistic search would consider those two 0 seats together despite the fact they're nowhere near each other:
XXXX0
0XXXX
Secondly, some people consider the row to be very important. I won't sit in the first five rows at the local cinema simply because I have to keep moving my head to see all the action.
By way of example, you can get the front-most row with two consecutive seats with (commands are split for readability):
pax> cat seats.txt
X000X
00000
0XXX0
pax> expr $(
(echo '00000';cat seats.txt)
| grep -n 00
| tail -1
| sed 's/:.*//'
) - 1
2
The expr magic and extra echo are to ensure you get back 0 if no seats are available. And you can get the first position in that row with:
pax> cat seats.txt
| grep 00
| tail -1
| awk '{print index($0,"00")}'
3

Faster way to find duplicates conditioned by time

In a machine with AIX without PERL I need to filter records that will be considered duplicated if they have the same id and if they were registered between a period of four hours.
I implemented this filter using AWK and work pretty well but I need a solution much faster:
# Generar lista de Duplicados
awk 'BEGIN {
FS=","
}
/OK/ {
old[$8] = f[$8];
f[$8] = mktime($4, $3, $2, $5, $6, $7);
x[$8]++;
}
/OK/ && x[$8]>1 && f[$8]-old[$8]
Any suggestions? Are there ways to improve the environment (preloading the file or someting like that)?
The input file is already sorted.
With the corrections suggested by jj33 I made a new version with better treatment of dates, still maintaining a low profile for incorporating more operations:
awk 'BEGIN {
FS=",";
SECSPERMINUTE=60;
SECSPERHOUR=3600;
SECSPERDAY=86400;
split("0 31 59 90 120 151 181 212 243 273 304 334", DAYSTOMONTH, " ");
split("0 366 731 1096 1461 1827 2192 2557 2922 3288 3653 4018 4383 4749 5114 5479 5844 6210 6575 6940 7305", DAYSTOYEAR, " ");
}
/OK/ {
old[$8] = f[$8];
f[$8] = mktime($4, $3, $2, $5, $6, $7);
x[$8]++;
}
/OK/ && x[$8]>1 && f[$8]-old[$8] 2 ) && ( ((y % 4 == 0) && (y % 100 != 0)) || (y % 400 == 0) ) ) {
d2m = d2m + 1;
}
d2y = DAYSTOYEAR[ y - 1999 ];
return ss + (mm*SECSPERMINUTE) + (hh*SECSPEROUR) + (d*SECSPERDAY) + (d2m*SECSPERDAY) + (d2y*SECSPERDAY);
}
'
This sounds like a job for an actual database. Even something like SQLite could probably help you reasonably well here. The big problem I see is your definition of "within 4 hours". That's a sliding window problem, which means you can't simply quantize all the data to 4 hour segments... you have to compute all "nearby" elements for every other element separately. Ugh.
If your data file contains all your records (i.e. it includes records that do not have dupicate ids within the file) you could pre-process it and produce a file that only contains records that have duplicate (ids).
If this is the case that would reduce the size of file you need to process with your AWK program.
How is the input file sorted? Like, cat file|sort, or sorted via a single specific field, or multiple fields? If multiple fields, what fields and what order? It appears the hour fields are a 24 hour clock, not 12, right? Are all the date/time fields zero-padded (would 9am be "9" or "09"?)
Without taking into account performance it looks like your code has problems with month boundaries since it assumes all months are 30 days long. Take the two dates 2008-05-31/12:00:00 and 2008-06-01:12:00:00. Those are 24 hours apart but your code produces the same time code for both (63339969600)
I think you would need to consider leap years. I didn't do the math, but I think during a leap year, with a hard code of 28 days for feb, a comparison of noon on 2/29 and noon on 3/1 would result in the same duplicate time stamp as before. Although it looks like you didn't implement it like that. They way you implemented it, I think you still have the problem but it's between dates on 12/31 of $leapyear and 1/1 of $leapyear+1.
I think you might also have some collisions during time changes if your code has to handle time zones that handle them.
The file doesn't really seem to be sorted in any useful way. I'm guessing that field $1 is some sort of status (the "OK" you're checking for). So it's sorted by record status, then by DAY, then MONTH, YEAR, HOURS, MINUTES, SECONDS. If it was year,month,day I think there could be some optimizations there. Still might be but my brain's going in a different direction right now.
If there are a small number of duplicate keys in proportion to total number of lines, I think your best bet is to reduce the file your awk script works over to just duplicate keys (as David said). You could also preprocess the file so the only lines present are the /OK/ lines. I think I would do this with a pipeline where the first awk script only prints the lines with duplicate IDs and the second awk script is basically the one above but optimized to not look for /OK/ and with the knowledge that any key present is a duplicate key.
If you know ahead of time that all or most lines will have repeated keys, it's probably not worth messing with. I'd bite the bullet and write it in C. Tons more lines of code, much faster than the awk script.
On many unixen, you can get sort to sort by a particular column, or field. So by sorting the file by the ID, and then by the date, you no longer need to keep the associative array of when you last saw each ID at all. All the context is there in the order of the file.
On my Mac, which has GNU sort, it's:
sort -k 8 < input.txt > output.txt
to sort on the ID field. You can sort on a second field too, by saying (e.g) 8,3 instead, but ONLY 2 fields. So a unix-style time_t timestamp might not be a bad idea in the file - it's easy to sort, and saves you all those date calculations. Also, (again at least in GNU awk), there is a mktime function that makes the time_t for you from the components.
#AnotherHowie, I thought the whole preprocessing could be done with sort and uniq. The problem is that the OP's data seems to be comma delimited and (Solaris 8's) uniq doesn't allow you any way specify the record separator, so there wasn't a super clean way to do the preprocessing using standard unix tools. I don't think it would be any faster so I'm not going to look up the exact options, but you could do something like:
cut -d, -f8 <infile.txt | sort | uniq -d | xargs -i grep {} infile.txt >outfile.txt
That's not very good because it executes grep for every line containing a duplicate key. You could probably massage the uniq output into a single regexp to feed to grep, but the benefit would only be known if the OP posts expected ratio of lines containing suspected duplicate keys to total lines in the file.

Resources