Parse Apache Formatted URLs in Ruby - ruby

How can I take in a Apache Common Log file and list all of the URLs in it in a neat histogram like:
/favicon.ico ##
/manual/mod/mod_autoindex.html #
/ruby/faq/Windows/ ##
/ruby/faq/Windows/index.html #
/ruby/faq/Windows/RubyonRails #
/ruby/rubymain.html #
/robots.txt ########
Sample of test file:
65.54.188.137 - - [03/Sep/2006:03:50:20 -0400] "GET /~longa/geomed/ppa/doc/localg/localg.htm HTTP/1.0" 200 24834
65.54.188.137 - - [03/Sep/2006:03:50:32 -0400] "GET /~longa/geomed/modules/sv/scen1.html HTTP/1.0" 200 1919
65.54.188.137 - - [03/Sep/2006:03:53:51 -0400] "GET /~longa/xlispstat/code/statistics/introstat/axis/code/axisDens.lsp HTTP/1.0" 200 15962
65.54.188.137 - - [03/Sep/2006:04:03:03 -0400] "GET /~longa/geomed/modules/cluster/lab/nm.pop HTTP/1.0" 200 66302
65.54.188.137 - - [03/Sep/2006:04:11:15 -0400] "GET /~longa/geomed/data/france/names.txt HTTP/1.0" 200 20706
74.129.13.176 - - [03/Sep/2006:04:14:35 -0400] "GET /~jbyoder/ambiguouslyyours/ambig.rss HTTP/1.1" 304 -
This is what I have right now (but I'm not sure how to make the histogram):
...
---
$apache_line = /\A(?<ip_address>\S+) \S+ \S+ \[(?<time>[^\]]+)\] "(?<method>GET|POST) (?<url>\S+) \S+?" (?<status>\d+) (?<bytes>\S+)/
$parts = apache_line.match(file)
$p parts[:ip_address], parts[:status], parts[:method], parts[:url]
def get_url(file)
hits = Hash.new {|h,k| h[k]=0}
File.read(file).to_a.each do |line|
while $p parts[:url]
if k = k
h[k]+=1
puts "%-15s %s" % [k,'#'*h[k]]
end
end
end
...
---
Here is the full question: http://pastebin.com/GRPS6cTZ Pseudo code is fine.

You can create a hash mapping each path to the number of hits. For convenience, I suggest using a Hash that sets the value to 0 when you ask for a path it hasn't seen before. For example:
hits = Hash.new{ |h,k| h[k]=0 }
...
hits["/favicon.ico"] += 1
hits["/ruby/faq/Windows/"] += 1
hits["/favicon.ico"] += 1
p hits
#=> {"/favicon.ico"=>2, "/ruby/faq/Windows/"=>1}
In case the log file is really huge, instead of slurping the whole thing into memory, process the lines one at a time. (Look through the methods of the File class.)
Because Apache log file formats don't have standard delimiters, I'd suggesting using a regular expression to take each line and separate it into the chunks you want. Assuming you're using Ruby 1.9, I'm going to use named captures for clean access to the methods later on. For example:
apache_line = /\A(?<ip_address>\S+) \S+ \S+ \[(?<time>[^\]]+)\] "(?<method>GET|POST) (?<url>\S+) \S+?" (?<status>\d+) (?<bytes>\S+)/
...
parts = apache_line.match(log_line)
p parts[:ip_address], parts[:status], parts[:method], parts[:url]
You might want to choose to filter these based on the status code. For example, do you want to include in your graph all the 404 hits where someone mistyped? If you're not slurping all the lines into memory, you won't be using Array#select but instead skipping over them during your loop.
After you have gathered all your hits, then its time to write out the results. Some helpful tips:
Hash#keys can give you all the keys of the array (the paths) at once. You probably want to write out all the paths with the same amount of whitespace, so you need to figure out which is the longest. Perhaps you want to map the paths to their lengths and then get the max element, or perhaps you want to use max_by to find the longest path and then find its length.
Although geeky, using sprintf or String#% is a great way to lay out formatted reports. For example:
puts "%-15s %s" % ["Hello","####"]
#=> "Hello ####"
Just like you needed to find the longest name for good formatting, might want to to find the URL with the most hits, so that you can scale your longest amount of hashes to that value. Hash#values will give you an array of all values. Alternatively, perhaps you have a requirement that one # must always represent 100 hits, or something.
Note that String#* lets you create a string by repetition:
p '#'*10
#=> "##########"
If you have specific questions with your code, ask more questions!

Since this is homework, I won't give you the exact answer, but Simone Carletti has implemented a Ruby class to parse Apache log files. You might start there and look at how he does things.

Related

Extract data from URL with Ruby

I'm new to ruby and I'm trying to return a list of ASINs and corresponding prices using Ruby. I was able to get pretty close to what I need but would need help to answer 2 questions:
How can I get rid of the [[' and >\n"]] around the ASIN (see result below)
Is there a simpler way to extract the ASIN from the URL than using this regex?
Thanks so much for your help!
Here is what I get in the Terminal from the current code:
[["B00EJDIG8M\n"]] - $7.00
[["B00KJ07SEM\n"]] - $26.99
[["B000FAR33M\n"]] - $119.00
[["B00LLMKPVK\n"]] - $22.99
[["B007NXPAQG\n"]] - $9.47
[["B004W5WAMU\n"]] - $22.43
[["B00LFUNGU0\n"]] - $17.99
[["B0052G14E8\n"]] - $54.99
[["B002MPLYEW\n"]] - $212.99
[["B00009W3G7\n"]] - $6.61
[["B000NCTOUM\n"]] - $3.04
[["B009SANIDO\n"]] - $12.29
[["B0052G51AQ\n"]] - $67.99
[["B003XEUEPQ\n"]] - $26.74
[["B00CYH9HRO\n"]] - $25.75
[["B00KV0SKQK\n"]] - $21.99
[["B009PCI2JU\n"]] - $56.66
[["B00LLM6ZFK\n"]] - $24.99
[["B004RQDY60\n"]] - $18.40
[["B000JLNBW4\n"]] - $49.14
Here is the code:
require 'rubygems'
require 'nokogiri'
require 'open-uri'
PAGE_URL = "http://www.amazon.com/Best-Sellers-Appliances/zgbs/appliances/ref=zg_bs_nav_0"
page = Nokogiri::HTML(open(PAGE_URL))
page.css(".zg_itemWrapper").each do |item|
price = item.at_css(".zg_price .price").text
asin = item.at_css(".zg_title a")[:href].scan(/http:\/\/(?:www\.|)amazon\.com\/(?:gp\/product|[^\/]+\/dp|dp)\/([^\/]+)/)
puts "#{asin} - #{price}"
end
Rather than cleaning up your Nokogiri search, the easiest thing to do at this point is just clean up your current asin values during interpolation. For example:
puts "#{asin.flatten.pop.chomp} - #{price}"
Regarding question 2., I realized I don't really need regex and found a way to get the same result with a much shorter line of code
replacing
asin = item.at_css(".zg_title a")[:href].scan(/http:\/\/(?:www\.|)amazon\.com\/(?:gp\/product|[^\/]+\/dp|dp)\/([^\/]+)/)
with
asin = item.at_css(".zg_title a")[:href].split("/")[5].chomp

Ruby paging over API response dataset causes memory spike

I'm experiencing an issue with a large memory spike when I page through a dataset returned by an API. The API is returning ~150k records, I'm requesting 10k records at a time and paging through 15 pages of data. The data is an array of hashes, each hash containing 25 keys with ~50-character string values. This process kills my 512mb Heroku dyno.
I have a method used for paging an API response dataset.
def all_pages value_key = 'values', &block
response = {}
values = []
current_page = 1
total_pages = 1
offset = 0
begin
response = yield offset
#The following seems to be the culprit
values += response[value_key] if response.key? value_key
offset = response['offset']
total_pages = (response['totalResults'].to_f / response['limit'].to_f).ceil if response.key? 'totalResults'
end while (current_page += 1) <= total_pages
values
end
I call this method as so:
all_pages("items") do |current_page|
get "#{data_uri}/data", query: {offset: current_page, limit: 10000}
end
I know it's the concatenation of the arrays that is causing the issue as removing that line allows the process to run with no memory issues. What am I doing wrong? The whole dataset is probably no larger than 20mb - how is that consuming all the dyno memory? What can I do to improve the effeciency here?
Update
Response looks like this: {"totalResults":208904,"offset":0,"count":1,"hasMore":true, limit:"10000","items":[...]}
Update 2
Running with report shows the following:
[HTTParty] [2014-08-13 13:11:22 -0700] 200 "GET 29259/data" -
Memory 171072KB
[HTTParty] [2014-08-13 13:11:26 -0700] 200 "GET 29259/data" -
Memory 211960KB
... removed for brevity ...
[HTTParty] [2014-08-13 13:12:28 -0700] 200 "GET 29259/data" -
Memory 875760KB
[HTTParty] [2014-08-13 13:12:33 -0700] 200 "GET 29259/data" -
Errno::ENOMEM: Cannot allocate memory - ps ax -o pid,rss | grep -E "^[[:space:]]*23137"
Update 3
I can recreate the issue with the basic script below. The script is hard coded to only pull 100k records and already consumes over 512MB of memory on my local VM.
#! /usr/bin/ruby
require 'uri'
require 'net/http'
require 'json'
uri = URI.parse("https://someapi.com/data")
offset = 0
values = []
begin
http = Net::HTTP.new(uri.host, uri.port)
http.use_ssl = true
http.set_debug_output($stdout)
request = Net::HTTP::Get.new(uri.request_uri + "?limit=10000&offset=#{offset}")
request.add_field("Content-Type", "application/json")
request.add_field("Accept", "application/json")
response = http.request(request)
json_response = JSON.parse(response.body)
values << json_response['items']
offset += 10000
end while offset < 100_000
values
Update 4
I've made a couple of improvements which seem to help but not completely alleviate the issue.
1) Using symbolize_keys turned out to consume less memory. This is because the keys of each hash are the same and it's cheaper to symbolize them then to parse them as seperate Strings.
2) Switching to ruby-yajl for JSON parsing consumes significantly less memory as well.
Memory consumption of processing 200k records:
JSON.parse(response.body): 861080KB (Before completely running out of memory)
JSON.parse(response.body, symbolize_keys: true): 573580KB
Yajl::Parser.parse(response.body): 357236KB
Yajl::Parser.parse(response.body, symbolize_keys: true): 264576KB
This is still an issue though.
Why does a dataset that's no more than 20MB take that much memory to process?
What is the "right way" to process large datasets like this?
What does one do when the dataset becomes 10x larger? 100x larger?
I will buy a beer for anyone who can thoroughly answer these three questions!
Thanks a lot in advance.
You've identified the problem to be using += with your array. So the likely solution is to add the data without creating a new array each time.
values.push response[value_key] if response.key? value_key
Or use the <<
values << response[value_key] if response.key? value_key
You should only use += if you actually want a new array. It doesn't appear you do actually want a new array, but actually just want all the elements in a single array.

Match Multiple Patterns in a String and Return Matches as Hash

I'm working with some log files, trying to extract pieces of data.
Here's an example of a file which, for the purposes of testing, I'm loading into a variable named sample. NOTE: The column layout of the log files is not guaranteed to be consistent from one file to the next.
sample = "test script result
Load for five secs: 70%/50%; one minute: 53%; five minutes: 49%
Time source is NTP, 23:25:12.829 UTC Wed Jun 11 2014
D
MAC Address IP Address MAC RxPwr Timing I
State (dBmv) Offset P
0000.955c.5a50 192.168.0.1 online(pt) 0.00 5522 N
338c.4f90.2794 10.10.0.1 online(pt) 0.00 3661 N
990a.cb24.71dc 127.0.0.1 online(pt) -0.50 4645 N
778c.4fc8.7307 192.168.1.1 online(pt) 0.00 3960 N
"
Right now, I'm just looking for IPv4 and MAC address; eventually the search will need to include more patterns. To accomplish this, I'm using two regular expressions and passing them to Regexp.union
patterns = Regexp.union(/(?<mac_address>\h{4}\.\h{4}\.\h{4})/, /(?<ip_address>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})/)
As you can see, I'm using named groups to identify the matches.
The result I'm trying to achieve is a Hash. The key should equal the capture group name, and the value should equal what was matched by the regular expression.
Example:
{"mac_address"=>"0000.955c.5a50", "ip_address"=>"192.168.0.1"}
{"mac_address"=>"338c.4f90.2794", "ip_address"=>"10.10.0.1"}
{"mac_address"=>"990a.cb24.71dc", "ip_address"=>"127.0.0.1"}
{"mac_address"=>"778c.4fc8.7307", "ip_address"=>"192.168.1.1"}
Here's what I've come up with so far:
sample.split(/\r?\n/).each do |line|
hashes = []
line.split(/\s+/).each do |val|
match = val.match(patterns)
if match
hashes << Hash[match.names.zip(match.captures)].delete_if { |k,v| v.nil? }
end
end
results = hashes.reduce({}) { |r,h| h.each {|k,v| r[k] = v}; r }
puts results if results.length > 0
end
I feel like there should be a more "elegant" way to do this. My chief concern, though, is performance.

How do I parse Google image URLs using Ruby and Nokogiri?

I'm trying to make an array of all the image files on a Google images webpage.
I want a regular expression to pull everything after "imagurl=" and ending before "&amp" as seen in this HTML:
<img height="124" width="124" src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRLy5inpSdHxWuE7z3QSZw35JwN3upbBaLr11LR25noTKbSMn9-qrySSg"><br><cite title="trendytree.com">trendytree.com</cite><br>Silent Night Chapel <b>20031</b><br>400 × 400 - 58k - jpg</td>
I feel like I can do this with a regex, but I can't find a way to search my parsed document using regex, but I'm not finding any solutions.
str = '<img height="124" width="124" src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRLy5inpSdHxWuE7z3QSZw35JwN3upbBaLr11LR25noTKbSMn9-qrySSg"><br><cite title="trendytree.com">trendytree.com</cite><br>Silent Night Chapel <b>20031</b><br>400 × 400 - 58k - jpg</td>'
str.split('imgurl=')[1].split('&amp')[0]
#=> "http://www.trendytree.com/old-world- christmas/images/20031chapel20031-silent-night-chapel.jpg"
Is that what you're looking for?
The problem with using a regex is you assume too much knowledge about the order of parameters in the URL. If the order changes, or & disappears the regex won't work.
Instead, parse the URL, then split the values out:
# encoding: UTF-8
require 'nokogiri'
require 'cgi'
require 'uri'
doc = Nokogiri::HTML.parse('<img height="124" width="124" src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRLy5inpSdHxWuE7z3QSZw35JwN3upbBaLr11LR25noTKbSMn9-qrySSg"><br><cite title="trendytree.com">trendytree.com</cite><br>Silent Night Chapel <b>20031</b><br>400 × 400 - 58k - jpg</td>')
img_url = doc.search('a').each do |a|
query_params = CGI::parse(URI(a['href']).query)
puts query_params['imgurl']
end
Which outputs:
http://www.trendytree.com/old-world-christmas/images/20031chapel20031-silent-night-chapel.jpg
Both URI and CGI are used because URI's decode_www_form raises an exception when trying to decode the query.
I've also been known to decode the query string into a hash using something like:
Hash[URI(a['href']).query.split('&').map{ |p| p.split('=') }]
That will return:
{"imgurl"=>
"http://www.trendytree.com/old-world-christmas/images/20031chapel20031-silent-night-chapel.jpg",
"imgrefurl"=>
"http://www.trendytree.com/old-world-christmas/silent-night-chapel-20031-christmas-ornament-old-world-christmas.html",
"usg"=>"__YJdf3xc4ydSfLQa9tYnAzavKHYQ",
"h"=>"400",
"w"=>"400",
"sz"=>"58",
"hl"=>"en",
"start"=>"19",
"zoom"=>"1",
"tbnid"=>"ajDcsGGs0tgE9M:",
"tbnh"=>"124",
"tbnw"=>"124",
"ei"=>"qagfUbXmHKfv0QHI3oG4CQ",
"itbs"=>"1",
"sa"=>"X",
"ved"=>"0CE4QrQMwEg"}
To get all the img urls you want do
# get all links
url = 'some-google-images-url'
links = Nokogiri::HTML( open(url) ).css('a')
# get regex match or nil on desired img
img_urls = links.map {|a| a['href'][/imgurl=(.*?)&/, 1] }
# get rid of nils
img_urls.compact
The regex you want is /imgurl=(.*?)&/ because you want a non-greedy match between imgurl= and &, otherwise the greedy .* would take everything to the last & in the string.

Highlight numbers like keywords in a Notepad++ custom language (for access logs)

I want to write a custom language for access logs in Notepad++.
The Problem is that numbers (here: HTTP status codes) won't be highlighted like real keywords (i.e. GET). Notepad++ only provides a highlight color for numbers in general.
How do I handle numbers like text?
Sample log file
192.23.0.9 - - [10/Sep/2012:13:46:42 +0200] "GET /js/jquery-ui.custom.min.js HTTP/1.1" 200 206731
192.23.0.9 - - [10/Sep/2012:13:46:43 +0200] "GET /js/onmediaquery.min.js HTTP/1.1" 200 1229
192.23.0.9 - - [10/Sep/2012:13:46:43 +0200] "GET /en/contact HTTP/1.1" 200 12836
192.23.0.9 - - [10/Sep/2012:13:46:44 +0200] "GET /en/imprint HTTP/1.1" 200 17380
192.23.0.9 - - [10/Sep/2012:13:46:46 +0200] "GET /en/nothere HTTP/1.1" 404 2785
Sample custom languages
http://sourceforge.net/apps/mediawiki/notepad-plus/index.php?title=User_Defined_Language_Files
I also tried editing and importing a predefined language like this:
http://notepad-plus.sourceforge.net/commun/userDefinedLang/Log4Net.xml
I thought the custom language should look like this:
<KeywordLists>
[...]
<Keywords name="Words1">404 501</Keywords>
<Keywords name="Words2">301 303</Keywords>
<Keywords name="Words3">200</Keywords>
</KeywordLists>
<Styles>
<WordsStyle name="DEFAULT" styleID="11" fgColor="000000" bgColor="FFFFFF" colorStyle="0" fontName="Courier New" fontStyle="0"/>
[...]
<WordsStyle name="KEYWORD1" styleID="5" fgColor="FF0000" bgColor="FFFFFF" colorStyle="1" fontName="" fontStyle="0"/>
<WordsStyle name="KEYWORD2" styleID="6" fgColor="0000FF" bgColor="FFFFFF" colorStyle="1" fontName="" fontStyle="1"/>
<WordsStyle name="KEYWORD3" styleID="7" fgColor="00FF00" bgColor="FFFFFF" colorStyle="1" fontName="" fontStyle="0"/>
[...]
// This line causes number highlighting. Deletion doesn't work either.
<WordsStyle name="NUMBER" styleID="4" fgColor="0F7F00" bgColor="FFFFFF" fontName="" fontStyle="0"/>
</Styles>
Unfortunately numbers will be colored in the same color.
I'd like to color them like this:
etc.
Any suggestions? How to handle the numbers like keywords?
It isn't possible to highlight numbers as keywords as the built-in lexers (parsers/language definitions) use a numeric as a token meaning that the only way to differentiate between a numeric and your keyword would be to parse the whole numeric block and then compare to the keyword list, in which case it becomes required to also parse the delimiters around the numeric block to ensure that .200. doesn't highlight as 200. This is why your numbers all highlighted as the same color; namely the 'number' color.
While this could be done using a custom lexer using either fixed position tokens or regex matching you'll find the user defined languages (the last I heard) do not have this capability.
As your request is actually a fairly simple, from what I understand, being as general as possible ( as requested in your comment )...
Highlight space delimited numeric values contained in a given set.
[[:space:]](200|301|404)[[:space:]]
We can use the 'Mark' feature of the 'Find' dialog with that regex but then everything is marked the same color like with your failed experiment.
Perhaps what would be simple and suit your needs would be to use a npp pythonscript and the Mark Style settings in the Style Configurator to get the desired result?
something like this crude macro style:
from Npp import *
def found(line, m):
global first
pos = editor.positionFromLine(line)
if first:
editor.setSelection(pos + m.end(), pos + m.start())
first = False
else:
editor.addSelection(pos + m.end(), pos + m.start())
editor.setMultipleSelection(True)
lines = editor.getUserLineSelection()
# Use space padded search since MARKALLEXT2 will act just
# like the internal lexer if only the numeric is selected
# when it is called.
first = True
editor.pysearch( " 200 ", found, 0, lines[0], lines[1])
notepad.menuCommand(MENUCOMMAND.SEARCH_MARKALLEXT1)
first = True
editor.pysearch( " 301 ", found, 0, lines[0], lines[1])
notepad.menuCommand(MENUCOMMAND.SEARCH_MARKALLEXT2)
first = True
editor.pysearch( " 404 ", found, 0, lines[0], lines[1])
notepad.menuCommand(MENUCOMMAND.SEARCH_MARKALLEXT3)
Which, to use, just use the plugin manager to install Python Script, go to the plugin menu and select New Script then paste, save, select the tab for the doc you want to parse, and execute the script (once again from the plugin menu).
Obviously you could use all 5 Mark styles for different terms, you could assign to a shortcut, and you could get more into the 'scripting' -vs- 'macro' style of nppPython and make a full blown script to parse whatever you want... shoot having a script trigger whenever you select a particular lexer style is doable too.

Resources