Ruby check uniqueness - ruby

I build some code for going through a textfile (webserver logfile). My code works fine, but i have two questions. The code works fine, only the first username visible in the logfile is not printed and not counted. Does anyone know why?
My second question is about my count_unique. What do I need to do to count only the unique usernames?
My Code:
count_tot = 0
count_unique = 0
file = File.new("text.txt", "r")
line = file.gets
while (line = file.gets)
substrings = line.split("&")
substrings.each do |sub|
if sub.include? 'username'
puts sub
count_tot += 1
else
end
end
end
file.close
puts ""
puts "Total found input values:"
puts count_tot
puts count_unique
Example input (2 lines)
[11/Mar/2014:00:15:02 +0100] "GET /web/show/id=568296 HTTP/1.1" 200 8499 "https://www.site.com/csc/default.aspx?sid=ertett4353452445.orker2&username=username1&timestamp=20140311001443&hashkey=847823786547385243678&" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.74.9 (KHTML, like Gecko) Version/7.0.2 Safari/537.74.9" 52345 1FD323C0D681D2F10AE789F8A6C0900D.wm9worker5
[11/Mar/2014:00:35:50 +0100] "GET /web/show/id=568296 HTTP/1.1" 200 8499 "https://www.site.com/csc/default.aspx?sid=gfdgdfdgfgdfdfg._worker1&username=username2&timestamp=20140311003517&hashkey=fdsfsdffsffds&" "Mozilla/5.0 (iPad; CPU OS 7_0_6 like Mac OS X) AppleWebKit/537.51.1 (KHTML, like Gecko) CriOS/33.0.1750.14 Mobile/11B651 Safari/9537.53" 62415 5852920B165D2E39559241BA8B5FB36A.wm9worker6

only the first username visible in the logfile is not printed and not counted. Does anyone know why?
For that you need to do
line = file.gets # remove this.
while (line = file.gets) # keep only this.
line = file.gets ( which is before while loop), is not being processed. Before entering into the while loop that line data got lost.
update
string = <<_
[11/Mar/2014:00:15:02 +0100] "GET /web/show/id=568296 HTTP/1.1" 200 8499 "https://www.site.com/csc/default.aspx?sid=ertett4353452445.orker2&username=username1&timestamp=20140311001443&hashkey=847823786547385243678&" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.74.9 (KHTML, like Gecko) Version/7.0.2 Safari/537.74.9" 52345 1FD323C0D681D2F10AE789F8A6C0900D.wm9worker5
[11/Mar/2014:00:35:50 +0100] "GET /web/show/id=568296 HTTP/1.1" 200 8499 "https://www.site.com/csc/default.aspx?sid=gfdgdfdgfgdfdfg._worker1&username=username2&timestamp=20140311003517&hashkey=fdsfsdffsffds&" "Mozilla/5.0 (iPad; CPU OS 7_0_6 like Mac OS X) AppleWebKit/537.51.1 (KHTML, like Gecko) CriOS/33.0.1750.14 Mobile/11B651 Safari/9537.53" 62415 5852920B165D2E39559241BA8B5FB36A.wm9worker6
[11/Mar/2014:00:35:50 +0100] "GET /web/show/id=568296 HTTP/1.1" 200 8499 "https://www.site.com/csc/default.aspx?sid=gfdgdfdgfgdfdfg._worker1&username=username2&timestamp=20140311003517&hashkey=fdsfsdffsffds&" "Mozilla/5.0 (iPad; CPU OS 7_0_6 like Mac OS X) AppleWebKit/537.51.1 (KHTML, like Gecko) CriOS/33.0.1750.14 Mobile/11B651 Safari/9537.53" 62415 5852920B165D2E39559241BA8B5FB36A.wm9worker6
_
File.write('f1',string)
#usernames = []
File.foreach('f1') do |line|
#collect all the usernames
#usernames << line[/username=(\w+)/,1]
# do other tasks with *line*
end
#usernames # => ["username1", "username2", "username2"]
# to get the uniq usernames
#usernames.uniq # => ["username1", "username2"]
# if you want to see, which username present how many times, think something
# like below
Hash[#usernames.group_by { |s| s }.map { |k,v| [k,v.size]}]
# => {"username1"=>1, "username2"=>2}
Look at the method IO::foreach to understand why I used it. Checkout the Array#uniq and group_by methods also. Documentation of these are much clear.

First of all, the IO class, and by extension File, has an each method which yields lines to the block. There is also a foreach class method that makes it even more concise.
File.foreach 'text.txt' do |line|
# Count stuff ...
end
Regarding your first question, that happens because you read the first line into a variable and then proceed to overwrite said variable immediately after in the while loop's clause. This effectively skips the first line. The example above gets rid of that problem.
It is hard to answer the second question without looking at the input we're dealing with.
A simple String#scan-based solution might suffice:
line.scan /[?&]username=([^&]*)/ do |user_name|
puts user_name
end
Everything can thus be simplified to:
user_names = File.foreach('text.txt').map do |line|
line.scan /[?&]username=([^&]*)/
end.flatten
user_name_counts = user_names.uniq.inject Hash.new do |hash, user_name|
hash.tap do |hash|
hash[user_name] = user_names.count user_name
end
end
p user_name_counts
# => {"username1"=>1, "username2"=>2}

Related

Return line number with Ruby readline

searching for a simple ruby/bash solution to investigate a logfile, for example an apache access log.
my log contains lines with beginning string "authorization:"
goal of the script is to return the whole next but one line after this match, which contains the string "x-forwarded-for".
host: 10.127.5.12:8088^M
accept: */*^M
date: Wed, 19 Apr 2019 22:12:36 GMT^M
authorization: FOP ASC-amsterdam-b2c-v7:fkj9234f$t34g34rf=^M
user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0)
x-forwarded-for: 195.99.33.222, 10.127.72.254^M
x-forwarded-host: my.luckyhost.com^M
x-forwarded-server: server.luckyhost.de^M
connection: Keep-Alive^M
^M
My question relates to the if condition.
How can I get the line number/caller from readline and in second step return the whole next line with x-forwarded-for.
file = File.open(args[:apache_access_log], "r")
log_snapshot = file.readlines
file.close
log_snapshot.reverse_each do |line|
if line.include? "authorization:"
puts line
end
end
Maybe something along these lines:
log_snapshot.each_with_index.reverse_each do |line, n|
case (line)
when /authorization:/
puts '%d: %s' % [ n + 1, line ]
end
end
Where each_with_index is used to generate 0-indexed line numbers. I've switched to a case style so you can have more flexibility in matching different conditions. For example, you can add the /i flag to do a case-insensitive match really easily or add \A at the beginning to anchor it at the beginning of the string.
Another thing to consider using the block method for File.open, like this:
File.open(args[:apache_access_log], "r") do |f|
f.readlines.each_with_index.reverse_each do |line, n|
# ...
end
end
Where that eliminates the need for an explicit close call. The end of the block closes it for you automatically.

Multiple matches in hash table

Learning Ruby (v. 2.5) in coursera.
Aim is to write on ruby simple parser, which will count what IP host is responsible for the most queries in the apache logs.
Apache logs:
87.99.82.183 - - [01/Feb/2018:18:50:06 +0000] "GET /favicon.ico HTTP/1.1" 404 504 "http://35.225.14.147/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.119 Safari/537.36"
87.99.82.183 - - [01/Feb/2018:18:50:52 +0000] "GET /secret.html HTTP/1.1" 404 505 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.119 Safari/537.36"
Ruby code:
class ApacheLogAnalyzer
def initialize
#total_hits_by_ip = {}
end
def analyze(file_name)
ip_regex = /^\d{1,3}.\d{1,3}.\d{1,3}.\d{1,3}/
file = File.open(file_name , "r")
file.each_line do |line|
count_hits(ip_regex.match(line))
end
end
def count_hits(ip)
if ip
if #total_hits_by_ip[ip]
#total_hits_by_ip[ip] += 1
else
#total_hits_by_ip[ip] = 1
end
end
end
Result is following:
{#<MatchData "87.99.82.183">=>1, #<MatchData "87.99.82.183">=>1}
The result contains duplicates (it shoud contain one key "87.99.82.183" with value 2). Where could be the issue?
The result contains duplicates in your case because hash keys are different objects but with the same values. Look at this examples:
a = "hello world foo".match(/he/) # => #<MatchData "he">
b = "hello world bar".match(/he/) # => #<MatchData "he">
a == b # => false
You can replace the hash keys with just string for example to definitely avoid this:
class ApacheLogAnalyzer
def analyze(file_name)
File.open(file_name).each_line.inject(Hash.new(0)) do |result, line|
ip = line.split
hash[ip] += 1
result
end
end
end
Thank you for your comment. I found that using method to_s resolves the issue.
So improved code looks like this:
count_hits(ip_regex.match(line).to_s)

Ruby: How to set feedjira configuration options?

In the Feedjira 2.0 announcement blog post, it says that if you want to set the user agent, that should be a configuration option, but it is not clear how to do this. Ideally, I would like to mimic the options previously provided in Feedjira 1.0, including user_agent, if_modified_since, timeout, and ssl_verify_peer.
http://feedjira.com/blog/2014/04/14/thoughts-on-version-two-point-oh.html
With Feedjira 1.0, you could set those options by making the following call (as described here):
feed_parsed = Feedjira::Feed.fetch_and_parse("http://sports.espn.go.com/espn/rss/news", {:if_modified_since => Time.now, :ssl_verify_peer => false, :timeout => 5, :user_agent => "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.94 Safari/537.36"})
The only example I have seen where configuration options are set was from a comment in a github pull request, which is as follows:
Feedjira::Feed.configure do |faraday|
faraday.request :user_agent, app: "MySite", version: APP_VERSION
end
But when I tried something similar, I received the following error:
undefined method `configure' for Feedjira::Feed:Class
It looks like a patch was added to allow a timeout option to be passed to the fetch_and_parse function:
https://github.com/feedjira/feedjira/pull/318/commits/fbdb85b622f72067683508b1d7cab66af6303297#diff-a29beef397e3d8624e10af065da09a14
However, until that is pushed live, a timeout and an open_timeout option can be passed by bypassing Feedjira for the fetching and instead using Faraday (or any library that can fetch HTTP requests, like Net::HTTP). You can also set ssl verify to false, and set the user agent, such as this:
require 'feedjira'
require 'pp'
url = "http://www.espn.com/espnw/rss/?sectionKey=athletes-life"
conn = Faraday.new :ssl => {:verify => false}
response = conn.get do |request|
request.url url
request.options.timeout = 5
request.options.open_timeout = 5
request.headers = {'User-Agent' => "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.94 Safari/537.36"}
end
feed_parsed = Feedjira::Feed.parse response.body
pp feed_parsed.entries.first
I haven't seen a way to check for "if_modified_since", but I will update answer if I do.

Trouble scraping Google trends using Capybara and Poltergeist

I want to get the top trending queries in a particular category on Google Trends. I could download the CSV for that category but that is not a viable solution because I want to branch into each query and find the trending sub-queries for each.
I am unable to capture the contents of the following table, which contains the top 10 trending queries for a topic. Also for some weird reason taking a screenshot using capybara returns a darkened image.
<div id="TOP_QUERIES_0_0table" class="trends-table">
Please run the code on the Ruby console to see it working. Capturing elements/screenshot works fine for facebook.com or google.com but doesn't work for trends.
I am guessing this has to do with the table getting generated dynamically on page load but I'm not sure if that should block capybara from capturing the elements already loaded on the page. Any hints would be very valuable.
require 'capybara/poltergeist'
require 'capybara/dsl'
require 'csv'
class PoltergeistCrawler
include Capybara::DSL
def initialize
Capybara.register_driver :poltergeist_crawler do |app|
Capybara::Poltergeist::Driver.new(app, {
:js_errors => false,
:inspector => false,
phantomjs_logger: open('/dev/null')
})
end
Capybara.default_wait_time = 3
Capybara.run_server = false
Capybara.default_driver = :poltergeist_crawler
page.driver.headers = {
"DNT" => 1,
"User-Agent" => "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:22.0) Gecko/20100101 Firefox/22.0"
}
end
# handy to peek into what the browser is doing right now
def screenshot(name="screenshot")
page.driver.render("public/#{name}.jpg",full: true)
end
# find("path") and all("path") work ok for most cases. Sometimes I need more control, like finding hidden fields
def doc
Nokogiri.parse(page.body)
end
end
crawler = PoltergeistCrawler.new
url = "http://www.google.com/trends/explore#cat=0-45&geo=US&date=today%2012-m&cmpt=q"
crawler.visit url
crawler.screenshot
crawler.find(:xpath, "//div[#id='TOP_QUERIES_0_0table']")
Capybara::ElementNotFound: Unable to find xpath "//div[#id='TOP_QUERIES_0_0table']"
from /Users/karan/.rvm/gems/ruby-1.9.3-p545/gems/capybara-2.4.4/lib/capybara/node/finders.rb:41:in block in find'
from /Users/karan/.rvm/gems/ruby-1.9.3-p545/gems/capybara-2.4.4/lib/capybara/node/base.rb:84:insynchronize'
from /Users/karan/.rvm/gems/ruby-1.9.3-p545/gems/capybara-2.4.4/lib/capybara/node/finders.rb:30:in find'
from /Users/karan/.rvm/gems/ruby-1.9.3-p545/gems/capybara-2.4.4/lib/capybara/session.rb:676:inblock (2 levels) in '
from /Users/karan/.rvm/gems/ruby-1.9.3-p545/gems/capybara-2.4.4/lib/capybara/dsl.rb:51:in block (2 levels) in <module:DSL>'
from (irb):45
from /Users/karan/.rbenv/versions/1.9.3-p484/bin/irb:12:in'
The javascript error was due to the incorrect USER-Agent. Once I changed the User Agent to that of my chrome browser it worked !
"User-Agent" => "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36"

Can't read HTTP Request Header correctly with Ruby 1.9.3

I'm writing a small webserver. I want to read the HTTP Request. It works when there is no body involved. But when a body is sent then I can't read the content of the body in a satisfying manner.
I read the data coming from the client via TCPSocket. The TCPSocket::gets method reads until the data for the body is received. There is no delimiter or EOF send to signal for the end of the HTTP Request body. The HTTP/1.1 Specification - Section 4.4 lists five cases to get the message length. Point 1) works. Points 2) and 4) are not relevant for my application. Point 5) is not an option because I need to send an response.
I can read the value of the Content-Length field. But when I try to "persuade" the TCPSocket to read the last part of the HTTP Request via read(contentlength) or rcv(contentlength), I have no success. Reading line-by-line until the \r\n which separates Header and Body works, but after that I'm stuck - at least in the way I want to do it.
So my questions are:
Is there a possibility to do is like I intended in the code?
Are there better ways to achieve my goal of reading the HTTP Request correctly (which I really hope for)?
Here is runnable code. The parts that I want to work is in comments.
#!/usr/bin/ruby
require 'socket'
server = TCPServer.new 2000
loop do
Thread.start(server.accept) do |client|
hascontent = false
contentlength = 0
content = ""
request = ""
#This seems to work, but I'm not really happy with it, too much is happening in
#the loop
while(buf = client.readpartial(4096))
request = request + buf
split = request.split("\r\n")
puts request
puts request.dump
puts split.length
puts split.inspect
if(request.index("\r\n\r\n")>0)
break
end
end
#This part is commented out because it doesn't work
=begin
while(line = client.gets)
puts ":" + line
request = request + line
if(line.start_with?("Content-Length"))
hascontent = true
split = line.split(' ')
contentlength = split[1]
end
if(line == "\r\n" and !hascontent)
break
end
if(line == "\r\n" and hascontent)
puts "Trying to get content :P"
puts contentlength
puts content.length
puts client.inspect
#tried read, with and without parameter, rcv, also with and
#without param and their nonblocking couterparts
#where does my thought process go in the wrong direction
while(readin = client.readpartial(contentlength))
puts readin
content = content + readin
end
break
end
end
=end
puts request
client.close
end
So... I have just had this issue for the past 2 hours also, and so I did some digging into the Socket API. Turns out Socket extends BasicSocket which has a method recvmsg. When I tried calling it I got the following:
["GET / HTTP/1.1\r\nHost: localhost:12357\r\nConnection: keep-alive\r\nCache-Control: max-age=0\r\nUpgrade-Insecure-Requests: 1\r\nUser-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36\r\nAccept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8\r\nAccept-Encoding: gzip, deflate, br\r\nAccept-Language: en-US,en;q=0.9\r\n\r\n", #<Addrinfo: empty-sockaddr SOCK_STREAM>, 0]
I.E. My the complete HTTP request, the sender's address information and any other ruby flags raised.
You can use recvmsg to read the entire HTTP request:
raw_request = client.recvmsg()
request = /(?<METHOD>\w+) \/(?<RESOURCE>[^ ]*) HTTP\/1.\d\r\n(?<HEADERS>(.+\r\n)*)(?:\r\n)?(?<BODY>(.|\s)*)/i.match(raw_request)
p request["BODY"]
I have no idea how to do it without recvmsg but I am glad the functionality exists.

Resources