pig script to parse aws elb log - hadoop

I am trying to parse this elb logs with pig and I am able to parse it successfully using this script
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
2016-07-16T00:00:41.700161Z testelb 11.11.17.2:50883 192.168.1.94:80 0.00002 0.001392 0.000019 200 200 0 43 "GET http://test.example.com:80/bac?aid=b5cf542d74&cid=etrsewtp&bid=23c45c543&dte=Sat%20Jul%2016%202016%2008:00:41%20GMT+0800%20(HKT) HTTP/1.1" "Mozilla/5.0 (iPhone; CPU iPhone OS 9_3_2 like Mac OS X) AppleWebKit/601.1.46 (KHTML, like Gecko) Mobile/13F69" - -
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
***************************************************************
A = LOAD '/tmp/one.log' USING TextLoader AS (line:chararray);
B = FOREACH A GENERATE FLATTEN (
REGEX_EXTRACT_ALL(
line,'^(\\S+) (\\S+) (\\S+) (\\S+) (\\S+) (\\S+) (\\S+) (\\S+) (\\S+) (\\S+) (\\S+) "(.+?)" "(.+?)" (\\S+) (\\S+)')
) AS (
timestamp:chararray, elb:int, client_port:chararray, backend_port:chararray, request_processing_time:float, backend_processing_time:float, response_processing_time:float, elb_status_code:int, backend_status_code:int, received_bytes:int, sent_bytes:int, request:chararray, user_agent:chararray, ssl_cipher:chararray, ssl_protocol:chararray
);
DUMP B;
Now I want to extract request url, aid, bid, cid etc but not able to match the regex. Can someone help me to get these details?
Apart from above regex method if there is any other method to get the complete elb log details then I would like to know.
NOTE: The position of aid, bid and cid are not fixed in the request log.

Your question has been already answered here
Alternate way to do the same task requires a custom loader.

Related

Json Extractor | How to avoid initial few lines

I have json path extractor and its response (given below), using match No.-1
#..[?(#.unitName == 'Prod')].name
My Json path giving the output as
Result[0]= Jon
Result[1]= Flip
Result[2]= Athar
Result[3]= Bobby
Result[4]= Azra
Result[5]= Colton
Result[6]= Sony
.
.
.
.
Result[1000]= Maik
I want to avoid first few lines randomly for each user .
Ex : For user 1 if first three lines are randomly ignored, the output should be as follows :
Result[0]= Bobby
Result[1]= Azra
Result[2]= Colton
Result[3]= Sony
.
.
.
.
Result[997]= Maik
I tried following expressions but it doesnt work
#..[?(#.unitName == 'Prod')].name[$ {__Random (1,500)}]
OR
#..[?(#.unitName == 'Prod')].name[299]
Maybe you should try considering switching to JSON JMESPath Extractor, there you could do something like:
[?unitName=='Prod'].name | #[3:6]
More information:
JMESPath - Filters and Multiselect Lists
JMESPath - Slicing
How to Performance Test Web Services Using JMeter

AttributeError: 'DecisionTreeClassifier' object has no attribute 'n_features_'

I was trying to deploy my NLP model to Heroku, and I got the following error in the logs upon trying to predict the result of the inputs-
2022-09-07T15:36:35.497488+00:00 app[web.1]: if self.n_features_ != n_features:
2022-09-07T15:36:35.497488+00:00 app[web.1]: AttributeError: 'DecisionTreeClassifier' object has no attribute 'n_features_'
2022-09-07T15:36:35.498198+00:00 app[web.1]: 10.1.22.85 - - [07/Sep/2022:15:36:35 +0000] "POST /predict HTTP/1.1" 500 290 "https://stocksentimentanalysisapp.herokuapp.com/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36"
This specific line is strange considering I never used a Decision Tree Classifier, only Random Forest-
AttributeError: 'DecisionTreeClassifier' object has no attribute 'n_features_'
The model runs perfectly well in Jupyter Notebook. This issue began only when I tried to deploy it.
Here is my model-
import pandas as pd
import pickle
df = pd.read_csv('D:\Sa\Projects\Stock Sentiment Analysis\data\data.csv', encoding='ISO-8859-1')
train = df[df['Date']<'20150101']
test = df[df['Date']>'20141231']
#Removing non-alphabetic characters
data = train.iloc[:,2:27]
data.replace('[^a-zA-Z]', ' ', regex=True, inplace=True)
#Renaming columns to numerical index
idx_list = [i for i in range(25)]
new_index = [str(i) for i in idx_list]
data.columns = new_index
for index in new_index:
data[index] = data[index].str.lower()
combined_headlines = []
for row in range(0, len(data.index)):
combined_headlines.append(' '.join(str(x) for x in data.iloc[row, 0:25]))
from sklearn.ensemble import RandomForestClassifier
#Bag of words
from sklearn.feature_extraction.text import CountVectorizer
count_vectorizer = CountVectorizer(ngram_range=(2,2))
train_data = count_vectorizer.fit_transform(combined_headlines)
pickle.dump(count_vectorizer, open('countVectorizer.pkl', 'wb'))
rfc = RandomForestClassifier(n_estimators=200, criterion='entropy')
rfc.fit(train_data, train['Label'])
test_transform = []
for row in range(0, len(data.index)):
test_transform.append(' '.join(str(x) for x in data.iloc[row, 2:27]))
test_data = count_vectorizer.transform(test_transform)
predictions = rfc.predict(test_data)
# Saving model to disk
pickle.dump(rfc, open('randomForestClassifier.pkl', 'wb'))
Please help me understand what is going wrong.

Webscraping Nokogiri unable to pick any classes

I am using this page:
https://www.google.com/search?q=ford+fusion+msrp&oq=ford+fusion+msrp&aqs=chrome.0.0l6.2942j0j7&sourceid=chrome&ie=UTF-8
I am trying to get the this element: class="_XWk"
page = HTTParty.get('https://www.google.com/search?q=ford+fusion+msrp&oq=ford+fusion+msrp&aqs=chrome.0.0l6.11452j0j7&sourceid=chrome&ie=UTF-8')
parse_page = Nokogiri::HTML(page)
parse_page.css('_XWk')
Here I can see the whole page in parse_page but when I try the .cc('classname') I don't see anything. Am I using the method the wrong way?
Check out the SelectorGadget Chrome extension to grab css selectors by clicking on the desired element in the browser.
It's because of a simple typo, e.g. . (dot) before selector as ran already mentioned.
In addition, the next problem might occur because no HTTP user-agent is specified thus Google will block a request eventually and you'll receive a completely different HTML that will contain an error message or something similar without the actual data you was looking for. What is my user-agent.
Pass a user-agent:
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
HTTParty.get("https://www.google.com/search", headers: headers)
Iterate over container to extract titles from Google Search:
data = doc.css(".tF2Cxc").map do |result|
title = result.at_css(".DKV0Md")&.text
Code and example in the online IDE:
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
q: "ford fusion msrp",
num: "20"
}
response = HTTParty.get("https://www.google.com/search",
query: params,
headers: headers)
doc = Nokogiri::HTML(response.body)
data = doc.css(".tF2Cxc").map do |result|
title = result.at_css(".DKV0Md")&.text
link = result.at_css(".yuRUbf a")&.attr("href")
displayed_link = result.at_css(".tjvcx")&.text
snippet = result.at_css(".VwiC3b")&.text
puts "#{title}#{snippet}#{link}#{displayed_link}\n\n"
-------
'''
2020 Ford Fusion Prices, Reviews, & Pictures - Best Carshttps://cars.usnews.com/cars-trucks/ford/fusionhttps://cars.usnews.com › Cars › Used Cars › Used Ford
Ford® Fusion Retired | Now What?Not all vehicles qualify for A, Z or X Plan. All Mustang Shelby GT350® and Shelby® GT350R prices exclude gas guzzler tax. 2. EPA-estimated city/hwy mpg for the ...https://www.ford.com/cars/fusion/https://www.ford.com › cars › fusion
...
'''
Alternatively, you can achieve this by using Google Organic Results API from SerpApi. It's a paid API with a free plan.
The difference in your case is that you don't need to figure out what the correct selector is or why results are different in the output since it's already done for the end-user.
Basically, the only thing that needs to be done is just to iterate over structured JSON and get the data you were looking for.
Example code:
require 'google_search_results'
params = {
api_key: ENV["API_KEY"],
engine: "google",
q: "ford fusion msrp",
hl: "en",
num: "20"
}
search = GoogleSearch.new(params)
hash_results = search.get_hash
data = hash_results[:organic_results].map do |result|
title = result[:title]
link = result[:link]
displayed_link = result[:displayed_link]
snippet = result[:snippet]
puts "#{title}#{snippet}#{link}#{displayed_link}\n\n"
-------
'''
2020 Ford Fusion Prices, Reviews, & Pictures - Best Carshttps://cars.usnews.com/cars-trucks/ford/fusionhttps://cars.usnews.com › Cars › Used Cars › Used Ford
Ford® Fusion Retired | Now What?Not all vehicles qualify for A, Z or X Plan. All Mustang Shelby GT350® and Shelby® GT350R prices exclude gas guzzler tax. 2. EPA-estimated city/hwy mpg for the ...https://www.ford.com/cars/fusion/https://www.ford.com › cars › fusion
...
'''
P.S - I wrote a blog post about how to scrape Google Organic Search Results.
Disclaimer, I work for SerpApi.
It looks like something is swapping the classes so what you see in the browser is not what you are getting from the http call. In this case from _XWk to _tA
page = HTTParty.get('https://www.google.com/search?q=ford+fusion+msrp&oq=ford+fusion+msrp&aqs=chrome.0.0l6.11452j0j7&sourceid=chrome&ie=UTF-8')
parse_page = Nokogiri::HTML(page)
parse_page.css('._tA').map(&:text)
# >>["Up to 23 city / 34 highway", "From $22,610", "175 to 325 hp", "192″ L x 73″ W x 58″ H", "3,431 to 3,681 lbs"]
Change parse_page.css('_XWk') to parse_page.css('._XWk')
Note the dot (.) difference. The dot references a class.
Using parse_page.css('_XWk'), nokogiri doesn't know wether _XWk is a class, id, data attribute etc..

How to get taxonomic specific ids for kingdom, phylum, class, order, family, genus and species from taxid?

I have a list of taxids that looks like this:
1204725
2162
1300163
420247
I am looking to get a file with taxonomic ids in order from the taxids above:
kingdom_id phylum_id class_id order_id family_id genus_id species_id
I am using the package "ete3". I use the tool ete-ncbiquery that tells you the lineage from the ids above. (I run it from my linux laptop with the command below)
ete3 ncbiquery --search 1204725 2162 13000163 420247 --info
The result looks like this:
# Taxid Sci.Name Rank Named Lineage Taxid Lineage
2162 Methanobacterium formicicum species root,cellular organisms,Archaea,Euryarchaeota,Methanobacteria,Methanobacteriales,Methanobacteriaceae,Methanobacterium,Methanobacterium formicicum 1,131567,2157,28890,183925,2158,2159,2160,2162
1204725 Methanobacterium formicicum DSM 3637 no rank root,cellular organisms,Archaea,Euryarchaeota,Methanobacteria,Methanobacteriales,Methanobacteriaceae,Methanobacterium,Methanobacterium formicicum,Methanobacterium formicicum DSM 3637 1,131567,2157,28890,183925,2158,2159,2160,2162,1204725
420247 Methanobrevibacter smithii ATCC 35061 no rank root,cellular organisms,Archaea,Euryarchaeota,Methanobacteria,Methanobacteriales,Methanobacteriaceae,Methanobrevibacter,Methanobrevibacter smithii,Methanobrevibacter smithii ATCC 350611,131567,2157,28890,183925,2158,2159,2172,2173,420247
I have no idea which items (IDS) correspond to what I am looking for (if any)
The following code:
import csv
from ete3 import NCBITaxa
ncbi = NCBITaxa()
def get_desired_ranks(taxid, desired_ranks):
lineage = ncbi.get_lineage(taxid)
lineage2ranks = ncbi.get_rank(lineage)
ranks2lineage = dict((rank, taxid) for (taxid, rank) in lineage2ranks.items())
return {'{}_id'.format(rank): ranks2lineage.get(rank, '<not present>') for rank in desired_ranks}
def main(taxids, desired_ranks, path):
with open(path, 'w') as csvfile:
fieldnames = ['{}_id'.format(rank) for rank in desired_ranks]
writer = csv.DictWriter(csvfile, delimiter='\t', fieldnames=fieldnames)
writer.writeheader()
for taxid in taxids:
writer.writerow(get_desired_ranks(taxid, desired_ranks))
if __name__ == '__main__':
taxids = [1204725, 2162, 1300163, 420247]
desired_ranks = ['kingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species']
path = 'taxids.csv'
main(taxids, desired_ranks, path)
Produces a file that looks like this:
kingdom_id phylum_id class_id order_id family_id genus_id species_id
<not present> 28890 183925 2158 2159 2160 2162
<not present> 28890 183925 2158 2159 2160 2162
<not present> 28890 183925 2158 2159 2160 2162
<not present> 28890 183925 2158 2159 2172 2173
With the Taxid Lineage numbers in your results, try using them in ete3's get_rank method. As an example:
from ete3 import NCBITaxa
ncbi = NCBITaxa()
print ncbi.get_rank([9606, 9443])
# {9443: u'order', 9606: u'species'}
Presumably the resulting dictionary should contain the rank information of all IDs, including any intermediate "no rank" IDs that you may want to eliminate.
You can also use the R packaage taxonomizr. The package takes a bit of time to download the necessary files, but after that its quite fast and easy.
library("taxonomizr)
getNamesAndNodes()
taxaNodes <- read.nodes('nodes.dmp')
taxaNames <- read.names('names.dmp')
taxaID <- c("1204725", "2162", "1300163", "420247")
getNamesAndNodes downloads the names.dmp and nodes.dmp file from ncbi.

What is the standard format for a browser's User-Agent string?

Is there an RFC, official standard, or template for creating a User Agent string? The iphone's user-agent string seems strange...
Mozilla/5.0 (iPhone; U; CPU iPhone OS 3_1_2 like Mac OS X; en-us) AppleWebKit/528.18 (KHTML, like Gecko) Version/4.0 Mobile/7D11 Safari/528.16
The User-Agent header is part of the RFC7231, which is an improved version of the RFC1945, where it states:
The User-Agent request-header field contains information about the
user agent originating the request. This is for statistical purposes,
the tracing of protocol violations, and automated recognition of user
agents for the sake of tailoring responses to avoid particular user
agent limitations. User agents SHOULD include this field with
requests. The field can contain multiple product tokens (section 3.8)
and comments identifying the agent and any subproducts which form a
significant part of the user agent. By convention, the product tokens
are listed in order of their significance for identifying the
application.
EBNF Definitions:
User-Agent = "User-Agent" ":" 1*( product | comment )
Where product is defined as:
product = token ["/" product-version]
product-version = token
token = 1*<any CHAR except CTLs or separators>
And comment as:
comment = "(" *( ctext | quoted-pair | comment ) ")"
ctext = <any TEXT excluding "(" and ")">
And other rules, for reference:
CTL = <control characters, e.g. ASCII 0x00 through 0x0F and 0x7F>
separators = "(" | ")" | "<" | ">" | "#"
"," | ";" | ":" | "\" | <">
"/" | "[" | "]" | "?" | "="
"{" | "}" | SP | HT
SP = <ASCII space 0x20, i.e. " ">
HT = <ASCII horizontal tab 0x09, aka '\t'>
Note that this means that product strings cannot contain spaces, but comment strings can.
Examples:
Here are some valid examples of product strings (with and without product-version strings):
# Single `product` without product-version:
Foobar
Foobar-baz
# Single `product` with product-version:
Foobar/abc
Foobar/1.0.0
Foobar/2021.44.30.15-b917dc
Here are some valid examples of comment strings; note how all strings are enclosed in matched parentheses ( ):
# This was the default `comment` used by Internet Explorer 11:
(Windows NT 6.1; WOW64; Trident/7.0; rv:11.0)
# You can put almost any text inside a comment:
(Why are you looking at HTTP headers? Go outside, find love, do some good in the world)
# Note that `comment` strings can also be nested, provided their delimiting parentheses are matched, for example:
(Outer comment (Inner comment))
As a User-Agent header's value is comprised of arbitrary product and comment strings, these are all valid User-Agent headers:
User-Agent: Foobar
User-Agent: Foobar/2021.44.30.15-b917dc
User-Agent: MyProduct Foobar/2021.44.30.15-b917dc
User-Agent: Tsom/OfraHaza (Life is short and love is always over in the morning) AnotherProduct
This is specified in RFC 1945 in the section on Request Headers. It is not a very standardized format, though, and user agents tend to put whatever they want in there.
Yes, see: mozilla website, but as it was mentioned before. Basically you can put whatever you want there. For statistical/analytical purposes, the most important thing is, that every browser/os should have this standardized for itself.

Resources