Webscraping Nokogiri unable to pick any classes - ruby

I am using this page:
https://www.google.com/search?q=ford+fusion+msrp&oq=ford+fusion+msrp&aqs=chrome.0.0l6.2942j0j7&sourceid=chrome&ie=UTF-8
I am trying to get the this element: class="_XWk"
page = HTTParty.get('https://www.google.com/search?q=ford+fusion+msrp&oq=ford+fusion+msrp&aqs=chrome.0.0l6.11452j0j7&sourceid=chrome&ie=UTF-8')
parse_page = Nokogiri::HTML(page)
parse_page.css('_XWk')
Here I can see the whole page in parse_page but when I try the .cc('classname') I don't see anything. Am I using the method the wrong way?

Check out the SelectorGadget Chrome extension to grab css selectors by clicking on the desired element in the browser.
It's because of a simple typo, e.g. . (dot) before selector as ran already mentioned.
In addition, the next problem might occur because no HTTP user-agent is specified thus Google will block a request eventually and you'll receive a completely different HTML that will contain an error message or something similar without the actual data you was looking for. What is my user-agent.
Pass a user-agent:
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
HTTParty.get("https://www.google.com/search", headers: headers)
Iterate over container to extract titles from Google Search:
data = doc.css(".tF2Cxc").map do |result|
title = result.at_css(".DKV0Md")&.text
Code and example in the online IDE:
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
q: "ford fusion msrp",
num: "20"
}
response = HTTParty.get("https://www.google.com/search",
query: params,
headers: headers)
doc = Nokogiri::HTML(response.body)
data = doc.css(".tF2Cxc").map do |result|
title = result.at_css(".DKV0Md")&.text
link = result.at_css(".yuRUbf a")&.attr("href")
displayed_link = result.at_css(".tjvcx")&.text
snippet = result.at_css(".VwiC3b")&.text
puts "#{title}#{snippet}#{link}#{displayed_link}\n\n"
-------
'''
2020 Ford Fusion Prices, Reviews, & Pictures - Best Carshttps://cars.usnews.com/cars-trucks/ford/fusionhttps://cars.usnews.com › Cars › Used Cars › Used Ford
Ford® Fusion Retired | Now What?Not all vehicles qualify for A, Z or X Plan. All Mustang Shelby GT350® and Shelby® GT350R prices exclude gas guzzler tax. 2. EPA-estimated city/hwy mpg for the ...https://www.ford.com/cars/fusion/https://www.ford.com › cars › fusion
...
'''
Alternatively, you can achieve this by using Google Organic Results API from SerpApi. It's a paid API with a free plan.
The difference in your case is that you don't need to figure out what the correct selector is or why results are different in the output since it's already done for the end-user.
Basically, the only thing that needs to be done is just to iterate over structured JSON and get the data you were looking for.
Example code:
require 'google_search_results'
params = {
api_key: ENV["API_KEY"],
engine: "google",
q: "ford fusion msrp",
hl: "en",
num: "20"
}
search = GoogleSearch.new(params)
hash_results = search.get_hash
data = hash_results[:organic_results].map do |result|
title = result[:title]
link = result[:link]
displayed_link = result[:displayed_link]
snippet = result[:snippet]
puts "#{title}#{snippet}#{link}#{displayed_link}\n\n"
-------
'''
2020 Ford Fusion Prices, Reviews, & Pictures - Best Carshttps://cars.usnews.com/cars-trucks/ford/fusionhttps://cars.usnews.com › Cars › Used Cars › Used Ford
Ford® Fusion Retired | Now What?Not all vehicles qualify for A, Z or X Plan. All Mustang Shelby GT350® and Shelby® GT350R prices exclude gas guzzler tax. 2. EPA-estimated city/hwy mpg for the ...https://www.ford.com/cars/fusion/https://www.ford.com › cars › fusion
...
'''
P.S - I wrote a blog post about how to scrape Google Organic Search Results.
Disclaimer, I work for SerpApi.

It looks like something is swapping the classes so what you see in the browser is not what you are getting from the http call. In this case from _XWk to _tA
page = HTTParty.get('https://www.google.com/search?q=ford+fusion+msrp&oq=ford+fusion+msrp&aqs=chrome.0.0l6.11452j0j7&sourceid=chrome&ie=UTF-8')
parse_page = Nokogiri::HTML(page)
parse_page.css('._tA').map(&:text)
# >>["Up to 23 city / 34 highway", "From $22,610", "175 to 325 hp", "192″ L x 73″ W x 58″ H", "3,431 to 3,681 lbs"]

Change parse_page.css('_XWk') to parse_page.css('._XWk')
Note the dot (.) difference. The dot references a class.
Using parse_page.css('_XWk'), nokogiri doesn't know wether _XWk is a class, id, data attribute etc..

Related

AttributeError: 'DecisionTreeClassifier' object has no attribute 'n_features_'

I was trying to deploy my NLP model to Heroku, and I got the following error in the logs upon trying to predict the result of the inputs-
2022-09-07T15:36:35.497488+00:00 app[web.1]: if self.n_features_ != n_features:
2022-09-07T15:36:35.497488+00:00 app[web.1]: AttributeError: 'DecisionTreeClassifier' object has no attribute 'n_features_'
2022-09-07T15:36:35.498198+00:00 app[web.1]: 10.1.22.85 - - [07/Sep/2022:15:36:35 +0000] "POST /predict HTTP/1.1" 500 290 "https://stocksentimentanalysisapp.herokuapp.com/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36"
This specific line is strange considering I never used a Decision Tree Classifier, only Random Forest-
AttributeError: 'DecisionTreeClassifier' object has no attribute 'n_features_'
The model runs perfectly well in Jupyter Notebook. This issue began only when I tried to deploy it.
Here is my model-
import pandas as pd
import pickle
df = pd.read_csv('D:\Sa\Projects\Stock Sentiment Analysis\data\data.csv', encoding='ISO-8859-1')
train = df[df['Date']<'20150101']
test = df[df['Date']>'20141231']
#Removing non-alphabetic characters
data = train.iloc[:,2:27]
data.replace('[^a-zA-Z]', ' ', regex=True, inplace=True)
#Renaming columns to numerical index
idx_list = [i for i in range(25)]
new_index = [str(i) for i in idx_list]
data.columns = new_index
for index in new_index:
data[index] = data[index].str.lower()
combined_headlines = []
for row in range(0, len(data.index)):
combined_headlines.append(' '.join(str(x) for x in data.iloc[row, 0:25]))
from sklearn.ensemble import RandomForestClassifier
#Bag of words
from sklearn.feature_extraction.text import CountVectorizer
count_vectorizer = CountVectorizer(ngram_range=(2,2))
train_data = count_vectorizer.fit_transform(combined_headlines)
pickle.dump(count_vectorizer, open('countVectorizer.pkl', 'wb'))
rfc = RandomForestClassifier(n_estimators=200, criterion='entropy')
rfc.fit(train_data, train['Label'])
test_transform = []
for row in range(0, len(data.index)):
test_transform.append(' '.join(str(x) for x in data.iloc[row, 2:27]))
test_data = count_vectorizer.transform(test_transform)
predictions = rfc.predict(test_data)
# Saving model to disk
pickle.dump(rfc, open('randomForestClassifier.pkl', 'wb'))
Please help me understand what is going wrong.

crawl realtime google finance price

I want to create a small excel sheet which sort of like Bloomberg's launchpad for me to monitor live stock market price. So far, out of all the available free data source, I only found Google finance provides real time price for a list of exchanges I need. The issue with Google finance is they have already closed down their finance API. I am looking for a way to help me to programmatically retrieve the real price that I circled in chart below to have it update live in my excel.
I have been searching around and to no avail as of now. I read some post here:
How does Google Finance update stock prices? but the method suggested in the answer points to retrieving a time series of data in the chart, instead of the live updating price part I need. I have been examining the network communication of the web page in chrome's inspection and didn't find any request that returns the part of real time price I need. Any help is greatly appreciated. some sample codes (can be in other languages other than VBA) would be very beneficial. Thanks everyone !
There are so many way ways to do this: VBA, VB, C# R, Python, etc. Below is a way to download statistics from Yahoo finance.
Sub DownloadData()
Set ie = CreateObject("InternetExplorer.application")
With ie
.Visible = True
.navigate "https://finance.yahoo.com/quote/AAPL/key-statistics?p=AAPL"
' Wait for the page to fully load; you can't do anything if the page is not fully loaded
Do While .Busy Or _
.readyState <> 4
DoEvents
Loop
' Set a reference to the data elements that will be downloaded. We can download either 'td' data elements or 'tr' data elements. This site happens to use 'tr' data elements.
Set Links = ie.document.getElementsByTagName("tr")
RowCount = 1
' Scrape out the innertext of each 'tr' element.
With Sheets("DataSheet")
For Each lnk In Links
.Range("A" & RowCount) = lnk.innerText
RowCount = RowCount + 1
Next
End With
End With
MsgBox ("Done!!")
End Sub
I will leave it up to you to find other technologies that do the same. Thing, for instance, R and Prthon can do exactly the same thing, although, the scripts will be a bit different than the VBA scripts that do this kind of work.
It's fairly easy to make it work in Python. You'll need a few libraries:
Library
Purpose
requests
to make a request to Google Finance and then return HTML.
bs4
to process returned HTML.
pandas
to easily save to CSV/Excel.
Code and full example in the online IDE:
from bs4 import BeautifulSoup
import requests, lxml, json
from itertools import zip_longest
def scrape_google_finance(ticker: str):
# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
"hl": "en", # language
}
# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
# https://www.whatismybrowser.com/detect/what-is-my-user-agent
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.60 Safari/537.36",
}
html = requests.get(f"https://www.google.com/finance/quote/{ticker}", params=params, headers=headers, timeout=30)
soup = BeautifulSoup(html.text, "lxml")
ticker_data = {"right_panel_data": {},
"ticker_info": {}}
ticker_data["ticker_info"]["title"] = soup.select_one(".zzDege").text
ticker_data["ticker_info"]["current_price"] = soup.select_one(".AHmHk .fxKbKc").text
right_panel_keys = soup.select(".gyFHrc .mfs7Fc")
right_panel_values = soup.select(".gyFHrc .P6K39c")
for key, value in zip_longest(right_panel_keys, right_panel_values):
key_value = key.text.lower().replace(" ", "_")
ticker_data["right_panel_data"][key_value] = value.text
return ticker_data
# tickers to iterate over
tickers = ["DIS:NYSE", "TSLA:NASDAQ", "AAPL:NASDAQ", "AMZN:NASDAQ", "NFLX:NASDAQ"]
# temporary store the data before saving to the file
tickers_prices = []
for ticker in tickers:
# extract ticker data
ticker_data = scrape_google_finance(ticker=ticker)
# append to temporary list
tickers_prices.append({
"ticker": ticker_data["ticker_info"]["title"],
"price": ticker_data["ticker_info"]["current_price"]
})
# create dataframe and save to csv/excel
df = pd.DataFrame(data=tickers_prices)
# to save to excel use to_excel()
df.to_csv("google_finance_live_stock.csv", index=False)
Outputs:
ticker,price
Walt Disney Co,$137.06
Tesla Inc,"$1,131.21"
Apple Inc,$176.99
"Amazon.com, Inc.","$3,321.61"
Netflix Inc,$384.93
Returned data from ticker_data
{
"right_panel_data": {
"previous_close": "$138.61",
"day_range": "$136.66 - $139.20",
"year_range": "$128.38 - $191.67",
"market_cap": "248.81B USD",
"volume": "9.98M",
"p/e_ratio": "81.10",
"dividend_yield": "-",
"primary_exchange": "NYSE",
"ceo": "Bob Chapek",
"founded": "Oct 16, 1923",
"headquarters": "Burbank, CaliforniaUnited States",
"website": "thewaltdisneycompany.com",
"employees": "166,250"
},
"ticker_info": {
"title": "Walt Disney Co",
"current_price": "$136.66"
}
}
If you want to scrape more data with a line-by-line explanation, there's a Scrape Google Finance Ticker Quote Data in Python blog post of mine that also covers scraping time-series chart data.

pig script to parse aws elb log

I am trying to parse this elb logs with pig and I am able to parse it successfully using this script
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
2016-07-16T00:00:41.700161Z testelb 11.11.17.2:50883 192.168.1.94:80 0.00002 0.001392 0.000019 200 200 0 43 "GET http://test.example.com:80/bac?aid=b5cf542d74&cid=etrsewtp&bid=23c45c543&dte=Sat%20Jul%2016%202016%2008:00:41%20GMT+0800%20(HKT) HTTP/1.1" "Mozilla/5.0 (iPhone; CPU iPhone OS 9_3_2 like Mac OS X) AppleWebKit/601.1.46 (KHTML, like Gecko) Mobile/13F69" - -
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
***************************************************************
A = LOAD '/tmp/one.log' USING TextLoader AS (line:chararray);
B = FOREACH A GENERATE FLATTEN (
REGEX_EXTRACT_ALL(
line,'^(\\S+) (\\S+) (\\S+) (\\S+) (\\S+) (\\S+) (\\S+) (\\S+) (\\S+) (\\S+) (\\S+) "(.+?)" "(.+?)" (\\S+) (\\S+)')
) AS (
timestamp:chararray, elb:int, client_port:chararray, backend_port:chararray, request_processing_time:float, backend_processing_time:float, response_processing_time:float, elb_status_code:int, backend_status_code:int, received_bytes:int, sent_bytes:int, request:chararray, user_agent:chararray, ssl_cipher:chararray, ssl_protocol:chararray
);
DUMP B;
Now I want to extract request url, aid, bid, cid etc but not able to match the regex. Can someone help me to get these details?
Apart from above regex method if there is any other method to get the complete elb log details then I would like to know.
NOTE: The position of aid, bid and cid are not fixed in the request log.
Your question has been already answered here
Alternate way to do the same task requires a custom loader.

Search in patterns list

I have list of objects like this:
{
"pattern": "Mozilla/5.0 (*Mac OS X 10?4*) AppleWebKit/* (KHTML, like Gecko) Chrome/46.*Safari/*",
"name": "Macintosh",
"brand": "Apple"
}
{
"pattern": "Mozilla/5.0 (*Windows NT 5.1*rv:46.0*) Gecko/*/",
"name": "Windows",
"brand": "Microsoft"
}
or like this (the same, but in regular expression notation):
{
"pattern": "mozilla/5\.0 \(.*linux.*android.4\.4.*gxt_dongle_3188 build/.*\) applewebkit/.* \(khtml, like gecko\) version/.* chrome/.*safari/.* bdbrowserhd_i18n/1\.(\d).*",
"name": "Macintosh",
"brand": "Apple"
}
This is dictionary of browser user agents with 7000 items. I have user agent string, for example:
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36"
And I need to find name and brand as fast as possible. Now I split dictionary by chunks (100 patterns) and glue to one big regexp. After this I try to match user agent of this big regexp. If matched - I walk all items of this chunk.
Would you recommend some DB engine which can help me with this? Or simply algorithm which can help me do it faster?

What is the standard format for a browser's User-Agent string?

Is there an RFC, official standard, or template for creating a User Agent string? The iphone's user-agent string seems strange...
Mozilla/5.0 (iPhone; U; CPU iPhone OS 3_1_2 like Mac OS X; en-us) AppleWebKit/528.18 (KHTML, like Gecko) Version/4.0 Mobile/7D11 Safari/528.16
The User-Agent header is part of the RFC7231, which is an improved version of the RFC1945, where it states:
The User-Agent request-header field contains information about the
user agent originating the request. This is for statistical purposes,
the tracing of protocol violations, and automated recognition of user
agents for the sake of tailoring responses to avoid particular user
agent limitations. User agents SHOULD include this field with
requests. The field can contain multiple product tokens (section 3.8)
and comments identifying the agent and any subproducts which form a
significant part of the user agent. By convention, the product tokens
are listed in order of their significance for identifying the
application.
EBNF Definitions:
User-Agent = "User-Agent" ":" 1*( product | comment )
Where product is defined as:
product = token ["/" product-version]
product-version = token
token = 1*<any CHAR except CTLs or separators>
And comment as:
comment = "(" *( ctext | quoted-pair | comment ) ")"
ctext = <any TEXT excluding "(" and ")">
And other rules, for reference:
CTL = <control characters, e.g. ASCII 0x00 through 0x0F and 0x7F>
separators = "(" | ")" | "<" | ">" | "#"
"," | ";" | ":" | "\" | <">
"/" | "[" | "]" | "?" | "="
"{" | "}" | SP | HT
SP = <ASCII space 0x20, i.e. " ">
HT = <ASCII horizontal tab 0x09, aka '\t'>
Note that this means that product strings cannot contain spaces, but comment strings can.
Examples:
Here are some valid examples of product strings (with and without product-version strings):
# Single `product` without product-version:
Foobar
Foobar-baz
# Single `product` with product-version:
Foobar/abc
Foobar/1.0.0
Foobar/2021.44.30.15-b917dc
Here are some valid examples of comment strings; note how all strings are enclosed in matched parentheses ( ):
# This was the default `comment` used by Internet Explorer 11:
(Windows NT 6.1; WOW64; Trident/7.0; rv:11.0)
# You can put almost any text inside a comment:
(Why are you looking at HTTP headers? Go outside, find love, do some good in the world)
# Note that `comment` strings can also be nested, provided their delimiting parentheses are matched, for example:
(Outer comment (Inner comment))
As a User-Agent header's value is comprised of arbitrary product and comment strings, these are all valid User-Agent headers:
User-Agent: Foobar
User-Agent: Foobar/2021.44.30.15-b917dc
User-Agent: MyProduct Foobar/2021.44.30.15-b917dc
User-Agent: Tsom/OfraHaza (Life is short and love is always over in the morning) AnotherProduct
This is specified in RFC 1945 in the section on Request Headers. It is not a very standardized format, though, and user agents tend to put whatever they want in there.
Yes, see: mozilla website, but as it was mentioned before. Basically you can put whatever you want there. For statistical/analytical purposes, the most important thing is, that every browser/os should have this standardized for itself.

Resources