Search in patterns list - user-agent

I have list of objects like this:
{
"pattern": "Mozilla/5.0 (*Mac OS X 10?4*) AppleWebKit/* (KHTML, like Gecko) Chrome/46.*Safari/*",
"name": "Macintosh",
"brand": "Apple"
}
{
"pattern": "Mozilla/5.0 (*Windows NT 5.1*rv:46.0*) Gecko/*/",
"name": "Windows",
"brand": "Microsoft"
}
or like this (the same, but in regular expression notation):
{
"pattern": "mozilla/5\.0 \(.*linux.*android.4\.4.*gxt_dongle_3188 build/.*\) applewebkit/.* \(khtml, like gecko\) version/.* chrome/.*safari/.* bdbrowserhd_i18n/1\.(\d).*",
"name": "Macintosh",
"brand": "Apple"
}
This is dictionary of browser user agents with 7000 items. I have user agent string, for example:
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36"
And I need to find name and brand as fast as possible. Now I split dictionary by chunks (100 patterns) and glue to one big regexp. After this I try to match user agent of this big regexp. If matched - I walk all items of this chunk.
Would you recommend some DB engine which can help me with this? Or simply algorithm which can help me do it faster?

Related

AttributeError: 'DecisionTreeClassifier' object has no attribute 'n_features_'

I was trying to deploy my NLP model to Heroku, and I got the following error in the logs upon trying to predict the result of the inputs-
2022-09-07T15:36:35.497488+00:00 app[web.1]: if self.n_features_ != n_features:
2022-09-07T15:36:35.497488+00:00 app[web.1]: AttributeError: 'DecisionTreeClassifier' object has no attribute 'n_features_'
2022-09-07T15:36:35.498198+00:00 app[web.1]: 10.1.22.85 - - [07/Sep/2022:15:36:35 +0000] "POST /predict HTTP/1.1" 500 290 "https://stocksentimentanalysisapp.herokuapp.com/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36"
This specific line is strange considering I never used a Decision Tree Classifier, only Random Forest-
AttributeError: 'DecisionTreeClassifier' object has no attribute 'n_features_'
The model runs perfectly well in Jupyter Notebook. This issue began only when I tried to deploy it.
Here is my model-
import pandas as pd
import pickle
df = pd.read_csv('D:\Sa\Projects\Stock Sentiment Analysis\data\data.csv', encoding='ISO-8859-1')
train = df[df['Date']<'20150101']
test = df[df['Date']>'20141231']
#Removing non-alphabetic characters
data = train.iloc[:,2:27]
data.replace('[^a-zA-Z]', ' ', regex=True, inplace=True)
#Renaming columns to numerical index
idx_list = [i for i in range(25)]
new_index = [str(i) for i in idx_list]
data.columns = new_index
for index in new_index:
data[index] = data[index].str.lower()
combined_headlines = []
for row in range(0, len(data.index)):
combined_headlines.append(' '.join(str(x) for x in data.iloc[row, 0:25]))
from sklearn.ensemble import RandomForestClassifier
#Bag of words
from sklearn.feature_extraction.text import CountVectorizer
count_vectorizer = CountVectorizer(ngram_range=(2,2))
train_data = count_vectorizer.fit_transform(combined_headlines)
pickle.dump(count_vectorizer, open('countVectorizer.pkl', 'wb'))
rfc = RandomForestClassifier(n_estimators=200, criterion='entropy')
rfc.fit(train_data, train['Label'])
test_transform = []
for row in range(0, len(data.index)):
test_transform.append(' '.join(str(x) for x in data.iloc[row, 2:27]))
test_data = count_vectorizer.transform(test_transform)
predictions = rfc.predict(test_data)
# Saving model to disk
pickle.dump(rfc, open('randomForestClassifier.pkl', 'wb'))
Please help me understand what is going wrong.

r studio debug location is approximate because source is not available

I get this error
r studio debug location is approximate because source is not available
although I'm on the newest version of R studio with no backward slashes available in my function
My function:
calculate_depreciation_percentage_v1 = function (end_of_life_value_perc_temp, immediate_loss_of_value_percent, theoretische_afschrijfperiode, restwaarde_bouw_perc,bouwwaarde_perc,perceelwaarde_perc,current_nr_of_years,groeivoet_waarde_perceel, cash_flow_year_1, maintanance_cost_weights) {
# browser()
# in-function parameters: moved all of them to main script
# declare variables
end_of_life_value_perc_local = repmat(-1234, size(immediate_loss_of_value_percent,1), size(immediate_loss_of_value_percent,2))
#CALCULATIONS
for (RV_loop_local in 1:size(immediate_loss_of_value_percent,1)) { #for the first calculation, calculate...
jaarlijks_afschrijfpercentage_bouwwaarde = (1-immediate_loss_of_value_percent[RV_loop_local,1] - restwaarde_bouw_perc) / theoretische_afschrijfperiode #ok
depecriation_vec_bouwwaarde = seq(from = 0, to=(jaarlijks_afschrijfpercentage_bouwwaarde*size(immediate_loss_of_value_percent,2))-jaarlijks_afschrijfpercentage_bouwwaarde , by=jaarlijks_afschrijfpercentage_bouwwaarde ) + immediate_loss_of_value_percent[RV_loop_local,1]
# depecriation_vec_bouwwaarde[1] = immediate_loss_of_value_percent[RV_loop,1] mss toepassen op hele vector, kan wel
remaining_value_vec_bouwwaarde = (-cash_flow_year_1)*(1-depecriation_vec_bouwwaarde) # it assumed you cannot go under 0, but the program does not detect if you do...
# tis niet juist
#
# (-cash_flow_year_1*bouwwaarde_perc)*
# the remaining value
weight_vector_perceelwaarde_FV = repmat(-1234, 1, size(immediate_loss_of_value_percent,2))
for (current_nr_of_years_local in 1:size(immediate_loss_of_value_percent,2)) {
# (-perceelwaarde_perc *cash_flow_year_1)
weight_vector_perceelwaarde_FV[current_nr_of_years_local] = (-perceelwaarde_perc *cash_flow_year_1)*((1+groeivoet_waarde_perceel)^current_nr_of_years_local)
# weight_vector_perceelwaarde_FV[current_nr_of_years_local] = (-perceelwaarde_perc *cash_flow_year_1)*((1+groeivoet_waarde_perceel)^current_nr_of_years_local)
# weight_vector_perceelwaarde_FV
# ,current_nr_of_years,groeivoet_waarde_perceel #
# -bouwwaarde_perc*cash_flow_year_1
# end_of_life_value_perc_local = [RV_loop, current_nr_of_years] =
}
# depecriation_vec_bouwwaarde*bouwwaarde_perc
# weight_vector_perceelwaarde_FV*perceelwaarde_perc
weighted_average_residual_value_vec = weight_vector_perceelwaarde_FV + (bouwwaarde_perc*remaining_value_vec_bouwwaarde)
weight_vector_perceelwaarde_FV
(bouwwaarde_perc*remaining_value_vec_bouwwaarde)
browser()
}
# weighted_average_depreciation_matrix[RV_loop_local,:] = weighted_average_depreciation_vector
# weighted_average_depreciation_matrix <- cbind(x, y, z)
#simulation_bouwwaarde = end_of_life_value_perc_temp[1]-jaarlijks_afschrijfpercentage_bouwwaarde*20
# finVal = 1 * (0.0125 + 1)^20
# return(...)
}
Version
RStudio 2021.09.2+382 "Ghost Orchid" Release (fc9e217980ee9320126e33cdf334d4f4e105dc4f, 2022-01-04) for Windows
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) QtWebEngine/5.12.8 Chrome/69.0.3497.128 Safari/537.36

Unable to iterate over multiple pages while web scraping

I am trying to scrape
https://www.maybank.co.id/others/locate-us?Keyword=&LocType=branch&LocSubType=all
to obtain branch name and address for all bank branches. There are 44 pages I need to scrape for which the url doesn't change. I cant iterate over the pages.
for page_no in range(1,45):
payload='page='+str(page_no)+'&PageSize=9&id=%7B5066AC98-FE40-407A-B4FE-03C814BED5F5%7D&keyword=&LocType=branch&LocSubType=all'
response = requests.post(url, data=payload)
page = requests.post(url,data=payload)
print('Page',page_no)
for i in soup.find_all('div',class_="col-md-4 col-sm-6 col-xs-12 property-item"):
Branch=i.find_all('h3') if i.find_all('h3') else ''
Address=i.find_all('p') if i.find_all('p') else ''
for j in Address:
j = re.sub(r'<(.*?)>', '', str(j))
j = j.strip()
Address_list.append(j)
for k in Branch:
k=re.sub(r'<(.*?)>', '', str(k))
Branch_list.append(k)
Can someone suggest should be done here?
You should use the API to get what you need.
Try this:
from urllib.parse import urlencode
import requests
from bs4 import BeautifulSoup
api_url = "https://www.maybank.co.id/api/sitecore/MapsLocation/MapsLocationListPaging?"
payload = {
"page": "44",
"id": "{5066AC98-FE40-407A-B4FE-03C814BED5F5}",
"keyword": "",
"LocType": "branch",
"LocSubType": "all",
}
headers = {
"user-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.183 Safari/537.36",
"x-requested-with": "XMLHttpRequest",
}
for page in range(1, 45):
payload["PageSize"] = page
page = requests.get(f"{api_url}{urlencode(payload)}", headers).text
soup = BeautifulSoup(page, "html.parser").find("div", {"class": "col-md-4 col-sm-6 col-xs-12 property-item"})
branch_data = [
soup.find("h3").getText(strip=True),
[p.getText(strip=True) for p in soup.find_all("p")],
soup.find("a")["href"],
]
print(branch_data)
Output:
['KC MANADO', ['Jl. Kawasan Mega Mas Jl. Pierre Tendean Boulevard Blok I C1 No. 24,25,26 dan Blok I C2 No. 27,28,29 Manado', 'Closed until 03.30 PM0431 - 860543'], '/others/locate-us/locate-us-detail?id=337&loctype=Branch&locsubtype=']
['KC SUNSET ROAD, DPS', ['Jl. Sunset Road No 811, Kuta - Badung, Bali', 'Closed until 03.30 PM0361 - 3003811'], '/others/locate-us/locate-us-detail?id=294&loctype=Branch&locsubtype=']
['KCP BSB CITY', ['Ruko Taman Niaga Bukit Semarang Baru (BSB) Blok E No. 3A, Semarang', 'Closed until 03.30 PM(024) 76670611'], '/others/locate-us/locate-us-detail?id=217&loctype=Branch&locsubtype=']
['KCP GRAHA IRAMA', ['Jl. HR Rasuna Said Kav. 1-2 Ground Floor Blok B Jakarta Selatan', 'Closed until 03.30 PM021-5261330-4'], '/others/locate-us/locate-us-detail?id=111&loctype=Branch&locsubtype=']
['KCP KLP. GADING BULEVARD II', ['Jl. Raya Boulevard I-3 no. 4, Jakarta', 'Closed until 03.30 PM021 - 4515253'], '/others/locate-us/locate-us-detail?id=199&loctype=Branch&locsubtype=']
['KCP PALM SPRING BATAM CENTER', ['Komplek Palm Spring BTC Blok D1 No. 10, Batam Centre', 'Closed until 03.30 PM0778 - 6053070'], '/others/locate-us/locate-us-detail?id=26&loctype=Branch&locsubtype=']
and so on...

Webscraping Nokogiri unable to pick any classes

I am using this page:
https://www.google.com/search?q=ford+fusion+msrp&oq=ford+fusion+msrp&aqs=chrome.0.0l6.2942j0j7&sourceid=chrome&ie=UTF-8
I am trying to get the this element: class="_XWk"
page = HTTParty.get('https://www.google.com/search?q=ford+fusion+msrp&oq=ford+fusion+msrp&aqs=chrome.0.0l6.11452j0j7&sourceid=chrome&ie=UTF-8')
parse_page = Nokogiri::HTML(page)
parse_page.css('_XWk')
Here I can see the whole page in parse_page but when I try the .cc('classname') I don't see anything. Am I using the method the wrong way?
Check out the SelectorGadget Chrome extension to grab css selectors by clicking on the desired element in the browser.
It's because of a simple typo, e.g. . (dot) before selector as ran already mentioned.
In addition, the next problem might occur because no HTTP user-agent is specified thus Google will block a request eventually and you'll receive a completely different HTML that will contain an error message or something similar without the actual data you was looking for. What is my user-agent.
Pass a user-agent:
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
HTTParty.get("https://www.google.com/search", headers: headers)
Iterate over container to extract titles from Google Search:
data = doc.css(".tF2Cxc").map do |result|
title = result.at_css(".DKV0Md")&.text
Code and example in the online IDE:
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
q: "ford fusion msrp",
num: "20"
}
response = HTTParty.get("https://www.google.com/search",
query: params,
headers: headers)
doc = Nokogiri::HTML(response.body)
data = doc.css(".tF2Cxc").map do |result|
title = result.at_css(".DKV0Md")&.text
link = result.at_css(".yuRUbf a")&.attr("href")
displayed_link = result.at_css(".tjvcx")&.text
snippet = result.at_css(".VwiC3b")&.text
puts "#{title}#{snippet}#{link}#{displayed_link}\n\n"
-------
'''
2020 Ford Fusion Prices, Reviews, & Pictures - Best Carshttps://cars.usnews.com/cars-trucks/ford/fusionhttps://cars.usnews.com › Cars › Used Cars › Used Ford
Ford® Fusion Retired | Now What?Not all vehicles qualify for A, Z or X Plan. All Mustang Shelby GT350® and Shelby® GT350R prices exclude gas guzzler tax. 2. EPA-estimated city/hwy mpg for the ...https://www.ford.com/cars/fusion/https://www.ford.com › cars › fusion
...
'''
Alternatively, you can achieve this by using Google Organic Results API from SerpApi. It's a paid API with a free plan.
The difference in your case is that you don't need to figure out what the correct selector is or why results are different in the output since it's already done for the end-user.
Basically, the only thing that needs to be done is just to iterate over structured JSON and get the data you were looking for.
Example code:
require 'google_search_results'
params = {
api_key: ENV["API_KEY"],
engine: "google",
q: "ford fusion msrp",
hl: "en",
num: "20"
}
search = GoogleSearch.new(params)
hash_results = search.get_hash
data = hash_results[:organic_results].map do |result|
title = result[:title]
link = result[:link]
displayed_link = result[:displayed_link]
snippet = result[:snippet]
puts "#{title}#{snippet}#{link}#{displayed_link}\n\n"
-------
'''
2020 Ford Fusion Prices, Reviews, & Pictures - Best Carshttps://cars.usnews.com/cars-trucks/ford/fusionhttps://cars.usnews.com › Cars › Used Cars › Used Ford
Ford® Fusion Retired | Now What?Not all vehicles qualify for A, Z or X Plan. All Mustang Shelby GT350® and Shelby® GT350R prices exclude gas guzzler tax. 2. EPA-estimated city/hwy mpg for the ...https://www.ford.com/cars/fusion/https://www.ford.com › cars › fusion
...
'''
P.S - I wrote a blog post about how to scrape Google Organic Search Results.
Disclaimer, I work for SerpApi.
It looks like something is swapping the classes so what you see in the browser is not what you are getting from the http call. In this case from _XWk to _tA
page = HTTParty.get('https://www.google.com/search?q=ford+fusion+msrp&oq=ford+fusion+msrp&aqs=chrome.0.0l6.11452j0j7&sourceid=chrome&ie=UTF-8')
parse_page = Nokogiri::HTML(page)
parse_page.css('._tA').map(&:text)
# >>["Up to 23 city / 34 highway", "From $22,610", "175 to 325 hp", "192″ L x 73″ W x 58″ H", "3,431 to 3,681 lbs"]
Change parse_page.css('_XWk') to parse_page.css('._XWk')
Note the dot (.) difference. The dot references a class.
Using parse_page.css('_XWk'), nokogiri doesn't know wether _XWk is a class, id, data attribute etc..

Logstash custom date format and irregular spaces

Receiving a parsing failure with my grok match. I can't seem to find anything that will match my log.
Here is my log:
2016-06-14 14:03:42 1.1.1.1 GET /origin-www.site.com/ScriptResource.axd?d= jEHA4v5Z26oA-nbsKDVsBINPydW0esbNCScJdD-RX5iFGr6qqeyJ69OnKDoJgTsDcnI1&t=5f9d5645 200 26222 0 "http://site/ layouts/CategoryPage.aspx?dsNav=N:10014" "Mozilla/5.0 (Linux; Android 4.4.4; SM-G318HZ Build/KTU84P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.95 Mobile Safari/537.36" "cookie"
Here is my grok match. It works fine in the grok debugger.
filter {
grok {
match => { 'message' => '%{DATE:date} %{TIME:time} %{IP:client} %{WORD:method} %{URIPATHPARAM:request} %{NUMBER:status} %{NUMBER:bytes} %{NUMBER:time_taken} %{QUOTEDSTRING:referrer} %{QUOTEDSTRING:user_agent} %{QUOTEDSTRING:cookie}' }
}
}
EDIT: I decided to do a screenshot of what my log file looks like as the spaces dont come over when copying and pasting. Those appear to be single spaces when I copy/paste.
Beside the space in that logline you posted, which I assume won't exist in your logs, your pattern is incorrect on the date parsing. Logstash DATE follows this pattern:
DATE_US %{MONTHNUM}[/-]%{MONTHDAY}[/-]%{YEAR}
DATE_EU %{MONTHDAY}[./-]%{MONTHNUM}[./-]%{YEAR}
DATE %{DATE_US}|%{DATE_EU}
Which doesn't match your YYYY-MM-dd format. I recommend using a pattern file and defining a custom date format
CUST_DATE %{YEAR}-%{MONTHNUM2}-%{MONTHDAY}
then your pattern can be
%{CUST_DATE:date} %{TIME:time} %{IP:client} %{WORD:method} %{URIPATHPARAM:request} %{NUMBER:status} %{NUMBER:bytes} %{NUMBER:time_taken} %{QUOTEDSTRING:referrer} %{QUOTEDSTRING:user_agent} %{QUOTEDSTRING:cookie}
EDIT:
You may be able to handle weird whitespace with a gsub, this won't remove whitespace, but will normalize spaces to all be 1 " "
mutate {
gsub => [
# replace all whitespace characters or multiple adjacent whitespace characters with one space
"message", "\s+", " "
]
}

Resources