AttributeError: 'DecisionTreeClassifier' object has no attribute 'n_features_' - heroku

I was trying to deploy my NLP model to Heroku, and I got the following error in the logs upon trying to predict the result of the inputs-
2022-09-07T15:36:35.497488+00:00 app[web.1]: if self.n_features_ != n_features:
2022-09-07T15:36:35.497488+00:00 app[web.1]: AttributeError: 'DecisionTreeClassifier' object has no attribute 'n_features_'
2022-09-07T15:36:35.498198+00:00 app[web.1]: 10.1.22.85 - - [07/Sep/2022:15:36:35 +0000] "POST /predict HTTP/1.1" 500 290 "https://stocksentimentanalysisapp.herokuapp.com/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36"
This specific line is strange considering I never used a Decision Tree Classifier, only Random Forest-
AttributeError: 'DecisionTreeClassifier' object has no attribute 'n_features_'
The model runs perfectly well in Jupyter Notebook. This issue began only when I tried to deploy it.
Here is my model-
import pandas as pd
import pickle
df = pd.read_csv('D:\Sa\Projects\Stock Sentiment Analysis\data\data.csv', encoding='ISO-8859-1')
train = df[df['Date']<'20150101']
test = df[df['Date']>'20141231']
#Removing non-alphabetic characters
data = train.iloc[:,2:27]
data.replace('[^a-zA-Z]', ' ', regex=True, inplace=True)
#Renaming columns to numerical index
idx_list = [i for i in range(25)]
new_index = [str(i) for i in idx_list]
data.columns = new_index
for index in new_index:
data[index] = data[index].str.lower()
combined_headlines = []
for row in range(0, len(data.index)):
combined_headlines.append(' '.join(str(x) for x in data.iloc[row, 0:25]))
from sklearn.ensemble import RandomForestClassifier
#Bag of words
from sklearn.feature_extraction.text import CountVectorizer
count_vectorizer = CountVectorizer(ngram_range=(2,2))
train_data = count_vectorizer.fit_transform(combined_headlines)
pickle.dump(count_vectorizer, open('countVectorizer.pkl', 'wb'))
rfc = RandomForestClassifier(n_estimators=200, criterion='entropy')
rfc.fit(train_data, train['Label'])
test_transform = []
for row in range(0, len(data.index)):
test_transform.append(' '.join(str(x) for x in data.iloc[row, 2:27]))
test_data = count_vectorizer.transform(test_transform)
predictions = rfc.predict(test_data)
# Saving model to disk
pickle.dump(rfc, open('randomForestClassifier.pkl', 'wb'))
Please help me understand what is going wrong.

Related

ValueError with Sklearn LinearRegression model.predict()

I am trying to do a simple linear regression model to estimate the sales price of an item for borrowers we don't have contract information on. I'm using data from borrowers we do have price and payment info on and using sklearn's LinearRegression model but getting an error when I call the predict() method on the model. The exact error:
ValueError: X has 844 features, but LinearRegression is expecting 2529 features as input.
Here is my code, I feel like it's fairly straightforward. The build_customer_df is a method call that returns the dataframe with some column formatting, nothing fancy:
`
fp = Path('master_borrower.xlsx')
df = build_customer_df(fp)
df = df[['payment', 'trailer_sales_price']]
df = df[df['payment']!= 0]
X = df['payment'].values
y = df['trailer_sales_price'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=1)
X_test = X_test.reshape(1,-1)
X_train = X_train.reshape(1,-1)
y_train = y_train.reshape(1,-1)
y_test = y_test.reshape(1,-1)
model = linear_model.LinearRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)

r studio debug location is approximate because source is not available

I get this error
r studio debug location is approximate because source is not available
although I'm on the newest version of R studio with no backward slashes available in my function
My function:
calculate_depreciation_percentage_v1 = function (end_of_life_value_perc_temp, immediate_loss_of_value_percent, theoretische_afschrijfperiode, restwaarde_bouw_perc,bouwwaarde_perc,perceelwaarde_perc,current_nr_of_years,groeivoet_waarde_perceel, cash_flow_year_1, maintanance_cost_weights) {
# browser()
# in-function parameters: moved all of them to main script
# declare variables
end_of_life_value_perc_local = repmat(-1234, size(immediate_loss_of_value_percent,1), size(immediate_loss_of_value_percent,2))
#CALCULATIONS
for (RV_loop_local in 1:size(immediate_loss_of_value_percent,1)) { #for the first calculation, calculate...
jaarlijks_afschrijfpercentage_bouwwaarde = (1-immediate_loss_of_value_percent[RV_loop_local,1] - restwaarde_bouw_perc) / theoretische_afschrijfperiode #ok
depecriation_vec_bouwwaarde = seq(from = 0, to=(jaarlijks_afschrijfpercentage_bouwwaarde*size(immediate_loss_of_value_percent,2))-jaarlijks_afschrijfpercentage_bouwwaarde , by=jaarlijks_afschrijfpercentage_bouwwaarde ) + immediate_loss_of_value_percent[RV_loop_local,1]
# depecriation_vec_bouwwaarde[1] = immediate_loss_of_value_percent[RV_loop,1] mss toepassen op hele vector, kan wel
remaining_value_vec_bouwwaarde = (-cash_flow_year_1)*(1-depecriation_vec_bouwwaarde) # it assumed you cannot go under 0, but the program does not detect if you do...
# tis niet juist
#
# (-cash_flow_year_1*bouwwaarde_perc)*
# the remaining value
weight_vector_perceelwaarde_FV = repmat(-1234, 1, size(immediate_loss_of_value_percent,2))
for (current_nr_of_years_local in 1:size(immediate_loss_of_value_percent,2)) {
# (-perceelwaarde_perc *cash_flow_year_1)
weight_vector_perceelwaarde_FV[current_nr_of_years_local] = (-perceelwaarde_perc *cash_flow_year_1)*((1+groeivoet_waarde_perceel)^current_nr_of_years_local)
# weight_vector_perceelwaarde_FV[current_nr_of_years_local] = (-perceelwaarde_perc *cash_flow_year_1)*((1+groeivoet_waarde_perceel)^current_nr_of_years_local)
# weight_vector_perceelwaarde_FV
# ,current_nr_of_years,groeivoet_waarde_perceel #
# -bouwwaarde_perc*cash_flow_year_1
# end_of_life_value_perc_local = [RV_loop, current_nr_of_years] =
}
# depecriation_vec_bouwwaarde*bouwwaarde_perc
# weight_vector_perceelwaarde_FV*perceelwaarde_perc
weighted_average_residual_value_vec = weight_vector_perceelwaarde_FV + (bouwwaarde_perc*remaining_value_vec_bouwwaarde)
weight_vector_perceelwaarde_FV
(bouwwaarde_perc*remaining_value_vec_bouwwaarde)
browser()
}
# weighted_average_depreciation_matrix[RV_loop_local,:] = weighted_average_depreciation_vector
# weighted_average_depreciation_matrix <- cbind(x, y, z)
#simulation_bouwwaarde = end_of_life_value_perc_temp[1]-jaarlijks_afschrijfpercentage_bouwwaarde*20
# finVal = 1 * (0.0125 + 1)^20
# return(...)
}
Version
RStudio 2021.09.2+382 "Ghost Orchid" Release (fc9e217980ee9320126e33cdf334d4f4e105dc4f, 2022-01-04) for Windows
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) QtWebEngine/5.12.8 Chrome/69.0.3497.128 Safari/537.36

Data reshaping in sklearn (Linear regression)

input code:
data = pd.read_csv('test.csv')
data.head()
data['Density'] = data['Flow [Veh/h]'] / data['Speed [km/h]']
data = data.replace(np.nan, 1)
X = data['Density']
y = data['Speed [km/h]']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=101)
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X_train,y_train) #HERE I GOT AN ERROR
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
You can try changing your variable X as the following:
X = data['Density'].values.reshape((-1, 1))
I had faced the same error, where my feature set had only one variable. The above change solved the issue for me.
Try using [[]] while taking the parameters:
X = data[['Density']]

Webscraping Nokogiri unable to pick any classes

I am using this page:
https://www.google.com/search?q=ford+fusion+msrp&oq=ford+fusion+msrp&aqs=chrome.0.0l6.2942j0j7&sourceid=chrome&ie=UTF-8
I am trying to get the this element: class="_XWk"
page = HTTParty.get('https://www.google.com/search?q=ford+fusion+msrp&oq=ford+fusion+msrp&aqs=chrome.0.0l6.11452j0j7&sourceid=chrome&ie=UTF-8')
parse_page = Nokogiri::HTML(page)
parse_page.css('_XWk')
Here I can see the whole page in parse_page but when I try the .cc('classname') I don't see anything. Am I using the method the wrong way?
Check out the SelectorGadget Chrome extension to grab css selectors by clicking on the desired element in the browser.
It's because of a simple typo, e.g. . (dot) before selector as ran already mentioned.
In addition, the next problem might occur because no HTTP user-agent is specified thus Google will block a request eventually and you'll receive a completely different HTML that will contain an error message or something similar without the actual data you was looking for. What is my user-agent.
Pass a user-agent:
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
HTTParty.get("https://www.google.com/search", headers: headers)
Iterate over container to extract titles from Google Search:
data = doc.css(".tF2Cxc").map do |result|
title = result.at_css(".DKV0Md")&.text
Code and example in the online IDE:
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
q: "ford fusion msrp",
num: "20"
}
response = HTTParty.get("https://www.google.com/search",
query: params,
headers: headers)
doc = Nokogiri::HTML(response.body)
data = doc.css(".tF2Cxc").map do |result|
title = result.at_css(".DKV0Md")&.text
link = result.at_css(".yuRUbf a")&.attr("href")
displayed_link = result.at_css(".tjvcx")&.text
snippet = result.at_css(".VwiC3b")&.text
puts "#{title}#{snippet}#{link}#{displayed_link}\n\n"
-------
'''
2020 Ford Fusion Prices, Reviews, & Pictures - Best Carshttps://cars.usnews.com/cars-trucks/ford/fusionhttps://cars.usnews.com › Cars › Used Cars › Used Ford
Ford® Fusion Retired | Now What?Not all vehicles qualify for A, Z or X Plan. All Mustang Shelby GT350® and Shelby® GT350R prices exclude gas guzzler tax. 2. EPA-estimated city/hwy mpg for the ...https://www.ford.com/cars/fusion/https://www.ford.com › cars › fusion
...
'''
Alternatively, you can achieve this by using Google Organic Results API from SerpApi. It's a paid API with a free plan.
The difference in your case is that you don't need to figure out what the correct selector is or why results are different in the output since it's already done for the end-user.
Basically, the only thing that needs to be done is just to iterate over structured JSON and get the data you were looking for.
Example code:
require 'google_search_results'
params = {
api_key: ENV["API_KEY"],
engine: "google",
q: "ford fusion msrp",
hl: "en",
num: "20"
}
search = GoogleSearch.new(params)
hash_results = search.get_hash
data = hash_results[:organic_results].map do |result|
title = result[:title]
link = result[:link]
displayed_link = result[:displayed_link]
snippet = result[:snippet]
puts "#{title}#{snippet}#{link}#{displayed_link}\n\n"
-------
'''
2020 Ford Fusion Prices, Reviews, & Pictures - Best Carshttps://cars.usnews.com/cars-trucks/ford/fusionhttps://cars.usnews.com › Cars › Used Cars › Used Ford
Ford® Fusion Retired | Now What?Not all vehicles qualify for A, Z or X Plan. All Mustang Shelby GT350® and Shelby® GT350R prices exclude gas guzzler tax. 2. EPA-estimated city/hwy mpg for the ...https://www.ford.com/cars/fusion/https://www.ford.com › cars › fusion
...
'''
P.S - I wrote a blog post about how to scrape Google Organic Search Results.
Disclaimer, I work for SerpApi.
It looks like something is swapping the classes so what you see in the browser is not what you are getting from the http call. In this case from _XWk to _tA
page = HTTParty.get('https://www.google.com/search?q=ford+fusion+msrp&oq=ford+fusion+msrp&aqs=chrome.0.0l6.11452j0j7&sourceid=chrome&ie=UTF-8')
parse_page = Nokogiri::HTML(page)
parse_page.css('._tA').map(&:text)
# >>["Up to 23 city / 34 highway", "From $22,610", "175 to 325 hp", "192″ L x 73″ W x 58″ H", "3,431 to 3,681 lbs"]
Change parse_page.css('_XWk') to parse_page.css('._XWk')
Note the dot (.) difference. The dot references a class.
Using parse_page.css('_XWk'), nokogiri doesn't know wether _XWk is a class, id, data attribute etc..

crawl realtime google finance price

I want to create a small excel sheet which sort of like Bloomberg's launchpad for me to monitor live stock market price. So far, out of all the available free data source, I only found Google finance provides real time price for a list of exchanges I need. The issue with Google finance is they have already closed down their finance API. I am looking for a way to help me to programmatically retrieve the real price that I circled in chart below to have it update live in my excel.
I have been searching around and to no avail as of now. I read some post here:
How does Google Finance update stock prices? but the method suggested in the answer points to retrieving a time series of data in the chart, instead of the live updating price part I need. I have been examining the network communication of the web page in chrome's inspection and didn't find any request that returns the part of real time price I need. Any help is greatly appreciated. some sample codes (can be in other languages other than VBA) would be very beneficial. Thanks everyone !
There are so many way ways to do this: VBA, VB, C# R, Python, etc. Below is a way to download statistics from Yahoo finance.
Sub DownloadData()
Set ie = CreateObject("InternetExplorer.application")
With ie
.Visible = True
.navigate "https://finance.yahoo.com/quote/AAPL/key-statistics?p=AAPL"
' Wait for the page to fully load; you can't do anything if the page is not fully loaded
Do While .Busy Or _
.readyState <> 4
DoEvents
Loop
' Set a reference to the data elements that will be downloaded. We can download either 'td' data elements or 'tr' data elements. This site happens to use 'tr' data elements.
Set Links = ie.document.getElementsByTagName("tr")
RowCount = 1
' Scrape out the innertext of each 'tr' element.
With Sheets("DataSheet")
For Each lnk In Links
.Range("A" & RowCount) = lnk.innerText
RowCount = RowCount + 1
Next
End With
End With
MsgBox ("Done!!")
End Sub
I will leave it up to you to find other technologies that do the same. Thing, for instance, R and Prthon can do exactly the same thing, although, the scripts will be a bit different than the VBA scripts that do this kind of work.
It's fairly easy to make it work in Python. You'll need a few libraries:
Library
Purpose
requests
to make a request to Google Finance and then return HTML.
bs4
to process returned HTML.
pandas
to easily save to CSV/Excel.
Code and full example in the online IDE:
from bs4 import BeautifulSoup
import requests, lxml, json
from itertools import zip_longest
def scrape_google_finance(ticker: str):
# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
"hl": "en", # language
}
# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
# https://www.whatismybrowser.com/detect/what-is-my-user-agent
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.60 Safari/537.36",
}
html = requests.get(f"https://www.google.com/finance/quote/{ticker}", params=params, headers=headers, timeout=30)
soup = BeautifulSoup(html.text, "lxml")
ticker_data = {"right_panel_data": {},
"ticker_info": {}}
ticker_data["ticker_info"]["title"] = soup.select_one(".zzDege").text
ticker_data["ticker_info"]["current_price"] = soup.select_one(".AHmHk .fxKbKc").text
right_panel_keys = soup.select(".gyFHrc .mfs7Fc")
right_panel_values = soup.select(".gyFHrc .P6K39c")
for key, value in zip_longest(right_panel_keys, right_panel_values):
key_value = key.text.lower().replace(" ", "_")
ticker_data["right_panel_data"][key_value] = value.text
return ticker_data
# tickers to iterate over
tickers = ["DIS:NYSE", "TSLA:NASDAQ", "AAPL:NASDAQ", "AMZN:NASDAQ", "NFLX:NASDAQ"]
# temporary store the data before saving to the file
tickers_prices = []
for ticker in tickers:
# extract ticker data
ticker_data = scrape_google_finance(ticker=ticker)
# append to temporary list
tickers_prices.append({
"ticker": ticker_data["ticker_info"]["title"],
"price": ticker_data["ticker_info"]["current_price"]
})
# create dataframe and save to csv/excel
df = pd.DataFrame(data=tickers_prices)
# to save to excel use to_excel()
df.to_csv("google_finance_live_stock.csv", index=False)
Outputs:
ticker,price
Walt Disney Co,$137.06
Tesla Inc,"$1,131.21"
Apple Inc,$176.99
"Amazon.com, Inc.","$3,321.61"
Netflix Inc,$384.93
Returned data from ticker_data
{
"right_panel_data": {
"previous_close": "$138.61",
"day_range": "$136.66 - $139.20",
"year_range": "$128.38 - $191.67",
"market_cap": "248.81B USD",
"volume": "9.98M",
"p/e_ratio": "81.10",
"dividend_yield": "-",
"primary_exchange": "NYSE",
"ceo": "Bob Chapek",
"founded": "Oct 16, 1923",
"headquarters": "Burbank, CaliforniaUnited States",
"website": "thewaltdisneycompany.com",
"employees": "166,250"
},
"ticker_info": {
"title": "Walt Disney Co",
"current_price": "$136.66"
}
}
If you want to scrape more data with a line-by-line explanation, there's a Scrape Google Finance Ticker Quote Data in Python blog post of mine that also covers scraping time-series chart data.

Resources