How to pivot in Pandas dataframe - azure-databricks

Please advise how it can be done in Pandas dataframe :
Current in the Pandas dataframe(df)
Id Phone Email workplace Mailing city Mailing Stat
123 12345 123#1233.de test New York New York
abc 45678 abc#ab.de test New York New York
def 78019 def#def.de test New York New York
I am looking in the below mentioned format in the pandas dataframe
id Attribute Value Mailing city Mailing sat
123 Phone 12345 New York New York
123 Email 123#1233.de New York New York
123 workplace test New York New York
abc phone 45678 New York New York
abc email abc#ab.de New York New York
abc workplace test

I reproduced this and got the below results by using melt function.
Pandas dataframe:
Code for required result:
import pandas as pd
result_df= pd.melt(p_df, id_vars="Id", value_vars=["Phone","Email","workplace"],var_name="Attribute", value_name="Value")
result_df=result_df.sort_values('Id')
result_df
Execution:

Related

Autocomplete search for MongoDb and

I'm trying to build an efficient autocomplete search for my Spring boot app, but I'm not getting the proper results.
I have a cars database with multiple car models.
This is my current search aggregation
val agg = Document(
"\$search", Document(
"compound", Document(
"must", searchInput.split(" ").map {
Document(
"autocomplete", Document("query", it)
.append("path", "fullName")
)
}
)
)
)
fullName represents the full name of the car: Brand + Model + Year + Power
The search works but not good enough yet.
For instance if I search for "Ford GT" first results that come up are:
Ford Mustang GT V8 5.0
Ford Mustang GT 390 Fastback
Ford Mustang GT 302 HO
and so on.
But the first results should be:
Ford GT - 655 PS
Ford GT40 Mk III
and only afterwards, the ones above.
How can I achieve this?

ValueError: Input is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers

from os import listdir
from os.path import isfile, join
from datasets import load_dataset
from transformers import BertTokenizer
test_files = [join('./test/', f) for f in listdir('./test') if isfile(join('./test', f))]
dataset = load_dataset('json', data_files={"test": test_files}, cache_dir="./.cache_dir")
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
def encode(batch):
return tokenizer.encode_plus(batch["abstract"], max_length=32, add_special_tokens=True, pad_to_max_length=True,
return_attention_mask=True, return_token_type_ids=False, return_tensors="pt")
dataset.set_transform(encode)
When I run this code, I have
ValueError: Input is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers.
Instead of having a list of strings, I have a list of lists of strings. Here is the content of batch["article"]:
[['eleven politicians from 7 parties made comments in letter to a newspaper .', "said dpp alison saunders had ` damaged public confidence ' in justice .", 'ms saunders ruled lord janner unfit to stand trial over child abuse claims .', 'the cps has pursued at least 19 suspected paedophiles with dementia .'], ['an increasing number of surveys claim to reveal what makes us happiest .', 'but are these generic lists really of any use to us ?', 'janet street-porter makes her own list - of things making her unhappy !'], ["author of ` into the wild ' spoke to five rape victims in missoula , montana .", "` missoula : rape and the justice system in a college town ' was released april 21 .", "three of five victims profiled in the book sat down with abc 's nightline wednesday night .", 'kelsey belnap , allison huguet and hillary mclaughlin said they had been raped by university of montana football players .', "huguet and mclaughlin 's attacker , beau donaldson , pleaded guilty to rape in 2012 and was sentenced to 10 years .", 'belnap claimed four players gang-raped her in 2010 , but prosecutors never charged them citing lack of probable cause .', 'mr krakauer wrote book after realizing close friend was a rape victim .'], ['tesco announced a record annual loss of £ 6.38 billion yesterday .', 'drop in sales , one-off costs and pensions blamed for financial loss .', 'supermarket giant now under pressure to close 200 stores nationwide .', 'here , retail industry veterans , plus mail writers , identify what went wrong .'], ..., ['snp leader said alex salmond did not field questions over his family .', "said she was not ` moaning ' but also attacked criticism of women 's looks .", 'she made the remarks in latest programme profiling the main party leaders .', 'ms sturgeon also revealed her tv habits and recent image makeover .', 'she said she relaxed by eating steak and chips on a saturday night .']]
How could I fix this issue?

Removing categories with patsy and statsmodels

I am using statsmodels and patsy for building a logistic regression model. I'll use pseudocode here. Let's assume I have a dataframe containing a categorical variable, say Country, with 200 levels. I have reasons to believe some of them would be predictive, so I build a model as in
formula = 'outcome ~ C(Country)'
patsy splits Country into its levels and the model is build using all countries. I then see that the coefficient in GB is high so I want to remove only GB. Can I do something like this in patsy:
formula = 'outcome ~ C(country) - C(country)[GB]'
I tried and it did not change anything.
I don't know if there is a way to subset a Category with patsy formula, but you can do it in the DataFrame.
For example
import numpy as np
import pandas as pd
import statsmodels.api as sm
# sample data
size = 100
np.random.seed(1)
countries = ['IT', 'UK', 'US', 'FR', 'ES']
df = pd.DataFrame({
'outcome': np.random.random(size),
'Country': np.random.choice(countries, size)
})
df['Country'] = df.Country.astype('category')
print(df.Country)
0 ES
1 IT
2 UK
3 US
4 UK
..
95 FR
96 UK
97 ES
98 UK
99 US
Name: Country, Length: 100, dtype: category
Categories (5, object): ['ES', 'FR', 'IT', 'UK', 'US']
Let us suppose we want to remove Category "US"
# create a deep copy excluding 'US'
_df = df[df.Country!='US'].copy(deep=True)
print(_df.Country)
0 ES
1 IT
2 UK
4 UK
5 ES
..
94 UK
95 FR
96 UK
97 ES
98 UK
Name: Country, Length: 83, dtype: category
Categories (5, object): ['ES', 'FR', 'IT', 'UK', 'US']
Even if there are no more elements with category "US" in the DataFrame, the category is still there. If we use this DataFrame in a statsmodels model, we'd get a singular matrix error, so we need to remove unused categories
# remove unused category 'US'
_df['Country'] = _df.Country.cat.remove_unused_categories()
print(_df.Country)
0 ES
1 IT
2 UK
4 UK
5 ES
..
94 UK
95 FR
96 UK
97 ES
98 UK
Name: Country, Length: 83, dtype: category
Categories (4, object): ['ES', 'FR', 'IT', 'UK']
and now we can fit a model
mod = sm.Logit.from_formula('outcome ~ Country', data=_df)
fit = mod.fit()
print(fit.summary())
Optimization terminated successfully.
Current function value: 0.684054
Iterations 4
Logit Regression Results
==============================================================================
Dep. Variable: outcome No. Observations: 83
Model: Logit Df Residuals: 79
Method: MLE Df Model: 3
Date: Sun, 16 May 2021 Pseudo R-squ.: 0.01179
Time: 22:43:37 Log-Likelihood: -56.776
converged: True LL-Null: -57.454
Covariance Type: nonrobust LLR p-value: 0.7160
=================================================================================
coef std err z P>|z| [0.025 0.975]
---------------------------------------------------------------------------------
Intercept -0.1493 0.438 -0.341 0.733 -1.007 0.708
Country[T.FR] 0.4129 0.614 0.673 0.501 -0.790 1.616
Country[T.IT] -0.1223 0.607 -0.201 0.840 -1.312 1.068
Country[T.UK] 0.1027 0.653 0.157 0.875 -1.178 1.383
=================================================================================

Integrate google search

I am trying to integrate google search results with my rails application but unable to get proper response. I am following this link
https://github.com/wiseleyb/google_custom_search_api
Here is my code
class UsersController < ApplicationController
def index
results = GoogleCustomSearchApi.search("restaurant near sector 2 noida")
results["items"].each do |item|
puts item["title"], item["link"]
end
end
end
The response which I am getting is this
Youth Thrash Restaurant Owner In Noida | Full Video - YouTube
https://www.youtube.com/watch?v=4PSKUacsmnY
Karim's Hotel Restaurant - Delhi - YouTube
https://www.youtube.com/watch?v=AKqOlthRO2w
Bikanervala Bliss Fine Dine Restaurant at Noida Sec-18 - YouTube
https://www.youtube.com/watch?v=s9J4R-sC3-w
Top 5 | Fine Dining Restaurants In Noida - YouTube
https://www.youtube.com/watch?v=DLXno_bYhRY
Dev's Bar & Restaurant, Sector-62, Noida - YouTube
https://www.youtube.com/watch?v=4KZ2pz_azFA
Noida Kids- Sector 38 McDonalds - YouTube
https://www.youtube.com/watch?v=L-JCZF1FkCc
Review of Theos, Noida | Bakery & Cake Shops Restaurants- Italian ...
https://www.youtube.com/watch?v=G1LFabXPeks
Bistro 37, Noida | Coffee Shops/Cafes /Restaurants- Continental ...
https://www.youtube.com/watch?v=7pVstxQ0EvQ
Review of Desi Vibes, Noida | Restaurants- North Indian ...
https://www.youtube.com/watch?v=ZRwJPuv8vC4
sector 62 noida
https://www.google.com/mymaps/viewer?mid=1Y_e1x5df5tCuGjKOCRzx3bXBv9w&hl=en
Which is giving me wrong result .So there any way i can integrate google search which can give me proper answers
I should get these results as showing in google

Apache PIG - How to get the Flop 10 data records?

I have data records like this:
Name customerID revenue(Mio) premium
Michael James 078932832 2.7 y
Susan Miller 024383490 3.9 n
John Cooper 021023023 2.1 y
How do I get the records - divided into the premium flag - each with the lowest revenue (=Flop 10)?
The result should be given as:
Nr Name customerID revenue(Mio) premium
1 John Cooper 021023023 2.1 y
2 Michael James 078932832 2.7 y
3 Andrew Murs 044834399 3.0 y
. ... ..... ... .
10 th entry with flag y
1 Susan Miller 024383490 3.9 n
. ... ..... ... .
10 th entry with flag n
As you see the list is ordered ascending (beginning with the lowest revenue).
I guess you should use split
Considering A is your load statement
A = load 'data' as (Nr,Name,customerID,revenue,premium);
B = split A into PRE if premium =='y', NONPRE if premium == 'n';
C = order PRE by revenue asc;
D = order NONPRE by revenue asc;
Disclaimer: Be careful while using split as null records get dropped. I have not compiled this code.

Resources