TypeError: First element of field tuple is neither a tuple nor str - rapids

I am a beginner to RAPIDS. I am trying to run the following code on Colab. It is resulting in an error.
TypeError: First element of field tuple is neither a tuple nor str
Similari code runs well while using pandas. But failing while using cudf dataframe apply.
Code Snippet:
dataSet_formatted_gdf = cudf.DataFrame.from_pandas(pd.DataFrame(dataset_formatted_npa))
conAttrs_gdf = cudf.DataFrame.from_pandas(pd.DataFrame(dataset_formatted_npa[:, 0]))
equivalenceClasses = conAttrs_gdf.groupby(0).groups
decAttr_gdf = cudf.DataFrame.from_pandas(pd.DataFrame(dataset_formatted_npa[:, 1]))
decisionClasses = decAttr_gdf.groupby(0).groups
rough_membership_values = dataSet_formatted_gdf.apply(lambda x: calculateRoughMembership(x, equivalenceClasses, decisionClasses), axis=1)

Related

multiprocessing a geopandas.overlay() throws no error but seemingly never completes

I'm trying to pass a geopandas.overlay() to multiprocessing to speed it up.
I have used custom functions and functools to partially fill function inputs and then pass the iterative component to the function to produce a series of dataframes that I then concat into one.
def taska(id, points, crs):
return make_break_points((vms_points[points.ID == id]).reset_index(drop=True), crs)
points_gdf = geodataframe of points with an id field
grid_gdf = geodataframe polygon grid
partialA = functools.partial(taska, points=points_gdf, crs=grid_gdf.crs)
partialA_results =[]
with Pool(cpu_count()-4) as pool:
for results in pool.map(partialA, list(points_gdf.ID.unique())):
partialA_results.append(results)
bpts_gdf = pd.concat(partialA_results)
In the example above I use the list of unique values to subset the df and pass it to a processor to perform the function and return the results. In the end all the results are combined using pd.concat.
When I apply the same approach to a list of dataframes created using numpy.array_split() the process starts with a number of processors, then they all close and everything hangs with no indication that work is being done or that it will ever exit.
def taskc(tracks, grid):
return gpd.overlay(tracks, grid, how='union').explode().reset_index(drop=True)
tracks_gdf = geodataframe of points with an id field
dfs = np.array_split(tracks_gdf, (cpu_count()-4))
grid_gdf = geodataframe polygon grid
partialC_results = []
partialC = functools.partial(taskc, grid=grid_gdf)
with Pool(cpu_count() - 4) as pool:
for results in pool.map(partialC, dfs):
partialC_results.append(results)
results_df = pd.concat(partialC_results)
I tried using with get_context('spawn').Pool(cpu_count() - 4) as pool: based on the information here https://pythonspeed.com/articles/python-multiprocessing/ with no change in behavior.
Additionally, if I simply run geopandas.overlay(tracks_gdf, grid_gdf) the process is successful and the script carries on to the end with expected results.
Why does the partial function approach work on a list of items but not a list of dataframes?
Is the numpy.array_split() not an iterable object like a list?
How can I pass a single df into geopandas.overlay() in chunks to utilize multiprocessing capabilities and get back a single dataframe or a series of dataframes to concat?
This is my work around but am also interested if there is a better way to perform this and similar tasks. Essentially, modified the partial function so the df split is moved to the partial function then I create a list of values from range() as my iteral.
def taskc(num, tracks, grid):
return gpd.overlay(np.array_split(tracks, cpu_count()-4)[num], grid, how='union').explode().reset_index(drop=True)
partialC = functools.partial(taskc, tracks=tracks_gdf, grid=grid_gdf)
dfrange = list(range(0, cpu_count() - 4))
partialC_results = []
with get_context('spawn').Pool(cpu_count() - 4) as pool:
for results in pool.map(partialC, dfrange):
partialC_results.append(results)
results_gdf = pd.concat(partialC_results)

Python: Printing vertically

The final code will print the distance between states. I'm trying to print the menu with the names of the states numbered and vertically. I really struggle to find my mistakes.
This code doesn't raise any error, it just prints nothing, empty.
state_data = """
LA 34.0522°N 118.2437°W
Florida 27.6648°N 81.5158°W
NY 40.7128°N 74.0060°W"""
states = []
import re
state_data1 = re.sub("[°N#°E]", "", state_data)
def process_states(string):
states_temp = string.split()
states = [(states_temp[x], float(states_temp[x + 1]), float(states_temp[x + 2])) for x in
range(0, len(states_temp), 3)]
return states
def menu():
for state_data in range(state_data1):
print(f'{state_data + 1} {name[number]}')
My first guess is, your code does not print anything without errors because you never actually execute process_airports() nor menu().
You have to call them like this at the end of your script:
something = process_airports(airport_data1)
menu()
This will now raise some errors though. So let's address them.
The menu() function will raise an error because neither name nor number are defined and because you are trying to apply the range function over a string (airport_data1) instead of an integer.
First fixing the range error: you mixed two ideas in your for-loop: iterating over the elements in your list airport_data1 and iterating over the indexes of the elements in the list.
You have to choose one (we'll see later that you can do both at once), in this example, I choose to iterate over the indexes of the list.
Then, since neither name nor number exists anywhere they will raise an error. You always need to declare variables somewhere, however, in this case they are not needed at all so let's just remove them:
def menu(data):
for i in range(len(data)):
print(f'{i + 1} {data[i]}')
processed_airports = process_airports(airport_data1)
menu(processed_airports)
Considering data is the output of process_airports()
Now for some general advices and improvements.
First, global variables.
Notice how you can access airport_data1 within the menu() function just fine, while it works this is not something recommended, it's usually better to explicitly pass variables as arguments.
Notice how in the function I proposed above, every single variable is declared in the function itself, there is no information coming from a higher scope. Again, this is not mandatory but makes the code way easier to work with and understand.
airport_data = """
Alexandroupoli 40.855869°N 25.956264°E
Athens 37.936389°N 23.947222°E
Chania 35.531667°N 24.149722°E
Chios 38.343056°N 26.140556°E
Corfu 39.601944°N 19.911667°E"""
airports = []
import re
airport_data1 = re.sub("[°N#°E]", "", airport_data)
def process_airports(string):
airports_temp = string.split()
airports = [(airports_temp[x], float(airports_temp[x + 1]), float(airports_temp[x + 2])) for x in
range(0, len(airports_temp), 3)]
return airports
def menu(data):
for i in range(len(data)):
print(f'{i + 1} {data[i]}')
# I'm adding the call to the functions for clarity
data = process_airports(airport_data1)
menu(data)
The printed menu now looks like that:
1 ('Alexandroupoli', 40.855869, 25.956264)
2 ('Athens', 37.936389, 23.947222)
3 ('Chania', 35.531667, 24.149722)
4 ('Chios', 38.343056, 26.140556)
5 ('Corfu', 39.601944, 19.911667)
Second and this is mostly fyi, but you can access both the index of a iterable and the element itself by looping over enumerate() meaning, the following function will print the exact same thing as the one with range(len(data)). This is handy if you need to work with both the element itself and it's index.
def menu(data):
for the_index, the_element in enumerate(data):
print(f'{the_index + 1} {the_element}')

Use of TfidfVectorizer on dataframe

I have the dataframe which has 3 colums(Positive Reviews,Negative and Score):
negative Positive Label
0 [there, were, issues, with, the, wifi, c] [no, positive] 1
1 [rooms, could, do, with, a, bit, of, a] [the, well, meaning, staff] 2.5
I want to apply the TfidfVectorizer on the DF.
I have written the following code.
from sklearn.feature_extraction.text import TfidfVectorizer
df_x=train_df["Positive"]
df_y=train_df["Score"]
cv = TfidfVectorizer()
df_xcv = cv.fit_transform(df_x)
a=df_xcv.toarray()
cv.get_feature_names()
which is giving an error:
AttributeError: 'list' object has no attribute 'lower'
Why is this throwing an error?
You are passing a pd.Series object to the cv.fit_transform() instead of a list/series of strings. So what you could do is to join your lists in each row and then pass them to the Vectorizer method:
df['joined_positive'] = df['Positive'].apply(' '.join)
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(df['joined_positive'])

get_document_topics() returns empty list of topics

I am having a 1-word document that I want to transform to its bag-of-words representation:
so doc is ['party'] and id2word.doc2bow(doc) is [(229, 1)] which means the word is known.
However, if I call get_document_topics() with doc_bow, the result is an empty list:
id2word = lda.id2word
# ..
doc_bow = id2word.doc2bow(doc)
t = lda.get_document_topics(doc_bow)
try:
label, prob = sorted(t, key=lambda x: -x[1])[0]
except Exception as e:
print('Error!')
raise e
The only possible explanation I'd have here is that this document (the single word) cannot be assigned to any topic. Is this the reason why I am seeing this?
Here is how I solved this issue:
in the file gensim/models/ldamodel.py you need to edit value of epsilon to a larger value.
DTYPE_TO_EPS = {
np.float16: 1e-5,
np.float32: 1e-35, #line to change
np.float64: 1e-100,
}
also, make sure to set the minimum_probability parameter of get_document_topics to a very small value.

Cannot convert object of type 'list' to a numeric value

I am making a pyomo model, where i want to use random numbers for my two dimensional parameters. I put a small python script for random numbers that looks exactly what i wanted to see for my two dimensional parameter. I am getting a TypeError: Cannot convert object of type 'list'(value =[[....]] to a numeric value. in my objective function. Below is my objective function and random numbers script.
model.obj = Objective(expr=sum(model.C[v,l] * model.T[v,l] for v in model.V for l in model.L) + \
sum(model.OC[o,l] * model.D[o,l] for o in model.O for l in model.L), sense=minimize)
import random
C = [[] for i in range(7)]
for i in range(7):
for j in range(5):
C[i]+= [random.randint(100,500)]
model.C = Param(model.V, model.L, initialize=C)
Please let me know if someone can help fixing this.
You should initialize your parameter using a function instead of a nested list
def init_c(m, i, j):
return random.randint(100,500)
model.c = Param(model.V, model.L, initialize=init_c)

Resources