how to change individual value in koalas dataframe? - spark-koalas

I'm looking to update a single value in a koalas dataframe by passing its positional index.
I have tried using iat but keep running into errors. For example, if I tried:
df.iat[2,3]=5
I get an error saying 'iAtIndexer' does not support item assignment. This worked previously in pandas and am not sure if there is a different method in Koalas.
Thanks

Related

Error in azureml "Non numeric value(s) were encountered in the target column."

I am using Automated ML to run a time series forecasting pipeline.
When the AutoMLStep gets triggered, I get this error: Non numeric value(s) were encountered in the target column.
The data to this step is passed through an OutputTabularDatasetConfig, after applying the read_delimited_files() on an OutputFileDatasetConfig. I've inspected the prior step, and the data is comprised of a 'Date' column and a numeric column called 'Place' with +80 observations in monthly frequencies.
Nothing seems to be wrong with the column type or the data. I've also applied a number of techniques on the data prep side e.g. pd.to_numeric(), astype(float) to ensure it is numeric.
I've also tried forcing this through the FeaturizationConfig() add_column_purpose('Place','Numeric') but in this case, I get another error: Expected column(s) Place in featurization config's column purpose not found in X.
Any thoughts on how to solve?
So a few learnings on this interacting with the stellar Azure Machine Learning engineering team.
When calling the read_delimited_files() method, ensure that the output folder does not have many inputs or files. For example, if all intermediate outputs are saved to a common folder, it may read all the prior inputs into this folder, and depending upon the shape of the data, borrow the schema from the first file, or confuse all of them together. This can lead to inconsistencies and errors. In my case, I was dumping many files to the same location, hence this was causing confusion for this method. The fix is either to distinctly mark the output folder (e.g. with a UUID) or give different paths.
The dataframe from read_delimiter_files() may treat all columns as object type which can lead to a data type check failure (i.e. label_column needs to be numeric). To mitigate, explictly state the type. For example:
from azureml.data import DataType
prepped_data = prepped_data.read_delimited_files(set_column_types={"Place":DataType.to_float()})

Issue with choice action when running transform map

I'm trying to insert records to a table by using transform maps. I have this field in the target table, which is a choice type, and I have set the choice action in the source table's field to reject if there's no matching value found. But, when I tried inserting the record using the transform map with the correct value, which exists in the choice list of the target field, it still got rejected and hence not inserting the records.
I have tried searching for possible reasons as to why it still got rejected even with correct value in the source field. Here's the sample link that I have found: https://hi.service-now.com/kb_view.do?sysparm_article=KB0677334
It says that if there are more than 40 characters for the choice list value it will be truncated and might not match those choice. But the choices in the target field has only 20 characters or less.
I have first tried running the transform map in the lower environments before proceeding to production. In the lower environment it works fine and the records got inserted. But, when I tried it in production it got rejected.
There is a difference between choice and choice list. Within the choice list the values are comma separated sys_ids. I could imagine that you have multiple values for import and then the max character are reached or the values do not match, etc.
You could use this approach:
Instead of a direct assignment, source to target field, use the script to target. Then you gain the full script power ;)
Maybe here you could add some logic like switch case or whatever, I guess you get the point.

SSRS URL Access passing a parameter that has a dataset

I am trying to build an url to make the job of going to the SSRS page a bit faster and to eliminate all the input a user has to do (he already did it in an other program.
When I use this url:
http://localhost/ReportServer/Pages/ReportViewer.aspx?/folder/subfolder/reportname&rs:Command=Render&customerId=1000002
I can't fill out the parameter that has query where the default values of the parameters are found in the database, when i remove the query behind the parameter the textbox is filled out and the report is working as expected, i am sure the value of custimerId is in the database
How can I solve this?
I found it.
Data i got back from the database contained some trailing spaces. After removing those it worked like a charm.

Pandas DataFrame Sort: Want to sum and sort, but keep column names

Right now, I am running a sum and sort on a DataFrame object:
games_tags.groupby(['GameID', 'GameName', 'Tag']).sum().sort(['Count'], ascending=False)
The issue I'm running into is that afterwards, I want to be able to still grab each row's GameID, GameName, and Tag via row['GameID'], etc. However, I noticed that after I use the sum() method, it creates a column named 'Count', but I can no longer access any of the original columns.
I was wondering if anyone knows a work around or some intricacy to the sum() method that I am missing. Any help is appreciated. Thanks!
You can reset the index after the groupby to restore the columns back:
game_tags.reset_index(inplace=True)

rethinkdb: "RqlRuntimeError: Array over size limit" even when using limit()

I'm trying to access a constant number of the latest documents of a table ordered by a "date" key. Note that the date, unfortunately, was implemented (not by me...) such that the value is set as a string, e.g "2014-01-14", or sometimes "2014-01-14 22:22:22". I'm getting a "RqlRuntimeError: Array over size limit 102173" error message when using the following query:
r.db('awesome_db').table("main").orderBy(r.desc("date"))
I tried to overcome this problem by specifying a constant limit, since for now I only need the latest 50:
r.db('awesome_db').table("main").orderBy(r.desc("date")).limit(50)
Which ended with the same error. So, my questions are:
How can I get a constant number of the latest documents by date?
Is ordering by a string based date field possible? Is this issue has something to do with my first question?
The reason you get an error here is that the orderBy gets evaluated before the limit so it orders the entire table in memory which is over the array limit. The way to fix this is by using and index. Try doing the following:
table.indexCreate("date")
table.indexWait()
table.orderBy({index: r.desc("date")}).limit(50)
That should be equivalent to what you have there but uses an index so it doesn't require loading the entire table into memory.
This code is decision problem.
ro:= r.RunOpts{ArrayLimit: 500000}
r.DB("wrk").Table("log").Run(sessionArray[0],ro)
// This code for Python
r.db('awesome_db').table("main").run(sesion, r.runOpts{arrayLimit: 500000})

Resources