How can I create a dataframe from this dict - data-structures

I know this is a rather basic question, but I for my life I cant get it to work and I am running out of time, so:
I have a dict that looks like this
data_dict = {'timestamp': '2019-05-01T06:00:00-04:00', 'data': [0.37948282157787916, 1.5890471705541012, 2.1883813840381885], '_id': '62377385587e549976adfda0'}
How can I create a dataframe from it? I tried:
schema = StructType([
StructField('timestamp', TimestampType(), True),
StructField('data', ArrayType(DecimalType()), True),
StructField('_id', StringType(), True)
])
df = spark.createDataFrame(data=data_dict, schema=schema)
this gives me error:
TypeError: StructType can not accept object 'timestamp' in type <class 'str'>
But even when I shrink the dict and take out the timestamp from the dict and the schema, I get a similar error:
TypeError: StructType can not accept object 'data' in type <class 'str'>
Any help greatly appreciated, thanks a lot in advance!
Edit: I just figured out, that by just putting [] around the dict, I get it to work. However, if anyone has a less ugly solution, Ill buy it

You can cast the columns to desired type once you have the data in the df and then
if needed explode the data column further to spread the values of the arrays in columns.
data_dict = {'timestamp': '2019-05-01T06:00:00-04:00', 'data': [0.37948282157787916, 1.5890471705541012, 2.1883813840381885], '_id': '62377385587e549976adfda0'}
df=spark.createDataFrame([data_dict]).select('_id',explode('data').alias('data'),col('timestamp').cast(TimestampType()))
_id
data
timestamp
62377385587e549976adfda0
0.37948282157787916
2019-05-01T06:00:00-04:00
62377385587e549976adfda0
1.5890471705541012
2019-05-01T06:00:00-04:00
62377385587e549976adfda0
2.1883813840381885
2019-05-01T06:00:00-04:00

Related

Type error with Hasura array data type: "A string is expected for type : _int4"

I have a table I created in the Hasura console. A few of the columns are integer int types and I get the expected data type: Maybe<Scalars['Int']>.
However, a few needed to be an array of integers so I created those in the Hasura SQL tab:
ALTER TABLE my_table
add my_ids integer[];
If I populate those in GraphiQL with the following query variable everything works just fine:
{
"my_ids": "{1200, 1201, 1202}",
}
However, when I try to make the same request from my front-end client, I received the following error: A string is expected for type : _int4. Looking at the datatype, it is slightly different than the preset (in the data type dropdown) integer types: Maybe<Scalars['_int4']>.
Is there a way to get the array of integers to be Maybe<Scalars['Int']> like the preset integer types? Or if there isn't, how can resolve the issue with my request throwing errors for this _int4 type?
Hasura cannot process array but it can process array literals.
I've written a function to help you to transform your array into array literal before mutation:
const toArrayLiteral = (arr) => (JSON.stringify(arr).replace('[', '{').replace(']', '}'))
...
myObject.array = toArrayLiteral(myObject.array) // will make ['friend'] to {'friend'}
Good luck

split observable array into two observable arrays

I have an api call which returns data with Observable<User[]> (where user is
{id:string, status:string}
what I would like to do is to split the observable into two different observables, based on the status (active/inactive)
I googled and Stackoverflowed, but the only samples I could find only showed single-value arrays [1,2,3,4] etc
I tried to apply the same technique to the array of objects
const foo = this.userApi
.find()
.pipe(partition(item => item.status === 'active'));
hoping that this would return foo[0] holding the active and foo[1] holding the inactive users
However, the vsCode editor complains about
[ts]
Argument of type 'UnaryFunction<Observable<User[]>, [Observable<User[]>, Observable<User[]>]>' is not assignable to parameter of type 'OperatorFunction<any, any>'.
Type '[Observable<User[]>, Observable<User[]>]' is not assignable to type 'Observable<any>'.
Property '_isScalar' is missing in type '[Observable<User[]>, Observable<User[]>]'.
[ts] Property 'status' does not exist on type 'User[]'.
any
which implies that the item is an array ...
I also tried groupBy and hit the same problem
const foo = this.userApi
.find<User>()
.pipe(groupBy(item => item.status));
[ts] Property 'status' does not exist on type 'User[]'
as you can probably tell, I'm no rxjs expert and have hit a wall at this point, and would appreciate any pointers
You had in my opinion the correct idea. Using partition seems like it is the right way to do it. But partition is really weird for now (and soon deprecated for something better: splitBy see here for more info: https://github.com/ReactiveX/rxjs/issues/3807).
Anyway here's how to use partition:
const [activeUsers$, inactiveUsers$]: [Observable<User>, Observable<User>] =
partition(
(x: User) => x.status === 'active'
)(from(mockUsers));
Stackblitz example here: https://stackblitz.com/edit/typescript-ziafpr
See also https://github.com/ReactiveX/rxjs/issues/2995
You are not returning the item but instead the entire array to groupBy. Try this
const foo = this.userApi
.find<User>()
.pipe(mergeMap(users=>from(users),
groupBy(item => item.status));

Transform a org.apache.spark.rdd.RDD[String] into Parallelized collections

I've a csv file in my HDFS with a collection of products like:
[56]
[85,66,73]
[57]
[8,16]
[25,96,22,17]
[83,61]
I'm trying to apply the Association Rules algorithm in my code. For that I need to run this:
scala> val data = sc.textFile("/user/cloudera/data")
data: org.apache.spark.rdd.RDD[String] = /user/cloudera/data MapPartitionsRDD[294] at textFile at <console>:38
scala> val distData = sc.parallelize(data)
But when I submit this I'm getting this error:
<console>:40: error: type mismatch;
found : org.apache.spark.rdd.RDD[String]
required: Seq[?]
Error occurred in an application involving default arguments.
val distData = sc.parallelize(data)
How can I transform a RDD[String] in a Sequence collection?
Many thanks!
What you are facing is simple. The error show to you.
To parallelize a object in spark you should add a Seq() object. And you are trying to add a RDD[String] object.
The RDD is already parallelized, the textFile method parallelize the file elements by lines in your cluster.
You can check the method description here:
https://spark.apache.org/docs/latest/programming-guide.html

MongoDB C# LINQ serialization error

I have a class AttributeValue containing two string fields:
string DataType
string Category
My mongo query is as below:
var test19 = _repo.All().Where(p => p.Rule.Any(r => r.Target.AnyOf.Any(an => an.AllOf.Any(av => av.Match != null && policyFilters2.Contains(string.Concat(av.Match.AttributeValue.DataType, av.Match.AttributeValue.Category))))));
where policyFilters2 is List<string>
The above query gives me an error:
"Unable to determine the serialization information for the expression:
String.Concat(av.Match.AttributeValue.DataType,
av.Match.AttributeValue.Category)."
I am not sure what needs to be done to resolve this.
Any help is greatly appreciated.
I don't think MongoDB can do a search on concatenated values like that. Try concatenating the two fields separately in a field and then try the query.
Eventually, I had to search for both the fields using logical AND rather than concatenating

Filter records using Linq on an Enum type

I'm hoping this is a simple solution. I have a field (PressType) in a table (Stocks) that is seed populated by using an Enum. The table stores the data as an integer. However when I want to query some data via Linq it gives me issues. I can filter any other fields in the table using this format however on the Enum populated field it says
the "==" operator cannot be applied to operands of type "Models.PressType" and "string".
Any help you could give would be appreciated, thanks.
var test = db.Stocks.Where(x => x.PressType == myValue);
There is nothing wrong with your Linq. Your problem is that myValue is of type string. You need to convert your string to the enum first.
string myValue = SomeControl.Text;
Models.PressType myValueAsEnum = (Models.PressType)
Enum.Parse(typeof(Models.PressType), myValue);
IQueryable<Stock> test = db.Stocks.Where(x => x.PressType == myValueAsEnum);

Resources