pyspark: Sort a PyhonRDD using the object attribute - sorting

I have a result RDD, which is created using pyspark.mllib.fpm
The result RDD looks like:
print(result)
result.take(5)
PythonRDD[32] at RDD at PythonRDD.scala:48
Out[18]:
[FreqSequence(sequence=[['John']], freq=18980),
FreqSequence(sequence=[['Mary']], freq=106),
FreqSequence(sequence=[['John-Mary']], freq=381),
FreqSequence(sequence=[['John-Ann']], freq=158),
FreqSequence(sequence=[['Ann']], freq=433)]
How do I sort the above result RDD based on the freq attribute? Thanks!

You can use keyfunc argument:
rdd.sortBy(lambda x: x.frequency)

You can use keyfunc argument:
rdd.sortBy(lambda x: x.freq)

Related

pass dataframe column as parameter in xpath

I am using xpath in pyspark to extract from xml which is stored as a column in a table.
Below works fine
entity_id="D8"
dfquestionstep=df_source_xml.selectExpr("disclosure_entity_id",
f'xpath(**xml**,"*/entities/entity[#type=\'TI\']/entity[#type=\'UNDERWRITING\']/entity[#type=\'DISCLOSURES\']/entity[**#id=\'{entity_id}\'**]/entity[#type=\'DECISION_PATH\']/entity[#type=\'QUESTION_STEP\']/#id") QUESTION_STEP_ID'
)
PROBLEM
Now I want to pass disclosure_entity_id which is a column in dataframe having values like D8, D9 etc. in place of entity_id, i.e. entity[#id=disclosure_entity_id]
But all I get is [] as result when I execute like this, i.e. xpath fails to find anything.
Is there a way to pass the DF column directly as argument to XPATH like above?
Some testdata:
data = [
['a','<x><a>a1</a><b>b1</b><c>c1</c></x>'],
['b','<x><a>a2</a><b>b2</b><c>c2</c></x>'],
['c','<x><a>a3</a><b>b3</b><c>c3</c></x>'],
]
df = spark.createDataFrame(data, ['col','data'])
Attempt 1:
Creating a column with an XPath expression can be done:
from pyspark.sql import functions as f
df.withColumn('my_path', f.concat(f.lit('//'), f.col('col'))) \
.selectExpr('xpath(data, my_path)').show()
But unfortunately the code above only yields the error message
AnalysisException: cannot resolve 'xpath(`data`, `my_path`)' due to data type mismatch:
path should be a string literal; line 1 pos 0;
The path parameter of the xpath function has to be a constant string. This string is parsed before Spark even looks at the data.
Attempt 2:
Another option is to use an udf and use standard Python functions to process the XPath expression inside of the udf:
import xml.etree.ElementTree as ET
from pyspark.sql import types as T
def find_val(col, data):
result= ET.fromstring(data).find(f'.//{col}')
if not result is None:
return result.text
find_val_udf=f.udf(find_val, returnType=T.StringType())
df.select('col', 'data', find_val_udf('col', 'data')).show(truncate=False)
Output:
+---+----------------------------------+-------------------+
|col|data |find_val(col, data)|
+---+----------------------------------+-------------------+
|a |<x><a>a1</a><b>b1</b><c>c1</c></x>|a1 |
|b |<x><a>a2</a><b>b2</b><c>c2</c></x>|b2 |
|c |<x><a>a3</a><b>b3</b><c>c3</c></x>|c3 |
+---+----------------------------------+-------------------+

InfluxDB Task to Calculate/Insert Value

I am trying to calculate/insert a value into my InfluxDB 2.0.8 on a regular interval....can this be done with a task?
For a over simplified example: How can I evaluate 1+1 once an hour and insert the result into the database?
I assume you are well aware of the general intrinsics of tasks and how to set up a task. The only thing missing then is how to generate data rows that do not result from querying. The method array.rows() comes handy for this.
Here is an example - I did not try it.
import "array"
option task = {name: "oneplusone", every: 1h}
array.from(rows: [
{_time: now(), _measurement: "calculation", _field: "result", _value: 1+1}
])|> to(bucket: "mybucket")

SWIFT - String based Key-Value Array decoding?

I have a String based Key-Value Array inside of a String, and I want to decoded it and assign the value to an existing array in Swift 4.2. For example:
let array: [String:String] = []
let stringToDecode = “[\“Hello\”:\”World\”, \"Key\":\"Value\"]”
// I want ‘array’ to be assigned
// to the value that is inside
// ‘stringToDecode’
I’ve tried the JSON decoder, but it couldn’t decode it. Is there a simple way to do this? Thank you.
Try using a library like SwiftyJson, it makes working with json much easier.

how to get objects which param contains substring in value using JSONata?

here is data:
[
{name:"Hello"},
{name:"World"},
{name:"Hello World"}
]
how to build proper JSONata query to get all entries where name contains World?
I did try the "'World' in name", but it returns undefined
Thanks.
Use the $contains() function within a filter expression ($[...] filters the input array):
$[$contains(name, "World")]
See this in the JSONata Exerciser: http://try.jsonata.org/BJDPGXzEG

Converting multiple arrays into a single hash

I'm working on a configuration file parser and I need help parsing key: value pairs into a hash.
I have data in the form of: key: value key2: value2 another_key: another_value.
So far I have code in form of
line = line.strip!.split(':\s+')
which returns an array in the form of
["key:value"]["key2: value2"]["another_key: another_value"]
How can I turn these arrays into a single hash in the form of
{key=>value, key2=>value2, another_key=>another_value}
I'm not sure if the key:value pairs need to be in the form of a string or not. Whatever is easiest to work with.
Thanks for your help!
This is the solution I found:
line = line.strip.split(':')
hash = Hash[*line]
which results in the output{"key"=>"value"}, {"key2"=>"value2"}
Very very close to Cary's solution:
Hash[*line.delete(':').split]
Even simpler:
Hash[*line.gsub(':',' ').split]
# => {"key"=>"value", "key2"=>"value2", "another_key"=>"another_value"}
Assuming the key and value are single words, I'd probably do something like this:
Hash[line.scan(/(\w+):\s?(\w+)/)]
You can change the regex if it's not quite what you are looking for.

Resources