SparkR::gapply return less rows than expected - sparkr

See the example below. I have a dataframe with 2 columns and 1000 rows. Z simple adds 10 to one of the columns using gapply, the output is another SparkDataFrame with 1000 rows -- that's good. newZ does the same, but if key==10, returns NULL.
I would have expected the output to have 999 rows. Why is it less than that?
library(SparkR)
SparkR::sparkR.session()
sdf=as.DataFrame(data.frame(x=1:1000,y=1),numPartitions=10)
Z=gapply(sdf,'x',function(key,d){
data.frame(x=key[[1]],newy=d$y+10)
},schema="x int, newy int")
count(Z)
# [1] 1000
newZ=gapply(sdf,'x',function(key,d){
if(as.integer(key[[1]])==10)return(NULL)
data.frame(x=key[[1]],newy=d$y+10)
},schema="x int, newy int")
count(newZ)
# [1] 993
Some spark config:
> sparkR.conf()
$eventLog.rolloverIntervalSeconds
[1] "3600"
$spark.akka.frameSize
[1] "256"
$spark.app.name
[1] "Databricks Shell"
$spark.databricks.cloudProvider
[1] "Azure"
$spark.databricks.clusterUsageTags.clusterMaxWorkers
[1] "12"
$spark.databricks.clusterUsageTags.clusterMetastoreAccessType
[1] "RDS_DIRECT"
$spark.databricks.clusterUsageTags.clusterMinWorkers
[1] "2"
$spark.databricks.clusterUsageTags.clusterPythonVersion
[1] "3"
$spark.databricks.clusterUsageTags.clusterResourceClass
[1] "Serverless"
$spark.databricks.clusterUsageTags.clusterScalingType
[1] "autoscaling"
$spark.databricks.clusterUsageTags.clusterTargetWorkers
[1] "2"
$spark.databricks.clusterUsageTags.clusterWorkers
[1] "2"
$spark.databricks.clusterUsageTags.driverNodeType
[1] "Standard_E8s_v3"
$spark.databricks.clusterUsageTags.enableElasticDisk
[1] "true"
$spark.databricks.clusterUsageTags.numPerClusterInitScriptsV2
[1] "1"
$spark.databricks.clusterUsageTags.sparkVersion
[1] "latest-stable-scala2.11"
$spark.databricks.clusterUsageTags.userProvidedRemoteVolumeCount
[1] "0"
$spark.databricks.clusterUsageTags.userProvidedRemoteVolumeSizeGb
[1] "0"
$spark.databricks.delta.multiClusterWrites.enabled
[1] "true"
$spark.databricks.driverNodeTypeId
[1] "Standard_E8s_v3"
$spark.databricks.r.cleanWorkspace
[1] "true"
$spark.databricks.workerNodeTypeId
[1] "Standard_DS13_v2"
$spark.driver.maxResultSize
[1] "4g"
$spark.eventLog.enabled
[1] "false"
$spark.executor.id
[1] "driver"
$spark.executor.memory
[1] "40658m"
$spark.hadoop.databricks.dbfs.client.version
[1] "v2"
$spark.hadoop.fs.s3a.connection.maximum
[1] "200"
$spark.hadoop.fs.s3a.multipart.size
[1] "10485760"
$spark.hadoop.fs.s3a.multipart.threshold
[1] "104857600"
$spark.hadoop.fs.s3a.threads.max
[1] "136"
$spark.hadoop.fs.wasb.impl.disable.cache
[1] "true"
$spark.hadoop.fs.wasbs.impl
[1] "shaded.databricks.org.apache.hadoop.fs.azure.NativeAzureFileSystem"
$spark.hadoop.fs.wasbs.impl.disable.cache
[1] "true"
$spark.hadoop.hive.server2.idle.operation.timeout
[1] "7200000"
$spark.hadoop.hive.server2.idle.session.timeout
[1] "900000"
$spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version
[1] "2"
$spark.hadoop.parquet.memory.pool.ratio
[1] "0.5"
$spark.home
[1] "/databricks/spark"
$spark.logConf
[1] "true"
$spark.r.numRBackendThreads
[1] "1"
$spark.rdd.compress
[1] "true"
$spark.scheduler.mode
[1] "FAIR"
$spark.serializer.objectStreamReset
[1] "100"
$spark.shuffle.manager
[1] "SORT"
$spark.shuffle.memoryFraction
[1] "0.2"
$spark.shuffle.reduceLocality.enabled
[1] "false"
$spark.shuffle.service.enabled
[1] "true"
$spark.sql.catalogImplementation
[1] "hive"
$spark.sql.hive.convertCTAS
[1] "true"
$spark.sql.hive.convertMetastoreParquet
[1] "true"
$spark.sql.hive.metastore.jars
[1] "/databricks/hive/*"
$spark.sql.hive.metastore.version
[1] "0.13.0"
$spark.sql.parquet.cacheMetadata
[1] "true"
$spark.sql.parquet.compression.codec
[1] "snappy"
$spark.sql.ui.retainedExecutions
[1] "100"
$spark.sql.warehouse.dir
[1] "/user/hive/warehouse"
$spark.storage.blockManagerTimeoutIntervalMs
[1] "300000"
$spark.storage.memoryFraction
[1] "0.5"
$spark.streaming.driver.writeAheadLog.allowBatching
[1] "true"
$spark.task.reaper.enabled
[1] "true"
$spark.task.reaper.killTimeout
[1] "60s"
$spark.worker.cleanup.enabled
[1] "false"

Related

ruby fill elements of an array into a set of nested arrays and remove duplicates

I have the arrays months and monthly_doc_count_for_topic.
months = ["2019-01-01", "2019-02-01", "2019-03-01", "2019-04-01"]
monthly_doc_count_for_topic = [
["foo","2019-02-01: 186904","2019-03-01: 196961"],
["bar","2019-01-01: 8876","2019-04-01: 8694"]
]
goal = [
["foo","2019-02-01: 186904","2019-03-01: 196961","2019-01-01","2019-02-01","2019-03-01","2019-04-01"],
["bar","2019-01-01: 8876","2019-04-01: 8694","2019-01-01","2019-02-01","2019-03-01","2019-04-01"]
]
I'd like to fill in element of the array months into arrays inside monthly_doc_count_for_topic so it looks like array goal.
My attempt:
monthly_doc_count_for_topic.map do |topic_set|
months.each { |month| topic_set << month }
end
But I'm getting:
=> [
[0] [
[0] "2019-01-01",
[1] "2019-02-01",
[2] "2019-03-01",
[3] "2019-04-01"
],
[1] [
[0] "2019-01-01",
[1] "2019-02-01",
[2] "2019-03-01",
[3] "2019-04-01"
]
]
it's not appending the values from monthly_doc_count_for_topic instead replacing it with elements from array months. How can I modify my code to achieve the output like array goal? Thank you very much!
In your attempt replace
monthly_doc_count_for_topic.map
with
monthly_doc_count_for_topic.each
and it works perfectly fine:
goal = monthly_doc_count_for_topic.each do |topic_set|
months.each { |month| topic_set << month }
end
But I'd prefer CarySwoveland's solution in the comment, it's less verbose:
monthly_doc_count_for_topic.map { |topic_set| topic_set + months }

My jsonpath expression returns two results when I expect just one

This piece of ruby code returns [1,1] but I expect to get just [1]. If I put the same text and jsonpath expression through http://jsonpath.com then I get [1]. Is this a bug in the 'jsonpath' gem?
require 'jsonpath'
string = <<-HERE_DOC
[
{"processId":1,"process":"XX"},
{"processId":2,"process":"YY"}
]
HERE_DOC
jsonpath = "$..[?(#.process=='XX')].processId"
path = JsonPath.new(jsonpath)
result = path.on(string)
puts "result: #{result}"
It seems that the problem is the extra point, in your jsonpath expression without this works similar in the two behaviours, you only need to go down one step:
[1] pry(main)> require 'jsonpath'
=> true
[2] pry(main)> jsonpath = "$.[?(#.process=='XX')].processId"
=> "$.[?(#.process=='XX')].processId"
[3] pry(main)> path = JsonPath.new(jsonpath)
=> #<JsonPath:0x00007f8c5bf42f10
#opts={},
#path=["$", "[?(#.process=='XX')]", "['processId']"]>
[4] pry(main)> string = <<-HERE_DOC
[4] pry(main)* [
[4] pry(main)* {"processId":1,"process":"XX"},
[4] pry(main)* {"processId":2,"process":"YY"}
[4] pry(main)* ]
[4] pry(main)* HERE_DOC
=> "[\n {\"processId\":1,\"process\":\"XX\"},\n {\"processId\":2,\"process\":\"YY\"}\n]\n"
[5] pry(main)> result = path.on(string)
=> [1]

Calling flatten on a hash in ruby. Oddities

Say I have the following hash:
error_hash = {
:base => [
[0] [
[0] "Address is required to activate"
]
]
}
Are these results odd?
[18] pry(#<Api::UsersController>)> error_hash.flatten
[
[0] :base,
[1] [
[0] [
[0] "Address is required to activate"
]
]
]
[19] pry(#<Api::UsersController>)> error_hash.flatten(1)
[
[0] :base,
[1] [
[0] [
[0] "Address is required to activate"
]
]
]
[20] pry(#<Api::UsersController>)> error_hash.flatten(2)
[
[0] :base,
[1] [
[0] "Address is required to activate"
]
]
[21] pry(#<Api::UsersController>)> error_hash.flatten(3)
[
[0] :base,
[1] "Address is required to activate"
]
I would have expected .flatten to have been equal to flatten(3), or in otherwords, I would have expected .flatten to have flattened recursively until evereything was in a single array.
Why would you expect flatten to act recursively when the documentation does suggest otherwise?
You can extend the capability of hash using following:
class Hash
def flatten_deepest
self.each_with_object({}) do |(key, val), h|
if val.is_a? Hash
val.flatten_to_root.map do |hash_key, hash_val|
h["#{key}.#{hash_key}".to_sym] = hash_val
end
else
h[k] = val
end
end
end
end
and then do:
error_hash.flatten_deepest
I think you got the idea.

Ruby - Prevent auto escape characters

I have e.g. r = "\t" and a = "thisisabigbad\wolf"
How can I prevent ruby from auto escaping my string and also count the \ at the same time?
a.count r #=> this should return 2 instead of 0
I wish to do a.count and receive 2
You can use single quotes:
[17] pry(main)> r = '\t'
=> "\\t"
[18] pry(main)> r.size
=> 2
[20] pry(main)> a = 'thisisabigbad\wolf'
=> "thisisabigbad\\wolf"
[21] pry(main)> a.size
=> 18

How do I split the responses from an SNMP array?

I've got the following sample response from a system when walking the tree:
[name=1.3.6.1.4.1.15248.2.5.1.3.1.1.8650, value=8650 (INTEGER)]
[name=1.3.6.1.4.1.15248.2.5.1.3.1.1.8651, value=8651 (INTEGER)]
[name=1.3.6.1.4.1.15248.2.5.1.3.1.2.8650, value=QNewsAK (OCTET STRING)]
[name=1.3.6.1.4.1.15248.2.5.1.3.1.2.8651, value=QSuite4AK (OCTET STRING)]
[name=1.3.6.1.4.1.15248.2.5.1.3.1.3.8650, value=46835255 (INTEGER)]
[name=1.3.6.1.4.1.15248.2.5.1.3.1.3.8651, value=11041721 (INTEGER)]
[name=1.3.6.1.4.1.15248.2.5.1.3.1.4.8650, value=8442357 (INTEGER)]
[name=1.3.6.1.4.1.15248.2.5.1.3.1.4.8651, value=5717570 (INTEGER)]
[name=1.3.6.1.4.1.15248.2.5.1.3.1.5.8650, value=0 (INTEGER)]
[name=1.3.6.1.4.1.15248.2.5.1.3.1.5.8651, value=0 (INTEGER)]
I've got two distinct sets of data here. I don't know how many rows I will eventually get, and as you can also see, the first pair of values are also part of the OID.
Printing them nicely obviously tidies it up, but if I want to use them once on each line, what's the best way to split it?
I might get up to eight distinct sets of values that I'll have to work with, so each line would be for example:
8650, QNewsAK, 46835255, 8442357, 0
Which are the "ID", "Name", "Size", "Free", and "Status", where status is ordinarily non-zero.
Here's a starting point using group_by to do the heavy-lifting:
SNMP_RESPONSE = [
'[name=1.3.6.1.4.1.15248.2.5.1.3.1.1.8650, value=8650 (INTEGER)]',
'[name=1.3.6.1.4.1.15248.2.5.1.3.1.1.8651, value=8651 (INTEGER)]',
'[name=1.3.6.1.4.1.15248.2.5.1.3.1.2.8650, value=QNewsAK (OCTET STRING)]',
'[name=1.3.6.1.4.1.15248.2.5.1.3.1.2.8651, value=QSuite4AK (OCTET STRING)]',
'[name=1.3.6.1.4.1.15248.2.5.1.3.1.3.8650, value=46835255 (INTEGER)]',
'[name=1.3.6.1.4.1.15248.2.5.1.3.1.3.8651, value=11041721 (INTEGER)]',
'[name=1.3.6.1.4.1.15248.2.5.1.3.1.4.8650, value=8442357 (INTEGER)]',
'[name=1.3.6.1.4.1.15248.2.5.1.3.1.4.8651, value=5717570 (INTEGER)]',
'[name=1.3.6.1.4.1.15248.2.5.1.3.1.5.8650, value=0 (INTEGER)]',
'[name=1.3.6.1.4.1.15248.2.5.1.3.1.5.8651, value=0 (INTEGER)]',
]
SNMP_RESPONSE.group_by{ |s| s.split(',').first[/\d+$/] }
Which returns:
{
"8650" => [
[0] "[name=1.3.6.1.4.1.15248.2.5.1.3.1.1.8650, value=8650 (INTEGER)]",
[1] "[name=1.3.6.1.4.1.15248.2.5.1.3.1.2.8650, value=QNewsAK (OCTET STRING)]",
[2] "[name=1.3.6.1.4.1.15248.2.5.1.3.1.3.8650, value=46835255 (INTEGER)]",
[3] "[name=1.3.6.1.4.1.15248.2.5.1.3.1.4.8650, value=8442357 (INTEGER)]",
[4] "[name=1.3.6.1.4.1.15248.2.5.1.3.1.5.8650, value=0 (INTEGER)]"
],
"8651" => [
[0] "[name=1.3.6.1.4.1.15248.2.5.1.3.1.1.8651, value=8651 (INTEGER)]",
[1] "[name=1.3.6.1.4.1.15248.2.5.1.3.1.2.8651, value=QSuite4AK (OCTET STRING)]",
[2] "[name=1.3.6.1.4.1.15248.2.5.1.3.1.3.8651, value=11041721 (INTEGER)]",
[3] "[name=1.3.6.1.4.1.15248.2.5.1.3.1.4.8651, value=5717570 (INTEGER)]",
[4] "[name=1.3.6.1.4.1.15248.2.5.1.3.1.5.8651, value=0 (INTEGER)]"
]
}
The hash can be manipulated further:
groups = SNMP_RESPONSE.group_by{ |s| s.split(',').first[/\d+$/] }
values = groups.map{ |key, ary| ary.map{ |s| s[/value=(\S+)/, 1] } }
values looks like:
[
[0] [
[0] "8650",
[1] "QNewsAK",
[2] "46835255",
[3] "8442357",
[4] "0"
],
[1] [
[0] "8651",
[1] "QSuite4AK",
[2] "11041721",
[3] "5717570",
[4] "0"
]
]
A bit more massaging gives:
puts values.map{ |a| a.join(', ') }
Which outputs:
8650, QNewsAK, 46835255, 8442357, 0
8651, QSuite4AK, 11041721, 5717570, 0

Resources