SparkR::gapply return less rows than expected - sparkr
See the example below. I have a dataframe with 2 columns and 1000 rows. Z simple adds 10 to one of the columns using gapply, the output is another SparkDataFrame with 1000 rows -- that's good. newZ does the same, but if key==10, returns NULL.
I would have expected the output to have 999 rows. Why is it less than that?
library(SparkR)
SparkR::sparkR.session()
sdf=as.DataFrame(data.frame(x=1:1000,y=1),numPartitions=10)
Z=gapply(sdf,'x',function(key,d){
data.frame(x=key[[1]],newy=d$y+10)
},schema="x int, newy int")
count(Z)
# [1] 1000
newZ=gapply(sdf,'x',function(key,d){
if(as.integer(key[[1]])==10)return(NULL)
data.frame(x=key[[1]],newy=d$y+10)
},schema="x int, newy int")
count(newZ)
# [1] 993
Some spark config:
> sparkR.conf()
$eventLog.rolloverIntervalSeconds
[1] "3600"
$spark.akka.frameSize
[1] "256"
$spark.app.name
[1] "Databricks Shell"
$spark.databricks.cloudProvider
[1] "Azure"
$spark.databricks.clusterUsageTags.clusterMaxWorkers
[1] "12"
$spark.databricks.clusterUsageTags.clusterMetastoreAccessType
[1] "RDS_DIRECT"
$spark.databricks.clusterUsageTags.clusterMinWorkers
[1] "2"
$spark.databricks.clusterUsageTags.clusterPythonVersion
[1] "3"
$spark.databricks.clusterUsageTags.clusterResourceClass
[1] "Serverless"
$spark.databricks.clusterUsageTags.clusterScalingType
[1] "autoscaling"
$spark.databricks.clusterUsageTags.clusterTargetWorkers
[1] "2"
$spark.databricks.clusterUsageTags.clusterWorkers
[1] "2"
$spark.databricks.clusterUsageTags.driverNodeType
[1] "Standard_E8s_v3"
$spark.databricks.clusterUsageTags.enableElasticDisk
[1] "true"
$spark.databricks.clusterUsageTags.numPerClusterInitScriptsV2
[1] "1"
$spark.databricks.clusterUsageTags.sparkVersion
[1] "latest-stable-scala2.11"
$spark.databricks.clusterUsageTags.userProvidedRemoteVolumeCount
[1] "0"
$spark.databricks.clusterUsageTags.userProvidedRemoteVolumeSizeGb
[1] "0"
$spark.databricks.delta.multiClusterWrites.enabled
[1] "true"
$spark.databricks.driverNodeTypeId
[1] "Standard_E8s_v3"
$spark.databricks.r.cleanWorkspace
[1] "true"
$spark.databricks.workerNodeTypeId
[1] "Standard_DS13_v2"
$spark.driver.maxResultSize
[1] "4g"
$spark.eventLog.enabled
[1] "false"
$spark.executor.id
[1] "driver"
$spark.executor.memory
[1] "40658m"
$spark.hadoop.databricks.dbfs.client.version
[1] "v2"
$spark.hadoop.fs.s3a.connection.maximum
[1] "200"
$spark.hadoop.fs.s3a.multipart.size
[1] "10485760"
$spark.hadoop.fs.s3a.multipart.threshold
[1] "104857600"
$spark.hadoop.fs.s3a.threads.max
[1] "136"
$spark.hadoop.fs.wasb.impl.disable.cache
[1] "true"
$spark.hadoop.fs.wasbs.impl
[1] "shaded.databricks.org.apache.hadoop.fs.azure.NativeAzureFileSystem"
$spark.hadoop.fs.wasbs.impl.disable.cache
[1] "true"
$spark.hadoop.hive.server2.idle.operation.timeout
[1] "7200000"
$spark.hadoop.hive.server2.idle.session.timeout
[1] "900000"
$spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version
[1] "2"
$spark.hadoop.parquet.memory.pool.ratio
[1] "0.5"
$spark.home
[1] "/databricks/spark"
$spark.logConf
[1] "true"
$spark.r.numRBackendThreads
[1] "1"
$spark.rdd.compress
[1] "true"
$spark.scheduler.mode
[1] "FAIR"
$spark.serializer.objectStreamReset
[1] "100"
$spark.shuffle.manager
[1] "SORT"
$spark.shuffle.memoryFraction
[1] "0.2"
$spark.shuffle.reduceLocality.enabled
[1] "false"
$spark.shuffle.service.enabled
[1] "true"
$spark.sql.catalogImplementation
[1] "hive"
$spark.sql.hive.convertCTAS
[1] "true"
$spark.sql.hive.convertMetastoreParquet
[1] "true"
$spark.sql.hive.metastore.jars
[1] "/databricks/hive/*"
$spark.sql.hive.metastore.version
[1] "0.13.0"
$spark.sql.parquet.cacheMetadata
[1] "true"
$spark.sql.parquet.compression.codec
[1] "snappy"
$spark.sql.ui.retainedExecutions
[1] "100"
$spark.sql.warehouse.dir
[1] "/user/hive/warehouse"
$spark.storage.blockManagerTimeoutIntervalMs
[1] "300000"
$spark.storage.memoryFraction
[1] "0.5"
$spark.streaming.driver.writeAheadLog.allowBatching
[1] "true"
$spark.task.reaper.enabled
[1] "true"
$spark.task.reaper.killTimeout
[1] "60s"
$spark.worker.cleanup.enabled
[1] "false"
Related
ruby fill elements of an array into a set of nested arrays and remove duplicates
I have the arrays months and monthly_doc_count_for_topic. months = ["2019-01-01", "2019-02-01", "2019-03-01", "2019-04-01"] monthly_doc_count_for_topic = [ ["foo","2019-02-01: 186904","2019-03-01: 196961"], ["bar","2019-01-01: 8876","2019-04-01: 8694"] ] goal = [ ["foo","2019-02-01: 186904","2019-03-01: 196961","2019-01-01","2019-02-01","2019-03-01","2019-04-01"], ["bar","2019-01-01: 8876","2019-04-01: 8694","2019-01-01","2019-02-01","2019-03-01","2019-04-01"] ] I'd like to fill in element of the array months into arrays inside monthly_doc_count_for_topic so it looks like array goal. My attempt: monthly_doc_count_for_topic.map do |topic_set| months.each { |month| topic_set << month } end But I'm getting: => [ [0] [ [0] "2019-01-01", [1] "2019-02-01", [2] "2019-03-01", [3] "2019-04-01" ], [1] [ [0] "2019-01-01", [1] "2019-02-01", [2] "2019-03-01", [3] "2019-04-01" ] ] it's not appending the values from monthly_doc_count_for_topic instead replacing it with elements from array months. How can I modify my code to achieve the output like array goal? Thank you very much!
In your attempt replace monthly_doc_count_for_topic.map with monthly_doc_count_for_topic.each and it works perfectly fine: goal = monthly_doc_count_for_topic.each do |topic_set| months.each { |month| topic_set << month } end But I'd prefer CarySwoveland's solution in the comment, it's less verbose: monthly_doc_count_for_topic.map { |topic_set| topic_set + months }
My jsonpath expression returns two results when I expect just one
This piece of ruby code returns [1,1] but I expect to get just [1]. If I put the same text and jsonpath expression through http://jsonpath.com then I get [1]. Is this a bug in the 'jsonpath' gem? require 'jsonpath' string = <<-HERE_DOC [ {"processId":1,"process":"XX"}, {"processId":2,"process":"YY"} ] HERE_DOC jsonpath = "$..[?(#.process=='XX')].processId" path = JsonPath.new(jsonpath) result = path.on(string) puts "result: #{result}"
It seems that the problem is the extra point, in your jsonpath expression without this works similar in the two behaviours, you only need to go down one step: [1] pry(main)> require 'jsonpath' => true [2] pry(main)> jsonpath = "$.[?(#.process=='XX')].processId" => "$.[?(#.process=='XX')].processId" [3] pry(main)> path = JsonPath.new(jsonpath) => #<JsonPath:0x00007f8c5bf42f10 #opts={}, #path=["$", "[?(#.process=='XX')]", "['processId']"]> [4] pry(main)> string = <<-HERE_DOC [4] pry(main)* [ [4] pry(main)* {"processId":1,"process":"XX"}, [4] pry(main)* {"processId":2,"process":"YY"} [4] pry(main)* ] [4] pry(main)* HERE_DOC => "[\n {\"processId\":1,\"process\":\"XX\"},\n {\"processId\":2,\"process\":\"YY\"}\n]\n" [5] pry(main)> result = path.on(string) => [1]
Calling flatten on a hash in ruby. Oddities
Say I have the following hash: error_hash = { :base => [ [0] [ [0] "Address is required to activate" ] ] } Are these results odd? [18] pry(#<Api::UsersController>)> error_hash.flatten [ [0] :base, [1] [ [0] [ [0] "Address is required to activate" ] ] ] [19] pry(#<Api::UsersController>)> error_hash.flatten(1) [ [0] :base, [1] [ [0] [ [0] "Address is required to activate" ] ] ] [20] pry(#<Api::UsersController>)> error_hash.flatten(2) [ [0] :base, [1] [ [0] "Address is required to activate" ] ] [21] pry(#<Api::UsersController>)> error_hash.flatten(3) [ [0] :base, [1] "Address is required to activate" ] I would have expected .flatten to have been equal to flatten(3), or in otherwords, I would have expected .flatten to have flattened recursively until evereything was in a single array.
Why would you expect flatten to act recursively when the documentation does suggest otherwise? You can extend the capability of hash using following: class Hash def flatten_deepest self.each_with_object({}) do |(key, val), h| if val.is_a? Hash val.flatten_to_root.map do |hash_key, hash_val| h["#{key}.#{hash_key}".to_sym] = hash_val end else h[k] = val end end end end and then do: error_hash.flatten_deepest I think you got the idea.
Ruby - Prevent auto escape characters
I have e.g. r = "\t" and a = "thisisabigbad\wolf" How can I prevent ruby from auto escaping my string and also count the \ at the same time? a.count r #=> this should return 2 instead of 0 I wish to do a.count and receive 2
You can use single quotes: [17] pry(main)> r = '\t' => "\\t" [18] pry(main)> r.size => 2 [20] pry(main)> a = 'thisisabigbad\wolf' => "thisisabigbad\\wolf" [21] pry(main)> a.size => 18
How do I split the responses from an SNMP array?
I've got the following sample response from a system when walking the tree: [name=1.3.6.1.4.1.15248.2.5.1.3.1.1.8650, value=8650 (INTEGER)] [name=1.3.6.1.4.1.15248.2.5.1.3.1.1.8651, value=8651 (INTEGER)] [name=1.3.6.1.4.1.15248.2.5.1.3.1.2.8650, value=QNewsAK (OCTET STRING)] [name=1.3.6.1.4.1.15248.2.5.1.3.1.2.8651, value=QSuite4AK (OCTET STRING)] [name=1.3.6.1.4.1.15248.2.5.1.3.1.3.8650, value=46835255 (INTEGER)] [name=1.3.6.1.4.1.15248.2.5.1.3.1.3.8651, value=11041721 (INTEGER)] [name=1.3.6.1.4.1.15248.2.5.1.3.1.4.8650, value=8442357 (INTEGER)] [name=1.3.6.1.4.1.15248.2.5.1.3.1.4.8651, value=5717570 (INTEGER)] [name=1.3.6.1.4.1.15248.2.5.1.3.1.5.8650, value=0 (INTEGER)] [name=1.3.6.1.4.1.15248.2.5.1.3.1.5.8651, value=0 (INTEGER)] I've got two distinct sets of data here. I don't know how many rows I will eventually get, and as you can also see, the first pair of values are also part of the OID. Printing them nicely obviously tidies it up, but if I want to use them once on each line, what's the best way to split it? I might get up to eight distinct sets of values that I'll have to work with, so each line would be for example: 8650, QNewsAK, 46835255, 8442357, 0 Which are the "ID", "Name", "Size", "Free", and "Status", where status is ordinarily non-zero.
Here's a starting point using group_by to do the heavy-lifting: SNMP_RESPONSE = [ '[name=1.3.6.1.4.1.15248.2.5.1.3.1.1.8650, value=8650 (INTEGER)]', '[name=1.3.6.1.4.1.15248.2.5.1.3.1.1.8651, value=8651 (INTEGER)]', '[name=1.3.6.1.4.1.15248.2.5.1.3.1.2.8650, value=QNewsAK (OCTET STRING)]', '[name=1.3.6.1.4.1.15248.2.5.1.3.1.2.8651, value=QSuite4AK (OCTET STRING)]', '[name=1.3.6.1.4.1.15248.2.5.1.3.1.3.8650, value=46835255 (INTEGER)]', '[name=1.3.6.1.4.1.15248.2.5.1.3.1.3.8651, value=11041721 (INTEGER)]', '[name=1.3.6.1.4.1.15248.2.5.1.3.1.4.8650, value=8442357 (INTEGER)]', '[name=1.3.6.1.4.1.15248.2.5.1.3.1.4.8651, value=5717570 (INTEGER)]', '[name=1.3.6.1.4.1.15248.2.5.1.3.1.5.8650, value=0 (INTEGER)]', '[name=1.3.6.1.4.1.15248.2.5.1.3.1.5.8651, value=0 (INTEGER)]', ] SNMP_RESPONSE.group_by{ |s| s.split(',').first[/\d+$/] } Which returns: { "8650" => [ [0] "[name=1.3.6.1.4.1.15248.2.5.1.3.1.1.8650, value=8650 (INTEGER)]", [1] "[name=1.3.6.1.4.1.15248.2.5.1.3.1.2.8650, value=QNewsAK (OCTET STRING)]", [2] "[name=1.3.6.1.4.1.15248.2.5.1.3.1.3.8650, value=46835255 (INTEGER)]", [3] "[name=1.3.6.1.4.1.15248.2.5.1.3.1.4.8650, value=8442357 (INTEGER)]", [4] "[name=1.3.6.1.4.1.15248.2.5.1.3.1.5.8650, value=0 (INTEGER)]" ], "8651" => [ [0] "[name=1.3.6.1.4.1.15248.2.5.1.3.1.1.8651, value=8651 (INTEGER)]", [1] "[name=1.3.6.1.4.1.15248.2.5.1.3.1.2.8651, value=QSuite4AK (OCTET STRING)]", [2] "[name=1.3.6.1.4.1.15248.2.5.1.3.1.3.8651, value=11041721 (INTEGER)]", [3] "[name=1.3.6.1.4.1.15248.2.5.1.3.1.4.8651, value=5717570 (INTEGER)]", [4] "[name=1.3.6.1.4.1.15248.2.5.1.3.1.5.8651, value=0 (INTEGER)]" ] } The hash can be manipulated further: groups = SNMP_RESPONSE.group_by{ |s| s.split(',').first[/\d+$/] } values = groups.map{ |key, ary| ary.map{ |s| s[/value=(\S+)/, 1] } } values looks like: [ [0] [ [0] "8650", [1] "QNewsAK", [2] "46835255", [3] "8442357", [4] "0" ], [1] [ [0] "8651", [1] "QSuite4AK", [2] "11041721", [3] "5717570", [4] "0" ] ] A bit more massaging gives: puts values.map{ |a| a.join(', ') } Which outputs: 8650, QNewsAK, 46835255, 8442357, 0 8651, QSuite4AK, 11041721, 5717570, 0