I have made a small impl in order to read parquet files and store the items into cache. So I wrote:
val df= sqlContext.read.
parquet(hdfsFolder).
select("a","b", "c", "d", "e", "f")
val columnsSeq= Seq("a","b", "c", "d", "e", "f")
val values = df.map(row => (row.getAs[String]("a"), row.getValuesMap(columnsSeq))).
groupByKey(1024).
map(row => (row._1 , row._2.toList.asJava ))
//put them into cache
val igniteContext = new IgniteContext(sc, cacheConfigPath)
val sharedRdd = igniteContext.fromCache(cacheName)
sharedRdd.savePairs(values)
But the last line "sharedRdd.savePairs(values)" gives the compileerror:
found : org.apache.spark.rdd.RDD[(String,
java.util.List[Map[String,Nothing]])] required:
org.apache.spark.rdd.RDD[(Nothing, Nothing)] Note: (String,
java.util.List[Map[String,Nothing]]) >: (Nothing, Nothing), but class
RDD is invariant in type T. You may wish to define T as -T instead.
(SLS 4.5)
sharedRdd.savePairs(values)
I could not found any ways to overcome this error.
Any ideas?
You should create IgniteRDD with proper typing:
val sharedRdd = igniteContext.fromCache[String, java.util.List[Map[String,Nothing]]](cacheName)
Related
I have this dataframe
structure(list(plate = c("A", "A", "A", "A", "A", "B", "B", "B",
"B", "B", "C", "C", "C", "C", "C"), marker = c("IL-1", "IL-2",
"IL-3", "IL-4", "IL-5", "IL-1", "IL-2", "IL-3", "IL-4", "IL-5",
"IL-1", "IL-2", "IL-3", "IL-4", "IL-5"), sample = c(1, 2, 3,
4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15), result = c(1.94000836127381,
0.426353706529969, 2.07418521661429, 1.58200029160696, 0.812685661255674,
0.546681932009987, 0.199532997122114, 0.100208840148698, 0.720956738045624,
0.444814277410285, 2.25080298569014, 1.61429066532657, 1.1066027850052,
0.927880542016121, 4.1487948134003), LOD = c(0.810456546400942,
0.614177278086376, 0.98739611371029, 0.315142822914328, 0.221497734151459,
0.0191136249820546, 0.364139946842526, 0.983763479804491, 0.982034953153209,
0.851687364910033, 0.893324689832074, 0.978609354294382, 0.62613140416969,
0.0310439168600307, 0.729966088361143)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -15L))
As you can see I have different LOD values, for each marker in each plate. So I calculate the mean LOD for each marker using
lod <- dummy_2 |>
group_by (marker) |>
summarise(lod = mean(LOD))
which results in the following mean LOD per marker for all plates
structure(list(marker = c("IL-1", "IL-2", "IL-3", "IL-4", "IL-5"
), lod = c(0.57429828707169, 0.652308859741095, 0.865763665894824,
0.442740564309189, 0.601050395807545)), class = c("tbl_df", "tbl",
"data.frame"), row.names = c(NA, -5L))
So far so good. Now I want to check if the result of my markers are above or below my mean LOD. If it is above my mean LOD, the results must not be changed, if it is below my LOD, the result must be changed in the LOD/2.
I tried to use for loop and mutate in combination with ifelse, but that did not work. I also saw the accross function, but that also did not work. My latest try was...
marker <- unique(dummy_2$marker)
for (i in marker){
dummy_2 <- mutate(result = ifelse(i %in% dummy_2$result < dummy_2$LOD, (i %in% lod$LOD)/2), dummy_2$result)}
Is for loop the right way to go, or is there a better solution?
Any help would be appreciated..
Already found a solution by creating a new dataframe with my mean values and linked this together using left_join to my dataset. Still like to know if this could be possible with a for loop. But for now, problem solved....
Using PyYAML for loading a YAML (large) file which has duplicate keys. I would like to preserve all keys and would modify duplicate key according to project need. But it seems PyYAML is silently overwrites results with the last key and not getting a chance to modify it as my need (loss of information), resulting in this dict: {'blocks':{'a':'b2:11 c2:22'}}
simple example YAML:
import yaml
given_str = '''
blocks:
a:
b1:1
c1:2
a:
b2:11
c2:22'''
p = yaml.load(given_str)
How can I load the YAML with duplicate keys so that I get a chance to recursively traverse it and modify keys as my need. I need to load YAML and then transfer it into a database.
Assuming your input YAML has no merge keys ('<<'), no tags and no comments you want
to preserve, you can use the following:
import sys
import ruamel.yaml
from pathlib import Path
from collections.abc import Hashable
file_in = Path('input.yaml')
class MyConstructor(ruamel.yaml.constructor.SafeConstructor):
def construct_mapping(self, node, deep=False):
"""deep is True when creating an object/mapping recursively,
in that case want the underlying elements available during construction
"""
if not isinstance(node, ruamel.yaml.nodes.MappingNode):
raise ConstructorError(
None, None, f'expected a mapping node, but found {node.id!s}', node.start_mark,
)
total_mapping = self.yaml_base_dict_type()
if getattr(node, 'merge', None) is not None:
todo = [(node.merge, False), (node.value, False)]
else:
todo = [(node.value, True)]
for values, check in todo:
mapping: Dict[Any, Any] = self.yaml_base_dict_type()
for key_node, value_node in values:
# keys can be list -> deep
key = self.construct_object(key_node, deep=True)
# lists are not hashable, but tuples are
if not isinstance(key, Hashable):
if isinstance(key, list):
key = tuple(key)
if not isinstance(key, Hashable):
raise ConstructorError(
'while constructing a mapping',
node.start_mark,
'found unhashable key',
key_node.start_mark,
)
value = self.construct_object(value_node, deep=deep)
if key in mapping:
pat = key + '_undup_{}'
index = 0
while True:
nkey = pat.format(index)
if nkey not in mapping:
key = nkey
break
index += 1
mapping[key] = value
total_mapping.update(mapping)
return total_mapping
yaml = ruamel.yaml.YAML(typ='safe')
yaml.default_flow_style = False
yaml.Constructor = MyConstructor
data = yaml.load(file_in)
yaml.dump(data, sys.stdout)
which gives:
blocks:
a: b1:1 c1:2
a_undup_0: b2:11 c2:22
Please note that the values for both a keys are multiline plain scalars. For b1 and c1 to be a key
the mapping value indicator (:, the colon) needs to be followed by a whitespace character:
a:
b1: 1
c1: 2
After reading many forums, I think best solution is create a wrapper for yml loader (removing duplicates) is the solution. #Anthon - any comment?
import yaml
from collections import defaultdict, Counter
####### Preserving Duplicate ###################
def parse_preserving_duplicates(input_file):
class PreserveDuplicatesLoader(yaml.CLoader):
pass
def map_constructor(loader, node, deep=False):
"""Walk tree, removing degeneracy in any duplicate keys"""
keys = [loader.construct_object(node, deep=deep) for node, _ in node.value]
vals = [loader.construct_object(node, deep=deep) for _, node in node.value]
key_count = Counter(keys)
data = defaultdict(dict) # map all data removing duplicates
c = 0
for key, value in zip(keys, vals):
if key_count[key] > 1:
data[f'{key}{c}'] = value
c += 1
else:
data[key] = value
return data
PreserveDuplicatesLoader.add_constructor(yaml.resolver.BaseResolver.DEFAULT_MAPPING_TAG,
map_constructor)
return yaml.load(input_file, PreserveDuplicatesLoader)
##########################################################
with open(inputf, 'r') as file:
fp = parse_preserving_duplicates(input_file)
The documentation says "delete cannot work with partial keys". What is your recommendation how to solve it. For example create new index, use cycle delete or any way?
You can delete values in a loop using a primary key.
#!/usr/bin/env tarantool
local json = require('json')
local function key_from_tuple(tuple, key_parts)
local key = {}
for _, part in ipairs(key_parts) do
table.insert(key, tuple[part.fieldno] or box.NULL)
end
return key
end
box.cfg{}
box.once('init', function()
box.schema.space.create('s')
box.space.s:create_index('pk')
box.space.s:create_index('sk', {
unique = false,
parts = {
{2, 'number'},
{3, 'number'},
}
})
end)
box.space.s:truncate()
box.space.s:insert{1, 1, 1}
box.space.s:insert{2, 1, 1}
print('before delete')
print('---')
box.space.s:pairs():each(function(tuple)
print(json.encode(tuple))
end)
print('...')
local key_parts = box.space.s.index.pk.parts
for _, tuple in box.space.s.index.sk:pairs({1}) do
local key = key_from_tuple(tuple, key_parts)
box.space.s.index.pk:delete(key)
end
print('after delete')
print('---')
box.space.s:pairs():each(function(tuple)
print(json.encode(tuple))
end)
print('...')
os.exit()
In the example above a common case is handled using key_from_tuple function. Things may be simpler when you know which fields form a primary key. Say, if it is the first field:
for _, tuple in box.space.s.index.sk:pairs({1}) do
box.space.s.index.pk:delete(tuple[1])
end
The new key_def module that was added in tarantool-2.2.0-255-g22db9c264 (not released yet, but availiable from our 2.2 repository) simplifies extracting a key from a tuple, especially in case of json path indexes:
#!/usr/bin/env tarantool
local json = require('json')
local key_def_lib = require('key_def')
box.cfg{}
box.once('init', function()
box.schema.space.create('s')
box.space.s:create_index('pk')
box.space.s:create_index('sk', {
unique = false,
parts = {
{2, 'number', path = 'a'},
{2, 'number', path = 'b'},
}
})
end)
box.space.s:truncate()
box.space.s:insert{1, {a = 1, b = 1}}
box.space.s:insert{2, {a = 1, b = 2}}
print('before delete')
print('---')
box.space.s:pairs():each(function(tuple)
print(json.encode(tuple))
end)
print('...')
local key_def = key_def_lib.new(box.space.s.index.pk.parts)
for _, tuple in box.space.s.index.sk:pairs({1}) do
local key = key_def:extract_key(tuple)
box.space.s.index.pk:delete(key)
end
print('after delete')
print('---')
box.space.s:pairs():each(function(tuple)
print(json.encode(tuple))
end)
print('...')
os.exit()
(source of the code)
Starting from Tarantool 2.1, you can use SQL syntax for that ('delete from ... where ...').
However, be aware that Tarantool will try to perfrom this in a transaction, so if you're trying to delete too many tuples, it will lock transaction thread for some time.
Spark Version: '2.0.0.2.5.0.0-1245'
So, my original question changed a bit but it's still the same issue.
What I want to do is load a huge amount of JSON files and transform those to a DataFrame - also probably save them as CSV or parquet file for further processing. Each JSON file represents one row in the final DataFrame.
import os
import glob
HDFS_MOUNT = # ...
DATA_SET_BASE = # ...
schema = StructType([
StructField("documentId", StringType(), True),
StructField("group", StringType(), True),
StructField("text", StringType(), True)
])
# Get the file paths
file_paths = glob.glob(os.path.join(HDFS_MOUNT, DATA_SET_BASE, '**/*.json'))
file_paths = [f.replace(HDFS_MOUNT + '/', '') for f in file_paths]
print('Found {:d} files'.format(len(file_paths))) # 676 files
sql = SQLContext(sc)
df = sql.read.json(file_paths, schema=schema)
print('Loaded {:d} rows'.format(df.count())) # 9660 rows (what !?)
Besides the fact that there are 9660 rows instead of 676 (number of available files) I also have the problem that the content seems to be None:
df.head(2)[0].asDict()
gives
{
'documentId': None,
'group': None,
'text': None,
}
Example Data
This is just fake data of course but it resembles the actual data.
Note: Some fields may be missing e.g. text must not always be present.
a.json
{
"documentId" : "001",
"group" : "A",
"category" : "indexed_document",
"linkIDs": ["adiojer", "asdi555", "1337"]
}
b.json
{
"documentId" : "002",
"group" : "B",
"category" : "indexed_document",
"linkIDs": ["linkId", "1000"],
"text": "This is the text of this document"
}
assuming that all your files has the same structure and are in the same directory:
df = sql_cntx.read.json('/hdfs/path/to/folder/*.json')
There might be a problem if any of the columns has Null values for all rows. Then spark will not be able to determine schema, so you have an option to tell spark which schema to use:
from pyspark import SparkContext, SQLContext
from pyspark.sql.types import StructType, StructField, StringType, LongType
sc = SparkContext(appName="My app")
sql_cntx = SQLContext(sc)
schema = StructType([
StructField("field1", StringType(), True),
StructField("field2", LongType(), True)
])
df = sql_cntx.read.json('/hdfs/path/to/folder/*.json', schema=schema)
UPD:
in case if file has multirows formatted json you can try this code:
sc = SparkContext(appName='Test')
sql_context = SQLContext(sc)
rdd = sc.wholeTextFiles('/tmp/test/*.json').values()
df = sql_context.read.json(rdd, schema=schema)
df.show()
I am trying to read this json file into a hive table, the top level keys i.e. 1,2.., here are not consistent.
{
"1":"{\"time\":1421169633384,\"reading1\":130.875969,\"reading2\":227.138275}",
"2":"{\"time\":1421169646476,\"reading1\":131.240628,\"reading2\":226.810211}",
"position": 0
}
I only need the time and readings 1,2 in my hive table as columns ignore position.
I can also do a combo of hive query and spark map-reduce code.
Thank you for the help.
Update , here is what I am trying
val hqlContext = new HiveContext(sc)
val rdd = sc.textFile(data_loc)
val json_rdd = hqlContext.jsonRDD(rdd)
json_rdd.registerTempTable("table123")
println(json_rdd.printSchema())
hqlContext.sql("SELECT json_val from table123 lateral view explode_map( json_map(*, 'int,string')) x as json_key, json_val ").foreach(println)
It throws the following error :
Exception in thread "main" org.apache.spark.sql.hive.HiveQl$ParseException: Failed to parse: SELECT json_val from temp_hum_table lateral view explode_map( json_map(*, 'int,string')) x as json_key, json_val
at org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:239)
at org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:50)
at org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:49)
at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135)
at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
This would work, if you rename "1" and "2" (key names) to "x1" and "x2" (inside the json file or in the rdd):
val resultrdd = sqlContext.sql("SELECT x1.time, x1.reading1, x1.reading1, x2.time, x2.reading1, x2.reading2 from table123 ")
resultrdd.flatMap(row => (Array( (row(0),row(1),row(2)), (row(3),row(4),row(5)) )))
This would give you an RDD of tuples with time, reading1 and reading2. If you need a SchemaRDD, you would map it to a case class inside the flatMap transformation, like this:
case class Record(time: Long, reading1: Double, reading2: Double)
resultrdd.flatMap(row => (Array( Record(row.getLong(0),row.getDouble(1),row.getDouble(2)),
Record(row.getLong(3),row.getDouble(4),row.getDouble(5)) )))
val schrdd = sqlContext.createSchemaRDD(resultrdd)
Update:
In the case of many nested keys, you can parse the row like this:
val allrdd = sqlContext.sql("SELECT * from table123")
allrdd.flatMap(row=>{
var recs = Array[Record]();
for(col <- (0 to row.length-1)) {
row(col) match {
case r:Row => recs = recs :+ Record(r.getLong(2),r.getDouble(0),r.getDouble(1));
case _ => ;
}
};
recs
})