Transform a list of files (JSON) to a dataframe - hadoop

Spark Version: '2.0.0.2.5.0.0-1245'
So, my original question changed a bit but it's still the same issue.
What I want to do is load a huge amount of JSON files and transform those to a DataFrame - also probably save them as CSV or parquet file for further processing. Each JSON file represents one row in the final DataFrame.
import os
import glob
HDFS_MOUNT = # ...
DATA_SET_BASE = # ...
schema = StructType([
StructField("documentId", StringType(), True),
StructField("group", StringType(), True),
StructField("text", StringType(), True)
])
# Get the file paths
file_paths = glob.glob(os.path.join(HDFS_MOUNT, DATA_SET_BASE, '**/*.json'))
file_paths = [f.replace(HDFS_MOUNT + '/', '') for f in file_paths]
print('Found {:d} files'.format(len(file_paths))) # 676 files
sql = SQLContext(sc)
df = sql.read.json(file_paths, schema=schema)
print('Loaded {:d} rows'.format(df.count())) # 9660 rows (what !?)
Besides the fact that there are 9660 rows instead of 676 (number of available files) I also have the problem that the content seems to be None:
df.head(2)[0].asDict()
gives
{
'documentId': None,
'group': None,
'text': None,
}
Example Data
This is just fake data of course but it resembles the actual data.
Note: Some fields may be missing e.g. text must not always be present.
a.json
{
"documentId" : "001",
"group" : "A",
"category" : "indexed_document",
"linkIDs": ["adiojer", "asdi555", "1337"]
}
b.json
{
"documentId" : "002",
"group" : "B",
"category" : "indexed_document",
"linkIDs": ["linkId", "1000"],
"text": "This is the text of this document"
}

assuming that all your files has the same structure and are in the same directory:
df = sql_cntx.read.json('/hdfs/path/to/folder/*.json')
There might be a problem if any of the columns has Null values for all rows. Then spark will not be able to determine schema, so you have an option to tell spark which schema to use:
from pyspark import SparkContext, SQLContext
from pyspark.sql.types import StructType, StructField, StringType, LongType
sc = SparkContext(appName="My app")
sql_cntx = SQLContext(sc)
schema = StructType([
StructField("field1", StringType(), True),
StructField("field2", LongType(), True)
])
df = sql_cntx.read.json('/hdfs/path/to/folder/*.json', schema=schema)
UPD:
in case if file has multirows formatted json you can try this code:
sc = SparkContext(appName='Test')
sql_context = SQLContext(sc)
rdd = sc.wholeTextFiles('/tmp/test/*.json').values()
df = sql_context.read.json(rdd, schema=schema)
df.show()

Related

PyYAML loader with duplicate keys

Using PyYAML for loading a YAML (large) file which has duplicate keys. I would like to preserve all keys and would modify duplicate key according to project need. But it seems PyYAML is silently overwrites results with the last key and not getting a chance to modify it as my need (loss of information), resulting in this dict: {'blocks':{'a':'b2:11 c2:22'}}
simple example YAML:
import yaml
given_str = '''
blocks:
a:
b1:1
c1:2
a:
b2:11
c2:22'''
p = yaml.load(given_str)
How can I load the YAML with duplicate keys so that I get a chance to recursively traverse it and modify keys as my need. I need to load YAML and then transfer it into a database.
Assuming your input YAML has no merge keys ('<<'), no tags and no comments you want
to preserve, you can use the following:
import sys
import ruamel.yaml
from pathlib import Path
from collections.abc import Hashable
file_in = Path('input.yaml')
class MyConstructor(ruamel.yaml.constructor.SafeConstructor):
def construct_mapping(self, node, deep=False):
"""deep is True when creating an object/mapping recursively,
in that case want the underlying elements available during construction
"""
if not isinstance(node, ruamel.yaml.nodes.MappingNode):
raise ConstructorError(
None, None, f'expected a mapping node, but found {node.id!s}', node.start_mark,
)
total_mapping = self.yaml_base_dict_type()
if getattr(node, 'merge', None) is not None:
todo = [(node.merge, False), (node.value, False)]
else:
todo = [(node.value, True)]
for values, check in todo:
mapping: Dict[Any, Any] = self.yaml_base_dict_type()
for key_node, value_node in values:
# keys can be list -> deep
key = self.construct_object(key_node, deep=True)
# lists are not hashable, but tuples are
if not isinstance(key, Hashable):
if isinstance(key, list):
key = tuple(key)
if not isinstance(key, Hashable):
raise ConstructorError(
'while constructing a mapping',
node.start_mark,
'found unhashable key',
key_node.start_mark,
)
value = self.construct_object(value_node, deep=deep)
if key in mapping:
pat = key + '_undup_{}'
index = 0
while True:
nkey = pat.format(index)
if nkey not in mapping:
key = nkey
break
index += 1
mapping[key] = value
total_mapping.update(mapping)
return total_mapping
yaml = ruamel.yaml.YAML(typ='safe')
yaml.default_flow_style = False
yaml.Constructor = MyConstructor
data = yaml.load(file_in)
yaml.dump(data, sys.stdout)
which gives:
blocks:
a: b1:1 c1:2
a_undup_0: b2:11 c2:22
Please note that the values for both a keys are multiline plain scalars. For b1 and c1 to be a key
the mapping value indicator (:, the colon) needs to be followed by a whitespace character:
a:
b1: 1
c1: 2
After reading many forums, I think best solution is create a wrapper for yml loader (removing duplicates) is the solution. #Anthon - any comment?
import yaml
from collections import defaultdict, Counter
####### Preserving Duplicate ###################
def parse_preserving_duplicates(input_file):
class PreserveDuplicatesLoader(yaml.CLoader):
pass
def map_constructor(loader, node, deep=False):
"""Walk tree, removing degeneracy in any duplicate keys"""
keys = [loader.construct_object(node, deep=deep) for node, _ in node.value]
vals = [loader.construct_object(node, deep=deep) for _, node in node.value]
key_count = Counter(keys)
data = defaultdict(dict) # map all data removing duplicates
c = 0
for key, value in zip(keys, vals):
if key_count[key] > 1:
data[f'{key}{c}'] = value
c += 1
else:
data[key] = value
return data
PreserveDuplicatesLoader.add_constructor(yaml.resolver.BaseResolver.DEFAULT_MAPPING_TAG,
map_constructor)
return yaml.load(input_file, PreserveDuplicatesLoader)
##########################################################
with open(inputf, 'r') as file:
fp = parse_preserving_duplicates(input_file)

How to use Spark DataFrame column value to update the elasticsearch

Problem description: I am trying to update documents in elastic search. I have data in a dataframe. Dataframe has esIndex, id and body attributes. I need to use the value in index attribute to update the document in the right index. The following code below does not work. What is possible fix?
Command 1
val esUrl = "test.elasticsearch.com"
Command 2
import spark.implicits._
val df = Seq(
("test_datbricks001","1221866122238136328", "TEST DATA 001"),
("test_datbricks001","1221866122238136329", "TEST DATA 002"),
("test_datbrick","1221866122238136329", "TEST DATA 002")
).toDF("esIndex", "id", "body")
Command 3
display(df)
Command 4
import org.apache.spark.sql.functions.col
val retval = df.toDF.write
.format("org.elasticsearch.spark.sql")
.option("es.nodes.wan.only","true")
.option("es.mapping.id","id")
.option("es.mapping.exclude", "id, esIndex")
.option("es.port","80")
.option("es.nodes", esUrl)
.option("es.write.operation", "update")
.option("es.index.auto.create", "no")
.mode("Append")
.save("{esIndex}") // <<== This index is not taking value from "index" column of df
I am getting the following error EsHadoopIllegalArgumentException: Target index [{esIndex}] does not exist and auto-creation is disabled [setting 'es.index.auto.create' is 'false']
Question: How do I get code in Command 4 to use the value in esIndex column of df

How to change column datatype with pyarrow

I am reading a set of arrow files and am writing them to a parquet file:
import pathlib
from pyarrow import parquet as pq
from pyarrow import feather
import pyarrow as pa
base_path = pathlib.Path('../mydata')
fields = [
pa.field('value', pa.int64()),
pa.field('code', pa.dictionary(pa.int32(), pa.uint64(), ordered=False)),
]
schema = pa.schema(fields)
with pq.ParquetWriter('sample.parquet', schema) as pqwriter:
for file_path in base_path.glob('*.arrow'):
table = feather.read_table(file_path)
pqwriter.write_table(table)
My problem is that the code field in the arrow files is defined with an int8 index instead of int32. The range of int8 however is insufficient. Hence I defined a schema with a int32 index for the field code in the parquet file.
However, writing the arrow table to parquet now complains that the schemas do not match.
How can I change the datatype of the arrow column? I checked the pyarrow API and did not find a way to change the schema. Can this be done without roundtripping to pandas?
Arrow ChunkedArray has got a cast function, but unfortunately it doesn't work for what you want to do:
>>> table['code'].cast(pa.dictionary(pa.int32(), pa.uint64(), ordered=False))
Unsupported cast from dictionary<values=uint64, indices=int8, ordered=0> to dictionary<values=uint64, indices=int32, ordered=0> (no available cast function for target type)
Instead you can cast to pa.uint64() and encode it to dictionary:
>>> table['code'].cast(pa.uint64()).dictionary_encode().type
DictionaryType(dictionary<values=uint64, indices=int32, ordered=0>)
Here's a self contained example:
import pyarrow as pa
source_schema = pa.schema([
pa.field('value', pa.int64()),
pa.field('code', pa.dictionary(pa.int8(), pa.uint64(), ordered=False)),
])
source_table = pa.Table.from_arrays([
pa.array([1, 2, 3], pa.int64()),
pa.array([1, 2, 1000], pa.dictionary(pa.int8(), pa.uint64(), ordered=False)),
], schema=source_schema)
destination_schema = pa.schema([
pa.field('value', pa.int64()),
pa.field('code', pa.dictionary(pa.int32(), pa.uint64(), ordered=False)),
])
destination_data = pa.Table.from_arrays([
source_table['value'],
source_table['code'].cast(pa.uint64()).dictionary_encode(),
], schema=destination_schema)

Display selected features after Gridsearch

I'm using GridSearchCV to perform feature selection (SelectKBest) for a linear regression. The results show that 10 features are selected (using .best_params_), but I'm unsure how to display which features this are.
The code is pasted below. I'm using a pipeline because the next models will also need hyperparameter selection. x_train is a dataframe with 12 columns that I cannot share due to data restrictions.
cv_folds = KFold(n_splits=5, shuffle=False)
steps = [('feature_selection', SelectKBest(mutual_info_regression, k=3)), ('regr',
LinearRegression())]
pipe = Pipeline(steps)
search_space = [{'feature_selection__k': [1,2,3,4,5,6,7,8,9,10,11,12]}]
clf = GridSearchCV(pipe, search_space, scoring='neg_mean_squared_error', cv=5, verbose=0)
clf = clf.fit(x_train, y_train)
print(clf.best_params_)
You can access the information about feature_selection step like this:
<GridSearch_model_variable>.best_estimater_.named_steps[<feature_selection_step>]
So, in your case, it would be like this:
print(clf.best_estimator_.named_steps['feature_selection'])
#Output: SelectKBest(k=8, score_func=<function mutual_info_regression at 0x13d37b430>)
Next you can use the get_support function to get the boolean map of the selected features:
print(clf.best_estimator_.named_steps['feature_selection'].get_support())
# Output: array([ True, False, True, False, True, True, True, False, False,
True, True, False, True])
Now provide this map over the original columns:
data_columns = X.columns # List of columns in your dataset
# This is the original list of columns
print(data_columns)
# Output: ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX',
'PTRATIO', 'B', 'LSTAT']
# Now print the select columns
print(data_columns[clf.best_estimator_.named_steps['feature_selection'].get_support()])
# Output: ['CRIM', 'INDUS', 'NOX', 'RM', 'AGE', 'TAX', 'PTRATIO', 'LSTAT']
So you can see out of 13 features only 8 were selected ( as in my data k=4 was the best case)
Here is the full code with boston dataset:
import pandas as pd
from sklearn.datasets import load_boston
from sklearn.model_selection import KFold
from sklearn.feature_selection import SelectKBest, mutual_info_regression
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
boston_dataset = load_boston()
X = pd.DataFrame(boston_dataset.data, columns=boston_dataset.feature_names)
y = boston_dataset.target
cv_folds = KFold(n_splits=5, shuffle=False)
steps = [('feature_selection', SelectKBest(mutual_info_regression, k=3)),
('regr', LinearRegression())]
pipe = Pipeline(steps)
search_space = [{'feature_selection__k': [1,2,3,4,5,6,7,8,9,10,11,12]}]
clf = GridSearchCV(pipe, search_space, scoring='neg_mean_squared_error', cv=5, verbose=0)
clf = clf.fit(X, y)
print(clf.best_params_)
data_columns = X.columns
selected_features = data_columns[clf.best_estimator_.named_steps['feature_selection'].get_support()]
print(selected_features)
# Output : Index(['CRIM', 'INDUS', 'NOX', 'RM', 'AGE', 'TAX', 'PTRATIO', 'LSTAT'], dtype='object')
References:
https://stackoverflow.com/a/33378303/8160718
https://stackoverflow.com/a/38788087/8160718

read json key-values with hive/sql and spark

I am trying to read this json file into a hive table, the top level keys i.e. 1,2.., here are not consistent.
{
"1":"{\"time\":1421169633384,\"reading1\":130.875969,\"reading2\":227.138275}",
"2":"{\"time\":1421169646476,\"reading1\":131.240628,\"reading2\":226.810211}",
"position": 0
}
I only need the time and readings 1,2 in my hive table as columns ignore position.
I can also do a combo of hive query and spark map-reduce code.
Thank you for the help.
Update , here is what I am trying
val hqlContext = new HiveContext(sc)
val rdd = sc.textFile(data_loc)
val json_rdd = hqlContext.jsonRDD(rdd)
json_rdd.registerTempTable("table123")
println(json_rdd.printSchema())
hqlContext.sql("SELECT json_val from table123 lateral view explode_map( json_map(*, 'int,string')) x as json_key, json_val ").foreach(println)
It throws the following error :
Exception in thread "main" org.apache.spark.sql.hive.HiveQl$ParseException: Failed to parse: SELECT json_val from temp_hum_table lateral view explode_map( json_map(*, 'int,string')) x as json_key, json_val
at org.apache.spark.sql.hive.HiveQl$.createPlan(HiveQl.scala:239)
at org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:50)
at org.apache.spark.sql.hive.ExtendedHiveQlParser$$anonfun$hiveQl$1.apply(ExtendedHiveQlParser.scala:49)
at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135)
at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
at scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
at scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
This would work, if you rename "1" and "2" (key names) to "x1" and "x2" (inside the json file or in the rdd):
val resultrdd = sqlContext.sql("SELECT x1.time, x1.reading1, x1.reading1, x2.time, x2.reading1, x2.reading2 from table123 ")
resultrdd.flatMap(row => (Array( (row(0),row(1),row(2)), (row(3),row(4),row(5)) )))
This would give you an RDD of tuples with time, reading1 and reading2. If you need a SchemaRDD, you would map it to a case class inside the flatMap transformation, like this:
case class Record(time: Long, reading1: Double, reading2: Double)
resultrdd.flatMap(row => (Array( Record(row.getLong(0),row.getDouble(1),row.getDouble(2)),
Record(row.getLong(3),row.getDouble(4),row.getDouble(5)) )))
val schrdd = sqlContext.createSchemaRDD(resultrdd)
Update:
In the case of many nested keys, you can parse the row like this:
val allrdd = sqlContext.sql("SELECT * from table123")
allrdd.flatMap(row=>{
var recs = Array[Record]();
for(col <- (0 to row.length-1)) {
row(col) match {
case r:Row => recs = recs :+ Record(r.getLong(2),r.getDouble(0),r.getDouble(1));
case _ => ;
}
};
recs
})

Resources