I have a Spark Graph that I created this way
val graph = Graph(vertices, edges, defaultArticle).cache
My vertices is an RDD[(Long, (String, Option[String], List[String], Option[String])] and my edges is an RDD[Edge[Long]]
How do I save this graph/Edges/Vertices to Hive/Text File/Anything else, and how would I read it back? I looked into Spark SQL doc and Spark core doc but I still don't succeed. If I do saveAsTextFile() then when I read it back it's an RDD[String], which is not what I need....
EDIT: Daniel has provided an answer to save as an object file... I'm still interested in understanding how to save and read the object above as a Hive table. Thanks!
Instead of rdd.saveAsTextFile()/sc.textFile() use rdd.saveAsObjectFile()/sc.objectFile(). This will use normal Java serialization for each line, stored as a Hadoop SequenceFile.
Related
I've a requirement to persist a flowchart built by users into database. Until now, I've been using BLOBs to store the flowchart image. But I want to serialize the flow chart and persist them as a structured data.
For example, a flow chart will have start and end, process box, input/output box, decision flow etc. All these elements will be connected by arrows. I'm blanking out on how to model data structure that will store this information. Any tips would be appreciated?
Sorry, but this question is too vague.
Perhaps the best question to start off asking yourself:
Q: Do I really want to invent my own format, or find an existing "flowchart"
format that does what I need?
If you want to invent your own, I'd strongly suggest:
Using JSON or XML (instead of something completely ad-hoc), and
Finding JSON or XML examples for storing "graphs" (nodes, edges, etc).
Here is one such example: http://jsongraphformat.info/
But then your next challenge would be how to DISPLAY your flowchat, after you've read it from the database.
ANOTHER SUGGESTION:
Assume, for example, you're using Visio. I'd just store the .vsd in your database!
A common problem in Big Data is getting data into Big Data friendly format (parquet or TSV).
In Spark wholeTextFiles which currently returns RDD[(String, String)] (path -> whole file as string) is a useful method for this but causes many issues when the files are large (mainly memory issues).
In principle it ought to be possible to write a method as follows using the underlying Hadoop API
def wholeTextFilesIterators(path: String): RDD[(String, Iterator[String])]
Where the iterator is the file (assuming newline as delimiter) and the iterator is encapsulating the underlying file reading & buffering.
After reading through the code for a while I think a solution would involve creating something similar to WholeTextFileInputFormat and WholeTextFileRecordReader.
UPDATE:
After some thought this probably means also implementing a custom org.apache.hadoop.io.BinaryComparable so the iterator can survive a shuffle (hard to serialise the iterator as it has file handle).
See also https://issues.apache.org/jira/browse/SPARK-22225
Spark-Obtaining file name in RDDs
As per Hyukjin's comment on the JIRA, something close to what is wanted is given by
spark.format("text").read("...").selectExpr("value", "input_file_name()")
I'm currently looking at the following example https://bl.ocks.org/mbostock/7607999. The json file which the chart is getting its data from is in the following format {"name":"flare.analytics.cluster.MergeEdge","size":743,"imports":[]}];
Instead of the above format, is it possible to have it as
{"imports":[],"name":"flare.analytics.cluster.MergeEdge","size":743};
Many thanks in advance for taking the time to read this.
Basically, D3.js get json data via your obj.name, so that's no matter how your json data sorting, I try a demo modify
{"name":"flare.vis.data.render.ArrowType", "size":698,"imports":[]}
to
{"size":698,"imports":[],"name":"flare.vis.data.render.ArrowType"}
and the graph still show out the same.
I want to use parquet in one of my projects as columnar storage. But i dont want to depends on hadoop/hdfs libs. Is it possible to use parquet outside of hdfs? Or What is the min dependency?
Investigating the same question I found that apparently it's not possible for the moment.
I found this git issue, which proposes decoupling parquet from the hadoop api. Apparently it has not been done yet.
In the Apache Jira I found an issue, which asks for a way to read a parquet file outside hadoop. It is unresolved by the time of writing.
EDIT:
Issues are not tracked on github anymore (first link above is dead). A newer issue I found is located on apache's Jira with the following headline:
make it easy to read and write parquet files in java without depending on hadoop
Since it is just a file format it is obviously possible to decouple parquet from the Hadoop ecosystem. Nowadays the simplest approach I could find was through Apache Arrow, see here for a python example.
Here a small excerpt from the official PyArrow docs:
Writing
In [2]: import numpy as np
In [3]: import pandas as pd
In [4]: import pyarrow as pa
In [5]: df = pd.DataFrame({'one': [-1, np.nan, 2.5],
...: 'two': ['foo', 'bar', 'baz'],
...: 'three': [True, False, True]},
...: index=list('abc'))
...:
In [6]: table = pa.Table.from_pandas(df)
In [7]: import pyarrow.parquet as pq
In [8]: pq.write_table(table, 'example.parquet')
Reading
In [11]: pq.read_table('example.parquet', columns=['one', 'three'])
EDIT:
With Pandas directly
It is also possible to use pandas directly to read and write
DataFrames. This makes it as simple as my_df.to_parquet("myfile.parquet") and my_df = pd.read_parquet("myfile.parquet")
You don't need to have HDFS/Hadoop for consuming Parquet file. There are different ways to consume Parquet.
You could access it using Apache Spark.
If you are on AWS, you can
directly load or access it from Redshift or Athena
If you are on
Azure, you can load or access it from SQL DataWarehouse or SQL
Server
similarly in GCP as well
Late to the party, but I've been working on something that should make this possible: https://github.com/jmd1011/parquet-readers.
This is still under development, but a final implementation should be out within a month or two of writing this.
Edit: Months later, and still working on this! It is under active development, just taking longer than expected.
What type of data do you have in Parquet? You don't require HDFS to read Parquet files. It is definitely not a pre-requisite. We use parquet files at Incorta for our staging tables. We do not ship with a dependency on HDFS, however, you can store the files on HDFS if you want. Obviously, we at Incorta can read directly from the parquet files, but you can also use Apache Drill to connect, use file:/// as the connection and not hdfs:/// See below for an example.
To read or write Parquet data, you need to include the Parquet format in the storage plugin format definitions. The dfs plugin definition includes the Parquet format.
{
"type" : "file",
"enabled" : true,
"connection" : "file:///",
"workspaces" : {
"json_files" : {
"location" : "/incorta/tenants/demo//drill/json/",
"writable" : false,
"defaultInputFormat" : json
}
},
Nowadays you dont need to rely on hadoop as heavily as before.
Please see my other post: How to view Apache Parquet file in Windows?
As the name suggests, I'm looking for some tool which will convert the existing data from hadoop sequence file to json format.
My initial googling have only shown up results related to jaql, which I'm desperately trying to get to work.
Is there any tool from Apache available for this very purpose?
NOTE:
I've hadoop sequence file sitting on my local machine and would like to get data in corresponding json format.
So in-effect, I'm looking for some tool/utility which will take hadoop sequence file as input and produce output in json format.
Thanks
Apache Hadoop might be a good tool for reading sequence files.
All kidding aside, though, why not write the simplest possible Mapper java program that uses, say, Jackson to serialize each key and value pair it sees? That would be a pretty easy program to write.
I thought there must be some tool which will do this given that its such common requirement. Yes, it should be pretty easy to code but again why to do so if you already have something which does just the same.
Anyway, I figured out to do it using jaql. Sample working query which worked for me,
read({type: 'hdfs', location: 'some_hdfs_file', inoptions: {converter: 'com.ibm.jaql.io.hadoop.converter.FromJsonTextConverter'}});