Caching vs Tempview - caching

I have a parquet file which I reading atleast 4-5 times within my application. I was wondering what is most efficient thing to do.
Option 1. While writing parquet file read it back on dataset and call cache. I am assuming by doing an immediate read I might use some existing hdfs/spark cache as part from write process.
Option 2. In my application when I need the dataset first time, after reading it cache it.
Option 3. While writing parquet file, after completion create a temp view out of it. In all subsequent usage, use the view.
I am also not very clear about efficiency of reading from tempview vs parquet dataset.
The datasets doesn't fit all into memory.

You should cache dataset (Option 2).
writing to disk will provide no improvements over Spark in-memory format
temporary views don't cache.

Related

Effectively merge big parquet files

I'm using parquet-tools to merge parquet files. But it seems that parquet-tools needs an amount of memory as big as the merged file. Do we have other ways or configurable options in parquet-tools to use memory more effectively? Cause I run the merge job in as a map job on hadoop env. And the container gets killed every time cause it used more memory than it is provided.
Thank you.
I wouldn't recommend using parquet-tools merge, since it just places row groups one after the another, so you will still have small groups, just packed together in a single file. The resulting file will typically not have noticably better performance, and under certain circumstances it may even perform worse than separate files. See PARQUET-1115 for details.
Currently the only proper way to merge Parquet files is to read all data from them and write it to a new Parquet file. You can do it with a MapReduce job (requires writing custom code for this purpose) or using Spark, Hive or Impala.

How do I stream parquet using pyarrow?

I'm trying to read in a large dataset of parquet files piece by piece, do some operation and then move on to the next one without holding them all in memory. I need to do this because the entire dataset doesn't fit into memory. Previously I used ParquetDataset and I'm aware of RecordBatchStreamReader but I'm not sure how to combine them.
How can I use Pyarrow to do this?
At the moment, the Parquet APIs only support complete reads of individual files, so we can only limit reads at the granularity of a single file. We would like to create an implementation of arrow::RecordBatchReader (the streaming data interface) that reads from Parquet files, see https://issues.apache.org/jira/browse/ARROW-1012. Patches would be welcome.

Data format and database choices Spark/hadoop

I am working on structured data (one value per field, the same fields for each row) that I have to put in a NoSql environment with Spark (as analysing tool) and Hadoop. Though, I am wondering what format to use. i was thinking about json or csv but I'm not sure. What do you think and why? I don't have enough experience in this field to properly decide.
2nd question : I have to analyse these data (stored in an HDFS). So, as far as I know I have two possibilities to query them (before the analysis):
direct reading and filtering. i mean that it can be done with Spark, for exemple:
data = sqlCtxt.read.json(path_data)
Use Hbase/Hive to properly make a query and then process the data.
So, I don't know what is the standard way of doing all this and above all, what will be the fastest.
Thank you by advance!
Use Parquet. I'm not sure about CSV but definitely don't use JSON. My personal experience using JSON with spark was extremely, extremely slow to read from storage, after switching to Parquet my read times were much faster (e.g. some small files took minutes to load in compressed JSON, now they take less than a second to load in compressed Parquet).
On top of improving read speeds, compressed parquet can be partitioned by spark when reading, whereas compressed JSON cannot. What this means is that Parquet can be loaded onto multiple cluster workers, whereas JSON will just be read onto a single node with 1 partition. This isn't a good idea if your files are large and you'll get Out Of Memory Exceptions. It also won't parallelise your computations, so you'll be executing on one node. This isn't the 'Sparky' way of doing things.
Final point: you can use SparkSQL to execute queries on stored parquet files, without having to read them into dataframes first. Very handy.
Hope this helps :)

What is meant by "HDFS lacks random read and write access"?

Any file system should provide an API to access its files and directories, etc.
So, what is meant by "HDFS lacks random read and write access"?
So, we should use HBase.
The default HDFS block size is 128 MB. So you cannot read one line here, one line there. You always read and write 128 MB blocks. This is fine when you want to process the whole file. But it makes HDFS unsuitable for some applications, like where you want to use an index to look up small records.
HBase on the other hand is great for this. If you want to read a small record, you will only read that small record.
HBase uses HDFS as its backing store. So how does it provide efficient record-based access?
HBase loads the tables from HDFS to memory or local disk, so most reads do not go to HDFS. Mutations are stored first in an append-only journal. When the journal gets large, it is built into an "addendum" table. When there are too many addendum tables, they all get compacted into a brand new primary table. For reads, the journal is consulted first, then the addendum tables, and at last the primary table. This system means that we only write a full HDFS block when we have a full HDFS block's worth of changes.
A more thorough description of this approach is in the Bigtable whitepaper.
In a typical database where the data is stored in tables in RDBMS format you can read or write to any record from any table without having to know what is there in other records. This is called random writing/reading.
But in HDFS data is stored in the file format(generally) rather than table format. So if you are reading/writing its not as easy as is in RDBMS.

Running a MR Job on a portion of the HDFS file

Imagine you have a big file stored in hdtf which contains structured data. Now the goal is to process only a portion of data in the file like all the lines in the file where second column value is between so and so. Is it possible to launch the MR job such that hdfs only stream the relevant portion of the file versus streaming everything to the mappers.
The reason is that I want to expedite the job speed by only working on the portion that I need. Probably one approach is to run a MR job to get create a new file but I am wondering if one can avoid that?
Please note that the goal is to keep the data in HDFS and I do not want to read and write from database.
HDFS stores files as a bunch of bytes in blocks, and there is no indexing, and therefore no way to only read in a portion of your file (at least at the time of this writing). Furthermore, any given mapper may get the first block of the file or the 400th, and you don't get control over that.
That said, the whole point of MapReduce is to distribute the load over many machines. In our cluster, we run up to 28 mappers at a time (7 per node on 4 nodes), so if my input file is 1TB, each map slot may only end up reading 3% of the total file, or about 30GB. You just perform the filter that you want in the mapper, and only process the rows you are interested in.
If you really need filtered access, you might want to look at storing your data in HBase. It can act as a native source for MapReduce jobs, provides filtered reads, and stores its data on HDFS, so you are still in the distributed world.
One answer is looking at the way that hive solves this problem. The data is in "tables" which are really just meta data about files on disk. Hive allows you to set columns on which a table is partitioned. This creates a separate folder for each partition so if you were partitioning a file by date you would have:
/mytable/2011-12-01
/mytable/2011-12-02
Inside of the date directory would be you actual files. So if you then ran a query like:
SELECT * FROM mytable WHERE dt ='2011-12-01'
Only files in /mytable/2011-12-01 would be fed into the job.
Tho bottom line is that if you want functionality like this you either want to move to a higher level language (hive/pig) or you need to roll your own solutions.
Big part of the processing cost - is data parsing to produce Key-Values to the Mapper. We create there (usually) one java object per value + some container. It is costly both in terms of CPU and garbage collector pressure
I would suggest the solution "in the middle". You can write input format which will read the input stream and skip non-relevant data in the early stage (for example by looking into few first bytes of the string). As a result you will read all data, but actually parse and pass to the Mapper - only portion of it.
Another approach I would consider - is to use RCFile format (or other columnar format), and take care that relevant and non relevant data will sit in the different columns.
If the files that you want to process have some unique attribute about their filename (like extension or partial filename match), you can also use the setInputPathFilter method of FileInputFormat to ignore all but the ones you want for your MR job. Hadoop by default ignores all ".xxx" and _xxx" files/dirs, but you can extend with setInputPathFilter.
As others have noted above, you will likely get sub-optimal performance out of your cluster doing something like this which breaks the "one block per mapper" paradigm, but sometimes this is acceptable. Can sometimes take more to "do it right", esp if you're dealing with a small amount of data & the time to re-architect and/or re-dump into HBase would eclipse the extra time required to run your job sub-optimally.

Resources