Filestream detecting duplicate records - spark-streaming

I am developing a spark streaming application based on filestream...I need to actually detect the duplicate records in the DStream (RDDs.)
I want to possibly do it using just memory..First, thinking about an accumulator but not sure if accumulators can keep a large number of records (a look up table with hashing of CSV records).
I wanted to know how I can have a large and global mutable collection in my spark streaming application?

Related

How does GreenPlum handle multiple large joins and simultaneous workloads?

Our product is extracts from our database, they can be as large as 300GB+ in file format. To achieve that we join multiple large tables (tables close to 1TB in size in some cases). We do not aggregate data period, it's pure extracts. How does GreenPlum handle these kind of large data sets (The join keys are 3+ column keys and not every table has the same keys to join with, the only common key is the first key and if data would be distributed by that there will be a lot of skew since the data itself is not balanced).
You should use writable external tables for those types of large data extracts because it can leverage gpfdist and write data in parallel. It will be very fast.
https://gpdb.docs.pivotal.io/510/ref_guide/sql_commands/CREATE_EXTERNAL_TABLE.html
Also, your use case doesn't really indicate skew. Skew would be either storing the data by a poor column choice like gender_code or processing skew where you filter by a column or columns where only a few segments has the data.
In general, Greenplum Database handles this kind of load just fine. The query is executed in parallel on the segments.
Your bottleneck is likely the final export from the database - if you use SQL (or COPY), everything has to go through the master to the client. That takes time, and is slow.
As Jon pointed out, consider using an external table, and write out the data as it comes out of the query. Also avoid any kind of sort operation in your query, if possible. This is unnecessary because the data arrives unsorted in the external table file.

Saving ordered dataframe in Spark

I'm trying to save ordered dataframe into HDFS. My code looks like this:
dataFrame.orderBy("index").write().mode(SaveMode.Overwrite).parquet(getPath());
I run same code on two different clusters, one cluster uses Spark 1.5.0, another - 1.6.0. When running on cluster with Spark 1.5.0 it does not preserve sorting after saving on disc.
Is there any specific cluster settings to preserve sorting during saving data on disc? or is it a known problem of the spark version? I've searched spark documentation but couldn't find any info about.
Update:
I've checked files in parquet and in both cases files are sorted. So problem occures while reading, Spark 1.5.0 doesn't preserve ordering while reading and 1.6.0 does.
So my question now: Is it possible to read sorted file and preserve ordering in Spark 1.5.0?
There are several things going on here:
When you are writing, spark splits the data into several partitions and those are written separately so even if the data is ordered it is split.
When you are reading the partitions do not save ordering between them, so you would be sorted only blocks. Worse, there might be something different than a 1:1 mapping of file to partition:
Several files might be mapped to a single partition in the wrong order causing the sorting inside the partition to only be true in blocks
A single file might be divided between partitions (if it is larger than the block size).
Based on the above, the easiest solution would be to repartition (or rather coalesce) to 1 when writing and thus have 1 file. When that file is read the data would be ordered if the file is smaller than the block size (you can even make the block size very large to ensure this).
The problem with this solution is that it reduces your parallelism (when you write you need to repartition and when you read you would need to repartition again to get parallelism. The coalesce/repartition can be costly.
The second problem with this solution is that it doesn't scale well (you might end up with a huge file).
A better solution would be based on your use case. The basic would be if you can use partitioning before sorting. For example, if you are planning to do a custom aggregation that requires the sorting then if you make sure to keep a 1:1 mapping between files and partitions you can be assured of the sorting within the partition which might be enough for you. You can also add the maximum value inside each partition as a second value and then groupby it and do a secondary sort.

Is it generally better to transform semi-structured into structured data on Hadoop if the possibility exists?

I have large and growing datasets of semi-structured data in JSON files on a Hadoop cluster. The data is fairly benign but one of the keys which holds a list of maps can change heavily in size, it can vary between zero and up to few thousands of those maps each with a few dozen keys themselves.
However the data could be transformed into two separate tables of structured data linked by foreign keys. Both would be narrow tables, one of them would roughly be ten times as long as the other.
I could either keep the data in a semi-structured format and use a wide-column store like HBase to store it or alternatively use a columnar storage like Parquet to store the data in two large relational tables.
It is unlikely the data format will change, but it can't be ruled out.
I'm new to Hadoop and Big Data, so which of the two possibilities is generally preferable? Should semi-structured data be changed into structured data if the possibility exists and the data format is fairly constant?
EDIT: Additional info as requested by Rahul Sharma.
The data consists of shopping carts from a shopping software, the variable length comes from the variable number of items in the carts. Initially the data is in XML format but then is transformed into JSON, but not by me, I have no control over that step.
No realtime analytics planned, only batch analytics.
The relationship in both tables would be that one table is the customer/purchase info while the other would be the purchased items. Both would be linked with a fitting key.
I hope this helps.

How to compare two large data sets using hadoop mapreduce?

I am new to hadoop and mapreduce. We have a normal java application where we read a file ( 8 GB in size ) from hadoop file system and we apply some rules on that data. After applying rules we get java hashmap (which is huge in size) and we keep that data in cache or in buffer. At the same time we get the data from hive by applying a query on it and prepare a java hashmap which is again huge in size. Now we compare both the hashmaps data to prepare final report to check the data accuracy.
In the above process since we are using normal java program to do the stuff we are facing below problems.
To process this huge data it takes ages to complete the job. Because input file contains tens of millions of records in it and we need to apply rules on each row to extract the data. It takes days to complete the job. At the same time hive also contains the same amount of data, query is taking too much time to return the data from hive.
Since we are keeping the data in buffer we are facing memory issues.
Now we are trying to implement the same in hadoop mapreduce.
What is the best way to achieve the above scenario?
What are the best ways to implement the above scenario in mapreduce?
How can I increase the application performance by using mapreduce?
8 GB is a tiny data set. I can fit 4 of these 'data sets' into my Laptop RAM! Just dump it in a any relational engine and massage it as fit until the cows come home. This is not 'big data'.
For the record, the way to do processing of two truly large datasets (say +1 TB each) in Hive is a sort-merge-bucket join (aka. SMB join). Read LanguageManual JoinOptimization, watch Join Strategies in Hive.

Is Hadoop the right tech for this?

If I had millions of records of data, that are constantly being updated and added to every day, and I needed to comb through all of the data for records that match specific logic and then take that matching subset and insert it into a separate database would I use Hadoop and MapReduce for such a task or is there some other technology I am missing? The main reason I am looking for something other than a standard RDMS is because all of the base data is from multiple sources and not uniformly structured.
Map-Reduce is designed for algorithms that can be parallelized and local results can be computed and aggregated. A typical example would be counting words in a document. You can split this up into multiple parts where you count some of the words on one node, some on another node, etc and then add up the totals (obviously this is a trivial example, but illustrates the type of problem).
Hadoop is designed for processing large data files (such as log files). The default block size is 64MB, so having millions of small records wouldn't really be a good fit for Hadoop.
To deal with the issue of having non-uniformly structured data, you might consider a NoSQL database, which is designed to handle data where a lot of a columns are null (such as MongoDB).
Hadoop/MR are designed for batch processing and not for real time processing. So, some other alternative like Twitter Storm, HStreaming has to be considered.
Also, look at Hama for real time processing of data. Note that real time processing in Hama is still crude and a lot of improvement/work has to be done.
I would recommend Storm or Flume. In either of these you may analyze each record as it comes in and decide what to do with it.
If your data volumes are not great , and millions of records are not sounds as such I would suggest to try to get most from RDMBS, even if your schema will not be properly normalized.
I think even tavle of structure K1, K2, K3, Blob will be more useful t
In NoSQL KeyValue stores are built to support schemaless data in various flavors but their query capability are limited.
Only case I can think as usefull is MongoDB/ CoachDB capability to index schemaless data. You will be able to get records by some attribute value.
Regarding Hadoop MapReduce - i think it is not useful unless you want to harness a lot of CPUs for your processing or have a lot of data or need distributed sort capability.

Resources