Is it generally better to transform semi-structured into structured data on Hadoop if the possibility exists? - hadoop

I have large and growing datasets of semi-structured data in JSON files on a Hadoop cluster. The data is fairly benign but one of the keys which holds a list of maps can change heavily in size, it can vary between zero and up to few thousands of those maps each with a few dozen keys themselves.
However the data could be transformed into two separate tables of structured data linked by foreign keys. Both would be narrow tables, one of them would roughly be ten times as long as the other.
I could either keep the data in a semi-structured format and use a wide-column store like HBase to store it or alternatively use a columnar storage like Parquet to store the data in two large relational tables.
It is unlikely the data format will change, but it can't be ruled out.
I'm new to Hadoop and Big Data, so which of the two possibilities is generally preferable? Should semi-structured data be changed into structured data if the possibility exists and the data format is fairly constant?
EDIT: Additional info as requested by Rahul Sharma.
The data consists of shopping carts from a shopping software, the variable length comes from the variable number of items in the carts. Initially the data is in XML format but then is transformed into JSON, but not by me, I have no control over that step.
No realtime analytics planned, only batch analytics.
The relationship in both tables would be that one table is the customer/purchase info while the other would be the purchased items. Both would be linked with a fitting key.
I hope this helps.

Related

All else held equal, which is the faster querying option: Milvus, RocksDB, or Apache HBase

I have a requirement to store billions of records (with capacity up to one trillion records) in a database (total size is in terms of petabytes). The records are textual fields with about 5 columns representing transactional information.
I want to be able to query data in the database incredibly quickly, so I was researching Milvus, Apache HBase, and RocksDB. Based on my research, all three are incredibly fast and work well with large amounts of data. All else equal, which of these three is the fastest?
What type of data are you storing in the database?
Milvus is used for vector storage and computation.
If you want to search by the semantics of the text, milvus is the fastest option.
Hbase and RocksDB are both K-value database.
If you want to search by the key columns,These 2 would be more faster

In spark, if one column contains a lot of data does this affect the performance of queries on other columns?

Let's assume that I'm using spark and storing data in parquet format. I have a table with multiple columns, one of which may contain a relatively large amount of data (for example let's say it could contain a 10000 word string).
Now I want to make a simple query on the number of rows for a given partition of data. I would expect that it wouldn't matter how much data is in one of the columns. As I understand it, parquet is a columnar format so it stores and operates on data column-wise. So it could just load one of the columns with a small amount of data and count the rows. But I'm finding that when a column contains large data it slows down the query performance. In spark UI I can see that the input data is much larger. This is true even if the query explicitly excludes that column.
Is this the expected behavior? Isn't the advantage of columnar data specifically to improve performance in cases like this?

How does GreenPlum handle multiple large joins and simultaneous workloads?

Our product is extracts from our database, they can be as large as 300GB+ in file format. To achieve that we join multiple large tables (tables close to 1TB in size in some cases). We do not aggregate data period, it's pure extracts. How does GreenPlum handle these kind of large data sets (The join keys are 3+ column keys and not every table has the same keys to join with, the only common key is the first key and if data would be distributed by that there will be a lot of skew since the data itself is not balanced).
You should use writable external tables for those types of large data extracts because it can leverage gpfdist and write data in parallel. It will be very fast.
https://gpdb.docs.pivotal.io/510/ref_guide/sql_commands/CREATE_EXTERNAL_TABLE.html
Also, your use case doesn't really indicate skew. Skew would be either storing the data by a poor column choice like gender_code or processing skew where you filter by a column or columns where only a few segments has the data.
In general, Greenplum Database handles this kind of load just fine. The query is executed in parallel on the segments.
Your bottleneck is likely the final export from the database - if you use SQL (or COPY), everything has to go through the master to the client. That takes time, and is slow.
As Jon pointed out, consider using an external table, and write out the data as it comes out of the query. Also avoid any kind of sort operation in your query, if possible. This is unnecessary because the data arrives unsorted in the external table file.

What is the opposite of ETL?

ETL (extract, transform, load) is the process of getting data into a data warehouse from various sources.
Is there a name for the opposite process? Extracting data from a data warehouse, transforming it and putting it into a table - usually to feed a reporting tool.
Technically speaking, the opposite is of an ETL is an ELT.
Instead of extract, transform, then load, an ELT is an extract, load, then transform. The choice between which of the two pipelines should be used depends on the system and the nature of the data. For example, the process of bringing data into a relational database necessarily requires a transformation before loading, but other frameworks, such as Hadoop, are better able to handle unstructured data and apply structure to it after loading takes place.
Since this question has been asked, 6 years ago (!) a lot has changed in the ETL landscape.
There is a new trend called "Reverse ETL" which is the idea of taking cleaned/ transformed/modeled data from your warehouse back into the SaaS applications (Salesforce, Marketo, Zendesk, HubSpot, etc) that your teams use.
The main tools are
getCensus
Seekwell
Grouparoo
You can read more about this nascent trend here and here too
The ETL abbreviation applies to any extract, transform and load sequence. It can be applied to extracting data from a data warehouse, transforming the data and loading the transformed data into a table.
In your question you have two ETL sequences; one that loads the data into the data warehouse and one that extracts information from the data warehouse and loads this data into the table.

Is Hadoop the right tech for this?

If I had millions of records of data, that are constantly being updated and added to every day, and I needed to comb through all of the data for records that match specific logic and then take that matching subset and insert it into a separate database would I use Hadoop and MapReduce for such a task or is there some other technology I am missing? The main reason I am looking for something other than a standard RDMS is because all of the base data is from multiple sources and not uniformly structured.
Map-Reduce is designed for algorithms that can be parallelized and local results can be computed and aggregated. A typical example would be counting words in a document. You can split this up into multiple parts where you count some of the words on one node, some on another node, etc and then add up the totals (obviously this is a trivial example, but illustrates the type of problem).
Hadoop is designed for processing large data files (such as log files). The default block size is 64MB, so having millions of small records wouldn't really be a good fit for Hadoop.
To deal with the issue of having non-uniformly structured data, you might consider a NoSQL database, which is designed to handle data where a lot of a columns are null (such as MongoDB).
Hadoop/MR are designed for batch processing and not for real time processing. So, some other alternative like Twitter Storm, HStreaming has to be considered.
Also, look at Hama for real time processing of data. Note that real time processing in Hama is still crude and a lot of improvement/work has to be done.
I would recommend Storm or Flume. In either of these you may analyze each record as it comes in and decide what to do with it.
If your data volumes are not great , and millions of records are not sounds as such I would suggest to try to get most from RDMBS, even if your schema will not be properly normalized.
I think even tavle of structure K1, K2, K3, Blob will be more useful t
In NoSQL KeyValue stores are built to support schemaless data in various flavors but their query capability are limited.
Only case I can think as usefull is MongoDB/ CoachDB capability to index schemaless data. You will be able to get records by some attribute value.
Regarding Hadoop MapReduce - i think it is not useful unless you want to harness a lot of CPUs for your processing or have a lot of data or need distributed sort capability.

Resources