How to query file in hdfs which has xml as one column - hadoop

Context:
I have data in a table in mysql with xml as one column.
For Ex: Table application has 3 fields.
id(integer) , details(xml) , address(text)
(In real case i have 10-12 fields here).
Now we want to query the whole table with all the fields in mysql table using pig.
Transferred the data from mysql into hdfs using sqoop with
record delimiter '\u0005' and column delimiter as "`" to /x.xml.
Then Load the data from x.xml into the Pig using
app = LOAD '/x.xml' USING PigStorage('\u0005') AS (id:int , details:chararray , address:chararray);
What is the best way to query such data.
Solution that i could currently think about.
Use a custom loader and extend Loadfunc to read the data.
If there is some way to load a particular column using xmlpathloader and rest loading normally. Please suggest if this can be done.
As all the examples i have seen using xpath are using XML loader while loading the file.
For Ex:
A = LOAD 'xmls/hadoop_books.xml' using org.apache.pig.piggybank.storage.XMLLoader('BOOK') as (x:chararray);
Is it good to use pig for querying such kind of data, please suggest if there are any other alternative technologies, that does it effectively.
The size of data present is around 500 GB.
FYI i am new to hadoop ecosytem and i might be missing something trivial.

Load a specific column:
Some other StackOverflow answers suggesting preprocessing the data with awk (generate a new input contains only the xml part.)
A nicer work-a-round to generate the specific data with an extra FOREACH from the xml column, like:
B = FOREACH app GENERATE details;
and store it to be able to load with an XML loader.
Check the StreamingXMLLoader
(You can also check Apache Drill it may support this case out of the box)
Or use UDF for the XML processing and in pig you just hand over the related xml field.

Related

Read data from multiple tables at a time and combine the data based where clause using Nifi

I have scenario where I need to extract multiple database table data including schema and combine(combination data) them and then write to xl file?
In NiFi the general strategy to read in from a something like a fact table with ExecuteSQL or some other SQL processor, then using LookupRecord to enrich the data with a lookup table. The thing in NiFi is that you can only do a table at a time, so you'd need one LookupRecord for each enrichment table. You could then write to a CSV file that you could open in Excel. There might be some extensions elsewhere that can write directly to Excel but I'm not aware of any in the standard NiFi distro.

Write Hive Table using Spark SQL and JDBC

I am new to Hadoop and I am using a single node cluster (for development) to pull some data from a relational database.
Specifically, I am using Spark (version 1.4.1), Java API, to pull data for a query and write to Hive. I have run into various problems (and have read the manuals and tried searching online) but I think I might be misunderstanding some fundamental part of this because I am having problems.
First, I thought I'd be able to read data into Spark, optionally run some Spark methods to manipulate the data and then write it to Hive through a HiveContext object. But, there doesn't seem to be any way to write straight from Spark to Hive. Is that true?
So I need an intermediate step. I have tried a few different methods of storing the data first before writing to Hive and settled on writing an HDFS text file since it seemed to work best for me. However, writing the HDFS file, I get square brackets in the files, like this: [A,B,C]
So, when I load the data into Hive using the "LOAD DATA INPATH..." HiveQL statement, I get the square brackets in the Hive table!!
What am I missing? Or more appropriately, can someone please help me understand the steps I need to do to:
Run a SQL on SQL Server or Oracle DB
Write the data out to a Hive table that can be accessed by a dashboard tool.
My code right now, looks something like this:
DataFrame df= sqlContext.read().format("jdbc").options(getSqlContextOptions(driver, dburl, query)).load(); // This step seem to work fine.
JavaRDD<Row> rdd = df.javaRDD();
rdd.saveAsTextFile(getHdfsUri() + pathToFile); // This works, but writes the rows in square brackets, like: [1, AAA].
hiveContext.sql("CREATE TABLE BLAH (MY_ID INT, MY_DESC STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS TEXTFILE");
hiveContext.sql("LOAD DATA INPATH '" + getHdfsUri() + hdfsFile + "' OVERWRITE INTO TABLE `BLAH`"); // Get's written like:
MY_INT MY_DESC
------ -------
AAA]
The INT column doesn't get written at all because the leading [ makes it no longer a numeric value and the last column shows the "]" at the end of the row in the HDFS file.
Please help me understand why this isn't working or what a better way would be. Thanks!
I am not locked into any specific approach, so all options would be appreciated.
Ok, I figured out what I was doing wrong. I needed to use the write function on the HiveContext and needed to use the com.databricks.spark.csv to write a sequencefile in Hive. This does not require an intermediate step of saving a file in HDFS, which is great, and writes to Hive successfully.
DataFrame df = hiveContext.createDataFrame(rdd, struct);
df.select(cols).write().format("com.databricks.spark.csv").mode(SaveMode.Append).saveAsTable("TABLENAME");
I did need to create a StructType object, though, to pass into the createDataFrame method for the proper mapping of the data types (Something like is shown in the middle of this page: Support for User Defined Types for java in Spark). And the cols variable is an array of Column objects which is really just an array of column names (i.e. something like Column[] cols = {new Column("COL1"), new Column("COL2")};
I think "Insert" is not yet supported.
http://spark.apache.org/docs/latest/sql-programming-guide.html#compatibility-with-apache-hive
To get rid of brackets in text file, you should avoid saveAsTextFile. Instead try writing the contents using HDFS API i.e FSDataInputStream

Hive cannot query the tables save by calling saveAsTable in Spark

I was trying to use Hive to query the tables I saved using saveAsTable() provided by Spark DataFrame. Everything works well when I query using hiveContext.sql(). However, when I switch to hive and describe the table, it becomes col, array, something like this and is no longer queryable.
Any ideas how to work it through? Is there a reliable way to make Hive understands the metadata defined in spark instead of explicitly defining the schema?
Sometimes I make use of spark to infer schema from the raw data or read schema from certain file formats like parquet so don't want to create these table that could be inferred automatically.
Thanks a lot for any advice!

Is there a common place to store data schemas in Hadoop?

I've been doing some investigation lately around using Hadoop, Hive, and Pig to do some data transformation. As part of that I've noticed that the schema of data files doesn't seem to attached to files at all. The data files are just flat files (unless using something like a SequenceFile). Each application that wants to work with those files has its own way of representing the schema of those files.
For example, I load a file into the HDFS and want to transform it with Pig. In order to work effectively with it I need to specify the schema of the file when I load the data:
EMP = LOAD 'myfile' using PigStorage() as { first_name: string, last_name: string, deptno: int};
Now, I know that when storing a file using PigStorage, the schema can optionally be written out along side it, but in order to get a file into Pig in the first place it seems like you need to specify a schema.
If I want to work with the same file in Hive, I need to create a table and specify the schema with that too:
CREATE EXTERNAL TABLE EMP ( first_name string
, last_name string
, empno int)
LOCATION 'myfile';
It seems to me like this is extremely fragile. If the file format changes even slightly then the schema must be manually updated in each application. I'm sure I'm being naive but wouldn't it make sense to store the schema with the data file? That way the data is portable between applications and the barrier to using another tool would be lower since you wouldn't need to re-code the schema for each application.
So the question is: Is there a way to specify the schema of a data file in Hadoop/HDFS or do I need to specify the schema for the data file in each application?
It looks like you are looking for Apache Avro. With Avro your schema is embedded in your data, so you can read it without having to worry about schema issues and it makes schema evolution really easy.
The great thing about Avro is that it is completely integrated in Hadoop and you can use it with a lot of Hadoop sub-projects like Pig and Hive.
For example with Pig you could do:
EMP = LOAD 'myfile.avro' using AvroStorage();
I would advise looking at the documentation for AvroStorage for more details.
You can also work with Avro with Hive as described here but I have not used that personally but it should work the same way.
What you need is HCatalog which is
"Apache HCatalog is a table and storage management service for data
created using Apache Hadoop.
This includes:
Providing a shared schema and data type mechanism.
Providing a table abstraction so that users need not be concerned with where or how
their data is stored.
Providing interoperability across data processing tools such as Pig, Map Reduce, and Hive."
You can take a look at the "data flow example" in the docs to see exactly the scenario you are talking about
Apache Zebra seems to be the tool that could provide a common schema definition across mr, pig and hive. It has its own schema store. MR job can use its built in TableStore to write to HDFS.

Using Apache Hive as a MapReduce Input Format and/or Scraping Hive Metadata

Our environment is heavy into storing data in hive. I find myself currently working on something that it outside the scope though. I have a mapreduce written, but it requires a lot of direct user inputs for information that could easily be scraped from Hive. That said, when I query hive for extended table data, all of the extended information is thrown out in 1 or 2 columns as a giant blob of almost-JSON. Is there either a convenient way to parse this information, or better yet, get it directly in a more direct manor?
Alternatively, if I could get pointed to documentation on manually using the CombinedHiveInputFormat, that would simplify my code a lot more. But it seems like that InputFormat is solely used inside of Hive, using it's custom structs.
Ultimately, what I want is to know table names, columns (not including partitions), and partition locations for the split a mapper is working on. If there is yet another way to accomplish this, I am eager to know.

Resources