In HDFS, partitioned data is stored as multiple files like
hdfs://user/hive/warehouse/TABLE_NAME/column_1="VALUE"/column_2="VALUE"/000000
Does big query supports loading these files as they are or is it necessary to flatten the data into one single file?
Nothing is mentioned in the documentation regarding loading the files as they are.
Multiple files can be loaded in bigquery comes under same directory, So no need to flatten.
Below is the sample code:
bq load --replace --quote "" -F"\t" ${db_name}.${tgt_table_name}\$${bq_partition} gs://bucket_name/folder/*
Let me know if it helps or not.
Related
I have many project reports in text format (word and pdf). These files contains data that I want to extract; Such as references, keywords, names mentioned .......
I want to process these files with Apache spark and save the result to hive,
use the power of dataframe (use the table of context as schema) is that possible?
May you share with me any ideas about how to process these files?
As far as I understand, you will need to parse the files using Tika and manually create custom schema s as described here.
Let me know if this helps. Cheers.
I have a bunch of tables in Hive, stored as ORC. I want to index their data in a SolrCloud collection.
Is there any support for indexing data stored in ORC format in Solr?
I've googled around but nothing came out.
Looks like you want SolR to read data from a specific Hive file format.
You might look at the problem the other way i.e. use Hive to write data to SolR -- and thus let Hive take care of the complexity of the actual input file format (whether ORC, Parquet, AVRO, whatever -- even HBase data files).
In the LucidWorks GitHub repo you will find a project labeled hive-solr. Have a look.
I'll accept Samson's answer.
Anyway, I'm not fully satisfied about this solution. In fact, now I still need to create an external table manually declaring all fields in the original table. In terms of operations, it is not different from creating a new table (stored ad textfile) starting from the original one, indexing the new text files and finally dropping them (of course, this may be a problem for very large tables, which is not my case).
Being ORC a self-describing format, it would be great for Solr to read both field names and data directly from the compressed files.
I am working on a Spark Application that has to read multiple directories (i.e. multiple paths) from S3 Bucket and HDFS. I read that newHadoopAPI provides a great way to read Lzo compressed / indexed files in a good performant way. But, how do we read multiple folder paths / directories have several Lzo files and Index files in an RDD using newHadoopAPI?
The folder structure is like partitioned Hive Table on two columns.
Ex: as below. Partition on date and batch
/rootDirectory/date=20161002/batch=5678/001_0.lzo
/rootDirectory/date=20161002/batch=5678/001_0.lzo.index
/rootDirectory/date=20161002/batch=5678/002_0.lzo
/rootDirectory/date=20161002/batch=5678/002_0.lzo.index
/rootDirectory/date=20161002/batch=8765/001_0.lzo
/rootDirectory/date=20161002/batch=8765/001_0.lzo.index
/rootDirectory/date=20161002/batch=8765/002_0.lzo
/rootDirectory/date=20161002/batch=8765/002_0.lzo.index
..... and so on.
Now I use the below code to read data from S3. This treats both Lzo and Lzo.Index files as input which crashes my application, as I dont want to read .lzo.index files, but just the .lzo files using the index for speed.
val impInput = sparkSession.sparkContext.newAPIHadoopFile("s3://my-bucket/myfolder/*/*", classOf[NonSplittableTextInputFormat],classOf[org.apache.hadoop.io.LongWritable],classOf[org.apache.hadoop.io.Text])
val impRDD = impInput.map(_._2.toString)
Could anyone please help me to understand how can I do that?
1). Read all (mulitple) folders under the root for the Lzo files using the newHadoopAPI so that I can utilize the .index file for my benefit.
2). Read the data from HDFS in the similar fashion.
Adding a suffix to your HDFS path may help.
val impInput = sparkSession.sparkContext.newAPIHadoopFile("s3://my-bucket/myfolder/*/*.lzo", classOf[NonSplittableTextInputFormat],classOf[org.apache.hadoop.io.LongWritable],classOf[org.apache.hadoop.io.Text])
I am learning to use Hadoop for performing Big Data related operations.
I need to perform some queries on a collection of data sets split across 8 csv files. Each csv file has multiple sheets and the query concerns only one of the sheets(Sheet Name: Table4)
The dataset can be downloaded here : http://www.census.gov/hhes/www/hlthins/data/utilization/tables.html
Sample Data snap shot attached for quick reference
I have already converted the above xls file to csv.
Am not sure how to group the data while creating table in Hive.
It will be really helpful if you can guide me here.
Note: I am a novice with Hadoop and Big Data, so if anyone could guide me with how to proceed further I'd be very grateful.
If you need information on the queries or anything else let me know.
Thanks!
I have file on HDFS with 78 GB size
I need to create an Impala External table over it to perform some grouping and aggregation on data available
Problem
The file contain headers.
Question
Is there any way to skip headers from file while reading the file and do querying on the rest of data.
Although i have a way to solve the problem by copying file to local then remove the headers and then copy the updated file to HDFS again but that is not feasible as the file size is too large
Please suggest if anyone have any idea...
Any suggestions will be appreciated....
Thanks in advance
UPDATE or DELETE row operations are not available in Hive/Impala. So you should simulate DELETE as
Load data file into a temporary Hive/Impala table
Use INSERT INTO or CREATE TABLE AS on temp table to create require table
A straightforward approach would be to run the HDFS data through Pig to filter out the headers and generate a new HDFS dataset formatted so that Impala could read it cleanly.
A more arcane approach would depend on the format of the HDFS data. For example, if both header and data lines are tab-delimited, then you could read everything using a schema with all STRING fields and then filter or partition out the headers before doing aggregations.