Is it possible to make a loop in Hive to insert a bunch of random values in a table?
I understand that I can create a script in some programming language to create a csv file with the needed amount of rows and then load csv into hive as an external table.
So I want to have the table with 1000000 rows. The schema:
name String,
s_name String,
age int
Thanks in advance.
The proper way is to use csv (or any other file format) to insert data in Hive. If you don't want to use programming language you can use Excel (or any other analouge) to generate as may rows with random data as you need and then save them in CSV file. Hope this helps.
Related
I have a HIVE Table (test) that I need to create in the PARQUET format. I will be using a bunch of SEQUENCE files in order to create and insert into a table.
Once the table is created, is there a way to convert into PARQUET? I mean I know we could have done, say
CREATE TABLE default.test( user_id STRING, location STRING)
PARTITIONED BY ( dt INT ) STORED AS PARQUET
initially while creating the table itself. However, in my case I am forced to use SEQUENCE files to create the table first because it is the format that I have to begin with and cannot directly convert to PARQUET.
Is there a way I could convert into parquet after the table is created and data inserted?
To convert form sequence file to Parquet you need to load the data (CTAS) into a new table.
The question is tagged with presto, so I am giving you Presto syntax for this. I am including partitioning, because example in the question contains it.
CREATE TABLE test_parquet WITH(format='PARQUET', partitioned_by=ARRAY['dt']) AS
SELECT * FROM test_sequencefile;
Problem statement
I need to replace certain dates in my sql dump file and create another dump file. First I need to parse the create table definition and store information about date type columns. I may need to skip certain columns. After that I need to parser the "Insert into table" statements. Break this statement into rows and then into columns and then replace the dates.
Solution I am using Spring batch where reader is Composite Reader. I read the entire table definition (Drop statement, Create statement and Insert statements) in memory and then pass to processor to replace the dates. During reading I also split the insert statements into Rows and Column.
Problem This solution works fine for small dump but getting out of memory for large dumps. e.g. There is one table having long blobs and size is 2GB.
Any idea how can I fix the problem or Spring batch is not a right solution for this. Any help will be highly appreciated.
I'm trying to read a large gzip file into hive through spark runtime
to convert into SequenceFile format
And, I want to do this efficiently.
As far as I know, Spark supports only one mapper per gzip file same as it does for text files.
Is there a way to change the number of mappers for a gzip file being read? or should I choose another format like parquet?
I'm stuck currently.
The problem is that my log file is json-like data save into txt-format and then was gzip - ed, so for reading I used org.apache.spark.sql.json.
The examples I have seen that show - converting data into SequenceFile have some simple delimiters as csv-format.
I used to execute this query:
create TABLE table_1
USING org.apache.spark.sql.json
OPTIONS (path 'dir_to/file_name.txt.gz');
But now I have to rewrite it in something like that:
CREATE TABLE table_1(
ID BIGINT,
NAME STRING
)
COMMENT 'This is table_1 stored as sequencefile'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS SEQUENCEFILE;
LOAD DATA INPATH 'dir_to/file_name.txt.gz' OVERWRITE INTO TABLE table_1;
LOAD DATA INPATH 'dir_to/file_name.txt.gz' INTO TABLE table_1;
INSERT OVERWRITE TABLE table_1 SELECT id, name from table_1_text;
INSERT INTO TABLE table_1 SELECT id, name from table_1_text;
Is this the optimal way of doing this, or is there a simpler approach to this problem?
Please help!
As gzip textfile file is not splitable ,only one mapper will be launched or
you have to choose other data formats if you want to use more than one
mappers.
If there are huge json files and you want to save storage on hdfs use bzip2
compression to compress your json files on hdfs.You can query .bzip2 json
files from hive without modifying anything.
I have written a pig script that would generate tuples of a hive table. I am trying to dump the results to a specific partition in HDFS where hive stores the table date. As of now the partition value I am using is a timestamp string value that is generated inside pigscript. I have to use this timestamp string value to store my pig script results but i am have no idea how to do that. Any help would be greatly appreciated.
If I understand it right you read some data from a partition of a HIVE table and want to store into another HIVE table partitions, right?
A HIVI partition (form HDFS perspective) is just a subfolder which name is constructed like this: fieldname_the_partitioning_is_based_on=value
For example you have a date partition it looks like this: hdfs_to_your_hive_table/date=20160607/
So all you need is to specify this output location in the store statement
STORE mydata INTO '$HIVE_DB.$TABLE' USING org.apache.hive.hcatalog.pig.HCatStorer('date=$today');
Based in the csv file column header it should create table dynamically and also insert records of that csv file into the newly create table.
Ex:
1) If i upload a file TEST.csv with 3 columns, it should create a table dynamically with three
2) Again if i upload a new file called TEST2.csv with 5 columns, it should create a table dynamically with five columns.
Every time it should create a table based on the uploaded csv file header..
how to achieve this in oracle APEX..
Thanks in Advance..
Without creating new tables you can treat the CSVs as tables using a TABLE function you can SELECT from. If you download the packages from the Alexandria Project you will find a function that will do just that inside CSV_UTIL_PKG (clob_to_csv is this function but you will find other goodies in here).
You would just upload the CSV and store in a CLOB column and then you can build reports on it using the CSV_UTIL_PKG code.
If you must create a new table for the upload you could still use this parser. Upload the file and then select just the first row (e.g. SELECT * FROM csv_util_pkg.clob_to_csv(your_clob) WHERE ROWNUM = 1). You could insert this row into an Apex Collection using APEX_COLLECTION.CREATE_COLLECTION_FROM_QUERY to make it easy to then iterate over each column.
You would need to determine the datatype for each column but could just use VARCHAR2 for everything.
But if you are just using generic columns you could just as easily just store one addition column as a name of this collection of records and store all of the uploads in the same table. Just build another table to store the column names.
Simply store this file as BLOB if structure is "dynamic".
You can use XML data type for this use case too but it won't be very different from BLOB column.
There is a SecureFile feature since 11g, It is a new BLOB implementation, it performs better than regular BLOB and it is good for unstructured or semi structured data.