Magento: How to load large tables into custom db table? - magento

I have a sql installer file for my custom Magento module. It attempts to insert many thousands of rows into a custom database table but it runs out of memory and the module doesn't install.
Everything works fine if I put the table in manually with normal mysql and there is no 'memory balloon' doing it that way.
I would like my module to work as a module, without having to do anything manually on the command line. Is there any way I can break down my installer file or call some external routine to get the data in?

You could distribute a CSV file containing the data with your module and use MySQL's LOAD DATA command to load the data into the table you create in your upgrade script.
Maybe something like:
$db = Mage::getSingleton('core/resource')->getConnection('core_write');
$filename = Mage::getBaseDir('code').'/local/Your/Module/sql/your_module_setup/foo.csv';
$sql = "LOAD DATA LOCAL INFILE '".$filename."' INTO TABLE foo FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '\"' lines terminated by '\r\n'";
$db->query($sql);
You can, of course, run further queries if you need to process the data somehow.

You can split up creating table and inserts in different upgrade scripts.

Related

Creating txt file using Pentaho

I'm currently trying to create txt files from all tables in the dbo schema
I have like 200s-300s tables there, so it would takes up too much times to create it manually..
I was thinking for creating a loop.
so as example (using AdventureWorks2019) :
select t.name as table_name
from sys.tables t
where schema_name(t.schema_id) = 'Person'
order by table_name;
This would get all the table name within the Person schema.
So I would loop :
Table input : select * from ${table_name}
But then i realized that for txt files, i need to declare all the field and their data types in pentaho, so it would become a problems.
Any ideas how to do this "backup" txt files?
Using Metadata Injection and more queries to the schema catalog tables in SQL Server. You not only need to retrieve the table name, you would need to afterwards retrieve the columns in that table and the data types, and inject that information (metadata) to the text output step.
You have in the samples directory of your spoon installation an example on how to use Metadata Injection, use it, along with the documentation, to build a simple example (the check to generate a transformation with the metadata you have injected is of great use to debug)
I have something similar to copy data from one database to another, both in Oracle, but with SQL Server you have similar catalog tables as in Oracle to retrieve the information you need. I created a simple, almost empty transformation to read one table and write to another. This transformation has almost no information, only the database origin in the Table Input step and the target database in the Table Output step:
And then I have a second transformation where I fill up all the information (metadata) to inject: The query to perform in the Table Input step, and all the data I need in the Table Output: Target table, if I need to truncate before inserting, the columns from (stream field) and to (Table field):

SQL Loader in Oracle

As I am inserting data from a CSV file to a oracle table using SQL Loader and it is working fine .
LOAD DATA
INFILE DataOut.txt
BADFILE dataFile.bad
APPEND INTO TABLE ASP_Net_C_SHARP_Articles
FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"'
(ID,Name,Category)
above settings are being used to do that but I do not want to specify any of the column name ex. (ID,Name,Category) .
Is this possible or not if yes can anybody tell me how..
In SQL*Loader you need to specify the column names. If you still persist in ignoring the column names in the control file, then I would suggest you to use SQL to "discover" the name of the columns and dynamically generate the control file and wrap it via shell script to make it more automated.
Meanwhile, you can consider External Tables which uses the SQL*Loader engine, so you will still have to perform some dynamic creation here for your input file as suggested above. But you can create a script to scan the input file and dynamically generate the CREATE TABLE..ORGANIZATION EXTERNAL command for you. Then the data becomes available as if it were a table in your database.
You can also partially skip the columns if that would help you, by using FILLER. BOUNDFILLER (available with Oracle 9i and above) can be used if the skipped column's value will be required later again.

Using bash to send hive script a variable number of fields

I'm automating a data pipeline by using a bash script to move csvs to HDFS and build external Hive tables on them. Currently, this only works when the format of the table is predefined in an .hql file. But I want to be able to read the headers from the CSV and send them as arguments to Hive. So currently I do this inside a loop through the files:
# bash
hive -S -hiveconf VAR1=$target_db -hiveconf VAR2=$filename -hiveconf VAR3=$target_folder/$filename -f create_tables.hql
Which is sent to this...
-- hive
CREATE DATABASE IF NOT EXISTS ${hiveconf:VAR1};
CREATE EXTERNAL TABLE IF NOT EXISTS ${hiveconf:VAR1}.${hiveconf:VAR2}(
individual_pkey INT,
response CHAR(1)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/${hiveconf:VAR3}'
I want the hive script to look more like this...
CREATE DATABASE IF NOT EXISTS ${hiveconf:VAR1};
CREATE EXTERNAL TABLE IF NOT EXISTS ${hiveconf:VAR1}.${hiveconf:VAR2}(
${hiveconf:ROW1} ${hiveconf:TYPE1},
... ...
${hiveconf:ROW_N} ${hiveconf:TYPE_N}
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/${hiveconf:VAR3}'
Is it possible to send it some kind of array that it would parse? Is this feasible or advisable?
I eventually figured out a way around this.
You can't really write an HQL script that takes in a variable number of fields. You can, however, write a bash script that generates an HQL script of variable length. I've implemented this for my team, but the general idea is to write out how you want the HQL to look as a string in bash, then use something like Rscript to read in and identify the data types of your CSV. Store the data types as an array along with the CSV headers and then loop through those arrays, writing the information to the HQL.

Inserting local csv to a Hive table from Qubole

I have a csv on my local machine, and I access Hive through Qubole web console. I am trying to upload the csv as a new table, but couldn't figure out. I have tried the following:
LOAD DATA LOCAL INPATH <path> INTO TABLE <table>;
I get the error saying No files matching path file
I am guessing that the csv has to be in some remote server where hive is actually running, and not on my local machine. The solutions I saw doesn't explain how to handle this issue. Can someone help me out reg. this?
Qubole allows you to define hive external/managed tables on the data sitting on your cloud storage ( s3 or azure storage ) - so LOAD from your local box wont work. you will have to upload this on your cloud storage and then define an external table against it -
CREATE External TABLE orc1ext(
`itinid` string, itinid1 string)
stored as ORC
LOCATION
's3n://mybucket/def.us.qubole.com/warehouse/testing.db/orc1';
INSERT INTO TABLE orc1ext SELECT itinid, itinid
FROM default.default_qubole_airline_origin_destination LIMIT 5;
First, create a table on hive using the field names present in your csv file.syntax which you are using seems correct.
Use below syntax for creating table
CREATE TABLE foobar(key string, stats map<string, bigint>)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
COLLECTION ITEMS TERMINATED BY '|'
MAP KEYS TERMINATED BY ':' ;
and then load data using below format,then mention path name correctly
LOAD DATA LOCAL INPATH '/yourfilepath/foobar.csv' INTO TABLE foobar;

What is the best way to produce large results in Hive

I've been trying to run some Hive queries with largish result sets. My normal approach is to submit a job through the WebHCat API, and read the results from the resulting stdout file, or to just run hive at the console and pipe stdout to a file. However, with large results (more than one reducer used), the stdout is blank or truncated.
My current solution is to create a new table from the results CREATE TABLE FROM SELECT which introduces an extra step, and leaves the table to clear up afterwards if I don't want to keep the result set.
Does anyone have a better method for capturing all the results from such a Hive query?
You can write the data directly to a directory on either hdfs or the local file system, then do what you want with the files. For example, to generate CSV files:
INSERT OVERWRITE DIRECTORY '/hive/output/folder'
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE
SELECT ... FROM ...;
This is essentially the same as CREATE TABLE FROM SELECT but you don't have to clean up the table. Here's the full documentation:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-Writingdataintothefilesystemfromqueries

Resources