hive : remove stuff from distributed cache - hadoop

I can add stuff to distributed cache via
add file largelookuptable
and then run a bunch of HQL.
now when I have a series of commands, like the following
add file largelookuptable1;
select blah from blahness using somehow largelookuptable1;
add file largelookuptable2;
select newblah from otherblah using largelookuptable2;
in this case largelookuptable1 is unnecessarily available for the second query. is there a way I can get rid of it before the second query runs ?

On the Hive CLI, type:
delete file largelookuptable1;
Same thing applies to jars added to distributed cache.
Syntax (from Hive CLI):
Usage: delete [FILE|JAR|ARCHIVE] []*

Related

How to use airflow for real time data processing

I have a scenario where i want to process csv file and load to someother database:
Cases
pic csv file and load to mysql with the same name as csv
then do some modification on loaded rows using python task file
after that extract data from mysql and load to some other database
CSV files are coming from remote server to one airflow server in a folder.
We have to pick these csv file and process through python script.
Suppose i pick one csv file then i need to pass this csv file to rest of the operator in a dependency manner like
filename : abc.csv
task1 >> task2 >> task3 >>task4
So abc.csv should be available for all the task.
Please tell how to proceed.
Your scenarios don't have anything to do with realtime. This is ingesting on a schedule/interval. Or perhaps you could use a SensorTask Operator t detect data availability.
Implement each of your requirements as functions and call them from operator instances.
Add the operators to a DAG with a schedule appropriate for your incoming feed.
How you pass and access params is
-kwargs python_callable when initing an operator
-context['param_key'] in execute method when extending an operator
-jinja templates
relevant...
airflow pass parameter from cli
execution_date in airflow: need to access as a variable
The way tasks communicate in Airflow is using XCOM, but it is meant for small values, not for file content.
If you want your tasks to work with the same csv file you should save it on some location and then pass in the XCOM the path to this location.
We are using the LocalExecutor, so the local file system is fine for us.
We decided to create a folder for each dag with the name of the dag. Inside that folder we generate a folder for each execution date (we do this in the first task, that we always call start_task). Then we pass the path of this folder to the subsequent tasks via Xcom.
Example code for the start_task:
def start(share_path, **context):
execution_date_as_string = context['execution_date'].strftime(DATE_FORMAT)
execution_folder_path = os.path.join(share_path, 'my_dag_name', execution_date_as_string)
_create_folder_delete_if_exists(execution_folder_path)
task_instance = context['task_instance']
task_instance.xcom_push(key="execution_folder_path", value=execution_folder_path)
start_task = PythonOperator(
task_id='start_task',
provide_context=True,
python_callable=start,
op_args=[share_path],
dag=dag
)
The share_path is the base directory for all dags, we keep it in the Airflow variables.
Subsequent tasks can get the execution folder with:
execution_folder_path = task_instance.xcom_pull(task_ids='start_task', key='execution_folder_path')

export data to csv using hive sql

How to export hive table/select query to csv? I have tried the command below. But it creates the output as multiple files. Any better methods?
INSERT OVERWRITE LOCAL DIRECTORY '/mapr/mapr011/user/output/'
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
SELECT fied1,field2,field3 FROM table1
Hive creates as many files as many reducers were running. This is fully parallel.
If you want single file then add order by to force running on single reducer or try to increase bytes per reducer configuration parameter:
SELECT fied1,field2,field3 FROM table1 order by fied1
OR
set hive.exec.reducers.bytes.per.reducer=67108864; --increase accordingly
Also you can try to merge files:
set hive.merge.smallfiles.avgsize=500000000;
set hive.merge.size.per.task=500000000;
set hive.merge.mapredfiles=true;
Also you can concatenate files using cat after getting them from hadoop.
You can use hadoop fs -cat /hdfspath > some.csv
command and get the output in one file.
If you want Header then you can use SED along with hive.. See this link which discusses various options in exporting Hive to CSV
https://medium.com/#gchandra/best-way-to-export-hive-table-to-csv-file-326063f0f229

informatica Post command task

I am working with multiple source files with single source instance. I created three flat files and one destination table to experiment multiple sources. I am using ‘File list’ concept, for that I created a text file which contains all the flat file names.
Example:
Filename : File_list.txt
File content : Price1.txt
Price2.txt
Price3.txt
In the above example Price1.txt, Price2.txt and Price3.txt are flat file names. I specified File_list.txt as a source file while running the Workflow in Informatica. So it will iterate through all the flat files in the specified file (File_list.txt) and insert all the values to destination table.
Now what I want to do is once data is inserted to the destination, I need to delete that source file in that directory location.
How to achieve this?.
You'll need to write a custom script that will use the File_list.txt as input and perform the delete operations. You can then call it using Post-Session Success Command session component, or as a separate Command Task in the workflow linked using a $YourSessionName.Status = SUCCEEDED condition.

Working with zips in pyspark

I have n zips in a directory and I want to extract each one of those and then extract some data out of a file or two lying inside the zips and add it to a graph DB. I have made a sequential python script for this whole thing, but I am stuck at converting it for spark. All of my zips are in a HDFS directory. And, he graph DB is Neo4j. I am yet to learn about connecting spark with neo4j but I am stuck at a more initial step.
I am thinking my code should be along these lines.
# Names of all my zips
zip_names = ["a.zip", "b.zip", "c.zip"]
# function extract_&_populate_graphDB() returns 1 after doing all the work.
# This was done so that a closure can be applied to start the spark job.
sc.parallelize(zip_names).map(extract_&_populate_grapDB).reduce(lambda a, b: a+b)
What I cant do to test this is how to extract the zips and read the files within. I was able to read the zip by sc.textFile but it on running take(1) on it, it returned hex data.
So, is it possible to read in a zip and extract the data? Or, should I extract the data before putting it into the HDFS? Or maybe there's some other approach to deal with this?
Updating Answer*
If you'd like to use Gzip compressed files, there are parameters you can set when you configure your Spark shell or Spark job that allow you to read and write compressed data.
--conf spark.hadoop.mapred.output.compress=True \
--conf spark.hadoop.mapred.output.compression.codec=True \
--conf spark.hadoop.mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec \
--conf spark.hadoop.mapred.output.compression.type: BLOCK
Add those to the bash script you are currently using to create a shell (e.g. pyspark) and you can read and write compressed data.
Unfortunately, there is no innate support of Zip files, so you'll need to do a bit more legwork to get there.

It is possible to reference another SQL file from SQL script

Basically I want to execute an SQL file from an SQL file in Postgres.
Similar question for mysql: is it possible to call a sql script from a stored procedure in another sql script?
Why?
Because I have 2 data files in a project and I want to have one line that can be commented/un-commented that loads the second file.
Clarification:
I want to call B.SQL from A.SQL
Clarification2:
This is for a Spring Project that uses hibernate to create the database from the initial SQL file (A.SQL).
On further reflection it seems I may have to handle this from java/string/hibernate.
Below is the configuration file:
spring.datasource.url=jdbc:postgresql://localhost:5432/dbname
spring.datasource.username=postgres
spring.datasource.password=root
spring.datasource.driver-class-name=org.postgresql.Driver
spring.datasource.data=classpath:db/migration/postgres/data.sql
spring.jpa.hibernate.ddl-auto=create
Import of other files is not supported in Sql, but if you execute the script with psql can you use the \i syntax:
SELECT * FROM table_1;
\i other_script.sql
SELECT * FROM table_2;
This will probably not work if you execute the sql with other clients than psql.
Hibernate is just:
reading all your SQL files line per line
strip any comment (lines starting with --, // or /*)
removes any ; at the end
executes the result as a single statement
(see SchemaExport.importScript and SingleLineSqlCommandExtractor)
There is no support for an include here.
What you can do:
Define your own ImportSqlCommandExtractor which knows how to include a file - you can set that extractor with hibernate.hbm2ddl.import_files_sql_extractor=(fully qualified class name)
Define your optional file as additional import file with hibernate.hbm2ddl.import_files=prefix.sql,optional.sql,postfix.sql, you can either add and remove the file reference as you like, or you can even exclude the file from your artifact - a missing file will only create a debug message.
Create an Integrator which sets the hibernate.hbm2ddl.import_files property dynamically - depending on some environment property

Resources