How to use pentaho kitchen to connect database repository?

How to use pentaho kitchen to connect database repository? - etl

How to use pentaho kitchen to connect my central database repository under commandline?

set up your connection in repositories.xml, you probably already have one of these if you have been using spoon. Make sure the repositories.xml exists in .kettle for the installation where you are running kitchen.
then simply use these command line options:
/rep "YOUR REPO NAME"
/user "REPO USER"
/pass "REPO PSS"

Below, an Windows script batch example to run a Pentaho Data Integration kettle Job :
#echo off
SET LOG_PATHFILE=C:\logs\KITCHEN_name_of_job_%DATETIME%.log
call Kitchen.bat /rep:"name_repository" /job:"name_of_job" /dir:/foo/sub_foo1 /user:dark /pass:vador /level:Detailed >> %LOG_PATHFILE%`
The repository "name_repository" must be defined in /users/.kettle/repositories.xml. Juste below an example of this file :
<?xml version="1.0" encoding="UTF-8"?>
<repositories>
<connection>
<name>name_repository</name>
<server>hostname</server>
<type>MYSQL</type>
<access>Native</access>
<database>name_database_repository</database>
<port>9090</port>
<username>[name]</username>
<password>[password]</password>
<servername/>
<data_tablespace/>
<index_tablespace/>
<attributes>
<attribute><code>EXTRA_OPTION_MYSQL.defaultFetchSize</code><attribute>500</attribute></attribute>
<attribute><code>EXTRA_OPTION_MYSQL.useCursorFetch</code><attribute>true</attribute></attribute>
<attribute><code>FORCE_IDENTIFIERS_TO_LOWERCASE</code><attribute>N</attribute></attribute>
<attribute><code>FORCE_IDENTIFIERS_TO_UPPERCASE</code><attribute>N</attribute></attribute>
<attribute><code>IS_CLUSTERED</code><attribute>N</attribute></attribute>
<attribute><code>PORT_NUMBER</code><attribute>9090</attribute></attribute>
<attribute><code>QUOTE_ALL_FIELDS</code><attribute>N</attribute></attribute>
<attribute><code>STREAM_RESULTS</code><attribute>Y</attribute></attribute>
<attribute><code>SUPPORTS_BOOLEAN_DATA_TYPE</code><attribute>N</attribute></attribute>
<attribute><code>USE_POOLING</code><attribute>N</attribute></attribute>
</attributes>
</connection>
<repository>
<id>KettleDatabaseRepository</id>
<name>name_repository</name>
<description>the pentaho data integraion kettle repository</description>
<connection>name_repository</connection>
</repository>

Related

Using pact broker with a path

I am trying to get my pact broker working on my environment. I have the broker running in K8S under https://mydomain/pactbroker (image; dius/pactbroker).
I am able to send to the broker with the maven plugin (publish). However when I try to verify I get an error; Request to path '/' failed with response 'HTTP/1.1 401 Unauthorized'
Can someone help me out?
<build>
<plugins>
<plugin>
<groupId>au.com.dius</groupId>
<artifactId>pact-jvm-provider-maven</artifactId>
<version>4.0.10</version>
<configuration>
<serviceProviders>
<!-- You can define as many as you need, but each must have a unique name -->
<serviceProvider>
<name>FaqService</name>
<protocol>http</protocol>
<host>localhost</host>
<port>8080</port>
<pactBroker>
<url>https://mydomain/pactbroker/</url>
<authentication>
<scheme>basic</scheme>
<username>user</username>
<password>pass</password>
</authentication>
</pactBroker>
</serviceProvider>
</serviceProviders>
</configuration>
</plugin>
</plugins>
</build>
Added information (Jun 18, 12:52 CET):
When trying to go through the logs it seems it tries to fetch the HAL root information via path "/". However responds with;
[WARNING] Could not fetch the root HAL document
When I enable PreEmptive Authentication i can see that ot give a Warning like
[WARNING] Using preemptive basic authentication with the pact broker at https://mydomain so without the path.

Have you confirmed you can use the broker correctly outside of Maven?
e.g. can you curl --user user:pass https://mydomain/pactbroker/ and get back an API result? Can you visit it in the browser?
You may also need to make sure all relative links etc. work. See https://docs.pact.io/pact_broker/configuration#running-the-broker-behind-a-reverse-proxy and docs for whatever proxy you have in front of it.

The issue was with pact. An issue was raised and should be merged to the next release soon (4.1.4)

PXF JSON plugin error

Using HDP 2.4 and HAWQ 2.0
Wanted to read json data kept in HDFS path into HAWQ external table?
Followed below steps to add new json plugin into PXF and read data.
Download plugin "json-pxf-ext-3.0.1.0-1.jar" from
https://bintray.com/big-data/maven/pxf-plugins/view#
Copy the plugin into path /usr/lib/pxf.
Create External table
CREATE EXTERNAL TABLE ext_json_mytestfile ( created_at TEXT,
id_str TEXT, text TEXT, source TEXT, "user.id" INTEGER,
"user.location" TEXT,
"coordinates.type" TEXT,
"coordinates.coordinates[0]" DOUBLE PRECISION,
"coordinates.coordinates[1]" DOUBLE PRECISION)
LOCATION ('pxf://localhost:51200/tmp/hawq_test.json'
'?FRAGMENTER=org.apache.hawq.pxf.plugins.hdfs.HdfsDataFragmenter'
'&ACCESSOR=org.apache.hawq.pxf.plugins.json.JsonAccessor'
'&RESOLVER=org.apache.hawq.pxf.plugins.json.JsonResolver'
'&ANALYZER=org.apache.hawq.pxf.plugins.hdfs.HdfsAnalyzer')
FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import')
LOG ERRORS INTO err_json_mytestfile SEGMENT REJECT LIMIT 10 ROWS;
When execute the above DDL table create successfully. After that trying to execute select query
select * from ext_json_mytestfile;
But getting error: -
ERROR: remote component error (500) from 'localhost:51200': type Exception report message java.lang.ClassNotFoundException: org.apache.hawq.pxf.plugins.json.JsonAccessor description The server encountered an internal error that prevented it from fulfilling this request. exception javax.servlet.ServletException: java.lang.ClassNotFoundException: org.apache.hawq.pxf.plugins.json.JsonAccessor (libchurl.c:878) (seg4 sandbox.hortonworks.com:40000 pid=117710) (dispatcher.c:1801)
DETAIL: External table ext_json_mytestfile
Any help would be much appreciated.

It seems that referenced jar file has old package name as com.pivotal.*. The JSON PXF extension is still incubating, the jar pxf-json-3.0.0.jar is built for JDK 1.7 as Single node HDB VM is using JDK 1.7 and uploaded to dropbox.
https://www.dropbox.com/s/9ljnv7jiin866mp/pxf-json-3.0.0.jar?dl=0
Echo'ing the details of the above comments so that the steps are performed correctly to ensure the PXF service recognize the jar file. The below steps assume that Hawq/HDB is managed by Ambari. If not, the manual steps as mentioned by the previous updates should work.
Copy the pxf-json-3.0.0.jar to /usr/lib/pxf/ of all your HAWQ nodes (master and segments).
In Ambari managed PXF, add the below line by going through Ambari Admin -> PXF -> Advanced pxf-public-classpath
/usr/lib/pxf/pxf-json-3.0.0.jar
In Ambari managed PXF, add this snippet to your pxf profile xml at the end by going through Ambari Admin -> PXF -> Advanced pxf-profiles
<profile>
<name>Json</name>
<description>
JSON Accessor</description>
<plugins>
<fragmenter>org.apache.hawq.pxf.plugins.hdfs.HdfsDataFragmenter</fragmenter>
<accessor>org.apache.hawq.pxf.plugins.json.JsonAccessor</accessor>
<resolver>org.apache.hawq.pxf.plugins.json.JsonResolver</resolver>
</plugins>
</profile>
Restart PXF service via Ambari

Did you add the jar file location to /etc//conf/pxf-public.classpath?

Did you try:
copying PXF JSON jar file to /usr/lib/pxf
updating /etc/pxf/conf/pxf-profiles.xml to include the Json plug-in profile if not already present
(per comment above) updating the /etc/pxf/conf/pxf-public.classpath
restarting the PXF service either via Ambari or command line (sudo service pxf-service restart)

likely didn't add json jar in classpath.
Create External Table DDL will always succeed as it was just a definition.
Only when you run queries, HAWQ will check the run time jar dependencies.

Yes, the jar json-pxf-ext-3.0.1.0-1.jar" from https://bintray.com/big-data/maven/pxf-plugins/view# has old package name as com.pivotal.*. The previous update has edited with details to download the correct jar from dropbox

Getting error try to select hive table using hcatalog from HAWQ

I am using Hortonworks (HDP)sandbox and on top of that install HAWQ 2.0
I'm trying to select hive table using hcatalog but not able to access hive tables form HAWQ. Executing below steps mention in pivotal doc.
postgres=# SET pxf_service_address TO "localhost:51200";
SET
postgres=# select count(*) from hcatalog.default.sample_07;
ERROR: remote component error (500) from 'localhost:51200': type Exception report message Internal server error. Property "METADATA" has no value in current request description The server encountered an internal error that prevented it from fulfilling this request. exception java.lang.IllegalArgumentException: Internal server error. Property "METADATA" has no value in current request (libchurl.c:878)
LINE 1: select count(*) from hcatalog.default.sample_07;

I think there's a missing property in pxf-profile.xml
check if you have <metadata> property under Hive profile
this is newly added profile and if you were using a legacy build it might not have it.
<profile>
<name>Hive</name>
<description>This profile is suitable for using when connecting to Hive</description>
<plugins>
<fragmenter>org.apache.hawq.pxf.plugins.hive.HiveDataFragmenter</fragmenter>
<accessor>org.apache.hawq.pxf.plugins.hive.HiveAccessor</accessor>
<resolver>org.apache.hawq.pxf.plugins.hive.HiveResolver</resolver>
<metadata>org.apache.hawq.pxf.plugins.hive.HiveMetadataFetcher</metadata>
</plugins>
</profile>

Caused by: ERROR XSDB6: Another instance of Derby may have already booted the database

I am trying to run SparkSQL :
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
But the error i m getting is below:
... 125 more
Caused by: java.sql.SQLException: Another instance of Derby may have already booted the database /root/spark/bin/metastore_db.
at org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source)
at org.apache.derby.impl.jdbc.SQLExceptionFactory40.wrapArgsForTransportAcrossDRDA(Unknown Source)
at org.apache.derby.impl.jdbc.SQLExceptionFactory40.getSQLException(Unknown Source)
at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Unknown Source)
... 122 more
Caused by: ERROR XSDB6: Another instance of Derby may have already booted the database /root/spark/bin/metastore_db.
at org.apache.derby.iapi.error.StandardException.newException(Unknown Source)
at org.apache.derby.impl.store.raw.data.BaseDataFileFactory.privGetJBMSLockOnDB(Unknown Source)
at org.apache.derby.impl.store.raw.data.BaseDataFileFactory.run(Unknown Source)
at java.security.AccessController.doPrivileged(Native Method)
at org.apache.derby.impl.store.raw.data.BaseDataFileFactory.getJBMSLockOnDB(Unknown Source)
at org.apache.derby.impl.store.raw.data.BaseDataFileFactory.boot(Unknown Source)
I see there is a metastore_db folder exists..
My hive metastore includes mysql as metastore.But not sure why the error shows as derby execption

I was getting the same error while creating Data frames on Spark Shell :
Caused by: ERROR XSDB6: Another instance of Derby may have already booted the database /metastore_db.
Cause:
I found that this is happening as there were multiple other instances of Spark-Shell already running and holding derby DB already, so when i was starting yet another Spark Shell and creating Data Frame on it using RDD.toDF() it was throwing error:
Solution:
I ran the ps command to find other instances of Spark-Shell:
ps -ef | grep spark-shell
and i killed them all using kill command:
kill -9 Spark-Shell-processID ( example: kill -9 4848)
after all the SPark-Shell instances were gone, i started a new SPark SHell and reran my Data frame function and it ran just fine :)

If you're running in spark shell, you shouldn't instantiate a HiveContext, there's one created automatically called sqlContext (the name is misleading - if you compiled Spark with Hive, it will be a HiveContext). See similar discussion here.
If you're not running in shell - this exception means you've created more than one HiveContext in the same JVM, which seems to be impossible - you can only create one.

an lck(lock) file is an access control file which locks the database so that only a single user can access or update the database.
The error suggests that there is another instance which is using the same database. Thus you need to delete the .lck files.
In your home directory, go to metastore_db and delete any .lck files.

Another case where you can see the same error is a Spark REPL of an AWS Glue dev endpoint, when you are trying to convert a dynamic frame into a dataframe.
There are actually several different exceptions like:
pyspark.sql.utils.IllegalArgumentException: u"Error while instantiating 'org.apache.spark.sql.hive.HiveSessionState':"
ERROR XSDB6: Another instance of Derby may have already booted the database /home/glue/metastore_db.
java.sql.SQLException: Failed to start database 'metastore_db' with class loader org.apache.spark.sql.hive.client.IsolatedClientLoader
The solution is hard to find with google but eventually it is described here.
The loaded REPL contains an instantiated SparkSession in a variable spark and you just need to stop it before creating a new SparkContext:
>>> spark.stop()
>>> from pyspark.context import SparkContext
>>> from awsglue.context import GlueContext
>>>
>>> glue_context = GlueContext(SparkContext.getOrCreate())
>>> glue_frame = glue_context.create_dynamic_frame.from_catalog(database=DB_NAME, table_name=T_NAME)
>>> df = glue_frame.toDF()

If you are facing issue during bringing up WAS application on windows machine:
kill java processes using task manager
delete db.lck file present in WebSphere\AppServer\profiles\AppSrv04\databases\EJBTimers\server1\EJBTimerDB (My DB is EJBTimerDB which was causing issue)
restart application.

I was facing the same issue while creating table.
sqlContext.sql("CREATE TABLE....
I could see many entries for ps -ef | grep spark-shell so I killed all of them and restarted spark-shell. It worked for me.

This happened when I was using pyspark ml Word2Vec. I was trying to load previously built model. Trick is, just create empty data frame of pyspark or scala using sqlContext. Following is the python syntax -
from pyspark.sql.types import StructType
schema = StructType([])`
empty = sqlContext.createDataFrame(sc.emptyRDD(), schema)
This is a workaround. My problem fixed after using this block.
Note - It only occurs when you instantiate sqlContext from HiveContext, not SQLContext.

I got this error by running sqlContext._get_hive_ctx()
This was caused by initially trying to load a pipelined RDD into a dataframe
I got the error
Exception: ("You must build Spark with Hive. Export 'SPARK_HIVE=true' and run build/sbt assembly", Py4JJavaError(u'An error occurred while calling None.org.apache.spark.sql.hive.HiveContext.\n', JavaObject id=o29))
So you could running this before rebuilding it, but FYI I have seen others reporting this did not help them.

I am getting this error while running test cases in my multi maven spark setup.
I was creating sparkSession in my test classes separately as unit test cases required different spark parameters every time which I am passing it through a configuration file.
To resolve this I followed this approach.
While creating the sparkSession in Spark 2.2.0
//This is present in my Parent Trait.
def createSparkSession(master: String, appName: String, configList: List[(String, String)]): SparkSession ={
val sparkConf = new SparkConf().setAll(configList)
val spark = SparkSession
.builder()
.master(master)
.config(sparkConf)
.enableHiveSupport()
.appName(appName)
.getOrCreate()
spark
}
In my test classes
//metastore_db_test will test class specific folder in my modules.
val metaStoreConfig = List(("javax.jdo.option.ConnectionURL", "jdbc:derby:;databaseName=hiveMetaStore/metastore_db_test;create=true"))
val configList = configContent.convertToListFromConfig(sparkConfigValue) ++ metaStoreConfig
val spark = createSparkSession("local[*]", "testing", configList)
And post that in maven clean plugin I am cleaning this hiveMetaStore directory.
//Parent POM
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-clean-plugin</artifactId>
<version>3.1.0</version>
<configuration>
<filesets>
<fileset>
<directory>metastore_db</directory>
</fileset>
<fileset>
<directory>spark-warehouse</directory>
</fileset>
</filesets>
</configuration>
</plugin>
Child Module POM
<plugin>
<artifactId>maven-clean-plugin</artifactId>
<configuration>
<filesets>
<fileset>
<directory>hiveMetaStore</directory>
<includes>
<include>**</include>
</includes>
</fileset>
<fileset>
<directory>spark-warehouse</directory>
</fileset>
</filesets>
</configuration>
</plugin>

The error came because of the multiple spark shell you are trying to run in same node or due to system failure its shut down without proper exit the spark shell, In any of the reason you just find out the process id and kill them, for that us
[hadoop#localhost ~]$ ps -ef | grep spark-shell
hadoop 11121 9197 0 17:54 pts/0 00:00:00 grep --color=auto spark-shell
[hadoop#localhost ~]$ kill 9197

its very difficult to find where your derby metastore_db is access by another thread, if you are able to find the process then you can kill it using kill command.
Best solutions to restart the system.

Deploy a maven site into Alfresco through FTP

I'm experiencing some issues at the moment, when deploying maven site into Alfresco.
In my company, we use Alfresco as ECM, in our forge.
Since this tool supports FTP, and index all content of any kind of text document, I'd like to push my maven site into.
But even I'm able to deploy site manually through FTP on Alfresco, or upload it automatically using maven, I'm not able to combine both :
Here my part pom.xml
<distributionManagement>
[...]
<site>
<id>forge-alfresco</id>
<name>Serveur Alfresco de la Forge</name>
<url>ftp://alfresco.mycompany.corp/Alfresco/doc/site</url>
</site>
</distributionManagement>
<build>
<extensions>
<!-- Enabling the use of FTP -->
<extension>
<groupId>org.apache.maven.wagon</groupId>
<artifactId>wagon-ftp</artifactId>
<version>2.2</version>
</extension>
</extensions>
</build>
And here, part of my settings.xml
<servers>
<server>
<id>forge-alfresco</id>
<username>jrrevy</username>
<password>xxxxxxxx</password>
</server>
</servers>
When I try to deploy using site:deploy, I facing to this :
[INFO] [site:deploy {execution: default-cli}]
Reply received: 220 FTP server ready
Command sent: USER jrrevy
Reply received: 331 User name okay, need password for jrrevy
Command sent: PASS xxxxxx
Reply received: 230 User logged in, proceed
Command sent: SYST
Reply received: 215 UNIX Type: Java FTP Server
Remote system is UNIX Type: Java FTP Server
Command sent: TYPE I
Reply received: 200 Command OK
ftp://alfresco.mycompany.corp/Alfresco/doc/site/ - Session: Opened
[INFO] Pushing D:\project\workspaces\yyyyy\myproject\target\site
[INFO] >>> to ftp://alfresco.mycompany.corp/Alfresco/doc/site/./
Command sent: CWD /Alfresco/doc/site/
Reply received: 250 Requested file action OK
Recursively uploading directory D:\project\workspaces\yyyyy\myproject\target\site as ./
processing = D:\project\workspaces\yyyyy\myproject\target\site as ./
Command sent: CWD ./
Reply received: 550 Invalid path ./
Command sent: MKD ./
Reply received: 250 /Alfresco/doc/site/.
Command sent: CWD ./
Reply received: 550 Invalid path ./
ftp://alfresco.mycompany.corp/Alfresco/doc/site/ - Session: Disconnecting
ftp://alfresco.mycompany.corp/Alfresco/doc/site/ - Session: Disconnected
[INFO] ------------------------------------------------------------------------
[ERROR] BUILD ERROR
[INFO] ------------------------------------------------------------------------
[INFO] Error uploading site
Embedded error: Unable to change cwd on ftp server to ./ when processing D:\project\workspaces\yyyyy\myproject\target\site
I can't figure out what the problem. Maybe the plugin version is not compatible... Maybe Alfresco's implementation is not full compatible (forgive me for this outrage ;)), maybe there is a configuration in the server properties I missed.
I don't really know where to look for, and after some time googlin', I can't find what the matter.
I have already some workarounds. I'll try to upload the website using webdav protocol, and I can use some extra features (like deploy artifatcts of Jenkins) on our CI plateform, but I really want to know what's the problem.
Can someone help me ?

Indeed, it looks like an Alfresco issue : issues.alfresco.com/jira/browse/ALF-4724.
I'm running under Alfresco 3.1, and this issue seems to be solved in 3.3.5 and above.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

How to use pentaho kitchen to connect database repository? - etl

How to use pentaho kitchen to connect my central database repository under commandline?

Related

Using pact broker with a path

PXF JSON plugin error

Getting error try to select hive table using hcatalog from HAWQ

Caused by: ERROR XSDB6: Another instance of Derby may have already booted the database

Deploy a maven site into Alfresco through FTP

Categories

Resources