Inserting large amounts of data into hive (around 64MB) - hadoop

Ok, so I have a hive table on a remote hadoop node set up on a linux machine. I'm having an issue when attempting to insert a large json string, large as in possibly 64MB or more given that map reduce won't work well unless I approach that limit. I've successfully transfered over 8 - 9MB, but that's as high as it gets, if I attempt to do more than the query fails. I also had to override C#'s default json serializer to do this, not a good practice I know, but I really don't know any other way to do this.
Anyway this is how I store data into Hive:
namespace HadoopWebService.Controllers
{
public class LogsController : Controller
{
// POST: HadoopRequest
[HttpPost]
public ContentResult Create(string json)
{
OdbcConnection hiveConnection = new OdbcConnection("DSN=Hadoop Server;UID=XXXX;PWD=XXXX");
hiveConnection.Open();
Stream req = Request.InputStream;
req.Seek(0, SeekOrigin.Begin);
string request = new StreamReader(req).ReadToEnd();
ContentResult response;
string query;
try
{
query = "INSERT INTO TABLE error_log (json_error_log) VALUES('" + request + "')";
OdbcCommand command = new OdbcCommand(query, hiveConnection);
command.ExecuteNonQuery();
command.CommandText = query;
response = new ContentResult { Content = "{status: 1}", ContentType = "application/json" };
hiveConnection.Close();
return response;
}
catch(Exception error)
{
response = new ContentResult { Content = "{status: 0, message:" + error.ToString()+ "}" };
System.Diagnostics.Debug.WriteLine(error.Message.ToString());
hiveConnection.Close();
return response;
}
}
}
}
Is there some setting which I can use to insert larger amounts of data? I assume there must be some buffer that is failing to load everything. I've checked on google but I haven't found anything, mainly because this probably isn't the way to insert properly into Hadoop, but I'm really out of options right now, I can't use HDInsight, all I've got is the ODBC connection.
EDIT: This is the error I get:
System.Data.Odbc.OdbcException (0x80131937): ERROR [HY000][HiveODBC]
(35) Error from Hive: error code: ‘0’ error message: ‘ExecuteStatement
finished with operation state: ERROR_STATE’.
message:System.Data.Odbc.OdbcException (0x80131937): ERROR [HY000]
[Microsoft][HiveODBC] (35) Error from Hive: error code: '0' error
message: 'ExecuteStatement finished with operation state:
ERROR_STATE'. at
System.Data.Odbc.OdbcConnection.HandleError(OdbcHandle hrHandle,
RetCode retcode) at
System.Data.Odbc.OdbcCommand.ExecuteReaderObject(CommandBehavior
behavior, String method, Boolean needReader, Object[] methodArguments,
SQL_API odbcApiMethod) at
System.Data.Odbc.OdbcCommand.ExecuteReaderObject(CommandBehavior
behavior, String method, Boolean needReader) at
System.Data.Odbc.OdbcCommand.ExecuteNonQuery()

Related

In Spring Boot, how to I get the column names from a StoredProcedureQuery result?

I am working on creating a simple utility that allows our users to execute a pre-selected list of stored procedures that return a simple list result set as a JSON string. The result set varies based on the selected procedure. I am able to get the results easily enough (and pass back as JSON as required), but the results don't include the column names.
The most common answer I found online is to use ResultSetMetaData or NativeQuery, but I couldn't figure out how to extract the metadata or transform the query properly using a StoredProcedureQuery object. How do I get the column names from a StoredProcedureQuery result?
Here is my code:
#SuppressWarnings("unchecked")
public String executeProcedure(String procedure, String jsonData) {
//Set up a call to the stored procedure
StoredProcedureQuery query = entityManager.createStoredProcedureQuery(procedure);
//Register and set the parameters
query.registerStoredProcedureParameter(0, String.class, ParameterMode.IN);
query.setParameter(0, jsonData);
String jsonResults = "[{}]";
try {
//Execute the query and store the results
query.execute();
List list = query.getResultList();
jsonResults = new Gson().toJson(list);
} finally {
try {
//Cleanup
query.unwrap(ProcedureOutputs.class).release();
} catch(Exception e) {
e.printStackTrace();
}
}
return jsonResults;
}
The challenge is to get a ResultSet. In order to list the column names you need a ResultSet to do the following to access metadata. (Column names are metadata)
ResultSetMetaData resultSetMetaData = resultSet.getMetaData();
System.out.println("Column name: "+resultSetMetaData.getColumnName(1));
System.out.println("Column type: "+resultSetMetaData.getColumnTypeName(1));
You can't get ResultSet (or metadata) from javax.persistence.StoredProcedureQuery or from spring-jpa Support JPA 2.1 stored procedures returning result sets
You can with low-level JDBC as follows:
CallableStatement stmnt = conn.prepareCall("{call demoSp(?, ?)}");
stmnt.setString(1, "abcdefg");
ResultSet resultSet1 = stmnt.executeQuery();
resultSet1.getMetaData(); // etc

How to use ES Java API to create a new type of an index

I have succeed create an index use Client , the code like this :
public static boolean addIndex(Client client,String index) throws Exception {
if(client == null){
client = getSettingClient();
}
CreateIndexRequestBuilder requestBuilder = client.admin().indices().prepareCreate(index);
CreateIndexResponse response = requestBuilder.execute().actionGet();
return response.isAcknowledged();
//client.close();
}
public static boolean addIndexType(Client client, String index, String type) throws Exception {
if (client == null) {
client = getSettingClient();
}
TypesExistsAction action = TypesExistsAction.INSTANCE;
TypesExistsRequestBuilder requestBuilder = new TypesExistsRequestBuilder(client, action, index);
requestBuilder.setTypes(type);
TypesExistsResponse response = requestBuilder.get();
return response.isExists();
}
however, the method of addIndexType is not effected, the type is not create .
I don't know how to create type ?
You can create types when you create the index by providing a proper mapping configuration. Alternatively a type gets created when you index a document of a certain type. However the first suggestion is the better one, because then you can control the full mapping of that type instead of relying on dynamic mapping.
You can set types in the following way:
// JSON schema is the JSON mapping which you want to give for the index.
JSONObject builder = new JSONObject().put(type, JSONSchema);
// And then just fire the below command
client.admin().indices().preparePutMapping(indexName)
.setType(type)
.setSource(builder.toString(), XContentType.JSON)
.execute()
.actionGet();

Hive ql Driver how to specify database name other than default

I am writing a sample program to connect to Hive metastore using org.apache.hadoop.hive.ql.Driver class. A sample snippet is as below
String userName = "test";
HiveConf conf = new HiveConf(SessionState.class);
conf.set("fs.default.name", "hdfs://" + hadoopMasterHost + ":8020");
conf.set("hive.metastore.local","false");
conf.set("hive.metastore.warehouse.dir","/user/hive/warehouse");
conf.set("hive.metastore.uris","thrift://" + hadoopMasterHost + ":9083");
conf.set("hadoop.bin.path", "/usr/hdp/2.2.0.0-2041/hadoop/bin");
conf.set("yarn.nodemanager.hostname", hadoopMasterHost);
conf.set("yarn.resourcemanager.hostname", hadoopMasterHost);
ss = new MyCliSessionState(conf);
ss.out = new PrintStream(System.out, true, "UTF-8");
ss.err = new PrintStream(System.err, true, "UTF-8");
SessionState.start(ss);
driver = new Driver(conf);
query = "show tables";
if (userName == null || userName.isEmpty())
return driver.run(query);
UserGroupInformation ugi = createUgi(userName);
CommandProcessorResponse response = ugi.doAs(new PrivilegedExceptionAction<CommandProcessorResponse>() {
public CommandProcessorResponse run() throws Exception {
CliSessionState ss = null;
ss = new MyCliSessionState(conf);
ss.out = new PrintStream(System.out, true, "UTF-8");
ss.err = new PrintStream(System.err, true, "UTF-8");
// refresh thread local SessionState and Hive
SessionState.start(ss);
Hive.get(conf, true);
return driver.run(query);
}
});
return response;
I am able to connect to default database and get list of all tables. But how can I connect to other databases (other than default) ? I tried searching hive configuration property, but could not found one to specify database name. Can somebody help me please.
Looks like you want to do things the hard way, and re-implement the Beeline utility. For most people it would appear to be a masochist attempt, but who am I to judge?
Anyway, at this point you have to execute HQL commands, like anyone else... and anyone should know about the "use" command:
driver.run("use " +argDatabase) ;
// check status
driver.run("show tables") ;
// check status, parse output
driver.run("describe extended " +argTable) ;
// check status, parse output

Astyanax/Cassandra - Getting "Re-preparing already prepared query" warning with caching enabled

I'm trying to insert some data to Cassandra with Astyanax, by I'm getting a lot of "Re-preparing already prepared query" warnings even if have caching enabled:
22:08:03,703 WARN Cluster:1702 - Re-preparing already prepared query INSERT INTO test.test (key,c0,c1,c2,c3,c4,c5,c6,c7,c8,c9) VALUES (?,?,?,?,?,?,?,?,?,?,?) . Please note that preparing the same query more than once is generally an anti-pattern and will likely affect performance. Consider preparing the statement only once.
22:08:03,707 WARN Cluster:1702 - Re-preparing already prepared query INSERT INTO test.test (key,c0,c1,c2,c3,c4,c5,c6,c7,c8,c9) VALUES (?,?,?,?,?,?,?,?,?,?,?) . Please note that preparing the same query more than once is generally an anti-pattern and will likely affect performance. Consider preparing the statement only once.
22:08:03,708 WARN Cluster:1702 - Re-preparing already prepared query INSERT INTO test.test (key,c0,c1,c2,c3,c4,c5,c6,c7,c8,c9) VALUES (?,?,?,?,?,?,?,?,?,?,?) . Please note that preparing the same query more than once is generally an anti-pattern and will likely affect performance. Consider preparing the statement only once.
Source code:
Connect: (executed once)
#Override
public void connect() throws ClientException {
AstyanaxContext<Keyspace> context = new AstyanaxContext.Builder()
.forCluster(clusterName)
.forKeyspace(keyspaceName)
.withHostSupplier(new Supplier<List<Host>>() {
#Override
public List<Host> get() {
return Collections.singletonList(new Host(host, 9160));
}
})
.withAstyanaxConfiguration(
new AstyanaxConfigurationImpl().setDiscoveryType(NodeDiscoveryType.DISCOVERY_SERVICE)
.setDiscoveryDelayInSeconds(60000))
.withConnectionPoolConfiguration(new JavaDriverConfigBuilder().build())
.buildKeyspace(CqlFamilyFactory.getInstance());
context.start();
keyspace = context.getClient();
columnFamilyTemplate = new ColumnFamily<String, String>(columnFamily,
StringSerializer.get(), StringSerializer.get());
try {
columnFamilyTemplate.describe(keyspace);
} catch (ConnectionException e) {
throw new ClientException(e);
}
insert = keyspace.prepareMutationBatch().withCaching(true);
}
Insert: (executed multiple times)
insert.discardMutations();
final ColumnListMutation<String> row = insert.withRow(columnFamilyTemplate, key);
for (Map.Entry<String, String> pair : columnValues.entrySet()) {
final String column = pair.getKey();
final String value = pair.getValue();
row.putColumn(column, value, null);
}
try {
insert.withCaching(true).execute();
} catch (ConnectionException e) {
throw new ClientException(e);
}
The warning message suggests that the caching is not actually working. Any idea how to fix it?

Query returning values from Oracle and no records when run from Java

This query is returning the record with Min Create time Stamp for the Person Pers_ID when I run it in SQL Developer and the same query is not returning any value from Java JDBC connection.
Can you please help?
select PERS_ID,CODE,BEG_DTE
from PRD_HIST H
where PERS_ID='12345'
and CODE='ABC'
and CRTE_TSTP=(
select MIN(CRTE_TSTP)
from PRD_HIST S
where H.PERS_ID=S.PERS_ID
and PERS_ID='12345'
and EFCT_END_DTE is null
)
Java Code
public static List<String[]> getPersonwithMinCreateTSTP(final String PERS_ID,final String Category,final Connection connection){
final List<String[]> personRecords = new ArrayList<String[]>();
ResultSet resultSet = null;
Statement statement = null;
String PersID=null;
String ReportCode=null;
String effBegDate=null;
try{
statement = connection.createStatement();
final String query="select PERS_ID,CODE,EFCT_BEG_DTE from PRD_HIST H where PERS_ID='"+PERS_ID+"'and CODE='"+Category+"'and CRTE_TSTP=(select MIN(CRTE_TSTP) from PRD_HIST S where H.PERS_ID=S.PERS_ID and PERS_ID='"+PERS_ID+"' and EFCT_END_DTE is null)";
if (!statement.execute(query)) {
//print error
}
resultSet = statement.getResultSet();
while (resultSet.next()) {
PersID=resultSet.getString("PERS_ID");
ReportCode=resultSet.getString("CODE");
effBegDate=resultSet.getString("EFCT_BEG_DTE");
final String[] personDetails={PersID,ReportCode,effBegDate};
personRecords.add(personDetails);
}
} catch (SQLException sqle) {
CTLoggerUtil.logError(sqle.getMessage());
}finally{ // Finally is added to close the connection and resultset
try {
if (resultSet!=null) {
resultSet.close();
}if (statement!=null) {
statement.close();
}
} catch (SQLException e) {
//print error
}
}
return personRecords;
}
Print out your SQL SELECT statement from your java program and paste it into SQL*Plus and see what is happening. It's likely you're not getting your variables set to what you think you are. In fact, you're likely to see the error when you print out the SELECT statement without even running it - lower case values when upper is needed, etc.
If you still can't see it, post the actual query from your java code here.
I came here with similar problem - just thought I'd post my solution for others following - I hadn't run "COMMIT" after the inserts I'd made (via sqlplus) - doh!
The database table has records but the JDBC client can't retrieve the records.
Means the JDBC client doesn't have the select privileges. Please run the below query on command line:
grant all on emp to hr;

Resources