Ok, so I have a hive table on a remote hadoop node set up on a linux machine. I'm having an issue when attempting to insert a large json string, large as in possibly 64MB or more given that map reduce won't work well unless I approach that limit. I've successfully transfered over 8 - 9MB, but that's as high as it gets, if I attempt to do more than the query fails. I also had to override C#'s default json serializer to do this, not a good practice I know, but I really don't know any other way to do this.
Anyway this is how I store data into Hive:
namespace HadoopWebService.Controllers
{
public class LogsController : Controller
{
// POST: HadoopRequest
[HttpPost]
public ContentResult Create(string json)
{
OdbcConnection hiveConnection = new OdbcConnection("DSN=Hadoop Server;UID=XXXX;PWD=XXXX");
hiveConnection.Open();
Stream req = Request.InputStream;
req.Seek(0, SeekOrigin.Begin);
string request = new StreamReader(req).ReadToEnd();
ContentResult response;
string query;
try
{
query = "INSERT INTO TABLE error_log (json_error_log) VALUES('" + request + "')";
OdbcCommand command = new OdbcCommand(query, hiveConnection);
command.ExecuteNonQuery();
command.CommandText = query;
response = new ContentResult { Content = "{status: 1}", ContentType = "application/json" };
hiveConnection.Close();
return response;
}
catch(Exception error)
{
response = new ContentResult { Content = "{status: 0, message:" + error.ToString()+ "}" };
System.Diagnostics.Debug.WriteLine(error.Message.ToString());
hiveConnection.Close();
return response;
}
}
}
}
Is there some setting which I can use to insert larger amounts of data? I assume there must be some buffer that is failing to load everything. I've checked on google but I haven't found anything, mainly because this probably isn't the way to insert properly into Hadoop, but I'm really out of options right now, I can't use HDInsight, all I've got is the ODBC connection.
EDIT: This is the error I get:
System.Data.Odbc.OdbcException (0x80131937): ERROR [HY000][HiveODBC]
(35) Error from Hive: error code: ‘0’ error message: ‘ExecuteStatement
finished with operation state: ERROR_STATE’.
message:System.Data.Odbc.OdbcException (0x80131937): ERROR [HY000]
[Microsoft][HiveODBC] (35) Error from Hive: error code: '0' error
message: 'ExecuteStatement finished with operation state:
ERROR_STATE'. at
System.Data.Odbc.OdbcConnection.HandleError(OdbcHandle hrHandle,
RetCode retcode) at
System.Data.Odbc.OdbcCommand.ExecuteReaderObject(CommandBehavior
behavior, String method, Boolean needReader, Object[] methodArguments,
SQL_API odbcApiMethod) at
System.Data.Odbc.OdbcCommand.ExecuteReaderObject(CommandBehavior
behavior, String method, Boolean needReader) at
System.Data.Odbc.OdbcCommand.ExecuteNonQuery()
So this is my first time to ever use Pig and I'm having a hard time getting it to interpret my data correctly. I dont want to have to define a schema for my input files until run time, so I wrote a super simple custom loader where the only changes I made to PigStorage were changing the GetSchema Method to read the first two lines of my file and create a schema off of it:
public ResourceSchema getSchema(String location,
Job job) throws IOException {
BufferedReader br = new BufferedReader(new FileReader(location.replace("file://", "")));
String[] line = br.readLine().split(",");
String[] data = br.readLine().split(",");
List<FieldSchema> fields = new ArrayList<FieldSchema>();
for(int f = 0; f< line.length; f++)
{
Byte type = GetType(data[f].replace("\"", ""));
fields.add(new FieldSchema(line[f].replace("\"", ""), type));
}
schema = new ResourceSchema(new Schema(fields));
return schema;
}
private Byte GetType(Object Data)
{
try{
int number = Integer.parseInt(Data.toString());
return org.apache.pig.data.DataType.INTEGER;
}
catch(Exception e){}
try{
double dnumber = Double.parseDouble(Data.toString());
return org.apache.pig.data.DataType.DOUBLE;
}
catch(Exception e){}
return org.apache.pig.data.DataType.CHARARRAY;
}
When I load a file and run DESCRIBE on it, it looks like what I want, for instance:
{CU_NUMBER: int,CYCLE_DATE: chararray,JOIN_NUMBER: int,RSSD: int,CU_TYPE: int,CU_NAME: chararray}
And the first 10 Rows look like this:
(1,9/30/2013 0:00:00,2,"50377","1","MORRIS SHEPPARD TEXARKANA")
(5,9/30/2013 0:00:00,6,"859879","1","FIRST CASTLE")
(6,9/30/2013 0:00:00,7,"54571","1","THE NEW ORLEANS FIREMEN'S")
(12,9/30/2013 0:00:00,11,"56678","1","FRANKLIN TRUST")
(13,9/30/2013 0:00:00,12,"861676","1","E")
(16,9/30/2013 0:00:00,14,"59277","1","WOODMEN")
(19,9/30/2013 0:00:00,16,"863773","1","NEW HAVEN TEACHERS")
(22,9/30/2013 0:00:00,17,"61074","1","WATERBURY CONNECTICUT TEACHER")
(26,9/30/2013 0:00:00,19,"866372","1","FARMERS")
(28,9/30/2013 0:00:00,21,"953375","1","CENTRIS")
However, when I try to do stuff with the data like:
FOICU = LOAD 'file:///home/biadmin/NCUA/foicu.txt' USING org.apache.pig.builtin.PigStorageInferSchema(',', '-schema');
FirstSixColumns = FOREACH FOICU GENERATE CU_NUMBER, CYCLE_DATE, JOIN_NUMBER, RSSD, CU_TYPE, CU_NAME;
TopTen = LIMIT FirstSixColumns 10;
FOICUFiltered = FILTER TopTen BY CU_NUMBER > 20;
CU_FIVE = FILTER TopTen BY CU_NUMBER == 5;
DUMP FOICUFiltered;
DUMP CU_FIVE;
FOICUFiltered returns all 10 rows even though 7 of them have a CU_NUMBER less than 20:
(1,9/30/2013 0:00:00,2,"50377","1","MORRIS SHEPPARD TEXARKANA")
(5,9/30/2013 0:00:00,6,"859879","1","FIRST CASTLE")
(6,9/30/2013 0:00:00,7,"54571","1","THE NEW ORLEANS FIREMEN'S")
(12,9/30/2013 0:00:00,11,"56678","1","FRANKLIN TRUST")
(13,9/30/2013 0:00:00,12,"861676","1","E")
(16,9/30/2013 0:00:00,14,"59277","1","WOODMEN")
(19,9/30/2013 0:00:00,16,"863773","1","NEW HAVEN TEACHERS")
(22,9/30/2013 0:00:00,17,"61074","1","WATERBURY CONNECTICUT TEACHER")
(26,9/30/2013 0:00:00,19,"866372","1","FARMERS")
(28,9/30/2013 0:00:00,21,"953375","1","CENTRIS")
And CU_FIVE returns no rows at all.
Does anybody know what I've done wrong here and is there a better way to dynamically load the schema at run time without using schema files?
I am using the OObjectDatabaseTx implementation of OrientDB to store my POJOs in the database. When I try to retrieve some POJOs with a SQL commant, I get the result set but the attributes of the POJOs seem to be empty (getters regurning null).
I register my classes properly with
db.getEntityManager().registerEntityClass(MyUser.class);
The following code describes my problem:
Map<String, String> params = new HashMap<String, String>();
params.put("name", username);
List<MyUser> users = db.command(
new OSQLSynchQuery<MyUser>(
"select * from MyUser where "
+ "name = :name"))
.execute(params);
for (MyUser founduser : users) {
ODocument doc = db.getRecordByUserObject(founduser, false);
String pass = doc.field("pwd");
assertEquals(pass != null, true); // passes
assertEquals(founduser.getPwd() != null, true); // fails
}
How can I get the method getPwd to return the proper value?
I am now using Version 1.3.0 and this has worked before (afaik in 1.1.0).
Can you see if the POJO has the "pwd" field set inside of it?
Note: I'm specifically not using Fluent NHibernate but am using 3.x's built-in mapping style. However, I am getting a blank recordset when I think I should be getting records returned.
I'm sure I'm doing something wrong and it's driving me up a wall. :)
Background / Setup
I have an Oracle 11g database for a product by IBM called Maximo
This product has a table called workorder which lists workorders; that table has a field called "wonum" which represents a unique work order number.
I have a "reporting" user which can access the table via the maximo schema
e.g. "select * from maximo.workorder"
I am using Oracle's Managed ODP.NET DLL to accomplish data tasks, and using it for the first time.
Things I've Tried
I created a basic console application to test this
I added the OracleManagedClientDriver.cs from the NHibernate.Driver on the master branch (it is not officially in the release I'm using).
I created a POCO called WorkorderBriefBrief, which only has a WorkorderNumber field.
I created a class map, WorkorderBriefBriefMap, which maps only that value as a read-only value.
I created a console application with console output to attempt to write the lines of work orders.
The session and transaction appear to open correct,
I tested a standard ODP.NET OracleConnection to my connection string
The Code
POCO: WorkorderBriefBrief.cs
namespace PEApps.Model.WorkorderQuery
{
public class WorkorderBriefBrief
{
public virtual string WorkorderNumber { get; set; }
}
}
Mapping: WorkorderBriefBriefMap.cs
using NHibernate.Mapping.ByCode;
using NHibernate.Mapping.ByCode.Conformist;
using PEApps.Model.WorkorderQuery;
namespace ConsoleTests
{
public class WorkorderBriefBriefMap : ClassMapping<WorkorderBriefBrief>
{
public WorkorderBriefBriefMap()
{
Schema("MAXIMO");
Table("WORKORDER");
Property(x=>x.WorkorderNumber, m =>
{
m.Access(Accessor.ReadOnly);
m.Column("WONUM");
});
}
}
}
Putting it Together: Program.cs
namespace ConsoleTests
{
class Program
{
static void Main(string[] args)
{
NHibernateProfiler.Initialize();
try
{
var cfg = new Configuration();
cfg
.DataBaseIntegration(db =>
{
db.ConnectionString = "[Redacted]";
db.Dialect<Oracle10gDialect>();
db.Driver<OracleManagedDataClientDriver>();
db.KeywordsAutoImport = Hbm2DDLKeyWords.AutoQuote;
db.BatchSize = 500;
db.LogSqlInConsole = true;
})
.AddAssembly(typeof(WorkorderBriefBriefMap).Assembly)
.SessionFactory().GenerateStatistics();
var factory = cfg.BuildSessionFactory();
List<WorkorderBriefBrief> query;
using (var session = factory.OpenSession())
{
Console.WriteLine("session opened");
Console.ReadLine();
using (var transaction = session.BeginTransaction())
{
Console.WriteLine("transaction opened");
Console.ReadLine();
query =
(from workorderbriefbrief in session.Query<WorkorderBriefBrief>() select workorderbriefbrief)
.ToList();
transaction.Commit();
Console.WriteLine("Transaction Committed");
}
}
Console.WriteLine("result length is {0}", query.Count);
Console.WriteLine("about to write WOs");
foreach (WorkorderBriefBrief wo in query)
{
Console.WriteLine("{0}", wo.WorkorderNumber);
}
Console.WriteLine("DONE!");
Console.ReadLine();
// Test a standard connection below
string constr = "[Redacted]";
OracleConnection con = new OracleConnection(constr);
con.Open();
Console.WriteLine("Connected to Oracle Database {0}, {1}", con.ServerVersion, con.DatabaseName.ToString());
con.Dispose();
Console.WriteLine("Press RETURN to exit.");
Console.ReadLine();
}
catch (Exception ex)
{
Console.WriteLine("Error : {0}", ex);
Console.ReadLine();
}
}
}
}
Thanks in advance for any help you can give!
Update
The following code (standard ADO.NET with OracleDataReader) works fine, returning the 16 workorder numbers that it should. To me, this points to my use of NHibernate more than the Oracle Managed ODP.NET. So I'm hoping it's just something stupid that I did above in the mapping or configuration.
// Test a standard connection below
string constr = "[Redacted]";
OracleConnection con = new Oracle.ManagedDataAccess.Client.OracleConnection(constr);
con.Open();
Console.WriteLine("Connected to Oracle Database {0}, {1}", con.ServerVersion, con.DatabaseName);
var cmd = new OracleCommand();
cmd.Connection = con;
cmd.CommandText = "select wonum from maximo.workorder where upper(reportedby) = 'MAXADMIN'";
cmd.CommandType = CommandType.Text;
Oracle.ManagedDataAccess.Client.OracleDataReader reader = cmd.ExecuteReader();
while (reader.Read())
{
Console.WriteLine(reader.GetString(0));
}
con.Dispose();
When configuring NHibernate, you need to tell it about your mappings.
I found the answer -- thanks to Oskar's initial suggestion, I realized it wasn't just that I hadn't added the assembly, I also needed to create a new mapper.
to do this, I added the following code to the configuration before building my session factory:
var mapper = new ModelMapper();
//define mappingType(s) -- could be an array; in my case it was just 1
var mappingType = typeof (WorkorderBriefBriefMap);
//use AddMappings instead if you're mapping an array
mapper.AddMapping(mappingType);
//add the compiled results of the mapper to the configuration
cfg.AddMapping(mapper.CompileMappingForAllExplicitlyAddedEntities());
var factory = cfg.BuildSessionFactory();
I have built a common app that works with PostgreSQL and should work on Oracle.
However i'm getting strange errors when inserting records through a parametrized query.
My formatted query looks like this:
"INSERT INTO layer_mapping VALUES (#lm_id,#lm_layer_name,#lm_layer_file);"
Unlike Npgsql which documents how to use the parameters, i could not found how Oracle "prefers" them to be used. I could only find :1, :2, :3, for example.
I do not wanto use sequential parameters, i want to use them in a named way.
Is there a way to do it? Am i doing something wrong?
Thanks
You can use named parameters with ODP.NET like so:
using (var cx=new OracleConnection(connString)){
using(var cmd=cx.CreateCommand()){
cmd.CommandText="Select * from foo_table where bar=:bar";
cmd.BindByName=true;
cmd.Parameters.Add("bar",barValue);
///...
}
}
I made this lib https://github.com/pedro-muniz/ODPNetConnect/blob/master/ODPNetConnect.cs
so you can do parameterized write and read like this:
ODPNetConnect odp = new ODPNetConnect();
if (!String.IsNullOrWhiteSpace(odp.ERROR))
{
throw new Exception(odp.ERROR);
}
//Write:
string sql = #"INSERT INTO TABLE (D1, D2, D3) VALUES (:D1, :D2, :D3)";
Dictionary<string, object> params = new Dictionary<string, object>();
params["D1"] = "D1";
params["D2"] = "D2";
params["D3"] = "D3";
int affectedRows = odp.ParameterizedWrite(sql, params);
if (!String.IsNullOrWhiteSpace(odp.ERROR))
{
throw new Exception(odp.ERROR);
}
//read
string sql = #"SELECT * FROM TABLE WHERE D1 = :D1";
Dictionary<string, object> params = new Dictionary<string, object>();
params["D1"] = "D1";
DataTable dt = odp.ParameterizedRead(sql, params);
if (!String.IsNullOrWhiteSpace(odp.ERROR))
{
throw new Exception(odp.ERROR);
}
Notes: you have to change these lines in ODPNetConnect.cs to set connection string:
static private string devConnectionString = "SET YOUR DEV CONNECTION STRING";
static private string productionConnectionString = "SET YOUR PRODUCTION CONNECTION STRING";
And you need to change line 123 to set environment to dev or prod.
public OracleConnection GetConnection(string env = "dev", bool cacheOn = false)