parallel programming and datasets with datatables - visual-studio-2010

I have a taskfactory application that generates new tasks to complete processing using multiple classes. Everything seems to work fine with occasional IOEXception errors at times. The latest errors involve the use of datasets and datatables where in each process that runs, I access a database table to extract 100000 ids which I use in my process. In doing so I add a datatable to the dataset and use the datatable in a different portion of the process.
When creating a new task I assumed that it would generate its own datatable and dataset in memory as a new process involves a new instance of the classes involved and it should be threadsafe. However I get the error
' Datatable already belongs in this dataset'
Now after researching I found that you cannot create multiple datatables in the same set. So is there anyway around this or should I consider a different data structure like perhaps using a list?
Public class tprocess
....
....
public shared dsCountry as new dataset
public shared dstable as new datatable
public shared sub main()
do_processing()
end sub
public shared do_processing()
try
dscountry.tables.add(dsTable)
...' sql steps to get data with sqldatareader

It sounds like you're sharing the same DataSet and creating new DataTable instances in that one DS. Is that the case?
It can be tricky. For example, this would use the same instance of a DataSet:
void Method()
{
DataSet ds = null;
Task t1 = Task.Factory.StartNew(delegate {ds = new DataSet(); /*work with ds */});
Task t2 = Task.Factory.StartNew(delegate {ds = new DataSet(); /*work with ds */});
}

Related

Is there a way to run NRule Engine in asyncronously?

I want to run all rules asyncronously to make it thread safe
When i m performing load test then why RuleEngine taking so much time to execute all rules.
NRuleRepository repository = null;
foreach (var rule in rules)
{
repository = new NRuleRepository();
repository.LoadRules(rule.Rule);
var factory = repository.Compile();
var session = factory.CreateSession();
NRuleBody data = null;
foreach (var fact in rule.RuleDataList)
{
data = new NRuleBody();
data.Rule = rule.Rule;
data.RuleData = fact;
session.Insert(data);
}
result += session.Fire();
}
Can i make call as below:
session.FireAsync();
or there is any other option to fire Multiple rules but in async ?
and NRuleRepository class should be reinitialize on every request ?
At the very least, you could probably utilise Task.Run() in order to create a thread for each instance of the repository, but what you're doing seems very inefficient.
Why are you inserting the rule with the data in the session? You've already added the rule to the repository.
If you are only ever having a singular rule in the repository, NRules is almost certainly overkill, and you would be better placed doing almost anything else.

DbContext per request aspboilerplate

I need to implement multithreading background job for import file.
I have implemented it with background job(Hangfire). But if i use one thread it goes very slow.
The function look like this.
I using non-transaction unit to save changes to db immediately.
var contactFound = await _contactRepository.FirstOrDefaultAsync(x => x.Email.ToLower() == contact.Email.ToLower());
if (contactFound != null)
{
await _bjInfoManager.AddLog(args.JobId, "Found duplicated email: " + contact.Email);
}
else
{
contact.ContactListId = args.ContactListId;
contact.Email = contact.Email.ToLower();
await _contactRepository.InsertAsync(contact);
//Save changes in db
await CurrentUnitOfWork.SaveChangesAsync();
}
The problem occur when I tries to use this with Producer-Consumer Dataflow Pattern. I throws the exception "A second operation started on this context before a previous asynchronous operation completed."
The question is how to create isolated DbContext inside this method.
Please help me.
Transactions should not be multi-threaded. If you create a new task/thread in a UOW, you can create a seperated UOW using IUnitOfWork.Begin(TransactionScopeOption.RequiresNew) in a using block.
See the links
https://github.com/aspnetboilerplate/aspnetboilerplate/issues/619
Does Entity Framework support Multi-Threading?
Entity Framework and Multi threading
If you are using Microsoft SQL Server, then I recommend you to use bulk insert. It's super fast than entity framework.
https://learn.microsoft.com/en-us/sql/t-sql/statements/bulk-insert-transact-sql

Save Spark Dataframe into Elasticsearch - Can’t handle type exception

I have designed a simple job to read data from MySQL and save it in Elasticsearch with Spark.
Here is the code:
JavaSparkContext sc = new JavaSparkContext(
new SparkConf().setAppName("MySQLtoEs")
.set("es.index.auto.create", "true")
.set("es.nodes", "127.0.0.1:9200")
.set("es.mapping.id", "id")
.set("spark.serializer", KryoSerializer.class.getName()));
SQLContext sqlContext = new SQLContext(sc);
// Data source options
Map<String, String> options = new HashMap<>();
options.put("driver", MYSQL_DRIVER);
options.put("url", MYSQL_CONNECTION_URL);
options.put("dbtable", "OFFERS");
options.put("partitionColumn", "id");
options.put("lowerBound", "10001");
options.put("upperBound", "499999");
options.put("numPartitions", "10");
// Load MySQL query result as DataFrame
LOGGER.info("Loading DataFrame");
DataFrame jdbcDF = sqlContext.load("jdbc", options);
DataFrame df = jdbcDF.select("id", "title", "description",
"merchantId", "price", "keywords", "brandId", "categoryId");
df.show();
LOGGER.info("df.count : " + df.count());
EsSparkSQL.saveToEs(df, "offers/product");
You can see the code is very straightforward. It reads the data into a DataFrame, selects some columns and then performs a count as a basic action on the Dataframe. Everything works fine up to this point.
Then it tries to save the data into Elasticsearch, but it fails because it cannot handle some type. You can see the error log here.
I'm not sure about why it can't handle that type. Does anyone know why this is occurring?
I'm using Apache Spark 1.5.0, Elasticsearch 1.4.4 and elaticsearch-hadoop 2.1.1
EDIT:
I have updated the gist link with a sample dataset along with the source code.
I have also tried to use the elasticsearch-hadoop dev builds as mentionned by #costin on the mailing list.
The answer for this one was tricky, but thanks to samklr, I have managed to figure about what the problem was.
The solution isn't straightforward nevertheless and might consider some “unnecessary” transformations.
First let's talk about Serialization.
There are two aspects of serialization to consider in Spark serialization of data and serialization of functions. In this case, it's about data serialization and thus de-serialization.
From Spark’s perspective, the only thing required is setting up serialization - Spark relies by default on Java serialization which is convenient but fairly inefficient. This is the reason why Hadoop itself introduced its own serialization mechanism and its own types - namely Writables. As such, InputFormat and OutputFormats are required to return Writables which, out of the box, Spark does not understand.
With the elasticsearch-spark connector one must enable a different serialization (Kryo) which handles the conversion automatically and also does this quite efficiently.
conf.set("spark.serializer","org.apache.spark.serializer.KryoSerializer")
Even since Kryo does not require that a class implement a particular interface to be serialized, which means POJOs can be used in RDDs without any further work beyond enabling Kryo serialization.
That said, #samklr pointed out to me that Kryo needs to register classes before using them.
This is because Kryo writes a reference to the class of the object being serialized (one reference is written for every object written), which is just an integer identifier if the class has been registered but is the full classname otherwise. Spark registers Scala classes and many other framework classes (like Avro Generic or Thrift classes) on your behalf.
Registering classes with Kryo is straightforward. Create a subclass of KryoRegistrator,and override the registerClasses() method:
public class MyKryoRegistrator implements KryoRegistrator, Serializable {
#Override
public void registerClasses(Kryo kryo) {
// Product POJO associated to a product Row from the DataFrame
kryo.register(Product.class);
}
}
Finally, in your driver program, set the spark.kryo.registrator property to the fully qualified classname of your KryoRegistrator implementation:
conf.set("spark.kryo.registrator", "MyKryoRegistrator")
Secondly, even thought the Kryo serializer is set and the class registered, with changes made to Spark 1.5, and for some reason Elasticsearch couldn't de-serialize the Dataframe because it can't infer the SchemaType of the Dataframe into the connector.
So I had to convert the Dataframe to an JavaRDD
JavaRDD<Product> products = df.javaRDD().map(new Function<Row, Product>() {
public Product call(Row row) throws Exception {
long id = row.getLong(0);
String title = row.getString(1);
String description = row.getString(2);
int merchantId = row.getInt(3);
double price = row.getDecimal(4).doubleValue();
String keywords = row.getString(5);
long brandId = row.getLong(6);
int categoryId = row.getInt(7);
return new Product(id, title, description, merchantId, price, keywords, brandId, categoryId);
}
});
Now the data is ready to be written into elasticsearch :
JavaEsSpark.saveToEs(products, "test/test");
References:
Elasticsearch's Apache Spark support documentation.
Hadoop Definitive Guide, Chapter 19. Spark, ed. 4 – Tom White.
User samklr.

Is paging broken with spring data solr when using group fields?

I currently use the spring data solr library and implement its repository interfaces, I'm trying to add functionality to one of my custom queries that uses a Solr template with a SimpleQuery. it currently uses paging which appears to be working well, however, I want to use a Group field so sibling products are only counted once, at their first occurrence. I have set the group field on the query and it works well, however, it still seems to be using the un-grouped number of documents when constructing the page attributes.
is there a known work around for this?
the query syntax provides the following parameter for this purpose, but it would seem that Spring Data Solr isn’t taking advantage of it. &group.ngroups=true should return the number of groups in the result and thus give a correct page numbering.
any other info would be appreciated.
There are actually two ways to add this parameter.
Queries are converted to the solr format using QueryParsers, so it would be possible to register a modified one.
QueryParser modifiedParser = new DefaultQueryParser() {
#Override
protected void appendGroupByFields(SolrQuery solrQuery, List<Field> fields) {
super.appendGroupByFields(solrQuery, fields);
solrQuery.set(GroupParams.GROUP_TOTAL_COUNT, true);
}
};
solrTemplate.registerQueryParser(Query.class, modifiedParser);
Using a SolrCallback would be a less intrusive option:
final Query query = //...whatever query you have.
List<DomainType> result = solrTemplate.execute(new SolrCallback<List<DomainType>>() {
#Override
public List<DomainType> doInSolr(SolrServer solrServer) throws SolrServerException, IOException {
SolrQuery solrQuery = new QueryParsers().getForClass(query.getClass()).constructSolrQuery(query);
//add missing params
solrQuery.set(GroupParams.GROUP_TOTAL_COUNT, true);
return solrTemplate.convertQueryResponseToBeans(solrServer.query(solrQuery), DomainType.class);
}
});
Please feel free to open an issue.

Separation of Concerns: Returning Projected Data between layers From a Linq Query

I'm using Linq and having trouble doing something that I believe should be trivial. I want to return data from one layer so it can be used independently of linq in another layer.
Suppose I have a Data Access Layer. It knows about the entity framework and how to interact with it. But, it doesn't care who accesses it. The one interesting requirement I have is that the queries in the entity framework return projected data that is not part of the Entity Model itself. Please don't ask me to change this part of the requirement and make POCOs for each return type, as it is not the best design given the problem I am trying to solve. Below is an example.
public class ChartData
{
public function <<returnType??>> GetData()
{
MyEntities context = new MyEntities();
var results = from context.vManyColumnsOfData as v
where v.CompanyName = "acme"
select new {Year = v.SalesYear, Income = v.Income};
return ??;
}
}
Then, I would like to have an ASP.Net UI layer be able to call into the Data Access Layer to get the data in order to bind it to a control. The UI layer should have no notion of where the data came from. It should only know that it has the data it needs to bind. Below is an example.
protected void chart_Load(object sender, EventArgs e)
{
// set some chart properties
chart.Skin = "Default";
...
// Set the data source
ChartData dataMgr = new ChartData();
<<returnType?>> data = dataMgr.GetData();
chart.DataSource = data;
chart.DataBind();
}
What is the best way to send linq projected data back to another layer?
If you don't need to use the projected type statically, just return IEnumerable<object>.
Please don't ask me to change this part of the requirement and make
POCOs for each return type, as it is not the best design given the
problem I am trying to solve.
I feel like I should rightly ignore this, as the best thing to do is to return a defined type. Anonymous types are useful when they are wholly contained within the method that creates them. Once you start passing them around, it is time to go ahead and give them the proper class treatment.
However, to live within your imposed limitations, you can return IEnumerable<object> from the method and use that or var at the callsite and rely upon the dynamic binding of the control to get at the data. It's not going to help you if you need to deal with the object programmatically, but it will serve fine for databinding.
You can not return an anonymous type, so basically for this you will need POCO's even though you don't want them.
"not the best design given the problem I am trying to solve"
Could you explain what you are trying to achieve a little more? It might be possible to return some type of list containing a dictionary of items (ie rows and columns). Think something like an untyped dataset (yuck)
Your GetData method can use IEnumerable (the "old" non-generic interface) as its return type.
Any dynamic resolution (e.g. ASP.NET or XAML bindings) should work as expected, which seems to be what you want to do.
However, if you want to use the results in your code, you will probably have to resort to .NET 4's dynamic keyword.
The following example can be run in LINQPad (in "C# Program" mode) and illustrates this:
void Main()
{
var v = GetData();
foreach (dynamic element in v)
{
((string)element.Name).Dump();
}
}
public IEnumerable GetData()
{
return from i in Enumerable.Range(1, 10)
select new
{
Name = "Item " + i,
Value = i
};
}
Keep in mind that, design-wise, coding like this will make most people frown and can affect performance.

Resources