Hadoop Cascading : CascadeException "no loops allowed in cascade" when cogroup pipes twice

Hadoop Cascading : CascadeException "no loops allowed in cascade" when cogroup pipes twice - hadoop

I'm trying to write a Casacading(v1.2) casade (http://docs.cascading.org/cascading/1.2/userguide/htmlsingle/#N20844) consisting of two flows:
1) The first flow outputs urls to a db table, (in which they are automatically assigned id's via an auto-incrementing id value).
This flow also outputs pairs of urls into a SequenceFile with field names "urlTo", "urlFrom".
2) The second flow reads from both these sources and tries to do a CoGroup on "urlTo" (from the SequenceFile) and "url" (from the db source) to get the db record "id" for each "urlTo".
It then does a CoGroup on "urlFrom" and "url" to get the db record "id" for each "urlFrom".
The two flows work individually - if I call flow.complete() on the first before running the second flow. But if I put the two flows in a cascade object I get the error
cascading.cascade.CascadeException: no loops allowed in cascade, flow: urlLink*url*url, source: JDBCTap{connectionUrl='jdbc:mysql://localhost:3306/mydb', driverClassName='com.mysql.jdbc.Driver', tableDesc=TableDesc{tableName='urls', columnNames=null, columnDefs=null, primaryKeys=null}}, sink: JDBCTap{connectionUrl='jdbc:mysql://localhost:3306/mydb', driverClassName='com.mysql.jdbc.Driver', tableDesc=TableDesc{tableName='url_link', columnNames=[urlLinkFrom, urlLinkTo], columnDefs=[bigint(20), bigint(20)], primaryKeys=[urlLinkFrom, urlLinkTo]}}
on trying to configure the cascade.
I can see it's coming from the addEdgeFor function of the CascadeConnector but I'm not clear on how to resolve this problem.
I've never used Cascade / CascadeConnector before. Is there something I'm missing?

It seems like your some paths for source and sinks are the same.
A Cascade uses the concept of Direct Graphs to build the Cascade itself so if you have a flow source and a sink source pointing to the same location that in essence creates a loop and is disallowed in the concept of Directed Graphs since
it does not go from:
Source Location A to Sink Location B
but instead goes from:
Source Location A to Sink Location A.

"A Tap is not given an explicit name by design. This is so a given Tap instance can be re-used in different {#link Flow}s that may expect a source or sink by a different logical name, but are the same physical resource."
"In general, two instances of the same Tap class must have differing Identifiers (and different #equals)."
It turns out that JDBCTaps generate their identifier from the connection url alone (and do not include the table name). So as I was reading from one table and writing to a different table in the same database it seemed like I was reading from and writing to the same Tap and causing a loop.
As a work-around, I'm going to subclass the JDBCTap and override the getIdentifier() method to include the table name.

Related

Simulate the assembly of parts in Arena Simulation Software

I am simulating an assembly process in Arena. To keep things simple, suppose a model that assigns bags to passengers.
Each entity has a different ID (Entity.SerialNumber): I would like to check which bag has been assigned to which passenger.
How could I write a log file saving the ID of passenger and ID of the assigned bag?

First you must declare an attribute for the original entity that I understand that in this case would be the passenger, for example: ID = ID + 1, then with the separate module you duplicate the original entity, now both entities can have their independent flow and when you want to put them together again you can do it with a batch module using the rule: by attribute or with a match module using the type: based on attribute.

Possible to set file name for h2o.save_model() (rather then simply use the model_id value)?

Trying to save an h2o model with some specific name that differs from the model's model_id field, but trying something like...
h2o.save_model(model=model,
path='/some/path/then/filename',
force=False)
just creates a dir/file structure like
some
|__path
|__then
|__filename
|__<model_id>
as opposed to
some
|__path
|__then
|__filename
Is this possible to do from the save_model method?
I can't / hesitate to simply change the model_id before calling the save method because the model names have timestamps appended to them to avoid name collisions with other models that may be on the h2o cluster (am trying to remove these timestamps when saving on disk and simplifying the name on the cluster before saving creates a time where naming collision can occur if other processes are also attempting to save such a model (of, say, a different timestamp)).
Any way to get this behavior or other common alternatives / workarounds?

This is currently not possible, however I created a feature request here. There is a related question here which shows a solution for R (could be adapted to Python). The work-around is just to rename the file manually using a few lines of R/Python code.

What is the purpose of RocksDBStore with Serdes.Bytes() and Serdes.ByteArray()?

RocksDBStore<K,V> stores keys and values as byte[] on disk. It converts to/from K and V typed objects using Serdes provided while constructing the object of RocksDBStore<K,V>.
Given this, please help me understand the purpose of the following code in RocksDbKeyValueBytesStoreSupplier:
return new RocksDBStore<>(name,
Serdes.Bytes(),
Serdes.ByteArray());
Providing Serdes.Bytes() and Serdes.ByteArray() looks redundant.
RocksDbKeyValueBytesStoreSupplier is introduced in KAFKA-5650 (Kafka Streams 1.0.0) as part of KIP-182: Reduce Streams DSL overloads and allow easier use of custom storage engines.
In KIP-182, there is the following sentence :
The new Interface BytesStoreSupplier supersedes the existing StateStoreSupplier (which will remain untouched). This so we can provide a convenient way for users creating custom state stores to wrap them with caching/logging etc if they chose. In order to do this we need to force the inner most store, i.e, the custom store, to be a store of type <Bytes, byte[]>.
Please help me understand why we need to force custom stores to be of type <Bytes, byte[]>?
Another place (KAFKA-5749) where I found similar sentence:
In order to support bytes store we need to create a MeteredSessionStore and ChangeloggingSessionStore. We then need to refactor the current SessionStore implementations to use this. All inner stores should by of type < Bytes, byte[] >
Why?

Your observation is correct -- the PR implementing KIP-182 did miss to remove the Serdes from RocksDBStore that are not required anymore. This was fixed in 1.1 release already.

$ORDER vs counting to scan global range

I have a choice between two ways of scanning through a key level in a large global array and am trying to figure out if one method is more efficient than the other.
This is a vendor supplied application and database on the Intersystems Caché database platform. It is written in the old MUMPS style and does not use any of Caché's object persistence functions: all data is stored in globals directly and any indexes are application maintained.
There is a common convention for repeating data elements attached to entities where the first record will contain a count of child records and then each child record is numbered sequentially at the next key level. For example:
^GBDATA(12345,100)="3"
^GBDATA(12345,100,1)="A^Record"
^GBDATA(12345,100,2)="B^Record"
^GBDATA(12345,100,3)="C^Record"
Where "12345" is the entity key, and "100" is one of the attached detail types. Note that the first "100" record with no other keys has the count of subrecords. There could be anywhere between 0 and hundreds of subrecords attached. The entities are often very wide and there is a lot of other data besides this subrecord type (not shown in example).
Given an entity key, I want to scan through all the subrecords of one type. Would it be faster to use $ORDER to go through the subkeys or to use a FOR loop to anticipate the key values? Does it matter?
$ORDER method:
SET EKEY=12345
SET SEQ=""
FOR
{
SET SEQ=$ORDER(^GBDATA(EKEY,100,SEQ), 1, ROWDATA)
QUIT:SEQ=""
WRITE ROWDATA,!
}
FOR count method:
SET EKEY=12345
SET LIM=^GBDATA(EKEY,100)
FOR SEQ=1:1:LIM
{
WRITE ^GBDATA(EKEY,100,SEQ),!
}
Does anyone know how $ORDER vs $GET is implemented internally in Caché?
I'm having trouble testing this empirically since we only have one production instance with appropriate data and I can't take it offline to clear the cache. I'm most interested in from-disk performance as opposed to from-cache performance.

You could use %SYS.MONLBL to figure out definitively. My guess is that $ORDER is slightly better.
http://docs.intersystems.com/cache20122/csp/docbook/DocBook.UI.Page.cls?KEY=GCM_monlbl

In regards to your question, "Does anyone know how $ORDER vs $GET is implemented internally in Caché?" The two are completely different functions.
$Order is used for the direction that you're going in when reviewing your ^Global.
$Get is used to pull the data within the ^Global. Below is an example of it's use. I use Cache ObjectScript; however, this should give you a general idea
Global Structure
^People(LastName,FirstName)="Phone"
Global Data
^People(Doe,John)="1035001234"
^People(Smith,Jane)="7405241305"
^People(Wood,Edgar)="7555127598"
Code Sample
SET LASTNAME=0
FOR QUIT:LASTNAME?." " DO
.SET LASTNAME=$ORDER(^People(LASTNAME)) QUIT:LASTNAME?." "
.SET FIRSTNAME=0
.FOR QUIT:FIRSTNAME?." " DO
..SET FIRSTNAME=$ORDER(^People(LASTNAME,FIRSTNAME)) QUIT:FIRSTNAME?." "
..SET PHONE=$GET(^People(LASTNAME,FIRSTNAME))
In the sample provided above, it will start with the first record within the ^People global and then start with the first record within the last name by utilizing $Order. It will then $Get the data for the ^People(LASTNAME,FIRSTNAME) node, which is the phone number.
For some samples and reference areas, check out the following links:
$Get Information
$Order Information

'Maximum number of expressions in a list is 1000' error with Grails and Oracle

I'm using Grails with an Oracle database. Most of the data in my application is part of a hierarchy that goes something like this (each item containing the following one):
Direction
Group
Building site
Contract
Inspection
Non-conformity
Data visible to a user is filtered according to his accesses which can be at the Direction, Group or Building Site level depending on user role.
We easily accomplished this by creating a listWithSecurity method for the BuildingSite domain class which we use instead of list across most of the system. We created another listWithSecurity method for Contract. It basically does a Contract.findAllByContractIn(BuildingSite.listWithSecurity). And so on with the other classes. This has the advantage of keeping all the actual access logic in BuildingSite.listWithsecurity.
The problem came when we started getting real data in the system. We quickly hit the "ora-01795 maximum number of expressions in a list is 1000" error. Fair enough, passing a list of over 1000 literals is not the most efficient thing to do so I tried other ways even though it meant I would have to deport the security logic to each controller.
The obvious way seemed to use a criteria such as this (I only put the Direction level access here for simplicity):
def c = NonConformity.createCriteria()
def listToReturn = c.list(max:params.max, offset: params.offset?.toInteger() ?: 0)
{
inspection {
contract {
buildingSite {
group {
'in'("direction",listOfOneOrTwoDirections)
}
}
}
}
}
I was expecting Grails to generate a single query with joins that would avoid the ora-01795 error but it seems to be calling a separate query for each level and passing the result back to Oracle as literal in an 'in' to query the other level. In other words, it does exactly what I was doing so I get the same error.
Actually, it might be optimising a bit. It seems to be solving the problem but only for one level. In the previous example, I wouldn't get an error for 1001 inspections but I would get it for 1001 contracts or building sites.
I also tried to do basically the same thing with findAll and a single HQL where statement to which I passed a single direction to get the nonConformities in one query. Same thing. It solves the first levels but I get the same error for other levels.
I did manage to patch it by splitting my 'in' criteria into many 'in' inside an 'or' so no single list of literals is more than 1000 long but that's profoundly ugly code. A single findAllBy[…]In becomes over 10 lines of code. And in the long run, it will probably cause performance problems since we're stuck doing queries with a very large amount of parameters.
Has anyone encountered and solved this problem in a more elegant and efficient way?

This won't win any efficiency awards but I thought I'd post it as an option if you just plainly need to query a list of more than 1000 items none of the more efficient options are available/appropriate. (This stackoverflow question is at the top of Google search results for "grails oracle 1000")
In a grails criteria you can make use of Groovy's collate() method to break up your list...
Instead of this:
def result = MyDomain.createCriteria().list {
'in'('id', idList)
}
...which throws this exception:
could not execute query
org.hibernate.exception.SQLGrammarException: could not execute query
at grails.orm.HibernateCriteriaBuilder.invokeMethod(HibernateCriteriaBuilder.java:1616)
at TempIntegrationSpec.oracle 1000 expression max in a list(TempIntegrationSpec.groovy:21)
Caused by: java.sql.SQLSyntaxErrorException: ORA-01795: maximum number of expressions in a list is 1000
at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:440)
You'll end up with something like this:
def result = MyDomain.createCriteria().list {
or { idList.collate(1000).each { 'in'('id', it) } }
}
It's unfortunate that Hibernate or Grails doesn't do this for you behind the scenes when you try to do an inList of > 1000 items and you're using an Oracle dialect.
I agree with the many discussions on this topic of refactoring your design to not end up with 1000+ item lists but regardless, the above code will do the job.

Along the same lines as Juergen's comment, I've approached a similar problem by creating a DB view that flattens out user/role access rules at their most granular level (Building Site in your case?) At a minimum, this view might contain just two columns: a Building Site ID and a user/group name. So, in the case where a user has Direction-level access, he/she would have many rows in the security view - one row for each child Building Site of the Direction(s) that the user is permitted to access.
Then, it would be a matter of creating a read-only GORM class that maps to your security view, joining this to your other domain classes, and filtering using the view's user/role field. With any luck, you'll be able to do this entirely in GORM (a few tips here: http://grails.1312388.n4.nabble.com/Grails-Domain-Class-and-Database-View-td3681188.html)
You might, however, need to have some fun with Hibernate: http://grails.org/doc/latest/guide/hibernate.html

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Hadoop Cascading : CascadeException "no loops allowed in cascade" when cogroup pipes twice - hadoop

Related

Simulate the assembly of parts in Arena Simulation Software

Possible to set file name for h2o.save_model() (rather then simply use the model_id value)?

What is the purpose of RocksDBStore with Serdes.Bytes() and Serdes.ByteArray()?

$ORDER vs counting to scan global range

'Maximum number of expressions in a list is 1000' error with Grails and Oracle

Categories

Resources