Pig : Multiple join statements in single statement - hadoop

Please help me getting this done in pig
Input:
record1: ("Ammit", 123, 234, 345)
record2: (map : [
"123" : ("accountNo": 123, "bank": "ICICI Bank", "branch" : "Delhi"),
"234" : ("accountNo": 234, "bank": "HDFC Bank", "branch" : "Mumbai"),
"345" : ("accountNo": 345, "bank": "SBI", "branch" : "Bangalore"),
])
Above data represents Amit's bank accounts with the details of accountNo, bank and branch. Record1 contains name followed by 3 account number ids, which are ordered (i.e. they represent the order in which Amit opened the account)
output: ("Amit",
"123" : ("accountNo": 123, "bank": "ICICI Bank", "branch" : "Delhi"),
"234" : ("accountNo": 234, "bank": "HDFC Bank", "branch" : "Mumbai"),
"345" : ("accountNo": 345, "bank": "SBI", "branch" : "Bangalore"),
)
How do I achieve this?

You could flatten the map and then a merge join will maintain the order
https://wiki.apache.org/pig/PigMergeJoin

I solved it using an UDF MapToBag defined here . That gave me access to values of the maps in record2, which I used to join with ids from record1.

Related

Iterate through a child map in a for loop in groovy

Let us assume I have a map like below which contains another map (child map) within it. I like to print the child map as individual rows for each key.
def map =[
1: [
[name:"Jerry", age: 42, city: "New York"],
[name:"Long", age: 25, city: "New York"]
],
2: [
[name:"Dustin", age: 29, city: "New York"],
[name:"Bob", age: 34, city: "New York"]
]
]
Currently, I iterate through the parent map, and am able to print the names. But, that is not actually what I want. I wanted to see if I can atleast iterate through the parent map. I do not see errors so far.
for(allentries in map){
loggerApi.info("${allentries.key}: ${allentries.value.name}") // this prints [Jerry, Long] for key 1 & [Dustin, Bob] for key 2
loggerApi.info("${allentries.key}: ${allentries.value.age}")
loggerApi.info("${allentries.key}: ${allentries.value.city}")
}
Can you please let me know what I should add to the above code to see individual rows like these:
Name Age City
Jerry 42 New York
Long 25 New York
Dustin 29 New York
Bob 34 New York
Appreciate.
Thank you.
Br,
Noor.
Assuming you got the input structure wrong in the question, and you have a list of maps inside each value in map, you can just iterate through them in an internal loop
​def map =[
1: [
[name:"Jerry", age: 42, city: "New York"],
[name:"Long", age: 25, city: "New York"]
],
2: [
[name:"Dustin", age: 29, city: "New York"],
[name:"Bob", age: 34, city: "New York"]
]
]
map.each { key, value ->
value.each { submap ->
println "$key $submap.name $submap.age $submap.city"
// Or with no prefixing number:
// println "$submap.name $submap.age $submap.city"
}
}

Creating a new column in JMP using an if condition from another column

I am very new in JMP so I am still feeling around.
I want to create a new column called "Status" in JMP. This status is character and depends on the value of the column "Grade". If the value of the entry in column "Grade" is zero, the value of the entry in column "Status" should be "fail". If the "Grade" value is greater than 100, the entry in column "Status" should be "invalid". If the :Grade" value is less than 0, the "Status" value should be "invalid". This should be simple. But somehow, my script won't work:
dt = Current Data Table();
dt << New Column("Status", Character, Formula(
If(:Name( "Grade" )==0, "fail",
:Name( "Grade" )>100, "invalid",
:Name( "Grade" )<0, "invalid")
));
Can you help me debug this script?
I just tried the script and the formula is working for me.
Here is some JSL which is a bit more complete which also adds the "Grade" column upon which "Status" depends.
dt = Current Data Table();
dt << New Column( "Grade",
Numeric,
"Continuous",
Format( "Best", 12 ),
);
dt << New Column( "Status",
Character,
"Nominal",
Formula(
If(
:Grade == 0, "fail",
:Grade > 100, "invalid",
:Grade < 0, "invalid"
)
)
);
Perhaps the issue is that you don't already have a data table opened with a Grade column? Here's a script to create a brand new table with the formula and some values.
New Table( "Grading Test",
Add Rows( 7 ),
New Column( "Grade",
Numeric,
"Continuous",
Format( "Best", 12 ),
Set Selected,
Set Values( [45, 20, 100, 101, -4, 0, 120] )
),
New Column( "Status",
Character,
"Nominal",
Formula(
If(
:Grade == 0, "fail",
:Grade > 100, "invalid",
:Grade < 0, "invalid"
)
)
)
);
I created that by interactively creating the table and using the red-triangle menu and selected "Copy Table Script".
I tried JMP 12.0, which version are you using?

can't match digits in haystack elastic search

I have some products that I'm indexing that go something like "99% chocolate". If I search for chocolate, it matches this particular item, but if I search for "99", it doesn't match. I came across this Using django haystack autocomplete with elasticsearch to search for digits/numbers? which had the same issue, but nobody has answered his question. Can someone please help?
Edit2: I'm sorry I neglected to include an important detail. The numeric search itself works, but the autocomplete doesn't work. I'm including the relevant lines:
#the relevant line in my index
name_auto = indexes.EdgeNgramField(model_attr='name')
#the relevant line in my view
prodSqs = SearchQuerySet().models(Product).autocomplete(name_auto=request.GET.get('q', ''))
Edit: following are the results of running the analyser:
curl -XGET 'localhost:9200/haystack/_analyze?analyzer=standard&pretty' -d '99% chocolate'
{
"tokens" : [ {
"token" : "99",
"start_offset" : 0,
"end_offset" : 2,
"type" : "<NUM>",
"position" : 1
}, {
"token" : "chocolate",
"start_offset" : 4,
"end_offset" : 13,
"type" : "<ALPHANUM>",
"position" : 2
} ]
}
finally found the answer here: ElasticSearch: EdgeNgrams and Numbers
Add the following classes and change the Engine under Haystack_connections in settings file to use CustomElasticsearchSearchEngine below instead of default haystack one:
class CustomElasticsearchBackend(ElasticsearchSearchBackend):
"""
The default ElasticsearchSearchBackend settings don't tokenize strings of digits the same way as words, so they
get lost: the lowercase tokenizer is the culprit. Switching to the standard tokenizer and doing the case-
insensitivity in the filter seems to do the job.
"""
def __init__(self, connection_alias, **connection_options):
# see https://stackoverflow.com/questions/13636419/elasticsearch-edgengrams-and-numbers
self.DEFAULT_SETTINGS['settings']['analysis']['analyzer']['edgengram_analyzer']['tokenizer'] = 'standard'
self.DEFAULT_SETTINGS['settings']['analysis']['analyzer']['edgengram_analyzer']['filter'].append('lowercase')
super(CustomElasticsearchBackend, self).__init__(connection_alias, **connection_options)
class CustomElasticsearchSearchEngine(ElasticsearchSearchEngine):
backend = CustomElasticsearchBackend
Running you string 99% chocolate through the standard analyser gives the right results (99 is a term on its own), so if you're not using it currently, you should switch to it.
curl -XGET 'localhost:9200/myindex/_analyze?analyzer=standard&pretty' -d '99% chocolate'
{
"tokens" : [ {
"token" : "99",
"start_offset" : 0,
"end_offset" : 2,
"type" : "<NUM>",
"position" : 1
}, {
"token" : "chocolate",
"start_offset" : 4,
"end_offset" : 13,
"type" : "<ALPHANUM>",
"position" : 2
} ]
}

How to pluck by key name matching?

Say I have the following table:
[
{numberOfRedStripes: 7, numberOfBlueStripes: 6, stars: 50, foo: "bar"},
{numberOfRedStripes: 1, numberOfBlueStripes: 1, stars: 0, something: "else"}
]
How can I use regex in order to pluck only the docs which their KEYS start with the string 'numberOf', so that the result would be:
[
{numberOfRedStripes: 7, numberOfBlueStripes: 6},
{numberOfRedStripes: 1, numberOfBlueStripes: 1}
]
?
Does this work?
table.map {|row|
row.pluck(r.args(row.keys().filter{|key| key.match("^numberOf")}))
}

MongoDB geospatial query with sort - performance issues

I have query (which is very slow ~2,5s):
db.markers.find({ latlng: { '$within': { '$box': [ [ -16, -140 ], [ 75, 140 ] ] } } }).sort({_id: -1}).limit(1000)
When I run explain for this query I get
{
"cursor" : "GeoBrowse-box",
"isMultiKey" : false,
"n" : 1000,
"nscannedObjects" : 242331,
"nscanned" : 242331,
"nscannedObjectsAllPlans" : 242331,
"nscannedAllPlans" : 242331,
"scanAndOrder" : true,
"indexOnly" : false,
"nYields" : 1383,
"nChunkSkips" : 0,
"millis" : 2351,
"indexBounds" : {
"latlng" : [ ]
},
"lookedAt" : NumberLong(262221),
"matchesPerfd" : NumberLong(242331),
"objectsLoaded" : NumberLong(242331),
"pointsLoaded" : NumberLong(0),
"pointsSavedForYield" : NumberLong(0),
"pointsChangedOnYield" : NumberLong(0),
"pointsRemovedOnYield" : NumberLong(0),
"server" : "xx:27017"
}
When I remove sort({_id: -1}) explain gives me (fast query 5 milis):
{
"cursor" : "GeoBrowse-box",
"isMultiKey" : false,
"n" : 1000,
"nscannedObjects" : 1000,
"nscanned" : 1000,
"nscannedObjectsAllPlans" : 1000,
"nscannedAllPlans" : 1000,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 0,
"nChunkSkips" : 0,
"millis" : 5,
"indexBounds" : {
"latlng" : [ ]
},
"lookedAt" : NumberLong(1000),
"matchesPerfd" : NumberLong(1000),
"objectsLoaded" : NumberLong(1000),
"pointsLoaded" : NumberLong(0),
"pointsSavedForYield" : NumberLong(0),
"pointsChangedOnYield" : NumberLong(0),
"pointsRemovedOnYield" : NumberLong(0),
"server" : "xx:27017"
}
I have 2d index on latlng, desc index on _id and compound indexes.
db.markers.ensureIndex({latlng: '2d', _id:-1})
db.markers.ensureIndex({ latlng: '2d' })
db.markers.ensureIndex({ _id: -1 })
What I want to achieve is to get markers from a particular area sorted from newest.
Any ideas or suggestions how to do a lot less than 2.5 seconds??
If someone wants to do their own tests
var i = 0,
lat = 0,
lng = 0;
for (i; i < 260000; i++) {
lat = parseFloat(Math.min(-90 + (Math.random() * 180), 90).toFixed(6));
lng = parseFloat(Math.min(-180 + (Math.random() * 360), 180).toFixed(6));
collection.insert({latlng: [lat, lng]}, function () {});
}
collection.find({ latlng: { '$within': { '$box': [ [ -90, -180 ], [ 90, 180 ] ] } } }, {latlng: 1, _id: 1 }).sort({_id: -1}).limit(1000).explain()
On my local machine I receives (~ 2,6s):
{
"cursor" : "GeoBrowse-box",
"isMultiKey" : false,
"n" : 1000,
"nscannedObjects" : 260000,
"nscanned" : 260000,
"nscannedObjectsAllPlans" : 260000,
"nscannedAllPlans" : 260000,
"scanAndOrder" : true,
"indexOnly" : false,
"nYields" : 1612,
"nChunkSkips" : 0,
"millis" : 2613,
"indexBounds" : {
"latlng" : [ ]
},
"lookedAt" : NumberLong(260000),
"matchesPerfd" : NumberLong(260000),
"objectsLoaded" : NumberLong(260000),
"pointsLoaded" : NumberLong(0),
"pointsSavedForYield" : NumberLong(0),
"pointsChangedOnYield" : NumberLong(0),
"pointsRemovedOnYield" : NumberLong(0),
"server" : "xx:27017"
}
Thx
Do you actually have the following three indexes defined on your collection?
db.markers.ensureIndex({ latlng: '2d', _id:-1 })
db.markers.ensureIndex({ latlng: '2d' })
db.markers.ensureIndex({ _id: -1 })
The geospatial indexing docs advise against creating multiple geo indexes on the same collection. Although MongoDB will allow it, the behavior may be undesirable. My guess for your case is that the non-compound {latlng: '2d'} may have been selected for use instead of the compound index. The explain() output doesn't really help us here, since it simply reports GeoBrowse-box instead of the index name; however, I would suggest manually hinting that the cursor use the compound index and see if the results improve. Alternatively, simply get rid of the non-compound index, so {latlng: '2d', _id:-1} because the obvious and only choice for the query optimizer.
Lastly, the {_id: -1} index is redundant and can be removed. Per the compound index documentation, direction is only relevant when dealing with indexes comprised of multiple fields. For a single-key index, we can walk the index backwards or forwards easily enough. Since MongoDB already creates an {_id: 1} index for us by default, it's more efficient to simply rely on that.
Now, with indexing out of the way: one caveat with your query is that limits are applied to the geospatial query component before sorting by non-geo criteria (_id in your case). I believe this means that, while your results will indeed be sorted by _id, that sort may not be considering all documents within the matched bounds. This is mentioned in the compound index bit of the documentation, which references SERVER-4247 as a pending solution.
Edit: Following up with your benchmark
I populated the example data, which are 260k random points between ±90 and ±180. I then ran your query:
db.markers.find(
{ latlng: { $within: { $box: [[-90, -180], [90, 180]] }}},
{ latlng: 1, _id: 1 }
).sort({_id: -1}).limit(1000).explain()
That took 1713ms (I'll use that as a baseline of comparison instead of your time of 2351ms). I'll also note that the query matched all 260k documents, and scanned the same number of index entries. It appears the limit didn't factor in until the _id sort, which is not what I would have expected based on the note here. I then tweaked the query a bit to examine some other cases:
Original query without the _id sort and limit: nscanned is 260k and time is 1470ms.
Original query without the _id sort: nscanned is 1000 and time is 9ms.
Original query without the limit: nscanned is 260k and time is 2567ms.
I also wanted to test sorting on an unindexed field alone to simulate what might happen for the _id sort after a geo match; however, I couldn't use _id since the default index will always exist. To do this, I deleted the compound geo index and then sorted by the latlng object. This resulted in nscanned of 260k and a time of 1039ms. If I add a limit of 1000, the time was 461ms.
If we add that to the 1470ms above (geo query without a sort and limit), it's very close to the original query without a limit, which was 2567ms. Likewise, if we add 461ms (limited sort) to 1470ms, it's near the original benchmark result of 1713ms. Based on that correlation, I'd wager that the _id sort in your benchmark isn't taking advantage of the compound index at all.
In any event, one other reason the benchmark is slow is due to a very wide geo match. Tighter bounds would definitely result in less data to sort, even with that sort being unindexed. That said, I do think SERVER-4247 would help you, since it would likely process the non-geo sort first before performing the geo match.
Are your indexes using compound keys?
db.markers.ensureIndex({latlng: '2d', _id:-1})

Resources