Count and flatten in pig - hadoop

Hi I have a data like this :
{"user_id": "kim95", "type": "Book", "title": "Modern Database Systems: The Object Model, Interoperability, and Beyond.", "year": "1995", "publisher": "ACM Press and Addison-Wesley", "authors": [{"name":"null"}], "source": "DBLP"}
{"user_id": "marshallo79", "type": "Book", "title": "Inequalities: Theory of Majorization and Its Application.", "year": "1979", "publisher": "Academic Press", "authors": [{"name":"Albert W. Marshall"},{"name":"Ingram Olkin"}], "source": "DBLP"}
{"user_id": "knuth86a", "type": "Book", "title": "TeX: The Program", "year": "1986", "publisher": "Addison-Wesley", "authors": [{"name":"Donald E. Knuth"}], "source": "DBLP"}
...
And I would like to get the publisher,title and then applied a count on the group but I got error ' a column needs to be...' with this script :
books = load 'data/book-seded-workings-reduced.json'
using JsonLoader('user_id:chararray,type:chararray,title:chararray,year:chararray,publisher:chararray,authors:{(name:chararray)},source:chararray');
doc = group books by publisher;
res = foreach doc generate group,books.title,count(books.publisher);
DUMP res;
On a second query I would like to have a structure like this :(name,year),title
So I tried this one :
books = load 'data/book-seded-workings-reduced.json'
using JsonLoader('user_id:chararray,type:chararray,title:chararray,year:chararray,publisher:chararray,authors:{(name:chararray)},source:chararray');
flat =group books by (generate FLATTEN((authors.name),year);
tab = foreach flat generate group, books.title;
DUMP tab;
But it also doesn't work...
Any idea please?

What is the error you are getting on trying out the first query?
COUNT being inbuilt function has to be in all caps, you cannot invoke COUNT(group), group is internal identifier generated by Pig.
I get following result on running your first query -
(Academic Press,{(Inequalities: Theory of Majorization and Its Application.)},1)
(Addison-Wesley,{(TeX: The Program)},1)
(ACM Press and Addison-Wesley,{(Modern Database Systems: The Object Model, Interoperability, and Beyond.)},1)
The expected format of (name,year), title can also be achieved this way -
flat = foreach books generate FLATTEN(authors.name) as authorName, year, title;
tab = group flat by (authorName, year);
finaltab = foreach tab generate group, flat.title;

Only problem in your first code i could see is "COUNT" instead of count (CAPS on)
if you use without caps count then you will get a error
Could not resolve count using imports:

Related

Couchbase return null value after save and read document using n1ql

I insert a document in couchbase using repository.save()
and after that, I make a query to find duplicates on another document.
query is:
SELECT ARRAY_AGG(i.serialnumber) serialNumbers
FROM default tempItem
UNNEST items i
WHERE tempItem.class = "com.inventory.model.item.TempItem"
AND META(tempItem).id = '4390dd9e-e392-4432-939f-ebf046570086'
and i.serialnumber in (select raw serialnumber from default where class = 'com.inventory.model.item.Item'
AND status != 'DELETED' and serialnumber is not missing)
the result of the query is:
[
{
"serialNumbers": [
"9121945901",
"9121955901",
"9211965901"
]
}
]
The document that saved is like below:
[
{
"tempItem": {
"class": "com.inventory.model.item.TempItem",
"items": [
{
"categoryId": "67aaca7b-90b1-43e4-a6c6-0e9567bf283e",
"clientIds": [
"919d0ca7-c8d4-4283-8b0a-b6f2a7b39753"
],
"description": "bla bla",
"initial": 1,
"productId": "db5c81c4-0fec-407e-8703-6f5fb69a070c",
"serialnumber": "9121945901",
"simType": "PREPAID",
"status": "ACTIVE",
"stock": 1,
"title": "bla bla"
}
]
}
}
]
and another document to check is :
{
"categoryId": "67aaca7b-90b1-43e4-a6c6-0e9567bf283e",
"class": "com.inventory.model.item.Item",
"clientIds": [
"919d0ca7-c8d4-4283-8b0a-b6f2a7b39753"
],
"createdts": 1601801989176,
"creator": "919d0ca7-c8d4-4283-8b0a-b6f2a7b39753",
"description": "bla bla",
"initial": 1,
"prefix1": "912",
"prefix2": "194",
"productId": "db5c81c4-0fec-407e-8703-6f5fb69a070c",
"serialnumber": "9121945901",
"simType": "PREPAID",
"status": "ACTIVE",
"stock": 1,
"title": "bla bla"
}
in spring boot when I run the query immediately after seve the document its return null result
and if I make some milliseconds sleep after save and before the run query returns the value
what is that problem?
can anybody help this issue?
This is probably because of ScanConsistency. Indexes in Couchbase are built asynchronously. So if you are using the default "NOT_BOUNDED" consistency and query the data with N1QL immediately after you write it, it may not be indexed yet.
I don't know how to change this in Spring, but the other options are:
REQUEST_PLUS - will likely take a bit longer to return the results but the query engine will make sure that it is as up-to-date as possible.
consistentWith(MutationState) a.k.a AT_PLUS - for a more narrowed-down scan consistency, depending on the index update rate this might provide a speedier response.
Again, not sure about Spring, but you don't have to set this globally. Each query can use a different scan consistency. So, if you value maximum performance over up-to-the-second accuracy, you can go with the default. If you value up-to-the-second accuracy over maximum performance, you can go to with REQUEST_PLUS or AT_PLUS.
Using the CouchbaseRepository then it is possible to annotate every query using the annotation #ScanConsistency(query = QueryScanConsistency.REQUEST_PLUS) to enforce the desidered Scan consistency on each query. Have a look at the official documentation Querying with consistency

Use Kafka Connect to update Elasticsearch field on existing document instead of creating new

I have Kafka set-up running with the Elasticsearch connector and I am successfully indexing new documents into an ES index based on the incoming messages on a particular topic.
However, based on incoming messages on another topic, I need to append data to a field on a specific document in the same index.
Psuedo-schema below:
{
"_id": "6993e0a6-271b-45ef-8cf5-1c0d0f683acc",
"uuid": "6993e0a6-271b-45ef-8cf5-1c0d0f683acc",
"title": "A title",
"body": "A body",
"created_at": 164584548,
"views": []
}
^ This document is being created fine in ES based on the data in the topic mentioned above.
However, how do I then add items to the views field using messages from another topic. Like so:
article-view topic schema:
{
"article_id": "6993e0a6-271b-45ef-8cf5-1c0d0f683acc",
"user_id": 123456,
"timestamp: 136389734
}
and instead of simply creating a new document on the article-view index (which I dont' want to even have). It appends this to the views field on the article document with corresponding _id equal to article_id from the message.
so the end result after one message would be:
{
"_id": "6993e0a6-271b-45ef-8cf5-1c0d0f683acc",
"uuid": "6993e0a6-271b-45ef-8cf5-1c0d0f683acc",
"title": "A title",
"body": "A body",
"created_at": 164584548,
"views": [
{
"user_id": 123456,
"timestamp: 136389734
}
]
}
Using the ES API it is possible using a script. Like so:
{
"script": {
"lang": "painless",
"params": {
"newItems": [{
"timestamp": 136389734,
"user_id": 123456
}]
},
"source": "ctx._source.views.addAll(params.newItems)"
}
}
I can generate scripts like above dynamically in bulk, and then use the helpers.bulk function in the ES Python library to bulk update documents this way.
Is this possible with Kafka Connect / Elasticsearch? I haven't found any documentation on Confluent's website to explain how to do this.
It seems like a fairly standard requirement and an obvious thing people would need to do with Kafka / A sink connector like ES.
Thanks!
Edit: Partial updates are possible with write.method=upsert (src)
The Elasticsearch connector doesn't support this. You can update documents in-place but need to send the full document, not a delta for appending which I think it what you're after.

is there any way where i can apply group and pagination using createQuery?

Query like this,
http://localhost:3030/dflowzdata?$skip=0&$group=uuid&$limit=2
and dflowzdata service contains data like,
[
{
"uuid": 123456,
"id": 1
},
{
"uuid": 123456,
"id": 2
},
{
"uuid": 7890,
"id": 3
},
{
"uuid": 123456,
"id": 4
},
{
"uuid": 4567,
"id": 5
}
]
Before Find Hook like,
if (query.$group !== undefined) {
let value = hook.params.query.$group
delete hook.params.query.$group
const query = hook.service.createQuery(hook.params.query);
hook.params.rethinkdb = query.group(value)
}
Its gives correct result but without pagination, like I need only two records but its give me all records
result is,
{"total":[{"group":"123456","reduction":3},{"group":"7890","reduction":1},{"group":"4567","reduction":3}],"data":[{"group":"123456","reduction":[{"uuid":"123456","id":1},{"uuid":"123456","id":2},{"uuid":"123456","id":4}]},{"group":"7890","reduction":[{"uuid":"7890","id":3}]},{"group":"4567","reduction":[{"uuid":"4567","id":5}]}],"limit":2,"skip":0}
can anyone help me how should get correct records using $limit?
According to the documentation on data types, ReQL commands called on GROUPED_DATA operate on each group individually. For more details, read the group documentation. So limit won't apply to the result of group.
The page for group tells: to operate on all the groups rather than operating on each group [...], you can use ungroup to turn a grouped stream or grouped data into an array of objects representing the groups.
Hence ungroup to apply functions to group's result:
r.db('db').table('table')
.group('uuid')
.ungroup()
.limit(2)

RethinkDB simple filter not working?

I'm learning ReQL in the web UI "Data Explorer" and have created the following "cars" table with 2 documents, in the provided "test" database:
[{
"brand": "Nissan" ,
"id": 1 ,
"model": "Murano" ,
"year": 2009
} ,
{
"brand": "Nissan" ,
"id": 2 ,
"model": "Qashqai" ,
"year": 2014
}
]
While the following query returns both documents correctly:
r.table("cars")
...the following should return only the second document but why does it instead return an empty array?:
r.table("cars").filter(
r.row["year"] > 2010
)
I got this filter query straight out of the official samples at http://www.rethinkdb.com/docs/sql-to-reql/
The examples in the SQL to ReQL cheat sheet are in Python.
However the Data Explorer uses JavaScript. In JavaScript, .gt must be used instead of > and () instead of []:
r.table("cars").filter(r.row("year").gt(2010))

Loading JSON file with serde in Cloudera

I am trying to work with a JSON file with this bag structure :
{
"user_id": "kim95",
"type": "Book",
"title": "Modern Database Systems: The Object Model, Interoperability, and Beyond.",
"year": "1995",
"publisher": "ACM Press and Addison-Wesley",
"authors": [
{
"name": "null"
}
],
"source": "DBLP"
}
{
"user_id": "marshallo79",
"type": "Book",
"title": "Inequalities: Theory of Majorization and Its Application.",
"year": "1979",
"publisher": "Academic Press",
"authors": [
{
"name": "Albert W. Marshall"
},
{
"name": "Ingram Olkin"
}
],
"source": "DBLP"
}
I tried to use serde to load JSON data for Hive. I followed both ways that I saw here : http://blog.cloudera.com/blog/2012/12/how-to-use-a-serde-in-apache-hive/
With this code :
CREATE EXTERNAL TABLE IF NOT EXISTS serd (
user_id:string,
type:string,
title:string,
year:string,
publisher:string,
authors:array<struct<name:string>>,
source:string)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION '/user/hdfs/data/book-seded_workings-reduced.json';
I got this error:
error while compiling statement: failed: parseexception line 2:17 cannot recognize input near ':' 'string' ',' in column type
I alson tried this version : https://github.com/rcongiu/Hive-JSON-Serde
which gave a different error :
Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Cannot validate serde: org.openx.data.jsonserde.JsonSerde
Any idea?
I also want to know what are alternatives to work with a JSON like this to make queries on 'name' field in 'authors'. Whether it's Pig or Hive?
I have already converted it in to a "tsv" file. But, since my authors column is a tuple, I don't know how make requests on 'name' with Hive, If I build a table from this file. Should I change my script for "tsv" conversion or keep it? Or are there any alternatives with Hive or Pig?
Hive does not have built in support for JSON. So for using JSON with Hive we need to use third part jars like:
https://github.com/rcongiu/Hive-JSON-Serde
You have couple of issues with the create table statement. It should look like this:
CREATE EXTERNAL TABLE IF NOT EXISTS serd (
user_id string,type string,title string,year string,publisher string,authors array<string>,source:string)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION...
The JSON records your are using keep each record in a single line like this:
{"user_id": "kim95", "type": "Book", "title": "Modern Database Systems: The Object Model, Interoperability, and Beyond.", "year": "1995", "publisher": "ACM Press and Addison-Wesley", "authors": [{"name":"null"}], "source": "DBLP"}
{"user_id": "marshallo79", "type": "Book", "title": "Inequalities: Theory of Majorization and Its Application.", "year": "1979", "publisher": "Academic Press","authors": [{"name":"Albert W. Marshall"},{"name":"Ingram Olkin"}], "source": "DBLP"}
After downloading the project from GIT you need to compile the the project which will create a jar you need to add this jar in the Hive session before running the create table statement.
Hope it helps...!!!
add jar only add to session which won't be available and finally it is getting error.
Get the JAR loaded on all the nodes at Hive and Map Reduce path like the below location so that HIVE and Map Reduce component will pick this whenever it’s been called.
/hadoop/CDH_5.2.0_Linux_parcel/parcels/CDH-5.2.0- 1.cdh5.2.0.p0.36/lib/hive/lib/json-serde-1.3.6-jar-with-dependencies.jar
/hadoop/CDH_5.2.0_Linux_parcel/parcels/CDH-5.2.0-1.cdh5.2.0.p0.36/lib/hadoop-mapreduce/lib/json-serde-1.3.6-jar-with-dependencies.jar
Note: This path varies to cluster.

Resources