cassandra query on map in select clause - cassandra-2.0

i am new to cassandra and i am trying to read a row from database which contains values
siteId | country | someMap
1 | US | {a:b, x:z}
2 | PR | {a:b, x:z}
I have also created an index on table using create index on columnfamily(keys(someMap));
but still when i query as select * from table where siteId=1 and someMap contains key 'a'
it returns an entiremap as
1 | US | {a:b, x:z}
Can somebody help me on what should i do to get the value as
1 | US | {a:b}

You can not: even if internally each entry of a Map|List|Set is stored as a column you can only retrieve the whole collection but not part of it. You are not asking cassandra give me the entry of the map containing X, but the row whom map contains X.
HTH,
Carlo

Related

Query a large list of partition ids in dynamo db where partition key is unique

I am new to dynamo db.
The table looks like below
| id |rangekey |timestamp |dimensions
| -----| --------|-------------|----------
| of1 | ACTIVE |1631460979529|{"type":"test","content":"abc"}
| of2 | ACTIVE |1631499979529|{"type":"test","content":"bxh"}
| of3 | ACTIVE |1631499979520|{"type":"practice","content":"xyz"}
| of4 | ACTIVE |1631499979528|{"type":"lecture","content":"lll"}
| of5 | ACTIVE |1631460979927|{"type":"practice","content":"olp","component":"one"}
| .. |.. |... |...
so on.
The id is the partition key and range key is the sort key.The id values are unique.
It seems like poorly designed table when it comes to querying all the id for which the dimensions contains (or begins with)
"type":"test" or "type":"practice"
I am aware of the below approaches:
Scan the table with filter expression like below
contains(dimensions,'"type":"test"') or contains(dimensions,'"type":"practice"')
Query the partition id one
by one with filter expression as above.This seems like a problem
because i have a large list of id (partition keys) approx up-to
5000 .But this could be run in parallel to reduce time
Or can i create a dynamo db stream sort of a materialized view which has the
view containing all id whose dimension is of type test or
practice.Need more insight on this one.
Does any of the above approach seems good cost wise or efficiency.Are there better ways of doing this.Thanks in advance !

How to groupBy on one column in laravel?

I have a section table and class Table
class table is designed in this way
(id,class_name,section_id)
one class has many sections like
--------------------------------------------
| SN | ClassName | Section_id |
--------------------------------------------
| 1 | ClassOne | 1 |
| 2 | ClassOne | 2 |
| 3 | ClassOne | 3 |
| 4 | ClassOne | 4 |
--------------------------------------------
Now i want to groupBy Only ClassName and display all the sections of that class
$data['classes'] = SectionClass::groupBy('class_name')->paginate(10);
i have groupby like this but it only gives me one section id
Try this way...
$things = SectionClass::paginate(10);
$data['classes']= $things->groupBy('class_name');
You are getting just one row because that is what GROUP BY does, groups a set of rows into a set of summary rows and returns one row for each group. In standard SQL, a query that includes a GROUP BY clause cannot refer to nonaggregated columns in the select list that are not named in the GROUP BY clause. For example, in SQL Server if you try the next clause
SELECT * FROM [Class] GROUP BY [ClassName]
You'll get the next error
"Column 'SN' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause"
Think about it, you are grouping by ClassName, and following your sample data, this will return just one row. Your SELECT clause includes column ClassName, which is easy to get because is the same in every single row, but when you are selecting another, which one should be return if only one has to be selected?
Now, things change a little bit in MySQL. MySQL extends the standard SQL use of GROUP BY so that the select list can refer to nonaggregated columns not named in the GROUP BY clause. This means that the preceding query is legal in MySQL. However, this is useful primarily when all values in each nonaggregated column not named in the GROUP BY are the same for each group. The server is free to choose any value from each group, so unless they are the same, the values chosen are nondeterministic. You can find a complete explanation about this topic here https://dev.mysql.com/doc/refman/5.6/en/group-by-handling.html
If you are expecting a result in one row, you can use GROUP_CONCAT() function to get something like
--------------------------------
| ClassName | Sections |
--------------------------------
| ClassOne | 1,2,3,4 |
--------------------------------
Your query must be something like:
select `ClassName`, group_concat(Section_id) from `class` group by `ClassName`
You can get this with a raw query in laravel or its up to you to find a way to get the same result using query builder ;)

Do UDF (which another spark job is needed) to each element of array column in SparkSQL

The structure of a hive table (tbl_a) is as follows:
name | ids
A | [1,7,13,25168,992]
B | [223, 594, 3322, 192928]
C | null
...
Another hive table (tbl_b) have the corresponding mapping between id to new_id. This table is big so cannot be loaded into memory
id | new_id
1 | 'aiks'
2 | 'ficnw'
...
I intend to create a new hive table to have the same structure as tbl_a, but convert the array of id to the array of new_id:
name | ids
A | ['aiks','fsijo','fsdix','sssxs','wie']
B | ['cx', 'dds', 'dfsexx', 'zz']
C | null
...
Could anyone give me some idea on how to implement this scenario in spark sql or in spark DataFrame? Thanks!
This is an expensive operation but you can make it using a coalesce, explode and a left outer join as followed :
tbl_a
.withColumn("ids", coalesce($"ids", array(lit(null).cast("int"))))
.select($"name", explode($"ids").alias("id"))
.join(tbl_b, Seq("id"), "leftouter")
.groupBy("name").agg(collect_list($"new_id").alias("ids"))
.show

Hive: SemanticException [Error 10002]: Line 3:21 Invalid column reference 'name'

I am using the following hive query script for the version 0.13.0
DROP TABLE IF EXISTS movies.movierating;
DROP TABLE IF EXISTS movies.list;
DROP TABLE IF EXISTS movies.rating;
DROP DATABASE IF EXISTS movies;
ADD JAR /usr/local/hadoop/hive/hive/lib/RegexLoader.jar;
CREATE DATABASE IF NOT EXISTS movies;
CREATE EXTERNAL TABLE IF NOT EXISTS movies.list (id STRING, name STRING, genre STRING)
ROW FORMAT SERDE 'com.cisco.hadoop.loaders.RegexSerDe'with SERDEPROPERTIES(
"input.regex"="^(.*)\\:\\:(.*)\\:\\:(.*)$",
"output.format.string"="%1$s %2$s %3$s");
CREATE EXTERNAL TABLE IF NOT EXISTS movies.rating (id STRING, userid STRING, rating STRING, timestamp STRING)
ROW FORMAT SERDE 'com.cisco.hadoop.loaders.RegexSerDe'
with SERDEPROPERTIES(
"input.regex"="^(.*)\\:\\:(.*)\\:\\:(.*)\\:\\:(.*)$",
"output.format.string"="%1$s %2$s %3$s %4$s");
LOAD DATA LOCAL INPATH 'ml-10M100K/movies.dat' into TABLE movies.list;
LOAD DATA LOCAL INPATH 'ml-10M100K/ratings.dat' into TABLE movies.rating;
CREATE TABLE movies.movierating(id STRING, name STRING, genre STRING, rating STRING);
INSERT OVERWRITE TABLE movies.movierating
SELECT list.id, list.name, list.genre, rating.rating from movies.list list LEFT JOIN movies.rating rating ON (list.id=rating.id) GROUP BY list.id;
The issue is when I execute the script without the "GROUP BY" clause it works fine.
But when I execute it with the "GROUP BY" clause, I get the following error
FAILED: SemanticException [Error 10002]: Line 4:21 Invalid column reference 'name'
Any ideas what is happening here?
Appreciate your help
Thanks!
If you group by a column, your select statement can only select a) that column, b) columns derived only from that column, or c) a UDAF applied to other columns.
In this case, you're only grouping by list.id, so when you try to select list.name, that's invalid. Think about it this way: what if your list table contained the following two entries:
id|name |genre
--+-----+------
01|name1|comedy
01|name2|horror
What would you expect this query to return:
select list.id, list.name, list.genre from list group by list.id;
In this case it's nonsensical. I'm guessing that id in reality is a primary key, but note that hive does not know this, so the above data set is perfectly valid.
With all that in mind, it's not clear to me how to fix it because I don't know the desired output. For example, let's say without the group by (just the join), you have as output:
id|name |genre |rating
--+-----+------+-------
01|name1|comedy|'pretty good'
01|name1|comedy|'bad'
02|name2|horror|'9/10'
03|name3|action|NULL
What would you want the output to be with the group by? What are you trying to accomplish by doing the group by?
OK let me see if I can ask this in a better way.
Here are my two tables
Movies list table - Consists of movies information
ID | Movie Name | Genre
1 | Movie 1 | comedy
2 | movie 2 | action
3 | movie 3 | thriller
And I have ratings table
MOVIE_ID | USER ID | RATING on 5 | TIMESTAMP
1 | xyz | 5 | 12345612
1 | abc | 4 | 23232312
2 | zvc | 1 | 12321123
2 | zyx | 2 | 12312312
What I would like to do is get the output in the following way:
Movie ID | Movie Name | Genre | Rating Average
1 | Movie 1 | comedy | 4.5
2 | Movie 2 | action | 1.5
I am not a db expert but I understand this, when you group the data together you need to convert the multiple values to the scalar values or all the values, if string should be same right?
For example in my previous case, I was grouping them together as a string. So which is okay for list.id, list.name and list.genre, but the list.rating, well that is always going to give some problem here (I just learnt PIG along with hive, so grouping works differently there)
So to tackle the problem, I casted the rating and averaged it out and stored it in the float table. Have a look at my code below:
CREATE TABLE movies.movierating(id STRING, name STRING, genre STRING, rating FLOAT);
INSERT OVERWRITE TABLE movies.movierating
SELECT list.id, list.name, list.genre, AVG(cast(rating.rating as FLOAT)) from movies.list list LEFT JOIN movies.rating rating ON (list.id=rating.id) GROUP BY list.id, list.name,list.genre order by list.id DESC;
Thank you for your explanation. I might save the following question for the next thread but here is my observation:
The performance of the Overall job is reduced when performing Grouping and Joining together than to do it in two separate queries. For the same job, I had changed the code a bit to perform the grouping first and then joining the data and the over all time was reduced by 40 seconds. Earlier it was taking 140 seconds and now it is taking 100 seconds. Any reasons to that?
Once again thank you for your explanation.
I came across same issue:
org.apache.hadoop.hive.ql.parse.SemanticException: Invalid column reference "charge_province"
After I put the "charge_province" in the group by, the issue is gone. I don't know why.

Will this type of pagination scale?

I need to paginate on a set of models that can/will become large. The results have to be sorted so that the latest entries are the ones that appear on the first page (and then, we can go all the way to the start using 'next' links).
The query to retrieve the first page is the following, 4 is the number of entries I need per page:
SELECT "relationships".* FROM "relationships" WHERE ("relationships".followed_id = 1) ORDER BY created_at DESC LIMIT 4 OFFSET 0;
Since this needs to be sorted and since the number of entries is likely to become large, am I going to run into serious performance issues?
What are my options to make it faster?
My understanding is that an index on 'followed_id' will simply help the where clause. My concern is on the 'order by'
Create an index that contains these two fields in this order (followed_id, created_at)
Now, how large is the large we are talking about here? If it will be of the order of millions.. How about something like the one that follows..
Create an index on keys followed_id, created_at, id (This might change depending upon the fields in select, where and order by clause. I have tailor-made this to your question)
SELECT relationships.*
FROM relationships
JOIN (SELECT id
FROM relationships
WHERE followed_id = 1
ORDER BY created_at
LIMIT 10 OFFSET 10) itable
ON relationships.id = itable.id
ORDER BY relationships.created_at
An explain would yield this:
+----+-------------+---------------+------+---------------+-------------+---------+------+------+-----------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+---------------+------+---------------+-------------+---------+------+------+-----------------------------------------------------+
| 1 | PRIMARY | NULL | NULL | NULL | NULL | NULL | NULL | NULL | Impossible WHERE noticed after reading const tables |
| 2 | DERIVED | relationships | ref | sample_rel2 | sample_rel2 | 5 | | 1 | Using where; Using index |
+----+-------------+---------------+------+---------------+-------------+---------+------+------+-----------------------------------------------------+
If you examine carefully, the sub-query containing the order, limit and offset clauses will operate on the index directly instead of the table and finally join with the table to fetch the 10 records.
It makes a difference when at one point your query makes a call like limit 10 offset 10000. It will retrieve all the 10000 records from the table and fetch the first 10. This trick should restrict the traversal to just the index.
An important note: I tested this in MySQL. Other database might have subtle differences in behavior, but the concept holds good no matter what.
you can index these fields. but it depends:
you can assume (mostly) that the created_at is already ordered. So that might by unnecessary. But that more depends on you app.
anyway you should index followed_id (unless its the primary key)

Resources