records that belong to multiple categories - dc.js

I have a list of persons and the languages they speak:
name; language
John; english,italian
Jane; french, spanish, english
...
I want to list them (table) and have a barChart of the languages... and I'm stuck
To be able to draw the language, I would pre-process the data to change the format to
name; language
John; english
John; italian
Jane; french
Jane; spanish
Jane; english
and use the language as the dimension for the barChart... however, I then have duplicates in the table, where I should have John and Janes only once.
How can I handle that?

This is a rather short answer, but use the new array dimension feature in version 1.4: https://github.com/crossfilter/crossfilter/wiki/API-Reference#dimension_with_arrays That allows you to count each record in more than one group without having to reprocess your data and do any weird aggregation to deal with the duplicates.
(This is in the community fork at https://github.com/crossfilter/crossfilter/ )

Related

Advanced Filter on PowerQuery

Question
I am trying to filter some excel data I have on PowerQuery but am really struggling to figure out the best way to do this. On Excel I would usually do this using an Advanced Filter but I don't believe Power Query has the same functionality.
I would like to filter across a number of columns, using an OR condition for a:
a range of values in some columns; and
a range of 'wild card' searches in other columns
Example
As an example if my dataset is as below
Name
Function
Sub-Function
Position
Country
Andy
Sales
Omni-Channel
Sales Manager
Brazil
Bob
Marketing
eCommerce
Web Design
Argentina
Rakesh
HR
Business Partnering
HRBP
Italy
Tom
Finance
Reporting
Finance Manager
UK
Chris
Sales
Trade Marketing
Sales support
US
Raj
Legal
Legal
Para-legal
Brazil
I might want to filter for anyone meeting the following criteria;
working in the sales function OR
working in the Trade Marketing or Omni-Channel sub-function OR
has the words 'Manager' or 'Sales' in their Position title OR
is based in Brazil
The desired output would then be
Name
Function
Sub-Function
Position
Country
Andy
Sales
Omni-Channel
Sales Manager
Brazil
Tom
Finance
Reporting
Finance Manager
UK
Chris
Sales
Trade Marketing
Sales support
US
Raj
Legal
Legal
Para-legal
Brazil
My current approach
My approach was to create a table which had all my criteria (not including the 'wild card' searches), upload to PowerQuery, create multiple lists from this table (called for example Filter1, Filter2 etc..). Then using the following formula against my main dataset
= Table.SelectRows(#"Filtered Rows1", each (List.Contains(Filter1,[Function]) = true) or (List.Contains(Filter2,[Sub-Function]) = true) or (List.Contains(Filter3,[Country]) = true) or Text.Contains([Position], "Manager") or Text.Contains([Position], "Sales"))
Issues
The formula above works for really small data sets however not on my 80,000 line data set, or rather it does not work within a reasonable time frame
I have a long list of 'wild card' searches which apply to three columns so typing in the Text.Contain(...... etc.. formula multiple times into the formula seems very inefficient, is prone to mistakes and is not dynamic in any way.
I'm sure there must be a better way to do this but I have not found many helpful discussions or tutorials on this online so am reaching out to the community.
Thank you
I think the part of the problem the query is slow as with your current approach, filtering is evaluating filtering conditions separately before it could filter. Power bi is capable of handling filter on multiple columns in a single filter, like following
Table.SelectRows(
#"Changed Type1",
each ([Function] = "Sales")
or ([#"Sub-Function"] = "Trade Marketing" or [#"Sub-Function"] = "Omni-Channel")
or (
Text.Contains([Position], "Manager")
or Text.Contains([Position], "Sales")
or ([Country] = "Brazil")
)
)

Design for columns and column families in Hbase

Hi I am new to hbase and want to ask about columns and column families.
Its my assignment and I am stuck in design for this. I have to save month names in hbase in different formats and in different languages.
Every record should have:
lang_id,
format,
language,
translation.
Now lang_id for:
January=1,
February=2
......
format can be:
full(means January)
3figure(means Jan)
langauge can be:
eng
arabic
urdu
etc...
Now translation will have further columns like:
id
content
timestamp
id means id of translation
content is the actual data
e.g for lang_id =1 format=full language =english
the content should store January
e.g for lang_id =1 format=3figure language =english
the content should store Jan
Now i am stuck in the design. That what columns should i make and what column families.
lang_id,format, language, translation
But translation will again have some more columns... id,content,timestamp
Any help with an example will be very appreciative.
I think storing just 12 rows/months, or even several years of year/month combinations (or x times the number of languages) in Hbase is a mistake. Hbase is designed to handle millions/billions of rows, you are contemplating it as store for a small reference table.
The overhead of handling this data in Hbase is simply too large. I would just use a RDBMS

How to quickly search book titles?

I have a database of about 200k books. I wish to give my users a way to quickly search a book by the title. Now, some titles might have prefix like A, THE, etc. and also can have numbers in the title, so search for 12 should match books with "12", "twelve" and "dozen" in the title. This will work via AJAX, so I need to make sure database query is really fast.
I assume that most of the users will try to search using some words of the title, so I'm thinking to split all the titles into words and create a separate database table which would map words to titles. However, I fear this might not give the best results. For example, the book title could be some 2 or 3 commonly used words, and I might get a list of books with longer titles that contain all 2-3 words and the one I'm looking for lost like a needle in a haystack. Also, searching for a book with many words in the title might slow down the query because of a lot of OR clauses.
Basically, I'm looking for a way to:
find the results quickly
sort them by relevance.
I assume this is not the first time someone needs something like this, and I'd hate to reinvent the wheel.
P.S. I'm currently using MySQL, but I could switch to anything else if needed.
Using a SOUNDEX is the best way i think.
SELECT
id,
title
FROM products AS p
WHERE p.title SOUNDS LIKE 'Shaw'
// This will match 'Saw' etc.
For best database performances you can best calculate the SOUNDEX value of your titles and put this in a new column. You can calculate the soundex with SOUNDEX('Hello').
Example usage:
UPDATE `books` SET `soundex_title` = SOUNDEX(title);
You might want to have a look at Apache Lucene. this is a high performance java based Information Retrieval System.
you would want to create an IndexWriter, and index all your titles, and you can add parameters (have a look at the class) linking to the actual book.
when searching, you would need an IndexReader and an IndexSearcher, and use the search() oporation on them.
have a look at the sample at: src/demo and in: http://lucene.apache.org/java/2_4_0/demo2.html
using Information Retrieval techniques makes the indexing take longer, but every search will not require going through most of the titles, and overall you can expect better performance for searching.
also, choosing good Analyzer enables you to ignore words such "the","a"...
One solution that would easily accomodate your volume of data and speed requirment is to use the Redis key-value pair store.
The way I see it, you can go ahead with your solution of mapping titles to keywords and storing them under the form:
keyword : set of book titles
Redis already has a built-in set data-type that you can use.
Next, to get the titles of the books that contains the search keywords you can use the sinter command which will peform set intersection for you.
Everything is done in memory; therefore the response time is very fast.
Also, if you want to save your index, redis has a number of different persistance/caching mechanisms.
Apache Lucene with Solr is definitely a very good option for your problem
You can directly link Solr/Lucene to directly index your MySQL database. Here is a simple tutorial on how to link your MySQL database with Lucene/Solr: http://www.cabotsolutions.com/2009/05/using-solr-lucene-for-full-text-search-with-mysql-db/
Here are the advantages and pains of using Lucene-Solr instead of MySQL full text search: http://jayant7k.blogspot.com/2006/05/mysql-fulltext-search-versus-lucene.html
Keep it simple. Create an index on the title field and use wildcard pattern matching. You can not possibly make it any faster as your bottleneck is not the string matching but the number of strings you want to match against the title.
And just came up with a different idea. You say that some words can be interpreted differently. Like 12, Twelve, dozen. Instead of creating a query with different interpretations, why not store different interpretations of the titles in a separate table with a one to many to the books. You can then GROUP BY book_id to get unique book titles.
Say the book "A dime in a dozen". In books table it will be:
book_id=356
book_title='A dime in a dozen'
In titles table will be stored:
titles_id=123
titles_book_id=356
titles_title='A dime in a dozen'
--
titles_id=124
titles_book_id=356
titles_title='A dime in a 12'
--
titles_id=125
titles_book_id=356
titles_title='A dime in a twelve'
The query for this:
SELECT b.book_id, b.book_title
FROM books b JOIN titles t on b.book_id=t.titles_book_id
WHERE t.titles_title='%twelve%'
GROUP BY b.book_id
Now, insertions becomes a much bigger task, but creating the variants can be done outside the database and inserted in one swoop.

Sort on last name, first name, or both?

I have a dilemma that I've encountered before. What's the best in terms of usability when one displays personal names in a table? Should there be a single column for the name? If so, is "firstname lastname" or "lastname, firstname" preferable? Or would a column for "firstname" and a column for "lastname" be best? I'm thinking in terms of the user's desire to sort the columns. I like having a column for each name component because I can imagine that in some cases the first name will be more important to the user whereas in other cases the last name would be more important.
I would assume that many out there have had this dilemma and am looking for pearls of wisdom based on past experience.
Definitely have a column for each part. That gives you much more flexibility. So you could sort by surname, but print "firstname surname", for example.
If you don't have the screen real estate to have a column for each part, you can combine them into a single string whose format represents the sorting order. Each click on the column header cycles to the next sort order. For example:
Default: sort by last, first (ASC)
Bimbleman, Wally P.
Zonkenstein, Arnold Q.
1st click: sort by last, first (DESC)
Zonkenstein, Arnold Q.
Bimbleman, Wally P.
2nd click: sort by first, middle, last (ASC)
Arnold Q. Zonkenstein
Wally P. Bimbleman
3rd click: sort by first, middle, last (DESC)
Wally P. Bimbleman
Arnold Q. Zonkenstein
etc...
Easier to read an entire name this way (vs. having it span across columns), takes up less screen real estate, and frees you from having to decide upon a single format & sort.
As far as I know, each country has Its own rules to Sort the names, some countries have the uses of do it By First name, and some by Last Name, I believe that the right answer here is, what is about your app? how many users will appear on those columns? And which users (age/nationality/context) are going to use your app?
Really, I agree with Skilldrick - a good UI has at least separate columns for first and last names...
But don't forget that CONSISTENCY in a UI is actually more important and makes things usable: giving the end user an implied expectation of how things are done.
You might consider calling the fields "Given Name" and "Family Name" to account for people who put their family name first. Of course this doesn't cover everyone (some people only have a given name) but it might reduce potential confusion with Chinese and Japanese names, for example.
In most cases you will find that these fields will cover for most scenarios: Title, Firstname, Middlename, Lastname
Most systems that I have worked with here in Australia, data are sorted by their lastname on default display. Also on the screen if you are providing search, usually Lastname field comes before firstname. Sorting by firstname is just as common too, so your systems should always allow the view to switch to sorting by Firstname
Here is a solution for a single column, I don't think separate columns can be scanned and read as quickly, although I don't have any data to back that up.
The primary focus of a user-oriented solution should be to display names as they would be read aloud, i.e. Title Firstname Middlename Lastname.
For most domains where the names are known to the user, sorting by firstname is acceptable. Here is an example where a persons title is ignored in the sorting, and the sort field is clear as it is highlighted:
Arnold Q. Zonkenstein
Mr. David Cliff
Marty P. Bimbleman
For formal business oriented applications, the default sorting could be by surname. You can preserve reading order, while still sorting by last name, again using highlighting:
Marty P. Bimbleman
Mr. David Cliff
Arnold Q. Zonkenstein
If you want the sorting field to be configurable, use an explicit checkbox, the solution of clicking multiple times on the column heading to cycle between sort fields will be jarring to the user (toggling sort direction by clicking on the heading is more acceptable).
IMO this is the simplest solution without any compromises.

data validation for date data type in unix?

for eg if i m getting various inputs from keyboard ie
Book accession number,
Subject code Book_id ,
Author ,
Year of Publication,
Title of the book,
Publisher’s name,
Price,
and i want do validations such that
year of publication must be before 1996,
Book_id must be unique,
Publisher, Author and the title of the Book cannot be entered blank,
Subject code can only be either UNIX or C,
The Book accession numbers must be in ascending order,
how do i store all values first and do validations for data types like date
I guess it would make more sense to put this stuff in to a database. Then create a select. In shell programming it's also posible, but would create a lot of work and it could get a bad performance.
Using Perl is another good option since it has date modules in it, and it's faster. If you don't have DB access.

Resources