what determines whether boxplot or outliers are shown? - dc.js

I have 40 boxplots that each represent 110 companies on page load. The user can filter the companies that drive the boxplots down to a single company. When there's only one company in the filter, I can only see outlier circles and no boxplots, which is what I'd expect. But even when there are 25 companies in the filter, I can still only see outlier circles and no boxplots.
I am going to be asked about this when I present my app, but I can't currently explain what the criteria are for a full boxplot to show instead of outliers. Is it simply the count must be, e.g., >30? Should I be looking in the D3 docs for this since I can't see much in the DC docs? Thanks

Related

Algorithm to find areas of support in a candlestick chart

I am in the process of designing an algorithm that will calculate regions in a candlestick chart where strong areas of support exist. An "area of support" in this case is defined as an area in the chart where the price of a stock rises by a large amount in a short period of time. (Please see the diagram below, the blue dots represent these strong areas of support)
The data I am working with is a list of over 6000 TOHLC (timestamp, open price, high price, low price, close price) values. For example, the first entry in this list of data is:
[1555286400, 83.7, 84.63, 83.7, 84.27]
The way I have structured the algorithm to work is as follows:
1.) The list of 6000+ TOHLC values are split into sub-lists of 30 TOHLC values (30 is a number that I arbitrarily chose). The lowest low price (LLP) is then obtained from each of these sub-lists. The purpose behind using this method is to find areas in the chart where prices dip.
2.) The next step is to determine how high the price rose from each of these lows. For this, I take the next 30 candlestick values from the low and determine what the highest high price (HHP) is. Then, if HHP / LLP >= 1.03, the low price is accepted, otherwise it is discarded. Again, 1.03 is a value that I arbitrarily chose, by analysing the stock chart manually and determining how much the price rose on average from these lows.
The blue dots in the chart above represent the accepted areas of support by the algorithm. It appears to be working well, in terms of that I am trying to achieve.
So the question I have is: does anyone have any improvements they can suggest for this algorithm, or point out any faults in it?
Thanks!
I may have understood wrong, however, from your explanation it seems like you are doing your calculation in separate 30-ish sub lists and then combining them together.
So, what if the LLP is the 30th element of sublist N and HHP is 1st element of sublist N+1 ? If you have taken that into account, then it's fine.
If you haven't taken that into account, I would suggest doing a moving-window type of approach in reading those data. So, you would start from 0th element of 6000+ TOHLC and start with a window size of 30 and slide it 1 by 1. This way, you won't miss any values.
Some of the selected blue dots have higher dip than others. Why is that? I would separate them into another classifier. If you will store them into an object, store the dip rate as well.
Floating point numbers are not suggested in finance. If possible, I'd use a different approach and perhaps classifier, solely using integers. It may not bother you or your project as of now, but surely, it will begin to create false results when the numbers add up in the future.

Calculating Relative Recall

When calculating relative recall using TREC and 'K' pooling, does the total relevant documents reflect relevant documents from all participating systems per query or is it all the queries?
And does this approach not invalidate recall calculations, say I have the 50 top documents between two systems but collectively there are 75 relevant documents, then irrespective of how good either system is they will never be able to reach 100% recall?
When calculating relative recall using TREC and 'K' pooling, does the total relevant documents reflect relevant documents from all participating systems per query or is it all the queries?
The set of relevant documents comprises documents that are judged relevant by human accessors, who are asked to look at the union of top-100 documents retrieved by each participating system. Note the stress on the word union, which indicates that the accessors are not shown this set in any particular order. So, this pool is indeed a set (and not an ordered set).
The set of relevant documents is different for each query. So you might imagine if R represents relevant set of documents, it has an argument q (the query). So, in effect you have R(q) and not just R.
And does this approach not invalidate recall calculations, say I have the 50 top documents between two systems but collectively there are 75 relevant documents, then irrespective of how good either system is they will never be able to reach 100% recall?
They can, in principle, achieve 100% recall if they retrieve at least 75 documents each. Obviously, if you're allowed to retrieve 10 documents and there are a total of 20 relevant documents, tha max. recall you can achieve is only 50%.

Ranking/ weighing search result

I am trying to build an application that has a smart adaptive search engine (lets say for cars). If I search for for 4x4 then the DB will return all the 4x4 cars I have (100 cars) - but as time goes by and I start checking out cars, liking them, commenting on them, etc the order of the search result should be the different. That means 1 month later when searching for 4x4, I should get the same result set ordered differently as per my previous interaction with the site. If I was mainly liking and commenting on German cars, BMW should be on the top and Land cruiser should be further down.
This ranking should be based on attributes that I captureduring user interaction (eg: car origin, user age, user location, car type[4x4, coupe, hatchback], price range). So for each car result I get, I will be weighing it based on how well it is performing on the 5 attributes above.
I intend to use the DB just as a repository and do the ranking and the thinking on the server. My question is, what kind of algorithm should I be using to weigh/rank my search result?
Thanks.
You're basically saying that you already have several ordering schemes:
Keyword search result
amount of likes for car's category
likely others, such as popularity, some form of date, etc.
What you do then is make up a new scheme, call it relevance:
relevance = W1 * keyword_score + W2*likes_score + ...
and sort by relevance. Experiment with the weights W1, W2, ..., until you get something you find useful.
From my understanding search engines work on this principle. It's been long thrown around that Google has on the order of 200 different inputs into the relevance score, PageRank being just one. The beauty of this approach is that it lets you fine tune the importance of everything (even individually for every query), and it lets you add additional inputs without screwing everything up.

How to create a rating based off social scores

I have a thousand recipes each having a tweet and facebook like counts. What i want to do is to create an overall rating out of 100 based off these two scores (and perhaps other social network counts too).
Assuming both facebook and twitter are equally weighted, how can i go about this.
one way to do this for any given network would be somethign like this
this_recipes_facebook_count / max_facebook_count_in_db * 100.0
and average it with the twitter result.
However what happens if there is a recipe with a freakish high score? It unfairly punishes other recipes with lower yet still relatively high scores.
I feel i need to take standard deviation into acccount, perhaps some dampening function...but its been 14 years since i took stats in highschool.
Can anyone help? Id prefer simple over complex as it is only recipe ratings after all.
Instead of linearly increasing the popularity count you might do something like this: (1-p^x)
Where p is a pre-selected value (say 0.99) and x is the number of mentions.
Initially increase in mentions is going to speed up the score a lot. But after sometime the effect becomes smaller and smaller.

How to implement a real estate recommendation engine?

I am talking about something like movie/item recommendation, but it seems that real estate is more tricky. When visiting a web-site and doing some search for RE, the user should be presented with some suggestions. Let's separate the task in two tasks:
a) the user has still not entered any personal info - item based recommendation
b) the user has already entered his/hers details such as income, location, etc. - item/user based recommendation
The first thing that comes to my mind for task a) is to start modeling RE features, but using some ranges instead of exact values. For example:
Area in m2
40 - 50 we can mark it for "1"
50 - 70 is "2"
etc ...
Price:
20 - 30 thousands € will be marked as 1
30 - 40 will be 2
etc ...
Proximity to city center:
1 for the RE being within the city center
2 for Zone 2 or up to 2/3 kilometers from center
3 for Zone 3 or 7 kilometers from center
So having ranges lets us assign a vector to each RE property which will allows us to use: Euclidean distance, Pearson correlation and some nearest neighbor algorithms.
Please comment on my approach or suggest a new one.
If you already have a website with enough traffic, you can try a pure collaborative filtering approach, i.e people who viewed this property also viewed these other properties. You could use the Pearson correlation there for good results.
Similarity between 2 RE can be defined as
number of people who viewed both RE1 and RE2
sim = ---------------------------------------------
number of people who viewed either 1 or both
When a user is viewing property RE you can sort all other RE properties based on the similarity score with the property being shown and show the top few.
You could add some obvious filters on top of this like the location of the property, the price range etc.
You can also define the similarity as you have suggested and mix the results from both for good representation from new RE entries which do not have a high chance of getting in if a pure collaborative filtering algorithm is used.

Resources