How to look for S&P 500 Constituents history, added and removed dates etc - quantitative-finance

I am trying to get a historical list of the S&P500 underlying stocks mix. all tickers the dates were added to the S&P500 index mix and the dates tickers were removed from the list. and throughout the years for each period what is the mix. I did some search, doesn't seems to have any luck.
if anyone can provide some good search keywords, or suggest a place to look for would be appreciated
this is something very specific.
I currently use backtrader to work on some data. if there is a systematic way to get the data, please let me know as well.
many thanks.

You can access this data systematically in QuantRocket, via data provider Sharadar:
https://www.quantrocket.com/data/?filter=sharadar

Related

elasticsearch mapping for Friend to Friend list

we have started using elasticsearch in our project, we are storing user data and his friend list as nested object, and nested to nested object storing friend's friend list because we required this data when we are doing global search.
Now we are syncing this data in real time with our database, so is this good to syncing done in real time 50-100 TPS or in future it will create problem.
We need to create complex queries for updating the data because we are managing friend list in 2nd level. so how to create advance scripting in painless, I have checked this in Google but not found anything in detail.
If my approach is wrong of doing this, please let me know.
To answer your first question
so is this good to syncing done in real time 50-100 TPS or in future
it will create problem
As of version ES6.0, Multi level nesting is automatically supported, and detected, resulting in an inner nested query to automatically match the relevant nesting level (and not root) if it exists within another nested query. But there is a caveat, indexing a document with 100 nested fields actually indexes 101 documents as each nested document is indexed as a separate document. To safeguard against ill-defined mappings the number of nested fields that can be defined per index is usually limited to 50 using the index.mapping.nested_fields.limit . This setting allows you to limit the number of field mappings that can be created manually or dynamically, in order to prevent bad documents from causing a mapping explosion. So to answer your question, this is fine, but as your data grows, it becomes more complicated to manage and you risk the danger of a mapping explosion.
To answer your second question
We need to create complex queries for updating the data because we are
managing friend list in 2nd level. so how to create advance scripting
in painless, I have checked this in Google but not found anything in
detail.
You might need to present some context here to be able to understand why your approach is necessary, but basically, in a social profile context, managing a friends list as you are doing is always a bad idea, especially if you anticipate scaling in the future. It may work for smaller use-cases, but it does not work very well when you scale. This is because, the relationships become more sophisticated and you will end up having too many multi nested objects. As mentioned, all factors kept at a constant, you might want to look at a graph database for this kind of a scenario. You could, however, have other reasons to your approach which is why you might want to enumerate your context so we can better advise.
Hope this helps!!

Updating nested documents en masse

We've been using Elasticsearch to deliver the 700,000 or so pieces of content to the readers of our site for a couple of years but some circumstances have changed and we need to work out whether or not the service can adapt with us... (sorry this post is so long, I tried to anticipate all questions!)
We use Elasticsearch to store "snapshots" of our content to avoid duplicating work and slowing down our apps by making them fetch data and resolve all resources from our content APIs. We also take advantage of Elasticsearch's search API to retrieve the content in all sorts of ways.
To maintain content in our cluster we run a service that receives notifications of content changes from our APIs which triggers a content "ingest" (fetching the data, doing any necessary transformation and indexing it). The same service also periodically "reingests" content over time. Typically a new piece of content will be ingested in <30 seconds of publishing and touched every 5 days or so thereafter.
The most common method our applications use to retrieve content is by "tag". We have list pages to view content by tag and our users can subscribe to content updates for a tag. Every piece of content has one or more tags.
Tags have several properties:- ID, name, taxonomy, and it's relationship to the content. They're indexed as nested objects so that we can aggregate on them etc.
This is where it gets interesting... tags used to be immutable but we have recently changed metadata systems and they may now change - names will be updated, IDs may flux as they move taxonomy etc.
We have around 65,000 tags in use, the vast majority of which are used only in relatively small numbers. If and when these tags change we can trigger a reingest of all the associated content without requiring any changes to our infrastructure.
However, we also have some tags which are very common, the most popular of which is used more than 180,000 times. And we've just received warning it, a few others with tens of thousands of documents are due to change! So we need to be able to cope with these updates now and into the future.
Triggering a reingest of all the associated content and queuing it up is not the problem, but this could take quite some time, at least 3-5 hours in some cases, and we would like to try and avoid our list pages becoming orphaned or duplicated while this occurs.
If you've got this far, thank you! I have two questions:
Is there a more optimal mapping we could use for our documents knowing now that nested objects - often duplicated thousands of times - may change? Could a parent/child mapping work with so many relations?
Is there an efficient way to update a large number of nested objects? Hacks are fine, at least to cover us in the short term. Could the update by query API and a script handle it?
Thanks
I've already answered a similar question to your use case of Nested datatype.
Here is the link to the answer of maintaining Parent-Child relation data into ES using Nested datatype.
Try this. Do let me know if this solution helps in solving your problem.

MarkLogic 8 - Reporting and Aggregation from Large Collection

Say I have a collection with 100 million records/documents in it.
I want to create a series of reports that involve summing of values in certain columns and grouping by various columns.
What references for XQuery and/or MarkLogic can anyone point me to that will allow me to do this quickly?
I saw cts:avg-aggregate which looks fine. But then I need to group as well..
Also, since I am dealing with a large amount of data and it will take some time to go through it all, I am thinking about setting this up as a job that runs at night to update the report.
I thought of using corb to run through the records and then do something with the output from that. Is this the right approach with MarkLogic and reporting?
Perhaps this guide would help:
http://developer.marklogic.com/blog/group-by-the-marklogic-way
You have several options which are discussed above:
cts:estimate
cts:element-value-co-occurrences
cts:value-tuples + cts:frequency

rails algorithm visitors count

Which is the best way to implement visitor's logic?
Create visitors table |ip|resource_type|resource_id|
Create serialize field in records (Post, Pet, Event, Ad, etc...)
Use nosql solutions
Any other idea
In the 1st case, we have extended the table size for every visit.
In the 2nd, we have a long field.
In the 3nd, I have trouble with mongoid at production (centOS).
Not sure I'm answering, but I would not implement that myself, but rather take a look at existing solutions. For basic counting :
Vanity
Google Analytics
For more detailed metrics about what each user does, I would go toward cohort.
A totally other option could be using just the log and something like lograge to log each request. It is very easy to add fields (such as the IP). You can then extract all the informations from your logs.

How does Facebook do it?

Have you ever noticed how facebook says “3 friends and 33 others liked this”? I was wondering what the best approach to do this is. I don’t think going through the friends list, and the list of users who “liked this” and comparing them is efficient at all! Do they keep a track of this in the database? That will make the database size very huge.
What do you guys think?
Thanks!
I would guess they outer join their friends table with their likes table to count both regular likes and friend likes at the same time.
With the proper indexes, it wouldn't be a slow query at all. Huge databases aren't necessarily slow, so there's really no reason to not store all of this information in a database. The trick is to make sure the indexes and partitions (if any) are set up well.
Facebook uses Cassandra, a NoSQL database for at least some things. Here's a more detailed discussion of what some of the bigger social media sites do to solve these problems:
http://www.25hoursaday.com/weblog/2009/09/10/BuildingScalableDatabasesDenormalizationTheNoSQLMovementAndDigg.aspx
Lots of interesting reading in there if you follow the links from it to the Digg blog post, etc.
Yes they definitely keep it in their database as they definitely have more than 1 server that needs to access the data.
As for scalability, I'm sure they use a lot of caching.
Here is an example:
If you have 1 million rows to go through, an index can perform O(logn) = 20 operations (in the worst case) only to find what you need.
For 2 million, you only need 21 operations (in the worst case) to find what you need.
Every time you double the amount of users to go through you simply need only 1 more operation (in the worst case) with a O(logn) index.
They also have a distributed architecture or a clustered database.
Facebook must be using a trigger(which automatically gets executed as soon as an event occurs).
For example, suppose a trigger is created to store the count and names of people who liked the status, then it will get executed every time when someone likes your status and that too implicitly (automatically).
This makes the operation way too easy and Facebook doesn't have to manually update the database or store a huge database for this. Also,this approach is a faster one.
In designing social networking software (mothsorchid.com) I found the only way to address this is to pre-cache streams of notifications. One doesn't query the database at the time of page load to count how many friends and others 'liked this', when someone 'likes' something that is recorded on the object, and when retrieving the object one can compare with the current user's friend list. If someone updates their profile/makes a comment/etc it sends notification objects to friends which are pre-cached in their feeds. Cuts down tremendously on database work at expense of disk space, but disk space is cheap.
As to how Facebook does this, they use Cassandra DBMS, which is probably a little different to what you have in mind.
Keep in mind that Facebook strongly utilizes memcached, so they're retaining a lot of data in memory and only refreshing it when absolutely necessary. See this blog post for some scalability discussion around this:
http://www.facebook.com/note.php?note_id=39391378919
Each entry that somebody can like probably contains a list of everybody who does like it (all of this is of course in a database). When you view that entry, they match it against your friends list to see which of them is your friend. Voila.
A lot of this are explained by the Director of Engineering of Facebook in this QCon presentation :
http://www.infoq.com/presentations/Facebook-Software-Stack
A great presentation to watch.....

Resources