Related
I recently finished a machine learning course and would like to make a forum sentiment analysis tool, to apply it in stock-related forums.
The idea is to:
Capture (text mining) users with their comments, and evaluate their comment's sentiment (positive, negative, neutral).
Capture what happens (stock market) after those comments, and assign a weight to the user accordingly (bigger weight if the user's sentiments is spot-on and the market follows the same direction)
Use the comments as a tool to predict market direction.
Actually, I do this myself (pay attention on forums) plus my own technical analysis and the obligatory due diligence, and it has been working very well for me. I just wanted to try to automate it a little bit and maybe even allow a program to play with some of my accounts (paper trading first, and if it performs decently assign some money in a real account)
This would be my first machine learning project (just as a proof-of-concept) so any comments would be very kindly appreciated.
The biggest problem that I find is that I would like to make an unsupervised training, and I need a sample dataset to do the training.
Question: Is there any known forum-sentiment dataset available to be used for unsupervised training?
I've found several sentiment datasets (twitter, imbd, amazon reviews) but they are very specific to their niche (short messages, movies, products...) but I'm looking for something more general.
Since you are looking for an unsupervised approach you can use any set of data that matches your "real case scenario". Text mining and sentiment analysis are are often tailored to the problem at hand so it is easy to start directly with the real data. The best approach is to built a scraper that grabs directly the forum posts that you want to analyze. You can build the scraper easily enough with Python (beautifulsoup/selenium). Online is full of nice tutorial eg: https://www.dataquest.io/blog/web-scraping-tutorial-python/
I inspect the following content that I want to extract.
<code style="display: none" id="bpr-guid-1441788">
{"companyDetails":{"com.linkedin.voyager.jobs.JobPostingCompany":{"companyResolutionResult":{"entityUrn":"urn:li:fs_normalized_company:166973","name":"World Wildlife Fund","logo":{"image":{"com.linkedin.voyager.common.MediaProcessorImage":{"id":"/p/3/000/093/367/1651958.png"}},"type":"LOGO_LEGACY"}},"company":"urn:li:fs_normalized_company:166973"}},"entityUrn":"urn:li:fs_normalized_jobPosting:324588733","formattedLocation":"Bozeman, Montana","jobState":"LISTED","description":{"attributes":[{"start":572,"length":1,"type":{"com.linkedin.pemberly.text.LineBreak":{}}},{"start":0,"length":574,"type":{"com.linkedin.pemberly.text.Paragraph":{}}},{"start":574,"length":1,"type":{"com.linkedin.pemberly.text.LineBreak":{}}},{"start":574,"length":2,"type":{"com.linkedin.pemberly.text.Paragraph":{}}},{"start":576,"length":18,"type":{"com.linkedin.pemberly.text.Paragraph":{}}},{"start":594,"length":316,"type":{"com.linkedin.pemberly.text.ListItem":{}}},{"start":910,"length":134,"type":{"com.linkedin.pemberly.text.ListItem":{}}},{"start":1044,"length":160,"type":{"com.linkedin.pemberly.text.ListItem":{}}},{"start":1204,"length":342,"type":{"com.linkedin.pemberly.text.ListItem":{}}},{"start":1546,"length":270,"type":{"com.linkedin.pemberly.text.ListItem":{}}},{"start":594,"length":1222,"type":{"com.linkedin.pemberly.text.List":{"ordered":false}}},{"start":1817,"length":1,"type":{"com.linkedin.pemberly.text.LineBreak":{}}},{"start":1834,"length":1,"type":{"com.linkedin.pemberly.text.Paragraph":{}}},{"start":1835,"length":147,"type":{"com.linkedin.pemberly.text.ListItem":{}}},{"start":1982,"length":129,"type":{"com.linkedin.pemberly.text.ListItem":{}}},{"start":2111,"length":130,"type":{"com.linkedin.pemberly.text.ListItem":{}}},{"start":2241,"length":92,"type":{"com.linkedin.pemberly.text.ListItem":{}}},{"start":2333,"length":189,"type":{"com.linkedin.pemberly.text.ListItem":{}}},{"start":1835,"length":687,"type":{"com.linkedin.pemberly.text.List":{"ordered":false}}},{"start":2522,"length":1,"type":{"com.linkedin.pemberly.text.LineBreak":{}}},{"start":2522,"length":1,"type":{"com.linkedin.pemberly.text.Bold":{}}},{"start":2522,"length":2,"type":{"com.linkedin.pemberly.text.Paragraph":{}}},{"start":2524,"length":12,"type":{"com.linkedin.pemberly.text.Bold":{}}},{"start":2524,"length":66,"type":{"com.linkedin.pemberly.text.Paragraph":{}}},{"start":2590,"length":1,"type":{"com.linkedin.pemberly.text.LineBreak":{}}},{"start":2590,"length":1,"type":{"com.linkedin.pemberly.text.Bold":{}}},{"start":2590,"length":2,"type":{"com.linkedin.pemberly.text.Paragraph":{}}},{"start":2592,"length":9,"type":{"com.linkedin.pemberly.text.Bold":{}}},{"start":2592,"length":10,"type":{"com.linkedin.pemberly.text.Paragraph":{}}},{"start":2602,"length":17,"type":{"com.linkedin.pemberly.text.Bold":{}}},{"start":2619,"length":12,"type":{"com.linkedin.pemberly.text.Bold":{}}},{"start":2631,"length":78,"type":{"com.linkedin.pemberly.text.Bold":{}}},{"start":2602,"length":108,"type":{"com.linkedin.pemberly.text.ListItem":{}}},{"start":2710,"length":88,"type":{"com.linkedin.pemberly.text.Bold":{}}},{"start":2710,"length":89,"type":{"com.linkedin.pemberly.text.ListItem":{}}},{"start":2602,"length":197,"type":{"com.linkedin.pemberly.text.List":{"ordered":false}}},{"start":2799,"length":177,"type":{"com.linkedin.pemberly.text.Bold":{}}},{"start":2799,"length":177,"type":{"com.linkedin.pemberly.text.Paragraph":{}}},{"start":2976,"length":0,"type":{"com.linkedin.pemberly.text.Paragraph":{}}}],"text":"World Wildlife Fund (WWF), the world’s leading conservation organization, seeks a Data Analyst. Under the direction of the supervisor, this position is responsible for providing data synthesis and analysis for the Northern Great Plains (NGP). S/he will assist the NGP program in communicating its goals and successes through the development of data synthesis products and developing statistical interpretation of existing and new datasets, with a focus on informing grassland conservation. S/he will develop products to disseminate information to NGP staff and partners. \n \n Responsibilities Provide data synthesis and interpretation for existing and new datasets to support grassland conservation goals. Work with existing datasets to develop new ways of interpreting the data and communicating it to partners. Work with new datasets to help answer key science questions as outlined by the Program Manager. Given the list of science priorities for the program, develop methods for answering pressing questions using the best available data. Develop spatial data for use in projects. Collect and process datasets for use by NGP staff and partners, as needed and in partnership with the GIS Specialist. Support NGP Program by developing in-depth knowledge of grassland conservation and researching and developing skills in other approaches necessary to ensure success of WWF’s conservation strategies in the region. Build knowledge through research to keep up to date with the state of the art knowledge and apply the knowledge to WWF projects. The candidate will report to the Program Manager. S/he will also maintain strong relationships with the Managing Director, Deputy Director, NGO partner organizations, federal, state and provincial agency planning personnel and corporate and foundations staff at WWF-US. \n Qualifications A Master of Science Degree in Biostatistics, Biology, Conservation Biology, Zoology, Ecology, Wildlife Management, or a related field, is required 4+ years of experience in spatial analysis and data synthesis is required. A PhD will substitute for 3 years of work experience. Substantial and demonstrated experience in spatial analysis; data synthesis; and managing an independent work program is required Experience in biodiversity conservation and grassland-focused spatial datasets is preferred Candidates should have a strong commitment to the mission, goals, and values of WWF, good interpersonal and relationship-building skills, energy and enthusiasm, and high ethical standards. \n Please Note: This is a 2-year position based in Bozeman, Montana. \n To Apply: Please visit our Careers Page, job#17065, to submit an online application including resume and cover letter Due to the high volume of applications we are not able to respond to inquiries via phone As an EOE/AA employer, WWF will not discriminate in its employment practices due to an applicant’s race, color, religion, sex, national origin, and veteran or disability status."},"applyMethod":{"com.linkedin.voyager.jobs.OffsiteApply":{"applyStartersPreferenceVoid":true,"companyApplyUrl":"https://careers-wwfus.icims.com/jobs/1727/data-analyst---17065/job"}},"title":"Data Analyst","listedAt":1496950791000}
</code>
I tried several different ways to extract the content, especially the longest text part, such as
body.xpath('//code[#id="bpr-guid-1441788"]/text()').extract()
But there is no response, the return of scrapy is null.
Anyone can help me out?
Traditionional software metrics deal with quality of software. I'm looking for metrics that can be used to identify developers by their code, in the same vein as plagiarism software and stylometry can be used to identify authors by their writing style. I can imagine that certain existing metrics can be used here as well, such as comment ratio. I can also imagine metrics that would irrelevant from a quality point of view, such as the (over)use of certain methods or design patterns, average length of variable names, etc.
I'm interested either in a pointer to a collection of such metrics or studies, or individual metrics. They may be language-agnostic or related to a language or programming paradigm.
I want to use it to understand and analyze different coding styles, not to detect plagiarism.
I see there are already a couple of studies that looked into this. They might help.
Kothari, J., Shevertalov, M., Stehle, E., Mancoridis, S., "A probabilistic approach to source code authorship identification", In Proceedings of the International Conference on Information Technology, pp.243-248, IEEE, 2007.
Available online here
Quoting from the abstract:
We begin by computing a set of metrics to build profiles for a population of known authors using code samples that are verified to be authentic. We then compute metrics on unidentified source code to determine the closest matching profile. [...] In our case study we are able
to determine authorship with greater than 70% accuracy in choosing the single nearest match and greater than 90% accuracy in choosing the top three ordered nearest matches.
Shevertalov, M., Kothari, J., Stehle, E., Mancoridis, S., "On the use of discretized source code metrics for author identification", In Proceedings of the 1st International Symposium on Search Based Software Engineering, pp.69-78, IEEE, 2009.
Available online here, this is a follow-up of the previous study.
Lange, R., Mancoridis, S., "Using code metric histograms and genetic algorithms to perform author identification for software forensics", In Proceedings of the 9th Annual Conference on Genetic and Evolutionary Computation, pp.2082-2089, ACM, 2007.
Available online here
This is also related to the first reference (common author), and discusses the metrics in more detail. Again quoting from the abstract:
Our method involves measuring the differences in histogram distributions for code metrics. Identifying a combination of metrics that is effective in distinguishing developer styles is key to the utility of the technique. Our case study involves 18 metrics.
You can also use Google Scholar for other references, and for finding other papers based on the ones above (using the "cited by" option).
If you're looking for potential metrics, you might try reviewing some coding standards. Since these dictate a particular style, it follows that the things they talk about (spacing, placement of braces, identifier lengths, mandatory comments, etc.) are things that might be used to identify developers from their code.
Also, if you're interested in .NET code, you might find NDepend to be a useful tool. It enables you to run queries against a code base, and supports 82 metrics.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
I have been learning alot about using graphs for machine learning by watching Christopher Bishops videos( http://videolectures.net/mlss04_bishop_gmvm/ ). I find it very interesting and watched a few others in the same categories(machine learning/graph) but was wondering if anyone had any recommendations for ways of learning more?
My problem is, although the videos gave a great high level understanding, I don't have much practical skills in it yet. I've read Bishops book on machine learning/patterns as well as Norvig's AI book but both don't seem to touch upon specific using graphs much. With the emergence of search engines and social networking, I would think machine learning on graphs would be popular.
If possible, can anyone suggestion an a resource to learn from? (I'm new to this field and development is a hobby for me, so I'm sorry in advance if there's a super obvious resource to learn from..I tried google and university sites).
Thanks in advance!
First, i would strongly recommend the book Social Network Analysis for Startups by Maksim Tsvetovat and Alexander Kouznetsov. A book like this is a godsend for programmers who need to quickly acquire a basic fluency in a specific discipline (in this case, graph theory) so that they can begin writing code to solve problems in this domain. Both authors are academically trained graph theoreticians but the intended audience of their book is programmers. Nearly all of the numerous examples presented in the book are in python using the networkx library.
Second, for the projects you have in mind, two kinds of libraries are very helpful if not indispensible:
graph analysis: e.g., the excellent networkx (python), or igraph
(python, R, et. al.) are two that i can recommend highly; and
graph rendering: the excellent graphViz, which can be used
stand-alone from the command line but more likely you will want to
use it as a library; there are graphViz bindings in all major
languages (e.g., for python there are at least three i know of,
though pygraphviz is my preference; for R there is rgraphviz which is
part of the bioconductor package suite). Rgraphviz has excellent documentation (see in particular the Vignette included with the Package).
It is very easy to install and begin experimenting with these libraries and in particular using them
to learn the essential graph theoretic lexicon and units of analysis
(e.g., degree sequence distribution, nodes traversal, graph
operators);
to distinguish critical nodes in a graph (e.g., degree centrality,
eigenvector centrality, assortivity); and
to identify prototype graph substructures (e.g., bipartite structure,
triangles, cycles, cliques, clusters, communities, and cores).
The value of using a graph-analysis library to quickly understand these essential elements of graph theory is that for the most part there is a 1:1 mapping between the concepts i just mentioned and functions in the (networkx or igraph) library.
So e.g., you can quickly generate two random graphs of equal size (node number), render and then view them, then easily calculate for instance the average degree sequence or betweenness centrality for both and observer first-hand how changes in the value of those parameters affects the structure of a graph.
W/r/t the combination of ML and Graph Theoretic techniques, here's my limited personal experience. I use ML in my day-to-day work and graph theory less often, but rarely together. This is just an empirical observation limited to my personal experience, so the fact that i haven't found a problem in which it has seemed natural to combine techniques in these two domains. Most often graph theoretic analysis is useful in ML's blind spot, which is the availability of a substantial amount of labeled training data--supervised ML techniques depend heavily on this.
One class of problems to illustrate this point is online fraud detection/prediction. It's almost never possible to gather data (e.g., sets of online transactions attributed to a particular user) that you can with reasonable certainty separate and label as "fraudulent account." If they were particularly clever and effective then you will mislabel as "legitimate" and for those accounts for which fraud was suspected, quite often the first-level diagnostics (e.g., additional id verification or an increased waiting period to cash-out) are often enough to cause them to cease further activity (which would allow for a definite classification). Finally, even if you somehow manage to gather a reasonably noise-free data set for training your ML algorithm, it will certainly be seriously unbalanced (i.e., much more "legitimate" than "fraud" data points); this problem can be managed with statistics pre-processing (resampling) and by algorithm tuning (weighting) but it's still a problem that will likely degrade the quality of your results.
So while i have never been able to successfully use ML techniques for these types of problems, in at least two instances, i have used graph theory with some success--in the most recent instance, by applying a model adapted from the project by a group at Carnegie Mellon initially directed to detection of online auction fraud on ebay.
MacArthur Genius Grant recipient and Stanford Professor Daphne Koller co-authored a definitive textbook on Bayesian networks entitled Probabalistic Graphical Models, which contains a rigorous introduction to graph theory as applied to AI. It may not exactly match what you're looking for, but in its field it is very highly regarded.
You can attend free online classes at Stanford for machine learning and artificial intelligence:
https://www.ai-class.com/
http://www.ml-class.org/
The classes are not simply focused on graph theory, but include a broader introduction in the field and they will give you a good idea of how and when you should apply which algorithm. I understand that you've read the introductory books on AI and ML, but I think that the online classes will provide you with a lot of exercises that you can try.
Although this is not an exact match to what you are looking for, textgraphs is a workshop that focuses on the link between graph theory and natural language processing. Here is a link. I believe the workshop also generated this book.
I've always been curious as to how these systems work. For example, how do netflix or Amazon determine what recommendations to make based on past purchases and/or ratings? Are there any algorithms to read up on?
Just so there's no misperceptions here, there's no practical reason for me asking. I'm just asking out of sheer curiosity.
(Also, if there's an existing question on this topic, point me to it. "Recommendations system" is a difficult term to search for.)
At it's most basic, most recommendation systems work by saying one of two things.
User-based recommendations:
If User A likes Items 1,2,3,4, and 5,
And User B likes Items 1,2,3, and 4
Then User B is quite likely to also like Item 5
Item-based recommendations:
If Users who purchase item 1 are also disproportionately likely to purchase item 2
And User A purchased item 1
Then User A will probably be interested in item 2
And here's a brain dump of algorithms you ought to know:
- Set similarity (Jaccard index & Tanimoto coefficient)
- n-Dimensional Euclidean distance
- k-means algorithm
- Support Vector Machines
This is such a commercially important application that Netflix introduced a $1 million prize for improving their recommendations by 10%.
After a couple of years people are getting close (I think they're up around 9% now) but it's hard for many, many reasons. Probably the biggest factor or the biggest initial improvement in the Netflix Prize was the use of a statistical technique called singular value decomposition.
I highly recommend you read If You Liked This, You’re Sure to Love That for an in-depth discussion of the Netflix Prize in particular and recommendation systems in general.
Basically though the principle of Amazon and so on is the same: they look for patterns. If someone bought the Star Wars Trilogy well there's a better than even chance they like Buffy the Vampire Slayer more than the average customer (purely made up example).
The O'Reilly book "Programming Collective Intelligence" has a nice chapter showing how it works. Very readable.
The code examples are all written in Python, but that's not a big problem.
GroupLens Research at the University of Minnesota studies recommender systems and generously shares their research and datasets.
Their research expands a bit each year and now considers specifics like online communities, social collaborative filtering, and the UI challenges in presenting complex data.
The Netflix algorithm for its recommendation system is actually a competitive endeavor in which programmers continue to compete to make gains in the accuracy of the system.
But in the most basic terms, a recommendation system would examine the choices of users who closely match another user's demographic/interest information.
So if you are a white male, 25 years old, from New York City, the recommendation system might try and bring you products purchased by other white males in the northeast United States in the age range of 21-30.
Edit: It should also be noted that the more information you have about your users, the more closely you can refine your algorithms to match what other people are doing to what may interest the user in question.
This is a classification problem - that is, the classification of users into groups of users who are likely to be interested in certain items.
Once classified into such a group, it is easy to examine the purchases/likes of other users in that group and recommend them.
Therefore, Bayesian Classification and neural networks (multilayer perceptrons, radial basis functions, support vector machines) are worth reading up on.
One technique is to group users into clusters and recommend products from other users in the same cluster.
There're mainly two types of recommender systems, which work differently:
1. Content-based.
These systems make recommendations based on characteristic information. This is information about the items (keywords, categories, etc.) and users (preferences, profiles, etc.).
2. Collaborative filtering.
These systems are based on user-item interactions. This is information such as ratings, number of purchases, likes, etc.
This article (published by the company I work at) provides an overview of the two systems, some practical examples, and suggests when it makes sense to implement them.
Ofcourse there is algorithms that will recommend you with prefered items. Different data mining techniques have been implemented for that. If you want more basic details on Recommender System then visit this blog. Here every basics has been covered to know about Recommender System.