What is the Rails way to log visits in order to collect data for a recommendation engine?

What is the Rails way to log visits in order to collect data for a recommendation engine? - ruby

For a summer internship, I am asked to collect some specific data relative to the pages a user visits on the startup's website.
In order to simplify things, we can consider the website as a dating site, where each user has its profile page and is tagged under certain categories (hair color, city, etc).
I would like to know the best way, in the Rails framework, to keep traces of each visits a user makes to a profile or to a tag page.
Should it be logged in a file or added in a database, where exactly in the code should the functions be called ?
Maybe a gem already exists for this specific purpose?
The question is both about where functions should be called in Rails and how data should be stored because the goal is to build a recommendation system, ultimately.

There are a wide range of options available to you. I'd recommend one of the following:
Instrument detailed logging of the relevant controller actions. Periodically run a rake task that aggregates data from the log files and makes it available to your relevance engine.
Use a key/value store such as Redis to increment user/action specific counters during requests. Your relevance engine can query this store for the required metrics. Again, periodic aggregation of metrics is advised.
Both approaches lend themselves well to before_filter statements. You can interrogate the input params before the controller action executes to transparently implement the collection of statistics.
I wouldn't recommend using a relational database to store the raw data.

Related

Seeking Advice For Oracle Data-Intensive Application

I'm endeavoring to develop an application that uses Oracle as the database back-end. The application will calculate several statistics from the various tables in the database. The front-end will most likely be a web application and this front-end will display various charts and calculated statistics. Now, I imagine that it would be more efficient to perform the calculations in the database rather than in the service layer because said calculations would need to be performed for every web request. That being the case, I'm not sure which mechanism to use. (e.g. stored procedure, function, view) To illustrate what I'm going for, suppose I want to keep statistics of student grades for many students. I would like to have a web interface that lets me view those statistics on student-by-student basis and also an all-inclusive basis. Some of the stats are dependent on aggregates (e.g. average, min, max) of all of the student grades and some stats are dependent only on an individual student. In this situation, every time a record is added or updated, the aggregates would have to be recalculated. So I am speculating that if I had a special table that held all of the calculated values I need and a trigger(s) to recalculate everything when a record is added/updated then all I would need to do from a web request point-of-view is have the service layer pull the desired values from this special table. I'm just not sure if this is the best way to go or not so I am asking the community for any input/advice. Note: Although I'm using Oracle, I'm open to using PostgreSQL or mySQL.
Thanks in advance

The scenario you are describing would be ideal for using materialized views. They can be designed to refresh automatically (and incrementally) every time the source data is updated by your application. The calculations would be built in to the view definition. No triggers required, and likely no stored procedures unless your calculations involve multiple steps. Check here: https://oracle-base.com/articles/misc/materialized-views and here: https://medium.com/oracledevs/lightning-fast-sql-with-real-time-materialized-views-12-things-developers-will-love-about-oracle-54bcc9eac358 for more info.

Data integration for Magento to Quick Book

I'm currently new to Talend and I'm learning through videos and documentation, so I'm just not sure how to approach/implement this with best practices.
Goal
Integrate Magento and Quick Book using Talend.
My thoughts
Initially my first thought was I will setup direct DB connection for Magento and will take relevant data which I need and will process it and will send to QuickBook using REST API's(specifically bulk API's in batch)
But then again I thought it would be little hectic for me to query Magento database(multiple joins) so I've another option to use Magento's REST API.
But as I'm not much familiar with the tool I'm struggling little to find best suitable approach, so any help is appreciated.
What I've done till now?
I've saved my auth(for QB) and db(Magento) credentials data in file and using tFileInputDelimited and tContextLoad, I'm storing them in context variables so they can be accessible globally.
I've successfully configured database connection and dbinput but I've not used metadata for connection(should I use that and if Yes how can I pass dynamic values there?). I've used my context variables data in db connection settings.
I've taken relevant fields for now but if I want multiple fields simple query is not enough as Magento stores data in multiple tables for Customer etc but it's not big deal I know but I think it might increase my work.
For now that's what I've built and my next step is send the data to QB using REST while getting access_token and saving it to context variable and again storing the QB reference into Magento DB.
Also I've decided to use QB bulk API's but I'm not sure how I can process data in chunks in Talend(I tried to check multiple resources but no luck) i.e. if the Magento is returning 500 rows I want to process them in chunks of 30 as QB batch max limit is 30, so I will be sending it using REST to QB and as I said I also want to store back QB reference ID in magento(so I can update it later).
Also this all will be on local, then how can I do same in production? how I can maintain development and production environment?
Resources I'm referring
For REST and Auth best practices - https://community.talend.com/t5/How-Tos-and-Best-Practices/Using-OAuth-2-0-with-Talend-to-Access-Goo...

Nice example for batch processing here:
https://community.talend.com/t5/Design-and-Development/Batch-processing-in-talend-job/td-p/51952
Redirect your input to a tFileOutputDelimited.
Enter the output filename, tick the option "Split output in several files" from the "Advanced settings" and enter the value of 1000 into the field "Rows in each output file". This will create n files based on the filename with 1000 in each.
On the next subjob, use a tFileList to iterate over this file list to get records from each file.

Getting into designing dashboards and need some help identifying each technical layer along the way

So I will be embarking on designing a dashboard that will display KPI's and other relevant information for my team. Since I am in the early stages of this project and am not very familiar on the technical process behind designing a dashboard, I need some questions vetted out first before I go and shop for some solutions to avoid reinventing the wheel.
Here are some of my questions:
We want a dashboard that can provide live-time information via our data sources (or as close to live-time as possible). What function allows a dashboard to update itself with concurrent datasources? From a conceptual standpoint, I can understand creating a dashboard out of Microsoft Excel, and having the dashboard dependent on the values you may have set within your pivot table.
How do you make a dashboard request information from multiple datasources on its own? Just like the excel example, a user may have to go into the pivot tables to update values, but I want to know how would a dashboard request this by itself and what is the exact method from a programming standpoint? Does the code execute itself every time you refresh the webpage?
How do you create datasources organically? I know for some solutions such as SharePoint BI Center, there are pre-supported datasources like an excel sheet or SharePoint and it's as easy as uploading your document and letting the design handle the rest. However, there are going to be some datasources that I know that will need to be fetched. Do I need to understand something else like an event recorder in order to navigate this issue?

Introduction
The dashboard (or a report, respectively) is usually the result of a long chain of steps. Very much simplified it could look like this:
src1
|------\
src2 | /---- Dashboards
|------+---[DWH]-[BR]-+
src n | | \---- Reports etc.
|------/ [Big Data]
Keep in mind, this is only a very, very simple structure of a data backend / frontend.
DWH means Data Warehouse, where data might be stored temporarily (you referred to this as fetching). This could be a database, could be a Big Data engine, could be a combination of both...
Afterwards, there are Business Rules (BR). Those might be specific rules in how different departments calculate and relate to data, but also simple things like algebra.
Questions
So, the main question should not be about the technology:
What software should we choose?
How can we create a dashboard?
but on the contrary focused on your business processes (see it like a top-down view):
How does our core process look like? Where would I like to measure data?
How would department a calculate sales in difference to department b? Should all use the same rule?
Where does everyone store the data? Can we access it? Do we need structural data?
And, very easy to forget but also easily sometimes one of the biggest parts: Is the identifier of a business object (say, sales id) everywhere build and formatted in the same way?
Conclusion
When those questions are at least in the back of your head and you keep working in this direction, more or less automatically data will spill out at certain points of that process.
Then it won't matter if you use Excel, a small-to medium app like Tableau, Tibco Spotfire, QlikView, Power BI or you want to go full scale with a big Hadoop backend, databases and JasperReports, Apache Drill, Pentaho, SSIS on top of it... it will come out eventually.
TL;DR
Focus on the processes first. Make sure to understand them. Draft in Excel. Then proceed in getting the data and the tools you need to help your use cases. It will work out much better from a "top-down" approach than trying to solve your requirements with tools only.

rails algorithm visitors count

Which is the best way to implement visitor's logic?
Create visitors table |ip|resource_type|resource_id|
Create serialize field in records (Post, Pet, Event, Ad, etc...)
Use nosql solutions
Any other idea
In the 1st case, we have extended the table size for every visit.
In the 2nd, we have a long field.
In the 3nd, I have trouble with mongoid at production (centOS).

Not sure I'm answering, but I would not implement that myself, but rather take a look at existing solutions. For basic counting :
Vanity
Google Analytics
For more detailed metrics about what each user does, I would go toward cohort.
A totally other option could be using just the log and something like lograge to log each request. It is very easy to add fields (such as the IP). You can then extract all the informations from your logs.

Database-driven routing in Rails 3

I'd like to ask a theoretical question about routing and using a database to look up specific routes.
Let's say that I have a product system. I know all about to_param and how it works (thanks to Ryan Bates about a zillion years ago) and even that I can take the ID out of the param if I set it statically on the model itself.
The problem is that I want to take the ID out of it for search engine optimization and other purposes. I can't put it on the Product model itself, because the slug has the tendency to change with product changes, etc. Therefore, I need to keep slugs in something, e.g. a Route model. I have a table with a product_id column, and a slug column. The slug is indexed, so it would be fast for Rails to look up.
I have a few concerns, though:
Is this even the best way to do this?
Is there a way to cache the slugs in memory so that there isn't a roundtrip to the database every time?
Will my database take a huge hit if it is just looking up a simple slug in a table which is indexed? I suppose I will need to do some performance testing on this myself, but I am curious as to whether anyone else has ever measured this before.
Thanks!

Use http://memcached.org/
Free & open source, high-performance, distributed memory object caching system, generic in nature, but intended for use in speeding up dynamic web applications by alleviating database load.
Memcached is an in-memory key-value store for small chunks of arbitrary data (strings, objects) from results of database calls, API calls, or page rendering.
Memcached is simple yet powerful. Its simple design promotes quick deployment, ease of development, and solves many problems facing large data caches. Its API is available for most popular languages.
Ruby gem is http://rubygems.org/gems/memcache
You may opt any gem here http://rubygems.org/searchutf8=%E2%9C%93&query=memcache

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio