Golang Cache HTTP GET Results In Memory - go

I am working on a CLI in Go that scrapes a webpage to collect the href attributes of all the links on the page into a slice. I want to store this slice in memory for some time so that the scraper is not being called on every execution of the CLI command. Ideally, the scraper would only be called after the cache expires or the user provides some sort of --update flag.
I came across the library go-cache and other similar libraries, but from what I could tell they only work for something that is continuously running, like a server.
I thought about writing the links to a file, but then how would I expire the results after a specific duration? Would it make sense to create a small server in the background that shuts down after a while in order to use a library like go-cache? Any help is appreciated.

There are two main approaches in these scenarios:
Create a daemon, service or background application that acts as your data repository. You can run it as an HTTP server / RPC server depending on your requirements. Your CLI application then interacts with this daemon as required;
Implement a persistence mechanism that will allow data to be written and read across multiple CLI application executions. You may use normal text files, databases or even an implementation of golang's encoding/gob to write and read your slice (a map would probably be better) to and from a binary file.
You can timestamp entries and simply remove them after their ttl expires by explicitly deleting them, or by simply not rewriting them during subsequent executions, according to the strategy / approach selected above.
The scope and number of examples for such an open ended question is too myriad to post in a single answer and will most likely require multiple specific questions.

Use a database and store as much detail as you can (fetched_at, host, path, title, meta_desc, anchors etc). You'll be able to query over the data later and it will be useful to have it in a structured format. If you don't want to deal with a db dependency you could embed something like boltdb (pure go) or sqlite (cgo).

Related

Off-Chain Worker Framework

I haven’t entirely given up on the idea of validators moonlighting as oracles for off-chain computation…based on this extensive discussion: https://gov.near.org/t/off-chain-computation-framework/1400/6
So far from studying Sputnik’s code, I have figured out the mechanics of how to upload a blob to a smart contract. Let's say that a blob represents a storage-less contract, having only stateless functions that act only on input to the function, and return those inputs modified.
Now I’m missing the piece of how Validators can download and execute the blob. As mentioned by Ilya in the link above, the NearSDK would be able to interpret the blob (if the blob is essentially a compiled contract), but it needs to be a modified version of the SDK...
Think of this like sandbox mode…blob cannot modify state of any other contract, but can read state (forget about the internet access part for now). Results of the blob execution are then fed back to a smart contract, where they have to match the results of every other validator who executed the blob. This can be done by hash comparison (rather than looping through the results individually), so it’s not an expensive comparison, especially because it’s all or nothing.
Question: how can a Validator download the blob and execute it via a sandboxed SDK, and post the result via the regular SDK to the blockchain? I am missing a lot of architectural context…and this is bringing me to the edge of giving on the idea. Please help prevent that from happening!
If you are implementing this as a separate binary, your binary will be doing next things:
Use RPC to load the WASM file from the blockchain. See RPC reference
Use runtime-standalone to run this WASM with specific inputs. An example of using runtime standalone is here, but you will need to customize this with few things.
The result should be sent as a transaction signed by this binary again via RPC.
If you want these WASM files to have access to state, you will need to load state inside this binary. There are two options:
Modify a nearcore node to also do the above items
Run nearcore in parallel, and open the database on read when you are initializing Trie (e.g. here load from disk instead).
If you want to add more host functions (like accessing internet), you will need to fork runtime-standalone to expose those functions.

Is it advisable to use Redis or Memcached as a cache for FILES?

I have multiple configuration files which I need to read from disk and apply to many records.
I need to improve this to increase performance.
I have two processes.
Process1: Update Configuration:
This updates content configuration files.
This can run from multiple locations.
Process2: Apply Configuration:
This uses content of configuration files.
This can run from multiple locations.
At present, this is using direct file+n/w IO to read updated configuration files.
Both processes are back-end and there is no browser involved here.
Should I use Redis or Memcached as a cache for FILES ?
Note that file need to be read from a common location. They are being updated by another background process. Update can happen any time. Size of configuration files is 1K to 10K.
I want Process2 to access updated configuration files in fastest way possible.
Redis is good choice as it preserves data in memory with optional persistence. So such approach does not have to touch hard drive.
The problem I can see here that every client needs to understand Redis and is to use some support library, e.g. in Java or whatever language you use.
Why to not use http itself, e.g. deploy some http file server. You can also provide version checking + caching, so client can store version of file on the server and use client-cache content if the server has same file and download it when it was changed. This is called HEAD, look at http://www.tutorialspoint.com/http/http_methods.htm
You just should use same approach as web itself has. Every browser downloads content, html, css, images etc. Best improvement, for you, is client side caching, e.g. css or images are stored in browsers cache and download only first type or when it was changed.
And if you dont want, you cant use exactly REST approach itself.

Write to shared txt file or DB table from web service?

I am developing a web service that will be invoked (using JSON) from client side each time the selection of a drop down changes.
The goal is to register each "intermediate" change (on client side) using the "OnSelectedIndexChanged" event and before submitting the form to the Server.
Each new selected value will be written to a shared txt file calling a relative web method via Ajax/JSON.
Would it be better to write these changes to a txt file (having to implement a lock/unlock policy to assure exclusive access) or rather define a DB table and save the changes there?
Everyday the web app will have around 10 to 20 active users that might potentially changes the DropDownLists and usually the right value will be selected at first, hence generally no more than one "intermediate" entry would be registered.
Thanks.
Don't use the filesystem. It's slow. Use mongodb via a node.js webserver.
http://howtonode.org/express-mongodb
Good Luck!
This sounds exactly like what you would want to use a database for, since ACID is already implemented there.
If you want a real headache (and a programming challenge!) trying to debug overlapping writes, resource starvation and deadlocks, by all means, go with a shared text file!

proxy for scale, performance (to load external content)?

I am sure answer for this question will be very subjective, I simply want to know what the options are out there (for building a proxy to load external contents).
Typically I used cURL in php and pass a variable like proxy.url to fetch content. Then make an AJAX call with Javascript to populate the contents.
EDIT:
YQL (Yahoo Query language) seems a very promising solution to me, however, it has a daily usage limit which essentially prevents me from using it for large scale projects.
What other options do I have? I am open to any language, any platform, key criteria are: performance and scalability.
Please share your ideas, thoughts and experience on this topic.
Thanks,
you dont need a proxy server or something else.
Just create a cronjob to fetch the contents every 5 minutes (or whenever you want).
You just need to create a script that grabs the content from the web and saves it (to a file, a database, ...), which will be started by the cronjob.
If somebody requests your page, you just need to send the cached content out and do with it whatever you want to do.
I think scalability and performance will be no problem.
Depending on what you need to do with the content, you might consider Erlang. It's lightening fast, ridiculously reliable, and great for scaling.

Streaming, Daemons, Cronjobs, how do you use them? (in Ruby)

I've finally had a second to look into streaming, daemons, and cron
tasks and all the neat gems built around them! But I'm not clear on
how/when to use these things.
I have a few questions:
1) If I wanted to have a website that stayed constantly updated, realtime, with my Facebook friends' activity feeds, up-to-the-minute Amazon book reviews on my favorite books, and my Twitter feed, would I just create some custom streaming implementation using the Daemon gem, the ruby-yali gem for streaming the content, and the Whenever gem, which could say, check those sites every 3-10 seconds to see if content I'm looking for has changed? Is that how it would work? Or is it typically/preferably done differently?
2) Is (1) too processor intensive? Is there a better way you do it, a better way for live content streaming, given that the website you want realtime updates on doesn't have a streaming api? I'm thinking about just sending a request every few seconds in a separate small ruby app (with daemons and cronjobs), getting the json/xml result, using nokogiri to remove the stuff I don't need, and then just going through the small list of comments/books/posts/etc., building a feed of what's changed, and using Juggernaut or something to push those changes to some rails app. Would that work?
I guess it all boils down to the question:
How does real-time streaming of the latest content of some website work? How do YOU do it?
...so if someone is on my site, they can see in real time the new message or new book that just came out?
Looking forward to your answers,
Lance
Well first, if a website that doesn't provide an API, then it's a strong indication that it's not legal to parse and extract their data, however you'd better check their terms of use and privacy policy.
Personally I'm not aware of something called "Streaming API", but supposing that they have an API , you still need to pull the results provided by it(xml, json, ....), parse them and present them back to the user. The strategy will vary depending on your app type:
Desktop app: then you just can pull the data directly, parse it and provide it to the user, many apps are like that just like Twhirl.
Web app: then you need to cut down the time for extracting the data. Typically you will pull the data from the API and store it. However, storing the data is a bit tricky! You don't want want your database to be a lock down for the app by the extreme pull queries that it gonna get to retrieve the data back. One way to do this is to use push methodology; follow option 2 in this case to get the data and then push to the user. If you want instant updates like chat for example you can have a look at orbited. If it's ok to save the data to some kind of user and followers' 'inboxes', then the simplest way as I can tell is to use IMAP to send the updates to the user inbox.

Resources