How should I store data(100 x 2 array) in Gatsby? - graphql

I'm developing a map which loads 100 location pinpoints. To do so, I want to save double array of 100 rows and 2 column, which saves 100 locations.(latitude/longitude)
I am considering CMS or GraphQL-filesystem, but I don't know which to use. I want to to prioritize webpage load time, but if possible, it'd be nice to conceal my data.
Where should I store my data? Please give me advise. Other options are also welcome Thank you.

100 locations with latitude and longitude is still only a few KB of data. You do not need sophisticated data storage solutions for this.
The easiest way is to use your local filesystem i.e. an array and store it in a javascript file inside your project.
The next easiest way is to use any external solution that Gatsby allows. Take a look at their possible external sources.

Related

Making sure my Go page view counter isn't abused

I believe I have a found a very good and fast solution for efficiently counting page views:
Working example in go playground here: https://play.golang.org/p/q_mYEYLa1h
My idea is to push this to the database every X minutes, and after pushing a key then delete it from the page map.
My question now is, what would be the optimal way to ensure that this isn't abused? Ideally, I would only want to increase page count from the same person if there was a time interval of 2 hours since last visiting the page.
As far as I know, it would be ideal to store and compare both IP and user agent (I don't want to rely on cookie/localstorage), but I'm not quite sure how to efficiently store and compare this information.
I'd likely get both the IP (req.Header.Get("x-forwarded-for")) and UserAgent (req.UserAgent()) from http.Request.
I was thinking making a visitor struct similar to my page struct that would look like this:
type visitor struct {
mutex sync.Mutex
urlIPUAAndTime map[string]time
}
This way should make it possible to do something similar to before. However, imagine if the website had so many requests that there would be hundreds of millions of unique visitor maps being stored, and each of these could only be deleted after 2 (or more) hours. I therefore think this is not a good solution.
I guess it would be ideal/necessary to write to and read from some file, but not sure how this should be done efficiently. Help would be greatly appreciated
One of optimization ways is to add a Bloom filter before this map. Bloom filter is a probabilistic structure which can say one of these:
this user is definitely new
and this user possibly was here
This is a way to cut off computation on early stage. If many of your users are new then you save requests to database to check all of them.
What if structure says "user is possibly non-unique"? Then you go the database and check it.
Here's one more optimization: if you do not need very accurate information and can agree with mistake about several percent, you may use the sole bloom filter. I guess many large sites use this technique for estimation.

Loading 30000 documents from wikipedia

I have a wikipedia url and I want to load the content from that page and other referenced pages upto 30000 documents using wiki API, I can loop through the urls and do that but that is not an effiecient way of doing it. Is there any other way through which I can acheive this. I need this to populate my HDFS in hadoop.
You can download the wikimedia software and a database image, set up the wikipedia and access it locally. This is well described and should be a lot more efficient then requesting that number of pages through the net. see: http://www.igeek.co.za/2009/10/16/how-to-mirror-wikipedia/
There are also many other sources and also preprocessed pages. Here comes the question what you plan to do with the content in the next step.
There are a few ways to go about this. Toolserver users have direct database query access to all the metadata, but not text. If that suits you, you might be able to ask one of them to run a query through the query service. This is a pretty straight-forward way to find out what pages are linked, etc. and build a map of page ids or revision ids.
Otherwise, take a look at database dumps which are great for bulk work but will take some processing on your end.
Finally, Wikipedia is used to tons of bots and API scrapes. It's not ideal, but if nothing else suits you then run a timer that starts a new query once every second and you'll be done in 8 hours.
As it is said by Jeff and NilsB you have a wrong intent to crawl wikipedia for filling your HDFS. The right thing to do it is to download the whole wiki as a single file and to load it to HDFS.
But if we abstracted away from some details in your question it would transform into more general: How to crawl some sites specified by url using Hadoop?
So the answer is you should upload the file(s) with url to hdfs, write a mapper (accepting url, downloading a page and yielding it as key=url and value=page's body) and configure a Job to use NLineInputFormat for controlling the count of url each mapper'll process.
By control of that parameter you'll be able to control a level of parallelism through itself and the map slots count.

Transferring lots of objects with Guid IDs to the client

I have a web app that uses Guids as the PK in the DB for an Employee object and an Association object.
One page in my app returns a large amount of data showing all Associations all Employees may be a part of.
So right now, I am sending to the client essentially a bunch of objects that look like:
{assocation_id: guid, employees: [guid1, guid2, ..., guidN]}
It turns out that many employees belong to many associations, so I am sending down the same Guids for those employees over and over again in these different objects. For example, it is possible that I am sending down 30,000 total guids across all associations in some cases, of which there are only 500 unique employees.
I am wondering if it is worth me building some kind of lookup index that I also send to the client like
{ 1: Guid1, 2: Guid2 ... }
and replacing all of the Guids in the objects I send down with those ints,
or if simply gzipping the response will compress it enough that this extra effort is not worth it?
Note: please don't get caught up in the details of if I should be sending down 30,000 pieces of data or not -- this is not my choice and there is nothing I can do about it (and I also can't change Guids to ints or longs in the DB).
Your wrote at the end of your question the following
Note: please don't get caught up in the details of if I should be
sending down 30,000 pieces of data or not -- this is not my choice and
there is nothing I can do about it (and I also can't change Guids to
ints or longs in the DB).
I think it's your main problem. If you don't solve the main problem you will be able to reduce the size of transferred data to 10 times for example, but you still don't solve the main problem. Let us we think about the question: Why so many data should be sent to the client (to the web browser)?
The data on the client side are needed to display some information to the user. The monitor is not so large to show 30,000 total on one page. No user are able to grasp so much information. So I am sure that you display only small part of the information. In the case you should send only the small part of information which you display.
You don't describe how the guids will be used on the client side. If you need the information during row editing for example. You can transfer the data only when the user start editing. In the case you need transfer the data only for one association.
If you need display the guids directly, then you can't display all the information at once. So you can send the information for one page only. If the user start to scroll or start "next page" button you can send the next portion of data. In the way you can really dramatically reduce the size of transferred data.
If you do have no possibility to redesign the part of application you can implement your original suggestion: by replacing of GUID "{7EDBB957-5255-4b83-A4C4-0DF664905735}" or "7EDBB95752554b83A4C40DF664905735" to the number like 123 you reduce the size of GUID from 34 characters to 3. If you will send additionally array of "guid mapping" elements like
123:"7EDBB95752554b83A4C40DF664905735",
you can reduce the original size of data 30000*34 = 1020000 (1 MB) to 300*39 + 30000*3 = 11700+90000 = 101700 (100 KB). So you can reduce the size of data in 10 times. The usage of compression of dynamic data on the web server can reduce the size of data additionally.
In any way you should examine why your page is so slowly. If the program works in LAN, then the transferring of even 1MB of data can be quick enough. Probably the page is slowly during placing of the data on the web page. I mean the following. If you modify some element on the page the position of all existing elements have to be recalculated. If you would be work with disconnected DOM objects first and then place the whole portion of data on the page you can improve the performance dramatically. You don't posted in the question which technology you use in you web application so I don't include any examples. If you use jQuery for example I could give some example which clear more what I mean.
The lookup index you propose is nothing else than a "custom" compression scheme. As amdmax stated, this will increase your performance if you have a lot of the same GUIDs, but so will gzip.
IMHO, the extra effort of writing the custom coding will not be worth it.
Oleg states correctly, that it might be worth fetching the data only when the user needs it. But this of course depends on your specific requirements.
if simply gzipping the response will compress it enough that this extra effort is not worth it?
The answer is: Yes, it will.
Compressing the data will remove redundant parts as good as possible (depending on the algorithm) until decompression.
To get sure, just send/generate the data uncompressed and compressed and compare the results. You can count the duplicate GUIDs to calculate how big your data block would be with the dictionary compression method. But I guess gzip will be better because it can also compress the syntactic elements like braces, colons, etc. inside your data object.
So what you are trying to accomplish is Dictionary compression, right?
http://en.wikibooks.org/wiki/Data_Compression/Dictionary_compression
What you will get instead of Guids which are 16 bytes long is int which is 4 bytes long. And you will get a dictionary full of key value pairs that will associate each guid to some int value, right?
It will decrease your transfer time when there're many objects with the same id used. But will spend CPU time before transfer to compress and after transfer to decompress. So what is the amount of data you transfer? Is it mb / gb / tb? And is there any good reason to compress it before sending?
I do not know how dynamic is your data, but I would
on a first call send two directories/dictionaries mapping short ids to long GUIDS, one for your associations and on for your employees e.g. {1: AssoGUID1, 2: AssoGUID2,...} and {1: EmpGUID1, 2:EmpGUID2,...}. These directories may also contain additional information on the Associations and Employees instances; I suspect you do not simply display GUIDs
on subsequent calls just send the index of Employees per Association { 1: [2,4,5], 3:[2,4], ...}, the key being the association short id and the ids in the array value, the short ids of the employees. Given your description building the reverse index: Employee to Associations may give better result size wise (but higher processing)
Then its all down to associative arrays manipulations which is straightforward in JS.
Again, if your data is (very) dynamic server side, the two directories will soon be obsolete and maintaining synchronization may cost you a lot.
I would start by answering the following questions:
What are the performance requirements? Are there size requirements? Speed requirements? What is the minimum performance that is truly needed?
What are the current performance metrics? How far are you from the requirements?
You characterized the data as possibly being mostly repeats. Is that the normal case? If not, what is?
The 2 options you listed above sound reasonable and trivial to implement. Try creating a look-up table and see what performance gains you get on actual queries. Try zipping the results (with look-ups and without), and see what gains you get.
In my experience if you're not TOO far from the goal, performance requirements are often trial and error.
If those options don't get you close to the requirements, I would take a step back and see if the requirements are reasonable in the time you have to solve the problem.
What you do next depends on which performance goals are lacking. If it is size, you're starting to be limited if you're required to send the entire association list ever time. Is that truly a requirement? Can you send the entire list once, and then just updates?

MVC-3 User-Image Management - Best Practices

Developing using MVC-3, Razor, C#
Been searching around and cannot find advice I'm looking for. My site will contain user-uploaded images (possibly a high number). What is the best practice for managing these pictures (placement, breakdown into sub-folders, etc...)? Where do I place them that will prevent them from getting accidentally blown away if I republish my site periodically?
If there are any good articles or blog posts, that would be helpful. Also, any advice/tips anyone wants to add would be great.
Thanks for your time!
Rob
EDIT
Also would like to know what people do to prevent hot linking.
A site that I run and has a high volume of images, has all of the images stored in a date folder structure. i.e. 2010/Dec/31/image.jpg
There are two reasons for this.
The first is the limited amount of DB space (200 MB) came with my hosting plan. Obviously if I had gigabytes of space I would have stored them in the DB.
The second reason is to keep the number of images in the folders to a minimum. Directory listings take longer with the more files that are contained in them so a new directory every 24 hours was my workaround.
Can you perhaps tell us more about what resources you have or how many images you estimate will be uploaded daily?
If you are using SQL Server 2005 or above you may use FileStreams. If the files are under 1MB in size you might even have better performance if you store them as VARBINARY(MAX). The best part about storing in the database is you may easily use transactions.
As for replication and backup you may use standard database replication and backup with the files.
If you have the space in your DB, then I recommend that, as backup/restore becomes much easier. If you have limited space for your DB, then a folder structure would work, though I would not store more than 1000 files in a single folder. So you'll want to come up with a solution that helps keep a folder from not holding more than 1000 images and folders in one place. If you think you'll have less than 1000 images per day, then a variation on what Sir Psycho suggested would probably work well which would be a folder for each year, then a sub folder under the year with month and day to store all the images for that day.
To answer your question about hot linking: your best bet is to check the referrer website (which should be found in the head of the request for the image) and make sure it's coming from your domain. If it's not, you can either not send back any information, or you send back an image that let's the user know they cannot see the image from the 3rd party site.
The header data can be spoofed, but odds are random visitors coming to the 3rd party site will not only not have done this, but probably wouldn't know/care how to.

can browsers [especially IE6] handle big Json object

I am returning a big Json object [ 5000 records and 10 elements per record]
from the controller [asp.net mvc] using Jquery and Ajaxpost. till now I was dealing with just 20 records [testing] and it is working fine. But in production there are 5000 records so i am wondering if browser can handle huge amount of data. Especially IE6. I have to display all the 5000 records on a single page. I am confused. I dont have that much data now to test. I need you expert advice whether to Use or Not to Use Jquery Json AjaxPost to return huge amount of data. Thank you.
You should really page this data. You might want to have a look at this:
PagedList
A simple way to find out and probably good practice is to make sure you have that data to test and it will make it much easier to choose between differnt options. paging is one option but there are others. You may be surprised how quickly modern browsers can load that amount of data.
it should be easy enough with a vb.net page to generate all the data you need using random data. This is really essential if you want to be able to investigate what to do.

Resources