Nutch retrieve too many duplicate images

Nutch retrieve too many duplicate images - image

I am trying to retrieve images with Nutch. The plugin is just searching for the required images and retrieving their urls. What I get at the end contains too many duplicate urls. It retrieved 43 thousand urls and 39 thousand of them were duplicates.
Is this normal or there could be some fault in the code I wrote (which I don't think is the case), Or other wise some problem with the Nutch itself?

Could be for instance that the same images are referenced multiple times?, in which case your results could be perfectly normal, I guess that running a test example on a given/known set of URLs could provide you with a better answer, limit your crawl only the the URLs on the seed file run a test and check which images are being crawled. What is the size of your crawl? Are you fetching already fetched pages or focusing on not yet visited pages? Are you ignoring small images like icons?
Keep in mind that usually on a website a lot of image assets are reused over and over again, specially if the website isn't t

Related

Images Always Take Exactly 5 seconds To Load Regardless Of Image Size (MAMP)

I have a situation where images are all taking exactly 5 seconds to load on my localhost. It's not happening on live sites and the pages themselves that are fetching data from a MySQL database all load quickly - it seems it is just the image assets.
What I find odd is:
Regardless of image size (the largest image is about 200KB) it is always exactly 5 seconds (see screenshot).
When I check with Google's Lighthouse it scores well for performance and these issues aren't being picked up.
The issues seems to be the painting of the image?
It is only happening on the latest version of MAMP.
I've noticed as well that the phpMyAdmin URL has changed to http://localhost:8888/phpMyAdmin5/ ... I accept this may be versioning URL and not related to the issue but thought I'd mention it.
I also had this following issue which I fixed with the solution given (I don't know if this is related to the problem) Wrong permissions on configuration file, should not be world writable! MAMP
I've never come across anything like this before, and I can't get my head around why it always takes exactly 5 seconds to load an uncached image (multiple images also take a total of 5 seconds)?
The other questions on here (StackOverflow) about MAMP and page-load relate to general page load and caching issues.
Any suggestions most welcome.

Images storage performance react native (base64 vs uri path)

I have an app to create reports with some data and images (min 1 img, max 6). This reports keeps saved on my app, until user sent it to API (which can be done at the same day that he registered a report, or a week later).
But my question is: What's the proper way to store this images (I'm using Realm), is it saving the path (uri) or a base64 string? My current version keeps the base64 for this images (500 ~~ 800 kb img size), and then after my users send his reports to API, I deleted this base64 hash.
I was developing a way to save the path to the image, and then I display it. But image-picker uri returned is temporary. So to do this, I need to copy this file to another place, then save the path. But doing it, I got (for kind of 2 or 3 days) 2x images stored on phone (using memory).
So before I develop all this stuff, I was wondering, will it (copy image to another path then save path) be more performant that save base64 hash (to store at phone), or it shouldn't make much difference?

I try to avoid text only answers; including code is best practice but the question about storing images comes up frequently and it's not really covered in the documentation so I thought it should be addressed at a high level.
Generally speaking, Realm is not a solution for storing blob type data - images, pdf's etc. There are a number of technical reasons for that but most importantly, an image can go well beyond the capacity of a Realm field. Additionally it can significantly impact performance (especially in a sync'ing use case)
If this is a local only app, storing the images on disk in the device and keep a reference to where they are (their path) stored in Realm. That will enable the app to be fast and responsive with a minimal footprint.
If this is a sync'd solution where you want to share images across devices or with other users, there are several cloud based solutions to accommodate image storage and then store a URL to the image in Realm.
One option is part of the MongoDB family of products (which also includes MongoDB Realm) called GridFS. Another option is a solid product we've leveraged for years is called Firebase Cloud Storage.
Now that I've made those statements, I'll backtrack just a bit and refer you to this article Realm Data and Partitioning Strategy Behind the WildAid O-FISH Mobile Apps which is a fantastic article about implementing Realm in a real-world use application and in particular how to deal with images.
In that article, note they do store the images in Realm for a short time. However, one thing they left out of that (which was revealed in a forum post) is that the images are compressed to ensure they don't go above the Realm field size limit.
I am not totally on board with general use of that technique but it works for that specific use case.
One more note: the image sizes mentioned in the question are pretty small (500 ~~ 800 kb img size) and that's a tiny amount of data which would really not have an impact, so storing them in realm as a data object would work fine. The caveat to that is future expansion; if you decide to later store larger images, it would require a complete re-write of the code; so why not plan for that up front.

If you dont apply the image to an object does parse.com still charge you for storage?

I have to work with some rather large images and multiples of them. During my testing I test uploaded several which I didnt apply to an actual object. Now It says I have taken up .7% of my 20 gigs with only 3 images on an object.
Does parse keep all images I uploaded previously even though I never applied them to an object? Is there a way to clear this data out?

Yes, it still counts against your storage, because the file is still there even if you haven't saved a reference to it on a ParseObject. You can clear out files that have not been assigned to Parse Objects by going into the Settings page and scrolling down to "Clean Up Files".

Loading 30000 documents from wikipedia

I have a wikipedia url and I want to load the content from that page and other referenced pages upto 30000 documents using wiki API, I can loop through the urls and do that but that is not an effiecient way of doing it. Is there any other way through which I can acheive this. I need this to populate my HDFS in hadoop.

You can download the wikimedia software and a database image, set up the wikipedia and access it locally. This is well described and should be a lot more efficient then requesting that number of pages through the net. see: http://www.igeek.co.za/2009/10/16/how-to-mirror-wikipedia/
There are also many other sources and also preprocessed pages. Here comes the question what you plan to do with the content in the next step.

There are a few ways to go about this. Toolserver users have direct database query access to all the metadata, but not text. If that suits you, you might be able to ask one of them to run a query through the query service. This is a pretty straight-forward way to find out what pages are linked, etc. and build a map of page ids or revision ids.
Otherwise, take a look at database dumps which are great for bulk work but will take some processing on your end.
Finally, Wikipedia is used to tons of bots and API scrapes. It's not ideal, but if nothing else suits you then run a timer that starts a new query once every second and you'll be done in 8 hours.

As it is said by Jeff and NilsB you have a wrong intent to crawl wikipedia for filling your HDFS. The right thing to do it is to download the whole wiki as a single file and to load it to HDFS.
But if we abstracted away from some details in your question it would transform into more general: How to crawl some sites specified by url using Hadoop?
So the answer is you should upload the file(s) with url to hdfs, write a mapper (accepting url, downloading a page and yielding it as key=url and value=page's body) and configure a Job to use NLineInputFormat for controlling the count of url each mapper'll process.
By control of that parameter you'll be able to control a level of parallelism through itself and the map slots count.

How to generate sitemap on a highly dynamic website?

Should a highly dynamic website that is constantly generating new pages use a sitemap? If so, how does a site like stackoverflow.com go about regenerating a sitemap? It seems like it would be a drain on precious server resources if it was constantly regenerating a sitemap every time someone adds a question. Does it generate a new sitemap at set intervals (e.g. every four hours)? I'm very curious how large, dynamic websites make this work.

On Stackoverflow (and all Stack Exchange sites), a sitemap.xml file is created which contains a link to every question posted on the system. When a new question is posted, they simply append another entry to the end of the sitemap file. It isn't that resource intensive to add to the end of the file but the file is quite large.
That is the only way search engines like Google can effectively crawl the site.
Jeff Atwood talks about it in a blog post: The Importance of Sitemaps
This is from Google's webmaster help page on sitemaps:
Sitemaps are particularly helpful if:
Your site has dynamic content.
Your site has pages that aren't easily discovered by Googlebot during
the crawl process - for example, pages
featuring rich AJAX or Flash.
Your site is new and has few links to it. (Googlebot crawls the web by
following links from one page to
another, so if your site isn't well
linked, it may be hard for us to
discover it.)
Your site has a large archive of content pages that are not well linked
to each other, or are not linked at
all.

There's no need to regenerate the Google sitemap XML each time a question is posted. It's far simpler just to have the XML file generated on-demand directly from the database (and a little caching).
To reduce load, the sitemap can be split into many sitemaps. Partitioning it by day/month would allow you to tell Google to retrieve today's sitemap frequently, but only fetch the sitemap from six months ago once in a while.

I'd like to share my solution here just in case it helps someone as well.
It took me reading this question and many others to decide what to do.
My site structure.
Static pages
Home (Highly dynamic. Cached for 30 mins)
Artists, Albums, Songs, Playlists and Albums (Paginated List)
Legal (Static page with Terms etc)
...etc
Dynamic Pages
Artists, Albums, Songs, Playlists and Albums detail pages
My approach.
sitemap.xml: This url generates a <sitemapindex /> with the first item being /sitemap-main.xml. The number of Artists, Albums, Songs etc are counted and divided by 1,000 (number of urls I want in each sitemap. the limit is 50,000). I round this number up.
So for e.g, 1900 songs = 1.9 = 2.
I generate. add the urls /sitemap-songs-0.xml and /sitemap-songs-1.xml to the index. I repeat this for all other items. Basically, I am paginating.
The output is returned uncached. I want this to always be fresh.
sitemap-main.xml: This lists all the static pages. You can actually use a static file for this as you will only need to update it once in a while.
sitemap-songs-0.xml, sitemap-albums-0.xml, etc: I use a single route for this in SlimPhp 2.
$app->get('/sitemap-:type-:page.xml', function ($type, $page) use ($app) {...
I use a simple switch statement to generate the relevant files. If for this page, I got 1,000 items, the limit specified above, I cache the file for 2 Weeks.
Else, I only cache it for a few hours.
I guess this can help anyone else implement their own system.

Even on something like StackOverflow, there is a certain amount of static organization; there are FAQs, tag pages, question pages, user pages, badge pages, etc.; I'd say in a very dynamic site, the best way to approach a sitemap would be to have a map of the categorizations; each node in the sitemap can point to a page of the dynamically generated data (a node for a question page, a node for a user page, etc.).
Of course, a sitemap may not even be appropriate for a given site; there's a certain amount of judgment call required there.

For a highly dynamic site, I wrote a cron job at my server which runs on daily basis. It makes a rest call to my backend every day, and generates a new sitemap according to all newly generated content, and returns the sitemap in the form of an xml file. This new sitemap overrides the previous one and keeps my website updated according to all the changes. Changing sitemap for each newly added dynamic content is not a good approach I think

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio