Issues with subfolders not being crawled correctly - google-search-appliance

I have an issue with a couple folders on one of my sites. There is one folder called Publications_A and one called Publications_B. In each of these folders are a few sub folders (basically archive folders for past years), an index page, and a few documents that are shown on the index page.
Issue 1: The GSA crawls a bunch of documents in the Pub A folder that throw an "not found" error. That's true because those documents are not even there. They are actually located in one of the sub-folders. Even after resetting the index, these still keep showing up here.
Issue 2: The documents that are showing up in the main folder are not showing in the sub folders where they are actually located! I don't get it. In one of the subfolders (named 2014) the GSA is only picking up 5 documents even though there are actually 10 in there. Even if I feed the GSA the full path to these missing documents, it doesn't index them. They are all PDF documents, and there are links to them inside the index.asp file in the 2014 folder. I've checked and there isn't a robots no-crawl tag in any of them.
I've been playing around with this for hours and can't figure it out for the life of me. Anyone have any ideas?

I would use real-time diagnostics to attempt to fetch one of the 'missing' documents and see if you get a "200" response.
Pages that show up that should not could be due to 'relative' links within other content. For example, a PDF document might have a non-fully qualified URL link inside of it which can cause the GSA to crawl a link that does not really exist.
Make sure your index pages list all of the content you want crawled.

Issue1: If its crawling folders that are not there - then you have your follow path set at a higher level - which will follow any sub folders.
Fix: change follow path / add do not follow path
In addition, as stated by Terry Chambers... If your follow and do not follow paths are correctly listed, then your page content would have a "link" (in some way) to the undesired content (sub folder displayed for A or B).
If folder A has a link that takes u to folder B - then yes it will crawl and index this.
Remove link to avoid undesired effects.
Hope this helps.

"Issue 2: The documents that are showing up in the main folder are not showing in the sub folders where they are actually located! I don't get it. In one of the subfolders (named 2014) the GSA is only picking up 5 documents even though there are actually 10 in there. Even if I feed the GSA the full path to these missing documents, it doesn't index them. They are all PDF documents, and there are links to them inside the index.asp file in the 2014 folder. I've checked and there isn't a robots no-crawl tag in any of them."
PDF documents can have issues crawling/indexing if the content is not "select-able" or in other words the image is "Flat"
You can also try to embed footers/headers (internally or by HTML) in documentation, image type files, etc. This should allow those documents to be crawled and indexed.
Hope this helps.

Related

Can files be deleted in a folder if they don't contain a specific word using Power Automate?

I currently have a folder which has photos dumped into it, I am looking to delete all files that do not contain a specific word (which is present in all file names in which I want to keep).
I am hoping this can be done with power automate as there is 100's of photos and I want to improve its efficiency.
I look forward to learning from somebody!
Image below, it seems the flow ran successfully.
enter image description here
You could use a Get files (properties only) action and use a filter array afterwards. In the filter array you could check if the Name field does not contain your keyword.
After that you can loop through the results of the filter array and delete the files based on the {Identifier} field.
Below is an example of that approach
Test it properly, because you are deleting files. Otherwise restore from the first or second stage recycle bin ;)

How to extract a list of URLs from specific domain?

I'm using Firefox 53, and have Scrapbook X and want to save a lot of pages using the Save Multiple URLs feature, but before I do that, I want to extract a specific list of URLs without having to do so manually.
The site I'm looking at extracting data from is www.address-data.co.uk - namely this page.
What I want to do is extract only the URLs and subpages within that page but not the privacy policy or contact us page and all the sub-pages with the EH postcodes.
Is there a way to do this online, or any tool for Mac OS X that can find all related URLs before I copy them into Scrapbook's Save Multiple URLs (where I save them in a subfolder of Scrapbook)?
I assume that EH45 is typical of those you want to extract from the page you mentioned. Like its siblings it's of the form https://address-data.co.uk/postcode-district-EH<postcode number>.
This means that you can make a complete list of the urls if you have a list of the numbers, or of the postcodes.
My main difficulty in answering is that I don't know what tools (especially programming tools) you might have at your disposal. I will assume only that you have, or can obtain, access to an editor that can do macros or that can edit columns. On Windows I would use Emerald (ow known as Crimson).
Then copy the contents of the table in the EH page (not the table headings) and remove everything except the first column. Finally, prepend every item in the column with 'https://address-data.co.uk/postcode-district-'.
PS: This might also be a good question to put on SuperUser.

Crawl depth for URLs added through metadata-and-url feed

We have a need to add specific URLs through metadata-and-url feed and prevent GSA to follow links found on these pages. URLs found on this pages must be ignored even if they specified in Follow Patterns rules.
Is it possible to specify crawl depth for URLs added through metadata-and-url feed or maybe there are some other ways to prevent GSA follow URLs found on specific pages?
You can't solve this problem with just a metadata-and-URL feed. The GSA is going to crawl the links that it finds, unless you can specify patterns to block them.
There are a couple possible solutions I can think of.
You could replace the metadata-and-URL feed with a content feed. You'd then have to fetch whatever you want to index and include that in the feed. Your fetch program could remove all of the links, or it could "break" relative links by specifying an incorrect URL for each of the documents. You'd then have to rewrite the incorrect URLs back to the correct URLs in your search result display page. I've done the second approach before, and that's pretty easy to do.
You could use a crawl proxy to block access to any of the links you don't want the GSA to follow.
The easiest method to prevent this is to add the following to the "HEAD" section of your HTML.
This will prevent the GSA (and any other search engine) from following any links on the page.
Since you say that you can't add the relevant nofollow meta tags to your content then you can handle this using your follow and crawl patterns.
From the official documentation:
Google recommends crawling to the maximum depth, allowing the Google algorithm to present the user with the best search results. You can use URL patterns to control how many levels of subdirectories are included in the index.
For example, the following URL patterns cause the search appliance to crawl the top three subdirectories on the site www.mysite.com:
regexp:www\\.mysite\\.com/[^/]*$
regexp:www\\.mysite\\.com/[^/]*/[^/]*$
regexp:www\\.mysite\\.com/[^/]*/[^/]*/[^/]*$

Is there an efficient way in docpad to keep static and to-be-rendered files in the same directory?

I am rebuilding a site with docpad and it's very liberating to form a folders structure that makes sense with my workflow of content-creation, but I'm running into a problem with docpad's hard-division of content-to-be-rendered vs 'static'-content.
Docpad recommends that you put things like images in /files instead of /documents, and the documentation makes it sound as if otherwise there will be some processing overhead incurred.
First, I'd like an explanation if anyone has it of why a file with a
single extension (therefore no rendering) and no YAML front-matter,
such as a .jpg, would impact site-regeneration time when placed
within /documents.
Second, the real issue: is there a way, if it does indeed create a
performance hit, to mitigate it? For example, to specify an 'ignore'
list with regex, etc...
My use case
I would like to do this for posts and their associated images to make authoring a post more natural. I can easily see the images I have to work with and all the related files are in one place.
I also am doing this for an artwork I am displaying. In this case it's an even stronger use case, as the only data in my html.eco file is yaml front matter of various meta data, my layout automatically generates the gallery from all the attached images located in a folder of the same-name as the post. I can match the relative output path folder in my /files directory but it's error prone, because you're in one folder (src/files/artworks/) when creating the folder of images and another (src/documents/artworks/) when creating the html file -- typos are far more likely (as you can't ever see the folder and the html file side by side)...
Even without justifying a use case I can't see why docpad should be putting forth such a hard division. A performance consideration should not be passed on to the end user like that if it can be avoided in any way; since with docpad I am likely to be managing my blog through the file system I ought to have full control over that structure and certainly don't want my content divided up based on some framework limitation or performance concern instead of based on logical content divisions.
I think the key is the line about "metadata".Even though a file does NOT have a double extension, it can still have metadata at the top of the file which needs to be scanned and read. The double extension really just tells docpad to convert the file from one format and output it as another. If I create a straight html file in the document folder I can still include the metadata header in the form:
---
tags: ['tag1','tag2','tag3']
title: 'Some title'
---
When the file is copied to the out directory, this metadata will be removed. If I do the same thing to a html file in the files directory, the file will be copied to the out directory with the metadata header intact. So, the answer to your question is that even though your file has a single extension and is not "rendered" as such, it still needs to be opened and processed.
The point you make, however, is a good one. Keeping images and documents together. I can see a good argument for excluding certain file extensions (like image files) from being processed. Or perhaps, only including certain file extensions.

Diagnosing Mac Help keyword indexing?

I am having difficulty coaxing the "Help → Search" function of my application to show topics related to the useful (and unique) keywords in my application. Only one keyword shows up.
Background: I created several html help pages (examples: index, accuracy, convert) in a subdirectory of my program. If I invoke the master help, the index.html file shows up fine. From there, I can click through to any of the other topic pages.
Problem: If I try using the keyword search function, only "Accuracy" and a blank indicator (that pulls up the index.html) show up. I have other keywords like "coordinates" that should point to a specific page, but aren't showing up.
What I've done so far: In addition to re-skimming the documentation (which at this time, I am a little bleary-eyed), I have run each page through BBEdit's syntax checker. I also searched StackOverflow for information related to the problem. Because the keywords are rather ubiquitous, this was the primary topical match, but I'm well-past that.
The Help Indexer log notes that it's indexed all of the html files, finding KEYWORDS and DESCRIPTION meta tags in each (as recommended by the Help Book):
droot.html -- File has KEYWORDS meta tag content being indexed.
gc.html -- File has KEYWORDS meta tag content being indexed.
index.html -- File has KEYWORDS meta tag content being indexed.
droot.html -- File has DESCRIPTION meta tag used for abstract.
gc.html -- File has DESCRIPTION meta tag used for abstract.
index.html -- Finished parsing
droot.html -- Finished parsing
gc.html -- Finished parsing
(etc)
The *.helpindex file in the Release package (?/Contents/Resources/MacFizzyCalcHelp/ directory) is ~25k. I do not know how to inspect its contents, though.
Any thoughts on what I'm missing?
I found the following post on the Apple Support site useful when I ran into a similar problem with the Help topics of my Helpbook not appearing using the Search functions:
https://discussions.apple.com/thread/3442044
There are many reasons of problems. Once I found that apple developer documentation has a mistake (in describing anchors). Did you register help book in application Info.plist? Does help book contains own info plist file? You can check helpindex file using hiutil. I hope this help you.
I just had the issue of the blank Apple help entry and after several days of trying everything I could think of, finally found the solution. Add META NAME="ROBOTS" CONTENT="NOINDEX" to the blank entry's page (in my case it was index.html, aka the landing page or access page). I then re-indexed the HTML pages and lo, no more blank entry. No need to even delete the help viewer cache.
[Edited as tags hid content :-]

Resources