Find Ghost Pages

Forum Moderators: phranque

Message Too Old, No Replies

Find Ghost Pages

JoeFromSD

12:26 am on Nov 19, 2017 (gmt 0)

I have a pretty big site with 3610 pages. Over the years I've abandoned quite a few pages (with text and images) by just deleting the links to them.

Is there any way to find them besides manually, so I can delete most all of them?

[edited by: engine at 4:34 pm (utc) on Nov 19, 2017]
[edit reason] Please see WebmasterWorld TOS - no urls [/edit]

TorontoBoy

5:19 pm on Nov 19, 2017 (gmt 0)

This may sound weird from a webmaster site, but most Linux open source bot scraping software will also crawl a site for URLs and save them in a file. They call it "intelligence gathering", or reconnaissance. We call it very annoying and ban them. You may call it getting a list of all your used URLs.

I've used OWASP WebScarab, but there is Scrapy, Nutch, and so many more. Unfortunately if you are on the receiving end of these bots, they are too easily available and too easy to run.

Your ghost pages do not sound very spooky.

JoeFromSD

5:30 pm on Nov 19, 2017 (gmt 0)

Will any of those find the pages that don't have a link to them on my website?

TorontoBoy

5:38 pm on Nov 19, 2017 (gmt 0)

These are web crawlers that collect links to existing pages, so that later they can attack or scrape them. So no, if they page is not linked then they will not be collected.

You could use the list from the bot crawler against the directory listing, and with some shell scripting hocus pocus create a list of pages not indexed.

[edited by: TorontoBoy at 6:06 pm (utc) on Nov 19, 2017]

graeme_p

6:05 pm on Nov 19, 2017 (gmt 0)

> Linux open source bot scraping software

Minor point, but there is nothing "Linux" about it - most, including the ones you mention, are cross platform.

> I've used OWASP WebScarab, but there is Scrapy, Nutch

WebScarab is a bit different because it is a vulnerability scanner - it is something you should run against your own site to test security, and is someone else is running it you should be suspicious.

The others are just spiders I have used Scrapy quite a bit, for entirely legitimate purposes.

graeme_p

6:08 pm on Nov 19, 2017 (gmt 0)

The solution for this has to come from the inside of the site.

I think you will have to write a script that crawls the site and compares the crawl to the site content.

If it is a static site I would wget the whole site and diff it against the site content.

If it is not a static site I would use a crawler like Scrapy to create a list of URLs that are linked to, write a script that compiles a list of urls that are on the site, and diff the two.

JoeFromSD

6:29 pm on Nov 19, 2017 (gmt 0)

Oh if I only could but I know nothing about coding. I'm going to have my site converted over to wordpress and thought it a good idea to get rid of all those pages I've just abandoned.

JoeFromSD

7:57 pm on Nov 19, 2017 (gmt 0)

Can anyone here write that (paid) script for me?

lucy24

8:18 pm on Nov 19, 2017 (gmt 0)

It's much easier to find pages that are linked than pages that are not linked. If nothing else, you run the w3c link checker with �check recursively�* enabled, and you'll end up with a list of everything that has been visited. The spbot/OpenLinkProfiler is also useful, if it happens to have visited recently, because of its exact crawling pattern. Just find its most recent visit in your logs and you'll have a full listing of all accessible pages. (Quick detour to raw logs tells me that, as of earlier this month, I have 404 pages, not counting the ones in roboted-out directories. That is, ahem, the actual number, not the response code.)

It isn't clear from your post whether you already have a listing of all current pages. If yes, then all you have to do is compare: subtract List B from List A.

3610 is not actually a huge number. That is, it's a lot, but not impossible, even if you have to do some of the work by hand.

Wouldn't you also, though, need to know if anyone--other than the googlebot--is still visiting those orphan pages? Removing links doesn't do anything about bookmarks.

* That's assuming recursion only applies within the original site. You don't want to trigger a cascade of requests all over the internet.

TorontoBoy

8:31 pm on Nov 19, 2017 (gmt 0)

There are so many ways to do this. Assuming you are running linux you will need to generate 2 files: File 1: a list of all your .htm files in your directory. File 2: From the bot scraper program, list of all .htm files that are linked.

Search Google for "shell script compare 2 files find difference". The first link [stackoverflow.com] is pretty simple to follow. If you cannot do this yourself you should hire someone.

Also note that while Wordpress is pretty friendly for content people there are disadvantages. WP runs ~28% of the world's web sites (yay!), is easy to use (yay!), and has thus become the easiest type of site to hack into and scrape (boo!). Ensure you are ok with keeping the WP core and plugins up to date, or go with a WP specific host provider. If you have not been keeping up with your site maintenance of the old site, will you maintain your new site? Leave a WP site alone for too long and it will get hacked, especially plugins.

There are many other alternatives to Wordpress. Do you need a dynamic content management system? Something that is faster and possibly just as effective, is a static site generator. You maintain static .htm pages, then feed it into the SSG and it publishes a site. Move these files to your host provider and "Bob's your Uncle". Of course there is some learning. I have used Hexo, but there are many others.

Find Ghost Pages

JoeFromSD

TorontoBoy

JoeFromSD

TorontoBoy

graeme_p

graeme_p

JoeFromSD

JoeFromSD

lucy24

TorontoBoy

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week