robots.txt file being ignored and old content not being removed - Sitemaps, Meta Data, and robots.txt forum at WebmasterWorld - WebmasterWorld

Forum Moderators: goodroi

Message Too Old, No Replies

robots.txt file being ignored and old content not being removed

Marty3454

1:50 am on May 19, 2008 (gmt 0)

10+ Year Member

Hi,

I seem to be having a few issues with google not removing old content from my site and ignoring robots.txt.

If you run

"site:example.com.au"

and run through some of the pages, you will notice that pages like:

- example.com.au/page/jobsbysector
- example.com.au/page/latestjobs

Still show up even tho they are now old pages and return a 404 error.
They have now been gone for a long time (~3 months).
In google webmaster tools i have also manually removed them to try and help out - but this didnt make a difference.

Also if you click on "repeat the search with the omitted results included" at the end of the results and then go towards the end of the indexed pages you will notice pages like:

example.com.au/page/emailme?id=865439

and

http://example.com.au/page/candidate_jobapp?[snipped]

are being indexed even tho they are excluded in robots.txt!

The robots.txt file is being fetched ok and its a valid syntax:

# Don't bother crawling any online forms
User-agent: *
Disallow: /page/candidate_jobapp
Disallow: /page/candidate_registration
Disallow: /page/client_registration
Disallow: /page/vacancy_registration
Disallow: /page/emailfriend
Disallow: /page/emailme
Disallow: /page/form/
Disallow: /form/

# Bait
Disallow: /restricted/

Also another issue i have noticed is that it google says that its crawled 172,000 when you do a "site:example.com.au" but the results only go upto 270 pages (if you navigate manually to the last page) .. this is especially worrying.

Any help will be greatly appreciated ... its driving me nuts.

thanks.

[edited by: pageoneresults at 2:52 am (utc) on May 19, 2008]
[edit reason] Examplified URI References - Please Refer to TOS [/edit]

Pitafi

7:03 pm on May 19, 2008 (gmt 0)

10+ Year Member

Hi Marty3454,

i think there is a simple solution for removing your old content or pages i.e you should apply 301 redirect on old pages to your new pages. This way your old content will automatically be converted into your new pages and it will never be shown on google by redirecting.

Marty3454

11:49 pm on May 19, 2008 (gmt 0)

10+ Year Member

Hi,

- example.com.au/page/jobsbysector
- example.com.au/page/latestjobs

are pages that dont exist anymore so they return a 404 (Not Found) response (hence google shouldnt be indexing them anymore)
Doing a 301 on these pages doesnt really make sense because the content doesnt exist anymore.

I have put 301 redirect on some other urls formats tho

eg:
[vediorjobs.com.au...]

now goes to a more SEO friendly :
[vediorjobs.com.au...]

URL.

Yahoo doesnt seem to be having any issues indexing the site, the problem is only with google indexing.

Any other suggestions will be much appreciated.

goodroi

1:56 am on May 20, 2008 (gmt 0)

WebmasterWorld Administrator

10+ Year Member

Top Contributors Of The Month

Hi Marty3454,

First Issue - The 404 pages of your sites that Google has indexed were valid pages when they were indexed several days ago. I checked the cache on the pages and it was dated May 4th. These 404 pages should drop out of the index as Google revisits the pages. If you do not have alot of links or traffic to these pages it may take some time for Google to revisit.

Second Issue - Google did not crawl the pages that are blocked with robots.txt. They are aware of the URL (probably because there are links pointing to the pages). Since they are aware of the URL they display the URL only in the site: search results.

Third Issue - A "site:" search reports high number of pages but you are only seeing a small number. The main reason Google is only returning a few pages is that your pages are very similar to each other.

Marty3454

2:12 am on May 20, 2008 (gmt 0)

10+ Year Member

Hi goodroi,

thanks for the feedback.

1. googlebot still seems to be accessing these pages which is strange. I see what you are saying tho, ill leave it a bit longer.

2. OK thanks for letting me know this.

3. Im not entirely sure thats the problem because even after you click "repeat the search with the omitted results included" the result set is limited (to around 280 results). It should be limited to 1000 results like every other site.
Even if pages being duplicate is the problem can you point me in the right direction as to which pages? Their should be > 9028 pages (because there is 9028 unique jobs on the site all with different content and urls)

thanks for the help.

goodroi

2:41 am on May 20, 2008 (gmt 0)

WebmasterWorld Administrator

10+ Year Member

Top Contributors Of The Month

Oops, I looked at the pages too quickly. There is another reason why Google is not returning more results. You have very few links pointing to your site. Link popularity is important to Google actually it is very, very important. You should visit the Google Forum to read the threads on how to get Google to like your site more.

As for my statement about your pages being similar, I still think it is true. I took some lines from the unique job descriptions and found that companies reused large sections for multiple postings. When you combine that with your template code it lowers the uniqueness that Google sees. Good luck.

Marty3454

3:06 am on May 20, 2008 (gmt 0)

10+ Year Member

hi,

I didnt actually think my backlink count was that low (compared to some other sites that i see that dont have the same problem)

After doing a "link:thedomain" query i get 3,510 linking.

I will try and make the job display templates different to see if this makes a difference.

thank you very much for the help.

goodroi

11:39 am on May 20, 2008 (gmt 0)

WebmasterWorld Administrator

10+ Year Member

Top Contributors Of The Month

When I did a linkdomain search on your domain it only returned 22 unique domains (and several of those domains appear to be the same network). The reason why our numbers are so far apart is probably because some of those domains have a ROS link pointed to you.

I don't think the issue here is with robots.txt. It is more about Google not valuing your site and therefore not robustly indexing it. This is not to saw your site is not valuable. It is just that Google is not seeing the quality signals they generally look for - large amount of inbound links from a variety of relevant websites and very unique content.

bilalseo

5:55 pm on May 20, 2008 (gmt 0)

10+ Year Member

Hello

Being a part of this special forum, I'd like to add my comments against your post.

404 errors is generally occured when google found your pages broken. You probably seems to give google something you being added new. Apply 301 redirect on old pages and let them pointing to new pages or the pages you want to rank up in SERPs. It is vain of blocking by robots.txt file rather you can use 301 redirect in place of. Robots.txt file is only used for blocking unnecessary or restircted areas of the site not even to block removed urls. You can take advantage by using old urls to point them on to new pages or existing site pages with low rank.

The second issues is showing 270 pages and rest of the pages probably occured in omitted results. It is just because of Google indexing. Google has indexed 270 pages of your site and rest of the pages are indexed by with duplicate issue or being older. It seems that Google has removed supplementary indexing of the sites, but it still working on it and let the old pages feteched and take them into supplemental with duplicate results. One more reason that the link pointing to those urls are less worthy.

Thanks,

Bilal

Marty3454

11:46 pm on May 20, 2008 (gmt 0)

10+ Year Member

thanks very much for the feedback guys.

Ill work on bumping up the number of inbound links as well.