Welcome to WebmasterWorld Guest from 22.214.171.124
Forum Moderators: goodroi
I seem to be having a few issues with google not removing old content from my site and ignoring robots.txt.
If you run
and run through some of the pages, you will notice that pages like:
Still show up even tho they are now old pages and return a 404 error.
They have now been gone for a long time (~3 months).
In google webmaster tools i have also manually removed them to try and help out - but this didnt make a difference.
Also if you click on "repeat the search with the omitted results included" at the end of the results and then go towards the end of the indexed pages you will notice pages like:
are being indexed even tho they are excluded in robots.txt!
The robots.txt file is being fetched ok and its a valid syntax:
# Don't bother crawling any online forms
Also another issue i have noticed is that it google says that its crawled 172,000 when you do a "site:example.com.au" but the results only go upto 270 pages (if you navigate manually to the last page) .. this is especially worrying.
Any help will be greatly appreciated ... its driving me nuts.
[edited by: pageoneresults at 2:52 am (utc) on May 19, 2008]
[edit reason] Examplified URI References - Please Refer to TOS [/edit]
i think there is a simple solution for removing your old content or pages i.e you should apply 301 redirect on old pages to your new pages. This way your old content will automatically be converted into your new pages and it will never be shown on google by redirecting.
are pages that dont exist anymore so they return a 404 (Not Found) response (hence google shouldnt be indexing them anymore)
Doing a 301 on these pages doesnt really make sense because the content doesnt exist anymore.
I have put 301 redirect on some other urls formats tho
now goes to a more SEO friendly :
Yahoo doesnt seem to be having any issues indexing the site, the problem is only with google indexing.
Any other suggestions will be much appreciated.
First Issue - The 404 pages of your sites that Google has indexed were valid pages when they were indexed several days ago. I checked the cache on the pages and it was dated May 4th. These 404 pages should drop out of the index as Google revisits the pages. If you do not have alot of links or traffic to these pages it may take some time for Google to revisit.
Second Issue - Google did not crawl the pages that are blocked with robots.txt. They are aware of the URL (probably because there are links pointing to the pages). Since they are aware of the URL they display the URL only in the site: search results.
Third Issue - A "site:" search reports high number of pages but you are only seeing a small number. The main reason Google is only returning a few pages is that your pages are very similar to each other.
thanks for the feedback.
1. googlebot still seems to be accessing these pages which is strange. I see what you are saying tho, ill leave it a bit longer.
2. OK thanks for letting me know this.
3. Im not entirely sure thats the problem because even after you click "repeat the search with the omitted results included" the result set is limited (to around 280 results). It should be limited to 1000 results like every other site.
Even if pages being duplicate is the problem can you point me in the right direction as to which pages? Their should be > 9028 pages (because there is 9028 unique jobs on the site all with different content and urls)
thanks for the help.
As for my statement about your pages being similar, I still think it is true. I took some lines from the unique job descriptions and found that companies reused large sections for multiple postings. When you combine that with your template code it lowers the uniqueness that Google sees. Good luck.
I didnt actually think my backlink count was that low (compared to some other sites that i see that dont have the same problem)
After doing a "link:thedomain" query i get 3,510 linking.
I will try and make the job display templates different to see if this makes a difference.
thank you very much for the help.
I don't think the issue here is with robots.txt. It is more about Google not valuing your site and therefore not robustly indexing it. This is not to saw your site is not valuable. It is just that Google is not seeing the quality signals they generally look for - large amount of inbound links from a variety of relevant websites and very unique content.
Being a part of this special forum, I'd like to add my comments against your post.
404 errors is generally occured when google found your pages broken. You probably seems to give google something you being added new. Apply 301 redirect on old pages and let them pointing to new pages or the pages you want to rank up in SERPs. It is vain of blocking by robots.txt file rather you can use 301 redirect in place of. Robots.txt file is only used for blocking unnecessary or restircted areas of the site not even to block removed urls. You can take advantage by using old urls to point them on to new pages or existing site pages with low rank.
The second issues is showing 270 pages and rest of the pages probably occured in omitted results. It is just because of Google indexing. Google has indexed 270 pages of your site and rest of the pages are indexed by with duplicate issue or being older. It seems that Google has removed supplementary indexing of the sites, but it still working on it and let the old pages feteched and take them into supplemental with duplicate results. One more reason that the link pointing to those urls are less worthy.