|Crawl rate since mid december|
The number of pages crawled per day on website has more that doubled since mid december ( according to the crawl stats or the webmaster tools )
Is it the same for everyone or just me and what does it mean ?
Using Webmaster Tools:
Go to "Google Index", "Index Status"
Click the Advanced Tab, and check "Ever Crawled"
If this number has jumped up Google has found a new way to access your content.
One very common recent occurrence is Google discovers that your site pages can be crawled using both https (SSL) and the conventional http: protocols.
To Google this can apparently double the number of pages you have, and perhaps (but I doubt) will lead to a duplicate content penalty.
The best fix is using the "rel=canonical" mechanism.
I just checked and the number has been steady and hasn't moved since the 7 th of april 2013.
The only time I got a spike was on the 10 th of march 2013 when i went from 1500 to 2700 pages crawled in less than 2 weeks.
However, the reason was that was due to a bug in my cms when upgrading to a newer version.
By the way my website is only 38 pages and it says over 1500 ever crawled when does it find that number and why is it so high in comparison to my website number of pages ?
Google has been crawling our site a lot more in the past few months. It used to be that there would be a deep crawl, followed by an update (such as a Panda update) a few days later, then nothing much until the next month. Now we see deep crawls over a one- or two-day period every week or so. The monthly graph has gone from zzzzzzzzzzz!zzzzzzzzz to a roller coaster, with higher minimum and average crawl figures than before.
I suppose it's possible that our sitemap has something to do with this. After a long period of having given up on sitemaps, I found a sitemap generator that I liked and began using it late last year. (I've got "change frequency" set to "monthly," for whatever thatever might be worth.) Could the sitemap be encouraging Google to crawl the site more often? Beats me. (The site should be easy for Googlebot to crawl with or without the sitemap--it's a static site of only 5,000+ pages with mostly evergreen content and plain-vanilla internal links.)
If your webpages do not filter out bad "parameters" or "http query strings", the ever crawled number can get very large.
|By the way my website is only 38 pages and it says over 1500 ever crawled when does it find that number and why is it so high in comparison to my website number of pages ? |
If the server responds with to a URL like this:http://www.example.com/topic.htmlxxx
your ever crawled count is likely to climb and climb.
Google is very resourceful in actually inventing new BAD URL's to access your content.
As I mentioned before https, is another source of these "ever crawled" stats. Also many webhosts create a secret subdomain to access your site, something like, "YourDomainAcronym.YourWebHostDomain.com". Google finds these "secret" domains and crawls them and adds them to their index, sometimes before a webmaster has even published the actual site on the desired domain!
Then to make some money the webhost makes sure that all the error pages that result from misspelling etc, have advertisements on them.
How do I make sure my website doesn't have a secret subdomain ?
Usually the webhost does tell you the secret subdomain name, on their domain.
One method that works for my specific host:
The -site: operator partially removes the webhosts content itself from the results.
This search produces 1,470,000 results mostly excluding my WebHost. If I then click on some of these sites I can find their real domain name, usually in links, or perhaps in the source code itself.
Perhaps many of these websites don't know about this or they don't even care about the duplication in Google's index. My host in the past claimed to host 700,000 domains, they have since been bought up by a conglomerate that owns many other webhosts, that also use secret domains, as an aid to new webmasters creating sites. These secret domains can be assets but the webmasters have to take steps to secure them.
I oft times take a sentence from one of my pages and search for it in quotes (perhaps adding the site: command above). This can pop up the secret pages in Google's index, and great; I just found another idiot blatantly copying my entire article, which I chose at random! You try and put up free content to help people and you just end up fighting Google (to get the content to the people that want the help) and, fighting idiots trying to take advantage of you!
Excellent thank you and any command that I can type that I can show me all the pages that google has index ( even the ones blocked by robots.txt or in supplemental index )