Forum Moderators: open

Message Too Old, No Replies

Technically speaking, how does googlebot crawl?

?

         

heretic

9:17 pm on Nov 2, 2002 (gmt 0)

10+ Year Member



If I were a googlebot, this is how I would crawl the web, if someone knows, let me know how far off I am from the real way:

Assumption: I have a database of urls as a starting point.

#1. I would start with the first domain, crawl the index page and then add all links to the same domain into one list (let's call it List A) for immediate crawling and all links to outside domains for future crawling into List B.

2. Starting with List A, make sure I didn't crawl the link already, follow the link. Go to #1.

3. Once I'm done with this domain, move on to domain #2 in my list. Go to #1.

4. Once I'm done with all the domains that I started with that were in my database, start to crawl the domains from List B. Go to #1.

Once the domains are done, figure out Pagerank etc, run through standard list of tests for new index, especially going through the top few hundred terms searched on and go live.

5. Sent a freshness bot to recheck all sites that had a timestamp within last couple of days. Decide how frequently to visit. Don't touch the PR until next month, though.

How did I do?

Yidaki

9:30 pm on Nov 2, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Do you know programming other languages than html and seo? ;) I'm open for new employees ... since i programmed my own robots, i know there are a lot more things to do. Many of them are really hard to figure out - a lot are impossible to fix. Thinking about how a crawler technically works is a good way to understand a se and why it always seems to webmasters that se's are paranoid about spam issues.

heretic

9:55 pm on Nov 2, 2002 (gmt 0)

10+ Year Member



Actually I don't know SEO anymore. I knew a bit when it didn't yet have a name and google wasn't even a thought in Larry or Sergey's brain yet...

I dabbled in a few languages like php/perl and some databases, but I'm not looking for a job, sorry.

There is of course a lot more to a crawler than that. They need to check the status of pages, to worry that they are crawling human generated (or "based on human generated) content...make sure that they are not crawling into an infinite loop...not to mention a million spam related issues...I'm sure it's really really tough. I bet 10% of the job is dealing with the WWW's good citizens, which make up 99.9% of the people. And 90% is dealing with the .1% of spammers and the .01% of very very clever spammers with lots of very very clever techniques up their sleeve.

If I were them, I'd cut off crawling any content within a domain beyond a certain number of links and have a human check those to make sure it isn't dynamically generated.

Yidaki

10:18 pm on Nov 2, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Sorry, heretic, i didn't want to teach you or treat you like a seo kid, seriously! But, as you said, there are some big problems you'll have when you run a crawler based se. Not only regarding spam. Biggest problem i see is how to seperate the real cheating from "honest mistakes". There are many webmasters that have great (often the greatest) content but have no idea of seo or even factors of being penalized. You, robot, should seperate it, huhh ...

<added>BTW, "dynamically generated" is a phrase like "search engine optimization" - why good? why bad? I wouldn't like to have my dynamically generated sites cut off! My content sits inside a well organized database. This doesn't mean, it's no good. But if you speak of dynamically generated subdomains (wildcards) or generated doorway / simply cheating pages, i'd agree!</added>

heretic

10:47 pm on Nov 2, 2002 (gmt 0)

10+ Year Member



Hi Yidaki,

I think we may have a little miscommunication...I wasn't offended at all. I was just saying that I know very little about seo, I'm just enjoying this for the fun of it...I like to understand the technology...

I'm also not building a robot or crawler. Google seems to do a great job :)

But I'm just curious how their crawler works...

Sounds like you have built your own crawler? Just for fun? You have a search engine?