Forum Moderators: phranque
I've few questions:
Many many Web sites (in SEO tools, directories scripts) make automated queries to Google to get loads and lots of useful precious information for a Web site analysis and rankings reports. The same is with many other search engines. Google mentions that it shouldn't be done in its terms of service.
As a SEO strategist and programmer with a SEO company, I've done a dozen such tools that query many search engines including Google, either when user manually submits a form and it is automated also.
Can some one tell me, what can be the other way to similar thing without violating policies of search engines.
Second most important question is, I'm looking to build a Web directory cum engine to display results, search, rank listings etc. Foundation of idea is using data automatically retrieved from Google and other engines. I think Alexa, Alta Vista and such also do this. I may be having a crawler, but it is not feasible (financially also) to do everything from the scratch.
What're your ideas friends? I hope to hear (and learn) from you. Thanks.
Folks do scrape google and the other SE's but it's a cat and mouse game as the SE's try and shut you down. But it can be done (no, I don't know how specifically).
Meta search engines do this legitimately somehow - but I'm not really sure how. My guess is they've got an agreement with the SE's, but those more knowledgeable can step in with details. One place to start investigating anyway.
In terms of a search engine/directory, if you're doing it on a large scale here's some suggstions:
- nutch is an OSS project that provides both a crawler and a search/ranking/index tool. If you're building from scratch that's a good place to start looking
- Alexa has made their crawl info available. Some combination of nutch or your own software combined with their data could give you something to work with.
Thank you very much for the precious reply.
I'm considerting a commercial 'search engine project' in my mind.
Neither I'm considering it is work of just few months.
As you know, Google API is not suffient for such.
In last 2 months, I used 'Snoopy script from SourceForge' to power SEO tools of my employer.
It is a nice script and I've really NOT coded much, as it helped me all times.
I'm just wondering if such a simple ~ 37 KB Snoopy file can fulfil ny dream!
Also, I know only PHP and neither of other two: Perl, Python.
Can PHP is a suitable language for such a large scale project?
Besides, Google, I need results of Yahoo!, MSN and Alexa also.
My concept is though simple, to bring something from all these into one directory.
I've seen such many tools (like Web Position Gold) query automation to Google.
Many 1000s of Web sites, make available their 'SEO tools' for public which query Google.
And Goolge is not banning these?
Can you suggest if what I'm looking can be done using LAMP?
Can you rough estimate bandwidth, database required to hold information about 1 million Web sites per month?
I'm just trying and estimating ..
I'm a lonely PHP programmer; not that determined to do it in just 6 months or a year.
First, I hope, I should build basic ideas ..
>>I'm just wondering if such a simple ~ 37 KB Snoopy file can fulfil ny dream!
No
>>Can PHP is a suitable language for such a large scale project?
No
>>My concept is though simple, to bring something from all these into one directory.
As I stated, the meta search engines are doing this. How? Technically it's not hard. How they have permission to do this, I don't know. Somehow they've got the SE's approval.
>>I've seen such many tools (like Web Position Gold) query automation to Google. Many 1000s of Web sites, make available their 'SEO tools' for public which query Google.And Goolge is not banning these?
You'll find that most of these tools either are very limited use, or require you to get your own Google API key and use that. Then you're running using your own 1000 searches per day.
>>Can you suggest if what I'm looking can be done using LAMP?
Nope. You'll need to be looking at C or Java or something a lot faster than a basic scripting language like PHP. (don't get your pee all hot, I love PHP, but it ain't the tool to use to index and search 10's of millions of docs)
>>Can you rough estimate bandwidth, database required to hold information about 1 million Web sites per month?
Sure. 1 million sites, say 10 pages per site, say 50K per page. That's 500 gigs of bandwidth a month if you only spider the pages once per month.
>>I'm a lonely PHP programmer;
Sorry, I'm married :).
You're looking at two very different beasts here. A meta search engine can be done in PHP, and isn't a big processing/bandwidth hog relatively speaking. Your only real hurdle is getting permission from the SE's to scrape their results. If you get that, then you're going to query the SE's for their results only when a user does a search - so you don't need the big index/crawler etc. It'll just be a data feed captured when someone runs a search.
If you're going to do your own crawling, now you've got bigger problems. At that point you'll have to worry about bandwidth for searches and crawling as well as some serious processing power to index all that. That's why I suggested you investigate Alexa, they've got all the data crawled (and likely indexed) and plenty of computing power available, cheap.