Forum Moderators: open
It should be up to the site to provide a reasonable search engine once you get there if they are a legit site. I think IBM can afford to index their own pages with a site search. Maybe that is pea-brain logic as well.
And I seriously doubt amazon has 2 mil+ real pages. Crawling a database and calling them real pages is also a dumb thing to do IMO.
It is true that thanks to weakness in current algos, for the same content, more the number of pages a site can create, more hits it will get in most cases. Moreover, for many sites, there is the added bonus of exposing visitors to more ads.
For example, recently I visited a very highly ranked and highly respected site. It ranked #1 on <widget> <operation>. I went to a page, it had widget-industry related words, all clickable. I clicked on a word <widget-word> and I was taken to a page that had the entire content as
<widget-word>: <widget-word> is a part of <widget> that controls its <function>.
Besides this 10 word content, the page had Adsence at side, sponsor links at bottom, and various other partner and sponsor links here and there, banner ads etc to make it a full-sized page. Anchor texts, brief descriptions of the links, anchor texts related to internal links and so on seemed designed to do only one thing - come on top of SE searches.
I think, if SEs stated that only site with 100 pages or below will rank in the top ten in commercial categories, the very next day all these mega sites will shrink down to 100 pages, while retaining all their content.
When I checked this morning it was already being found for some search terms. It does not yet have a high ranking but this is really excellent performance from Yahoo.
Some people have been doing Google-SEO for years and built a portfolio of high-PR sites on different IPs that they can use to quickly create links and push new sites into the Google index.
People with somewhat less experience and less power at their disposal get fewer links.
The group that says "Google is faster" is group 1, the group that says "Yahoo is faster" is group 2.
Personally, I don't think that Yahoo's results will suffer for it. There are probably a lot of quality sites in group #2.
It's pure speculation on my part, but might explain the difference.
Google, Yahoo, MSN, etc, etc should not be deep crawling a 100k+ page site. It is just plane stupid.
Seems to me that a site with dynamic content fed from a DB (news site, directories, e-commerce catalogs, etc) should indeed be indexed that deeply, as the deeper the crawler goes, more localized and specific user searches can be met.
Our site consistently gets over 300K pages indexed by Google and people searching on Google for very specific or localized terms in our industry are greeted with a page/link that is EXACTLY what they were looking for because the pages are driven by the DB and the search terms entered in Google are precisely mapped to the query parameters being fed to the engine which displays the pages on our site.
Therefore there is a very good reason for googlebot to deeply index dynamically driven sites. I have found it unfortunate that Yahoo!'s new(?) crawler has yet to exhibit the same behaviour. Our company made the unfortunate mistake of signing up for SiteMatch, and urls that had previously been free-crawled are now completely gone from the index.
Repeated conversations, emails, and correspondence to Overture/PositionTech/Yahoo have been mostly in vain; after 2 months of harassing PositionTech to get an answer we were finally put through to an insider at Yahoo. After another 2 weeks, we were in the index, but no deep crawls even after 3 months now. Very disappointing, considering we submit (manually) urls for key directory and content pages on a daily basis.
While I understand both camps (the ones who like G and the ones who like Y), in our experience, our specific industry has been hurt significantly by Yahoo!'s inability to remove even the most egregious spammers from its index in our categories. Even our best competitors are suffering as spam sites simply flood (by Yahoo and PositionTech's own admission) the first 2-3 pages of results in Yahoo.
I do believe that there are some categories/industries that G is better for delivering good results and some categories that Y is better for delivering good results. It may just be a waiting game until both companies clean up those categories which are suffering.
Oh, and for those that don't believe that Y does not manually manipulate its index, we have seen hard evidence to the contrary.
Sorry for rambling...
I think, if SEs stated that only site with 100 pages or below will rank in the top ten in commercial categories, the very next day all these mega sites will shrink down to 100 pages, while retaining all their content.
What a rediculous notion. Our site has over 350 products, each of which has it's own page with more detailed description (linked form that product's category).
By your reasoning...because we have a selection larger than 100 products we would get somehow penalized?
I didn't write about products but pages. Is there anything preventing you from showing all the 350 descriptions on one page?
What if Google gave bonus for one page site? Would you then do it? Do the current 350 pages - one for each product- exist because of SE rankings - knowing that each page can be optimized for each product?
Its just a silly thing to say, and iluvsearchengines, go have a look at microsoft.com, hp.com, yahoo.com, perhaps even google.com and tell me how much information is on those main pages that people would want.
My main page is the LEAST active page on my website, why? because it is about us instead of our products; just like any real business that offers a large volume of services and products.
It would be very hard to find a driver or opinion from a forum or any info from a large site if this was the case.
I do agree though that these 'advertisment trap' sites shouldnt even be in the index....
I didn't write about products but pages. Is there anything preventing you from showing all the 350 descriptions on one page?What if Google gave bonus for one page site? Would you then do it? Do the current 350 pages - one for each product- exist because of SE rankings - knowing that each page can be optimized for each product?
NO...the individual product pages DO NOT just exist for the search engines. Putting 350 products on one page (with accompanying thumbnails) would not only make that page non-user friendly (due to load time, etc.), but would even exceed google's current "guidelines" regarding page size exceeding 100k.
Further, with a growing product line, that 350 products grows in number by 10+ new product per week. By your reasoning, we should just keep slapping them all up on one page regardless of size, etc.
Our site is designed (as is any SUCCESSFUL commercial site) with the user in mind. Great rankings do little for you if the users don't enjoy the experience. Factors such as load time, ease of navigation, and verbose product descriptions give us a robust 3.8% conversion rate.
Having said that, for a search engine to truly be "GREAT", they need to be able to judge a site as a human would. If a site is of high quality, is on par and/or better than it's competiton, and results in a great user experience, then the engine in question should rank that site accordingly. Until they get that right, it's all a moot point.
My comment was more geared towards 1 million page sites, one page for each 'product' - blue small deluxe widget, green small deluxe widget, ..., red medium deluxe widget, ... yellow medium super-deluxe widget, ...
Just like email spamming where cost of sending any incremental email is almost zero but even one in ten thousand conversions make them profitable, having a miilion or two page websites makes economic sense. However, if Google started charging some annual fee per page, say $1/page for every page after the first 100 free pages, I would expect many site owners to get only 100 pages indexed thus easing the load on Google.
However, if Google started charging some annual fee per page, say $1/page for every page after the first 100 free pages, I would expect many site owners to get only 100 pages indexed thus easing the load on Google.
This is a great suggestion even though we know it won't happen. It would be a surefire way of reducing the amount of **** on the net. People would be much more selective about what they uploaded and obviously the value of what was made available would greatly increase.
However, as Google found new backlinks my sites steadily moved up the results to where they're now all in the top 5 for their respective keywords. Yahoo on the other hand, has not changed the ranking from where the sites first appeared - which is not in the top 100.
So I would say that Yahoo's crawling and indexing is comparable to Google, but its ranking is much slower.
However, if Google started charging some annual fee per page, say $1/page for every page after the first 100 free pages, I would expect many site owners to get only 100 pages indexed thus easing the load on Google.
So sites like this one would have to pay hundreds of thousands of dollars or none of this info would be indexed on Google? That would suck.
Is it necessary for all the pages to be indexed?
I think in terms of the user's, yes.
When one of our visitors types in "pink reverse dangle widget" in Yahoo, and our page comes up for the product that matches that description perfectly, the user is better served (IMHO) than if they had to go thru the main page of our site, find the relevant category, and drill down thru a few pages to find the prodcut they want.
Let us not forget also that the internet is (and should be) one of the world's best educational tools. As a former student myself, saying that "all pages of a site don't need to be indexed severely limits the potential of the net as a whole.
Further, as searchers get more and more "educated" about more advanced search techniques, their searches are getting more and more specific, leading to the need to be able to serve up EXACTLY the pages/products that are relevent.
How would you feel if a librarian simply pointed to a row of books as opposed to helping you find that obscure text you REALLY need? Limiting the number of pages indexed on a site is doing that very thing. I don;t need google to tell me that SOMEWHERE on that site is what I need. I needd it to tell me the EXACT page it is on. Isn't that what a search engine is for in the first place?
A better solution would/will be to finally eliminate all the affiliate/spam sites (which I know is a never ending battle), which will in itself bring about much greater relevancy.
Frankly, I don't want the engines to index less...I want them to index more..much more.
Seems to me that a site with dynamic content fed from a DB (news site, directories, e-commerce catalogs, etc) should indeed be indexed that deeply, as the deeper the crawler goes, more localized and specific user searches can be met.
If you had 5,000 people working for your company you may have a point. But in every case I can think of, companies with DB driven 'local' sites are just plain spammers trying to hit every city of every state. One little nerd doing this can create a mess if he makes several of these sites for several industries.
Here is an example. Let's say xyz book store makes a page for every single word in the english language. You do a search for 'best internet job site new york'.
In the top result you get the book store with a DB driven page that says...
Looking for the book ''best internet job site new york' We have tons of books about 'best internet job site new york'. Click here for books about 'best internet job site new york'.
You click their link and land on their site and it says, 'sorry could not find your search phrase' please try another search.
Another example, the city search spam DB site. They claim to be the number one city search site in the US. They have every city. You do a search for 'city anything' or 'anything city state' and they come up!
Trouble is, they have no content, just links to national data pretending to be local. Maybe a local news station link or a few generic links they ripped such as the local library. The local news section has national stories ripped from a national news feed, etc.
spam spam spam
Those are not local sites, they are spam. Google is stupid to index databases IMO. They should only index top level real pages.
Also, if I want to find something on MSN I go to MSN search. I don't need Google to index MSN.
If I want to find a cruise to Mexico I will go to a cruise site and search, I don't need Google or Yahoo or anyone else to do that for me. Search engines should not be in the business of trying to find spam. They should be trying to find sites.
People equate, and rightly so, that a search engine with a larger index will give them more pages, sites and options relating to their search.
If they didnt crawl db pages, then their index would drop to (in alot of cases) useless homepages and outdated static pages.
How often to you do a search and get directed to a home page, unless its a company, site or brand your searching for? In which case thats the best result.
iluvsearchengines, if you want to search yourself, then why do you use a search-engine like google? why dont you just guess which domains have information about your search and then find pages? Maybe in your industry it is feasable to just have home pages listed and let people fiddle around looking for the correct page on the site, but when i do a search i want results that match that search - and most of that info is nowhere near the homepage.
do a search on an ASP function and you will find pages that match that query all on the first page, none of which are homepages. Try searching "Response.Write" or "windows xp benchmarking" for examples.....
anyone can create a huge homepage with a billion keywords on it, which is exactly what would happen with crawler patterns like that.
anyone can create a huge homepage with a billion keywords on it, which is exactly what would happen with crawler patterns like that.
Anyone can create a billion page DB keyword driven virtual web site much easier, since single pages are limited to about 100k of indexability from my understanding.
And that is exactly the problem. You get a little nerd that learns how to create unlimited pages with a DB and he will always go nuts, pushing as many keyword spammed pages as he can into a web site. I have downloaded a few of their web sites and I see what they are doing. (there are tools that allow you to do this)
If you have the right page generating tools you could make a million page web site in a few hours. Just create a template page or two, insert some fields, point it at the SQL database and start hammering those spam pages out. Instant mega site.
If I were to do the same thing and generate 1 million emails in a few hours, most people would call that spam.
I am not saying just index the main page. I am saying index the top level real pages of a site. That could be a few hundred pages maybe. If they have real hand made pages, I would not be opposed to thousands being indexed. But hundreds of thousands of DB driven pages like I mentioned are worthless spam IMO.
If I want technical info I go to www.dejanews.com (yes I used them before Google bought the company) or I go to MSN and search for 'windows issue blah blah blah'
I go to the source and search there. I understand what you are saying and there are certainly two sides to the coin. I just don't personally think deep crawling DB driven sites accomplishes as much as it hurts. There is more spam down in those deep crawls (from punks with DB page creating technology) than data, IMO.
I am not saying just index the main page. I am saying index the top level real pages of a site. That could be a few hundred pages maybe. If they have real hand made pages, I would not be opposed to thousands being indexed. But hundreds of thousands of DB driven pages like I mentioned are worthless spam IMO.
Absolutely!
It is unreasonable for commercial sites with hundreds of thousands of product pages to expect Google to find and present the exact information within their sites that their visitors want. Why should they? If you have a site with a million pages you should have the capability to do this yourself through good design and an efficient search facility.
Do you really expect Google to provide the hardware to cache a million pages of advertising on your site for nothing? If and when Google are forced to stop doing this you will see better content because webmasters will be driven to provide more balanced, useful information on the pages that are indexed.
The vast majority of sites of this size are purely commercial so they should be responsible for guiding their visitors wherever they want to go. @mazon have four million pages in the index. This is totally unnecessary because it is a well designed site. Even a complete drongo should be able to find what they want there within two or three clicks.
Anyway, this is getting off topic!