Forum Moderators: open

Message Too Old, No Replies

580,899,012 is this how many they have indexed?

I would like to know how to get at the size of their db

         

han solo

6:26 pm on Jan 23, 2001 (gmt 0)



So, any takers?

How do you get that data from Altavista? It occurs to me that given the right range of queries, a merge, etc. one could get a figure.

Is there an easier way? I think I might have found one, but I'd like to hear expert opinions before giving it all away...:)

Cheers,

Han Solo

Brett_Tabke

6:54 pm on Jan 23, 2001 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



You got me. 15mins trying to come up with something that would work. I used to use a common keywords. Even http:// used to work. You could do the variations on ".com" to get a feel for each top level domain and add them up from there. I can't get nothing to come up right now.

rencke

7:01 pm on Jan 23, 2001 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Almost an impossible question. AV seems to have reduced the size of the main index at Altavista.com to make room for more pages, and have done so by moving the bulk of all non dot-coms into their international indexes. So 581 million may well be right for the main index, with lots more internationally. Example: A search for * and the se-domain will give you 1,5 million pages in the main index, but 6,8 million in the Swedish. You would have to visit each international site separately, I guess.

han solo

7:07 pm on Jan 23, 2001 (gmt 0)



Thanx for the replies.

For all of the lucky contestants, go to the listings altavista site at listings.altavista.com, and try searching for

-rse

Don't ask me why this does what it does...does this do something in Unix or OpenVMS i should know about?

And if there are other letter combos that give a higher number I'd like to know about it...but this was the biggest number I could get.

Cheers,
Han Solo

msgraph

7:09 pm on Jan 23, 2001 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I did a url:http:// and received

582,225,646

Everything else I tried is less than this number.

This was on listings.altavista.com

han solo

7:34 pm on Jan 23, 2001 (gmt 0)



Wow!!!

I bow before the wisdom of the mighty...and the grand prize goes to msgraph.

Thanx for the tip.

Question remains, why does this work, and what are we seeing, really? Y tambien, gracias por la informacion.

Cheers,

Han Solo

msgraph

7:46 pm on Jan 23, 2001 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



No por favor

Well for one thing it is "listings"'s DB, seems like everything is in this one.

URL:http:// is pulling
http://domain.com - Last modified on: 03-Sep-00 - 28154 bytes - English

As for the -rse displaying 2 million less pages I think it has something to do with what you said. (Unix or OpenVMS ) It must be that those 2M do not have whatever that code is on their server maybe?

I guess it now shows that AV pulls whatever it needs from "listings" to fill up the whatever_specific AV indexes around the globe.

CaveToad

3:28 pm on Jan 24, 2001 (gmt 0)



Can you do something similar to Google to see how close the results come to its advertised size? Or any other engines to see how they compare?
This could be an interesting trick to compare engines and popularity.
interesting.

Brett_Tabke

4:09 pm on Jan 24, 2001 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



Engine had a trick he was using with Northern Light, but I forget what it is.

I knew the http:// thing would work. What is the difference between "listings.altavista.com" and just the stock "av.com"?

man av sucks since they added goto...

msgraph

4:17 pm on Jan 24, 2001 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Still looking into a large part of it. But I do know this. What was listed in the top of "listings" under many keywords 2 weeks ago was placed into the top for the Main AV.

I'm keeping an eye out for the next re-index of Main AV to see if they match up the current "listings" results. If they do then it is a good prediction tool to see how you will be listed in the Main index next roundup.

I figure that Main AV pulls about 100M of "listings" results into theirs.

han solo

6:19 pm on Jan 24, 2001 (gmt 0)



Well, I do know that you can find stuff that doesn't make it into the av.com database in the listings database.

The reason for this is some of it is considered spam, and therefore has an added penalty to it...I'm not going to give specifics, but I will say if you search for the highly contested industry term that so many, many companies compete over (just look at the bids on goto to see how crazy some are)...

Then compare the results between the 2 sources. Notice anything changed? Should be...and you can figure out who made the cut, and who didn't...aka, what I believe to be the companies they marked as "spammers", and the ones who play by their rules.

To me, listings represents everything they could store off of the internet, and successfully categorize...I don't honestly believe they have the resources to duplicate what google/fast do with the # of documents they have sorted.

Cheers,

Han Solo

gmiller

5:05 am on Jan 28, 2001 (gmt 0)

10+ Year Member



Interesting... It occurred to me that searching for "http://" wouldn't get all the FTP URLs, so I searched for "ftp://" and got about 8.9 million results. Paging through them, though, it appears they're all web pages with links to FTP files.

Did AV stop indexing submitted ftp URLs at some point? I haven't paid any attention in a long time, but I guess I missed that change.

bartek

3:57 pm on Jan 28, 2001 (gmt 0)

10+ Year Member



>Engine had a trick he was using with Northern Light, but I forget what it is.

I think its the "search or not search" search - since NL supports boolean searches. Won't work on AV though, due to their timeouts and only partial support for boolean logic.

I searched for http on the power search [altavista.com] with "The search should include all the words".
Search found 0 returns, but ignored 1,098,877,583. These are the results using Opera and Netscape.

The kicker is that the same search with IE "ignores" only 383,496,896.
Did it about 3 times with consistant results...
Go figure...

han solo

3:15 pm on Jan 29, 2001 (gmt 0)



Interesting, just tried it, and netscape and explorer are both giving me the small number.

That would be interesting, though. Why don't they publicize that they have a billion documents, too? I'm thinking they might not, can anyone else duplicate the feat of getting the billion to show up?

Thanks for the info, I'll try it again later, to see if the numbers change.

Cheers,

Han solo