Welcome to WebmasterWorld Guest from 126.96.36.199
Forum Moderators: phranque
In the past year, Google has received wide press recognition and praise for the quality relevant results that they serve to web surfers.
One of the major reasons for this is success is due to their PageRank Technology [citeseer.nj.nec.com] developed by Sergey Brin and Lawrence Page. PageRank is a technology that scores web pages by how "important" they are in relation to other web pages.
Another major reason is that Google does not over-populate, or flood, their search results pages with two or three third-party databases like various other popular search engines. With the exception of data collected from the Open Directory Project [dmoz.org], Google can claim that their results, both advertised and crawled, are their own.
Google is now in the position to take their search engine to another level that can make them soar even higher than their competitors. A lot of speculation has arisen in regards to what lies ahead in Google's future. A few of the speculations are how and when Google will display personalized results.
The basis of this paper is to examine a few of the methods that Google might employ in their system of providing personalized results. The reason why these methods are examined in great detail is because last year Google acquired a company, Outride,Inc., who researched heavily into personalizing data. Throughout this paper, I will examine some of the research, people, and technology behind Outride's past projects and software systems. We shall never know how Google will implement it with their search system, but at least we can look into the various possible uses that this technology can offer.
Google and Outride
In December 2000, a press release was passed out that notified the mission of a barely known company, Outride, Inc., whose main goal was to provide the internet community with a system that would provide highly-relevant personalized results.
September 20, 2001
Google Acquires Technology Assets of Outride Inc [google.com]
Outride, Inc's Founders
Dr. James E. Pitkow - President and co-founder of Outride, Inc.
Dr. Pitkow, previously, was a research scientist at Xerox (PARC).
Dr. Hinrich Schuetze - VP of Advanced Development and co-founder of Outride
Dr. Schuetze was also a former research scientist at Xerox (PARC). His specialties are in language processing and information retrieval.
Dr. Todd Cass - VP of Engineering and co-founder of Outride, Inc.
Like his two other colleagues, Dr. Cass was a research scientist at Xerox (PARC). Dr. Cass specializes in computer vision and pattern recognition.
Since all three of Outride's founders came out of Xerox (PARC), the source of their relevance technology must lie somewhere within PARC's research center as well as in various patents. Here is where we shall begin our journey into Outride's technology.
Looking at Outride's foundation goals, the best candidate to dig up information on is Dr. Jim Pitkow. Dr. Pitkow is a recent alumni of the User Interface Research Group at Xerox (PARC).
The three main projects of this group are:
- Information Foraging
- Information Visualization
(edited by: msgraph at 5:25 pm (utc) on Feb. 13, 2002)
To understand a web user's behavior and the usability of a site, you have to develop a way to visualize it.
... there has been a rapid expansion in on-line information, creating a need for computer-aid in finding and understanding them. Information Visualization is a form of external cognition, using resources in the world outside the mind to amplify what the mind can do.
One key point of the paper is the analysis of how links are structured both inbound and outbound. In a fashion similar to Google, they look at what page has the most inbound links from outside sources to determine if that page is the preferred point of entry.
Another key point is that they have researched the method of how to determine similar goals across a spectrum of users. They want to analyze a sorted list of keywords to determine the goals of similar groups of users. This is another clue on predicting relevance and it will also reflect on Outride's claims further on in this paper.
SYSTEM FOR WEB SCENT VISUALIZATION
What this section refers to is how they analyzed two data sets so that their application could predict a user's surfing pattern as closely as possible. One system is set up to analyze how real users traverse through a web site in order to find the information they require. The next system is to replace that user with a simulated user to see if it performed just as well, if not better.
For example, let's say they monitored a user who is surfing a web site looking for alpha widgets. The system might analyze the number of links this user had to pass through as well as how the pages were related to widgets.
Now that they have this model analysis set up of an actual users' trails, they can apply a simulated user to this model to look for alpha widgets too. This simulated user analyzes the first page it comes across for any information related to alpha widgets. If there is not a large amount of text on this page related to alpha widgets, the simulated user will look for bits of info in the form of links. This simulated user will most likely follow the link that will somehow be related to alpha widgets. This does not necessarily mean that the link has to contain alpha widgets, most likely a related word would be sufficient.
Let's say you are surfing a web directory, such as the Open Directory Project, and you are looking for web sites with information on the topic of HTML editors. A real user will click on the Computers/Software link located on the main index page even though the words HTML editors do not exist on that page. The simulated user is set up to know this as well. The system knows the words are related and will therefore follow that link.
After the simulated user finds a page containing information on alpha widgets, the two data sets from both users would be compared.
Think about this, if the results were very consistent with each other then it would be a great success. They could apply any information goal to this system and have it traverse through any site to find the related information. The sites that offered the best information at the fastest time would be the leaders for that information goal.
Let's look a little bit more into how they might achieve this....
Network Representations of CUT
The similarities, the number of users following the same links, and the actual links themselves could dictate the relevancy of pages in relation to the interest of the user. If you have links pointing from a page on Alpha widgets to Beta widgets yet the content does not have any noticeable similarities, you may achieve negative points on how your site is usable to a user. A page should not be cluttered with irrelevant links to irrelevant pages, at least not internally within a site. That could give a user more information to dig through which would then mean the user has to spend more time to find that information.
(edited by: msgraph at 2:28 am (utc) on Feb. 13, 2002)
The results of the case studies are a good reassurance that some links should not be under-valued. With some of the examples above in previous quotes, it appears that only relevant links should be contained within relevant content pages. This is not always the case. A site should have a well-structured site map even though the traffic to this page is poor. A site map is a good backup for finding content that cannot be easily found by traversing various links. So basically, have a map located off the main page that will lead to the pages with high levels of importance.
Another interesting topic they focus on is how users react to other information they come across while surfing a site. While navigating a site, they might find some related info that catches their eye and they branch off to check it out. After their curiosity is fulfilled, they resume their journey to seek out the information they originally wanted to find.
Well what this tidbit states is that when a user branches off to some related page, there should be a link on that page that makes it easier for the user to find their interest. This way a user will not have to hit their browser back button or click on a link to return to their previous page. Possibly the best way to do this would be to have highly desired related links on all related pages in order to keep a surfer moving forward on relevant topical information. You do not want your users to move back to an origin unless they desire something on another topic, for which you always provide that link just in case.
Recommended Reading 2
Ed H. Chi, Peter Pirolli, Kim Chen, James Pitkow. Using Information Scent to Model User Information Needs and Actions on the Web. In Proc. of ACM CHI 2001 Conference on Human Factors in Computing Systems, pp. 490--497. ACM Press, April 2001. Seattle, WA. [www-users.cs.umn.edu]
This paper can get pretty technical but it is a good read nonetheless. The main subject that should be digested is the WUFIS algo (Web User Flow by Information Scent.) If Google uses Outride's technology like it is supposed to be used then this could come in handy. Knowing this will be very useful when wanting to boost the usability of a site. The main function is to determine the quality of links and how a user might traverse those links while in search of their wants and needs.
The main details and descriptions of the WUFIS algo are listed on pages 2-5 in the PDF document linked above. Here are some basic translations of the important variables' functions. If you plan on reading the translations below, you should read the pages mentioned or at least have it opened in another window.. If you have no interest in this subject then skip on over it to the next section.
Don't worry, it is a lot simpler than it looks on paper. If you look at the Figure 2 on page 3 you will understand it better. Here is what they do:
1. They extract all the content and links from a web site. Most all search engines do this when they grab your pages.
2.(T) From the links that they extracted from your site, they create the linkage topology. This means that they create a sort of "family tree" design of your link structure. A layout to know what links point to what pages in your site.
3.(W) This is how many times a keyword(s) appears in your page. Think of keyword density here.
4.(TD.IDF) "Term Frequency by Inverse Document Frequency". This is used to determine how many times a keyword occurs throughout a set number of pages.
5.(WTF.IDF) They will look at the density of keywords in PageA and calculate how many times these keywords occur throughout the rest of your site. This would be used to figure out how much weight they have in relation to the information your provide.
6.(i,j) For example, let's say they want figure out how important KeywordA is in relation to PageA. All they have to do is apply it to the formula (WTF.IDF) and bingo, they have it.
7. Q This would be WebSurferA's information need. WebSurferA is looking for information KeywordA.
8. K These would be some sort of hints, either within or surrounding a LinkA ,that would provide a clue into what LinkA would take UserA to if they clicked on it
9.PS With all this information above they are able to determine the user's need within a link. First they figure in UserA's need for KeywordA. Since it is figured that LinkA is the most likely source to have information on Keyword A, they combine the two to form PS.
So to sum this all up first they need to set what UserA's needs are. UserA's needs are KeywordA. It is figured that PageC offers the best amount of information on KeywordA. On PageA they find that LinkA has the highest probability of leading UserA to PageC.
When UserA reaches PageB, the calculation shows that LinkX has the highest probability of leading UserA to PageC.
The site is determined to have a high rate of usability for KeywordA
Now let's say that while running their simulation, they discover that LinkX on PageB does not provide enough clues to UserA that PageC has information on KeywordA. The system knows that LinkX will lead to PageC yet they determine that if UserA is a real person they might not realize it right away. This is because the calculation of K on LinkY and LinkZ shows the same amount of hints that their links will lead to information on KeywordA.
The usability of a site in this scenario is a lot worse than the previous one
There is a flaw to all this and it has to do with image-based links. It is explained in more detail at the end of page 3 and at the beginning of page 4.
Also be sure to read up on the spreading activation algo they describe on page 4 as well. That will come in handy down below when it gets mentioned again.
(edited by: msgraph at 5:25 pm (utc) on Feb. 13, 2002)
Archive.org's stored version of Outride.net's former site [web.archive.org]
The Outride SurfAlong product is what really catches the eye on how Outride's tech might fit into Google. Outride claimed that this package would be the most optimal software to use with their relevance technology. It was designed to come in the form of a browser add-on product.
If Google implements this technology it might very well fit in with Google's toolbar.
Below is an analysis of Outride's publication1 on this bookmark software system. However Google uses this technology, if they use it at all, this analysis should be taken as a way to predict and prepare for what is to come.
Bookmark software system
As mentioned above, the software is meant to be integrated within a web browser. This is more feasible because it would cut down on Google's server processing. They would be able to track user's surfing patterns as well as making it easier to establish bookmarks by drag-and-drop.
They have stated that this invention can be customized to fit different scenarios depending on the need so nothing is set in stone yet.
1. Users can have the option of browsing through their own collection of bookmarks or be able to browse the bookmark collection of a variety of users.
Perhaps we could see a peer-to-peer network search engine that would be similar to the Open Directory Project, except for the fact that you do not need to rely on one editor's opinion. Note: Some system would have to be enabled to prevent massive auto account creations by spammers to inflate the popularity of bookmarks.
2. Users have the ability to search for their needs by using a search box to scour the database of bookmarks for the best results. Bookmark search results would depend on category, title, and URL naming conventions.
If mixed in with PageRank factors this could provide highly relevant results to users as well as eliminating spam.
Adding and Editing Bookmarks
Users have the ability to add their bookmarks either by manually typing in the URL or by drag-and-drop. Similar to Netscape and Internet Explorer.
Link Display Information
1. To let the user be aware of inactive links.
2. To let the user know the freshness of a document. When it was updated last, or if it was updated after the user last logged in.
3. To let the user know the popularity of a document. Built by how many other users have this site bookmarked as well. (Perhaps PageRank could be equated here as well)
(edited by: msgraph at 5:26 pm (utc) on Feb. 13, 2002)
By collecting the bookmarked information gathered by all registered users, they are able to easily find out a user's needs.
User's bookmark collections are sent back to the database for further processing on the user's average wants and needs.
User's bookmark collections are processed and analyzed on the user's machine. This would simply put the invention on the lines of a Peer-to-Peer network.
The preferred method would be to use the database as the central processor in order to easily and rapidly supply search and retrieval functions. If a large number of users are not on-line at the time then this would hinder other user's needs for relevant information. Also, if a user decides to log on to another computer they would easily be able to access their personalized data.
If a user bookmarks various collections relating to a specific topic then the user's profile will be updated to reflect on this. Various calculations will be made to note that the user likes to view information pertaining to this specific topic. It can also be noted that since a large collection of bookmarks have been recorded on this specific topic they might be of use to other users or groups with the same interests.
A user is given the option of joining a particular interest group that contains similar interests. In this way all members of this group can be provided with personal results relating to what other users have found in their paths for information.
Two documents they reference are:
These references give an idea about how they might discover users' similarities by clustering their collected documents(bookmarks). If a user does not select a particular interest group to join, they can be profiled into a group automatically.
Either way, when users search for their specific topic of interest they can be provided with enhanced results that reflect their interest.
The basic principles would be that the user searches for their information need. Then the user is provided with a results list that best matches their interest if at all possible. The results are then matched through the basics of the bookmark's status:
-how popular the document is.
-how many times the document is accessed
-how recent has the document been accessed.
They have also claimed that relevancy of search results can also pertain to link structure similar to Google. Where inbound and outbound links related to the bookmarked documents in question will be factored in. Also, link text and surrounding text can be factored in as well in determining link structure. This was covered in the research papers that were referenced earlier in this article.
Well now you can see that since Outride is part of Google this will DEFINITELY factor in if the need arises.
Personalized recommendations can also be displayed while surfing web pages according to a few other claims. Perhaps you are browsing a page and this system recognizes that one of the links on that page is determined popular by the bookmark system. The bookmark software tool can then alert you that this bookmark is in your interest.
This could be very helpful to a person who is browsing a site filled with links but has no clue to what link would be best to follow. If a certain link is bookmarked enough then they could become aware that this link will most likely provide sufficient data to their need.
Another way I can see this being used is if your need is buried deep within a site you visit and their is no clear cut path provided for you. Perhaps they can enact a system to flash the internal link before you to provide quick access to that buried page.
Another recommendation that could arise is if you are searching for a particular interest and the bookmark that fits your needs is currently inactive. This recommendation could be information that pertains to the bookmark in question. For example, other pages that contain similar types of information based on on-page criteria.
Another option that Google's database could provide.
Updating User Bookmarks' Status
Depending on how often a page gets modified affects how often they check to see if the bookmarked page has been refreshed. They state the preferred method as having a cached copy of the page on the server to perform comparisons with the actual page to check for updates.
I can see Google's cache copy working really well here.
What is not clear is how often they will check to see the availability of a document. However they handle this, any page that is not active at the time of verification will be noted and checked at a later time.
The question arises that if they do proceed with this type of method, how often will it occur? This could cause a rise in constant retrieval activity on various web sites throughout the net. This would have to be performed server-side in order to keep the retrievals to a low level.
Popularity is determined by the constant additions of a bookmark by all users. Each time a bookmark is added, it is logged and given a popularity rank.
Bookmark relevance can also be determined by on-page criteria. This could come from meta data and keyword vectors
Here is another instance that Google is already set up to handle. Applying Google's page criteria algos to bookmark sets can easily determine the relevancy of a page.
There is a Xerox Patent that they refer to titled: "Method and apparatus for automatic document summarization" U.S. Patent 5,638,543. Basically it calls for analyzing sentences and bits of keyword text to formulate an abstract of the document. It is your typical run-of-the-mill page relevancy analysis patent.
Again I feel that Google's own page criteria technology could handle this even better.
Looking at the explanations above, how do you follow through with a user's search to provide the results they need if in fact they do not find their need in the first set of results?
Well for example, when a user searches for a bit of information, you first give them generalized, yet somewhat specific, listings of results that best match up to the topic. The user selects the topic that most likely fits their needs and then they are provided with a more detailed sub-set.
The generalized topics can include information that might interest the user in a detailed or vague manner. The specifics of the results will depend on the user's selection.
How would they personalize this?
Think of some major search engines out there that give you a subset of results that have either been searched by other users or are determined relevant to the topic.
If the user selects the more detailed topic that relates to their interest, then the results will be narrowed down from there with the addition of the keywords in that topic.
If the user does not select the detailed topics, then the user will be provided results that do not reflect on those details.
To break this down, imagine that you are searching for "vacation packages." You are provided with a list of categories that are popular like Hawaii, Las Vegas, Bahamas, etc. There are other listings that offer vacation packages in general. If you click on a general category, the system would realize that you are not interested in Hawaii, Las Vegas, and Bahamas so they would exclude them from the results.
Another option is to give a user the ability to sort the results based on the various results that are stated above: recency, frequency, popularity, link structure, and context.
This ability to sort results is a popular method used by a variety of sites these days and they appear to be very successful in grabbing the user's interest in specific data. For example, Download.com let's the user sort software packages by popularity, date, title, etc. This is great for users who are looking for popularity of a product or the latest release of an application.
It could also be possible to set up an automatic ranking system without user input in order to learn the user's traits. This would help to identify if the current ranking methods do not interest the user. If the user tends to select results that are further on down the list then perhaps some changes in the ranking systems need to take place. These changes can either be made on a global scale for all users or to just a specific user.
Depending on the resources involved for modifying a specific user's results, this would be a perfect option to provide. For example, let's say a user is given results with a higher popularity first, yet this user tends to click on links that point to fresher pages. The results can be customized so that the user gets a list of fresher yet still popular results.
Spreading activation can also be factored in with the how the results are displayed.
This could be used to learn the terms that a user likes to associate with yet are not always provided in the best pattern.
For example, some users might use the terms "travel bargains" while others use the term "vacation getaways." They are both have similar meanings yet there is the possibility that there are differences.
A user might be looking for any kind of vacation package so they search for "vacation getaways." The use of "spreading activation" could list some results that show "travel deals" within the title yet offer vacation getaways on their site.
The user tends to select the "spreading activation" results just as much as their original search terms and bookmarks the pages just the same. The system will notice this and will offer these results just as much in the future.
Another user searches for "travel deals" but rarely, if ever, clicks on any of the terms provided by "spreading activation", such as "vacation getaways." This user is interested more in specific travel deals and not vacation getaways so therefore they ignore any other listing that does not contain their search query. The system can identify that the user dislikes the use of "spreading activation" in their results so therefore it is not much of a priority in ranking sites.
This would be a great system to provide for many users in displaying personal relevant results. Imagine if you are going to Las Vegas for a night or two and you are searching for cheap hotel rates. All you want to find are sites offering cheap hotel rates, yet you are given a list of sites that offer various 2-3 day packages with other inclusions. In the future when you are searching for hotel rates, the top sites you view will be for those types of rates and not some vacation package. But, there could be users that tend to look at them as well and their personalized results can be modified to fit their pattern.
An ideal method to finalize how the results are displayed would be to let the users decide what results they want displayed. Do they want to see only the bookmark data, Web-crawled data, or a mixture of the two. If both sets are blended it would also be ideal to have the bookmark data highlighted somehow to let the user aware of that data.
(edited by: msgraph at 8:15 pm (utc) on Feb. 12, 2002)
When a user decides to bookmark a document, the system could be configured to automatically categorize the document into the appropriate category.
To explain this in better detail they refer to a few patents. One that really catches the eye is:
"System for categorizing documents in a linked collection of documents"; U.S Patent No. 5,895,470; Pirolli et al.
James Pitkow's name is on the patent too just in case you were wondering.
Remember the section above that mentions the "Network Representations of CUT" from the paper, "The Scent of a Site: A System for Analyzing and Predicting Information Scent, Usage, and Usability of a Web Site" [parc.xerox.com] ?
Well, read that section very carefully because it explains a lot about automatic document categorization.
This is one application they will HAVE to implement. One discouraging factor I have considered in the beginning is that people like to store bookmarks but they tend to be lazy about it. They end up with a long list of bookmarks that have to be scrolled through in order to find the document when it is in need. What better system would there be other then having it organized automatically for easy retrieval?
Another plus for this method of automatically categorizing bookmarks is that it will be easier to make automatic inclusions into the public database. This in turn will make it easier for the public bookmark search system to find relevant results by category.
Summary of Claims
A.) Main Objective
To provide users a way to search through a collection of bookmarks from both private and public databases.
- To profile users into groups from a larger group of users.
- To profile a single user from a larger group of users.
- Profiling of users begins the moment they collect their first bookmark.
C.) Rankings In General
- To rank results based on the context of the collected documents.
- To rank results based on the popularity of the shared bookmarks.
- To rank results based on how often the page is accessed.
- To rank results based on how recent a page was accessed
- To rank results based on the link structure of the page and site.
- To rank results based on inbound/outbound link weight. ( Think of PageRank)
- To rank results based on the "spreading activation analysis" of a collection of bookmarks contained in a related group/topic .
D.) Rankings In Respect To User Groups
- All of part C depending on how each document is determined based on a group topic association.
E.) Search Query
- When a user performs a search query, one or more keywords will be added to the query, based on a user group profile.
Let's say you search for "Miami restaurants" and you are part of a group that contains a large collection of Italian recipes. You could be given results containing information on Italian restaurants in Miami.
- Same as above yet additional keywords can be assigned based on your own profile and not that of your groups.
Like the previous example, let's say you search for "Miami restaurants." Although one of your groups has a large collection of Italian recipes, you have a large private collection of Italian, Mexican, Spanish, French, and Japanese food related bookmarks. You can then be given restaurant results based on these food preferences as well.
F.) Unavailable Bookmarks
- A system to monitor the availability of bookmarks. If a bookmark is unavailable, then one or more bookmarks that contain similar data will be recommended to take the place of the unavailable bookmark.
(edited by: msgraph at 5:24 pm (utc) on Feb. 13, 2002)
After reading into the research and patented technology of Outride's application, the question still remains on how Google will implement it into their search platform. Would Google use the technology as a browser add-on tool?
Outride claimed that the ideal method for this application would be to have it as a downloadable browser companion. With Google working hard to push their current Toolbar, this could be their best option for delivering the application. This would allow for Google to use all of Outride's technology as it as meant to be used.
1 This analysis was based on the Published International Application: WO 00/67159; SYSTEM AND METHOD FOR SEARCHING AND RECOMMENDING DOCUMENTS IN A COLLECTION USING SHARED BOOKMARKS; Xerox Corporation.
We never talk about future plans outside of Google. The only rule of thumb I can give is that we're going to tackle the things that help the user most, in the order that we think will give a good return for time invested. I'm not sure that personalization is the biggest win at this time. If we start to think it would be a big win for users, we would do it.
An added benefit is that it allows search services to collect valuable relevance information about the results shown to the user. In the context of each query SearchPad can log the actions taken by the user, and in particular record the links that were considered relevant by the user in the context of the query
Toolbar on steroids?
Another option is to give a user the ability to sort the results based on the various results that are stated above: recency, frequency, popularity, link structure, and context.
This would be such a great advanced feature.. Google has a tremendous database, but thier algo is really geared to favor larger corporate type sites. This would allow you so much more control when deciding what variables are important for your particular search.
let's say a user is given results with a higher popularity first, yet this user tends to click on links that point to fresher pages. The results can be customized so that the user gets a list of fresher yet still popular results
This is very similar to the above but automatic. I think this could provide valuable help for neophytes lost on the WWW, but you would definitely need to provide a way to disable the feature.
"personalisation be made availible to the general searching public?"
Other search engines are already talking about implementing them. I would expect that google would have to provide them to the general public if they want to stay King of the SE's.