Search Tool Data Analysis

by yourActual name (tmuirtmuir in BIT330, Fall 2008)

Questions and queries

Web search engines

For the web search engines, I’m using a search that I just did last weekend for my first MKT300 assignment. The assignment was to find an example of unethical marketing from the past. In this assignment, I am going to mimic the actual search that I did for that assignment because it gives me some clue about what I actually got out of it and perhaps what I could have gotten out of it.

To be considered ‘applicable’ my rough standard is whether or not the web site provides either complete information to be used as an example of unethical marketing, or at least enough to lead me to find what little I might need to fill in some gaps.

For the query, I kept it very simple to mimic my actual search. For all search engines, using just what I used in my actual assignment:

Unethical marketing example

Blog search engines

For the blog search engines, I’m going with a more personal interest search. My brother is a game developer in San Francisco working on a game called Brutal Legend. Their original publisher Vivendi was recently acquired by Activision, and in the company review Brutal Legend did not make the cut for continued funding. As such, they are out of a publisher and shopping it around, and I’m actually very interested to see what people have been saying about the game and its limbo status of the last month or so.

I had to put this one in to quotes to make sure that I was getting the proper information. If I wasn’t personally interested in what people had to say about the game, I probably would have just left it without quotes to show that even what appears to be a well worded search can come up with nothing the searcher would actually be looking for

"Brutal Legend”

Data that I collected

Search engine overlap data

Web search Live Google Yahoo Web
Live 45 20 20
Google 60 20
Yahoo Web 50
All 10
Blog search Technorati Google Blog Bloglines
Technorati 85 10 10
Google Blog 100 15
Bloglines 75
All 10

Search engine ranking overlap data

This table provides a measure of how much of Google's responses are reproduced by Yahoo.
GY Yahoo
Google 5 10 20
5 1 1 1
10 3 4 4
20 3 4 4
This table provides a measure of how much of Yahoo's responses are reproduced by Google.
YG Google
Yahoo 5 10 20
5 1 3 3
10 1 4 4
20 1 4 4
This table provides a measure of how much of Blogline's responses are reproduced by Google Blog Search.
BG Google
Bloglines 5 10 20
5 0 0 0
10 0 0 0
20 1 3 3
This table provides a measure of how much of Google Blog Search's responses are reproduced by Bloglines.
GB Bloglines
GBlog 5 10 20
5 0 0 1
10 0 0 3
20 0 0 3

Results

Web search

Search Engine Overlap Data

Prec. Live Prec. Google Prec. Yahoo Web Overlap Y/G Overlap L/Y Overlap G/Y Overlap L/G/Y
Mean 42.77777778 54.44444444 51.66666667 18.33333333 20 20.55555556 10
Median 42.5 57.5 52.5 20 20 20 10
Std. Dev 22.76621738 20.06525303 22.42635005 9.548637106 11.37592918 7.838233761 7.475450016
Maximum 80 90 85 35 45 35 25

For each category, I have given the values of the pertinent statistics, as shown in the rows: The mean, the median, the standard deviation and the maximum for each. Precision is the precision of each respective search engine (Microsoft Live Search, Google and Yahoo) given as a percentage of relevant documents out of the total number examined (20 in this case). Overlap is the measure of how many documents that appeared in the search on one engine also appeared on either of the other two. The result is reported as a percentage. All class data can be found here.

Search Engine Overlap Ranking Data

Google / Yahoo

o(5,5) o(10,5) o(20,5) o(5,10) o(10,10) o(20,10) o(5,20) o(10,20) o(20,20)
Mean 1.058823529 1.352941176 1.647058824 1.294117647 2 2.647058824 1.647058824 2.470588235 3.705882353
Median 1 1 2 1 2 3 1 3 4
Std. Dev 1.197423705 1.32009358 1.411611511 1.212678125 1.322875656 1.729926894 1.221739358 1.545867356 2.114376559
Maximum 4 4 4 4 4 5 4 5 7

Yahoo / Google

o(5,5) o(10,5) o(20,5) o(5,10) o(10,10) o(20,10) o(5,20) o(10,20) o(20,20)
Mean 1.058823529 1.176470588 1.647058824 1.470588235 1.941176471 2.470588235 1.882352941 2.647058824 3.764705882
Median 1 1 1 1 2 3 2 3 4
Std. Dev 1.197423705 1.286239389 1.366618842 1.23073388 1.390619836 1.58578242 1.268973647 1.729926894 2.077540967
Maximum 4 4 4 4 4 5 4 5 7

Again, relevant statistics are given in rows and each category has been given its own column. This is a more specific look inside of the overlap data from the previous table. The notation for the columns should be read as follows: for the Google / Yahoo chart o(5,10) would be "Of the first five documents returned by Google, how many of those also appear (overlap) in the first ten documents returned by Google?". All results were given in integers and can be found here.

Blog search

Blog Overlap Data

Prec. Live Prec. Google Prec. Yahoo Web Overlap Y/G Overlap L/Y Overlap G/Y Overlap L/G/Y
Mean 33.05555556 52.5 44.44444444 3.611111111 9.166666667 6.944444444 1.388888889
Median 30 42.5 47.5 0 7.5 5 0
Std. Dev 21.15342337 22.17908395 14.33720878 7.030512398 7.717436331 6.448640734 3.34556579
Maximum 85 100 75 25 25 20 10

See Web Search Data section for descriptions and where to find the original data.

Blog Overlap Ranking Data

Google / Bloglines

o(5,5) o(10,5) o(20,5) o(5,10) o(10,10) o(20,10) o(5,20) o(10,20) o(20,20)
Mean 0.294117647 0.352941176 0.470588235 0.411764706 0.470588235 0.823529412 0.705882353 0.764705882 1.058823529
Median 0 0 0 0 0 0 0 0 1
Std. Dev 0.469668218 0.606339063 0.624264273 0.618346942 0.717430054 1.014599312 0.919558718 1.091410313 1.197423705
Maximum 1 2 2 2 2 3 3 4 4

Bloglines / Google

o(5,5) o(10,5) o(20,5) o(5,10) o(10,10) o(20,10) o(5,20) o(10,20) o(20,20)
Mean 0.294117647 0.352941176 0.588235294 0.411764706 0.529411765 0.823529412 0.529411765 0.882352941 1.117647059
Median 0 0 0 0 0 1 0 1 1
Std. Dev 0.469668218 0.606339063 0.870260272 0.618346942 0.717430054 1.074435556 0.624264273 0.992619825 1.166316474
Maximum 1 2 3 2 2 4 2 3 4

See Web Search Data section for descriptions and where to find the original data.

Discussion

Web search

The first thing to take note of here, is that nothing that is said here is by any means conclusive. For all of the data above, the total number of data points collected was a woefully inadequate 18. 18 data points to judge the effectiveness of the three biggest web search engines on the planet. To say this is not enough is a massive understatement. Another major problem with the data collected was that there was no restriction placed on the search terms to be used. Some of us may have used very detailed searches, and others may have used only the vaguest of search terms. There are a couple of other issues with the data in itself which will be addressed when discussing the relevant statistic.

The first statistic to be looked at is Precision. There is a major problem with calculating precision, but it is not specific to our experiment. To calculate precision, someone has to determine whether or not a page returned is relevant. Who is to judge that? By what criteria? What is relevant to one person may seem entirely irrelevant to the next person, and there is no completely objective way to tell which one is right. Again, this is not a problem specific to our experiment but it is certainly worth noting.

Back to precision as it pertains to our data, the first thing that jumps out is how low the numbers are. An average precision of only about 50 percent across the three major search engines used today is simply astonishing. As was discussed in class, this number can be brought up to near 100 percent fairly easily - but there is a certain amount of knowledge that is needed to do it. This is knowledge that the average user does not possess. If we think for a moment about how much the average person trusts a search engine, or how common the phrase "Just Google it!" is then these numbers look very disappointing. Also consider for a moment the number of documents that is returned on an average search. Ten thousand at least, up to multiple millions. When, at best, even 90 percent of those are relevant that means there are thousands upon thousands of documents returned that have no relevance. Not very efficient.

The next thing to consider is the basic Overlap Data. These numbers are only interesting insofar as what people think about search engines. Most people think that all search engines are interchangeable. That putting one set of search terms in to Google would spit out the exact same results as if the search terms were put in to Yahoo instead. What is illustrated very nicely here is that that is very obviously not true. With an average overlap between any two of the search engines at about 20 percent, and for documents appearing in all three search engines to be only 10 percent shows that.

This raises some very interesting questions about what the search engine actually does. What is it that would account for these huge discrepencies? Is it in the way that the search engines spider and find new pages? Is it in the way that they rank them? Is it related to how the search terms are interpereted by each specific engine? I think that the correct answer here is actually "All of the above as well as many other things". The search terms entered in to a search engine, any search engine, go in to a metaphorical black box and then out comes a set of results. Each search engine has its own processes that it goes through and all the user gets to see is the result. Without being able to see what goes on in that black box there isn't any way to be sure, and even then it might not tell everything that is going on.

Next, look at the Standard Deviations given for each column. This is where it starts to become clear that only 18 results isn't nearly enough. Standard deviation is given in the same units as the rest of the data (percentage, here) so take a look at the Precision of Microsoft Live Search - an average of 42.7% and a standard deviation of 22.7%. This is a representative example of the rest of the data. As more and more data points approach the mean, the standard deviation is smaller. As data gets further from the mean, the standard deviation starts to grow. As can be seen from the tables, the standard deviations are very large compared to the respective means.

What this means is that there is very little central tendency in the data. It is sort of 'scatter shot' across the map, from high to low. This would generally be considered a bad thing, as there can't be many useful inferences drawn from data so random, but given the problems with the experiment as a whole mentioned above it is not at all surprising that the standard deviations would be so large.

A very interesting thing to observe, particularly in the Search Data Overlap table, is that the mean and the median are very close. With a standard deviation so high, this is not something that would be expected. This piece of information, along with the fact that the standard deviation is so high, leads us to believe that the data is evenly distributed about the mean/median but in a very wide range. Looking at the raw data, it can be seen that this is indeed true. That can only be done because of the small number of data points, but it is still interesting to notice.

Now, let's take a closer look at how well two search engines mimic each other in detail. The most interesting data here will certainly come out of the o(5,5) column of the tables. The question here is "Of the top five results in Google, how many of them were returned in the top five results in Yahoo?" and vice-versa. This will be the most interesting statistic, possibly of the whole experiment, because as was discussed in class the average user does not look much beyond the first three or five results, let only down even to the twentieth. However, given what was discussed about how different search engines are different from each other, it is not clear whether this number should be high or low? If it came out to be five (all five results from one reproduced in the top five of the other) then there would be no clear benefit of using one over the other. Also, it should be noted that the overlap ranking data made no distinction between relevant and not. It could be that the two search engines are mirroring each others irrelevant search results, which would surely be the worst outcome of all.

What is seen in o(5,5) is that there is an average of about one result overlapping between Google and Yahoo in the top result from either. To me, this seems like it is about "right". It gives a sense of reassurance that both search engines are doing similar things when they search for a particular set of terms, but then there is still a difference in the results. And because of that, there is still a reason to use a different search engine.

There is one last thing to note from the Overlap Ranking Data. It is not surprising, but as the number of documents compared from either search engine is increased (five to ten, ten to twenty) the number of positive results from each search engine increases. This is another fact that increases cross web search engine confidence while giving a reason to try any search in a different search engine just to see what might come out.

Blog search

The blog search engines are statistically analogous to the web search engines across the board but… worse. Where we saw an average of about 50 percent precision across the board in the web search engines, here in the blog search engines it is closer to 40 percent. In overlap data, the blog search engines performed even worse. Comparing any two search engines yields only about ten percent overlap, and on all three it goes down closer to just one percent.

However, this is not the slightest bit surprising given the information that is being searched. The web search engines have the benefit of being able to search just through web pages to decide what they are about, and often this involves facts and definitions and other concrete information. Blogs are about people. People writing their opinions and reacting to things. This makes them inherently much harder to catalog, and subsequently much harder to search. Still, that is not to say it makes it impossible. Note that the only 100 percent precision across both web searches and blog searches comes from the blog searches, Google Blog Search to be exact.

Other than that, there isn't a whole lot interesting to discuss that wasn't already covered above. What was said about the web searches applies almost verbatim to the blog searches. Compared to the means, the standard deviations are very high. Also, the means and medians are again very close. As before this leads us to believe that the data is distributed fairly evenly above and below the mean/median over a very large range. At best there was a 25% overlap between any two blog searches. This lower number was just explained above. Similarly, when doing detailed comparisons of the results from Google Blog Search and Bloglines the numbers are much smaller.

The most interesting thing to notice in the Blog Search Data is found in the overlap data. To get an average of even one overlapping result, it is necessary to compare the top twenty results from each, that is to say all the way to o(20,20). In o(5,5) there is only an average of only 0.2 results being returned by both. The median for almost all the overlap statistics in the blog search was zero. And in the exception of it not being zero, it was only one.

One last thing to note from the Overlap Ranking Data is that zero is well within one standard deviation of the mean in almost all of the results. This tells us that there is very little reliability confidence across blog search engines.

Again though, it is hard to say if this is a bad thing. When each individual search engine only has a precision of about 40 percent, it is almost better that they are not mimicking each other too closely.

That last point is what brings me to what I would call the conclusion of this whole analysis: Use different search engines. It is going to be more efficient to go through the first ten results from three search engines than it will be to go through the top thirty results from one. Assuming 50% precision and 20% overlap, this is how I would suggest to maximize time spent searching. Of course, an emphasis needs to be placed on making searches as accurate as possible. Expecting any search engine to return perfectly relevant results every time when inputting incredibly vague terms is silly. However, with a combination of good search terms across different search engines should yield the best results in the smallest time. I know I will be.

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License