This is a design patrol. This is not an exercise. Thank you for your cooperation.
Searches more complex; engines keeping up?
Via Greg Sterling comes a OneStat report that general Web search queries are shifting away from single keywords and up to two, three and four words a pop.
Of all the search phrases worldwide, 28.91 percent of the people use two word phrases, 27.85 percent use three word phrases and 17.11 percent use four word phrases. Less and less people use now one keyword since the last measurement in July 2005.
I just went through an exercise with Scripps colleagues and a vendor we use for some of our search products, laying out how I believed we should handle consumer queries in a search system where we don't have features such as Google's PageRank to help determine relevant, credible results.
Here's how I think searches should work, broadly, by default:
- Use an OR search, meaning if you search for "pizza hut" results will include "pizza" or "hut."
- Support word stemming and word form expansion. That means, for example, a search for "president" would also bring up matches for "preside," "presiding," "presidents," "president's," "presidents'" and "presidential."
Those two defaults mean a fairly liberal match set if the database you're searching is large (if it isn't, are you sure you need a search engine? ;-)). Now comes the hard part -- sorting:
- Before matching any search string, start by throwing out non-content words (e.g., "a," "an," "the," "or," "and").
- After that filtering, sort first by number of words matched in the string. That means in a search for "pizza hut" we would first see matches of both words, then matches of only "pizza" and "hut."
- In single-word matches from multiword strings, more weight should be given to the first word in the search string than the rest.
- Within the groups of "all keywords matched" and "more than one keyword matched," sort next by word proximity. So in a search for "farmers market" results that match both words where the two words are together should rank higher than where the two words have many words between them. This obviously does not matter when only one keyword is matched.
- Within the resulting subgroups of results, sort next by ratio of keywords to total word count of matching document.
- Within the resulting sub-subgroups of results, sort next by date of last known document update, most recent at top.
A good search service addresses synonyms, acronyms, abbreviations and common misspellings with a built-in common-word thesaurus and administrator-configurable match tables. Search administrators should be able to add both "this equals that" and "this does not equal that" rules.
Many of us have seen Google's "Did You Mean...?" keyword suggestion links at the top of search results -- a clever way to help us recognize misspelled keywords or spelling/phrasing alternatives. A search thesaurus enables similar functionality, making a better and more usable alternative to "Advanced Search" forms for query refinement.
Assuming administrators don't try to game the system, I think a keyword search engine that uses these tools and follows this sort order will be more likely to deliver a high perceived quality of match in its results. That perception -- "I searched for something and got results that met my expectations" -- propelled Google to the top of general Web searches.
That's not to say this little formula will unseat Google, just that it may help organize smaller search services, such as those on local or vertical sites. What am I missing? How do you think keyword search engines achieve the best perceived quality of match?