Uniquing Search Results

You can use the Unique parameter as a filter on the results that match a Search query. (The Unique parameter is not supported for the StartSearch action, however.) Duplicates can be filtered using any combination of the following fields: siteprefix, subsite, site, port, path, title, size, date, contentcheck, and shingle.

The default Unique value is

Unique=site,2;shingle,2/6;siteprefix.massage(subsite).site.lower(path)

which limits the results to two per site, where most of the document content is unique, and where the URL is unique. The following examples describe the Unique value attributes in greater detail and show how they are applied.

[Note]Note

In the case of a query that searches a specific site, such as cats+site:yahoo.com (look for pages about cats on yahoo.com), the user probably wants as many results as possible from yahoo.com. However, the default Unique parameter includes site,2 which returns a maximum of 2 results from any given site. So, the site,2 is automatically removed from the default Unique parameter value if there is a site: in the Query parameter. Passing in a value for the Unique parameter disables this behavior.

Return only 1 result per site:

Unique=site

Return 2 results per site:

Unique=site,2

The URL of the document returned for a given search result can be constructed from the siteprefix, subsite, site, and path result fields. Return only one document with the exact same URL:

Unique=siteprefix.subsite.site.path

Return only one document with a given URL, but "massage away" a "www" in subsite, and ignore case differences in the URL path:

Unique=siteprefix.massage(subsite).site.lower(path)

Shingles are used to measure document similarity. A given document is divided into 6 'shingles' that are checksums representing portions of that page's content. Don't return documents with mostly the same content:

Unique=shingle,2/6

The contentcheck is an MD5 checksum of the document content. Don't return more than one document with the exact same content:

Unique=contentcheck

You can also combine multiple criteria, by separating them with a semicolon. The value below would return two results per site, with mostly unique document content, and only one result for a given URL.

Unique=site,2;shingle,2/6;siteprefix.subsite.site.path