When a document is initially added to the Alexa search engine, about fifty different document attributes are indexed in search fields. When a query is submitted, the search engine retrieves documents pertinent to that query by matching the query to the search fields. By default the anchor text pointing to that document, the document title, its URL, the DMOZ category that it is in, and the text of the document are examined. In general, the more search fields that match, the better score the document will have and the earlier it will appear in the results list. All other things being equal, the more popular a page is the earlier it will appear on the list.
You may also narrow a query by explicitly specifying individual search fields in the Query parameter. The search fields listed in the table below allow you to limit your search based on attributes of the document such as the URL of the document, document size, language, category, and many more. To search an individual field, prefix the search terms by the field name and a colon as shown in the examples.
Search for all JPEG images on yahoo.com:
type:image/jpeg site:yahoo.com
Search for documents containing the words "cat" and "dog" in the title that are in the English language:
title:(cat dog) lang:en
Search for documents about java crawled in April 2007 or May 2007:
java date:(200704|200705)
Search for documents about cats in all documents that are in English or where the language is unknown:
cats lang:(en|unknown)
Search for documents containing the word cats in the document text:
text:cats
Search for documents about cats with moderate adult filtering:
cats porn:(-yes)
Boolean search fields are used to search for string tokens that either are or are not present in a given document. The tokens could be words, such as the words in a page title, or they could be other attributes of the page, such as the HTTP code returned when the page was retrieved.
Phrase search fields are used to search for a list of string tokens that should be searchable as word phrases. Phrase search also supports the notion of relevance. Records with more occurrences of tokens are considered more relevant.
The table below contains the complete list of search fields. The most commonly used search fields are text, site, lang, type, magic, title, porn, and pagetype. The field names are case-insensitive.
| Type | Field Name | Field Type | Description |
|---|---|---|---|
| Document | Anchor | phrase | Inbound Anchor text. That is, the anchor text that is pointing to this document. |
| Charset | boolean | Character set (big5 big5-hkscs, cp874, cp949, euc-jp, euc-kr, euc-tw, gb18030, gb2312, gbk, iso-2022, iso-2022-cn, iso-2022-cn-ext, iso-2022-jp, iso-2022-jp-2, iso-2022-kr, iso-8459-1, iso-8859-1, ..., iso-8859-14, koi8, koi8-r, koi8-u, s-ascii, u25ufreei, unknown, us-ascii, utf-16be, utf-16le, utf-32be, utf-32le, utf-8, viscii, vuiso-8859-1, wdexows-31j, windows-1250, ..., windows-1257, windows-1) | |
| ClassTag | boolean | Value of class= attributes | |
| Code | boolean | HTTP response code returned by server at crawl time | |
| Date | boolean |
Date document was crawled (2007, 200710, 20071025) | |
| HasText | boolean | Does the document have text? (yes, no, none) "no" if document is not one of the text types, "none" if of a text type but with no text, "yes" otherwise. | |
| Lang | boolean | Two character language code (af, am, an, ar, arc, az, be, bg, bn, bo, br, bs, bug, ca, chr, co, cop, cs, csb, cv, cy, da, de, dv, el, en, eo, es, et, eu, fa, fi, fo, fr, fy, ga, gd, gsw, gu, he, hi, hr, ht, hu, hy, id, ii, io, is, it, ja, jv, ka, kk, km, kn, ko, ku, ky, la, lb, li, lo, lt, lv, mk, ml, mn, mr, my, nap, nds, ne, nl, nn, no, oc, or, os, pa, pl, ps, pt, rm, ro, ru, sc, scn, si, sk, sl, sq, sr, sv, sw, ta, te, th, tk, tl, tr, uk, unknown, ur, uz, vi, wa, x, yi, zh) | |
| LinkText | phrase | Outbound anchor text | |
| Magic | boolean | MIME type determined by analyzing the document content Note: As of November 2007, about 93% of the pages in the search index were html. (aiff, application, audio, bitmap, bmp, compress, css, dvi, elc, flash, frame, gif, greymap, gzip, html, image, javascript, jpeg, message, midi, mpeg, msword, news, octet, pdf, pixmap, plain, png, portable, postscript, prs, quicktime, rfc822, rtf, sc, shockwave, sid, stream, tar, text, tiff, unknown, video, x, xbm, xhtml, xml ) | |
| PageType | boolean | One of (robots. redirect, homepage, irrelevant) Homepage means that the page is hosted on a personal site. Irrelevant means that the page had a non-200 response code, had no text, or was a redirect or robots.txt page. Note: By default, pagetype:(-irrelevant) is passed in along with the query in a When using the "Million Search Results" | |
| Porn | boolean | Document contains adult content (yes, no, maybe) For strict adult content filtering use | |
| Region | boolean | Two character sub-language (bn, cn, hk, tw, unknown) | |
| RelTag | boolean | Value of rel= attributes. See http://microformats.org/wiki/reltag | |
| SizeAtLeast | boolean | Minimum document size in bytes (0, 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1k, 2k, 4k, 8k, 16k, 32k, 64k, 128k, 256k, 512k, 1m, 2m, 4m, 8m, 16m) | |
| Text | phrase | Text of document, excluding markup | |
| Title | phrase | Document title | |
| Type | boolean | MIME type from header (text/plain, image/jpeg, jpeg, . . . ) | |
| Website | Dmoz | phrase |
Open Directory Project categories (arts, business, computers, games, health, home, "kids and teens", news, recreation, reference, regional, science, shopping, society, sports, world, ... and about 150,000 more terms) Search within a specific category: Search within several categories: |
| Traffic | boolean |
Alexa traffic rank (top5, top10, top50, top100, top500, top1000, top5000, top10000, top50000, top100000, top500000, top1000000, top5000000, top10000000) To get sites ranked from 1001 to 5000 you would use: | |
| URL | Site | boolean |
Site (my.careers.yahoo.com, careers.yahoo.com, yahoo.com, com) Search only on specific sites: Search only .org sites: Don't search .co.uk sites and .org sites: |
| Cache | boolean |
Normalized URL ("aol.com/test.cgi?foo=bar" from www.aol.com/test.cgi?foo=bar) You can use this field to see if a specific document is in the search index. | |
| Url | phrase | URL ("aol.com/login?loc=us" from my.name.aol.com/login?loc=us) | |
| SubSite | boolean | Sub-site ("name" from my.name.aol.com) | |
| SitePrefix | phrase | Site prefix ("my" from my.name.aol.com) | |
| SLD | boolean | Second level domain ("amazon" from www.amazon.co.uk/path?) | |
| Suffix | boolean | URL suffix ("doc" from aol.com/test.cgi?foo=bar.doc) | |
| CSuffix | boolean | Pre-query suffix ("cgi" from aol.com/test.cgi?foo=bar.doc) | |
| Host | phrase | Host ("my.name.aol.com" from my.name.aol.com/login?loc=us) | |
| Redirecting to | Redirect | boolean | Normalized URL ("aol.com/test.cgi?foo=bar" from www.aol.com/test.cgi?foo=bar) |
| RSite | boolean | Site (members.aol.com, yahoo.com, . . . ) | |
| RUrl | phrase | URL ("aol.com/login?loc=us" from my.name.aol.com/login?loc=us) | |
| RSubSite | boolean | Sub-site ("name" from my.name.aol.com) | |
| RSitePrefix | phrase | Site prefix ("my" from my.name.aol.com) | |
| RSLD | boolean | Second level domain ("amazon" from www.amazon.co.uk/path?) | |
| RSuffix | boolean | URL suffix ("doc" from aol.com/test.cgi?foo=bar.doc) | |
| RCSuffix | boolean | Pre-query suffix ("cgi" from aol.com/test.cgi?foo=bar.doc) | |
| RHost | phrase | Host ("my.name.aol.com" from my.name.aol.com/login?loc=us) | |
| Linking to | Link | boolean | Normalized URL ("aol.com/test.cgi?foo=bar" from www.aol.com/test.cgi?foo=bar) |
| LSite | boolean | Linking to Site (members.aol.com, yahoo.com, . . . ) | |
| LUrl | phrase | URL ("aol.com/login?loc=us" from my.name.aol.com/login?loc=us) | |
| LSubSite | boolean | Sub-site ("name" from my.name.aol.com) | |
| LSitePrefix | phrase | Site prefix ("my" from my.name.aol.com) | |
| LSLD | boolean | Second level domain ("amazon" from www.amazon.co.uk/path?) | |
| LSuffix | boolean | URL suffix ("doc" from aol.com/test.cgi?foo=bar.doc) | |
| LCSuffix | boolean | Pre-query suffix ("cgi" from aol.com/test.cgi?foo=bar.doc) | |
| LHost | phrase | Host ("my.name.aol.com" from my.name.aol.com/login?loc=us) |
The fields in the table below refer to information about the web server that served the page
| Type | Field Name | Field Type | Description |
|---|---|---|---|
| Crawl | IP1 | boolean | First octet of server IP address (207) |
| IP2 | boolean | First two octets of server IP address (207.171) | |
| IP3 | boolean | First three octets of server IP address (207.171.166) | |
| IP4 | boolean | server IP address (207.171.166.102) | |
| Geography | Country | boolean | 2 character country code from server IP address (us, de, ...) |
| State | boolean | 2 character state code from server IP address (IL, NY, CA,...) | |
| City | boolean | City from server IP address | |
| ZipCode | boolean | Zip code from server IP address | |
| DmaCode | boolean | Designated Marked Area from server IP address (U.S. only) | |
| AreaCode | boolean | Area Code from server IP address - U.S. Only |