The StartGrep action finds documents matching a regular expression. When a document is initially added to the Alexa search index, about fifty different document attributes are indexed in separate search fields. The StartGrep action allows you to filter your search results using criteria that Alexa has not indexed. The regular expression is run against the actual document. You could use this action, for example, to select documents containing a specific HTML tag.
Some limitations:
Not all documents indexed by the search engine are available for post-processing with StartGrep. Add CachedDocumentsOnly=true to your StartSearch request to limit the results to documents that have been cached and are available for post processing.
Only the first 20,000 characters of a document will be examined
Only the first five matches are returned.
The extracted text is truncated at 250 characters.
All whitespace characters in the extracted text, such as tabs, newlines or carriage returns, are replaced with spaces.
This action reads the output file from a Million Search Results query and then futher filters the matching documents. The steps are:
Use this test page to fine tune a regular expression that will extract or filter what you are interested in
Make a Million Search Results StartSearch request. Your request should include CachedDocumentsOnly=true to select documents whose content is available.
Use GetStatus to check the status of your search process..
Once your search process is completed, GetStatus will return a DownloadUrl containing your query results.
Pass the DownloadUrl in as the InputFileURL for your StartGrep request.
Use GetStatus to check the status of your grep process.
Once your grep process is completed, GetStatus will return a DownloadUrl containing your grep results.
Download your results.
| Name | Description | Required |
|---|---|---|
Action | Set the | Yes |
|
Version | Pass in the version number to ensure that requests succeed even if the API changes in future versions. | Yes |
InputFileURL | The DownloadUrl from a Million Search Results StartSearch query. | Yes |
RegExPattern | A regular expression that will be used to post-filter the search results. If the regular expression matches on the document then the line from the input file will be echoed to the output file. If a capturing group is specified using parentheses, then the first match will be returned as the last column in the output file. The regular expression is matched on the whole document, not per-line. The regular expression syntax used is that of Java Example: Find documents containing OBJECT tags: See the article on regular expressions for more examples. | Yes |
MaxNumberOfDocuments |
Maximum number of documents to process. The value must be between 1 and 10000000 (inclusive).
| Yes |
MaxNumberOfCPUHours |
The maximum runtime in CPU hours, after which any available results are returned as if the search terminated normally. The value must be between 0.5 and 100 (inclusive). Processing one million documents with a simple regular expression will typically take 10 CPU hours. Multiple CPUs are used, so the actual clock time will typically be less.
| Yes |
| Name | Description |
|---|---|
ActionRequestId | The ID associated with this request. Pass this id into the GetStatus Action to find out if your grep process has completed. |
The following example shows a Query-style request and response
http://msearch.amazonaws.com/?
Action=StartGrep
&Version=2007-03-15
&AWSAccessKeyId=[Your AWS Access Key ID]
&Timestamp=[Current timestamp]
&Signature=[Calculated request signature]
&InputFileURL=[An url that contains the output of a Million Search Results query]
&RegExPattern=[A regular expression used to filter/extract matching text]
&MaxNumberOfDocuments=[The maximum number of documents to process]
&MaxNumberOfCPUHours=[The maximum number of CPU hours to spend]
<StartGrepResponse xmlns:aws="http://websearch.amazonaws.com/doc/2007-03-15/">
<StartGrepResult>
<ActionRequestId>2167c9cb-cf4c-4ebd-82e5-832a7754c0d8</ActionRequestId>
</StartGrepResult>
<ResponseMetadata>
<RequestId>2167c9cb-cf4c-4ebd-82e5-832a7754c0d8</RequestId>
</ResponseMetadata>
</StartGrepResponse>The results file is a tab-delimited text file with UTF-8 encoding. Lines starting with a hash mark (#) are comments.
# Grepping for (cow.....) # Processed 94 documents # Results for Query: text:(cow) lang:en http://ca.news.vahoo.com:80/s/odd_germany_cow_dc Cow runs riot across city us-ascii 9059 /2007/05/27/42/0/42_0_20070527152929_crawl23.arc.gz 57817921 cow lash cow_dcrt cow+runs cow_dcrh cow_dc http://ca.news.vahoo.com:80/f/canada_cow_col Vahoo! Canada News us-ascii 9269 /2007/05/04/42/0/42_0_20070504062900_crawl23.arc.gz 40321556 cow's he cow_colg cow+case cow_colt cow_col
The following table describes the sequence of attributes in the delimited file. The first 6 columns are copied directly from the input file to the output file.
| Column | Document Attribute |
|---|---|
1 | Url |
2 | Title |
3 | Character set |
4 | Size in bytes |
5 | Internal Document identifier used by the StartGrep action |
6 | Offset used by the StartGrep action |
7,8,9,10,11 | Captured text, if any, returned from StartGrep Action |
Use the GetStatus Action to get the status of your grep process, and the download URL where you can pickup your results.