Managing Non-Human Visitor Detection

Overview

Spiders, robots, crawlers, pingers, and any other computer program that might generate traffic to a Web site are referred to as non-human visitors (NHV). Spiders (a search engine bot, for example) surf the Web site traveling various links to determine the contents of all of the Web pages. All spiders or NHVs have certain behavior characteristics that make it possible to identify them. These characteristics include clicking at a rate faster than humanly possible or pinging at an exact interval.
Activity from NHVs is handled in two locations. The first is in the Clickstream Parse transformation using the Filter Spiders by User Agent rule. This rule matches commonly known strings found in the user agent of well-behaved NHVs who identify themselves as an NHV. By default, this rule deletes activity for these NHVs. The purpose of this detection is to eliminate NHV clicks as soon as possible.
The second location of NHV activity is handled during the Clickstream Sessionize transformation. The transformation uses a proprietary behavioral detection approach that examines the behavior of the visitor within a session and decides whether the behavior is likely to be that of a human or a non-human visitor. This process is known as Behavioral Identification of Non-Human Sessions (BINS), and is configured using the spider-related options on the Clickstream Sessionize transformation. See the Clickstream Sessionize Options tab help for details about how to configure this functionality.
If you have already filtered and removed the NHVs found by the Clickstream Parse transformation using the rule that examines the User Agent string, you might want to analyze the visitor behavior to ensure that none of the remaining sessions were created by NHVs. To perform this analysis, set the options in the Clickstream Sessionize properties window to detect any NHVs.

Set Clickstream Sessionize Transformation Options

Perform the following steps to set the options in the Clickstream Sessionize properties window:
  1. Open the Tuning category on the Options tab in the Clickstream Sessionize Properties window.
  2. Specify a value in the Spider detection threshold, Spider force threshold, and Maximum average time between spider clicks fields. For example, the Web site's administrator determines that for the site's visitors, no human visitor is likely to perform more than 50 clicks in a session. Therefore, you might decide to set the Spider force threshold to 50, forcing the detection of an NHV when the number of clicks in the session reaches 50 or higher.
  3. Select a value in the Spider Action field. This value determines whether the session is isolated, deleted, or no action is taken once the spider is identified.
    Although the Spider Action does not directly impact the detection of NHVs, it does impact what happens to the data for any NHV. The default of ISOLATE is useful as it separates the non-human data into a separate table and enables you to validate that the detection heuristics are accurate. This table is found in the library specified by the Additional output library option in the Tables group in the Clickstream Sessionize transformation Options tab. The DELETE action is perhaps useful once the heuristics are considered accurate and you just want the non-human data discarded. The final option of NONE means that the non-human sessions are not identified. so they are treated as any other session data.