Using Character Decoding Options to Process Multi-Byte Data

Overview

Two types of decoding occur during the ETL process for SAS Data Surveyor for Clickstream Data. It is important that you fully understand the implications in order to ensure the best data quality. The following decoding types are available:
URL Encoding
URL encoding is also known as percent encoding. This encoding type is a mechanism used to encode certain reserved characters (or unsafe characters) that might appear in a URL. These reserved characters might have special meaning if not otherwise encoded. For example, a form on a Web page could have a text field where customers can leave a short comment that includes a question mark character. Submitting this form results in the following URL:
/submit.php?text=Is this really true?
A question mark is a reserved character in a URL that is used to separate the requested file from the query string. In order to remove any confusion as to which question mark is the separator, the URL is automatically encoded as follows:
/submit.php?text=Is%20this%20really%20true%3F
This URL represents how the URL would be encoded. The space character is encoded as %20 and the non-reserved question mark is encoded as %3F.
Character Encoding
Character encoding is required when the URL sent to the Web server contains characters that are not part of the all US-ASCII character set. For example, a Japanese customer who performed a search on the word Rugby would type in the characters ラグビー. These characters, if passed as part of the URL, would be encoded as %E3%83%A9%E3%82%B0%E3%83%93%E3%83%BC. The actual percent encoding values would depend on the character set used in the Web page.
As a user, you should consider the following topics:

Using the Decode Option

By default, the template jobs have decoding turned on. Furthermore, the incoming encoding of the Web log is set to UTF-8. The primary benefit of decoding the incoming data is that it becomes readable by humans. Because the data has been decoded, you no longer see encoded values in filtered data or reports. For information about the SAS Unicode Server, see Using a Unicode SAS Application Server.

Not Using the Decode Option

You should not turn on decoding when you process a standard Web log that contains records from pages that used different encodings or character sets. Character set information for each record is not present in a standard Web log. Therefore, all of the records are assumed to use the same encoding. Any Web pages with encoding that does not match the one selected for the incoming data to this ETL will not be decoded correctly. SAS page tag logs do not have this problem because the page code automatically encodes all data records using UTF-8.

Multi-Byte Data Options in Clickstream Transformations

The following table covers decoding options in Clickstream transformations:
Multi-Byte Data Options in Clickstream Transformations
Option
Location
Setting
Notes
Decode input data?
Input pane of the Options tab in the Clickstream Log transformation
Yes
Determines whether decoding should be performed for each line of data as it is read from the input. When set to the default value of Yes, you must set the encoding of the incoming data correctly and use a SAS Unicode Application Server (if necessary).
Encoding of the incoming data
Input pane of the Options tab in the Clickstream Log transformation
Not displayed unless Decode input data? is Yes
Specifies the original encoding of the data in the input file. All SAS page tag logs are encoded in UTF-8 irrespective of the encoding used on the originating Web page. However, you must ensure that all incoming records come from Web pages with the same encoding when you process standard Web logs. Otherwise, data could be transcoded incorrectly.
Decode referrer ?
Input pane of the Options tab in the Clickstream Log transformation
No
Determines whether the data contained in the referrer is encoded as it is processed. The referrer field has to be treated as a separate decoding entity because the encoding of the referrer string is often unknown since it might come from an external site. If the encoding of all the referrers within your data is known, then this option can be set to Yes. Otherwise, it should remain set to No.
Sort options
Table pane of the Options tab in the Clickstream Sessionize transformation
Blank
Specifies sort options that can be useful for changing the linguistic collation of output when MBCS data is processed.
See PROC SORT documentation for details about valid syntax and available options.