Speedy alternative to sort/sort/merge
SAS® PROC Format reduces CPU up to 80% at Wells Fargo Bank
Users of SAS software who work with large data sets are regularly confronted with processing-time challenges. The question often arises: How might it be possible to reduce the amount of CPU required to retrieve specific data?
Jenine Milum found an answer.
Milum, now a Vice President and Analytics Manager at Wells Fargo Bank, experienced a breakthrough moment a little more than a decade ago while working in another industry.
As the website became more popular, the volumes of data coming in for us to analyze grew at a mammoth rate. We were constantly looking for ways to speed up our daily processes and make them more efficient.
"We were processing log activity for our website on a daily basis," says Milum, who's been a SAS user for more than 20 years. "As the website became more popular, the volumes of data coming in for us to analyze grew at a mammoth rate. We were constantly looking for ways to speed up our daily processes and make them more efficient.
"In analyzing where the drain on resources was coming from, it was evident it came from matching this massive data to other files. Every time we did it, we had to sort and then merge that massive file. That meant reading it twice, and we needed a way to avoid doing so."
Her innovative solution involves a matching method utilizing PROC Format, replacing the CPU-heavy sort/sort/merge.
"It's ideal for situations in which a key from one file is needed to extract data from another," Milum says. "It's best suited when you're dealing with at least one large-volume data set and need to merge it with a smaller amount of data."
Milum now regularly uses this method in her work at Wells Fargo. It's effective on all platforms utilizing SAS, and has proven time and again to decrease CPU by 70 percent to 80 percent.
Users of SAS System software are often confronted with the challenge of retrieving specific information from very large data sets. How can the desired information be extracted effectively while reducing the amount of time required to retrieve the data?
Milum's matching method utilizing PROC Format provides the solution.
To appreciate the effectiveness of the PROC Format method, it helps to first look at the sort/sort/merge that it's replacing:
proc sort data=largefile;
proc sort data=keyfile (keep=keyvar) nodupkey;
data match; merge largefile (in=large)
if key and large;
Sort/sort/merge is used when key values from one file are needed to extract records from another file containing the same key, or BY variable.
Milum offers the following example:
Suppose you have a small file (keyfile) that contains a list of Social Security numbers (the key) of individuals in a particular division of your company. In order to extract additional information on each individual, these numbers need to be matched to another file (largefile) representing the entire company. You want only the records that can be found in both files.
For a clean merge, both data sets have to be sorted. Both files have to be processed twice, once in the sort and once in the merge step. If either or both of the files are large, CPU time can be considerable.
There are other basic concerns when running any sort/sort/merge. Are there any duplicate records in either data set being used in the merge? Is the merge logic being handled properly so that required data from matching data sets will not be accidentally overlaid?
"These are no longer issues with the PROC Format method," Milum says.
Before utilizing the PROC Format method, a basic understanding of the "parts" of a format is necessary. Milum offers, as an example, a basic format she calls $CTYST. that links cities and states. Using the "fmtlib" option of PROC Format reveals the metadata surrounding this existing format:
Proc Format library=work fmtlib;
"The result shows several of the critical pieces used to create a format out of one of the KEYFILE data sets used in this sort/merge," Milum says.
The value of the FORMAT is $CTYST. START is the value of the variable used to match – in this case, the city name. Variable END is not necessary since numeric ranges aren't being used. LABEL represents the value START, which will be assigned in the instance of a match – in this case, the state.
Working with these three metadata items in a standard format, you can create a format from the KEYFILE data set that you can use for matching purposes. When a record has a match to the value in START, the value in LABEL is linked to it for additional use.
"Understanding these basics about formatting allows for a slick trick utilizing the selection ability of PROC Format," Milum says.
She offers the following PROC Format method:
data key; set keyfile(keep=keyvar);
/* These variables translate to the FORMAT values in the metadata */
fmtname = '$key';
label = '*';
rename keyvar = start;
proc sort data=key nodupkey;
Proc Format cntlin=key;
data match; set largefile;
if put (keyvar,$key.) = '*';
In the first DATA step, a SAS data set needs to be created from the input file with the required valid format-variable names: LABEL, START and FMTNAME. Doing so prepares the information to be turned into a format:
• START is set to the variable assigned as key.
• FMTNAME becomes the format name to be referenced later. (Formatting naming conventions do not allow for a FMTNAME to end in a numeric value.)
• LABEL becomes the symbol that the desired key values are associated with – in this case, asterisk (*).
The LABEL is assigned an asterisk as the formatted value, which acts as the hook into the bigger file. Using this simple line of code in any DATA step within your program will select those records that match the formatted values.
if put (key,$key.) = '*';
The variable name KEY can be any name in a data set as long as the values in the variable will have matches to the values in the format. You can use this line of code to select these matches anywhere in your program.
It's very important, Milum points out, that the symbol assigned to LABEL never has an occurrence in the key character string of the master file; otherwise, an unwanted match will result. The asterisk symbol works in most situations, as it's rarely contained in character-variable values.
"This pre-format dataset needs to be sorted and any duplicates eliminated," Milum says. "SAS formats won't allow duplicate values."
Running PROC Format with duplicate values will create the SAS system message "ERROR: This range is repeated, or values overlap," and processing will be halted.
To create a working format, execute PROC FORMAT with the CNTLIN=option using the sorted data set Key as the input-control data set. This converts the data stored in the pre-format SAS data set (Key) to a SAS format and adds it to the format catalog in the Work library.
Once a format is created, it can be used anywhere else in the program for selecting matches to the key. In essence, the assignment statement gives the value of LABEL – in this case, asterisk (*) – to a matching key. This in turn can be used for additional coding. In the above scripts, it's used to select records matching the key.
"Using the PROC Format method, only one file is processed twice – the smaller file – and only one variable is needed from it," Milum says. "The biggest savings in CPU is that the sorting of the large file isn't required."
For those still skeptical, Milum provides results showing several different methods for comparison. The same input files were used in all the examples. All tests were run using the same UNIX Sun platform. The key file had 730 observations; the larger file had 1.5 million records.
Milum points out that there are other uses for her method.
For example, merging frequently requires more than one variable. While still using the PROC Format methodology, two additional solutions provide the desired results. One involves concatenating the matching variables. The second suggests creating additional formats or key variables for matching.
These approaches allow you to select records by matching cities and state:
The Variable Concatenation Format approach: Concatenating the variables you would normally sort and then merge by creates just one unique variable by which a format can be created. This single variable can then be used to create a format to select records. Taking the value of CITY as Atlanta and the value of STATE as Georgia, the value of the new variable would be AtlantaGeorgia.
set keyfile (keep=city state);
start = trim (city) || trim (state);
fmtname = ‘$ctyst’;
label = ‘*’;
proc sort data=key nodupkey;
Proc Format cntlin=key;
if put (trim (city) || trim (state),$ctyst.) = ‘*’;
The Multiple Format approach: You can use one DATA step to create multiple pre-formatted data sets. Notice that START and FMTNAME are assigned and then output. The second format variable name is being reset and then output. You can do this up to as many key variables as necessary. You'll need one Format SAS procedure to create each of the key formats. Once the formats are created, use them in any future DATA step to identify matching records.
data keycity keystate;
set keyfile (keep=City State);
label = ‘*’;
start = City;
fmtname = ‘$city’;
start = State;
fmtname = ‘$state’;
proc sort data=keycity nodupkey;
proc sort data=keystate nodupkey;
Proc Format cntlin=keycity;
Proc Format cntlin=keystate;
data matchfile; set largefile;
if put (city,$city.) = ‘*’ and
put (state,$state.) = ‘*’;
Some additional tips
Milum offers a few tips and suggestions in using the PROC Format method:
Make the LABEL more meaningful.
In the first DATA step:
label = 'Match';
Use multiple formats to include or exclude records.
data newfile; set bigfile;
if put (key1,$keyone.) = ‘*’ and put (key2,$keytwo.) ne ‘*’;
Create one format with multiple values.
In the first DATA step:
if key = 'one' then label = 'one';
if key = 'two' then label = 'two';
Then in the last DATA step:
data newfileone newfiletwo; set bigfile;
if put (key,$key.) = ‘one’ then output newfileone; else
if put (key,$key.) = ‘two’ then output newfiletwo; else
Utilize the Format in a WHERE clause.
data newfile ;
merge bigfileone (where=(put(key,$key.) = ‘one’))
bigfiletwo (where=(put(key,$key.) = ‘two’));
by newvar ;
Milum first presented her PROC Format method findings in 2001 at a SAS users group, and has since done so on a number of occasions, having twice won best paper awards.
"I do a 'Where's Waldo?' spin on the method," she says. "You're trying to find something small in a sea of information."
Time and again, her method has proven its value.