Grubbs検定 (Grubbs-Smirnov検定) を行なう方法

[OS]ALL

[リリース] ALL

[キーワード] Grubbs-Smirnov, outlier, Bonferroni

[質問]

外れ値を検定するGrubbs検定（Grubbs-Smirnov検定）を実行する方法はありますか？

[回答]

Grubbs検定を実行できるプロシジャはありません。ただし、検定統計量はMEANSプロシジャおよびDATAステップで簡単に求めることができます。

また、「National Institute of Standards and Technology」サイトの「Engineering Statistics Handbook」のページなどに記載されている Bonferroni調整に基づく近似式を使うと、比較的容易にGrubbs-Smirnov検定を実現することができます。下記の URLに、Grubbs-Smirnov検定の計算式（近似検定）が記載されております。

http://www.itl.nist.gov/div898/handbook/index.html

（「Grubbs Test」のページをご参照ください。リンクを許可していただいたNational Institute of Standards and Technology, Statistical Engineering Divisionに感謝いたします)。

Grubbs検定の棄却限界値や p 値をBonferroni調整によって計算するためのサンプルマクロ（%grubbs）を下記に記載します。


******************************************************************************;
*  サンプルマクロ　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　*;
*                                                                            *;
*      作成日：2000-06-07                                                    *;
*                                                                            *;
*                                                                            *;
******************************************************************************;
*                                                                            *;
*   このサンプルは、あくまで「サンプル」として提供されているものであり、正式 *;
*  なサポート対象のものではありません。                                      *;
*                                                                            *;
*  このサンプルマクロは、Grubbs検定( Grubbs-Sminorv検定) のp値と指定した     *;
* 有意水準に対しての棄却域を計算します。計算式は、下記のNational Institute   *;
* of Standards and Technologyなどに記載されているものを用いています。        *;
*                                                                            *;
*  http://www.itl.nist.gov/div898/handbook/index.html                        *;
*                                                                            *;
*                                                                            *;
*   検定方法として                                                           *;
*    両側検定 ：最大値 もしくは最小値が外れ値であるかどうかの検定            *;
*    上側検定 ：最大値が外れ値であるかどうかの検定                           *;
*    下側検定 ：最小値が外れ値であるかどうかの検定                           *;
*   を選択することができます。                                               *;
*                                                                            *;
******************************************************************************;
*   ＊パラメータの指定                                                       *;
*    %macro grubbs(data= データセット名を指定します。,                       *;
*                   var= 検定をおこなう変数を指定します。,                   *;
*                   alpha=有意水準を指定します。,                            *;
*                   test= 両側検定→two                                      *;
*                         上側検定→max                                      *;
*                         下側検定→min と指定します。);                     *;
*                                                                            *;
*  ＊データセット名が DATA1 変数がX 有意水準を0.05 で両側検定を             *;
*    行う場合は、                                                            *;
*                                                                            *;
*    %macro grubbs(data=data1,var=x,alpha=0.05,test=two);                    *;
*                                                                            *;
*    というように指定します。                                                *;
*                                                                            *;
*  ＊__OUT__という名前の結果を含むデータセットを、WORKライブラリに出力します   *;
******************************************************************************;

***************** Begin of Sample Macro PGM **********************************;
%macro grubbs(data=,var=,alpha=,test=);
  ******  Submit PROC MEANS *************************;
  proc means data=&data noprint;
    var &var;
    output out=__out__ mean=mean std=std max=max min=min n=n;
  run;
  ****** Calculate p-value and critical value ******;
  data __out__;
    set __out__;
    test="&test"; alpha=&alpha;
   ******* Two-tail **********************************;
    %if &test=two %then %do;
      g=max(max-mean,mean-min)/std;
      t=tinv(1-alpha/(2*n),n-2);
      c=((n-1)/sqrt(n))*sqrt(t**2/(n-2+t**2));
      gg=(g**2*n-(n-1)**2);
           if gg =.   then goto END;
      else if gg < 0  then st=sqrt(-1*(g**2*n*(n-2)/gg));
      else if gg  = 0 then st=sqrt(g**2*n*(n-2)/gg);
      if st=. then goto END;
      p=n*(2*(1-probt(st,n-2)));
    %end;
   ******* For Max ************************************;
    %if &test=max %then %do;
      g=(max-mean)/std;
      t=tinv(1-alpha/n,n-2);
      c=((n-1)/sqrt(n))*sqrt(t**2/(n-2+t**2));
      gg=(g**2*n-(n-1)**2);
            if gg =.   then goto END;
      else  if gg <  0 then st=sqrt(-(g**2*n*(n-2)/gg));
      else  if gg  = 0 then st=sqrt(g**2*n*(n-2)/gg);
      if st=. then goto END;
      p=n*(1-probt(st,n-2));
    %end;
  ******* For Min ***********************************;
    %if &test=min %then %do;
      g=(mean-min)/std;
      t=tinv(alpha/n,n-2);
      c=((n-1)/sqrt(n))*sqrt(t**2/(n-2+t**2));
      gg=(g**2*n-(n-1)**2);
           if gg =.   then goto END;
      else if gg < 0 then st=sqrt(-(g**2*n*(n-2)/gg));
      else if gg  = 0 then st=sqrt(g**2*n*(n-2)/gg);
      if st=. then goto END;
      p=n*(1-probt(st,n-2));
    %end;
    if p > 1 then p=1;
    END: if p=. then put 'ERROR: The value become missing because of some reasons.';
  run;

  title 'Grubbs test';
  proc print data=__out__;
    var test alpha max mean min std g c p;
    format p pvalue.;
  run;
  title;
%mend;
***************** End of Sample Macro PGM *********************************;

■ 実行例


 **サンプルデータの作成***;
  data data1;
   input x@@;
  cards;
   10 11 12
   14 15 17
   19 20 70
  ;
  run;

 ****マクロの指定 *******************;
 * 有意水準は、0.05で両側検定を指定;

  %grubbs(data=data1,var=x, alpha=0.05,test=two)

■ 実行結果の一部


                                  Grubbs test

  OBS    test    alpha    max      mean     min     std         g          c         p

   1     two      0.05     70    20.8889     10    18.7380    2.62094    2.21500    <.0001

出力される変数は以下のようになります。

TEST	両側か片側か
ALPHA	有意水準
MAX	最大値
MIN	最小値
MEAN	平均
STD	標準偏差
G	検定統計量
C	棄却限界
P	p値

なお、p値だけを求めたいのであれば、回帰分析を行なうREGプロシジャのRSTUDENT（1オブザベーションを除いて計算するスチューデント化残差）を用いたプログラミングの方が簡単です。
下記は、そのプログラム例です。下記のプログラムも、上プログラム同様、Bonferroni調整に基づいています。


%let data=data1; /*** データセット名 ***/
%let var=x;      /*** 変数名 ***********/

proc reg data=&data noprint;
  model &var=;
  output out=__out2__ rstudent=r;
run;

proc means data=__out2__ noprint;
 var r;
  output out=__out3__ max=max min=min n=n;
run;

data __out3__; 
  set __out3__;
  amin=abs(min);
  max_p=min((1-probt(max,n-2))*n,1);             /*** One-tail (for max) *****/
  min_p=min((1-probt(amin,n-2))*n,1);            /*** One-tail (for min) *****/
  two_p=min((1-probt(max(max,amin),n-2))*2*n,1); /*** two-tail ****/
run;

proc print data=__out3__;
run;

■ 実行結果の一部


     OBS       max_p        min_p        two_p

      1     .000009619        1      .000019237