본문 바로가기
통계프로그램 비교 시리즈/데이터 전처리 비교

[데이터 관리] 8. By 또는 Split 파일 프로세싱

by 기서무나구물 2022. 1. 11.

포스팅 목차

    [데이터 관리] 8. By 또는 Split 파일 프로세싱

     


    1. Proc SQL

    • By 또는 Split 파일 프로세싱을 위한 PROC SQL 프로그램
    • Group by 옵션을 이용하여, 성별을 기준으로 지정된 각 함수 처리
    SAS Programming
    proc sql;
      create table withmooc as
        select gender,
               sum(q1)  as q1_sum,
               mean(q2) as q2_mean,
               min(q3)  as q3_min,
               max(q4)  as q4_max
        from   BACK.mydata a
        group by gender;
    
      select * from withmooc;
    quit;

     

    Results
    gender       q1_sum   q2_mean    q3_min    q4_max
    ------------------------------------------------
    f                8      1.25         4         3
    m               18      4.25         2         5

     

     

    SAS Programming
    proc sql;
      create table withmooc as
        select workshop,
               gender,
               sum(q1)  as q1_sum,
               mean(q2) as q2_mean,
               min(q3)  as q3_min,
               max(q4)  as q4_max
        from   BACK.mydata a
        group by workshop,
                 gender;
    
      select * from withmooc;
    quit;

     

    Results
    workshop  gender       q1_sum   q2_mean    q3_min    q4_max
    ----------------------------------------------------------
           1  f                3       1.5         4         3
           1  m                9         4         2         4
           2  f                5         1         4         3
           2  m                9       4.5         5         5

     


    2. SAS Programming

    • By 또는 Split 파일 프로세싱을 위한 SAS 프로그램
    • 성별을 기준으로 하여 데이터 분석;
    SAS Programming
    * 성별을 기준으로 하여 데이터 소트;
    
    PROC SORT DATA=BACK.withmooc;
         BY gender;
    run;
    
    * 성별을 기준으로 하여 데이터 분석; 
    
    PROC MEANS DATA=BACK.withmooc;
         BY gender;
    run;

     

    Results
    ---------------------------------- gender=f ----------------------------------
    
                                    MEANS 프로시저
    
    
    변수        N          평균값        표준편차          최소값          최대값
    -----------------------------------------------------------------------------
    id          4       2.5000000       1.2909944       1.0000000       4.0000000
    workshop    4       1.5000000       0.5773503       1.0000000       2.0000000
    q1          4       2.0000000       0.8164966       1.0000000       3.0000000
    q2          4       1.2500000       0.5000000       1.0000000       2.0000000
    q3          3       4.3333333       0.5773503       4.0000000       5.0000000
    q4          4       2.0000000       1.1547005       1.0000000       3.0000000
    -----------------------------------------------------------------------------
    
    ---------------------------------- gender=m ----------------------------------
    변수        N          평균값        표준편차          최소값          최대값
    -----------------------------------------------------------------------------
    id          4       6.5000000       1.2909944       5.0000000       8.0000000
    workshop    4       1.5000000       0.5773503       1.0000000       2.0000000
    q1          4       4.5000000       0.5773503       4.0000000       5.0000000
    q2          4       4.2500000       0.9574271       3.0000000       5.0000000
    q3          4       4.0000000       1.4142136       2.0000000       5.0000000
    q4          4       4.5000000       0.5773503       4.0000000       5.0000000
    -----------------------------------------------------------------------------

     


    • Class 절을 사용하여 변수 소트 없이 데이터 분석 가능.;
    SAS Programming
    PROC MEANS DATA=BACK.withmooc;
         CLASS gender;
    run;

     

    Results
                관측치
    gender          수    변수        N          평균값        표준편차          최소값          최대값
    ---------------------------------------------------------------------------------------------------
    f                4    id          4       2.5000000       1.2909944       1.0000000       4.0000000
                          workshop    4       1.5000000       0.5773503       1.0000000       2.0000000
                          q1          4       2.0000000       0.8164966       1.0000000       3.0000000
                          q2          4       1.2500000       0.5000000       1.0000000       2.0000000
                          q3          3       4.3333333       0.5773503       4.0000000       5.0000000
                          q4          4       2.0000000       1.1547005       1.0000000       3.0000000
    
    m                4    id          4       6.5000000       1.2909944       5.0000000       8.0000000
                          workshop    4       1.5000000       0.5773503       1.0000000       2.0000000
                          q1          4       4.5000000       0.5773503       4.0000000       5.0000000
                          q2          4       4.2500000       0.9574271       3.0000000       5.0000000
                          q3          4       4.0000000       1.4142136       2.0000000       5.0000000
                          q4          4       4.5000000       0.5773503       4.0000000       5.0000000
    ---------------------------------------------------------------------------------------------------

     


    3. SPSS

    • By 또는 Split 파일 프로세싱을 위한 SPSS 프로그램.
    SPSS Programming
    GET FILE="C:\mydata.sav".
    SORT CASES BY gender .
    SPLIT FILE
      SEPARATE BY gender .
    
    DESCRIPTIVES
      VARIABLES=q1 q2 q3 q4
      /STATISTICS=MEAN STDDEV MIN MAX .

     


    4. R Programming (R-PROJECT)

    • By 또는 Split 파일 프로세싱을 위한 프로그램.
    R Programming
    from rpy2.robjects import r
    %load_ext rpy2.ipython

     

    Results
    The rpy2.ipython extension is already loaded. To reload it, use:
      %reload_ext rpy2.ipython

     


     

    R Programming
    %%R
    
    library(tidyverse)
    library(psych)
    load(file="C:/work/data/mydata.Rdata")
    
    withmooc = mydata
    
    attach(withmooc) # mydata를 기본 데이터 세트로 지정.
    
    withmooc

     

    Results
    R[write to console]: The following objects are masked from withmooc (pos = 3):
    
        gender, q1, q2, q3, q4, workshop
    
    
    
    
      workshop gender q1 q2 q3 q4
    1        1      f  1  1  5  1
    2        2      f  2  1  4  1
    3        1      f  2  2  4  3
    4        2      f  3  1 NA  3
    5        1      m  4  5  2  4
    6        2      m  5  4  5  5
    7        1      m  5  3  4  4
    8        2      m  4  5  5  5

     


    • 관측치와 모든 변수의 요약 통계 구하기
    R Programming
    %%R
    
    summary(withmooc)

     

    Results
        workshop      gender                q1             q2             q3              q4  
     Min.   :1.0   Length:8           Min.   :1.00   Min.   :1.00   Min.   :2.000   Min.   :1.00  
     1st Qu.:1.0   Class :character   1st Qu.:2.00   1st Qu.:1.00   1st Qu.:4.000   1st Qu.:2.50  
     Median :1.5   Mode  :character   Median :3.50   Median :2.50   Median :4.000   Median :3.50  
     Mean   :1.5                      Mean   :3.25   Mean   :2.75   Mean   :4.143   Mean   :3.25  
     3rd Qu.:2.0                      3rd Qu.:4.25   3rd Qu.:4.25   3rd Qu.:5.000   3rd Qu.:4.25  
     Max.   :2.0                      Max.   :5.00   Max.   :5.00   Max.   :5.000   Max.   :5.00  
                                                                    NA's   :1      

     


    psych::describe 함수

     

    R Programming
    %%R
    
    psych::describe(withmooc)

     

    Results
             vars n mean   sd median trimmed  mad min max range  skew kurtosis   se
    workshop    1 8 1.50 0.53    1.5    1.50 0.74   1   2     1  0.00    -2.23 0.19
    gender*     2 8 1.50 0.53    1.5    1.50 0.74   1   2     1  0.00    -2.23 0.19
    q1          3 8 3.25 1.49    3.5    3.25 2.22   1   5     4 -0.14    -1.73 0.53
    q2          4 8 2.75 1.75    2.5    2.75 2.22   1   5     4  0.19    -1.91 0.62
    q3          5 7 4.14 1.07    4.0    4.14 1.48   2   5     3 -0.93    -0.52 0.40
    q4          6 8 3.25 1.58    3.5    3.25 1.48   1   5     4 -0.36    -1.59 0.56

     


    • 성별을 기준으로 하여 각 변수에 대한 요약 통계 구하기.
    R Programming
    %%R
    
    by(withmooc, gender, summary)

     

    Results
    gender: f
        workshop      gender                q1             q2             q3              q4   
     Min.   :1.0   Length:4           Min.   :1.00   Min.   :1.00   Min.   :4.000   Min.   :1  
     1st Qu.:1.0   Class :character   1st Qu.:1.75   1st Qu.:1.00   1st Qu.:4.000   1st Qu.:1  
     Median :1.5   Mode  :character   Median :2.00   Median :1.00   Median :4.000   Median :2  
     Mean   :1.5                      Mean   :2.00   Mean   :1.25   Mean   :4.333   Mean   :2  
     3rd Qu.:2.0                      3rd Qu.:2.25   3rd Qu.:1.25   3rd Qu.:4.500   3rd Qu.:3  
     Max.   :2.0                      Max.   :3.00   Max.   :2.00   Max.   :5.000   Max.   :3  
                                                                    NA's   :1      
    
    ------------------------------------------------------------ 
    gender: m
        workshop      gender                q1            q2             q3            q4     
     Min.   :1.0   Length:4           Min.   :4.0   Min.   :3.00   Min.   :2.0   Min.   :4.0  
     1st Qu.:1.0   Class :character   1st Qu.:4.0   1st Qu.:3.75   1st Qu.:3.5   1st Qu.:4.0  
     Median :1.5   Mode  :character   Median :4.5   Median :4.50   Median :4.5   Median :4.5  
     Mean   :1.5                      Mean   :4.5   Mean   :4.25   Mean   :4.0   Mean   :4.5  
     3rd Qu.:2.0                      3rd Qu.:5.0   3rd Qu.:5.00   3rd Qu.:5.0   3rd Qu.:5.0  
     Max.   :2.0                      Max.   :5.0   Max.   :5.00   Max.   :5.0   Max.   :5.0  

     


    • 열 이름에 의해 선택된 변수에 대하여 성별에 대하여 각 값에 대한 요약 통계.
    R Programming
    %%R
    
    by( withmooc[c("q1","q2","q3","q4")] , gender, summary)

     

    Results
    gender: f
           q1             q2             q3              q4   
     Min.   :1.00   Min.   :1.00   Min.   :4.000   Min.   :1  
     1st Qu.:1.75   1st Qu.:1.00   1st Qu.:4.000   1st Qu.:1  
     Median :2.00   Median :1.00   Median :4.000   Median :2  
     Mean   :2.00   Mean   :1.25   Mean   :4.333   Mean   :2  
     3rd Qu.:2.25   3rd Qu.:1.25   3rd Qu.:4.500   3rd Qu.:3  
     Max.   :3.00   Max.   :2.00   Max.   :5.000   Max.   :3  
                                   NA's   :1                  
    ------------------------------------------------------------ 
    gender: m
           q1            q2             q3            q4     
     Min.   :4.0   Min.   :3.00   Min.   :2.0   Min.   :4.0  
     1st Qu.:4.0   1st Qu.:3.75   1st Qu.:3.5   1st Qu.:4.0  
     Median :4.5   Median :4.50   Median :4.5   Median :4.5  
     Mean   :4.5   Mean   :4.25   Mean   :4.0   Mean   :4.5  
     3rd Qu.:5.0   3rd Qu.:5.00   3rd Qu.:5.0   3rd Qu.:5.0  
     Max.   :5.0   Max.   :5.00   Max.   :5.0   Max.   :5.0  

     


    • 다중 범주 변수는 리스트에서 이용되어야 하고, data.frame 함수는 리스트를 취할 수 있다.
    • 데이터는 workshop과 gender로 정렬될 필요가 없다.
    • workshop과 gender 변수를 기준으로 각 값에 대한 요약 통계.
    R Programming
    %%R
    
    by(withmooc[c("q1","q2","q3","q4")],
      data.frame(workshop,gender), summary)

     

    Results
    workshop: 1
    gender: f
           q1             q2             q3             q4     
     Min.   :1.00   Min.   :1.00   Min.   :4.00   Min.   :1.0  
     1st Qu.:1.25   1st Qu.:1.25   1st Qu.:4.25   1st Qu.:1.5  
     Median :1.50   Median :1.50   Median :4.50   Median :2.0  
     Mean   :1.50   Mean   :1.50   Mean   :4.50   Mean   :2.0  
     3rd Qu.:1.75   3rd Qu.:1.75   3rd Qu.:4.75   3rd Qu.:2.5  
     Max.   :2.00   Max.   :2.00   Max.   :5.00   Max.   :3.0  
    ------------------------------------------------------------ 
    workshop: 2
    gender: f
           q1             q2          q3          q4     
     Min.   :2.00   Min.   :1   Min.   :4   Min.   :1.0  
     1st Qu.:2.25   1st Qu.:1   1st Qu.:4   1st Qu.:1.5  
     Median :2.50   Median :1   Median :4   Median :2.0  
     Mean   :2.50   Mean   :1   Mean   :4   Mean   :2.0  
     3rd Qu.:2.75   3rd Qu.:1   3rd Qu.:4   3rd Qu.:2.5  
     Max.   :3.00   Max.   :1   Max.   :4   Max.   :3.0  
                                NA's   :1                
    ------------------------------------------------------------ 
    workshop: 1
    gender: m
           q1             q2            q3            q4   
     Min.   :4.00   Min.   :3.0   Min.   :2.0   Min.   :4  
     1st Qu.:4.25   1st Qu.:3.5   1st Qu.:2.5   1st Qu.:4  
     Median :4.50   Median :4.0   Median :3.0   Median :4  
     Mean   :4.50   Mean   :4.0   Mean   :3.0   Mean   :4  
     3rd Qu.:4.75   3rd Qu.:4.5   3rd Qu.:3.5   3rd Qu.:4  
     Max.   :5.00   Max.   :5.0   Max.   :4.0   Max.   :4  
    ------------------------------------------------------------ 
    workshop: 2
    gender: m
           q1             q2             q3          q4   
     Min.   :4.00   Min.   :4.00   Min.   :5   Min.   :5  
     1st Qu.:4.25   1st Qu.:4.25   1st Qu.:5   1st Qu.:5  
     Median :4.50   Median :4.50   Median :5   Median :5  
     Mean   :4.50   Mean   :4.50   Mean   :5   Mean   :5  
     3rd Qu.:4.75   3rd Qu.:4.75   3rd Qu.:5   3rd Qu.:5  
     Max.   :5.00   Max.   :5.00   Max.   :5   Max.   :5  

     


    • 위 예제에서 by 문 안의 옵션을 사전에 정의하여 처리.
    R Programming
    %%R
    
    myVars <- c("q1","q2","q3","q4")
    
    myBys <- data.frame(workshop,gender)
    
    by( withmooc[myVars], myBys, summary)

     

    Results
    workshop: 1
    gender: f
           q1             q2             q3             q4     
     Min.   :1.00   Min.   :1.00   Min.   :4.00   Min.   :1.0  
     1st Qu.:1.25   1st Qu.:1.25   1st Qu.:4.25   1st Qu.:1.5  
     Median :1.50   Median :1.50   Median :4.50   Median :2.0  
     Mean   :1.50   Mean   :1.50   Mean   :4.50   Mean   :2.0  
     3rd Qu.:1.75   3rd Qu.:1.75   3rd Qu.:4.75   3rd Qu.:2.5  
     Max.   :2.00   Max.   :2.00   Max.   :5.00   Max.   :3.0  
    ------------------------------------------------------------ 
    workshop: 2
    gender: f
           q1             q2          q3          q4     
     Min.   :2.00   Min.   :1   Min.   :4   Min.   :1.0  
     1st Qu.:2.25   1st Qu.:1   1st Qu.:4   1st Qu.:1.5  
     Median :2.50   Median :1   Median :4   Median :2.0  
     Mean   :2.50   Mean   :1   Mean   :4   Mean   :2.0  
     3rd Qu.:2.75   3rd Qu.:1   3rd Qu.:4   3rd Qu.:2.5  
     Max.   :3.00   Max.   :1   Max.   :4   Max.   :3.0  
                                NA's   :1                
    ------------------------------------------------------------ 
    workshop: 1
    gender: m
           q1             q2            q3            q4   
     Min.   :4.00   Min.   :3.0   Min.   :2.0   Min.   :4  
     1st Qu.:4.25   1st Qu.:3.5   1st Qu.:2.5   1st Qu.:4  
     Median :4.50   Median :4.0   Median :3.0   Median :4  
     Mean   :4.50   Mean   :4.0   Mean   :3.0   Mean   :4  
     3rd Qu.:4.75   3rd Qu.:4.5   3rd Qu.:3.5   3rd Qu.:4  
     Max.   :5.00   Max.   :5.0   Max.   :4.0   Max.   :4  
    ------------------------------------------------------------ 
    workshop: 2
    gender: m
           q1             q2             q3          q4   
     Min.   :4.00   Min.   :4.00   Min.   :5   Min.   :5  
     1st Qu.:4.25   1st Qu.:4.25   1st Qu.:5   1st Qu.:5  
     Median :4.50   Median :4.50   Median :5   Median :5  
     Mean   :4.50   Mean   :4.50   Mean   :5   Mean   :5  
     3rd Qu.:4.75   3rd Qu.:4.75   3rd Qu.:5   3rd Qu.:5  
     Max.   :5.00   Max.   :5.00   Max.   :5   Max.   :5  

     


     

    R Programming
    %%R
    
    aggregate(withmooc[c("q1","q2","q3","q4")],list(gender = withmooc$gender),mean)

     

    Results
      gender  q1   q2 q3  q4
    1      f 2.0 1.25 NA 2.0
    2      m 4.5 4.25  4 4.5

     


     

    R Programming
    %%R
    
    psych::describeBy(withmooc, withmooc$gender)

     

    Results
     Descriptive statistics by group 
    group: f
             vars n mean   sd median trimmed  mad min max range skew kurtosis   se
    workshop    1 4 1.50 0.58    1.5    1.50 0.74   1   2     1 0.00    -2.44 0.29
    gender*     2 4 1.00 0.00    1.0    1.00 0.00   1   1     0  NaN      NaN 0.00
    q1          3 4 2.00 0.82    2.0    2.00 0.74   1   3     2 0.00    -1.88 0.41
    q2          4 4 1.25 0.50    1.0    1.25 0.00   1   2     1 0.75    -1.69 0.25
    q3          5 3 4.33 0.58    4.0    4.33 0.00   4   5     1 0.38    -2.33 0.33
    q4          6 4 2.00 1.15    2.0    2.00 1.48   1   3     2 0.00    -2.44 0.58
    ------------------------------------------------------------ 
    group: m
             vars n mean   sd median trimmed  mad min max range  skew kurtosis   se
    workshop    1 4 1.50 0.58    1.5    1.50 0.74   1   2     1  0.00    -2.44 0.29
    gender*     2 4 1.00 0.00    1.0    1.00 0.00   1   1     0   NaN      NaN 0.00
    q1          3 4 4.50 0.58    4.5    4.50 0.74   4   5     1  0.00    -2.44 0.29
    q2          4 4 4.25 0.96    4.5    4.25 0.74   3   5     2 -0.32    -2.08 0.48
    q3          5 4 4.00 1.41    4.5    4.00 0.74   2   5     3 -0.53    -1.88 0.71
    q4          6 4 4.50 0.58    4.5    4.50 0.74   4   5     1  0.00    -2.44 0.29

     


    5. R - Tidyverse

    • 관측치와 모든 변수의 요약 통계 구하기.
    R Programming
    %%R
    
    withmooc %>%
        summary()

     

    Results
        workshop      gender                q1             q2             q3              q4      
     Min.   :1.0   Length:8           Min.   :1.00   Min.   :1.00   Min.   :2.000   Min.   :1.00  
     1st Qu.:1.0   Class :character   1st Qu.:2.00   1st Qu.:1.00   1st Qu.:4.000   1st Qu.:2.50  
     Median :1.5   Mode  :character   Median :3.50   Median :2.50   Median :4.000   Median :3.50  
     Mean   :1.5                      Mean   :3.25   Mean   :2.75   Mean   :4.143   Mean   :3.25  
     3rd Qu.:2.0                      3rd Qu.:4.25   3rd Qu.:4.25   3rd Qu.:5.000   3rd Qu.:4.25  
     Max.   :2.0                      Max.   :5.00   Max.   :5.00   Max.   :5.000   Max.   :5.00  
                                                                    NA's   :1      

     

     

    R Programming
    %%R
    
    withmooc %>% 
      purrr::keep(.p = is.numeric) %>% # 숫자형 데이터만 남기기
      purrr::map_df(.x = .,
                    .f = psych::describe)

     

    Results
           vars n mean   sd median trimmed  mad min max range  skew kurtosis   se
    X1...1    1 8 1.50 0.53    1.5    1.50 0.74   1   2     1  0.00    -2.23 0.19
    X1...2    1 8 3.25 1.49    3.5    3.25 2.22   1   5     4 -0.14    -1.73 0.53
    X1...3    1 8 2.75 1.75    2.5    2.75 2.22   1   5     4  0.19    -1.91 0.62
    X1...4    1 7 4.14 1.07    4.0    4.14 1.48   2   5     3 -0.93    -0.52 0.40
    X1...5    1 8 3.25 1.58    3.5    3.25 1.48   1   5     4 -0.36    -1.59 0.56

     


    • 성별을 기준으로 하여 각 변수에 대한 요약 통계 구하기.
    R Programming
    %%R
    
    withmooc %>%
      group_by(gender) %>%
      summarise(mean = mean(q1), n = n())

     

    Results
    `summarise()` ungrouping output (override with `.groups` argument)
    # A tibble: 2 x 3
      gender  mean     n
      <chr>  <dbl> <int>
    1 f        2       4
    2 m        4.5     4

     

     

    R Programming
    %%R
    
    withmooc %>%
      group_by(gender) %>%
      summarise_each(funs(min, median, mean, max, n()), q1,q2,q3,q4)

     

    Results
    # A tibble: 2 x 21
      gender q1_min q2_min q3_min q4_min q1_median q2_median q3_median q4_median
      <chr>   <int>  <int>  <int>  <int>     <dbl>     <dbl>     <dbl>     <dbl>
    1 f           1      1     NA      1       2         1        NA         2  
    2 m           4      3      2      4       4.5       4.5       4.5       4.5
    # ... with 12 more variables: q1_mean <dbl>, q2_mean <dbl>, q3_mean <dbl>,
    #   q4_mean <dbl>, q1_max <int>, q2_max <int>, q3_max <int>, q4_max <int>,
    #   q1_n <int>, q2_n <int>, q3_n <int>, q4_n <int>

     

     

    R Programming
    %%R
    
    withmooc %>%
      group_by(gender) %>%
      summarise_each(funs(mean, n()), q1,q2,q3)

     

    Results
    # A tibble: 2 x 7
      gender q1_mean q2_mean q3_mean  q1_n  q2_n  q3_n
      <chr>    <dbl>   <dbl>   <dbl> <int> <int> <int>
    1 f          2      1.25      NA     4     4     4
    2 m          4.5    4.25       4     4     4     4

     


     

    R Programming
    %%R
    
    withmooc %>%
      group_by(gender) %>%
      purrr::keep(.p = is.numeric) %>% # 숫자형 데이터만 남기기
      purrr::map_df(.x = .,
                    .f = psych::describe)

     

    Results
           vars n mean   sd median trimmed  mad min max range  skew kurtosis   se
    X1...1    1 8 1.50 0.53    1.5    1.50 0.74   1   2     1  0.00    -2.23 0.19
    X1...2    1 8 3.25 1.49    3.5    3.25 2.22   1   5     4 -0.14    -1.73 0.53
    X1...3    1 8 2.75 1.75    2.5    2.75 2.22   1   5     4  0.19    -1.91 0.62
    X1...4    1 7 4.14 1.07    4.0    4.14 1.48   2   5     3 -0.93    -0.52 0.40
    X1...5    1 8 3.25 1.58    3.5    3.25 1.48   1   5     4 -0.36    -1.59 0.56

     


     

    R Programming
    %%R
    
    withmooc %>%
      summarise_if(is.numeric, mean, na.rm = TRUE)

     

    Results
      workshop   q1   q2       q3   q4
    1      1.5 3.25 2.75 4.142857 3.25

     


     

    R Programming
    %%R
    
    withmooc %>%
      summarise(across(where(is.numeric), ~ mean(.x, na.rm = TRUE)))

     

    Results
      workshop   q1   q2       q3   q4
    1      1.5 3.25 2.75 4.142857 3.25

     

     

    R Programming
    %%R
    
    withmooc %>% 
      group_by(gender) %>% 
      summarise(across(q1:q4, mean))

     

    Results
    `summarise()` ungrouping output (override with `.groups` argument)
    # A tibble: 2 x 5
      gender    q1    q2    q3    q4
      <chr>  <dbl> <dbl> <dbl> <dbl>
    1 f        2    1.25    NA   2  
    2 m        4.5  4.25     4   4.5

     

     

    R Programming
    %%R
    
    withmooc %>% 
      group_by(gender) %>% 
      summarise(across(where(is.numeric), mean))

     

    Results
    `summarise()` ungrouping output (override with `.groups` argument)
    # A tibble: 2 x 6
      gender workshop    q1    q2    q3    q4
      <chr>     <dbl> <dbl> <dbl> <dbl> <dbl>
    1 f           1.5   2    1.25    NA   2  
    2 m           1.5   4.5  4.25     4   4.5

     


    • 다중 범주 변수는 리스트에서 이용되어야 하고, data.frame 함수는 리스트를 취할 수 있다.
    • 데이터는 workshop과 gender로 정렬될 필요가 없다.
    • workshop과 gender 변수를 기준으로 각 값에 대한 요약 통계
    R Programming
    %%R
    
    withmooc %>%
      group_by(workshop, gender) %>%
      summarise_each(funs(min, median, mean, sd, max, n()), q1,q2,q3,q4)

     

    Results
    # A tibble: 4 x 26
    # Groups:   workshop [2]
      workshop gender q1_min q2_min q3_min q4_min q1_median q2_median q3_median
         <int> <chr>   <int>  <int>  <int>  <int>     <dbl>     <dbl>     <dbl>
    1        1 f           1      1      4      1       1.5       1.5       4.5
    2        1 m           4      3      2      4       4.5       4         3  
    3        2 f           2      1     NA      1       2.5       1        NA  
    4        2 m           4      4      5      5       4.5       4.5       5  
    # ... with 17 more variables: q4_median <dbl>, q1_mean <dbl>, q2_mean <dbl>,
    #   q3_mean <dbl>, q4_mean <dbl>, q1_sd <dbl>, q2_sd <dbl>, q3_sd <dbl>,
    #   q4_sd <dbl>, q1_max <int>, q2_max <int>, q3_max <int>, q4_max <int>,
    #   q1_n <int>, q2_n <int>, q3_n <int>, q4_n <int>

     


    6. Python - Pandas

     

    Python Programming
    import pandas as pd
    import numpy as np
    import sweetviz as sv
    
    mydata = pd.read_csv("C:/work/data/mydata.csv",sep=",",
                         dtype={'id':object,'workshop':object,
                                'q1':int, 'q2':int, 'q3':float, 'q4':int},
                         na_values=['NaN'],skipinitialspace =True)
    
    withmooc= mydata.copy()
    
    withmooc

     

    Results
    	id	workshop	gender	q1	q2	q3	q4
    0	1	1		f	1	1	5.0	1
    1	2	2		f	2	1	4.0	1
    2	3	1		f	2	2	4.0	3
    3	4	2		f	3	1	NaN	3
    4	5	1		m	4	5	2.0	4
    5	6	2		m	5	4	5.0	5
    6	7	1		m	5	3	4.0	4
    7	8	2		m	4	5	5.0	5
     

    By 또는 Split 파일 프로세싱을 위한 파이썬 Pandas 프로그램.

    • describe() 함수
    Python Programming
    # 관측치와 모든 변수의 요약 통계 구하기.
    
    withmooc.describe()

     

    Results
    	q1		q2		q3		q4
    count	8.000000	8.000000	7.000000	8.000000
    mean	3.250000	2.750000	4.142857	3.250000
    std	1.488048	1.752549	1.069045	1.581139
    min	1.000000	1.000000	2.000000	1.000000
    25%	2.000000	1.000000	4.000000	2.500000
    50%	3.500000	2.500000	4.000000	3.500000
    75%	4.250000	4.250000	5.000000	4.250000
    max	5.000000	5.000000	5.000000	5.000000
     

    • summary_cont() 함수
    Python Programming
    import researchpy as rp
    
    rp.summary_cont(withmooc[['q1','q2','q3','q4']])

     

    Results
    	Variable	N	Mean	SD	SE	95% Conf.	Interval
    0	q1		8.0	3.2500	1.4880	0.5261	2.0060		4.4940
    1	q2		8.0	2.7500	1.7525	0.6196	1.2848		4.2152
    2	q3		7.0	4.1429	1.0690	0.4041	3.1542		5.1316
    3	q4		8.0	3.2500	1.5811	0.5590	1.9281		4.5719
     
     

     
    • 성별을 기준으로 하여 각 변수에 대한 요약 통계 구하기.
    Python Programming
    withmooc.groupby(withmooc['gender']).describe()

     

    Results
    	q1									q2		...	q3		q4
    	count	mean	std		min	25%	50%	75%	max	count	mean	...	75%	max	count	mean	std		min	25%	50%	75%	max
    gender																					
    f	4.0	2.0	0.816497	1.0	1.75	2.0	2.25	3.0	4.0	1.25	...	4.5	5.0	4.0	2.0	1.154701	1.0	1.0	2.0	3.0	3.0
    m	4.0	4.5	0.577350	4.0	4.00	4.5	5.00	5.0	4.0	4.25	...	5.0	5.0	4.0	4.5	0.577350	4.0	4.0	4.5	5.0	5.0

    2 rows × 32 columns

     


    • 성별을 기준으로 하여 각 변수에 대한 요약 통계 구하기.
    Python Programming
    withmooc.groupby(withmooc['gender']).agg([np.mean, np.std])

     

    Results
    	q1			q2			q3				q4
    	mean	std		mean	std		mean		std		mean	std
    gender								
    f	2.0	0.816497	1.25	0.500000	4.333333	0.577350	2.0	1.154701
    m	4.5	0.577350	4.25	0.957427	4.000000	1.414214	4.5	0.577350

     

    Python Programming
    withmooc.groupby('gender').agg({'q1':pd.Series.nunique, 'q2':['mean', 'max','std']})

     

    Results
    	q1	q2
    	nunique	mean	max	std
    gender				
    f	3	1.25	2	0.500000
    m	2	4.25	5	0.957427

     


    • 열 이름에 의해 선택된 변수에 대하여 성별에 대하여 각 값에 대한 요약 통계.
    Python Programming
    withmooc[['q1','q2','q3','q4']].groupby(withmooc['gender']).agg([np.mean, np.std])

     

    Results
    	q1			q2			q3				q4
    	mean	std		mean	std		mean		std		mean	std
    gender								
    f	2.0	0.816497	1.25	0.500000	4.333333	0.577350	2.0	1.154701
    m	4.5	0.577350	4.25	0.957427	4.000000	1.414214	4.5	0.577350
     

     

    Python Programming
    rp.summary_cont(withmooc.groupby(['gender'])[['q1','q2','q3','q4']])

     

    Results
    	q1								q2				...	q3						q4
    	N	Mean	SD	SE	95% Conf.	Interval	N	Mean	SD	SE	...	SD	SE	95% Conf.	Interval	N	Mean	SD	SE	95% Conf.	Interval
    gender																					
    f	4	2.0	0.8165	0.4082	0.7008		3.2992		4	1.25	0.5000	0.2500	...	0.5774	0.3333	2.8991		5.7676		4	2.0	1.1547	0.5774	0.1626		3.8374
    m	4	4.5	0.5774	0.2887	3.5813		5.4187		4	4.25	0.9574	0.4787	...	1.4142	0.7071	1.7497		6.2503		4	4.5	0.5774	0.2887	3.5813		5.4187

    2 rows × 24 columns


    • 다중 범주 변수는 리스트에서 이용되어야 하고, data.frame 함수는 리스트를 취할 수 있다.
    • 데이터는 workshop과 gender로 정렬될 필요가 없다.
    • workshop과 gender 변수를 기준으로 각 값에 대한 요약 통계.
    Python Programming
    withmooc.groupby(['workshop','gender'])['q1','q2','q3','q4'].agg([np.mean, np.std, np.median, np.max ])

     

    Results
    <ipython-input-32-667e0b1b1789>:1: FutureWarning: Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead.
      withmooc.groupby(['workshop','gender'])['q1','q2','q3','q4'].agg([np.mean, np.std, np.median, np.max ])

     

    Results
    			q1					q2					q3					q4
    			mean	std		median	amax	mean	std		median	amax	mean	std		median	amax	mean	std		median	amax
    workshop	gender																
    1		f	1.5	0.707107	1.5	2	1.5	0.707107	1.5	2	4.5	0.707107	4.5	5.0	2	1.414214	2	3
    		m	4.5	0.707107	4.5	5	4.0	1.414214	4.0	5	3.0	1.414214	3.0	4.0	4	0.000000	4	4
    2		f	2.5	0.707107	2.5	3	1.0	0.000000	1.0	1	4.0	NaN		4.0	4.0	2	1.414214	2	3
    		m	4.5	0.707107	4.5	5	4.5	0.707107	4.5	5	5.0	0.000000	5.0	5.0	5	0.000000	5	5

     


    7. Python - dfply

     

    Python Programming
    import pandas as pd
    from dfply import *
    
    mydata   = pd.read_csv("c:/work/data/mydata.csv",sep=",",
                            dtype={'id':object,'workshop':object,
                                   'q1':int, 'q2':int, 'q3':float, 'q4':int},
                            na_values=['NaN'],skipinitialspace =True)
    
    withmooc= mydata.copy()
    
    # 모든 변수 선택하기.
    withmooc

     

    Results
    	id	workshop	gender	q1	q2	q3	q4
    0	1	1		f	1	1	5.0	1
    1	2	2		f	2	1	4.0	1
    2	3	1		f	2	2	4.0	3
    3	4	2		f	3	1	NaN	3
    4	5	1		m	4	5	2.0	4
    5	6	2		m	5	4	5.0	5
    6	7	1		m	5	3	4.0	4
    7	8	2		m	4	5	5.0	5
     

     
    • 관측치와 모든 변수의 요약 통계 구하기.
    Python Programming
    withmooc >> group_by(X.gender) >> summarize(q1_n=n(X.q1), q2_n=n(X.q2))
    
    withmooc >> group_by(X.gender) >> summarize_each([np.mean, np.std, np.median, np.max], X.q1, X.q2)

     

    Results
    	gender	q1_mean	q1_std		q1_median	q1_amax	q2_mean	q2_std		q2_median	q2_amax
    0	f	2.0	0.707107	2.0		3	1.25	0.433013	1.0		2
    1	m	4.5	0.500000	4.5		5	4.25	0.829156	4.5		5
     

     

    Python Programming
    (withmooc>> group_by(X.gender)).describe()

     

    Results
    	q1		q2		q3		q4
    count	8.000000	8.000000	7.000000	8.000000
    mean	3.250000	2.750000	4.142857	3.250000
    std	1.488048	1.752549	1.069045	1.581139
    min	1.000000	1.000000	2.000000	1.000000
    25%	2.000000	1.000000	4.000000	2.500000
    50%	3.500000	2.500000	4.000000	3.500000
    75%	4.250000	4.250000	5.000000	4.250000
    max	5.000000	5.000000	5.000000	5.000000
     
     
     
    • 다중 범주 변수는 리스트에서 이용되어야 하고, data.frame 함수는 리스트를 취할 수 있다.
    • 데이터는 workshop과 gender로 정렬될 필요가 없다.
    • workshop과 gender 변수를 기준으로 각 값에 대한 요약 통계
    Python Programming
    withmooc >> \
      group_by(X.workshop, X.gender) >> \
      summarize_each([np.min, np.median, np.mean, np.std, np.max, np.count_nonzero], X.q1, X.q2, X.q3)

     

    Results
    	gender	workshop	q1_amin	q1_median	q1_mean	q1_std	q1_amax	q1_count_nonzero	q2_amin	q2_median	q2_mean	q2_std	q2_amax	q2_count_nonzero	q3_amin	q3_median	q3_mean	q3_std	q3_amax	q3_count_nonzero
    0	f	1		1	1.5		1.5	0.5	2	2			1	1.5		1.5	0.5	2	2			4.0	4.5		4.5	0.5	5.0	2
    1	m	1		4	4.5		4.5	0.5	5	2			3	4.0		4.0	1.0	5	2			2.0	3.0		3.0	1.0	4.0	2
    2	f	2		2	2.5		2.5	0.5	3	2			1	1.0		1.0	0.0	1	2			4.0	NaN		4.0	0.0	4.0	2
    3	m	2		4	4.5		4.5	0.5	5	2			4	4.5		4.5	0.5	5	2			5.0	5.0		5.0	0.0	5.0	2

     


     


     

    통계프로그램 비교 목록(Proc sql, SAS, SPSS, R 프로그래밍, R Tidyverse, Python Pandas, Python Dfply)
    [Oracle, Pandas, R Prog, Dplyr, Sqldf, Pandasql, Data.Table] 오라클 함수와 R & Python 비교 사전 목록 링크
    [SQL, Pandas, R Prog, Dplyr, SQLDF, PANDASQL, DATA.TABLE]
    SQL EMP 예제로 만나는 테이블 데이터 처리 방법 리스트 링크
    반응형

    댓글