본문 바로가기
통계프로그램 비교 시리즈/데이터 전처리 비교

[데이터 관리] 11. Aggregating Or Summarizing 데이터

by 기서무나구물 2022. 1. 15.

포스팅 목차

    [데이터 관리] 11. Aggregating Or Summarizing 데이터

     


    1. Proc SQL

     

    SAS Programming
    proc sql;
      create table withmooc as
        select GENDER,
               MEAN(q1) as q1_mean
        from   BACK.mydata
        group by GENDER;
    
      select * from withmooc;
    quit;

     

    Results
     gender      q1_mean
     ------------------
     f                2
     m              4.5

     


     

    SAS Programming
    proc sql;
      create table withmooc as
        select WORKSHOP,
               GENDER,
               MEAN(q1) as q1_mean
        from   BACK.mydata
        group by 1,2;
    
      select * from withmooc;
    quit;

     

    Results
      workshop  gender     q1_mean
     ----------------------------
            1  f              1.5
            1  m              4.5
            2  f              2.5
            2  m              4.5

     


    • Sub-Query를 이용하여 평균을 가로 결합
    SAS Programming
    proc sql;
      create table withmooc as
        select a.*,b.q1_mean
        from   BACK.mydata a , ( select workshop,gender,
                                        mean(q1) as q1_mean
                                 from   BACK.mydata
                                 group by workshop,2 ) b
        where  a.workshop = b.workshop
        and    a.gender   = b.gender;
    
      select * from withmooc;
    quit;

     

    Results
    id  workshop  gender          q1        q2        q3        q4   q1_mean
    ------------------------------------------------------------------------
     1         1  f                1         1         5         1       1.5
     2         2  f                2         1         4         1       2.5
     3         1  f                2         2         4         3       1.5
     4         2  f                3         1         .         3       2.5
     5         1  m                4         5         2         4       4.5
     6         2  m                5         4         5         5       4.5
     7         1  m                5         3         4         4       4.5
     8         2  m                4         5         5         5       4.5

     


    2. SAS Programming

    • 각 성별에 대한 q1의 평균을 구한다.;
    SAS Programming
    PROC SUMMARY DATA=BACK.mydata MEAN NWAY;
         CLASS GENDER;
         VAR q1;
         OUTPUT OUT=withmooc;
    RUN;
    
    PROC PRINT; RUN;

     

    Results
    OBS    gender    _TYPE_    _FREQ_    _STAT_       q1
      1      f          1         4       N        4.00000
      2      f          1         4       MIN      1.00000
      3      f          1         4       MAX      3.00000
      4      f          1         4       MEAN     2.00000
      5      f          1         4       STD      0.81650
      6      m          1         4       N        4.00000
      7      m          1         4       MIN      4.00000
      8      m          1         4       MAX      5.00000
      9      m          1         4       MEAN     4.50000
     10      m          1         4       STD      0.57735

     

     

    SAS Programming
    DATA withmooc;
     SET withmooc;
         WHERE _STAT_='MEAN';
         KEEP gender q1;
    RUN;
    
    PROC PRINT; RUN;

     

    Results
    OBS    gender     q1
     1       f       2.0
     2       m       4.5

     

    • workshop와 성별을 기준으로 q1의 평균을 구한다.
    SAS Programming
    PROC SUMMARY DATA=BACK.mydata MEAN NWAY;
         CLASS WORKSHOP GENDER;
         VAR Q1;
         OUTPUT OUT=withmooc;
    RUN;
    
    PROC PRINT; RUN;

     

    Results
    OBS    workshop    gender    _TYPE_    _FREQ_    _STAT_       q1
      1        1         f          3         2       N        2.00000
      2        1         f          3         2       MIN      1.00000
      3        1         f          3         2       MAX      2.00000
      4        1         f          3         2       MEAN     1.50000
      5        1         f          3         2       STD      0.70711
      6        1         m          3         2       N        2.00000
      7        1         m          3         2       MIN      4.00000
      8        1         m          3         2       MAX      5.00000
      9        1         m          3         2       MEAN     4.50000
     10        1         m          3         2       STD      0.70711
     11        2         f          3         2       N        2.00000
     12        2         f          3         2       MIN      2.00000
     13        2         f          3         2       MAX      3.00000
     14        2         f          3         2       MEAN     2.50000
     15        2         f          3         2       STD      0.70711
     16        2         m          3         2       N        2.00000
     17        2         m          3         2       MIN      4.00000
     18        2         m          3         2       MAX      5.00000
     19        2         m          3         2       MEAN     4.50000
     20        2         m          3         2       STD      0.70711

     

    • 평균을 선택한다.
    SAS Programming
    DATA withmooc;
     SET withmooc;
         WHERE _STAT_='MEAN';
         KEEP workshop gender q1;
         RENAME q1=meanQ1;
    RUN;
    
    PROC PRINT; RUN;

     

    • mydata에 집계된 데이터를 Merge 한다.
    SAS Programming
    PROC SORT DATA=BACK.mydata out=mydata;
         BY workshop gender;
    RUN;
    
    PROC SORT DATA=withmooc;
         BY workshop gender;
    RUN;
    
    DATA mydata2;
     MERGE mydata withmooc;
         BY workshop gender;
    run;
    
    PROC PRINT; RUN;

     


    3. SPSS

    • SPSS Program for Aggregating/Summarizing Data.
    • Get mean of q1 by gender.
    SPSS Programming
    GET FILE='C:\mydata.sav'.
    
    AGGREGATE
      /OUTFILE='C:\myAgg.sav'
      /BREAK=gender
      /q1_mean = MEAN(q1).
    
    GET FILE='C:\myAgg.sav'.
    LIST.
    EXECUTE.
    
    
    
    * Get mean of q1 by workshop and gender.
    
    GET FILE='C:\mydata.sav'.
    
    AGGREGATE
      /OUTFILE='C:\myAgg.sav'
      /BREAK=workshop gender
      /q1_mean = MEAN(q1).
    
    GET FILE='C:\myAgg.sav'.
    LIST.
    EXECUTE.
    
    
    
    * Merge aggregated data back into mydata.
    
    GET FILE='C:\mydata.sav'.
    SORT CASES BY  workshop (A) gender (A) .
    
    MATCH FILES /FILE=*
     /TABLE='C:\myAgg.sav'
     /BY workshop gender.
    
    SAVE OUTFILE='C:\mydata.sav'.
    EXECUTE.

     


    4. R Programming (R-PROJECT)

     

    R Programming
    from rpy2.robjects import r
    %load_ext rpy2.ipython

     

    Results
    The rpy2.ipython extension is already loaded. To reload it, use:
      %reload_ext rpy2.ipython

     

     

    R Programming
    %%R
    
    library(tidyverse)
    library(psych)
    mydata <- read_csv("C:/work/data/mydata.csv", 
      col_types = cols( id       = col_double(),
                        workshop = col_character(),
                        gender   = col_character(),
                        q1       = col_double(),
                        q2       = col_double(),
                        q3       = col_double(),
                        q4       = col_double()
      )
    )
    
    withmooc = mydata
    
    attach(withmooc) # mydata를 기본 데이터 세트로 지정.
    
    withmooc

     

    Results
    R[write to console]: -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
    
    ---------------------------------------
    ---------------------------------------
    
        s = ffi.string(c, maxlen).decode('utf-8')
    UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb4 in position 1: invalid start byte
        R[write to console]: The following objects are masked from 'package:ggplot2':
    
        %+%, alpha

    Pop함수를 이용하여 q3와 q4변수를 삭제.

    # A tibble: 8 x 7
         id workshop gender    q1    q2    q3    q4
      <dbl> <chr>    <chr>  <dbl> <dbl> <dbl> <dbl>
    1     1 1        f          1     1     5     1
    2     2 2        f          2     1     4     1
    3     3 1        f          2     2     4     3
    4     4 2        f          3     1    NA     3
    5     5 1        m          4     5     2     4
    6     6 2        m          5     4     5     5
    7     7 1        m          5     3     4     4
    8     8 2        m          4     5     5     5

     

     

    • Aggregating / Summarizing 데이터를 위한 R-Project 프로그램.
    • [참 고] RESHAPE 라이브러리
    • 필요한 패키지 로드. 필요시 사전에 인스톨해야 한다.
    R Programming
    %%R
    
    library(Hmisc)
    
    library(reshape)

     

     

    • Aggregate함수는 R-Project에서 만들어진 함수이다.
    • Aggregate함수는 변수에 대하여 새 변수명을 생성한다.
    • Gender는 단일 오브젝트임에 불구하고 List함수 내에 들어가 있다.
    R Programming
    %%R
    
    myAgg<-aggregate(q1, by=list(gender), FUN=mean, na.rm=TRUE)
    
    print(myAgg)

     

    Results
      Group.1   x
    1       f 2.0
    2       m 4.5

     


    • workshop과 gender를 기준으로 집계.
    R Programming
    %%R
    
    myAgg<-aggregate(q1, by=list(workshop,gender), FUN=mean, na.rm=TRUE)
    
    myAgg

     

    Results
      Group.1 Group.2   x
    1       1       f 1.5
    2       2       f 2.5
    3       1       m 4.5
    4       2       m 4.5

     


    • Summarize함수는 Hmisc라이브러리에 존재.
    • Summarize함수는 원 변수명을 유지하고, 라벨 존재 시 라벨 역시 유지한다.
    • 성별에 대하여.
    R Programming
    %%R
    
    myAgg<-Hmisc::summarize(q1, by=gender, FUN=mean, na.rm=TRUE)
    
    myAgg

     

    Results
      gender  q1
    1      f 2.0
    2      m 4.5

     


    • Workshop과 gender에 대하여 Summary
    R Programming
    %%R
    
    myAgg<-summarize(q1, by=llist(workshop,gender), FUN=mean, na.rm=TRUE)
    
    myAgg

     

    Results
      workshop gender  q1
    1        1      f 1.5
    2        1      m 4.5
    3        2      f 2.5
    4        2      m 4.5

     

     

    • Mydata에 집계된 값을 Merge 한다.
    • 먼저, q1변수명을 mean.q1으로 변경.
    R Programming
    %%R
    
    myAgg<-rename(myAgg, c(q1="mean.q1"))
    
    myAgg

     

    Results
      workshop gender mean.q1
    1        1      f     1.5
    2        1      m     4.5
    3        2      f     2.5
    4        2      m     4.5

     


    • Mydata에 위의 값을 Merge 한다.
    R Programming
    %%R
    
    withmooc2<-merge(withmooc,myAgg,by=c("workshop","gender") )
    
    withmooc2

     

    Results
      workshop gender id q1 q2 q3 q4 mean.q1
    1        1      f  1  1  1  5  1     1.5
    2        1      f  3  2  2  4  3     1.5
    3        1      m  5  4  5  2  4     4.5
    4        1      m  7  5  3  4  4     4.5
    5        2      f  2  2  1  4  1     2.5
    6        2      f  4  3  1 NA  3     2.5
    7        2      m  6  5  4  5  5     4.5
    8        2      m  8  4  5  5  5     4.5

     

     


    5. R - Tidyverse

     

    R Programming
    from rpy2.robjects import r
    %load_ext rpy2.ipython

     

    Results
    The rpy2.ipython extension is already loaded. To reload it, use:
      %reload_ext rpy2.ipython

     


    R Programming
    %%R
    
    library(tidyverse)
    library(psych)
    mydata <- read_csv("C:/work/data/mydata.csv", 
      col_types = cols( id       = col_double(),
                        workshop = col_character(),
                        gender   = col_character(),
                        q1       = col_double(),
                        q2       = col_double(),
                        q3       = col_double(),
                        q4       = col_double()
      )
    )
    
    withmooc = mydata
    
    attach(withmooc) # mydata를 기본 데이터 세트로 지정.
    
    withmooc

     

    Results
    R[write to console]: The following objects are masked from withmooc (pos = 8):
    
        gender, id, q1, q2, q3, q4, workshop
    
    
    
    
    # A tibble: 8 x 7
         id workshop gender    q1    q2    q3    q4
      <dbl> <chr>    <chr>  <dbl> <dbl> <dbl> <dbl>
    1     1 1        f          1     1     5     1
    2     2 2        f          2     1     4     1
    3     3 1        f          2     2     4     3
    4     4 2        f          3     1    NA     3
    5     5 1        m          4     5     2     4
    6     6 2        m          5     4     5     5
    7     7 1        m          5     3     4     4
    8     8 2        m          4     5     5     5

     

     

    • Aggregate함수는 R-Project에서 만들어진 함수이다.
    • Aggregate함수는 변수에 대하여 새 변수명을 생성한다.
    • Gender는 단일 오브젝트임에 불구하고 List함수 내에 들어가 있다.
    • 성별을 기준으로.
    R Programming
    %%R
    
    withmooc %>%
      group_by(gender) %>%
      summarise( x=mean(q1))

     

    Results
    `summarise()` ungrouping output (override with `.groups` argument)
    # A tibble: 2 x 2
      gender     x
      <chr>  <dbl>
    1 f        2  
    2 m        4.5

     


    • [참고] 연산 후 group 변수에 대한 처리 방식 지정
    R Programming
    %%R
    
    withmooc %>%
      group_by(workshop,gender) %>%
      summarise( q1=mean(q1)) %>%
      group_vars()
    
    withmooc %>%
      group_by(workshop,gender) %>%
      summarise( q1=mean(q1), .groups = "drop_last") %>%
      group_vars()
    
    withmooc %>%
      group_by(workshop,gender) %>%
      summarise( q1=mean(q1), .groups = "drop") %>%
      group_vars()
    
    withmooc %>%
      group_by(workshop,gender) %>%
      summarise( q1=mean(q1), .groups = "keep") %>%
      group_vars()
    
    # 행 방향 그룹이됩니다. 즉, 각 행이 그룹입니다.
    withmooc %>%
      group_by(workshop,gender) %>%
      summarise( q1=mean(q1), .groups = "rowwise") %>%
      group_vars()

     

    Results
    `summarise()` regrouping output by 'workshop' (override with `.groups` argument)
    [1] "workshop" "gender"  

     


    • workshop과 gender를 기준으로 집계.
    R Programming
    %%R
    
    myAgg <- withmooc %>%
      group_by(workshop,gender) %>%
      summarise( q1=mean(q1), .groups = 'keep') %>%
      ungroup()
    
    myAgg

     

    Results
    # A tibble: 4 x 3
      workshop gender    q1
      <chr>    <chr>  <dbl>
    1 1        f        1.5
    2 1        m        4.5
    3 2        f        2.5
    4 2        m        4.5

     

     

    • Mydata에 집계된 값을 Merge 한다.
    • 먼저, q1변수명을 mean.q1으로 변경.
    R Programming
    %%R
    
    myAgg <- myAgg %>%
      dplyr::rename(mean.q1=q1)
    myAgg

     

    Results
    # A tibble: 4 x 3
      workshop gender mean.q1
      <chr>    <chr>    <dbl>
    1 1        f          1.5
    2 1        m          4.5
    3 2        f          2.5
    4 2        m          4.5

     

     

    • Mydatam 위의 값을 Merge 한다.
    R Programming
    %%R
    
    withmooc2 <- withmooc %>%
      left_join(myAgg, by=c("workshop","gender")) %>%
      arrange(workshop,gender)
    
    withmooc2

     

    Results
    # A tibble: 8 x 8
         id workshop gender    q1    q2    q3    q4 mean.q1
      <dbl> <chr>    <chr>  <dbl> <dbl> <dbl> <dbl>   <dbl>
    1     1 1        f          1     1     5     1     1.5
    2     3 1        f          2     2     4     3     1.5
    3     5 1        m          4     5     2     4     4.5
    4     7 1        m          5     3     4     4     4.5
    5     2 2        f          2     1     4     1     2.5
    6     4 2        f          3     1    NA     3     2.5
    7     6 2        m          5     4     5     5     4.5
    8     8 2        m          4     5     5     5     4.5

     


    6. Python - Pandas

     

    Python Programming
    import pandas as pd
    import numpy as np
    import sweetviz as sv
    
    mydata = pd.read_csv("C:/work/data/mydata.csv",sep=",",
                         dtype={'id':object,'workshop':object,
                                'q1':int, 'q2':int, 'q3':float, 'q4':int},
                         na_values=['NaN'],skipinitialspace =True)
    
    withmooc= mydata.copy()
    
    withmooc

     

    Results
    	id	workshop	gender	q1	q2	q3	q4
    0	1	1		f	1	1	5.0	1
    1	2	2		f	2	1	4.0	1
    2	3	1		f	2	2	4.0	3
    3	4	2		f	3	1	NaN	3
    4	5	1		m	4	5	2.0	4
    5	6	2		m	5	4	5.0	5
    6	7	1		m	5	3	4.0	4
    7	8	2		m	4	5	5.0	5

     


    • Aggregating / Summarizing 데이터를 위한 R-Project 프로그램.
    • Aggregate함수는 R-Project에서 만들어진 함수이다.
    • Aggregate함수는 변수에 대하여 새 변수명을 생성한다.
    • Gender는 단일 오브젝트임에 불구하고 List함수 내에 들어가 있다.
    Python Programming
    withmooc.groupby(['gender'])['q1'].agg([np.mean])

     

    Results
    	mean
    gender	
    f	2.0
    m	4.5

     


    • workshop과 gender를 기준으로 집계.
    Python Programming
    withmooc.groupby(['workshop','gender'])['q1'].mean()
    
    withmooc.groupby(['workshop','gender'])['q1'].agg([np.mean])

     

    Results
    		mean
    workshop	gender	
    1	f	1.5
    	m	4.5
    2	f	2.5
    	m	4.5

     

     

    Python Programming
     
    myAgg = withmooc.groupby(['workshop','gender'])['q1'].mean().reset_index()
    myAgg

     

    Results
    	workshop	gender	q1
    0	1		f	1.5
    1	1		m	4.5
    2	2		f	2.5
    3	2		m	4.5

     


    • withmooc에 집계된 값을 Merge 한다.
    • 먼저, q1변수명을 mean.q1으로 변경.
    Python Programming
    myAgg = myAgg.rename(columns = {'q1' : 'mean.q1'})
    myAgg

     

    Results
    	workshop	gender	mean.q1
    0	1		f	1.5
    1	1		m	4.5
    2	2		f	2.5
    3	2		m	4.5

     


    • Mydata에 위의 값을 Merge 한다.
    Python Programming
    pd.merge(withmooc, myAgg, on=['workshop','gender'])

     

    Results
    	id	workshop	gender	q1	q2	q3	q4	mean.q1
    0	1	1		f	1	1	5.0	1	1.5
    1	3	1		f	2	2	4.0	3	1.5
    2	2	2		f	2	1	4.0	1	2.5
    3	4	2		f	3	1	NaN	3	2.5
    4	5	1		m	4	5	2.0	4	4.5
    5	7	1		m	5	3	4.0	4	4.5
    6	6	2		m	5	4	5.0	5	4.5
    7	8	2		m	4	5	5.0	5	4.5

     

     


    7. Python - dfply

     

    Python Programming
    import pandas as pd
    from dfply import *
    
    mydata   = pd.read_csv("c:/work/data/mydata.csv",sep=",",
                           dtype={'id':object,'workshop':object,
                                  'q1':int, 'q2':int, 'q3':float, 'q4':int},
                           na_values=['NaN'],skipinitialspace =True)
    
    withmooc= mydata.copy()
    
    # 모든 변수 선택하기.
    withmooc

     

    Results
    	id	workshop	gender	q1	q2	q3	q4
    0	1	1		f	1	1	5.0	1
    1	2	2		f	2	1	4.0	1
    2	3	1		f	2	2	4.0	3
    3	4	2		f	3	1	NaN	3
    4	5	1		m	4	5	2.0	4
    5	6	2		m	5	4	5.0	5
    6	7	1		m	5	3	4.0	4
    7	8	2		m	4	5	5.0	5

     


    • Aggregate함수는 R-Project에서 만들어진 함수이다.
    • Aggregate함수는 변수에 대하여 새 변수명을 생성한다.
    • Gender는 단일 오브젝트임에 불구하고 List함수 내에 들어가 있다.
    • 성별에 대하여.
    Python Programming
    withmooc >> \
      group_by(X.gender) >> \
      summarise( x=mean(X.q1))

     

    Results
    	gender	x
    0	f	2.0
    1	m	4.5

     


    • workshop과 gender를 기준으로 집계.
    Python Programming
    myAgg = withmooc >> \
      group_by(X.workshop,X.gender) >> \
      summarise( q1=mean(X.q1))
    
    myAgg

     

    Results
    	gender	workshop	q1
    0		f	1	1.5
    1		m	1	4.5
    2		f	2	2.5
    3		m	2	4.5

     

     
    • Mydata에 집계된 값을 Merge 한다.
    • 먼저, q1변수명을 mean.q1으로 변경.
    Python Programming
    myAgg = myAgg >> \
      rename(mean_q1=X.q1)
    
    myAgg

     

    Results
    	gender	workshop	mean_q1
    0		f	1	1.5
    1		m	1	4.5
    2		f	2	2.5
    3		m	2	4.5

     

     

    • withmooc에 위의 값을 Merge 한다.
    Python Programming
    withmooc2 = withmooc >> \
      left_join(myAgg,left_on  = ["workshop","gender"],
                      right_on = ["workshop","gender"]) >> \
      arrange(X.workshop,X.gender)
    withmooc2

     

    Results
    	id	workshop	gender	q1	q2	q3	q4	mean_q1
    0	1	1		f	1	1	5.0	1	1.5
    2	3	1		f	2	2	4.0	3	1.5
    4	5	1		m	4	5	2.0	4	4.5
    6	7	1		m	5	3	4.0	4	4.5
    1	2	2		f	2	1	4.0	1	2.5
    3	4	2		f	3	1	NaN	3	2.5
    5	6	2		m	5	4	5.0	5	4.5
    7	8	2		m	4	5	5.0	5	4.5
     

     


     

    통계프로그램 비교 목록(Proc sql, SAS, SPSS, R 프로그래밍, R Tidyverse, Python Pandas, Python Dfply)
    [Oracle, Pandas, R Prog, Dplyr, Sqldf, Pandasql, Data.Table] 오라클 함수와 R & Python 비교 사전 목록 링크
    [SQL, Pandas, R Prog, Dplyr, SQLDF, PANDASQL, DATA.TABLE]
    SQL EMP 예제로 만나는 테이블 데이터 처리 방법 리스트 링크
    반응형

    댓글