14. 변수 특성에 따른 통계량 일괄 처리 & Value Labels Or Formats(& Measurement Level))

포스팅 목차

14. 변수 특성에 따른 통계량 일괄 처리 & Value Labels Or Formats(& Measurement Level))

14. 변수 특성에 따른 통계량 일괄 처리 & Value Labels Or Formats(& Measurement Level))

1. Proc SQL

SAS Program to Assign Value Labels (formats)

SAS Programming

options linesize=150;

* SAS Program to Assign Value Labels (formats);

PROC FORMAT;
     VALUE workshop_f 1="Control" 2="Treatment";
     VALUE $gender_f "m"="Male" "f"="Female";
     VALUE agreement 1='Strongly Disagree'
                     2='Disagree'
                     3='Neutral'
                     4='Agree'
                     5='Strongly Agree'.;

run;



proc sql;
  select id,
         workshop format=workshop_f.,
         gender   format=$gender_f.  ,
         q1       format=agreement. ,
         q2       format=agreement. ,
         q3       format=agreement. ,
         q4       format=agreement.
  from   BACK.mydata;

quit;

Results

id   workshop  gender                 q1                 q2                 q3                 q4
-------------------------------------------------------------------------------------------------
 1  Control    Female  Strongly Disagree  Strongly Disagree  Strongly Agree.    Strongly Disagree
 2  Treatment  Female  Disagree           Strongly Disagree  Agree              Strongly Disagree
 3  Control    Female  Disagree           Disagree           Agree              Neutral
 4  Treatment  Female  Neutral            Strongly Disagree                  .  Neutral
 5  Control    Male    Agree              Strongly Agree.    Disagree           Agree
 6  Treatment  Male    Strongly Agree.    Agree              Strongly Agree.    Strongly Agree.
 7  Control    Male    Strongly Agree.    Neutral            Agree              Agree
 8  Treatment  Male    Agree              Strongly Agree.    Strongly Agree.    Strongly Agree.

2. SAS Programming

값 라벨(포맷)을 할당하기 위한 SAS프로그램;

SAS Programming

PROC FORMAT;
     VALUE workshop_f 1="Control" 2="Treatment";
     VALUE $gender_f "m"="Male" "f"="Female";
     VALUE agreement 1='Strongly Disagree'
                     2='Disagree'
                     3='Neutral'
                     4='Agree'
                     5='Strongly Agree'.;
run;



DATA withmooc;
 SET BACK.mydata;
     FORMAT workshop workshop_f. gender gender_f.
            q1-q4 agreement.;
run;

proc print;run;

Results

OBS id workshop  gender q1                q2                       q3         q4
 1   1 Control   Female Strongly Disagree Strongly Disagree Strongly Agree.   Strongly Disagree
 2   2 Treatment Female Disagree          Strongly Disagree Agree             Strongly Disagree
 3   3 Control   Female Disagree          Disagree          Agree             Neutral
 4   4 Treatment Female Neutral           Strongly Disagree                 . Neutral
 5   5 Control   Male   Agree             Strongly Agree.   Disagree          Agree
 6   6 Treatment Male   Strongly Agree.   Agree             Strongly Agree.   Strongly Agree.
 7   7 Control   Male   Strongly Agree.   Neutral           Agree             Agree
 8   8 Treatment Male   Agree             Strongly Agree.   Strongly Agree.   Strongly Agree.

3. SPSS

값 라벨을 할당하기 위한 SPSS 프로그램.

SPSS Programming

GET FILE="c:\mydata.sav".

VARIABLE LEVEL workshop (NOMINAL)
 /q1 TO q4 (SCALE).

VALUE LABELS  workshop 1 'Control'  2 'Treatment'
 /q1 TO q4
 1 'Strongly Disagree'
 2 'Disagree'
 3 'Neutral'
 4 'Agree'
 5 'Strongly Agree'.

SAVE OUTFILE="C:\mydata.sav".

4. R Programming (R-PROJECT)

R Programming

from rpy2.robjects import r
%load_ext rpy2.ipython

Results

The rpy2.ipython extension is already loaded. To reload it, use:
  %reload_ext rpy2.ipython

R Programming

%%R

options(width = 200)

library(tidyverse)
library(psych)
library(Hmisc)

mydata <- read_csv("C:/work/data/mydata.csv", 
  col_types = cols( id       = col_double(),
                    workshop = col_character(),
                    gender   = col_character(),
                    q1       = col_double(),
                    q2       = col_double(),
                    q3       = col_double(),
                    q4       = col_double()
  )
)

withmooc = mydata

attach(withmooc) # mydata를 기본 데이터 세트로 지정.

withmooc

Results

R[write to console]: -- Attaching packages ------------------------------------------------------------------------------------------------------------- tidyverse 1.3.0 --

From cffi callback :
Traceback (most recent call last):
  File "C:\Users\BACK\anaconda3\lib\site-packages\rpy2\rinterface_lib\callbacks.py", line 131, in _consolewrite_ex

====================================================

    R[write to console]: The following object is masked from 'package:psych':

    describe


R[write to console]: The following objects are masked from 'package:dplyr':

   src, summarize


R[write to console]: The following objects are masked from 'package:base':

    format.pval, units

Results

# A tibble: 8 x 7
     id workshop gender    q1    q2    q3    q4
  <dbl> <chr>    <chr>  <dbl> <dbl> <dbl> <dbl>
1     1 1        f          1     1     5     1
2     2 2        f          2     1     4     1
3     3 1        f          2     2     4     3
4     4 2        f          3     1    NA     3
5     5 1        m          4     5     2     4
6     6 2        m          5     4     5     5
7     7 1        m          5     3     4     4
8     8 2        m          4     5     5     5

값 라벨과 Factor 상태를 할당하기 위한 R-Project 프로그램.
기본적으로, Group은 수치형으로 읽히고, Gender는 Factor로써 읽힌다.
Gender가 문자 이기 때문이다.
기본적으로, Summary는 Group을 수치형으로 취급하지만, Gender는 Factor로 가정하고, 그것의 레벨을 카운트한다.

R Programming

%%R

base::summary(withmooc)

Results

       id         workshop            gender                q1             q2             q3              q4      
 Min.   :1.00   Length:8           Length:8           Min.   :1.00   Min.   :1.00   Min.   :2.000   Min.   :1.00  
 1st Qu.:2.75   Class :character   Class :character   1st Qu.:2.00   1st Qu.:1.00   1st Qu.:4.000   1st Qu.:2.50  
 Median :4.50   Mode  :character   Mode  :character   Median :3.50   Median :2.50   Median :4.000   Median :3.50  
 Mean   :4.50                                         Mean   :3.25   Mean   :2.75   Mean   :4.143   Mean   :3.25  
 3rd Qu.:6.25                                         3rd Qu.:4.25   3rd Qu.:4.25   3rd Qu.:5.000   3rd Qu.:4.25  
 Max.   :8.00                                         Max.   :5.00   Max.   :5.00   Max.   :5.000   Max.   :5.00  
                                                                                    NA's   :1

R Programming

%%R

dlookr::diagnose_numeric(mydata)

Results

R[write to console]: Registered S3 method overwritten by 'quantmod':
  method            from
  as.zoo.data.frame zoo 



# A tibble: 5 x 10
  variables   min    Q1  mean median    Q3   max  zero minus outlier
  <chr>     <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <int> <int>   <int>
1 id            1  2.75  4.5     4.5  6.25     8     0     0       0
2 q1            1  2     3.25    3.5  4.25     5     0     0       0
3 q2            1  1     2.75    2.5  4.25     5     0     0       0
4 q3            2  4     4.14    4    5        5     0     0       1
5 q4            1  2.5   3.25    3.5  4.25     5     0     0       0

R Programming

%%R

withmooc %>%
  dlookr::diagnose_numeric()

Results

# A tibble: 5 x 10
  variables   min    Q1  mean median    Q3   max  zero minus outlier
  <chr>     <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <int> <int>   <int>
1 id            1  2.75  4.5     4.5  6.25     8     0     0       0
2 q1            1  2     3.25    3.5  4.25     5     0     0       0
3 q2            1  1     2.75    2.5  4.25     5     0     0       0
4 q3            2  4     4.14    4    5        5     0     0       1
5 q4            1  2.5   3.25    3.5  4.25     5     0     0       0

R Programming

%%R

withmooc %>% 
  dlookr::describe() %>%
  as.data.frame()

Results

  variable n na     mean       sd   se_mean  IQR   skewness  kurtosis p00  p01  p05 p10 p20  p25
1       id 8  0 4.500000 2.449490 0.8660254 3.50  0.0000000 -1.200000   1 1.07 1.35 1.7 2.4 2.75
2       q1 8  0 3.250000 1.488048 0.5261043 2.25 -0.2167811 -1.410198   1 1.07 1.35 1.7 2.0 2.00
3       q2 8  0 2.750000 1.752549 0.6196197 3.25  0.2919336 -1.914116   1 1.00 1.00 1.0 1.0 1.00
4       q3 7  1 4.142857 1.069045 0.4040610 1.00 -1.5200483  2.712500   2 2.12 2.60 3.2 4.0 4.00
5       q4 8  0 3.250000 1.581139 0.5590170 1.75 -0.5421047 -1.024000   1 1.00 1.00 1.0 1.8 2.50
  p30 p40 p50 p60 p70  p75 p80 p90  p95  p99 p100
1 3.1 3.8 4.5 5.2 5.9 6.25 6.6 7.3 7.65 7.93    8
2 2.1 2.8 3.5 4.0 4.0 4.25 4.6 5.0 5.00 5.00    5
3 1.1 1.8 2.5 3.2 3.9 4.25 4.6 5.0 5.00 5.00    5
4 4.0 4.0 4.0 4.6 5.0 5.00 5.0 5.0 5.00 5.00    5
5 3.0 3.0 3.5 4.0 4.0 4.25 4.6 5.0 5.00 5.00    5

R Programming

%%R

withmooc %>%
  purrr::keep(.p = is.numeric) %>% # 숫자형 데이터만 남기기
  dlookr::describe()

Results

# A tibble: 5 x 26
  variable     n    na  mean    sd se_mean   IQR skewness kurtosis   p00   p01   p05   p10   p20
  <chr>    <int> <int> <dbl> <dbl>   <dbl> <dbl>    <dbl>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 id           8     0  4.5   2.45   0.866  3.5     0        -1.2      1  1.07  1.35   1.7   2.4
2 q1           8     0  3.25  1.49   0.526  2.25   -0.217    -1.41     1  1.07  1.35   1.7   2  
3 q2           8     0  2.75  1.75   0.620  3.25    0.292    -1.91     1  1     1      1     1  
4 q3           7     1  4.14  1.07   0.404  1      -1.52      2.71     2  2.12  2.6    3.2   4  
5 q4           8     0  3.25  1.58   0.559  1.75   -0.542    -1.02     1  1     1      1     1.8
# ... with 12 more variables: p25 <dbl>, p30 <dbl>, p40 <dbl>, p50 <dbl>, p60 <dbl>, p70 <dbl>,
#   p75 <dbl>, p80 <dbl>, p90 <dbl>, p95 <dbl>, p99 <dbl>, p100 <dbl>

R Programming

%%R

withmooc %>%
  purrr::keep(.p = is.numeric) %>% # 숫자형 데이터만 남기기
  dlookr::describe() %>%
  as.data.frame()

Results

  variable n na     mean       sd   se_mean  IQR   skewness  kurtosis p00  p01  p05 p10 p20  p25
1       id 8  0 4.500000 2.449490 0.8660254 3.50  0.0000000 -1.200000   1 1.07 1.35 1.7 2.4 2.75
2       q1 8  0 3.250000 1.488048 0.5261043 2.25 -0.2167811 -1.410198   1 1.07 1.35 1.7 2.0 2.00
3       q2 8  0 2.750000 1.752549 0.6196197 3.25  0.2919336 -1.914116   1 1.00 1.00 1.0 1.0 1.00
4       q3 7  1 4.142857 1.069045 0.4040610 1.00 -1.5200483  2.712500   2 2.12 2.60 3.2 4.0 4.00
5       q4 8  0 3.250000 1.581139 0.5590170 1.75 -0.5421047 -1.024000   1 1.00 1.00 1.0 1.8 2.50
  p30 p40 p50 p60 p70  p75 p80 p90  p95  p99 p100
1 3.1 3.8 4.5 5.2 5.9 6.25 6.6 7.3 7.65 7.93    8
2 2.1 2.8 3.5 4.0 4.0 4.25 4.6 5.0 5.00 5.00    5
3 1.1 1.8 2.5 3.2 3.9 4.25 4.6 5.0 5.00 5.00    5
4 4.0 4.0 4.0 4.6 5.0 5.00 5.0 5.0 5.00 5.00    5
5 3.0 3.0 3.5 4.0 4.0 4.25 4.6 5.0 5.00 5.00    5

Workshop변수를 Factor로 변경.

R Programming

%%R

withmooc$workshop <- factor( withmooc$workshop,
                             levels=c(1,2,3,4),
                             labels=c("R","SAS","SPSS","Stata") )

withmooc

Results

# A tibble: 8 x 7
     id workshop gender    q1    q2    q3    q4
  <dbl> <fct>    <chr>  <dbl> <dbl> <dbl> <dbl>
1     1 R        f          1     1     5     1
2     2 SAS      f          2     1     4     1
3     3 R        f          2     2     4     3
4     4 SAS      f          3     1    NA     3
5     5 R        m          4     5     2     4
6     6 SAS      m          5     4     5     5
7     7 R        m          5     3     4     4
8     8 SAS      m          4     5     5     5

Summary함수는 workshop변수의 출현 횟수를 카운트한다.
현재의 workshop의 평균은 잘못된 기록이다.

R Programming

%%R

summary(withmooc)

Results

       id        workshop    gender                q1             q2             q3              q4      
 Min.   :1.00   R    :4   Length:8           Min.   :1.00   Min.   :1.00   Min.   :2.000   Min.   :1.00  
 1st Qu.:2.75   SAS  :4   Class :character   1st Qu.:2.00   1st Qu.:1.00   1st Qu.:4.000   1st Qu.:2.50  
 Median :4.50   SPSS :0   Mode  :character   Median :3.50   Median :2.50   Median :4.000   Median :3.50  
 Mean   :4.50   Stata:0                      Mean   :3.25   Mean   :2.75   Mean   :4.143   Mean   :3.25  
 3rd Qu.:6.25                                3rd Qu.:4.25   3rd Qu.:4.25   3rd Qu.:5.000   3rd Qu.:4.25  
 Max.   :8.00                                Max.   :5.00   Max.   :5.00   Max.   :5.000   Max.   :5.00  
                                                                           NA's   :1

Hmisc 패키지에서 Describe함수를 이용.
Summary함수와 틀리게, Describe함수는 q변수의 빈도와 평균, 백분율을 계산한다.
Describe함수를 사용하기 위해서 Hmisc 라이브러리를 인스톨해야 한다.

R Programming

%%R

Hmisc::describe(withmooc)

Results

withmooc 

 7  Variables      8  Observations
------------------------------------------------------------------------------------------------------------------------------------------------------
id 
       n  missing distinct     Info     Mean      Gmd 
       8        0        8        1      4.5        3 

lowest : 1 2 3 4 5, highest: 4 5 6 7 8

Value          1     2     3     4     5     6     7     8
Frequency      1     1     1     1     1     1     1     1
Proportion 0.125 0.125 0.125 0.125 0.125 0.125 0.125 0.125
------------------------------------------------------------------------------------------------------------------------------------------------------
workshop 
       n  missing distinct 
       8        0        2 

Value        R SAS
Frequency    4   4
Proportion 0.5 0.5
------------------------------------------------------------------------------------------------------------------------------------------------------
gender 
       n  missing distinct 
       8        0        2 

Value        f   m
Frequency    4   4
Proportion 0.5 0.5
------------------------------------------------------------------------------------------------------------------------------------------------------
q1 
       n  missing distinct     Info     Mean      Gmd 
       8        0        5    0.964     3.25    1.786 

lowest : 1 2 3 4 5, highest: 1 2 3 4 5

Value          1     2     3     4     5
Frequency      1     2     1     2     2
Proportion 0.125 0.250 0.125 0.250 0.250
------------------------------------------------------------------------------------------------------------------------------------------------------
q2 
       n  missing distinct     Info     Mean      Gmd 
       8        0        5     0.94     2.75    2.071 

lowest : 1 2 3 4 5, highest: 1 2 3 4 5

Value          1     2     3     4     5
Frequency      3     1     1     1     2
Proportion 0.375 0.125 0.125 0.125 0.250
------------------------------------------------------------------------------------------------------------------------------------------------------
q3 
       n  missing distinct     Info     Mean      Gmd 
       7        1        3    0.857    4.143    1.143 

Value          2     4     5
Frequency      1     3     3
Proportion 0.143 0.429 0.429
------------------------------------------------------------------------------------------------------------------------------------------------------
q4 
       n  missing distinct     Info     Mean      Gmd 
       8        0        4    0.952     3.25    1.857 

Value         1    3    4    5
Frequency     2    2    2    2
Proportion 0.25 0.25 0.25 0.25
------------------------------------------------------------------------------------------------------------------------------------------------------

R Programming

%%R

describeData(withmooc)

Results

n.obs =  8 of which  7   are complete cases.   Number of variables =  7  of which all are numeric  FALSE  
          variable # n.obs type H1  H2 H3   H4 T1  T2 T3  T4
id*                1     8    4  1   2  3    4  5   6  7   8
workshop*          2     8    4  R SAS  R  SAS  R SAS  R SAS
gender*            3     8    4  f   f  f    f  m   m  m   m
q1*                4     8    4  1   2  2    3  4   5  5   4
q2*                5     8    4  1   1  2    1  5   4  3   5
q3*                6     7    4  5   4  4 <NA>  2   5  4   5
q4*                7     8    4  1   1  3    3  4   5  4   5

어떻게 레벨이 값으로 매치되는지 확인.

R Programming

%%R

unclass(withmooc$workshop)

Results

[1] 1 2 1 2 1 2 1 2
attr(,"levels")
[1] "R"     "SAS"   "SPSS"  "Stata"

m은 male로 f는 female로 순서를 변경하자.
만약 값이 대문자이면, 실제적으로 결측값을 생성한다.

R Programming

%%R

withmooc$genderF <- factor( withmooc$gender,
                            levels=c("m","f"),labels=c("male","female") )

withmooc

Results

# A tibble: 8 x 8
     id workshop gender    q1    q2    q3    q4 genderF
  <dbl> <fct>    <chr>  <dbl> <dbl> <dbl> <dbl> <fct>  
1     1 R        f          1     1     5     1 female 
2     2 SAS      f          2     1     4     1 female 
3     3 R        f          2     2     4     3 female 
4     4 SAS      f          3     1    NA     3 female 
5     5 R        m          4     5     2     4 male   
6     6 SAS      m          5     4     5     5 male   
7     7 R        m          5     3     4     4 male   
8     8 SAS      m          4     5     5     5 male

매치된 결과를 확인하기 위해서 Gender와 Genderf를 출력.

R Programming

%%R

withmooc[ ,c("gender","genderF")]

Results

# A tibble: 8 x 2
  gender genderF
  <chr>  <fct>  
1 f      female 
2 f      female 
3 f      female 
4 f      female 
5 m      male   
6 m      male   
7 m      male   
8 m      male

각각의 기초되는 값을 추출.
genderNums는 변수 값의 알파벳 순서가 할당된다.
genderFNums은 위에서 factor함수의 levels의 순서에 따라서 m이 2, f가 1이 할당된다.

R Programming

%%R

withmooc$genderNums  <- as.numeric(withmooc$gender)

withmooc$genderFNums <- as.numeric(withmooc$genderF)

withmooc

Results

# A tibble: 8 x 10
     id workshop gender    q1    q2    q3    q4 genderF genderNums genderFNums
  <dbl> <fct>    <chr>  <dbl> <dbl> <dbl> <dbl> <fct>        <dbl>       <dbl>
1     1 R        f          1     1     5     1 female          NA           2
2     2 SAS      f          2     1     4     1 female          NA           2
3     3 R        f          2     2     4     3 female          NA           2
4     4 SAS      f          3     1    NA     3 female          NA           2
5     5 R        m          4     5     2     4 male            NA           1
6     6 SAS      m          5     4     5     5 male            NA           1
7     7 R        m          5     3     4     4 male            NA           1
8     8 SAS      m          4     5     5     5 male            NA           1

Factor로 이용하기 위해 q변수의 복사본을 생성하고, 그것을 카운트할 수 있다.
반복하여 사용하기 위해 라벨을 저장.

R Programming

%%R

myQlevels <- c(1,2,3,4,5)

myQlabels <- c("Strongly Disagree",
               "Disagree",
               "Neutral",
               "Agree",
               "Strongly Agree")

Factor함수를 이용하여 새로운 변수 세트를 생성.

R Programming

%%R

withmooc$q1f <- factor(q1, myQlevels, myQlabels)

withmooc$q2f <- factor(q2, myQlevels, myQlabels)

withmooc$q3f <- factor(q3, myQlevels, myQlabels)

withmooc$q4f <- factor(q4, myQlevels, myQlabels)

as.data.frame(withmooc)

Results

  id workshop gender q1 q2 q3 q4 genderF genderNums genderFNums               q1f               q2f            q3f               q4f
1  1        R      f  1  1  5  1  female         NA           2 	Strongly Disagree	Strongly Disagree	Strongly Agree	Strongly Disagree
2  2      SAS      f  2  1  4  1  female         NA           2         Disagree		Strongly Disagree	Agree		Strongly Disagree
3  3        R      f  2  2  4  3  female         NA           2         Disagree		Disagree		Agree		Neutral
4  4      SAS      f  3  1 NA  3  female         NA           2         Neutral 		Strongly Disagree	<NA>		Neutral
5  5        R      m  4  5  2  4    male         NA           1         Agree    		Strongly Agree		Disagree	Agree
6  6      SAS      m  5  4  5  5    male         NA           1    	Strongly Agree		Agree Strongly		Agree		Strongly Agree
7  7        R      m  5  3  4  4    male         NA           1    	Strongly Agree		Neutral			Agree		Agree
8  8      SAS      m  4  5  5  5    male         NA           1         Agree			Strongly Agree		Strongly Agree	Strongly Agree

Summary함수 결과.

R Programming

%%R

summary( withmooc[ c("q1f","q2f","q3f","q4f") ] )

Results

                q1f                   q2f                   q3f                   q4f   
 Strongly Disagree:1   Strongly Disagree:3   Strongly Disagree:0   Strongly Disagree:2  
 Disagree         :2   Disagree         :1   Disagree         :1   Disagree         :0  
 Neutral          :1   Neutral          :1   Neutral          :0   Neutral          :2  
 Agree            :2   Agree            :1   Agree            :3   Agree            :2  
 Strongly Agree   :2   Strongly Agree   :2   Strongly Agree   :3   Strongly Agree   :2  
                                             NA's             :1

Factor로 이용하기 위해서 q변수의 복사번을 생성. 만약 변수 수가 많다면, 자동적으로 쉽게 할 수 있는 방법.
Factor로써 이용하기 위해 q 변수의 복사본을 생성, 그 결과 그것들을 카운트할 수 있다.

R Programming

%%R

myQlevels <- c(1,2,3,4,5)

myQlabels <- c("Strongly Disagree",
               "Disagree",
               "Neutral",
               "Agree",
               "Strongly Agree")

print(myQlevels)

print(myQlabels)

Results

[1] 1 2 3 4 5
[1] "Strongly Disagree" "Disagree"          "Neutral"           "Agree"             "Strongly Agree"

이용될 변수 이름의 두 개 세트를 생성.

R Programming

%%R

myQnames  <- paste( "q",  1:4, sep="")

myQFnames <- paste( "qf", 1:4, sep="")

print(myQnames) # 원 변수명.

print(myQFnames)  # 새로운 factor 변수의 이름.

Results

[1] "q1" "q2" "q3" "q4"
[1] "qf1" "qf2" "qf3" "qf4"

데이터 프레임을 분리하기 위해 q변수 추출.

R Programming

%%R

myQFvars <- withmooc[ ,myQnames]

print(myQFvars)

Results

# A tibble: 8 x 4
     q1    q2    q3    q4
  <dbl> <dbl> <dbl> <dbl>
1     1     1     5     1
2     2     1     4     1
3     2     2     4     3
4     3     1    NA     3
5     4     5     2     4
6     5     4     5     5
7     5     3     4     4
8     4     5     5     5

Factor에 대하여 F를 가진 모든 변수로 변수명을 변경.

R Programming

%%R

names(myQFvars) <- myQFnames

print(myQFvars)

Results

# A tibble: 8 x 4
    qf1   qf2   qf3   qf4
  <dbl> <dbl> <dbl> <dbl>
1     1     1     5     1
2     2     1     4     1
3     2     2     4     3
4     3     1    NA     3
5     4     5     2     4
6     5     4     5     5
7     5     3     4     4
8     4     5     5     5

많은 변수의 라벨을 적용하기 위해 함수 생성.

R Programming

%%R

myLabeler <- function(x) { factor(x, myQlevels, myQlabels) }

한 변수가 함수로 어떻게 적용되는지 확인할 수 있다.

R Programming

%%R

summary( myLabeler(myQFvars["qf1"]) )

Results

Strongly Disagree          Disagree           Neutral             Agree    Strongly Agree              NA's 
                0                 0                 0                 0                 0                 1

모든 변수에 적용.

R Programming

%%R

myQFvars[ ,myQFnames] <- lapply( myQFvars[ ,myQFnames ], myLabeler )

myQFvars

Results

# A tibble: 8 x 4
  qf1               qf2               qf3            qf4              
  <fct>             <fct>             <fct>          <fct>            
1 Strongly Disagree Strongly Disagree Strongly Agree Strongly Disagree
2 Disagree          Strongly Disagree Agree          Strongly Disagree
3 Disagree          Disagree          Agree          Neutral          
4 Neutral           Strongly Disagree <NA>           Neutral          
5 Agree             Strongly Agree    Disagree       Agree            
6 Strongly Agree    Agree             Strongly Agree Strongly Agree   
7 Strongly Agree    Neutral           Agree          Agree            
8 Agree             Strongly Agree    Strongly Agree Strongly Agree

Summary함수의 결과.

R Programming

%%R

summary(myQFvars)

Results

                qf1                   qf2                   qf3                   qf4   
 Strongly Disagree:1   Strongly Disagree:3   Strongly Disagree:0   Strongly Disagree:2  
 Disagree         :2   Disagree         :1   Disagree         :1   Disagree         :0  
 Neutral          :1   Neutral          :1   Neutral          :0   Neutral          :2  
 Agree            :2   Agree            :1   Agree            :3   Agree            :2  
 Strongly Agree   :2   Strongly Agree   :2   Strongly Agree   :3   Strongly Agree   :2  
                                             NA's             :1

withmooc에 새로운 변수를 결합.

R Programming

%%R

withmooc<-cbind(withmooc,myQFvars)

withmooc

Results

  id workshop gender q1 q2 q3 q4 genderF genderNums genderFNums               q1f               q2f            q3f               q4f
1  1        R      f  1  1  5  1  female         NA           2 Strongly Disagree Strongly Disagree Strongly Agree Strongly Disagree
2  2      SAS      f  2  1  4  1  female         NA           2          Disagree Strongly Disagree          Agree Strongly Disagree
3  3        R      f  2  2  4  3  female         NA           2          Disagree          Disagree          Agree           Neutral
4  4      SAS      f  3  1 NA  3  female         NA           2           Neutral Strongly Disagree           <NA>           Neutral
5  5        R      m  4  5  2  4    male         NA           1             Agree    Strongly Agree       Disagree             Agree
6  6      SAS      m  5  4  5  5    male         NA           1    Strongly Agree             Agree Strongly Agree    Strongly Agree
7  7        R      m  5  3  4  4    male         NA           1    Strongly Agree           Neutral          Agree             Agree
8  8      SAS      m  4  5  5  5    male         NA           1             Agree    Strongly Agree Strongly Agree    Strongly Agree
                qf1               qf2            qf3               qf4
1 Strongly Disagree Strongly Disagree Strongly Agree Strongly Disagree
2          Disagree Strongly Disagree          Agree Strongly Disagree
3          Disagree          Disagree          Agree           Neutral
4           Neutral Strongly Disagree           <NA>           Neutral
5             Agree    Strongly Agree       Disagree             Agree
6    Strongly Agree             Agree Strongly Agree    Strongly Agree
7    Strongly Agree           Neutral          Agree             Agree
8             Agree    Strongly Agree Strongly Agree    Strongly Agree

5. R - Tidyverse

R Programming

from rpy2.robjects import r
%load_ext rpy2.ipython

The rpy2.ipython extension is already loaded. To reload it, use:
  %reload_ext rpy2.ipython

R Programming

%%R

library(tidyverse)
library(psych)
mydata <- read_csv("C:/work/data/mydata.csv", 
  col_types = cols( id       = col_double(),
                    workshop = col_character(),
                    gender   = col_character(),
                    q1       = col_double(),
                    q2       = col_double(),
                    q3       = col_double(),
                    q4       = col_double()
  )
)

withmooc = mydata

attach(withmooc) # mydata를 기본 데이터 세트로 지정.

withmooc

Results

R[write to console]: -- Attaching packages --------------------------------------- tidyverse 1.3.0 --

From cffi callback :
Traceback (most recent call last):

========================================

R[write to console]: The following objects are masked from 'package:ggplot2':

    %+%, alpha

Results

# A tibble: 8 x 7
     id workshop gender    q1    q2    q3    q4
  <dbl> <chr>    <chr>  <dbl> <dbl> <dbl> <dbl>
1     1 1        f          1     1     5     1
2     2 2        f          2     1     4     1
3     3 1        f          2     2     4     3
4     4 2        f          3     1    NA     3
5     5 1        m          4     5     2     4
6     6 2        m          5     4     5     5
7     7 1        m          5     3     4     4
8     8 2        m          4     5     5     5

기본적으로, Group은 수치형으로 읽히고, Gender는 Factor로써 읽힌다.
Gender가 문자 이기 때문이다.
하나의 긴 텍스트 문자열로 데이터 저장.

R Programming

%%R

mystring<-("id,workshop,gender,q1,q2,q3,q4
1,1,f,1,1,5,1
2,2,f,2,1,4,1
3,1,f,2,2,4,3
4,2,f,3,1, ,3
5,1,m,4,5,2,4
6,2,m,5,4,5,5
7,1,m,5,3,4,4
8,2,m,4,5,5,5")

mystring

Results

[1] "id,workshop,gender,q1,q2,q3,q4\n1,1,f,1,1,5,1\n2,2,f,2,1,4,1\n3,1,f,2,2,4,3\n4,2,f,3,1, ,3\n5,1,m,4,5,2,4\n6,2,m,5,4,5,5\n7,1,m,5,3,4,4\n8,2,m,4,5,5,5"

파일 위치 대신에 textConnection 함수를 이용하여서 프로그램 내의 mystring(긴 문자 벡터)을 텍스트 파일로 읽기.

R Programming

%%R

withmooc<-read.table(textConnection(mystring),
                   header=TRUE,sep=",",row.names="id")

withmooc

Results

  workshop gender q1 q2 q3 q4
1        1      f  1  1  5  1
2        2      f  2  1  4  1
3        1      f  2  2  4  3
4        2      f  3  1 NA  3
5        1      m  4  5  2  4
6        2      m  5  4  5  5
7        1      m  5  3  4  4
8        2      m  4  5  5  5

기본적으로, Summary는 Group을 수치형으로 취급하지만, Gender는 Factor로 가정하고, 그것의 레벨을 카운트한다.

R Programming

%%R

summary(withmooc)

Results

       id         workshop            gender                q1             q2             q3              q4      
 Min.   :1.00   Length:8           Length:8           Min.   :1.00   Min.   :1.00   Min.   :2.000   Min.   :1.00  
 1st Qu.:2.75   Class :character   Class :character   1st Qu.:2.00   1st Qu.:1.00   1st Qu.:4.000   1st Qu.:2.50  
 Median :4.50   Mode  :character   Mode  :character   Median :3.50   Median :2.50   Median :4.000   Median :3.50  
 Mean   :4.50                                         Mean   :3.25   Mean   :2.75   Mean   :4.143   Mean   :3.25  
 3rd Qu.:6.25                                         3rd Qu.:4.25   3rd Qu.:4.25   3rd Qu.:5.000   3rd Qu.:4.25  
 Max.   :8.00                                         Max.   :5.00   Max.   :5.00   Max.   :5.000   Max.   :5.00  
                                                                                    NA's   :1

R Programming

%%R

withmooc %>%
  dlookr::diagnose_numeric()

Results

# A tibble: 5 x 10
  variables   min    Q1  mean median    Q3   max  zero minus outlier
  <chr>     <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <int> <int>   <int>
1 id            1  2.75  4.5     4.5  6.25     8     0     0       0
2 q1            1  2     3.25    3.5  4.25     5     0     0       0
3 q2            1  1     2.75    2.5  4.25     5     0     0       0
4 q3            2  4     4.14    4    5        5     0     0       1
5 q4            1  2.5   3.25    3.5  4.25     5     0     0       0

R Programming

%%R

withmooc %>% 
  dlookr::describe()

Results

# A tibble: 5 x 26
  variable     n    na  mean    sd se_mean   IQR skewness kurtosis   p00   p01   p05   p10   p20
  <chr>    <int> <int> <dbl> <dbl>   <dbl> <dbl>    <dbl>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 id           8     0  4.5   2.45   0.866  3.5     0        -1.2      1  1.07  1.35   1.7   2.4
2 q1           8     0  3.25  1.49   0.526  2.25   -0.217    -1.41     1  1.07  1.35   1.7   2  
3 q2           8     0  2.75  1.75   0.620  3.25    0.292    -1.91     1  1     1      1     1  
4 q3           7     1  4.14  1.07   0.404  1      -1.52      2.71     2  2.12  2.6    3.2   4  
5 q4           8     0  3.25  1.58   0.559  1.75   -0.542    -1.02     1  1     1      1     1.8
# ... with 12 more variables: p25 <dbl>, p30 <dbl>, p40 <dbl>, p50 <dbl>, p60 <dbl>, p70 <dbl>,
#   p75 <dbl>, p80 <dbl>, p90 <dbl>, p95 <dbl>, p99 <dbl>, p100 <dbl>

R Programming

%%R

withmooc %>%
  purrr::keep(.p = is.numeric) %>% # 숫자형 데이터만 남기기
  dlookr::describe()

Results

# A tibble: 5 x 26
  variable     n    na  mean    sd se_mean   IQR skewness kurtosis   p00   p01   p05   p10   p20
  <chr>    <int> <int> <dbl> <dbl>   <dbl> <dbl>    <dbl>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 id           8     0  4.5   2.45   0.866  3.5     0        -1.2      1  1.07  1.35   1.7   2.4
2 q1           8     0  3.25  1.49   0.526  2.25   -0.217    -1.41     1  1.07  1.35   1.7   2  
3 q2           8     0  2.75  1.75   0.620  3.25    0.292    -1.91     1  1     1      1     1  
4 q3           7     1  4.14  1.07   0.404  1      -1.52      2.71     2  2.12  2.6    3.2   4  
5 q4           8     0  3.25  1.58   0.559  1.75   -0.542    -1.02     1  1     1      1     1.8
# ... with 12 more variables: p25 <dbl>, p30 <dbl>, p40 <dbl>, p50 <dbl>, p60 <dbl>, p70 <dbl>,
#   p75 <dbl>, p80 <dbl>, p90 <dbl>, p95 <dbl>, p99 <dbl>, p100 <dbl>

R Programming

%%R

print(packageVersion("tidyr"))
print(packageVersion("dplyr"))

Results

[1] '1.1.1'
[1] '1.0.2'

아래 에러 발생 시 재구동 : 정확한 원인 모름

Error: Input must be a vector, not a describe object.
Run rlang::last_error() to see where the error occurred.

[참고] [R 데이터 탐색] 2. map 함수를 활용한 변수별 기술통계량 한번에 확인하기(suzin.log) [링크]

R Programming

%%R

withmooc %>% 
  purrr::keep(.p = is.numeric) %>%                  # 숫자형 데이터만 남기기
  purrr::map_df(.x = ., .f = psych::describe) %>%  # 앞의 데이터에 대해 기술통계량을 구해주는 함수 적용
  base::transform(vars = colnames(purrr::keep(.x = withmooc,
                                         .p = is.numeric)))

Results

       vars n     mean       sd median  trimmed    mad min max range       skew
X1...1   id 8 4.500000 2.449490    4.5 4.500000 2.9652   1   8     7  0.0000000
X1...2   q1 8 3.250000 1.488048    3.5 3.250000 2.2239   1   5     4 -0.1422626
X1...3   q2 8 2.750000 1.752549    2.5 2.750000 2.2239   1   5     4  0.1915814
X1...4   q3 7 4.142857 1.069045    4.0 4.142857 1.4826   2   5     3 -0.9306418
X1...5   q4 8 3.250000 1.581139    3.5 3.250000 1.4826   1   5     4 -0.3557562
         kurtosis        se
X1...1 -1.6510417 0.8660254
X1...2 -1.7276762 0.5261043
X1...3 -1.9113964 0.6196197
X1...4 -0.5165816 0.4040610
X1...5 -1.5868750 0.5590170

Workshop변수를 Factor로 변경.
Summary함수는 workshop변수의 출현 횟수를 카운트한다.
현재의 workshop의 평균은 잘못된 기록이다.

R Programming

%%R

withmooc %>%
  mutate(workshop = factor(workshop,
                            levels=c(1,2,3,4),
                            labels=c("R","SAS","SPSS","Stata"))) %>%
  summary()

Results

       id        workshop    gender                q1             q2             q3              q4      
 Min.   :1.00   R    :4   Length:8           Min.   :1.00   Min.   :1.00   Min.   :2.000   Min.   :1.00  
 1st Qu.:2.75   SAS  :4   Class :character   1st Qu.:2.00   1st Qu.:1.00   1st Qu.:4.000   1st Qu.:2.50  
 Median :4.50   SPSS :0   Mode  :character   Median :3.50   Median :2.50   Median :4.000   Median :3.50  
 Mean   :4.50   Stata:0                      Mean   :3.25   Mean   :2.75   Mean   :4.143   Mean   :3.25  
 3rd Qu.:6.25                                3rd Qu.:4.25   3rd Qu.:4.25   3rd Qu.:5.000   3rd Qu.:4.25  
 Max.   :8.00                                Max.   :5.00   Max.   :5.00   Max.   :5.000   Max.   :5.00  
                                                                           NA's   :1

Hmisc 패키지에서 Describe함수를 이용.
Summary함수와 틀리게, Describe함수는 q변수의 빈도와 평균, 백분율을 계산한다.
Describe함수를 사용하기 위해서 Hmisc 라이브러리를 인스톨해야 한다.

R Programming

%%R

withmooc %>%
  mutate(workshop = factor(workshop,
                           levels=c(1,2,3,4),
                           labels=c("R","SAS","SPSS","Stata"))) %>%
  describe()

Results

          vars n mean   sd median trimmed  mad min max range  skew kurtosis   se
id           1 8 4.50 2.45    4.5    4.50 2.97   1   8     7  0.00    -1.65 0.87
workshop*    2 8 1.50 0.53    1.5    1.50 0.74   1   2     1  0.00    -2.23 0.19
gender*      3 8 1.50 0.53    1.5    1.50 0.74   1   2     1  0.00    -2.23 0.19
q1           4 8 3.25 1.49    3.5    3.25 2.22   1   5     4 -0.14    -1.73 0.53
q2           5 8 2.75 1.75    2.5    2.75 2.22   1   5     4  0.19    -1.91 0.62
q3           6 7 4.14 1.07    4.0    4.14 1.48   2   5     3 -0.93    -0.52 0.40
q4           7 8 3.25 1.58    3.5    3.25 1.48   1   5     4 -0.36    -1.59 0.56

어떻게 레벨이 값으로 매치되는지 확인.

R Programming

%%R

unclass(withmooc$gender)

Results

[1] "f" "f" "f" "f" "m" "m" "m" "m"

m은 male로 f는 female로 순서를 변경하자.
만약 값이 대문자이면, 실제적으로 결측값을 생성한다.
각각의 기초되는 값을 추출.
genderNums는 변수 값의 알파벳 순서가 할당된다.
genderFNums은 위에서 factor함수의 levels의 순서에 따라서 m이 2, f가 1이 할당된다.

R Programming

%%R

withmooc<-withmooc %>%
  mutate(gender  = factor(gender,levels=c("f","m"),labels=c("f","m")),
         genderF = factor(gender,levels=c("m","f"),labels=c("male","female")))
withmooc

Results

# A tibble: 8 x 8
     id workshop gender    q1    q2    q3    q4 genderF
  <dbl> <chr>    <fct>  <dbl> <dbl> <dbl> <dbl> <fct>  
1     1 1        f          1     1     5     1 female 
2     2 2        f          2     1     4     1 female 
3     3 1        f          2     2     4     3 female 
4     4 2        f          3     1    NA     3 female 
5     5 1        m          4     5     2     4 male   
6     6 2        m          5     4     5     5 male   
7     7 1        m          5     3     4     4 male   
8     8 2        m          4     5     5     5 male

R Programming

%%R

print(unclass(withmooc$gender))

unclass(withmooc$genderF)

Results

[1] 1 1 1 1 2 2 2 2
attr(,"levels")
[1] "f" "m"
[1] 2 2 2 2 1 1 1 1
attr(,"levels")
[1] "male"   "female"

각각의 기초되는 값을 추출.
genderNums는 변수 값의 알파벳 순서가 할당된다.
genderFNums은 위에서 factor함수의 levels의 순서에 따라서 m이 2, f가 1이 할당된다.

R Programming

%%R

withmooc$genderNums  <- as.numeric(withmooc$gender)
withmooc$genderFNums <- as.numeric(withmooc$genderF)

# 실제 할당된 값을 확인.
withmooc

Results

# A tibble: 8 x 10
     id workshop gender    q1    q2    q3    q4 genderF genderNums genderFNums
  <dbl> <chr>    <fct>  <dbl> <dbl> <dbl> <dbl> <fct>        <dbl>       <dbl>
1     1 1        f          1     1     5     1 female           1           2
2     2 2        f          2     1     4     1 female           1           2
3     3 1        f          2     2     4     3 female           1           2
4     4 2        f          3     1    NA     3 female           1           2
5     5 1        m          4     5     2     4 male             2           1
6     6 2        m          5     4     5     5 male             2           1
7     7 1        m          5     3     4     4 male             2           1
8     8 2        m          4     5     5     5 male             2           1

Factor로 이용하기 위해 q변수의 복사본을 생성하고, 그것을 카운트할 수 있다.
반복하여 사용하기 위해 라벨을 저장.

R Programming

%%R

myQlevels <- c(1,2,3,4,5)

# 반복하여 이용하기 위해 라벨을 저장.

myQlabels <- c("Strongly Disagree",
               "Disagree",
               "Neutral",
               "Agree",
               "Strongly Agree")

Factor함수를 이용하여 새로운 변수 세트를 생성.

R Programming

%%R

withmooc %>%
  mutate(q1f = factor(q1, myQlevels, myQlabels), 
         q2f = factor(q2, myQlevels, myQlabels), 
         q3f = factor(q3, myQlevels, myQlabels), 
         q4f = factor(q4, myQlevels, myQlabels) ) %>%
  select(q1f,q2f,q3f,q4f) %>%
  summary()

Results

                q1f                   q2f                   q3f                   q4f   
 Strongly Disagree:1   Strongly Disagree:3   Strongly Disagree:0   Strongly Disagree:2  
 Disagree         :2   Disagree         :1   Disagree         :1   Disagree         :0  
 Neutral          :1   Neutral          :1   Neutral          :0   Neutral          :2  
 Agree            :2   Agree            :1   Agree            :3   Agree            :2  
 Strongly Agree   :2   Strongly Agree   :2   Strongly Agree   :3   Strongly Agree   :2  
                                             NA's             :1

Factor로 이용하기 위해서 q변수의 복사 번을 생성. 만약 변수 수가 많다면, 자동적으로 쉽게 할 수 있는 방법.
Factor로써 이용하기 위해 q 변수의 복사본을 생성, 그 결과 그것들을 카운트할 수 있다.

R Programming

%%R

myQlevels <- c(1,2,3,4,5)

myQlabels <- c("Strongly Disagree",
               "Disagree",
               "Neutral",
               "Agree",
               "Strongly Agree")

print(myQlevels)

print(myQlabels)

Results

[1] 1 2 3 4 5
[1] "Strongly Disagree" "Disagree"          "Neutral"           "Agree"             "Strongly Agree"

이용될 변수 이름의 두 개 세트를 생성.

R Programming

%%R

myQnames  <- paste( "q",  1:4, sep="")
myQFnames <- paste( "qf", 1:4, sep="")

print(myQnames) # 원 변수명.
print(myQFnames)  # 새로운 factor 변수의 이름.

Results

[1] "q1" "q2" "q3" "q4"
[1] "qf1" "qf2" "qf3" "qf4"

많은 변수의 라벨을 적용하기 위해 함수 생성.

R Programming

%%R

myLabeler <- function(x) { factor(x, myQlevels, myQlabels) }

한 변수가 함수로 어떻게 적용되는지 확인할 수 있다.

R Programming

%%R

withmooc %>%
  mutate(qf1 = myLabeler(q1))

Results

# A tibble: 8 x 11
     id workshop gender    q1    q2    q3    q4 genderF genderNums genderFNums qf1              
  <dbl> <chr>    <fct>  <dbl> <dbl> <dbl> <dbl> <fct>        <dbl>       <dbl> <fct>            
1     1 1        f          1     1     5     1 female           1           2 Strongly Disagree
2     2 2        f          2     1     4     1 female           1           2 Disagree         
3     3 1        f          2     2     4     3 female           1           2 Disagree         
4     4 2        f          3     1    NA     3 female           1           2 Neutral          
5     5 1        m          4     5     2     4 male             2           1 Agree            
6     6 2        m          5     4     5     5 male             2           1 Strongly Agree   
7     7 1        m          5     3     4     4 male             2           1 Strongly Agree   
8     8 2        m          4     5     5     5 male             2           1 Agree

모든 변수에 적용.
map : 각 변수 별로 함수 적용 후 하나의 테이블로 재구성됨.
transmute() 함수는 신규 변수를 생성하고 기존 변수 삭제
파일 위치 대신에 textConnection 함수를 이용하여서 프로그램 내의 mystring(긴 문자 벡터)을 텍스트 파일로 읽기.

R Programming

%%R

withmooc %>%
  purrr::keep(.p = is.numeric) %>% # 숫자형 데이터만 남기기
  purrr::map(myLabeler) %>%
  as_tibble()

Results

# A tibble: 8 x 7
  id                q1                q2                q3             q4                genderNums        genderFNums      
  <fct>             <fct>             <fct>             <fct>          <fct>             <fct>             <fct>            
1 Strongly Disagree Strongly Disagree Strongly Disagree Strongly Agree Strongly Disagree Strongly Disagree Disagree         
2 Disagree          Disagree          Strongly Disagree Agree          Strongly Disagree Strongly Disagree Disagree         
3 Neutral           Disagree          Disagree          Agree          Neutral           Strongly Disagree Disagree         
4 Agree             Neutral           Strongly Disagree <NA>           Neutral           Strongly Disagree Disagree         
5 Strongly Agree    Agree             Strongly Agree    Disagree       Agree             Disagree          Strongly Disagree
6 <NA>              Strongly Agree    Agree             Strongly Agree Strongly Agree    Disagree          Strongly Disagree
7 <NA>              Strongly Agree    Neutral           Agree          Agree             Disagree          Strongly Disagree
8 <NA>              Agree             Strongly Agree    Strongly Agree Strongly Agree    Disagree          Strongly Disagree

R Programming

%%R

withmooc %>%
  select(starts_with("q")) %>% # 숫자형 데이터만 남기기
  purrr::map_dfc(myLabeler)

Results

# A tibble: 8 x 4
  q1                q2                q3             q4               
  <fct>             <fct>             <fct>          <fct>            
1 Strongly Disagree Strongly Disagree Strongly Agree Strongly Disagree
2 Disagree          Strongly Disagree Agree          Strongly Disagree
3 Disagree          Disagree          Agree          Neutral          
4 Neutral           Strongly Disagree <NA>           Neutral          
5 Agree             Strongly Agree    Disagree       Agree            
6 Strongly Agree    Agree             Strongly Agree Strongly Agree   
7 Strongly Agree    Neutral           Agree          Agree            
8 Agree             Strongly Agree    Strongly Agree Strongly Agree

R Programming

%%R

withmooc %>%
  purrr::keep(.p = is.numeric) %>% # 숫자형 데이터만 남기기
  purrr::map_df(.x = .,
                .f = myLabeler)

Results

# A tibble: 8 x 7
  id                q1                q2                q3             q4                genderNums        genderFNums      
  <fct>             <fct>             <fct>             <fct>          <fct>             <fct>             <fct>            
1 Strongly Disagree Strongly Disagree Strongly Disagree Strongly Agree Strongly Disagree Strongly Disagree Disagree         
2 Disagree          Disagree          Strongly Disagree Agree          Strongly Disagree Strongly Disagree Disagree         
3 Neutral           Disagree          Disagree          Agree          Neutral           Strongly Disagree Disagree         
4 Agree             Neutral           Strongly Disagree <NA>           Neutral           Strongly Disagree Disagree         
5 Strongly Agree    Agree             Strongly Agree    Disagree       Agree             Disagree          Strongly Disagree
6 <NA>              Strongly Agree    Agree             Strongly Agree Strongly Agree    Disagree          Strongly Disagree
7 <NA>              Strongly Agree    Neutral           Agree          Agree             Disagree          Strongly Disagree
8 <NA>              Agree             Strongly Agree    Strongly Agree Strongly Agree    Disagree          Strongly Disagree

R Programming

%%R

withmooc %>% mutate_at( (withmooc %>%
                         select(starts_with("q")) %>%
                         colnames()),
                         myLabeler)

Results

# A tibble: 8 x 10
     id workshop gender q1                q2                q3             q4                genderF genderNums genderFNums
  <dbl> <chr>    <fct>  <fct>             <fct>             <fct>          <fct>             <fct>        <dbl>       <dbl>
1     1 1        f      Strongly Disagree Strongly Disagree Strongly Agree Strongly Disagree female           1           2
2     2 2        f      Disagree          Strongly Disagree Agree          Strongly Disagree female           1           2
3     3 1        f      Disagree          Disagree          Agree          Neutral           female           1           2
4     4 2        f      Neutral           Strongly Disagree <NA>           Neutral           female           1           2
5     5 1        m      Agree             Strongly Agree    Disagree       Agree             male             2           1
6     6 2        m      Strongly Agree    Agree             Strongly Agree Strongly Agree    male             2           1
7     7 1        m      Strongly Agree    Neutral           Agree          Agree             male             2           1
8     8 2        m      Agree             Strongly Agree    Strongly Agree Strongly Agree    male             2           1

R Programming

%%R

withmooc %>% mutate_at( vars(starts_with("q")),
                        myLabeler)

Results

# A tibble: 8 x 10
     id workshop gender q1                q2                q3             q4                genderF genderNums genderFNums
  <dbl> <chr>    <fct>  <fct>             <fct>             <fct>          <fct>             <fct>        <dbl>       <dbl>
1     1 1        f      Strongly Disagree Strongly Disagree Strongly Agree Strongly Disagree female           1           2
2     2 2        f      Disagree          Strongly Disagree Agree          Strongly Disagree female           1           2
3     3 1        f      Disagree          Disagree          Agree          Neutral           female           1           2
4     4 2        f      Neutral           Strongly Disagree <NA>           Neutral           female           1           2
5     5 1        m      Agree             Strongly Agree    Disagree       Agree             male             2           1
6     6 2        m      Strongly Agree    Agree             Strongly Agree Strongly Agree    male             2           1
7     7 1        m      Strongly Agree    Neutral           Agree          Agree             male             2           1
8     8 2        m      Agree             Strongly Agree    Strongly Agree Strongly Agree    male             2           1

함수 직접 작성

R Programming

%%R

withmooc %>% mutate_at( vars(starts_with("q")),
                        funs(factor(., myQlevels, myQlabels)))

Results

# A tibble: 8 x 10
     id workshop gender q1                q2                q3             q4                genderF genderNums genderFNums
  <dbl> <chr>    <fct>  <fct>             <fct>             <fct>          <fct>             <fct>        <dbl>       <dbl>
1     1 1        f      Strongly Disagree Strongly Disagree Strongly Agree Strongly Disagree female           1           2
2     2 2        f      Disagree          Strongly Disagree Agree          Strongly Disagree female           1           2
3     3 1        f      Disagree          Disagree          Agree          Neutral           female           1           2
4     4 2        f      Neutral           Strongly Disagree <NA>           Neutral           female           1           2
5     5 1        m      Agree             Strongly Agree    Disagree       Agree             male             2           1
6     6 2        m      Strongly Agree    Agree             Strongly Agree Strongly Agree    male             2           1
7     7 1        m      Strongly Agree    Neutral           Agree          Agree             male             2           1
8     8 2        m      Agree             Strongly Agree    Strongly Agree Strongly Agree    male             2           1

R Programming

%%R

withmooc %>%
  select(q1,q2,q3,q4) %>%
  purrr::map_dfc(~ withmooc %>% transmute( {{.x}} := myLabeler(.x))) %>%
  set_names(c('q1','q2','q3','q4'))

Results

R[write to console]: New names:
* `..1` -> ...1
* `..1` -> ...2
* `..1` -> ...3
* `..1` -> ...4



# A tibble: 8 x 4
  q1                q2                q3             q4               
  <fct>             <fct>             <fct>          <fct>            
1 Strongly Disagree Strongly Disagree Strongly Agree Strongly Disagree
2 Disagree          Strongly Disagree Agree          Strongly Disagree
3 Disagree          Disagree          Agree          Neutral          
4 Neutral           Strongly Disagree <NA>           Neutral          
5 Agree             Strongly Agree    Disagree       Agree            
6 Strongly Agree    Agree             Strongly Agree Strongly Agree   
7 Strongly Agree    Neutral           Agree          Agree            
8 Agree             Strongly Agree    Strongly Agree Strongly Agree

6. Python - Pandas

Python Programming

import pandas as pd
import numpy as np
import sweetviz as sv

mydata = pd.read_csv("C:/work/data/mydata.csv",sep=",",
                     dtype={'id':object,'workshop':object,
                            'q1':int, 'q2':int, 'q3':float, 'q4':int},
                     na_values=['NaN'],skipinitialspace =True)

withmooc= mydata.copy()

withmooc

Results

	id	workshop	gender	q1	q2	q3	q4
0	1	1		f	1	1	5.0	1
1	2	2		f	2	1	4.0	1
2	3	1		f	2	2	4.0	3
3	4	2		f	3	1	NaN	3
4	5	1		m	4	5	2.0	4
5	6	2		m	5	4	5.0	5
6	7	1		m	5	3	4.0	4
7	8	2		m	4	5	5.0	5

수치형 변수에 대한 요약 통계

Python Programming

withmooc= mydata.copy()

withmooc.describe()

Results

	q1		q2		q3		q4
count	8.000000	8.000000	7.000000	8.000000
mean	3.250000	2.750000	4.142857	3.250000
std	1.488048	1.752549	1.069045	1.581139
min	1.000000	1.000000	2.000000	1.000000
25%	2.000000	1.000000	4.000000	2.500000
50%	3.500000	2.500000	4.000000	3.500000
75%	4.250000	4.250000	5.000000	4.250000
max	5.000000	5.000000	5.000000	5.000000

문자형 변수에 대한 요약 통계

Python Programming

withmooc= mydata.copy()

withmooc.describe(include=[np.object])

Results

	id	workshop	gender
count	8	8		8
unique	8	2		2
top	2	1		f
freq	1	4		4

Python Programming

withmooc= mydata.copy()

withmooc.apply(lambda x : x.describe())

Results

	id	workshop	gender	q1	q2	q3	q4
25%	NaN	NaN	NaN	2.000000	1.000000	4.000000	2.500000
50%	NaN	NaN	NaN	3.500000	2.500000	4.000000	3.500000
75%	NaN	NaN	NaN	4.250000	4.250000	5.000000	4.250000
count	8	8	8	8.000000	8.000000	7.000000	8.000000
freq	1	4	4	NaN		NaN		NaN		NaN
max	NaN	NaN	NaN	5.000000	5.000000	5.000000	5.000000
mean	NaN	NaN	NaN	3.250000	2.750000	4.142857	3.250000
min	NaN	NaN	NaN	1.000000	1.000000	2.000000	1.000000
std	NaN	NaN	NaN	1.488048	1.752549	1.069045	1.581139
top	2	1	f	NaN		NaN		NaN		NaN
unique	8	2	2	NaN		NaN		NaN		NaN

Python Programming

withmooc= mydata.copy()

labels2={'1':'R','2':'SAS','3':'SPSS', '4':'Python'}

withmooc['workshop'] = withmooc['workshop'].apply(lambda x: labels2.get(x))

withmooc

Results

	id	workshop	gender	q1	q2	q3	q4
0	1	R		f	1	1	5.0	1
1	2	SAS		f	2	1	4.0	1
2	3	R		f	2	2	4.0	3
3	4	SAS		f	3	1	NaN	3
4	5	R		m	4	5	2.0	4
5	6	SAS		m	5	4	5.0	5
6	7	R		m	5	3	4.0	4
7	8	SAS		m	4	5	5.0	5

Python Programming

withmooc= mydata.copy()
withmooc['workshop'] = withmooc['workshop'].map(labels2)

withmooc

Results

	id	workshop	gender	q1	q2	q3	q4
0	1	R		f	1	1	5.0	1
1	2	SAS		f	2	1	4.0	1
2	3	R		f	2	2	4.0	3
3	4	SAS		f	3	1	NaN	3
4	5	R		m	4	5	2.0	4
5	6	SAS		m	5	4	5.0	5
6	7	R		m	5	3	4.0	4
7	8	SAS		m	4	5	5.0	5

Python Programming

withmooc= mydata.copy()

withmooc['workshop'] = withmooc['workshop'].astype('category')
withmooc['workshop'] = withmooc['workshop'].cat.rename_categories(["R", "SAS"])

withmooc

Results

	id	workshop	gender	q1	q2	q3	q4
0	1	R		f	1	1	5.0	1
1	2	SAS		f	2	1	4.0	1
2	3	R		f	2	2	4.0	3
3	4	SAS		f	3	1	NaN	3
4	5	R		m	4	5	2.0	4
5	6	SAS		m	5	4	5.0	5
6	7	R		m	5	3	4.0	4
7	8	SAS		m	4	5	5.0	5

Python Programming

withmooc.groupby(withmooc['workshop']).describe()

Results

	q1							q2		...	q3		q4
	count	mean	std	min		25%	50%	75%	max	count	mean	...	75%	max	count	mean	std		min	25%	50%	75%	max
workshop																					
R	4.0	3.0	1.825742	1.0	1.75	3.0	4.25	5.0	4.0	2.75	...	4.25	5.0	4.0	3.0	1.414214	1.0	2.5	3.5	4.0	4.0
SAS	4.0	3.5	1.290994	2.0	2.75	3.5	4.25	5.0	4.0	2.75	...	5.00	5.0	4.0	3.5	1.914854	1.0	2.5	4.0	5.0	5.0

2 rows × 32 columns

어떻게 레벨이 값으로 매치되는지 확인.

Python Programming

withmooc.info()
withmooc.dtypes
withmooc['workshop'].dtype

Results

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype   
---  ------    --------------  -----   
 0   id        8 non-null      object  
 1   workshop  8 non-null      category
 2   gender    8 non-null      object  
 3   q1        8 non-null      int32   
 4   q2        8 non-null      int32   
 5   q3        7 non-null      float64 
 6   q4        8 non-null      int32   
dtypes: category(1), float64(1), int32(3), object(2)
memory usage: 520.0+ bytes





CategoricalDtype(categories=['R', 'SAS'], ordered=False)

m은 male로 f는 female로 순서를 변경하자.
만약 값이 대문자이면, 실제적으로 결측 값을 생성한다.

Python Programming

withmooc['gender']  = withmooc['gender'].astype('category')
withmooc['genderF'] = withmooc['gender'].cat.rename_categories(["female", "male"])

withmooc

Results

	id	workshop	gender	q1	q2	q3	q4	genderF
0	1	R		f	1	1	5.0	1	female
1	2	SAS		f	2	1	4.0	1	female
2	3	R		f	2	2	4.0	3	female
3	4	SAS		f	3	1	NaN	3	female
4	5	R		m	4	5	2.0	4	male
5	6	SAS		m	5	4	5.0	5	male
6	7	R		m	5	3	4.0	4	male
7	8	SAS		m	4	5	5.0	5	male

각각의 기초되는 값을 추출.
genderNums는 변수 값의 알파벳 순서가 할당된다.
genderFNums은 위에서 factor함수의 levels의 순서에 따라서 m이 2, f가 1이 할당된다.

Python Programming

withmooc= mydata.copy()

withmooc['gender']  = withmooc['gender'].astype('category')
withmooc['genderF'] = withmooc['gender'].cat.rename_categories(["female", "male"])

withmooc["genderNums"]  = withmooc["gender"].cat.codes
withmooc["genderFNums"] = withmooc["genderF"].cat.codes

withmooc

Results

	id	workshop	gender	q1	q2	q3	q4	genderF	genderNums	genderFNums
0	1	1		f	1	1	5.0	1	female	0		0
1	2	2		f	2	1	4.0	1	female	0		0
2	3	1		f	2	2	4.0	3	female	0		0
3	4	2		f	3	1	NaN	3	female	0		0
4	5	1		m	4	5	2.0	4	male	1		1
5	6	2		m	5	4	5.0	5	male	1		1
6	7	1		m	5	3	4.0	4	male	1		1
7	8	2		m	4	5	5.0	5	male	1		1

Python Programming

withmooc= mydata.copy()


withmooc['gender']  = withmooc['gender'].astype('category')
withmooc['genderF'] = withmooc['gender'].cat.rename_categories(["female", "male"])

withmooc['genderNums']  = pd.factorize(withmooc.gender)[0]
withmooc['genderFNums'] = pd.factorize(withmooc.genderF)[0]
withmooc

Results

	id	workshop	gender	q1	q2	q3	q4	genderF	genderNums	genderFNums
0	1	1		f	1	1	5.0	1	female	0		0
1	2	2		f	2	1	4.0	1	female	0		0
2	3	1		f	2	2	4.0	3	female	0		0
3	4	2		f	3	1	NaN	3	female	0		0
4	5	1		m	4	5	2.0	4	male	1		1
5	6	2		m	5	4	5.0	5	male	1		1
6	7	1		m	5	3	4.0	4	male	1		1
7	8	2		m	4	5	5.0	5	male	1		1

Python Programming

withmooc= mydata.copy()

withmooc['gender']  = withmooc['gender'].astype('category')
withmooc['genderF'] = withmooc['gender'].cat.rename_categories(["female", "male"])

from sklearn.preprocessing import LabelEncoder

number = LabelEncoder()

withmooc['genderNums']  = number.fit_transform(withmooc['gender'])
withmooc['genderFNums'] = number.fit_transform(withmooc['genderF'])

withmooc.info()

withmooc

Results

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 10 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   id           8 non-null      object  
 1   workshop     8 non-null      object  
 2   gender       8 non-null      category
 3   q1           8 non-null      int32   
 4   q2           8 non-null      int32   
 5   q3           7 non-null      float64 
 6   q4           8 non-null      int32   
 7   genderF      8 non-null      category
 8   genderNums   8 non-null      int32   
 9   genderFNums  8 non-null      int32   
dtypes: category(2), float64(1), int32(5), object(2)
memory usage: 688.0+ bytes

Results

	id	workshop	gender	q1	q2	q3	q4	genderF	genderNums	genderFNums
0	1	1		f	1	1	5.0	1	female	0		0
1	2	2		f	2	1	4.0	1	female	0		0
2	3	1		f	2	2	4.0	3	female	0		0
3	4	2		f	3	1	NaN	3	female	0		0
4	5	1		m	4	5	2.0	4	male	1		1
5	6	2		m	5	4	5.0	5	male	1		1
6	7	1		m	5	3	4.0	4	male	1		1
7	8	2		m	4	5	5.0	5	male	1		1

Factor로 이용하기 위해 q변수의 복사본을 생성하고, 그것을 카운트할 수 있다.
반복하여 사용하기 위해 라벨을 저장.
반복하여 이용하기 위해 라벨을 저장.
Factor함수를 이용하여 새로운 변수 세트를 생성.

Python Programming

withmooc= mydata.copy()

withmooc['q1f']=withmooc['q1'].astype('category').cat.rename_categories(["Strongly Disagree","Disagree","Neutral","Agree","Strongly Agree"])
withmooc['q2f']=withmooc['q2'].astype('category').cat.rename_categories(["Strongly Disagree","Disagree","Neutral","Agree","Strongly Agree"])
withmooc['q3f']=withmooc['q3'].astype('category').cat.rename_categories({1:"Strongly Disagree",2:"Disagree",3:"Neutral",4:"Agree",5:"Strongly Agree"})
withmooc['q4f']=withmooc['q4'].astype('category').cat.rename_categories({1:"Strongly Disagree",2:"Disagree",3:"Neutral",4:"Agree",5:"Strongly Agree"})

withmooc

Results

	id	workshop	gender	q1	q2	q3	q4	q1f			q2f			q3f		q4f
0	1	1		f	1	1	5.0	1	Strongly Disagree	Strongly Disagree	Strongly Agree	Strongly Disagree
1	2	2		f	2	1	4.0	1	Disagree		Strongly Disagree	Agree		Strongly Disagree
2	3	1		f	2	2	4.0	3	Disagree		Disagree		Agree		Neutral
3	4	2		f	3	1	NaN	3	Neutral			Strongly Disagree	NaN		Neutral
4	5	1		m	4	5	2.0	4	Agree			Strongly Agree		Disagree	Agree
5	6	2		m	5	4	5.0	5	Strongly Agree		Agree			Strongly Agree	Strongly Agree
6	7	1		m	5	3	4.0	4	Strongly Agree		Neutral			Agree		Agree
7	8	2		m	4	5	5.0	5	Agree			Strongly Agree		Strongly Agree	Strongly Agree

Python Programming

withmooc.info()

Results

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 11 columns):
 #   Column    Non-Null Count  Dtype   
---  ------    --------------  -----   
 0   id        8 non-null      object  
 1   workshop  8 non-null      object  
 2   gender    8 non-null      object  
 3   q1        8 non-null      int32   
 4   q2        8 non-null      int32   
 5   q3        7 non-null      float64 
 6   q4        8 non-null      int32   
 7   q1f       8 non-null      category
 8   q2f       8 non-null      category
 9   q3f       7 non-null      category
 10  q4f       8 non-null      category
dtypes: category(4), float64(1), int32(3), object(3)
memory usage: 1.2+ KB

Python Programming

pd.DataFrame([(withmooc[val].value_counts()) for val in ['q1f','q2f','q3f','q4f']]).T

Results

			q1f	q2f	q3f	q4f
Agree			2.0	1.0	3.0	2.0
Disagree		2.0	1.0	1.0	NaN
Neutral			1.0	1.0	NaN	2.0
Strongly Agree		2.0	2.0	3.0	2.0
Strongly Disagree	1.0	3.0	NaN	2.0

Factor로 이용하기 위해서 q변수의 복사 번을 생성. 만약 변수 수가 많다면, 자동적으로 쉽게 할 수 있는 방법.
Factor로써 이용하기 위해 q 변수의 복사본을 생성, 그 결과 그것들을 카운트할 수 있다.

Python Programming

myQlevels = [1,2,3,4,5]

myQlabels =   {1:"Strongly Disagree",2:"Disagree",3:"Neutral",4:"Agree",5:"Strongly Agree"}

print(myQlevels)
print(myQlabels)

Results

[1, 2, 3, 4, 5]
{1: 'Strongly Disagree', 2: 'Disagree', 3: 'Neutral', 4: 'Agree', 5: 'Strongly Agree'}

Python Programming

myQnames  = ["q" + str(i) for i in range(1,5)]
myQFnames = ["q" + str(i) + "f" for i in range(1,5)]

print(myQnames)  # 변수명 출력

print(myQFnames)  # 새로운 factor 변수의 이름.

Results

['q1', 'q2', 'q3', 'q4']
['q1f', 'q2f', 'q3f', 'q4f']

데이터 프레임을 분리하기 위해 q변수 추출.

Python Programming

myQFvars = withmooc.loc[:,myQnames]

myQFvars

Results

	q1	q2	q3	q4
0	1	1	5.0	1
1	2	1	4.0	1
2	2	2	4.0	3
3	3	1	NaN	3
4	4	5	2.0	4
5	5	4	5.0	5
6	5	3	4.0	4
7	4	5	5.0	5

Factor에 대하여 F를 가진 모든 변수로 변수명을 변경.

Python Programming

myQFvars.columns = myQFnames
myQFvars

Results

	q1f	q2f	q3f	q4f
0	1	1	5.0	1
1	2	1	4.0	1
2	2	2	4.0	3
3	3	1	NaN	3
4	4	5	2.0	4
5	5	4	5.0	5
6	5	3	4.0	4
7	4	5	5.0	5

Python Programming

withmooc['q4'].astype('category').cat.rename_categories(myQlabels)

Results

0    Strongly Disagree
1    Strongly Disagree
2              Neutral
3              Neutral
4                Agree
5       Strongly Agree
6                Agree
7       Strongly Agree
Name: q4, dtype: category
Categories (4, object): ['Strongly Disagree', 'Neutral', 'Agree', 'Strongly Agree']

Python Programming

def categories(x):
    return x.astype('category').cat.rename_categories(myQlabels)

categories(withmooc['q4'])

Results

0    Strongly Disagree
1    Strongly Disagree
2              Neutral
3              Neutral
4                Agree
5       Strongly Agree
6                Agree
7       Strongly Agree
Name: q4, dtype: category
Categories (4, object): ['Strongly Disagree', 'Neutral', 'Agree', 'Strongly Agree']

Python Programming

myQFvars.loc[ :,myQFnames ] = myQFvars.loc[ :,myQFnames ].apply(lambda x:categories(x))
myQFvars

Results

	q1f			q2f			q3f			q4f
0	Strongly Disagree	Strongly Disagree	Strongly Agree		Strongly Disagree
1	Disagree		Strongly Disagree	Agree			Strongly Disagree
2	Disagree		Disagree		Agree			Neutral
3	Neutral			Strongly Disagree	NaN			Neutral
4	Agree			Strongly Agree		Disagree		Agree
5	Strongly Agree		Agree			Strongly Agree		Strongly Agree
6	Strongly Agree		Neutral			Agree			Agree
7	Agree			Strongly Agree		Strongly Agree		Strongly Agree

Summary함수의 결과.

Python Programming

pd.DataFrame([(myQFvars[val].value_counts()) for val in ['q1f','q2f','q3f','q4f']]).T

Results

			q1f	q2f	q3f	q4f
Agree			2.0	1.0	3.0	2.0
Disagree		2.0	1.0	1.0	NaN
Neutral			1.0	1.0	NaN	2.0
Strongly Agree		2.0	2.0	3.0	2.0
Strongly Disagree	1.0	3.0	NaN	2.0

Python Programming

pd.merge(withmooc, myQFvars, how='inner')

Results

	id	workshop	gender	q1	q2	q3	q4	q1f			q2f			q3f			q4f
0	1	1		f	1	1	5.0	1	Strongly Disagree	Strongly Disagree	Strongly Agree		Strongly Disagree
1	2	2		f	2	1	4.0	1	Disagree		Strongly Disagree	Agree			Strongly Disagree
2	3	1		f	2	2	4.0	3	Disagree		Disagree		Agree			Neutral
3	4	2		f	3	1	NaN	3	Neutral			Strongly Disagree	NaN			Neutral
4	5	1		m	4	5	2.0	4	Agree			Strongly Agree		Disagree		Agree
5	6	2		m	5	4	5.0	5	Strongly Agree		Agree			Strongly Agree		Strongly Agree
6	7	1		m	5	3	4.0	4	Strongly Agree		Neutral			Agree			Agree
7	8	2		m	4	5	5.0	5	Agree			Strongly Agree		Strongly Agree		Strongly Agree

7. Python - dfply

기본적으로, Summary는 Group을 수치형으로 취급하지만, Gender는 Factor로 가정하고, 그것의 레벨을 카운트한다.

Python Programming

import pandas as pd
from dfply import *

mydata   = pd.read_csv("c:/work/data/mydata.csv",sep=",",
                       dtype={'id':object,'workshop':object,
                              'q1':int, 'q2':int, 'q3':float, 'q4':int},
                       na_values=['NaN'],skipinitialspace =True)

withmooc= mydata.copy()

# 모든 변수 선택하기.
withmooc

Results

	id	workshop	gender	q1	q2	q3	q4
0	1	1		f	1	1	5.0	1
1	2	2		f	2	1	4.0	1
2	3	1		f	2	2	4.0	3
3	4	2		f	3	1	NaN	3
4	5	1		m	4	5	2.0	4
5	6	2		m	5	4	5.0	5
6	7	1		m	5	3	4.0	4
7	8	2		m	4	5	5.0	5

Python Programming

withmooc >> summarize(**{
  **{f"{x}_mean": X[x].mean() for x in mydata.select_dtypes(int).columns},
  **{f"{x}_std" : X[x].std() for x in mydata.select_dtypes(int).columns},
  **{f"{x}_var" : X[x].var() for x in mydata.select_dtypes(int).columns},
  **{f"{x}_median" : X[x].median() for x in mydata.select_dtypes(int).columns}
  })

Results

	q1_mean	q2_mean	q4_mean	q1_std		q2_std		q4_std		q1_var		q2_var		q4_var	q1_median	q2_median	q4_median
0	3.25	2.75	3.25	1.488048	1.752549	1.581139	2.214286	3.071429	2.5	3.5		2.5		3.5

Python Programming

(withmooc >> select(withmooc.select_dtypes(include=np.number).columns.tolist())).describe()

Results

	q1		q2		q3		q4
count	8.000000	8.000000	7.000000	8.000000
mean	3.250000	2.750000	4.142857	3.250000
std	1.488048	1.752549	1.069045	1.581139
min	1.000000	1.000000	2.000000	1.000000
25%	2.000000	1.000000	4.000000	2.500000
50%	3.500000	2.500000	4.000000	3.500000
75%	4.250000	4.250000	5.000000	4.250000
max	5.000000	5.000000	5.000000	5.000000

Python Programming

(withmooc >> select(withmooc.select_dtypes(include=np.number).columns.tolist())).describe().T

Results

	count	mean		std		min	25%	50%	75%	max
q1	8.0	3.250000	1.488048	1.0	2.0	3.5	4.25	5.0
q2	8.0	2.750000	1.752549	1.0	1.0	2.5	4.25	5.0
q3	7.0	4.142857	1.069045	2.0	4.0	4.0	5.00	5.0
q4	8.0	3.250000	1.581139	1.0	2.5	3.5	4.25	5.0

Python Programming

withmooc= mydata.copy()

labels2={'1':'R','2':'SAS','3':'SPSS', '4':'Python'}

withmooc >> mutate(workshop = X['workshop'].apply(lambda x: labels2.get(x)))

Results

	id	workshop	gender	q1	q2	q3	q4
0	1	R		f	1	1	5.0	1
1	2	SAS		f	2	1	4.0	1
2	3	R		f	2	2	4.0	3
3	4	SAS		f	3	1	NaN	3
4	5	R		m	4	5	2.0	4
5	6	SAS		m	5	4	5.0	5
6	7	R		m	5	3	4.0	4
7	8	SAS		m	4	5	5.0	5

Python Programming

withmooc= mydata.copy()

labels2={'1':'R','2':'SAS','3':'SPSS', '4':'Python'}

withmooc >> mutate(workshop = X['workshop'].map(labels2))

Results

	id	workshop	gender	q1	q2	q3	q4
0	1	R		f	1	1	5.0	1
1	2	SAS		f	2	1	4.0	1
2	3	R		f	2	2	4.0	3
3	4	SAS		f	3	1	NaN	3
4	5	R		m	4	5	2.0	4
5	6	SAS		m	5	4	5.0	5
6	7	R		m	5	3	4.0	4
7	8	SAS		m	4	5	5.0	5

Python Programming

withmooc= mydata.copy()

withmooc = withmooc >> mutate(workshop = X['workshop'].astype('category'))
withmooc = withmooc >> mutate(workshop = X['workshop'].cat.rename_categories(["R", "SAS"]))

withmooc

Results

	id	workshop	gender	q1	q2	q3	q4
0	1	R		f	1	1	5.0	1
1	2	SAS		f	2	1	4.0	1
2	3	R		f	2	2	4.0	3
3	4	SAS		f	3	1	NaN	3
4	5	R		m	4	5	2.0	4
5	6	SAS		m	5	4	5.0	5
6	7	R		m	5	3	4.0	4
7	8	SAS		m	4	5	5.0	5

Python Programming

withmooc >> group_by('workshop') >> \
  summarize(**{
  **{f"{x}_mean"   : X[x].mean() for x in withmooc.select_dtypes(int).columns},
  **{f"{x}_std"    : X[x].std() for x in withmooc.select_dtypes(int).columns},
  **{f"{x}_var"    : X[x].var() for x in withmooc.select_dtypes(int).columns},
  **{f"{x}_median" : X[x].median() for x in withmooc.select_dtypes(int).columns}
  })

Results

	workshop	q1_mean	q2_mean	q4_mean	q1_std		q2_std		q4_std		q1_var		q2_var		q4_var		q1_median	q2_median	q4_median
0	R		3.0	2.75	3.0	1.825742	1.707825	1.414214	3.333333	2.916667	2.000000	3.0		2.5		3.5
1	SAS		3.5	2.75	3.5	1.290994	2.061553	1.914854	1.666667	4.250000	3.666667	3.5		2.5		4.0

Python Programming

withmooc >> group_by('workshop') >> \
  summarize(q1_mean=X.q1.mean(), q1_std=X.q1.std(),
            q2_mean=X.q1.mean(), q2_std=X.q1.std(),
            q3_mean=X.q1.mean(), q3_std=X.q1.std(),
            q4_mean=X.q1.mean(), q4_std=X.q1.std())

Results

	workshop	q1_mean	q1_std		q2_mean	q2_std		q3_mean	q3_std		q4_mean	q4_std
0	R		3.0	1.825742	3.0	1.825742	3.0	1.825742	3.0	1.825742
1	SAS		3.5	1.290994	3.5	1.290994	3.5	1.290994	3.5	1.290994

Python Programming

@pipe
@symbolic_evaluation()
def symbolic_double(df, *serieses):
    result = []
    for series in serieses:
        result.append(series.describe())
    return pd.DataFrame(result)

# withmooc >> symbolic_double(X.q1,X.q2,X.q3,X.q4)

withmooc >> symbolic_double(X.q1,X.q2,X.q3,X.q4)

Results

	count	mean		std		min	25%	50%	75%	max
q1	8.0	3.250000	1.488048	1.0	2.0	3.5	4.25	5.0
q2	8.0	2.750000	1.752549	1.0	1.0	2.5	4.25	5.0
q3	7.0	4.142857	1.069045	2.0	4.0	4.0	5.00	5.0
q4	8.0	3.250000	1.581139	1.0	2.5	3.5	4.25	5.0

Python Programming

@pipe
@symbolic_evaluation()
def num_variable(df,serieses):
    result = []

    for series in serieses:
        if df[series].dtypes in (["int32","float64"]):
            result.append(df[series].describe())
    return pd.DataFrame(result)

withmooc >> num_variable(mydata.columns.tolist())

Results

	count	mean		std		min	25%	50%	75%	max
q1	8.0	3.250000	1.488048	1.0	2.0	3.5	4.25	5.0
q2	8.0	2.750000	1.752549	1.0	1.0	2.5	4.25	5.0
q3	7.0	4.142857	1.069045	2.0	4.0	4.0	5.00	5.0
q4	8.0	3.250000	1.581139	1.0	2.5	3.5	4.25	5.0

Python Programming

@pipe
@symbolic_evaluation()
def num_variable(df,serieses):
    result = []

    for series in serieses:
        if df[series].dtypes in (["int32","float64"]):
            result.append(df[series].describe())
        elif df[series].dtypes in (["object"]):
            result.append(df[series].describe())
    return pd.DataFrame(result)

withmooc >> num_variable(mydata.columns.tolist())

Results

	count	unique	top	freq	mean		std		min	25%	50%	75%	max
id	8.0	8.0	2	1.0	NaN		NaN		NaN	NaN	NaN	NaN	NaN
gender	8.0	2.0	f	4.0	NaN		NaN		NaN	NaN	NaN	NaN	NaN
q1	8.0	NaN	NaN	NaN	3.250000	1.488048	1.0	2.0	3.5	4.25	5.0
q2	8.0	NaN	NaN	NaN	2.750000	1.752549	1.0	1.0	2.5	4.25	5.0
q3	7.0	NaN	NaN	NaN	4.142857	1.069045	2.0	4.0	4.0	5.00	5.0
q4	8.0	NaN	NaN	NaN	3.250000	1.581139	1.0	2.5	3.5	4.25	5.0

m은 male로 f는 female로 순서를 변경하자.
만약 값이 대문자이면, 실제적으로 결측 값을 생성한다.

Python Programming

withmooc = withmooc \
    >> mutate(gender = X.gender.astype('category')) \
    >> mutate(genderF = X.gender.cat.rename_categories(["female", "male"]))

print(withmooc.dtypes)

withmooc

Results

id            object
workshop    category
gender      category
q1             int32
q2             int32
q3           float64
q4             int32
genderF     category
dtype: object

Results

	id	workshop	gender	q1	q2	q3	q4	genderF
0	1	R		f	1	1	5.0	1	female
1	2	SAS		f	2	1	4.0	1	female
2	3	R		f	2	2	4.0	3	female
3	4	SAS		f	3	1	NaN	3	female
4	5	R		m	4	5	2.0	4	male
5	6	SAS		m	5	4	5.0	5	male
6	7	R		m	5	3	4.0	4	male
7	8	SAS		m	4	5	5.0	5	male

각각의 기초되는 값을 추출.
genderNums는 변수 값의 알파벳 순서가 할당된다.
genderFNums은 위에서 factor함수의 levels의 순서에 따라서 m이 2, f가 1이 할당된다.

Python Programming

mydata1= mydata.copy()

withmooc = withmooc \
    >> mutate(gender = X.gender.astype('category')) \
    >> mutate(genderF = X.gender.cat.rename_categories(["female", "male"])) \
    >> mutate(genderNums = X.gender.cat.codes) \
    >> mutate(genderFNums = X.genderF.cat.codes)

print(withmooc.dtypes)

withmooc

Results

id               object
workshop       category
gender         category
q1                int32
q2                int32
q3              float64
q4                int32
genderF        category
genderNums         int8
genderFNums        int8
dtype: object

Results

	id	workshop	gender	q1	q2	q3	q4	genderF	genderNums	genderFNums
0	1	R		f	1	1	5.0	1	female	0		0
1	2	SAS		f	2	1	4.0	1	female	0		0
2	3	R		f	2	2	4.0	3	female	0		0
3	4	SAS		f	3	1	NaN	3	female	0		0
4	5	R		m	4	5	2.0	4	male	1		1
5	6	SAS		m	5	4	5.0	5	male	1		1
6	7	R		m	5	3	4.0	4	male	1		1
7	8	SAS		m	4	5	5.0	5	male	1		1

Python Programming

mydata1= mydata.copy()

withmooc = withmooc \
    >> mutate(gender = X.gender.astype('category')) \
    >> mutate(genderF = X.gender.cat.rename_categories(["female", "male"])) \
    >> mutate(genderNums = pd.DataFrame(pd.factorize(withmooc.gender)[0])) \
    >> mutate(genderFNums = pd.DataFrame(pd.factorize(withmooc.genderF)[0]))

print(withmooc.dtypes)

withmooc

Results

id               object
workshop       category
gender         category
q1                int32
q2                int32
q3              float64
q4                int32
genderF        category
genderNums        int64
genderFNums       int64
dtype: object

Results

	id	workshop	gender	q1	q2	q3	q4	genderF	genderNums	genderFNums
0	1	R		f	1	1	5.0	1	female	0		0
1	2	SAS		f	2	1	4.0	1	female	0		0
2	3	R		f	2	2	4.0	3	female	0		0
3	4	SAS		f	3	1	NaN	3	female	0		0
4	5	R		m	4	5	2.0	4	male	1		1
5	6	SAS		m	5	4	5.0	5	male	1		1
6	7	R		m	5	3	4.0	4	male	1		1
7	8	SAS		m	4	5	5.0	5	male	1		1

Factor로 이용하기 위해 q변수의 복사본을 생성하고, 그것을 카운트할 수 있다.
반복하여 사용하기 위해 라벨을 저장.
Factor함수를 이용하여 새로운 변수 세트를 생성.

Python Programming

mydata1= mydata.copy()

withmooc = withmooc \
  >> mutate(q1f = X.q1.astype('category').cat.rename_categories(["Strongly Disagree","Disagree","Neutral","Agree","Strongly Agree"])) \
  >> mutate(q2f = X.q2.astype('category').cat.rename_categories(["Strongly Disagree","Disagree","Neutral","Agree","Strongly Agree"])) \
  >> mutate(q3f = X.q3.astype('category').cat.rename_categories({1:"Strongly Disagree",2:"Disagree",3:"Neutral",4:"Agree",5:"Strongly Agree"})) \
  >> mutate(q4f = X.q4.astype('category').cat.rename_categories({1:"Strongly Disagree",2:"Disagree",3:"Neutral",4:"Agree",5:"Strongly Agree"}))

withmooc

Results

	id	workshop	gender	q1	q2	q3	q4	genderF	genderNums	genderFNums	q1f			q2f			q3f		q4f
0	1	R		f	1	1	5.0	1	female	0		0		Strongly Disagree	Strongly Disagree	Strongly Agree	Strongly Disagree
1	2	SAS		f	2	1	4.0	1	female	0		0		Disagree		Strongly Disagree	Agree		Strongly Disagree
2	3	R		f	2	2	4.0	3	female	0		0		Disagree		Disagree		Agree		Neutral
3	4	SAS		f	3	1	NaN	3	female	0		0		Neutral			Strongly Disagree	NaN		Neutral
4	5	R		m	4	5	2.0	4	male	1		1		Agree			Strongly Agree		Disagree	Agree
5	6	SAS		m	5	4	5.0	5	male	1		1		Strongly Agree		Agree			Strongly Agree	Strongly Agree
6	7	R		m	5	3	4.0	4	male	1		1		Strongly Agree		Neutral			Agree		Agree
7	8	SAS		m	4	5	5.0	5	male	1		1		Agree			Strongly Agree		Strongly Agree	Strongly Agree

Python Programming

withmooc.info()

Results

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 14 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   id           8 non-null      object  
 1   workshop     8 non-null      category
 2   gender       8 non-null      category
 3   q1           8 non-null      int32   
 4   q2           8 non-null      int32   
 5   q3           7 non-null      float64 
 6   q4           8 non-null      int32   
 7   genderF      8 non-null      category
 8   genderNums   8 non-null      int64   
 9   genderFNums  8 non-null      int64   
 10  q1f          8 non-null      category
 11  q2f          8 non-null      category
 12  q3f          7 non-null      category
 13  q4f          8 non-null      category
dtypes: category(7), float64(1), int32(3), int64(2), object(1)
memory usage: 1.5+ KB

Python Programming

@pipe
@symbolic_evaluation()
def qf_counts(df,serieses):
    result = []

    for series in serieses:
        result.append(df[series].value_counts())
    return pd.DataFrame(result).T

withmooc >> qf_counts(['q1f','q2f','q3f','q4f'])

Results

			q1f	q2f	q3f	q4f
Agree			2.0	1.0	3.0	2.0
Disagree		2.0	1.0	1.0	NaN
Neutral			1.0	1.0	NaN	2.0
Strongly Agree		2.0	2.0	3.0	2.0
Strongly Disagree	1.0	3.0	NaN	2.0

Factor로 이용하기 위해서 q변수의 복사 번을 생성. 만약 변수 수가 많다면, 자동적으로 쉽게 할 수 있는 방법.
Factor로써 이용하기 위해 q 변수의 복사본을 생성, 그 결과 그것들을 카운트할 수 있다.

Python Programming

myQlevels = [1,2,3,4,5]

myQlabels =   {1:"Strongly Disagree",2:"Disagree",3:"Neutral",4:"Agree",5:"Strongly Agree"}

print(myQlevels)
print(myQlabels)

Results

[1, 2, 3, 4, 5]
{1: 'Strongly Disagree', 2: 'Disagree', 3: 'Neutral', 4: 'Agree', 5: 'Strongly Agree'}

데이터 프레임을 분리하기 위해 q변수 추출.

Python Programming

myQFvars = withmooc >> select(num_range("q", range(1,5)))

Python Programming

# Factor에 대하여 F를 가진 모든 변수로 변수명을 변경.
myQFnames = ['q1f', 'q2f', 'q3f', 'q4f']

myQFvars.columns = myQFnames
myQFvars

Results

	q1f	q2f	q3f	q4f
0	1	1	5.0	1
1	2	1	4.0	1
2	2	2	4.0	3
3	3	1	NaN	3
4	4	5	2.0	4
5	5	4	5.0	5
6	5	3	4.0	4
7	4	5	5.0	5

Python Programming

mydata1= mydata.copy()

withmooc \
  >> mutate(q4f = X.q4.astype('category').cat.rename_categories(myQlabels))

Results

	id	workshop	gender	q1	q2	q3	q4	genderF	genderNums	genderFNums	q1f			q2f			q3f			q4f
0	1	R		f	1	1	5.0	1	female	0		0		Strongly Disagree	Strongly Disagree	Strongly Agree	Strongly Disagree
1	2	SAS		f	2	1	4.0	1	female	0		0		Disagree		Strongly Disagree	Agree		Strongly Disagree
2	3	R		f	2	2	4.0	3	female	0		0		Disagree		Disagree		Agree		Neutral
3	4	SAS		f	3	1	NaN	3	female	0		0		Neutral			Strongly Disagree	NaN		Neutral
4	5	R		m	4	5	2.0	4	male	1		1		Agree			Strongly Agree		Disagree	Agree
5	6	SAS		m	5	4	5.0	5	male	1		1		Strongly Agree		Agree			Strongly Agree	Strongly Agree
6	7	R		m	5	3	4.0	4	male	1		1		Strongly Agree		Neutral			Agree		Agree
7	8	SAS		m	4	5	5.0	5	male	1		1		Agree			Strongly Agree		Strongly Agree	Strongly Agree

Python Programming

withmooc["q1"].astype('category').cat.rename_categories(myQlabels)

Results

0    Strongly Disagree
1             Disagree
2             Disagree
3              Neutral
4                Agree
5       Strongly Agree
6       Strongly Agree
7                Agree
Name: q1, dtype: category
Categories (5, object): ['Strongly Disagree', 'Disagree', 'Neutral', 'Agree', 'Strongly Agree']

Python Programming

@pipe
@symbolic_evaluation()
def qf_counts(df,serieses):
    result = []

    for series in serieses:
        result.append(df[series].astype('category').cat.rename_categories(myQlabels))
    return pd.DataFrame(result).T

myQFvars = withmooc >> qf_counts(['q1f','q2f','q3f','q4f'])

Python Programming

myQFvars

Results

	q1f			q2f			q3f			q4f
0	Strongly Disagree	Strongly Disagree	Strongly Agree		Strongly Disagree
1	Disagree		Strongly Disagree	Agree			Strongly Disagree
2	Disagree		Disagree		Agree			Neutral
3	Neutral			Strongly Disagree	NaN			Neutral
4	Agree			Strongly Agree		Disagree		Agree
5	Strongly Agree		Agree			Strongly Agree		Strongly Agree
6	Strongly Agree		Neutral			Agree			Agree
7	Agree			Strongly Agree		Strongly Agree		Strongly Agree

Python Programming

@pipe
@symbolic_evaluation()
def qf_counts(df,serieses):
    result = []

    for series in serieses:
        result.append(df[series].value_counts())
    return pd.DataFrame(result).T

myQFvars >> qf_counts(['q1f','q2f','q3f','q4f'])

Results

			q1f	q2f	q3f	q4f
Strongly Agree		2.0	2.0	3.0	2.0
Disagree		2.0	1.0	1.0	NaN
Agree			2.0	1.0	3.0	2.0
Strongly Disagree	1.0	3.0	NaN	2.0
Neutral			1.0	1.0	NaN	2.0

Python Programming

both = withmooc >> bind_cols(myQFvars)
both

Results

	id	workshop	gender	q1	q2	q3	q4	genderF	genderNums	genderFNums	q1f			q2f			q3f		q4f			q1f			q2f			q3f		q4f
0	1	R		f	1	1	5.0	1	female	0		0		Strongly Disagree	Strongly Disagree	Strongly Agree	Strongly Disagree	Strongly Disagree	Strongly Disagree	Strongly Agree	Strongly Disagree
1	2	SAS		f	2	1	4.0	1	female	0		0		Disagree		Strongly Disagree	Agree		Strongly Disagree	Disagree		Strongly Disagree	Agree		Strongly Disagree
2	3	R		f	2	2	4.0	3	female	0		0		Disagree		Disagree		Agree		Neutral			Disagree		Disagree		Agree		Neutral
3	4	SAS		f	3	1	NaN	3	female	0		0		Neutral			Strongly Disagree	NaN		Neutral			Neutral			Strongly Disagree	NaN		Neutral
4	5	R		m	4	5	2.0	4	male	1		1		Agree			Strongly Agree		Disagree	Agree			Agree			Strongly Agree		Disagree	Agree
5	6	SAS		m	5	4	5.0	5	male	1		1		Strongly Agree		Agree			Strongly Agree	Strongly Agree		Strongly Agree		Agree			Strongly Agree	Strongly Agree
6	7	R		m	5	3	4.0	4	male	1		1		Strongly Agree		Neutral			Agree		Agree			Strongly Agree		Neutral			Agree		Agree
7	8	SAS		m	4	5	5.0	5	male	1		1		Agree			Strongly Agree		Strongly Agree	Strongly Agree		Agree			Strongly Agree		Strongly Agree	Strongly Agree

통계프로그램 비교 목록(Proc sql, SAS, SPSS, R 프로그래밍, R Tidyverse, Python Pandas, Python Dfply)

[Oracle, Pandas, R Prog, Dplyr, Sqldf, Pandasql, Data.Table] 오라클 함수와 R & Python 비교 사전 목록 링크

[SQL, Pandas, R Prog, Dplyr, SQLDF, PANDASQL, DATA.TABLE]
SQL EMP 예제로 만나는 테이블 데이터 처리 방법 리스트 링크

저작자표시 비영리 변경금지

'통계프로그램 비교 시리즈 > 데이터 전처리 비교' 카테고리의 다른 글

통계프로그램 전처리 비교 (Proc sql, SAS, SPSS, R 프로그래밍, R Tidyverse, Python Pandas, Python Dfply) (0)	2022.01.19
15. 변수 라벨(Variable Labels) (0)	2022.01.19
통계프로그램 비교 시리즈 - 13. 데이터 프레임 정렬과 중복제거-Sorting & duplicate (0)	2022.01.15
[데이터 관리] 12. 변수를 관측치로 전치후 원상태로 복구 (0)	2022.01.15
[데이터 관리] 11. Aggregating Or Summarizing 데이터 (0)	2022.01.15

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

14. 변수 특성에 따른 통계량 일괄 처리 & Value Labels Or Formats(& Measurement Level))

14. 변수 특성에 따른 통계량 일괄 처리 & Value Labels Or Formats(& Measurement Level))

1. Proc SQL

2. SAS Programming

3. SPSS

4. R Programming (R-PROJECT)

5. R - Tidyverse

6. Python - Pandas

7. Python - dfply

'통계프로그램 비교 시리즈 > 데이터 전처리 비교' 카테고리의 다른 글

댓글

티스토리툴바

개인정보

단축키

내 블로그

블로그 게시글

모든 영역

14. 변수 특성에 따른 통계량 일괄 처리 & Value Labels Or Formats(& Measurement Level))

14. 변수 특성에 따른 통계량 일괄 처리 & Value Labels Or Formats(& Measurement Level))

1. Proc SQL

2. SAS Programming

3. SPSS

4. R Programming (R-PROJECT)

5. R - Tidyverse

6. Python - Pandas

7. Python - dfply

'통계프로그램 비교 시리즈 > 데이터 전처리 비교' 카테고리의 다른 글

관련글

댓글

티스토리툴바

개인정보

단축키

내 블로그

블로그 게시글

모든 영역