포스팅 목차
14. 변수 특성에 따른 통계량 일괄 처리 & Value Labels Or Formats(& Measurement Level))
1. Proc SQL
- SAS Program to Assign Value Labels (formats)
SAS Programming |
options linesize=150;
* SAS Program to Assign Value Labels (formats);
PROC FORMAT;
VALUE workshop_f 1="Control" 2="Treatment";
VALUE $gender_f "m"="Male" "f"="Female";
VALUE agreement 1='Strongly Disagree'
2='Disagree'
3='Neutral'
4='Agree'
5='Strongly Agree'.;
run;
proc sql;
select id,
workshop format=workshop_f.,
gender format=$gender_f. ,
q1 format=agreement. ,
q2 format=agreement. ,
q3 format=agreement. ,
q4 format=agreement.
from BACK.mydata;
quit;
Results |
id workshop gender q1 q2 q3 q4
-------------------------------------------------------------------------------------------------
1 Control Female Strongly Disagree Strongly Disagree Strongly Agree. Strongly Disagree
2 Treatment Female Disagree Strongly Disagree Agree Strongly Disagree
3 Control Female Disagree Disagree Agree Neutral
4 Treatment Female Neutral Strongly Disagree . Neutral
5 Control Male Agree Strongly Agree. Disagree Agree
6 Treatment Male Strongly Agree. Agree Strongly Agree. Strongly Agree.
7 Control Male Strongly Agree. Neutral Agree Agree
8 Treatment Male Agree Strongly Agree. Strongly Agree. Strongly Agree.
2. SAS Programming
- 값 라벨(포맷)을 할당하기 위한 SAS프로그램;
SAS Programming |
PROC FORMAT;
VALUE workshop_f 1="Control" 2="Treatment";
VALUE $gender_f "m"="Male" "f"="Female";
VALUE agreement 1='Strongly Disagree'
2='Disagree'
3='Neutral'
4='Agree'
5='Strongly Agree'.;
run;
DATA withmooc;
SET BACK.mydata;
FORMAT workshop workshop_f. gender gender_f.
q1-q4 agreement.;
run;
proc print;run;
Results |
OBS id workshop gender q1 q2 q3 q4
1 1 Control Female Strongly Disagree Strongly Disagree Strongly Agree. Strongly Disagree
2 2 Treatment Female Disagree Strongly Disagree Agree Strongly Disagree
3 3 Control Female Disagree Disagree Agree Neutral
4 4 Treatment Female Neutral Strongly Disagree . Neutral
5 5 Control Male Agree Strongly Agree. Disagree Agree
6 6 Treatment Male Strongly Agree. Agree Strongly Agree. Strongly Agree.
7 7 Control Male Strongly Agree. Neutral Agree Agree
8 8 Treatment Male Agree Strongly Agree. Strongly Agree. Strongly Agree.
3. SPSS
- 값 라벨을 할당하기 위한 SPSS 프로그램.
SPSS Programming |
GET FILE="c:\mydata.sav".
VARIABLE LEVEL workshop (NOMINAL)
/q1 TO q4 (SCALE).
VALUE LABELS workshop 1 'Control' 2 'Treatment'
/q1 TO q4
1 'Strongly Disagree'
2 'Disagree'
3 'Neutral'
4 'Agree'
5 'Strongly Agree'.
SAVE OUTFILE="C:\mydata.sav".
4. R Programming (R-PROJECT)
R Programming |
from rpy2.robjects import r
%load_ext rpy2.ipython
Results |
The rpy2.ipython extension is already loaded. To reload it, use:
%reload_ext rpy2.ipython
R Programming |
%%R
options(width = 200)
library(tidyverse)
library(psych)
library(Hmisc)
mydata <- read_csv("C:/work/data/mydata.csv",
col_types = cols( id = col_double(),
workshop = col_character(),
gender = col_character(),
q1 = col_double(),
q2 = col_double(),
q3 = col_double(),
q4 = col_double()
)
)
withmooc = mydata
attach(withmooc) # mydata를 기본 데이터 세트로 지정.
withmooc
Results |
R[write to console]: -- Attaching packages ------------------------------------------------------------------------------------------------------------- tidyverse 1.3.0 --
From cffi callback :
Traceback (most recent call last):
File "C:\Users\BACK\anaconda3\lib\site-packages\rpy2\rinterface_lib\callbacks.py", line 131, in _consolewrite_ex
====================================================
R[write to console]: The following object is masked from 'package:psych':
describe
R[write to console]: The following objects are masked from 'package:dplyr':
src, summarize
R[write to console]: The following objects are masked from 'package:base':
format.pval, units
Results |
# A tibble: 8 x 7
id workshop gender q1 q2 q3 q4
<dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 1 f 1 1 5 1
2 2 2 f 2 1 4 1
3 3 1 f 2 2 4 3
4 4 2 f 3 1 NA 3
5 5 1 m 4 5 2 4
6 6 2 m 5 4 5 5
7 7 1 m 5 3 4 4
8 8 2 m 4 5 5 5
- 값 라벨과 Factor 상태를 할당하기 위한 R-Project 프로그램.
- 기본적으로, Group은 수치형으로 읽히고, Gender는 Factor로써 읽힌다.
- Gender가 문자 이기 때문이다.
- 기본적으로, Summary는 Group을 수치형으로 취급하지만, Gender는 Factor로 가정하고, 그것의 레벨을 카운트한다.
R Programming |
%%R
base::summary(withmooc)
Results |
id workshop gender q1 q2 q3 q4
Min. :1.00 Length:8 Length:8 Min. :1.00 Min. :1.00 Min. :2.000 Min. :1.00
1st Qu.:2.75 Class :character Class :character 1st Qu.:2.00 1st Qu.:1.00 1st Qu.:4.000 1st Qu.:2.50
Median :4.50 Mode :character Mode :character Median :3.50 Median :2.50 Median :4.000 Median :3.50
Mean :4.50 Mean :3.25 Mean :2.75 Mean :4.143 Mean :3.25
3rd Qu.:6.25 3rd Qu.:4.25 3rd Qu.:4.25 3rd Qu.:5.000 3rd Qu.:4.25
Max. :8.00 Max. :5.00 Max. :5.00 Max. :5.000 Max. :5.00
NA's :1
R Programming |
%%R
dlookr::diagnose_numeric(mydata)
Results |
R[write to console]: Registered S3 method overwritten by 'quantmod':
method from
as.zoo.data.frame zoo
# A tibble: 5 x 10
variables min Q1 mean median Q3 max zero minus outlier
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <int> <int>
1 id 1 2.75 4.5 4.5 6.25 8 0 0 0
2 q1 1 2 3.25 3.5 4.25 5 0 0 0
3 q2 1 1 2.75 2.5 4.25 5 0 0 0
4 q3 2 4 4.14 4 5 5 0 0 1
5 q4 1 2.5 3.25 3.5 4.25 5 0 0 0
R Programming |
%%R
withmooc %>%
dlookr::diagnose_numeric()
Results |
# A tibble: 5 x 10
variables min Q1 mean median Q3 max zero minus outlier
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <int> <int>
1 id 1 2.75 4.5 4.5 6.25 8 0 0 0
2 q1 1 2 3.25 3.5 4.25 5 0 0 0
3 q2 1 1 2.75 2.5 4.25 5 0 0 0
4 q3 2 4 4.14 4 5 5 0 0 1
5 q4 1 2.5 3.25 3.5 4.25 5 0 0 0
R Programming |
%%R
withmooc %>%
dlookr::describe() %>%
as.data.frame()
Results |
variable n na mean sd se_mean IQR skewness kurtosis p00 p01 p05 p10 p20 p25
1 id 8 0 4.500000 2.449490 0.8660254 3.50 0.0000000 -1.200000 1 1.07 1.35 1.7 2.4 2.75
2 q1 8 0 3.250000 1.488048 0.5261043 2.25 -0.2167811 -1.410198 1 1.07 1.35 1.7 2.0 2.00
3 q2 8 0 2.750000 1.752549 0.6196197 3.25 0.2919336 -1.914116 1 1.00 1.00 1.0 1.0 1.00
4 q3 7 1 4.142857 1.069045 0.4040610 1.00 -1.5200483 2.712500 2 2.12 2.60 3.2 4.0 4.00
5 q4 8 0 3.250000 1.581139 0.5590170 1.75 -0.5421047 -1.024000 1 1.00 1.00 1.0 1.8 2.50
p30 p40 p50 p60 p70 p75 p80 p90 p95 p99 p100
1 3.1 3.8 4.5 5.2 5.9 6.25 6.6 7.3 7.65 7.93 8
2 2.1 2.8 3.5 4.0 4.0 4.25 4.6 5.0 5.00 5.00 5
3 1.1 1.8 2.5 3.2 3.9 4.25 4.6 5.0 5.00 5.00 5
4 4.0 4.0 4.0 4.6 5.0 5.00 5.0 5.0 5.00 5.00 5
5 3.0 3.0 3.5 4.0 4.0 4.25 4.6 5.0 5.00 5.00 5
R Programming |
%%R
withmooc %>%
purrr::keep(.p = is.numeric) %>% # 숫자형 데이터만 남기기
dlookr::describe()
Results |
# A tibble: 5 x 26
variable n na mean sd se_mean IQR skewness kurtosis p00 p01 p05 p10 p20
<chr> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 id 8 0 4.5 2.45 0.866 3.5 0 -1.2 1 1.07 1.35 1.7 2.4
2 q1 8 0 3.25 1.49 0.526 2.25 -0.217 -1.41 1 1.07 1.35 1.7 2
3 q2 8 0 2.75 1.75 0.620 3.25 0.292 -1.91 1 1 1 1 1
4 q3 7 1 4.14 1.07 0.404 1 -1.52 2.71 2 2.12 2.6 3.2 4
5 q4 8 0 3.25 1.58 0.559 1.75 -0.542 -1.02 1 1 1 1 1.8
# ... with 12 more variables: p25 <dbl>, p30 <dbl>, p40 <dbl>, p50 <dbl>, p60 <dbl>, p70 <dbl>,
# p75 <dbl>, p80 <dbl>, p90 <dbl>, p95 <dbl>, p99 <dbl>, p100 <dbl>
R Programming |
%%R
withmooc %>%
purrr::keep(.p = is.numeric) %>% # 숫자형 데이터만 남기기
dlookr::describe() %>%
as.data.frame()
Results |
variable n na mean sd se_mean IQR skewness kurtosis p00 p01 p05 p10 p20 p25
1 id 8 0 4.500000 2.449490 0.8660254 3.50 0.0000000 -1.200000 1 1.07 1.35 1.7 2.4 2.75
2 q1 8 0 3.250000 1.488048 0.5261043 2.25 -0.2167811 -1.410198 1 1.07 1.35 1.7 2.0 2.00
3 q2 8 0 2.750000 1.752549 0.6196197 3.25 0.2919336 -1.914116 1 1.00 1.00 1.0 1.0 1.00
4 q3 7 1 4.142857 1.069045 0.4040610 1.00 -1.5200483 2.712500 2 2.12 2.60 3.2 4.0 4.00
5 q4 8 0 3.250000 1.581139 0.5590170 1.75 -0.5421047 -1.024000 1 1.00 1.00 1.0 1.8 2.50
p30 p40 p50 p60 p70 p75 p80 p90 p95 p99 p100
1 3.1 3.8 4.5 5.2 5.9 6.25 6.6 7.3 7.65 7.93 8
2 2.1 2.8 3.5 4.0 4.0 4.25 4.6 5.0 5.00 5.00 5
3 1.1 1.8 2.5 3.2 3.9 4.25 4.6 5.0 5.00 5.00 5
4 4.0 4.0 4.0 4.6 5.0 5.00 5.0 5.0 5.00 5.00 5
5 3.0 3.0 3.5 4.0 4.0 4.25 4.6 5.0 5.00 5.00 5
- Workshop변수를 Factor로 변경.
R Programming |
%%R
withmooc$workshop <- factor( withmooc$workshop,
levels=c(1,2,3,4),
labels=c("R","SAS","SPSS","Stata") )
withmooc
Results |
# A tibble: 8 x 7
id workshop gender q1 q2 q3 q4
<dbl> <fct> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 R f 1 1 5 1
2 2 SAS f 2 1 4 1
3 3 R f 2 2 4 3
4 4 SAS f 3 1 NA 3
5 5 R m 4 5 2 4
6 6 SAS m 5 4 5 5
7 7 R m 5 3 4 4
8 8 SAS m 4 5 5 5
- Summary함수는 workshop변수의 출현 횟수를 카운트한다.
- 현재의 workshop의 평균은 잘못된 기록이다.
R Programming |
%%R
summary(withmooc)
Results |
id workshop gender q1 q2 q3 q4
Min. :1.00 R :4 Length:8 Min. :1.00 Min. :1.00 Min. :2.000 Min. :1.00
1st Qu.:2.75 SAS :4 Class :character 1st Qu.:2.00 1st Qu.:1.00 1st Qu.:4.000 1st Qu.:2.50
Median :4.50 SPSS :0 Mode :character Median :3.50 Median :2.50 Median :4.000 Median :3.50
Mean :4.50 Stata:0 Mean :3.25 Mean :2.75 Mean :4.143 Mean :3.25
3rd Qu.:6.25 3rd Qu.:4.25 3rd Qu.:4.25 3rd Qu.:5.000 3rd Qu.:4.25
Max. :8.00 Max. :5.00 Max. :5.00 Max. :5.000 Max. :5.00
NA's :1
- Hmisc 패키지에서 Describe함수를 이용.
- Summary함수와 틀리게, Describe함수는 q변수의 빈도와 평균, 백분율을 계산한다.
- Describe함수를 사용하기 위해서 Hmisc 라이브러리를 인스톨해야 한다.
R Programming |
%%R
Hmisc::describe(withmooc)
Results |
withmooc
7 Variables 8 Observations
------------------------------------------------------------------------------------------------------------------------------------------------------
id
n missing distinct Info Mean Gmd
8 0 8 1 4.5 3
lowest : 1 2 3 4 5, highest: 4 5 6 7 8
Value 1 2 3 4 5 6 7 8
Frequency 1 1 1 1 1 1 1 1
Proportion 0.125 0.125 0.125 0.125 0.125 0.125 0.125 0.125
------------------------------------------------------------------------------------------------------------------------------------------------------
workshop
n missing distinct
8 0 2
Value R SAS
Frequency 4 4
Proportion 0.5 0.5
------------------------------------------------------------------------------------------------------------------------------------------------------
gender
n missing distinct
8 0 2
Value f m
Frequency 4 4
Proportion 0.5 0.5
------------------------------------------------------------------------------------------------------------------------------------------------------
q1
n missing distinct Info Mean Gmd
8 0 5 0.964 3.25 1.786
lowest : 1 2 3 4 5, highest: 1 2 3 4 5
Value 1 2 3 4 5
Frequency 1 2 1 2 2
Proportion 0.125 0.250 0.125 0.250 0.250
------------------------------------------------------------------------------------------------------------------------------------------------------
q2
n missing distinct Info Mean Gmd
8 0 5 0.94 2.75 2.071
lowest : 1 2 3 4 5, highest: 1 2 3 4 5
Value 1 2 3 4 5
Frequency 3 1 1 1 2
Proportion 0.375 0.125 0.125 0.125 0.250
------------------------------------------------------------------------------------------------------------------------------------------------------
q3
n missing distinct Info Mean Gmd
7 1 3 0.857 4.143 1.143
Value 2 4 5
Frequency 1 3 3
Proportion 0.143 0.429 0.429
------------------------------------------------------------------------------------------------------------------------------------------------------
q4
n missing distinct Info Mean Gmd
8 0 4 0.952 3.25 1.857
Value 1 3 4 5
Frequency 2 2 2 2
Proportion 0.25 0.25 0.25 0.25
------------------------------------------------------------------------------------------------------------------------------------------------------
R Programming |
%%R
describeData(withmooc)
Results |
n.obs = 8 of which 7 are complete cases. Number of variables = 7 of which all are numeric FALSE
variable # n.obs type H1 H2 H3 H4 T1 T2 T3 T4
id* 1 8 4 1 2 3 4 5 6 7 8
workshop* 2 8 4 R SAS R SAS R SAS R SAS
gender* 3 8 4 f f f f m m m m
q1* 4 8 4 1 2 2 3 4 5 5 4
q2* 5 8 4 1 1 2 1 5 4 3 5
q3* 6 7 4 5 4 4 <NA> 2 5 4 5
q4* 7 8 4 1 1 3 3 4 5 4 5
- 어떻게 레벨이 값으로 매치되는지 확인.
R Programming |
%%R
unclass(withmooc$workshop)
Results |
[1] 1 2 1 2 1 2 1 2
attr(,"levels")
[1] "R" "SAS" "SPSS" "Stata"
- m은 male로 f는 female로 순서를 변경하자.
- 만약 값이 대문자이면, 실제적으로 결측값을 생성한다.
R Programming |
%%R
withmooc$genderF <- factor( withmooc$gender,
levels=c("m","f"),labels=c("male","female") )
withmooc
Results |
# A tibble: 8 x 8
id workshop gender q1 q2 q3 q4 genderF
<dbl> <fct> <chr> <dbl> <dbl> <dbl> <dbl> <fct>
1 1 R f 1 1 5 1 female
2 2 SAS f 2 1 4 1 female
3 3 R f 2 2 4 3 female
4 4 SAS f 3 1 NA 3 female
5 5 R m 4 5 2 4 male
6 6 SAS m 5 4 5 5 male
7 7 R m 5 3 4 4 male
8 8 SAS m 4 5 5 5 male
- 매치된 결과를 확인하기 위해서 Gender와 Genderf를 출력.
R Programming |
%%R
withmooc[ ,c("gender","genderF")]
Results |
# A tibble: 8 x 2
gender genderF
<chr> <fct>
1 f female
2 f female
3 f female
4 f female
5 m male
6 m male
7 m male
8 m male
- 각각의 기초되는 값을 추출.
- genderNums는 변수 값의 알파벳 순서가 할당된다.
- genderFNums은 위에서 factor함수의 levels의 순서에 따라서 m이 2, f가 1이 할당된다.
R Programming |
%%R
withmooc$genderNums <- as.numeric(withmooc$gender)
withmooc$genderFNums <- as.numeric(withmooc$genderF)
withmooc
Results |
# A tibble: 8 x 10
id workshop gender q1 q2 q3 q4 genderF genderNums genderFNums
<dbl> <fct> <chr> <dbl> <dbl> <dbl> <dbl> <fct> <dbl> <dbl>
1 1 R f 1 1 5 1 female NA 2
2 2 SAS f 2 1 4 1 female NA 2
3 3 R f 2 2 4 3 female NA 2
4 4 SAS f 3 1 NA 3 female NA 2
5 5 R m 4 5 2 4 male NA 1
6 6 SAS m 5 4 5 5 male NA 1
7 7 R m 5 3 4 4 male NA 1
8 8 SAS m 4 5 5 5 male NA 1
- Factor로 이용하기 위해 q변수의 복사본을 생성하고, 그것을 카운트할 수 있다.
- 반복하여 사용하기 위해 라벨을 저장.
R Programming |
%%R
myQlevels <- c(1,2,3,4,5)
myQlabels <- c("Strongly Disagree",
"Disagree",
"Neutral",
"Agree",
"Strongly Agree")
- Factor함수를 이용하여 새로운 변수 세트를 생성.
R Programming |
%%R
withmooc$q1f <- factor(q1, myQlevels, myQlabels)
withmooc$q2f <- factor(q2, myQlevels, myQlabels)
withmooc$q3f <- factor(q3, myQlevels, myQlabels)
withmooc$q4f <- factor(q4, myQlevels, myQlabels)
as.data.frame(withmooc)
Results |
id workshop gender q1 q2 q3 q4 genderF genderNums genderFNums q1f q2f q3f q4f
1 1 R f 1 1 5 1 female NA 2 Strongly Disagree Strongly Disagree Strongly Agree Strongly Disagree
2 2 SAS f 2 1 4 1 female NA 2 Disagree Strongly Disagree Agree Strongly Disagree
3 3 R f 2 2 4 3 female NA 2 Disagree Disagree Agree Neutral
4 4 SAS f 3 1 NA 3 female NA 2 Neutral Strongly Disagree <NA> Neutral
5 5 R m 4 5 2 4 male NA 1 Agree Strongly Agree Disagree Agree
6 6 SAS m 5 4 5 5 male NA 1 Strongly Agree Agree Strongly Agree Strongly Agree
7 7 R m 5 3 4 4 male NA 1 Strongly Agree Neutral Agree Agree
8 8 SAS m 4 5 5 5 male NA 1 Agree Strongly Agree Strongly Agree Strongly Agree
- Summary함수 결과.
R Programming |
%%R
summary( withmooc[ c("q1f","q2f","q3f","q4f") ] )
Results |
q1f q2f q3f q4f
Strongly Disagree:1 Strongly Disagree:3 Strongly Disagree:0 Strongly Disagree:2
Disagree :2 Disagree :1 Disagree :1 Disagree :0
Neutral :1 Neutral :1 Neutral :0 Neutral :2
Agree :2 Agree :1 Agree :3 Agree :2
Strongly Agree :2 Strongly Agree :2 Strongly Agree :3 Strongly Agree :2
NA's :1
- Factor로 이용하기 위해서 q변수의 복사번을 생성. 만약 변수 수가 많다면, 자동적으로 쉽게 할 수 있는 방법.
- Factor로써 이용하기 위해 q 변수의 복사본을 생성, 그 결과 그것들을 카운트할 수 있다.
R Programming |
%%R
myQlevels <- c(1,2,3,4,5)
myQlabels <- c("Strongly Disagree",
"Disagree",
"Neutral",
"Agree",
"Strongly Agree")
print(myQlevels)
print(myQlabels)
Results |
[1] 1 2 3 4 5
[1] "Strongly Disagree" "Disagree" "Neutral" "Agree" "Strongly Agree"
- 이용될 변수 이름의 두 개 세트를 생성.
R Programming |
%%R
myQnames <- paste( "q", 1:4, sep="")
myQFnames <- paste( "qf", 1:4, sep="")
print(myQnames) # 원 변수명.
print(myQFnames) # 새로운 factor 변수의 이름.
Results |
[1] "q1" "q2" "q3" "q4"
[1] "qf1" "qf2" "qf3" "qf4"
- 데이터 프레임을 분리하기 위해 q변수 추출.
R Programming |
%%R
myQFvars <- withmooc[ ,myQnames]
print(myQFvars)
Results |
# A tibble: 8 x 4
q1 q2 q3 q4
<dbl> <dbl> <dbl> <dbl>
1 1 1 5 1
2 2 1 4 1
3 2 2 4 3
4 3 1 NA 3
5 4 5 2 4
6 5 4 5 5
7 5 3 4 4
8 4 5 5 5
- Factor에 대하여 F를 가진 모든 변수로 변수명을 변경.
R Programming |
%%R
names(myQFvars) <- myQFnames
print(myQFvars)
Results |
# A tibble: 8 x 4
qf1 qf2 qf3 qf4
<dbl> <dbl> <dbl> <dbl>
1 1 1 5 1
2 2 1 4 1
3 2 2 4 3
4 3 1 NA 3
5 4 5 2 4
6 5 4 5 5
7 5 3 4 4
8 4 5 5 5
- 많은 변수의 라벨을 적용하기 위해 함수 생성.
R Programming |
%%R
myLabeler <- function(x) { factor(x, myQlevels, myQlabels) }
- 한 변수가 함수로 어떻게 적용되는지 확인할 수 있다.
R Programming |
%%R
summary( myLabeler(myQFvars["qf1"]) )
Results |
Strongly Disagree Disagree Neutral Agree Strongly Agree NA's
0 0 0 0 0 1
- 모든 변수에 적용.
R Programming |
%%R
myQFvars[ ,myQFnames] <- lapply( myQFvars[ ,myQFnames ], myLabeler )
myQFvars
Results |
# A tibble: 8 x 4
qf1 qf2 qf3 qf4
<fct> <fct> <fct> <fct>
1 Strongly Disagree Strongly Disagree Strongly Agree Strongly Disagree
2 Disagree Strongly Disagree Agree Strongly Disagree
3 Disagree Disagree Agree Neutral
4 Neutral Strongly Disagree <NA> Neutral
5 Agree Strongly Agree Disagree Agree
6 Strongly Agree Agree Strongly Agree Strongly Agree
7 Strongly Agree Neutral Agree Agree
8 Agree Strongly Agree Strongly Agree Strongly Agree
- Summary함수의 결과.
R Programming |
%%R
summary(myQFvars)
Results |
qf1 qf2 qf3 qf4
Strongly Disagree:1 Strongly Disagree:3 Strongly Disagree:0 Strongly Disagree:2
Disagree :2 Disagree :1 Disagree :1 Disagree :0
Neutral :1 Neutral :1 Neutral :0 Neutral :2
Agree :2 Agree :1 Agree :3 Agree :2
Strongly Agree :2 Strongly Agree :2 Strongly Agree :3 Strongly Agree :2
NA's :1
- withmooc에 새로운 변수를 결합.
R Programming |
%%R
withmooc<-cbind(withmooc,myQFvars)
withmooc
Results |
id workshop gender q1 q2 q3 q4 genderF genderNums genderFNums q1f q2f q3f q4f
1 1 R f 1 1 5 1 female NA 2 Strongly Disagree Strongly Disagree Strongly Agree Strongly Disagree
2 2 SAS f 2 1 4 1 female NA 2 Disagree Strongly Disagree Agree Strongly Disagree
3 3 R f 2 2 4 3 female NA 2 Disagree Disagree Agree Neutral
4 4 SAS f 3 1 NA 3 female NA 2 Neutral Strongly Disagree <NA> Neutral
5 5 R m 4 5 2 4 male NA 1 Agree Strongly Agree Disagree Agree
6 6 SAS m 5 4 5 5 male NA 1 Strongly Agree Agree Strongly Agree Strongly Agree
7 7 R m 5 3 4 4 male NA 1 Strongly Agree Neutral Agree Agree
8 8 SAS m 4 5 5 5 male NA 1 Agree Strongly Agree Strongly Agree Strongly Agree
qf1 qf2 qf3 qf4
1 Strongly Disagree Strongly Disagree Strongly Agree Strongly Disagree
2 Disagree Strongly Disagree Agree Strongly Disagree
3 Disagree Disagree Agree Neutral
4 Neutral Strongly Disagree <NA> Neutral
5 Agree Strongly Agree Disagree Agree
6 Strongly Agree Agree Strongly Agree Strongly Agree
7 Strongly Agree Neutral Agree Agree
8 Agree Strongly Agree Strongly Agree Strongly Agree
5. R - Tidyverse
R Programming |
from rpy2.robjects import r
%load_ext rpy2.ipython
The rpy2.ipython extension is already loaded. To reload it, use:
%reload_ext rpy2.ipython
R Programming |
%%R
library(tidyverse)
library(psych)
mydata <- read_csv("C:/work/data/mydata.csv",
col_types = cols( id = col_double(),
workshop = col_character(),
gender = col_character(),
q1 = col_double(),
q2 = col_double(),
q3 = col_double(),
q4 = col_double()
)
)
withmooc = mydata
attach(withmooc) # mydata를 기본 데이터 세트로 지정.
withmooc
Results |
R[write to console]: -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
From cffi callback :
Traceback (most recent call last):
========================================
R[write to console]: The following objects are masked from 'package:ggplot2':
%+%, alpha
Results |
# A tibble: 8 x 7
id workshop gender q1 q2 q3 q4
<dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 1 f 1 1 5 1
2 2 2 f 2 1 4 1
3 3 1 f 2 2 4 3
4 4 2 f 3 1 NA 3
5 5 1 m 4 5 2 4
6 6 2 m 5 4 5 5
7 7 1 m 5 3 4 4
8 8 2 m 4 5 5 5
- 기본적으로, Group은 수치형으로 읽히고, Gender는 Factor로써 읽힌다.
- Gender가 문자 이기 때문이다.
- 하나의 긴 텍스트 문자열로 데이터 저장.
R Programming |
%%R
mystring<-("id,workshop,gender,q1,q2,q3,q4
1,1,f,1,1,5,1
2,2,f,2,1,4,1
3,1,f,2,2,4,3
4,2,f,3,1, ,3
5,1,m,4,5,2,4
6,2,m,5,4,5,5
7,1,m,5,3,4,4
8,2,m,4,5,5,5")
mystring
Results |
[1] "id,workshop,gender,q1,q2,q3,q4\n1,1,f,1,1,5,1\n2,2,f,2,1,4,1\n3,1,f,2,2,4,3\n4,2,f,3,1, ,3\n5,1,m,4,5,2,4\n6,2,m,5,4,5,5\n7,1,m,5,3,4,4\n8,2,m,4,5,5,5"
- 파일 위치 대신에 textConnection 함수를 이용하여서 프로그램 내의 mystring(긴 문자 벡터)을 텍스트 파일로 읽기.
R Programming |
%%R
withmooc<-read.table(textConnection(mystring),
header=TRUE,sep=",",row.names="id")
withmooc
Results |
workshop gender q1 q2 q3 q4
1 1 f 1 1 5 1
2 2 f 2 1 4 1
3 1 f 2 2 4 3
4 2 f 3 1 NA 3
5 1 m 4 5 2 4
6 2 m 5 4 5 5
7 1 m 5 3 4 4
8 2 m 4 5 5 5
- 기본적으로, Summary는 Group을 수치형으로 취급하지만, Gender는 Factor로 가정하고, 그것의 레벨을 카운트한다.
R Programming |
%%R
summary(withmooc)
Results |
id workshop gender q1 q2 q3 q4
Min. :1.00 Length:8 Length:8 Min. :1.00 Min. :1.00 Min. :2.000 Min. :1.00
1st Qu.:2.75 Class :character Class :character 1st Qu.:2.00 1st Qu.:1.00 1st Qu.:4.000 1st Qu.:2.50
Median :4.50 Mode :character Mode :character Median :3.50 Median :2.50 Median :4.000 Median :3.50
Mean :4.50 Mean :3.25 Mean :2.75 Mean :4.143 Mean :3.25
3rd Qu.:6.25 3rd Qu.:4.25 3rd Qu.:4.25 3rd Qu.:5.000 3rd Qu.:4.25
Max. :8.00 Max. :5.00 Max. :5.00 Max. :5.000 Max. :5.00
NA's :1
R Programming |
%%R
withmooc %>%
dlookr::diagnose_numeric()
Results |
# A tibble: 5 x 10
variables min Q1 mean median Q3 max zero minus outlier
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <int> <int>
1 id 1 2.75 4.5 4.5 6.25 8 0 0 0
2 q1 1 2 3.25 3.5 4.25 5 0 0 0
3 q2 1 1 2.75 2.5 4.25 5 0 0 0
4 q3 2 4 4.14 4 5 5 0 0 1
5 q4 1 2.5 3.25 3.5 4.25 5 0 0 0
R Programming |
%%R
withmooc %>%
dlookr::describe()
Results |
# A tibble: 5 x 26
variable n na mean sd se_mean IQR skewness kurtosis p00 p01 p05 p10 p20
<chr> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 id 8 0 4.5 2.45 0.866 3.5 0 -1.2 1 1.07 1.35 1.7 2.4
2 q1 8 0 3.25 1.49 0.526 2.25 -0.217 -1.41 1 1.07 1.35 1.7 2
3 q2 8 0 2.75 1.75 0.620 3.25 0.292 -1.91 1 1 1 1 1
4 q3 7 1 4.14 1.07 0.404 1 -1.52 2.71 2 2.12 2.6 3.2 4
5 q4 8 0 3.25 1.58 0.559 1.75 -0.542 -1.02 1 1 1 1 1.8
# ... with 12 more variables: p25 <dbl>, p30 <dbl>, p40 <dbl>, p50 <dbl>, p60 <dbl>, p70 <dbl>,
# p75 <dbl>, p80 <dbl>, p90 <dbl>, p95 <dbl>, p99 <dbl>, p100 <dbl>
R Programming |
%%R
withmooc %>%
purrr::keep(.p = is.numeric) %>% # 숫자형 데이터만 남기기
dlookr::describe()
Results |
# A tibble: 5 x 26
variable n na mean sd se_mean IQR skewness kurtosis p00 p01 p05 p10 p20
<chr> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 id 8 0 4.5 2.45 0.866 3.5 0 -1.2 1 1.07 1.35 1.7 2.4
2 q1 8 0 3.25 1.49 0.526 2.25 -0.217 -1.41 1 1.07 1.35 1.7 2
3 q2 8 0 2.75 1.75 0.620 3.25 0.292 -1.91 1 1 1 1 1
4 q3 7 1 4.14 1.07 0.404 1 -1.52 2.71 2 2.12 2.6 3.2 4
5 q4 8 0 3.25 1.58 0.559 1.75 -0.542 -1.02 1 1 1 1 1.8
# ... with 12 more variables: p25 <dbl>, p30 <dbl>, p40 <dbl>, p50 <dbl>, p60 <dbl>, p70 <dbl>,
# p75 <dbl>, p80 <dbl>, p90 <dbl>, p95 <dbl>, p99 <dbl>, p100 <dbl>
R Programming |
%%R
print(packageVersion("tidyr"))
print(packageVersion("dplyr"))
Results |
[1] '1.1.1'
[1] '1.0.2'
- 아래 에러 발생 시 재구동 : 정확한 원인 모름
- Error: Input must be a vector, not a describe object.
- Run rlang::last_error() to see where the error occurred.
R Programming |
%%R
withmooc %>%
purrr::keep(.p = is.numeric) %>% # 숫자형 데이터만 남기기
purrr::map_df(.x = ., .f = psych::describe) %>% # 앞의 데이터에 대해 기술통계량을 구해주는 함수 적용
base::transform(vars = colnames(purrr::keep(.x = withmooc,
.p = is.numeric)))
Results |
vars n mean sd median trimmed mad min max range skew
X1...1 id 8 4.500000 2.449490 4.5 4.500000 2.9652 1 8 7 0.0000000
X1...2 q1 8 3.250000 1.488048 3.5 3.250000 2.2239 1 5 4 -0.1422626
X1...3 q2 8 2.750000 1.752549 2.5 2.750000 2.2239 1 5 4 0.1915814
X1...4 q3 7 4.142857 1.069045 4.0 4.142857 1.4826 2 5 3 -0.9306418
X1...5 q4 8 3.250000 1.581139 3.5 3.250000 1.4826 1 5 4 -0.3557562
kurtosis se
X1...1 -1.6510417 0.8660254
X1...2 -1.7276762 0.5261043
X1...3 -1.9113964 0.6196197
X1...4 -0.5165816 0.4040610
X1...5 -1.5868750 0.5590170
- Workshop변수를 Factor로 변경.
- Summary함수는 workshop변수의 출현 횟수를 카운트한다.
- 현재의 workshop의 평균은 잘못된 기록이다.
R Programming |
%%R
withmooc %>%
mutate(workshop = factor(workshop,
levels=c(1,2,3,4),
labels=c("R","SAS","SPSS","Stata"))) %>%
summary()
Results |
id workshop gender q1 q2 q3 q4
Min. :1.00 R :4 Length:8 Min. :1.00 Min. :1.00 Min. :2.000 Min. :1.00
1st Qu.:2.75 SAS :4 Class :character 1st Qu.:2.00 1st Qu.:1.00 1st Qu.:4.000 1st Qu.:2.50
Median :4.50 SPSS :0 Mode :character Median :3.50 Median :2.50 Median :4.000 Median :3.50
Mean :4.50 Stata:0 Mean :3.25 Mean :2.75 Mean :4.143 Mean :3.25
3rd Qu.:6.25 3rd Qu.:4.25 3rd Qu.:4.25 3rd Qu.:5.000 3rd Qu.:4.25
Max. :8.00 Max. :5.00 Max. :5.00 Max. :5.000 Max. :5.00
NA's :1
- Hmisc 패키지에서 Describe함수를 이용.
- Summary함수와 틀리게, Describe함수는 q변수의 빈도와 평균, 백분율을 계산한다.
- Describe함수를 사용하기 위해서 Hmisc 라이브러리를 인스톨해야 한다.
R Programming |
%%R
withmooc %>%
mutate(workshop = factor(workshop,
levels=c(1,2,3,4),
labels=c("R","SAS","SPSS","Stata"))) %>%
describe()
Results |
vars n mean sd median trimmed mad min max range skew kurtosis se
id 1 8 4.50 2.45 4.5 4.50 2.97 1 8 7 0.00 -1.65 0.87
workshop* 2 8 1.50 0.53 1.5 1.50 0.74 1 2 1 0.00 -2.23 0.19
gender* 3 8 1.50 0.53 1.5 1.50 0.74 1 2 1 0.00 -2.23 0.19
q1 4 8 3.25 1.49 3.5 3.25 2.22 1 5 4 -0.14 -1.73 0.53
q2 5 8 2.75 1.75 2.5 2.75 2.22 1 5 4 0.19 -1.91 0.62
q3 6 7 4.14 1.07 4.0 4.14 1.48 2 5 3 -0.93 -0.52 0.40
q4 7 8 3.25 1.58 3.5 3.25 1.48 1 5 4 -0.36 -1.59 0.56
- 어떻게 레벨이 값으로 매치되는지 확인.
R Programming |
%%R
unclass(withmooc$gender)
Results |
[1] "f" "f" "f" "f" "m" "m" "m" "m"
- m은 male로 f는 female로 순서를 변경하자.
- 만약 값이 대문자이면, 실제적으로 결측값을 생성한다.
- 각각의 기초되는 값을 추출.
- genderNums는 변수 값의 알파벳 순서가 할당된다.
- genderFNums은 위에서 factor함수의 levels의 순서에 따라서 m이 2, f가 1이 할당된다.
R Programming |
%%R
withmooc<-withmooc %>%
mutate(gender = factor(gender,levels=c("f","m"),labels=c("f","m")),
genderF = factor(gender,levels=c("m","f"),labels=c("male","female")))
withmooc
Results |
# A tibble: 8 x 8
id workshop gender q1 q2 q3 q4 genderF
<dbl> <chr> <fct> <dbl> <dbl> <dbl> <dbl> <fct>
1 1 1 f 1 1 5 1 female
2 2 2 f 2 1 4 1 female
3 3 1 f 2 2 4 3 female
4 4 2 f 3 1 NA 3 female
5 5 1 m 4 5 2 4 male
6 6 2 m 5 4 5 5 male
7 7 1 m 5 3 4 4 male
8 8 2 m 4 5 5 5 male
R Programming |
%%R
print(unclass(withmooc$gender))
unclass(withmooc$genderF)
Results |
[1] 1 1 1 1 2 2 2 2
attr(,"levels")
[1] "f" "m"
[1] 2 2 2 2 1 1 1 1
attr(,"levels")
[1] "male" "female"
- 각각의 기초되는 값을 추출.
- genderNums는 변수 값의 알파벳 순서가 할당된다.
- genderFNums은 위에서 factor함수의 levels의 순서에 따라서 m이 2, f가 1이 할당된다.
R Programming |
%%R
withmooc$genderNums <- as.numeric(withmooc$gender)
withmooc$genderFNums <- as.numeric(withmooc$genderF)
# 실제 할당된 값을 확인.
withmooc
Results |
# A tibble: 8 x 10
id workshop gender q1 q2 q3 q4 genderF genderNums genderFNums
<dbl> <chr> <fct> <dbl> <dbl> <dbl> <dbl> <fct> <dbl> <dbl>
1 1 1 f 1 1 5 1 female 1 2
2 2 2 f 2 1 4 1 female 1 2
3 3 1 f 2 2 4 3 female 1 2
4 4 2 f 3 1 NA 3 female 1 2
5 5 1 m 4 5 2 4 male 2 1
6 6 2 m 5 4 5 5 male 2 1
7 7 1 m 5 3 4 4 male 2 1
8 8 2 m 4 5 5 5 male 2 1
- Factor로 이용하기 위해 q변수의 복사본을 생성하고, 그것을 카운트할 수 있다.
- 반복하여 사용하기 위해 라벨을 저장.
R Programming |
%%R
myQlevels <- c(1,2,3,4,5)
# 반복하여 이용하기 위해 라벨을 저장.
myQlabels <- c("Strongly Disagree",
"Disagree",
"Neutral",
"Agree",
"Strongly Agree")
- Factor함수를 이용하여 새로운 변수 세트를 생성.
R Programming |
%%R
withmooc %>%
mutate(q1f = factor(q1, myQlevels, myQlabels),
q2f = factor(q2, myQlevels, myQlabels),
q3f = factor(q3, myQlevels, myQlabels),
q4f = factor(q4, myQlevels, myQlabels) ) %>%
select(q1f,q2f,q3f,q4f) %>%
summary()
Results |
q1f q2f q3f q4f
Strongly Disagree:1 Strongly Disagree:3 Strongly Disagree:0 Strongly Disagree:2
Disagree :2 Disagree :1 Disagree :1 Disagree :0
Neutral :1 Neutral :1 Neutral :0 Neutral :2
Agree :2 Agree :1 Agree :3 Agree :2
Strongly Agree :2 Strongly Agree :2 Strongly Agree :3 Strongly Agree :2
NA's :1
- Factor로 이용하기 위해서 q변수의 복사 번을 생성. 만약 변수 수가 많다면, 자동적으로 쉽게 할 수 있는 방법.
- Factor로써 이용하기 위해 q 변수의 복사본을 생성, 그 결과 그것들을 카운트할 수 있다.
R Programming |
%%R
myQlevels <- c(1,2,3,4,5)
myQlabels <- c("Strongly Disagree",
"Disagree",
"Neutral",
"Agree",
"Strongly Agree")
print(myQlevels)
print(myQlabels)
Results |
[1] 1 2 3 4 5
[1] "Strongly Disagree" "Disagree" "Neutral" "Agree" "Strongly Agree"
- 이용될 변수 이름의 두 개 세트를 생성.
R Programming |
%%R
myQnames <- paste( "q", 1:4, sep="")
myQFnames <- paste( "qf", 1:4, sep="")
print(myQnames) # 원 변수명.
print(myQFnames) # 새로운 factor 변수의 이름.
Results |
[1] "q1" "q2" "q3" "q4"
[1] "qf1" "qf2" "qf3" "qf4"
- 많은 변수의 라벨을 적용하기 위해 함수 생성.
R Programming |
%%R
myLabeler <- function(x) { factor(x, myQlevels, myQlabels) }
- 한 변수가 함수로 어떻게 적용되는지 확인할 수 있다.
R Programming |
%%R
withmooc %>%
mutate(qf1 = myLabeler(q1))
Results |
# A tibble: 8 x 11
id workshop gender q1 q2 q3 q4 genderF genderNums genderFNums qf1
<dbl> <chr> <fct> <dbl> <dbl> <dbl> <dbl> <fct> <dbl> <dbl> <fct>
1 1 1 f 1 1 5 1 female 1 2 Strongly Disagree
2 2 2 f 2 1 4 1 female 1 2 Disagree
3 3 1 f 2 2 4 3 female 1 2 Disagree
4 4 2 f 3 1 NA 3 female 1 2 Neutral
5 5 1 m 4 5 2 4 male 2 1 Agree
6 6 2 m 5 4 5 5 male 2 1 Strongly Agree
7 7 1 m 5 3 4 4 male 2 1 Strongly Agree
8 8 2 m 4 5 5 5 male 2 1 Agree
- 모든 변수에 적용.
- map : 각 변수 별로 함수 적용 후 하나의 테이블로 재구성됨.
- transmute() 함수는 신규 변수를 생성하고 기존 변수 삭제
- 파일 위치 대신에 textConnection 함수를 이용하여서 프로그램 내의 mystring(긴 문자 벡터)을 텍스트 파일로 읽기.
R Programming |
%%R
withmooc %>%
purrr::keep(.p = is.numeric) %>% # 숫자형 데이터만 남기기
purrr::map(myLabeler) %>%
as_tibble()
Results |
# A tibble: 8 x 7
id q1 q2 q3 q4 genderNums genderFNums
<fct> <fct> <fct> <fct> <fct> <fct> <fct>
1 Strongly Disagree Strongly Disagree Strongly Disagree Strongly Agree Strongly Disagree Strongly Disagree Disagree
2 Disagree Disagree Strongly Disagree Agree Strongly Disagree Strongly Disagree Disagree
3 Neutral Disagree Disagree Agree Neutral Strongly Disagree Disagree
4 Agree Neutral Strongly Disagree <NA> Neutral Strongly Disagree Disagree
5 Strongly Agree Agree Strongly Agree Disagree Agree Disagree Strongly Disagree
6 <NA> Strongly Agree Agree Strongly Agree Strongly Agree Disagree Strongly Disagree
7 <NA> Strongly Agree Neutral Agree Agree Disagree Strongly Disagree
8 <NA> Agree Strongly Agree Strongly Agree Strongly Agree Disagree Strongly Disagree
R Programming |
%%R
withmooc %>%
select(starts_with("q")) %>% # 숫자형 데이터만 남기기
purrr::map_dfc(myLabeler)
Results |
# A tibble: 8 x 4
q1 q2 q3 q4
<fct> <fct> <fct> <fct>
1 Strongly Disagree Strongly Disagree Strongly Agree Strongly Disagree
2 Disagree Strongly Disagree Agree Strongly Disagree
3 Disagree Disagree Agree Neutral
4 Neutral Strongly Disagree <NA> Neutral
5 Agree Strongly Agree Disagree Agree
6 Strongly Agree Agree Strongly Agree Strongly Agree
7 Strongly Agree Neutral Agree Agree
8 Agree Strongly Agree Strongly Agree Strongly Agree
R Programming |
%%R
withmooc %>%
purrr::keep(.p = is.numeric) %>% # 숫자형 데이터만 남기기
purrr::map_df(.x = .,
.f = myLabeler)
Results |
# A tibble: 8 x 7
id q1 q2 q3 q4 genderNums genderFNums
<fct> <fct> <fct> <fct> <fct> <fct> <fct>
1 Strongly Disagree Strongly Disagree Strongly Disagree Strongly Agree Strongly Disagree Strongly Disagree Disagree
2 Disagree Disagree Strongly Disagree Agree Strongly Disagree Strongly Disagree Disagree
3 Neutral Disagree Disagree Agree Neutral Strongly Disagree Disagree
4 Agree Neutral Strongly Disagree <NA> Neutral Strongly Disagree Disagree
5 Strongly Agree Agree Strongly Agree Disagree Agree Disagree Strongly Disagree
6 <NA> Strongly Agree Agree Strongly Agree Strongly Agree Disagree Strongly Disagree
7 <NA> Strongly Agree Neutral Agree Agree Disagree Strongly Disagree
8 <NA> Agree Strongly Agree Strongly Agree Strongly Agree Disagree Strongly Disagree
R Programming |
%%R
withmooc %>% mutate_at( (withmooc %>%
select(starts_with("q")) %>%
colnames()),
myLabeler)
Results |
# A tibble: 8 x 10
id workshop gender q1 q2 q3 q4 genderF genderNums genderFNums
<dbl> <chr> <fct> <fct> <fct> <fct> <fct> <fct> <dbl> <dbl>
1 1 1 f Strongly Disagree Strongly Disagree Strongly Agree Strongly Disagree female 1 2
2 2 2 f Disagree Strongly Disagree Agree Strongly Disagree female 1 2
3 3 1 f Disagree Disagree Agree Neutral female 1 2
4 4 2 f Neutral Strongly Disagree <NA> Neutral female 1 2
5 5 1 m Agree Strongly Agree Disagree Agree male 2 1
6 6 2 m Strongly Agree Agree Strongly Agree Strongly Agree male 2 1
7 7 1 m Strongly Agree Neutral Agree Agree male 2 1
8 8 2 m Agree Strongly Agree Strongly Agree Strongly Agree male 2 1
R Programming |
%%R
withmooc %>% mutate_at( vars(starts_with("q")),
myLabeler)
Results |
# A tibble: 8 x 10
id workshop gender q1 q2 q3 q4 genderF genderNums genderFNums
<dbl> <chr> <fct> <fct> <fct> <fct> <fct> <fct> <dbl> <dbl>
1 1 1 f Strongly Disagree Strongly Disagree Strongly Agree Strongly Disagree female 1 2
2 2 2 f Disagree Strongly Disagree Agree Strongly Disagree female 1 2
3 3 1 f Disagree Disagree Agree Neutral female 1 2
4 4 2 f Neutral Strongly Disagree <NA> Neutral female 1 2
5 5 1 m Agree Strongly Agree Disagree Agree male 2 1
6 6 2 m Strongly Agree Agree Strongly Agree Strongly Agree male 2 1
7 7 1 m Strongly Agree Neutral Agree Agree male 2 1
8 8 2 m Agree Strongly Agree Strongly Agree Strongly Agree male 2 1
- 함수 직접 작성
R Programming |
%%R
withmooc %>% mutate_at( vars(starts_with("q")),
funs(factor(., myQlevels, myQlabels)))
Results |
# A tibble: 8 x 10
id workshop gender q1 q2 q3 q4 genderF genderNums genderFNums
<dbl> <chr> <fct> <fct> <fct> <fct> <fct> <fct> <dbl> <dbl>
1 1 1 f Strongly Disagree Strongly Disagree Strongly Agree Strongly Disagree female 1 2
2 2 2 f Disagree Strongly Disagree Agree Strongly Disagree female 1 2
3 3 1 f Disagree Disagree Agree Neutral female 1 2
4 4 2 f Neutral Strongly Disagree <NA> Neutral female 1 2
5 5 1 m Agree Strongly Agree Disagree Agree male 2 1
6 6 2 m Strongly Agree Agree Strongly Agree Strongly Agree male 2 1
7 7 1 m Strongly Agree Neutral Agree Agree male 2 1
8 8 2 m Agree Strongly Agree Strongly Agree Strongly Agree male 2 1
R Programming |
%%R
withmooc %>%
select(q1,q2,q3,q4) %>%
purrr::map_dfc(~ withmooc %>% transmute( {{.x}} := myLabeler(.x))) %>%
set_names(c('q1','q2','q3','q4'))
Results |
R[write to console]: New names:
* `..1` -> ...1
* `..1` -> ...2
* `..1` -> ...3
* `..1` -> ...4
# A tibble: 8 x 4
q1 q2 q3 q4
<fct> <fct> <fct> <fct>
1 Strongly Disagree Strongly Disagree Strongly Agree Strongly Disagree
2 Disagree Strongly Disagree Agree Strongly Disagree
3 Disagree Disagree Agree Neutral
4 Neutral Strongly Disagree <NA> Neutral
5 Agree Strongly Agree Disagree Agree
6 Strongly Agree Agree Strongly Agree Strongly Agree
7 Strongly Agree Neutral Agree Agree
8 Agree Strongly Agree Strongly Agree Strongly Agree
6. Python - Pandas
Python Programming |
import pandas as pd
import numpy as np
import sweetviz as sv
mydata = pd.read_csv("C:/work/data/mydata.csv",sep=",",
dtype={'id':object,'workshop':object,
'q1':int, 'q2':int, 'q3':float, 'q4':int},
na_values=['NaN'],skipinitialspace =True)
withmooc= mydata.copy()
withmooc
Results |
id workshop gender q1 q2 q3 q4
0 1 1 f 1 1 5.0 1
1 2 2 f 2 1 4.0 1
2 3 1 f 2 2 4.0 3
3 4 2 f 3 1 NaN 3
4 5 1 m 4 5 2.0 4
5 6 2 m 5 4 5.0 5
6 7 1 m 5 3 4.0 4
7 8 2 m 4 5 5.0 5
- 수치형 변수에 대한 요약 통계
Python Programming |
withmooc= mydata.copy()
withmooc.describe()
Results |
q1 q2 q3 q4
count 8.000000 8.000000 7.000000 8.000000
mean 3.250000 2.750000 4.142857 3.250000
std 1.488048 1.752549 1.069045 1.581139
min 1.000000 1.000000 2.000000 1.000000
25% 2.000000 1.000000 4.000000 2.500000
50% 3.500000 2.500000 4.000000 3.500000
75% 4.250000 4.250000 5.000000 4.250000
max 5.000000 5.000000 5.000000 5.000000
- 문자형 변수에 대한 요약 통계
Python Programming |
withmooc= mydata.copy()
withmooc.describe(include=[np.object])
Results |
id workshop gender
count 8 8 8
unique 8 2 2
top 2 1 f
freq 1 4 4
Python Programming |
withmooc= mydata.copy()
withmooc.apply(lambda x : x.describe())
Results |
id workshop gender q1 q2 q3 q4
25% NaN NaN NaN 2.000000 1.000000 4.000000 2.500000
50% NaN NaN NaN 3.500000 2.500000 4.000000 3.500000
75% NaN NaN NaN 4.250000 4.250000 5.000000 4.250000
count 8 8 8 8.000000 8.000000 7.000000 8.000000
freq 1 4 4 NaN NaN NaN NaN
max NaN NaN NaN 5.000000 5.000000 5.000000 5.000000
mean NaN NaN NaN 3.250000 2.750000 4.142857 3.250000
min NaN NaN NaN 1.000000 1.000000 2.000000 1.000000
std NaN NaN NaN 1.488048 1.752549 1.069045 1.581139
top 2 1 f NaN NaN NaN NaN
unique 8 2 2 NaN NaN NaN NaN
Python Programming |
withmooc= mydata.copy()
labels2={'1':'R','2':'SAS','3':'SPSS', '4':'Python'}
withmooc['workshop'] = withmooc['workshop'].apply(lambda x: labels2.get(x))
withmooc
Results |
id workshop gender q1 q2 q3 q4
0 1 R f 1 1 5.0 1
1 2 SAS f 2 1 4.0 1
2 3 R f 2 2 4.0 3
3 4 SAS f 3 1 NaN 3
4 5 R m 4 5 2.0 4
5 6 SAS m 5 4 5.0 5
6 7 R m 5 3 4.0 4
7 8 SAS m 4 5 5.0 5
Python Programming |
withmooc= mydata.copy()
withmooc['workshop'] = withmooc['workshop'].map(labels2)
withmooc
Results |
id workshop gender q1 q2 q3 q4
0 1 R f 1 1 5.0 1
1 2 SAS f 2 1 4.0 1
2 3 R f 2 2 4.0 3
3 4 SAS f 3 1 NaN 3
4 5 R m 4 5 2.0 4
5 6 SAS m 5 4 5.0 5
6 7 R m 5 3 4.0 4
7 8 SAS m 4 5 5.0 5
Python Programming |
withmooc= mydata.copy()
withmooc['workshop'] = withmooc['workshop'].astype('category')
withmooc['workshop'] = withmooc['workshop'].cat.rename_categories(["R", "SAS"])
withmooc
Results |
id workshop gender q1 q2 q3 q4
0 1 R f 1 1 5.0 1
1 2 SAS f 2 1 4.0 1
2 3 R f 2 2 4.0 3
3 4 SAS f 3 1 NaN 3
4 5 R m 4 5 2.0 4
5 6 SAS m 5 4 5.0 5
6 7 R m 5 3 4.0 4
7 8 SAS m 4 5 5.0 5
Python Programming |
withmooc.groupby(withmooc['workshop']).describe()
Results |
q1 q2 ... q3 q4
count mean std min 25% 50% 75% max count mean ... 75% max count mean std min 25% 50% 75% max
workshop
R 4.0 3.0 1.825742 1.0 1.75 3.0 4.25 5.0 4.0 2.75 ... 4.25 5.0 4.0 3.0 1.414214 1.0 2.5 3.5 4.0 4.0
SAS 4.0 3.5 1.290994 2.0 2.75 3.5 4.25 5.0 4.0 2.75 ... 5.00 5.0 4.0 3.5 1.914854 1.0 2.5 4.0 5.0 5.0
2 rows × 32 columns
- 어떻게 레벨이 값으로 매치되는지 확인.
Python Programming |
withmooc.info()
withmooc.dtypes
withmooc['workshop'].dtype
Results |
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 8 non-null object
1 workshop 8 non-null category
2 gender 8 non-null object
3 q1 8 non-null int32
4 q2 8 non-null int32
5 q3 7 non-null float64
6 q4 8 non-null int32
dtypes: category(1), float64(1), int32(3), object(2)
memory usage: 520.0+ bytes
CategoricalDtype(categories=['R', 'SAS'], ordered=False)
- m은 male로 f는 female로 순서를 변경하자.
- 만약 값이 대문자이면, 실제적으로 결측 값을 생성한다.
Python Programming |
withmooc['gender'] = withmooc['gender'].astype('category')
withmooc['genderF'] = withmooc['gender'].cat.rename_categories(["female", "male"])
withmooc
Results |
id workshop gender q1 q2 q3 q4 genderF
0 1 R f 1 1 5.0 1 female
1 2 SAS f 2 1 4.0 1 female
2 3 R f 2 2 4.0 3 female
3 4 SAS f 3 1 NaN 3 female
4 5 R m 4 5 2.0 4 male
5 6 SAS m 5 4 5.0 5 male
6 7 R m 5 3 4.0 4 male
7 8 SAS m 4 5 5.0 5 male
- 각각의 기초되는 값을 추출.
- genderNums는 변수 값의 알파벳 순서가 할당된다.
- genderFNums은 위에서 factor함수의 levels의 순서에 따라서 m이 2, f가 1이 할당된다.
Python Programming |
withmooc= mydata.copy()
withmooc['gender'] = withmooc['gender'].astype('category')
withmooc['genderF'] = withmooc['gender'].cat.rename_categories(["female", "male"])
withmooc["genderNums"] = withmooc["gender"].cat.codes
withmooc["genderFNums"] = withmooc["genderF"].cat.codes
withmooc
Results |
id workshop gender q1 q2 q3 q4 genderF genderNums genderFNums
0 1 1 f 1 1 5.0 1 female 0 0
1 2 2 f 2 1 4.0 1 female 0 0
2 3 1 f 2 2 4.0 3 female 0 0
3 4 2 f 3 1 NaN 3 female 0 0
4 5 1 m 4 5 2.0 4 male 1 1
5 6 2 m 5 4 5.0 5 male 1 1
6 7 1 m 5 3 4.0 4 male 1 1
7 8 2 m 4 5 5.0 5 male 1 1
Python Programming |
withmooc= mydata.copy()
withmooc['gender'] = withmooc['gender'].astype('category')
withmooc['genderF'] = withmooc['gender'].cat.rename_categories(["female", "male"])
withmooc['genderNums'] = pd.factorize(withmooc.gender)[0]
withmooc['genderFNums'] = pd.factorize(withmooc.genderF)[0]
withmooc
Results |
id workshop gender q1 q2 q3 q4 genderF genderNums genderFNums
0 1 1 f 1 1 5.0 1 female 0 0
1 2 2 f 2 1 4.0 1 female 0 0
2 3 1 f 2 2 4.0 3 female 0 0
3 4 2 f 3 1 NaN 3 female 0 0
4 5 1 m 4 5 2.0 4 male 1 1
5 6 2 m 5 4 5.0 5 male 1 1
6 7 1 m 5 3 4.0 4 male 1 1
7 8 2 m 4 5 5.0 5 male 1 1
Python Programming |
withmooc= mydata.copy()
withmooc['gender'] = withmooc['gender'].astype('category')
withmooc['genderF'] = withmooc['gender'].cat.rename_categories(["female", "male"])
from sklearn.preprocessing import LabelEncoder
number = LabelEncoder()
withmooc['genderNums'] = number.fit_transform(withmooc['gender'])
withmooc['genderFNums'] = number.fit_transform(withmooc['genderF'])
withmooc.info()
withmooc
Results |
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 8 non-null object
1 workshop 8 non-null object
2 gender 8 non-null category
3 q1 8 non-null int32
4 q2 8 non-null int32
5 q3 7 non-null float64
6 q4 8 non-null int32
7 genderF 8 non-null category
8 genderNums 8 non-null int32
9 genderFNums 8 non-null int32
dtypes: category(2), float64(1), int32(5), object(2)
memory usage: 688.0+ bytes
Results |
id workshop gender q1 q2 q3 q4 genderF genderNums genderFNums
0 1 1 f 1 1 5.0 1 female 0 0
1 2 2 f 2 1 4.0 1 female 0 0
2 3 1 f 2 2 4.0 3 female 0 0
3 4 2 f 3 1 NaN 3 female 0 0
4 5 1 m 4 5 2.0 4 male 1 1
5 6 2 m 5 4 5.0 5 male 1 1
6 7 1 m 5 3 4.0 4 male 1 1
7 8 2 m 4 5 5.0 5 male 1 1
- Factor로 이용하기 위해 q변수의 복사본을 생성하고, 그것을 카운트할 수 있다.
- 반복하여 사용하기 위해 라벨을 저장.
- 반복하여 이용하기 위해 라벨을 저장.
- Factor함수를 이용하여 새로운 변수 세트를 생성.
Python Programming |
withmooc= mydata.copy()
withmooc['q1f']=withmooc['q1'].astype('category').cat.rename_categories(["Strongly Disagree","Disagree","Neutral","Agree","Strongly Agree"])
withmooc['q2f']=withmooc['q2'].astype('category').cat.rename_categories(["Strongly Disagree","Disagree","Neutral","Agree","Strongly Agree"])
withmooc['q3f']=withmooc['q3'].astype('category').cat.rename_categories({1:"Strongly Disagree",2:"Disagree",3:"Neutral",4:"Agree",5:"Strongly Agree"})
withmooc['q4f']=withmooc['q4'].astype('category').cat.rename_categories({1:"Strongly Disagree",2:"Disagree",3:"Neutral",4:"Agree",5:"Strongly Agree"})
withmooc
Results |
id workshop gender q1 q2 q3 q4 q1f q2f q3f q4f
0 1 1 f 1 1 5.0 1 Strongly Disagree Strongly Disagree Strongly Agree Strongly Disagree
1 2 2 f 2 1 4.0 1 Disagree Strongly Disagree Agree Strongly Disagree
2 3 1 f 2 2 4.0 3 Disagree Disagree Agree Neutral
3 4 2 f 3 1 NaN 3 Neutral Strongly Disagree NaN Neutral
4 5 1 m 4 5 2.0 4 Agree Strongly Agree Disagree Agree
5 6 2 m 5 4 5.0 5 Strongly Agree Agree Strongly Agree Strongly Agree
6 7 1 m 5 3 4.0 4 Strongly Agree Neutral Agree Agree
7 8 2 m 4 5 5.0 5 Agree Strongly Agree Strongly Agree Strongly Agree
Python Programming |
withmooc.info()
Results |
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 8 non-null object
1 workshop 8 non-null object
2 gender 8 non-null object
3 q1 8 non-null int32
4 q2 8 non-null int32
5 q3 7 non-null float64
6 q4 8 non-null int32
7 q1f 8 non-null category
8 q2f 8 non-null category
9 q3f 7 non-null category
10 q4f 8 non-null category
dtypes: category(4), float64(1), int32(3), object(3)
memory usage: 1.2+ KB
Python Programming |
pd.DataFrame([(withmooc[val].value_counts()) for val in ['q1f','q2f','q3f','q4f']]).T
Results |
q1f q2f q3f q4f
Agree 2.0 1.0 3.0 2.0
Disagree 2.0 1.0 1.0 NaN
Neutral 1.0 1.0 NaN 2.0
Strongly Agree 2.0 2.0 3.0 2.0
Strongly Disagree 1.0 3.0 NaN 2.0
- Factor로 이용하기 위해서 q변수의 복사 번을 생성. 만약 변수 수가 많다면, 자동적으로 쉽게 할 수 있는 방법.
- Factor로써 이용하기 위해 q 변수의 복사본을 생성, 그 결과 그것들을 카운트할 수 있다.
Python Programming |
myQlevels = [1,2,3,4,5]
myQlabels = {1:"Strongly Disagree",2:"Disagree",3:"Neutral",4:"Agree",5:"Strongly Agree"}
print(myQlevels)
print(myQlabels)
Results |
[1, 2, 3, 4, 5]
{1: 'Strongly Disagree', 2: 'Disagree', 3: 'Neutral', 4: 'Agree', 5: 'Strongly Agree'}
Python Programming |
myQnames = ["q" + str(i) for i in range(1,5)]
myQFnames = ["q" + str(i) + "f" for i in range(1,5)]
print(myQnames) # 변수명 출력
print(myQFnames) # 새로운 factor 변수의 이름.
Results |
['q1', 'q2', 'q3', 'q4']
['q1f', 'q2f', 'q3f', 'q4f']
- 데이터 프레임을 분리하기 위해 q변수 추출.
Python Programming |
myQFvars = withmooc.loc[:,myQnames]
myQFvars
Results |
q1 q2 q3 q4
0 1 1 5.0 1
1 2 1 4.0 1
2 2 2 4.0 3
3 3 1 NaN 3
4 4 5 2.0 4
5 5 4 5.0 5
6 5 3 4.0 4
7 4 5 5.0 5
- Factor에 대하여 F를 가진 모든 변수로 변수명을 변경.
Python Programming |
myQFvars.columns = myQFnames
myQFvars
Results |
q1f q2f q3f q4f
0 1 1 5.0 1
1 2 1 4.0 1
2 2 2 4.0 3
3 3 1 NaN 3
4 4 5 2.0 4
5 5 4 5.0 5
6 5 3 4.0 4
7 4 5 5.0 5
Python Programming |
withmooc['q4'].astype('category').cat.rename_categories(myQlabels)
Results |
0 Strongly Disagree
1 Strongly Disagree
2 Neutral
3 Neutral
4 Agree
5 Strongly Agree
6 Agree
7 Strongly Agree
Name: q4, dtype: category
Categories (4, object): ['Strongly Disagree', 'Neutral', 'Agree', 'Strongly Agree']
Python Programming |
def categories(x):
return x.astype('category').cat.rename_categories(myQlabels)
categories(withmooc['q4'])
Results |
0 Strongly Disagree
1 Strongly Disagree
2 Neutral
3 Neutral
4 Agree
5 Strongly Agree
6 Agree
7 Strongly Agree
Name: q4, dtype: category
Categories (4, object): ['Strongly Disagree', 'Neutral', 'Agree', 'Strongly Agree']
Python Programming |
myQFvars.loc[ :,myQFnames ] = myQFvars.loc[ :,myQFnames ].apply(lambda x:categories(x))
myQFvars
Results |
q1f q2f q3f q4f
0 Strongly Disagree Strongly Disagree Strongly Agree Strongly Disagree
1 Disagree Strongly Disagree Agree Strongly Disagree
2 Disagree Disagree Agree Neutral
3 Neutral Strongly Disagree NaN Neutral
4 Agree Strongly Agree Disagree Agree
5 Strongly Agree Agree Strongly Agree Strongly Agree
6 Strongly Agree Neutral Agree Agree
7 Agree Strongly Agree Strongly Agree Strongly Agree
- Summary함수의 결과.
Python Programming |
pd.DataFrame([(myQFvars[val].value_counts()) for val in ['q1f','q2f','q3f','q4f']]).T
Results |
q1f q2f q3f q4f
Agree 2.0 1.0 3.0 2.0
Disagree 2.0 1.0 1.0 NaN
Neutral 1.0 1.0 NaN 2.0
Strongly Agree 2.0 2.0 3.0 2.0
Strongly Disagree 1.0 3.0 NaN 2.0
Python Programming |
pd.merge(withmooc, myQFvars, how='inner')
Results |
id workshop gender q1 q2 q3 q4 q1f q2f q3f q4f
0 1 1 f 1 1 5.0 1 Strongly Disagree Strongly Disagree Strongly Agree Strongly Disagree
1 2 2 f 2 1 4.0 1 Disagree Strongly Disagree Agree Strongly Disagree
2 3 1 f 2 2 4.0 3 Disagree Disagree Agree Neutral
3 4 2 f 3 1 NaN 3 Neutral Strongly Disagree NaN Neutral
4 5 1 m 4 5 2.0 4 Agree Strongly Agree Disagree Agree
5 6 2 m 5 4 5.0 5 Strongly Agree Agree Strongly Agree Strongly Agree
6 7 1 m 5 3 4.0 4 Strongly Agree Neutral Agree Agree
7 8 2 m 4 5 5.0 5 Agree Strongly Agree Strongly Agree Strongly Agree
7. Python - dfply
- 기본적으로, Summary는 Group을 수치형으로 취급하지만, Gender는 Factor로 가정하고, 그것의 레벨을 카운트한다.
Python Programming |
import pandas as pd
from dfply import *
mydata = pd.read_csv("c:/work/data/mydata.csv",sep=",",
dtype={'id':object,'workshop':object,
'q1':int, 'q2':int, 'q3':float, 'q4':int},
na_values=['NaN'],skipinitialspace =True)
withmooc= mydata.copy()
# 모든 변수 선택하기.
withmooc
Results |
id workshop gender q1 q2 q3 q4
0 1 1 f 1 1 5.0 1
1 2 2 f 2 1 4.0 1
2 3 1 f 2 2 4.0 3
3 4 2 f 3 1 NaN 3
4 5 1 m 4 5 2.0 4
5 6 2 m 5 4 5.0 5
6 7 1 m 5 3 4.0 4
7 8 2 m 4 5 5.0 5
Python Programming |
withmooc >> summarize(**{
**{f"{x}_mean": X[x].mean() for x in mydata.select_dtypes(int).columns},
**{f"{x}_std" : X[x].std() for x in mydata.select_dtypes(int).columns},
**{f"{x}_var" : X[x].var() for x in mydata.select_dtypes(int).columns},
**{f"{x}_median" : X[x].median() for x in mydata.select_dtypes(int).columns}
})
Results |
q1_mean q2_mean q4_mean q1_std q2_std q4_std q1_var q2_var q4_var q1_median q2_median q4_median
0 3.25 2.75 3.25 1.488048 1.752549 1.581139 2.214286 3.071429 2.5 3.5 2.5 3.5
Python Programming |
(withmooc >> select(withmooc.select_dtypes(include=np.number).columns.tolist())).describe()
Results |
q1 q2 q3 q4
count 8.000000 8.000000 7.000000 8.000000
mean 3.250000 2.750000 4.142857 3.250000
std 1.488048 1.752549 1.069045 1.581139
min 1.000000 1.000000 2.000000 1.000000
25% 2.000000 1.000000 4.000000 2.500000
50% 3.500000 2.500000 4.000000 3.500000
75% 4.250000 4.250000 5.000000 4.250000
max 5.000000 5.000000 5.000000 5.000000
Python Programming |
(withmooc >> select(withmooc.select_dtypes(include=np.number).columns.tolist())).describe().T
Results |
count mean std min 25% 50% 75% max
q1 8.0 3.250000 1.488048 1.0 2.0 3.5 4.25 5.0
q2 8.0 2.750000 1.752549 1.0 1.0 2.5 4.25 5.0
q3 7.0 4.142857 1.069045 2.0 4.0 4.0 5.00 5.0
q4 8.0 3.250000 1.581139 1.0 2.5 3.5 4.25 5.0
Python Programming |
withmooc= mydata.copy()
labels2={'1':'R','2':'SAS','3':'SPSS', '4':'Python'}
withmooc >> mutate(workshop = X['workshop'].apply(lambda x: labels2.get(x)))
Results |
id workshop gender q1 q2 q3 q4
0 1 R f 1 1 5.0 1
1 2 SAS f 2 1 4.0 1
2 3 R f 2 2 4.0 3
3 4 SAS f 3 1 NaN 3
4 5 R m 4 5 2.0 4
5 6 SAS m 5 4 5.0 5
6 7 R m 5 3 4.0 4
7 8 SAS m 4 5 5.0 5
Python Programming |
withmooc= mydata.copy()
labels2={'1':'R','2':'SAS','3':'SPSS', '4':'Python'}
withmooc >> mutate(workshop = X['workshop'].map(labels2))
Results |
id workshop gender q1 q2 q3 q4
0 1 R f 1 1 5.0 1
1 2 SAS f 2 1 4.0 1
2 3 R f 2 2 4.0 3
3 4 SAS f 3 1 NaN 3
4 5 R m 4 5 2.0 4
5 6 SAS m 5 4 5.0 5
6 7 R m 5 3 4.0 4
7 8 SAS m 4 5 5.0 5
Python Programming |
withmooc= mydata.copy()
withmooc = withmooc >> mutate(workshop = X['workshop'].astype('category'))
withmooc = withmooc >> mutate(workshop = X['workshop'].cat.rename_categories(["R", "SAS"]))
withmooc
Results |
id workshop gender q1 q2 q3 q4
0 1 R f 1 1 5.0 1
1 2 SAS f 2 1 4.0 1
2 3 R f 2 2 4.0 3
3 4 SAS f 3 1 NaN 3
4 5 R m 4 5 2.0 4
5 6 SAS m 5 4 5.0 5
6 7 R m 5 3 4.0 4
7 8 SAS m 4 5 5.0 5
Python Programming |
withmooc >> group_by('workshop') >> \
summarize(**{
**{f"{x}_mean" : X[x].mean() for x in withmooc.select_dtypes(int).columns},
**{f"{x}_std" : X[x].std() for x in withmooc.select_dtypes(int).columns},
**{f"{x}_var" : X[x].var() for x in withmooc.select_dtypes(int).columns},
**{f"{x}_median" : X[x].median() for x in withmooc.select_dtypes(int).columns}
})
Results |
workshop q1_mean q2_mean q4_mean q1_std q2_std q4_std q1_var q2_var q4_var q1_median q2_median q4_median
0 R 3.0 2.75 3.0 1.825742 1.707825 1.414214 3.333333 2.916667 2.000000 3.0 2.5 3.5
1 SAS 3.5 2.75 3.5 1.290994 2.061553 1.914854 1.666667 4.250000 3.666667 3.5 2.5 4.0
Python Programming |
withmooc >> group_by('workshop') >> \
summarize(q1_mean=X.q1.mean(), q1_std=X.q1.std(),
q2_mean=X.q1.mean(), q2_std=X.q1.std(),
q3_mean=X.q1.mean(), q3_std=X.q1.std(),
q4_mean=X.q1.mean(), q4_std=X.q1.std())
Results |
workshop q1_mean q1_std q2_mean q2_std q3_mean q3_std q4_mean q4_std
0 R 3.0 1.825742 3.0 1.825742 3.0 1.825742 3.0 1.825742
1 SAS 3.5 1.290994 3.5 1.290994 3.5 1.290994 3.5 1.290994
Python Programming |
@pipe
@symbolic_evaluation()
def symbolic_double(df, *serieses):
result = []
for series in serieses:
result.append(series.describe())
return pd.DataFrame(result)
# withmooc >> symbolic_double(X.q1,X.q2,X.q3,X.q4)
withmooc >> symbolic_double(X.q1,X.q2,X.q3,X.q4)
Results |
count mean std min 25% 50% 75% max
q1 8.0 3.250000 1.488048 1.0 2.0 3.5 4.25 5.0
q2 8.0 2.750000 1.752549 1.0 1.0 2.5 4.25 5.0
q3 7.0 4.142857 1.069045 2.0 4.0 4.0 5.00 5.0
q4 8.0 3.250000 1.581139 1.0 2.5 3.5 4.25 5.0
Python Programming |
@pipe
@symbolic_evaluation()
def num_variable(df,serieses):
result = []
for series in serieses:
if df[series].dtypes in (["int32","float64"]):
result.append(df[series].describe())
return pd.DataFrame(result)
withmooc >> num_variable(mydata.columns.tolist())
Results |
count mean std min 25% 50% 75% max
q1 8.0 3.250000 1.488048 1.0 2.0 3.5 4.25 5.0
q2 8.0 2.750000 1.752549 1.0 1.0 2.5 4.25 5.0
q3 7.0 4.142857 1.069045 2.0 4.0 4.0 5.00 5.0
q4 8.0 3.250000 1.581139 1.0 2.5 3.5 4.25 5.0
Python Programming |
@pipe
@symbolic_evaluation()
def num_variable(df,serieses):
result = []
for series in serieses:
if df[series].dtypes in (["int32","float64"]):
result.append(df[series].describe())
elif df[series].dtypes in (["object"]):
result.append(df[series].describe())
return pd.DataFrame(result)
withmooc >> num_variable(mydata.columns.tolist())
Results |
count unique top freq mean std min 25% 50% 75% max
id 8.0 8.0 2 1.0 NaN NaN NaN NaN NaN NaN NaN
gender 8.0 2.0 f 4.0 NaN NaN NaN NaN NaN NaN NaN
q1 8.0 NaN NaN NaN 3.250000 1.488048 1.0 2.0 3.5 4.25 5.0
q2 8.0 NaN NaN NaN 2.750000 1.752549 1.0 1.0 2.5 4.25 5.0
q3 7.0 NaN NaN NaN 4.142857 1.069045 2.0 4.0 4.0 5.00 5.0
q4 8.0 NaN NaN NaN 3.250000 1.581139 1.0 2.5 3.5 4.25 5.0
- m은 male로 f는 female로 순서를 변경하자.
- 만약 값이 대문자이면, 실제적으로 결측 값을 생성한다.
Python Programming |
withmooc = withmooc \
>> mutate(gender = X.gender.astype('category')) \
>> mutate(genderF = X.gender.cat.rename_categories(["female", "male"]))
print(withmooc.dtypes)
withmooc
Results |
id object
workshop category
gender category
q1 int32
q2 int32
q3 float64
q4 int32
genderF category
dtype: object
Results |
id workshop gender q1 q2 q3 q4 genderF
0 1 R f 1 1 5.0 1 female
1 2 SAS f 2 1 4.0 1 female
2 3 R f 2 2 4.0 3 female
3 4 SAS f 3 1 NaN 3 female
4 5 R m 4 5 2.0 4 male
5 6 SAS m 5 4 5.0 5 male
6 7 R m 5 3 4.0 4 male
7 8 SAS m 4 5 5.0 5 male
- 각각의 기초되는 값을 추출.
- genderNums는 변수 값의 알파벳 순서가 할당된다.
- genderFNums은 위에서 factor함수의 levels의 순서에 따라서 m이 2, f가 1이 할당된다.
Python Programming |
mydata1= mydata.copy()
withmooc = withmooc \
>> mutate(gender = X.gender.astype('category')) \
>> mutate(genderF = X.gender.cat.rename_categories(["female", "male"])) \
>> mutate(genderNums = X.gender.cat.codes) \
>> mutate(genderFNums = X.genderF.cat.codes)
print(withmooc.dtypes)
withmooc
Results |
id object
workshop category
gender category
q1 int32
q2 int32
q3 float64
q4 int32
genderF category
genderNums int8
genderFNums int8
dtype: object
Results |
id workshop gender q1 q2 q3 q4 genderF genderNums genderFNums
0 1 R f 1 1 5.0 1 female 0 0
1 2 SAS f 2 1 4.0 1 female 0 0
2 3 R f 2 2 4.0 3 female 0 0
3 4 SAS f 3 1 NaN 3 female 0 0
4 5 R m 4 5 2.0 4 male 1 1
5 6 SAS m 5 4 5.0 5 male 1 1
6 7 R m 5 3 4.0 4 male 1 1
7 8 SAS m 4 5 5.0 5 male 1 1
Python Programming |
mydata1= mydata.copy()
withmooc = withmooc \
>> mutate(gender = X.gender.astype('category')) \
>> mutate(genderF = X.gender.cat.rename_categories(["female", "male"])) \
>> mutate(genderNums = pd.DataFrame(pd.factorize(withmooc.gender)[0])) \
>> mutate(genderFNums = pd.DataFrame(pd.factorize(withmooc.genderF)[0]))
print(withmooc.dtypes)
withmooc
Results |
id object
workshop category
gender category
q1 int32
q2 int32
q3 float64
q4 int32
genderF category
genderNums int64
genderFNums int64
dtype: object
Results |
id workshop gender q1 q2 q3 q4 genderF genderNums genderFNums
0 1 R f 1 1 5.0 1 female 0 0
1 2 SAS f 2 1 4.0 1 female 0 0
2 3 R f 2 2 4.0 3 female 0 0
3 4 SAS f 3 1 NaN 3 female 0 0
4 5 R m 4 5 2.0 4 male 1 1
5 6 SAS m 5 4 5.0 5 male 1 1
6 7 R m 5 3 4.0 4 male 1 1
7 8 SAS m 4 5 5.0 5 male 1 1
- Factor로 이용하기 위해 q변수의 복사본을 생성하고, 그것을 카운트할 수 있다.
- 반복하여 사용하기 위해 라벨을 저장.
- Factor함수를 이용하여 새로운 변수 세트를 생성.
Python Programming |
mydata1= mydata.copy()
withmooc = withmooc \
>> mutate(q1f = X.q1.astype('category').cat.rename_categories(["Strongly Disagree","Disagree","Neutral","Agree","Strongly Agree"])) \
>> mutate(q2f = X.q2.astype('category').cat.rename_categories(["Strongly Disagree","Disagree","Neutral","Agree","Strongly Agree"])) \
>> mutate(q3f = X.q3.astype('category').cat.rename_categories({1:"Strongly Disagree",2:"Disagree",3:"Neutral",4:"Agree",5:"Strongly Agree"})) \
>> mutate(q4f = X.q4.astype('category').cat.rename_categories({1:"Strongly Disagree",2:"Disagree",3:"Neutral",4:"Agree",5:"Strongly Agree"}))
withmooc
Results |
id workshop gender q1 q2 q3 q4 genderF genderNums genderFNums q1f q2f q3f q4f
0 1 R f 1 1 5.0 1 female 0 0 Strongly Disagree Strongly Disagree Strongly Agree Strongly Disagree
1 2 SAS f 2 1 4.0 1 female 0 0 Disagree Strongly Disagree Agree Strongly Disagree
2 3 R f 2 2 4.0 3 female 0 0 Disagree Disagree Agree Neutral
3 4 SAS f 3 1 NaN 3 female 0 0 Neutral Strongly Disagree NaN Neutral
4 5 R m 4 5 2.0 4 male 1 1 Agree Strongly Agree Disagree Agree
5 6 SAS m 5 4 5.0 5 male 1 1 Strongly Agree Agree Strongly Agree Strongly Agree
6 7 R m 5 3 4.0 4 male 1 1 Strongly Agree Neutral Agree Agree
7 8 SAS m 4 5 5.0 5 male 1 1 Agree Strongly Agree Strongly Agree Strongly Agree
Python Programming |
withmooc.info()
Results |
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 8 non-null object
1 workshop 8 non-null category
2 gender 8 non-null category
3 q1 8 non-null int32
4 q2 8 non-null int32
5 q3 7 non-null float64
6 q4 8 non-null int32
7 genderF 8 non-null category
8 genderNums 8 non-null int64
9 genderFNums 8 non-null int64
10 q1f 8 non-null category
11 q2f 8 non-null category
12 q3f 7 non-null category
13 q4f 8 non-null category
dtypes: category(7), float64(1), int32(3), int64(2), object(1)
memory usage: 1.5+ KB
Python Programming |
@pipe
@symbolic_evaluation()
def qf_counts(df,serieses):
result = []
for series in serieses:
result.append(df[series].value_counts())
return pd.DataFrame(result).T
withmooc >> qf_counts(['q1f','q2f','q3f','q4f'])
Results |
q1f q2f q3f q4f
Agree 2.0 1.0 3.0 2.0
Disagree 2.0 1.0 1.0 NaN
Neutral 1.0 1.0 NaN 2.0
Strongly Agree 2.0 2.0 3.0 2.0
Strongly Disagree 1.0 3.0 NaN 2.0
- Factor로 이용하기 위해서 q변수의 복사 번을 생성. 만약 변수 수가 많다면, 자동적으로 쉽게 할 수 있는 방법.
- Factor로써 이용하기 위해 q 변수의 복사본을 생성, 그 결과 그것들을 카운트할 수 있다.
Python Programming |
myQlevels = [1,2,3,4,5]
myQlabels = {1:"Strongly Disagree",2:"Disagree",3:"Neutral",4:"Agree",5:"Strongly Agree"}
print(myQlevels)
print(myQlabels)
Results |
[1, 2, 3, 4, 5]
{1: 'Strongly Disagree', 2: 'Disagree', 3: 'Neutral', 4: 'Agree', 5: 'Strongly Agree'}
- 데이터 프레임을 분리하기 위해 q변수 추출.
Python Programming |
myQFvars = withmooc >> select(num_range("q", range(1,5)))
Python Programming |
# Factor에 대하여 F를 가진 모든 변수로 변수명을 변경.
myQFnames = ['q1f', 'q2f', 'q3f', 'q4f']
myQFvars.columns = myQFnames
myQFvars
Results |
q1f q2f q3f q4f
0 1 1 5.0 1
1 2 1 4.0 1
2 2 2 4.0 3
3 3 1 NaN 3
4 4 5 2.0 4
5 5 4 5.0 5
6 5 3 4.0 4
7 4 5 5.0 5
Python Programming |
mydata1= mydata.copy()
withmooc \
>> mutate(q4f = X.q4.astype('category').cat.rename_categories(myQlabels))
Results |
id workshop gender q1 q2 q3 q4 genderF genderNums genderFNums q1f q2f q3f q4f
0 1 R f 1 1 5.0 1 female 0 0 Strongly Disagree Strongly Disagree Strongly Agree Strongly Disagree
1 2 SAS f 2 1 4.0 1 female 0 0 Disagree Strongly Disagree Agree Strongly Disagree
2 3 R f 2 2 4.0 3 female 0 0 Disagree Disagree Agree Neutral
3 4 SAS f 3 1 NaN 3 female 0 0 Neutral Strongly Disagree NaN Neutral
4 5 R m 4 5 2.0 4 male 1 1 Agree Strongly Agree Disagree Agree
5 6 SAS m 5 4 5.0 5 male 1 1 Strongly Agree Agree Strongly Agree Strongly Agree
6 7 R m 5 3 4.0 4 male 1 1 Strongly Agree Neutral Agree Agree
7 8 SAS m 4 5 5.0 5 male 1 1 Agree Strongly Agree Strongly Agree Strongly Agree
Python Programming |
withmooc["q1"].astype('category').cat.rename_categories(myQlabels)
Results |
0 Strongly Disagree
1 Disagree
2 Disagree
3 Neutral
4 Agree
5 Strongly Agree
6 Strongly Agree
7 Agree
Name: q1, dtype: category
Categories (5, object): ['Strongly Disagree', 'Disagree', 'Neutral', 'Agree', 'Strongly Agree']
Python Programming |
@pipe
@symbolic_evaluation()
def qf_counts(df,serieses):
result = []
for series in serieses:
result.append(df[series].astype('category').cat.rename_categories(myQlabels))
return pd.DataFrame(result).T
myQFvars = withmooc >> qf_counts(['q1f','q2f','q3f','q4f'])
Python Programming |
myQFvars
Results |
q1f q2f q3f q4f
0 Strongly Disagree Strongly Disagree Strongly Agree Strongly Disagree
1 Disagree Strongly Disagree Agree Strongly Disagree
2 Disagree Disagree Agree Neutral
3 Neutral Strongly Disagree NaN Neutral
4 Agree Strongly Agree Disagree Agree
5 Strongly Agree Agree Strongly Agree Strongly Agree
6 Strongly Agree Neutral Agree Agree
7 Agree Strongly Agree Strongly Agree Strongly Agree
Python Programming |
@pipe
@symbolic_evaluation()
def qf_counts(df,serieses):
result = []
for series in serieses:
result.append(df[series].value_counts())
return pd.DataFrame(result).T
myQFvars >> qf_counts(['q1f','q2f','q3f','q4f'])
Results |
q1f q2f q3f q4f
Strongly Agree 2.0 2.0 3.0 2.0
Disagree 2.0 1.0 1.0 NaN
Agree 2.0 1.0 3.0 2.0
Strongly Disagree 1.0 3.0 NaN 2.0
Neutral 1.0 1.0 NaN 2.0
Python Programming |
both = withmooc >> bind_cols(myQFvars)
both
Results |
id workshop gender q1 q2 q3 q4 genderF genderNums genderFNums q1f q2f q3f q4f q1f q2f q3f q4f
0 1 R f 1 1 5.0 1 female 0 0 Strongly Disagree Strongly Disagree Strongly Agree Strongly Disagree Strongly Disagree Strongly Disagree Strongly Agree Strongly Disagree
1 2 SAS f 2 1 4.0 1 female 0 0 Disagree Strongly Disagree Agree Strongly Disagree Disagree Strongly Disagree Agree Strongly Disagree
2 3 R f 2 2 4.0 3 female 0 0 Disagree Disagree Agree Neutral Disagree Disagree Agree Neutral
3 4 SAS f 3 1 NaN 3 female 0 0 Neutral Strongly Disagree NaN Neutral Neutral Strongly Disagree NaN Neutral
4 5 R m 4 5 2.0 4 male 1 1 Agree Strongly Agree Disagree Agree Agree Strongly Agree Disagree Agree
5 6 SAS m 5 4 5.0 5 male 1 1 Strongly Agree Agree Strongly Agree Strongly Agree Strongly Agree Agree Strongly Agree Strongly Agree
6 7 R m 5 3 4.0 4 male 1 1 Strongly Agree Neutral Agree Agree Strongly Agree Neutral Agree Agree
7 8 SAS m 4 5 5.0 5 male 1 1 Agree Strongly Agree Strongly Agree Strongly Agree Agree Strongly Agree Strongly Agree Strongly Agree
통계프로그램 비교 목록(Proc sql, SAS, SPSS, R 프로그래밍, R Tidyverse, Python Pandas, Python Dfply) |
[Oracle, Pandas, R Prog, Dplyr, Sqldf, Pandasql, Data.Table] 오라클 함수와 R & Python 비교 사전 목록 링크 |
[SQL, Pandas, R Prog, Dplyr, SQLDF, PANDASQL, DATA.TABLE] SQL EMP 예제로 만나는 테이블 데이터 처리 방법 리스트 링크 |
반응형
'통계프로그램 비교 시리즈 > 데이터 전처리 비교' 카테고리의 다른 글
통계프로그램 전처리 비교 (Proc sql, SAS, SPSS, R 프로그래밍, R Tidyverse, Python Pandas, Python Dfply) (0) | 2022.01.19 |
---|---|
15. 변수 라벨(Variable Labels) (0) | 2022.01.19 |
통계프로그램 비교 시리즈 - 13. 데이터 프레임 정렬과 중복제거-Sorting & duplicate (0) | 2022.01.15 |
[데이터 관리] 12. 변수를 관측치로 전치후 원상태로 복구 (0) | 2022.01.15 |
[데이터 관리] 11. Aggregating Or Summarizing 데이터 (0) | 2022.01.15 |
댓글