포스팅 목차
[데이터 관리] 6. 관측값 포맷팅 & 관측값 일괄 변경하기
1. Proc SQL
CASE함수를 이용하여 RECODING;
SAS Programming |
options linesize=120;
proc sql;
create table withmooc as
select a.*,
case q1 when 1 then 2
when 5 then 4
else q1 end as qr1,
case q2 when 1 then 2
when 5 then 4
else q2 end as qr2,
case q3 when 1 then 2
when 5 then 4
else q3 end as qr3,
case q4 when 1 then 2
when 5 then 4
else q4 end as qr4
from BACK.mydata a;
select * from withmooc;
quit;
Results |
id workshop gender q1 q2 q3 q4 qr1 qr2 qr3 qr4
1 1 f 1 1 5 1 2 2 4 2
2 2 f 2 1 4 1 2 2 4 2
3 1 f 2 2 4 3 2 2 4 3
4 2 f 3 1 . 3 3 2 . 3
5 1 m 4 5 2 4 4 4 2 4
6 2 m 5 4 5 5 4 4 4 4
7 1 m 5 3 4 4 4 3 4 4
8 2 m 4 5 5 5 4 4 4 4
SAS Programming |
proc sql;
select min(q1) as q1_min,
mean(q1) as q1_mean,
max(q1) as q1_max,
sum(q1) as q1_sum,
min(q2) as q2_min,
mean(q2) as q2_mean,
max(q2) as q2_max,
sum(q2) as q2_sum
from withmooc;
quit;
Results |
q1_min q1_mean q1_max q1_sum q2_min q2_mean q2_max q2_sum
----------------------------------------------------------------------------
1 3.25 5 26 1 2.75 5 22
2. SAS Programming
SAS Programming |
DATA withmooc;
INFILE 'c:\data\mydata.csv' delimiter = ','
MISSOVER DSD LRECL=32767 firstobs=2 ;
INPUT id workshop gender $ q1 q2 q3 q4;
run;
PROC PRINT; RUN;
PROC FORMAT;
VALUE Agreement 1="Disagree"
2="Disagree"
3="Neutral"
4="Agree"
5="Agree";
run;
# 5점 척도를 3점 척도로 변환. 1과 2를 하나의 척도로 4와 5를 또하나의 척도로 변환.;
DATA withmooc1;
SET withmooc;
ARRAY q q1-q4;
ARRAY qr qr1-qr4; *r for recoded;
DO i=1 to 4;
qr{i}=q{i};
if q{i}=1 then qr{i}=2;
else if q{i}=5 then qr{i}=4;
END;
FORMAT q1-q4 qr1-qr4 Agreement.;
RUN;
proc print;run;
Results |
id workshop gender q1 q2 q3 q4 qr1 qr2 qr3 qr4 i
1 1 f Disagree Disagree Agree Disagree Disagree Disagree Agree Disagree 5
2 2 f Disagree Disagree Agree Disagree Disagree Disagree Agree Disagree 5
3 1 f Disagree Disagree Agree Neutral Disagree Disagree Agree Neutral 5
4 2 f Neutral Disagree . Neutral Neutral Disagree . Neutral 5
5 1 m Agree Agree Disagree Agree Agree Agree Disagree Agree 5
6 2 m Agree Agree Agree Agree Agree Agree Agree Agree 5
7 1 m Agree Neutral Agree Agree Agree Neutral Agree Agree 5
8 2 m Agree Agree Agree Agree Agree Agree Agree Agree 5
SAS Programming |
proc print data=withmooc1;
format q1-q4 qr1-qr4;
run;
Results |
OBS id workshop gender q1 q2 q3 q4 qr1 qr2 qr3 qr4 i
1 1 1 f 1 1 5 1 2 2 4 2 5
2 2 2 f 2 1 4 1 2 2 4 2 5
3 3 1 f 2 2 4 3 2 2 4 3 5
4 4 2 f 3 1 . 3 3 2 . 3 5
5 5 1 m 4 5 2 4 4 4 2 4 5
6 6 2 m 5 4 5 5 4 4 4 4 5
7 7 1 m 5 3 4 4 4 3 4 4 5
8 8 2 m 4 5 5 5 4 4 4 4 5
- 자동적으로 포맷 이용.
SAS Programming |
PROC FREQ data=withmooc1;
TABLES q1-q4;
RUN;
Results |
누적 누적
q1 빈도 백분율 빈도 백분율
----------------------------------------------------
Disagree 3 37.50 3 37.50
Neutral 1 12.50 4 50.00
Agree 4 50.00 8 100.00
누적 누적
q2 빈도 백분율 빈도 백분율
----------------------------------------------------
Disagree 4 50.00 4 50.00
Neutral 1 12.50 5 62.50
Agree 3 37.50 8 100.00
누적 누적
q3 빈도 백분율 빈도 백분율
----------------------------------------------------
Disagree 1 14.29 1 14.29
Agree 6 85.71 7 100.00
결측값의 개수 = 1
누적 누적
q4 빈도 백분율 빈도 백분율
----------------------------------------------------
Disagree 2 25.00 2 25.00
Neutral 2 25.00 4 50.00
Agree 4 50.00 8 100.00
- q변수들은 포맷을 적용 안 함.;
- high/low 값은 1/5로 된다.;
SAS Programming |
PROC UNIVARIATE data=withmooc1;
VAR q1-q4;
RUN;
Results |
분위수(정의 5)
분위수 추정값
100% 최대값 5.0
99% 5.0
95% 5.0
90% 5.0
75% Q3 4.5
50% 중위수 3.5
25% Q1 2.0
10% 1.0
5% 1.0
1% 1.0
0% 최소값 1.0
- qr변수들은 3점 척도 적용한다.;
- high/low 값은 1/5로 된다.
SAS Programming |
PROC UNIVARIATE data=withmooc1;
VAR qr1-qr4;
RUN;
Results |
분위수 추정값
100% 최대값 4.0
99% 4.0
95% 4.0
90% 4.0
75% Q3 4.0
50% 중위수 3.5
25% Q1 2.0
10% 2.0
5% 2.0
1% 2.0
0% 최소값 2.0
3. SPSS
- SPSS Program for Recoding Variables.
SPSS Programming |
GET FILE='C:\mydata.sav'.
RECODE q1 to q4 (1=2) (5=4).
SAVE OUTFILE='C:\myleft.sav'.
EXECUTE .
4. R Programming (R-PROJECT)
- Recoding 변수를 위한 R-Project 프로그램.
R Programming |
from rpy2.robjects import r
%load_ext rpy2.ipython
Results |
The rpy2.ipython extension is already loaded. To reload it, use:
%reload_ext rpy2.ipython
R Programming |
%%R
load(file="C:/work/data/mydata.Rdata")
withmooc = mydata
attach(withmooc) # mydata를 기본 데이터 세트로 지정.
withmooc
Results |
workshop gender q1 q2 q3 q4
1 1 f 1 1 5 1
2 2 f 2 1 4 1
3 1 f 2 2 4 3
4 2 f 3 1 NA 3
5 1 m 4 5 2 4
6 2 m 5 4 5 5
7 1 m 5 3 4 4
8 2 m 4 5 5 5
- recode함수를 이용하여 관측값 변경;
R Programming |
%%R
# install.packages("car")
library(car)
Results |
From cffi callback <function _consolewrite_ex at 0x0000023AC5B63820>:
Traceback (most recent call last):
File "C:\Users\BACK\anaconda3\lib\site-packages\rpy2\rinterface_lib\callbacks.py", line 131, in _consolewrite_ex
s = conversion._cchar_to_str_with_maxlen(buf, maxlen=n)
File "C:\Users\BACK\anaconda3\lib\site-packages\rpy2\rinterface_lib\conversion.py", line 130, in _cchar_to_str_with_maxlen
s = ffi.string(c, maxlen).decode('utf-8')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc7 in position 0: invalid continuation byte
R Programming |
%%R
withmooc$q1<-car::recode(withmooc$q1,"1=2;5=4")
withmooc$q2<-car::recode(withmooc$q2,"1=2;5=4")
withmooc$q3<-car::recode(withmooc$q3,"1=2;5=4")
withmooc$q4<-car::recode(withmooc$q4,"1=2;5=4")
withmooc
Results |
workshop gender q1 q2 q3 q4
1 1 f 2 2 4 2
2 2 f 2 2 4 2
3 1 f 2 2 4 3
4 2 f 3 2 NA 3
5 1 m 4 4 2 4
6 2 m 4 4 4 4
7 1 m 4 3 4 4
8 2 m 4 4 4 4
- 변수명의 두 개의 세트 생성.
R Programming |
%%R
myQnames <- base::paste( "q", 1:4, sep="")
print(myQnames) # 원 변수명.
myQRnames <- base::paste( "qr", 1:4, sep="")
myQRnames # Recoded될 변수의 이름.
Results |
[1] "q1" "q2" "q3" "q4"
[1] "qr1" "qr2" "qr3" "qr4"
- 데이터 프레임에서 q변수를 추출.
R Programming |
%%R
myQRvars <- withmooc[ ,myQnames]
myQRvars
Results |
q1 q2 q3 q4
1 2 2 4 2
2 2 2 4 2
3 2 2 4 3
4 3 2 NA 3
5 4 4 2 4
6 4 4 4 4
7 4 3 4 4
8 4 4 4 4
- 변수명을 변경.
R Programming |
%%R
names(myQRvars) <- myQRnames
myQRvars
Results |
qr1 qr2 qr3 qr4
1 2 2 4 2
2 2 2 4 2
3 2 2 4 3
4 3 2 NA 3
5 4 4 2 4
6 4 4 4 4
7 4 3 4 4
8 4 4 4 4
- 많은 변수에 라벨을 적용하기 위해 함수 생성.
R Programming |
%%R
myRecoder <- function(x) { recode(x,"1=2;5=4") }
- 하나의 변수에 함수 적용.
R Programming |
%%R
myRecoder(myQRvars$qr1)
Results |
[1] 2 2 2 3 4 4 4 4
- lapply함수를 사용하여 모든 변수에 적용하여 관측값 변경하기.
R Programming |
%%R
myQRvars[ ,myQRnames] <- lapply( myQRvars[ ,myQRnames ], myRecoder )
myQRvars
Results |
qr1 qr2 qr3 qr4
1 2 2 4 2
2 2 2 4 2
3 2 2 4 3
4 3 2 NA 3
5 4 4 2 4
6 4 4 4 4
7 4 3 4 4
8 4 4 4 4
- 원 변수와 새 변수 합치기.
R Programming |
%%R
withmooc <- cbind(withmooc,myQRvars)
withmooc
Results |
workshop gender q1 q2 q3 q4 qr1 qr2 qr3 qr4
1 1 f 2 2 4 2 2 2 4 2
2 2 f 2 2 4 2 2 2 4 2
3 1 f 2 2 4 3 2 2 4 3
4 2 f 3 2 NA 3 3 2 NA 3
5 1 m 4 4 2 4 4 4 2 4
6 2 m 4 4 4 4 4 4 4 4
7 1 m 4 3 4 4 4 3 4 4
8 2 m 4 4 4 4 4 4 4 4
R Programming |
%%R
options(width = 150)
summary(withmooc)
Results |
workshop gender q1 q2 q3 q4 qr1 qr2 qr3 qr4
Min. :1.0 Length:8 Min. :2.000 Min. :2.000 Min. :2.000 Min. :2.00 Min. :2.000 Min. :2.000 Min. :2.000 Min. :2.00
1st Qu.:1.0 Class :character 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:4.000 1st Qu.:2.75 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:4.000 1st Qu.:2.75
Median :1.5 Mode :character Median :3.500 Median :2.500 Median :4.000 Median :3.50 Median :3.500 Median :2.500 Median :4.000 Median :3.50
Mean :1.5 Mean :3.125 Mean :2.875 Mean :3.714 Mean :3.25 Mean :3.125 Mean :2.875 Mean :3.714 Mean :3.25
3rd Qu.:2.0 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:4.00 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:4.00
Max. :2.0 Max. :4.000 Max. :4.000 Max. :4.000 Max. :4.00 Max. :4.000 Max. :4.000 Max. :4.000 Max. :4.00
NA's :1 NA's :1
5. R - Tidyverse
- Tip : Jupyter Notebook 상에서 R에서 생성된 데이터를 파이썬(ipython)으로 전달하기
- %%R -o 파이썬으로 전 송하기 위한 데이터명
R Programming |
%%R -o back
back = mydata
R Programming |
back
Results |
workshop gender q1 q2 q3 q4
1 1 f 1 1 5 1
2 2 f 2 1 4 1
3 1 f 2 2 4 3
4 2 f 3 1 -2147483648 3
5 1 m 4 5 2 4
6 2 m 5 4 5 5
7 1 m 5 3 4 4
8 2 m 4 5 5 5
R Programming |
from rpy2.robjects import r
%load_ext rpy2.ipython
Results |
The rpy2.ipython extension is already loaded. To reload it, use:
%reload_ext rpy2.ipython
R Programming |
%%R
library(tidyverse)
load(file="C:/work/data/mydata.Rdata")
withmooc = mydata
attach(withmooc) # mydata를 기본 데이터 세트로 지정.
withmooc
R[write to console]: The following objects are masked from withmooc (pos = 8):
gender, q1, q2, q3, q4, workshop
workshop gender q1 q2 q3 q4
1 1 f 1 1 5 1
2 2 f 2 1 4 1
3 1 f 2 2 4 3
4 2 f 3 1 NA 3
5 1 m 4 5 2 4
6 2 m 5 4 5 5
7 1 m 5 3 4 4
8 2 m 4 5 5 5
- recode함수를 이용하여 관측값 변경;
R Programming |
%%R
# install.packages("car")
library(tidyverse)
library(car)
- dplyr의 Record 함수와 구분하기 위하여 car 패키지 지정
R Programming |
%%R
withmooc %>% mutate(q1=car::recode(q1,"1=2;5=4"),
q2=car::recode(q2,"1=2;5=4"),
q3=car::recode(q3,"1=2;5=4"),
q4=car::recode(q4,"1=2;5=4")
)
Results |
workshop gender q1 q2 q3 q4
1 1 f 2 2 4 2
2 2 f 2 2 4 2
3 1 f 2 2 4 3
4 2 f 3 2 NA 3
5 1 m 4 4 2 4
6 2 m 4 4 4 4
7 1 m 4 3 4 4
8 2 m 4 4 4 4
- 많은 변수에 라벨을 적용하기 위해 함수 생성.
R Programming |
%%R
myQnames <- base::paste( "q", 1:4, sep="")
print(myQnames) # 원 변수명.
myQRvars <- withmooc[ ,myQnames]
myQRvars
Results |
[1] "q1" "q2" "q3" "q4"
q1 q2 q3 q4
1 1 1 5 1
2 2 1 4 1
3 2 2 4 3
4 3 1 NA 3
5 4 5 2 4
6 5 4 5 5
7 5 3 4 4
8 4 5 5 5
R Programming |
%%R
withmooc = mydata
myRecoder <- function(x) { car::recode(x,"1=2;5=4") }
print( myRecoder(withmooc$q1) )
Results |
[1] 2 2 2 3 4 4 4 4
- map : 각 변수 별로 함수 적용 후 하나의 테이블로 재구성됨.
R Programming |
%%R
withmooc %>%
select(q1,q2,q3,q4) %>%
map(~ transmute(withmooc, {{.}} := myRecoder(.))) %>%
as_tibble()
Results |
# A tibble: 8 x 4
q1$..1 q2$..1 q3$..1 q4$..1
<dbl> <dbl> <dbl> <dbl>
1 2 2 4 2
2 2 2 4 2
3 2 2 4 3
4 3 2 NA 3
5 4 4 2 4
6 4 4 4 4
7 4 3 4 4
8 4 4 4 4
R Programming |
%%R
withmooc %>%
select(q1,q2,q3,q4) %>%
purrr::map_df(myRecoder)
Results |
# A tibble: 8 x 4
q1 q2 q3 q4
<dbl> <dbl> <dbl> <dbl>
1 2 2 4 2
2 2 2 4 2
3 2 2 4 3
4 3 2 NA 3
5 4 4 2 4
6 4 4 4 4
7 4 3 4 4
8 4 4 4 4
R Programming |
%%R
myQRvars <- withmooc %>%
purrr::keep(.p = is.numeric) %>% # 숫자형 변수 선택
purrr::map_df(.x = .,
.f = myRecoder)
myQRvars
Results |
# A tibble: 8 x 5
workshop q1 q2 q3 q4
<dbl> <dbl> <dbl> <dbl> <dbl>
1 2 2 2 4 2
2 2 2 2 4 2
3 2 2 2 4 3
4 2 3 2 NA 3
5 2 4 4 2 4
6 2 4 4 4 4
7 2 4 3 4 4
8 2 4 4 4 4
- 원 변수와 새 변수 합치기.
R Programming |
%%R
withmooc <- cbind(withmooc,myQRvars)
withmooc
Results |
workshop gender q1 q2 q3 q4 workshop q1 q2 q3 q4
1 1 f 1 1 5 1 2 2 2 4 2
2 2 f 2 1 4 1 2 2 2 4 2
3 1 f 2 2 4 3 2 2 2 4 3
4 2 f 3 1 NA 3 2 3 2 NA 3
5 1 m 4 5 2 4 2 4 4 2 4
6 2 m 5 4 5 5 2 4 4 4 4
7 1 m 5 3 4 4 2 4 3 4 4
8 2 m 4 5 5 5 2 4 4 4 4
- 통계량 일괄 산출
R Programming |
%%R
library(psych)
library(mnormt)
library(skimr)
R Programming |
%%R
withmooc %>%
purrr::keep(.p = is.numeric) %>% # 숫자형 데이터만 남기기
purrr::map_df(.x = .,
.f = psych::describe)
Results |
vars n mean sd median trimmed mad min max range skew kurtosis se
X1...1 1 8 1.50 0.53 1.5 1.50 0.74 1 2 1 0.00 -2.23 0.19
X1...2 1 8 3.25 1.49 3.5 3.25 2.22 1 5 4 -0.14 -1.73 0.53
X1...3 1 8 2.75 1.75 2.5 2.75 2.22 1 5 4 0.19 -1.91 0.62
X1...4 1 7 4.14 1.07 4.0 4.14 1.48 2 5 3 -0.93 -0.52 0.40
X1...5 1 8 3.25 1.58 3.5 3.25 1.48 1 5 4 -0.36 -1.59 0.56
X1...6 1 8 2.00 0.00 2.0 2.00 0.00 2 2 0 NaN NaN 0.00
X1...7 1 8 3.12 0.99 3.5 3.12 0.74 2 4 2 -0.20 -2.07 0.35
X1...8 1 8 2.88 0.99 2.5 2.88 0.74 2 4 2 0.20 -2.07 0.35
X1...9 1 7 3.71 0.76 4.0 3.71 0.00 2 4 2 -1.62 0.80 0.29
X1...10 1 8 3.25 0.89 3.5 3.25 0.74 2 4 2 -0.40 -1.75 0.31
R Programming |
%%R
str(withmooc)
Results |
'data.frame': 8 obs. of 11 variables:
$ workshop: int 1 2 1 2 1 2 1 2
$ gender : chr "f" "f" "f" "f" ...
$ q1 : int 1 2 2 3 4 5 5 4
$ q2 : int 1 1 2 1 5 4 3 5
$ q3 : int 5 4 4 NA 2 5 4 5
$ q4 : int 1 1 3 3 4 5 4 5
$ workshop: num 2 2 2 2 2 2 2 2
$ q1 : num 2 2 2 3 4 4 4 4
$ q2 : num 2 2 2 2 4 4 3 4
$ q3 : num 4 4 4 NA 2 4 4 4
$ q4 : num 2 2 3 3 4 4 4 4
R Programming |
%%R
withmooc = mydata
skimr::skim(withmooc)
Results |
-- Data Summary ------------------------
Values
Name withmooc
Number of rows 8
Number of columns 6
_______________________
Column type frequency:
character 1
numeric 5
________________________
Group variables None
-- Variable type: character ----------------------------------------------------
# A tibble: 1 x 8
skim_variable n_missing complete_rate min max empty n_unique whitespace
* <chr> <int> <dbl> <int> <int> <int> <int> <int>
1 gender 0 1 1 1 0 2 0
-- Variable type: numeric ------------------------------------------------------
# A tibble: 5 x 11
skim_variable n_missing complete_rate mean sd p0 p25 p50 p75
* <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 workshop 0 1 1.5 0.535 1 1 1.5 2
2 q1 0 1 3.25 1.49 1 2 3.5 4.25
3 q2 0 1 2.75 1.75 1 1 2.5 4.25
4 q3 1 0.875 4.14 1.07 2 4 4 5
5 q4 0 1 3.25 1.58 1 2.5 3.5 4.25
p100 hist
* <dbl> <chr>
1 2 <U+2587><U+2581><U+2581><U+2581><U+2587>
2 5 <U+2583><U+2587><U+2583><U+2587><U+2587>
3 5 <U+2587><U+2582><U+2582><U+2582><U+2585>
4 5 <U+2582><U+2581><U+2581><U+2587><U+2587>
5 5 <U+2587><U+2581><U+2587><U+2587><U+2587>
6. Python - Pandas
Python Programming |
import pandas as pd
import numpy as np
import sweetviz as sv
from IPython.display import display # DataFrame()을 HTML로 출력
mydata = pd.read_csv("C:/work/data/mydata.csv",sep=",",
dtype={'id':object,'workshop':object,
'q1':int, 'q2':int, 'q3':float, 'q4':int},
na_values=['NaN'],skipinitialspace =True)
withmooc= mydata.copy()
withmooc
Results |
id workshop gender q1 q2 q3 q4
0 1 1 f 1 1 5.0 1
1 2 2 f 2 1 4.0 1
2 3 1 f 2 2 4.0 3
3 4 2 f 3 1 NaN 3
4 5 1 m 4 5 2.0 4
5 6 2 m 5 4 5.0 5
6 7 1 m 5 3 4.0 4
7 8 2 m 4 5 5.0 5
- Recoding 변수를 위한 R-Project 프로그램.
Python Programming |
withmooc= mydata.copy()
withmooc
mapper = {1:2, 2:2, 3:3,4:4,5:4}
withmooc['q1'] = withmooc['q1'].map(mapper)
withmooc['q2'] = withmooc['q2'].map(mapper)
withmooc['q3'] = withmooc['q3'].map(mapper)
withmooc['q4'] = withmooc['q4'].map(mapper)
withmooc
Results |
id workshop gender q1 q2 q3 q4
0 1 1 f 2 2 4.0 2
1 2 2 f 2 2 4.0 2
2 3 1 f 2 2 4.0 3
3 4 2 f 3 2 NaN 3
4 5 1 m 4 4 2.0 4
5 6 2 m 4 4 4.0 4
6 7 1 m 4 3 4.0 4
7 8 2 m 4 4 4.0 4
Python Programming |
withmooc= mydata.copy()
display(withmooc)
withmooc[['q1','q2','q3','q4']]= withmooc[['q1','q2','q3','q4']].applymap(lambda x: 2 if x == 1 else 4 if x==5 else x)
withmooc
Results |
id workshop gender q1 q2 q3 q4
0 1 1 f 1 1 5.0 1
1 2 2 f 2 1 4.0 1
2 3 1 f 2 2 4.0 3
3 4 2 f 3 1 NaN 3
4 5 1 m 4 5 2.0 4
5 6 2 m 5 4 5.0 5
6 7 1 m 5 3 4.0 4
7 8 2 m 4 5 5.0 5
id workshop gender q1 q2 q3 q4
0 1 1 f 2 2 4.0 2
1 2 2 f 2 2 4.0 2
2 3 1 f 2 2 4.0 3
3 4 2 f 3 2 NaN 3
4 5 1 m 4 4 2.0 4
5 6 2 m 4 4 4.0 4
6 7 1 m 4 3 4.0 4
7 8 2 m 4 4 4.0 4
- 변수명의 두 개의 세트 생성.
- R 패키지의 paste 함수
Python Programming |
myQnames = ["q" + str(x + 1) for x in range(4)]
display(myQnames)
# Recoded될 변수의 이름.
myQRnames = ["qr" + str(x + 1) for x in range(4)]
display(myQRnames)
# 데이터 프레임에서 q변수를 추출.
myQRvars = withmooc.loc[: ,myQnames]
display(myQRvars)
# 변수명을 변경.
myQRvars.columns = myQRnames
myQRvars
Results |
['q1', 'q2', 'q3', 'q4']
['qr1', 'qr2', 'qr3', 'qr4']
Results |
q1 q2 q3 q4
0 2 2 4.0 2
1 2 2 4.0 2
2 2 2 4.0 3
3 3 2 NaN 3
4 4 4 2.0 4
5 4 4 4.0 4
6 4 3 4.0 4
7 4 4 4.0 4
Results |
qr1 qr2 qr3 qr4
0 2 2 4.0 2
1 2 2 4.0 2
2 2 2 4.0 3
3 3 2 NaN 3
4 4 4 2.0 4
5 4 4 4.0 4
6 4 3 4.0 4
7 4 4 4.0 4
- 많은 변수에 라벨을 적용하기 위해 함수 생성.
Python Programming |
def myRecoder(x):
if x == 1:
return 2
elif x == 5:
return 4
else:
return x
# 하나의 변수에 함수 적용.
myQRvars['qr1'].apply(lambda x:myRecoder(x))
Results |
0 2
1 2
2 2
3 3
4 4
5 4
6 4
7 4
Name: qr1, dtype: int64
- lapply함수를 사용하여 모든 변수에 적용하기.
Python Programming |
withmooc= mydata.copy()
display(withmooc)
withmooc[['q1','q2','q3','q4']]= withmooc[['q1','q2','q3','q4']].applymap(myRecoder)
withmooc
Results |
id workshop gender q1 q2 q3 q4
0 1 1 f 1 1 5.0 1
1 2 2 f 2 1 4.0 1
2 3 1 f 2 2 4.0 3
3 4 2 f 3 1 NaN 3
4 5 1 m 4 5 2.0 4
5 6 2 m 5 4 5.0 5
6 7 1 m 5 3 4.0 4
7 8 2 m 4 5 5.0 5
Results |
id workshop gender q1 q2 q3 q4
0 1 1 f 2 2 4.0 2
1 2 2 f 2 2 4.0 2
2 3 1 f 2 2 4.0 3
3 4 2 f 3 2 NaN 3
4 5 1 m 4 4 2.0 4
5 6 2 m 4 4 4.0 4
6 7 1 m 4 3 4.0 4
7 8 2 m 4 4 4.0 4
- 원 변수와 새 변수 합치기.
Python Programming |
withmooc = pd.concat([withmooc,myQRvars],axis = 1)
withmooc
Results |
id workshop gender q1 q2 q3 q4 qr1 qr2 qr3 qr4
0 1 1 f 2 2 4.0 2 2 2 4.0 2
1 2 2 f 2 2 4.0 2 2 2 4.0 2
2 3 1 f 2 2 4.0 3 2 2 4.0 3
3 4 2 f 3 2 NaN 3 3 2 NaN 3
4 5 1 m 4 4 2.0 4 4 4 2.0 4
5 6 2 m 4 4 4.0 4 4 4 4.0 4
6 7 1 m 4 3 4.0 4 4 3 4.0 4
7 8 2 m 4 4 4.0 4 4 4 4.0 4
- 통계량 산출
Python Programming |
withmooc.describe()
Results |
q1 q2 q3 q4 qr1 qr2 qr3 qr4
count 8.000000 8.000000 7.000000 8.000000 8.000000 8.000000 7.000000 8.000000
mean 3.125000 2.875000 3.714286 3.250000 3.125000 2.875000 3.714286 3.250000
std 0.991031 0.991031 0.755929 0.886405 0.991031 0.991031 0.755929 0.886405
min 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000
25% 2.000000 2.000000 4.000000 2.750000 2.000000 2.000000 4.000000 2.750000
50% 3.500000 2.500000 4.000000 3.500000 3.500000 2.500000 4.000000 3.500000
75% 4.000000 4.000000 4.000000 4.000000 4.000000 4.000000 4.000000 4.000000
max 4.000000 4.000000 4.000000 4.000000 4.000000 4.000000 4.000000 4.000000
7. Python - dfply
Python Programming |
import pandas as pd
from dfply import *
mydata = pd.read_csv("c:/work/data/mydata.csv",sep=",",
dtype={'id':object,'workshop':object,
'q1':int, 'q2':int, 'q3':float, 'q4':int},
na_values=['NaN'],skipinitialspace =True)
withmooc= mydata.copy()
# 모든 변수 선택하기.
withmooc
Results |
id workshop gender q1 q2 q3 q4
0 1 1 f 1 1 5.0 1
1 2 2 f 2 1 4.0 1
2 3 1 f 2 2 4.0 3
3 4 2 f 3 1 NaN 3
4 5 1 m 4 5 2.0 4
5 6 2 m 5 4 5.0 5
6 7 1 m 5 3 4.0 4
7 8 2 m 4 5 5.0 5
- replace 함수를 사용하여서 1은 2로 5는 4로 변경
Python Programming |
withmooc >> mutate(q11 = X.q1.replace({1, 5},{2,4}))
Results |
id workshop gender q1 q2 q3 q4 q11
0 1 1 f 1 1 5.0 1 2
1 2 2 f 2 1 4.0 1 2
2 3 1 f 2 2 4.0 3 2
3 4 2 f 3 1 NaN 3 3
4 5 1 m 4 5 2.0 4 4
5 6 2 m 5 4 5.0 5 4
6 7 1 m 5 3 4.0 4 4
7 8 2 m 4 5 5.0 5 4
Python Programming |
withmooc >> mutate(q1=make_symbolic(np.where)( (X.q1==1) ,2, make_symbolic(np.where)( (X.q1==5) ,4,X.q1 )),
q2=make_symbolic(np.where)( (X.q2==1) ,2, make_symbolic(np.where)( (X.q2==5) ,4,X.q2 )),
q3=make_symbolic(np.where)( (X.q3==1) ,2, make_symbolic(np.where)( (X.q3==5) ,4,X.q3 )),
q4=make_symbolic(np.where)( (X.q4==1) ,2, make_symbolic(np.where)( (X.q4==5) ,4,X.q4 ))
)
Results |
id workshop gender q1 q2 q3 q4
0 1 1 f 2 2 4.0 2
1 2 2 f 2 2 4.0 2
2 3 1 f 2 2 4.0 3
3 4 2 f 3 2 NaN 3
4 5 1 m 4 4 2.0 4
5 6 2 m 4 4 4.0 4
6 7 1 m 4 3 4.0 4
7 8 2 m 4 4 4.0 4
Python Programming |
# [참고]
withmooc >> mutate(a06=if_else((X['q3'].apply(pd.isnull)),300,1))
Results |
id workshop gender q1 q2 q3 q4 a06
0 1 1 f 1 1 5.0 1 1
1 2 2 f 2 1 4.0 1 1
2 3 1 f 2 2 4.0 3 1
3 4 2 f 3 1 NaN 3 300
4 5 1 m 4 5 2.0 4 1
5 6 2 m 5 4 5.0 5 1
6 7 1 m 5 3 4.0 4 1
7 8 2 m 4 5 5.0 5 1
Python Programming |
@make_symbolic
def My5isNA(x):
return make_symbolic(np.where)( (x==1) ,2, make_symbolic(np.where)( (x==5) ,4,x ))
withmooc >> mutate(a06=X['q2'].apply(My5isNA))
Results |
id workshop gender q1 q2 q3 q4 a06
0 1 1 f 1 1 5.0 1 2
1 2 2 f 2 1 4.0 1 2
2 3 1 f 2 2 4.0 3 2
3 4 2 f 3 1 NaN 3 2
4 5 1 m 4 5 2.0 4 4
5 6 2 m 5 4 5.0 5 4
6 7 1 m 5 3 4.0 4 3
7 8 2 m 4 5 5.0 5 4
Python Programming |
@make_symbolic
def My5isNA(x):
return make_symbolic(np.where)( (x==1) ,2, make_symbolic(np.where)( (x==5) ,4,x ))
@pipe
@symbolic_evaluation()
def symbolic_double(df, serieses):
for series in serieses:
df[series.name] = My5isNA(series)
return df
withmooc >> symbolic_double([X.q1, X.q2, X.q3, X.q4])
Results |
id workshop gender q1 q2 q3 q4
0 1 1 f 2 2 4.0 2
1 2 2 f 2 2 4.0 2
2 3 1 f 2 2 4.0 3
3 4 2 f 3 2 NaN 3
4 5 1 m 4 4 2.0 4
5 6 2 m 4 4 4.0 4
6 7 1 m 4 3 4.0 4
7 8 2 m 4 4 4.0 4
- 수치형 변수에 대한 자동 포맷팅
Python Programming |
@make_symbolic
def My5isNA(x):
return make_symbolic(np.where)( (x==1) ,2 , make_symbolic(np.where)( (x==5) ,4,x ))
withmooc >> mutate(**{
**{f"{x}": X[x].apply(My5isNA) for x in withmooc.select_dtypes([int,float]).columns}
})
Results |
id workshop gender q1 q2 q3 q4
0 1 1 f 2 2 4.0 2
1 2 2 f 2 2 4.0 2
2 3 1 f 2 2 4.0 3
3 4 2 f 3 2 NaN 3
4 5 1 m 4 4 2.0 4
5 6 2 m 4 4 4.0 4
6 7 1 m 4 3 4.0 4
7 8 2 m 4 4 4.0 4
- 통계량 일괄 산출
Python Programming |
withmooc >> summarize(**{
**{f"{x}_mean": X[x].mean() for x in mydata.select_dtypes(int).columns},
**{f"{x}_std" : X[x].std() for x in mydata.select_dtypes(int).columns}
})
Results |
q1_mean q2_mean q4_mean q1_std q2_std q4_std
0 3.25 2.75 3.25 1.488048 1.752549 1.581139

통계프로그램 비교 목록(Proc sql, SAS, SPSS, R 프로그래밍, R Tidyverse, Python Pandas, Python Dfply) |
[Oracle, Pandas, R Prog, Dplyr, Sqldf, Pandasql, Data.Table] 오라클 함수와 R & Python 비교 사전 목록 링크 |
[SQL, Pandas, R Prog, Dplyr, SQLDF, PANDASQL, DATA.TABLE] SQL EMP 예제로 만나는 테이블 데이터 처리 방법 리스트 링크 |
반응형
'통계프로그램 비교 시리즈 > 데이터 전처리 비교' 카테고리의 다른 글
[데이터 관리] 8. By 또는 Split 파일 프로세싱 (0) | 2022.01.11 |
---|---|
[데이터 관리] 7. 변수 Keeping과 Dropping (0) | 2022.01.10 |
통계프로그램 비교 시리즈 – [데이터 관리] 5. 변수명 변경(Rename) (0) | 2022.01.10 |
통계프로그램 비교 시리즈 – [데이터 관리] 4. 다중 조건에 의한 변환 (0) | 2022.01.10 |
통계프로그램 비교 시리즈 – [데이터 관리] 3. 결측값 할당을 위한 조건 변환 (0) | 2022.01.10 |
댓글