포스팅 목차
[데이터 관리] 8. By 또는 Split 파일 프로세싱
1. Proc SQL
- By 또는 Split 파일 프로세싱을 위한 PROC SQL 프로그램
- Group by 옵션을 이용하여, 성별을 기준으로 지정된 각 함수 처리
SAS Programming |
proc sql;
create table withmooc as
select gender,
sum(q1) as q1_sum,
mean(q2) as q2_mean,
min(q3) as q3_min,
max(q4) as q4_max
from BACK.mydata a
group by gender;
select * from withmooc;
quit;
Results |
gender q1_sum q2_mean q3_min q4_max
------------------------------------------------
f 8 1.25 4 3
m 18 4.25 2 5
SAS Programming |
proc sql;
create table withmooc as
select workshop,
gender,
sum(q1) as q1_sum,
mean(q2) as q2_mean,
min(q3) as q3_min,
max(q4) as q4_max
from BACK.mydata a
group by workshop,
gender;
select * from withmooc;
quit;
Results |
workshop gender q1_sum q2_mean q3_min q4_max
----------------------------------------------------------
1 f 3 1.5 4 3
1 m 9 4 2 4
2 f 5 1 4 3
2 m 9 4.5 5 5
2. SAS Programming
- By 또는 Split 파일 프로세싱을 위한 SAS 프로그램
- 성별을 기준으로 하여 데이터 분석;
SAS Programming |
* 성별을 기준으로 하여 데이터 소트;
PROC SORT DATA=BACK.withmooc;
BY gender;
run;
* 성별을 기준으로 하여 데이터 분석;
PROC MEANS DATA=BACK.withmooc;
BY gender;
run;
Results |
---------------------------------- gender=f ----------------------------------
MEANS 프로시저
변수 N 평균값 표준편차 최소값 최대값
-----------------------------------------------------------------------------
id 4 2.5000000 1.2909944 1.0000000 4.0000000
workshop 4 1.5000000 0.5773503 1.0000000 2.0000000
q1 4 2.0000000 0.8164966 1.0000000 3.0000000
q2 4 1.2500000 0.5000000 1.0000000 2.0000000
q3 3 4.3333333 0.5773503 4.0000000 5.0000000
q4 4 2.0000000 1.1547005 1.0000000 3.0000000
-----------------------------------------------------------------------------
---------------------------------- gender=m ----------------------------------
변수 N 평균값 표준편차 최소값 최대값
-----------------------------------------------------------------------------
id 4 6.5000000 1.2909944 5.0000000 8.0000000
workshop 4 1.5000000 0.5773503 1.0000000 2.0000000
q1 4 4.5000000 0.5773503 4.0000000 5.0000000
q2 4 4.2500000 0.9574271 3.0000000 5.0000000
q3 4 4.0000000 1.4142136 2.0000000 5.0000000
q4 4 4.5000000 0.5773503 4.0000000 5.0000000
-----------------------------------------------------------------------------
- Class 절을 사용하여 변수 소트 없이 데이터 분석 가능.;
SAS Programming |
PROC MEANS DATA=BACK.withmooc;
CLASS gender;
run;
Results |
관측치
gender 수 변수 N 평균값 표준편차 최소값 최대값
---------------------------------------------------------------------------------------------------
f 4 id 4 2.5000000 1.2909944 1.0000000 4.0000000
workshop 4 1.5000000 0.5773503 1.0000000 2.0000000
q1 4 2.0000000 0.8164966 1.0000000 3.0000000
q2 4 1.2500000 0.5000000 1.0000000 2.0000000
q3 3 4.3333333 0.5773503 4.0000000 5.0000000
q4 4 2.0000000 1.1547005 1.0000000 3.0000000
m 4 id 4 6.5000000 1.2909944 5.0000000 8.0000000
workshop 4 1.5000000 0.5773503 1.0000000 2.0000000
q1 4 4.5000000 0.5773503 4.0000000 5.0000000
q2 4 4.2500000 0.9574271 3.0000000 5.0000000
q3 4 4.0000000 1.4142136 2.0000000 5.0000000
q4 4 4.5000000 0.5773503 4.0000000 5.0000000
---------------------------------------------------------------------------------------------------
3. SPSS
- By 또는 Split 파일 프로세싱을 위한 SPSS 프로그램.
SPSS Programming |
GET FILE="C:\mydata.sav".
SORT CASES BY gender .
SPLIT FILE
SEPARATE BY gender .
DESCRIPTIVES
VARIABLES=q1 q2 q3 q4
/STATISTICS=MEAN STDDEV MIN MAX .
4. R Programming (R-PROJECT)
- By 또는 Split 파일 프로세싱을 위한 프로그램.
R Programming |
from rpy2.robjects import r
%load_ext rpy2.ipython
Results |
The rpy2.ipython extension is already loaded. To reload it, use:
%reload_ext rpy2.ipython
R Programming |
%%R
library(tidyverse)
library(psych)
load(file="C:/work/data/mydata.Rdata")
withmooc = mydata
attach(withmooc) # mydata를 기본 데이터 세트로 지정.
withmooc
Results |
R[write to console]: The following objects are masked from withmooc (pos = 3):
gender, q1, q2, q3, q4, workshop
workshop gender q1 q2 q3 q4
1 1 f 1 1 5 1
2 2 f 2 1 4 1
3 1 f 2 2 4 3
4 2 f 3 1 NA 3
5 1 m 4 5 2 4
6 2 m 5 4 5 5
7 1 m 5 3 4 4
8 2 m 4 5 5 5
- 관측치와 모든 변수의 요약 통계 구하기
R Programming |
%%R
summary(withmooc)
Results |
workshop gender q1 q2 q3 q4
Min. :1.0 Length:8 Min. :1.00 Min. :1.00 Min. :2.000 Min. :1.00
1st Qu.:1.0 Class :character 1st Qu.:2.00 1st Qu.:1.00 1st Qu.:4.000 1st Qu.:2.50
Median :1.5 Mode :character Median :3.50 Median :2.50 Median :4.000 Median :3.50
Mean :1.5 Mean :3.25 Mean :2.75 Mean :4.143 Mean :3.25
3rd Qu.:2.0 3rd Qu.:4.25 3rd Qu.:4.25 3rd Qu.:5.000 3rd Qu.:4.25
Max. :2.0 Max. :5.00 Max. :5.00 Max. :5.000 Max. :5.00
NA's :1
psych::describe 함수
R Programming |
%%R
psych::describe(withmooc)
Results |
vars n mean sd median trimmed mad min max range skew kurtosis se
workshop 1 8 1.50 0.53 1.5 1.50 0.74 1 2 1 0.00 -2.23 0.19
gender* 2 8 1.50 0.53 1.5 1.50 0.74 1 2 1 0.00 -2.23 0.19
q1 3 8 3.25 1.49 3.5 3.25 2.22 1 5 4 -0.14 -1.73 0.53
q2 4 8 2.75 1.75 2.5 2.75 2.22 1 5 4 0.19 -1.91 0.62
q3 5 7 4.14 1.07 4.0 4.14 1.48 2 5 3 -0.93 -0.52 0.40
q4 6 8 3.25 1.58 3.5 3.25 1.48 1 5 4 -0.36 -1.59 0.56
- 성별을 기준으로 하여 각 변수에 대한 요약 통계 구하기.
R Programming |
%%R
by(withmooc, gender, summary)
Results |
gender: f
workshop gender q1 q2 q3 q4
Min. :1.0 Length:4 Min. :1.00 Min. :1.00 Min. :4.000 Min. :1
1st Qu.:1.0 Class :character 1st Qu.:1.75 1st Qu.:1.00 1st Qu.:4.000 1st Qu.:1
Median :1.5 Mode :character Median :2.00 Median :1.00 Median :4.000 Median :2
Mean :1.5 Mean :2.00 Mean :1.25 Mean :4.333 Mean :2
3rd Qu.:2.0 3rd Qu.:2.25 3rd Qu.:1.25 3rd Qu.:4.500 3rd Qu.:3
Max. :2.0 Max. :3.00 Max. :2.00 Max. :5.000 Max. :3
NA's :1
------------------------------------------------------------
gender: m
workshop gender q1 q2 q3 q4
Min. :1.0 Length:4 Min. :4.0 Min. :3.00 Min. :2.0 Min. :4.0
1st Qu.:1.0 Class :character 1st Qu.:4.0 1st Qu.:3.75 1st Qu.:3.5 1st Qu.:4.0
Median :1.5 Mode :character Median :4.5 Median :4.50 Median :4.5 Median :4.5
Mean :1.5 Mean :4.5 Mean :4.25 Mean :4.0 Mean :4.5
3rd Qu.:2.0 3rd Qu.:5.0 3rd Qu.:5.00 3rd Qu.:5.0 3rd Qu.:5.0
Max. :2.0 Max. :5.0 Max. :5.00 Max. :5.0 Max. :5.0
- 열 이름에 의해 선택된 변수에 대하여 성별에 대하여 각 값에 대한 요약 통계.
R Programming |
%%R
by( withmooc[c("q1","q2","q3","q4")] , gender, summary)
Results |
gender: f
q1 q2 q3 q4
Min. :1.00 Min. :1.00 Min. :4.000 Min. :1
1st Qu.:1.75 1st Qu.:1.00 1st Qu.:4.000 1st Qu.:1
Median :2.00 Median :1.00 Median :4.000 Median :2
Mean :2.00 Mean :1.25 Mean :4.333 Mean :2
3rd Qu.:2.25 3rd Qu.:1.25 3rd Qu.:4.500 3rd Qu.:3
Max. :3.00 Max. :2.00 Max. :5.000 Max. :3
NA's :1
------------------------------------------------------------
gender: m
q1 q2 q3 q4
Min. :4.0 Min. :3.00 Min. :2.0 Min. :4.0
1st Qu.:4.0 1st Qu.:3.75 1st Qu.:3.5 1st Qu.:4.0
Median :4.5 Median :4.50 Median :4.5 Median :4.5
Mean :4.5 Mean :4.25 Mean :4.0 Mean :4.5
3rd Qu.:5.0 3rd Qu.:5.00 3rd Qu.:5.0 3rd Qu.:5.0
Max. :5.0 Max. :5.00 Max. :5.0 Max. :5.0
- 다중 범주 변수는 리스트에서 이용되어야 하고, data.frame 함수는 리스트를 취할 수 있다.
- 데이터는 workshop과 gender로 정렬될 필요가 없다.
- workshop과 gender 변수를 기준으로 각 값에 대한 요약 통계.
R Programming |
%%R
by(withmooc[c("q1","q2","q3","q4")],
data.frame(workshop,gender), summary)
Results |
workshop: 1
gender: f
q1 q2 q3 q4
Min. :1.00 Min. :1.00 Min. :4.00 Min. :1.0
1st Qu.:1.25 1st Qu.:1.25 1st Qu.:4.25 1st Qu.:1.5
Median :1.50 Median :1.50 Median :4.50 Median :2.0
Mean :1.50 Mean :1.50 Mean :4.50 Mean :2.0
3rd Qu.:1.75 3rd Qu.:1.75 3rd Qu.:4.75 3rd Qu.:2.5
Max. :2.00 Max. :2.00 Max. :5.00 Max. :3.0
------------------------------------------------------------
workshop: 2
gender: f
q1 q2 q3 q4
Min. :2.00 Min. :1 Min. :4 Min. :1.0
1st Qu.:2.25 1st Qu.:1 1st Qu.:4 1st Qu.:1.5
Median :2.50 Median :1 Median :4 Median :2.0
Mean :2.50 Mean :1 Mean :4 Mean :2.0
3rd Qu.:2.75 3rd Qu.:1 3rd Qu.:4 3rd Qu.:2.5
Max. :3.00 Max. :1 Max. :4 Max. :3.0
NA's :1
------------------------------------------------------------
workshop: 1
gender: m
q1 q2 q3 q4
Min. :4.00 Min. :3.0 Min. :2.0 Min. :4
1st Qu.:4.25 1st Qu.:3.5 1st Qu.:2.5 1st Qu.:4
Median :4.50 Median :4.0 Median :3.0 Median :4
Mean :4.50 Mean :4.0 Mean :3.0 Mean :4
3rd Qu.:4.75 3rd Qu.:4.5 3rd Qu.:3.5 3rd Qu.:4
Max. :5.00 Max. :5.0 Max. :4.0 Max. :4
------------------------------------------------------------
workshop: 2
gender: m
q1 q2 q3 q4
Min. :4.00 Min. :4.00 Min. :5 Min. :5
1st Qu.:4.25 1st Qu.:4.25 1st Qu.:5 1st Qu.:5
Median :4.50 Median :4.50 Median :5 Median :5
Mean :4.50 Mean :4.50 Mean :5 Mean :5
3rd Qu.:4.75 3rd Qu.:4.75 3rd Qu.:5 3rd Qu.:5
Max. :5.00 Max. :5.00 Max. :5 Max. :5
- 위 예제에서 by 문 안의 옵션을 사전에 정의하여 처리.
R Programming |
%%R
myVars <- c("q1","q2","q3","q4")
myBys <- data.frame(workshop,gender)
by( withmooc[myVars], myBys, summary)
Results |
workshop: 1
gender: f
q1 q2 q3 q4
Min. :1.00 Min. :1.00 Min. :4.00 Min. :1.0
1st Qu.:1.25 1st Qu.:1.25 1st Qu.:4.25 1st Qu.:1.5
Median :1.50 Median :1.50 Median :4.50 Median :2.0
Mean :1.50 Mean :1.50 Mean :4.50 Mean :2.0
3rd Qu.:1.75 3rd Qu.:1.75 3rd Qu.:4.75 3rd Qu.:2.5
Max. :2.00 Max. :2.00 Max. :5.00 Max. :3.0
------------------------------------------------------------
workshop: 2
gender: f
q1 q2 q3 q4
Min. :2.00 Min. :1 Min. :4 Min. :1.0
1st Qu.:2.25 1st Qu.:1 1st Qu.:4 1st Qu.:1.5
Median :2.50 Median :1 Median :4 Median :2.0
Mean :2.50 Mean :1 Mean :4 Mean :2.0
3rd Qu.:2.75 3rd Qu.:1 3rd Qu.:4 3rd Qu.:2.5
Max. :3.00 Max. :1 Max. :4 Max. :3.0
NA's :1
------------------------------------------------------------
workshop: 1
gender: m
q1 q2 q3 q4
Min. :4.00 Min. :3.0 Min. :2.0 Min. :4
1st Qu.:4.25 1st Qu.:3.5 1st Qu.:2.5 1st Qu.:4
Median :4.50 Median :4.0 Median :3.0 Median :4
Mean :4.50 Mean :4.0 Mean :3.0 Mean :4
3rd Qu.:4.75 3rd Qu.:4.5 3rd Qu.:3.5 3rd Qu.:4
Max. :5.00 Max. :5.0 Max. :4.0 Max. :4
------------------------------------------------------------
workshop: 2
gender: m
q1 q2 q3 q4
Min. :4.00 Min. :4.00 Min. :5 Min. :5
1st Qu.:4.25 1st Qu.:4.25 1st Qu.:5 1st Qu.:5
Median :4.50 Median :4.50 Median :5 Median :5
Mean :4.50 Mean :4.50 Mean :5 Mean :5
3rd Qu.:4.75 3rd Qu.:4.75 3rd Qu.:5 3rd Qu.:5
Max. :5.00 Max. :5.00 Max. :5 Max. :5
R Programming |
%%R
aggregate(withmooc[c("q1","q2","q3","q4")],list(gender = withmooc$gender),mean)
Results |
gender q1 q2 q3 q4
1 f 2.0 1.25 NA 2.0
2 m 4.5 4.25 4 4.5
R Programming |
%%R
psych::describeBy(withmooc, withmooc$gender)
Results |
Descriptive statistics by group
group: f
vars n mean sd median trimmed mad min max range skew kurtosis se
workshop 1 4 1.50 0.58 1.5 1.50 0.74 1 2 1 0.00 -2.44 0.29
gender* 2 4 1.00 0.00 1.0 1.00 0.00 1 1 0 NaN NaN 0.00
q1 3 4 2.00 0.82 2.0 2.00 0.74 1 3 2 0.00 -1.88 0.41
q2 4 4 1.25 0.50 1.0 1.25 0.00 1 2 1 0.75 -1.69 0.25
q3 5 3 4.33 0.58 4.0 4.33 0.00 4 5 1 0.38 -2.33 0.33
q4 6 4 2.00 1.15 2.0 2.00 1.48 1 3 2 0.00 -2.44 0.58
------------------------------------------------------------
group: m
vars n mean sd median trimmed mad min max range skew kurtosis se
workshop 1 4 1.50 0.58 1.5 1.50 0.74 1 2 1 0.00 -2.44 0.29
gender* 2 4 1.00 0.00 1.0 1.00 0.00 1 1 0 NaN NaN 0.00
q1 3 4 4.50 0.58 4.5 4.50 0.74 4 5 1 0.00 -2.44 0.29
q2 4 4 4.25 0.96 4.5 4.25 0.74 3 5 2 -0.32 -2.08 0.48
q3 5 4 4.00 1.41 4.5 4.00 0.74 2 5 3 -0.53 -1.88 0.71
q4 6 4 4.50 0.58 4.5 4.50 0.74 4 5 1 0.00 -2.44 0.29
5. R - Tidyverse
- 관측치와 모든 변수의 요약 통계 구하기.
R Programming |
%%R
withmooc %>%
summary()
Results |
workshop gender q1 q2 q3 q4
Min. :1.0 Length:8 Min. :1.00 Min. :1.00 Min. :2.000 Min. :1.00
1st Qu.:1.0 Class :character 1st Qu.:2.00 1st Qu.:1.00 1st Qu.:4.000 1st Qu.:2.50
Median :1.5 Mode :character Median :3.50 Median :2.50 Median :4.000 Median :3.50
Mean :1.5 Mean :3.25 Mean :2.75 Mean :4.143 Mean :3.25
3rd Qu.:2.0 3rd Qu.:4.25 3rd Qu.:4.25 3rd Qu.:5.000 3rd Qu.:4.25
Max. :2.0 Max. :5.00 Max. :5.00 Max. :5.000 Max. :5.00
NA's :1
R Programming |
%%R
withmooc %>%
purrr::keep(.p = is.numeric) %>% # 숫자형 데이터만 남기기
purrr::map_df(.x = .,
.f = psych::describe)
Results |
vars n mean sd median trimmed mad min max range skew kurtosis se
X1...1 1 8 1.50 0.53 1.5 1.50 0.74 1 2 1 0.00 -2.23 0.19
X1...2 1 8 3.25 1.49 3.5 3.25 2.22 1 5 4 -0.14 -1.73 0.53
X1...3 1 8 2.75 1.75 2.5 2.75 2.22 1 5 4 0.19 -1.91 0.62
X1...4 1 7 4.14 1.07 4.0 4.14 1.48 2 5 3 -0.93 -0.52 0.40
X1...5 1 8 3.25 1.58 3.5 3.25 1.48 1 5 4 -0.36 -1.59 0.56
- 성별을 기준으로 하여 각 변수에 대한 요약 통계 구하기.
R Programming |
%%R
withmooc %>%
group_by(gender) %>%
summarise(mean = mean(q1), n = n())
Results |
`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 2 x 3
gender mean n
<chr> <dbl> <int>
1 f 2 4
2 m 4.5 4
R Programming |
%%R
withmooc %>%
group_by(gender) %>%
summarise_each(funs(min, median, mean, max, n()), q1,q2,q3,q4)
Results |
# A tibble: 2 x 21
gender q1_min q2_min q3_min q4_min q1_median q2_median q3_median q4_median
<chr> <int> <int> <int> <int> <dbl> <dbl> <dbl> <dbl>
1 f 1 1 NA 1 2 1 NA 2
2 m 4 3 2 4 4.5 4.5 4.5 4.5
# ... with 12 more variables: q1_mean <dbl>, q2_mean <dbl>, q3_mean <dbl>,
# q4_mean <dbl>, q1_max <int>, q2_max <int>, q3_max <int>, q4_max <int>,
# q1_n <int>, q2_n <int>, q3_n <int>, q4_n <int>
R Programming |
%%R
withmooc %>%
group_by(gender) %>%
summarise_each(funs(mean, n()), q1,q2,q3)
Results |
# A tibble: 2 x 7
gender q1_mean q2_mean q3_mean q1_n q2_n q3_n
<chr> <dbl> <dbl> <dbl> <int> <int> <int>
1 f 2 1.25 NA 4 4 4
2 m 4.5 4.25 4 4 4 4
R Programming |
%%R
withmooc %>%
group_by(gender) %>%
purrr::keep(.p = is.numeric) %>% # 숫자형 데이터만 남기기
purrr::map_df(.x = .,
.f = psych::describe)
Results |
vars n mean sd median trimmed mad min max range skew kurtosis se
X1...1 1 8 1.50 0.53 1.5 1.50 0.74 1 2 1 0.00 -2.23 0.19
X1...2 1 8 3.25 1.49 3.5 3.25 2.22 1 5 4 -0.14 -1.73 0.53
X1...3 1 8 2.75 1.75 2.5 2.75 2.22 1 5 4 0.19 -1.91 0.62
X1...4 1 7 4.14 1.07 4.0 4.14 1.48 2 5 3 -0.93 -0.52 0.40
X1...5 1 8 3.25 1.58 3.5 3.25 1.48 1 5 4 -0.36 -1.59 0.56
R Programming |
%%R
withmooc %>%
summarise_if(is.numeric, mean, na.rm = TRUE)
Results |
workshop q1 q2 q3 q4
1 1.5 3.25 2.75 4.142857 3.25
R Programming |
%%R
withmooc %>%
summarise(across(where(is.numeric), ~ mean(.x, na.rm = TRUE)))
Results |
workshop q1 q2 q3 q4
1 1.5 3.25 2.75 4.142857 3.25
R Programming |
%%R
withmooc %>%
group_by(gender) %>%
summarise(across(q1:q4, mean))
Results |
`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 2 x 5
gender q1 q2 q3 q4
<chr> <dbl> <dbl> <dbl> <dbl>
1 f 2 1.25 NA 2
2 m 4.5 4.25 4 4.5
R Programming |
%%R
withmooc %>%
group_by(gender) %>%
summarise(across(where(is.numeric), mean))
Results |
`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 2 x 6
gender workshop q1 q2 q3 q4
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 f 1.5 2 1.25 NA 2
2 m 1.5 4.5 4.25 4 4.5
- 다중 범주 변수는 리스트에서 이용되어야 하고, data.frame 함수는 리스트를 취할 수 있다.
- 데이터는 workshop과 gender로 정렬될 필요가 없다.
- workshop과 gender 변수를 기준으로 각 값에 대한 요약 통계
R Programming |
%%R
withmooc %>%
group_by(workshop, gender) %>%
summarise_each(funs(min, median, mean, sd, max, n()), q1,q2,q3,q4)
Results |
# A tibble: 4 x 26
# Groups: workshop [2]
workshop gender q1_min q2_min q3_min q4_min q1_median q2_median q3_median
<int> <chr> <int> <int> <int> <int> <dbl> <dbl> <dbl>
1 1 f 1 1 4 1 1.5 1.5 4.5
2 1 m 4 3 2 4 4.5 4 3
3 2 f 2 1 NA 1 2.5 1 NA
4 2 m 4 4 5 5 4.5 4.5 5
# ... with 17 more variables: q4_median <dbl>, q1_mean <dbl>, q2_mean <dbl>,
# q3_mean <dbl>, q4_mean <dbl>, q1_sd <dbl>, q2_sd <dbl>, q3_sd <dbl>,
# q4_sd <dbl>, q1_max <int>, q2_max <int>, q3_max <int>, q4_max <int>,
# q1_n <int>, q2_n <int>, q3_n <int>, q4_n <int>
6. Python - Pandas
Python Programming |
import pandas as pd
import numpy as np
import sweetviz as sv
mydata = pd.read_csv("C:/work/data/mydata.csv",sep=",",
dtype={'id':object,'workshop':object,
'q1':int, 'q2':int, 'q3':float, 'q4':int},
na_values=['NaN'],skipinitialspace =True)
withmooc= mydata.copy()
withmooc
Results |
id workshop gender q1 q2 q3 q4
0 1 1 f 1 1 5.0 1
1 2 2 f 2 1 4.0 1
2 3 1 f 2 2 4.0 3
3 4 2 f 3 1 NaN 3
4 5 1 m 4 5 2.0 4
5 6 2 m 5 4 5.0 5
6 7 1 m 5 3 4.0 4
7 8 2 m 4 5 5.0 5
By 또는 Split 파일 프로세싱을 위한 파이썬 Pandas 프로그램.
- describe() 함수
Python Programming |
# 관측치와 모든 변수의 요약 통계 구하기.
withmooc.describe()
Results |
q1 q2 q3 q4
count 8.000000 8.000000 7.000000 8.000000
mean 3.250000 2.750000 4.142857 3.250000
std 1.488048 1.752549 1.069045 1.581139
min 1.000000 1.000000 2.000000 1.000000
25% 2.000000 1.000000 4.000000 2.500000
50% 3.500000 2.500000 4.000000 3.500000
75% 4.250000 4.250000 5.000000 4.250000
max 5.000000 5.000000 5.000000 5.000000
- summary_cont() 함수
Python Programming |
import researchpy as rp
rp.summary_cont(withmooc[['q1','q2','q3','q4']])
Results |
Variable N Mean SD SE 95% Conf. Interval
0 q1 8.0 3.2500 1.4880 0.5261 2.0060 4.4940
1 q2 8.0 2.7500 1.7525 0.6196 1.2848 4.2152
2 q3 7.0 4.1429 1.0690 0.4041 3.1542 5.1316
3 q4 8.0 3.2500 1.5811 0.5590 1.9281 4.5719
- 성별을 기준으로 하여 각 변수에 대한 요약 통계 구하기.
Python Programming |
withmooc.groupby(withmooc['gender']).describe()
Results |
q1 q2 ... q3 q4
count mean std min 25% 50% 75% max count mean ... 75% max count mean std min 25% 50% 75% max
gender
f 4.0 2.0 0.816497 1.0 1.75 2.0 2.25 3.0 4.0 1.25 ... 4.5 5.0 4.0 2.0 1.154701 1.0 1.0 2.0 3.0 3.0
m 4.0 4.5 0.577350 4.0 4.00 4.5 5.00 5.0 4.0 4.25 ... 5.0 5.0 4.0 4.5 0.577350 4.0 4.0 4.5 5.0 5.0
2 rows × 32 columns
- 성별을 기준으로 하여 각 변수에 대한 요약 통계 구하기.
Python Programming |
withmooc.groupby(withmooc['gender']).agg([np.mean, np.std])
Results |
q1 q2 q3 q4
mean std mean std mean std mean std
gender
f 2.0 0.816497 1.25 0.500000 4.333333 0.577350 2.0 1.154701
m 4.5 0.577350 4.25 0.957427 4.000000 1.414214 4.5 0.577350
Python Programming |
withmooc.groupby('gender').agg({'q1':pd.Series.nunique, 'q2':['mean', 'max','std']})
Results |
q1 q2
nunique mean max std
gender
f 3 1.25 2 0.500000
m 2 4.25 5 0.957427
- 열 이름에 의해 선택된 변수에 대하여 성별에 대하여 각 값에 대한 요약 통계.
Python Programming |
withmooc[['q1','q2','q3','q4']].groupby(withmooc['gender']).agg([np.mean, np.std])
Results |
q1 q2 q3 q4
mean std mean std mean std mean std
gender
f 2.0 0.816497 1.25 0.500000 4.333333 0.577350 2.0 1.154701
m 4.5 0.577350 4.25 0.957427 4.000000 1.414214 4.5 0.577350
Python Programming |
rp.summary_cont(withmooc.groupby(['gender'])[['q1','q2','q3','q4']])
Results |
q1 q2 ... q3 q4
N Mean SD SE 95% Conf. Interval N Mean SD SE ... SD SE 95% Conf. Interval N Mean SD SE 95% Conf. Interval
gender
f 4 2.0 0.8165 0.4082 0.7008 3.2992 4 1.25 0.5000 0.2500 ... 0.5774 0.3333 2.8991 5.7676 4 2.0 1.1547 0.5774 0.1626 3.8374
m 4 4.5 0.5774 0.2887 3.5813 5.4187 4 4.25 0.9574 0.4787 ... 1.4142 0.7071 1.7497 6.2503 4 4.5 0.5774 0.2887 3.5813 5.4187
2 rows × 24 columns
- 다중 범주 변수는 리스트에서 이용되어야 하고, data.frame 함수는 리스트를 취할 수 있다.
- 데이터는 workshop과 gender로 정렬될 필요가 없다.
- workshop과 gender 변수를 기준으로 각 값에 대한 요약 통계.
Python Programming |
withmooc.groupby(['workshop','gender'])['q1','q2','q3','q4'].agg([np.mean, np.std, np.median, np.max ])
Results |
<ipython-input-32-667e0b1b1789>:1: FutureWarning: Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead.
withmooc.groupby(['workshop','gender'])['q1','q2','q3','q4'].agg([np.mean, np.std, np.median, np.max ])
Results |
q1 q2 q3 q4
mean std median amax mean std median amax mean std median amax mean std median amax
workshop gender
1 f 1.5 0.707107 1.5 2 1.5 0.707107 1.5 2 4.5 0.707107 4.5 5.0 2 1.414214 2 3
m 4.5 0.707107 4.5 5 4.0 1.414214 4.0 5 3.0 1.414214 3.0 4.0 4 0.000000 4 4
2 f 2.5 0.707107 2.5 3 1.0 0.000000 1.0 1 4.0 NaN 4.0 4.0 2 1.414214 2 3
m 4.5 0.707107 4.5 5 4.5 0.707107 4.5 5 5.0 0.000000 5.0 5.0 5 0.000000 5 5
7. Python - dfply
Python Programming |
import pandas as pd
from dfply import *
mydata = pd.read_csv("c:/work/data/mydata.csv",sep=",",
dtype={'id':object,'workshop':object,
'q1':int, 'q2':int, 'q3':float, 'q4':int},
na_values=['NaN'],skipinitialspace =True)
withmooc= mydata.copy()
# 모든 변수 선택하기.
withmooc
Results |
id workshop gender q1 q2 q3 q4
0 1 1 f 1 1 5.0 1
1 2 2 f 2 1 4.0 1
2 3 1 f 2 2 4.0 3
3 4 2 f 3 1 NaN 3
4 5 1 m 4 5 2.0 4
5 6 2 m 5 4 5.0 5
6 7 1 m 5 3 4.0 4
7 8 2 m 4 5 5.0 5
- 관측치와 모든 변수의 요약 통계 구하기.
Python Programming |
withmooc >> group_by(X.gender) >> summarize(q1_n=n(X.q1), q2_n=n(X.q2))
withmooc >> group_by(X.gender) >> summarize_each([np.mean, np.std, np.median, np.max], X.q1, X.q2)
Results |
gender q1_mean q1_std q1_median q1_amax q2_mean q2_std q2_median q2_amax
0 f 2.0 0.707107 2.0 3 1.25 0.433013 1.0 2
1 m 4.5 0.500000 4.5 5 4.25 0.829156 4.5 5
Python Programming |
(withmooc>> group_by(X.gender)).describe()
Results |
q1 q2 q3 q4
count 8.000000 8.000000 7.000000 8.000000
mean 3.250000 2.750000 4.142857 3.250000
std 1.488048 1.752549 1.069045 1.581139
min 1.000000 1.000000 2.000000 1.000000
25% 2.000000 1.000000 4.000000 2.500000
50% 3.500000 2.500000 4.000000 3.500000
75% 4.250000 4.250000 5.000000 4.250000
max 5.000000 5.000000 5.000000 5.000000
- 다중 범주 변수는 리스트에서 이용되어야 하고, data.frame 함수는 리스트를 취할 수 있다.
- 데이터는 workshop과 gender로 정렬될 필요가 없다.
- workshop과 gender 변수를 기준으로 각 값에 대한 요약 통계
Python Programming |
withmooc >> \
group_by(X.workshop, X.gender) >> \
summarize_each([np.min, np.median, np.mean, np.std, np.max, np.count_nonzero], X.q1, X.q2, X.q3)
Results |
gender workshop q1_amin q1_median q1_mean q1_std q1_amax q1_count_nonzero q2_amin q2_median q2_mean q2_std q2_amax q2_count_nonzero q3_amin q3_median q3_mean q3_std q3_amax q3_count_nonzero
0 f 1 1 1.5 1.5 0.5 2 2 1 1.5 1.5 0.5 2 2 4.0 4.5 4.5 0.5 5.0 2
1 m 1 4 4.5 4.5 0.5 5 2 3 4.0 4.0 1.0 5 2 2.0 3.0 3.0 1.0 4.0 2
2 f 2 2 2.5 2.5 0.5 3 2 1 1.0 1.0 0.0 1 2 4.0 NaN 4.0 0.0 4.0 2
3 m 2 4 4.5 4.5 0.5 5 2 4 4.5 4.5 0.5 5 2 5.0 5.0 5.0 0.0 5.0 2
통계프로그램 비교 목록(Proc sql, SAS, SPSS, R 프로그래밍, R Tidyverse, Python Pandas, Python Dfply) |
[Oracle, Pandas, R Prog, Dplyr, Sqldf, Pandasql, Data.Table] 오라클 함수와 R & Python 비교 사전 목록 링크 |
[SQL, Pandas, R Prog, Dplyr, SQLDF, PANDASQL, DATA.TABLE] SQL EMP 예제로 만나는 테이블 데이터 처리 방법 리스트 링크 |
반응형
'통계프로그램 비교 시리즈 > 데이터 전처리 비교' 카테고리의 다른 글
[데이터 관리] 10. Joining & Merging 데이터 프레임 (0) | 2022.01.14 |
---|---|
[데이터 관리] 9. Stacking & Concatenating & Adding Data Sets (0) | 2022.01.14 |
[데이터 관리] 7. 변수 Keeping과 Dropping (0) | 2022.01.10 |
[데이터 관리] 6. 관측값 포맷팅 & 관측값 일괄 변경하기 (0) | 2022.01.10 |
통계프로그램 비교 시리즈 – [데이터 관리] 5. 변수명 변경(Rename) (0) | 2022.01.10 |
댓글