포스팅 목차
* 파이썬 & R 패키지 호출 및 예제 데이터 생성 링크
[ CUME_DIST Oracle Function ]
CUME_DIST함수는 값의 그룹에 있는 값의 누적 분포치를 계산한다. 반환되는 값의 범위는 0보다 크고 1보다 작다. 이값은 항상 같은 누적 값에 대하여 평가한다.
- 함수 설명 : CUME_DIST 오라클 함수 링크
1. Oracle(오라클)
Oracle Programming |
SELECT JOB, ENAME, SAL,
CUME_DIST() OVER (PARTITION BY JOB ORDER BY SAL) AS cume_dist
FROM EMP
ORDER
BY JOB, SAL;
Results |
JOB ENAME SAL CUME_DIST
-------------------------------------------
ANALYST FORD 3000 1
ANALYST SCOTT 3000 1
CLERK SMITH 800 .25
CLERK JAMES 950 .5
CLERK ADAMS 1100 .75
CLERK MILLER 1300 1
MANAGER CLARK 2450 .33333
MANAGER BLAKE 2850 .66666
MANAGER JONES 2975 1
PRESIDENT KING 5000 1
SALESMAN MARTIN 1250 .5
SALESMAN WARD 1250 .5
SALESMAN TURNER 1500 .75
SALESMAN ALLEN 1600 1
2. Python Pandas(파이썬)
Python Programming |
withmooc = emp.copy()
withmooc['cume_dist'] = emp.groupby('job')['sal'].transform(lambda x: x.rank(method='max') / len(x))
withmooc.head()
Results |
empno ename job mgr hiredate sal comm deptno cume_dist
0 7369 SMITH CLERK 7902.0 1980/12/17 800 NaN 20 0.25
1 7499 ALLEN SALESMAN 7698.0 1981/02/20 1600 300.0 30 1.00
2 7521 WARD SALESMAN 7698.0 1981/02/22 1250 500.0 30 0.50
3 7566 JONES MANAGER 7839.0 1981/04/02 2975 NaN 20 1.00
4 7654 MARTIN SALESMAN 7698.0 1981/09/28 1250 1400.0 30 0.50
Python Programming |
withmooc = emp.copy()
withmooc['cume_dist'] = emp.groupby('job')['sal'].apply(lambda x: x.rank(method='max') / len(x))
withmooc.head()
Results |
empno ename job mgr hiredate sal comm deptno cume_dist
0 7369 SMITH CLERK 7902.0 1980/12/17 800 NaN 20 0.25
1 7499 ALLEN SALESMAN 7698.0 1981/02/20 1600 300.0 30 1.00
2 7521 WARD SALESMAN 7698.0 1981/02/22 1250 500.0 30 0.50
3 7566 JONES MANAGER 7839.0 1981/04/02 2975 NaN 20 1.00
4 7654 MARTIN SALESMAN 7698.0 1981/09/28 1250 1400.0 30 0.50
3. R Programming (R Package)
R Programming |
%%R
withmooc <- emp
withmooc['sal_cume_dist'] = ave(x = withmooc[, 'sal'], withmooc[, 'job'], FUN = dplyr::cume_dist)
withmooc[1:5, ]
Results |
# A tibble: 5 x 9
empno ename job mgr hiredate sal comm deptno sal_cume_dist
<dbl> <chr> <chr> <dbl> <date> <dbl> <dbl> <dbl> <dbl>
1 7369 SMITH CLERK 7902 1980-12-17 800 NA 20 0.25
2 7499 ALLEN SALESMAN 7698 1981-02-20 1600 300 30 1
3 7521 WARD SALESMAN 7698 1981-02-22 1250 500 30 0.5
4 7566 JONES MANAGER 7839 1981-04-02 2975 NA 20 1
5 7654 MARTIN SALESMAN 7698 1981-09-28 1250 1400 30 0.5
R Programming |
%%R
withmooc <- emp
withmooc['cume_dist'] = ave(withmooc[, 'sal'], withmooc[, 'job'], FUN = function (x) { rank(x, ties.method = "max", na.last = "keep")/sum(!is.na(x)) } )
withmooc[1:5, ]
Results |
# A tibble: 5 x 9
empno ename job mgr hiredate sal comm deptno cume_dist
<dbl> <chr> <chr> <dbl> <date> <dbl> <dbl> <dbl> <dbl>
1 7369 SMITH CLERK 7902 1980-12-17 800 NA 20 0.25
2 7499 ALLEN SALESMAN 7698 1981-02-20 1600 300 30 1
3 7521 WARD SALESMAN 7698 1981-02-22 1250 500 30 0.5
4 7566 JONES MANAGER 7839 1981-04-02 2975 NA 20 1
5 7654 MARTIN SALESMAN 7698 1981-09-28 1250 1400 30 0.5
R Programming |
%%R
library(plyr)
ddply(emp,.(job),transform,cume_dist = dplyr::cume_dist(sal))[1:5, ]
Results |
empno ename job mgr hiredate sal comm deptno cume_dist
1 7788 SCOTT ANALYST 7566 1982-12-09 3000 NA 20 1.00
2 7902 FORD ANALYST 7566 1981-12-03 3000 NA 20 1.00
3 7369 SMITH CLERK 7902 1980-12-17 800 NA 20 0.25
4 7876 ADAMS CLERK 7788 1983-01-12 1100 NA 20 0.75
5 7900 JAMES CLERK 7698 1981-12-03 950 NA 30 0.50
R Programming |
%%R
do.call(rbind,lapply(split(emp,emp$job),transform, cume_dist = cume_dist(sal)))[1:5, ]
Results |
empno ename job mgr hiredate sal comm deptno cume_dist
ANALYST.1 7788 SCOTT ANALYST 7566 1982-12-09 3000 NA 20 1.00
ANALYST.2 7902 FORD ANALYST 7566 1981-12-03 3000 NA 20 1.00
CLERK.1 7369 SMITH CLERK 7902 1980-12-17 800 NA 20 0.25
CLERK.2 7876 ADAMS CLERK 7788 1983-01-12 1100 NA 20 0.75
CLERK.3 7900 JAMES CLERK 7698 1981-12-03 950 NA 30 0.50
R Programming |
%%R
withmooc <- emp[order(emp$job,emp$sal),]
withmooc['cume_dist'] <- as.vector(unlist(with(withmooc, tapply(sal, job, function(x) cume_dist(x)))))
withmooc[1:5, ]
Results |
# A tibble: 5 x 9
empno ename job mgr hiredate sal comm deptno cume_dist
<dbl> <chr> <chr> <dbl> <date> <dbl> <dbl> <dbl> <dbl>
1 7788 SCOTT ANALYST 7566 1982-12-09 3000 NA 20 1
2 7902 FORD ANALYST 7566 1981-12-03 3000 NA 20 1
3 7369 SMITH CLERK 7902 1980-12-17 800 NA 20 0.25
4 7900 JAMES CLERK 7698 1981-12-03 950 NA 30 0.5
5 7876 ADAMS CLERK 7788 1983-01-12 1100 NA 20 0.75
[참고] dplyr 함수 구문 참고
R Programming |
%%R
dplyr::cume_dist
Results |
function (x)
{
rank(x, ties.method = "max", na.last = "keep")/sum(!is.na(x))
}
<bytecode: 0x000001b0d7fcbfa0>
<environment: namespace:dplyr>
[참고] 2개 통계치 적용 방식
R Programming |
%%R
aggregate(sal ~ job, data = emp, FUN = plyr::each(avg = mean, n = length) )
Results |
job sal.avg sal.n
1 ANALYST 3000.000 2.000
2 CLERK 1037.500 4.000
3 MANAGER 2758.333 3.000
4 PRESIDENT 5000.000 1.000
5 SALESMAN 1400.000 4.000
4. R Dplyr Package
R Programming |
%%R
emp %>%
dplyr::group_by(job) %>%
dplyr::mutate( sal_cume_dist = cume_dist(sal)) %>%
head()
Results |
# A tibble: 6 x 9
# Groups: job [3]
empno ename job mgr hiredate sal comm deptno sal_cume_dist
<dbl> <chr> <chr> <dbl> <date> <dbl> <dbl> <dbl> <dbl>
1 7369 SMITH CLERK 7902 1980-12-17 800 NA 20 0.25
2 7499 ALLEN SALESMAN 7698 1981-02-20 1600 300 30 1
3 7521 WARD SALESMAN 7698 1981-02-22 1250 500 30 0.5
4 7566 JONES MANAGER 7839 1981-04-02 2975 NA 20 1
5 7654 MARTIN SALESMAN 7698 1981-09-28 1250 1400 30 0.5
6 7698 BLAKE MANAGER 7839 1981-03-01 2850 NA 30 0.667
R Programming |
%%R
emp %>%
dplyr::group_by(job) %>%
dplyr::mutate( sal_cume_dist = rank(sal, ties.method = "max", na.last = "keep")/sum(!is.na(sal)) ) %>%
head()
Results |
# A tibble: 6 x 9
# Groups: job [3]
empno ename job mgr hiredate sal comm deptno sal_cume_dist
<dbl> <chr> <chr> <dbl> <date> <dbl> <dbl> <dbl> <dbl>
1 7369 SMITH CLERK 7902 1980-12-17 800 NA 20 0.25
2 7499 ALLEN SALESMAN 7698 1981-02-20 1600 300 30 1
3 7521 WARD SALESMAN 7698 1981-02-22 1250 500 30 0.5
4 7566 JONES MANAGER 7839 1981-04-02 2975 NA 20 1
5 7654 MARTIN SALESMAN 7698 1981-09-28 1250 1400 30 0.5
6 7698 BLAKE MANAGER 7839 1981-03-01 2850 NA 30 0.667
5. R sqldf Package
R Programming |
%%R
sqldf(" SELECT job, ename, sal, CUME_DIST() OVER (PARTITION BY job ORDER BY sal) AS cume_dist
FROM emp")[1:7, ]
Results |
job ename sal cume_dist
1 ANALYST SCOTT 3000 1.0000000
2 ANALYST FORD 3000 1.0000000
3 CLERK SMITH 800 0.2500000
4 CLERK JAMES 950 0.5000000
5 CLERK ADAMS 1100 0.7500000
6 CLERK MILLER 1300 1.0000000
7 MANAGER CLARK 2450 0.3333333
6. Python pandasql Package
Python Programming |
ps.sqldf(" SELECT job, ename, sal, CUME_DIST() OVER (PARTITION BY job ORDER BY sal) AS cume_dist FROM emp ").head()
Results |
job ename sal cume_dist
0 ANALYST SCOTT 3000 1.00
1 ANALYST FORD 3000 1.00
2 CLERK SMITH 800 0.25
3 CLERK JAMES 950 0.50
4 CLERK ADAMS 1100 0.75
7. R data.table Package
R Programming |
%%R
DT <- data.table(emp)
dept_DT <- data.table(dept)
DT[, cume_dist := dplyr::cume_dist(sal),by=job][1:5, ]
Results |
empno ename job mgr hiredate sal comm deptno cume_dist
1: 7369 SMITH CLERK 7902 1980-12-17 800 NA 20 0.25
2: 7499 ALLEN SALESMAN 7698 1981-02-20 1600 300 30 1.00
3: 7521 WARD SALESMAN 7698 1981-02-22 1250 500 30 0.50
4: 7566 JONES MANAGER 7839 1981-04-02 2975 NA 20 1.00
5: 7654 MARTIN SALESMAN 7698 1981-09-28 1250 1400 30 0.50
R Programming |
%%R
DT <- data.table(emp)
dept_DT <- data.table(dept)
DT[, cume_dist := sapply(.SD , function (x) {rank(x, ties.method = "max", na.last = "keep")/sum(!is.na(x)) } ) , by = job ,.SDcols=c("sal")][1:5, ]
Results |
empno ename job mgr hiredate sal comm deptno cume_dist
1: 7369 SMITH CLERK 7902 1980-12-17 800 NA 20 0.25
2: 7499 ALLEN SALESMAN 7698 1981-02-20 1600 300 30 1.00
3: 7521 WARD SALESMAN 7698 1981-02-22 1250 500 30 0.50
4: 7566 JONES MANAGER 7839 1981-04-02 2975 NA 20 1.00
5: 7654 MARTIN SALESMAN 7698 1981-09-28 1250 1400 30 0.50
8. Python DuckDB의 SQL
Python Programming |
%%sql
SELECT job, ename, sal,
CUME_DIST() OVER (PARTITION BY job ORDER BY sal) AS cume_dist
FROM emp
Python Programming |
duckdb.sql(" SELECT job, ename, sal, \
CUME_DIST() OVER (PARTITION BY job ORDER BY sal) AS cume_dist \
FROM emp \
LIMIT 6 ").df()
Results |
job ename sal cume_dist
0 ANALYST SCOTT 3000 1.00
1 ANALYST FORD 3000 1.00
2 SALESMAN WARD 1250 0.50
3 SALESMAN MARTIN 1250 0.50
4 SALESMAN TURNER 1500 0.75
5 SALESMAN ALLEN 1600 1.00
[Oracle, Pandas, R Prog, Dplyr, Sqldf, Pandasql, Data.Table] 오라클 함수와 R & Python 비교 사전 목록? 링크 |
오라클 SQL 함수(Oracle SQL Function) 목록 리스트 링크 |
[SQL, Pandas, R Prog, Dplyr, SQLDF, PANDASQL, DATA.TABLE] SQL EMP 예제로 만나는 테이블 데이터 처리 방법 리스트 링크 링크 |
반응형
댓글