본문 바로가기
통계프로그램 비교 시리즈/오라클함수 비교(R & Python)

CUME_DIST 오라클 함수 [Oracle, Pandas, R Prog, Dplyr, Sqldf, Pandasql, Data.Table, DuckDB]

by 기서무나구물 2021. 11. 22.

포스팅 목차

    * 파이썬 & R 패키지 호출 및 예제 데이터 생성 링크


    [ CUME_DIST Oracle Function ]

     


    CUME_DIST함수는 값의 그룹에 있는 값의 누적 분포치를 계산한다. 반환되는 값의 범위는 0보다 크고 1보다 작다. 이값은 항상 같은 누적 값에 대하여 평가한다.

     

     


    1. Oracle(오라클)

     

    Oracle Programming
    SELECT JOB, ENAME, SAL, 
           CUME_DIST() OVER (PARTITION BY JOB ORDER BY SAL) AS cume_dist
    FROM   EMP
    ORDER 
       BY  JOB, SAL;

     

    Results
    JOB		ENAME	SAL	CUME_DIST
    -------------------------------------------
    ANALYST		FORD	3000	1
    ANALYST		SCOTT	3000	1
    CLERK		SMITH	800	.25
    CLERK		JAMES	950	.5
    CLERK		ADAMS	1100	.75
    CLERK		MILLER	1300	1
    MANAGER		CLARK	2450	.33333
    MANAGER		BLAKE	2850	.66666
    MANAGER		JONES	2975	1
    PRESIDENT	KING	5000	1
    SALESMAN	MARTIN	1250	.5
    SALESMAN	WARD	1250	.5
    SALESMAN	TURNER	1500	.75
    SALESMAN	ALLEN	1600	1

     


    2. Python Pandas(파이썬)

     

    Python Programming
    withmooc = emp.copy()
    
    withmooc['cume_dist'] = emp.groupby('job')['sal'].transform(lambda x: x.rank(method='max') / len(x))
    
    withmooc.head()

     

    Results
    	empno	ename	job		mgr	hiredate	sal	comm	deptno	cume_dist
    0	7369	SMITH	CLERK		7902.0	1980/12/17	800	NaN	20	0.25
    1	7499	ALLEN	SALESMAN	7698.0	1981/02/20	1600	300.0	30	1.00
    2	7521	WARD	SALESMAN	7698.0	1981/02/22	1250	500.0	30	0.50
    3	7566	JONES	MANAGER		7839.0	1981/04/02	2975	NaN	20	1.00
    4	7654	MARTIN	SALESMAN	7698.0	1981/09/28	1250	1400.0	30	0.50​

     


     

     

    Python Programming
    withmooc = emp.copy()
    
    withmooc['cume_dist'] = emp.groupby('job')['sal'].apply(lambda x: x.rank(method='max') / len(x))
    
    withmooc.head()

     

    Results
    	empno	ename	job		mgr	hiredate	sal	comm	deptno	cume_dist
    0	7369	SMITH	CLERK		7902.0	1980/12/17	800	NaN	20	0.25
    1	7499	ALLEN	SALESMAN	7698.0	1981/02/20	1600	300.0	30	1.00
    2	7521	WARD	SALESMAN	7698.0	1981/02/22	1250	500.0	30	0.50
    3	7566	JONES	MANAGER		7839.0	1981/04/02	2975	NaN	20	1.00
    4	7654	MARTIN	SALESMAN	7698.0	1981/09/28	1250	1400.0	30	0.50​

     

     


    3. R Programming (R Package)

     

    R Programming
    %%R
    
    withmooc <- emp
    
    withmooc['sal_cume_dist'] = ave(x = withmooc[, 'sal'], withmooc[, 'job'], FUN = dplyr::cume_dist)
    
    withmooc[1:5, ]

     

    Results
    # A tibble: 5 x 9
      empno ename  job        mgr hiredate     sal  comm deptno sal_cume_dist
      <dbl> <chr>  <chr>    <dbl> <date>     <dbl> <dbl>  <dbl>         <dbl>
    1  7369 SMITH  CLERK     7902 1980-12-17   800    NA     20          0.25
    2  7499 ALLEN  SALESMAN  7698 1981-02-20  1600   300     30          1   
    3  7521 WARD   SALESMAN  7698 1981-02-22  1250   500     30          0.5 
    4  7566 JONES  MANAGER   7839 1981-04-02  2975    NA     20          1   
    5  7654 MARTIN SALESMAN  7698 1981-09-28  1250  1400     30          0.5 

     


     

    R Programming
    %%R
    
    withmooc <- emp
    
    withmooc['cume_dist'] = ave(withmooc[, 'sal'], withmooc[, 'job'], FUN = function (x) { rank(x, ties.method = "max", na.last = "keep")/sum(!is.na(x)) }  )
    
    withmooc[1:5, ]

     

    Results
    # A tibble: 5 x 9
      empno ename  job        mgr hiredate     sal  comm deptno cume_dist
      <dbl> <chr>  <chr>    <dbl> <date>     <dbl> <dbl>  <dbl>     <dbl>
    1  7369 SMITH  CLERK     7902 1980-12-17   800    NA     20      0.25
    2  7499 ALLEN  SALESMAN  7698 1981-02-20  1600   300     30      1   
    3  7521 WARD   SALESMAN  7698 1981-02-22  1250   500     30      0.5 
    4  7566 JONES  MANAGER   7839 1981-04-02  2975    NA     20      1   
    5  7654 MARTIN SALESMAN  7698 1981-09-28  1250  1400     30      0.5 

     


     

    R Programming
    %%R
    
    library(plyr)
    ddply(emp,.(job),transform,cume_dist = dplyr::cume_dist(sal))[1:5, ]

     

    Results
      empno ename     job  mgr   hiredate  sal comm deptno cume_dist
    1  7788 SCOTT ANALYST 7566 1982-12-09 3000   NA     20      1.00
    2  7902  FORD ANALYST 7566 1981-12-03 3000   NA     20      1.00
    3  7369 SMITH   CLERK 7902 1980-12-17  800   NA     20      0.25
    4  7876 ADAMS   CLERK 7788 1983-01-12 1100   NA     20      0.75
    5  7900 JAMES   CLERK 7698 1981-12-03  950   NA     30      0.50

     


     

    R Programming
    %%R
    
    do.call(rbind,lapply(split(emp,emp$job),transform, cume_dist = cume_dist(sal)))[1:5, ]

     

    Results
              empno ename     job  mgr   hiredate  sal comm deptno cume_dist
    ANALYST.1  7788 SCOTT ANALYST 7566 1982-12-09 3000   NA     20      1.00
    ANALYST.2  7902  FORD ANALYST 7566 1981-12-03 3000   NA     20      1.00
    CLERK.1    7369 SMITH   CLERK 7902 1980-12-17  800   NA     20      0.25
    CLERK.2    7876 ADAMS   CLERK 7788 1983-01-12 1100   NA     20      0.75
    CLERK.3    7900 JAMES   CLERK 7698 1981-12-03  950   NA     30      0.50

     


     

    R Programming
    %%R
    withmooc <- emp[order(emp$job,emp$sal),]
    
    withmooc['cume_dist'] <- as.vector(unlist(with(withmooc, tapply(sal, job, function(x) cume_dist(x)))))
    
    withmooc[1:5, ]

     

    Results
    # A tibble: 5 x 9
      empno ename job       mgr hiredate     sal  comm deptno cume_dist
      <dbl> <chr> <chr>   <dbl> <date>     <dbl> <dbl>  <dbl>     <dbl>
    1  7788 SCOTT ANALYST  7566 1982-12-09  3000    NA     20      1   
    2  7902 FORD  ANALYST  7566 1981-12-03  3000    NA     20      1   
    3  7369 SMITH CLERK    7902 1980-12-17   800    NA     20      0.25
    4  7900 JAMES CLERK    7698 1981-12-03   950    NA     30      0.5 
    5  7876 ADAMS CLERK    7788 1983-01-12  1100    NA     20      0.75

     

     


    [참고] dplyr 함수 구문 참고

     

    R Programming
    %%R
    
    dplyr::cume_dist

     

    Results
    function (x) 
    {
        rank(x, ties.method = "max", na.last = "keep")/sum(!is.na(x))
    }
    <bytecode: 0x000001b0d7fcbfa0>
    <environment: namespace:dplyr>

     

     


    [참고] 2개 통계치 적용 방식

     

    R Programming
    %%R
    
    aggregate(sal ~ job, data = emp, FUN = plyr::each(avg = mean, n = length) )

     

    Results
            job  sal.avg    sal.n
    1   ANALYST 3000.000    2.000
    2     CLERK 1037.500    4.000
    3   MANAGER 2758.333    3.000
    4 PRESIDENT 5000.000    1.000
    5  SALESMAN 1400.000    4.000

     

     


    4. R Dplyr Package

     

    R Programming
    %%R
    
    emp %>%
      dplyr::group_by(job) %>%
      dplyr::mutate( sal_cume_dist = cume_dist(sal)) %>%
      head()

     

    Results
    # A tibble: 6 x 9
    # Groups:   job [3]
      empno ename  job        mgr hiredate     sal  comm deptno sal_cume_dist
      <dbl> <chr>  <chr>    <dbl> <date>     <dbl> <dbl>  <dbl>         <dbl>
    1  7369 SMITH  CLERK     7902 1980-12-17   800    NA     20         0.25 
    2  7499 ALLEN  SALESMAN  7698 1981-02-20  1600   300     30         1    
    3  7521 WARD   SALESMAN  7698 1981-02-22  1250   500     30         0.5  
    4  7566 JONES  MANAGER   7839 1981-04-02  2975    NA     20         1    
    5  7654 MARTIN SALESMAN  7698 1981-09-28  1250  1400     30         0.5  
    6  7698 BLAKE  MANAGER   7839 1981-03-01  2850    NA     30         0.667

     


     

    R Programming
    %%R
    
    emp %>%
      dplyr::group_by(job) %>%
      dplyr::mutate( sal_cume_dist = rank(sal, ties.method = "max", na.last = "keep")/sum(!is.na(sal)) ) %>%
      head()

     

    Results
    # A tibble: 6 x 9
    # Groups:   job [3]
      empno ename  job        mgr hiredate     sal  comm deptno sal_cume_dist
      <dbl> <chr>  <chr>    <dbl> <date>     <dbl> <dbl>  <dbl>         <dbl>
    1  7369 SMITH  CLERK     7902 1980-12-17   800    NA     20         0.25 
    2  7499 ALLEN  SALESMAN  7698 1981-02-20  1600   300     30         1    
    3  7521 WARD   SALESMAN  7698 1981-02-22  1250   500     30         0.5  
    4  7566 JONES  MANAGER   7839 1981-04-02  2975    NA     20         1    
    5  7654 MARTIN SALESMAN  7698 1981-09-28  1250  1400     30         0.5  
    6  7698 BLAKE  MANAGER   7839 1981-03-01  2850    NA     30         0.667

     

     


    5. R sqldf Package

     

    R Programming
    %%R
    
    sqldf(" SELECT job, ename, sal, CUME_DIST() OVER (PARTITION BY job ORDER BY sal) AS cume_dist
            FROM emp")[1:7, ]

     

    Results
          job  ename  sal cume_dist
    1 ANALYST  SCOTT 3000 1.0000000
    2 ANALYST   FORD 3000 1.0000000
    3   CLERK  SMITH  800 0.2500000
    4   CLERK  JAMES  950 0.5000000
    5   CLERK  ADAMS 1100 0.7500000
    6   CLERK MILLER 1300 1.0000000
    7 MANAGER  CLARK 2450 0.3333333

     

     


    6. Python pandasql Package

     

    Python Programming
    ps.sqldf(" SELECT job, ename, sal, CUME_DIST() OVER (PARTITION BY job ORDER BY sal) AS cume_dist FROM emp ").head()

     

    Results
    	job	ename	sal	cume_dist
    0	ANALYST	SCOTT	3000	1.00
    1	ANALYST	FORD	3000	1.00
    2	CLERK	SMITH	800	0.25
    3	CLERK	JAMES	950	0.50
    4	CLERK	ADAMS	1100	0.75​

     

     


    7. R data.table Package

     

    R Programming
    %%R
    
    DT          <- data.table(emp)
    dept_DT     <- data.table(dept)
    
    DT[, cume_dist := dplyr::cume_dist(sal),by=job][1:5, ]

     

    Results
       empno  ename      job  mgr   hiredate  sal comm deptno cume_dist
    1:  7369  SMITH    CLERK 7902 1980-12-17  800   NA     20      0.25
    2:  7499  ALLEN SALESMAN 7698 1981-02-20 1600  300     30      1.00
    3:  7521   WARD SALESMAN 7698 1981-02-22 1250  500     30      0.50
    4:  7566  JONES  MANAGER 7839 1981-04-02 2975   NA     20      1.00
    5:  7654 MARTIN SALESMAN 7698 1981-09-28 1250 1400     30      0.50

     


     

    R Programming
    %%R
    
    DT          <- data.table(emp)
    dept_DT     <- data.table(dept)
    
    DT[, cume_dist := sapply(.SD , function (x) {rank(x, ties.method = "max", na.last = "keep")/sum(!is.na(x)) } ) , by = job ,.SDcols=c("sal")][1:5, ]

     

    Results
       empno  ename      job  mgr   hiredate  sal comm deptno cume_dist
    1:  7369  SMITH    CLERK 7902 1980-12-17  800   NA     20      0.25
    2:  7499  ALLEN SALESMAN 7698 1981-02-20 1600  300     30      1.00
    3:  7521   WARD SALESMAN 7698 1981-02-22 1250  500     30      0.50
    4:  7566  JONES  MANAGER 7839 1981-04-02 2975   NA     20      1.00
    5:  7654 MARTIN SALESMAN 7698 1981-09-28 1250 1400     30      0.50

     


    8. Python DuckDB의 SQL

     

    Python Programming
    %%sql
      SELECT job, ename, sal,
             CUME_DIST() OVER (PARTITION BY job ORDER BY sal) AS cume_dist
      FROM   emp
    Python Programming
    duckdb.sql(" SELECT job, ename, sal,                                                \
                        CUME_DIST() OVER (PARTITION BY job ORDER BY sal) AS cume_dist   \
                 FROM   emp                                                             \
                 LIMIT 6 ").df()

     

     

    Results
            job   ename   sal  cume_dist
    0   ANALYST   SCOTT  3000       1.00
    1   ANALYST    FORD  3000       1.00
    2  SALESMAN    WARD  1250       0.50
    3  SALESMAN  MARTIN  1250       0.50
    4  SALESMAN  TURNER  1500       0.75
    5  SALESMAN   ALLEN  1600       1.00

     

     

     


    101 daehak ro, ihwa dong, jongno gu, seoul, south korea (https://unsplash.com/photos/p_D5pbQG5TE)

     

     

    [Oracle, Pandas, R Prog, Dplyr, Sqldf, Pandasql, Data.Table] 오라클 함수와 R & Python 비교 사전 목록? 링크

     

    오라클 SQL 함수(Oracle SQL Function) 목록 리스트 링크

     

    [SQL, Pandas, R Prog, Dplyr, SQLDF, PANDASQL, DATA.TABLE] SQL EMP 예제로 만나는 테이블 데이터 처리 방법 리스트 링크 링크
    반응형

    댓글