본문 바로가기
통계프로그램 비교 시리즈/오라클함수 비교(R & Python)

PERCENTILE_DISC 오라클 함수 [Oracle, Pandas, R Prog, Dplyr, Sqldf, Pandasql, Data.Table, DuckDB]

by 기서무나구물 2021. 12. 11.

포스팅 목차

    * 파이썬 & R 패키지 호출 및 예제 데이터 생성 링크


    [ PERCENTILE_DISC  Oracle Function ]

     


    PERCENTILE_DISC함수는 이산 분포 모형을 가정하는 역 분포 함수이다. 사용자가 지정한 백분위수에 해당하는 데이터 값을 검색하고, 정확하게 일치하는 데이터 값이 없으면 지정한 백분위 값의 주변 데이터 정렬 옵션에 따라 우선 순위가 빠른 바로 앞의 관측치 값을 반환한다. 계산 과정에서 Null은 무시된다.

     

    • 함수설명 : PERCENTILE_DISC 오라클 함수 링크
    • PERCENTILE_CONT는 데이터 세트에 있거나 없을 수 있는 적절한 값을 보간법을 사용하여서 선택하는 반면, PERCENTILE_DISC는 항상 해당 세트에서 백분위수 바로 앞에 실제로 존재하는 값을 반환합니다. 데이터 계수가 짝수인 경우 반환하는 값이 차이가 날 수 있다.

     

     


    1. Oracle(오라클)

     

    1) 집계함수

    • PERCENTILE_DISC() 함수

    개별 부서 내에서 15% 백분위에 해당하는 급여를 반환한다.

     

    Oracle Programming
    SELECT DEPTNO, 
           PERCENTILE_DISC(0.15) WITHIN GROUP (ORDER BY SAL) AS PERCENTILE_DISC 
    FROM   EMP 
    GROUP  BY DEPTNO

     

    Results
    DEPTNO	PERCENTILE_DISC
    ----------------------------
    10	1300
    20	800
    30	950

    2) 분석함수

    • PERCENTILE_DISC() 함수

    파티션에 해당하는 개별 부서 내에서 15% 백분위에 해당하는 급여를 기준 관측치에 반환한다.

     

    Oracle Programming
    SELECT DEPTNO, 
           SAL,
           PERCENTILE_DISC(0.15) WITHIN GROUP (ORDER BY SAL) OVER (PARTITION BY DEPTNO) AS PERCENTILE_DISC 
    FROM   EMP
    ORDER  BY DEPTNO, SAL;

     

    Results
    DEPTNO	SAL	PERCENTILE_DISC
    --------------------------------
    10	1300	1300
    10	2450	1300
    10	5000	1300
    20	800	800
    20	1100	800
    20	2975	800
    20	3000	800
    20	3000	800
    30	950	950
    30	1250	950
    30	1250	950
    30	1500	950
    30	1600	950
    30	2850	950

    2. Python Pandas(파이썬)

     

    1) 집계함수

    • np.percentile() 함수

    전체 관측치 중에서 백분위 15%에 해당하는 급여를 반환한다.

     

    Python Programming
    np.percentile(emp['sal'], 15)

     

    Results
    1092.5

     


    • np.percentile() 함수

    개별 부서 내에서 15% 백분위 수에 해당하는 급여를 반환한다.

     

    Python Programming
    withmooc = copy.copy(emp)
    
    withmooc.groupby('deptno')['sal'].apply(lambda x: np.percentile(x,0.15))

     

    Results
    deptno
    10    1303.45
    20     801.80
    30     952.25
    Name: sal, dtype: float64

     


    2) 분석함수

    • np.percentile() 함수

     

    Python Programming
    withmooc = copy.copy(emp)
    
    withmooc['pct_first'] = withmooc.groupby('deptno')['sal'].transform(lambda x: np.percentile(x, q=15))
    
    withmooc.sort_values(['deptno','sal'],ascending = [True,True]).head(7)

     

    Results
    	empno	ename	job		mgr	hiredate	sal	comm	deptno	pct_first
    13	7934	MILLER	CLERK		7782.0	1982/01/23	1300	NaN	10	1645
    6	7782	CLARK	MANAGER		7839.0	1981/01/09	2450	NaN	10	1645
    8	7839	KING	PRESIDENT	NaN	1981/11/17	5000	NaN	10	1645
    0	7369	SMITH	CLERK		7902.0	1980/12/17	800	NaN	20	980
    10	7876	ADAMS	CLERK		7788.0	1983/01/12	1100	NaN	20	980
    3	7566	JONES	MANAGER		7839.0	1981/04/02	2975	NaN	20	980
    7	7788	SCOTT	ANALYST		7566.0	1982/12/09	3000	NaN	20	980

     


    • transform()과 np.percentile() 함수

     

    Python Programming
    withmooc = copy.copy(emp)
    
    withmooc['pct_first'] = withmooc.sort_values('sal').groupby('deptno')['sal'].transform(lambda x: np.percentile(x,50))
    
    withmooc.sort_values(['deptno','sal'],ascending = [True,True]).head(7)

     

    Results
    	empno	ename	job		mgr	hiredate	sal	comm	deptno	pct_first
    13	7934	MILLER	CLERK		7782.0	1982/01/23	1300	NaN	10	2450
    6	7782	CLARK	MANAGER		7839.0	1981/01/09	2450	NaN	10	2450
    8	7839	KING	PRESIDENT	NaN	1981/11/17	5000	NaN	10	2450
    0	7369	SMITH	CLERK		7902.0	1980/12/17	800	NaN	20	2975
    10	7876	ADAMS	CLERK		7788.0	1983/01/12	1100	NaN	20	2975
    3	7566	JONES	MANAGER		7839.0	1981/04/02	2975	NaN	20	2975
    7	7788	SCOTT	ANALYST		7566.0	1982/12/09	3000	NaN	20	2975

     


    • scoreatpercentile() 함수

     

    Python Programming
    from scipy.stats import tmean, scoreatpercentile
    
    withmooc = copy.copy(emp)
    
    withmooc['pct_first'] = withmooc.groupby('deptno')['sal'].transform(lambda x: scoreatpercentile(x,50))
    
    withmooc.sort_values(['deptno','sal'],ascending = [True,True]).head(7)

     

    Results
    	empno	ename	job		mgr	hiredate	sal	comm	deptno	pct_first
    13	7934	MILLER	CLERK		7782.0	1982/01/23	1300	NaN	10	2450
    6	7782	CLARK	MANAGER		7839.0	1981/01/09	2450	NaN	10	2450
    8	7839	KING	PRESIDENT	NaN	1981/11/17	5000	NaN	10	2450
    0	7369	SMITH	CLERK		7902.0	1980/12/17	800	NaN	20	2975
    10	7876	ADAMS	CLERK		7788.0	1983/01/12	1100	NaN	20	2975
    3	7566	JONES	MANAGER		7839.0	1981/04/02	2975	NaN	20	2975
    7	7788	SCOTT	ANALYST		7566.0	1982/12/09	3000	NaN	20	2975

     


    3. R Programming (R Package)

     

    • dplyr::cume_dist()와 사용자 정의 함수

     

    R Programming
    %%R
    
    library(plyr)   # dplyr 보다 먼저 로딩
    library(dplyr)
    
    withmooc <- emp[order(emp$sal),]
    
    withmooc['cume_dist'] = ave(x = withmooc[, 'sal'], withmooc[, 'deptno'], FUN = function (x) { dplyr::cume_dist(x) }  )
    
    withmooc <- plyr::ddply(withmooc,.(deptno),transform, PERCENTILE_DISC = min(ifelse(cume_dist >= 0.15, sal , 'NaN')) )
    withmooc[order(withmooc$deptno,withmooc$sal),]
    
    withmooc$ename <- withmooc$job <- withmooc$hiredate <- NULL
    
    withmooc[1:10, ]

     

    Results
       empno  mgr  sal comm deptno cume_dist PERCENTILE_DISC
    1   7934 7782 1300   NA     10 0.3333333            1300
    2   7782 7839 2450   NA     10 0.6666667            1300
    3   7839   NA 5000   NA     10 1.0000000            1300
    4   7369 7902  800   NA     20 0.2000000             800
    5   7876 7788 1100   NA     20 0.4000000             800
    6   7566 7839 2975   NA     20 0.6000000             800
    7   7788 7566 3000   NA     20 1.0000000             800
    8   7902 7566 3000   NA     20 1.0000000             800
    9   7900 7698  950   NA     30 0.1666667             950
    10  7521 7698 1250  500     30 0.5000000             950

     

     


    4. R Dplyr Package

     

    • cume_dist() 함수

     

    R Programming
    %%R
    
    emp %>%
      dplyr::arrange(deptno, sal)  %>%
      dplyr::group_by(deptno) %>%
      dplyr::mutate( sal_cume_dist = cume_dist(sal) ) %>%
      dplyr::mutate( PERCENTILE_DISC = min(sal[sal_cume_dist >= 0.75]) ) %>%
      dplyr::select(-c(hiredate)) %>%
      head(10)

     

    Results
    # A tibble: 10 x 9
    # Groups:   deptno [3]
       empno ename  job         mgr   sal  comm deptno sal_cume_dist PERCENTILE_DISC
       <dbl> <chr>  <chr>     <dbl> <dbl> <dbl>  <dbl>         <dbl>           <dbl>
     1  7934 MILLER CLERK      7782  1300    NA     10         0.333            5000
     2  7782 CLARK  MANAGER    7839  2450    NA     10         0.667            5000
     3  7839 KING   PRESIDENT    NA  5000    NA     10         1                5000
     4  7369 SMITH  CLERK      7902   800    NA     20         0.2              3000
     5  7876 ADAMS  CLERK      7788  1100    NA     20         0.4              3000
     6  7566 JONES  MANAGER    7839  2975    NA     20         0.6              3000
     7  7788 SCOTT  ANALYST    7566  3000    NA     20         1                3000
     8  7902 FORD   ANALYST    7566  3000    NA     20         1                3000
     9  7900 JAMES  CLERK      7698   950    NA     30         0.167            1600
    10  7521 WARD   SALESMAN   7698  1250   500     30         0.5              1600

     


    [참고]

     

    R Programming
    %%R
    
    emp %>%
      dplyr::arrange(deptno, sal)  %>%
      dplyr::group_by(deptno) %>%
      dplyr::mutate( sal_cume_dist = cume_dist(sal),
                     sal_lag       = lag(dplyr::cume_dist(sal)) , n=1, default = 0, order_by = sal) %>%
      dplyr::filter(0.15 >= ifelse(is.na(sal_lag), 0 , sal_lag)  & 0.15 < sal_cume_dist) %>%
      dplyr::select(-c("ename","job","hiredate"))

     

    Results
    # A tibble: 3 x 10
    # Groups:   deptno [3]
      empno   mgr   sal  comm deptno sal_cume_dist sal_lag     n default order_by
      <dbl> <dbl> <dbl> <dbl>  <dbl>         <dbl>   <dbl> <dbl>   <dbl>    <dbl>
    1  7934  7782  1300    NA     10         0.333      NA     1       0     1300
    2  7369  7902   800    NA     20         0.2        NA     1       0      800
    3  7900  7698   950    NA     30         0.167      NA     1       0      950

     

     


    5. R sqldf Package

     

    함수 구현 : 전체에서 35% 값

     

    R Programming
    %%R
    
    sqldf(" select min(sal) Per_35p
            from ( select sal 
                   from emp
                   order by sal
                   limit floor(.35 * (select count(*) from emp))
                  ) ")

     

    Results
      Per_35p
    1     800

     


    그룹 10에서 35% 값

     

    R Programming
    %%R
    
    sqldf(" select min(sal) Per_35p
            from ( select sal 
                   from   emp
                   where  deptno=10
                   order by sal
                   limit ceil( .35 * (select count(*) from emp where deptno=10) ) )
            ")

     

    Results
      Per_35p
    1    1300

     


    first_value() 와 percent_rank() 함수

     

    R Programming
    %%R
    
    sqldf(" SELECT DISTINCT
                   deptno,
                   first_value(sal) OVER ( ORDER BY CASE WHEN p1 <= 0.15 THEN p1 END NULLS LAST ) x1,
                   first_value(sal) OVER ( PARTITION BY deptno 
                                           ORDER BY CASE WHEN p2 <= 0.15 THEN p2 END NULLS LAST) x2
            FROM ( SELECT deptno,
                          sal,
                          percent_rank() OVER (ORDER BY sal) p1,
                          percent_rank() OVER (PARTITION BY deptno ORDER BY sal) p2
                   FROM emp
                 ) t
           ORDER BY deptno")

     

    Results
      deptno  x1   x2
    1     10 800 1300
    2     20 800  800
    3     30 800  950

     


     

    R Programming
    %%R
    
    sqldf(" select deptno,
                   sal,CUME_DIST,
                   min(case when CUME_DIST >= 0.15 then x.sal end) over (partition by x.deptno) as percent_15,
                   min(case when CUME_DIST >= 0.25 then x.sal end) over (partition by x.deptno) as percent_25,
                   min(case when CUME_DIST >= 0.50 then x.sal end) over (partition by x.deptno) as percent_50,
                   min(case when CUME_DIST >= 0.75 then x.sal end) over (partition by x.deptno) as percent_75
    from (select x.*,
                 row_number() over (partition by x.deptno
                                    order by x.sal
                                   ) as seqnum,
                 CUME_DIST() OVER (PARTITION BY x.deptno ORDER BY sal) CUME_DIST,
                 count(*) over (partition by x.deptno) as cnt
          from emp x
         ) x   ")

     

    Results
       deptno  sal CUME_DIST percent_15 percent_25 percent_50 percent_75
    1      10 1300 0.3333333       1300       1300       2450       5000
    2      10 2450 0.6666667       1300       1300       2450       5000
    3      10 5000 1.0000000       1300       1300       2450       5000
    4      20  800 0.2000000        800       1100       2975       3000
    5      20 1100 0.4000000        800       1100       2975       3000
    6      20 2975 0.6000000        800       1100       2975       3000
    7      20 3000 1.0000000        800       1100       2975       3000
    8      20 3000 1.0000000        800       1100       2975       3000
    9      30  950 0.1666667        950       1250       1250       1600
    10     30 1250 0.5000000        950       1250       1250       1600
    11     30 1250 0.5000000        950       1250       1250       1600
    12     30 1500 0.6666667        950       1250       1250       1600
    13     30 1600 0.8333333        950       1250       1250       1600
    14     30 2850 1.0000000        950       1250       1250       1600

     


    6. Python pandasql Package

     

    Python Programming
    ps.sqldf(" select deptno,             \
                   sal,cnt,CUME_DIST,     \
                   min(case when CUME_DIST >= 0.15 then x.sal end) over (partition by x.deptno) as percent_15,     \
                   min(case when CUME_DIST >= 0.25 then x.sal end) over (partition by x.deptno) as percent_25,     \
                   min(case when CUME_DIST >= 0.50 then x.sal end) over (partition by x.deptno) as percent_50,     \
                   min(case when CUME_DIST >= 0.75 then x.sal end) over (partition by x.deptno) as percent_75      \
    from (select x.*,                                         \
                 row_number() over (partition by x.deptno     \
                                    order by x.sal            \
                                   ) as seqnum,               \
                 CUME_DIST() OVER (PARTITION BY x.deptno ORDER BY sal) CUME_DIST,     \
                 count(*) over (partition by x.deptno) as cnt                         \
          from emp x                                                                  \
         ) x   ").head()

     

    Results
    	deptno	sal	cnt	CUME_DIST	percent_15	percent_25	percent_50	percent_75
    0	10	1300	3	0.333333	1300		2450		2450		5000
    1	10	2450	3	0.666667	1300		2450		2450		5000
    2	10	5000	3	1.000000	1300		2450		2450		5000
      3	20	800	5	0.200000	800		1100		2975		3000
    4	20	1100	5	0.400000	800		1100		2975		3000

     


    7. R data.table Package

     

    R Programming
    %%R
    
    DT          <- data.table(emp)
    dept_DT     <- data.table(dept)
    
    DT[ order(sal), PERCENTILE_DISC := sapply(.SD , function (x) { ifelse( dplyr::cume_dist(x) >= 0.75, as.list(x) , NULL )  }  ),    ## as.list에서 최초값을 가져오는 로직 이해 아직 안됨.
        by = deptno ,.SDcols=c("sal")][order(deptno,sal),][1:10, ]

     

    Results
        empno  ename       job  mgr   hiredate  sal comm deptno PERCENTILE_DISC
     1:  7934 MILLER     CLERK 7782 1982-01-23 1300   NA     10            5000
     2:  7782  CLARK   MANAGER 7839 1981-01-09 2450   NA     10            5000
     3:  7839   KING PRESIDENT   NA 1981-11-17 5000   NA     10            5000
     4:  7369  SMITH     CLERK 7902 1980-12-17  800   NA     20            3000
     5:  7876  ADAMS     CLERK 7788 1983-01-12 1100   NA     20            3000
     6:  7566  JONES   MANAGER 7839 1981-04-02 2975   NA     20            3000
     7:  7788  SCOTT   ANALYST 7566 1982-12-09 3000   NA     20            3000
     8:  7902   FORD   ANALYST 7566 1981-12-03 3000   NA     20            3000
     9:  7900  JAMES     CLERK 7698 1981-12-03  950   NA     30            1600
    10:  7521   WARD  SALESMAN 7698 1981-02-22 1250  500     30            1600

     


     

    R Programming
    %%R
    
    DT          <- data.table(emp)
    dept_DT     <- data.table(dept)
    
    DT[order(sal), `:=` ( PER_DISC_15 = min(.SD[dplyr::cume_dist(.SD) >= 0.15]),
                          PER_DISC_25 = min(.SD[dplyr::cume_dist(.SD) >= 0.25]),
                          PER_DISC_50 = min(.SD[dplyr::cume_dist(.SD) >= 0.50]),
                          PER_DISC_75 = min(.SD[dplyr::cume_dist(.SD) >= 0.75])  ), by=deptno,.SDcols=c("sal")][order(deptno,sal),][, c("ename","job","mgr","hiredate") := NULL]

     

    Results
        empno  sal comm deptno PER_DISC_15 PER_DISC_25 PER_DISC_50 PER_DISC_75
     1:  7934 1300   NA     10        1300        1300        2450        5000
     2:  7782 2450   NA     10        1300        1300        2450        5000
     3:  7839 5000   NA     10        1300        1300        2450        5000
     4:  7369  800   NA     20         800        1100        2975        3000
     5:  7876 1100   NA     20         800        1100        2975        3000
     6:  7566 2975   NA     20         800        1100        2975        3000
     7:  7788 3000   NA     20         800        1100        2975        3000
     8:  7902 3000   NA     20         800        1100        2975        3000
     9:  7900  950   NA     30         950        1250        1250        1600
    10:  7521 1250  500     30         950        1250        1250        1600
    11:  7654 1250 1400     30         950        1250        1250        1600
    12:  7844 1500    0     30         950        1250        1250        1600
    13:  7499 1600  300     30         950        1250        1250        1600
    14:  7698 2850   NA     30         950        1250        1250        1600

     


    참고용 : * * duplicated(.SD) 기능도 있음

     

    R Programming
    %%R
    
    DT          <- data.table(emp)
    dept_DT     <- data.table(dept)
    
    DT[order(sal),  `:=` (  test      = duplicated(.SD),
                            test1     = dplyr::cume_dist(duplicated(.SD)),
                            var_list  = ifelse( dplyr::cume_dist(duplicated(.SD)) >= 0.75,.SD, NA_real_) ,
                            cume_dist = lapply( ifelse( dplyr::cume_dist(duplicated(.SD)) >= 0.75,.SD, NA_real_) ,max ) # 리스트에서 조건을 만족하는 최소값
                          )   ,
                     by=deptno, .SDcols = c("sal")][order(deptno,sal),][, c("empno","ename","comm","job","mgr","hiredate") := NULL]

     

    Results
         sal deptno  test     test1                      var_list cume_dist
     1: 1300     10 FALSE 1.0000000                1300,2450,5000      5000
     2: 2450     10 FALSE 1.0000000                1300,2450,5000      5000
     3: 5000     10 FALSE 1.0000000                1300,2450,5000      5000
     4:  800     20 FALSE 0.8000000       800,1100,2975,3000,3000        NA
     5: 1100     20 FALSE 0.8000000       800,1100,2975,3000,3000        NA
     6: 2975     20 FALSE 0.8000000       800,1100,2975,3000,3000        NA
     7: 3000     20 FALSE 0.8000000       800,1100,2975,3000,3000        NA
     8: 3000     20  TRUE 1.0000000       800,1100,2975,3000,3000      3000
     9:  950     30 FALSE 0.8333333  950,1250,1250,1500,1600,2850        NA
    10: 1250     30 FALSE 0.8333333  950,1250,1250,1500,1600,2850        NA
    11: 1250     30  TRUE 1.0000000  950,1250,1250,1500,1600,2850      2850
    12: 1500     30 FALSE 0.8333333  950,1250,1250,1500,1600,2850        NA
    13: 1600     30 FALSE 0.8333333  950,1250,1250,1500,1600,2850        NA
    14: 2850     30 FALSE 0.8333333  950,1250,1250,1500,1600,2850        NA

     


    8. Python Duckdb의 SQL

     

    Python Programming
    %%sql
    SELECT DEPTNO,
           PERCENTILE_DISC(0.15) WITHIN GROUP (ORDER BY SAL) AS PERCENTILE_DISC
    FROM   emp
    GROUP  BY DEPTNO

     

    Python Programming
    duckdb.sql(" SELECT DEPTNO,                                                                   \
                        PERCENTILE_DISC(0.15) WITHIN GROUP (ORDER BY SAL) AS PERCENTILE_DISC      \
                 FROM   emp                                                                       \
                 GROUP  BY DEPTNO ").df()

     

    Results
       deptno  PERCENTILE_DISC
    0      20              800
    1      30              950
    2      10             1300

     


    2) 집계 함수

    • 테이블 조인

    PARTITION BY 은 현재 지원 안 함.

    : PERCENTILE_DISC(0.15) WITHIN GROUP (ORDER BY SAL) OVER (PARTITION BY DEPTNO) AS PERCENTILE_DISC

    Python Programming
    %%sql
    select a.*,
           PERCENT_DISC
    from   emp a
           left join ( SELECT DEPTNO,
                              percentile_disc(0.75) WITHIN GROUP (ORDER BY SAL) AS PERCENT_DISC
                       FROM   emp
                       group by deptno) b
                on a.deptno = b.deptno
    order  by a.deptno
    LIMIT  6

     

    Python Programming
    duckdb.sql(" select a.deptno,a.empno,a.ename,job,sal,                                                                                      \
                        PERCENT_DISC                                                                              \
                 from   emp a                                                                                     \
                        left join ( SELECT DEPTNO,                                                                \
                                           percentile_disc(0.75) WITHIN GROUP (ORDER BY SAL) AS PERCENT_DISC      \
                                    FROM   emp                                                                    \
                                    group by deptno) b                                                            \
                             on a.deptno = b.deptno                                                               \
                 order  by a.deptno                                                                               \
                 LIMIT  6 ").df()

     

    Results
       deptno  empno   ename        job   sal  PERCENT_DISC
    0      10   7782   CLARK    MANAGER  2450          5000
    1      10   7839    KING  PRESIDENT  5000          5000
    2      10   7934  MILLER      CLERK  1300          5000
    3      20   7369   SMITH      CLERK   800          3000
    4      20   7566   JONES    MANAGER  2975          3000
    5      20   7788   SCOTT    ANALYST  3000          3000

     


    • 서브쿼리 방
    Python Programming
    %%sql
    select a.*,
           (  SELECT percentile_disc(0.75) WITHIN GROUP (ORDER BY SAL) AS PERCENT_DISC
              FROM   emp b
              where  a.deptno = b.deptno) PERCENT_DISC
    from   emp a
    order  by a.deptno
    LIMIT  6

     

    Python Programming
    duckdb.sql(" select a.deptno,a.empno,a.ename,job,sal,                                               \
                        (  SELECT percentile_disc(0.75) WITHIN GROUP (ORDER BY SAL) AS PERCENT_DISC     \
                           FROM   emp b                                                                 \
                           where  a.deptno = b.deptno) PERCENT_DISC                                     \
                 from   emp a                                                                           \
                 order  by a.deptno                                                                     \
                 LIMIT  6 ").df()

     

    Results
       deptno  empno   ename        job   sal  PERCENT_DISC
    0      10   7782   CLARK    MANAGER  2450          5000
    1      10   7839    KING  PRESIDENT  5000          5000
    2      10   7934  MILLER      CLERK  1300          5000
    3      20   7369   SMITH      CLERK   800          3000
    4      20   7566   JONES    MANAGER  2975          3000
    5      20   7788   SCOTT    ANALYST  3000          3000

     


    •  
    Python Programming
    %%sql
      select deptno,
             sal,cnt,CUME_DIST,
             min(case when CUME_DIST >= 0.15 then x.sal end) over (partition by x.deptno) as percent_15,
             min(case when CUME_DIST >= 0.25 then x.sal end) over (partition by x.deptno) as percent_25,
             min(case when CUME_DIST >= 0.50 then x.sal end) over (partition by x.deptno) as percent_50,
             min(case when CUME_DIST >= 0.75 then x.sal end) over (partition by x.deptno) as percent_75
      from ( select x.*,
                    row_number() over (partition by x.deptno
                                       order by x.sal
                                      ) as seqnum,
                    CUME_DIST() OVER (PARTITION BY x.deptno ORDER BY sal) CUME_DIST,
                    count(*) over (partition by x.deptno) as cnt
             from emp x
          ) x
      order by x.deptno
      LIMIT 6

     

    Python Programming
    duckdb.sql(" select deptno,                                                                                               \
                        sal,cnt,CUME_DIST,                                                                                    \
                        min(case when CUME_DIST >= 0.15 then x.sal end) over (partition by x.deptno) as perc_15,              \
                        min(case when CUME_DIST >= 0.25 then x.sal end) over (partition by x.deptno) as perc_25,              \
                        min(case when CUME_DIST >= 0.50 then x.sal end) over (partition by x.deptno) as perc_50,              \
                        min(case when CUME_DIST >= 0.75 then x.sal end) over (partition by x.deptno) as perc_75               \
                 from ( select x.*,                                                                                           \
                               row_number() over (partition by x.deptno                                                       \
                                                  order by x.sal                                                               \
                                                 ) as seqnum,                                                                 \
                               CUME_DIST() OVER (PARTITION BY x.deptno ORDER BY sal) CUME_DIST,                               \
                               count(*) over (partition by x.deptno) as cnt                                                   \
                        from emp x                                                                                            \
                      ) x                                                                                                      \
                 order by x.deptno                                                                                            \
                 LIMIT 6 ").df()

     

    Results
       deptno   sal  cnt  CUME_DIST  perc_15  perc_25  perc_50  perc_75
    0      10  1300    3   0.333333     1300     1300     2450     5000
    1      10  2450    3   0.666667     1300     1300     2450     5000
    2      10  5000    3   1.000000     1300     1300     2450     5000
    3      20   800    5   0.200000      800     1100     2975     3000
    4      20  1100    5   0.400000      800     1100     2975     3000
    5      20  2975    5   0.600000      800     1100     2975     3000

     


    https://unsplash.com/photos/OhL_qEqpef4

      --------------------------------------------  

     

    [Oracle, Pandas, R Prog, Dplyr, Sqldf, Pandasql, Data.Table] 오라클 함수와 R & Python 비교 사전 목록 링크

     

    오라클 SQL 함수(Oracle SQL Function) 목록 리스트 링크

     

    [SQL, Pandas, R Prog, Dplyr, SQLDF, PANDASQL, DATA.TABLE] SQL EMP 예제로 만나는 테이블 데이터 처리 방법 리스트 링크 링크
    반응형

    댓글