본문 바로가기
통계프로그램 비교 시리즈/오라클함수 비교(R & Python)

PERCENTILE_CONT 오라클 함수 [Oracle, Pandas, R Prog, Dplyr, Sqldf, Pandasql, Data.Table, DuckDB]

by 기서무나구물 2022. 1. 3.

포스팅 목차

    * 파이썬 & R 패키지 호출 및 예제 데이터 생성 링크


    [ PERCENTILE_CONT Oracle Function ]

     


    PERCENTILE_CONT 함수는 연속된 분포 모델을 가정한 역 분포 함수(inverse distribution function)이다. 이 함수는 백분위수 값과 소트 지정을 취하고, 소트 지정에 해당하는 백분위수 값에 해당하는 보간된 값을 반환한다.

     

    현재 아래 예제는 Percent Rank 를 구현함.

     

     


    1. Oracle(오라클)

     

    1) 집계함수

    • percentile_cont() 함수

     

    Oracle Programming
    SELECT DEPTNO, PERCENTILE_CONT(0.35) WITHIN GROUP (ORDER BY SAL) PERCENTILE_CONT_FUNC
    FROM   EMP 
    GROUP  BY DEPTNO

     

    Results
    DEPTNO	PERCENTILE_CONT_FUNC
    ------------------------------
    10	2105
    20	1850
    30	1250

     


    2) 분석함수

    • percentile_cont() 함수

     

     

    Oracle Programming
    SELECT DEPTNO, EMPNO, SAL,
           PERCENTILE_CONT(0.35) WITHIN GROUP (ORDER BY SAL) OVER (PARTITION BY DEPTNO) PERCENTILE_CONT_FUNC
    FROM   EMP
    ORDER  BY DEPTNO, SAL DESC

     

    Results
    DEPTNO	EMPNO	SAL	PERCENTILE_CONT_FUNC
    ---------------------------------------------
    10	7839	5000	2105
    10	7782	2450	2105
    10	7934	1300	2105
    20	7788	3000	1850
    20	7902	3000	1850
    20	7566	2975	1850
    20	7876	1100	1850
    20	7369	800	1850
    30	7698	2850	1250
    30	7499	1600	1250
    30	7844	1500	1250
    30	7521	1250	1250
    30	7654	1250	1250
    30	7900	950	1250

     


    2. Python Pandas(파이썬)

     

    2) 분석함수

    • pandas패키지의 quantile(x,q=0.35)
    Python Programming
    withmooc = copy.copy(emp)
    
    withmooc['pct_first'] = withmooc.groupby('deptno')['sal'].transform(lambda x: x.quantile(q=0.35))
    
    withmooc.sort_values(['deptno','sal'],ascending = [True,False])

     

    Results
    	empno	ename	job		mgr	hiredate	sal	comm	deptno	pct_first
    8	7839	KING	PRESIDENT	NaN	1981/11/17	5000	NaN	10	2105.0
    6	7782	CLARK	MANAGER		7839.0	1981/01/09	2450	NaN	10	2105.0
    13	7934	MILLER	CLERK		7782.0	1982/01/23	1300	NaN	10	2105.0
    7	7788	SCOTT	ANALYST		7566.0	1982/12/09	3000	NaN	20	1850.0
    12	7902	FORD	ANALYST		7566.0	1981/12/03	3000	NaN	20	1850.0
    3	7566	JONES	MANAGER		7839.0	1981/04/02	2975	NaN	20	1850.0
    10	7876	ADAMS	CLERK		7788.0	1983/01/12	1100	NaN	20	1850.0
    0	7369	SMITH	CLERK		7902.0	1980/12/17	800	NaN	20	1850.0
    5	7698	BLAKE	MANAGER		7839.0	1981/03/01	2850	NaN	30	1250.0
    1	7499	ALLEN	SALESMAN	7698.0	1981/02/20	1600	300.0	30	1250.0
    9	7844	TURNER	SALESMAN	7698.0	1981/09/08	1500	0.0	30	1250.0
    2	7521	WARD	SALESMAN	7698.0	1981/02/22	1250	500.0	30	1250.0
    4	7654	MARTIN	SALESMAN	7698.0	1981/09/28	1250	1400.0	30	1250.0
    11	7900	JAMES	CLERK		7698.0	1981/12/03	950	NaN	30	1250.0​

     


    • Numpy 패키지의 np.quantile(x,q=0.35)
    Python Programming
    withmooc = copy.copy(emp)
    
    withmooc['pct_first'] = withmooc.groupby('deptno')['sal'].transform(lambda x: np.quantile(x,q=0.35))
    
    withmooc.sort_values(['deptno','sal'],ascending = [True,False])

     

    Results
     
    	empno	ename	job		mgr	hiredate	sal	comm	deptno	pct_first
    8	7839	KING	PRESIDENT	NaN	1981/11/17	5000	NaN	10	2105.0
    6	7782	CLARK	MANAGER		7839.0	1981/01/09	2450	NaN	10	2105.0
    13	7934	MILLER	CLERK		7782.0	1982/01/23	1300	NaN	10	2105.0
    7	7788	SCOTT	ANALYST		7566.0	1982/12/09	3000	NaN	20	1850.0
    12	7902	FORD	ANALYST		7566.0	1981/12/03	3000	NaN	20	1850.0
    3	7566	JONES	MANAGER		7839.0	1981/04/02	2975	NaN	20	1850.0
    10	7876	ADAMS	CLERK		7788.0	1983/01/12	1100	NaN	20	1850.0
    0	7369	SMITH	CLERK		7902.0	1980/12/17	800	NaN	20	1850.0
    5	7698	BLAKE	MANAGER		7839.0	1981/03/01	2850	NaN	30	1250.0
    1	7499	ALLEN	SALESMAN	7698.0	1981/02/20	1600	300.0	30	1250.0
    9	7844	TURNER	SALESMAN	7698.0	1981/09/08	1500	0.0	30	1250.0
    2	7521	WARD	SALESMAN	7698.0	1981/02/22	1250	500.0	30	1250.0
    4	7654	MARTIN	SALESMAN	7698.0	1981/09/28	1250	1400.0	30	1250.0
    11	7900	JAMES	CLERK		7698.0	1981/12/03	950	NaN	30	1250.0

     


    3. R Programming (R Package)

     

    2) 분석함수

    R Programming
    %%R
    withmooc <- emp[order(-emp$sal), ]
    withmooc['prec_cont'] = with(withmooc, ave(sal,deptno, FUN=function(x) { quantile(x, 0.35) }))
    
    withmooc[order(withmooc$deptno,-withmooc$sal),]

     

    Results
    # A tibble: 14 × 9
       empno ename  job         mgr hiredate     sal  comm deptno prec_cont
       <dbl> <chr>  <chr>     <dbl> <date>     <dbl> <dbl>  <dbl>     <dbl>
     1  7839 KING   PRESIDENT    NA 1981-11-17  5000    NA     10      2105
     2  7782 CLARK  MANAGER    7839 1981-01-09  2450    NA     10      2105
     3  7934 MILLER CLERK      7782 1982-01-23  1300    NA     10      2105
     4  7788 SCOTT  ANALYST    7566 1982-12-09  3000    NA     20      1850
     5  7902 FORD   ANALYST    7566 1981-12-03  3000    NA     20      1850
     6  7566 JONES  MANAGER    7839 1981-04-02  2975    NA     20      1850
     7  7876 ADAMS  CLERK      7788 1983-01-12  1100    NA     20      1850
     8  7369 SMITH  CLERK      7902 1980-12-17   800    NA     20      1850
     9  7698 BLAKE  MANAGER    7839 1981-03-01  2850    NA     30      1250
    10  7499 ALLEN  SALESMAN   7698 1981-02-20  1600   300     30      1250
    11  7844 TURNER SALESMAN   7698 1981-09-08  1500     0     30      1250
    12  7521 WARD   SALESMAN   7698 1981-02-22  1250   500     30      1250
    13  7654 MARTIN SALESMAN   7698 1981-09-28  1250  1400     30      1250
    14  7900 JAMES  CLERK      7698 1981-12-03   950    NA     30      1250

     


    4. R Dplyr Package

     

    2) 분석함수

    R Programming
    %%R
    
    emp %>%
      dplyr::group_by(deptno) %>%
      dplyr::mutate( quantile  = scales::percent(c(0.35)) ,
                     prec_cont = stats::quantile(sal, 0.35)) %>%
      dplyr::arrange(deptno, sal)

     

    Results
    # A tibble: 14 × 10
    # Groups:   deptno [3]
       empno ename  job         mgr hiredate     sal  comm deptno quantile prec_cont
       <dbl> <chr>  <chr>     <dbl> <date>     <dbl> <dbl>  <dbl> <chr>        <dbl>
     1  7934 MILLER CLERK      7782 1982-01-23  1300    NA     10 35%           2105
     2  7782 CLARK  MANAGER    7839 1981-01-09  2450    NA     10 35%           2105
     3  7839 KING   PRESIDENT    NA 1981-11-17  5000    NA     10 35%           2105
     4  7369 SMITH  CLERK      7902 1980-12-17   800    NA     20 35%           1850
     5  7876 ADAMS  CLERK      7788 1983-01-12  1100    NA     20 35%           1850
     6  7566 JONES  MANAGER    7839 1981-04-02  2975    NA     20 35%           1850
     7  7788 SCOTT  ANALYST    7566 1982-12-09  3000    NA     20 35%           1850
     8  7902 FORD   ANALYST    7566 1981-12-03  3000    NA     20 35%           1850
     9  7900 JAMES  CLERK      7698 1981-12-03   950    NA     30 35%           1250
    10  7521 WARD   SALESMAN   7698 1981-02-22  1250   500     30 35%           1250
    11  7654 MARTIN SALESMAN   7698 1981-09-28  1250  1400     30 35%           1250
    12  7844 TURNER SALESMAN   7698 1981-09-08  1500     0     30 35%           1250
    13  7499 ALLEN  SALESMAN   7698 1981-02-20  1600   300     30 35%           1250
    14  7698 BLAKE  MANAGER    7839 1981-03-01  2850    NA     30 35%           1250

     


    R Programming
    %%R
    
    emp %>%
      dplyr::group_by(deptno) %>%
      dplyr::mutate( enframe( quantile(sal, 0.35),'quantile','prec_cont')) %>%
      dplyr::arrange(deptno, sal)

     

    Results
    # A tibble: 14 × 10
    # Groups:   deptno [3]
       empno ename  job         mgr hiredate     sal  comm deptno quantile prec_cont
       <dbl> <chr>  <chr>     <dbl> <date>     <dbl> <dbl>  <dbl> <chr>        <dbl>
     1  7934 MILLER CLERK      7782 1982-01-23  1300    NA     10 35%           2105
     2  7782 CLARK  MANAGER    7839 1981-01-09  2450    NA     10 35%           2105
     3  7839 KING   PRESIDENT    NA 1981-11-17  5000    NA     10 35%           2105
     4  7369 SMITH  CLERK      7902 1980-12-17   800    NA     20 35%           1850
     5  7876 ADAMS  CLERK      7788 1983-01-12  1100    NA     20 35%           1850
     6  7566 JONES  MANAGER    7839 1981-04-02  2975    NA     20 35%           1850
     7  7788 SCOTT  ANALYST    7566 1982-12-09  3000    NA     20 35%           1850
     8  7902 FORD   ANALYST    7566 1981-12-03  3000    NA     20 35%           1850
     9  7900 JAMES  CLERK      7698 1981-12-03   950    NA     30 35%           1250
    10  7521 WARD   SALESMAN   7698 1981-02-22  1250   500     30 35%           1250
    11  7654 MARTIN SALESMAN   7698 1981-09-28  1250  1400     30 35%           1250
    12  7844 TURNER SALESMAN   7698 1981-09-08  1500     0     30 35%           1250
    13  7499 ALLEN  SALESMAN   7698 1981-02-20  1600   300     30 35%           1250
    14  7698 BLAKE  MANAGER    7839 1981-03-01  2850    NA     30 35%           1250

     


    5. R sqldf Package

     

    • 현재 sqlite에서 지원 안 함 :
    • Miscellaneous Extensions : https://www.sqlite.org/src/file/ext/misc
    • 값 사이의 보간법이 필요
    • SQLITE에서는 PERCENTILE_CONT 함수를 직접 제공 안함. 대략 하단과 같이 계산이 가능. 중복건(1250)이나 양쪽 극단치에 대한 별도 처리는 필요
    R Programming
    %%R
    
    sqldf(" SELECT DEPTNO, ENAME, SAL,
                   sal_lag,sal_Per_lag,sal_PerRank,
                   sal_lag + ( ( SAL - sal_lag) *  (0.35 - sal_Per_lag) / (sal_PerRank - sal_Per_lag) ) Per_35p
            FROM   ( SELECT DEPTNO, ENAME, SAL,
                            lag(sal) over (partition by deptno order by sal_perRank)         sal_lag,
                            lag(sal_PerRank) over (partition by deptno order by sal_perRank) sal_Per_lag,
                            sal_PerRank
                     FROM   ( SELECT DEPTNO, ENAME, SAL,
                                     PERCENT_RANK() OVER (PARTITION BY DEPTNO ORDER BY SAL) as sal_PerRank
                              FROM emp
                              order by deptno, sal
                            )
                     order by deptno, sal
                   )
            WHERE 0.35 BETWEEN sal_Per_lag AND sal_PerRank
          ")

     

    Results

     


    2) 분석함수

    R Programming
    %%R
    
    gsubfn::fn$sqldf("select `quantile( emp[emp$deptno==10,]$sal , 0.35)` ,
                             `quantile( emp[emp$deptno==20,]$sal , 0.35)` ,
                             `quantile( emp[emp$deptno==30,]$sal , 0.35)`
                       from emp ")

     

    Results

     


    • 전체에서 35% 값
    R Programming
    %%R
    
    sqldf(" select min(sal) from (
       select sal from emp
       order by sal desc
       limit floor(.35 * (select count(*) from emp))
    ) ")

     

    Results

     


    • 그룹 10에서 35% 값
    R Programming
    %%R
    
    sqldf(" select min(sal) 
            from ( select sal from emp
                   where deptno=10
                   order by sal desc
                   limit ceil( .35 * (select count(*) from emp where deptno=10) ) )
            ")
    Results

     


    •  
    R Programming
    %%R
    
    sqldf(" WITH scores(score, deptno, prev_rank, curr_rank, next_rank) AS (
        SELECT sal,deptno,
            (ROW_NUMBER() OVER ( ORDER BY sal ) - 1.0) / ((SELECT COUNT(*) FROM emp where deptno=10) + 1)  [prev_rank],
            (ROW_NUMBER() OVER ( ORDER BY sal ) + 0.0) / ((SELECT COUNT(*) FROM emp where deptno=10) + 1)  [curr_rank],
            (ROW_NUMBER() OVER ( ORDER BY sal ) + 1.0) / ((SELECT COUNT(*) FROM emp where deptno=10) + 1)  [next_rank]
        FROM emp
    where deptno=10
    )
      SELECT
        CASE WHEN t1.score = t2.score
            THEN t1.score
        ELSE
            t1.score + (t2.score - t1.score) * ((0.35 - t1.curr_rank) / (t2.curr_rank - t1.curr_rank))
        END Per_value
        FROM scores t1, scores t2
        WHERE (t1.curr_rank = 0.35 OR (t1.curr_rank < 0.35 AND t1.next_rank > 0.35))
            AND (t2.curr_rank = 0.35 OR (t2.curr_rank > 0.35 AND t2.prev_rank < 0.35))
    and t1.deptno=10
      ")​
    Results

     


    R Programming
    %%R
    
    sqldf(" SELECT DISTINCT
      deptno,
      first_value(sal) OVER (
        ORDER BY CASE WHEN p1 <= 0.35 THEN p1 END DESC NULLS LAST) x1,
      first_value(sal) OVER (
        PARTITION BY deptno
        ORDER BY CASE WHEN p2 <= 0.35 THEN p2 END DESC NULLS LAST) x2
    FROM (
      SELECT
        deptno,
        sal,
        percent_rank() OVER (ORDER BY sal) p1,
        percent_rank() OVER (PARTITION BY deptno ORDER BY sal) p2
      FROM emp
    ) t
    ORDER BY deptno")
    Results

     


    R Programming
    %%R
    
    sqldf(" SELECT Date, MIN(Abs90) AS Selector 
            FROM   ( SELECT mydata.Date, Value, COUNT(Value) as Pct, (ABS(0.9-(COUNT(Value)/d.DateTotal))) AS Abs90
                     FROM   mydata
                            INNER JOIN (SELECT Date, COUNT(*) AS DateTotal FROM mydata GROUP BY Date) AS d
                                  ON d.Date = mydata.Date
                     GROUP BY Date, Value ")​

     

    Results

     


    6. Python pandasql Package

     

    1) 집계함수

     

    Python Programming
    withmooc = copy.copy(emp)
    
    withmooc['pct_first'] = withmooc.groupby('deptno')['sal'].transform(lambda x: x.quantile(q=0.35))
    
    withmooc.sort_values(['deptno','sal'],ascending = [True,False])

     

    Results
    	empno	ename	job		mgr	hiredate	sal	comm	deptno	pct_first
    8	7839	KING	PRESIDENT	NaN	1981/11/17	5000	NaN	10	2105.0
    6	7782	CLARK	MANAGER		7839.0	1981/01/09	2450	NaN	10	2105.0
    13	7934	MILLER	CLERK		7782.0	1982/01/23	1300	NaN	10	2105.0
    7	7788	SCOTT	ANALYST		7566.0	1982/12/09	3000	NaN	20	1850.0
    12	7902	FORD	ANALYST		7566.0	1981/12/03	3000	NaN	20	1850.0
    3	7566	JONES	MANAGER		7839.0	1981/04/02	2975	NaN	20	1850.0
    10	7876	ADAMS	CLERK		7788.0	1983/01/12	1100	NaN	20	1850.0
    0	7369	SMITH	CLERK		7902.0	1980/12/17	800	NaN	20	1850.0
    5	7698	BLAKE	MANAGER		7839.0	1981/03/01	2850	NaN	30	1250.0
    1	7499	ALLEN	SALESMAN	7698.0	1981/02/20	1600	300.0	30	1250.0
    9	7844	TURNER	SALESMAN	7698.0	1981/09/08	1500	0.0	30	1250.0
    2	7521	WARD	SALESMAN	7698.0	1981/02/22	1250	500.0	30	1250.0
    4	7654	MARTIN	SALESMAN	7698.0	1981/09/28	1250	1400.0	30	1250.0
    11	7900	JAMES	CLERK		7698.0	1981/12/03	950	NaN	30	1250.0​

     


    7. R data.table Package

     

    2) 분석함수

    R Programming
    %%R
    
    DT          <- data.table(emp)
    dept_DT     <- data.table(dept)
    
    DT[,prec_cont := stats::quantile(sal, 0.35),by =deptno][order(deptno,sal),]

     

    Results
        empno  ename       job  mgr   hiredate  sal comm deptno prec_cont
     1:  7934 MILLER     CLERK 7782 1982-01-23 1300   NA     10      2105
     2:  7782  CLARK   MANAGER 7839 1981-01-09 2450   NA     10      2105
     3:  7839   KING PRESIDENT   NA 1981-11-17 5000   NA     10      2105
     4:  7369  SMITH     CLERK 7902 1980-12-17  800   NA     20      1850
     5:  7876  ADAMS     CLERK 7788 1983-01-12 1100   NA     20      1850
     6:  7566  JONES   MANAGER 7839 1981-04-02 2975   NA     20      1850
     7:  7788  SCOTT   ANALYST 7566 1982-12-09 3000   NA     20      1850
     8:  7902   FORD   ANALYST 7566 1981-12-03 3000   NA     20      1850
     9:  7900  JAMES     CLERK 7698 1981-12-03  950   NA     30      1250
    10:  7521   WARD  SALESMAN 7698 1981-02-22 1250  500     30      1250
    11:  7654 MARTIN  SALESMAN 7698 1981-09-28 1250 1400     30      1250
    12:  7844 TURNER  SALESMAN 7698 1981-09-08 1500    0     30      1250
    13:  7499  ALLEN  SALESMAN 7698 1981-02-20 1600  300     30      1250
    14:  7698  BLAKE   MANAGER 7839 1981-03-01 2850   NA     30      1250

     


    8. Python Duckdb의 SQL

     

    PERCENTILE_CONT() 함수

     

    Python Programming
    %%sql
    SELECT deptno,
          PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY sal) "Median cont",
          PERCENTILE_DISC(0.5) WITHIN GROUP (ORDER BY sal) "Median disc"
    FROM  emp
    GROUP BY deptno
    ORDER BY 1

     

    Python Programming
    duckdb.sql(" SELECT deptno,                                                           \
                        PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY sal) Median_cont,     \
                        PERCENTILE_DISC(0.5) WITHIN GROUP (ORDER BY sal) Median_disc      \
                 FROM  emp                                                               \
                 GROUP BY deptno                                                         \
                 ORDER BY 1 ").df()

     

    Results
       deptno  Median_cont  Median_disc
    0      10       2450.0         2450
    1      20       2975.0         2975
    2      30       1375.0         1250

     


     

    Python Programming
    %%sql
    Student << SELECT '1' grp, sal from emp
               union all
               select '1' grp, 1000 sal
    %%sql
      SELECT sal, PERCENT_RANK() OVER (PARTITION BY grp ORDER BY sal) as sal_PerRank from  Student

     

    Python Programming
    duckdb.sql(" SELECT sal, PERCENT_RANK() OVER (PARTITION BY grp ORDER BY sal) as sal_PerRank from  Student ").df()

     

    Results
         sal  sal_PerRank
    0    800     0.000000
    1    950     0.071429
    2   1000     0.142857
    3   1100     0.214286
    4   1250     0.285714
    5   1250     0.285714
    6   1300     0.428571
    7   1500     0.500000
    8   1600     0.571429
    9   2450     0.642857
    10  2850     0.714286
    11  2975     0.785714
    12  3000     0.857143
    13  3000     0.857143
    14  5000     1.000000

     


     

    Python Programming
    %%sql
      SELECT DEPTNO, ENAME, SAL,
             PERCENT_RANK() OVER (PARTITION BY DEPTNO ORDER BY SAL) as PerRank
      FROM   emp
      order  by deptno, sal
      LIMIT  7

     

    Python Programming
    duckdb.sql(" SELECT DEPTNO, ENAME, SAL,                                                     \
                        PERCENT_RANK() OVER (PARTITION BY DEPTNO ORDER BY SAL) as PerRank       \
                 FROM   emp                                                                     \
                 order  by deptno, sal                                                          \
                 LIMIT  7 ").df()

     

    Results
       deptno   ename   sal  PerRank
    0      10  MILLER  1300     0.00
    1      10   CLARK  2450     0.50
    2      10    KING  5000     1.00
    3      20   SMITH   800     0.00
    4      20   ADAMS  1100     0.25
    5      20   JONES  2975     0.50
    6      20   SCOTT  3000     0.75

     

     


    ( https://unsplash.com/photos/OhL_qEqpef4 )

      --------------------------------------------  

     

     

    [Oracle, Pandas, R Prog, Dplyr, Sqldf, Pandasql, Data.Table] 오라클 함수와 R & Python 비교 사전 목록 링크

     

    오라클 SQL 함수(Oracle SQL Function) 목록 리스트 링크

     

    [SQL, Pandas, R Prog, Dplyr, SQLDF, PANDASQL, DATA.TABLE] SQL EMP 예제로 만나는 테이블 데이터 처리 방법 리스트 링크 링크
    반응형

    댓글