포스팅 목차
* 파이썬 & R 패키지 호출 및 예제 데이터 생성 링크
[ PERCENTILE_CONT Oracle Function ]
PERCENTILE_CONT 함수는 연속된 분포 모델을 가정한 역 분포 함수(inverse distribution function)이다. 이 함수는 백분위수 값과 소트 지정을 취하고, 소트 지정에 해당하는 백분위수 값에 해당하는 보간된 값을 반환한다.
현재 아래 예제는 Percent Rank 를 구현함.
- 함수 설명 : PERCENTILE_CONT 오라클 함수 링크
1. Oracle(오라클)
1) 집계함수
- percentile_cont() 함수
Oracle Programming |
SELECT DEPTNO, PERCENTILE_CONT(0.35) WITHIN GROUP (ORDER BY SAL) PERCENTILE_CONT_FUNC
FROM EMP
GROUP BY DEPTNO
Results |
DEPTNO PERCENTILE_CONT_FUNC
------------------------------
10 2105
20 1850
30 1250
2) 분석함수
- percentile_cont() 함수
Oracle Programming |
SELECT DEPTNO, EMPNO, SAL,
PERCENTILE_CONT(0.35) WITHIN GROUP (ORDER BY SAL) OVER (PARTITION BY DEPTNO) PERCENTILE_CONT_FUNC
FROM EMP
ORDER BY DEPTNO, SAL DESC
Results |
DEPTNO EMPNO SAL PERCENTILE_CONT_FUNC
---------------------------------------------
10 7839 5000 2105
10 7782 2450 2105
10 7934 1300 2105
20 7788 3000 1850
20 7902 3000 1850
20 7566 2975 1850
20 7876 1100 1850
20 7369 800 1850
30 7698 2850 1250
30 7499 1600 1250
30 7844 1500 1250
30 7521 1250 1250
30 7654 1250 1250
30 7900 950 1250
2. Python Pandas(파이썬)
2) 분석함수
- pandas패키지의 quantile(x,q=0.35)
Python Programming |
withmooc = copy.copy(emp)
withmooc['pct_first'] = withmooc.groupby('deptno')['sal'].transform(lambda x: x.quantile(q=0.35))
withmooc.sort_values(['deptno','sal'],ascending = [True,False])
Results |
empno ename job mgr hiredate sal comm deptno pct_first
8 7839 KING PRESIDENT NaN 1981/11/17 5000 NaN 10 2105.0
6 7782 CLARK MANAGER 7839.0 1981/01/09 2450 NaN 10 2105.0
13 7934 MILLER CLERK 7782.0 1982/01/23 1300 NaN 10 2105.0
7 7788 SCOTT ANALYST 7566.0 1982/12/09 3000 NaN 20 1850.0
12 7902 FORD ANALYST 7566.0 1981/12/03 3000 NaN 20 1850.0
3 7566 JONES MANAGER 7839.0 1981/04/02 2975 NaN 20 1850.0
10 7876 ADAMS CLERK 7788.0 1983/01/12 1100 NaN 20 1850.0
0 7369 SMITH CLERK 7902.0 1980/12/17 800 NaN 20 1850.0
5 7698 BLAKE MANAGER 7839.0 1981/03/01 2850 NaN 30 1250.0
1 7499 ALLEN SALESMAN 7698.0 1981/02/20 1600 300.0 30 1250.0
9 7844 TURNER SALESMAN 7698.0 1981/09/08 1500 0.0 30 1250.0
2 7521 WARD SALESMAN 7698.0 1981/02/22 1250 500.0 30 1250.0
4 7654 MARTIN SALESMAN 7698.0 1981/09/28 1250 1400.0 30 1250.0
11 7900 JAMES CLERK 7698.0 1981/12/03 950 NaN 30 1250.0
- Numpy 패키지의 np.quantile(x,q=0.35)
Python Programming |
withmooc = copy.copy(emp)
withmooc['pct_first'] = withmooc.groupby('deptno')['sal'].transform(lambda x: np.quantile(x,q=0.35))
withmooc.sort_values(['deptno','sal'],ascending = [True,False])
Results |
empno ename job mgr hiredate sal comm deptno pct_first
8 7839 KING PRESIDENT NaN 1981/11/17 5000 NaN 10 2105.0
6 7782 CLARK MANAGER 7839.0 1981/01/09 2450 NaN 10 2105.0
13 7934 MILLER CLERK 7782.0 1982/01/23 1300 NaN 10 2105.0
7 7788 SCOTT ANALYST 7566.0 1982/12/09 3000 NaN 20 1850.0
12 7902 FORD ANALYST 7566.0 1981/12/03 3000 NaN 20 1850.0
3 7566 JONES MANAGER 7839.0 1981/04/02 2975 NaN 20 1850.0
10 7876 ADAMS CLERK 7788.0 1983/01/12 1100 NaN 20 1850.0
0 7369 SMITH CLERK 7902.0 1980/12/17 800 NaN 20 1850.0
5 7698 BLAKE MANAGER 7839.0 1981/03/01 2850 NaN 30 1250.0
1 7499 ALLEN SALESMAN 7698.0 1981/02/20 1600 300.0 30 1250.0
9 7844 TURNER SALESMAN 7698.0 1981/09/08 1500 0.0 30 1250.0
2 7521 WARD SALESMAN 7698.0 1981/02/22 1250 500.0 30 1250.0
4 7654 MARTIN SALESMAN 7698.0 1981/09/28 1250 1400.0 30 1250.0
11 7900 JAMES CLERK 7698.0 1981/12/03 950 NaN 30 1250.0
3. R Programming (R Package)
2) 분석함수
R Programming |
%%R
withmooc <- emp[order(-emp$sal), ]
withmooc['prec_cont'] = with(withmooc, ave(sal,deptno, FUN=function(x) { quantile(x, 0.35) }))
withmooc[order(withmooc$deptno,-withmooc$sal),]
Results |
# A tibble: 14 × 9
empno ename job mgr hiredate sal comm deptno prec_cont
<dbl> <chr> <chr> <dbl> <date> <dbl> <dbl> <dbl> <dbl>
1 7839 KING PRESIDENT NA 1981-11-17 5000 NA 10 2105
2 7782 CLARK MANAGER 7839 1981-01-09 2450 NA 10 2105
3 7934 MILLER CLERK 7782 1982-01-23 1300 NA 10 2105
4 7788 SCOTT ANALYST 7566 1982-12-09 3000 NA 20 1850
5 7902 FORD ANALYST 7566 1981-12-03 3000 NA 20 1850
6 7566 JONES MANAGER 7839 1981-04-02 2975 NA 20 1850
7 7876 ADAMS CLERK 7788 1983-01-12 1100 NA 20 1850
8 7369 SMITH CLERK 7902 1980-12-17 800 NA 20 1850
9 7698 BLAKE MANAGER 7839 1981-03-01 2850 NA 30 1250
10 7499 ALLEN SALESMAN 7698 1981-02-20 1600 300 30 1250
11 7844 TURNER SALESMAN 7698 1981-09-08 1500 0 30 1250
12 7521 WARD SALESMAN 7698 1981-02-22 1250 500 30 1250
13 7654 MARTIN SALESMAN 7698 1981-09-28 1250 1400 30 1250
14 7900 JAMES CLERK 7698 1981-12-03 950 NA 30 1250
4. R Dplyr Package
2) 분석함수
R Programming |
%%R
emp %>%
dplyr::group_by(deptno) %>%
dplyr::mutate( quantile = scales::percent(c(0.35)) ,
prec_cont = stats::quantile(sal, 0.35)) %>%
dplyr::arrange(deptno, sal)
Results |
# A tibble: 14 × 10
# Groups: deptno [3]
empno ename job mgr hiredate sal comm deptno quantile prec_cont
<dbl> <chr> <chr> <dbl> <date> <dbl> <dbl> <dbl> <chr> <dbl>
1 7934 MILLER CLERK 7782 1982-01-23 1300 NA 10 35% 2105
2 7782 CLARK MANAGER 7839 1981-01-09 2450 NA 10 35% 2105
3 7839 KING PRESIDENT NA 1981-11-17 5000 NA 10 35% 2105
4 7369 SMITH CLERK 7902 1980-12-17 800 NA 20 35% 1850
5 7876 ADAMS CLERK 7788 1983-01-12 1100 NA 20 35% 1850
6 7566 JONES MANAGER 7839 1981-04-02 2975 NA 20 35% 1850
7 7788 SCOTT ANALYST 7566 1982-12-09 3000 NA 20 35% 1850
8 7902 FORD ANALYST 7566 1981-12-03 3000 NA 20 35% 1850
9 7900 JAMES CLERK 7698 1981-12-03 950 NA 30 35% 1250
10 7521 WARD SALESMAN 7698 1981-02-22 1250 500 30 35% 1250
11 7654 MARTIN SALESMAN 7698 1981-09-28 1250 1400 30 35% 1250
12 7844 TURNER SALESMAN 7698 1981-09-08 1500 0 30 35% 1250
13 7499 ALLEN SALESMAN 7698 1981-02-20 1600 300 30 35% 1250
14 7698 BLAKE MANAGER 7839 1981-03-01 2850 NA 30 35% 1250
R Programming |
%%R
emp %>%
dplyr::group_by(deptno) %>%
dplyr::mutate( enframe( quantile(sal, 0.35),'quantile','prec_cont')) %>%
dplyr::arrange(deptno, sal)
Results |
# A tibble: 14 × 10
# Groups: deptno [3]
empno ename job mgr hiredate sal comm deptno quantile prec_cont
<dbl> <chr> <chr> <dbl> <date> <dbl> <dbl> <dbl> <chr> <dbl>
1 7934 MILLER CLERK 7782 1982-01-23 1300 NA 10 35% 2105
2 7782 CLARK MANAGER 7839 1981-01-09 2450 NA 10 35% 2105
3 7839 KING PRESIDENT NA 1981-11-17 5000 NA 10 35% 2105
4 7369 SMITH CLERK 7902 1980-12-17 800 NA 20 35% 1850
5 7876 ADAMS CLERK 7788 1983-01-12 1100 NA 20 35% 1850
6 7566 JONES MANAGER 7839 1981-04-02 2975 NA 20 35% 1850
7 7788 SCOTT ANALYST 7566 1982-12-09 3000 NA 20 35% 1850
8 7902 FORD ANALYST 7566 1981-12-03 3000 NA 20 35% 1850
9 7900 JAMES CLERK 7698 1981-12-03 950 NA 30 35% 1250
10 7521 WARD SALESMAN 7698 1981-02-22 1250 500 30 35% 1250
11 7654 MARTIN SALESMAN 7698 1981-09-28 1250 1400 30 35% 1250
12 7844 TURNER SALESMAN 7698 1981-09-08 1500 0 30 35% 1250
13 7499 ALLEN SALESMAN 7698 1981-02-20 1600 300 30 35% 1250
14 7698 BLAKE MANAGER 7839 1981-03-01 2850 NA 30 35% 1250
5. R sqldf Package
- 현재 sqlite에서 지원 안 함 :
- Miscellaneous Extensions : https://www.sqlite.org/src/file/ext/misc
- 값 사이의 보간법이 필요
- SQLITE에서는 PERCENTILE_CONT 함수를 직접 제공 안함. 대략 하단과 같이 계산이 가능. 중복건(1250)이나 양쪽 극단치에 대한 별도 처리는 필요
R Programming |
%%R
sqldf(" SELECT DEPTNO, ENAME, SAL,
sal_lag,sal_Per_lag,sal_PerRank,
sal_lag + ( ( SAL - sal_lag) * (0.35 - sal_Per_lag) / (sal_PerRank - sal_Per_lag) ) Per_35p
FROM ( SELECT DEPTNO, ENAME, SAL,
lag(sal) over (partition by deptno order by sal_perRank) sal_lag,
lag(sal_PerRank) over (partition by deptno order by sal_perRank) sal_Per_lag,
sal_PerRank
FROM ( SELECT DEPTNO, ENAME, SAL,
PERCENT_RANK() OVER (PARTITION BY DEPTNO ORDER BY SAL) as sal_PerRank
FROM emp
order by deptno, sal
)
order by deptno, sal
)
WHERE 0.35 BETWEEN sal_Per_lag AND sal_PerRank
")
Results |
2) 분석함수
R Programming |
%%R
gsubfn::fn$sqldf("select `quantile( emp[emp$deptno==10,]$sal , 0.35)` ,
`quantile( emp[emp$deptno==20,]$sal , 0.35)` ,
`quantile( emp[emp$deptno==30,]$sal , 0.35)`
from emp ")
Results |
- 전체에서 35% 값
R Programming |
%%R
sqldf(" select min(sal) from (
select sal from emp
order by sal desc
limit floor(.35 * (select count(*) from emp))
) ")
Results |
- 그룹 10에서 35% 값
R Programming |
%%R
sqldf(" select min(sal)
from ( select sal from emp
where deptno=10
order by sal desc
limit ceil( .35 * (select count(*) from emp where deptno=10) ) )
")
Results |
R Programming |
%%R
sqldf(" WITH scores(score, deptno, prev_rank, curr_rank, next_rank) AS (
SELECT sal,deptno,
(ROW_NUMBER() OVER ( ORDER BY sal ) - 1.0) / ((SELECT COUNT(*) FROM emp where deptno=10) + 1) [prev_rank],
(ROW_NUMBER() OVER ( ORDER BY sal ) + 0.0) / ((SELECT COUNT(*) FROM emp where deptno=10) + 1) [curr_rank],
(ROW_NUMBER() OVER ( ORDER BY sal ) + 1.0) / ((SELECT COUNT(*) FROM emp where deptno=10) + 1) [next_rank]
FROM emp
where deptno=10
)
SELECT
CASE WHEN t1.score = t2.score
THEN t1.score
ELSE
t1.score + (t2.score - t1.score) * ((0.35 - t1.curr_rank) / (t2.curr_rank - t1.curr_rank))
END Per_value
FROM scores t1, scores t2
WHERE (t1.curr_rank = 0.35 OR (t1.curr_rank < 0.35 AND t1.next_rank > 0.35))
AND (t2.curr_rank = 0.35 OR (t2.curr_rank > 0.35 AND t2.prev_rank < 0.35))
and t1.deptno=10
")
Results |
R Programming |
%%R
sqldf(" SELECT DISTINCT
deptno,
first_value(sal) OVER (
ORDER BY CASE WHEN p1 <= 0.35 THEN p1 END DESC NULLS LAST) x1,
first_value(sal) OVER (
PARTITION BY deptno
ORDER BY CASE WHEN p2 <= 0.35 THEN p2 END DESC NULLS LAST) x2
FROM (
SELECT
deptno,
sal,
percent_rank() OVER (ORDER BY sal) p1,
percent_rank() OVER (PARTITION BY deptno ORDER BY sal) p2
FROM emp
) t
ORDER BY deptno")
Results |
R Programming |
%%R
sqldf(" SELECT Date, MIN(Abs90) AS Selector
FROM ( SELECT mydata.Date, Value, COUNT(Value) as Pct, (ABS(0.9-(COUNT(Value)/d.DateTotal))) AS Abs90
FROM mydata
INNER JOIN (SELECT Date, COUNT(*) AS DateTotal FROM mydata GROUP BY Date) AS d
ON d.Date = mydata.Date
GROUP BY Date, Value ")
Results |
6. Python pandasql Package
1) 집계함수
Python Programming |
withmooc = copy.copy(emp)
withmooc['pct_first'] = withmooc.groupby('deptno')['sal'].transform(lambda x: x.quantile(q=0.35))
withmooc.sort_values(['deptno','sal'],ascending = [True,False])
Results |
empno ename job mgr hiredate sal comm deptno pct_first
8 7839 KING PRESIDENT NaN 1981/11/17 5000 NaN 10 2105.0
6 7782 CLARK MANAGER 7839.0 1981/01/09 2450 NaN 10 2105.0
13 7934 MILLER CLERK 7782.0 1982/01/23 1300 NaN 10 2105.0
7 7788 SCOTT ANALYST 7566.0 1982/12/09 3000 NaN 20 1850.0
12 7902 FORD ANALYST 7566.0 1981/12/03 3000 NaN 20 1850.0
3 7566 JONES MANAGER 7839.0 1981/04/02 2975 NaN 20 1850.0
10 7876 ADAMS CLERK 7788.0 1983/01/12 1100 NaN 20 1850.0
0 7369 SMITH CLERK 7902.0 1980/12/17 800 NaN 20 1850.0
5 7698 BLAKE MANAGER 7839.0 1981/03/01 2850 NaN 30 1250.0
1 7499 ALLEN SALESMAN 7698.0 1981/02/20 1600 300.0 30 1250.0
9 7844 TURNER SALESMAN 7698.0 1981/09/08 1500 0.0 30 1250.0
2 7521 WARD SALESMAN 7698.0 1981/02/22 1250 500.0 30 1250.0
4 7654 MARTIN SALESMAN 7698.0 1981/09/28 1250 1400.0 30 1250.0
11 7900 JAMES CLERK 7698.0 1981/12/03 950 NaN 30 1250.0
7. R data.table Package
2) 분석함수
R Programming |
%%R
DT <- data.table(emp)
dept_DT <- data.table(dept)
DT[,prec_cont := stats::quantile(sal, 0.35),by =deptno][order(deptno,sal),]
Results |
empno ename job mgr hiredate sal comm deptno prec_cont
1: 7934 MILLER CLERK 7782 1982-01-23 1300 NA 10 2105
2: 7782 CLARK MANAGER 7839 1981-01-09 2450 NA 10 2105
3: 7839 KING PRESIDENT NA 1981-11-17 5000 NA 10 2105
4: 7369 SMITH CLERK 7902 1980-12-17 800 NA 20 1850
5: 7876 ADAMS CLERK 7788 1983-01-12 1100 NA 20 1850
6: 7566 JONES MANAGER 7839 1981-04-02 2975 NA 20 1850
7: 7788 SCOTT ANALYST 7566 1982-12-09 3000 NA 20 1850
8: 7902 FORD ANALYST 7566 1981-12-03 3000 NA 20 1850
9: 7900 JAMES CLERK 7698 1981-12-03 950 NA 30 1250
10: 7521 WARD SALESMAN 7698 1981-02-22 1250 500 30 1250
11: 7654 MARTIN SALESMAN 7698 1981-09-28 1250 1400 30 1250
12: 7844 TURNER SALESMAN 7698 1981-09-08 1500 0 30 1250
13: 7499 ALLEN SALESMAN 7698 1981-02-20 1600 300 30 1250
14: 7698 BLAKE MANAGER 7839 1981-03-01 2850 NA 30 1250
8. Python Duckdb의 SQL
PERCENTILE_CONT() 함수
Python Programming |
%%sql
SELECT deptno,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY sal) "Median cont",
PERCENTILE_DISC(0.5) WITHIN GROUP (ORDER BY sal) "Median disc"
FROM emp
GROUP BY deptno
ORDER BY 1
Python Programming |
duckdb.sql(" SELECT deptno, \
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY sal) Median_cont, \
PERCENTILE_DISC(0.5) WITHIN GROUP (ORDER BY sal) Median_disc \
FROM emp \
GROUP BY deptno \
ORDER BY 1 ").df()
Results |
deptno Median_cont Median_disc
0 10 2450.0 2450
1 20 2975.0 2975
2 30 1375.0 1250
Python Programming |
%%sql
Student << SELECT '1' grp, sal from emp
union all
select '1' grp, 1000 sal
%%sql
SELECT sal, PERCENT_RANK() OVER (PARTITION BY grp ORDER BY sal) as sal_PerRank from Student
Python Programming |
duckdb.sql(" SELECT sal, PERCENT_RANK() OVER (PARTITION BY grp ORDER BY sal) as sal_PerRank from Student ").df()
Results |
sal sal_PerRank
0 800 0.000000
1 950 0.071429
2 1000 0.142857
3 1100 0.214286
4 1250 0.285714
5 1250 0.285714
6 1300 0.428571
7 1500 0.500000
8 1600 0.571429
9 2450 0.642857
10 2850 0.714286
11 2975 0.785714
12 3000 0.857143
13 3000 0.857143
14 5000 1.000000
Python Programming |
%%sql
SELECT DEPTNO, ENAME, SAL,
PERCENT_RANK() OVER (PARTITION BY DEPTNO ORDER BY SAL) as PerRank
FROM emp
order by deptno, sal
LIMIT 7
Python Programming |
duckdb.sql(" SELECT DEPTNO, ENAME, SAL, \
PERCENT_RANK() OVER (PARTITION BY DEPTNO ORDER BY SAL) as PerRank \
FROM emp \
order by deptno, sal \
LIMIT 7 ").df()
Results |
deptno ename sal PerRank
0 10 MILLER 1300 0.00
1 10 CLARK 2450 0.50
2 10 KING 5000 1.00
3 20 SMITH 800 0.00
4 20 ADAMS 1100 0.25
5 20 JONES 2975 0.50
6 20 SCOTT 3000 0.75
--------------------------------------------
[Oracle, Pandas, R Prog, Dplyr, Sqldf, Pandasql, Data.Table] 오라클 함수와 R & Python 비교 사전 목록 링크 |
오라클 SQL 함수(Oracle SQL Function) 목록 리스트 링크 |
[SQL, Pandas, R Prog, Dplyr, SQLDF, PANDASQL, DATA.TABLE] SQL EMP 예제로 만나는 테이블 데이터 처리 방법 리스트 링크 링크 |
반응형
댓글