포스팅 목차
* 파이썬 & R 패키지 호출 및 예제 데이터 생성 링크
[ PERCENTILE_DISC Oracle Function ]
PERCENTILE_DISC함수는 이산 분포 모형을 가정하는 역 분포 함수이다. 사용자가 지정한 백분위수에 해당하는 데이터 값을 검색하고, 정확하게 일치하는 데이터 값이 없으면 지정한 백분위 값의 주변 데이터 정렬 옵션에 따라 우선 순위가 빠른 바로 앞의 관측치 값을 반환한다. 계산 과정에서 Null은 무시된다.
- 함수설명 : PERCENTILE_DISC 오라클 함수 링크
- PERCENTILE_CONT는 데이터 세트에 있거나 없을 수 있는 적절한 값을 보간법을 사용하여서 선택하는 반면, PERCENTILE_DISC는 항상 해당 세트에서 백분위수 바로 앞에 실제로 존재하는 값을 반환합니다. 데이터 계수가 짝수인 경우 반환하는 값이 차이가 날 수 있다.
1. Oracle(오라클)
1) 집계함수
- PERCENTILE_DISC() 함수
개별 부서 내에서 15% 백분위에 해당하는 급여를 반환한다.
Oracle Programming |
SELECT DEPTNO,
PERCENTILE_DISC(0.15) WITHIN GROUP (ORDER BY SAL) AS PERCENTILE_DISC
FROM EMP
GROUP BY DEPTNO
Results |
DEPTNO PERCENTILE_DISC
----------------------------
10 1300
20 800
30 950
2) 분석함수
- PERCENTILE_DISC() 함수
파티션에 해당하는 개별 부서 내에서 15% 백분위에 해당하는 급여를 기준 관측치에 반환한다.
Oracle Programming |
SELECT DEPTNO,
SAL,
PERCENTILE_DISC(0.15) WITHIN GROUP (ORDER BY SAL) OVER (PARTITION BY DEPTNO) AS PERCENTILE_DISC
FROM EMP
ORDER BY DEPTNO, SAL;
Results |
DEPTNO SAL PERCENTILE_DISC
--------------------------------
10 1300 1300
10 2450 1300
10 5000 1300
20 800 800
20 1100 800
20 2975 800
20 3000 800
20 3000 800
30 950 950
30 1250 950
30 1250 950
30 1500 950
30 1600 950
30 2850 950
2. Python Pandas(파이썬)
1) 집계함수
- np.percentile() 함수
전체 관측치 중에서 백분위 15%에 해당하는 급여를 반환한다.
Python Programming |
np.percentile(emp['sal'], 15)
Results |
1092.5
- np.percentile() 함수
개별 부서 내에서 15% 백분위 수에 해당하는 급여를 반환한다.
Python Programming |
withmooc = copy.copy(emp)
withmooc.groupby('deptno')['sal'].apply(lambda x: np.percentile(x,0.15))
Results |
deptno
10 1303.45
20 801.80
30 952.25
Name: sal, dtype: float64
2) 분석함수
- np.percentile() 함수
Python Programming |
withmooc = copy.copy(emp)
withmooc['pct_first'] = withmooc.groupby('deptno')['sal'].transform(lambda x: np.percentile(x, q=15))
withmooc.sort_values(['deptno','sal'],ascending = [True,True]).head(7)
Results |
empno ename job mgr hiredate sal comm deptno pct_first
13 7934 MILLER CLERK 7782.0 1982/01/23 1300 NaN 10 1645
6 7782 CLARK MANAGER 7839.0 1981/01/09 2450 NaN 10 1645
8 7839 KING PRESIDENT NaN 1981/11/17 5000 NaN 10 1645
0 7369 SMITH CLERK 7902.0 1980/12/17 800 NaN 20 980
10 7876 ADAMS CLERK 7788.0 1983/01/12 1100 NaN 20 980
3 7566 JONES MANAGER 7839.0 1981/04/02 2975 NaN 20 980
7 7788 SCOTT ANALYST 7566.0 1982/12/09 3000 NaN 20 980
- transform()과 np.percentile() 함수
Python Programming |
withmooc = copy.copy(emp)
withmooc['pct_first'] = withmooc.sort_values('sal').groupby('deptno')['sal'].transform(lambda x: np.percentile(x,50))
withmooc.sort_values(['deptno','sal'],ascending = [True,True]).head(7)
Results |
empno ename job mgr hiredate sal comm deptno pct_first
13 7934 MILLER CLERK 7782.0 1982/01/23 1300 NaN 10 2450
6 7782 CLARK MANAGER 7839.0 1981/01/09 2450 NaN 10 2450
8 7839 KING PRESIDENT NaN 1981/11/17 5000 NaN 10 2450
0 7369 SMITH CLERK 7902.0 1980/12/17 800 NaN 20 2975
10 7876 ADAMS CLERK 7788.0 1983/01/12 1100 NaN 20 2975
3 7566 JONES MANAGER 7839.0 1981/04/02 2975 NaN 20 2975
7 7788 SCOTT ANALYST 7566.0 1982/12/09 3000 NaN 20 2975
- scoreatpercentile() 함수
Python Programming |
from scipy.stats import tmean, scoreatpercentile
withmooc = copy.copy(emp)
withmooc['pct_first'] = withmooc.groupby('deptno')['sal'].transform(lambda x: scoreatpercentile(x,50))
withmooc.sort_values(['deptno','sal'],ascending = [True,True]).head(7)
Results |
empno ename job mgr hiredate sal comm deptno pct_first
13 7934 MILLER CLERK 7782.0 1982/01/23 1300 NaN 10 2450
6 7782 CLARK MANAGER 7839.0 1981/01/09 2450 NaN 10 2450
8 7839 KING PRESIDENT NaN 1981/11/17 5000 NaN 10 2450
0 7369 SMITH CLERK 7902.0 1980/12/17 800 NaN 20 2975
10 7876 ADAMS CLERK 7788.0 1983/01/12 1100 NaN 20 2975
3 7566 JONES MANAGER 7839.0 1981/04/02 2975 NaN 20 2975
7 7788 SCOTT ANALYST 7566.0 1982/12/09 3000 NaN 20 2975
3. R Programming (R Package)
- dplyr::cume_dist()와 사용자 정의 함수
R Programming |
%%R
library(plyr) # dplyr 보다 먼저 로딩
library(dplyr)
withmooc <- emp[order(emp$sal),]
withmooc['cume_dist'] = ave(x = withmooc[, 'sal'], withmooc[, 'deptno'], FUN = function (x) { dplyr::cume_dist(x) } )
withmooc <- plyr::ddply(withmooc,.(deptno),transform, PERCENTILE_DISC = min(ifelse(cume_dist >= 0.15, sal , 'NaN')) )
withmooc[order(withmooc$deptno,withmooc$sal),]
withmooc$ename <- withmooc$job <- withmooc$hiredate <- NULL
withmooc[1:10, ]
Results |
empno mgr sal comm deptno cume_dist PERCENTILE_DISC
1 7934 7782 1300 NA 10 0.3333333 1300
2 7782 7839 2450 NA 10 0.6666667 1300
3 7839 NA 5000 NA 10 1.0000000 1300
4 7369 7902 800 NA 20 0.2000000 800
5 7876 7788 1100 NA 20 0.4000000 800
6 7566 7839 2975 NA 20 0.6000000 800
7 7788 7566 3000 NA 20 1.0000000 800
8 7902 7566 3000 NA 20 1.0000000 800
9 7900 7698 950 NA 30 0.1666667 950
10 7521 7698 1250 500 30 0.5000000 950
4. R Dplyr Package
- cume_dist() 함수
R Programming |
%%R
emp %>%
dplyr::arrange(deptno, sal) %>%
dplyr::group_by(deptno) %>%
dplyr::mutate( sal_cume_dist = cume_dist(sal) ) %>%
dplyr::mutate( PERCENTILE_DISC = min(sal[sal_cume_dist >= 0.75]) ) %>%
dplyr::select(-c(hiredate)) %>%
head(10)
Results |
# A tibble: 10 x 9
# Groups: deptno [3]
empno ename job mgr sal comm deptno sal_cume_dist PERCENTILE_DISC
<dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 7934 MILLER CLERK 7782 1300 NA 10 0.333 5000
2 7782 CLARK MANAGER 7839 2450 NA 10 0.667 5000
3 7839 KING PRESIDENT NA 5000 NA 10 1 5000
4 7369 SMITH CLERK 7902 800 NA 20 0.2 3000
5 7876 ADAMS CLERK 7788 1100 NA 20 0.4 3000
6 7566 JONES MANAGER 7839 2975 NA 20 0.6 3000
7 7788 SCOTT ANALYST 7566 3000 NA 20 1 3000
8 7902 FORD ANALYST 7566 3000 NA 20 1 3000
9 7900 JAMES CLERK 7698 950 NA 30 0.167 1600
10 7521 WARD SALESMAN 7698 1250 500 30 0.5 1600
[참고]
R Programming |
%%R
emp %>%
dplyr::arrange(deptno, sal) %>%
dplyr::group_by(deptno) %>%
dplyr::mutate( sal_cume_dist = cume_dist(sal),
sal_lag = lag(dplyr::cume_dist(sal)) , n=1, default = 0, order_by = sal) %>%
dplyr::filter(0.15 >= ifelse(is.na(sal_lag), 0 , sal_lag) & 0.15 < sal_cume_dist) %>%
dplyr::select(-c("ename","job","hiredate"))
Results |
# A tibble: 3 x 10
# Groups: deptno [3]
empno mgr sal comm deptno sal_cume_dist sal_lag n default order_by
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 7934 7782 1300 NA 10 0.333 NA 1 0 1300
2 7369 7902 800 NA 20 0.2 NA 1 0 800
3 7900 7698 950 NA 30 0.167 NA 1 0 950
5. R sqldf Package
함수 구현 : 전체에서 35% 값
R Programming |
%%R
sqldf(" select min(sal) Per_35p
from ( select sal
from emp
order by sal
limit floor(.35 * (select count(*) from emp))
) ")
Results |
Per_35p
1 800
그룹 10에서 35% 값
R Programming |
%%R
sqldf(" select min(sal) Per_35p
from ( select sal
from emp
where deptno=10
order by sal
limit ceil( .35 * (select count(*) from emp where deptno=10) ) )
")
Results |
Per_35p
1 1300
first_value() 와 percent_rank() 함수
R Programming |
%%R
sqldf(" SELECT DISTINCT
deptno,
first_value(sal) OVER ( ORDER BY CASE WHEN p1 <= 0.15 THEN p1 END NULLS LAST ) x1,
first_value(sal) OVER ( PARTITION BY deptno
ORDER BY CASE WHEN p2 <= 0.15 THEN p2 END NULLS LAST) x2
FROM ( SELECT deptno,
sal,
percent_rank() OVER (ORDER BY sal) p1,
percent_rank() OVER (PARTITION BY deptno ORDER BY sal) p2
FROM emp
) t
ORDER BY deptno")
Results |
deptno x1 x2
1 10 800 1300
2 20 800 800
3 30 800 950
R Programming |
%%R
sqldf(" select deptno,
sal,CUME_DIST,
min(case when CUME_DIST >= 0.15 then x.sal end) over (partition by x.deptno) as percent_15,
min(case when CUME_DIST >= 0.25 then x.sal end) over (partition by x.deptno) as percent_25,
min(case when CUME_DIST >= 0.50 then x.sal end) over (partition by x.deptno) as percent_50,
min(case when CUME_DIST >= 0.75 then x.sal end) over (partition by x.deptno) as percent_75
from (select x.*,
row_number() over (partition by x.deptno
order by x.sal
) as seqnum,
CUME_DIST() OVER (PARTITION BY x.deptno ORDER BY sal) CUME_DIST,
count(*) over (partition by x.deptno) as cnt
from emp x
) x ")
Results |
deptno sal CUME_DIST percent_15 percent_25 percent_50 percent_75
1 10 1300 0.3333333 1300 1300 2450 5000
2 10 2450 0.6666667 1300 1300 2450 5000
3 10 5000 1.0000000 1300 1300 2450 5000
4 20 800 0.2000000 800 1100 2975 3000
5 20 1100 0.4000000 800 1100 2975 3000
6 20 2975 0.6000000 800 1100 2975 3000
7 20 3000 1.0000000 800 1100 2975 3000
8 20 3000 1.0000000 800 1100 2975 3000
9 30 950 0.1666667 950 1250 1250 1600
10 30 1250 0.5000000 950 1250 1250 1600
11 30 1250 0.5000000 950 1250 1250 1600
12 30 1500 0.6666667 950 1250 1250 1600
13 30 1600 0.8333333 950 1250 1250 1600
14 30 2850 1.0000000 950 1250 1250 1600
6. Python pandasql Package
Python Programming |
ps.sqldf(" select deptno, \
sal,cnt,CUME_DIST, \
min(case when CUME_DIST >= 0.15 then x.sal end) over (partition by x.deptno) as percent_15, \
min(case when CUME_DIST >= 0.25 then x.sal end) over (partition by x.deptno) as percent_25, \
min(case when CUME_DIST >= 0.50 then x.sal end) over (partition by x.deptno) as percent_50, \
min(case when CUME_DIST >= 0.75 then x.sal end) over (partition by x.deptno) as percent_75 \
from (select x.*, \
row_number() over (partition by x.deptno \
order by x.sal \
) as seqnum, \
CUME_DIST() OVER (PARTITION BY x.deptno ORDER BY sal) CUME_DIST, \
count(*) over (partition by x.deptno) as cnt \
from emp x \
) x ").head()
Results |
deptno sal cnt CUME_DIST percent_15 percent_25 percent_50 percent_75
0 10 1300 3 0.333333 1300 2450 2450 5000
1 10 2450 3 0.666667 1300 2450 2450 5000
2 10 5000 3 1.000000 1300 2450 2450 5000
3 20 800 5 0.200000 800 1100 2975 3000
4 20 1100 5 0.400000 800 1100 2975 3000
7. R data.table Package
R Programming |
%%R
DT <- data.table(emp)
dept_DT <- data.table(dept)
DT[ order(sal), PERCENTILE_DISC := sapply(.SD , function (x) { ifelse( dplyr::cume_dist(x) >= 0.75, as.list(x) , NULL ) } ), ## as.list에서 최초값을 가져오는 로직 이해 아직 안됨.
by = deptno ,.SDcols=c("sal")][order(deptno,sal),][1:10, ]
Results |
empno ename job mgr hiredate sal comm deptno PERCENTILE_DISC
1: 7934 MILLER CLERK 7782 1982-01-23 1300 NA 10 5000
2: 7782 CLARK MANAGER 7839 1981-01-09 2450 NA 10 5000
3: 7839 KING PRESIDENT NA 1981-11-17 5000 NA 10 5000
4: 7369 SMITH CLERK 7902 1980-12-17 800 NA 20 3000
5: 7876 ADAMS CLERK 7788 1983-01-12 1100 NA 20 3000
6: 7566 JONES MANAGER 7839 1981-04-02 2975 NA 20 3000
7: 7788 SCOTT ANALYST 7566 1982-12-09 3000 NA 20 3000
8: 7902 FORD ANALYST 7566 1981-12-03 3000 NA 20 3000
9: 7900 JAMES CLERK 7698 1981-12-03 950 NA 30 1600
10: 7521 WARD SALESMAN 7698 1981-02-22 1250 500 30 1600
R Programming |
%%R
DT <- data.table(emp)
dept_DT <- data.table(dept)
DT[order(sal), `:=` ( PER_DISC_15 = min(.SD[dplyr::cume_dist(.SD) >= 0.15]),
PER_DISC_25 = min(.SD[dplyr::cume_dist(.SD) >= 0.25]),
PER_DISC_50 = min(.SD[dplyr::cume_dist(.SD) >= 0.50]),
PER_DISC_75 = min(.SD[dplyr::cume_dist(.SD) >= 0.75]) ), by=deptno,.SDcols=c("sal")][order(deptno,sal),][, c("ename","job","mgr","hiredate") := NULL]
Results |
empno sal comm deptno PER_DISC_15 PER_DISC_25 PER_DISC_50 PER_DISC_75
1: 7934 1300 NA 10 1300 1300 2450 5000
2: 7782 2450 NA 10 1300 1300 2450 5000
3: 7839 5000 NA 10 1300 1300 2450 5000
4: 7369 800 NA 20 800 1100 2975 3000
5: 7876 1100 NA 20 800 1100 2975 3000
6: 7566 2975 NA 20 800 1100 2975 3000
7: 7788 3000 NA 20 800 1100 2975 3000
8: 7902 3000 NA 20 800 1100 2975 3000
9: 7900 950 NA 30 950 1250 1250 1600
10: 7521 1250 500 30 950 1250 1250 1600
11: 7654 1250 1400 30 950 1250 1250 1600
12: 7844 1500 0 30 950 1250 1250 1600
13: 7499 1600 300 30 950 1250 1250 1600
14: 7698 2850 NA 30 950 1250 1250 1600
참고용 : * * duplicated(.SD) 기능도 있음
R Programming |
%%R
DT <- data.table(emp)
dept_DT <- data.table(dept)
DT[order(sal), `:=` ( test = duplicated(.SD),
test1 = dplyr::cume_dist(duplicated(.SD)),
var_list = ifelse( dplyr::cume_dist(duplicated(.SD)) >= 0.75,.SD, NA_real_) ,
cume_dist = lapply( ifelse( dplyr::cume_dist(duplicated(.SD)) >= 0.75,.SD, NA_real_) ,max ) # 리스트에서 조건을 만족하는 최소값
) ,
by=deptno, .SDcols = c("sal")][order(deptno,sal),][, c("empno","ename","comm","job","mgr","hiredate") := NULL]
Results |
sal deptno test test1 var_list cume_dist
1: 1300 10 FALSE 1.0000000 1300,2450,5000 5000
2: 2450 10 FALSE 1.0000000 1300,2450,5000 5000
3: 5000 10 FALSE 1.0000000 1300,2450,5000 5000
4: 800 20 FALSE 0.8000000 800,1100,2975,3000,3000 NA
5: 1100 20 FALSE 0.8000000 800,1100,2975,3000,3000 NA
6: 2975 20 FALSE 0.8000000 800,1100,2975,3000,3000 NA
7: 3000 20 FALSE 0.8000000 800,1100,2975,3000,3000 NA
8: 3000 20 TRUE 1.0000000 800,1100,2975,3000,3000 3000
9: 950 30 FALSE 0.8333333 950,1250,1250,1500,1600,2850 NA
10: 1250 30 FALSE 0.8333333 950,1250,1250,1500,1600,2850 NA
11: 1250 30 TRUE 1.0000000 950,1250,1250,1500,1600,2850 2850
12: 1500 30 FALSE 0.8333333 950,1250,1250,1500,1600,2850 NA
13: 1600 30 FALSE 0.8333333 950,1250,1250,1500,1600,2850 NA
14: 2850 30 FALSE 0.8333333 950,1250,1250,1500,1600,2850 NA
8. Python Duckdb의 SQL
Python Programming |
%%sql
SELECT DEPTNO,
PERCENTILE_DISC(0.15) WITHIN GROUP (ORDER BY SAL) AS PERCENTILE_DISC
FROM emp
GROUP BY DEPTNO
Python Programming |
duckdb.sql(" SELECT DEPTNO, \
PERCENTILE_DISC(0.15) WITHIN GROUP (ORDER BY SAL) AS PERCENTILE_DISC \
FROM emp \
GROUP BY DEPTNO ").df()
Results |
deptno PERCENTILE_DISC
0 20 800
1 30 950
2 10 1300
2) 집계 함수
- 테이블 조인
PARTITION BY 은 현재 지원 안 함.
: PERCENTILE_DISC(0.15) WITHIN GROUP (ORDER BY SAL) OVER (PARTITION BY DEPTNO) AS PERCENTILE_DISC
Python Programming |
%%sql
select a.*,
PERCENT_DISC
from emp a
left join ( SELECT DEPTNO,
percentile_disc(0.75) WITHIN GROUP (ORDER BY SAL) AS PERCENT_DISC
FROM emp
group by deptno) b
on a.deptno = b.deptno
order by a.deptno
LIMIT 6
Python Programming |
duckdb.sql(" select a.deptno,a.empno,a.ename,job,sal, \
PERCENT_DISC \
from emp a \
left join ( SELECT DEPTNO, \
percentile_disc(0.75) WITHIN GROUP (ORDER BY SAL) AS PERCENT_DISC \
FROM emp \
group by deptno) b \
on a.deptno = b.deptno \
order by a.deptno \
LIMIT 6 ").df()
Results |
deptno empno ename job sal PERCENT_DISC
0 10 7782 CLARK MANAGER 2450 5000
1 10 7839 KING PRESIDENT 5000 5000
2 10 7934 MILLER CLERK 1300 5000
3 20 7369 SMITH CLERK 800 3000
4 20 7566 JONES MANAGER 2975 3000
5 20 7788 SCOTT ANALYST 3000 3000
- 서브쿼리 방
Python Programming |
%%sql
select a.*,
( SELECT percentile_disc(0.75) WITHIN GROUP (ORDER BY SAL) AS PERCENT_DISC
FROM emp b
where a.deptno = b.deptno) PERCENT_DISC
from emp a
order by a.deptno
LIMIT 6
Python Programming |
duckdb.sql(" select a.deptno,a.empno,a.ename,job,sal, \
( SELECT percentile_disc(0.75) WITHIN GROUP (ORDER BY SAL) AS PERCENT_DISC \
FROM emp b \
where a.deptno = b.deptno) PERCENT_DISC \
from emp a \
order by a.deptno \
LIMIT 6 ").df()
Results |
deptno empno ename job sal PERCENT_DISC
0 10 7782 CLARK MANAGER 2450 5000
1 10 7839 KING PRESIDENT 5000 5000
2 10 7934 MILLER CLERK 1300 5000
3 20 7369 SMITH CLERK 800 3000
4 20 7566 JONES MANAGER 2975 3000
5 20 7788 SCOTT ANALYST 3000 3000
Python Programming |
%%sql
select deptno,
sal,cnt,CUME_DIST,
min(case when CUME_DIST >= 0.15 then x.sal end) over (partition by x.deptno) as percent_15,
min(case when CUME_DIST >= 0.25 then x.sal end) over (partition by x.deptno) as percent_25,
min(case when CUME_DIST >= 0.50 then x.sal end) over (partition by x.deptno) as percent_50,
min(case when CUME_DIST >= 0.75 then x.sal end) over (partition by x.deptno) as percent_75
from ( select x.*,
row_number() over (partition by x.deptno
order by x.sal
) as seqnum,
CUME_DIST() OVER (PARTITION BY x.deptno ORDER BY sal) CUME_DIST,
count(*) over (partition by x.deptno) as cnt
from emp x
) x
order by x.deptno
LIMIT 6
Python Programming |
duckdb.sql(" select deptno, \
sal,cnt,CUME_DIST, \
min(case when CUME_DIST >= 0.15 then x.sal end) over (partition by x.deptno) as perc_15, \
min(case when CUME_DIST >= 0.25 then x.sal end) over (partition by x.deptno) as perc_25, \
min(case when CUME_DIST >= 0.50 then x.sal end) over (partition by x.deptno) as perc_50, \
min(case when CUME_DIST >= 0.75 then x.sal end) over (partition by x.deptno) as perc_75 \
from ( select x.*, \
row_number() over (partition by x.deptno \
order by x.sal \
) as seqnum, \
CUME_DIST() OVER (PARTITION BY x.deptno ORDER BY sal) CUME_DIST, \
count(*) over (partition by x.deptno) as cnt \
from emp x \
) x \
order by x.deptno \
LIMIT 6 ").df()
Results |
deptno sal cnt CUME_DIST perc_15 perc_25 perc_50 perc_75
0 10 1300 3 0.333333 1300 1300 2450 5000
1 10 2450 3 0.666667 1300 1300 2450 5000
2 10 5000 3 1.000000 1300 1300 2450 5000
3 20 800 5 0.200000 800 1100 2975 3000
4 20 1100 5 0.400000 800 1100 2975 3000
5 20 2975 5 0.600000 800 1100 2975 3000
--------------------------------------------
[Oracle, Pandas, R Prog, Dplyr, Sqldf, Pandasql, Data.Table] 오라클 함수와 R & Python 비교 사전 목록 링크 |
오라클 SQL 함수(Oracle SQL Function) 목록 리스트 링크 |
[SQL, Pandas, R Prog, Dplyr, SQLDF, PANDASQL, DATA.TABLE] SQL EMP 예제로 만나는 테이블 데이터 처리 방법 리스트 링크 링크 |
반응형
댓글