포스팅 목차
* 파이썬 & R 패키지 호출 및 예제 데이터 생성 링크
[ MEDIAN Oracle Function ]
MEDIAN함수는 연속형 분포 모형을 가정하는 역 분포 함수이다. 이 함수는 수치 또는 날자값을 취하고, 중앙값 또는 값의 정렬 후에 중앙값이 되는 보간된 값을 반환한다. Null값들은 계산 시 무시된다.
- 함수 설명 : MEDIAN 오라클 함수 링크
- Windows 함수는 first_value 함수 참조
1. Oracle(오라클)
Median 함수
다음 예제는 employees 테이블의 개별 직무 내에서 급여의 중앙값을 계산한다.
Oracle Programming |
SELECT deptno, MEDIAN(sal) sal_median
FROM emp
GROUP BY deptno
Results |
DEPTNO SAL_MEDIAN
-----------------
10 2450
20 2975
30 1375
2. Python Pandas(파이썬)
1) 집계 함수
- Median 함수
Python Programming |
#Total
display( emp.groupby('deptno').agg({'sal':['median']}).reset_index() )
display( emp.groupby('deptno')['sal'].median() )
Results |
deptno sal
median
0 10 2450
1 20 2975
2 30 1375
Results |
deptno
10 2450
20 2975
30 1375
Name: sal, dtype: int64
2) 분석함수
- Transform()과 Median 함수
Python Programming |
withmooc = emp.copy()
withmooc['sal_median'] = emp.groupby('deptno')['sal'].transform(lambda x:x.median())
withmooc.sort_values('deptno').head()
Results |
empno ename job mgr hiredate sal comm deptno sal_median
6 7782 CLARK MANAGER 7839.0 1981/01/09 2450 NaN 10 2450
8 7839 KING PRESIDENT NaN 1981/11/17 5000 NaN 10 2450
13 7934 MILLER CLERK 7782.0 1982/01/23 1300 NaN 10 2450
0 7369 SMITH CLERK 7902.0 1980/12/17 800 NaN 20 2975
3 7566 JONES MANAGER 7839.0 1981/04/02 2975 NaN 20 2975
- groupby()와 Median 함수
Python Programming |
withmooc = emp.copy()
group_median = withmooc.groupby('deptno')['sal'].median()
withmooc['sal_median'] = withmooc['deptno'].map(group_median)
withmooc.sort_values('deptno').head()
Results |
empno ename job mgr hiredate sal comm deptno sal_median
6 7782 CLARK MANAGER 7839.0 1981/01/09 2450 NaN 10 2450
8 7839 KING PRESIDENT NaN 1981/11/17 5000 NaN 10 2450
13 7934 MILLER CLERK 7782.0 1982/01/23 1300 NaN 10 2450
0 7369 SMITH CLERK 7902.0 1980/12/17 800 NaN 20 2975
3 7566 JONES MANAGER 7839.0 1981/04/02 2975 NaN 20 2975
3. R Programming (R Package)
1) 집계 함수
- Aggregate()와 Median
R Programming |
%%R
aggregate(sal ~ deptno, data=emp , FUN=median)
Results |
deptno sal
1 10 2450
2 20 2975
3 30 1375
- ave()와 Median
R Programming |
%%R
with(emp, ave(sal, deptno, FUN=median))
Results |
[1] 2975 1375 1375 2975 1375 1375 2450 2975 2450 1375 2975 1375 2975 2450
- by()와 Median
R Programming |
%%R
unlist(with(emp, by(sal, deptno, median)))
Results |
deptno: 10
[1] 2450
------------------------------------------------------------
deptno: 20
[1] 2975
------------------------------------------------------------
deptno: 30
[1] 1375
- Map()과 Median
R Programming |
%%R
do.call(rbind, Map(median, split(emp$sal, emp$deptno)))
Results |
[,1]
10 2450
20 2975
30 1375
- purrr:map_dfr()와 Median
R Programming |
%%R
# purrr::map_chr(split(emp,emp$deptno), ~median(.x$sal))
purrr::map_dfr(split(emp,emp$deptno), ~median(.x$sal))
Results |
# A tibble: 1 x 3
`10` `20` `30`
<dbl> <dbl> <dbl>
1 2450 2975 1375
- purrr:map_dfr()와 Median
R Programming |
%%R
do.call(rbind, purrr::map_dfr(split(emp,emp$deptno), ~median(.x$sal)) )
Results |
[,1]
10 2450
20 2975
30 1375
- lapply()와 Median
R Programming |
%%R
do.call(rbind, lapply(split(emp$sal, emp$deptno), median))
Results |
[,1]
10 2450
20 2975
30 1375
2) 분석함수
- ave()와 Median
R Programming |
%%R
withmooc <- emp
withmooc['sal_median'] = with(withmooc, ave(x = sal, deptno, FUN =function(x) { median(x) })) # FUN = median
withmooc[order(withmooc$job),][1:10, ]
Results |
# A tibble: 10 x 9
empno ename job mgr hiredate sal comm deptno sal_median
<dbl> <chr> <chr> <dbl> <date> <dbl> <dbl> <dbl> <dbl>
1 7788 SCOTT ANALYST 7566 1982-12-09 3000 NA 20 2975
2 7902 FORD ANALYST 7566 1981-12-03 3000 NA 20 2975
3 7369 SMITH CLERK 7902 1980-12-17 800 NA 20 2975
4 7876 ADAMS CLERK 7788 1983-01-12 1100 NA 20 2975
5 7900 JAMES CLERK 7698 1981-12-03 950 NA 30 1375
6 7934 MILLER CLERK 7782 1982-01-23 1300 NA 10 2450
7 7566 JONES MANAGER 7839 1981-04-02 2975 NA 20 2975
8 7698 BLAKE MANAGER 7839 1981-03-01 2850 NA 30 1375
9 7782 CLARK MANAGER 7839 1981-01-09 2450 NA 10 2450
10 7839 KING PRESIDENT NA 1981-11-17 5000 NA 10 2450
- ave()와 Median
R Programming |
%%R
withmooc <- emp
withmooc['sal_median'] = ave(x = withmooc$sal, withmooc$deptno, FUN = function (x) { median(x) } ) # FUN = median
withmooc[order(withmooc$deptno),][1:10, ]
Results |
# A tibble: 10 x 9
empno ename job mgr hiredate sal comm deptno sal_median
<dbl> <chr> <chr> <dbl> <date> <dbl> <dbl> <dbl> <dbl>
1 7782 CLARK MANAGER 7839 1981-01-09 2450 NA 10 2450
2 7839 KING PRESIDENT NA 1981-11-17 5000 NA 10 2450
3 7934 MILLER CLERK 7782 1982-01-23 1300 NA 10 2450
4 7369 SMITH CLERK 7902 1980-12-17 800 NA 20 2975
5 7566 JONES MANAGER 7839 1981-04-02 2975 NA 20 2975
6 7788 SCOTT ANALYST 7566 1982-12-09 3000 NA 20 2975
7 7876 ADAMS CLERK 7788 1983-01-12 1100 NA 20 2975
8 7902 FORD ANALYST 7566 1981-12-03 3000 NA 20 2975
9 7499 ALLEN SALESMAN 7698 1981-02-20 1600 300 30 1375
10 7521 WARD SALESMAN 7698 1981-02-22 1250 500 30 1375
- purrr:map_chr()과 Median
R Programming |
%%R
withmooc <- emp
# split 되는 그룹내의 관측치 순서 보존을 위하여 데이터와 group 데이터 양쪽 재 정렬 필요
Median_var <- with(withmooc, purrr::map_chr(base::split(withmooc,withmooc$deptno), ~first(.x$sal)) )
data.frame(withmooc, Median_value =rep(Median_var, table(withmooc$deptno)))[1:10, ]
Results |
empno ename job mgr hiredate sal comm deptno Median_value
1 7369 SMITH CLERK 7902 1980-12-17 800 NA 20 2450.000000
2 7499 ALLEN SALESMAN 7698 1981-02-20 1600 300 30 2450.000000
3 7521 WARD SALESMAN 7698 1981-02-22 1250 500 30 2450.000000
4 7566 JONES MANAGER 7839 1981-04-02 2975 NA 20 800.000000
5 7654 MARTIN SALESMAN 7698 1981-09-28 1250 1400 30 800.000000
6 7698 BLAKE MANAGER 7839 1981-03-01 2850 NA 30 800.000000
7 7782 CLARK MANAGER 7839 1981-01-09 2450 NA 10 800.000000
8 7788 SCOTT ANALYST 7566 1982-12-09 3000 NA 20 800.000000
9 7839 KING PRESIDENT NA 1981-11-17 5000 NA 10 1600.000000
10 7844 TURNER SALESMAN 7698 1981-09-08 1500 0 30 1600.000000
- tapply()와 Median
R Programming |
%%R
withmooc <- emp
Median_var <- with(withmooc, tapply(sal, deptno, median))
data.frame(emp, Median_value =rep(Median_var, table(withmooc$deptno)))
# with(withmooc, data.frame(emp, Median_value=rep(tapply(sal, deptno, median), table(deptno))) )[1:10, ]
Results |
empno ename job mgr hiredate sal comm deptno Median_value
1 7369 SMITH CLERK 7902 1980-12-17 800 NA 20 2450
2 7499 ALLEN SALESMAN 7698 1981-02-20 1600 300 30 2450
3 7521 WARD SALESMAN 7698 1981-02-22 1250 500 30 2450
4 7566 JONES MANAGER 7839 1981-04-02 2975 NA 20 2975
5 7654 MARTIN SALESMAN 7698 1981-09-28 1250 1400 30 2975
6 7698 BLAKE MANAGER 7839 1981-03-01 2850 NA 30 2975
7 7782 CLARK MANAGER 7839 1981-01-09 2450 NA 10 2975
8 7788 SCOTT ANALYST 7566 1982-12-09 3000 NA 20 2975
9 7839 KING PRESIDENT NA 1981-11-17 5000 NA 10 1375
10 7844 TURNER SALESMAN 7698 1981-09-08 1500 0 30 1375
11 7876 ADAMS CLERK 7788 1983-01-12 1100 NA 20 1375
12 7900 JAMES CLERK 7698 1981-12-03 950 NA 30 1375
13 7902 FORD ANALYST 7566 1981-12-03 3000 NA 20 1375
14 7934 MILLER CLERK 7782 1982-01-23 1300 NA 10 1375
- plyr::ddply()와 Median
R Programming |
%%R
library(plyr) # dplyr 보다 먼저 로딩
library(dplyr)
plyr::ddply(emp,.(deptno),transform,sal_median = median(sal))[1:10, ]
Results |
empno ename job mgr hiredate sal comm deptno sal_median
1 7782 CLARK MANAGER 7839 1981-01-09 2450 NA 10 2450
2 7839 KING PRESIDENT NA 1981-11-17 5000 NA 10 2450
3 7934 MILLER CLERK 7782 1982-01-23 1300 NA 10 2450
4 7369 SMITH CLERK 7902 1980-12-17 800 NA 20 2975
5 7566 JONES MANAGER 7839 1981-04-02 2975 NA 20 2975
6 7788 SCOTT ANALYST 7566 1982-12-09 3000 NA 20 2975
7 7876 ADAMS CLERK 7788 1983-01-12 1100 NA 20 2975
8 7902 FORD ANALYST 7566 1981-12-03 3000 NA 20 2975
9 7499 ALLEN SALESMAN 7698 1981-02-20 1600 300 30 1375
10 7521 WARD SALESMAN 7698 1981-02-22 1250 500 30 1375
- purrr:map_dfc()와 Median
R Programming |
%%R
withmooc <- emp
mapping <- t( purrr::map_dfc( split(withmooc,withmooc$deptno), ~median(.x$sal) ) )
trans_data <- melt(mapping, value.name="sal_median")
merge(emp,trans_data,by.x='deptno',by.y='Var1')[1:10, ]
Results |
deptno empno ename job mgr hiredate sal comm Var2 sal_median
1 10 7782 CLARK MANAGER 7839 1981-01-09 2450 NA 1 2450
2 10 7839 KING PRESIDENT NA 1981-11-17 5000 NA 1 2450
3 10 7934 MILLER CLERK 7782 1982-01-23 1300 NA 1 2450
4 20 7369 SMITH CLERK 7902 1980-12-17 800 NA 1 2975
5 20 7876 ADAMS CLERK 7788 1983-01-12 1100 NA 1 2975
6 20 7566 JONES MANAGER 7839 1981-04-02 2975 NA 1 2975
7 20 7902 FORD ANALYST 7566 1981-12-03 3000 NA 1 2975
8 20 7788 SCOTT ANALYST 7566 1982-12-09 3000 NA 1 2975
9 30 7499 ALLEN SALESMAN 7698 1981-02-20 1600 300 1 1375
10 30 7698 BLAKE MANAGER 7839 1981-03-01 2850 NA 1 1375
4. R Dplyr Package
1) 집계 함수
- dplyr::summarize()와 median
R Programming |
%%R
emp %>%
dplyr::group_by(deptno) %>%
dplyr::summarize(sal_median = median(sal))
Results |
`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 3 x 2
deptno sal_median
<dbl> <dbl>
1 10 2450
2 20 2975
3 30 1375
- map()과 median
R Programming |
%%R
emp %>%
tbl_df() %>%
nest(-deptno) %>%
mutate(sal_median = map(data, ~ median(.$sal))) %>%
unnest(sal_median)
Results |
# A tibble: 3 x 3
deptno data sal_median
<dbl> <list> <dbl>
1 20 <tibble [5 x 7]> 2975
2 30 <tibble [6 x 7]> 1375
3 10 <tibble [3 x 7]> 2450
2) 분석함수
- Median() 함수
R Programming |
%%R
emp %>%
dplyr::group_by(deptno) %>%
dplyr::mutate(first_value = median(sal)) %>%
dplyr::arrange(deptno) %>%
head()
Results |
# A tibble: 6 x 9
# Groups: deptno [2]
empno ename job mgr hiredate sal comm deptno median_value
<dbl> <chr> <chr> <dbl> <date> <dbl> <dbl> <dbl> <dbl>
1 7782 CLARK MANAGER 7839 1981-01-09 2450 NA 10 2450
2 7839 KING PRESIDENT NA 1981-11-17 5000 NA 10 2450
3 7934 MILLER CLERK 7782 1982-01-23 1300 NA 10 2450
4 7369 SMITH CLERK 7902 1980-12-17 800 NA 20 2975
5 7566 JONES MANAGER 7839 1981-04-02 2975 NA 20 2975
6 7788 SCOTT ANALYST 7566 1982-12-09 3000 NA 20 2975
- purrr::map_dfr()과 median
R Programming |
%%R
withmooc <- emp %>% mutate(deptno = as.character(deptno))
withmooc %>%
dplyr::left_join( ( purrr::map_dfr(split(withmooc,withmooc$deptno), ~median(.x$sal)) %>%
gather(key = deptno, value = median_sal )
),
id = "deptno"
) %>%
dplyr::arrange(deptno) %>%
head()
Results |
Joining, by = "deptno"
# A tibble: 6 x 9
empno ename job mgr hiredate sal comm deptno median_sal
<dbl> <chr> <chr> <dbl> <date> <dbl> <dbl> <chr> <dbl>
1 7782 CLARK MANAGER 7839 1981-01-09 2450 NA 10 2450
2 7839 KING PRESIDENT NA 1981-11-17 5000 NA 10 2450
3 7934 MILLER CLERK 7782 1982-01-23 1300 NA 10 2450
4 7369 SMITH CLERK 7902 1980-12-17 800 NA 20 2975
5 7566 JONES MANAGER 7839 1981-04-02 2975 NA 20 2975
6 7788 SCOTT ANALYST 7566 1982-12-09 3000 NA 20 2975
5. R sqldf Package
1) 집계 함수
- Median() 함수
R Programming |
%%R
sqldf(" SELECT deptno, MEDIAN(sal) sal_median
FROM emp
GROUP BY deptno ")
Results |
deptno sal_median
1 10 2450
2 20 2975
3 30 1375
2) 분석함수
- Median() 함수
윈도우 함수 현재 지원 안 함.
R Programming |
%%R
sqldf(" select a.*,b.sal_median
from emp a,
( SELECT deptno, MEDIAN(sal) sal_median
FROM emp
GROUP BY deptno ) b
where a.deptno = b.deptno
order by a.deptno")
Results |
empno ename job mgr hiredate sal comm deptno sal_median
1 7782 CLARK MANAGER 7839 1981-01-09 2450 NA 10 2450
2 7839 KING PRESIDENT NA 1981-11-17 5000 NA 10 2450
3 7934 MILLER CLERK 7782 1982-01-23 1300 NA 10 2450
4 7369 SMITH CLERK 7902 1980-12-17 800 NA 20 2975
5 7566 JONES MANAGER 7839 1981-04-02 2975 NA 20 2975
6 7788 SCOTT ANALYST 7566 1982-12-09 3000 NA 20 2975
7 7876 ADAMS CLERK 7788 1983-01-12 1100 NA 20 2975
8 7902 FORD ANALYST 7566 1981-12-03 3000 NA 20 2975
9 7499 ALLEN SALESMAN 7698 1981-02-20 1600 300 30 1375
10 7521 WARD SALESMAN 7698 1981-02-22 1250 500 30 1375
11 7654 MARTIN SALESMAN 7698 1981-09-28 1250 1400 30 1375
12 7698 BLAKE MANAGER 7839 1981-03-01 2850 NA 30 1375
13 7844 TURNER SALESMAN 7698 1981-09-08 1500 0 30 1375
14 7900 JAMES CLERK 7698 1981-12-03 950 NA 30 1375
6. Python pandasql Package
- Median() 함수
Python Programming |
ps.sqldf(" SELECT deptno, median(sal) sal_median \
FROM emp \
GROUP BY deptno ")
7. R data.table Package
1) 집계 함수
- By = 와 Median() 함수
R Programming |
%%R
DT <- data.table(emp)
dept_DT <- data.table(dept)
DT[, median(sal), by = deptno]
Results |
deptno V1
1: 20 2975
2: 30 1375
3: 10 2450
2) 분석함수
- By = 와 Median() 함수
R Programming |
%%R
DT <- data.table(emp)
dept_DT <- data.table(dept)
DT[, sal_median := median(sal), by = deptno][order(deptno),][1:10, ]
Results |
empno ename job mgr hiredate sal comm deptno sal_median
1: 7782 CLARK MANAGER 7839 1981-01-09 2450 NA 10 2450
2: 7839 KING PRESIDENT NA 1981-11-17 5000 NA 10 2450
3: 7934 MILLER CLERK 7782 1982-01-23 1300 NA 10 2450
4: 7369 SMITH CLERK 7902 1980-12-17 800 NA 20 2975
5: 7566 JONES MANAGER 7839 1981-04-02 2975 NA 20 2975
6: 7788 SCOTT ANALYST 7566 1982-12-09 3000 NA 20 2975
7: 7876 ADAMS CLERK 7788 1983-01-12 1100 NA 20 2975
8: 7902 FORD ANALYST 7566 1981-12-03 3000 NA 20 2975
9: 7499 ALLEN SALESMAN 7698 1981-02-20 1600 300 30 1375
10: 7521 WARD SALESMAN 7698 1981-02-22 1250 500 30 1375
- By = 와 Median() 함수
R Programming |
%%R
DT <- data.table(emp)
dept_DT <- data.table(dept)
DT[, `:=` ( sal_median = median(unlist(.SD)) ),
by = deptno,
.SDcols = c("sal")][order(deptno),][1:10, ]
Results |
empno ename job mgr hiredate sal comm deptno sal_median
1: 7782 CLARK MANAGER 7839 1981-01-09 2450 NA 10 2450
2: 7839 KING PRESIDENT NA 1981-11-17 5000 NA 10 2450
3: 7934 MILLER CLERK 7782 1982-01-23 1300 NA 10 2450
4: 7369 SMITH CLERK 7902 1980-12-17 800 NA 20 2975
5: 7566 JONES MANAGER 7839 1981-04-02 2975 NA 20 2975
6: 7788 SCOTT ANALYST 7566 1982-12-09 3000 NA 20 2975
7: 7876 ADAMS CLERK 7788 1983-01-12 1100 NA 20 2975
8: 7902 FORD ANALYST 7566 1981-12-03 3000 NA 20 2975
9: 7499 ALLEN SALESMAN 7698 1981-02-20 1600 300 30 1375
10: 7521 WARD SALESMAN 7698 1981-02-22 1250 500 30 1375
8. Python DuckDB의 SQL
1) 집계 함수
Python Programming |
%%sql
SELECT deptno,
MEDIAN(sal) sal_median
FROM emp
GROUP BY 1
ORDER BY 1
Python Programming |
duckdb.sql(" SELECT deptno, \
MEDIAN(sal) sal_median \
FROM emp \
GROUP BY 1 \
ORDER BY 1 ").df()
Results |
deptno sal_median
0 10 2450.0
1 20 2975.0
2 30 1375.0
2) 분석함수
Python Programming |
%%sql
SELECT A.*,
B.SAL_MEDIAN
FROM emp A
LEFT JOIN ( SELECT deptno,
MEDIAN(sal) sal_median
FROM emp
GROUP BY 1 ) B
ON A.DEPTNO = B.DEPTNO
ORDER BY 1
LIMIT 5
Python Programming |
duckdb.sql(" SELECT A.empno,a.ename,a.sal, \
B.SAL_MEDIAN \
FROM emp A \
LEFT JOIN ( SELECT deptno, \
MEDIAN(sal) sal_median \
FROM emp \
GROUP BY 1 ) B \
ON A.DEPTNO = B.DEPTNO \
ORDER BY 1 \
LIMIT 5 ").df()
Results |
empno ename sal sal_median
0 7369 SMITH 800 2975.0
1 7499 ALLEN 1600 1375.0
2 7521 WARD 1250 1375.0
3 7566 JONES 2975 2975.0
4 7654 MARTIN 1250 1375.0
Python Programming |
%%sql
SELECT A.*,
( SELECT MEDIAN(sal) sal_median
FROM emp B
WHERE A.DEPTNO = B.DEPTNO) AS SAL_MEDIAN
FROM emp A
ORDER BY 1
LIMIT 5
Python Programming |
duckdb.sql(" SELECT A.empno,a.ename,a.sal, \
( SELECT MEDIAN(sal) sal_median \
FROM emp B \
WHERE A.DEPTNO = B.DEPTNO) AS SAL_MEDIAN \
FROM emp A \
ORDER BY 1 \
LIMIT 5 ").df()
Results |
empno ename sal SAL_MEDIAN
0 7369 SMITH 800 2975.0
1 7499 ALLEN 1600 1375.0
2 7521 WARD 1250 1375.0
3 7566 JONES 2975 2975.0
4 7654 MARTIN 1250 1375.0
--------------------------------------------
[Oracle, Pandas, R Prog, Dplyr, Sqldf, Pandasql, Data.Table] 오라클 함수와 R & Python 비교 사전 목록 링크 |
오라클 SQL 함수(Oracle SQL Function) 목록 리스트 링크 |
[SQL, Pandas, R Prog, Dplyr, SQLDF, PANDASQL, DATA.TABLE] SQL EMP 예제로 만나는 테이블 데이터 처리 방법 리스트 링크 링크 |
반응형
댓글