포스팅 목차
12. Display the names of employees who are working in the company for the past 5 years.
* 최근 5년 이상 근무한 직원의 이름을 표시하시오
- 파이썬 & R 패키지 호출 및 예제 데이터 생성 링크
- [날짜 데이터] 근속 기간 계산 후 데이터 필터 - 현재 날짜, 날짜 기간 차이 계산
- 조회를 위하여 2020년 현재 39년으로 예제 변경
- 날짜처리 예제 :12 / 13 / 123번
|
1. 오라클(Oracle)
Oracle Programming |
select ename from emp
where sysdate-hiredate > 39*365;
2. 파이썬(Pandas)
- dateutil 패키지를 사용하여 처리
Python Programming |
from datetime import datetime # 날짜모듈
# 날짜 처리 예시
date1 = datetime.strptime(str('2013-08-15 12:00:00'), '%Y-%m-%d %H:%M:%S')
date2 = datetime.strptime(str('2012-02-15'), '%Y-%m-%d')
from dateutil import relativedelta
# date1과 date2 날짜 사이의 기간을 계산한다.
r = relativedelta.relativedelta(date1, date2)
print(r)
print(r.days)
Results |
relativedelta(years=+1, months=+6, hours=+12)
0
to_datetime 함수를 사용하여서 날짜형 데이터로 변수 형변환
Python Programming |
pd.to_datetime(emp['hiredate'], format='%Y-%m-%d %H:%M:%S', errors='raise')
Results |
0 1980-12-17
1 1981-02-20
2 1981-02-22
3 1981-04-02
4 1981-09-28
5 1981-03-01
6 1981-01-09
7 1982-12-09
8 1981-11-17
9 1981-09-08
10 1983-01-12
11 1981-12-03
12 1981-12-03
13 1982-01-23
Name: hiredate, dtype: datetime64[ns]
Python Programming |
import datetime
import dateutil
# Get the current date
now = datetime.datetime.utcnow()
now = now.date()
# Get the difference between the current date and the birthday
age = dateutil.relativedelta.relativedelta(now, pd.to_datetime(emp['hiredate'][0], format='%Y %m %d'))
age = age.years
age
Results |
39
dateutil의 relativedelta : 현재 날짜와 입사일(hiredate) 사이의 GAP을 계산하여서 39년 이상인 데이터 추출
Python Programming |
from dateutil import relativedelta
emp[emp.apply(lambda row: relativedelta.relativedelta(pd.Timestamp.now(),
pd.to_datetime(row['hiredate'], format='%Y %m %d')).years, axis=1) >= 39][["ename"]]
Results |
ename | |
0 | SMITH |
1 | ALLEN |
2 | WARD |
3 | JONES |
5 | BLAKE |
6 | CLARK |
9 | TURNER |
두 날짜 사이의 GAP을 계산 후 39년 이상 근무한 직원을 선택한다.
Python Programming |
emp[ (pd.Timestamp.now() - pd.to_datetime(emp['hiredate'])).dt.days > 39*365 ][["ename"]]
Results |
ename | |
0 | SMITH |
1 | ALLEN |
2 | WARD |
3 | JONES |
5 | BLAKE |
6 | CLARK |
9 | TURNER |
Python Programming |
emp.loc[((pd.Timestamp.now() - pd.to_datetime(emp.hiredate)).dt.days > 39 *365 ) , ["ename"]]
Results |
ename | |
0 | SMITH |
1 | ALLEN |
2 | WARD |
3 | JONES |
5 | BLAKE |
6 | CLARK |
9 | TURNER |
- [참고] Pandas의 query 구문 사용 예제
Python Programming |
start_date, end_date = '1980/01/02', '1981/06/04'
emp.query('hiredate >= @start_date and hiredate <= @end_date')
Results |
empno | ename | job | mgr | hiredate | sal | comm | deptno | hiredate_D | |
0 | 7369 | SMITH | CLERK | 7902.0 | 1980/12/17 | 800 | NaN | 20 | 1980-12-17 |
1 | 7499 | ALLEN | SALESMAN | 7698.0 | 1981/02/20 | 1600 | 300.0 | 30 | 1981-02-20 |
2 | 7521 | WARD | SALESMAN | 7698.0 | 1981/02/22 | 1250 | 500.0 | 30 | 1981-02-22 |
3 | 7566 | JONES | MANAGER | 7839.0 | 1981/04/02 | 2975 | NaN | 20 | 1981-04-02 |
5 | 7698 | BLAKE | MANAGER | 7839.0 | 1981/03/01 | 2850 | NaN | 30 | 1981-03-01 |
6 | 7782 | CLARK | MANAGER | 7839.0 | 1981/01/09 | 2450 | NaN | 10 | 1981-01-09 |
- pd.Timestamp를 사용하여서 문자 데이터를 날짜로 변경
Python Programming |
ts = pd.Timestamp
emp['hiredate_D'] = pd.to_datetime(emp['hiredate'])
emp.query('hiredate_D >= @ts("19820201T071320")')
Results |
empno | ename | job | mgr | hiredate | sal | comm | deptno | hiredate_D | |
7 | 7788 | SCOTT | ANALYST | 7566.0 | 1982/12/09 | 3000 | NaN | 20 | 1982-12-09 |
10 | 7876 | ADAMS | CLERK | 7788.0 | 1983/01/12 | 1100 | NaN | 20 | 1983-01-12 |
- hiredate 변수를 날짜형으로 변경하여서 신규 변수 생성. 생성된 변수와 현재 날짜 비교
- query 상에서 현재 날짜와 hiredate 사이의 비교(-) 처리를 못하여서 현재 날짜를 기준으로 - 39년 날짜 계산 후 처리
Python Programming |
from dateutil.relativedelta import relativedelta
emp['hiredate_D'] = pd.to_datetime(emp['hiredate'])
now = datetime.datetime.now() - relativedelta(years=39)
emp.query("@now > hiredate_D", engine='python')[["ename"]]
Results |
ename | |
0 | SMITH |
1 | ALLEN |
2 | WARD |
3 | JONES |
5 | BLAKE |
6 | CLARK |
9 | TURNER |
- pd.to_datetime를 적용하여서 hiredate를 날짜형으로 변경 후 현재 날짜를 기준으로 -39년 날짜와 비교
Python Programming |
from dateutil.relativedelta import relativedelta
ts_chg = pd.to_datetime
base_dt = datetime.datetime.now() - relativedelta(years=39)
emp.query('@base_dt > @ts_chg(hiredate)')[["ename"]]
Results |
ename | |
0 | SMITH |
1 | ALLEN |
2 | WARD |
3 | JONES |
5 | BLAKE |
6 | CLARK |
9 | TURNER |
3. R Programming (R Package)
- difftime 함수를 사용하여서 현재 날짜와 입사일 사이의 기간을 계산
R Programming |
%%R
difftime(Sys.Date(), emp$hiredate) / 365
Results |
Time differences in days
[1] 39.75342 39.57534 39.56986 39.46301 38.97260 39.55068 39.69041 37.77534
[9] 38.83562 39.02740 37.68219 38.79178 38.79178 38.65205
R Programming |
%%R
as.numeric(Sys.Date() - as.Date(emp$hiredate)) / 365
Results |
[1] 39.75342 39.57534 39.56986 39.46301 38.97260 39.55068 39.69041 37.77534
[9] 38.83562 39.02740 37.68219 38.79178 38.79178 38.65205
R Programming |
%%R
emp[Sys.Date() - emp$hiredate > 39*365 , c("empno","ename") ]
Results |
# A tibble: 7 x 2
empno ename
<dbl> <chr>
1 7369 SMITH
2 7499 ALLEN
3 7521 WARD
4 7566 JONES
5 7698 BLAKE
6 7782 CLARK
7 7844 TURNER
4. R Dplyr Package
*현재 날짜와 입사일 사이의 기간을 계산 후 filter 구문을 사용하여서 39년 이상 근무한 직원을 선택한다.
R Programming |
%%R
emp %>% filter( (Sys.Date() - hiredate) > 39*365 ) %>% dplyr::select(ename)
Results |
# A tibble: 7 x 1
ename
<chr>
1 SMITH
2 ALLEN
3 WARD
4 JONES
5 BLAKE
6 CLARK
7 TURNER
5. R sqldf Package
- 현재날짜를 인수로 지정하여서 출력
R Programming |
%%R
sprintf("select * from DF where a >= %d", Sys.Date())
Results |
[1] "select * from DF where a >= 18580"
- 위에서 산출된 현재 날짜를 sqldf 쿼리에 직접 입력하여 처리(수작업)
R Programming |
%%R
require(sqldf)
sqldf("select ename,
strftime('%Y-%m-%d', 'now') now_day,
hiredate,
(strftime('%d-%m-%Y', 'now') - hiredate) Gap,
(hiredate - strftime('%d-%m-%Y', 'now')) Gap_1
from emp
where (18580 - hiredate) > 39*365")
Results |
ename now_day hiredate Gap Gap_1
1 SMITH 2020-11-13 1980-12-17 -3990 3990
2 ALLEN 2020-11-13 1981-02-20 -4055 4055
3 WARD 2020-11-13 1981-02-22 -4057 4057
4 JONES 2020-11-13 1981-04-02 -4096 4096
5 BLAKE 2020-11-13 1981-03-01 -4064 4064
6 CLARK 2020-11-13 1981-01-09 -4013 4013
7 TURNER 2020-11-13 1981-09-08 -4255 4255
- [참고] gsubfn 패키지 fn$ 을 사용하여서 sqldf에 파이썬의 함수와 파라미터의 값을 전달 방식
- Note that we must use the fn$ prefix to invoke the interpolation functionality
- gsubfn 패키지 fn$ 참고(Insert Variables) : https://github.com/ggrothendieck/sqldf
R Programming |
# %%R
# 파라미터 처리
# p1 <- 7
# fn$sqldf("select * from iris where Sepal_Length > $p1")
R Programming |
emp
Results |
empno | ename | job | mgr | hiredate | sal | comm | deptno | |
0 | 7369 | SMITH | CLERK | 7902.0 | 1980/12/17 | 800 | NaN | 20 |
1 | 7499 | ALLEN | SALESMAN | 7698.0 | 1981/02/20 | 1600 | 300.0 | 30 |
2 | 7521 | WARD | SALESMAN | 7698.0 | 1981/02/22 | 1250 | 500.0 | 30 |
3 | 7566 | JONES | MANAGER | 7839.0 | 1981/04/02 | 2975 | NaN | 20 |
4 | 7654 | MARTIN | SALESMAN | 7698.0 | 1981/09/28 | 1250 | 1400.0 | 30 |
5 | 7698 | BLAKE | MANAGER | 7839.0 | 1981/03/01 | 2850 | NaN | 30 |
6 | 7782 | CLARK | MANAGER | 7839.0 | 1981/01/09 | 2450 | NaN | 10 |
7 | 7788 | SCOTT | ANALYST | 7566.0 | 1982/12/09 | 3000 | NaN | 20 |
8 | 7839 | KING | PRESIDENT | NaN | 1981/11/17 | 5000 | NaN | 10 |
9 | 7844 | TURNER | SALESMAN | 7698.0 | 1981/09/08 | 1500 | 0.0 | 30 |
10 | 7876 | ADAMS | CLERK | 7788.0 | 1983/01/12 | 1100 | NaN | 20 |
11 | 7900 | JAMES | CLERK | 7698.0 | 1981/12/03 | 950 | NaN | 30 |
12 | 7902 | FORD | ANALYST | 7566.0 | 1981/12/03 | 3000 | NaN | 20 |
13 | 7934 | MILLER | CLERK | 7782.0 | 1982/01/23 | 1300 | NaN | 10 |
Sys.Date()
를 fn$sqldf 기능과 함께 사용- [참고] strftime('%d-%m-%Y', 'now') : 현재 날짜를 문자로 반환
R Programming |
%%R
gsubfn::fn$sqldf("select *, strftime('%d-%m-%Y', 'now') now_char from emp where (`Sys.Date()` - hiredate) > 39 * 365 ")
Results |
empno ename job mgr hiredate sal comm deptno now_char
1 7369 SMITH CLERK 7902 1980-12-17 800 NA 20 09-09-2020
2 7499 ALLEN SALESMAN 7698 1981-02-20 1600 300 30 09-09-2020
3 7521 WARD SALESMAN 7698 1981-02-22 1250 500 30 09-09-2020
4 7566 JONES MANAGER 7839 1981-04-02 2975 NA 20 09-09-2020
5 7698 BLAKE MANAGER 7839 1981-03-01 2850 NA 30 09-09-2020
6 7782 CLARK MANAGER 7839 1981-01-09 2450 NA 10 09-09-2020
7 7844 TURNER SALESMAN 7698 1981-09-08 1500 0 30 09-09-2020
6. Python pandasql Package
[참고] format 구문
Python Programming |
import pandasql as ps
a = 'v2'
b = 'v10'
q = "SELECT {} FROM df1 ".format(a+','+b)
q
Results |
'SELECT v2,v10 FROM df1 '
- [참고] format 구문
Python Programming |
import pandasql as ps
var = 'hiredate'
query = "SELECT {} FROM emp ".format(var)
ps.sqldf(query, globals())
Results |
hiredate | |
0 | 1980/12/17 |
1 | 1981/02/20 |
2 | 1981/02/22 |
3 | 1981/04/02 |
4 | 1981/09/28 |
5 | 1981/03/01 |
6 | 1981/01/09 |
7 | 1982/12/09 |
8 | 1981/11/17 |
9 | 1981/09/08 |
10 | 1983/01/12 |
11 | 1981/12/03 |
12 | 1981/12/03 |
13 | 1982/01/23 |
- 반복구문 처리
Python Programming |
import pandasql as ps
test = ['''CLERK''',"""SALESMAN"""]
for i in test:
query = """SELECT sal FROM emp where job = '""" + i + """'"""
print(ps.sqldf(query, globals()))
print()
Results |
sal
0 800
1 1100
2 950
3 1300
sal
0 1600
1 1250
2 1250
3 1500
- [참고] cast 함수 : 형변환 함수
Python Programming |
# emp.dtypes
Python Programming |
import pandasql as ps
query = """SELECT cast(hiredate as integer) FROM emp """
ps.sqldf(query, globals()).head()
Results |
cast(hiredate as integer) | |
0 | 1980 |
1 | 1981 |
2 | 1981 |
3 | 1981 |
4 | 1981 |
- pandasql 연습
Python Programming |
import datetime
nowdate= datetime.datetime.now()
format= "%Y-%m-%d"
import time
emp['hiredate_D'] = pd.to_datetime(emp['hiredate'])
emp['to_date'] = pd.Timestamp.now()
emp['day_gap'] = (pd.Timestamp.now() - pd.to_datetime(emp['hiredate'])).dt.days
query = """SELECT to_date 현재날짜1,
date('now') 현재날짜2,
datetime('now') 현재날짜3,
typeof(hiredate) 변수Type확인,
sqlite_version() sqlite버젼확인,
date('2020-09-10', 'start of month') 초일자,
CAST(STRFTIME('%Y %m %d', hiredate_D) AS date) cast함수,
STRFTIME('%Y %m %d', hiredate_D) 문자반환,
CAST(hiredate AS date) x2,
(cast(strftime('%s',hiredate_D) as interger))/(24*60*60) x3
FROM emp
WHERE day_gap > 39*365"""
# print(query)
query = ps.sqldf(query, globals())
query.head()
Results |
현재날짜 | 현재날짜2 | 현재날짜3 | 변수Type확인 | sqlite버젼확인 | 초일자 | cast함수 | 문자반환 | x2 | x3 | |
0 | 2020-09-10 09:45:49.454895 | 2020-09-10 | 2020-09-10 00:45:49 | text | 3.32.3 | 2020-09-01 | 1980 | 1980 12 17 | 1980 | 4003 |
1 | 2020-09-10 09:45:49.454895 | 2020-09-10 | 2020-09-10 00:45:49 | text | 3.32.3 | 2020-09-01 | 1981 | 1981 02 20 | 1981 | 4068 |
2 | 2020-09-10 09:45:49.454895 | 2020-09-10 | 2020-09-10 00:45:49 | text | 3.32.3 | 2020-09-01 | 1981 | 1981 02 22 | 1981 | 4070 |
3 | 2020-09-10 09:45:49.454895 | 2020-09-10 | 2020-09-10 00:45:49 | text | 3.32.3 | 2020-09-01 | 1981 | 1981 04 02 | 1981 | 4109 |
4 | 2020-09-10 09:45:49.454895 | 2020-09-10 | 2020-09-10 00:45:49 | text | 3.32.3 | 2020-09-01 | 1981 | 1981 03 01 | 1981 | 4077 |
- 데이터에 날짜형 변수를 직접 생성 후 처리
Python Programming |
import datetime
nowdate= datetime.datetime.now()
format= "%Y-%m-%d"
import time
emp['hiredate_D'] = pd.to_datetime(emp['hiredate'])
emp['to_date'] = pd.Timestamp.now()
emp['day_gap'] = (pd.Timestamp.now() - pd.to_datetime(emp['hiredate'])).dt.days
query = """SELECT to_date,
hiredate_D,
day_gap,
ename
FROM emp
WHERE day_gap > 39*365"""
ps.sqldf(query, globals())
Results |
to_date | hiredate_D | day_gap | ename | |
0 | 2020-09-10 09:39:10.707798 | 1980-12-17 00:00:00.000000 | 14512 | SMITH |
1 | 2020-09-10 09:39:10.707798 | 1981-02-20 00:00:00.000000 | 14447 | ALLEN |
2 | 2020-09-10 09:39:10.707798 | 1981-02-22 00:00:00.000000 | 14445 | WARD |
3 | 2020-09-10 09:39:10.707798 | 1981-04-02 00:00:00.000000 | 14406 | JONES |
4 | 2020-09-10 09:39:10.707798 | 1981-03-01 00:00:00.000000 | 14438 | BLAKE |
5 | 2020-09-10 09:39:10.707798 | 1981-01-09 00:00:00.000000 | 14489 | CLARK |
6 | 2020-09-10 09:39:10.707798 | 1981-09-08 00:00:00.000000 | 14247 | TURNER |
- 문자형 날짜 데이터를 날짜로 변경하여 처리
Python Programming |
import datetime
nowdate= datetime.datetime.now()
format= "%Y-%m-%d"
import time
query = """SELECT to_date,
date('now'),
ename,
(strftime('%s','now') -
strftime('%s',(substr(hiredate,1,4)||'-'||substr(hiredate,6,2)||'-'||substr(hiredate,9,2))))/60/60/24 second_gap
FROM emp
WHERE (strftime('%s','now') -
strftime('%s',(substr(hiredate,1,4)||'-'||substr(hiredate,6,2)||'-'||substr(hiredate,9,2))))/60/60/24 > 39*365"""
# print(query)
ps.sqldf(query, globals())
Results |
to_date | date('now') | ename | second_gap | |
0 | 2020-09-10 09:45:49.454895 | 2020-09-10 | SMITH | 14512 |
1 | 2020-09-10 09:45:49.454895 | 2020-09-10 | ALLEN | 14447 |
2 | 2020-09-10 09:45:49.454895 | 2020-09-10 | WARD | 14445 |
3 | 2020-09-10 09:45:49.454895 | 2020-09-10 | JONES | 14406 |
4 | 2020-09-10 09:45:49.454895 | 2020-09-10 | BLAKE | 14438 |
5 | 2020-09-10 09:45:49.454895 | 2020-09-10 | CLARK | 14489 |
6 | 2020-09-10 09:45:49.454895 | 2020-09-10 | TURNER | 14247 |
7. R data.table Package
%%R
DT <- data.table(emp)
dept_DT <- data.table(dept)
DT[(Sys.Date() - hiredate) > 39 * 365, .(ename)]
Results |
ename
1: SMITH
2: ALLEN
3: WARD
4: JONES
5: MARTIN
6: BLAKE
7: CLARK
8: TURNER
8. SAS Proc SQL
%%SAS sas
PROC SQL;
CREATE TABLE STATSAS_1 AS
SELECT ENAME,
TODAY() - hiredate as day_gap,
datdif(hiredate, TODAY(), 'act/act') as day_gap_1
FROM EMP
WHERE TODAY() - hiredate > 40*365;
QUIT;
PROC PRINT;RUN;
Results |
OBS | ename | day_gap | day_gap_1 |
1 | SMITH | 14727 | 14727 |
2 | ALLEN | 14662 | 14662 |
3 | WARD | 14660 | 14660 |
4 | JONES | 14621 | 14621 |
5 | BLAKE | 14653 | 14653 |
6 | CLARK | 14704 | 14704 |
9. SAS Data Step
%%SAS sas
DATA STATSAS_2;
SET EMP;
To_day = TODAY();
Day_Gap = TODAY() - hiredate;
Day_gap_1 = datdif(hiredate, TODAY(), 'act/act');
KEEP ENAME To_day Day_Gap Day_gap_1 HIREDATE;
WHERE datdif(hiredate, TODAY(), 'act/act') > 40*365;
RUN;
PROC PRINT;RUN;
Results |
OBS | ename | hiredate | To_day | Day_Gap | Day_gap_1 |
1 | SMITH | 1980-12-17 | 22383 | 14727 | 14727 |
2 | ALLEN | 1981-02-20 | 22383 | 14662 | 14662 |
3 | WARD | 1981-02-22 | 22383 | 14660 | 14660 |
4 | JONES | 1981-04-02 | 22383 | 14621 | 14621 |
5 | BLAKE | 1981-03-01 | 22383 | 14653 | 14653 |
6 | CLARK | 1981-01-09 | 22383 | 14704 | 14704 |
10. Python Dfply Package
- 별도 함수 정의(make_symbolic)
Python Programming |
@make_symbolic
def to_datetime(series, infer_datetime_format=True):
return pd.to_datetime(series, infer_datetime_format=infer_datetime_format)
emp >> filter_by( (pd.Timestamp.now() - to_datetime(X.hiredate)).dt.days > 39*365 ) >> head()
Results |
empno | ename | job | mgr | hiredate | sal | comm | deptno | |
0 | 7369 | SMITH | CLERK | 7902.0 | 1980/12/17 | 800 | NaN | 20 |
1 | 7499 | ALLEN | SALESMAN | 7698.0 | 1981/02/20 | 1600 | 300.0 | 30 |
2 | 7521 | WARD | SALESMAN | 7698.0 | 1981/02/22 | 1250 | 500.0 | 30 |
3 | 7566 | JONES | MANAGER | 7839.0 | 1981/04/02 | 2975 | NaN | 20 |
4 | 7654 | MARTIN | SALESMAN | 7698.0 | 1981/09/28 | 1250 | 1400.0 | 30 |
Python Programming |
emp >> \
filter_by( (pd.Timestamp.now() - make_symbolic(to_datetime)(X.hiredate)).dt.days > 39*365 ) >> \
head()
Results |
empno | ename | job | mgr | hiredate | sal | comm | deptno | |
0 | 7369 | SMITH | CLERK | 7902.0 | 1980/12/17 | 800 | NaN | 20 |
1 | 7499 | ALLEN | SALESMAN | 7698.0 | 1981/02/20 | 1600 | 300.0 | 30 |
2 | 7521 | WARD | SALESMAN | 7698.0 | 1981/02/22 | 1250 | 500.0 | 30 |
3 | 7566 | JONES | MANAGER | 7839.0 | 1981/04/02 | 2975 | NaN | 20 |
4 | 7654 | MARTIN | SALESMAN | 7698.0 | 1981/09/28 | 1250 | 1400.0 | 30 |
Python Programming |
@make_symbolic
def to_datetime(series, infer_datetime_format=True):
return pd.to_datetime(series, infer_datetime_format=infer_datetime_format)
emp >> mutate(pd_date=to_datetime(X.hiredate)) >> head()
Results |
empno | ename | job | mgr | hiredate | sal | comm | deptno | pd_date | |
0 | 7369 | SMITH | CLERK | 7902.0 | 1980/12/17 | 800 | NaN | 20 | 1980-12-17 |
1 | 7499 | ALLEN | SALESMAN | 7698.0 | 1981/02/20 | 1600 | 300.0 | 30 | 1981-02-20 |
2 | 7521 | WARD | SALESMAN | 7698.0 | 1981/02/22 | 1250 | 500.0 | 30 | 1981-02-22 |
3 | 7566 | JONES | MANAGER | 7839.0 | 1981/04/02 | 2975 | NaN | 20 | 1981-04-02 |
4 | 7654 | MARTIN | SALESMAN | 7698.0 | 1981/09/28 | 1250 | 1400.0 | 30 | 1981-09-28 |
[SQL, Pandas, R Prog, Dplyr, SQLDF, PANDASQL, DATA.TABLE] SQL EMP 예제로 만나는 테이블 데이터 처리 방법 리스트
반응형
댓글