본문 바로가기
통계프로그램 비교 시리즈/프로그래밍비교(Oracle,Python,R,SAS)

[데이터 선택 - 날자 기간 차이] 근속 기간 계산 후 데이터 필터 - 12 (오라클 SQL, R, Python, SAS)

by 기서무나구물 2021. 8. 5.

포스팅 목차

     

    12. Display the names of employees who are working in the company for the past 5 years.

     

    * 최근 5년 이상 근무한 직원의 이름을 표시하시오


    • Oracle : sysdate
    • 파이썬 Pandas : 날짜 형변환, pd.to_datetime(), datetime.strptime, relativedelta.relativedelta, now.date(), datetime.datetime.utcnow(), pd.Timestamp.now(), .query 구문
    • R 프로그래밍 : Sys.Date(), difftime 함수, as.numeric
    • R Dplyr Package : filter 구문, Sys.Date()
    • R sqldf Package : 날짜 인수, sprintf, Sys.Date(), strftime 함수, 파라미터 처리, gsubfn 패키지(gsubfn::fn$sqldf) 외부 변수 참조
    • Python pandasql Package : format 구문, 반복구문 처리, 형변환(cast 함수), 날짜 형변환
    • R data.table Package : Sys.Date()
    • SAS Proc SQL : TODAY() 함수, datdif 함수
    • SAS Data Step : TODAY() 함수, datdif 함수
    • Python Dfply Package : filter_by, 사용자함수 정의, pd.Timestamp.now(), to_datetime 함수, make_symbolic 함수
    • 파이썬 Base 프로그래밍 :

     


    1. 오라클(Oracle)

     

    Oracle Programming
    select ename from emp 
    where  sysdate-hiredate > 39*365;

     


    2. 파이썬(Pandas)

    • dateutil 패키지를 사용하여 처리
    Python Programming
    from datetime import datetime   # 날짜모듈 
    
    # 날짜 처리 예시
    date1 = datetime.strptime(str('2013-08-15 12:00:00'), '%Y-%m-%d %H:%M:%S')
    
    date2 = datetime.strptime(str('2012-02-15'), '%Y-%m-%d')
    
    from dateutil import relativedelta
    
    # date1과 date2 날짜 사이의 기간을 계산한다.
    r = relativedelta.relativedelta(date1, date2)
    print(r)
    print(r.days)

     

    Results
    relativedelta(years=+1, months=+6, hours=+12)
    0

     


    to_datetime 함수를 사용하여서 날짜형 데이터로 변수 형변환

    Python Programming
    pd.to_datetime(emp['hiredate'], format='%Y-%m-%d %H:%M:%S', errors='raise')

     

    Results
    0    1980-12-17
    1    1981-02-20
    2    1981-02-22
    3    1981-04-02
    4    1981-09-28
    5    1981-03-01
    6    1981-01-09
    7    1982-12-09
    8    1981-11-17
    9    1981-09-08
    10   1983-01-12
    11   1981-12-03
    12   1981-12-03
    13   1982-01-23
    Name: hiredate, dtype: datetime64[ns]

     


     

    Python Programming
    import datetime
    import dateutil
    
    # Get the current date
    now = datetime.datetime.utcnow()
    now = now.date()
    
    # Get the difference between the current date and the birthday
    age = dateutil.relativedelta.relativedelta(now, pd.to_datetime(emp['hiredate'][0], format='%Y %m %d'))
    age = age.years
    age

     

    Results
    39

     


    dateutil의 relativedelta : 현재 날짜와 입사일(hiredate) 사이의 GAP을 계산하여서 39년 이상인 데이터 추출

    Python Programming
    from dateutil import relativedelta
    
    emp[emp.apply(lambda row: relativedelta.relativedelta(pd.Timestamp.now(), 
                                                           pd.to_datetime(row['hiredate'], format='%Y %m %d')).years, axis=1) >= 39][["ename"]]

     

    Results
      ename
    0 SMITH
    1 ALLEN
    2 WARD
    3 JONES
    5 BLAKE
    6 CLARK
    9 TURNER

     


    두 날짜 사이의 GAP을 계산 후 39년 이상 근무한 직원을 선택한다.

    Python Programming
    emp[ (pd.Timestamp.now() - pd.to_datetime(emp['hiredate'])).dt.days > 39*365 ][["ename"]]

     

    Results
      ename
    0 SMITH
    1 ALLEN
    2 WARD
    3 JONES
    5 BLAKE
    6 CLARK
    9 TURNER

     


     

    Python Programming
    emp.loc[((pd.Timestamp.now() - pd.to_datetime(emp.hiredate)).dt.days > 39 *365 ) , ["ename"]]

     

    Results
      ename
    0 SMITH
    1 ALLEN
    2 WARD
    3 JONES
    5 BLAKE
    6 CLARK
    9 TURNER

    • [참고] Pandas의 query 구문 사용 예제
    Python Programming
    start_date, end_date = '1980/01/02', '1981/06/04'
    
    emp.query('hiredate >= @start_date and hiredate <= @end_date')

     

    Results
      empno ename job mgr hiredate sal comm deptno hiredate_D
    0 7369 SMITH CLERK 7902.0 1980/12/17 800 NaN 20 1980-12-17
    1 7499 ALLEN SALESMAN 7698.0 1981/02/20 1600 300.0 30 1981-02-20
    2 7521 WARD SALESMAN 7698.0 1981/02/22 1250 500.0 30 1981-02-22
    3 7566 JONES MANAGER 7839.0 1981/04/02 2975 NaN 20 1981-04-02
    5 7698 BLAKE MANAGER 7839.0 1981/03/01 2850 NaN 30 1981-03-01
    6 7782 CLARK MANAGER 7839.0 1981/01/09 2450 NaN 10 1981-01-09

     


    • pd.Timestamp를 사용하여서 문자 데이터를 날짜로 변경
    Python Programming
    ts = pd.Timestamp
    
    emp['hiredate_D'] = pd.to_datetime(emp['hiredate'])
    
    emp.query('hiredate_D >= @ts("19820201T071320")')

     

    Results
      empno ename job mgr hiredate sal comm deptno hiredate_D
    7 7788 SCOTT ANALYST 7566.0 1982/12/09 3000 NaN 20 1982-12-09
    10 7876 ADAMS CLERK 7788.0 1983/01/12 1100 NaN 20 1983-01-12

     


    • hiredate 변수를 날짜형으로 변경하여서 신규 변수 생성. 생성된 변수와 현재 날짜 비교
    • query 상에서 현재 날짜와 hiredate 사이의 비교(-) 처리를 못하여서 현재 날짜를 기준으로 - 39년 날짜 계산 후 처리
    Python Programming
    from dateutil.relativedelta import relativedelta
    
    emp['hiredate_D'] = pd.to_datetime(emp['hiredate'])
    
    now = datetime.datetime.now() - relativedelta(years=39)
    
    emp.query("@now > hiredate_D", engine='python')[["ename"]]

     

    Results
      ename
    0 SMITH
    1 ALLEN
    2 WARD
    3 JONES
    5 BLAKE
    6 CLARK
    9 TURNER

     


    • pd.to_datetime를 적용하여서 hiredate를 날짜형으로 변경 후 현재 날짜를 기준으로 -39년 날짜와 비교
    Python Programming
    from dateutil.relativedelta import relativedelta
    
    ts_chg = pd.to_datetime
    
    base_dt = datetime.datetime.now() - relativedelta(years=39)
    
    emp.query('@base_dt > @ts_chg(hiredate)')[["ename"]]

     

    Results
      ename
    0 SMITH
    1 ALLEN
    2 WARD
    3 JONES
    5 BLAKE
    6 CLARK
    9 TURNER

     


    3. R Programming (R Package)

    • difftime 함수를 사용하여서 현재 날짜와 입사일 사이의 기간을 계산
    R Programming
    %%R
    
    difftime(Sys.Date(), emp$hiredate) / 365

     

    Results
    Time differences in days
     [1] 39.75342 39.57534 39.56986 39.46301 38.97260 39.55068 39.69041 37.77534
     [9] 38.83562 39.02740 37.68219 38.79178 38.79178 38.65205

     


     

    R Programming
    %%R
    
    as.numeric(Sys.Date() - as.Date(emp$hiredate)) / 365

     

    Results
     [1] 39.75342 39.57534 39.56986 39.46301 38.97260 39.55068 39.69041 37.77534
     [9] 38.83562 39.02740 37.68219 38.79178 38.79178 38.65205

     


     

    R Programming
    %%R
    emp[Sys.Date() - emp$hiredate > 39*365 , c("empno","ename") ]

     

    Results
    # A tibble: 7 x 2
      empno ename 
      <dbl> <chr> 
    1  7369 SMITH 
    2  7499 ALLEN 
    3  7521 WARD  
    4  7566 JONES 
    5  7698 BLAKE 
    6  7782 CLARK 
    7  7844 TURNER

     


    4. R Dplyr Package

    *현재 날짜와 입사일 사이의 기간을 계산 후 filter 구문을 사용하여서 39년 이상 근무한 직원을 선택한다.

    R Programming
    %%R
    emp %>% filter( (Sys.Date() - hiredate) > 39*365 ) %>% dplyr::select(ename)

     

    Results
    # A tibble: 7 x 1
      ename 
      <chr> 
    1 SMITH 
    2 ALLEN 
    3 WARD  
    4 JONES 
    5 BLAKE 
    6 CLARK 
    7 TURNER

     


    5. R sqldf Package

    • 현재날짜를 인수로 지정하여서 출력
    R Programming
    %%R
    
    sprintf("select * from DF where a >= %d", Sys.Date())

     

    Results
    [1] "select * from DF where a >= 18580"

     


    • 위에서 산출된 현재 날짜를 sqldf 쿼리에 직접 입력하여 처리(수작업)
    R Programming
    %%R
    require(sqldf)
    sqldf("select ename,
                  strftime('%Y-%m-%d', 'now') now_day,
                  hiredate, 
                  (strftime('%d-%m-%Y', 'now') - hiredate) Gap,
                  (hiredate - strftime('%d-%m-%Y', 'now')) Gap_1
           from emp
           where (18580 - hiredate) > 39*365")

     

    Results
       ename    now_day   hiredate   Gap Gap_1
    1  SMITH 2020-11-13 1980-12-17 -3990  3990
    2  ALLEN 2020-11-13 1981-02-20 -4055  4055
    3   WARD 2020-11-13 1981-02-22 -4057  4057
    4  JONES 2020-11-13 1981-04-02 -4096  4096
    5  BLAKE 2020-11-13 1981-03-01 -4064  4064
    6  CLARK 2020-11-13 1981-01-09 -4013  4013
    7 TURNER 2020-11-13 1981-09-08 -4255  4255

     


    • [참고] gsubfn 패키지 fn$ 을 사용하여서 sqldf에 파이썬의 함수와 파라미터의 값을 전달 방식
    R Programming
    # %%R
    # 파라미터 처리
    # p1 <- 7
    # fn$sqldf("select * from iris where Sepal_Length > $p1")

     

    R Programming
    emp

     

    Results
      empno ename job mgr hiredate sal comm deptno
    0 7369 SMITH CLERK 7902.0 1980/12/17 800 NaN 20
    1 7499 ALLEN SALESMAN 7698.0 1981/02/20 1600 300.0 30
    2 7521 WARD SALESMAN 7698.0 1981/02/22 1250 500.0 30
    3 7566 JONES MANAGER 7839.0 1981/04/02 2975 NaN 20
    4 7654 MARTIN SALESMAN 7698.0 1981/09/28 1250 1400.0 30
    5 7698 BLAKE MANAGER 7839.0 1981/03/01 2850 NaN 30
    6 7782 CLARK MANAGER 7839.0 1981/01/09 2450 NaN 10
    7 7788 SCOTT ANALYST 7566.0 1982/12/09 3000 NaN 20
    8 7839 KING PRESIDENT NaN 1981/11/17 5000 NaN 10
    9 7844 TURNER SALESMAN 7698.0 1981/09/08 1500 0.0 30
    10 7876 ADAMS CLERK 7788.0 1983/01/12 1100 NaN 20
    11 7900 JAMES CLERK 7698.0 1981/12/03 950 NaN 30
    12 7902 FORD ANALYST 7566.0 1981/12/03 3000 NaN 20
    13 7934 MILLER CLERK 7782.0 1982/01/23 1300 NaN 10

     


    • Sys.Date() 를 fn$sqldf 기능과 함께 사용
    • [참고] strftime('%d-%m-%Y', 'now') : 현재 날짜를 문자로 반환

     

    R Programming
    %%R
    
    gsubfn::fn$sqldf("select *, strftime('%d-%m-%Y', 'now') now_char from emp where (`Sys.Date()` - hiredate) > 39 * 365  ")

     

    Results
      empno  ename      job  mgr   hiredate  sal comm deptno   now_char
    1  7369  SMITH    CLERK 7902 1980-12-17  800   NA     20 09-09-2020
    2  7499  ALLEN SALESMAN 7698 1981-02-20 1600  300     30 09-09-2020
    3  7521   WARD SALESMAN 7698 1981-02-22 1250  500     30 09-09-2020
    4  7566  JONES  MANAGER 7839 1981-04-02 2975   NA     20 09-09-2020
    5  7698  BLAKE  MANAGER 7839 1981-03-01 2850   NA     30 09-09-2020
    6  7782  CLARK  MANAGER 7839 1981-01-09 2450   NA     10 09-09-2020
    7  7844 TURNER SALESMAN 7698 1981-09-08 1500    0     30 09-09-2020

     


    6. Python pandasql Package

    [참고] format 구문

    Python Programming
    import pandasql as ps
    a = 'v2'
    b = 'v10'
    
    q = "SELECT {} FROM df1 ".format(a+','+b)
    q

     

    Results
    'SELECT v2,v10 FROM df1 '

     


    • [참고] format 구문
    Python Programming
    import pandasql as ps
    var = 'hiredate'
    
    query = "SELECT {} FROM emp ".format(var)
    ps.sqldf(query, globals())

     

    Results
      hiredate
    0 1980/12/17
    1 1981/02/20
    2 1981/02/22
    3 1981/04/02
    4 1981/09/28
    5 1981/03/01
    6 1981/01/09
    7 1982/12/09
    8 1981/11/17
    9 1981/09/08
    10 1983/01/12
    11 1981/12/03
    12 1981/12/03
    13 1982/01/23

     


    • 반복구문 처리
    Python Programming
    import pandasql as ps
    
    test = ['''CLERK''',"""SALESMAN"""]
    for i in test:
        query = """SELECT sal FROM emp where job = '""" + i + """'"""
        print(ps.sqldf(query, globals()))
        print()

     

    Results
        sal
    0   800
    1  1100
    2   950
    3  1300
    
        sal
    0  1600
    1  1250
    2  1250
    3  1500

     


    • [참고] cast 함수 : 형변환 함수
    Python Programming
    # emp.dtypes

     

    Python Programming
    import pandasql as ps
    
    query = """SELECT cast(hiredate as integer) FROM emp """
    
    ps.sqldf(query, globals()).head()

     

    Results
      cast(hiredate as integer)
    0 1980
    1 1981
    2 1981
    3 1981
    4 1981

     


    • pandasql 연습
    Python Programming
    import datetime
    
    nowdate= datetime.datetime.now()
    format= "%Y-%m-%d" 
    
    import time
    
    emp['hiredate_D'] = pd.to_datetime(emp['hiredate'])
    emp['to_date']    = pd.Timestamp.now()
    emp['day_gap']    = (pd.Timestamp.now() - pd.to_datetime(emp['hiredate'])).dt.days
    
    query = """SELECT to_date          현재날짜1,
                      date('now')      현재날짜2,
                      datetime('now')  현재날짜3,
                      typeof(hiredate) 변수Type확인,
                      sqlite_version() sqlite버젼확인,                  
                      date('2020-09-10', 'start of month') 초일자,
    
                      CAST(STRFTIME('%Y %m %d', hiredate_D) AS date) cast함수, 
                      STRFTIME('%Y %m %d', hiredate_D) 문자반환,
                      CAST(hiredate AS date) x2,
                      (cast(strftime('%s',hiredate_D) as interger))/(24*60*60) x3
               FROM emp
               WHERE day_gap > 39*365"""
    # print(query)
    query = ps.sqldf(query, globals())
    
    query.head()

     

    Results
      현재날짜 현재날짜2 현재날짜3 변수Type확인 sqlite버젼확인 초일자 cast함수 문자반환 x2 x3
    0 2020-09-10 09:45:49.454895 2020-09-10 2020-09-10 00:45:49 text 3.32.3 2020-09-01 1980 1980 12 17 1980 4003
    1 2020-09-10 09:45:49.454895 2020-09-10 2020-09-10 00:45:49 text 3.32.3 2020-09-01 1981 1981 02 20 1981 4068
    2 2020-09-10 09:45:49.454895 2020-09-10 2020-09-10 00:45:49 text 3.32.3 2020-09-01 1981 1981 02 22 1981 4070
    3 2020-09-10 09:45:49.454895 2020-09-10 2020-09-10 00:45:49 text 3.32.3 2020-09-01 1981 1981 04 02 1981 4109
    4 2020-09-10 09:45:49.454895 2020-09-10 2020-09-10 00:45:49 text 3.32.3 2020-09-01 1981 1981 03 01 1981 4077

     


    • 데이터에 날짜형 변수를 직접 생성 후 처리
    Python Programming
    import datetime
    
    nowdate= datetime.datetime.now()
    format= "%Y-%m-%d" 
    
    import time
    
    emp['hiredate_D'] = pd.to_datetime(emp['hiredate'])
    emp['to_date']    = pd.Timestamp.now()
    emp['day_gap']    = (pd.Timestamp.now() - pd.to_datetime(emp['hiredate'])).dt.days
    
    query = """SELECT to_date,
                      hiredate_D,
                      day_gap,
                      ename
               FROM emp
               WHERE  day_gap > 39*365"""
    ps.sqldf(query, globals())

     

    Results
      to_date hiredate_D day_gap ename
    0 2020-09-10 09:39:10.707798 1980-12-17 00:00:00.000000 14512 SMITH
    1 2020-09-10 09:39:10.707798 1981-02-20 00:00:00.000000 14447 ALLEN
    2 2020-09-10 09:39:10.707798 1981-02-22 00:00:00.000000 14445 WARD
    3 2020-09-10 09:39:10.707798 1981-04-02 00:00:00.000000 14406 JONES
    4 2020-09-10 09:39:10.707798 1981-03-01 00:00:00.000000 14438 BLAKE
    5 2020-09-10 09:39:10.707798 1981-01-09 00:00:00.000000 14489 CLARK
    6 2020-09-10 09:39:10.707798 1981-09-08 00:00:00.000000 14247 TURNER

     


    • 문자형 날짜 데이터를 날짜로 변경하여 처리
    Python Programming
    import datetime
    
    nowdate= datetime.datetime.now()
    format= "%Y-%m-%d" 
    
    import time
    
    query = """SELECT to_date,
                      date('now'),                  
                      ename,
    
                      (strftime('%s','now') -
    
                      strftime('%s',(substr(hiredate,1,4)||'-'||substr(hiredate,6,2)||'-'||substr(hiredate,9,2))))/60/60/24 second_gap
               FROM emp
               WHERE (strftime('%s','now') -                  
                      strftime('%s',(substr(hiredate,1,4)||'-'||substr(hiredate,6,2)||'-'||substr(hiredate,9,2))))/60/60/24 > 39*365"""
    # print(query)
    ps.sqldf(query, globals())

     

    Results
      to_date date('now') ename second_gap
    0 2020-09-10 09:45:49.454895 2020-09-10 SMITH 14512
    1 2020-09-10 09:45:49.454895 2020-09-10 ALLEN 14447
    2 2020-09-10 09:45:49.454895 2020-09-10 WARD 14445
    3 2020-09-10 09:45:49.454895 2020-09-10 JONES 14406
    4 2020-09-10 09:45:49.454895 2020-09-10 BLAKE 14438
    5 2020-09-10 09:45:49.454895 2020-09-10 CLARK 14489
    6 2020-09-10 09:45:49.454895 2020-09-10 TURNER 14247

     


    7. R data.table Package

     

    %%R
    
    DT          <- data.table(emp)
    dept_DT     <- data.table(dept)
    
    DT[(Sys.Date() - hiredate) > 39 * 365, .(ename)]

     

    Results
        ename
    1:  SMITH
    2:  ALLEN
    3:   WARD
    4:  JONES
    5: MARTIN
    6:  BLAKE
    7:  CLARK
    8: TURNER

     


    8. SAS Proc SQL

     

    %%SAS sas
    
    PROC SQL;
      CREATE TABLE STATSAS_1 AS
        SELECT ENAME,
               TODAY() - hiredate as day_gap,
               datdif(hiredate, TODAY(), 'act/act') as day_gap_1
        FROM   EMP
        WHERE  TODAY() - hiredate > 40*365;
    QUIT;
    PROC PRINT;RUN;

     

    Results
    OBS ename day_gap day_gap_1
    1 SMITH 14727 14727
    2 ALLEN 14662 14662
    3 WARD 14660 14660
    4 JONES 14621 14621
    5 BLAKE 14653 14653
    6 CLARK 14704 14704

     


    9. SAS Data Step

     

    %%SAS sas
    
    DATA STATSAS_2;
     SET EMP;
         To_day    = TODAY();
         Day_Gap   = TODAY() - hiredate;
         Day_gap_1 = datdif(hiredate, TODAY(), 'act/act');
         KEEP ENAME To_day Day_Gap Day_gap_1 HIREDATE;
         WHERE datdif(hiredate, TODAY(), 'act/act') > 40*365;
    RUN;
    
    PROC PRINT;RUN;

     

    Results
    OBS ename hiredate To_day Day_Gap Day_gap_1
    1 SMITH 1980-12-17 22383 14727 14727
    2 ALLEN 1981-02-20 22383 14662 14662
    3 WARD 1981-02-22 22383 14660 14660
    4 JONES 1981-04-02 22383 14621 14621
    5 BLAKE 1981-03-01 22383 14653 14653
    6 CLARK 1981-01-09 22383 14704 14704

     


    10. Python Dfply Package

    • 별도 함수 정의(make_symbolic)
    Python Programming
    @make_symbolic
    def to_datetime(series, infer_datetime_format=True):
        return pd.to_datetime(series, infer_datetime_format=infer_datetime_format)
    
    emp >> filter_by( (pd.Timestamp.now() - to_datetime(X.hiredate)).dt.days > 39*365 ) >> head()

     

    Results
      empno ename job mgr hiredate sal comm deptno
    0 7369 SMITH CLERK 7902.0 1980/12/17 800 NaN 20
    1 7499 ALLEN SALESMAN 7698.0 1981/02/20 1600 300.0 30
    2 7521 WARD SALESMAN 7698.0 1981/02/22 1250 500.0 30
    3 7566 JONES MANAGER 7839.0 1981/04/02 2975 NaN 20
    4 7654 MARTIN SALESMAN 7698.0 1981/09/28 1250 1400.0 30

     


     

    Python Programming
    emp >> \
      filter_by( (pd.Timestamp.now() - make_symbolic(to_datetime)(X.hiredate)).dt.days > 39*365 ) >> \
      head()

     

    Results
      empno ename job mgr hiredate sal comm deptno
    0 7369 SMITH CLERK 7902.0 1980/12/17 800 NaN 20
    1 7499 ALLEN SALESMAN 7698.0 1981/02/20 1600 300.0 30
    2 7521 WARD SALESMAN 7698.0 1981/02/22 1250 500.0 30
    3 7566 JONES MANAGER 7839.0 1981/04/02 2975 NaN 20
    4 7654 MARTIN SALESMAN 7698.0 1981/09/28 1250 1400.0 30

     


     

    Python Programming
    @make_symbolic
    def to_datetime(series, infer_datetime_format=True):
        return pd.to_datetime(series, infer_datetime_format=infer_datetime_format)
    
    emp >> mutate(pd_date=to_datetime(X.hiredate)) >> head()

     

    Results
      empno ename job mgr hiredate sal comm deptno pd_date
    0 7369 SMITH CLERK 7902.0 1980/12/17 800 NaN 20 1980-12-17
    1 7499 ALLEN SALESMAN 7698.0 1981/02/20 1600 300.0 30 1981-02-20
    2 7521 WARD SALESMAN 7698.0 1981/02/22 1250 500.0 30 1981-02-22
    3 7566 JONES MANAGER 7839.0 1981/04/02 2975 NaN 20 1981-04-02
    4 7654 MARTIN SALESMAN 7698.0 1981/09/28 1250 1400.0 30 1981-09-28

     


     

    [SQL, Pandas, R Prog, Dplyr, SQLDF, PANDASQL, DATA.TABLE]   SQL EMP 예제로 만나는 테이블 데이터 처리 방법 리스트

    반응형

    댓글