[문자변수 생성] 신규 변수 생성 - 문자 추출과 문자 결합

포스팅 목차

129. Display those employees whose first 2 characters from hire date-last 2 characters of salary?

* 입사일자에서 첫 2글자와 급여의 마지막 2글자로 구성된 신규 변수를 생성하시오.

파이썬 & R 패키지 호출 및 예제 데이터 생성 링크
substr 함수 : 69 / 129 / 189
concat 함수 : 68 / 129 / 142 / 181
[변수 생성] 신규 변수 생성 - 문자 추출과 문자 결합
데이터 전처리 - SQL, Pandas, R Prog, Dplyr, Dfply, SQLDF, PANDASQL, DATA.TABLE, SAS, Proc Sql

Oracle : substr(), to_char(), 문자 결합 연산자('||')
파이썬 Pandas : str.slice(), 문자 결합 연산자('+'), astype(str)
R 프로그래밍 : paste(), paste0(), base::substr(), stringr::str_sub()
R Dplyr Package : dplyr::mutate(), stringi::stri_join(), stringr::str_c(), paste(), paste0(), stringr::str_sub()
R sqldf Package : typeof() 변수 형태 확인, datetime(), substr(), strftime(), cast(), 문자 결합 연산자('||')
Python pandasql Package : typeof() 변수 형태 확인, datetime(), substr(), strftime(), cast(), 문자 결합 연산자('||')
R data.table Package : stringi::stri_join(), stringr::str_c(), paste(), paste0(), stringr::str_sub()
SAS Proc SQL : put(), strip(), substr(), reverse(), prxchange() 정규식 함수, 문자 결합 연산자('||'), cat 함수, cats 함수
SAS Data Step : put(), strip(), substr(), reverse(), prxchange() 정규식 함수, 문자 결합 연산자('||'), cat 함수, cats 함수
Python Dfply Package : .str.slice(), astype(str), 문자 결합 연산자('+')
파이썬 Base 프로그래밍 :

1. Oracle(오라클)

파이프 연산자를 사용하여서 입사일자의 첫 2글자와 급여의 마지막 2글자를 연결하여서 신규 문자 변수를 생성한다.

Oracle Programming

select ename,hiredate,sal,
       SUBSTR(to_char(hiredate,'yyyymmdd'),1,2)||ENAME||substr(sal,-2,2) as ename_str
from   emp

2. Python Pandas(파이썬)

Python Programming

import copy

withmooc = copy.copy(emp)
withmooc

withmooc['ename_string'] = withmooc['hiredate'].str.slice(0,2) + withmooc['ename'] + withmooc['sal'].astype(str).str.slice(-2,)
display(withmooc.head())

Results

	empno	ename	job	mgr	hiredate	sal	comm	deptno	ename_string
0	7369	SMITH	CLERK	7902.0	1980/12/17	800	NaN	20	19SMITH00
1	7499	ALLEN	SALESMAN	7698.0	1981/02/20	1600	300.0	30	19ALLEN00
2	7521	WARD	SALESMAN	7698.0	1981/02/22	1250	500.0	30	19WARD50
3	7566	JONES	MANAGER	7839.0	1981/04/02	2975	NaN	20	19JONES75
4	7654	MARTIN	SALESMAN	7698.0	1981/09/28	1250	1400.0	30	19MARTIN50

3. R Programming (R Package)

R Programming

%%R

withmooc <- emp

withmooc['str_combined_1'] <- paste(base::substr(withmooc$hiredate,1,2),withmooc$ename,stringr::str_sub(withmooc$sal,-2) ,sep="")
withmooc['str_combined_2'] <- paste0(base::substr(withmooc$hiredate,1,2),'',withmooc$ename,'', stringr::str_sub(withmooc$sal,-2))
head(withmooc)

Results

# A tibble: 6 x 10
  empno ename  job        mgr hiredate     sal  comm deptno str_combined_1 str_combined_2
  <dbl> <chr>  <chr>    <dbl> <date>     <dbl> <dbl>  <dbl> <chr>          <chr>         
1  7369 SMITH  CLERK     7902 1980-12-17   800    NA     20 19SMITH00      19SMITH00     
2  7499 ALLEN  SALESMAN  7698 1981-02-20  1600   300     30 19ALLEN00      19ALLEN00     
3  7521 WARD   SALESMAN  7698 1981-02-22  1250   500     30 19WARD50       19WARD50      
4  7566 JONES  MANAGER   7839 1981-04-02  2975    NA     20 19JONES75      19JONES75     
5  7654 MARTIN SALESMAN  7698 1981-09-28  1250  1400     30 19MARTIN50     19MARTIN50    
6  7698 BLAKE  MANAGER   7839 1981-03-01  2850    NA     30 19BLAKE50      19BLAKE50

4. R Dplyr Package

R Programming

%%R

library(stringi)

emp %>% 
  dplyr::mutate(str_combined_2 = stringi::stri_join(stringr::str_sub(withmooc$hiredate,1,2),ename,stringr::str_sub(withmooc$sal,-2),sep="")) %>%
  dplyr::mutate(str_combined_3 = stringr::str_c(stringr::str_sub(withmooc$hiredate,1,2),ename,stringr::str_sub(withmooc$sal,-2),sep="")) %>%
  dplyr::mutate(str_combined_4 = base::paste(stringr::str_sub(withmooc$hiredate,1,2),ename,stringr::str_sub(withmooc$sal,-2),sep="")) %>%
  dplyr::mutate(str_combined_5 = base::paste0(stringr::str_sub(withmooc$hiredate,1,2),"",ename,"",stringr::str_sub(withmooc$sal,-2))) %>%
  head()

Results

# A tibble: 6 x 12
  empno ename  job        mgr hiredate     sal  comm deptno str_combined_2 str_combined_3 str_combined_4 str_combined_5
  <dbl> <chr>  <chr>    <dbl> <date>     <dbl> <dbl>  <dbl> <chr>          <chr>          <chr>          <chr>         
1  7369 SMITH  CLERK     7902 1980-12-17   800    NA     20 19SMITH00      19SMITH00      19SMITH00      19SMITH00     
2  7499 ALLEN  SALESMAN  7698 1981-02-20  1600   300     30 19ALLEN00      19ALLEN00      19ALLEN00      19ALLEN00     
3  7521 WARD   SALESMAN  7698 1981-02-22  1250   500     30 19WARD50       19WARD50       19WARD50       19WARD50      
4  7566 JONES  MANAGER   7839 1981-04-02  2975    NA     20 19JONES75      19JONES75      19JONES75      19JONES75     
5  7654 MARTIN SALESMAN  7698 1981-09-28  1250  1400     30 19MARTIN50     19MARTIN50     19MARTIN50     19MARTIN50    
6  7698 BLAKE  MANAGER   7839 1981-03-01  2850    NA     30 19BLAKE50      19BLAKE50      19BLAKE50      19BLAKE50

5. R sqldf Package

현재 hiredate가 “real” 형태로 저장되어 있음. / sqlite에는 날짜형이 존재 안 함 (129번 sqldf)

R Programming

%%R

sqldf(" select hiredate, 
               typeof(hiredate) hire_type,

               datetime(hiredate * 3600 * 24,'unixepoch') real_to_datetext,
               typeof(datetime(hiredate * 3600 * 24,'unixepoch')) type2,

               substr(datetime(hiredate * 3600 * 24,'unixepoch'),1,2) k,

               strftime('%Y/%M/%d', datetime(hiredate * 3600 * 24,'unixepoch')) dateformat_chagne,
               strftime('%Y-%M-%d', hiredate) date_error,

               substr(datetime(hiredate * 3600 * 24,'unixepoch'),1,2)||ENAME||substr(cast(sal as int),-2,2) ename_1 
        from emp") %>% head(10)

Results

     hiredate hire_type    real_to_datetext type2  k dateformat_chagne date_error    ename_1
1  1980-12-17      real 1980-12-17 00:00:00  text 19        1980/00/17 -470-00-09  19SMITH00
2  1981-02-20      real 1981-02-20 00:00:00  text 19        1981/00/20 -470-00-13  19ALLEN00
3  1981-02-22      real 1981-02-22 00:00:00  text 19        1981/00/22 -470-00-15   19WARD50
4  1981-04-02      real 1981-04-02 00:00:00  text 19        1981/00/02 -470-00-23  19JONES75
5  1981-09-28      real 1981-09-28 00:00:00  text 19        1981/00/28 -470-00-21 19MARTIN50
6  1981-03-01      real 1981-03-01 00:00:00  text 19        1981/00/01 -470-00-22  19BLAKE50
7  1981-01-09      real 1981-01-09 00:00:00  text 19        1981/00/09 -470-00-02  19CLARK50
8  1982-12-09      real 1982-12-09 00:00:00  text 19        1982/00/09 -470-00-01  19SCOTT00
9  1981-11-17      real 1981-11-17 00:00:00  text 19        1981/00/17 -470-00-10   19KING00
10 1981-09-08      real 1981-09-08 00:00:00  text 19        1981/00/08 -470-00-01 19TURNER00

6. Python pandasql Package

Python Programming

ps.sqldf(" select hiredate, sal, \
                  substr(datetime(hiredate * 3600 * 24,'unixepoch'),1,2)||ENAME||substr(cast(sal as int),-2,2) ename_1 \
           from emp ").head()

Results

	hiredate	sal	ename_1
0	1980/12/17	800	19SMITH00
1	1981/02/20	1600	19ALLEN00
2	1981/02/22	1250	19WARD50
3	1981/04/02	2975	19JONES75
4	1981/09/28	1250	19MARTIN50

7. R data.table Package

R Programming

%%R

DT          <- data.table(emp)
dept_DT     <- data.table(dept)
salgrade_DT <- data.table(salgrade)


DT[, c('str_combined_2','str_combined_3','str_combined_4','str_combined_5') 
            := list( stringi::stri_join(stringr::str_sub(hiredate,1,2),ename,stringr::str_sub(sal,-2),sep="") ,
                      stringr::str_c(stringr::str_sub(hiredate,1,2),ename,stringr::str_sub(sal,-2),sep=""),
                      base::paste(stringr::str_sub(hiredate,1,2),ename,stringr::str_sub(sal,-2),sep=""),
                      base::paste0(stringr::str_sub(hiredate,1,2),"",ename,"",stringr::str_sub(sal,-2))
                   ) ]

DT

Results

    empno  ename       job  mgr   hiredate  sal comm deptno str_combined_2 str_combined_3 str_combined_4 str_combined_5
 1:  7369  SMITH     CLERK 7902 1980-12-17  800   NA     20      19SMITH00      19SMITH00      19SMITH00      19SMITH00
 2:  7499  ALLEN  SALESMAN 7698 1981-02-20 1600  300     30      19ALLEN00      19ALLEN00      19ALLEN00      19ALLEN00
 3:  7521   WARD  SALESMAN 7698 1981-02-22 1250  500     30       19WARD50       19WARD50       19WARD50       19WARD50
 4:  7566  JONES   MANAGER 7839 1981-04-02 2975   NA     20      19JONES75      19JONES75      19JONES75      19JONES75
 5:  7654 MARTIN  SALESMAN 7698 1981-09-28 1250 1400     30     19MARTIN50     19MARTIN50     19MARTIN50     19MARTIN50
 6:  7698  BLAKE   MANAGER 7839 1981-03-01 2850   NA     30      19BLAKE50      19BLAKE50      19BLAKE50      19BLAKE50
 7:  7782  CLARK   MANAGER 7839 1981-01-09 2450   NA     10      19CLARK50      19CLARK50      19CLARK50      19CLARK50
 8:  7788  SCOTT   ANALYST 7566 1982-12-09 3000   NA     20      19SCOTT00      19SCOTT00      19SCOTT00      19SCOTT00
 9:  7839   KING PRESIDENT   NA 1981-11-17 5000   NA     10       19KING00       19KING00       19KING00       19KING00
10:  7844 TURNER  SALESMAN 7698 1981-09-08 1500    0     30     19TURNER00     19TURNER00     19TURNER00     19TURNER00
11:  7876  ADAMS     CLERK 7788 1983-01-12 1100   NA     20      19ADAMS00      19ADAMS00      19ADAMS00      19ADAMS00
12:  7900  JAMES     CLERK 7698 1981-12-03  950   NA     30      19JAMES50      19JAMES50      19JAMES50      19JAMES50
13:  7902   FORD   ANALYST 7566 1981-12-03 3000   NA     20       19FORD00       19FORD00       19FORD00       19FORD00
14:  7934 MILLER     CLERK 7782 1982-01-23 1300   NA     10     19MILLER00     19MILLER00     19MILLER00     19MILLER00

8. SAS Proc SQL

SAS Programming

%%SAS sas

PROC SQL;
  CREATE TABLE STATSAS_1 AS
    select ename,
           put(hiredate, yymmddn.)                                       as head_base,
           strip(put(sal,5.))                                            as tail_base,
           substr( strip(put(sal,5.)) ,length(strip(put(sal,5.))) -1,2 ) as last_char_1,
           reverse( substr( reverse( strip(put(sal,5.)) ),1,2) )         as last_char_2,
           prxchange('s/.*(\d{2})/$1/',-1, strip(put(sal,5.))  )         as last_char_3,

           SUBSTR( calculated head_base , 1,2)||compress(ENAME)|| calculated last_char_1 as concat_str format=$12.

    from emp;
QUIT;
PROC PRINT data=STATSAS_1(obs=5);RUN;

Results

OBS	ename	head_base	tail_base	last_char_1	last_char_2	last_char_3	concat_str
1	SMITH	19801217	800	00	00	00	19SMITH00
2	ALLEN	19810220	1600	00	00	00	19ALLEN00
3	WARD	19810222	1250	50	50	50	19WARD50
4	JONES	19810402	2975	75	75	75	19JONES75
5	MARTIN	19810928	1250	50	50	50	19MARTIN50

9. SAS Data Step

SAS Programming

%%SAS sas

DATA STATSAS_2; 
 SET emp; 
     head_base   = put(hiredate, yymmddn.);
     tail_base   = strip(put(sal,5.));
     last_char_1 = substr( strip(put(sal,5.)) ,length(strip(put(sal,5.))) -1,2 );
     last_char_2 = reverse( substr( reverse( strip(put(sal,5.)) ),1,2) );
     last_char_3 = prxchange('s/.*(\d{2})/$1/',-1, strip(put(sal,5.))  );

     concat_str = SUBSTR( head_base , 1,2)||compress(ENAME)|| last_char_1;
     keep empno head_base tail_base last_: concat_str;
RUN;

PROC PRINT Data=STATSAS_2(Obs=5);RUN;

Results

OBS	empno	head_base	tail_base	last_char_1	last_char_2	last_char_3	concat_str
1	7369	19801217	800	00	00	00	19SMITH00
2	7499	19810220	1600	00	00	00	19ALLEN00
3	7521	19810222	1250	50	50	50	19WARD50
4	7566	19810402	2975	75	75	75	19JONES75
5	7654	19810928	1250	50	50	50	19MARTIN50

10. Python Dfply Package

Python Programming

emp >> mutate( ename_string = X.hiredate.str.slice(0,2) + X.ename + X.sal.astype(str).str.slice(-2,) ) >> \
  head()

Results

	empno	ename	job	mgr	hiredate	sal	comm	deptno	ename_string
0	7369	SMITH	CLERK	7902.0	1980/12/17	800	NaN	20	19SMITH00
1	7499	ALLEN	SALESMAN	7698.0	1981/02/20	1600	300.0	30	19ALLEN00
2	7521	WARD	SALESMAN	7698.0	1981/02/22	1250	500.0	30	19WARD50
3	7566	JONES	MANAGER	7839.0	1981/04/02	2975	NaN	20	19JONES75
4	7654	MARTIN	SALESMAN	7698.0	1981/09/28	1250	1400.0	30	19MARTIN50

[SQL, Pandas, R Prog, Dplyr, SQLDF, PANDASQL, DATA.TABLE] SQL EMP 예제로 만나는 테이블 데이터 전처리 방법 리스트

저작자표시 비영리 변경금지

'통계프로그램 비교 시리즈 > 프로그래밍비교(Oracle,Python,R,SAS)' 카테고리의 다른 글

[비상관 서브쿼리(In)] 특정 조건을 만족하는 데이터 추출 - 131 (오라클 SQL, R, Python, SAS) (0)	2022.12.18
[데이터 타입 변환] 특정 조건을 만족하는 데이터 추출 - 문자 추출, 문자형 수치를 수치형으로 변환 - 130 (오라클 SQL, R, Python, SAS) (0)	2022.12.17
[데이터 추출] 특정 조건을 만족하는 데이터 추출 - 비상관 서브쿼리(In), 중복 제거(Distinct) - 128 (오라클 SQL, R, Python, SAS) (0)	2022.12.16
[문자 함수] 문자열에서 문자 검색 - 문자 검색 함수(Instr 함수) - 127 (오라클 SQL, R, Python, SAS) (0)	2022.12.16
[날짜 함수] 특정 날짜에 해당하는 데이터 추출 - 영문월(Month) 처리, 날자형 데이터 날짜형 문자로 변환 - 126 (오라클 SQL, R, Python, SAS) (0)	2022.12.15

[문자변수 생성] 신규 변수 생성 - 문자 추출과 문자 결합 - 129 (오라클 SQL, R, Python, SAS)

129. Display those employees whose first 2 characters from hire date-last 2 characters of salary?

1. Oracle(오라클)

2. Python Pandas(파이썬)

3. R Programming (R Package)

4. R Dplyr Package

5. R sqldf Package

6. Python pandasql Package

7. R data.table Package

8. SAS Proc SQL

9. SAS Data Step

10. Python Dfply Package

'통계프로그램 비교 시리즈 > 프로그래밍비교(Oracle,Python,R,SAS)' 카테고리의 다른 글

댓글

티스토리툴바

[문자변수 생성] 신규 변수 생성 - 문자 추출과 문자 결합 - 129 (오라클 SQL, R, Python, SAS)

129. Display those employees whose first 2 characters from hire date-last 2 characters of salary?

1. Oracle(오라클)

2. Python Pandas(파이썬)

3. R Programming (R Package)

4. R Dplyr Package

5. R sqldf Package

6. Python pandasql Package

7. R data.table Package

8. SAS Proc SQL

9. SAS Data Step

10. Python Dfply Package

'통계프로그램 비교 시리즈 > 프로그래밍비교(Oracle,Python,R,SAS)' 카테고리의 다른 글

관련글

댓글

티스토리툴바