[데이터 전처리- 문자함수 예제] 기준 문자열에서 특정 문자 검색

포스팅 목차

70. Find the first occurrence of character a from the following string 'computer maintenance corporation'.

* 문자열 'computer maintenance corporation'에서 문자 'a'를 검색하여서 첫 번째 문자 발생 위치를 반환하라.

파이썬 & R 패키지 호출 및 예제 데이터 생성 링크
정규식으로 처리 가능하지만 이 문제에서는 정규식은 사용 안 함.

[문자 함수] 기준 문자열에서 특정 문자 검색

Oracle : instr() 함수
파이썬 Pandas : .str.find()
R 프로그래밍 : unlist(), lapply(), function(x) 사용자정의 함수, survPen::instr(), strsplit(), coalesce(), which(), stringi::stri_locate_all(), stringi::stri_locate_first()
R Dplyr Package : coalesce(), stringi::stri_locate_first()
R sqldf Package : instr() 함수
Python pandasql Package : instr() 함수
R data.table Package : stringi::stri_locate_first()
SAS Proc SQL : find() 함수, index() 함수, prxmatch() 함수, 정규식 함수
SAS Data Step : find() 함수, index() 함수, prxmatch() 함수, 정규식 함수
Python Dfply Package : .str.find() 함수
파이썬 Base 프로그래밍 :

1. Oracle(오라클)

문자열 'computer maintenance corporation'에서 문자 'a'를 검색하여서 첫 번째 문자 발생 위치를 반환하라.

Oracle Programming

select instr('computer maintenance corporation','a',1,1) str_instr 
from   dual

사원 이름을 검색하여서 문자 'A'가 발견(존재)되는 첫 번째 문자 발생 위치를 반환하라.

Oracle Programming

select instr(ename,'A') from emp

2. Python Pandas(파이썬)

find 함수를 사용하여서 사원 이름을 검색하여서 문자 'A'가 발견(존재)되는 첫 번째 문자 발생 위치를 반환한다. 존재하지 않는 경우 -1을 반환한다.

Python Programming

import copy

withmooc = copy.copy(emp)
withmooc

withmooc['ename_string'] = withmooc['ename'].str.find('A')
withmooc.head()

Results

	empno	ename	job	mgr	hiredate	sal	comm	deptno	ename_string
0	7369	SMITH	CLERK	7902.0	1980/12/17	800	NaN	20	-1
1	7499	ALLEN	SALESMAN	7698.0	1981/02/20	1600	300.0	30	0
2	7521	WARD	SALESMAN	7698.0	1981/02/22	1250	500.0	30	1
3	7566	JONES	MANAGER	7839.0	1981/04/02	2975	NaN	20	-1
4	7654	MARTIN	SALESMAN	7698.0	1981/09/28	1250	1400.0	30	1

3. R Programming (R Package)

string Vs Base 함수 : https://stringr.tidyverse.org/articles/from-base.html

survPen::instr 함수를 사용하여서 사원 이름을 검색하여서 문자 'A'가 발견(존재)되는 첫 번째 문자 발생 위치를 반환한다.

R Programming

%%R

withmooc <- emp

withmooc['ename_str'] = unlist(lapply(withmooc$ename, function(x) survPen::instr(x, 'A',1) ))

head(withmooc)

Results

# A tibble: 6 x 9
  empno ename  job        mgr hiredate     sal  comm deptno ename_str
  <dbl> <chr>  <chr>    <dbl> <date>     <dbl> <dbl>  <dbl>     <dbl>
1  7369 SMITH  CLERK     7902 1980-12-17   800    NA     20         0
2  7499 ALLEN  SALESMAN  7698 1981-02-20  1600   300     30         1
3  7521 WARD   SALESMAN  7698 1981-02-22  1250   500     30         2
4  7566 JONES  MANAGER   7839 1981-04-02  2975    NA     20         0
5  7654 MARTIN SALESMAN  7698 1981-09-28  1250  1400     30         2
6  7698 BLAKE  MANAGER   7839 1981-03-01  2850    NA     30         3

사원 이름을 1개의 문자 단위로 추출하여 문자 'A'와 비교하여 검색되는 경우 해당 위치를 반환한다.

R Programming

%%R

withmooc <- emp

withmooc['ename_str'] = unlist(lapply(strsplit(withmooc$ename, ''), function(x) coalesce( which(x == 'A')[1],0) ))
head(withmooc)

Results

# A tibble: 6 x 9
  empno ename  job        mgr hiredate     sal  comm deptno ename_str
  <dbl> <chr>  <chr>    <dbl> <date>     <dbl> <dbl>  <dbl>     <dbl>
1  7369 SMITH  CLERK     7902 1980-12-17   800    NA     20         0
2  7499 ALLEN  SALESMAN  7698 1981-02-20  1600   300     30         1
3  7521 WARD   SALESMAN  7698 1981-02-22  1250   500     30         2
4  7566 JONES  MANAGER   7839 1981-04-02  2975    NA     20         0
5  7654 MARTIN SALESMAN  7698 1981-09-28  1250  1400     30         2
6  7698 BLAKE  MANAGER   7839 1981-03-01  2850    NA     30         3

stringi::stri_locate_all (as of stringr version 1.0)

stringi::stri_locate_all 함수는 기준 문자열에서 검색 문자를 검색하여서 해당 위치의 시작과 종료 위치를 리스트 형태로 반환한다.

R Programming

%%R
print(stringi::stri_locate_all("aab", fixed = "a"))
print(stringi::stri_locate_all("aab", fixed = "a")[[1]][,1])  # "a" 의 위치를 반환한다.

print(stringi::stri_locate_all("aab", fixed = "b")[[1]])
print(stringi::stri_locate_all("aab", fixed = "c",omit_no_match=FALSE)[[1]]) # TRUE/FALSE : NA 처리 방식

Results

[[1]]
     start end
[1,]     1   1
[2,]     2   2

[1] 1 2
     start end
[1,]     3   3
     start end
[1,]    NA  NA

stringi::stri_locate_first 함수 참고

R Programming

%%R

stringi::stri_locate_first(pattern = 'A', "bbbA", fixed = FALSE)

Results

     start end
[1,]     4   4

stringi::stri_locate_first 함수는 기준 문자열에서 검색 문자(패턴)를 검색하여서 해당 위치의 시작과 종료 위치를 리스트 형태로 반환한다. 검색 문자('A')가 검색안되는 경우 NA가 반환되어서 coalesce 함수로 NA를 0으로 변환한다.

R Programming

%%R

withmooc <- emp

withmooc['ename_str'] = lapply(withmooc['ename'], function(x) coalesce( stringi::stri_locate_first(pattern = 'A', x, fixed = FALSE)[,1], 0) ) 
withmooc

Results

# A tibble: 14 x 9
   empno ename  job         mgr hiredate     sal  comm deptno ename_str
   <dbl> <chr>  <chr>     <dbl> <date>     <dbl> <dbl>  <dbl>     <int>
 1  7369 SMITH  CLERK      7902 1980-12-17   800    NA     20        NA
 2  7499 ALLEN  SALESMAN   7698 1981-02-20  1600   300     30         1
 3  7521 WARD   SALESMAN   7698 1981-02-22  1250   500     30         2
 4  7566 JONES  MANAGER    7839 1981-04-02  2975    NA     20        NA
 5  7654 MARTIN SALESMAN   7698 1981-09-28  1250  1400     30         2
 6  7698 BLAKE  MANAGER    7839 1981-03-01  2850    NA     30         3
 7  7782 CLARK  MANAGER    7839 1981-01-09  2450    NA     10         3
 8  7788 SCOTT  ANALYST    7566 1982-12-09  3000    NA     20        NA
 9  7839 KING   PRESIDENT    NA 1981-11-17  5000    NA     10        NA
10  7844 TURNER SALESMAN   7698 1981-09-08  1500     0     30        NA
11  7876 ADAMS  CLERK      7788 1983-01-12  1100    NA     20         1
12  7900 JAMES  CLERK      7698 1981-12-03   950    NA     30         2
13  7902 FORD   ANALYST    7566 1981-12-03  3000    NA     20        NA
14  7934 MILLER CLERK      7782 1982-01-23  1300    NA     10        NA

4. R Dplyr Package

coalesce를 사용하여서 NA값을 0으로 변경

R Programming

%%R

emp %>% 
  dplyr::mutate(ename_str = coalesce( stringi::stri_locate_first(pattern = 'A', ename, fixed = FALSE)[,1], 0))

Results

# A tibble: 14 x 9
   empno ename  job         mgr hiredate     sal  comm deptno ename_str
   <dbl> <chr>  <chr>     <dbl> <date>     <dbl> <dbl>  <dbl>     <dbl>
 1  7369 SMITH  CLERK      7902 1980-12-17   800    NA     20         0
 2  7499 ALLEN  SALESMAN   7698 1981-02-20  1600   300     30         1
 3  7521 WARD   SALESMAN   7698 1981-02-22  1250   500     30         2
 4  7566 JONES  MANAGER    7839 1981-04-02  2975    NA     20         0
 5  7654 MARTIN SALESMAN   7698 1981-09-28  1250  1400     30         2
 6  7698 BLAKE  MANAGER    7839 1981-03-01  2850    NA     30         3
 7  7782 CLARK  MANAGER    7839 1981-01-09  2450    NA     10         3
 8  7788 SCOTT  ANALYST    7566 1982-12-09  3000    NA     20         0
 9  7839 KING   PRESIDENT    NA 1981-11-17  5000    NA     10         0
10  7844 TURNER SALESMAN   7698 1981-09-08  1500     0     30         0
11  7876 ADAMS  CLERK      7788 1983-01-12  1100    NA     20         1
12  7900 JAMES  CLERK      7698 1981-12-03   950    NA     30         2
13  7902 FORD   ANALYST    7566 1981-12-03  3000    NA     20         0
14  7934 MILLER CLERK      7782 1982-01-23  1300    NA     10         0

5. R sqldf Package

DB별 제공 SQL 함수 비교 : https://en.wikibooks.org/wiki/SQL_Dialects_Reference/Functions_and_expressions/String_functions
Sqlite의 insr 함수는 검색 시작 위치와 n번째 검색단어 를 지정 못함.

R Programming

%%R

sqldf("select instr(ename,'A') from emp;")

Results

   instr(ename,'A')
1                 0
2                 1
3                 2
4                 0
5                 2
6                 3
7                 3
8                 0
9                 0
10                0
11                1
12                2
13                0
14                0

6. Python pandasql Package

emp 테이블의 사원명('ename')에서 문자 'a'를 검색하여서 첫 번째 문자 발생 위치를 반환하라.

Python Programming

ps.sqldf("select instr(ename,'A') from emp")

Results

	instr(ename,'A')
0	0
1	1
2	2
3	0
4	2
5	3
6	3
7	0
8	0
9	0
10	1
11	2
12	0
13	0

7. R data.table Package

R Programming

%%R

DT <- data.table(emp)

DT[,ename_str := stringi::stri_locate_first(pattern = 'A', ename, fixed = FALSE)[,1]]

Results

    empno  ename       job  mgr   hiredate  sal comm deptno ename_str
 1:  7369  SMITH     CLERK 7902 1980-12-17  800   NA     20        NA
 2:  7499  ALLEN  SALESMAN 7698 1981-02-20 1600  300     30         1
 3:  7521   WARD  SALESMAN 7698 1981-02-22 1250  500     30         2
 4:  7566  JONES   MANAGER 7839 1981-04-02 2975   NA     20        NA
 5:  7654 MARTIN  SALESMAN 7698 1981-09-28 1250 1400     30         2
 6:  7698  BLAKE   MANAGER 7839 1981-03-01 2850   NA     30         3
 7:  7782  CLARK   MANAGER 7839 1981-01-09 2450   NA     10         3
 8:  7788  SCOTT   ANALYST 7566 1982-12-09 3000   NA     20        NA
 9:  7839   KING PRESIDENT   NA 1981-11-17 5000   NA     10        NA
10:  7844 TURNER  SALESMAN 7698 1981-09-08 1500    0     30        NA
11:  7876  ADAMS     CLERK 7788 1983-01-12 1100   NA     20         1
12:  7900  JAMES     CLERK 7698 1981-12-03  950   NA     30         2
13:  7902   FORD   ANALYST 7566 1981-12-03 3000   NA     20        NA
14:  7934 MILLER     CLERK 7782 1982-01-23 1300   NA     10        NA

8. SAS Proc SQL

find계열과 index계열, 정규식 함수;

SAS Programming

%%SAS sas

PROC SQL;
  CREATE TABLE STATSAS_1 AS
    select ename,
           find(ename,'A')        as ename_find,
           index(ename,'A')       as ename_index,
           prxmatch('/A/', ename) as ename_prxmatch
    from   emp;
QUIT;
PROC PRINT data=STATSAS_1(obs=3);RUN;

Results

OBS	ename	ename_find	ename_index	ename_prxmatch
1	SMITH	0	0	0
2	ALLEN	1	1	1
3	WARD	2	2	2

9. SAS Data Step

SAS Programming

%%SAS sas

DATA STATSAS_2; 
 SET emp;
     ename_find     = find(ename,'A');
     ename_index    = index(ename,'A');
     ename_prxmatch = prxmatch('/A/', ename);
     keep ename empno ename_find ename_index ename_prxmatch;
RUN;

PROC PRINT data=STATSAS_2(obs=3);RUN;

Results

OBS	empno	ename	ename_find	ename_index	ename_prxmatch
1	7369	SMITH	0	0	0
2	7499	ALLEN	1	1	1
3	7521	WARD	2	2	2

10. Python Dfply Package

Python Programming

emp >> \
  mutate( ename_str = X.ename.str.find('A') + 1 ) >> head()

Results

	empno	ename	job	mgr	hiredate	sal	comm	deptno	ename_str
0	7369	SMITH	CLERK	7902.0	1980/12/17	800	NaN	20	0
1	7499	ALLEN	SALESMAN	7698.0	1981/02/20	1600	300.0	30	1
2	7521	WARD	SALESMAN	7698.0	1981/02/22	1250	500.0	30	2
3	7566	JONES	MANAGER	7839.0	1981/04/02	2975	NaN	20	0
4	7654	MARTIN	SALESMAN	7698.0	1981/09/28	1250	1400.0	30	2

[SQL, Pandas, R Prog, Dplyr, SQLDF, PANDASQL, DATA.TABLE] SQL EMP 예제로 만나는 테이블 데이터 처리 방법 리스트

저작자표시

'통계프로그램 비교 시리즈 > 프로그래밍비교(Oracle,Python,R,SAS)' 카테고리의 다른 글

[데이터 전처리- 문자함수 예제] 문자열 변경 - 72 (0)	2021.08.30
[데이터 전처리- 문자함수 예제] 문자열에서 특정 문자 변경 - 71 (0)	2021.08.30
[데이터 전처리- 문자함수 예제] 문자열 절단 함수를 사용하여 문자열 자르기 - 69 (0)	2021.08.27
[데이터 전처리- 문자함수 예제] 문자열 결합(\|\|) & 문자열 연결 - 68 (0)	2021.08.27
[데이터 전처리- 문자함수 예제] 문자 길이 합계 계산 - 67 (0)	2021.08.27

[데이터 전처리- 문자함수 예제] 기준 문자열에서 특정 문자 검색 - 70

70. Find the first occurrence of character a from the following string 'computer maintenance corporation'.

1. Oracle(오라클)

2. Python Pandas(파이썬)

3. R Programming (R Package)

4. R Dplyr Package

5. R sqldf Package

6. Python pandasql Package

7. R data.table Package

8. SAS Proc SQL

9. SAS Data Step

10. Python Dfply Package

[SQL, Pandas, R Prog, Dplyr, SQLDF, PANDASQL, DATA.TABLE] SQL EMP 예제로 만나는 테이블 데이터 처리 방법 리스트

'통계프로그램 비교 시리즈 > 프로그래밍비교(Oracle,Python,R,SAS)' 카테고리의 다른 글

댓글

티스토리툴바

[데이터 전처리- 문자함수 예제] 기준 문자열에서 특정 문자 검색 - 70

70. Find the first occurrence of character a from the following string 'computer maintenance corporation'.

1. Oracle(오라클)

2. Python Pandas(파이썬)

3. R Programming (R Package)

4. R Dplyr Package

5. R sqldf Package

6. Python pandasql Package

7. R data.table Package

8. SAS Proc SQL

9. SAS Data Step

10. Python Dfply Package

[SQL, Pandas, R Prog, Dplyr, SQLDF, PANDASQL, DATA.TABLE] SQL EMP 예제로 만나는 테이블 데이터 처리 방법 리스트

'통계프로그램 비교 시리즈 > 프로그래밍비교(Oracle,Python,R,SAS)' 카테고리의 다른 글

관련글

댓글

티스토리툴바