stringr

[R]
[R.package]

TOP ↑ ↓

stringr

正規表現のオプション
str_sub() 文字列の抽出
str_starts() 文字列の検索（始まる文字列）
str_ends() 文字列の検索（終わる文字列）
str_c() 文字列の結合
str_which() 文字列がある行番号を調べる
str_detect() 該当する文字列があるかどうか調べる
str_extract() 指定したパターンが該当した文字列を抽出する
str_replace(文字列, 置換対象表現, 置換後表現)
str_remove(文字列, 削除表現)
str_remove_all(文字列, 削除表現)
str_count(文字列, '\\w+') 単語数のカウント

References 参考サイト

stringiのラッパー
tidyverseに含まれている

正規表現のオプション

TOP ↑ ↓

ignore_case=T
multiline=T

str_sub() 文字列の抽出

TOP ↑ ↓

mutate(KID = str_sub(DateID, start= 6, end=9)

"2107_1901"の6文字目から9文字目までを抜き出す。結果 1901

str_starts() 文字列の検索（始まる文字列）

TOP ↑ ↓

sp.dat.long.score %>% dplyr::filter(str_starts(name, "Mostafa"))

longフォーマットのデータのうち、見出しnameのところで、
"Mostafa"で始まる文字列からなる項目を含む行のみを選び出す。

str_ends() 文字列の検索（終わる文字列）

TOP ↑ ↓

sp.dat.long.score %>% dplyr::filter(str_ends(name, "_DET"))

str_c() 文字列の結合

TOP ↑ ↓

例：ID = str_c(Lang,Year,PID,SID, sep="_"
- Lang, Year, PID, SIDをアンダースコアでつないで、新しくIDという文字列にする

SbyS.dat %>% mutate(ID = str_c(Lang,Year,PID,SID,  sep="_")) %>% head()

Lang Year PID SID sentence SL MDD Omega MHD ID
CN 2 2001 1 大学に入ってもう一年が経ちました 5 1.50 0.5000000 1.50 0.5000000 1.50 CN_2_2001_1

str_which() 文字列がある行番号を調べる

TOP ↑ ↓

str_which(カラム名,"文字列")

> head(df1)
    A B  C
1 AAA 2 20
2 BBB 3 30
3 AAA   40
4 BBB 4 30
5 CCC 5 60
> str_which(df1$A, "AAA")
[1] 1 3

str_detect() 該当する文字列があるかどうか調べる

TOP ↑ ↓

str_detect(データ, "正規表現")

subset() と合わせて使うと便利
- データフレーム中の特定の列に「ある種の文字列」があるかどうかを調べて、その文字列を含む行だけを選び出す。
  - 「ある種の文字列」の例：小文字の連続で書かれている「単語」が複数あるもの

fragJBnozeroMW <- subset(fragJBnozero, str_detect(fragJBnozero[,3], "[:lower:]+ +[^a-z]*[:lower:]+"), select=c("total","MW"))

> fragJBnozeroMW
     total                                MW
251    256                    a [NN] of [NP]
278    253                       do n't [VP]
285    252                  the [NN] of [NP]
291    219                       do not [VP]
306    235                      want to [VP]
325    240                             a lot
330    213                       for example
341    210                     a lot of [NP]