Data Science

January 14, 2023

1 分组操作
2 同时操作多列
3 操作行/筛选行
4 操作列/筛选列
5 正则表达式
- 5.1 za

Read a tab-separated file (.text) into R

The key argument we should correctly specify is sep, and the way to present the tab character in R is to type sep = "\t". The backslash is the escape character 转义符. It means the character following \ is special. See also

The rows are seen as the duplicate as long as the frequency of elements in these rows is identical. How can we remove duplicate rows?

df= tibble(x= c(0, 18, 4, 9, 88), y= c(4, 9, 0, 18, 40)) 
df
## # A tibble: 5 × 2
##       x     y
##   <dbl> <dbl>
## 1     0     4
## 2    18     9
## 3     4     0
## 4     9    18
## 5    88    40
df %>% 
  mutate(z= map2(x, y, ~ c(.x, .y) |> sort())) %>% 
  distinct(z, .keep_all = TRUE)
## # A tibble: 3 × 3
##       x     y z        
##   <dbl> <dbl> <list>   
## 1     0     4 <dbl [2]>
## 2    18     9 <dbl [2]>
## 3    88    40 <dbl [2]>

1 分组操作

对数据框进行分组，并添加组号

tibble(col= c("A", "A", "B", "B", "C")) %>% 
  group_by(col) %>% 
  mutate(grp = cur_group_id())
## # A tibble: 5 × 2
## # Groups:   col [3]
##   col     grp
##   <chr> <int>
## 1 A         1
## 2 A         1
## 3 B         2
## 4 B         2
## 5 C         3

批量t检验

# 注释是黑体吗
library(tidyverse)
library(rstatix)
df = read_csv("F:/Learning_materials/R/正则/Demo_t.test.csv")
df
## # A tibble: 38 × 7
##    compoundID case_1 case_2 case_3 control_1 control_2 control_3
##    <chr>       <dbl>  <dbl>  <dbl>     <dbl>     <dbl>     <dbl>
##  1 com_001       485    154    268       350       432       425
##  2 com_002       208    372    219       457       324       392
##  3 com_003       219    125    345       473       480       403
##  4 com_004       289    356    116       489       376       500
##  5 com_005       248    456    279       457       426       436
##  6 com_006       323    142    462       451       354       452
##  7 com_007       259    148    374       397       346       383
##  8 com_008       428    262    226       436       499       308
##  9 com_009       327    494    244       316       368       401
## 10 com_010       480    343    495       383       471       387
## # ℹ 28 more rows
df= df %>% 
  pivot_longer(-1, names_pattern = "(.*)_", names_to = c(".value")) %>% 
  pivot_longer(-1, names_to = "trt", values_to = "val")
df
## # A tibble: 228 × 3
##    compoundID trt       val
##    <chr>      <chr>   <dbl>
##  1 com_001    case      485
##  2 com_001    control   350
##  3 com_001    case      154
##  4 com_001    control   432
##  5 com_001    case      268
##  6 com_001    control   425
##  7 com_002    case      208
##  8 com_002    control   457
##  9 com_002    case      372
## 10 com_002    control   324
## # ℹ 218 more rows
df %>% 
  group_by(compoundID) %>% 
  t_test(val ~ trt, detailed = TRUE)
## # A tibble: 38 × 16
##    compoundID estimate estimate1 estimate2 .y.   group1 group2     n1    n2
##  * <chr>         <dbl>     <dbl>     <dbl> <chr> <chr>  <chr>   <int> <int>
##  1 com_001     -100         302.      402. val   case   control     3     3
##  2 com_002     -125.        266.      391  val   case   control     3     3
##  3 com_003     -222.        230.      452  val   case   control     3     3
##  4 com_004     -201.        254.      455  val   case   control     3     3
##  5 com_005     -112         328.      440. val   case   control     3     3
##  6 com_006     -110         309       419  val   case   control     3     3
##  7 com_007     -115         260.      375. val   case   control     3     3
##  8 com_008     -109         305.      414. val   case   control     3     3
##  9 com_009       -6.67      355       362. val   case   control     3     3
## 10 com_010       25.7       439.      414. val   case   control     3     3
## # ℹ 28 more rows
## # ℹ 7 more variables: statistic <dbl>, p <dbl>, df <dbl>, conf.low <dbl>,
## #   conf.high <dbl>, method <chr>, alternative <chr>

根据cyl列可将数据分为3组，分组修改数据框，使用group_modify

group_modify() returns a grouped tibble. In that case .f must return a data frame.

计算每组中所有变量的最小值
并将结果分别添加在各组的最后面

df= mtcars[2:10, 1:4] 
df
##                    mpg cyl  disp  hp
## Mazda RX4 Wag     21.0   6 160.0 110
## Datsun 710        22.8   4 108.0  93
## Hornet 4 Drive    21.4   6 258.0 110
## Hornet Sportabout 18.7   8 360.0 175
## Valiant           18.1   6 225.0 105
## Duster 360        14.3   8 360.0 245
## Merc 240D         24.4   4 146.7  62
## Merc 230          22.8   4 140.8  95
## Merc 280          19.2   6 167.6 123
unique(df$cyl) # cyl, has 3 levels
## [1] 6 4 8
df %>% 
  group_by(cyl) %>% 
  summarise(across(.fns = min))
## # A tibble: 3 × 4
##     cyl   mpg  disp    hp
##   <dbl> <dbl> <dbl> <dbl>
## 1     4  22.8   108    62
## 2     6  18.1   160   105
## 3     8  14.3   360   175
df %>% 
  group_by(cyl) %>% 
  group_modify(., ~ .x %>% bind_rows(apply(.x, 2, min)))
## # A tibble: 12 × 4
## # Groups:   cyl [3]
##      cyl   mpg  disp    hp
##    <dbl> <dbl> <dbl> <dbl>
##  1     4  22.8  108     93
##  2     4  24.4  147.    62
##  3     4  22.8  141.    95
##  4     4  22.8  108     62
##  5     6  21    160    110
##  6     6  21.4  258    110
##  7     6  18.1  225    105
##  8     6  19.2  168.   123
##  9     6  18.1  160    105
## 10     8  18.7  360    175
## 11     8  14.3  360    245
## 12     8  14.3  360    175

.指代分组后的整体数据框，而.x指代每组的小数据框

apply将函数min应用到每一列，返回的结果是向量；每组数据都返回一个向量。我们需要把计算得到的向量，跟着在每组数据的后边，用到bind_rows

每组都在做同样的两件事：

apply(.x, 2, min)计算每个变量的最小值，得到结果；
再用bind_rows把（每组）数据本身和结果合并起来；

可能的疑问：
每组的数据类型是数据框，而结果的数据类型是向量，把不同的格式进行合并会不会有问题？

bind_rows() and bind_cols() return the same type as the first input.

查看函数帮助，可以看到，如果向量是第二个参数，那么它会被转化成数据框。所以…

按月份分组计算，同年同月的数据为一组

head(economics, 3)
## # A tibble: 3 × 6
##   date         pce    pop psavert uempmed unemploy
##   <date>     <dbl>  <dbl>   <dbl>   <dbl>    <dbl>
## 1 1967-07-01  507. 198712    12.6     4.5     2944
## 2 1967-08-01  510. 198911    12.6     4.7     2945
## 3 1967-09-01  516. 199113    11.9     4.6     2958
economics %>% 
  group_by(ym= tsibble::yearmonth(date)) %>% 
  summarise(pce= mean(pce))
## # A tibble: 574 × 2
##           ym   pce
##        <mth> <dbl>
##  1  1967 7月  507.
##  2  1967 8月  510.
##  3  1967 9月  516.
##  4 1967 10月  512.
##  5 1967 11月  517.
##  6 1967 12月  525.
##  7  1968 1月  531.
##  8  1968 2月  534.
##  9  1968 3月  544.
## 10  1968 4月  544 
## # ℹ 564 more rows

2 同时操作多列

所有的.替换为0

匿名函数(lambda)写法。

The parameters of across function are .col, .fns, ..., respectively. Generally, we should input the parameters in order if we want to omit the names of these parameters.

The first parameter “.col” has a default value (i.e. everything() ). If we don’t need to modify “.col”, we can completely omit it.

The following function (~ str_replace_all() ), however, will be mistakenly identified as “.col” parameter due to it being input first. So we must add the corresponding name to that parameter, .fns = ~ str_replace_all(). 如果不加.fns=，函数会被误认为是.col的参数，导致出错。

df
##    A    B
## 1 X. X..X
## 2 Y. Y..Y
df %>% 
  mutate(across(.fns= ~ str_replace_all(.x, "\\.", "0")))
##    A    B
## 1 X0 X00X
## 2 Y0 Y00Y
df %>% 
  mutate(across(.fns= str_replace_all, pattern= "\\.", replacement= "0"))
##    A    B
## 1 X0 X00X
## 2 Y0 Y00Y

3 操作行/筛选行

Remove row containing repeat elements

df= data.frame(A= 1:4, B= c(2, 3, 4, 3), C= c(10, 10, 4, 1), D= c(4, 2, 4, 6))
df
##   A B  C D
## 1 1 2 10 4
## 2 2 3 10 2
## 3 3 4  4 4
## 4 4 3  1 6
df %>% 
   filter(! pmap_lgl(., ~ duplicated(c(...)) %>% any() ))
##   A B  C D
## 1 1 2 10 4
## 2 4 3  1 6
df %>% 
   filter(pmap_lgl(., ~ length(unique(c(...))) == length(c(...)) ))
##   A B  C D
## 1 1 2 10 4
## 2 4 3  1 6

Repeat each row N times

The first row is duplicated twice; Second and third row repeat three times and once, respectively.

df= tibble(A= c(0.56, 4.33, 5.81), N= c(2, 3, 1))
df
## # A tibble: 3 × 2
##       A     N
##   <dbl> <dbl>
## 1  0.56     2
## 2  4.33     3
## 3  5.81     1
df %>% 
  slice(rep(1:n(), times= N)) # slice(1, 1, 2, 2, 2, 3)
## # A tibble: 6 × 2
##       A     N
##   <dbl> <dbl>
## 1  0.56     2
## 2  0.56     2
## 3  4.33     3
## 4  4.33     3
## 5  4.33     3
## 6  5.81     1
df[rep(1:nrow(df), df$N), ] # Basic syntax
## # A tibble: 6 × 2
##       A     N
##   <dbl> <dbl>
## 1  0.56     2
## 2  0.56     2
## 3  4.33     3
## 4  4.33     3
## 5  4.33     3
## 6  5.81     1

Merge the elements of columns into one column, excluding NA.

df
## # A tibble: 4 × 4
##       A     B     C     D
##   <int> <dbl> <dbl> <dbl>
## 1     1     2    10     4
## 2     2    NA    10     2
## 3    NA     4     4     4
## 4     4    NA     1    NA
f= function(x) {
  x[!is.na(x)] %>% 
    paste0(., collapse= "-")
}
df %>% 
  mutate(new= pmap_chr(., ~ f(c(...))))
## # A tibble: 4 × 5
##       A     B     C     D new     
##   <int> <dbl> <dbl> <dbl> <chr>   
## 1     1     2    10     4 1-2-10-4
## 2     2    NA    10     2 2-10-2  
## 3    NA     4     4     4 4-4-4   
## 4     4    NA     1    NA 4-1

Replace the last non-NA value of each row with NA

df= tibble(A= c(200.79, NA, 193.2, NA), B= c(NA, NA, "C9LL", "WP45"), C= NA, D= c(4.326, NA, NA, NA))
df
## # A tibble: 4 × 4
##       A B     C         D
##   <dbl> <chr> <lgl> <dbl>
## 1  201. <NA>  NA     4.33
## 2   NA  <NA>  NA    NA   
## 3  193. C9LL  NA    NA   
## 4   NA  WP45  NA    NA
f= function(x) {
  if (all(is.na(x))) x
  else {
    n= length(x)
    while(is.na(x[n])) n= n-1
    x[n]= NA
    x
  }
}
df %>% 
  pmap_dfr(., ~ f(c(...)))
## # A tibble: 4 × 4
##   A      B     C     D    
##   <chr>  <chr> <chr> <chr>
## 1 200.79 <NA>  <NA>  <NA> 
## 2 <NA>   <NA>  <NA>  <NA> 
## 3 193.2  <NA>  <NA>  <NA> 
## 4 <NA>   <NA>  <NA>  <NA>

4 操作列/筛选列

Remove the column that all elements are “AAA”

df= tibble(x= rep("AAA", 5), y = 1:5, z= c(rep("AAA", 3), "b", "c"))
df
## # A tibble: 5 × 3
##   x         y z    
##   <chr> <int> <chr>
## 1 AAA       1 AAA  
## 2 AAA       2 AAA  
## 3 AAA       3 AAA  
## 4 AAA       4 b    
## 5 AAA       5 c
df %>% 
  select(where(~ !all(.x == "AAA")))
## # A tibble: 5 × 2
##       y z    
##   <int> <chr>
## 1     1 AAA  
## 2     2 AAA  
## 3     3 AAA  
## 4     4 b    
## 5     5 c

.x == “AAA”是判断语句，判断每个列向量是否等于”AAA”，返回的结果是与列向量等长度的逻辑向量。比如z列的结果是T T T F F。

配合all函数。只有当逻辑向量中全为TRUE，all(逻辑向量)的结果才是TRUE。那么y列、z列的结果都是FALSE。这样一来，结果为TRUE的列才会被保留，其余列都会被筛除。

而我们想要的结果相反，是想筛除x列，保留y z列。所以我们用!符号，反向选择。

Remove the column where all elements are NA

df
## # A tibble: 3 × 4
##   x         y z     w    
##   <lgl> <int> <lgl> <chr>
## 1 NA        1 NA    <NA> 
## 2 NA        2 NA    B    
## 3 NA        3 NA    C
df %>% 
  select(where(~ !all(is.na(.x))))
## # A tibble: 3 × 2
##       y w    
##   <int> <chr>
## 1     1 <NA> 
## 2     2 B    
## 3     3 C

5 正则表达式

找出数值
找出紧跟在b后的数值
找出b后面出现的数值??

tt = c("ab1", "vf2", "aaba2", "dd9b76", "d8p", "a0b3e4")
str_extract_all(tt, "\\d+") 
## [[1]]
## [1] "1"
## 
## [[2]]
## [1] "2"
## 
## [[3]]
## [1] "2"
## 
## [[4]]
## [1] "9"  "76"
## 
## [[5]]
## [1] "8"
## 
## [[6]]
## [1] "0" "3" "4"
str_extract(tt, "(?<=b)\\d+")
## [1] "1"  NA   NA   "76" NA   "3"

5.1 za

length<-是R自带函数，赋以长度的意思，后面max语句是作为实参传入

将数据框中每个单元格中的数分别相加，但不含第一个数

张老师：我以前不理解为什么str_split返回结果要设计成那么难访问的列表，最近才体会到，在数据框中使用是多么方便

TEST <- tibble(a_AD = c('1,2','0,2,3','2,0','0,0,2,3'), 
b_AD = c('1,2','0,0,2,3','0,2,0,3','2,0'))
TEST
## # A tibble: 4 × 2
##   a_AD    b_AD   
##   <chr>   <chr>  
## 1 1,2     1,2    
## 2 0,2,3   0,0,2,3
## 3 2,0     0,2,0,3
## 4 0,0,2,3 2,0
TEST %>% 
  mutate(across(1:2, ~ str_split(.x, ",") %>% 
                  map(as.numeric) %>% 
                  map_dbl(~ sum(.x[-1]))))
## # A tibble: 4 × 2
##    a_AD  b_AD
##   <dbl> <dbl>
## 1     2     2
## 2     5     5
## 3     0     5
## 4     5     0

across中的函数对每一列做了什么操作？

str_split将每个元素/字符串，拆分为字符型向量；然后map+as.numeric把列表列中的字符型向量转化为数值型，这里.x指的列表列中的单个向量；最后map遍历：对列表列中每个向量进行求和；加_dbl后缀，求和的结果转化成了列向量。

library(lubridate)
df = tibble(x = as.Date(c("2005/1--20", "2018/9--3"), 
                        format = "%Y/%m--%d"))
df
## # A tibble: 2 × 1
##   x         
##   <date>    
## 1 2005-01-20
## 2 2018-09-03
df %>% 
  mutate(y = if_else(x <= as.Date("2005-1-20"), x, NA_Date_))
## # A tibble: 2 × 2
##   x          y         
##   <date>     <date>    
## 1 2005-01-20 2005-01-20
## 2 2018-09-03 NA

使用if_else时，须注意重编码后的数据类型，要与初始的数据类型保持一致。那么如果要替换成NA，严格来说要换成NA_character_、NA_real_(double型缺失值)。而NA_Date_的使用，要加载lubridate包。

文本挖掘quanteda包

如何修改函数源码？trace函数

trace(rstatix:::as_tidy_cor, edit = TRUE)