Data Science
Read a tab-separated file (.text) into R
The key argument we should correctly specify is sep, and the way to present the tab character in R is to type sep = "\t". The backslash is the escape character 转义符. It means the character following \ is special. See also
The rows are seen as the duplicate as long as the frequency of elements in these rows is identical. How can we remove duplicate rows?
df= tibble(x= c(0, 18, 4, 9, 88), y= c(4, 9, 0, 18, 40))
df
## # A tibble: 5 × 2
## x y
## <dbl> <dbl>
## 1 0 4
## 2 18 9
## 3 4 0
## 4 9 18
## 5 88 40
df %>%
mutate(z= map2(x, y, ~ c(.x, .y) |> sort())) %>%
distinct(z, .keep_all = TRUE)
## # A tibble: 3 × 3
## x y z
## <dbl> <dbl> <list>
## 1 0 4 <dbl [2]>
## 2 18 9 <dbl [2]>
## 3 88 40 <dbl [2]>
1 分组操作
对数据框进行分组,并添加组号
tibble(col= c("A", "A", "B", "B", "C")) %>%
group_by(col) %>%
mutate(grp = cur_group_id())
## # A tibble: 5 × 2
## # Groups: col [3]
## col grp
## <chr> <int>
## 1 A 1
## 2 A 1
## 3 B 2
## 4 B 2
## 5 C 3
批量t检验
# 注释是黑体吗
library(tidyverse)
library(rstatix)
df = read_csv("F:/Learning_materials/R/正则/Demo_t.test.csv")
df
## # A tibble: 38 × 7
## compoundID case_1 case_2 case_3 control_1 control_2 control_3
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 com_001 485 154 268 350 432 425
## 2 com_002 208 372 219 457 324 392
## 3 com_003 219 125 345 473 480 403
## 4 com_004 289 356 116 489 376 500
## 5 com_005 248 456 279 457 426 436
## 6 com_006 323 142 462 451 354 452
## 7 com_007 259 148 374 397 346 383
## 8 com_008 428 262 226 436 499 308
## 9 com_009 327 494 244 316 368 401
## 10 com_010 480 343 495 383 471 387
## # ℹ 28 more rows
df= df %>%
pivot_longer(-1, names_pattern = "(.*)_", names_to = c(".value")) %>%
pivot_longer(-1, names_to = "trt", values_to = "val")
df
## # A tibble: 228 × 3
## compoundID trt val
## <chr> <chr> <dbl>
## 1 com_001 case 485
## 2 com_001 control 350
## 3 com_001 case 154
## 4 com_001 control 432
## 5 com_001 case 268
## 6 com_001 control 425
## 7 com_002 case 208
## 8 com_002 control 457
## 9 com_002 case 372
## 10 com_002 control 324
## # ℹ 218 more rows
df %>%
group_by(compoundID) %>%
t_test(val ~ trt, detailed = TRUE)
## # A tibble: 38 × 16
## compoundID estimate estimate1 estimate2 .y. group1 group2 n1 n2
## * <chr> <dbl> <dbl> <dbl> <chr> <chr> <chr> <int> <int>
## 1 com_001 -100 302. 402. val case control 3 3
## 2 com_002 -125. 266. 391 val case control 3 3
## 3 com_003 -222. 230. 452 val case control 3 3
## 4 com_004 -201. 254. 455 val case control 3 3
## 5 com_005 -112 328. 440. val case control 3 3
## 6 com_006 -110 309 419 val case control 3 3
## 7 com_007 -115 260. 375. val case control 3 3
## 8 com_008 -109 305. 414. val case control 3 3
## 9 com_009 -6.67 355 362. val case control 3 3
## 10 com_010 25.7 439. 414. val case control 3 3
## # ℹ 28 more rows
## # ℹ 7 more variables: statistic <dbl>, p <dbl>, df <dbl>, conf.low <dbl>,
## # conf.high <dbl>, method <chr>, alternative <chr>
根据cyl列可将数据分为3组,分组修改数据框,使用group_modify
group_modify() returns a grouped tibble. In that case .f must return a data frame.
- 计算每组中所有变量的最小值
- 并将结果分别添加在各组的最后面
df= mtcars[2:10, 1:4]
df
## mpg cyl disp hp
## Mazda RX4 Wag 21.0 6 160.0 110
## Datsun 710 22.8 4 108.0 93
## Hornet 4 Drive 21.4 6 258.0 110
## Hornet Sportabout 18.7 8 360.0 175
## Valiant 18.1 6 225.0 105
## Duster 360 14.3 8 360.0 245
## Merc 240D 24.4 4 146.7 62
## Merc 230 22.8 4 140.8 95
## Merc 280 19.2 6 167.6 123
unique(df$cyl) # cyl, has 3 levels
## [1] 6 4 8
df %>%
group_by(cyl) %>%
summarise(across(.fns = min))
## # A tibble: 3 × 4
## cyl mpg disp hp
## <dbl> <dbl> <dbl> <dbl>
## 1 4 22.8 108 62
## 2 6 18.1 160 105
## 3 8 14.3 360 175
df %>%
group_by(cyl) %>%
group_modify(., ~ .x %>% bind_rows(apply(.x, 2, min)))
## # A tibble: 12 × 4
## # Groups: cyl [3]
## cyl mpg disp hp
## <dbl> <dbl> <dbl> <dbl>
## 1 4 22.8 108 93
## 2 4 24.4 147. 62
## 3 4 22.8 141. 95
## 4 4 22.8 108 62
## 5 6 21 160 110
## 6 6 21.4 258 110
## 7 6 18.1 225 105
## 8 6 19.2 168. 123
## 9 6 18.1 160 105
## 10 8 18.7 360 175
## 11 8 14.3 360 245
## 12 8 14.3 360 175
.指代分组后的整体数据框,而.x指代每组的小数据框
apply将函数min应用到每一列,返回的结果是向量;每组数据都返回一个向量。我们需要把计算得到的向量,跟着在每组数据的后边,用到bind_rows
每组都在做同样的两件事:
- apply(.x, 2, min)计算每个变量的最小值,得到结果;
- 再用bind_rows把(每组)数据本身和结果合并起来;
可能的疑问:
每组的数据类型是数据框,而结果的数据类型是向量,把不同的格式进行合并会不会有问题?
bind_rows()andbind_cols()return the same type as the first input.
查看函数帮助,可以看到,如果向量是第二个参数,那么它会被转化成数据框。所以…
按月份分组计算,同年同月的数据为一组
head(economics, 3)
## # A tibble: 3 × 6
## date pce pop psavert uempmed unemploy
## <date> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1967-07-01 507. 198712 12.6 4.5 2944
## 2 1967-08-01 510. 198911 12.6 4.7 2945
## 3 1967-09-01 516. 199113 11.9 4.6 2958
economics %>%
group_by(ym= tsibble::yearmonth(date)) %>%
summarise(pce= mean(pce))
## # A tibble: 574 × 2
## ym pce
## <mth> <dbl>
## 1 1967 7月 507.
## 2 1967 8月 510.
## 3 1967 9月 516.
## 4 1967 10月 512.
## 5 1967 11月 517.
## 6 1967 12月 525.
## 7 1968 1月 531.
## 8 1968 2月 534.
## 9 1968 3月 544.
## 10 1968 4月 544
## # ℹ 564 more rows
2 同时操作多列
所有的.替换为0
匿名函数(lambda)写法。
The parameters of across function are .col, .fns, ..., respectively. Generally, we should input the parameters in order if we want to omit the names of these parameters.
The first parameter “.col” has a default value (i.e. everything() ). If we don’t need to modify “.col”, we can completely omit it.
The following function (~ str_replace_all() ), however, will be mistakenly identified as “.col” parameter due to it being input first. So we must add the corresponding name to that parameter, .fns = ~ str_replace_all(). 如果不加.fns=,函数会被误认为是.col的参数,导致出错。
df
## A B
## 1 X. X..X
## 2 Y. Y..Y
df %>%
mutate(across(.fns= ~ str_replace_all(.x, "\\.", "0")))
## A B
## 1 X0 X00X
## 2 Y0 Y00Y
df %>%
mutate(across(.fns= str_replace_all, pattern= "\\.", replacement= "0"))
## A B
## 1 X0 X00X
## 2 Y0 Y00Y
3 操作行/筛选行
- Remove row containing repeat elements
df= data.frame(A= 1:4, B= c(2, 3, 4, 3), C= c(10, 10, 4, 1), D= c(4, 2, 4, 6))
df
## A B C D
## 1 1 2 10 4
## 2 2 3 10 2
## 3 3 4 4 4
## 4 4 3 1 6
df %>%
filter(! pmap_lgl(., ~ duplicated(c(...)) %>% any() ))
## A B C D
## 1 1 2 10 4
## 2 4 3 1 6
df %>%
filter(pmap_lgl(., ~ length(unique(c(...))) == length(c(...)) ))
## A B C D
## 1 1 2 10 4
## 2 4 3 1 6
- Repeat each row N times
The first row is duplicated twice; Second and third row repeat three times and once, respectively.
df= tibble(A= c(0.56, 4.33, 5.81), N= c(2, 3, 1))
df
## # A tibble: 3 × 2
## A N
## <dbl> <dbl>
## 1 0.56 2
## 2 4.33 3
## 3 5.81 1
df %>%
slice(rep(1:n(), times= N)) # slice(1, 1, 2, 2, 2, 3)
## # A tibble: 6 × 2
## A N
## <dbl> <dbl>
## 1 0.56 2
## 2 0.56 2
## 3 4.33 3
## 4 4.33 3
## 5 4.33 3
## 6 5.81 1
df[rep(1:nrow(df), df$N), ] # Basic syntax
## # A tibble: 6 × 2
## A N
## <dbl> <dbl>
## 1 0.56 2
## 2 0.56 2
## 3 4.33 3
## 4 4.33 3
## 5 4.33 3
## 6 5.81 1
- Merge the elements of columns into one column, excluding NA.
df
## # A tibble: 4 × 4
## A B C D
## <int> <dbl> <dbl> <dbl>
## 1 1 2 10 4
## 2 2 NA 10 2
## 3 NA 4 4 4
## 4 4 NA 1 NA
f= function(x) {
x[!is.na(x)] %>%
paste0(., collapse= "-")
}
df %>%
mutate(new= pmap_chr(., ~ f(c(...))))
## # A tibble: 4 × 5
## A B C D new
## <int> <dbl> <dbl> <dbl> <chr>
## 1 1 2 10 4 1-2-10-4
## 2 2 NA 10 2 2-10-2
## 3 NA 4 4 4 4-4-4
## 4 4 NA 1 NA 4-1
- Replace the last non-NA value of each row with NA
df= tibble(A= c(200.79, NA, 193.2, NA), B= c(NA, NA, "C9LL", "WP45"), C= NA, D= c(4.326, NA, NA, NA))
df
## # A tibble: 4 × 4
## A B C D
## <dbl> <chr> <lgl> <dbl>
## 1 201. <NA> NA 4.33
## 2 NA <NA> NA NA
## 3 193. C9LL NA NA
## 4 NA WP45 NA NA
f= function(x) {
if (all(is.na(x))) x
else {
n= length(x)
while(is.na(x[n])) n= n-1
x[n]= NA
x
}
}
df %>%
pmap_dfr(., ~ f(c(...)))
## # A tibble: 4 × 4
## A B C D
## <chr> <chr> <chr> <chr>
## 1 200.79 <NA> <NA> <NA>
## 2 <NA> <NA> <NA> <NA>
## 3 193.2 <NA> <NA> <NA>
## 4 <NA> <NA> <NA> <NA>
4 操作列/筛选列
- Remove the column that all elements are “AAA”
df= tibble(x= rep("AAA", 5), y = 1:5, z= c(rep("AAA", 3), "b", "c"))
df
## # A tibble: 5 × 3
## x y z
## <chr> <int> <chr>
## 1 AAA 1 AAA
## 2 AAA 2 AAA
## 3 AAA 3 AAA
## 4 AAA 4 b
## 5 AAA 5 c
df %>%
select(where(~ !all(.x == "AAA")))
## # A tibble: 5 × 2
## y z
## <int> <chr>
## 1 1 AAA
## 2 2 AAA
## 3 3 AAA
## 4 4 b
## 5 5 c
.x == “AAA”是判断语句,判断每个列向量是否等于”AAA”,返回的结果是与列向量等长度的逻辑向量。比如z列的结果是T T T F F。
配合all函数。只有当逻辑向量中全为TRUE,all(逻辑向量)的结果才是TRUE。那么y列、z列的结果都是FALSE。这样一来,结果为TRUE的列才会被保留,其余列都会被筛除。
而我们想要的结果相反,是想筛除x列,保留y z列。所以我们用!符号,反向选择。
- Remove the column where all elements are NA
df
## # A tibble: 3 × 4
## x y z w
## <lgl> <int> <lgl> <chr>
## 1 NA 1 NA <NA>
## 2 NA 2 NA B
## 3 NA 3 NA C
df %>%
select(where(~ !all(is.na(.x))))
## # A tibble: 3 × 2
## y w
## <int> <chr>
## 1 1 <NA>
## 2 2 B
## 3 3 C
5 正则表达式
- 找出数值
- 找出紧跟在b后的数值
- 找出b后面出现的数值??
tt = c("ab1", "vf2", "aaba2", "dd9b76", "d8p", "a0b3e4")
str_extract_all(tt, "\\d+")
## [[1]]
## [1] "1"
##
## [[2]]
## [1] "2"
##
## [[3]]
## [1] "2"
##
## [[4]]
## [1] "9" "76"
##
## [[5]]
## [1] "8"
##
## [[6]]
## [1] "0" "3" "4"
str_extract(tt, "(?<=b)\\d+")
## [1] "1" NA NA "76" NA "3"
5.1 za
length<-是R自带函数,赋以长度的意思,后面max语句是作为实参传入
将数据框中每个单元格中的数分别相加,但不含第一个数
张老师:我以前不理解为什么str_split返回结果要设计成那么难访问的列表,最近才体会到,在数据框中使用是多么方便
TEST <- tibble(a_AD = c('1,2','0,2,3','2,0','0,0,2,3'),
b_AD = c('1,2','0,0,2,3','0,2,0,3','2,0'))
TEST
## # A tibble: 4 × 2
## a_AD b_AD
## <chr> <chr>
## 1 1,2 1,2
## 2 0,2,3 0,0,2,3
## 3 2,0 0,2,0,3
## 4 0,0,2,3 2,0
TEST %>%
mutate(across(1:2, ~ str_split(.x, ",") %>%
map(as.numeric) %>%
map_dbl(~ sum(.x[-1]))))
## # A tibble: 4 × 2
## a_AD b_AD
## <dbl> <dbl>
## 1 2 2
## 2 5 5
## 3 0 5
## 4 5 0
across中的函数对每一列做了什么操作?
str_split将每个元素/字符串,拆分为字符型向量;然后map+as.numeric把列表列中的字符型向量转化为数值型,这里.x指的列表列中的单个向量;最后map遍历:对列表列中每个向量进行求和;加_dbl后缀,求和的结果转化成了列向量。
library(lubridate)
df = tibble(x = as.Date(c("2005/1--20", "2018/9--3"),
format = "%Y/%m--%d"))
df
## # A tibble: 2 × 1
## x
## <date>
## 1 2005-01-20
## 2 2018-09-03
df %>%
mutate(y = if_else(x <= as.Date("2005-1-20"), x, NA_Date_))
## # A tibble: 2 × 2
## x y
## <date> <date>
## 1 2005-01-20 2005-01-20
## 2 2018-09-03 NA
使用if_else时,须注意重编码后的数据类型,要与初始的数据类型保持一致。那么如果要替换成NA,严格来说要换成NA_character_、NA_real_(double型缺失值)。而NA_Date_的使用,要加载lubridate包。
文本挖掘quanteda包
如何修改函数源码?trace函数
trace(rstatix:::as_tidy_cor, edit = TRUE)