Programming Projects
You may use any language: python, perl, matlab, R, C, C++, Java, etc.
Your solution should include
- Explanation of your approach, implementation, and results in well organized and grammatical text.
- source code
- instructions to install and run the code
- test data and expected results
- Answers to any specific questions
You may use existing packages/libraries to read and write sequences, but, unless otherwise instructed, generate your own code. Do not use solutions you look up on the internet.
Alignments
- DNA/protein alignment
Align a DNA sequence and a protein sequence based on the translated sequences of the DNA sequences.
Only the forward reading frames need be considered. If desired, you may use existing packages for
reading/writing sequences, scoring tables, and formatted results. Display the results in conventional pairs of
lines format showing both DNA and translated protein sequences.
- Use a local alignment algorithm
- Implement both global and local alignments
- Assume the sequences are in fasta format
- Use a scoring table such as Blosum or PAM
- Allow affine gaps within reading frames
- Allow length independent gaps between reading frames.
- allow affine gaps between reading frames
- Align two DNA sequences, same requirements as above, based on their protein translations
- Suboptimal alignment
For DNA or protein sequences, report n non-intersecting suboptimal alignments. As discussed in class, sub-optimal alignments use no pair of letters used in earlier alignments. This can be achieved by recalculating the score matrix, setting the score for all letter-pairs in previous alignments to be zero.
- allow selection of an appropriate scoring table for DNA or protein
- use a local algorithm with affine gap penalties
- display aligned sequences in conventional pairs of lines format
Simulations
- Sequence randomization
Randomize a sequence preserving its n-mer word content (i.e, 1 letter, 2-letter, 3-l;etter, …). This may be solved simply using sampling with or without replacement, but the n-mer content of the randomized sequence will not exactly match the original sequence.
- input and output in Fasta format. Your program should work for any alphanumeric letter sequence, but especially DNA and protein alphabets.
- n-mer should be selectable in range 1 to 5 (or more)
- Additional output should show original and randomized n-mer word content
- Exact solution (all n-mer counts exactly the same) using an implementation of Euler’s algorithm.
- Sequence evolution – Is a simple i.i.d. randomization a good model for proteins? Choose a random set of 50 protein sequences (from ncbi or uniprot). This problem can be combined with Simulation #1
- For the test data set, calculate the distribution of word scores using words of various length (minimum: 8, 12, 16) using a Blosum scoring table. Display as a frequency table (score vs count).
- Display as a histogram plot
- Compare to the scores distributions generated for the random sequences with different n-mer word lengths (n=1..5) preserved. Use the code in problem 1, above, or the program Ushuffle.
- Mean and standard deviation are sufficient for the comparison, although you may include more sophisticated tests.
- repeat for DNA using word lengths 4, 6, and 8.
- What n-mer model best fits the unrelated protein data?
Evolution/Trees
- Starting with a protein sequence (at least 100 residues), apply a random mutation process based on the PAM model. At random times, duplicate the sequences creating speciation events. At random times, remove some species (extinctions).
- Input parameters should include total evolution time (% true mutations), and number of leaf sequences to be generated.
- speciation and extinctions can follow simple a simple probability, i.e., probability of speciation or extinction at each step is constant.
- Implement a more sophisticated model where these events follow a distribution such as normal or gamma
- report the final set of sequences as a multiple alignment. No alignment algorithm is necessary, since the sequences are generated by random mutation, they are already aligned.
- report the tree in Newick format.
- report the true number of mutations between each pair of sequences (from the simulation history) and the observed number of differences (n x n table, where n is number of leaf nodes)
- parsimony – requires solving Evolution #1, above.
- Implement the parsimony algorithm.
- for 50 random trees with 20 leaf nodes, calculate the tree lengths and report the true tree lengths (known from the simulation), estimated tree length (from parsimony). Calculate the mean and standard deviation of the difference between the known and estimated tree lengths.
- Repeat for evolutionary distances of 50%, 100%, 200%. Are there significant differences between the known and estimated trees?
- Same as above for UPGMA trees. – requires solving Evolution #1, above.
- use 50 random trees with 20 leaf nodes
- Implement UPGMA
- Use the Fitch-Margoliash method to estimate branch lengths
- Use the observed mutational distances to construct trees, compare the estimated distances from the tree to the known true distances from the simulation.
- Calculate the mean and standard deviation of the RMS difference between the known and estimated numbers of mutations.
- Repeat for evolutionary distances of 50%, 100%, 200%. Are there significant differences between the known and estimated trees?
成果展示,部分。如需需要成品,左边二维码联系我们的客服。
Simulation 1 The function seq.Rand shuffles a string sequence. The argument seq for a sequence, k for n-mer, and seed for setting set to make the result repeatable. The output will include the original one, the randamized one, and the Eulerian path. fasta.form() function combines seq.Rand and the function for dealing with fasta format. Codes: # if (!requireNamespace("BiocManager", quietly = TRUE)) # install.packages("BiocManager") # BiocManager::install("Biostrings", version = "3.8") # BiocManager::install("msa") library(Biostrings) library(msa) library(knitr) data("BLOSUM62") seed <- 2018 # an example k <- 5 # an example seq <- "ACTGTGCAGTCGATGCTAGGTTCCGATATCTAGCTCGCTAGATC" # an example seq.Rand <- function(seq, k, seed = NULL){ if(!is.null(seed)) set.seed(seed) # set seed, if any if(k==1){ # if k = 1, sampling the vector without replacement seq.v <- unlist(strsplit(seq,"")) # break a sequence to a vector, each entry only has one char seq.v[-1] <- seq.v[sample(2:length(seq.v))] # first entry remains the same seq.new <- Eulerian <- paste(seq.v,collapse = "") } else{ seq.v <- unlist(strsplit(seq,"")) # break a sequence to a vector, each entry only has one char letter.dict <- unique(seq) len <- length(seq.v) kmer.o <- sapply(1:(len-k+1), function(x){paste(seq.v[x:(x+k-1)],collapse="")}) seq.new <- Eulerian <- NULL seq.new[1] <- Eulerian[1] <- kmer.o[1] kmer <- kmer.o[-1] for(i in 2:(length(kmer)+1)){ # if k>=2 # replace without relacement temp0 <- seq.new[i-1] overlap <- substr(temp0, 2, k) dict.temp <- which(substr(kmer, 1, k-1) == overlap) if (length(dict.temp) == 0) break ind.temp <- sample(dict.temp,1) seq.new[i] <- kmer[ind.temp] kmer <- kmer[-ind.temp] # print(kmer) } #Eulerian kmer <- kmer.o[-1] for(i in 2:(length(kmer)+1)){ temp0 <- Eulerian[i-1] overlap <- substr(temp0, 2, k) dict.temp <- which(substr(kmer, 1, k-1) == overlap) if (length(dict.temp) == 0) { # to make sure every node is visited dict.temp <- which(substr(kmer.o, 1, k-1) == overlap) if (length(dict.temp) == 0) break ind.temp <- sample(dict.temp,1) Eulerian[i] <- kmer.o[ind.temp] # print(kmer) } else { ind.temp <- sample(dict.temp,1) Eulerian[i] <- kmer[ind.temp] kmer <- kmer[-ind.temp] # print(kmer) } } # format the sequence seq.new <- paste(seq.new[1], paste(substr(seq.new[2:length(seq.new)],k-1,k),collapse = ""), sep = "") Eulerian <- paste(Eulerian[1], paste(substr(Eulerian[2:length(Eulerian)],k-1,k),collapse = ""), sep = "") } # output: return(list(randseq = seq.new, Eulerian = Eulerian, original = seq)) } seq.Rand(seq, k,2018) ## $randseq ## [1] "ACTGTTGGCCAAGGTTCCGGAATTAGGTAAGGACGTCAGGACCTC" ## ## $Eulerian ## [1] "ACTGTTGGCCAAGGTTCCGGAATTGGCCTTAAGGCCTGGGTTTTCCCCGGAATTAATTCTGGTTGTGGTGTTGTGGCTGTGGT" ## ## $original ## [1] "ACTGTGCAGTCGATGCTAGGTTCCGATATCTAGCTCGCTAGATC" #fasta fasta.form <- function(fasta, k, seed=NULL){ seq.name <- names(fasta) # deal with fasta format seq <- paste(fasta) result <- seq.Rand(seq, k,seed) return(list(name = seq.name, randseq = result$randseq, Eulerian = result$Eulerian, original = result$original)) } data <- readDNAStringSet("./sequence.fasta-3.txt") ## Warning in .Call2("fasta_index", filexp_list, nrec, skip, seek.first.rec, : ## reading FASTA file ./sequence.fasta-3.txt: ignored 12146 invalid one-letter ## sequence codes fasta.form(data[1], k,2018) ## $name ## [1] "BAG70982.1 phenylalanine ammonia-lyase [Musa balbisiana]" ## ## $randseq ## [1] "MAKAVVVVNNGGAACCKKAADDNNWWKKAAAASSTTGGSSHHDDVVKKRRMMVVRRKKVVRRGGAATTTTSSVVAAAAVVAAAAAARRAAAAVANRSSVVRRVVSSAARRDDGGVVRRAASSSASTATKHRVMWV" ## ## $Eulerian ## [1] "MAKAVVVVNNGGAACCKKAADDNNWWKKAAAASSTTGGSSHHDDVVKKRRMMVVRRKKVVRRGGAATTTTSSVVAAAAVVAAAAGSGSAARRSSVVRAAKAACCKNGNGNGVVVVVNVVVNVVVVVVAVNTATYGDGGVVRRDKVVRRGNWVVVVVVVVVNVVVVVVAVRASSNAMNSSRVVSAVTTSTVVVNAVSMRGRGRKCKACACNGNGGAAVNGGTTDSWWVARAAAGNKRAASTDDHSAGAVRSTVVAVYSAATSSGKRKGDSGTASKAASHRGGVRKAVVAVTTGGGARTTTVMMSVRGSASKAACAVSGTKVVVNVNAVASAASHHDKANGAVAKVRAAASGANGNGVNVVVVAVAATGNWDNVVVVAVCRRRCKVNNGGAGAVNVVVVVVAVSGAKTTVTDVNGGAGAAVTANATGSMKRVNAVVRMVATSYDGVRDNACAVGTWKKAACACVNVNVVVNVVVVVNVNVVVVVVVNVVVNAVVSGSATACCKGANGVVVNVVAVVNKRCKCKNGGAVVVNVNAVAAGVAATTADKAADVVAVSYTCAARAMGNTVRVNNGNGVVVVAVGRDVAANWVVAVAGGNKAKAADNGNGVNVNVVVVVNVNVNNGAVRNVTAAVNNGGAVNVVVNAVSGRVVNGSDNKAVNNGNGVNVVVVAVAGNAGGTDARAAVNNGGAVVVNVNVNAVVGGYKAKACKVVVVAVVVVNVVVNAVVGTGKGAVAADNVNNGGAACAVVGYAATTSRDSHKACKAVVVVNVNVNNGGAAVVGNNVMKVWKWKDNGAVNNGGAVVAVVGTGWKACCKKAACNGAVVGGRGDKANTNKKAWKVNVVAVVVAVVGGDGSRDTSVVVNNGNGVVVNNGNGAVVGTGRGRVAAAAAVVVVNNGVVVNVVVVVVVVAVVVVVVN" ## ## $original ## [1] "MAKAVVNGACKADNWKAASTGSHDVKRMVRKVRGATTSVAAVAAARSVRVSARDGVRASSWVMSMNKGTDSYGVTTGGATSHRRTKGGAKRNAGGSGSGNTSSAAKAAMVRVNTGYSGRAASNNGTCRGTTASGDVSYAGTGRNAKAVGDGKVGAAARASADGKGAVNGTAVGSGASMVANVAVSVSAVCVMGKTDHTHKKHHGAAAMHGSSYMKMAKKHDKKDRYARTSWGVRASTKSRNSVNDNDVSRSKAHGGNGTGVSMDNTRAVAAGKMASVNDYNNGSNSGGRNSDYGKGAAMAAYCSANVTNHVSAHNDVNSGSARKTAAVDKMSTTYVACAVDRHNKSAVKSTVSVAKRVTMGANGHARCKKVVDRHVTYVDDCSATYMKRVVAHANGDKKDAGSSKATTAKVAARAAVGGKAANRCRSYYRVRKTGTGKVRSGDKVDACGKVDCKWNGAC" 2 the distribution of word scores Please refer to the following code and out put for the frequency tables and histgrams for the scores derived by different lengths of word. The mean and Variance are summarized as followings.
length of word |
Mean |
Variance |
8 |
32.31412 |
41.9872 |
12 |
47.30298 |
104.3903 |
16 |
62.61114 |
190.2871 |
The larger the length of word, the higher mean and variance since the maginitude of the scores increaase. shuffle Please refer to the following code and out put for the shuffled word, which derives the the frequency tables and histgrams for shuffled words with different n-mer. The mean and Variance of scores for different n-mer are summarized as followings.
n |
Mean |
Variance |
1 |
25.559 |
7.6516 |
2 |
25.761 |
9.493 |
3 |
25.66886 |
8.859 |
4 |
24.917 |
4.986 |
5 |
26.48 |
8.758 |
What n-mer model best fits the unrelated protein data? The 5-mer obtains larger mean, indicating 5-mer model is the best one among the above five mdoels to fit the unrelated protein data. Redo for DNA Please refer to the code and out put for DNA for details. The mean and Variance are summarized as followings.
length of word |
Mean |
Variance |
4 |
32.31412 |
41.9872 |
6 |
47.30298 |
104.3903 |
8 |
62.61114 |
190.2871 |
code and out of the score. lengthof word: 8 data <- readDNAStringSet("./sequence.fasta-3.txt") ## Warning in .Call2("fasta_index", filexp_list, nrec, skip, seek.first.rec, : ## reading FASTA file ./sequence.fasta-3.txt: ignored 12146 invalid one-letter ## sequence codes seq.name <- names(data) seq <- paste(data) data <- data.frame(seq.name, seq) data$seq <- as.character(data$seq) data$seq.name <- as.character(data$seq.name) k <- 8 word.list <- NULL for(i in 1:50){ seq.v <- unlist(strsplit(seq[i],"")) len <- length(seq.v) word.list <- c(word.list, sapply(1:(len-k+1), function(x){paste(seq.v[x:(x+k-1)],collapse="")})) } word.list <- unique(word.list) # seq.v <- unlist(strsplit(seq,"")) # letter.dict <- unique(seq) # # len <- length(seq.v) # kmer.o <- sapply(1:(len-k+1), function(x){paste(seq.v[x:(x+k-1)],collapse="")}) scores.list <- list() i <- 1 seq.v <- unlist(strsplit(seq[i],"")) len <- length(seq.v) scores.temp <- matrix(rep(0,length(word.list)*(len+1-k)),length(word.list)) for(j in 1:length(word.list)){ scores.temp[j, ] <- sapply(1:(len+1-k),function(x){sum(diag(BLOSUM62[unlist(strsplit(word.list[j],"")), seq.v[x:(x+k-1)]]))}) } scores.temp <- as.vector(scores.temp) scores.list[[i]] <- scores.temp socres.8 <- unlist(scores.list) T.8 <- quantile(socres.8,.999) #set a T socres.8 <- socres.8[socres.8>=T.8] table(socres.8) ## socres.8 ## 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 ## 238 223 191 152 163 129 120 131 114 95 111 103 115 107 105 89 82 88 ## 42 43 44 45 46 47 48 49 50 51 52 53 55 56 ## 63 57 45 34 20 16 10 4 5 4 2 2 1 1 hist(socres.8) mean(socres.8) ## [1] 32.31412 var(socres.8) ## [1] 41.9872 lengthof word: 12 # 12 k <- 12 word.list <- NULL for(i in 1:50){ seq.v <- unlist(strsplit(seq[i],"")) len <- length(seq.v) word.list <- c(word.list, sapply(1:(len-k+1), function(x){paste(seq.v[x:(x+k-1)],collapse="")})) } word.list <- unique(word.list) # seq.v <- unlist(strsplit(seq,"")) # letter.dict <- unique(seq) # # len <- length(seq.v) # kmer.o <- sapply(1:(len-k+1), function(x){paste(seq.v[x:(x+k-1)],collapse="")}) scores.list <- list() i <- 1 seq.v <- unlist(strsplit(seq[i],"")) len <- length(seq.v) scores.temp <- matrix(rep(0,length(word.list)*(len+1-k)),length(word.list)) for(j in 1:length(word.list)){ scores.temp[j, ] <- sapply(1:(len+1-k),function(x){sum(diag(BLOSUM62[unlist(strsplit(word.list[j],"")), seq.v[x:(x+k-1)]]))}) } scores.temp <- as.vector(scores.temp) scores.list[[i]] <- scores.temp socres.12 <- unlist(scores.list) T.12 <- quantile(socres.12,.999) socres.12<- socres.12[socres.12>=T.12] table(socres.12) ## socres.12 ## 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 ## 135 125 118 115 105 108 114 90 102 109 93 88 114 90 97 72 92 81 ## 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 ## 97 97 73 94 87 88 90 74 86 72 81 70 75 51 50 46 25 32 ## 68 69 70 71 72 73 74 75 77 80 ## 15 11 9 4 2 1 2 3 1 1 hist(socres.12) mean(socres.12) ## [1] 47.30298 var(socres.12) ## [1] 104.3903 lengthof word: 16 #16 k <- 16 word.list <- NULL for(i in 1:50){ seq.v <- unlist(strsplit(seq[i],"")) len <- length(seq.v) word.list <- c(word.list, sapply(1:(len-k+1), function(x){paste(seq.v[x:(x+k-1)],collapse="")})) } word.list <- unique(word.list) # seq.v <- unlist(strsplit(seq,"")) # letter.dict <- unique(seq) # # len <- length(seq.v) # kmer.o <- sapply(1:(len-k+1), function(x){paste(seq.v[x:(x+k-1)],collapse="")}) scores.list <- list() i <- 1 seq.v <- unlist(strsplit(seq[i],"")) len <- length(seq.v) scores.temp <- matrix(rep(0,length(word.list)*(len+1-k)),length(word.list)) for(j in 1:length(word.list)){ scores.temp[j, ] <- sapply(1:(len+1-k),function(x){sum(diag(BLOSUM62[unlist(strsplit(word.list[j],"")), seq.v[x:(x+k-1)]]))}) } scores.temp <- as.vector(scores.temp) scores.list[[i]] <- scores.temp socres.16 <- unlist(scores.list) T.16 <- quantile(socres.16,.999) socres.16 <- socres.16[socres.16>=T.16] table(socres.16) ## socres.16 ## 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 ## 131 113 91 100 99 85 93 100 90 99 73 75 74 75 77 86 72 72 ## 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 ## 62 72 64 60 66 70 56 67 66 60 71 75 72 75 61 92 71 72 ## 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 94 95 96 ## 67 88 71 63 57 52 51 49 44 28 25 13 5 15 6 2 3 2 ## 97 98 99 ## 2 1 1 hist(socres.16) mean(socres.16) ## [1] 62.61114 var(socres.16) ## [1] 190.2871 code and out put for shuffle ## shuffle shuffle.1 <- seq.Rand(seq[1],1,2018)$randseq shuffle.2 <- seq.Rand(seq[1],2,2018)$randseq shuffle.3 <- seq.Rand(seq[1],3,2018)$randseq shuffle.4 <- seq.Rand(seq[1],4,2018)$randseq shuffle.5 <- seq.Rand(seq[1],5,2018)$randseq # 1mer seq.v <- unlist(strsplit(shuffle.1,"")) len <- length(seq.v) scores.temp <- matrix(rep(0,length(word.list)*(len+1-k)),length(word.list)) for(j in 1:length(word.list)){ scores.temp[j, ] <- sapply(1:(len+1-k),function(x){sum(diag(BLOSUM62[unlist(strsplit(word.list[j],"")), seq.v[x:(x+k-1)]]))}) } scores.temp <- as.vector(scores.temp) scores.list[[i]] <- scores.temp socres.16 <- unlist(scores.list) T.16 <- quantile(socres.16,.999) socres.16 <- socres.16[socres.16>=T.16] hist(socres.16) mean(socres.16) ## [1] 25.559 var(socres.16) ## [1] 7.651598 kable(table(socres.16))
socres.16 |
Freq |
23 |
965 |
24 |
724 |
25 |
564 |
26 |
420 |
27 |
322 |
28 |
237 |
29 |
149 |
30 |
103 |
31 |
61 |
32 |
53 |
33 |
49 |
34 |
26 |
35 |
11 |
36 |
8 |
37 |
8 |
38 |
2 |
39 |
2 |
40 |
1 |
41 |
4 |
42 |
1 |
43 |
2 |
# 2mer seq.v <- unlist(strsplit(shuffle.2,"")) len <- length(seq.v) scores.temp <- matrix(rep(0,length(word.list)*(len+1-k)),length(word.list)) for(j in 1:length(word.list)){ scores.temp[j, ] <- sapply(1:(len+1-k),function(x){sum(diag(BLOSUM62[unlist(strsplit(word.list[j],"")), seq.v[x:(x+k-1)]]))}) } scores.temp <- as.vector(scores.temp) scores.list[[i]] <- scores.temp socres.16 <- unlist(scores.list) T.16 <- quantile(socres.16,.999) socres.16 <- socres.16[socres.16>=T.16] hist(socres.16) mean(socres.16) ## [1] 25.76066 var(socres.16) ## [1] 9.493101 kable(table(socres.16))
socres.16 |
Freq |
23 |
2026 |
24 |
1541 |
25 |
1055 |
26 |
843 |
27 |
648 |
28 |
467 |
29 |
363 |
30 |
251 |
31 |
184 |
32 |
134 |
33 |
86 |
34 |
89 |
35 |
49 |
36 |
32 |
37 |
19 |
38 |
10 |
39 |
9 |
40 |
4 |
41 |
4 |
42 |
5 |
43 |
3 |
44 |
2 |
45 |
3 |
51 |
1 |
54 |
2 |
# 3mer seq.v <- unlist(strsplit(shuffle.3,"")) len <- length(seq.v) scores.temp <- matrix(rep(0,length(word.list)*(len+1-k)),length(word.list)) for(j in 1:length(word.list)){ scores.temp[j, ] <- sapply(1:(len+1-k),function(x){sum(diag(BLOSUM62[unlist(strsplit(word.list[j],"")), seq.v[x:(x+k-1)]]))}) } scores.temp <- as.vector(scores.temp) scores.list[[i]] <- scores.temp socres.16 <- unlist(scores.list) T.16 <- quantile(socres.16,.999) socres.16 <- socres.16[socres.16>=T.16] hist(socres.16) mean(socres.16) ## [1] 25.66886 var(socres.16) ## [1] 8.858734 kable(table(socres.16))
socres.16 |
Freq |
23 |
930 |
24 |
732 |
25 |
553 |
26 |
378 |
27 |
288 |
28 |
220 |
29 |
168 |
30 |
118 |
31 |
77 |
32 |
56 |
33 |
34 |
34 |
21 |
35 |
19 |
36 |
10 |
37 |
12 |
38 |
6 |
39 |
8 |
40 |
4 |
41 |
2 |
42 |
1 |
43 |
2 |
44 |
2 |
45 |
1 |
# 4mer seq.v <- unlist(strsplit(shuffle.4,"")) len <- length(seq.v) scores.temp <- matrix(rep(0,length(word.list)*(len+1-k)),length(word.list)) for(j in 1:length(word.list)){ scores.temp[j, ] <- sapply(1:(len+1-k),function(x){sum(diag(BLOSUM62[unlist(strsplit(word.list[j],"")), seq.v[x:(x+k-1)]]))}) } scores.temp <- as.vector(scores.temp) scores.list[[i]] <- scores.temp socres.16 <- unlist(scores.list) T.16 <- quantile(socres.16,.999) socres.16 <- socres.16[socres.16>=T.16] hist(socres.16) mean(socres.16) ## [1] 24.91729 var(socres.16) ## [1] 4.985532 kable(table(socres.16))
socres.16 |
Freq |
23 |
48 |
24 |
25 |
25 |
19 |
26 |
15 |
27 |
9 |
28 |
7 |
29 |
2 |
30 |
3 |
31 |
3 |
32 |
1 |
33 |
1 |
# 5mer seq.v <- unlist(strsplit(shuffle.5,"")) len <- length(seq.v) scores.temp <- matrix(rep(0,length(word.list)*(len+1-k)),length(word.list)) for(j in 1:length(word.list)){ scores.temp[j, ] <- sapply(1:(len+1-k),function(x){sum(diag(BLOSUM62[unlist(strsplit(word.list[j],"")), seq.v[x:(x+k-1)]]))}) } scores.temp <- as.vector(scores.temp) scores.list[[i]] <- scores.temp socres.16 <- unlist(scores.list) T.16 <- quantile(socres.16,.999) socres.16 <- socres.16[socres.16>=T.16] #table(socres.16) hist(socres.16) mean(socres.16) ## [1] 26.48 var(socres.16) ## [1] 8.757766 kable(table(socres.16))
socres.16 |
Freq |
24 |
262 |
25 |
204 |
26 |
144 |
27 |
98 |
28 |
79 |
29 |
39 |
30 |
32 |
31 |
28 |
32 |
20 |
33 |
14 |
34 |
9 |
35 |
4 |
36 |
4 |
37 |
4 |
38 |
3 |
39 |
2 |
40 |
2 |
44 |
1 |
51 |
1 |
For DNA data: lengthof word: 4 data <- readDNAStringSet("./sequence.dna.txt") seq.name <- names(data) seq <- paste(data) seq <- seq[1:50] k <- 4 word.list <- NULL for(i in 1:50){ seq.v <- unlist(strsplit(seq[i],"")) len <- length(seq.v) word.list <- c(word.list, sapply(1:(len-k+1), function(x){paste(seq.v[x:(x+k-1)],collapse="")})) } word.list <- unique(word.list) # seq.v <- unlist(strsplit(seq,"")) # letter.dict <- unique(seq) # # len <- length(seq.v) # kmer.o <- sapply(1:(len-k+1), function(x){paste(seq.v[x:(x+k-1)],collapse="")}) scores.list <- list() i <- 1 seq.v <- unlist(strsplit(seq[i],"")) len <- length(seq.v) scores.temp <- matrix(rep(0,length(word.list)*(len+1-k)),length(word.list)) for(j in 1:length(word.list)){ scores.temp[j, ] <- sapply(1:(len+1-k),function(x){sum(diag(BLOSUM62[unlist(strsplit(word.list[j],"")), seq.v[x:(x+k-1)]]))}) } scores.temp <- as.vector(scores.temp) scores.list[[i]] <- scores.temp socres.4 <- unlist(scores.list) T.4 <- quantile(socres.4,.999) #set a T socres.4 <- socres.4[socres.4>=T.4] table(socres.4) ## socres.4 ## 26 27 28 29 30 31 32 33 36 ## 71 65 59 30 7 3 9 6 1 hist(socres.4) mean(socres.4) ## [1] 27.68127 var(socres.4) ## [1] 3.066008 lengthof word: 6 # 12 k <- 6 word.list <- NULL for(i in 1:50){ seq.v <- unlist(strsplit(seq[i],"")) len <- length(seq.v) word.list <- c(word.list, sapply(1:(len-k+1), function(x){paste(seq.v[x:(x+k-1)],collapse="")})) } word.list <- unique(word.list) # seq.v <- unlist(strsplit(seq,"")) # letter.dict <- unique(seq) # # len <- length(seq.v) # kmer.o <- sapply(1:(len-k+1), function(x){paste(seq.v[x:(x+k-1)],collapse="")}) scores.list <- list() i <- 1 seq.v <- unlist(strsplit(seq[i],"")) len <- length(seq.v) scores.temp <- matrix(rep(0,length(word.list)*(len+1-k)),length(word.list)) for(j in 1:length(word.list)){ scores.temp[j, ] <- sapply(1:(len+1-k),function(x){sum(diag(BLOSUM62[unlist(strsplit(word.list[j],"")), seq.v[x:(x+k-1)]]))}) } scores.temp <- as.vector(scores.temp) scores.list[[i]] <- scores.temp socres.6 <- unlist(scores.list) T.6 <- quantile(socres.6,.999) socres.6<- socres.6[socres.12>=T.6] table(socres.6) ## socres.6 ## -18 -17 -16 -15 -14 -13 -12 -11 -10 -9 ## 1 19 122 502 1616 4148 9135 17839 30029 46708 ## -8 -7 -6 -5 -4 -3 -2 -1 0 1 ## 65803 82215 101207 113498 121924 137414 146644 158432 172517 178802 ## 2 3 4 5 6 7 8 9 10 11 ## 179153 177808 173615 162771 163037 151507 145695 133913 121480 109037 ## 12 13 14 15 16 17 18 19 20 21 ## 98724 88369 76383 67296 55174 49082 42024 34643 27584 22694 ## 22 23 24 25 26 27 28 29 30 31 ## 18865 15985 12269 9021 6915 5965 4384 3356 2229 1766 ## 32 33 34 35 36 37 38 39 40 41 ## 1335 1054 637 405 278 238 173 94 47 37 ## 42 43 44 45 46 47 48 ## 38 13 3 2 6 4 1 hist(socres.6) mean(socres.6) ## [1] 4.324427 var(socres.6) ## [1] 62.15408 lengthof word: 8 # # #16 # k <- 8 # word.list <- NULL # for(i in 1:50){ # seq.v <- unlist(strsplit(seq[i],"")) # len <- length(seq.v) # word.list <- c(word.list, sapply(1:(len-k+1), function(x){paste(seq.v[x:(x+k-1)],collapse="")})) # } # # word.list <- unique(word.list) # # # seq.v <- unlist(strsplit(seq,"")) # # letter.dict <- unique(seq) # # # # len <- length(seq.v) # # kmer.o <- sapply(1:(len-k+1), function(x){paste(seq.v[x:(x+k-1)],collapse="")}) # # scores.list <- list() # i <- 1 # seq.v <- unlist(strsplit(seq[i],"")) # len <- length(seq.v) # scores.temp <- matrix(rep(0,length(word.list)*(len+1-k)),length(word.list)) # for(j in 1:length(word.list)){ # scores.temp[j, ] <- sapply(1:(len+1-k),function(x){sum(diag(BLOSUM62[unlist(strsplit(word.list[j],"")), seq.v[x:(x+k-1)]]))}) # } # scores.temp <- as.vector(scores.temp) # scores.list[[i]] <- scores.temp # # socres.8 <- unlist(scores.list) # T.8 <- quantile(socres.8,.999) # socres.16 <- socres.8[socres.8>=T.8] # table(socres.8) # hist(socres.8) # mean(socres.8) # var(socres.8) shuffle # ## shuffle # shuffle.1 <- seq.Rand(seq[1],1,2018)$randseq # shuffle.2 <- seq.Rand(seq[1],2,2018)$randseq # shuffle.3 <- seq.Rand(seq[1],3,2018)$randseq # shuffle.4 <- seq.Rand(seq[1],4,2018)$randseq # shuffle.5 <- seq.Rand(seq[1],5,2018)$randseq # # # 1mer # seq.v <- unlist(strsplit(shuffle.1,"")) # len <- length(seq.v) # scores.temp <- matrix(rep(0,length(word.list)*(len+1-k)),length(word.list)) # for(j in 1:length(word.list)){ # scores.temp[j, ] <- sapply(1:(len+1-k),function(x){sum(diag(BLOSUM62[unlist(strsplit(word.list[j],"")), seq.v[x:(x+k-1)]]))}) # } # scores.temp <- as.vector(scores.temp) # scores.list[[i]] <- scores.temp # # socres.8 <- unlist(scores.list) # T.8 <- quantile(socres.8,.999) # socres.8 <- socres.8[socres.8>=T.8] # hist(socres.8) # mean(socres.8) # var(socres.8) # kable(table(socres.8)) # # # 2mer # seq.v <- unlist(strsplit(shuffle.2,"")) # len <- length(seq.v) # scores.temp <- matrix(rep(0,length(word.list)*(len+1-k)),length(word.list)) # for(j in 1:length(word.list)){ # scores.temp[j, ] <- sapply(1:(len+1-k),function(x){sum(diag(BLOSUM62[unlist(strsplit(word.list[j],"")), seq.v[x:(x+k-1)]]))}) # } # scores.temp <- as.vector(scores.temp) # scores.list[[i]] <- scores.temp # # socres.8 <- unlist(scores.list) # T.8 <- quantile(socres.8,.999) # socres.8 <- socres.8[socres.8>=T.8] # hist(socres.8) # mean(socres.8) # var(socres.8) # kable(table(socres.8)) # # # 3mer # seq.v <- unlist(strsplit(shuffle.3,"")) # len <- length(seq.v) # scores.temp <- matrix(rep(0,length(word.list)*(len+1-k)),length(word.list)) # for(j in 1:length(word.list)){ # scores.temp[j, ] <- sapply(1:(len+1-k),function(x){sum(diag(BLOSUM62[unlist(strsplit(word.list[j],"")), seq.v[x:(x+k-1)]]))}) # } # scores.temp <- as.vector(scores.temp) # scores.list[[i]] <- scores.temp # # socres.8 <- unlist(scores.list) # T.8 <- quantile(socres.8,.999) # socres.8 <- socres.8[socres.8>=T.8] # hist(socres.8) # mean(socres.8) # var(socres.8) # kable(table(socres.8)) # # # 4mer # seq.v <- unlist(strsplit(shuffle.4,"")) # len <- length(seq.v) # scores.temp <- matrix(rep(0,length(word.list)*(len+1-k)),length(word.list)) # for(j in 1:length(word.list)){ # scores.temp[j, ] <- sapply(1:(len+1-k),function(x){sum(diag(BLOSUM62[unlist(strsplit(word.list[j],"")), seq.v[x:(x+k-1)]]))}) # } # scores.temp <- as.vector(scores.temp) # scores.list[[i]] <- scores.temp # # socres.8 <- unlist(scores.list) # T.8 <- quantile(socres.8,.999) # socres.8 <- socres.8[socres.8>=T.8] # hist(socres.8) # mean(socres.8) # var(socres.8) # kable(table(socres.8)) # # # 5mer # seq.v <- unlist(strsplit(shuffle.5,"")) # len <- length(seq.v) # scores.temp <- matrix(rep(0,length(word.list)*(len+1-k)),length(word.list)) # for(j in 1:length(word.list)){ # scores.temp[j, ] <- sapply(1:(len+1-k),function(x){sum(diag(BLOSUM62[unlist(strsplit(word.list[j],"")), seq.v[x:(x+k-1)]]))}) # } # scores.temp <- as.vector(scores.temp) # scores.list[[i]] <- scores.temp # # socres.8 <- unlist(scores.list) # T.8 <- quantile(socres.8,.999) # socres.8 <- socres.8[socres.8>=T.8] # #table(socres.8) # hist(socres.8) # mean(socres.8) # var(socres.8) # kable(table(socres.8))