[Spezielle Probleme der Bioinformatik]

술먹기딱좋은날 2019. 4. 11. 16:51

2019. 4. 11. 16:51

Bash Skript-Sprache

https://blog.naver.com/aries84/220338118102

- Zuweisung :

- r Zugriff :

[Intro_R]

30. Vektoren : c 를 통해 Vektor 표시. R 의 데이터 베이스는 Vektor 라고.

31. 아래는 표현가능한 Variante.

32. NA 가 나타나지않도록 일찌감치 잘 파악하는게 중요하다는데 뭐라노

33. 0 은 ignoriert werden

35. negative Indizes :

- positive / negative Indizes 를 섞는건 안된다.

37. Logical : 알에 많이 익숙해지면 우리가 왕 많이 쓰는게 Logical Vektor 라고.

38. 이 조건을 걸고 찾아내는게 아주 많이쓴다

- wenn zwei Vektoren benutzt wird, die verschiedene Länge haben, dann wird kürzer recycled

- x<-5:10, y<-1:3 이 때 y 은 3 Elemente, x 는 6 Elemente. x+y 하면 y을 두번 반복해서 더한다고.

39. Names : Zuweisung

- 없는 Elemente 를 zuweisen 하면 NA 로 나타난다

41. seq und rep : 수열생성

https://m.blog.naver.com/PostView.nhn?blogId=coder1252&logNo=220952289447&proxyReferer=https%3A%2F%2Fwww.google.com%2F

- seq 는 by 를 쓸 때 반드시 Ganzer Zahl 일 필욘 없다. 0.2 이런것도 가능

42. Vektorarithmetik

- sqrt : Wurzel

- ^2

- fast alle mathematische Funktion

43. rnorm : 표준정규분포화된 랜덤 숫자 100. 잘 이해가 안감

- order : das liefert uns sortierte ...

44. Vektoren von Strings : paste 도 우린 많이 쓴다고.

[Funktion R]

1. Max

2. Deklaration :

3. Defaultwerte : defaults

4. Attribute : Names 랑 별 차이없다는 듯. nicht regelmäßig verwendet werden.

5. Arrays : Vektoren mit Attribut dim

- Matrix ist Array mit zwei Dimensionen

- 예시 : matrix(x,nrow=2)

6. Array - Matrizen anlegen

7. Array - Zugriff : Selektion.

- manchmal praktisch

- Array mit Indexvektoren

8. Arrays - Kombinaiton

- Konstruktion mit cbind, rbind. 자주 쓰인다

- z.B) cbind(4:1, 1:4), rbind(4:1, 1:4)

9. Arrays - Rechnen mit Arrays

10. Arrays - Matrix-Operationen

- Matrix-Multiplikation : x%*%y

- Diagonal-Matrix : dig

- upper.tri()

11. Listen : Übung에서 필요하다캄

#[R3]
#81  1999년 최초의 Bioinformatik mit Micro array 로 시작되었다는 듯. 마이크로 어레이는 생각보다 싸서 아직도 많이들 쓰인다는 듯.
전제는 mRNA 크기가 유전자 활성에 유의함을 갖고있다는 것. 다양한 Fehlerquelle 가 있다고.
#84
3_golub<-read.table(3_golub)
read.table()

plot(golub[[3]])
plot(golub[[3]],t="1")
plot(golub[[3]],t="1",axes=F,xlab="",ylab="expression")
axis(side=1)
axis(side=2)
axis(side=1,at=1:10,labels(gobub[1:10,1]))
# 아래 레이블 각도 조절
axis(side=1,at=1:10,labels(gobub[1:10,1]),las=1)
axis(side=1,at=1:10,labels(gobub[1:10,1]),las=2)
axis(side=1,at=1:10,labels(gobub[1:10,1]),las=3)
#87
par( "mar")
# 기본 입력값이 출력됨
par (mar=c(10,4,4,2))plot(golub[[3]][1:10],t="1",axes = )
#Histogramme
hist( golub[[3]])
par(mar=c(5,4,4,2)+0.1)
hist(golub[[3]],freq = F)
hist(golub[[3]], probability = T)
# 더 자세하게 쪼개려고 하면 브레이크로 간다 근데 항상 정확히 먹진 않아
hist(golub[[3]],freq = F, breaks = 20)
hist(golub[[3]],freq = F, breaks = 30)
hist(golub[[3]],freq = F, breaks = 50)
hist(golub[[3]],freq = F, breaks = 100)
break<-seq(min(golub[[3]]),max(golub[[3]]), length.out = 30)
hist(golub[[3]],freq = F, breaks = breaks)
a<-seq(min(golub[[3]]),max(golub[[3]]), length.out = 30)
lines(a,dnorm(a, mean = mean(golub[[3]])), sd=sd(golub[[3]]), col="red", lwd=3)
hist(golub[[3]],freq=f,breaks=100)
plot(density(golub[[3]]), bw= 1)
plot(density(golub[[3]]), bw= 0.1)
plot(density(golub[[3]]), bw= 0.01)
# 이젠 히스토그램과 굉장히 유사해짐
plot(density(golub[[3]]), bw=0.1)
plot(density(golub[[3]]), bw=2)
plot(density(golub[[3]]))
line(density(golub[[4]]), col="red")
line(density(golub[[5]]), col="blue")
plot(ecdf(golub[[3]]),t="1")
lines(ecdf(golub[[4]]),col="red")
#93
x<-c(2,4,5,7,1)
x
barplot(x)
x<-c(a=2, b=4, c=5, d=7,e=1)
barplot(x)
x<-c(a=2, b=4, c=5, d=7,e=1,las=3)
#las 내꺼에서 안먹는다
x<-c(a=2, b=4, c=5, d=7,e=1,las=3, ..)
# Quantile 은 좀 어렵다칸다, Idee, die Daten zu sortieren. 박스플롯 그릴때 박스 영역. 제일 아래가 1. Quatil, 제일 위에가 3.Quartil, 중간은 당연히 medien
# Whiskers 는 박스 밖의 작대기.
boxplot(golub[[3]]) # 아래 추가하는 걸 파라메터 라 한ㄷ
boxplot(golub[[3]], horiyontal = TRUE)
boxplot(golub[, 3:5])
# 아래 두개 차이 유무 비교
boxplot(golub[, c(-1)],las=3)
boxplot(golub[1:10,3:5]])
plot(golub[[3]],golub[[4]])
pairs(golub[3:6]) #d
pairs(golubüb [3:6])
qqplot(golub[[3]],golub[[4]])
abline(a=0,b=1,col="red",lwd=3)
qqnorm(golub[[3]])
x<-rnorm(10)
x<-rnorm(10000)
qqnorm( x)

par(mfrow=(2.0))
hist([[golub3]])

[R4]
# 클러스터링 설명중. Distanz 계산하는법 각 single linkage 등 이런거 전부.
d<-dist(golub[,c(-1,-2)])
d
read.table("../uebungen..")
d<-cor.dist(x=t(golub[,c(-1,-2)]))
h<-hclust(d,method = "single")
plot(h)
# 보통 싱글은 안쓰고 complete 를 쓴다고 한다
h<-hclust(d,method = "complete")
plot(h)
h<-hclust(d,method = "average")
kmeans(golub[,c(-1,-2)], centers = 2)
hist(golub[[3]])
postscript("plot.ps")
hist(golub[[3]])
# 뭔가 그래픽으로 뽑아내는 듯
dev.off()
pdf("plot.pdf", width=10, height = 5)
hist(golub[[3]])
hist(golub[[4]])
hist(golub[[5]])
dev.off()
# pdf 와 png 의 차이는 픽셀로 나타내어지냐 아니냐 인데 만약 엄청많은 점이 포함된 자료라면 png 가 나을 수 도 있다. 안그럼 pdf 는 여는데 한오백년이라고.
png("plot.png")

x<-rnorm(10)
plot(x)
x<-rnorm(10)+3
# logarithmische Skalierung
plot(x, log="y")
# text function : 특별한게 있어서 나타네고 싶을때
# title, mtext,
mtext(text = "ss")
mtext(text = "ss", side = 1)
mtext(text = "ss", side = 1, line = 3)
#  그래픽 옵션 :  pch...
# 그래픽에서 점, 숫자 크기 조절 등
plot(x, main="main lot", cex=2)
plot(x, main="main lot", cex=2, cex.axis=2)
# bioconductor 는 bioinformatische Package 인데 씨퀀싱, 마이크로어레이 등 데이터 분석에 쓰일 것들이 가득
BiocManager::install("affy")
library(affy)
batch

Data<-ReadAffy()
pd <- read.table('covdesc.txt',header=T)
batch<-ReadAffy(phenoData = pd, celfile.path = ".")
data(batch)
fn<-featureNames(batch)
ex<-expras(batch)
# wenn wir zwei Experimente miteinander vergleichen wollen, y-Achse:x-y, x Achse: x+y/2
# 그렇게 표를 그리면 y축의 0을 중심으로 직선 그어지고 그 주변에 두 실험간 차이들이 분포한다.
MAplot(batch)
# 이거하면 부드럽게 픽셀로 차이들이 나타난다.
MAplot(batch, plot.method="smoothScatter")
hist(batch)
# Normalisierung : RMA 는 꽤 잘 작동하고, MAS5 는 영 빠이. 이건 그 이유가 회사가 실제 기술을 nicht publiziert. Reverse Technik 으로 만들었다는 듯.
# RMA 는 자료를 sortieren -> nolw??
eset.rma<-rma(batch)
eset.mas5<-mas5(batch)

hist(exprs(eset.rma))
hist(exprs(eset.mas5))

# wenn man etwas publizieren will, muss man es zeigen, welche Normalisierungsverfahren angewendet wird.
exprs<-exprs(eset.rma)
plotDensity(exprs)
boxplot(exprs)
# 폴리에 없는 예시. oligo 깔아야됨
library(oligo)
MAplot(eset.rma,plot.method="smoothScatter")

# 116. 우리의 목적은 m g,k 값 변화를 보는 것.
# varianz 는 var 함수로.
# 118. t-Test 의 Null Hypothese 는 익스페리먼트, 테스트 간 차이가 없다는 것. das ist ein Weg, Unterschiede zu bestimmen
# 122. Signifikanz-Niveau : 1프로는 가설이 틀릴 확률
# 124. 완전 정확한 t-Test는 t.test (x,y,var.equal=TRUE) 로 할 수 있다고.
# würfel gezinkt 주사위를 던졌을 때 6이 나올 Haeufigkeit 를 한번 R 로 계산해보자고. empirisch 로 10번 해보자.
?p.adjust

# 128. RNA seq Daten : diskret, binär. negative Binomialverteilung statt Normalverteilung
install.packages("DESeq2")
library("DESeq2")
dds <- DESeqDataSet(se, design = ~ cell + dex)

# wie viele Reads pro Experiment
colSums(counts(dds))

# wie unterscheidet sich die Verteilung der Yaehlwerte
plotDensity(counts(dds))
# -> nicht viel yu sehen
#deswegen, logarithmiert
plotDensity(log(counts(dds)+1))

# boxplots
# 134. 현실에서는 Replikanten은 많지 않기에 이걸 직접 적용할 일이 안많다. 그래서 135 처럼 한다고.
# 실제 이걸 이해하려면 statistic 1 보다는 훨씬 많은 것이 필요합니다.
# die gesamte DESeq/pipeline auf die Daten anwenden
dds <- DESeq(dds)

raw<-counts(dds)
colSums(raw())
norm<-counts(dds, normalized = TRUE)
colSums

# so bekommen wir die Ergebnisse
res<- results(dds)
  head(res)
sum(res$pvalue < 0.05, na.rm = TRUE)

# Multiples Testen!!
sum(res$padj < 0.04, na.rm = TRUE)

# suchen signifikante Gene raus
resSig <- res[!is.na(res$padj) & res$padj <0.05,]
# und ordnen sie nach log2FoldChange, hochreguliert
# head(resSig[ ordner res]) # 다 못씀...
# und runterreguliert
head(resSig[order(resSig$log2Foldchange),])
# rlog/Transformation
rld <- rlog

[R-Aufgaben]

#Aufgabe 1.3
a<- 113:-12
b<- seq(113,-12,-3)
c<- rep(c(TRUE,FALSE),56)
d<- rep(1:7,each=3)

#Aufgabe 1.4
a<- letters[26:1]
b<- letters[seq(1,26,2)]
c<- letters[seq(26,1,-2)]
loesung<- c(b,c)

#Aufgabe 2.1
x<-rep(c("a","c","g","t"),each=4)
y<-rep(c("a","c","g","t"),4)
paste(x,y,".fasta")
paste(x,y,".fasta", sep = "", collapse = NULL)

#Aufgabe 2.2
#Geburtsjahr, Schuhgroesse, Gewicht
#2.2 (a)
a<-sample(1980:2000,10)
a
b<-sample(270:290,10)
b
c<-sample(60:100,10)
c
x <- c(a,b,c)
x<-matrix(x,10)
x

#attr(x,"dim") <- c(10, 3)

#2.2 (b)
a<-paste("student",seq(1:10),collapse = )
a
rownames(x)<-a
colnames(x)<-c("Geburtsjahr","Schuhgroesse","Gewicht")
x
mode(x)

#2.2 (c)
d<-sample(c("w","m"),10,replace = TRUE)
d
df=data.frame(d,x)
colnames(df)<-c("Geschlechte","Geburtsjahr","Schuhgroesse","Gewicht")

# 이건 왜 안되지?? y<-cbind(d,x)
# 벡터의원소에 숫자와 문자가 함께 포함되면 모두 문자로 변환됨.
#여기 그냥 Col name만 추가하고 싶음. 통으로 바꾸는게 아니고.

#2.2(d)

man <-subset(df, Geschlechte=="m")
man
woman <-subset(df, Geschlechte=="w")
woman

#2.2(e)

Kg_Mittel<- mean(df[,4])
Kg_Mittel # mittlere Gewicht
Kg_man<-mean(man[,4])
Kg_man # mittlere Gewicht von Männer
Kg_woman<-mean(woman[,4])
Kg_woman # mittlere Gewicht von Frauen

#Aufgabe 2.3
#2.3 (a)

iris<-read.table("2_iris.txt", header = TRUE, sep = )
iris

#2.3 (b)
SL<-mean(iris[,1]) # mittlere Länge von Sepalen
SB<-mean(iris[,2]) # mittlere Breite von Sepalen
PL<-mean(iris[,3]) # mittlere Länge von Petalen
PB<-mean(iris[,4]) # mittlere Breite von Petalen

#2.3 (c)
# was heißt Verlassen Sie sich dabei nicht auf die Zeilenreihenfolge?
setosa<-subset(iris,Species=="setosa")
setosa
versicolor<-subset(iris,Species=="versicolor")
versicolor
virginica<-subset(iris,Species=="virginica")
virginica

sts1<-mean(setosa[,1])
sts2<-mean(setosa[,2])
sts3<-mean(setosa[,3])
sts4<-mean(setosa[,4])
sts<-matrix(c(sts1,sts2,sts3,sts4),nrow = 1)
colnames(sts)<-c("SepalLength","Sepal.Width","Petal.Length","Petal.Width")
rownames(sts)<-"setosa"
sts

vsc1<-mean(versicolor[,1])
vsc2<-mean(versicolor[,2])
vsc3<-mean(versicolor[,3])
vsc4<-mean(versicolor[,4])
vsc<-matrix(c(vsc1,vsc2,vsc3,vsc4), nrow = 1)
colnames(vsc)<-c("SepalLength","Sepal.Width","Petal.Length","Petal.Width")
rownames(vsc)<-"versinicolar"
vsc

vgn1<-mean(virginica[,1])
vgn2<-mean(virginica[,2])
vgn3<-mean(virginica[,3])
vgn4<-mean(virginica[,4])
vgn<-matrix(c(vgn1,vgn2,vgn3,vgn4),nrow = 1)
colnames(vgn)<-c("SepalLength","Sepal.Width","Petal.Length","Petal.Width")
rownames(vgn)<-"virginica"
vgn

#2.3 (d)
iris<-iris[c(order(-iris$Sepal.Width)),]
iris
# nicht fertig

#2.3 (e)

#=====Expressionsdatenanalyse============
#데이터 유형의 우선 순위
#character > numeric > logical
#우선순위가 낮은 타입에서 높은 타입으로 변화는 가능.
#예) numeric을 character로 변경하거나, logical을 numeric으로 변환하는 것은 가능
#우선순위가 높은 타입에서 낮은 타입으로 변경하는 것은 일부만 되고 일부는 안됨.

#3.1
#dist 함수는 자료의 거리를 계산해준다. 디폴트값은 euclidean이며 method 설정을 통해 바꿔줄 수 있다.
# 연속형 수치로 이루어진 (x=1,2,3... y=4,5,6...) 변수의 관계는 상관관계, 즉 correltion 과 인과관계로 나눈다카네.이건 영향을 주는 관계는 의미하진 않고 단지 이런 관계가 있다는 것만 서술. 양이면 오를때 같이 오르는거. 만약 서로 영향을 주고받는 다면 그건 인과관계 causation.
# 상관관계 분석은 보통 correlation analysis 이며 인과관계는 regression
# Pearson CC 제일 일반적인 것. 두 변수가 모두 연속 형 자료일 때.
# r=correlation coefficient. spearman CC 는 상관관계를 분석할라 카는 연속형 변수가 normal distribution 을 심각하게 벗어나거나 ordinal scale 순위 척도일때 사용. 예를 들어 성적을 과목별 등수로 매ㅐ긴 후 과목 간 연관관계 분석.

'독일 석사' 카테고리의 다른 글

[Ökologischer Landbau] (0)	2019.04.16
[Expressionsdatenanalyse] (0)	2019.04.12
[Züchtung von Obst-, Gemüse-, und ArzneiPF] (0)	2019.04.10
[Pflanzgenetische Ressourcen und Genomforschung] (0)	2019.01.28
[Resistenzgenetik] Vorlesung, Übung (0)	2019.01.07

걱정하지 마라. 어떻게든 된다.

[Spezielle Probleme der Bioinformatik]

'독일 석사' 카테고리의 다른 글

+ Recent posts

티스토리툴바