Hi,
We are biology students of Avans Breda, are trying to run a machine learning script in R with a dataset of human genome sequences. We came across some errors and hope that one of you can help us with this.
The scripts we are trying to run are located at https://github.com/cancer-genomics/delfi_scripts.
Our error comes up while running the 04-script on a single bam file from the original dataset, which is part of the scripts located at the mentioned github.
As far as we understand this script joins the product of the previous scripts (an .rds file) with the sample_reference.csv file located on the github, and than splits the data into 5mb bins. Which are later used in a stochastic gradient boosted alogrithm. The problem is in this bit of code:
1 df.fr <- readRDS("../.../.../ourfilespecification_frags_bin_100kb.rds")
2 master <- read_csv("sample_reference.csv")
3 df.fr2 <- inner_join(df.fr, master, by=c("sample"="WGS ID"))
4 hic.eigen <- (df.fr2 %>% filter(sample=="PGDX10346P1"))$hic.eigen
But while joining, it gives the following error message:
Error in UseMethod("inner_join") :
no applicable method for 'inner_join' applied to an object of class "c('GRanges', 'GenomicRanges', 'Ranges', 'GenomicRanges_OR_missing', 'GenomicRanges_OR_GenomicRangesList', 'GenomicRanges_OR_GRangesList', 'List', 'Vector', 'list_OR_List', 'Annotated', 'vector_OR_Vector')"
Calls: inner_join
We assumed this meant that the inner_join function is not compatible with the GRanges class. We tried changing the object class by first changing the GRanges to a dataframe.
1 df.fr <- data.frame(readRDS("../.../.../ourfilespecification_frags_bin_100kb.rds"))
When we ran the script again we error message changed to this:
Error: by can't contain join column sample which is missing from LHS
Backtrace:
█
├─dplyr::inner_join(df.fr, master, by = c(sample = "WGS ID"))
└─dplyr:::inner_join.data.frame(df.fr, master, by = c(sample = "WGS ID"))
├─base::as.data.frame(...)
├─dplyr::inner_join(tbl_df(x), y, by = by, copy = copy, ...)
└─dplyr:::inner_join.tbl_df(...)
├─dplyr::common_by(by, x, y)
└─dplyr:::common_by.character(by, x, y)
└─dplyr:::common_by.list(by, x, y)
└─dplyr:::bad_args(...)
└─dplyr:::glubort(fmt_args(args), ..., .envir = .envir)
Execution halted
It seems to us that there are no identical keys to match up the two dataframes. When we looked at the input file, created from the previous scripts, there is no column called sample. The sample_reference.csv file does have a "WGS ID" column.
Is it possible to join these files, and continue with the scripts.
Another thing which bothers us is that the PGDX10346P1 in the code is the name of a bam file. Do we have to change it to the bam file which we use to run the scripts? But first, the inner_join problem.
Could anyone help us fix this error message? The authors of the paper told us the man who wrote the script is currently unavailable because of medical reasons, so we can't ask them.
Please keep in mind that we are biology students, not informatics or mathmatics students, we are not very good with coding.
With kind regards, School of Life Sciences, Avans university of applied sciences, Breda, The Netherlands.
This is a pretty abstract question, but since GPT-2 is good at predicting the next word, is it possible to ask the model if a sentence is idiomatic/well formed/normal by asking the model what the probability each word being in the sentence at the position that it's in is?
Does something like this exist? I was unable to figure out an easy way to do this using the source here
Hey ML folks,
I'd like to invite you all to a live webinar MLOps and Data Quality: How to Deploy Reliable ML Models in Production by Provectus and AWS. Time and place: Online; February 24, 11AM PT | 2PM ET.
Register: https://provectus.com/webinar-mlops-and-data-quality-deploying-reliable-ml-models-feb-2021/
We will discuss what goes into building such fundamental components of machine learning infrastructure as: