Mike Stone - How Reliable and Accurate are Genomes?

Steven Avery

Administrator
Mike Stone

HOW RELIABLE AND ACCURATE ARE GENOMES?:

One argument people try to make as proof of "viruses" is the existence of "viral" genomes. They believe that if a genome can be sequenced from unpurified cell culture soup where a "virus" is assumed to exist, that this is proof enough that a "virus" actually physically exists. Looking beyond the irony of claiming random A,C,G,T's in a computer database can somehow be used as evidence for the physical existence of an unseen entity, there are numerous reasons to question the reliability and accuracy of genomes. These include, but are not limited to, the reliance on inaccurate reference genomes, the inability to replicate results, the numerous technological hurdles based on the tech that is used, the introduction of biases, errors, and artefacts, the uncurated databases, the various assumptions that are made, etc. It is utterly ridiculous to believe that these non-reproducible and error-prone sequences from unpurified cell culture soup can be used as INDIRECT proof of a "virus" when the DIRECT proof, i.e. purified/isolated particles taken directly from sick humans which are proven pathogenic in a natural way, have yet to be scientifically proven first.

Below are highlights from one article and one review showcasing many of these faults:

ACCURACY OF HUMAN DNA SEQUENCING
"But just HOW ACCURATE IS DNA SEQUENCING AND ITS DATA STORAGE TECHNIQUES? What effect do these INACCURACIES HAVE ON GENOMICS and their use in pharmacogenetics?

Throughout the course of the Human Genome Project, there have been VARYING LEVELS OF TARGET ACCURACIES that the research institutes have aimed for. In 2000, the first draft was released with an ERROR RATE OF ONE ERROR PER EVERY 1,000 BASE PAIRS. In 2003, the official results were cited to have an ERROR RATE OF ONE PER EVERY 10,000 BASE PAIRS1. Currently, this requires going through and sequencing the DNA a total of ten times to achieve that level of accuracy3. Known as the Bermuda Standards, the international standard for accuracy is currently held at one error per 10,000 base pairs for the entire contiguous sequence – THE DNA IS SEQUENCED IN PARTS, AND OFTEN TIMES, GAPS EXIST BETWEEN THESE DIFFERENT PARTS. Regardless of how accurate this process of sequencing MAY SEEM, through the sequencimg of the entire human genome, THIS YIELDS A TOTAL OF APPROXIMATELY 300,000 BASE PAIR ERRORS.

But how significant is a 00.0001% error rate? The Human Genome Project has brought attention to the significance of single nucleotide polymorphisms (SNPs). SNPs are NATURAL DNA SEQUENCING VARIATIONS of a single nucleotide (A, T, C or G) that occur every 100 to 300 base pairs5. THE VARIATIONS CAUSED BY SNP CAN DRAMATICALLY AFFECT HOW HUMANS REACT DIFFERENTLY TO THINGS SUCH AS DRUGS, VACCINES, OR DISEASES. However, BECAUSE OF THE INHERENT AND ALLOWABLE ERRORS for companies such as 23andMe that sequence DNA, THEIR RESULTS WILL CERTAINLY SEQUENCE SOME SNPs INACCURATELY. The problem is that companies like 23andMe expect to use their DNA sequencing results to provide medical advice for the participants and their doctors so that they can better prescribe more accurate drug dosages. However, WITH OVER 300,000 BASE PAIR ERRORS, HOW ACCURATE CAN THIS MEDICAL ADVICE BE? If the capabilities and limitations of the human body are sensitive down to the individual nucleotide (as with SNP), CAN HUMAN GENOME SEQUENCING BE RELIABLE ENOUGH TO SERVE ITS PURPOSE AS A SOURCE FOR PERSONALIZED MEDICAL INFORMATION COMPLETELY DEPENDENT ON HUMAN DNA?
https://cs.stanford.edu/.../2010-11/Genomics/accuracy.html

In Summary (Part 1):
-there have been VARYING LEVELS OF TARGET ACCURACIES that the research institutes have aimed for
-in 2000, the first draft was released with an error rate of ONE ERROR PER EVERY 1,000 BASE PAIRS
-in 2003, the official results were cited to have an ERROR RATE OF ONE PER EVERY 10,000 BASE PAIRS
-the DNA is sequenced in parts, and often times, GAPS EXIST BETWEEN THESE DIFFERENT PARTS
-this yields a total of approximately 300,000 BASE PAIR ERRORS
-SNPs are NATURAL DNA SEQUENCING VARIATIONS of a single nucleotide (A, T, C or G) that occur every 100 to 300 base pairs
-the VARIATIONS caused by SNP can DRAMATICALLY AFFECT how humans react differently to things such as drugs, vaccines, or diseases
-because of the INHERENT AND ALLOWABLE ERRORS for companies such as 23andMe that sequence DNA, their results will certainly sequence some SNPs INACCURATELY
-with over 300,000 base pair errors, HOW ACCURATE can this medical advice be?
-can human genome sequencing BE RELIABLE ENOUGH to serve its purpose as a source for personalized medical information completely dependent on human DNA?
From a 2019 Review:

IS RELIANCE ON AN INACCURATE GENOME SEQUENCE SABOTAGING YOUR EXPERIMENTS?
"However, new technologies and algorithmic advances DO NOT GUARANTEE FLAWLESS GENOMICS SEQUENCES OR ANNOTATION. BIAS, ERRORS, AND ARTIFACTS can enter at any stage of the process from library preparation to annotation."

"ALL GENOME SEQUENCES HAVE “ISSUES”
There are MANY FACTORS that can affect the ultimate genome sequence and annotation that are produced, and both SHOULD BE CONSIDERED “WORKS IN PROGRESS.”
"What is the origin of the sample used to generate the genome sequence?

THE ORIGIN MATTERS. Did the sample originate from a clone, a mixed population (common with microbes), or possibly a hybrid? Differences between individuals can be single nucleotide polymorphisms (SNPs), but often they involve INSERTIONS OR DELETIONS (indels) OF VARIOUS SIZES, COPY NUMBER VARIATIONS (CNV), AND EVEN SMALL REARRANGEMENTS. Hybrids can have dramatic differences between orthologous chromosomes [1]. Genome sequences derived from a heterogenous population, especially when CNVs exist, COMPLICATE GENOME ASSEMBLY, and often THE SEQUENCE PRODUCED IS A COMPOSITE of the major alleles present in the sequenced sample. Genome sequences derived from clonal laboratory strains are often easier to assemble, BUT THEY MAY NOT BE TRULY REPRESENTATIVE OF CIRCULATING WILD TYPE STRAINS because they are adapted to culture and, if propagated for a long time, MAY HAVE LOST GENES OR ACCUMULATED MUTATIONS [2]."

"Does the genome have troublesome characteristics?
Some genome sequences are physically difficult to sequence BECAUSE OF EXTREME NUCLEOTIDE BIAS."
"Long homopolymeric runs of any base are PARTICULARLY TROUBLESOME for some sequencing technologies [4] and MAY LEAD TO AN INCORRECT NUMBER OF NUCLEOTIDES, resulting in frame-shifts if the sequence is coding."

"If the genome sequence CONTAINS NUMEROUS REPETITIVE SEQUENCES, retrotransposons or mobile elements, or large, highly similar gene families, THE GENOME ASSEMBLY WILL BE AFFECTED (Fig 1), especially if only short-read sequences were used."

"Repetitive sequences are a HUGE CHALLENGE for most assembly algorithms."

"Low-coverage, LESS ACCURATE, long-molecule reads can be used as a framework upon which shorter-read sequences can be mapped"

"There is an easy way to assess the quality of your organism’s genome assembly. Map the reads from the sequencing project back to the ASSEMBLED GENOME SEQUENCE and have a look."

"The reference genome assembly for the apicomplexan parasite Toxoplasma gondii ME49 contains several collapsed regions that vary by strain (Fig 1C) [8]. DESPITE THE HIGH QUALITY OF THIS GENOME SEQUENCE AND ITS CORRESPONDENCE TO GENETIC MAPS, ISSUES RELATED TO THE NUMBER OF CHROMOSOMES STILL EXIST [13, 14]."

"GENOME SEQUENCES THAT RELIED ON CLONING AND BIOLOGICAL REPLICATION HAVE ADDITIONAL ISSUES THAT NEED TO BE CONSIDERED. SOME SEQUENCES SIMPLY CANNOT BE CLONED; they are toxic to the organism used for cloning and replication and thus, WILL BE MISSING IN THE GENOME SEQUENCE PRODUCED. Unclonable sequences often contain a few select genes and heterochromatin. The inverse is also true; A DNA SEQUENCE FROM THE CLONING VECTOR OR ORGANISM USED TO CONSTRUCT THE LIBRARY CAN END UP IN THE ASSEMBLED TARGET GENOME SEQUENCE."
"HIGH-THROUGHPUT NGS LIBRARY PREPARATION PLAYS A CRITICAL ROLE WITH RESPECT TO THE QUALITY OF THE GENOME SEQUENCE PRODUCED. Many protocols contain amplification steps, WHICH CAN INTRODUCE BIAS. For example, single cells can be used for genome sequencing but via the application of whole genome amplification (WGA). The approach is powerful when material is limited, but the amplification process is biased, and several different WGA reactions (on different cells or populations of like cells) are necessary to fully identify and remove the amplification bias [15, 16]. IT SHOULD BE NOTED THAT BIAS IS RARELY REMOVED FROM THE READS SUBMITTED TO ARCHIVES, so it is imperative to know if WGA was utilized."
"What sequencing platform was used?
DIFFERENT SEQUENCING PLATFORMS HAVE DIFFERENT STRENGTHS AND WEAKNESSES [9], and they continue to evolve rapidly and often complement each other if several different approaches are applied. Genome sequences assembled with Sanger chemistry will have good quality sequence, BUT THE ASSEMBLED GENOME SEQUENCE WILL BE AFFECTED BY THE LIBRARY ISSUES MENTIONED PREVIOUSLY. Genome sequences generated with legacy systems, e.g., 454 and Ion Torrent, WILL HAVE HOMOPOLYMER MISCOUNT ISSUES. Newer genome sequences will consist of highly accurate Illumina short-read technology, BUT THE ASSEMBLED SEQUENCE, especially if repeats are present, WILL BE INCOMPLETE AND CONTAIN GAPS AND MIS-ASSEMBLIES unless a hybrid assembly using long-read technologies like PacBio or Oxford Nanopore are utilized.
How was the genome assembled?
Sequence assemblies are of two types: de novo, assembled from scratch, and reference-based. THE LATTER IS NORMALLY USED WHEN AN ESTABLISHED ORGANISMAL REFERENCE GENOME ALREADY EXISTS AND THE EXPERIMENTAL GOAL IS TO DETERMINE VARIATION WITH RESPECT TO IT. IT IS NOT A GOOD APPROACH TO DETECT REARRANGEMENTS OR SYNTENIC BREAKS, but it is ideal to detect SNPs, some indels, and CNV. REFERENCE-BASED APPROACHES WILL NOT REVEAL GENOME FEATURES NOT PRESENT IN THE REFERENCE, A SIGNIFICANT DRAWBACK. Due to the large volume of population studies focused on SNPs, MOST GENOMES SEQUENCE DATA, SADLY, REMAINS AS UNASSEMBLED FILES OF READS.
De NOVO ASSEMBLIES ARE THE ONLY OPTION FOR AN ORGANISM’S FIRST GENOME SEQUENCE, and when possible, they should be performed as a matter of practice to permit discovery of new features. In the case of eukaryotic genome sequences, especially when the karyotype is UNKNOWN AND PHYSICAL MAPS DO NOT EXIST, READS CAN ONLY BE PARTIALLY ASSEMBLED into contiguous reads, “contigs,” or scaffolds of contigs, CONTAINING GAPS. Contigs often contain sequences that are fairly unique because REPETITIVE SEQUENCES ARE OFTEN “MASKED” in a de novo assembly because of the issues they cause. As a result, contigs often end at, or are separated by, MISSING REPETITIVE REGIONS THAT WERE NOT UTILIZED (e.g., masked) OR COULD NOT BE RESOLVED DURING THE ASSEMBLY. VARIATION FOUND AT THE ENDS OF CONFIGS SHOULD BE TREATED WITH CAUTION. Gaps between contigs that have been ordered and oriented into scaffolds are often indicated by exactly 100 “N’s” to indicate a gap of unknown size. In some cases, scaffolds representative of whole chromosomes are assembled, but these, too, often contain numerous gaps or ambiguous bases (Table 1). SOME ASSEMBLERS ALSO CREATE A SCAFFOLD THAT LINKS TOGETHER ALL “LEFTOVER” CONFIGS. BEWARE OF THIS SCAFFOLD, often named “scaffold 0,” AS THE ORDER AND ORIENTATION OF THESE CONFIGS BEARS NO RESEMBLANCE TO THEIR BIOLOGICAL LOCATION; it is simply a convenient mechanism to make sure all contigs are available to those using or searching the genome sequence.
"If a reference genome sequence is already available, you can use unassembled reads to detect sequence variants and CNVs much faster without assembly."
"EACH TYPE OF SEQUENCE ASSEMBLY COMES WITH A SET OF INHERENT ISSUES, and most genome sequence projects produce an assortment of leftover reads and contigs THAT DO NOT ASSEMBLE. In some cases, THESE READS CAN BE IDENTIFIED AS CONTAMINATION, AN UNEXPECTED SYMBIONT, OR ORGANELLAR GENOME SEQUENCE. In other cases, THE LEFTOVER BITS ARE A TELL-TALE SIGN OF PARTICULAR TYPES OF ASSEMBLY ERRORS OR UNEXPECTED GENOME SEQUENCE VARIATION, e.g., CNV (Fig 1) OR HIGH LEVELS OF HETEROZYGOSITY BETWEEN ALLELES (especially if a population was sequenced, rather than an individual)."
"Was the genome sequence “corrected,” and if so, how?
ERROR-PRONE LONG-SEQUENCE READS can be corrected prior to assembly using proovread [21]. CORRECTION PRIOR TO ASSEMBLY CAN FACILITATE ASSEMBLY WHEN THE ERROR RATE IS HIGH, e.g., in low-coverage PacBio reads. ASSEMBLED GENOME SEQUENCES CAN ALSO BE “POLISHED.” Polishing involves base call correction, and ICORN2 [22] is a popular tool. Polishing is performed using highly accurate Illumina reads mapped back against the final genome assembly. Read correction and polishing are useful and recommended steps, but THEY ARE HIGHLY DEPENDENT ON THE PERFORMANCE OF THE ALIGNER, and the end user must be aware that the CORRECTED AND POLISHED SEQUENCES WILL REPRESENT THE MOST ABUNDANT ALLELES PRESENT IN THE READS. In other words, ISOFORMS AND RARE VARIANTS OF REPETITIVE SEQUENCES WILL BE “CORRECTED,” i.e., OVERWRITTEN, IN THE FINAL ASSEMBLY BY MORE ABUNDANT SEQUENCE VARIANTS."
"GENE PREDICTIONS ARE GENOME-ASSEMBLY DEPENDENT, WHICH MEANS IF A REGION IS MISSING, IT CANNOT BE ANNOTATED. Likewise, IF THE REGION IS POORLY ASSEMBLED OR MISSING IN A REFERENCE GENOME SEQUENCE USED FOR ORTHOLOGS, IT MAY END UP MISSING IN THE GENOME SEQUENCE THAT IS BEING ANNOTATED. A good example is Cryptosporidium. The genome sequence for C. parvum was released in 2004, with a state-of-the-art assembly and annotation for the time [27]. This genome sequence was used as the reference sequence for several additional Cryptosporidium strains and species [28, 29]. This practice can be dangerous, as one of the genome features that facilitates speciation is genome rearrangement, which affects chromosome pairing during reproduction. AS THERE ARE NO GENETIC SYSTEMS FOR MANY PATHOGENS THAT CAN BE USED TO GENERATE A PHYSICAL MAP, reference mapping is useful, BUT IT IS EASY TO FORGET THE ORIGINS OF GENOME SEQUENCE ASSEMBLIES AND ANNOTATION CREATED OR PROPAGATED IN THIS WAY, SO CARE MUST BE EXERCISED WHEN USING REFERENCE-MAPPED GENOME ASSEMBLIES AS THE BASIS FOR EXPERIMENTS."
"The gene is annotated as single copy, is it?
Additional copies of genes can thwart experiments designed to target, clone, delete, or modify a particular gene. The annotation may indicate a single-copy gene, but DEPENDING ON THE TECHNOLOGY USED TO GENERATE YOUR GENOME SEQUENCE, NEARLY IDENTICAL COPIES OF GENES CAN BECOME ASSEMBLED AS ONE GENE (short-read only assemblies are most prone to this issue), and slightly divergent gene family members, especially if they are in tandem repeats, OFTEN DON'T ASSEMBLE AND CAN BE FOUND IN THE LEFTOVER READS OR SMALL UNASSEMBLED CONTIGS (Fig 1)."
"The annotation doesn’t describe your gene. Is it really missing from the genome?
IT IS EASY TO BE MISLED ON THE BASIS OF EXISTING ANNOTATION that a gene is missing. Genes can be lost, and they do decay or evolve beyond recognition, BUT THEY MAY ALSO BE MISSING BECAUSE OF A SEQUENCE ASSEMBLY GAP."
"Alternatively, the region may be missing from the genome assembly, i.e., a gap relative to the comparator sequence. MISASSEMBLIES AND GAPS CAN PROVIDE THE ILLUSION OF MISSING GENES, WHEN IN REALITY, THEY ARE MISSING FROM THE ASSEMBLY, HAVE EVOLVED INTO PSEUDO GENES, OR, IN SOME CASES, HAVE BEEN REPLACED BY A HORIZONTAL GENE TRANSFER LOCATED ELSEWHERE IN THE GENOME.
GENOME SEQUENCE GAPS HAVE MANY DOWNSTREAM CONSEQUENCES. The number of genes MAY BE REDUCED relative to the actual number, and ironically, the number of genes CAN ALSO BE INFLATED because a portion of the same gene can be found on each side of the gap, RESULTING IN TWO PARTIAL PREDICTIONS. Small assembly gaps often lead to frameshifts in coding sequences, which, in turn, LEAD TO AN ARTIFICIAL INCREASE IN THE NUMBER OF PSEUDOGENES, when, in reality, the culprit is an assembly gap. Gaps can also indicate the location of a missing tandem array of genes or repeat sequences that COULD NOT BE PROPERLY ASSEMBLED (Fig 1C)."
"Can I trust the annotation?
Some organismal genome sequences are continuously curated by the community or experts and have a good, recent genome annotation (Table 1). However, ANNOTATORS CANNOT ANNOTATE WHAT DOES NOT EXIST (e.g., GAPS). Eukaryotic genome sequences, especially from animal, vector, or plant hosts, are complex, and even with continuous curation, there is much more to be fixed and discovered as new sequence technology, assembly algorithms, and experimental evidence appear. For example, UNTRANSLATED REGIONS AND NONCODING RNAs AREN'T ROUTINELY ANNOTATED. ALL GENOME SEQUENCES AND THEIR ANNOTATION ARE “WORKS IN PROGRESS” AND ARE STATIC REPRESENTATIVES OF ONE POINT IN TIME FOR A CONTINUOUSLY EVOLVING MOLECULE WITHIN A GENETICALLY DIVERSE POPULATION."
"Does the annotation affect pathway analyses?
Yes. Studies aimed at drug target discovery often look for a gene that appears to be essential to a pathway. Once discovered, the gene is knocked out, and to everyone’s dismay, it was not essential, and the organism survives in the presence of drug. There are many reasons this may have happened, which range from the ability of the drug to reach the target to the possibility that the assessment of essentiality is flawed. ERRORS IN THE ANNOTATION OR THE ASSEMBLY CAN ALSO LEAD TO THIS RESULT. For example, the gene may not be single copy, or the knockout construct behaved oddly and targeted a related or additional gene copy of the target, producing unusual or hard to interpret results. Alternatively, THE LARGE PROPORTION OF GENES OF UNKNOWN FUNCTION (AS HIGH AS 40% IN SOME ORGANISMS) ENCODE FUNCTIONS THAT ALLOW THE ORGANISM TO CIRCUMVENT THE KNOCKOUT. Much work is still needed on this important class of genes."
"Some genome sequences will require additional approaches beyond long reads, such as Hi-C (chromatin conformation capture) [35], Chicago library methodologies [36], or optical mapping [37]. Truly difficult genome sequences can be hexaploid (like wheat), have enormous numbers of scaffolds (like Ixodes scapularis, which has >350,000), be littered with highly similar repeat elements (like T. vaginalis), or suffer from extreme heterogeneity and length differences between sister chromosomes (as in the hybrid T. cruzi). SOME GENOME SEQUENCES HAVE ALREADY BEEN “FIXED” WITH THESE NEW TECHNOLOGIES, BUT THERE IS STILL SIGNIFICANT WORK REQUIRED TO MAKE THEM AS GOOD AS THEY CAN BE."
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6742220/
In Summary (Part 2):
-new technologies and algorithmic advances DO NOT GUARANTEE flawless genomic sequences or annotation
-BIAS, ERRORS, and ARTEFACTS can enter at any stage of the process from library preparation to annotation
-ALL genome sequences have “ISSUES"
-there are MANY FACTORS that can affect the ultimate genome sequence and annotation that are produced, and both SHOULD BE CONSIDERED “WORKS IN PROGRESS”
-the ORIGIN of the genome MATTERS -whether it originates from a clone, a mixed population (common with microbes), or possibly a hybrid
-differences between individuals can be single nucleotide polymorphisms (SNPs), but often they involve INSERTIONS OR DELETIONS (indels) OF VARIOUS SIZES, COPY NUMBER VARIATIONS (CNV), AND EVEN SMALL REARRANGEMENTS
-hybrids can have DRAMATIC DIFFERENCES between orthologous chromosomes
-genome sequences derived from a heterogenous (diverse in content) population, especially when CNVs exist, COMPLICATE GENOME ASSEMBLY, and often THE SEQUENCE PRODUCED IS A COMPOSITE of the major alleles present in the sequenced sample
-genome sequences derived from CLONAL laboratory strains are often easier to assemble, BUT THEY MAY NOT BE TRULY REPRESENTATIVE OF CIRCULATING WILD TYPE STRAINS because they are adapted to culture and, if propagated for a long time, MAY HAVE LOST GENES OR ACCUMULATED MUTATIONS
-some genome sequences are physically difficult to sequence BECAUSE OF EXTREME NUCLEOTIDE BIAS
-long homopolymeric runs of any base are PARTICULARLY TROUBLESOME for some sequencing technologies and MAY LEAD TO AN INCORRECT NUMBER OF NUCLEOTIDES, resulting in frame-shifts if the sequence is coding
-if the genome sequence CONTAINS NUMEROUS REPETITIVE SEQUENCES, retrotransposons or mobile elements, or large, highly similar gene families, THE GENOME ASSEMBLY WILL BE AFFECTED, especially if only short-read sequences were used"
-repetitive sequences are a HUGE CHALLENGE for most assembly algorithms
-low-coverage, LESS ACCURATE, long-molecule reads can be USED AS A FRAMEWORK upon which shorter-read sequences can be mapped
-they state that there is an easy way to assess the quality of the organism’s genome assembly which is to map the reads from the sequencing project BACK TO THE ASSEMBLED GENOME SEQUENCE and have a look (however, if the reference genome is inaccurate...
🤷‍♂️
)
-case in point: the REFERENCE GENOME ASSEMBLY for the apicomplexan parasite Toxoplasma gondii ME49 contains several collapsed regions that VARY BY STRAIN and DESPITE THE HIGH QUALITY of this genome sequence and its correspondence to genetic maps, ISSUES RELATED TO THE NUMBER OF CHROMOSOMES STILL EXIST
-genome sequences that relied on CLONING AND BIOLOGICAL REPLICATION HAVE ADDITIONAL ISSUES that need to be considered
-some sequences SIMPLY CANNOT BE CLONED; they are TOXIC to the organism used for cloning and replication and thus, WILL BE MISSING IN THE GENOME SEQUENCE PRODUCED
-a DNA sequence from the cloning vector or organism used to construct the library CAN END UP IN THE ASSEMBLED TARGET GENOME SEQUENCE
-in other words, unwanted DNA sequences from other organisms used for cloning find their way into the new genome
-high-throughput NGS library preparation plays a CRITICAL ROLE WITH RESPECT TO THE QUALITY OF THE GENOME SEQUENCE produced and many protocols contain amplification steps, WHICH CAN INTRODUCE BIAS
-it should be noted that BIAS IS RARELY REMOVED FROM THE READS SUBMITTED TO ARCHIVES
-different sequencing platforms have different strengths and weaknesses
-genome sequences assembled with Sanger chemistry will have good quality sequence, BUT THE ASSEMBLED GENOME SEQUENCE WILL BE AFFECTED BY THE LIBRARY ISSUES MENTIONED PREVIOUSLY
-genome sequences generated with legacy systems, e.g., 454 and Ion Torrent, WILL HAVE HOMOPOLYMER MISCOUNT ISSUES
-newer genome sequences will consist of highly accurate Illumina short-read technology, BUT THE ASSEMBLED SEQUENCE, especially if repeats are present, WILL BE INCOMPLETE AND CONTAIN GAPS AND MIS-ASSEMBLIES
-REFERENCE GENOMES are normally used when an ESTABLISHED ORGANISMAL REFERENCE GENOME ALREADY EXISTS and the experimental goal is to determine variation with respect to it
-Drawbacks to Reference Genomes:
1. It is not a good approach to detect rearrangements or syntenic breaks
2. Reference-based approaches WILL NOT REVEAL GENOME FEATURES NOT PRESENT IN THE REFERENCE, a significant drawback
-due to the large volume of population studies focused on SNPs, MOST GENOME SEQUENCE DATA, SADLY, REMAIN AS UNASSEMBLED FILES OF READS
-De novo assemblies are the ONLY OPTION FOR AN ORGANISM’S FIRST GENOME SEQUENCE
-in the case of eukaryotic genome sequences, especially when the karyotype is UNKNOWN AND PHYSICAL MAPS DO NOT EXIST, READS CAN ONLY BE PARTIALLY ASSEMBLED into contiguous reads, “contigs,” or scaffolds of contigs, CONTAINING GAPS
-contigs often contain sequences that are fairly unique because REPETITIVE SEQUENCES ARE OFTEN “MASKED” in a de novo assembly BECAUSE OF THE ISSUES THEY CAUSE
-as a result, contigs often end at, or are separated by, MISSING REPETITIVE REGIONS THAT WERE NOT UTILIZED (e.g., masked) OR COULD NOT BE RESOLVED DURING THE ASSEMBLY
-variation found at the ends of contigs should be treated with caution
-some assemblers also create a scaffold that links together all “leftover” contigs, often named “scaffold 0,” but the order and orientation of these contigs BEARS NO RESEMBLANCE TO THEIR BIOLOGICAL LOCATION
-each type of sequence assembly COMES WITH A SET OF INHERENT ISSUES, and most genome sequence projects produce an assortment of leftover reads and contigs THAT DO NOT ASSEMBLE
-in some cases, these reads can be identified as:
1. CONTAMINATION
2. UNEXPECTED SYMBIONT
3. ORGANELLAR GENOME SEQUENCE
4. TELL-TALE SIGN OF PARTICULAR TYPES OF ASSEMBLY ERRORS
5. UNEXPECTED GENOME SEQUENCE VARIATION, e.g., CNV (Fig 1) or HIGH LEVELS OF HETEROZYGOSITY between alleles (especially if a population was sequenced, rather than an individual)
-assembled genome sequences can also be “polished"
-however, "polishing" is HIGHLY DEPENDENT ON THE PERFORMANCE OF THE ALIGNER, and the end user must be aware that the corrected and polished sequences will represent the most abundant alleles present in the reads
-in other words, isoforms and rare variants of repetitive sequences will be “CORRECTED,” i.e., OVERWRITTEN, in the final assembly by more abundant sequence variants
-gene predictions are GENOME-ASSEMBLY DEPENDENT, which means if a region is missing, it cannot be annotated
-IF THE REGION IS POORLY ASSEMBLED OR MISSING IN A REFERENCE GENOME sequence used for orthology, IT MAY END UP MISSING IN THE GENOME SEQUENCE THAT IS BEING ANNOTATED
-as there are NO GENETIC SYSTEMS FOR MANY PATHOGENS THAT CAN BE USED TO GENERATE A PHYSICAL MAP, reference mapping is useful, BUT IT IS EASY TO FORGET THE ORIGINS OF GENOME SEQUENCE ASSEMBLIES AND ANNOTATION CREATED OR PROPAGATED IN THIS WAY, so care must be exercised when using reference-mapped genome assemblies as the basis for experiments
-depending on the technology used to generate the genome sequence, NEARLY IDENTICAL COPIES OF GENES CAN BECOME ASSEMBLED AS ONE GENE (short-read only assemblies are most prone to this issue)
-slightly divergent gene family members, especially if they are in tandem repeats, OFTEN DON'T ASSEMBLE AND CAN BE FOUND IN THE LEFTOVER READS OR SMALL UNASSEMBLED CONTIGS
-IT IS EASY TO BE MISLED ON THE BASIS OF EXISTING ANNOTATION that a gene is missing
-genes can be lost, and they do decay or evolve beyond recognition, BUT THEY MAY ALSO BE MISSING BECAUSE OF A SEQUENCE ASSEMBLY GAP
-MISASSEMBLIES AND GAPS CAN PROVIDE THE ILLUSION OF MISSING GENES, when in reality, THEY ARE MISSING FROM THE ASSEMBLY, have evolved into pseudogenes, or, in some cases, have been replaced by a horizontal gene transfer located elsewhere in the genome
-genome sequence gaps have many downstream consequences:
1. The number of genes MAY BE REDUCED relative to the actual number, and ironically, the number of genes CAN ALSO BE INFLATED because a portion of the same gene can be found on each side of the gap, RESULTING IN TWO PARTIAL PREDICTIONS
2. Small assembly gaps often lead to frameshifts in coding sequences, which, in turn, LEAD TO AN ARTIFICIAL INCREASE IN THE NUMBER OF PSEUDOGENES, when, in reality, the culprit is an assembly gap
3. Gaps can also indicate the location of a missing tandem array of genes or repeat sequences that COULD NOT BE PROPERLY ASSEMBLED
-annotators cannot annotate what does not exist (e.g., GAPS)
-untranslated regions and noncoding RNAs aren’t routinely annotated
-highlighting important statement:
"ALL GENOME SEQUENCES AND THEIR ANNOTATION ARE “WORKS IN PROGRESS” AND ARE STATIC REPRESENTATIVES OF ONE POINT IN TIME FOR A CONTINUOUSLY EVOLVING MOLECULE WITHIN A GENETICALLY DIVERSE POPULATION."
-errors in the annotation or the assembly can also affect pathway analyses
-the large proportion of genes OF UNKNOWN FUNCTION (as high as 40% in some organisms) encode functions that allow the organism to circumvent the knockout
-in other words, if a drug doesn't perform as expected and they don't want to blame genome assembly, they can blame the unknown functions of certain genes
-some genome sequences have already been “FIXED” with these new technologies, BUT THERE IS STILL SIGNIFICANT WORK REQUIRED TO MAKE THEM AS GOOD AS THEY CAN BE
Finally, some experts from a 2008 article by molecular biologist Ulrich Bahnsen about the ever-changing genomes:
"The genome was considered to be the unchangeable blueprint of the human being, which is determined at the beginning of our life. SCIENCE MUST BID FAREWELL TO THIS IDEA. IN REALITY, OUR GENETIC MAKE-UP IS IN A STATE OF CONSTANT CHANGE."
"The experts believed they had understood how a gene looks and functions, which functional principles the human or microbial genome follows. "IN RETROSPECT, OUR ASSUMPTIONS ABOUT HOW THE GENOME WORKS BACK THEN WERE SO NAIVE THAT IT IS ALMOST EMBARRASSING," says Craig Venter, who was involved in the project with his company Celera."
"Until then, THE ASSUMPTION HAD BEEN THAT THE GENETIC MATERIAL OF ANY TWO PEOPLE DIFFERED ONLY BY ABOUT ONE PER MILLE OF ALL DNA BUILDING BLOCKS. But the differences in the genetic makeup of humans are in reality so great that science now confirms what the vernacular has long known: "EVERY MAN IS DIFFERENT. COMPLETELY DIFFERENT!"
"THE IDEA THAT THE GENOME REPRESENTS A NATURAL CONSTANT, A FIXED SOURCE CODE OF THE HUMAN BEING, IS NOW CRUMBLING UNDER THE WEIGHT OF THE FINDINGS. The US geneticist Matthew Hahn already compared the genome with a revolving door: "GENES CONSTANTLY COME, OTHERS GO."
https://telegra.ph/Genetics-Genome-in-Dissolution-11-01
After reading the laundry list of problems associated with the creation of genomes and the breakdown in the assumptions of a static genome, how reliable and accurate do you believe these "WORKS IN PROGRESS" truly are?
Reproducibility Crisis in Genomics:
https://m.facebook.com/story.php?story_fbid=10158323807473576&id=502548575
Problems with Reference Genomes:
https://m.facebook.com/story.php?story_fbid=10158058147763576&id=502548575
Problems with "Viral" Genomics:
https://m.facebook.com/story.php?story_fbid=10158051667393576&id=502548575
Problems with "SARS-COV-2" Genomes:
https://m.facebook.com/story.php?story_fbid=10158049233488576&id=502548575
 
Top