Proposed Algorithm to Study DNA Faster

Scientists propose an algorithm to study DNA faster and more accurately

January 18, 2016
Stylized image of DNA
Stylized image of DNA. Credit:

A team of scientists from Germany, the United States and Russia, including Dr. Mark Borodovsky, a Chair of the Department of Bioinformatics at MIPT, have proposed an algorithm to automate the process of searching for genes, making it more efficient. The new development combines the advantages of the most advanced tools for working with genomic data. The new method will enable scientists to analyse DNA sequences faster and more accurately and identify the full set of genes in a genome.

Although the paper describing the  only appeared recently in the journal Bioinformatics, which is published by Oxford Journals, the proposed method has already proven to be very popular—the computer  has been downloaded by more than 1500 different centres and laboratories worldwide. Tests of the algorithm have shown that it is considerably more accurate than other similar algorithms.

The development involves applications of the cross-disciplinary field of bioinformatics. Bioinformatics combines mathematics, statistics and computer science to study biological molecules, such as DNA, RNA and protein structures. DNA, which is fundamentally an information molecule, is even sometimes depicted in computerized form (see Fig. 1) in order to emphasize its role as a molecule of biological memory. Bioinformatics is a very topical subject; every new sequenced genome raises so many additional questions that scientists simply do not have time to answer them all. So automating processes is key to the success of any bioinformatics project, and these algorithms are essential for solving a wide variety of problems.

One of the most important areas of bioinformatics is annotating genomes – determining which particular DNA molecules are used to synthesize RNA and proteins (see Fig. 2). These parts –  – are of great scientific interest. The fact is that in many studies, scientists do not need information about the entire genome (which is around 2 metres long for a single human cell), but about its most informative part – genes. Gene sections are identified by searching for similarities between sequence fragments and known genes, or by detecting consistent patterns of the nucleotide sequence. This process is carried out using predictive algorithms.

Locating gene sections is no easy task, especially in eukaryotic organisms, which includes almost all widely known types of organism, except for bacteria. This is due to the fact that in these cells, the transfer of genetic information is complicated by “gaps” in the coding regions (introns) and because there are no definite indicators to determine whether a region is a coding region or not.

Diagram showing the transmission of hereditary information in a cell
Diagram showing the transmission of hereditary information in a cell. Credit:

The algorithm proposed by the scientists determines which regions in the DNA are genes and which are not. The scientists used a Markov chain, which is a sequence of random events, the future of which is dependent on past events. The states of the chain in this case are either nucleotides or nucleotide words (k-mers). The algorithm determines the most probable division of a genome into coding and noncoding regions, classifying the genomic fragments in the best possible way according to their ability to encode proteins or RNA. Experimental data obtained from RNA give additional useful information which can be used to train the model used in the algorithm. Certain gene prediction programs can use this data to improve the accuracy of finding genes. However, these algorithms require type-specific training of the model. For the AUGUSTUS software program, for example, which has a high level of accuracy, a training set of genes is needed. This set can be obtained using another program – GeneMark-ET – which is a self-training algorithm. These two algorithms were combined in the BRAKER1 algorithm, which was proposed jointly by the developers of AUGUSTUS and GeneMark-ET.

BRAKER1 has demonstrated a high level of efficiency. The developed program has already been downloaded by more than 1500 different centres and laboratories. Tests of the algorithm have shown that it is considerably more accurate than other similar algorithms. The example running time of BRAKER1 on a single processor is ∼17.5 hours for training and the prediction of genes in a genome with a length of 120 megabases. This is a good result, considering that this time may be significantly reduced by using parallel processors, and this means that in the future, the algorithm might function even faster and generally more efficiently.

Tools such as these solve a variety of problems. Accurately annotating genes in a genome is extremely important – an example of this is the global 1000 Genomes Project, the initial results of which have already been published. Launched in 2008, the project involves researchers from 75 different laboratories and companies. Sequences of rare gene variants and gene substitutions were discovered, some of which can cause disease. When diagnosing genetic diseases, it is very important to know which substitutions in gene sections cause the disease to develop. The project mapped genomes of different people , noting their coding sections, and rare nucleotide substitutions were identified. In the future, this will help doctors to diagnose complex diseases such as heart disease, diabetes, and cancer.

BRAKER1 enables scientists to work effectively with the genomes of new organisms, speeding up the process of annotating genomes and acquiring essential knowledge about life sciences.

 Explore further: Novel algorithm better assembles DNA sequences and detects genetic variation

Read more at:


The Children of Adam – National Geographics



In the name of God, “The Most Gracious”, The Dispenser of Grace”

In the name of the Father, Son, and Holy Ghost, “the Lord, The Almighty, the Creator, the Maker, the Godhead

Jehovah, Yahweh, Catholic, Jew, Baptist, Methodist, Buddist and other religious groups

God is the Ruler of the Universe and creator of all life which began in Africa and spread around the world. God makes no distinction of religion, color, sexual orientation. He only and simply stated follow my teaching and I give you free will to choose me your Father or Satan.

This is a fact that lies within all of us in a special place in our DNA.

African is the beginning and the end, take your place with God.

When reviewing material for this article, I found so much hate and rejection of the scientifically validated facts. Challenges based on religious preferences without an open mind or understanding or wanting to seek validation. This is the position of some of the world today but it is changing. We are one and you are his people.

Coming soon, DNA spirituality, health, disease, relationships and mental health. I wrote a blog that came to my mind while flying to DC-Bal. So for five hours, I wrote the blog while deep in thought and prayer. It has nothing to do with Trump tweet today regarding North Korea.

I met a man in the Dollar Store today and I turn to him and I said I know you some how. He said to me I have been waiting for you. We talked a little about our father our God. He wrote down a book he wanted me to read, and we continue a discussion about our connection. Finally, he said we shall meet again and other will be coming to give you greetings. I asked his name and he just smiled. He said I know your name. As he said to me, “there is no burden too heavy or hard to that you can not bear the burden if you believe in God. This is the third experience since my transition and return to this life.

May our God bless you all and keep you safe.


Do You Know Who You Are? Do You Know Your Ancestors?



Those of us who had ancestors living in slavery in South Carolina Low Country, North Carolina and Georgia were likely from Senegal and Sierra-Leone. Do you know who you are, did you have ancestors on a plantation in these areas particularly?  A slave who could manage escape did not go north but to Florida  Seminole territory.

What pulled me in this direction was the multitude of matches from Brazil, Mexico, Iran, Syria and the Caribbean. I started to see names such as Sadi, Jahid, Fahid, Dajzar, and Raza.

Wrong or right, the surnames we are using are not our own. They bind us to the earth as a human being. Looking beyond that our ancestor used a different name. So when I see the strange names I want to dig for more. DNA has given us the opportunity to see the world internationally not narrowly focused.

The picture below is of Abraham, a Black Seminole Leader in the Second Seminole War.


Genetic Inheritance Follows Rules Concept 5


Ref: DNA for, access Aug 1, 2017

When Mendel proposed that each trait is determined by a pair of genes, it presented a potential problem. If parents pass on both copies of a gene pair, then offspring would end up with four genes for each trait. Mendel deduced that sex cells — sperm and eggs — contain only one parental gene of each pair. The half-sets of genes contributed by sperm and egg restore a whole set of genes in the offspring.

Mendel found that different gene combinations from the parents resulted in specific ratios of dominant-to-recessive traits. The results of a cross between two hybrid parents — each carrying one dominant and one recessive gene — were key to his synthesis. For example, a cross between two yellow-seed hybrids produces three times as many yellow seeds as green seeds. This is Mendel’s famous 3 to 1 ratio.

DNA Triangulation, What?

Triangulation is a term derived from surveying to describe a method of determining the Y-STR or mitochondrial DNA ancestral haplotype using two or more known data points. The term “Genetic Triangulation” was coined by genetic genealogist Bill Hurst in 2004 Triangulate

Here is a 3-step process for Triangulation: Collect, Arrange, Compare/Group.

  1. Collect all the Match-segments you can. I recommend testing at all three companies (23andMe, FTDNA, and AncestryDNA), and using GEDmatch. But, wherever you test, get all of your segments into a spreadsheet. If you are using more than one company, you need to download, and then arrange, the data in the same format as your spreadsheet. Downloading/arranging is best when starting a new spreadsheet. Downloading avoids typing errors, but direct typing is sometimes easier for updates. I recommend deleting all segments under 7cM – most of them will be IBC/IBS (false segments) anyway, and even the ones which may be IBD are very difficult to confirm as such. You are much better off doing as much Triangulation as you can with segments over 7cM (or use a 10cM threshold if you wish), and then adding smaller segments back in later, if you want to analyze them. NB: Some of your closer Matches will share multiple segments with you – each segment must be entered as a separate row in your spreadsheet. The minimum requirement for a Triangulation with a spreadsheet includes columns for MatchName, Chromosome, SegmentStartLocation, SengmentEndLocation, cMs and TG. Most of us also have columns for SNPs, company, testee, TG, and any other information of interest to you. Perhaps I need a separate blog post about spreadsheets… ;>j
  1. Arrange the segments by sorting the entire spreadsheet (Cntr-A) by Chromosome and Segment StartLocation. This is one sort with two levels – the Chromosome column is the first level. This puts all of your segments in order – from the first one on Chromosome 1 to the last one on Chromosome 23 (for sorting purposes I recommend changing Chromosome X to 23 or 23X so it will sort after 22). This serves the purpose of putting overlapping segments close to each other in the spreadsheet where they are easy to compare.
  1. Compare/Group overlapping segments. All of these segments are shared segments with you. So with segments that overlap each other, you want to know if they match each other at this location. If so this is Triangulation. This comparison is done a little differently at each company, but the goal is the same: two segments either match each other, or they don’t (or there isn’t enough overlapping segment information to determine a match). All the Matches who match each other will form a Triangulated Group, on one chromosome – call this TG A (or any other name you want). Go through the same process with the segments who didn’t match TG A. They will often match each other and will form a second, overlapping TG, on the other chromosome – call this TG B. [Remember you have two of each numbered chromosome.] So to review, and put it all a different way: All of your segments (every row of your spreadsheet) will go into one of 4 categories:
  • – TG A [the first one with segments which match each other]
  • – TG B [the other, overlapping, one with segments which match each other]
  • – IBC/IBS [the segments don’t match either TG A or TG B]
  • – Undetermined [there are not enough segments to form both TG A and TG B                            and/or there isn’t enough overlapping data to determine a match.]
  • NB: None of the segments in TG A should match any of the segments in TG B.
  1. At GEDmatch – the comparisons are easy. Just compare two kit numbers using the one-to-one utility to see if they match each other on the appropriate segment. The ones that do are Triangulated. You may also use the Tier1 Triangulation utility or the Segment utility. I prefer using the one-to-one utility and Chrome.
  1. At 23andMe you have several different utilities:
  • – Family Inheritance: Advanced lets you compare up to 5 Matches at a time. You may also request a spreadsheet of all your shared segments; sort that by chromosome and SegmentStart, and check to see if two of your Matches match each other. The ones that do are Triangulated.
  • – Countries of Ancestry: Sort a Match’s spreadsheet by chromosome and SegmentStart, search for your own name, and highlight the overlapping segments. The Matches on this highlighted list who are also on overlapping segments in your spreadsheet are Triangulated (the CoA spreadsheet confirms the match between two of your Matches)
  1. At FTDNA it’s a little trickier, because they don’t have a utility to compare two of your Matches. So the most positive method is to contact the Matches and ask them to confirm if they match your overlapping Matches, or not. The ones that do are Triangulated. An almost-as-good alternative is to use the InCommonWith utility. Look for the 2-squigley-arrows icon next to a Match’s name, click that, and select In Common With to get a list of your Matches who also match the Match you started with. Compare that list of Matches with the list of list of Matches with overlapping segments in your spreadsheet. Matches on both lists are considered to be Triangulated. Although this is not a foolproof method, it works most of the time. And if you find three or four ICW Matches in the same TG, the odds are much closer to 100%. Remember, every segment in your spreadsheet must go in one TG or the other, or be IBC/IBS, or be undetermined. If a particular Match, in one TG, is critical to your analysis, then try hard to confirm the Triangulation by contacting the Matches.
  1. AncestryDNA has no DNA analysis utilities. You need to convince your Matches to upload their raw data to GEDmatch (for free) or FTDNA (for a fee), and see the paragraphs above.

Comments to improve this blog post are welcomed.

%d bloggers like this: