Basic Robot

From Eterna Wiki

by User:Guzz

Overview

Warning: the concepts expressed here are likely to be underwelming to those knowledgeable in the art of RNA folding.

When I start working on a puzzle, I like to take its structure, copy it to a text editor and hand edit a starting sequence. This saves me me from effectively doing what I would do if I were taking the more traditional route of clicking around, and it does save time: especially on large puzzles. It took me a while to come up with this technique, but I would assume that the technique has occurred to anyone who has been doing this for a while.

I am interested in automating this technique because performing it by hand is error prone. First, it is easy to make typos: afterall, in many respects this is equally as mind numbing as doing it with a mouse in EteRNA. Second, there are issues of loop balance which can lead to much rework. Removing the loop balance errors require a great deal of looking forwards and backwards which is better suited for a computer than a human.

 

Puzzles Solved

I was mostly just interested in capturing what I have learned in code, so I never really had the intention of this Robot solving problems for me. However, it has solved puzzles, so I am creating this section. Note: I include neither puzzles with unpaired bases of any number (which solve with a space bar) nor puzzles that are simple hairpins. There are far too many of each.

I don't list simple puzzles like (one-N unpaired nucleotides or simple hairpins).

2016-01-25 Cafeteria virus tRNA  

2016-01-25 puzzle 1

2016-01-25 the key

2016-01-24 Big & Little Loops  

2016-01-23 Should not take too long to solve

2016-01-23 Monoecious II

2016-01-10 Hooked

2016-01-09 Uplift

2016-01-09 First structure

2016-01-09 Mat - Broken Propeller Mod

2016-01-09 Infobot test 19

2016-01-09 First-time puzzle

2016-01-08 InfoRNA bulge test 7

2016-01-05 Bulge Practice

2016-01-04 Synthetic 5

2016-01-04 Rings Around the Little Stack

2016-01-04 (RNA-R) 3

2016-01-04 Mat - Bug test

2016-01-02 600!

2015-12-31 Broken Propeller

2015-12-31 Little star 

2015-12-30 Big and Easy +

2015-12-30 Steps for lab

2015-12-24 Lab design for newer players with 2-2 loops

2015-12-24 Ebola Zaire (Partial)

2015-12-24 Cantor Knot

2015-12-24 hardest ever

2015-12-24 "The Asymmetry" with out the open loop   

2015-12-23 Mat - 2-2 loop energy challenge

2015-12-23 The pick-axe

2015-12-23 InfoRNA bulge test 5

2015-12-23 Easy Loops

2015-12-23 Tripple Stack

2015-12-23 Big Loop Balancing Act

2015-12-23 Fold back

2015-12-23 Puba

2015-12-23 Ball with arms

2015-12-23 2-2 loop test

2015-12-23 Tripod

2015-12-23 Venus disguised as a Pogo stick.

2015-12-20 repeat

2015-12-21 (RNA-Rb) AU1

2015-12-21 V for Vienna

2015-12-21 Snowman

2015-12-21 First Puzzle   

2015-12-20 The Scissor

2015-12-20 Easy series 2

2015-12-20 Really easy

2015-12-20 Lab design for newer players featuring 4-4 loops

2015-12-19 JRA HP loop 

2015-12-19 Stack Strength Challenge

2015-12-19 Dancing Doll

2015-12-19 Easy Bot Test

2015-12-19 cel-mir-1

2015-12-19 Two stacks

2015-12-18 Lock

2015-12-14 Arabidopsis Thaliana 7 - Difficulty Level 4

 

Unit Test Puzzles

These are the puzzles that I run as part of the robot's unit tests for regression testing purposes. I have posted them to EteRNA for others to use too.

2015-12-15 JC - Basic Robot Test #0 - My first puzzle

2015-12-17 JC - Basic Robot Test #1: Isolated Loop

2015-12-17 JC - Basic Robot Test #2: Isolated 1x1 Loop

2015-12-17 JC - Basic Robot Test #3: Isolated 1x2/2x1 Loops

2015-12-19 JC - Basic Robot Test #4: Isolated 2x2 Loop

 

Puzzles that Demonstrate Bugs or a Possible Improvement

The following puzzles should work (or at least work better), but don't because of a bug or an error in logic.

2016-01-25 Nothing else to do  

2016-01-12 miRNA  

2016-01-10 [Lab Tutorial Learn the RNA alphabet - Uracil  ]

2016-01-10 Dreamcatcher

2016-01-10 Random Number Association (RNA) 1

2016-01-09 Tri-loop

2016-01-09 Brourd - Complicated 2-2 loop energy challenge

2016-01-09 Asymetrical fun

2016-01-04 Walking Along

2015-12-28 Adenine

2015-12-20 n'importe quoi 2

2015-12-19 Key


My Technique

When I start on a puzzle, I first copy the structure.
..(((..(.).)))...

 
I then paste it into a text editor and overlay U's and A's because UA pairs are strong and well suited for long runs.
..(((..(.).)))...

UAUAUAUAUAUAUAUAU

I then replace the dots with A's, the end '('s with C's and the end ')'s with G's. The middle '(' and ')' (those between the ends) retain their original U and A nucleotides. I do this because it places GC pairs at the closing base pairs. This again is a strong configuration which works nicely. It also make sure that there tends to be A's in the non-terminal elements of a loop.

 
..(((..(.).)))...

AACACAACAGAGUGAAA


I then paste it into the puzzle. As mentioned before, its not perfect and it does require some tweeking once back in the puzzle. This technique guaranties that there are no runs of greater than two. I can then focus on reducing GC pairs, increasing UG pairs and other typical puzzle requirements. Note: this does result in a cub scout sequence, so the likelyhood of a stable structure is not great.

 

Case Study:

For this case study, I randomly select Synthetic RNA-Difficulty Level 2.

(((((((.(((((((((((((((((((((((((((....))))))))))) )))))))......(((((.....(((((((((((....))))))))))). ..))))).)))))))))..)))))))

UAUAUAUAUAUAUAUAUAUAUAUAUAUAUAUAUAUAUAUAUAUAUAUAUAUAUAUAUAUAUAUAUAUAUAUAUAUAUAUAUAUAUAUAUAUAUAUAUAUAUAUAUAUAUAUAUAUAUAUAUAUAUAUA

CAUAUACACAUAUAUAUAUAUAUAUAUAUAUAUACAAAAGUAUAUAUAUGAGUAUAUGAAAAAACAUACAAAAACAUAUAUAUACAAAAGUAUAUAUAUGAAAAGAUAGAGAUAUAUAGAAGUAUAUG

CAUAUACACAUAUAUACCAUAUACCAUAUAUAUACAAAAGUAUAUAUAUGAGUAUAUGAAAAAACAUACAAAAACAUAUAUAUACAAAAGUAUAUAUAUGAAAAGUAUGAGUAUAUAUGAAGUAUAUG


If you paste the third line into that puzzle, you will notice the two mechanical problems with my strategy:
  1. When two columns join a loop directly adjacent to each other, there is a GA and a GU pair. The A and U come from the fact that they do not appear to be endpoints. 
  2. Sometimes columns come out correct with AU pairs. Other times they come out with AA and UU pairs. 
The solution to both of these problems are associated with an awareness of what is transpiring on either side of a loop or bulge.

The fourth line above shows what would have been the desired outcome. In this case, that structure is stable (not always the case) but doesn't satisfy the puzzle. Even so, it is a good Kickstart.

11/27/2015 Updates

I have refined my thinking and my algorythms since the original posting above. The solution for loop balancing was actually simple: pushing indexes of '(' in the DBN and popping them when I find a ')'. This gives me the current position with its mate. Since the mate already has an assigned nucleotide, I can set the current nucleotide to whatever works best. No looking forwards and backwards: always looking forward.

Here are the steps which I now take in my code:

1) Normalize the Dot-Bracket Notation (DBN) by replacing '{'/'[' with '(' and '}'/']' with ')'. These are used to indicate psudoknots which are not of interest to me at this time.

2) Loop through the normalize structure:

3) Give each nucleotide a default value of uracil (U) for even nucleotide indexes and adenine (A) for odd.

4) If a '.' is found, make the nucleotide an 'A'

5a) If a '(' is found, push its index on a program stack (not to be confused with an RNA stack).

5b) If the '(' is at index 1 or it is at the start or end of a cluster of '(' set the nucleotide to cytosine (C).

6a) If a ')' is found, pop the index at the top of the program stack (the mate of the current position)

6b) If the ')' is at the end of the structure or at the start or end of a cluster of ')' set the nucleotide to guanine (G)

6c) Go head and set the mate to cytosine (C): which it might well be already.

6d) Regardless of steps 6b and 6c, if the mate is 'A' set the nucleotide to 'U', if 'C' set it to 'G' or if 'U' set it to 'A' (it won't be 'G').

11/28/2015 Observations

So my algorythm has been working to my satisfaction, but I just discovered a case where it doesn't work so well. This is the case where three RNA stacks join at a loop such that the center RNA stack is immediately adjacent to the other two RNA stacks (i.e., no dot in the DBN) between the RNA stacks. For example Sprinzl tRNA RF9331 - Difficulty Level 1 exhibits this behaviour. In this puzzle, nucleotide 7 and 67 should be sequenced as cytosine (C) and guanine (G) respectively, but rather, they are sequenced as uracil (U) and adenine (A).

(((((((((((........))))....(((((.......))))).....(((((.......)))))))))))).......

CAUAUAUCUACAAAAAAAAGUAGAAAACUAUCAAAAAAAGAUAGAAAAACUAUCAAAAAAAGAUAGAUAUAUGAAAAAAA

CAUAUACCUACAAAAAAAAGUAGAAAACUAUCAAAAAAAGAUAGAAAAACUAUCAAAAAAAGAUAGGUAUAUGAAAAAAA

As shown above, structurally, they are burried in a group of brackets such that they appear to be one long RNA stack. I see what is going on here, but will need to think about the best way to solve it. I love stuff like this! :) My current thinking is that I should be suspecious of the point in the program stack where I see a dot. (If this is true, I believe that it will simplify my algorythm. Because I will see dots both when I am climbing and decending the stack.)

My assertion appears to be correct, so I have modified my algorythm as follows:

1) Normalize the Dot-Bracket Notation (DBN) by replacing '{'/'[' with '(' and '}'/']' with ')'. These are used to indicate psudoknots which are not of interest to me at this time.

2) Loop through the normalize structure:

3) Give each nucleotide a default value of uracil (U) for even nucleotide indexes and adenine (A) for odd.

4a) If a '.' is found, make the nucleotide an adenine (A)

4b) and make the nucleotide at the position indicated by the top of the program stack a cytosine (C)

5a) If a '(' is found, push its index on a program stack (not to be confused with an RNA stack).

5b) If the '(' is at index 1 or it is at the start or end of a cluster of '(' set the nucleotide to cytosine (C).

6a) If a ')' is found, pop the index at the top of the program stack (the mate of the current position)

6b) If the ')' is at the end of the structure or at the start or end of a cluster of ')' set the nucleotide to guanine (G)

6c) Go head and set the mate to cytosine (C): which it might well be already.

6d) Regardless of steps 6b and 6c, if the mate is 'A' set the nucleotide to 'U', if 'C' set it to 'G' or if 'U' set it to 'A' (it won't be 'G').

Thanksgiving Enhancements

Well over thanksgiving break, I had an idea that was rolling around in my head which I implemented but have been putting off documenting for a while. The time has come.

Some vocabulary:

ASSENDING- climbing up a group of consecutive OPENING BRACKETS and associated trailing intermediate dots in an RNA structure.

ASSENDING BRACKET- a bracket represented by an '(' in an RNA sequence. When an ASSENDING BRACKET is encountered, a new BRACKET CLASS INSTANCE is instantiated and pushed onto the FILO.

ASSENDING DOTS- the dots that appear after an ASSENDING BRACKET until the next bracket is encountered.

BRACKET CLASS INSTANCE- an instance of a class which represents ASSENDING BRACKET in the FILO. The BCI is used to track the index in an RNA sequence of an ASSENDING BRACKET, the ASSENDING DOTS, the DECENDING DOTS and the BRACKET INSTANCE MODE.

BRACKET INSTANCE MODE- each BRACKET CLASS INSTANCE is in one of two modes: assending or decending. The BCI starts and remains in assending mode until a new ASSENDING BRACKET is seen: at which time it is put in decending mode. The BIM is used to determine if dots should be counted as assending or decending dots.

DECENDING- climbing down a group of consecutive CLOSING BRACKETS and leading intermediate dots in an RNA structure.

DECENDING BRACKET- a bracket represented by an ')' in an RNA sequence. When a DECENDING BRACKET is encountered, the corresponding  BRACKET CLASS INSTANCE (associated with the base-paired ASSENDING BRACKET  )   is popped from the top of the FILO.

DECENDING DOTS- the dots that appear in a RNA structure leading up to a DECENDING BRACKET following the previous DECENDING BRACKET encountered.

FILO- standand for First-in Last-out. This is a computer science term. Rather than saying structure stacks and computer stacks, I will now use FIFO when refering to the computer stack. In a computer program, one pushes things onto a FILO, pops things off of a FILO and sometimes peeks at (looks at but doesn't pop) the top item on a FILO . The last thing pushed on a FILO is the first thing popped off. Formerly referred to as a computer stack.

STACK- a traditional stack of base pairs in an RNA sequence. Formerly referred to as structure stack.

Alogorythm Updates

1) Normalize the Dot-Bracket Notation (DBN) by replacing '{'/'[' with '(' and '}'/']' with ')'. These are used to indicate psudoknots which are not of interest to me at this time.

2) Loop through the normalize structure:

3) Give each nucleotide a default value of uracil (U) for even nucleotide indexes and adenine (A) for odd.

4a) If a '.' is found, make the nucleotide an adenine (A)

4b) and make the nucleotide at the position indicated by the top of the program stack a cytosine (C)

5a) If a '(' is found, create a new Bracket Class Instance, set the assending bracket index and push it on the FIFO

5b) If the '(' is at index 1 or it is at the start or end of a cluster of '(' set the nucleotide to cytosine (C).

5c) Lock the Bracket Class Instance at the top of the FIFO: switch it from assending mode to decending mode.

6a) If a ')' is found, pop the Bracket Class Instance from the top of the stack.

6b) If the ')' is at the end of the structure or at the start or end of a cluster of ')' set the nucleotide to guanine (G).

6b) From the Bracket Class Instance, use the assending dots and decending dots to invoke methods (as describe below) to effect changes to the sequence.

6c) Go head and set the mate to cytosine (C): which it might well be already.

6d) Regardless of steps 6b and 6c, if the mate is 'A' set the nucleotide to 'U', if 'C' set it to 'G' or if 'U' set it to 'A' (it won't be 'G').

Loop Method

A loop is identified when the Brack Class Instance at the top of the FIFO is unlocked (not in decending mode). Consider the following sequence:

(((...)))

This would result in three BCI's being pushed onto the stack: first three assending brackets. Next, the three dots would be counted as assending dots on the BCI on the top of the FIFO. Next, when the decending bracket is popped off of the FIFO, it will not have been locked, so it remains in assending mode: indicating a unlocked 3x0 configuration or a loop with 3 unpaired bases.

The method for handling this situation is to set the nucleotide associated with the assending bracket to urical (U), the decending bracket to guanine (G), and the lowest number dot to guanine (G) to boost the loop.

Default Method:

A consecutive base-pair is identified when the Bracket Class Instance at the top of the FIFO is locked (in decending mode) and both the assending and decending dot counts are zero.