Basic Robot
by User:Guzz
Overview
Warning: the concepts expressed here are likely to be underwelming to those knowledgeable in the art of RNA folding.
When I start working on a puzzle, I like to take its structure, copy it to a text editor and hand edit a starting sequence. This saves me me from effectively doing what I would do if I were taking the more traditional route of clicking around, and it does save time: especially on large puzzles. It took me a while to come up with this technique, but I would assume that the technique has occurred to anyone who has been doing this for a while.
I am interested in automating this technique because performing it by hand is error prone. First, it is easy to make typos: afterall, in many respects this is equally as mind numbing as doing it with a mouse in EteRNA. Second, there are issues of loop balance which can lead to much rework. Removing the loop balance errors require a great deal of looking forwards and backwards which is better suited for a computer than a human.
Puzzles Solved
I was mostly just interested in capturing what I have learned in code, so I never really had the intention of this Robot solving problems for me. However, it has solved puzzles, so I am creating this section. Note: I include neither puzzles with unpaired bases of any number (which solve with a space bar) nor puzzles that are simple hairpins. There are far too many of each.
I don't list simple puzzles like (one-N unpaired nucleotides or simple hairpins).
2016-01-25 Cafeteria virus tRNA
2016-01-25 puzzle 1
2016-01-25 the key
2016-01-24 Big & Little Loops
2016-01-23 Should not take too long to solve
2016-01-23 Monoecious II
2016-01-10 Hooked
2016-01-09 Uplift
2016-01-09 First structure
2016-01-09 Mat - Broken Propeller Mod
2016-01-09 Infobot test 19
2016-01-09 First-time puzzle
2016-01-08 InfoRNA bulge test 7
2016-01-05 Bulge Practice
2016-01-04 Synthetic 5
2016-01-04 Rings Around the Little Stack
2016-01-04 (RNA-R) 3
2016-01-04 Mat - Bug test
2016-01-02 600!
2015-12-31 Broken Propeller
2015-12-31 Little star
2015-12-30 Big and Easy +
2015-12-30 Steps for lab
2015-12-24 Lab design for newer players with 2-2 loops
2015-12-24 Ebola Zaire (Partial)
2015-12-24 Cantor Knot
2015-12-24 hardest ever
2015-12-24 "The Asymmetry" with out the open loop
2015-12-23 Mat - 2-2 loop energy challenge
2015-12-23 The pick-axe
2015-12-23 InfoRNA bulge test 5
2015-12-23 Easy Loops
2015-12-23 Tripple Stack
2015-12-23 Big Loop Balancing Act
2015-12-23 Fold back
2015-12-23 Puba
2015-12-23 Ball with arms
2015-12-23 2-2 loop test
2015-12-23 Tripod
2015-12-23 Venus disguised as a Pogo stick.
2015-12-20 repeat
2015-12-21 (RNA-Rb) AU1
2015-12-21 V for Vienna
2015-12-21 Snowman
2015-12-21 First Puzzle
2015-12-20 The Scissor
2015-12-20 Easy series 2
2015-12-20 Really easy
2015-12-20 Lab design for newer players featuring 4-4 loops
2015-12-19 JRA HP loop
2015-12-19 Stack Strength Challenge
2015-12-19 Dancing Doll
2015-12-19 Easy Bot Test
2015-12-19 cel-mir-1
2015-12-19 Two stacks
2015-12-18 Lock
2015-12-14 Arabidopsis Thaliana 7 - Difficulty Level 4
Unit Test Puzzles
These are the puzzles that I run as part of the robot's unit tests for regression testing purposes. I have posted them to EteRNA for others to use too.
2015-12-15 JC - Basic Robot Test #0 - My first puzzle
2015-12-17 JC - Basic Robot Test #1: Isolated Loop
2015-12-17 JC - Basic Robot Test #2: Isolated 1x1 Loop
2015-12-17 JC - Basic Robot Test #3: Isolated 1x2/2x1 Loops
2015-12-19 JC - Basic Robot Test #4: Isolated 2x2 Loop
Puzzles that Demonstrate Bugs or a Possible Improvement
The following puzzles should work (or at least work better), but don't because of a bug or an error in logic.
2016-01-25 Nothing else to do
2016-01-12 miRNA
2016-01-10 [Lab Tutorial Learn the RNA alphabet - Uracil ]
2016-01-10 Dreamcatcher
2016-01-10 Random Number Association (RNA) 1
2016-01-09 Tri-loop
2016-01-09 Brourd - Complicated 2-2 loop energy challenge
2016-01-09 Asymetrical fun
2016-01-04 Walking Along
2015-12-28 Adenine
2015-12-20 n'importe quoi 2
2015-12-19 Key
My Technique
When I start on a puzzle, I first copy the structure.
..(((..(.).)))...
I then paste it into a text editor and overlay U's and A's because UA pairs are strong and well suited for long runs.
..(((..(.).)))...
UAUAUAUAUAUAUAUAU
I then replace the dots with A's, the end '('s with C's and the end ')'s with G's. The middle '(' and ')' (those between the ends) retain their original U and A nucleotides. I do this because it places GC pairs at the closing base pairs. This again is a strong configuration which works nicely. It also make sure that there tends to be A's in the non-terminal elements of a loop.
..(((..(.).)))...
AACACAACAGAGUGAAA
I then paste it into the puzzle. As mentioned before, its not perfect and it does require some tweeking once back in the puzzle. This technique guaranties that there are no runs of greater than two. I can then focus on reducing GC pairs, increasing UG pairs and other typical puzzle requirements. Note: this does result in a cub scout sequence, so the likelyhood of a stable structure is not great.
Case Study:
For this case study, I randomly select Synthetic RNA-Difficulty Level 2.
(((((((.(((((((((((((((((((((((((((....))))))))))) )))))))......(((((.....(((((((((((....))))))))))). ..))))).)))))))))..)))))))
UAUAUAUAUAUAUAUAUAUAUAUAUAUAUAUAUAUAUAUAUAUAUAUAUAUAUAUAUAUAUAUAUAUAUAUAUAUAUAUAUAUAUAUAUAUAUAUAUAUAUAUAUAUAUAUAUAUAUAUAUAUAUAUA
CAUAUACACAUAUAUAUAUAUAUAUAUAUAUAUACAAAAGUAUAUAUAUGAGUAUAUGAAAAAACAUACAAAAACAUAUAUAUACAAAAGUAUAUAUAUGAAAAGAUAGAGAUAUAUAGAAGUAUAUG
CAUAUACACAUAUAUACCAUAUACCAUAUAUAUACAAAAGUAUAUAUAUGAGUAUAUGAAAAAACAUACAAAAACAUAUAUAUACAAAAGUAUAUAUAUGAAAAGUAUGAGUAUAUAUGAAGUAUAUG
- When two columns join a loop directly adjacent to each other, there is a GA and a GU pair. The A and U come from the fact that they do not appear to be endpoints.
- Sometimes columns come out correct with AU pairs. Other times they come out with AA and UU pairs.
11/27/2015 Updates
I have refined my thinking and my algorythms since the original posting above. The solution for loop balancing was actually simple: pushing indexes of '(' in the DBN and popping them when I find a ')'. This gives me the current position with its mate. Since the mate already has an assigned nucleotide, I can set the current nucleotide to whatever works best. No looking forwards and backwards: always looking forward.
Here are the steps which I now take in my code:
1) Normalize the Dot-Bracket Notation (DBN) by replacing '{'/'[' with '(' and '}'/']' with ')'. These are used to indicate psudoknots which are not of interest to me at this time.
2) Loop through the normalize structure:
3) Give each nucleotide a default value of uracil (U) for even nucleotide indexes and adenine (A) for odd.
4) If a '.' is found, make the nucleotide an 'A'
5a) If a '(' is found, push its index on a program stack (not to be confused with an RNA stack).
5b) If the '(' is at index 1 or it is at the start or end of a cluster of '(' set the nucleotide to cytosine (C).
6a) If a ')' is found, pop the index at the top of the program stack (the mate of the current position)
6b) If the ')' is at the end of the structure or at the start or end of a cluster of ')' set the nucleotide to guanine (G)
6c) Go head and set the mate to cytosine (C): which it might well be already.
6d) Regardless of steps 6b and 6c, if the mate is 'A' set the nucleotide to 'U', if 'C' set it to 'G' or if 'U' set it to 'A' (it won't be 'G').
11/28/2015 Observations
So my algorythm has been working to my satisfaction, but I just discovered a case where it doesn't work so well. This is the case where three RNA stacks join at a loop such that the center RNA stack is immediately adjacent to the other two RNA stacks (i.e., no dot in the DBN) between the RNA stacks. For example Sprinzl tRNA RF9331 - Difficulty Level 1 exhibits this behaviour. In this puzzle, nucleotide 7 and 67 should be sequenced as cytosine (C) and guanine (G) respectively, but rather, they are sequenced as uracil (U) and adenine (A).
(((((((((((........))))....(((((.......))))).....(((((.......)))))))))))).......
CAUAUAUCUACAAAAAAAAGUAGAAAACUAUCAAAAAAAGAUAGAAAAACUAUCAAAAAAAGAUAGAUAUAUGAAAAAAA
CAUAUACCUACAAAAAAAAGUAGAAAACUAUCAAAAAAAGAUAGAAAAACUAUCAAAAAAAGAUAGGUAUAUGAAAAAAA
As shown above, structurally, they are burried in a group of brackets such that they appear to be one long RNA stack. I see what is going on here, but will need to think about the best way to solve it. I love stuff like this! :) My current thinking is that I should be suspecious of the point in the program stack where I see a dot. (If this is true, I believe that it will simplify my algorythm. Because I will see dots both when I am climbing and decending the stack.)
My assertion appears to be correct, so I have modified my algorythm as follows:
1) Normalize the Dot-Bracket Notation (DBN) by replacing '{'/'[' with '(' and '}'/']' with ')'. These are used to indicate psudoknots which are not of interest to me at this time.
2) Loop through the normalize structure:
3) Give each nucleotide a default value of uracil (U) for even nucleotide indexes and adenine (A) for odd.
4a) If a '.' is found, make the nucleotide an adenine (A)
4b) and make the nucleotide at the position indicated by the top of the program stack a cytosine (C)
5a) If a '(' is found, push its index on a program stack (not to be confused with an RNA stack).
5b) If the '(' is at index 1 or it is at the start or end of a cluster of '(' set the nucleotide to cytosine (C).
6a) If a ')' is found, pop the index at the top of the program stack (the mate of the current position)
6b) If the ')' is at the end of the structure or at the start or end of a cluster of ')' set the nucleotide to guanine (G)
6c) Go head and set the mate to cytosine (C): which it might well be already.
6d) Regardless of steps 6b and 6c, if the mate is 'A' set the nucleotide to 'U', if 'C' set it to 'G' or if 'U' set it to 'A' (it won't be 'G').
Thanksgiving Enhancements
Well over thanksgiving break, I had an idea that was rolling around in my head which I implemented but have been putting off documenting for a while. The time has come.
Some vocabulary:
ASSENDING- climbing up a group of consecutive OPENING BRACKETS and associated trailing intermediate dots in an RNA structure.
ASSENDING BRACKET- a bracket represented by an '(' in an RNA sequence. When an ASSENDING BRACKET is encountered, a new BRACKET CLASS INSTANCE is instantiated and pushed onto the FILO.
ASSENDING DOTS- the dots that appear after an ASSENDING BRACKET until the next bracket is encountered.
BRACKET CLASS INSTANCE- an instance of a class which represents ASSENDING BRACKET in the FILO. The BCI is used to track the index in an RNA sequence of an ASSENDING BRACKET, the ASSENDING DOTS, the DECENDING DOTS and the BRACKET INSTANCE MODE.
BRACKET INSTANCE MODE- each BRACKET CLASS INSTANCE is in one of two modes: assending or decending. The BCI starts and remains in assending mode until a new ASSENDING BRACKET is seen: at which time it is put in decending mode. The BIM is used to determine if dots should be counted as assending or decending dots.
DECENDING- climbing down a group of consecutive CLOSING BRACKETS and leading intermediate dots in an RNA structure.
DECENDING BRACKET- a bracket represented by an ')' in an RNA sequence. When a DECENDING BRACKET is encountered, the corresponding BRACKET CLASS INSTANCE (associated with the base-paired ASSENDING BRACKET ) is popped from the top of the FILO.
DECENDING DOTS- the dots that appear in a RNA structure leading up to a DECENDING BRACKET following the previous DECENDING BRACKET encountered.
FILO- standand for First-in Last-out. This is a computer science term. Rather than saying structure stacks and computer stacks, I will now use FIFO when refering to the computer stack. In a computer program, one pushes things onto a FILO, pops things off of a FILO and sometimes peeks at (looks at but doesn't pop) the top item on a FILO . The last thing pushed on a FILO is the first thing popped off. Formerly referred to as a computer stack.
STACK- a traditional stack of base pairs in an RNA sequence. Formerly referred to as structure stack.
Alogorythm Updates
1) Normalize the Dot-Bracket Notation (DBN) by replacing '{'/'[' with '(' and '}'/']' with ')'. These are used to indicate psudoknots which are not of interest to me at this time.
2) Loop through the normalize structure:
3) Give each nucleotide a default value of uracil (U) for even nucleotide indexes and adenine (A) for odd.
4a) If a '.' is found, make the nucleotide an adenine (A)
4b) and make the nucleotide at the position indicated by the top of the program stack a cytosine (C)
5a) If a '(' is found, create a new Bracket Class Instance, set the assending bracket index and push it on the FIFO
5b) If the '(' is at index 1 or it is at the start or end of a cluster of '(' set the nucleotide to cytosine (C).
5c) Lock the Bracket Class Instance at the top of the FIFO: switch it from assending mode to decending mode.
6a) If a ')' is found, pop the Bracket Class Instance from the top of the stack.
6b) If the ')' is at the end of the structure or at the start or end of a cluster of ')' set the nucleotide to guanine (G).
6b) From the Bracket Class Instance, use the assending dots and decending dots to invoke methods (as describe below) to effect changes to the sequence.
6c) Go head and set the mate to cytosine (C): which it might well be already.
6d) Regardless of steps 6b and 6c, if the mate is 'A' set the nucleotide to 'U', if 'C' set it to 'G' or if 'U' set it to 'A' (it won't be 'G').
Loop Method
A loop is identified when the Brack Class Instance at the top of the FIFO is unlocked (not in decending mode). Consider the following sequence:
(((...)))
This would result in three BCI's being pushed onto the stack: first three assending brackets. Next, the three dots would be counted as assending dots on the BCI on the top of the FIFO. Next, when the decending bracket is popped off of the FIFO, it will not have been locked, so it remains in assending mode: indicating a unlocked 3x0 configuration or a loop with 3 unpaired bases.
The method for handling this situation is to set the nucleotide associated with the assending bracket to urical (U), the decending bracket to guanine (G), and the lowest number dot to guanine (G) to boost the loop.
Default Method:
A consecutive base-pair is identified when the Bracket Class Instance at the top of the FIFO is locked (in decending mode) and both the assending and decending dot counts are zero.