PDA

View Full Version : Combinatorics question


Paul C. Anagnostopoulos
27th June 2006, 05:46 PM
Okay, I have a sequence of n capital letters. I need to know how many of the possible combinations include at least 1 subsequence of m As. There can be any number of such subsequences and they can overlap arbitrarily.

Or, looking at it from the other direction, how many combinations include no occurrence of m consecutive As?

~~ Paul

Vorticity
27th June 2006, 06:19 PM
Okay, I have a sequence of n capital letters. I need to know how many of the possible combinations include at least 1 subsequence of m As. There can be any number of such subsequences and they can overlap arbitrarily.

Or, looking at it from the other direction, how many combinations include no occurrence of m As?

~~ Paul
OK, I like combinatorics. Let's see what we can figure out. First, three obvious things, just so we're on the same page:

1) For each n, there are (n choose m) possible m-subsequences.

2) For each n, there are 26^n possible sequences of the type you describe.

3) (# n-sequences with at least 1 m-A subsequence) = 26^n - (# n-sequences with no m-A subsequences)

It's pretty clear that it is easier to look at it from the "no m-A" direction. Furthermore, the set of all n-sequences with no m-A subsequences is (I think) the same as the set of all n-sequences with less than m A's in them. This will be the same as:

(# n-sequences with no m-A subsequences) = 26^n * (Probability that any given n-sequence will have less than m A's)

Each term in the sequence has a probability 1/26 of being A. So:

P(<m)
= (Probability that any given n-sequence will have less than m A's)
= Sum_{k=0}^{m-1} (n choose k) (1/26)^k (25/26)^(n-k)

So:

(# n-sequences with at least 1 m-A subsequence)
= 26^n * (1 - P(<m))
= 26^n * P(>=m)
= 26^n * Sum_{k=m}^{n}(n choose k) (1/26)^k (25/26)^(n-k)

I think that's right. Let me chug on that expression a bit and see if it can be simplified.

Paul C. Anagnostopoulos
27th June 2006, 06:34 PM
Furthermore, the set of all n-sequences with no m-A subsequences is (I think) the same as the set of all n-sequences with less than m A's in them.
No, I don't think so. Say m=3:

AAXXXAA

~~ Paul

Vorticity
27th June 2006, 06:46 PM
No, I don't think so. Say m=3:

AAXXXAA

~~ Paul
OK, did you mean contiguous subsequences?

I was going with the more general of the two definitions of subsequence found at http://mathworld.wolfram.com/Subsequence.html

More generally, the word subsequence is sometimes used to mean a sequence derived from a sequence by discarding some of its terms.

This, I believe, is the standard definition of subsequence. If you meant contiguous subsequences, then I'll have to think about it some more. Stay tuned.

Vorticity
27th June 2006, 06:50 PM
Ouch, it's much more difficult that way...

Paul C. Anagnostopoulos
27th June 2006, 06:59 PM
Oh yeah, it's more difficulter for sure.

The real problem involves DNA sequences rather than letters: How many combinations of n DNA bases include no occurrences of AAAAAA?

~~ Paul

Vorticity
27th June 2006, 07:23 PM
Hey, does this have anything to do with that 4^x guy?

I love combinatorics problems. They get stuck in my head. So I'll work on this one a bit more...

Paul C. Anagnostopoulos
27th June 2006, 07:29 PM
Yeah, it does have something to do with that pesky 4^x guy.

Thanks for pondering it.

~~ Paul

Unnamed
27th June 2006, 07:31 PM
Paul and Vorticity,

Maybe it helps to consider the AA...A block fixed and calculate the number of permutations of the remaining (n-m) spots. Then you can insert the repeated block in one of the (n-m+1) positions.

For N capital letters (already assuming you will want 4 letters), that's (n-m+1)*N^(n-m) combinations including such repeated block, out of N^n total combinations.

I'll check with a few examples...

Unnamed
27th June 2006, 07:37 PM
I am double-counting the cases where there are sequences longer than m letters, and multiple m-letter sequences :(

Vorticity
27th June 2006, 07:41 PM
I am double-counting the cases where there are sequences longer than m letters, and multiple m-letter sequences :(
Yeah, I was just having the same problem.

Paul C. Anagnostopoulos
27th June 2006, 07:58 PM
Yeah, me, too. :D

~~ Paul

Unnamed
27th June 2006, 08:06 PM
I got rid of double-counting when there are long sequences, by taking all m-long sequences and removing the double-counted (m+1)-long sequences.

(n-m+1)*N^(n-m) - (n-m)*N^(n-m-1)

Since the (m+1)-long sequences also double-counted the (m+2)-long sequences (and so on...), this actually cancels out all sequences longer than m.

Now, how can we get rid of disjoint sequences AAAxxxxAAA? :confused:

EBU
27th June 2006, 08:26 PM
If you have k letters to choose from (for alphabet k=26; for DNA, k=4?), and the sequence is n letters long, then the number of subsequences containing exactly j A's is: (n choose j) times (k-1) to the (n-j) power.

So if you sum over j from m to n, this will give you the number of subsequences with at least m A's. But, unfortunately, this includes non-contiguous groups of A's.

Vorticity
27th June 2006, 08:43 PM
Hmmm. Maybe we can do this inductively.

For n>=m, define

S(n,m) = # of n-sequences with no m-A subsequences

If there are N "letters", then the number the OP seeks will be N^n - S(n,m).

A little thought shows that S(m,m) = [(N-1)/N]^m. Can we derive an expression for S(n+1,m) in terms of S(n,m) and/or perhaps lower terms?

It seems right that

S(n+1,m) = Some kind of weighted sum over S(k,m) for m<=k<=n.

But what sum?

Vorticity
27th June 2006, 08:56 PM
Ok. Imagine we create the n+1 sequence by tacking on an extra letter at the end of one of the "n,m" sequences described above (of which there are S(n,m)). To first order there are N S(n,m) of these. But we also need to exclude those cases where tacking on an extra "A" at the end creates an m-A subsequence at the end, where there was none before. In other words:

S(n+1,m) = N S(n,m) - (Number of n-sequences with no m-A subsequences, but that have exactly m-1 A's right at the end)

We can envision this latter string as being made of 3 chunks:

{An n-m sequence with no m-A subsequence} {One letter that's not A} {m-1 A's}

How many of these are there? The number is S(n-m,m)*(N-1), unless n-m < m, in which case there are zero. In other words:

S(n+1,m) = N S(n,m) - (N-1) S(n-m,m) if n-m >= m

else

S(n+1,m) = N S(n,m)

But how to solve the induction in closed form?

Unnamed
27th June 2006, 08:59 PM
Vorticity,

I think that's it. I have the following:

S(n,m) = N^n, for n < m
S(n,m) = S(n,m+1) - 1, for n = m (I'm not sure why we disagree here)
S(n,m) = S(n,m+1) - 2(N-1), for n = m + 1
S(n,m) = S(n,m+1) - (N-1)^2 - 2(N-1)N, for n = m + 2
S(n,m) = S(n,m+1) - 2(N-1)S(n-m-1,m) - sum(a = 0; a <= n-m-2) 2(N-1)^2*S(n-m-2-a,m)*S(a,m), for n > m + 2

The last formula is recursive, but eventually converges to of of the first cases.

Now let me try to explain, so you can check the formulas.

Unnamed
27th June 2006, 09:07 PM
My idea is exactly the same as yours. Let me expand the formula for n > m+2:

S(n,m+1) = all possible combinations that don't have a m+1 block (and therefore don't have a m+2 block, etc. - so we only need to exclude possibilities with blocks of length m)

2*(N-1)*S(n-m-1,m) = {block of m A's}{letter not A}{n-m-1 letters, no A-blocks} and {n-m-1 letters, no A-blocks}{letter not A}{block of m A's}

2*(N-1)^2*S(n-m-2-a,m)*S(a,m) = {string of a letters, without any A-block}{letter, not-A}{block of m A's}{letter, not-A}{string with remaining lettes, without an A-block}

Well, I think the answer will look like this, unless you the induction happens to have a closed form.

Vorticity
27th June 2006, 09:08 PM
I said:
A little thought shows that S(m,m) = [(N-1)/N]^m.
That's clearly wrong.

As you say, we have S(m,m) = N^m - 1.

delphi_ote
27th June 2006, 09:12 PM
Okay, I have a sequence of n capital letters. I need to know how many of the possible combinations include at least 1 subsequence of m As. There can be any number of such subsequences and they can overlap arbitrarily.

Or, looking at it from the other direction, how many combinations include no occurrence of m consecutive As?

~~ Paul
I'm confused. Would subsequence include just a single A? Can m = 1?

Nessus
27th June 2006, 09:33 PM
You may find that generating functions give you the answer no problem at all.
Well i tell a lie, generating functions will give you the exact answer, and handy enough I just did a paper on them.

Okay, I have a sequence of n capital letters. I need to know how many of the possible combinations include at least 1 subsequence of m As. There can be any number of such subsequences and they can overlap arbitrarily.

Or, looking at it from the other direction, how many combinations include no occurrence of m consecutive As?

To cut a long story short, (i wont bother reproducing the formulas)
let:
S(z) := the counting generating function of strings without AA...A (m A's )
c(z) := the autocorrelation polynomial
= 1 + z + z^2 + ... + z^(m-1)
S(z) = c(z) / ( z^m + (1 - 26z) * c(z) )

(the coefficient of z^n of : S(z) ) / 26^n= the probability that a random 26 character string of length n does not contain m A's in a row, if you times by the number of possible combinations of 26 character n length string ( 26^n) then you have how many strings contain repeating A's of size no more than (m-1)

Working out :
S(z)
and:
26^(-m) * (the coefficient of z^n of : S(z) )
can be done in any sort of math program automatically, and you have your answer!

edit: did i put that thumbs up there? oh well

Unnamed
27th June 2006, 10:00 PM
It's much simpler when you know what you are doing, isn't it :) ?

My formulas still don't account for multiple non-overlapping sequences. I am giving up, since there is a solution now.

Cyphermage
27th June 2006, 10:33 PM
To cut a long story short, (i wont bother reproducing the formulas)
let:
S(z) := the counting generating function of strings without AA...A (m A's )
c(z) := the autocorrelation polynomial
= 1 + z + z^2 + ... + z^(m-1)
S(z) = c(z) / ( z^m + (1 - 26z) * c(z) )

(the coefficient of z^n of : S(z) ) / 26^n= the probability that a random 26 character string of length n does not contain m A's in a row, if you times by the number of possible combinations of 26 character n length string ( 26^n) then you have how many strings contain repeating A's of size no more than (m-1)

So if I wanted to find out how many 10 character strings from "ABCD" don't contain 4 A's in a row somewhere, I would construct

S(z) = (1+z+z^2+z^3)/[z^4 + (1+z+z^2+z^3)*(1-4z)]

and the coefficients of the taylor series around zero would be the numbers of strings of various sizes having this property?

Let's see for strings up to length 10.

<Fires up Mathematica>


1+4\multsp z+16\multsp {z^2}+64\multsp {z^3}+255\multsp {z^4}+1017\multsp {z^5}+4056\multsp {z^6}+ \\
\noalign{\vspace{0.604167ex}}
\hspace{1.em} 16176\multsp {z^7}+64512\multsp {z^8}+257283\multsp {z^9}+1026081\multsp {z^{10}}


Goodness, it works. Exactly 1026081 strings of length 10 made from "ABCD" don't have "AAAA" anywhere in them.

Cute! I'm impressed.

Art Vandelay
27th June 2006, 11:02 PM
Let F(n) be the number of sequences of DNA of length n without six A’s in a row. Suppose the first letter is not A. Then there are F(n-1) sequences after that. If there is one A at the beginning, then there are F(n-2) sequences. And so on. So:

F(n)=3[F(n-1)+F(n-2)+F(n-3)+F(n-4)+F(n-5)+F(n-6)]

Now, I could solve this in closed form, but I think that it would be simpler to just approximate it with ar^n.
So ar^(n+6)=3a[r^(n-1)…]
Divide both sides by ar^n
r^6-3r^5-3r^4…=0

This has only two real solutions: 3.99927 and -.9208
I’ll take the positive solution. A little calculation gives me a=1.00104. So within rounding errors, none of them have six A’s in a row.

and the coefficients of the taylor series around zero would be the numbers of strings of various sizes having this property?AKA MacLaurin Series.

Goodness, it works. Exactly 1026081 strings of length 10 made from "ABCD" don't have "AAAA" anywhere in them.

Cute! I'm impressed.

Yes, cute. Generating functions are a really cool idea, but often they're just a way to get a sequence in a closed form, rather than actually make it easier to find the terms. For instance, you mention that you used Mathematica. Well, if you've got that much computing power, you could just calculate the terms directly.

Cyphermage
27th June 2006, 11:22 PM
Let F(n) be the number of sequences of DNA of length n without six A’s in a row. Suppose the first letter is not A. Then there are F(n-1) sequences after that.

I'll buy that.

If there is one A at the beginning, then there are F(n-2) sequences. And so on. So:

F(n)=3[F(n-1)+F(n-2)+F(n-3)+F(n-4)+F(n-5)+F(n-6)]

Here I become skeptical. Could you illustrate this part with some actual numbers.

Art Vandelay
27th June 2006, 11:55 PM
Do you agree that the first five numbers should be as follows?
4
16
64
256
1024

For the sixth number, it's 4096 minus one, or 4095. For the seventh number, there are 16384 total combinations. Of those, there are three that start with a non-A, three that end with a non-A, and one that has all A's. That's a total of seven, and 16384 minus seven is 16377. If you add up the first six numbers, you get 5459. Mulitply by three, and you get 16377.

Cyphermage
28th June 2006, 01:02 AM
Do you agree that the first five numbers should be as follows?
4
16
64
256
1024

For the sixth number, it's 4096 minus one, or 4095. For the seventh number, there are 16384 total combinations. Of those, there are three that start with a non-A, three that end with a non-A, and one that has all A's. That's a total of seven, and 16384 minus seven is 16377. If you add up the first six numbers, you get 5459. Mulitply by three, and you get 16377.

The recursion formula is correct. It wasn't immediately clear from your explanation how you got exactly three of the prior six things, but after working it out, I see that is in fact what happens.

Paul C. Anagnostopoulos
28th June 2006, 05:55 AM
Holy cow! How can you not love this place?

I would be indebted to Nessus or anyone else if you would explain this magic a bit. Consider me relatively stoopid about it.

Also, Cyphermage, could you do the Mathematica for strings up to, say, 24 letters in length? Same 4 letters with no AAAA anywhere.

Thanks, guys!

~~ Paul

Paul C. Anagnostopoulos
28th June 2006, 06:04 AM
I'm confused. Would subsequence include just a single A? Can m = 1?
Not in the real world situation we're simulating. I can't imagine m less than about 4.

~~ Paul

Paul C. Anagnostopoulos
28th June 2006, 06:08 AM
Art, can Mathematica evaluate your recursion formula for large numbers?

~~ Paul

Edited to add: Hold on, I'm confused. Are Art's answers agreeing with Nesus's?

~~ Paul

69dodge
28th June 2006, 08:48 AM
Also, Cyphermage, could you do the Mathematica for strings up to, say, 24 letters in length? Same 4 letters with no AAAA anywhere.I don't have Mathematica. But I have a C++ compiler. I used Art's recursion formula (modified for 4 A's instead of 6).

0: 1
1: 4
2: 16
3: 64
4: 255
5: 1017
6: 4056
7: 16176
8: 64512
9: 257283
10: 1026081
11: 4092156
12: 16320096
13: 65086848
14: 259575543
15: 1035223929
16: 4128619248
17: 16465516704
18: 65666806272
19: 261888498459
20: 1044448322049
21: 4165407430452
22: 16612233171696
23: 66251932267968
24: 264222063576495
25: 1.05375490933983e+015
26: 4.20252341506798e+015
27: 1.67602569607568e+016
28: 6.68422720462234e+016
29: 2.66576421994164e+017
30: 1.06314442324864e+018
31: 4.23997012274934e+018
32: 1.69095997201151e+019
33: 6.74378720643217e+019
34: 2.68951758991304e+020
35: 1.07261760269547e+021
36: 4.27775050041364e+021
37: 1.70602732024942e+022
38: 6.80387791937839e+022
39: 2.71348261498162e+023
40: 1.08217519318456e+024
41: 4.315867521237e+024
42: 1.72122892653405e+025
43: 6.86450407237807e+025
44: 2.73766118110628e+026
45: 1.09181794686296e+027
46: 4.35432418488813e+027
47: 1.73656598717565e+028
48: 6.92567043648546e+028
49: 2.76205519105087e+029
50: 1.10154662257976e+030
51: 4.39312351776436e+030
52: 1.75203970914422e+031
53: 6.98738182526742e+031
54: 2.78666656453382e+032
55: 1.11136198594579e+033
56: 4.43226857322985e+033
57: 1.76765131016451e+034
58: 7.04964309518223e+034
59: 2.81149723837929e+035
60: 1.12126480939388e+036
61: 4.47176243185583e+036
62: 1.78340201881184e+037
63: 7.1124591459618e+037
64: 2.83654916666958e+038
65: 1.13125587223965e+039
66: 4.51160820166304e+039
67: 1.79929307460878e+040
68: 7.17583492099723e+040
69: 2.86182432089888e+041
70: 1.14133596074284e+042
71: 4.55180901836635e+042
72: 1.81532572812271e+043
73: 7.23977540772787e+043
74: 2.88732469012845e+044
75: 1.15150586816915e+045
76: 4.5923680456215e+045
77: 1.83150124106423e+046
78: 7.30428563803375e+046
79: 2.91305228114311e+047
80: 1.16176639485274e+048
81: 4.63328847527409e+048
82: 1.84782088638644e+049
83: 7.36937068863167e+049
84: 2.93900911860924e+050
85: 1.17211834825914e+051
86: 4.67457352761073e+051
87: 1.86428594838513e+052
88: 7.43503568147463e+052
89: 2.96519724523402e+053
90: 1.18256254304883e+054
91: 4.7162264516125e+054
92: 1.88089772279984e+055
93: 7.50128578415495e+055
94: 2.99161872192628e+056
95: 1.19309980114136e+057
96: 4.75825052521062e+057
97: 1.89765751691585e+058
98: 7.56812621031093e+058
99: 3.01827562795859e+059

69dodge
28th June 2006, 09:01 AM
Let F(n) be the number of sequences of DNA of length n without six A’s in a row. Suppose the first letter is not A. Then there are F(n-1) sequences after that. If there is one A at the beginning, then there are F(n-2) sequences. And so on. So:

F(n)=3[F(n-1)+F(n-2)+F(n-3)+F(n-4)+F(n-5)+F(n-6)]I agree.

Just to clarify, for those who may have made the same mistake I did when I first read this: "If there is one A at the beginning" means "if there is exactly one A at the beginning". In other words, there's an A followed by a non-A (which, in turn, is followed by some sequence of length n - 2 that does not contain six A's in a row anywhere). There are three possible non-A's; that's where the 3 in the formula comes from. And there are F(n - 2) sequences of length n - 2 that do not contain six A's in a row anywhere.

Yllanes
28th June 2006, 09:35 AM
Art, can Mathematica evaluate your recursion formula for large numbers?

~~ Paul

It sure can. Mathematica has a built in recursion limit, to protect against infinite loops, but you can get around that. Art's formula is actually quite good, if you use it wisely. It would be silly to calculate the possibilities with N = 1000 directly, but you can save previous results, thus avoiding recursion. For example, calculating the possibilities for string lengths 1-10000 took 0.25 seconds on my computer. Some numbers

N = 500: 2.4·10300

24476075492917547009981012225213598894351831448090 4890913058323793926340624629\
51161820122302116396212620936559867289597075185454 4505784162630233032190271691\
41456546242868400023162975563815272241085945088850 7581634569827727494142100792\
51656351531234415297419927483082834507737656830381 19253281660175873

N = 1000: 5.94·10600

59430205144802670072928276322763935668279411243598 8788618357583460934031329724\
93474229894020518052719585405679346903616205839603 4231160690608849889695729166\
12011010574517673185340119200130122978373603494553 5970751564070746945939755236\
52584878952191926722776945137889282781450458507631 6546041263666228038561146793\
57259946757579773033860166321031848192261569242613 6410764499654358259624709041\
89155769901024555492848170272128600698035605231044 5509148983970583403000096598\
22077508342633590259112810709021213489623282367827 2778239514816704253086711759\
36900269024452073431938482450379726457751601465814 63809

N=10000: 5·106007

(And I won't clutter the board with the full number).

If you want some specific string length, let me know.


Edited to add: Hold on, I'm confused. Are Art's answers agreeing with Nesus's?

~~ Paul
Yes. Look at n=10 in 69Dodge's list.

Yllanes
28th June 2006, 09:44 AM
N=10000: 5·106007.

Compare with 410000 ~ 3.98·106020. So roughly one in 1013 doesn't have any subsequence of 4 A's in a row.

Paul C. Anagnostopoulos
28th June 2006, 10:35 AM
You guys are great! Now let me present the real problem:

I have a chromosome of length n composed of the four bases A, C, G, T. Placed on this chromosome, in arbitrary positions possibly overlapping, are s sites of length m each. All these sites have the same subsequence of bases; any subsequence will do. None of the rest of the chromosome contains that subsequence. Thus a gene can match the sites without accidently matching anywhere else on the chromosome.

What fraction of all possible chromosomes fit this description?

Typical values are n=32k; s=16; m=6.

Can we just apply the above solutions to $n-sm$ and then divide by $4^n$?

~~ Paul

Yllanes
28th June 2006, 11:05 AM
I'm not sure I understand. This would only give the fraction of chromosomes that don't have this subsequence outside the sites. We don't know what's in them. Anyway, applying the above solution to the numbers you give returns (I'm using the formula for word length 6)

Number of strings of length n-sm=32000-16*6 that don't have the subsequence: 3.8 · 1019205.

Number of possible strings of length n-sm: 4n-sm ~1.3·1019208

Fraction of chromosomes that don't have the word anywhere else, provided they do have it on the s sites ~ 0.003.

If we want to know how many chromosomes have the word in 16 places and nowhere else, the problem is different. We would have to calculate the fraction of strings that have certain word s times.

Alternatively, we already have the fraction of chromosomes that don't have the word outside the sites. So, if we calculate the probability of a string of length sm containing nothing but the word repeated s times and then mutiply by this fraction, we would get the answer. With no overlapping, only one in 4sm strings fit the description, so the final answer would be 0.003 · 4-sm ~ 4.6·10-61.

Edited because I mixed up the numbers

Edited again: Ahh, I see you wrote 'then divide by 4n', which is exactly what I have done. Sorry.

Paul C. Anagnostopoulos
28th June 2006, 11:26 AM
I think I left out important information, as usual. Thanks for your patience, guys.

We aren't picking an a priori subsequence for the sites and then waiting for the rest of the chromosome not to contain that subsequence. We're starting with a random chromosome and evolving one that contains the same subsequence (whatever it is) at all the sites, while not containing that subsequence anywhere else.*

So, as Yllanes says, we need to know the fraction of chromosomes that contain some m-letter word in s places and nowhere else.

~~ Paul

* More or less. Absolutely perfect uniqueness of sites is not necessary.

Yllanes
28th June 2006, 11:31 AM
We aren't picking an a priori subsequence for the sites and then waiting for the rest of the chromosome not to contain that subsequence. We're starting with a random chromosome and evolving one that contains the same subsequence (whatever it is) at all the sites, while not containing that subsequence anywhere else.

See my edited post... I misunderstood you at first.

Paul C. Anagnostopoulos
28th June 2006, 12:19 PM
No, that doesn't seem right. If the probability were that low, I don't think we'd ever evolve perfect creatures. The mathematical model must be too simplistic.

Let me give the complete story. I don't expect anyone to build me a mathematical model, but at least you'll know what's going on.

The chromosome of length n is divided into two sections. The first section contains a weighting matrix and a threshold. The second section contains an array of s binding sites of length m.

The weighting matrix is used to compute a weight for every position on the chromosome. If a position's weight exceeds the threshold, then the position matches. A score is computed by tallying a mistake for each of the s binding sites that do not match and a mistake for each of the other positions that do match. The creatures are sorted by mistakes, the worst half die, and the best half procreate. Mutations are applied at random throughout the chromosomes.

If you watch the simulation one generation at a time, the best creature is almost immediately one with exactly s mistakes, because no position exceeds the threshold. If s=16, the worst creature might have this number of mistakes from generation to generation: 229, 146, 89, 79, 47, 33, 28, 17, 20, 18, ...

From this point we watch the best creature slowly decrease in number of mistakes, eventually reaching 0 mistakes. Our standard model requires about 650 generations to reach 0 mistakes (large variance). At this point, the binding sites all have a similar subsequence of bases, while the rest of the chromosome does not have that sequence. Not really a similar subsequence, but one that fits a pattern such as:

x x A/T A/C A/C/G A

That pattern does not occur anywhere else on the chromosome, at least not with a close enough match to exceed the threshold.

So the model is actually quite complex, and evolves a step-by-step increase in the number of binding sites that fit the pattern, perhaps with the pattern itself changing as the chromosome evolves.

I was a fool to think a simple model would tell us the fraction of genomes that fit this situation. Thanks for putting up with my foolishness.

~~ Paul

Art Vandelay
28th June 2006, 02:19 PM
I would be indebted to Nessus or anyone else if you would explain this magic a bit. Consider me relatively stoopid about it.You mean generating functions? I think that the clearest case is a coin flip. Suppose you flip a coin, and you don’t know that it’s fair, so you just assign variables to the probabilities. H= probability of heads, T= probability of tails. Obviously H+T=1 (as long as nothing weird is going on). If you have three coin flips, each one can be represented with H+T. So all of them together are
(H+T)^3=H^3+3H^2T+3HT^2+T^3=1.
Notice that each term represents a different result; 3HT^2, for instance, refers to a head and two tails. You have a coefficient of 3 because there are three ways of getting that result.

There are a whole bunch of tricks that you can use to go from a sequence to a generating function and vice versa. For instance, suppose you want the generating function for the Fibbonochi sequence. That is, each coefficient is equal to the sum of the two previous coefficient.

Then

f(x)= 1x^0+1x^1+2x^2+3x^3+5x^4+8x^5…
xf(x)= 1x^1+1x^2+2x^3+3x^4+5x^5+8x^6…
x^2f(x)= 1x^2+1x^3+2x^4+3x^5+5x^6+8x^7…
xf(x)+x^2f(x)= 1x^1+2x^2+3x^3+5x^4+8x^5+13x^6…
=f(x)-1

So f(x)-1=xf(x)+x^2f(x)
x^2f(x)+xf(x)-f(x)=1
(x^2+x-1)f(x)=1
f(x)=1/(x^2+x-1)

So if you take the MacLaurin series of the above function,
the coefficients will be the Fibonnochi sequence

You could do something similar to my formula to get a generating function.

Art Vandelay
28th June 2006, 02:27 PM
Not in the real world situation we're simulating. I can't imagine m less than about 4.And if m is 1, then it’s rather trivial; that just means that there’s one fewer letter in the alphabet. I guess that could be used to check a general formula.

Art, can Mathematica evaluate your recursion formula for large numbers?Probably. It’s really easy to set up an Excel spreadsheet. And since each term is a linear combination of the previous six, there’s a matrix which determines each term given the six previous. If you diagonalize that matrix and plug in the initial six terms, you’ll get an equation in terms of powers of the eigenvalues.

You guys are great! Now let me present the real problem:

I have a chromosome of length n composed of the four bases A, C, G, T. Placed on this chromosome, in arbitrary positions possibly overlapping, are s sites of length m each. Suppose the sequence is ACAC. Does ACACACAC count as two sequences, or three?

Nessus
28th June 2006, 05:01 PM
I'll first explain what generating functions are, then how to modify the formula given to calculate a reasonable number of things (deriving the formula is not too difficult... but i really dont want to have to do it if nessary :) )

Take a sequence of numbers, for instance:
1, 1, 1, 1, 1 ...a(i), 1,...
where a(i) is the ith number in the sequence (better that its a subscript i, but i cant do that here)

The generating function for this (infinite) sequence is F(z) which is:

F(z) = 1 + z + z + ... + z^i + ... = 1 / ( 1 - z ) = (sum over n of : a(n)*z^n)

What is amazing is that from the function F(z) = 1 / (1 - z) all sorts of information can be taken out, if you know how.

Basically, generating fuctions in this case count things!, anything that can be counted you can find a generating function for, then extract out the mean of something, the variance.. the first expected position of a subsequence pattern in a string, etc etc.

You can go straight from a recursive definition of something eg the fib sequence:
f(n) = f(n-1) + f(n-2)
to its generating function F(z), involving a few simple tricks :
F(z) = z / (1 - z - z^2)
and from this you can find the value of a(i) for any i!

If you can count it, you can use a generating function to help you greatly.


Ok onto the formula, this is for counting subsequence patterns inside an n length random string where there are x characters of the string (26 for english alphabet) with equal probability for each character)

let:
S(z) := the counting generating function of strings without the subsequence a(0)a(1)..a(m-1) (an m length substring, eg: ABABA
I say substring because the characters are all next to each other, you can find the generating function where the string has any number of chracters inbetween all the pattern characters, actually its simpler to do it that way)

c(z) := the autocorrelation polynomial described later

S(z) = c(z) / ( z^m + (1 - x*z) * c(z) )


All you need to do is figure out the autocorrelation polynomial! This tells how much the end of the substring looks like the beginning.

Formally, if you shift the subsequence j and when the subsequence and shifted subsequence are a perfect match, v(j) = 1, otherwise v(j) = 0
C(z) = sum of all (v(j) * z^j) for j = 0 to j = m-1)


For example: ABABA has an autocorrelation polynomial c(z) = 1 + z^2 + z^4
Shift it 0 times, you obviously get a match (1)
Shift it 1 times, you have BABA overlapping ABAB
Shift it 2 times, you have ABA overlapping ABA (z^2)
Shift it 3 times, you have BA overlapping AB
Shift it 4 times, you have A overlapping A (z^4)
Thus c(z) = 1 + z^2 + z^4

When you have your c(z) then you have your S(z) for your pattern, use :

(The coefficient of z^n of : S(z) ) / x^n) = the probability that a random x character string of length n does not contain the pattern a(0)a(1)...a(m-1), if you times by the number of possible combinations of x character n length string ( x^n) then you have how many strings dont contain the pattern.

Obviously:
1-(The coefficient of z^n of : S(z) ) / x^n) = the probability that a random x character string of length n DOES contain the pattern a(0)a(1)...a(m-1)

The hard part is working out your c(z) :)
Hope this helps.

Nessus
28th June 2006, 05:27 PM
Let F(n) be the number of sequences of DNA of length n without six A’s in a row. Suppose the first letter is not A. Then there are F(n-1) sequences after that. If there is one A at the beginning, then there are F(n-2) sequences. And so on. So:

F(n)=3[F(n-1)+F(n-2)+F(n-3)+F(n-4)+F(n-5)+F(n-6)]

Now, I could solve this in closed form, but I think that it would be simpler to just approximate it with ar^n.
So ar^(n+6)=3a[r^(n-1)…]
Divide both sides by ar^n
r^6-3r^5-3r^4…=0

This has only two real solutions: 3.99927 and -.9208
I’ll take the positive solution. A little calculation gives me a=1.00104. So within rounding errors, none of them have six A’s in a row.

AKA MacLaurin Series.


Yes, cute. Generating functions are a really cool idea, but often they're just a way to get a sequence in a closed form, rather than actually make it easier to find the terms. For instance, you mention that you used Mathematica. Well, if you've got that much computing power, you could just calculate the terms directly.

I disagree, generating functions can be automatically done by the computer.. thats one of the great things about them is that you DONT have to use much thinking, trust me on this. Besides your method (may) be alright for patters of the form XX...X but what about something like AGCTGATCA?
A simpler way to describe the sequence is to say:
let S be the set of strings not containing (k) A's from lets just say a binary alphabet {A,B}
then you can say that
S = ( (empty set) union {A,AA,...,(k-1)A's} ) * ( (B*S) union (empty set) )

Using something called the symbolic method.. directly translating from the above to the below
S = ( 1 + z + z^2 .. + z^(k-1) ) * ( 1 + z*S)

Making S the subject
S(z) = 1 + z + z^2 + .. + z^(k-1) / ( 1 - z - z^2 - .. - z^(k) )

And from that can find out the probability (the z^n coefficient of s(z) / 2^n),
the first expected position of the pattern: S(1/2)

I didnt have to think at all to work that out, it was recursive definiton straight to generating function.
Certianlly working out the nth coefficient of this is simpler than working out ALL the strings of length n then adding up those that have a pattern and dividing that by the total... There are simple tricks that you can pop out the coefficient from generating functions, much quicker than talor series expansion.

Suppose the sequence is ACAC. Does ACACACAC count as two sequences, or three?
I wasnt taught how do find distinct sequences, much too hard so they say. So for my generating function it would be 3...

Art Vandelay
28th June 2006, 05:50 PM
Take a sequence of numbers, for instance:
1, 1, 1, 1, 1 ...a(i), 1,...
where a(i) is the ith number in the sequence (better that its a subscript i, but i cant do that here)
a[/b]]1=a1.

Basically, generating fuctions in this case count things!, anything that can be counted you can find a generating function for,Not explicitly.

In fact, a generating function similar to a Laplace transform, isn't? Does everything have a Laplcase transform?

I disagree, generating functions can be automatically done by the computer.. You mean, given a generating function, the computer can give the terms? Yes, but usually the computer can calculate the terms directly.

thats one of the great things about them is that you DONT have to use much thinking, trust me on this.Seems like it takes a lot of thinking to figure out the generating function in this case.

Besides your method (may) be alright for patters of the form XX...X but what about something like AGCTGATCA?I think that it works with a few modifications. Instead of "A", read "letters that match".
So if the first letter doesn't match, there are f(n-1) sequences that work.
If only the first letter matches, then there are f(n-2), and so on.

Nessus
28th June 2006, 07:52 PM
In fact, a generating function similar to a Laplace transform, isn't? Does everything have a Laplcase transform?
It's been a few years since I have done any Laplace transforms, but they do seem to have similarities
Seems like it takes a lot of thinking to figure out the generating function in this case.
Though I would be hard pressed to find applications, but as I was told by my lecturer, generating functions are very handy for automatic computation of recursive formulas, finding out information on string patterns, trees, and so on.

T'ai Chi
28th June 2006, 08:45 PM
Does everything have a Laplcase transform?


No. There are conditions on the function being integrated.

See 2) at http://www.sosmath.com/diffeq/laplace/basic/basic.html

Yllanes
29th June 2006, 01:01 AM
Though I would be hard pressed to find applications, but as I was told by my lecturer, generating functions are very handy for automatic computation of recursive formulas, finding out information on string patterns, trees, and so on.
Generating functions do have applications. See Knuth's Art of Computer Programming, for example.

An application from mathematics is the partition function, p(n). It measures the number of ways a natural number can be decomposed as a sum of smaller natural numbers. For example, 4 = 1+1+1+1 = 1+1+2 = 1 + 3 = 2 + 2, so p(4)=5. This function seems rather naive, but is actually very complicated and no reasonable closed formula exists for it. Ramanujan conjectured (and then proved with Hardy), that an asymptotic formula is

http://www.randi.org/latexrender/latex.php?%3Cbr%20/%3E%0A%20p%28n%29%20%5Csim%20%5Cfrac%7B1%7D%7B4n%5 Csqrt%7B3%7D%7D%5Cmathrm%7Be%7D%5E%7B%5Cpi%5Csqrt% 7B2n/3%7D%7D,%5Cqquad%20n%5Cgg1%5C%20.%3Cbr%20/%3E

It was later proven that an exact formula is the following

http://www.randi.org/latexrender/latex.php?%5Cdisplaystyle%3Cbr%20/%3E%0A%20p%28n%29%20=%20%5Cfrac%7B1%7D%7B%5Cpi%5Cs qrt2%7D%5Csum_%7Bk=1%7D%5E%5Cinfty%20A_k%28n%29%20 k%5E%7B1/2%7D%20%5Cleft%5B%5Cfrac%7B%5Cmathrm%7Bd%7D%7D%7B% 5Cmathrm%7Bd%7D%20x%7D%20%5Cfrac%7B%5Csinh%5Cleft% 28%20%28%5Cpi/k%29%5Csqrt%7B%5Cfrac23%28x-%5Cfrac1%7B24%7D%29%7D%5Cright%29%7D%7B%5Csqrt%7Bx-%5Cfrac%7B1%7D%7B24%7D%7D%7D%5Cright%5D_%7Bx=n%7D% 3Cbr%20/%3E

where

http://www.randi.org/latexrender/latex.php?A_k%28n%29=%5Csum_%7Bh=1%7D%5Ek%20%5Cdel ta_%7B%5Cmathrm%7Bgcd%7D%28h,k%29,%201%7D%5C%20%5C omega_%7Bb,k%7D%20%5Cmathrm%7Be%7D%5E%7B-2%5Cpi%20%5Cii%20n%20h/k%7D

and w is a primitive 24th root of unity. It's madness to think one can do anything with this formula. However, a simple generating function for p(n) exists (Euler):

http://www.randi.org/latexrender/latex.php?%5Cdisplaystyle%3Cbr%20/%3E%0A%5Cprod_%7Bm=1%7D%5E%5Cinfty%20%5Cfrac%7B1%7 D%7B1-x%5Em%7D=%5Csum_%7Bn=0%7D%5E%5Cinfty%20p%28n%29%20 x%5En%5C%3Cbr%20/%3E

It's very easy to prove formally that this works. With formally I mean neglecting things like convergence, etc. which is how Euler did it, but which isn't really a proof by today's standards. Using this formula it is possible to show countless theorems about p(n) and related functions. A simple one, known to Euler

http://www.randi.org/latexrender/latex.php?%3Cbr%20/%3E%0A%281+x%29%281+x%5E3%29%281+x%5E5%29%5Ccdots= 1+%5Cfrac%7Bx%7D%7B1-x%5E2%7D+%5Cfrac%7Bx%5E4%7D%7B%281-x%5E2%29%281-x%5E4%29%7D+%5Cfrac%7Bx%5E9%7D%7B%281-x%5E2%29%281-x%5E4%29%281-x%5E6%29%7D+%5Cldots%3Cbr%20/%3E

easy to prove with the help of the generating functions. A crazy theorem, stated without proof by Ramanujan (later proven, but this one is certainly not easy),

http://www.randi.org/latexrender/latex.php?%3Cbr%20/%3E%0A%20%5Csum_%7Bm=0%7D%5E%5Cinfty%20p%287m+5%29 %20x%5Em%20=%207%20%5Cfrac%7B%5Cphi%28x%5E7%29%5E3 %7D%7B%5Cphi%28x%29%5E4%7D%20+%2049x%5Cfrac%7B%5Cp hi%28x%5E7%29%5E7%7D%7B%5Cphi%28x%29%5E8%7D,%5Cqqu ad%20%5Cphi%28x%29=%5Cprod_%7Bn=1%7D%5E%5Cinfty%28 1-x%5En%29%3Cbr%20/%3E

For some introductory info on the partition function, check Hardy & Wright, Introduction to the Theory of Numbers, or Apostol's Introduction to Analytic Number Theory. The partition function tends to come rather late in these books, but it doesn't matter, because the rest of the text is devoted to multiplicative number theory (primes, etc.) so the chapter on partitions is usually self contained.

(Sorry for the LaTeX excess).