Quantcast

probabilistic approach to tinderboxing

classic Classic list List threaded Threaded
14 messages Options
Bjoern Michaelsen Bjoern Michaelsen
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

probabilistic approach to tinderboxing

Hi all,

this is a proposal to get as much focused tinderboxing as possible with limited
resources and unreliable build clients. This also assumes patches under
consideration to be independant (that is: without conflicts) and not dependant
on a specific order. gerrit allow us to have patches independant of order as
they are not yet merged/picked to the branch. We can also easily exclude any
patch on gerrit that conflicts against the existing patches under
consideration: To do so we cherry-pick the new patch on the tip of the branch
and then cherry-pick all patches under consideration on top of that. If it
conflicts, we can report that back.(*)

== Testcase ==

Tinderboxes will be grouped in "testcases": All tinderboxes in belonging to one
"testcase" are assumed to provide the same results for the same input. The
simplest testcase is: "completing a build on platform foo" while more complex
testcases include "completed a build on platform foo and run checks".  A
testcase is assumed to have a rather stable long term empiric value that one
patch does not breaks it: p-innocent(testcase). Whenever a new patch is
commited for consideration we assume its inital "innocence" to be average:

 p-innocent(patch) = p-innocent(testcase)

== Giving a tinderbox a subset to test ==

There are two possibly useful results out of a tinderbox run:
- If the tinderbox run is successful, we can mark all patches as good and
  remove them from the patches under consideration
- If the tinderbox run is a failure and only contains one patch, we can mark
  that patch as bad and remove it from the patches under consideration
- All other cases have no directly deductable action

=== optimistic tinderbox run ===

A completed tinderbox run provides us with the most information, if we assume
it to fail half of the time (biggest insecurity). So, this is the first thing
to give a tinderbox to test:
 - Pick the most innocent looking patches (highest p-innocent(patch))
   - for patches with the same innocence prefer the oldest ones first
 - Add patches as long as the innocence of the patchset:
   p-innocent(patchset) = product(p-innocent(patchn) ...)
   is above 0.5
After sending off the tinderbox with this patchset, do:
   p-innocent(patch) *= p-innocent(testcase)
reducing the innocence of the patches without waiting for results. So we assume
the patches of the set to be partially guilty until proven innocent.

If the tinderbox succeeds, mark all patches as good and remove them from the
patches under consideration.

=== pessimistic tinderbox run ===

A failed tinderbox run provides us with most information, if it contains only
one patch. So if the innocence of one or more patches drops below 0.5:

 p-innocent(dirtypatch)<= 0.5

or if there is currently only one patch under consideration instead of doing
the optimistic approach above, do a pessimistic tinderbox run with just one
patch. Select the patch with the lowest p-innocent(patch) and, if there are
multiple patches with equal p-innocent(patch), select the oldest candidate.
After sending off the tinderbox with just this one patch, do:

 p-innocent(patch)/=p-innocent(testcase)

for all patches in each tinderbox run with this patch (once for each run)
without waiting for the tinderbox to return with a result. So we assume most of
those patches to be more innocent now (hoping the blame is on the dirtypatch).
The suspicous patch should now be back to the initial value:

 p-innocent(dirtypatch)=p-innocent(testcase)

The tinderbox will hopefully mark that patch as good or bad soon anyway and
remove it from the patches under consideration.

== advantages ==
This approach:
- Does not rely on the tinderboxes to ever come back with a result
  -- it should be robust against lost results.
- does not need to test each patch individually, if the average patch quality
  is not too bad
- requires only little bookkeeping on the side of the dispatcher
- should allow us to confidently mark all patches as 'doesnt break master for
  this testcase'

Opinions/Additions/Corrections? Objections? If not, we only need somebody to script
this ;)

Best,

Bjoern

P.S.: Dont show my careless use of 'propabilities' and such to the physics
department at the university where I got my degree. It will just result in
snorty laughs and cruel teasing.

(*) Conflicts might be rare though, so we might even get away with the
optimistic approach: just hoping that considered patches never conflict.Hi all,

this is a proposal to get as much focused tinderboxing as possible with limited
resources and unreliable build clients. This also assumes patches under
consideration to be independant (that is: without conflicts) and not dependant
on a specific order. gerrit allow us to have patches independant of order as
they are not yet merged/picked to the branch. We can also easily exclude any
patch on gerrit that conflicts against the existing patches under
consideration: To do so we cherry-pick the new patch on the tip of the branch
and then cherry-pick all patches under consideration on top of that. If it
conflicts, we can report that back.(*)

== Testcase ==

Tinderboxes will be grouped in "testcases": All tinderboxes in belonging to one
"testcase" are assumed to provide the same results for the same input. The
simplest testcase is: "completing a build on platform foo" while more complex
testcases include "completed a build on platform foo and run checks".  A
testcase is assumed to have a rather stable long term empiric value that one
patch does not breaks it: p-innocent(testcase). Whenever a new patch is
commited for consideration we assume its inital "innocence" to be average:

 p-innocent(patch) = p-innocent(testcase)

== Giving a tinderbox a subset to test ==

There are two possibly useful results out of a tinderbox run:
- If the tinderbox run is successful, we can mark all patches as good and
  remove them from the patches under consideration
- If the tinderbox run is a failure and only contains one patch, we can mark
  that patch as bad and remove it from the patches under consideration
- All other cases have no directly deductable action

=== optimistic tinderbox run ===

A completed tinderbox run provides us with the most information, if we assume
it to fail half of the time (biggest insecurity). So, this is the first thing
to give a tinderbox to test:
 - Pick the most innocent looking patches (highest p-innocent(patch))
   - for patches with the same innocence prefer the oldest ones first
 - Add patches as long as the innocence of the patchset:
   p-innocent(patchset) = product(p-innocent(patchn) ...)
   is above 0.5
After sending of the tinderbox with this patchset, do:
   p-innocent(patch) *= p-innocent(testcase)
reducing the innocence of the patches without waiting for results. So we assume
the patches of the set to be partially guilty until proven innocent.

If the tinderbox succeeds, mark all patches as good and remove them from the
patches under consideration.

=== pessimistic tinderbox run ===

A failed tinderbox run provides us with most information, if it contains only
one patch. So if the innocence of one or more patches drops below 0.5:

 p-innocent(dirtypatch)<= 0.5

instead of doing the optimistic approach above, do a pessimistic tinderbox run
with just one patch. Select the patch with the lowest p-innocent(patch) and, if
there are multiple patches with equal p-innocent(patch), select the oldest
candidate.
After sending of the tinderbox with just this one patch, do:

 p-innocent(patch)/=p-innocent(testcase)

for all patches in each tinderbox run with this patch (once for each run)
without waiting for the tinderbox to return with a result. So we assume most of
those patches to be more innocent now (hoping the blame is on the dirtypatch).
The suspicous patch should now be back to the initial value:

 p-innocent(dirtypatch)=p-innocent(testcase)

The tinderbox will hopefully mark that patch as good or bad soon anyway and
remove it from the patches under consideration.

== advantages ==
This approach:
- Does not rely on the tinderboxes to ever come back with a result
  -- it should be robust against lost results.
- does not need to test each patch individually, if the average patch quality
  is not too bad
- requires only little bookkeeping on the side of the dispatcher
- should allow us to confidently mark all patches as 'doesnt break master for
  this testcase'

Opinions/Additions/Corrections? Objections? If not, we only need somebody to script
this ;)

Best,

Bjoern

P.S.: Dont show my careless use of 'propabilities' and such to the physics
department at the university where I got my degree. It will just result in
snorty laughs and cruel teasing.

(*) Conflicts might be rare though, so we might even get away with the
optimistic approach: just hoping that considered patches never conflict.
_______________________________________________
LibreOffice mailing list
[hidden email]
http://lists.freedesktop.org/mailman/listinfo/libreoffice
Bjoern Michaelsen Bjoern Michaelsen
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: probabilistic approach to tinderboxing

On Tue, Jun 12, 2012 at 04:45:06PM +0200, Bjoern Michaelsen wrote:
> After sending off the tinderbox with this patchset, do:
>    p-innocent(patch) *= p-innocent(testcase)
> reducing the innocence of the patches without waiting for results. So we assume
> the patches of the set to be partially guilty until proven innocent.

On a second thought that should actually be:
    p-innocent(patch) *= (number of patches in patchset-1)/(number of patches in patchset)

Best,

Bjoern
_______________________________________________
LibreOffice mailing list
[hidden email]
http://lists.freedesktop.org/mailman/listinfo/libreoffice
Bjoern Michaelsen Bjoern Michaelsen
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: probabilistic approach to tinderboxing

In reply to this post by Bjoern Michaelsen
On Tue, Jun 12, 2012 at 04:45:06PM +0200, Bjoern Michaelsen wrote:
> After sending off the tinderbox with just this one patch, do:
>
>  p-innocent(patch)/=p-innocent(testcase)
>
> for all patches in each tinderbox run with this patch (once for each run)

... and this should probably then be:
>  p-innocent(patch)/=0.5*(number of patches in patchset-1)/(numbers of patches in patchset)

as we assume p-innocent(dirtypatch) ~ 0.5 and this reverts half of the 'guilt'
from tinderbox runs with both patches in it.

Best,

Bjoern
_______________________________________________
LibreOffice mailing list
[hidden email]
http://lists.freedesktop.org/mailman/listinfo/libreoffice
Eike Rathke-2 Eike Rathke-2
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: probabilistic approach to tinderboxing

In reply to this post by Bjoern Michaelsen
Hi Bjoern,

On Tuesday, 2012-06-12 16:45:06 +0200, Bjoern Michaelsen wrote:

> This also assumes patches under consideration to be independant (that
> is: without conflicts) and not dependant on a specific order.

Maybe I misunderstood, but the independency and arbitrary order I see as
a problem. People do work in iterative steps.. so if we could at least
assure that patches by the same author are ordered correctly that might
work.

  Eike

--
LibreOffice Calc developer. Number formatter stricken i18n transpositionizer.
GnuPG key 0x293C05FD : 997A 4C60 CE41 0149 0DB3  9E96 2F1A D073 293C 05FD

_______________________________________________
LibreOffice mailing list
[hidden email]
http://lists.freedesktop.org/mailman/listinfo/libreoffice

attachment0 (205 bytes) Download Attachment
Bjoern Michaelsen Bjoern Michaelsen
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: probabilistic approach to tinderboxing

Hi,

On Wed, Jun 13, 2012 at 02:03:31PM +0200, Eike Rathke wrote:
> Maybe I misunderstood, but the independency and arbitrary order I see as
> a problem. People do work in iterative steps.. so if we could at least
> assure that patches by the same author are ordered correctly that might
> work.

Well, I simplified a bit here. When using gerrit you would push one patch
separate if it is independant and multiple patches as a series if they are
dependant. gerrit handles that correctly as a 'patch series'. I would not want
to testbuild in the middle of one such series of patches -- so whenever there
are multiple dependant patches in gerrit they would be viewed as 'one' by the
tinderboxes. If one has dependant patches _and_ wants to have them tinderboxed
separately one could push them in multiple sets. However, for now I consider
that scenario a cornercase (its usually good enough to test the final state of
a series of dependant patches).

Best,

Bjoern
_______________________________________________
LibreOffice mailing list
[hidden email]
http://lists.freedesktop.org/mailman/listinfo/libreoffice
Norbert Thiebaud Norbert Thiebaud
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: probabilistic approach to tinderboxing

On Wed, Jun 13, 2012 at 7:20 AM, Bjoern Michaelsen
<[hidden email]> wrote:
>
> Well, I simplified a bit here. When using gerrit you would push one patch
> separate if it is independant and multiple patches as a series if they are
> dependant. gerrit handles that correctly as a 'patch series'. I would not want
> to testbuild in the middle of one such series of patches

Well... it _should_ be buildable at each step of the patch series...
if not, clean-it up, it is not yet 'published' in the git sens of the
term so you still have a chance to git rebase -i to fix things up.

Building at each commit is very important to have useful bisections.

Norbert
_______________________________________________
LibreOffice mailing list
[hidden email]
http://lists.freedesktop.org/mailman/listinfo/libreoffice
Bjoern Michaelsen Bjoern Michaelsen
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: probabilistic approach to tinderboxing

On Wed, Jun 13, 2012 at 08:00:12AM -0500, Norbert Thiebaud wrote:
> Well... it _should_ be buildable at each step of the patch series...
> if not, clean-it up, it is not yet 'published' in the git sens of the
> term so you still have a chance to git rebase -i to fix things up.
>
> Building at each commit is very important to have useful bisections.

But we are doing that not now as pushes are atomic (or more precise:
ref-updates are), so tinderboxes wouldnt see a half-push. Also all patches in a
series have the same owner, so blaming is easy. The first goal of the
tinderboxes is to have the master tip buildable all times -- having it
buildable at every random commit is secondary. Also: bibisect etc. could be
taught to only look for commits where the next commit date is at least 5 minutes
off.

Best,

Bjoern
_______________________________________________
LibreOffice mailing list
[hidden email]
http://lists.freedesktop.org/mailman/listinfo/libreoffice
Norbert Thiebaud Norbert Thiebaud
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: probabilistic approach to tinderboxing

On Wed, Jun 13, 2012 at 8:53 AM, Bjoern Michaelsen
<[hidden email]> wrote:
> But we are doing that not now as pushes are atomic (or more precise:
> ref-updates are), so tinderboxes wouldnt see a half-push.

true. but just because we don't, doesn't mean we shouldn't

> Also all patches in a
> series have the same owner, so blaming is easy.

blaming is one thing, quality is another.

> The first goal of the
> tinderboxes is to have the master tip buildable all times -- having it
> buildable at every random commit is secondary.

secondary but important nonetheless.

> Also: bibisect etc. could be
> taught to only look for commits where the next commit date is at least 5 minutes
> off.
No, that would not help. two commit cane within 5 minutes and yet
independant and 2 commit can be 5 minutes apart and yet belong to the
same patch-serie
iow, the timestamp of a commit is not very helpful in general wrt to tinderbox

Norbert
_______________________________________________
LibreOffice mailing list
[hidden email]
http://lists.freedesktop.org/mailman/listinfo/libreoffice
sberg sberg
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: probabilistic approach to tinderboxing

In reply to this post by Norbert Thiebaud
On 06/13/2012 03:00 PM, Norbert Thiebaud wrote:
> On Wed, Jun 13, 2012 at 7:20 AM, Bjoern Michaelsen
> <[hidden email]>  wrote:
>>
>> Well, I simplified a bit here. When using gerrit you would push one patch
>> separate if it is independant and multiple patches as a series if they are
>> dependant. gerrit handles that correctly as a 'patch series'. I would not want
>> to testbuild in the middle of one such series of patches
>
> Well... it _should_ be buildable at each step of the patch series...

I think you are confusing buildability after each patch of a series has
been applied in sequence (which is a desirable property indeed) vs.
buildability with each patch of a series applied individually ("out of
context," what Eike's concern was).

Stephan
_______________________________________________
LibreOffice mailing list
[hidden email]
http://lists.freedesktop.org/mailman/listinfo/libreoffice
Lubos Lunak Lubos Lunak
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: probabilistic approach to tinderboxing

In reply to this post by Bjoern Michaelsen
On Tuesday 12 of June 2012, Bjoern Michaelsen wrote:

> Hi all,
>
> this is a proposal to get as much focused tinderboxing as possible with
> limited resources and unreliable build clients. This also assumes patches
> under consideration to be independant (that is: without conflicts) and not
> dependant on a specific order. gerrit allow us to have patches independant
> of order as they are not yet merged/picked to the branch. We can also
> easily exclude any patch on gerrit that conflicts against the existing
> patches under
> consideration: To do so we cherry-pick the new patch on the tip of the
> branch and then cherry-pick all patches under consideration on top of that.
> If it conflicts, we can report that back.(*)
>
> == Testcase ==
>
> Tinderboxes will be grouped in "testcases": All tinderboxes in belonging to
> one "testcase" are assumed to provide the same results for the same input.
> The simplest testcase is: "completing a build on platform foo" while more
> complex testcases include "completed a build on platform foo and run
> checks".

 Where are we going to get these tinderboxes? The page for master lists 17
tinderboxes, out of which only about 11 build regularly. Out of which there
are 4 linux-gcc tinderboxes, one linux-clang, 1(2?) macosx, 3 windows and 1
mingw. That, not counting any further division like running or not running
checks, gives 5 testcases if I understand it correctly, and building
successfully in any of them does not guarantee that the rest builds as well.

 So, the thing I do not understand, how is this (complex, as far as I can say)
setup supposed to improve anything? There does not seem to be any big
difference between dedicating X tinderboxes to it or just 1-2 tinderboxes. If
we put a number of tinderboxes on this, they still won't test the patches
enough, while there won't be that many left for checking that the actual
repository builds fine, which will be needed anyway.

 If patches in gerrit[*] need to be tested, it should work just as well to use
one of the really fast tinderboxes for checking that it builds, which will
cover a majority of possible problems, then maybe one more can build
including checks, and if any problem gets through, it'll be caught by the
normal tinderboxes, which are needed anyway (for direct commits, some other
seemingly unrelated commit breaking the patch meanwhile, etc.).


[*] Speaking of which, when will we get told about this gerrit thing, besides
random remarks, or have I missed it?

--
 Lubos Lunak
 [hidden email]
_______________________________________________
LibreOffice mailing list
[hidden email]
http://lists.freedesktop.org/mailman/listinfo/libreoffice
Bjoern Michaelsen Bjoern Michaelsen
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: probabilistic approach to tinderboxing

Hi,

On Thu, Jun 14, 2012 at 12:36:12PM +0200, Lubos Lunak wrote:
>  Where are we going to get these tinderboxes? The page for master lists 17
> tinderboxes, out of which only about 11 build regularly. Out of which there
> are 4 linux-gcc tinderboxes, one linux-clang, 1(2?) macosx, 3 windows and 1
> mingw. That, not counting any further division like running or not running
> checks, gives 5 testcases if I understand it correctly, and building
> successfully in any of them does not guarantee that the rest builds as well.

Right -- although I would count the linux-gcc-with-subsequentcheck tinderbox as
a single test case.

>  So, the thing I do not understand, how is this (complex, as far as I can say)
> setup supposed to improve anything? There does not seem to be any big
> difference between dedicating X tinderboxes to it or just 1-2 tinderboxes. If
> we put a number of tinderboxes on this, they still won't test the patches
> enough, while there won't be that many left for checking that the actual
> repository builds fine, which will be needed anyway.

I dont think we should split up tinderboxes between this and repository builds
(I asssume you mean testbuilding the head of master with that). As the patches
are cherrypicked on the head of master with this, master is already tested
along with this. However, while people are in the beginning still pushing even
risky patches directly to master we cant assume master to be flawless all of
the time -- so the proposed algo should be extended with "build the master you
cherry-picked upon once after each failed build".

What is this setup improving? Simple:
- find buildbreakers _before_ they hit master
- with gerrit, you can push your patch, wait for the tinderbox result you need
  and _then_ put it on master
  (so this gives everyone easy access to windows test builders, which as we all
  know are a pain to set up, and also subsequentcheck runs)
- reduce tinderbox spammage as the list of possible offenders is smaller -- and
  esp. is not growing with every build
- allows greenlighting bigger sets of patches with one compile (essential on
  slower platforms, f.e. Windows)

>  If patches in gerrit[*] need to be tested, it should work just as well to use
> one of the really fast tinderboxes for checking that it builds, which will
> cover a majority of possible problems, then maybe one more can build
> including checks, and if any problem gets through, it'll be caught by the
> normal tinderboxes, which are needed anyway (for direct commits, some other
> seemingly unrelated commit breaking the patch meanwhile, etc.).

That doesnt scale that well. Also note that the tinderboxes building plain
master should end up a scenario that is very, very rarely needed (it will still
be done of course)

Best,

Bjoern

> [*] Speaking of which, when will we get told about this gerrit thing, besides
> random remarks, or have I missed it?

Look for a mail with "ACTION REQUIRED" in the subject in the libreoffice
mailing list archive.
_______________________________________________
LibreOffice mailing list
[hidden email]
http://lists.freedesktop.org/mailman/listinfo/libreoffice
Norbert Thiebaud Norbert Thiebaud
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: probabilistic approach to tinderboxing

In reply to this post by sberg
On Thu, Jun 14, 2012 at 1:59 AM, Stephan Bergmann <[hidden email]> wrote:
> On 06/13/2012 03:00 PM, Norbert Thiebaud wrote:
>
>
> I think you are confusing buildability after each patch of a series has been
> applied in sequence (which is a desirable property indeed) vs. buildable
> with each patch of a series applied individually ("out of context," what
> Eike's concern was).

When you apply a series in sequence, at any point the result must be
buildable...
1/ it is good practice
2/ if that is not the case bisection is much harder

If you pushed directly to master and happened to push something that
is not buildable, then of course you can't go back and fix it...
but with gerrit, as long as your patches series is not 'merged' then
you can git rebase interactive to fix the offending commit.

Yes, you cannot pick a random patch in a series, apply just that and
expect it to be buildable
but if my series has 3 patches 1,2,3

the If I apply (1) (1,2) or (1,2,3) I expect the result to be buildable

Norbert
_______________________________________________
LibreOffice mailing list
[hidden email]
http://lists.freedesktop.org/mailman/listinfo/libreoffice
sberg sberg
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: probabilistic approach to tinderboxing

On 06/14/2012 04:57 PM, Norbert Thiebaud wrote:

> On Thu, Jun 14, 2012 at 1:59 AM, Stephan Bergmann<[hidden email]>  wrote:
>> On 06/13/2012 03:00 PM, Norbert Thiebaud wrote:
>>
>>
>> I think you are confusing buildability after each patch of a series has been
>> applied in sequence (which is a desirable property indeed) vs. buildable
>> with each patch of a series applied individually ("out of context," what
>> Eike's concern was).
>
> When you apply a series in sequence, at any point the result must be
> buildable...
> 1/ it is good practice
> 2/ if that is not the case bisection is much harder
>
> If you pushed directly to master and happened to push something that
> is not buildable, then of course you can't go back and fix it...
> but with gerrit, as long as your patches series is not 'merged' then
> you can git rebase interactive to fix the offending commit.
>
> Yes, you cannot pick a random patch in a series, apply just that and
> expect it to be buildable
> but if my series has 3 patches 1,2,3
>
> the If I apply (1) (1,2) or (1,2,3) I expect the result to be buildable

Yes sure (and I think nobody is arguing with that).  Its just that I
felt you guys where talking past each other, one talking about an "out
of context buildability" problem while one discussing "sequential
buildability."  So, best forget that I said anything, anyway---seems to
produce more confusion than clarification...  ;)

Stephan
_______________________________________________
LibreOffice mailing list
[hidden email]
http://lists.freedesktop.org/mailman/listinfo/libreoffice
Bjoern Michaelsen Bjoern Michaelsen
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: probabilistic approach to tinderboxing

In reply to this post by Norbert Thiebaud
On Thu, Jun 14, 2012 at 09:57:48AM -0500, Norbert Thiebaud wrote:
> When you apply a series in sequence, at any point the result must be
> buildable...
> 1/ it is good practice
> 2/ if that is not the case bisection is much harder

... but it has a lower importance than having the tip of master buildable all
the time. Which is why a tinderbox should leave testing intermediate states
aside as long as there are still things to test that ensure the tip of master
to stay buildable.

Best,

Bjoern
_______________________________________________
LibreOffice mailing list
[hidden email]
http://lists.freedesktop.org/mailman/listinfo/libreoffice
Loading...