Professor is hosted by Hepforge, IPPP Durham


I've generated some data to test if the interpolation quality depends on the distribution of the used parameter points. At the moment I've only used a parabolic with an Gaussian error added.

Each data set was generated as follows:

  1. Choose a random point in N-dim unit cube as center of the parabolic. The parabolic evaluates as (\vec{p}-\vec{p}_{center})2.
  2. Run many interpolations (I used to run 5000 interpolations):
    1. Choose M random points from unit cube and calculate the parabolic with errors added.
    2. Create a BinDistribution and Interpolation and actually interpolate.
    3. Compare the interpolation result with the original parabolic using T randomly located test points.
    4. Save the resulting "chi2" and the used points in a file.

I've created data varying N (dimension of parameter space) from 1 to 10 and M(number of MC runs used for interpolation) from minimal to minimal+10.

Results (Jan. 17.) - two minima distribution

I've created data and plots with an underlying distribution that is not parabolic but has two valleys. The meaning of the histograms is the same as below in the results for December 10. A page with all histograms is here:

The conclusions do not differ from the older results:

  1. The location of the anchor points has not influence on the interpolation quality.
  2. The number of MC runs has influence on the chi2 values: I looked at the 7D data and using 37+4 runs changed the chi2 scale from 100...106 (37 MC) to 100...103 (42 MC) and the limits of the chi2 band dropped from 101...103 (37 MC) to 6 100...6 101. Further increasing the number of runs did not have such a strong impact.

Results (Dec. 10.)

I've created new plots with log scale chi2/ndf axis and a third column which displays the unnormed histogram entries: Unfortunately some tick labels still overlap.

For the ndf I took 10000 (= the number of test points). I did not use any cuts.

The main conclusions which can be drawn are

  1. There is not strong dependence between the chi2/ndf and any of the distance measures that would require some kind of parameter point selection.
  2. However, the number of MC runs strongly influences the chi2/ndf values.

Below I'll compare some plots more detailed. If not mentioned otherwise histogram means the normed histograms (i.e. the first column). A page with only the discussed plots is under:

2D case

Comparing the plots with 7 and 11 MC runs the following can be observed: When using the minimal number of MC runs (here 7) the chi2/ndf results are very poor. Reason for this might be that some of the test points are located in the parameter space outside of the region where the points used for the interpolation lay. And we're not interpolating anymore but extrapolating, resulting in an underestimation of the error and thus a too big chi2/ndf. Second the chi2/ndf band is quite flat in the two minimal-distance-histograms compared to the average-distance-histograms.

When increasing the number of MC runs by 4. The chi2/ndf becomes better: the scale changes from 1e-2...1e6 to 1e-2...1e3. And the chi2/ndf values in the regions where the most entries lie (i.e. the yellow spots in the third column) are approximately 1. Further increasing the number of MC runs results in even lower chi2/ndf values.

7D case

When increasing the number of dimensions the chi2/ndf bands in the histograms become flat. No strong dependence of the chi2/ndf value from any of the distance measures can be observed.

As in the 2D case increasing the number of MC runs results in better chi2/ndf values: the chi2/ndf scale changes from 1e0...1e6 (37 MC runs) to 1e-1...1e2 (47 MC runs). But the chi2/ndf values do not cluster around 1 as wanted.

The reason for this behaviour might be, that the number of interpolation coefficients scales with D2, D being the number of dimensions. And therefore the number of additional MC runs scales the same way.

Results (Dec. 5.)

I've created new chi2 vs. dp plots for all 4 different dp's and with errors attached to the bin entries and error estimates used in the chi2 calculation. A html site with links to all images can be found under Unfortunately, the tick labels are messed up and I try to fix it.

To calculate the chi2 I used T=10000 test points so I would expect a chi2 of 10000. The histograms have a chi2 cut at 1e6 and with increasing dimension and minimal number of MC runs the number of interpolations with a chi2 getting in the overflow bins increases resulting in 1844 of 5000 interpolations for 10D.

A reason for this behaviour might be that the error used to smear the paraboloid is big: at the borders of the unit cube 5% or bigger. The exact scaling is: 0.05 * 0.5 * sqrt(dims) .

On the other hand with increasing number of MC runs the chi2 improves: Almost all chi2 entries can be found in the lower 2 histogram rows. So, I think, the expected behavior is observed that the interpolation result strongly depends on the number of MC runs and that using the minimal necessary number is not a good idea.

However there is still no method to decide how many more runs are needed.


Old lego plots without errors can be found under OldResults.

Last modified 10 years ago Last modified on Jan 17, 2008, 4:26:39 PM