Search of 17 clues puzzles in a solution grid

CONTEXT and General overview page 4

H Process optimization

H1 timing considerations

H1 sorting entries

H2 Adding a layer 2cells X;2 cells Y

H3 Preparing the main filter

H4 Optimizing cache

 H Process optimization


In chapters B,C,D we have seen how valid bands 1+2 are  produced filtered and checked.

Optimization of the corresponding process can reduce the run time in a factor 1:10.


Optimization factors are divided in 3 groups:


1) selection of band1 and band2 order

2) Layer with 2 cells X and 2 cells Y un changed

3) Cache sensitivity


Timing considerations


Till now, preliminary tasks including :


generation of an entry band 1+2 and of the relevant band3 set

Collection of all start UAs and GUAs

Search of valid bands 1 and valid band1


Stays below 50 ms and usually around 30 ms.


Processing one entry, the number of XY can vary in a very large range,

but an average  2x10000x50000 (one billion XY) seems a good rough first estimate


For each XY, we have to see if all UAs are hit.

Less UAs at the start will give more XY hitting all UAs, but most of them will not give a valid bands 1+2.

With more UAs, the best chance to find a UA not hit by an XY is in the set of small UAs


All tests show that the best results are achieved by far using as band 3 the band having the highest number of valid band with 6 clues. The reason is quite simple less XY and less valid bands 1+2 to process.


The process requires to combine each valid band 1 to each valid band 2. The simplest way to do it (2  loops) gives poor results with modern processors using caches. In this case, the optimum is likely different from one processor to another one, but similar options have to be applied.



H1 Selection of band1 band2 order


Here the key factor is the small weight of the preliminary tasks compared to the effect on the XY count.


Changing the priority in the band order changes the number of ED bands 1+2, and likely in a similar ratio the number of valid bands 1+2.


And when the number of ED bands grows, the average number of bands 3 per “entry” is smaller leading to less GUAs and more Band1+2 killed by lack of band 3 active after the first filters.


Using bands ordered of the lowest number of valid band with 6 clues, enumeration gives 610 163 364 ED bands for the 5 472 730 538 ED solution grids.


In blue’s process, the corresponding table 983 959 110  ED bands 1+2 is loaded in memory. Here, we use the enumeration program to produce in a batch the chunk of ED bands to study. Producing the next entry is expected to be in average less than 1ms (entry plus relevant bands 3).


In fact, but this is not yet ready, it is nearly sure that a tailored made file has to be prepared to process the case 665 exchanging band 2 and band3.


Note: a very small number of entries will produce nothing (both bands 1 and 2 requiring 6 clues).



H2 layer with same 2Xcells same 2Ycells


This is an important feature having a global very positive effect on performance although leading to a small penalty for solution grids having a small number of XY.


To stay as simple as possible and due to the fact that the penalty is small, the layer is applied to all solution grids.


The basic idea is the following.


Band valid puzzles are created using UAs. Doing so, a kind of native index appears, each cell used in the first UAs becoming a dead cell in the branch.


The minimum size for that index is 16 items (2 disjoint UAs of size 4) and the maximum seen size is 116 items. Most often, the number of items is in the range 30-50.


As in all UAs expansion, The first cell will appear at the start and never again ... (see the concept of “dead branch” in the Gary McGuire article).


Based on that index, we can define a first loop for subgroups ‘X’’Y’ having the same four cells. The minimum number of such subgroups is 256 (16*16) so as we will see with what is done for each group; the result can be a penalty if the count of XY is small.


Without that layer, cell vectors are built once for the entire list of  UAs and GUAs. (a vector is as in the McGuire process a bit field having one bit for each UA).  Here the bit is set to 1 except is the corresponding UA is hit by the cell. Combining (logical and) the vectors for a set of cells gives the list of UAs not hit in a very quick way.


Adding the layer, 2 important points can be considered


For a group 2X2Y part of the UAs will always be hit and we can work on a reduced list,


We have now evidence of “dead cells” “dead branches”. And the UAs not hit can be reduced to cells not yet dead. This can produce empty UAs showing a dead branch (such UAs would never be hit), but it does not seem very frequent in that process.


So in general, we can build the cell vectors for each 2X2Y group on the reduced list of UAs limited to the “active” cells.


The UAs bands1+2 table size is limited to 256 UAs for a given 2X2Y, but most often, the actual size will be less than 128.


The same process is applied to the GUAs table and the same 256 limit is applied. In this case, vectors are also created for each active GUAs id (81 GUA2s id, 81 GUA3s id) to accelerate the search of active GUAs for a given XY.


Doing all this when the average number of XY per 2X2Y is small is more expensive than to do it once, but the differential is not big enough to push to a double process.


Another important possibility appears with UAs collected when the validity of the bands 1+2 is checked with a negative result;


These UAs can be added to the main table and filtered in vector mode in the next 2X2Y. This is done for the  smallest UAs  If the added UA is big, it has a smaller chance to be of any use out of the close vicinity of the XY checked.


H3 Preparing the main filter


 Having produced a XY, the first task will be to check if the list of UAs bands 1+2 is fully hit. On UA not hit is enough to reject the corresponding XY.


We combine in the process a valid band ‘X’ with ‘x’ clues to a valid band ‘Y’ with ‘y’ clues.

For each 2X2Y we have up to 256 UAs so “cell vectors“ have up to 256 bits.


As soon as cell vectors are produced, we can prepare for each ’X’ and each ’Y’ of the 2X2Y group the combined situation of the ’x’ and ’y’ cells.


Doing so, the final test when a XY is analyzed is to “and”  the ‘X’ vector and the ‘Y’ vector. (v[X] & v[y]).


If ‘X’ is the outer loop and ‘Y’ the inner loop, v[Y] will be done in once for each 2X2Y, but as we will see in the next chapter, the v[X] will be done later.


The  frame for the process will equivalent to


for(int i=0;i<nX;i++) for (int j=0;j<nY;j++)

If(v[X] & v[Y]   null) this is a possible XY


And the answer will be “no” for  more than 95% of the XY.


The process uses 64 bits fields. The memory location and the code are optimized for each valid number of such fields (1 to 4 with  a limit of 256 UAs)


H4 Cache Optimization


The task described in H3 (main filter) is highly cache sensitive. Cache optimization is not easy to implement as it will be tightly linked to the processor structure. What is proposed here should be good for all modern processors


In the program, the “cache sensitivity” is managed in the following way :


The XY generation is split in chunks of small size (1-256 X; 1-256 Y)



A buffer of  XY passing the UAs filter is built  in a 256 x 256 bits field.

Then the buffer is scanned to send corresponding XY to the next step.


Data for the UAs filter are collected or built in a fix area to have the inner loop in cache if possible.


The v[Y] have been created in an area specific to the 2X2Y group. This has been done with dynamic handling of the size to have the active part of the vectors occupying a continuous area.

 The v[Y] corresponding to the chunk is moved in block in the chunk ‘Y’ memory buffer.


The v[X] is directly created in the chunk ‘Y’ memory buffer


Then one of the four 2 loops is executed. Here above, the loop for the case where the number of UAx band 1+2 is in the range 64-127


              case 2:{// likely the highest frequency

                     for (int ix = 0; ix < nxc; ix++, px = &px[2], tbits = &tbits[4]){

                            register uint64_t *py = vyc;

                            for (iy = 0; iy < nyc; iy++, py = &py[2]){

                                   if (px[0] & py[0]) continue;

                                   if (px[1] & py[1]) continue;

                                   {// passing UAs filters flag it in       chunkbuf[256][4];

                                          register uint64_t B = 1; B <<= iy & 0x3f;// 1<<(iy%64)

                                          tbits[iy >> 6] |= B;







And the buffer is scanned using the following code


       for (ix = 0; ix < nxc; ix++){// primary loop on 'X'

              g17xy.xxe = txec[ix];

              __movsq(g17xy.vxg, &vxgc[indexstep.n64vgua * ix], 4);

              uint64_t * tbits = cbuf[ix];

              for (int iyc = 0,dy=0; iyc < 4; iyc++,dy+=64){

                     uint64_t x = tbits[iyc];

                     unsigned long iy;

                     while (_BitScanForward64(&iy, x)){

                            register uint64_t b=1;


                            x ^= b;///clear bit

                            g17xy.yye = tyec[iy + dy];

                            __movsq(g17xy.vyg, &vygc[indexstep.n64vgua * iy], 4);






One can note here the use of intrinsic instructions




Use of such instructions play a key role in the global performance

And the ‘X’ ‘Y’ other data are again sent in a fix area for the next steps in the process.


We have covered the optimization in the critical loop. Other optimization factors have a local effect and will be discussed in the corresponding chapters.


I UAs generation


In this process all UA generation is done using brute force a process that I learned from mladen Dobrichev.

The topic has a wider scope so I open recently a specific thread to explain how it works



A serious update of the thread has to be done for 2 main reasons


UAs generation for bands 1+2 has been extended to more digits and to special patterns

The process has been simplified and must still be improved.


Here the main difference with blue’s process is that many more UAs are produced at the start for bands 1+2. The current limit set is 64+2*128 = 320 UAs. (64 to 66 in blue’s process)


64 UAs for the first filter

256 UAs for the complementary filter.


As we are in the area where the cache sensitivity is very high, it can be difficult to define the optimum.