Search of 17 clues puzzles in a solution grid


N 2X2Y loop


As explained in the overview, after band1 and band 2 expansion, the process is split using pieces of the expansion tables having the same first 2 cells.


Note: One could think, for solutions grids having huge amount of  valid bands, of another split, for example 3X2Y no test has been made in that direction


The expansion process sort the valid bands in the appropriate order to do that and the index to use is built during the expansion phase.


The 2X2Y loop is still a preparation phase (most of the job done here was in the preliminary tasks when the 2X2Y loop did not exist).


The work is heavily oriented to speed up the “part 1”,  finding all XY (valid band1 ; valid band2) and applying filters to come to a potential valid set of bands 3.


The main task in the loop is to shrink UAs and GUAs tables, building cells vectors to use later and preparing each ‘Y’ (preparation of ‘X’ will come later) to apply the filters on UAs GUAs.


The 2X2Y process is called twice, once for the 566 distribution, and once for the 656 distribution.


N0 Moving tables in a fix area


As said in the overview, the 2X2Y loop is part of the cache optimization. Expanded tables used in the loop are moved in permanent areas belonging to the G17INDEXSTEP structure.


The move is done using the intrinsic instruction

__movsq(Rd, Ro, g17b.xyexpand_size*nw1);


This is possible only if the elementary item in the XY_EXPAND is done of 64 bits values, what is true in the code.


Here Rd is a permanent area holding the relevant  XY_EXPAND  table.

Ro is the start address of the piece of the expanded table corresponding to the 2X2Y loop

g17b.xyexpand_size is the number of 64 bits blocs in  XY_EXPAND  worked out in the G17B constructor


N1 shrinking tables


Inside the loop, the 2X first cells and the 2 Y first cells will always be part of the solution. All UAs GUAs having the corresponding cells will be hit in any XY of that 2X2Y.


Second key point, dead cells (cells that will never come again) can be identified here as in the Gary McGuire process.


From the UAs and GUAs tables, we can extract items not hit by the 2X;2Y and reduce each UA to cells not yet dead.


If an empty UA appears, We have a dead branch; This UA can not be hit in this 2X2Y;


If an empty GUA appears, we know that the corresponding socket will stay valid for the entire 2X2Y step;


The logic would be to keep a separate list of such sockets and to erase all GUAs of the socket in the 2X2Y step.


This in not yet implemented, the UA is kept empty, giving indirectly the same result, in a more expensive way.


N2 Limits for the  2X2Y UAs GUAs tables;


I started with low limits (256 UAs) for the UAs and GUAs tables in a 2X2Y pass.

To-day, I am convinced that best results will come with a flexible limit accepting bigger tables.

I already pushed the limit to 640 UAs for the GUAs table. As soon as possible, I’ll increase the limit for bands 1+2 UAs..

In the current code, increasing the UAs limit has several  main consequences :


UAs vector are bigger (can be)

The fix area storing ‘X’ and ‘Y’ vectors is bigger

The process (chunk loops) must be extended to the biggest size.


The run time for small numbers of valid UAs is not sensitive to the upper limit, so I have no doubt that the optimum is with a high limit.

Note: it is necessary to have a complete view of the topic to cover the “brute force check of an XY”, in the inner most loop of the “part 1”.


N3 building vectors and applying them in the ‘X’ ‘Y’ tables


For readers not familiar with the “vector concept”, intensively used in the Gary McGuire proof that no 16clues puzzle exists, a short reminder.


Let’s assume that we have UAs in the range 1-64 UAs and that you just want to check that all UAs are hit by a group of cells.


One way is to look at the list of UAs, checking whether each of them is hit by the set of cells.


This is not too hard and is done in the program in many places. A bit field of the set of cells is created and each item of the table is compared to the set of cells.


When the table can contain a significant number of items and if the number of cells is relatively small, we can have a faster way to do that.


For each cell, (we are in a range 1-64 UAs), a 64 bits field is created.”64 bits vector”, each bit corresponding to one UA of the table.


In our case, the start status is  ‘all bits to 1” but  a base vector is worked out to initiate the “logical and” with all unused bits sets to 0.


If the UAs of rank ‘n’ contains the cells ‘x’, the bit of rank ‘n’ in the 64 bits field is set to 0. Once all cells vectors have been created, To know if a set of  cells hits all UAs, we have just to combine cells vectors in a “logical and” to have the final status.


If we are looking for a null status (valid solution for bands 1+2) the final test is very simple, but even if we expect to have still valid UAs at the end (as for GUAs), it is better to use vectors and to process the “and” vector using the _BitScanForward intrinsic instruction to extract the bits ‘1’. So vectors are built for both tables of UAs.


This is for 1-64 UAs, for more UAs, we just extend the bits field in as many 64 bits pieces as necessary.


N 3.1 GUA vectors per socket


For GUAs, the first GUA not it of a given socket is enough to make the socket valid.

In the XY process, the final GUAs vector will be scanned for bit still ‘1’.


As soon as such a bit is seen, all the other bits of the same socket can be ignored. As all UAs of a given socket are in a continuous area, this will be done applying a “socket vector”  to the 64 bits area being processed and the following one. (one dummy 64 bits area has been added to the working table to avoid the need to test if it is the last bloc)


So here, the 81+81 socket vectors are built as well.


N4 applying them in the ‘X’ ‘Y’ tables


The final filter would be to do the “logical and” for each XY set of cells (11 cells here 6+5). At the right moment, the task can be done partially for the ‘X’ set of cells and the ‘Y’ set of cells and moreover, the 2X 2Y permanent part of the loop can be prepared once.


The “right moment” is within the 656 or 566 process of the 2X2Y loop for the ‘Y’ part and in the chunk process for the ‘X’ part if X is the outer loop in the ‘X’ x ‘Y’ generation. The target here is to do it once, but to keep as low as possible the volume of data to transfer.


At the end, for a given XY, we just have to combine the X vector v[X] and the y vector v[Y].

 Again, this is done for both tables, UAs and GUAs.


N5 Memory optimization for the cache


To optimize the transfer of data in the ‘Y’ field, but mainly to optimize the cache view, the ‘X’ vectors and the ‘Y vectors have to be as much as possible in a contiguous area.


Using for example a vc[a][b] table where b is the dimension for the maximum size of the bits field is not good for the cache view.


Using as many vc[a] sub tables is slightly better for the cache view, but not very good for the innermost loop of the XY process (part 1).


The v[X] and v[Y] are built in a dynamic mode to have, at the end something equivalent to vc[a][b] with b set to the actual size of the vector.


This is done in buffers tvuas;tvguas for the ‘Y’ part (located in the structure G17INDEXSTEP and in buffers vcx;vcxg for the ‘X’ part (located in the G17CHUNK structure). The ‘Y’ chunk component is sent in block to the chunk area (vyc,vygc)


N6 Launching the chunks


This is the last task in the 2X2Y loop, The code is in Do_Common3(). The 2X2Y interval is split in chunk depending on the chunk option; here a chunk has usually 256 ‘X’ and 256’Y’, less for the last ‘X’ and the last ‘Y’.


For each chunk, the relevant part of the XY-EXPAND index file is sent to the chunk area

Same for the ‘Y’ vectors for UASQ and GUAs


The ‘X’ file of vectors id built directly in the chunk area


And the control is given to the chunk process


N7 Last changes to have flexibility and test other cache options


In the code, part of the options were “in hard”. This is true for the  maximum size of the tables of UAs Gus accepted if the 2X2Y process, but also for the chunk size.


The code has been revised to have these driven by 3define constants.

In the same time, the alternative chunk organization that I had in mind has been drafted and coded under control of the

#define G17CHUNMODEA


And to cover all changes minor changes have been made during the cleaning of the code and errors could have been done in the cleaning;


At the end, the released code has to be retested. To have a change to meet the due date for the final delivery of the first version, I’ll start the validation tests in parallel and most of them will be done after the final fist release.