A trie encodes a set of strings, represented by concatenating the characters of the edges along the path from the root node to each corresponding leaf. The common prefixes are collapsed such that each string prefix corresponds to a unique path.
4.1. The Trie Solution
The problem specification stipulates that the size of the grid should be equal to $n\times m$, where n is the size of the given word that should be placed in the first column, and m is the number of columns of the optimal solution. If the optimisation function $\Phi $ is defined to maximize the characters’ weight sums, then it is very probable that an optimal solution will be obtained for big values of m, and so an approach could start for the maximal value of m, which decreases successively.
The proposed solution follows a letterbyletter approach, and the design analysis considers a specified size $n\times m$ of the grid.
Data representation: The words of the dictionary are represented using a complex data structure that contains $max\_size$ tries; the number $max\_size$ is equal to the maximum size of the words from the dictionary (e.g., $max\_size=16$ for our dictionary). We have words that contains 1, 2, ..., $max\_size$ letters. This way, we have words represented only as full paths—from root to leaves—in the tries.
The roots of the corresponding tries are stored in an array of lengths, $max\_size$—$Tries[1..max\_size]$; so, $Tries\left[n\right]$ is the root of the trie that stores the words with lengths equal to n.
The decision of using $max\_size$ tries increases the space complexity, but facilitates a very good time complexity, since in order to fill a crossword grid, as specified in the problem specification, we need words of specific lengths. These tries could be created one time and stored, potentially as the first step of the algorithm. The time complexity of creating these tries is not very high since a single dictionary traversal is required; for each new word, the path corresponding to its letters is followed and new nodes (corresponding to the contained letters) are added when necessary. So, the time complexity is $O\left(N\right)$ for this operation.
For each trie, the first level contains an array of 26 roots for the words starting with each alphabet letter. We propose a special trie representation, in which each node of a trie contains:
A letter—$letter$;
An array with a length equal to 26—$array\_link$—for the children nodes (one for each letter);
A number—$code$—which is obtained from a binary representation with 26 digits that reflects the possible continuations:
 
0 on the position i means that there is not a subtrie corresponding the ith letter;
 
1 on the position i means that there is a subtrie corresponding the ith letter.
Example: if for one particular node there are subtries that continue only with the letters A and D, the $code$ will be 37748736, which has the following binary representation: 10010000000000000000000000.
This binary code is very important in the fast verification of the possible letters that could be placed in a new position.
For the node that represents the root of such a tree, the attribute $letter$ is empty.
The words are obtained from the leaf nodes and correspond to the letters on the path from root to them.
In order to fill the crossword grid, three matrices ($M1,M2$ and $M3$), which all have the same size as the grid ($n\times m$), are used:
$M1$ memorizes pointers to the nodes from $Tries\left[m\right]$ that correspond to each letter of the words written horizontally (on rows)s;
$M2$ memorizes pointers to the nodes from $Tries\left[n\right]$ that correspond to each letter of the words written vertically (in columns);
$M3$ memorizes each position the index of the last letter put into the grid—$M3[i,j].index$, and the binary code of the possible letters that could be placed on that position—$M3[i,j].code$—this code is obtained through the intersection (and operation) of the binary codes stored in the corresponding pointed nodes of the first two matrices. The type of the $M3$ elements is defined by a data structure with two fields—$(index,code)$.
In addition, a position that indicates the current position in the grid completion is used:
The algorithm: The algorithm is in essence a backtracking algorithm that fills the cells step by step in columns, starting with the first which is known. So, it is a letterbyletter algorithm that relies on prefix verification.
The $M1$ and $M2$ matrices are first initialized with the pointers to the nodes that correspond to the letters of the given word that should appear on the first column. In $M2$, the nodes (from $Tries\left[n\right]$) follow a path that represents the given word, and in $M1$ we will have pointers to the nodes from $Tries\left[m\right]$ that correspond to the specified first letters in the rows.
The letter in a new position $[i,j]$ is chosen from the letters obtained by the intersection of the set of the possible letters that allow horizontal words (it creates possible prefixes in rows)—given from the code stored in the node $M1[i,j1]$—and the set of possible letters that allow vertical words (creates possible prefixes in columns)—given from the code stored in the node $M2[i1,j]$. The intersection is obtained by applying the $bit\_and$ operation (denoted by $\&\&$) to the codes $M1[i,j1].code\phantom{\rule{4pt}{0ex}}and\phantom{\rule{4pt}{0ex}}M2[i1,j].code$, and it is stored in the $M3[i,j].code$. The letters from this intersection are considered in order and the index of the current chosen letter is stored into $M3[i,j].index$.
For the first letter in a column j the set of possible letters is given by $M1[0,j1].code$; no intersection operation is needed.
If the result of the $\&\&$ operation between the codes of the nodes stored into $M1[i1,j]$ and $M2[i,j1]$ is zero (the intersection is empty), then a step back is executed and the letter in the position $[i1,j1],if\phantom{\rule{3.33333pt}{0ex}}i\ge 1$ is changed by taking the next possible letter in the corresponding intersection set. Additionally, when all the possible letters from the intersection were verified, and moving forward is not possible, a step back is executed again.
So, the
back_step function is defined as:
procedureback_step( ) $position=\left\{\begin{array}{cc}(position.i1,position.j),\hfill & \mathrm{if}\phantom{\rule{4pt}{0ex}}position.i0\hfill \\ (n1,position.j1),\hfill & \mathrm{if}position.i=0\end{array}\right.$

The function
set_letter has the responsibility of setting a new letter in the current position.
functionset_letter( ) $M3\left[position\right].index=$ first_index$\left(M3\right[position].code)$ if $\left(M3\right[position].index\ne 0)$ then $M1\left[position\right]=M1\left[\mathrm{LEFT}\right(position\left)\right].array\_link\left[M3\right[position].index]$ if $position.i>0$ then $M2\left[position\right]=M2\left[\mathrm{UP}\right(position\left)\right].array\_link\left[M3\right[position].index]$ else ▹ the root node corresponding to $M3\left[position\right].index]$ letter in $\mathbf{Tries}\left[\mathbf{m}\right]$ $M2\left[position\right]=\mathbf{Tries}\left[\mathbf{m}\right].array\_link\left[M3\right[position].index]]$ return $true$ else return $false$

The function first_index returns the position of the first bit equal to 1 in the binary representation of $M3\left[position\right].code$ and set it to 0; setting this bit to 0 is needed in order to emphasize that it has been verified. Additionally, in this way the next call of the function FIRST_INDEX will return the position of the next bit equal to 1.
There is always a left position since we have a word in the first column, and the UP function will be called only when
$position.i>0$.
functionup($position$) $\phantom{\rule{1.em}{0ex}}up\_position=\left\{\begin{array}{cc}(position.i1,position.j),\hfill & \mathrm{if}position.i0\hfill \\ (n1,position.j)1,\hfill & \mathrm{if}position.i=0\end{array}\right.$ return $up\_position$ functionleft($position$) $\phantom{\rule{1.em}{0ex}}left\_position=\left\{\begin{array}{cc}(position.i,position.j1),\hfill & \mathrm{if}position.i0\wedge position.j0\hfill \end{array}\right.$ return $left\_position$

The procedure
$MOVE\_NEXT$ moves the current position to the next cell in the grid, and sets the corresponding value of the
$code$ attribute in
$M3$:
proceduremove_next( ) $position=\left\{\begin{array}{cc}(position.i+1,position.j),\hfill & \mathrm{if}position.in1\hfill \\ (0,position.j+1),\hfill & \mathrm{if}position.i=n1\end{array}\right.$ if position.i>0 then $M3\left[position\right].code=M1\left[\mathrm{LEFT}\right(position\left)\right].code\phantom{\rule{4pt}{0ex}}\phantom{\rule{4pt}{0ex}}\&\&\phantom{\rule{4pt}{0ex}}\phantom{\rule{4pt}{0ex}}M2\left[\mathrm{UP}\right(position\left)\right].code$ else $M3\left[position\right].code=M1\left[\mathrm{LEFT}\right(position\left)\right].code$

Using these functions, we may define the scheme of the overall algorithm—Algorithm 1.
The algorithm was parameterized with the start position in order to allow its usage for the parallel implementation, for the sequential case
$start\_pos=[0,1]$.
Algorithm 1$\mathbf{Tries}\_\mathbf{Crosswords}\_\mathbf{Seq}(\mathbf{Sol}\_\mathbf{List},\mathbf{Optimal}\_\mathbf{Sol},\mathbf{start}\_\mathbf{pos})$ 
▹$\mathbf{Sol}\_\mathbf{List}$—the list of the resulted solution 
▹$\mathbf{Optimal}\_\mathbf{Sol}$—store the optimal solution 

▹ matrices initialisation 
@set the nodes from $\mathbf{Tries}\left[\mathbf{m}\right]$ in $\mathbf{M}\mathbf{1}$ corresponding to the given word (first column) @set the nodes from $\mathbf{Tries}\left[\mathbf{n}\right]$ in $\mathbf{M}\mathbf{2}$ corresponding to the given word (first column) $position=\mathrm{UP}(\mathbf{start}\_\mathbf{pos})$ MOVE_NEXT() repeat $letter\_found=$ set_letter (position) if $letter\_found$ then if ($position=[n1,m1]$then ▹ if solution @save the solution into the list of solutions—$\mathbf{Sol}\_\mathbf{List}$ if optimal_solution() then @save the solution as optimal—$\mathbf{Optimal}\_\mathbf{Sol}$ else MOVE_NEXT() else BACK_STEP() until$M3[\mathbf{start}\_\mathbf{pos}].code=0$

The most important characteristics and benefits of the algorithm are:
The usage of tries to find the possible prefixes;
The codes attached to each node of the tries (which reflect the possible continuation); using the codes in the nodes of the tries facilitates a very fast verification of the possible paths to solutions using the $and$ operation on bits;
The codes saved in the cells of the $M3$ matrix, as an intersection of the corresponding cells in $M1$ and $M2$ matrices.
The performance of this algorithm depends on the size of the grid, but most importantly on the number of words of a certain size.
4.2. Parallel Implementation
The algorithm has good potential for parallelisation, and we have developed a hybrid parallel implementation using an MPI (Message Passing Interface) and multithreading. We have chosen this hybrid parallelisation in order to allow using distributed memory and not only shared memory architectures.
Let P be the number of MPI processes and each of these uses a number of T threads (a thread pool of size T). Each process will create the necessary tries from the dictionary file. The space complexity of these structures is not very high, and so, it is worth carrying out this duplication.
The first parallelisation that we may identify relies on the observation that there are several possibilities to set the cell value in the position
$(0,1)$; these possible letters are given by the bit code of the node stored in
$M1[0,0]$. Still, in order to allow a parallelisation control independent of the given word that should be placed in the first column, the entire set of alphabet letters is distributed throughout the processes. An exemplification of this is given in
Figure 2. In this way, the responsibility of one process is to find the solutions that have one of the letters assigned to said process in the position
$[0,1)$.
Since the proposed solution is a hybrid one (multiprocessing combined with multithreading), each process will use a thread pool of size T able to execute the associated tasks that lead to the solution finding.
For the parallelisation at the threads level, the tasks are defined based on the construction of a list of pairs of letters $(L1,L2)$, where $L1$ is a potential letter to be placed in the position $(0,1)$ and $L2$ in the position $(1,1)$ (Procedure PAIR_TASK). In this way, the maximal parallelisation degree increases to an adequate value—${26}^{2}$. Each task will define and use its own matrices $M1,M2$ and $M3$.
More concretely, if we consider the case of four processes with the letter distribution described in
Figure 2, the first process will create
$7\times 26$ tasks that correspond to all the pairs that are formed by the Cartesian product between the following two sets:
$\{A,E,I,M,Q,U,Y\}$ and
$\{A,B,C,D,E,F,G,H,I,J,K,L,M,N,O,P,Q,R,S,T,U,V,W,X,Y,Z\}$.
The program executed by each MPI process is described by Algorithm 2.
Algorithm 2$\mathbf{Tries}\_\mathbf{Crosswords}\_\mathbf{Parallel}\to $MPI program 
$\mathbf{Load}\_\mathbf{dictionary}$—  ▹ each process read the dictionary and create the tries 
$\mathbf{Create}\_\mathbf{thread}\_\mathbf{pool}$—  ▹ each process creates a threadpool with T threads 
$\mathbf{Sol}\_\mathbf{List}$—  ▹ create an empty list where solutions will be placed 
$\mathbf{Optimal}\_\mathbf{Solution}$—  ▹ set a variable where the optimal solution will be placed 
$\mathbf{Letter}\_\mathbf{distribution}$—  ▹ balanced distribution of the set A over the P processes 
▹ (A contains all 26 letters of the alphabet) 
▹ each process i will have a subset ${A}_{i}$ of A (${A}_{i}\in A$) 
▹ for each pair of ${A}_{i}\times A$ a task is created and submitted to the threadpool 
for each pair $(L1,L2$ ) of the Cartesian product ${A}_{i}\times A$ do 
@submit pair_task $(L1,L2,Sol\_List,Optimal\_Solution,[1,1\left]\right))$ to the thread pool 
@wait for all tasks to finalize 
@aggregate all the solutions and the optimal solution in the process with $ID=0$ 
The final aggregation of all found solutions and also the optimal solution, which it is obtained through a $reduce$type operation, is carried out in the first MPI process ($ID=0$). This means that after all the tasks are executed, each process sends to the process with a rank equal to 0 the value of the optimal function obtained for the solutions it found. The process 0 computes the global optimal value and sends it back to all the other processes. The processes that have optimal solutions save their optimal solutions.
procedurePair_Task()$(L1,L2,Sol\_List,Optimal\_Sol,start\_position)$ if $L1$ eligible to be placed on $M3[0,1]$ then @set the values $M1[0,1],M2[0,1],M3[0,1]$ to correspond to letter $L1$ else EXIT if $L2$ eligible to be placed on $M3[1,1]$ then @set the values $M1[1,1],M2[1,1],M3[1,1]$ to correspond to letter $L2$ else EXIT @call $\mathbf{Tries}\_\mathbf{Crosswords}\_\mathbf{Seq}(Sol\_List,Optimal\_Sol,start\_position)$

Analysis
The decision to distribute the letters between the processes was based on the fact that we intended to define a parallel algorithm for which the degree of parallelism could be controlled independently on the given word to be placed in the first column. The value of $M1[0,0].code$ gives, more precisely, the letters that could be placed in the position $M2[0,1]$, but these are value dependent, and so not appropriate for the MPI process definition.
Even if, theoretically, the algorithm allows us to define 26 MPI processes, it is more efficient from the cost point of view to define fewer processes (to decrease the probability of having processes that do not have effective tasks to execute). A good average would be to assign 3–4 letters per process, and also to use a cyclic distribution for assigning letters to processes.
The presented solution leads to a multiprocessing parallelisation degree equal to 26, and the hybrid degree of parallelism (through multiprocessing and multiprogramming) is $26\times 26$. The degree of paralelism could be improved if we consider pairs of letters distributed to the processes instead of simple letters. Using this variant, each process (from p processes) will receive a list that contains $676/p$ of pairs of two letters. In this case, a task created by a process will be defined by three letters—the first two are given from a pair distributed to the process and they are used for setting the positions $(0,1)$ and $(1,1)$, and the third is one from the entire alphabet and is supposed to be placed in the position $(2,1)$. This would allow a degree of parallelism bounded by $676\times 26$. This idea could be generalized to tuples of k letters, and the obtained degree of parallelism would be bounded by ${26}^{k+1}$. So, the number of MPI processes could be increased by defining a more general algorithm that distributes tuples of letters and not just single letters to the processes.
The reason for this generalisation is the possible need to engage many more processes in the computation, which is very plausible when using big clusters.