Thursday, March 26, 2015

TCC 2015: Who Punctured my KEM? Understanding CCA-Secure PKE

The twelfth edition of the Theory of Cryptography Conference (TCC) was held in the beautiful city of Warsaw this week amidst uncertainty over the future (placement in the cryptographic calendar) of the event. The next edition will be held in Tel Aviv in January 2016 ahead of a planned shift to the latter end of the calendar year. This blog post lives in the present, and will take a closer look at 'Constructing and Understanding Chosen Ciphertext Security via Puncturable Key Encapsulation Mechanisms' [link] by Takahiro Matsuda and Goichiro Hanaoka, part of the broad-reaching session titled 'Encryption and Key Exchange'. Enhancing our understanding of a concept as fundamental as chosen ciphertext security can only be a good thing, particularly when much of the remainder of the TCC program was so heavily focused on obfuscation.

Sending messages via public-key encryption is expensive and slow, so we want to achieve the speed benefits of symmetric-key primitives (such as blockciphers) without having to exchange symmetric keys between all communicating parties. A natural approach for sending long messages is to construct a ciphertext of two components: an encryption of a (randomly chosen) symmetric key under the recipient's public key, and an encryption of the message under the symmetric key. The process of creating this first ciphertext component is known as a Key Encapsulation Mechanism, and is reversed by the recipient performing 'decapsulation' using their secret key. This two-tier framework is known as hybrid encryption, the component that encrypts the plaintext is called a Data Encapsulation Mechanism (DEM), and if we combine a CCA-secure KEM with a CCA-secure DEM then we get a CCA-secure public-key encryption scheme.

Many constructions of Chosen-Ciphertext Attack (CCA)-secure PKE (/KEMs) are inspired by the Dolev/Dwork/Naor framework, (incl. Lossy Trapdoor Functions by Peikert and Waters, Correlated-Input secure TDFs by Rosen and Segev among others), so the paper attempts to generalise this style of technique. Taking some inspiration from recent work on puncturable PRFs and puncturable tag-based encryption, the authors give a definition of puncturable KEMs which comprise the classical key generation, encapsulation and decapsulation algorithms plus three additional algorithms: $\hat{sk_{c^*}}\gets\mathsf{Punc}(sk,c^*)$ that punctures $sk$ by a ciphertext $c^*$; punctured decapsulation $K/\bot\gets\mathsf{PDecap}(\hat{sk_{c^*}},c)$ and a predicate $0/1\gets\mathsf{F}(c,c')$ that decides whether ciphertexts $c$ and $c'$ are 'close to' each other.

The punctured secret key $\hat{sk_{c^*}}$ can be used by $\mathsf{PDecap}$ to decapsulate all ciphertexts that are 'far from' $c^*$, yet is useless for ciphertexts 'close to' $c^*$ (including $c^*$ itself). This means that a 'normal' secret key $sk$ defines a map whose domain is entire ciphertext space, whereas a punctured secret key $\hat{sk_{c^*}}$ only defines a map whose domain has a hole punctured by $c^*$. If ciphertexts are in this 'hole' punctured by $c^*$ then they will not decrypt. The authors describe how many existing CCA-secure KEMs fit the puncturable KEM framwork, and show that the following ingredients yield a CCA-secure KEM:
  • Decapsulation soundness (the only valid ciphertext that is 'close' to $c$ is $c$ itself),
  • Punctured decapsulation soundness (punctured decapsulation works as well as the 'normal' decapsulation algorithm for all ciphertexts 'far' from $c$),
  • Extended CPA (eCPA) security (CPA security still holds even in the presence of a punctured secret key corresponding to the challenge ciphertext)
In addition, they show that Puncturable KEMs can be constructed from sender non-committing encryption (security holds even when sender's randomness used to create challenge ciphertext is corrupted) and one-time key-dependent message-secure symmetric encryption (secure even when an adversary can acquire encryptions of certain functions of the secret key).

Wednesday, March 25, 2015

52 Things: Number 25: Methods for modular reduction using "special" primes that define $GF(p)$ and $GF(2^n)$

This is the latest in a series of blog posts to address the list of '52 Things Every PhD Student Should Know' to do Cryptography: a set of questions compiled to give PhD candidates a sense of what they should know by the end of their first year. Continuing the section on implementation details, we discuss one method of implementing efficient modular reduction: using "special" primes modulo. 

As we have seen in the previous blogs, when implementing cryptographic schemes, one of the mostly frequently invoked operation is the modulo operation. Unfortunately, despite its massive usage modulo cannot be performed easily like some other arithmetic operations such as additions and multiplications. Montgomery representation provides one solution to this problem, and we will discuss another solution, Psudo-Mersenne Prime reduction.

Definition: A prime number $p$ is called a Psudo-Mersenne Prime if $p$ can be written in the form:
p = b^n - c
0 < |c| < 2^{\lfloor n/2 \rfloor}
Practically, $b$ is always $2$ and we pick $c$ within a word size which is usually $32$ or $64$ bits.

It can then be easily derived from its definition that
p \equiv b^n - c \equiv 0 (\mod p)\\
b^n \equiv c (\mod p)

Therefore given an integer $z$ of $k$ bits, we let $z'$ be its Least Significant $n$ bits and $z''$ its Most Significant $k-n$ bits, i.e. $z = z''2^n + z'$, we can then rewrite $z \mod p$ as: 
z \equiv z''b^n + z' \equiv z''c + z'  (\mod p)
The modulo of $z$ can then be computed by recursively apply this until it yields in $\mathbb{Z}_p$.

There are several points might worth mentioned here:
  1. Both $z'$ and $z''$ can be computed efficiently by simply shifting $z$.
  2. Since $c$ is picked within a word size, the multiplication can be done very efficiently.
  3. Each iteration shrinks the left hand side $k$ bits to the right hand side $max(k-n+w, n)$ bits where $w$ is the word size.
So generally speaking, the reduction can be done very efficiently when the modulo is a Psudo-Mersenne Prime as it only requires shifting, addition and multiplication.

Nevertheless, drawback of using such method is also obvious as such implementation will usually requires multiple parties to use a fixed setup which will potentially results into both interoperability and security problems.

More details about this solution can be found in [1] and [2].

$GF(2^n)$ is another filed of common usage in cryptography. Trinomial and pentanomial are the mostly used modular in this scenario. We will show how  trinomial simplifies the reduction. Same technique can be applied to pentanomial straight forward.

The idea is very similar to the prime field one. Presume we have the trinomial modulo $f(x)$ such that
f(x) = x^n + x^t + 1
where $0 < t < n/2$.

We immediately have
x^n \equiv x^t + 1 (\mod f(x))

Given a polynomial $z(x)$ with degree greater to $n$, we rewrite $z(x)$ as
z(x) = z''(x) x^n + z'(x)
where $z'(x)$ is the polynomial represented by the Least Significant $n-1$ bits of $z(x)$ and $z''(x)$ the others.

Then just like what we have done on $GF(p)$, we compute the modular by
z(x) = z''(x) x^n + z'(x) \equiv z''(x) (x^t + 1) + z'(x)\\
\equiv z''(x) x^t + z''(x) + z'(x) (\mod f(x))
This can be done very efficiently as $t$ is a "small" number.

[2] also described another optimization to the standard reduction. Consider during a standard procedure of reducing $z(x)$ of degree $m$ to $f(x)$ of degree $n$:
z(x) = a_m x^m + a_{m-1} x^{m-1} + ... a_1 x^1 + a_0 x^0\\
f(x) = x^n + x^t + 1

When we try to reduce $a_ix^{i}$, there are two cases:
  • If $a_{i} = 0$, then nothing will be changed to the remainder, or
  • If $a_{i} = 1$, then $1$ will be added to the bits aligned to $x^t$ and $1$ in $f(x)$, namely $a_{i - n + t}$ and $a_{i-n}$.
Since adding $0$ does not change the remainder, these two cases can be generalised to one; therefore we can write the standard reduction procedure as:

INPUT: $z(x)$
OUTPUT: $z(x)$
1. for $i = m$ to $n$ by $-1$
2. {
3.     $a_{i - n + t} += a_i$
4.     $a_{i - n} += a_i$
5. }
6. return $z(x)$

The advantage of using such algorithm does not seem obvious from a software perspective; however, it can be implemented very efficiently by hardware as it simply updates $z(x)$ and requires no extra storage.

Another advantage is that such code requires only $0 < t < n$ and can be executed in a constant time.

[1]Menezes, Alfred J., Paul C. Van Oorschot, and Scott A. Vanstone. Handbook of applied cryptography. CRC press, 1996.

[2]Blake, Ian F., Gadiel Seroussi, and Nigel Smart. Elliptic curves in cryptography. Vol. 265. Cambridge university press, 1999.

Monday, March 23, 2015

But soft! what byte through yonder crypto leaks?

Dan M led the latest study group, on 'Soft Analytical Side-Channel Attacks' (Veyrat-Charvillon, Gérard and Standaert, Asiacrypt 2014 [link; eprint]). The paper (which features in our 2014 Side-Channel Almanac) aims to 'bridge the gap' between two broad classes of strategies for side-channel analysis with a new method (SASCA) combining many of the advantages typically considered unique to one or the other.

Differential Power Analysis (DPA) and Template Attacks (TA) operate on a 'divide-and-conquer' (DC) basis: the dependency of the power consumption on small, 'guessable' parts of the intermediate state in the first or last round is exploited by comparing the leakage measurements with attacker predictions based on key guesses and approximate (or detailed, in the case of TA) knowledge of the functional form of the leakage. The full key is then recovered by enumerating over the highest-ranked subkeys. DC methods are computationally efficient and adapt straightforwardly to a wide range of algorithms and scenarios requiring minimal ('grey box') familiarity with implementation details. They are robust to noise (from measurement error or non-targeted system processes), which simply increase the number of traces needed to maintain the success rate. However, they generally discard all but a fraction of the information present in a given trace, so their overall data complexity is far from optimal and, importantly, inadequate to indicate the true extent of the vulnerability of a device for evaluation purposes. This drawback is exacerbated by the sensitivity of the data complexity to the quality of the leakage predictions (or ‘templates’, in the case of TA).

By contrast, analytical methods such as 'Simple' Power Analysis (SPA) aim to recover the whole key simultaneously by exploiting a large number of leaked intermediate values (possibly incorporating multiple rounds) from just one or a small number of traces. Some methods achieve this by reducing the key candidate space and then enumerating; others, by solving (e.g. via a SAT solver) a system of algebraic equations in function of the intermediate values and the side-channel auxiliary information in which variables representing the key are treated as unknowns to be recovered. The obvious advantage of such strategies is their low data complexity. However, as well as being far more complicated to implement than DC methods, and much more dependent on detailed knowledge of the implementation, they also suffer from weak noise resistance, generally requiring 'hard' error-free classifications of all intermediate values into sufficiently small subsets of the output space.

Wouldn’t it be neat, then, to have a strategy that efficiently exploits all the useful intermediate values in a trace, just like analytical attacks, whilst requiring only ‘soft’ (probabilistic) information on those values (and thereby remaining robust to noise) just like DC attacks?

Such is the spec that Veyrat-Charvillon et al. seek to meet, and to do so they re-express the side-channel key recovery problem as a "Low Density Parity Check"-like code to be decoded. This allows them to apply Belief Propagation (BP) — an algorithm which passes messages around a factor graph representing the code until convergence is reached. As long as the factor graph is tree-like, BP is able to find reasonable approximations in reasonable time for solutions that take exponential time (in the number of variables) to compute exactly.

The factor graph comprises nodes for variables and nodes for functions, with edges connecting functions to the variables appearing in them. In the case at hand — 'decoding' an AES key using plaintexts, ciphertexts and side-channel traces — the variables xi are the intermediate values handled by the device, and there are two categories of function: those corresponding to probabilistic information on unknown values derived from the observed side-channel leakages, f(x) = P(x = L); and indicator functions representing the known executed operations, thereby enforcing the correctness of AES, e.g. f(xi,xj,xk) = 1{OP(xi,xj) = xk} (where OP might be, say, an XOR). The parts of the key appear as xi's which are a priori unknown; propagating the known information around the factor graph refines the distribution on the keys, with increasing certainty towards the solution.

In cases where a single trace is not sufficient to identify the key (for example, because of a low signal-to-noise ratio (SNR)) additional traces can be incorporated by 'chaining together' several factor graphs with a shared key schedule (i.e. because the xi’s representing the key remain the same in each run of the algorithm). This becomes quite memory intensive — 16MB of space per trace — but is an improvement on previous SPA techniques which are unable to handle the added complexity of multiple observations (although they do benefit from repeat observations relating to the same algorithm inputs, which can be averaged to produce an enhanced SNR). A space-saving alternative would be to build and propagate the factor graphs for each trace in turn, retaining the posterior distribution on the key after exploiting one trace as the prior distribution when exploiting the next. However, the authors state (in their conclusion) that this requires first adapting the method to only propagate messages towards the key schedule, something they leave for future work.

The advantages over analytic strategies are obvious: SASCA is far more robust to low SNR, both because it can handle 'soft’ information about the intermediates and because it can easily incorporate multiple traces. Moreover, the authors state that BP decoding improves on the time and memory complexity of SAT solvers and optimisers. The advantages over DC-style strategies are verified experimentally in a tightly controlled simulated attack environment based on the AES FURIOUS implementation with Hamming weight leakage as SNR varies. Template attacks are the appropriate benchmark, widely-recognised as the ‘optimal’ approach as far as DC methods go; they rely on a similar profiling phase to SASCA, although for far fewer intermediate values, and requiring less detailed knowledge of the assembly code. As expected, SASCA, with its option to incorporate numerous intermediates, is experimentally shown to substantially outperform TA (in terms of the number of traces needed) at all tested SNR levels, and to succeed even with unknown inputs and outputs.

In all, SASCA is an interesting, elegant proposal, which certainly does go considerable way — theoretically, at least — towards closing the gap between DC and analytic side-channel strategies. However, its practical usefulness is not yet established: so far, it has only been tested in a carefully constructed simulated scenario; its ready transferability beyond that scenario is certainly not obvious. For one thing, the profiling phase is high-effort and requires in-depth assembly code-level knowledge of the attacked implementation; for another, BP is not guaranteed to converge when applied to “non-tree-like” factor graphs (i.e., graphs with cycles) and the authors have not yet been able to indicate the extent to which real-world cryptographic implementations of interest are or are not suitable targets for the methodology. Ideas developed concurrently by Oren et al. and presented at CHES 2014 [link] appear a more 'practical' approach to handling 'soft' side-channel information, but are only able to exploit leakage from the first encryption round and the key schedule. So, the actual contribution of subsequent rounds to the performance of SASCA needs to be clarified. Fortunately, the topic is one of continuing interest to many, so all of these avenues for further investigation are likely to be explored enthusiastically…

Wednesday, March 18, 2015

52 Things: Number 24: Describe the binary, m-ary and sliding window exponentiation algorithms.

This is the latest in a series of blog posts to address the list of '52 Things Every PhD Student Should Know' to do Cryptography: a set of questions compiled to give PhD candidates a sense of what they should know by the end of their first year. Continuing the section on implementation details, we consider some different methods for calculating modular exponentiations.

Brief reminder of what we already know

A large number of cryptographic algorithms are based around the security of exponentiation-based problems, such as RSA or DH. Therefore, having an efficient implementation of Modular exponentiation is required for modern cryptography. We should begin by addressing the naive solution: to calculate $x^a \pmod N$ we use our favourite exponentiation algorithm to calculate $x^a$, then reduce the answer modulo $N$. However, for most cryptographic uses $x^a$ will be excessively large. Now, most traditional methods can be modified simply by reducing modulo $N$ at every possible stage, and this leads to some of the common techniques.  Here we present a few possible methods for calculating $X^E \mod N$.



The binary modular exponentiation method closely resembles the traditional square-and-multiply method for calculating exponentiation. Indeed, the only difference is that at each stage we reduce modulo $N$ (preventing any of the intermediate values getting any larger than necessary). We can do this from left-to-right or right-to-left (either going up or down the bits of the exponent), and there are subtitles that may affect your choice.


The m-ary method works in a similar way, but instead of considering the exponent as a sequence of bits, we consider them as elements of some alphabet of $M=2^m$ elements (ie m-bit strings). In fact, the binary method can be thought of as the m-ary method with M=2 (hence the name). So, how does it work? Well, first we calculate a lookup table for all the $X^i$ for $i=1$ to $2^m-1$.
Then, we work our way through the exponent $E$ (in base $M$). Then, for each 'term' in $E$ we simply look up the appropriate value from our table, and then instead of doubling we shift by $m$.
Compared to the binary technique, this means more precomputation (ie calculate more powers of X) and so requires more space, but in return have to make fewer multiplications.

Sliding Window

So, the m-ary window certainly reduces the number of multiplications we had to do, but can we do better? The answer is (unsurprisingly given the retorical question) yes, in the right circumstances. Suppose we are working with $m=4$, and $E=22=(0,0,0,1,0,1,1,0)_2=(1,6)_{2^{4}}$. Then we could use the 4-ary method, but perhaps if we 'reposition' our window, we can do even better: after all there are only 3 set bits, and they're all within a window of 4. If we knew this in advance, we could apply our lookup table at that position, and so only have to do one lookup.  So, for the sliding window method, we first look through the binary representation of $X$, and using this find a representation of $X=\sum x_i 2^i$ where $0\leq x_i \leq 2^m-1$ and as many $x_i$ are zero as possible. This leads to even more precomputation work than the m-ary method, since we have to calculate an efficient representation of $X$ as well as the lookup table used previously. However, if $X$ is known to be sufficiently sparse, a sliding-window method will almost certainly be more efficient.

Thursday, March 12, 2015

52 Things: Number 23: Write a C program to implement Montgomery arithmetic.

This is the latest in a series of blog posts to address the list of '52 Things Every PhD Student Should Know' to do Cryptography: a set of questions compiled to give PhD candidates a sense of what they should know by the end of their first year. This next blog continues the topic of 'Cryptographic Implementation Details'.

In this post I will aim to compliment what we saw last week regarding the more theoretical aspects of Montgomery arithmetic with a practical implementation of it. The implementation is written in C and written for a computer with a word size of 64 bits. The moduls m can therefore be as large as 264 – 1 and a and b can be as large as m – 1. We will take r = 264. As in the previous post, most of what is given here is derived from [1] so please refer to this for more information.
You will remember from the last blog post, that four steps were given to the algorithm (please see post for a full description of the algorithm if this is hazy). For the purposes of our implementation, we will eamine each of these stages separately.

1.       The GCD Operation
This function uses the binary extended gcd operation to find the integers r-1 and m’ such that rr-1= 1 + mm’. These integers are required for the subsequent steps of the Montgomery reduction. The algorithm takes r and m and pointers to values r-1and m’ which the algorithm derives. This is done using the extended gcd algorithm which could be implemented in a number of ways (the purpose of which this blog is not about) and I refer the reader to [1] or [2] for detailed descriptions of it.

2.       Transform the Multipliers
The second stage aims to compute abar = ar mod m and bbar = br mod m. As r = 264, no multiplication is required but only a shifting by 64 bits (due to our selection of r = 264), giving a 128 bit output with 64 0s tailing the value of a or b. The remainder of the division by m is then given as abar or bbar.  A function which takes the high 64 bits (x) and the low 64 bits (y) and the value for m (z) and returns the 64 bit result could be written to do this. Such a function could be defined as follows:

   uint64 modul64(uint64 x, uint64 y, uint64 z);
Where uint64 is a type defined as follows:
   typedef unsigned long long uint64;       
      3.       Montgomery Multiplication
The function which carries out the Montgomery multiplication can be defined as a function which takes the 64 bit values abar, bbar, m and mprine and returns a 64 bit value for the output of the Montgomery multiplication step.

The first sub-stage to of the function is to calculate t = abar*bbar. This is done by multiplying abar and bbar together to give a 128 bit product. An additional function could be written to do this which takes abar and bbar and returns two 64bit values, one with the high 64 bits (thi) and one with the low 64 bits (tlow).

The next stage is the computation of u = (t + (tm’ mod r)m)/r. Here t is a 128 bit integer and m’ is a 64 bit integer. As we mod by r, only the lower 64 bits of tm’ are required, meaning that we can disregard the top 64 bits of t.

    tm = tlo*mprime; // Disregard thi
   mulul64(tm, m, &tmmhi, &tmmlo); // Function which performs 128 bit multiplication

The subsequent multiplication by m gives an output of 128 bits and the addition of t an output of 129 bits, which can be carried out as 128bit + 128bit = 128bit output and compute the carry separately. In C:

   ulo = tlo + tmmlo;
   uhi = thi + tmmhi;
   if (ulo < tlo) uhi = uhi +1; // test for overflow from ulo and add if necessary to uhi
   ov = (uhi < thi) | ((uhi == thi) & (ulo < tlo)); // check for carry

The last step is the calculation of if(u >=m) then return u – m; else return u. This is shown below:
   ulo = uhi;
   uhi = 0;
   if(ov > 0 || ulo >= m) // test if there was overflow or ulo is higher that
                ulo = ulo – m;
   return ulo;

4.       The Inverse Transformation
In the final stage we compute ur-1mod m which is the product of a and b modulo m. This could be done by calling the following functions:

    mulul64(p, rinv, &phi, &plo); // performs multiplication and returns two 64 bit values phi and plo
   p = modul64(phi, plo, m); // returns value of 128bit input mod m

This gives the output p which is the 64 bit output equal to ab mod m and the Montgomery reduction is complete.


Wednesday, March 11, 2015

Side channel research in 2014: overview, statistics, and new directions?

Many times I have hard from fellow academic researchers, and also from industrial collaborators, how useful it would be to have somebody produce an overview for the last year in side channel research. Now being a well established topic, relevant papers appear across many different conferences, workshops, and journals. Keeping track of the important innovations in any year becomes then a time consuming task. Luckily I have plenty of PhD students at present!

Over January and February, we have (re) read many papers (that were published in 2014) to identify emerging themes and hot topics. We tirelessly queried Google Scholar to obtain some relevant (?) citation statistics, and eventually, we typed this up in a reasonably readable manner: the first Side Channel Almanac is hence now available on the SILENT web site for public enjoyment!

The 2014 Bristol Side Channel Almanac features four research themes:

  • Efficiently extracting all information from leakage traces.
  • Developing theory around the fundamental statistical methods for information extraction.
  • Exploring and advancing new (or less well-researched) side-channels or attacks.
  • Making cryptography provably leakage resilient.

  • For each theme, we highlight (what we think are) the key contributions this year: this also mean a big pat on the back of the authors of these!

    Tuesday, March 10, 2015

    What a time to be alive

    Today marks the start of the 18th DATE conference and exhibition, which takes place in the French capital of walnuts (I am sure I didn't make the latter up, but somehow can't find any good reference for it). I am here presenting a paper co-authored with Elisabeth and Carolyn (check it out, it's got some eye-catching plots).

    Joke and self promotion attempts aside, it's great to be here. In a single word, the conference is overwhelming: as the main design and engineering event in Europe, it attracts a lot of worldwide interest from both academia and industry. This is reflected in the steadily increasing number of submissions (935, out of which 205 accepted). The technical program is divided into six parallel sessions, but all participants can attend 
    lunchtime keynote talks. I will write about two such talks that I attended today.

    Drones that fly for you
    Gerard* of Parrot Paris showed us the Bebop drone in action; impressive but short-lived action, as it quickly fell on the ground (things always go wrong when you least need them to). However, he then held the Bebop up in the air with one hand and started rotating it while the audience could still see the  video streaming on the laptop screen, and I have to say -- it was remarkably stable.
    Drones have evolved from toys to pro tools, and it was exciting to hear about putting such devices to good use (e.g., thermal mapping of properties, aiding agriculture). And guess how much a really really pro drone can cost? 30,000.

    Robots that live with you
    Vincent* of Aldebaran introduced us to Romeo, Pepper and NAO, the company's three humanoid robots. Pepper has been available for sale in Japan, and reportedly 1,000 exemplars (priced at around $2,000 each) have been sold within the hour, with 300 in just the first minute.
    The robots are similar in terms of software but designed for different purposes (research, entertainment, assisting disabled persons), and can currently perform simple (for humans!) physical tasks such as grasping objects and maintaining their balance while moving. Emotion recognition is also a big part of making them `human', and we've been shown videos of NAO recognizing joy and discontent. 
    It is expected that Pepper becomes available in Europe in a matter of 2-3 years.

    [*] It must have been some sort of agreement or theme of the day that registered speakers could not attend, and colleagues ended up giving talks on their behalf; the actual names of some speakers are not updated on the conference website as of the time of publishing this, and I can only remember the first names.

    Friday, March 6, 2015

    52 Things: Number 22: How do you represent a number and multiply numbers in Montgomery arithmetic?

    This is the latest in a series of blog posts to address the list of '52 Things Every PhD Student Should Know' to do Cryptography: a set of questions compiled to give PhD candidates a sense of what they should know by the end of their first year. This next blog continues the topic of 'Cryptographic Implementation Details'.

    Most of this blog is derived from [1].

    Secure and Efficient?

    While the goal of cryptography is to design highly secure cryptographic protocols, it must also be possible to implement these protocols efficiently so that they can be performed over and over again without slowing users down too much as they, for example, shop online or transfer money via internet banking. Therefore steps are taken to minimise the computational cost of each and every algorithm used in these protocols. One of the most expensive operations used in such algorithms is the reduction of integers modulo some $n>0$, since it is effectively a division operation.

    The Cost of Modular Reductions

    From now on we fix the notation that $x \: \mathrm{mod} \: n$ is the remainder after $x$ has been divided by $n$, so it's a particular non-negative integer which is less than $n$ (as opposed to the equivalence class of $x$ in $\mathbb{Z}/n\mathbb{Z}$, etc.)

    Let $a,b \in \mathbb{Z}$ and say we know $a \: \mathrm{mod} \: n$ and $b \: \mathrm{mod} \: n$. In many cryptographic applications we wish to compute the product $ab \: \mathrm{mod} \: n$. The naive approach is to compute the usual integer product $ab$ and then reduce this modulo $n$. While this is fine for just one multiplication, we may need to carry out several consecutive multiplications which would lead to several (expensive) modular reductions. For example, in RSA the ciphertext of a message $m$ is $c = m^e \: \mathrm{mod} \: pq$ for some large primes $p,q$ and a public exponent $e$. Computing $c$ according to the naive approach means $e$ iterations of multiplying by $m$ and reducing modulo $pq$. It would be much better to somehow defer the tricky modular reduction to the end of the calculation. But at the same time, if we just compute $m^e$ over the integers and then reduce modulo $n$ at the end, we will have to store (and calculate with) extremely large numbers. So this requires a little more thought.

    "Montgomery Space"

    The insight, due to Montgomery, is to work over a convenient modulus, typically a power of 2. So to compute $ab \: \mathrm{mod} \: n$:
    1. Move $a$ and $b$ into "Montgomery space" (remainders modulo some convenient $r$)
    2. Calculate products and perform modular reductions in this nicer space
    3. Convert the answer into the desired remainder modulo $n$.
    Why is a power of 2 convenient? Say $x$ is an integer written in binary (as is typical for computers) and $r = 2^k$ for some integer $k>0$. Then,
    • to compute $x \: \mathrm{mod} \: r$ you just discard all but the rightmost $k$ bits of $x$
    • to compute $xr$ you just shift the digits of $x$ by $k$ places to the left
    • to compute $x/r$ you just shift the digits of $x$ by $k$ places to the right.
    The Algorithm

    Let $a, b, n, r$ be positive integers with $\mathrm{gcd}(n,r)=1$ and $r > n$. We compute $ab \: \mathrm{mod} \: n$ as follows:
    1. Use the extended Euclidean algorithm to find integers $r^{-1},n'$ with $rr^{-1} = 1 + nn'$
    2. Compute $\overline{a} = ar \: \mathrm{mod} \: n$ and $\overline{b} = br \: \mathrm{mod} \: n$
    3. Compute $ u = abr \: \mathrm{mod} \: n$ via:
      1. $t \leftarrow \overline{a}\overline{b}$
      2. $u \leftarrow \left ( t + \left ( n't \: \mathrm{mod} \: r \right ) \: n \right ) / r$
      3. If $u \geq n$ then $u \leftarrow (u - n)$
      4. Output $u$
    4. Multiply $u$ by $r^{-1}$ and reduce modulo $n$.
    Some remarks on speed:
    • Step 1 is always possible since $r$ is coprime to $n$, and can be done very quickly when $r$ is a power of 2 and $m$ is odd.
    • In Step 2, while the integer multiplication (e.g. $a \times r$) is quick (especially if $r = 2^k$), there is an expensive modular reduction for each multiplier $a$ and $b$. There is also an expensive reduction in Step 4. So for just one multiplication this algorithm is probably slower than the naive approach. But calculating many products modulo $n$ of a small number of multipliers (e.g. $a^mb^{m'}$ modulo $n$ for some integers $m,m'$) is much better this way since we can reuse the outputs of Steps 2 and 3 in subsequent multiplications.
    • Step 3 involves three integer multiplications, one integer addition and, if $r=2^k$, discarding all but $k$ bits and a shift $k$ places to the right. All these processes are efficient.

    Proof of Correctness 
    It should be clear that the algorithm does output $ab \: \mathrm{mod} \: n$ provided that $ u = abr \: \mathrm{mod} \: n$ as is claimed in Step 3. To prove this claim, let $t = \overline{a}\overline{b}$ so that
    $u = abr \: \mathrm{mod} \: n$
    $= arbrr^{-1} \: \mathrm{mod} \: n$
    $= tr^{-1} \: \mathrm{mod} \: n$
    $ = trr^{-1}/r \: \mathrm{mod} \: n$
    $ = t(1 + nn')/r \: \mathrm{mod} \: n $
    $ =((t + nn't)/r + mn) \: \mathrm{mod} \: n = (t + (n't + mr)n)/r \: \mathrm{mod} \: n$, for any integer $m$.
    In particular, when $m$ is chosen so that $n't + mr = n't \: \mathrm{mod} \: r$, we find
    $ u = abr \: \mathrm{mod} \: n = (t + (n't \: \mathrm{mod} \: r)n)/r \: \mathrm{mod} \: n. $

    Note that the quantity computed in the algorithm is like the above but without the reduction modulo $n$. This is because, since $n > r > 0$, we have
    $n^2 < rn \Rightarrow n^2 + rn < 2rn \Rightarrow \left( n^2 + rn \right) / r < 2n$ which shows that $(t + (n't \: \mathrm{mod} \: r)n)/r < 2n$ since $t$ is the product of two remainders modulo $n$. So instead of reducing $(t + (n't \: \mathrm{mod} \: r)n)/r$ modulo $n$ we can just compare it against $n$ (which is an efficient operation) and subtract $n$ once if necessary (also quick). Thus the algorithm given is correct.

    [1] -