BYU logo Computer Science

To start this assignment, download this zip file.

Project 5 - Cipher

Background

Julius Caesar was known (among many things) for encoding important military instructions using what became known as the Caesar Cipher.

The Caesar Cipher was a simple mapping from one letter of the alphabet to another: A -> D, B -> E, C -> F, etc.

A cipher that performs a simple character substitution is called a simple substitution cipher. Many other simple substitution ciphers have since been invented, each one using a different strategy for coming up with the letter mappings.

Text that has not been encoded is called plaintext. Text that has been encoded is called ciphertext.

These ciphers are easy to generate and easy to use. They are also fairly easy to break.

One strategy for breaking a simple substitution cipher is to do a character frequency analysis.

In regular English text, each letter is used at a fairly consistent frequency. By analyzing how frequent a letter is in plaintext and ciphertext, you can create reasonable predictions for which characters map to which.

Instructions

In this project, you will implement a simple substitution cipher using a randomized alphabet. You will use that cipher to encode and decode some messages. Finally, you will perform a character frequency analysis on encoded and decoded text as a first step towards breaking the cipher.

1. build_codex.py

Write a program called build_codex.py that generates a randomized simple substitution cipher which it saves to CSV file.

The program should have the following commandline arguments:

  • The name of the CSV file in which to save the cipher

The output file should have the following format:

  • each line is a mapping from one letter to another, separated by a comma (no spaces)

For example:

a,x
b,y
c,z

TIPS

  • Use random.sample(alphabet, len(alphabet)) to generate a randomized version of the alphabet
  • You can import string and then use alphabet = string.ascii_lowercase
    • Or you can simply hardcode the alphabet: alphabet = 'abcdefg...'

2. encode.py

Write a program named encode.py that takes

  • A CSV file containing a simple substitution cipher (in the format described above)
  • An input file containing plaintext to encode
  • An output file to write with the ciphertext

This program then encodes the input file and writes the output file using the provided cipher.

Characters that are not in the cipher should be preserved.

Preserve casing of the input characters; for example, i -> x and I -> X.

3. decode.py

Write a program named decode.py that takes

  • A CSV file containing a simple substitution cipher (in the format described above)
  • An input file containing ciphertext to decode
  • An output file to write with the decoded plaintext

This program then decodes the input file and writes the output file using the provided cipher.

As with encode.py, preserve casing and characters not found in the cipher.

NOTE The cipher CSV is the same one as used by encode.py. To decode using this cipher, the cipher will need to be inverted.

4. frequency.py

Write a program named frequency.py that performs a character-frequency analysis on an input file.

Every distinct character in the file should be counted, and the final counts for each character should be divided by the total number of characters in the file.

The frequencies should be rounded to 3 decimal places.

The program should print a dictionary mapping each character to its associated frequency.

Example

For a file containing

a
bbb
ccc

The frequencies should be

{'a': 0.111, 'b': 0.333, 'c': 0.333, '\n': '0.222'}

Another file might have

a
bb

The frequencies should be

{'a': 0.25, 'b': 0.5, '\n': '0.25'}

Testing

The test_files folder contains several files you can use to test your code. These files are used by the autograder.

You are also encouraged to use your own files to test these programs.

  • Generate a cipher CSV
  • Encode and decode custom messages
  • Perform your own frequence analysis

Grading

ActivityPoints
build_cipher.py10
encode.py30
decode.py30
frequency.py30

Just for fun

More ciphers

There are many other ways to generate a simple substitution cipher. Which ones can you implement?

For example, the ROT ciphers (e.g. ROT13) use a rotation to generate the cipher, and you only need the rotation factor to decode the message.

Crack the code

Can you use the character frequency analysis to derive the cipher from a ciphertext?

Try building a custom cipher CSV by hand based on the frequencies you see. How well does the custom cipher decode the message?

Here is a ciphertext you can practice on:

Bmansopvwopimah! Gmv uorz tiaihuzd puz bicuzs csmfzbp, puz wohp csmfzbp mt BH110. 

Lz umcz gmv uorz zafmgzd puih bwohh. 
Oad xmsz ixcmspoapwg, lz umcz gmv uorz nsmla tsmx puih zkczsizabz.

Lz nirz gmv hmxz cospian lmsdh mt mvs csmcuzp pm bmahidzs:

  I cwzod lipu gmv pm poez buosnz mt gmvs pzhpixmag mt Fzhvh Busihp. 
  Lmse tms ip. Mla ip. Bosz tms ip. 
  Avspvsz ip hm puop ip liww nsml. 
  Puza lopbu tms xisobwzh pm uoccza ia gmvs witz.
  --Svhhzww X. Azwhma
  
Oh gmv bmxz vapm Busihp liww oww gmvs uzosp wmrz Uix ia toipu,
Uz liww ywzhh gmv oad czstmsx xisobwzh ia gmvs witz.

Xog gmv uorz Uih ywzhhianh lipu gmv ia gmvs tvpvsz zadzormsh.

Hiabzszwg, BH 110 iahpsvbpmsh oad POh