Elemental Words

Last night, my colleague Matthew Hall tweeted

Is there a app/site dictionary of all words possible with element symbols? #RealTimeChem

— Matthew Hall (@cispt2) January 3, 2016

With the recent news of the 7th row of the periodic table being filled I figured this would be a good time to follow up on Matthews request and identify such elemental words.

There are a lot of word lists available online. Being an ex-Scrabble addict, the OSPD came to mind. So using the SOWPODS word list of 267,751 words I put together a quick Python program to identify words that can be constructed from 1- and 2-letter element symbols. (The newly confirmed elements – Uut, Uuo, Uup & Uus – don’t occur in any English words). Importantly, 2-letter elements should exist in a contiguous fashion. This means that a word like ABRI (a shelter) is not an elemental word since it contains Boron & Iodine, but the A and R are not contiguous and so wouldn’t correpsond to Argon. (It could also contain Bromine and Iodine but then the remaining A doesn’t match any element).

The code below takes ~~4.1s~~ 2.0s to process SOWPODS and identifies ~~19,698~~ 40,989 “elemental words”. Thanks to Noel O’Boyle for suggesting the use of a regex and directly extracting matches (so avoiding looping over individual words) and Rich Lewis for generating output in element-case.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

from __future__ import print_function
import sys, re

if len(sys.argv) != 2:
print('Usage: code.py WORD_LIST_FILE_NAME')
sys.exit(0)

wordlist = sys.argv[1]
words = open(wordlist, 'r').read()
print('Dictionary has %d words' % (len(re.findall('\n', words))))
with open('elements.txt', 'r') as eles:
elems = {e.lower(): e for e in eles.read().split() if e != ''}
valid_w = re.findall('(^(?:'+'|'.join(elems.keys())+')+?$)', words, re.I|re.M)
print('Found %d elemental words' % (len(valid_w)))
pattern = re.compile('|'.join(elems.keys()))
elementify = lambda s: pattern.sub(lambda x: elems[x.group()], s)
with open('elemental-%s' % (wordlist), 'w') as o:
for w in valid_w:
o.write(elementify(w)+"\n")

Just for fun I also extracted all the titles from Wiktionary, irrespective of language. That gives me a list of 2,726,436 words to examine. After ~~35s~~ 20s I got ~~148,211~~ 370,724 “elemental words”.

You can find the code along with the element symbol list and input files in this repository

Update: Thanks to Noels’ suggestion of a regex, I realized my initial implementation had a bug and did not identify all elemental words in a dictionary. The updated code now does, and does it 50% faster

Update:Thanks to Rich Lewis for providing a patch to output matching words in element-case (e.g., AcOUSTiCAl)

So much to do, so little time

Trying to squeeze sense out of chemical data

Leave a Reply Cancel reply