Last night, my colleague Matthew Hall tweeted
Is there a app/site dictionary of all words possible with element symbols? #RealTimeChem
— Matthew Hall (@cispt2) January 3, 2016
With the recent news of the 7th row of the periodic table being filled I figured this would be a good time to follow up on Matthews request and identify such elemental words.
There are a lot of word lists available online. Being an ex-Scrabble addict, the OSPD came to mind. So using the SOWPODS word list of 267,751 words I put together a quick Python program to identify words that can be constructed from 1- and 2-letter element symbols. (The newly confirmed elements – Uut, Uuo, Uup & Uus – don’t occur in any English words). Importantly, 2-letter elements should exist in a contiguous fashion. This means that a word like ABRI (a shelter) is not an elemental word since it contains Boron & Iodine, but the A and R are not contiguous and so wouldn’t correpsond to Argon. (It could also contain Bromine and Iodine but then the remaining A doesn’t match any element).
The code below takes 4.1s 2.0s to process SOWPODS and identifies 19,698 40,989 “elemental words”. Thanks to Noel O’Boyle for suggesting the use of a regex and directly extracting matches (so avoiding looping over individual words) and Rich Lewis for generating output in element-case.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | from __future__ import print_function import sys, re if len(sys.argv) != 2: print('Usage: code.py WORD_LIST_FILE_NAME') sys.exit(0) wordlist = sys.argv[1] words = open(wordlist, 'r').read() print('Dictionary has %d words' % (len(re.findall('\n', words)))) with open('elements.txt', 'r') as eles: elems = {e.lower(): e for e in eles.read().split() if e != ''} valid_w = re.findall('(^(?:'+'|'.join(elems.keys())+')+?$)', words, re.I|re.M) print('Found %d elemental words' % (len(valid_w))) pattern = re.compile('|'.join(elems.keys())) elementify = lambda s: pattern.sub(lambda x: elems[x.group()], s) with open('elemental-%s' % (wordlist), 'w') as o: for w in valid_w: o.write(elementify(w)+"\n") |
Just for fun I also extracted all the titles from Wiktionary, irrespective of language. That gives me a list of 2,726,436 words to examine. After 35s 20s I got 148,211 370,724 “elemental words”.
You can find the code along with the element symbol list and input files in this repository
Update: Thanks to Noels’ suggestion of a regex, I realized my initial implementation had a bug and did not identify all elemental words in a dictionary. The updated code now does, and does it 50% faster
Update:Thanks to Rich Lewis for providing a patch to output matching words in element-case (e.g., AcOUSTiCAl
)