DATA SIMPLIFICATION: Building Word Lists

Over the next few weeks, I will be writing on topics related to my latest book, Data Simplification: Taming Information With Open Source Tools (release date March 23, 2016). I hope I can convince you that this is a book worth reading. Blog readers can use the discount code: COMP315 for a 30% discount, at checkout.Word lists, for just about any written language for which there is an electronic literature, are easy to create. Here is a short Python script, words.py, that prompts the user to enter a line of text. The script drops the line to lowercase, removes the carriage return at the end of the line, parses the result into an alphabetized list, removes duplicate terms from the list, and prints out the list, with one term assigned to each line of output. This words.py script can be easily modified to create word lists from plain-text files (See Glossary item, Metasyntactic variable). #!/usr/local/bin/pythonimport sys, re, stringprint "Enter a line of text to be parsed into a word list"line = sys.stdin.readline()line = string.lower(line)line = string.rstrip(line)linearray = sorted(set(re.split(r' +', line)))for i in range(0, len(linearray)): print(linearray[i])exit Here is some a sample of output, when the input is the first line of Joyce's Finegans Wake: c:\ftp>words.pyEnter a line of text to be parsed into a word lista way a lone a last a loved a long the riverrun, past Eve and Adam's, from swerve of shore to bend of bay, brings us by a commodius vicusaadam's,andbay,b...
Source: Specified Life - Category: Information Technology Tags: complexity computer science data analysis data repurposing data simplification data wrangling information science simplifying data taming data word lists Source Type: blogs