Text Processing with Ruby

Extract Value from the Data That Surrounds You

by: Rob Miller

Published 2015-09-25
Internal code rmtpruby
Print status In Print
Pages 272
User level Beginner
Keywords ruby, text processing, XML, HTML, CSV, regular expression, unicode, encoding
Related titles

Programming Ruby
Build Awesome Command-Line Applications with Ruby 2

ISBN 9781680500707
Other ISBN Channel epub: 9781680504927
Channel PDF: 9781680504934
Kindle: 9781680501568
Safari: 9781680501575
Kindle: 9781680501568
BISACs COM018000 COMPUTERS / Data Processing
COM051410 COMPUTERS / Programming Languages / Ruby
COM051410 COMPUTERS / Programming Languages / Ruby

Highlight

Whatever you want to do with text, Ruby is up to the job. No matter what the source – web pages, databases, the contents of files – learn how to acquire the text and get it into your program. Explore techniques to process that text and then output the transformed or extracted text. Cut even the most complex text-based tasks down to size and learn how to master regular expressions, scrape information from Web pages, develop reusable utilities to process text in pipelines, and more.

Description

Most information in the world is in text format, and programmers often find themselves needing to make sense of the data hiding within. You want to do this efficiently, avoiding labor-intensive, manual work—and Ruby is ideally suited to this task.

Text Processing with Ruby takes a practical approach to working with text:

You’ll soon be able to tackle even the most enormous and entangled text with ease, scything through gigabytes of data and effortlessly extracting the bits that matter.

Top Five Text Processing Tips
by Rob Miller, author of Text Processing with Ruby

Clean up your data first
Data in the real world is messy. It almost always pays off to take some
time to normalize different sources of data and to get them into the
same format before you begin whatever actual processing you need to do.
You’ll have less exceptions and special cases in your code, and it’ll be
a lot more resilient.

Master regular expressions
There are definitely some text processing problems that can’t be solved
with regular expressions, but not that many. While they’re not always
the best or more readable option, knowing regular expressions well will
get you out of many tight spots, and even more often than that will be
the first step towards a more robust solution.

Break your problem into discrete steps
Almost all text processing tasks, no matter how complicated they seem on the face of it, are really a series of small transformations. Figuring out how to frame your problem in this way will make it easy to take a pipeline approach, where your text flows through a series of small,
discrete steps, each of which transform the data in a particular way and
then passes it on. Such programs are both easier to reason about and
easier to modify and extend.

Figure out a strategy for missing data
Data in the real world, as well as being messy, also frequently has gaps. Decide early on how you’re going to cope with that — how you’ll represent the absence of particular fields or properties — and you’ll
avoid messiness later on.

Make the most of existing tools
There are hundreds of command-line tools that exist solely to process
textual data. Each of them is capable of performing a particular
transformation, which means you don’t need to reinvent the wheel. If you
use existing tools for the parts of your problem that have already been
solved, all that remains is to solve the unique problem that you have.

Contents and Extracts