Unstructured Data

noisy text

nlp

Unstructured Data refers to information that is not organized in a predefined manner. Properly formated computerized data is stored in a database (making it easily retrievable) and labeled with metadata (‘data about data,’ e.g., author, subject, size). Unstructured information has missing or conflicting metadata and may lack contextual clues that make it difficult to understand using traditional programs.

Techniques such as data mining, Natural Language Processing (NLP), and ‘noisy-text’ analytics provide different methods to find patterns in, or otherwise interpret, this information. NLP is a field in Artificial Intelligence, related to linguistics that attempts to program computers to understand human languages. There is considerable commercial interest in the field because of its application to news-gathering, text categorization, voice-activation, archiving, and large-scale content-analysis.

‘Noisy-text’ is text with a poor signal to noise ratio, noise in this case referring to the differences between the surface form of a coded representation of the text and the intended, correct, or original text. It can be due to e.g. typographic errors or colloquialisms always present in natural language and usually lowers the data quality in a way that makes the text less accessible to automated processing by computers such as natural language processing. The noise can also get introduced through an extraction process (i.e. transcription, OCR) from media other than original electronic texts. Noisy text analytics is a process of information extraction whose goal is to automatically extract structured or semistructured information from noisy unstructured text data.

While Text analytics is a growing and mature field that has great value because of the huge amounts of data being produced, processing of noisy text is gaining in importance because a lot of common applications produce noisy text data. In 1998, Merrill Lynch cited a rule of thumb that somewhere around 80-90% of all potentially usable business information may originate in unstructured form. This rule of thumb is not based on primary or any quantitative research, but nonetheless is accepted by some. A data analytics firm projected that between 2010 and 2020 data usage will grow by 50 times (from to 800 exabytes to 40 zettabytes). ‘Computer World’ states that unstructured information might account for more than 70%–80% of all data in organizations.

Common techniques for structuring text usually involve manual tagging with metadata or part-of-speech tagging for further text mining-based structuring. Unstructured Information Management Architecture (UIMA) provides a common framework for processing this information to extract meaning and create structured data about the information. Software that creates machine-processable structure exploits the linguistic, auditory, and visual structure inherent in all forms of human communication. Algorithms can infer this inherent structure from text, for instance, by examining word morphology, sentence syntax, and other small- and large-scale patterns. Unstructured information can then be enriched and tagged to address ambiguities and relevancy-based techniques then used to facilitate search and discovery.

Several commercial solutions are available for analyzing and understanding unstructured data for business applications. This includes products from companies like ZL Technologies, Brainspace, SAS, Provalis Research, Inxight, and IBM’s SPSS or Watson, as well as more specialized offerings such as Attensity, Clarabridge, and Sysomos, which focus on analyzing unstructured social media data. Other vendors such as IRI (CoSort) can find and structure data in unstructured sources, then integrate and transform it along with structured data for business intelligence and analytic purposes.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.