Compare commits

..

7 Commits

Author SHA1 Message Date
Louis Mullie 65dfc01e74 Fix little bug. 2012-07-25 02:42:46 -04:00
Louis Mullie d40c6326b8 Renamed doable to chainable. 2012-07-25 02:30:32 -04:00
Louis Mullie 21dd3bf2f7 Rename do() to chain. 2012-07-25 02:07:07 -04:00
Louis Mullie c6c59d7dbe No need for this ugly patch anymore. 2012-07-25 02:06:57 -04:00
Louis Mullie f50718e647 Add support for lower case builders. 2012-07-25 02:06:48 -04:00
Louis Mullie 0bbcea4a6d Add comment on beauty of code. 2012-07-25 02:06:32 -04:00
Louis Mullie e069fa9248 Rename do() to chain() for more intuitive syntax. 2012-07-25 02:06:21 -04:00
235 changed files with 4366 additions and 6756 deletions

2
.gitignore vendored
View File

@ -12,6 +12,4 @@
*.html
*.yaml
spec/sandbox.rb
coverage/*
benchmark/*
TODO

2
.rspec
View File

@ -1,2 +1,2 @@
--format d -c
--format s -c
--order rand

View File

@ -1,18 +1,11 @@
language: ruby
rvm:
- 1.9.2
- 1.9.3
- 2.0
- 2.1
- 2.2
before_install:
- export "JAVA_HOME=/usr/lib/jvm/java-6-openjdk-i386/"
before_script:
- export "JAVA_HOME=/usr/lib/jvm/java-6-openjdk/"
before_script:
- sudo apt-get install antiword
- sudo apt-get install poppler-utils
- rake treat:install[travis] --trace
script: rake treat:spec --trace
- rake treat:install[travis]
script: rake treat:spec

35
.treat
View File

@ -1,35 +0,0 @@
# A boolean value indicating whether to silence
# the output of external libraries (e.g. Stanford
# tools, Enju, LDA, Ruby-FANN, Schiphol).
Treat.core.verbosity.silence = false
# A boolean value indicating whether to explain
# the steps that Treat is performing.
Treat.core.verbosity.debug = true
# A boolean value indicating whether Treat should
# try to detect the language of newly input text.
Treat.core.language.detect = false
# A string representing the language to default
# to when detection is off.
Treat.core.language.default = 'english'
# A symbol representing the finest level at which
# language detection should be performed if language
# detection is turned on.
Treat.core.language.detect_at = :document
# The directory containing executables and JAR files.
Treat.paths.bin = '##_INSTALLER_BIN_PATH_##'
# The directory containing trained models
Treat.paths.models = '##_INSTALLER_MODELS_PATH_##'
# Mongo database configuration.
Treat.databases.mongo.db = 'your_database'
Treat.databases.mongo.host = 'localhost'
Treat.databases.mongo.port = '27017'
# Include the DSL by default.
include Treat::Core::DSL

39
Gemfile
View File

@ -1,45 +1,12 @@
source 'https://rubygems.org'
source :rubygems
gemspec
gem 'birch'
gem 'schiphol'
gem 'yomu'
gem 'ruby-readability'
gem 'nokogiri'
gem 'sourcify'
group :test do
gem 'rspec'
gem 'rake'
gem 'terminal-table'
gem 'simplecov'
end
=begin
gem 'linguistics'
gem 'engtagger'
gem 'open-nlp'
gem 'stanford-core-nlp'
gem 'rwordnet'
gem 'scalpel'
gem 'fastimage'
gem 'decisiontree'
gem 'whatlanguage'
gem 'zip'
gem 'nickel'
gem 'tactful_tokenizer'
gem 'srx-english'
gem 'punkt-segmenter'
gem 'chronic'
gem 'uea-stemmer'
gem 'rbtagger'
gem 'ruby-stemmer'
gem 'activesupport'
gem 'rb-libsvm'
gem 'tomz-liblinear-ruby-swig'
gem 'ruby-fann'
gem 'fuzzy-string-match'
gem 'levenshtein-ffi'
gem 'tf-idf-similarity'
gem 'kronic'
=end
end

View File

@ -1,4 +1,4 @@
Treat - Text Retrieval, Extraction and Annotation Toolkit, v. 2.0.0
Treat - Text Retrieval, Extraction and Annotation Toolkit, v. 1.1.2
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
@ -15,7 +15,7 @@ along with this program. If not, see <http://www.gnu.org/licenses/>.
Author: Louis-Antoine Mullie (louis.mullie@gmail.com). Copyright 2011-12.
A non-trivial amount of code has been incorporated and modified from other libraries:
Non-trivial amount of code has been incorporated and modified from other libraries:
- formatters/readers/odt.rb - Mark Watson (GPL license)
- processors/tokenizers/tactful.rb - Matthew Bunday (GPL license)

View File

@ -1,43 +1,34 @@
[![Build Status](https://secure.travis-ci.org/louismullie/treat.png)](http://travis-ci.org/#!/louismullie/treat)
[![Code Climate](https://codeclimate.com/github/louismullie/treat.png)](https://codeclimate.com/github/louismullie/treat)
[![Dependency Status](https://gemnasium.com/louismullie/treat.png)](https://gemnasium.com/louismullie/treat)
Treat is a framework for natural language processing and computational linguistics in Ruby. It provides a common API for a number of gems and external libraries for document retrieval, parsing, annotation, and information extraction.
![Treat Logo](http://www.louismullie.com/treat/treat-logo.jpg)
**New in v2.0.5: [OpenNLP integration](https://github.com/louismullie/treat/commit/727a307af0c64747619531c3aa355535edbf4632) and [Yomu support](https://github.com/louismullie/treat/commit/e483b764e4847e48b39e91a77af8a8baa1a1d056)**
Treat is a toolkit for natural language processing and computational linguistics in Ruby. The Treat project aims to build a language- and algorithm- agnostic NLP framework for Ruby with support for tasks such as document retrieval, text chunking, segmentation and tokenization, natural language parsing, part-of-speech tagging, keyword extraction and named entity recognition. Learn more by taking a [quick tour](https://github.com/louismullie/treat/wiki/Quick-Tour) or by reading the [manual](https://github.com/louismullie/treat/wiki/Manual).
**Features**
**Current features**
* Text extractors for PDF, HTML, XML, Word, AbiWord, OpenOffice and image formats (Ocropus).
* Text chunkers, sentence segmenters, tokenizers, and parsers (Stanford & Enju).
* Lexical resources (WordNet interface, several POS taggers for English).
* Language, date/time, topic words (LDA) and keyword (TF*IDF) extraction.
* Text retrieval with indexation and full-text search (Ferret).
* Text chunkers, sentence segmenters, tokenizers, and parsers for several languages (Stanford & Enju).
* Word inflectors, including stemmers, conjugators, declensors, and number inflection.
* Serialization of annotated entities to YAML, XML or to MongoDB.
* Lexical resources (WordNet interface, several POS taggers for English, Stanford taggers for several languages).
* Language, date/time, topic words (LDA) and keyword (TF*IDF) extraction.
* Serialization of annotated entities to YAML, XML formats or to MongoDB.
* Visualization in ASCII tree, directed graph (DOT) and tag-bracketed (standoff) formats.
* Linguistic resources, including language detection and tag alignments for several treebanks.
* Machine learning (decision tree, multilayer perceptron, LIBLINEAR, LIBSVM).
* Text retrieval with indexation and full-text search (Ferret).
* Decision tree and multilayer perceptron classification (liblinear coming soon!)
**Contributing**
<br>
I am actively seeking developers that can help maintain and expand this project. You can find a list of ideas for contributing to the project [here](https://github.com/louismullie/treat/wiki/Contributing).
**Resources**
**Authors**
Lead developper: @louismullie [[Twitter](https://twitter.com/LouisMullie)]
Contributors:
- @bdigital
- @automatedtendencies
- @LeFnord
- @darkphantum
- @whistlerbrk
- @smileart
- @erol
* Read the [latest documentation](http://rubydoc.info/github/louismullie/treat/frames).
* See how to [install Treat](https://github.com/louismullie/treat/wiki/Installation).
* Learn how to [use Treat](https://github.com/louismullie/treat/wiki/Manual).
* Help out by [contributing to the project](https://github.com/louismullie/treat/wiki/Contributing).
* View a list of [papers](https://github.com/louismullie/treat/wiki/Papers) about tools included in this toolkit.
* Open an [issue](https://github.com/louismullie/treat/issues).
<br>
**License**
This software is released under the [GPL License](https://github.com/louismullie/treat/wiki/License-Information) and includes software released under the GPL, Ruby, Apache 2.0 and MIT licenses.
This software is released under the [GPL License](https://github.com/louismullie/treat/wiki/License-Information) and includes software released under the GPL, Ruby, Apache 2.0 and MIT licenses.

14
RELEASE
View File

@ -41,16 +41,4 @@ Treat - Text Retrieval, Extraction and Annotation Toolkit
1.1.0
* Complete refactoring of the core of the library.
* Separated all configuration stuff from dynamic stuff.
1.2.0
* Added LIBSVM and LIBLINEAR classifier support.
* Added support for serialization of documents and data sets to MongoDB.
* Added specs for most of the core classes.
* Several bug fixes.
2.0.0rc1
* MAJOR CHANGE: the old DSL is no longer supported. A new DSL style using
lowercase keywords is now used and must be required explicitly.
* Separated all configuration stuff from dynamic stuff.

View File

@ -1,47 +1,29 @@
# All commands are prefixed with "treat:".
require 'date'
require 'rspec/core/rake_task'
task :default => :spec
namespace :treat do
# Require the Treat library.
require_relative 'lib/treat'
# Sandbox a script, for development.
# Syntax: rake treat:sandbox
task :sandbox do
require_relative 'spec/sandbox'
RSpec::Core::RakeTask.new do |t|
task = ARGV[0].scan(/\[([a-z]*)\]/)
if task && task.size == 0
t.pattern = "./spec/*.rb"
else
t.pattern = "./spec/#{task[0][0]}.rb"
end
end
# Prints the current version of Treat.
# Syntax: rake treat:version
task :version do
puts Treat::VERSION
vpath = '../lib/treat/version.rb'
vfile = File.expand_path(vpath, __FILE__)
contents = File.read(vfile)
puts contents[/VERSION = "([^"]+)"/, 1]
end
# Installs a language pack (default to english).
# A language pack is a set of gems, binaries and
# model files that support the various workers
# that are available for that particular language.
# Syntax: rake treat:install (installs english)
# - OR - rake treast:install[some_language]
task :install, [:language] do |t, args|
language = args.language || 'english'
Treat::Core::Installer.install(language)
end
# Runs 1) the core library specs and 2) the
# worker specs for a) all languages (default)
# or b) a specific language (if specified).
# Also outputs the coverage for the whole
# library to treat/coverage (using SimpleCov).
# N.B. the worker specs are dynamically defined
# following the examples found in spec/workers.
# (see /spec/language/workers for more info)
# Syntax: rake treat:spec (core + all langs)
# - OR - rake treat:spec[some_language]
task :spec, [:language] do |t, args|
require_relative 'spec/helper'
Treat::Specs::Helper.start_coverage
Treat::Specs::Helper.run_library_specs
Treat::Specs::Helper.run_language_specs(args.language)
require './lib/treat'
Treat.install(args.language || 'english')
end
end
end

View File

@ -1,23 +1,36 @@
# Treat is a toolkit for natural language
# processing and computational linguistics
# in Ruby. The Treat project aims to build
# a language- and algorithm- agnostic NLP
# framework for Ruby with support for tasks
# such as document retrieval, text chunking,
# segmentation and tokenization, natural
# language parsing, part-of-speech tagging,
# keyword mining and named entity recognition.
#
# Author: Louis-Antoine Mullie (c) 2010-12.
#
# Released under the General Public License.
module Treat
# * Load all the core classes. * #
require_relative 'treat/version'
require_relative 'treat/exception'
require_relative 'treat/autoload'
require_relative 'treat/modules'
require_relative 'treat/builder'
# Treat requires Ruby >= 1.9.2
if RUBY_VERSION < '1.9.2'
raise "Treat requires Ruby version 1.9.2 " +
"or higher, but current is #{RUBY_VERSION}."
end
# Custom exception class.
class Exception < ::Exception; end
# Load configuration options.
require 'treat/config'
# Load all workers.
require 'treat/helpers'
# Require library loaders.
require 'treat/loaders'
# Require all core classes.
require 'treat/core'
# Require all entity classes.
require 'treat/entities'
# Lazy load worker classes.
require 'treat/workers'
# Require proxies last.
require 'treat/proxies'
# Turn sugar on.
Treat::Config.sweeten!
# Install packages for a given language.
def self.install(language = :english)
require 'treat/installer'
Treat::Installer.install(language)
end
end

View File

@ -1,44 +0,0 @@
# Basic mixin for all the main modules;
# takes care of requiring the right files
# in the right order for each one.
#
# If a module's folder (e.g. /entities)
# contains a file with a corresponding
# singular name (e.g. /entity), that
# base class is required first. Then,
# all the files that are found directly
# under that folder are required (but
# not those found in sub-folders).
module Treat::Autoload
# Loads all the files for the base
# module in the appropriate order.
def self.included(base)
m = self.get_module_name(base)
d = self.get_module_path(m)
n = self.singularize(m) + '.rb'
f, p = File.join(d, n), "#{d}/*.rb"
require f if File.readable?(f)
Dir.glob(p).each { |f| require f }
end
# Returns the path to a module's dir.
def self.get_module_path(name)
file = File.expand_path(__FILE__)
dirs = File.dirname(file).split('/')
File.join(*dirs[0..-1], name)
end
# Return the downcased form of the
# module's last name (e.g. "entities").
def self.get_module_name(mod)
mod.to_s.split('::')[-1].downcase
end
# Helper method to singularize words.
def self.singularize(w)
if w[-3..-1] == 'ies'; w[0..-4] + 'y'
else; (w[-1] == 's' ? w[0..-2] : w); end
end
end

View File

@ -1,6 +0,0 @@
class Treat::Builder
include Treat::Core::DSL
def initialize(&block)
instance_exec(&block)
end
end

141
lib/treat/config.rb Normal file
View File

@ -0,0 +1,141 @@
module Treat::Config
Paths = [ :tmp, :lib, :bin,
:files, :data, :models, :spec ]
class << self
attr_accessor :config
end
Treat.module_eval do
# Handle all missing methods as conf options.
def self.method_missing(sym, *args, &block)
super(sym, *args, &block) if sym == :to_ary
Treat::Config.config[sym]
end
end
def self.configure
# Temporary configuration hash.
config = { paths: {} }
confdir = get_full_path(:lib) + 'treat/config'
# Iterate over each directory in the config.
Dir[confdir + '/*'].each do |dir|
name = File.basename(dir, '.*').intern
config[name] = {}
# Iterate over each file in the directory.
Dir[confdir + "/#{name}/*.rb"].each do |file|
key = File.basename(file, '.*').intern
config[name][key] = eval(File.read(file))
end
end
# Get the path config.
Paths.each do |path|
config[:paths][path] = get_full_path(path)
end
# Get the tag alignments.
configure_tags!(config[:tags][:aligned])
# Convert hash to structs.
self.config = self.hash_to_struct(config)
end
def self.get_full_path(dir)
File.dirname(__FILE__) +
'/../../' + dir.to_s + "/"
end
def self.configure_tags!(config)
ts = config[:tag_sets]
config[:word_tags_to_category] =
align_tags(config[:word_tags], ts)
config[:phrase_tags_to_category] =
align_tags(config[:phrase_tags], ts)
end
# Align tag configuration.
def self.align_tags(tags, tag_sets)
wttc = {}
tags.each_slice(2) do |desc, tags|
category = desc.gsub(',', ' ,').
split(' ')[0].downcase
tag_sets.each_with_index do |tag_set, i|
next unless tags[i]
wttc[tags[i]] ||= {}
wttc[tags[i]][tag_set] = category
end
end
wttc
end
def self.hash_to_struct(hash)
return hash if hash.keys.
select { |k| !k.is_a?(Symbol) }.size > 0
struct = Struct.new(
*hash.keys).new(*hash.values)
hash.each do |key, value|
if value.is_a?(Hash)
struct[key] =
self.hash_to_struct(value)
end
end
struct
end
# Turn on syntactic sugar.
def self.sweeten!
# Undo this in unsweeten! - # Fix
Treat::Entities.module_eval do
self.constants.each do |type|
define_singleton_method(type) do |value='', id=nil|
const_get(type).build(value, id)
end
end
end
return if Treat.core.syntax.sweetened
Treat.core.syntax.sweetened = true
Treat.core.entities.list.each do |type|
kname = type.to_s.capitalize.intern
klass = Treat::Entities.const_get(kname)
Object.class_eval do
define_method(kname.to_s.downcase.intern) do |val, opts={}|
klass.build(val, opts)
end
# THIS WILL BE DEPRECATED IN 2.0
define_method(kname) do |val, opts={}|
klass.build(val, opts)
end
end
end
Treat::Core.constants.each do |kname|
Object.class_eval do
klass = Treat::Core.const_get(kname)
define_method(kname.to_s.downcase.intern) do |*args|
klass.new(*args)
end
# THIS WILL BE DEPRECATED IN 2.0
define_method(kname) do |*args|
klass.new(*args)
end
end
end
end
# Turn off syntactic sugar.
def self.unsweeten!
return unless Treat.core.syntax.sweetened
Treat.core.syntax.sweetened = false
Treat.core.entities.list.each do |type|
name = cc(type).intern
Object.class_eval { remove_method(name) }
end
end
# Run all configuration.
self.configure
end

View File

@ -1,38 +0,0 @@
# This module uses structs to represent the
# configuration options that are stored in
# the /config folder.
module Treat::Config
# Require configurable mix in.
require_relative 'importable'
# Make all configuration importable.
extend Treat::Config::Importable
# Core configuration options for entities.
class Treat::Config::Entities; end
# Configuration for paths to models, binaries,
# temporary storage and file downloads.
class Treat::Config::Paths; end
# Configuration for all Treat workers.
class Treat::Config::Workers; end
# Helpful linguistic options.
class Treat::Config::Linguistics; end
# Supported workers for each language.
class Treat::Config::Languages; end
# Configuration options for external libraries.
class Treat::Config::Libraries; end
# Configuration options for database
# connectivity (host, port, etc.)
class Treat::Config::Databases; end
# Configuration options for Treat core.
class Treat::Config::Core; end
end

View File

@ -1,51 +0,0 @@
# Provide default functionality to load configuration
# options from flat files into their respective modules.
module Treat::Config::Configurable
# When extended, add the .config property to
# the class that is being operated on.
def self.extended(base)
class << base; attr_accessor :config; end
base.class_eval { self.config = {} }
end
# Provide base functionality to configure
# all modules. The behaviour is as follows:
#
# 1 - Check if a file named data/$CLASS$.rb
# exists; if so, load that file as the base
# configuration, i.e. "Treat.$CLASS$"; e.g.
# "Treat.core"
#
# 2 - Check if a folder named data/$CLASS$
# exists; if so, load each file in that folder
# as a suboption of the main configuration,
# i.e. "Treat.$CLASS$.$FILE$"; e.g. "Treat.workers"
#
# (where $CLASS$ is the lowercase name of
# the concrete class being extended by this.)
def configure!
path = File.dirname(File.expand_path( # FIXME
__FILE__)).split('/')[0..-4].join('/') + '/'
main_dir = path + 'lib/treat/config/data/'
mod_name = self.name.split('::')[-1].downcase
conf_dir = main_dir + mod_name
base_file = main_dir + mod_name + '.rb'
if File.readable?(base_file)
self.config = eval(File.read(base_file))
elsif FileTest.directory?(conf_dir)
self.config = self.from_dir(conf_dir)
else; raise Treat::Exception,
"No config file found for #{mod_name}."
end
end
# * Helper methods for configuraton * #
def from_dir(conf_dir)
Hash[Dir[conf_dir + '/*'].map do |path|
name = File.basename(path, '.*').intern
[name, eval(File.read(path))]
end]
end
end

View File

@ -0,0 +1,4 @@
['xml', 'html', 'txt', 'odt',
'abw', 'doc', 'yaml', 'uea',
'lda', 'pdf', 'ptb', 'dot',
'ai', 'id3', 'svo', 'mlp' ]

View File

@ -0,0 +1,8 @@
{language_to_code: {
arabic: 'UTF-8',
chinese: 'GB18030',
english: 'UTF-8',
french: 'ISO_8859-1',
ferman: 'ISO_8859-1',
hebrew: 'UTF-8'
}}

View File

@ -0,0 +1,2 @@
{list: [:entity, :unknown, :email, :url, :symbol, :sentence, :punctuation, :number, :enclitic, :word, :token, :fragment, :phrase, :paragraph, :title, :zone, :list, :block, :page, :section, :collection, :document],
order: [:token, :fragment, :phrase, :sentence, :zone, :section, :document, :collection]}

View File

@ -0,0 +1,3 @@
{default: :english,
detect: false,
detect_at: :document}

View File

@ -0,0 +1,8 @@
{description: {
:tmp => 'temporary files',
:lib => 'class and module definitions',
:bin => 'binary files',
:files => 'user-saved files',
:models => 'model files',
:spec => 'spec test files'
}}

View File

@ -0,0 +1 @@
{sweetened: false}

View File

@ -0,0 +1 @@
{debug: false, silence: true}

View File

@ -1,54 +0,0 @@
{
acronyms:
['xml', 'html', 'txt', 'odt',
'abw', 'doc', 'yaml', 'uea',
'lda', 'pdf', 'ptb', 'dot',
'ai', 'id3', 'svo', 'mlp',
'svm', 'srx', 'nlp'],
encodings:
{language_to_code: {
arabic: 'UTF-8',
chinese: 'GB18030',
english: 'UTF-8',
french: 'ISO_8859-1',
ferman: 'ISO_8859-1',
hebrew: 'UTF-8'
}},
entities:
{list:
[:entity, :unknown, :email,
:url, :symbol, :sentence,
:punctuation, :number,
:enclitic, :word, :token, :group,
:fragment, :phrase, :paragraph,
:title, :zone, :list, :block,
:page, :section, :collection,
:document],
order:
[:token, :fragment, :group,
:sentence, :zone, :section,
:document, :collection]},
language: {
default: :english,
detect: false,
detect_at: :document
},
paths: {
description: {
tmp: 'temporary files',
lib: 'class and module definitions',
bin: 'binary files',
files: 'user-saved files',
models: 'model files',
spec: 'spec test files'
}
},
learning: {
list: [:data_set, :export, :feature, :tag, :problem, :question]
},
syntax: { sweetened: false },
verbosity: { debug: false, silence: true}
}

View File

@ -1,10 +0,0 @@
{
default: {
adapter: :mongo
},
mongo: {
host: 'localhost',
port: '27017',
db: nil
}
}

View File

@ -1,15 +0,0 @@
{
list:
[:entity, :unknown, :email,
:url, :symbol, :sentence,
:punctuation, :number,
:enclitic, :word, :token,
:fragment, :phrase, :paragraph,
:title, :zone, :list, :block,
:page, :section, :collection,
:document],
order:
[:token, :fragment, :phrase,
:sentence, :zone, :section,
:document, :collection]
}

View File

@ -1,33 +0,0 @@
{
dependencies: [
'ferret', 'bson_ext', 'mongo', 'lda-ruby',
'stanford-core-nlp', 'linguistics',
'ruby-readability', 'whatlanguage',
'chronic', 'kronic', 'nickel', 'decisiontree',
'rb-libsvm', 'ruby-fann', 'zip', 'loggability',
'tf-idf-similarity', 'narray', 'fastimage',
'fuzzy-string-match', 'levenshtein-ffi'
],
workers: {
learners: {
classifiers: [:id3, :linear, :mlp, :svm]
},
extractors: {
keywords: [:tf_idf],
language: [:what_language],
topic_words: [:lda],
tf_idf: [:native],
distance: [:levenshtein],
similarity: [:jaro_winkler, :tf_idf]
},
formatters: {
serializers: [:xml, :yaml, :mongo],
unserializers: [:xml, :yaml, :mongo],
visualizers: [:dot, :standoff, :tree]
},
retrievers: {
searchers: [:ferret],
indexers: [:ferret]
}
}
}

View File

@ -1,95 +0,0 @@
{
dependencies: [
'rbtagger',
'ruby-stemmer',
'punkt-segmenter',
'tactful_tokenizer',
'nickel',
'rwordnet',
'uea-stemmer',
'engtagger',
'activesupport',
'srx-english',
'scalpel'
],
workers: {
extractors: {
time: [:chronic, :kronic, :ruby, :nickel],
topics: [:reuters],
name_tag: [:stanford]
},
inflectors: {
conjugators: [:linguistics],
declensors: [:english, :linguistics],
stemmers: [:porter, :porter_c, :uea],
ordinalizers: [:linguistics],
cardinalizers: [:linguistics]
},
lexicalizers: {
taggers: [:lingua, :brill, :stanford],
sensers: [:wordnet],
categorizers: [:from_tag]
},
processors: {
parsers: [:stanford],
segmenters: [:scalpel, :srx, :tactful, :punkt, :stanford],
tokenizers: [:ptb, :stanford, :punkt, :open_nlp]
}
},
stop_words:
[
"about",
"also",
"are",
"away",
"because",
"been",
"beside",
"besides",
"between",
"but",
"cannot",
"could",
"did",
"etc",
"even",
"ever",
"every",
"for",
"had",
"have",
"how",
"into",
"isn",
"maybe",
"non",
"nor",
"now",
"should",
"such",
"than",
"that",
"then",
"these",
"this",
"those",
"though",
"too",
"was",
"wasn",
"were",
"what",
"when",
"where",
"which",
"while",
"who",
"whom",
"whose",
"will",
"with",
"would",
"wouldn",
"yes"
]
}

View File

@ -1,148 +0,0 @@
{
dependencies: [
'punkt-segmenter',
'tactful_tokenizer',
'stanford-core-nlp'
],
workers: {
processors: {
segmenters: [:scalpel],
tokenizers: [:ptb,:stanford],
parsers: [:stanford]
},
lexicalizers: {
taggers: [:stanford],
categorizers: [:from_tag]
}
},
stop_words:
[
"ailleurs",
"ainsi",
"alors",
"aucun",
"aucune",
"auquel",
"aurai",
"auras",
"aurez",
"aurons",
"auront",
"aussi",
"autre",
"autres",
"aux",
"auxquelles",
"auxquels",
"avaient",
"avais",
"avait",
"avec",
"avez",
"aviez",
"avoir",
"avons",
"celui",
"cependant",
"certaine",
"certaines",
"certains",
"ces",
"cet",
"cette",
"ceux",
"chacun",
"chacune",
"chaque",
"comme",
"constamment",
"davantage",
"depuis",
"des",
"desquelles",
"desquels",
"dessous",
"dessus",
"donc",
"dont",
"duquel",
"egalement",
"elles",
"encore",
"enfin",
"ensuite",
"etaient",
"etais",
"etait",
"etes",
"etiez",
"etions",
"etre",
"eux",
"guere",
"ici",
"ils",
"jamais",
"jusqu",
"laquelle",
"legerement",
"lequel",
"les",
"lesquelles",
"lesquels",
"leur",
"leurs",
"lors",
"lui",
"maintenant",
"mais",
"malgre",
"moi",
"moins",
"notamment",
"parce",
"plupart",
"pourtant",
"presentement",
"presque",
"puis",
"puisque",
"quand",
"quant",
"que",
"quel",
"quelqu",
"quelque",
"quelques",
"qui",
"quoi",
"quoique",
"rien",
"selon",
"serai",
"seras",
"serez",
"serons",
"seront",
"soient",
"soit",
"sommes",
"sont",
"sous",
"suis",
"telle",
"telles",
"tels",
"toi",
"toujours",
"tout",
"toutes",
"tres",
"trop",
"une",
"vos",
"votre",
"vous"
]
}

View File

@ -1,137 +0,0 @@
#encoding: UTF-8
{
dependencies: [
'punkt-segmenter',
'tactful_tokenizer',
'stanford-core-nlp'
],
workers: {
processors: {
segmenters: [:tactful, :punkt, :stanford, :scalpel],
tokenizers: [:stanford, :punkt],
parsers: [:stanford]
},
lexicalizers: {
taggers: [:stanford],
categorizers: [:from_tag]
}
},
stop_words:
[
"alle",
"allem",
"alles",
"andere",
"anderem",
"anderen",
"anderer",
"anderes",
"auf",
"bei",
"beim",
"bist",
"dadurch",
"dein",
"deine",
"deiner",
"deines",
"deins",
"dem",
"denen",
"der",
"deren",
"des",
"deshalb",
"dessen",
"diese",
"diesem",
"diesen",
"dieser",
"dieses",
"ein",
"eine",
"einem",
"einen",
"einer",
"eines",
"euer",
"euere",
"eueren",
"eueres",
"für",
"haben",
"habt",
"hatte",
"hatten",
"hattest",
"hattet",
"hierzu",
"hinter",
"ich",
"ihr",
"ihre",
"ihren",
"ihrer",
"ihres",
"indem",
"ist",
"jede",
"jedem",
"jeden",
"jeder",
"jedes",
"kann",
"kannst",
"können",
"könnt",
"konnte",
"konnten",
"konntest",
"konntet",
"mehr",
"mein",
"meine",
"meiner",
"meines",
"meins",
"nach",
"neben",
"nicht",
"nichts",
"seid",
"sein",
"seine",
"seiner",
"seines",
"seins",
"sie",
"sind",
"über",
"und",
"uns",
"unser",
"unsere",
"unter",
"vor",
"warst",
"weil",
"wenn",
"werde",
"werden",
"werdet",
"willst",
"wir",
"wird",
"wirst",
"wollen",
"wollt",
"wollte",
"wollten",
"wolltest",
"wolltet",
"zum",
"zur"
]
}

View File

@ -1,162 +0,0 @@
{
dependencies: [
'punkt-segmenter',
'tactful_tokenizer'
],
workers: {
processors: {
segmenters: [:punkt],
tokenizers: []
}
},
stop_words:
[
"affinche",
"alcun",
"alcuna",
"alcune",
"alcuni",
"alcuno",
"allora",
"altra",
"altre",
"altri",
"altro",
"anziche",
"certa",
"certe",
"certi",
"certo",
"che",
"chi",
"chiunque",
"comunque",
"con",
"cosa",
"cose",
"cui",
"dagli",
"dai",
"dall",
"dalla",
"dalle",
"darsi",
"degli",
"del",
"dell",
"della",
"delle",
"dello",
"dunque",
"egli",
"eppure",
"esse",
"essi",
"forse",
"gia",
"infatti",
"inoltre",
"invece",
"lui",
"malgrado",
"mediante",
"meno",
"mentre",
"mie",
"miei",
"mio",
"modo",
"molta",
"molte",
"molti",
"molto",
"negli",
"nel",
"nella",
"nelle",
"nessun",
"nessuna",
"nessuno",
"niente",
"noi",
"nostra",
"nostre",
"nostri",
"nostro",
"nulla",
"occorre",
"ogni",
"ognuno",
"oltre",
"oltretutto",
"oppure",
"ovunque",
"ovvio",
"percio",
"pertanto",
"piu",
"piuttosto",
"poca",
"poco",
"poiche",
"propri",
"proprie",
"proprio",
"puo",
"qua",
"qual",
"qualche",
"qualcuna",
"qualcuno",
"quale",
"quali",
"qualunque",
"quando",
"quant",
"quante",
"quanti",
"quanto",
"quantunque",
"quegli",
"quei",
"quest",
"questa",
"queste",
"questi",
"questo",
"qui",
"quindi",
"sebbene",
"sembra",
"sempre",
"senza",
"soltanto",
"stessa",
"stesse",
"stessi",
"stesso",
"sugli",
"sui",
"sul",
"sull",
"sulla",
"sulle",
"suo",
"suoi",
"taluni",
"taluno",
"tanta",
"tanti",
"tanto",
"tra",
"tuo",
"tuoi",
"tutt",
"tutta",
"tutte",
"tutto",
"una",
"uno",
"voi"
]
}

View File

@ -1,11 +0,0 @@
{
dependencies: [
'punkt-segmenter',
'srx-polish'
],
workers: {
processors: {
segmenters: [:srx, :punkt]
}
}
}

View File

@ -1,291 +0,0 @@
{
dependencies: [
'punkt-segmenter',
'tactful_tokenizer'
],
workers: {
processors: {
segmenters: [:punkt],
tokenizers: []
}
},
stop_words:
[
"abans",
"aca",
"acerca",
"ahora",
"aixo",
"algo",
"algu",
"alguien",
"algun",
"alguna",
"algunas",
"algunes",
"alguno",
"algunos",
"alguns",
"alla",
"alli",
"allo",
"altra",
"altre",
"altres",
"amb",
"amunt",
"antes",
"aquel",
"aquell",
"aquella",
"aquellas",
"aquelles",
"aquellos",
"aquells",
"aquest",
"aquesta",
"aquestes",
"aquests",
"aqui",
"asimismo",
"aun",
"aunque",
"avall",
"cada",
"casi",
"com",
"como",
"con",
"cosas",
"coses",
"cual",
"cuales",
"cualquier",
"cuando",
"damunt",
"darrera",
"davant",
"debe",
"deben",
"deber",
"debia",
"debian",
"decia",
"decian",
"decir",
"deia",
"deien",
"del",
"demasiado",
"des",
"desde",
"despues",
"dicen",
"diciendo",
"dins",
"dir",
"diu",
"diuen",
"doncs",
"ell",
"ellas",
"elles",
"ells",
"els",
"encara",
"entonces",
"ese",
"esos",
"esser",
"esta",
"estan",
"estando",
"estant",
"estar",
"estaria",
"estarian",
"estarien",
"estas",
"estos",
"farien",
"feia",
"feien",
"fent",
"fue",
"fueron",
"gaire",
"gairebe",
"hace",
"hacia",
"hacian",
"haciendo",
"haran",
"hauria",
"haurien",
"hemos",
"hola",
"junto",
"lejos",
"les",
"lloc",
"los",
"menos",
"menys",
"meva",
"mias",
"mio",
"misma",
"mismas",
"mismo",
"mismos",
"molt",
"molta",
"moltes",
"mon",
"mucha",
"mucho",
"muy",
"nadie",
"ningu",
"nomes",
"nosaltres",
"nosotros",
"nostra",
"nostre",
"nuestra",
"nuestras",
"nuestro",
"nuestros",
"nunca",
"otra",
"pasa",
"pasan",
"pasara",
"pasaria",
"passara",
"passaria",
"passen",
"perque",
"poc",
"pocas",
"pocos",
"podem",
"poden",
"podeu",
"podria",
"podrian",
"podrien",
"poques",
"porque",
"potser",
"puc",
"pudieron",
"pudo",
"puede",
"pueden",
"puesto",
"qualsevol",
"quan",
"que",
"queria",
"querian",
"qui",
"quien",
"quienes",
"quiere",
"quieren",
"quin",
"quina",
"quines",
"quins",
"quizas",
"segueent",
"segun",
"sempre",
"seran",
"seria",
"serian",
"seu",
"seva",
"sido",
"siempre",
"siendo",
"siguiente",
"sino",
"sobretodo",
"solamente",
"sovint",
"suya",
"suyas",
"suyo",
"suyos",
"tambe",
"tambien",
"tanmateix",
"tanta",
"tanto",
"tendran",
"tendria",
"tendrian",
"tenen",
"teu",
"teva",
"tiene",
"tienen",
"tindran",
"tindria",
"tindrien",
"toda",
"todavia",
"todo",
"tota",
"totes",
"tras",
"traves",
"tuvieron",
"tuvo",
"tuya",
"tuyas",
"tuyo",
"tuyos",
"unas",
"unes",
"unos",
"uns",
"usaba",
"usaban",
"usada",
"usades",
"usado",
"usan",
"usando",
"usant",
"usar",
"usat",
"usava",
"usaven",
"usen",
"vaig",
"varem",
"varen",
"vareu",
"vegada",
"vegades",
"vez",
"volem",
"volen",
"voleu",
"vora",
"vos",
"vosaltres",
"vosotros",
"vostra",
"vostre",
"voy",
"vuestra",
"vuestras",
"vuestro",
"vuestros",
"vull"
]
}

View File

@ -1,289 +0,0 @@
{
dependencies: [
'punkt-segmenter',
'tactful_tokenizer'
],
workers: {
processors: {
segmenters: [:punkt],
tokenizers: []
}
},
stop_words:
[
"atminstone",
"an",
"anda",
"aven",
"aldrig",
"alla",
"alls",
"allt",
"alltid",
"allting",
"alltsa",
"andra",
"annan",
"annars",
"antingen",
"att",
"bakom",
"bland",
"blev",
"bli",
"bliva",
"blivit",
"bort",
"bortom",
"bredvid",
"dar",
"darav",
"darefter",
"darfor",
"dari",
"darigenom",
"darvid",
"dedar",
"definitivt",
"del",
"den",
"dendar",
"denhar",
"denna",
"deras",
"dessa",
"dessutom",
"desto",
"det",
"detta",
"dylik",
"efterat",
"efter",
"eftersom",
"eller",
"emellertid",
"enbart",
"endast",
"enligt",
"ens",
"ensam",
"envar",
"eran",
"etc",
"ett",
"exakt",
"fatt",
"fastan",
"fick",
"fler",
"flera",
"foljande",
"foljde",
"foljer",
"for",
"fore",
"forhoppningsvis",
"formodligen",
"forr",
"forra",
"forutom",
"forvisso",
"fran",
"framfor",
"fullstandigt",
"gang",
"gar",
"gatt",
"ganska",
"gav",
"genom",
"genomgaende",
"ger",
"gick",
"gjorde",
"gjort",
"gor",
"hade",
"har",
"harav",
"har",
"hej",
"hela",
"helst",
"helt",
"hitta",
"hon",
"honom",
"hur",
"huruvida",
"huvudsakligen",
"ibland",
"icke",
"ickedestomindre",
"igen",
"ihop",
"inat",
"ingen",
"ingenstans",
"inget",
"innan",
"innehalla",
"inre",
"inte",
"inuti",
"istaellet",
"kanske",
"klart",
"knappast",
"knappt",
"kom",
"komma",
"kommer",
"kraver",
"kunde",
"kunna",
"lata",
"later",
"lagga",
"langre",
"laet",
"lagd",
"leta",
"letar",
"manga",
"maste",
"med",
"medan",
"medans",
"mellan",
"mest",
"min",
"mindre",
"minst",
"mittemellan",
"motsvarande",
"mycket",
"nagon",
"nagongang",
"nagonsin",
"nagonstans",
"nagonting",
"nagorlunda",
"nagot",
"namligen",
"nar",
"nara",
"nasta",
"nastan",
"nedat",
"nedanfor",
"nerat",
"ner",
"nog",
"normalt",
"nummer",
"nuvarande",
"nytt",
"oavsett",
"och",
"ocksa",
"oppna",
"over",
"overallt",
"ofta",
"okej",
"olika",
"ovanfor",
"ratt",
"redan",
"relativt",
"respektive",
"rimlig",
"rimligen",
"rimligt",
"salunda",
"savida",
"saga",
"sager",
"sakert",
"sand",
"sarskilt",
"satt",
"sak",
"samma",
"samtliga",
"sedd",
"senare",
"senaste",
"ser",
"sig",
"sista",
"sjaelv",
"ska",
"skall",
"skickad",
"skriva",
"skulle",
"snabb",
"snarare",
"snart",
"som",
"somliga",
"speciellt",
"stalla",
"stallet",
"starta",
"strax",
"stundom",
"tackar",
"tanka",
"taga",
"tagen",
"tala",
"tanke",
"tidigare",
"tills",
"tog",
"totalt",
"trolig",
"troligen",
"tvaers",
"tvars",
"tycka",
"tyckte",
"tyvarr",
"understundom",
"upp",
"uppenbarligen",
"uppenbart",
"utan",
"utanfor",
"uteslutande",
"utom",
"var",
"varan",
"vad",
"val",
"varde",
"vanlig",
"vanligen",
"var",
"vare",
"varenda",
"varfor",
"varifran",
"varit",
"varje",
"varken",
"vars",
"vart",
"vem",
"verkligen",
"vidare",
"vilken",
"vill",
"visar",
"visst",
"visste"
]
}

View File

@ -1,16 +0,0 @@
{
punkt: {
model_path: nil
},
reuters: {
model_path: nil
},
stanford: {
jar_path: nil,
model_path: nil
},
open_nlp: {
jar_path: nil,
model_path: nil
}
}

View File

@ -1,44 +0,0 @@
{
categories:
['adjective', 'adverb', 'noun',
'verb', 'interjection', 'clitic',
'coverb', 'conjunction', 'determiner',
'particle', 'preposition', 'pronoun',
'number', 'symbol', 'punctuation',
'complementizer'],
punctuation: {
punct_to_category: {
'.' => 'period',
',' => 'comma',
';' => 'semicolon',
':' => 'colon',
'?' => 'interrogation',
'!' => 'exclamation',
'"' => 'double_quote',
"'" => 'single_quote',
'$' => 'dollar',
'%' => 'percent',
'#' => 'hash',
'*' => 'asterisk',
'&' => 'ampersand',
'+' => 'plus',
'-' => 'dash',
'/' => 'slash',
'\\' => 'backslash',
'^' => 'caret',
'_' => 'underscore',
'`' => 'tick',
'|' => 'pipe',
'~' => 'tilde',
'@' => 'at',
'[' => 'bracket',
']' => 'bracket',
'{' => 'brace',
'}' => 'brace',
'(' => 'parenthesis',
')' => 'parenthesis',
'<' => 'tag',
'>' => 'tag'
}}
}

View File

@ -1,328 +0,0 @@
{
aligned: {
tag_sets: [
:claws_c5, :brown, :penn,
:stutgart, :chinese, :paris7
],
phrase_tags: [
'Adjectival phrase', ['', '', 'ADJP', '', '', 'AP'],
'Adverbial phrase', ['', '', 'ADVP', '', '', 'AdP'],
'Conjunction phrase', ['', '', 'CONJP', '', '', 'Ssub'],
'Fragment', ['', '', 'FRAG', '', '', ''],
'Interjectional phrase', ['', '', 'INTJ', '', '', ''],
'List marker', ['', '', 'LST', '', '', ''],
'Not a phrase', ['', '', 'NAC', '', '', ''],
'Noun phrase', ['', '', 'NP', '', '', 'NP'],
'Verbal nucleus', ['', '', '', '', '', 'VN'],
'Head of noun phrase', ['', '', 'NX', '', '', ''],
'Prepositional phrase', ['', '', 'PP', '', '', 'PP'],
'Parenthetical', ['', '', 'PRN', '', '', ''],
'Particle', ['', '', 'PRT', '', '', ''],
'Participial phrase', ['', '', '', '', '', 'VPart'],
'Quantifier phrase', ['', '', 'QP', '', '', ''],
'Relative clause', ['', '', 'RRC', '', '', 'Srel'],
'Coordinated phrase', ['', '', 'UCP', '', '', 'COORD'],
'Infinitival phrase', ['', '', '', '', '', 'VPinf'],
'Verb phrase', ['', '', 'VP', '', '', ''],
'Inverted yes/no question', ['', '', 'SQ', '', '', ''],
'Wh adjective phrase', ['', '', 'WHADJP', '', '', ''],
'Wh adverb phrase', ['', '', 'WHADVP', '', '', ''],
'Wh noun phrase', ['', '', 'WHNP', '', '', ''],
'Wh prepositional phrase', ['', '', 'WHPP', '', '', ''],
'Unknown', ['', '', 'X', '', '', ''],
'Phrase', ['', '', 'P', '', '', 'Sint'],
'Sentence', ['', '', 'S', '', '', 'SENT'],
'Phrase', ['', '', 'SBAR', '', '', ''] # Fix
],
word_tags: [
# Aligned tags for the Claws C5, Brown and Penn tag sets.
'Adjective', ['AJ0', 'JJ', 'JJ', '', 'JJ', 'A'],
'Adjective', ['AJ0', 'JJ', 'JJ', '', 'JJ', 'ADJ'],
'Ajective, adverbial or predicative', ['', '', '', 'ADJD', '', 'ADJ'],
'Adjective, attribute', ['', '', '', 'ADJA', 'VA', 'ADJ'],
'Adjective, ordinal number', ['ORD', 'OD', 'JJ', '', 'OD', 'ADJ'],
'Adjective, comparative', ['AJC', 'JJR', 'JJR', 'KOKOM', '', 'ADJ'],
'Adjective, superlative', ['AJS', 'JJT', 'JJS', '', 'JJ', 'ADJ'],
'Adjective, superlative, semantically', ['AJ0', 'JJS', 'JJ', '', '', 'ADJ'],
'Adjective, cardinal number', ['CRD', 'CD', 'CD', 'CARD', 'CD', 'ADJ'],
'Adjective, cardinal number, one', ['PNI', 'CD', 'CD', 'CARD', 'CD', 'ADJ'],
'Adverb', ['AV0', 'RB', 'RB', 'ADV', 'AD', 'ADV'],
'Adverb, negative', ['XX0', '*', 'RB', 'PTKNEG', '', 'ADV'],
'Adverb, comparative', ['AV0', 'RBR', 'RBR', '', 'AD', 'ADV'],
'Adverb, superlative', ['AV0', 'RBT', 'RBS', '', 'AD', 'ADV'],
'Adverb, particle', ['AVP', 'RP', 'RP', '', '', 'ADV'],
'Adverb, question', ['AVQ', 'WRB', 'WRB', '', 'AD', 'ADV'],
'Adverb, degree & question', ['AVQ', 'WQL', 'WRB', '', 'ADV'],
'Adverb, degree', ['AV0', 'QL', 'RB', '', '', 'ADV'],
'Adverb, degree, postposed', ['AV0', 'QLP', 'RB', '', '', 'ADV'],
'Adverb, nominal', ['AV0', 'RN', 'RB', 'PROP', '', 'ADV'],
'Adverb, pronominal', ['', '', '', '', 'PROP', '', 'ADV'],
'Conjunction, coordination', ['CJC', 'CC', 'CC', 'KON', 'CC', 'COOD'],
'Conjunction, coordination, and', ['CJC', 'CC', 'CC', 'KON', 'CC', 'ET'],
'Conjunction, subordination', ['CJS', 'CS', 'IN', 'KOUS', 'CS', 'CONJ'],
'Conjunction, subordination with to and infinitive', ['', '', '', 'KOUI', '', ''],
'Conjunction, complementizer, that', ['CJT', 'CS', 'IN', '', '', 'C'],
'Determiner', ['DT0', 'DT', 'DT', '', 'DT', 'D'],
'Determiner, pronoun', ['DT0', 'DTI', 'DT', '', '', 'D'],
'Determiner, pronoun, plural', ['DT0', 'DTS', 'DT', '', '', 'D'],
'Determiner, prequalifier', ['DT0', 'ABL', 'DT', '', '', 'D'],
'Determiner, prequantifier', ['DT0', 'ABN', 'PDT', '', 'DT', 'D'],
'Determiner, pronoun or double conjunction', ['DT0', 'ABX', 'PDT', '', '', 'D'],
'Determiner, pronoun or double conjunction', ['DT0', 'DTX', 'DT', '', '', 'D'],
'Determiner, article', ['AT0', 'AT', 'DT', 'ART', '', 'D'],
'Determiner, postdeterminer', ['DT0', 'AP', 'DT', '', '', 'D'],
'Determiner, possessive', ['DPS', 'PP$', 'PRP$', '', '', 'D'],
'Determiner, possessive, second', ['DPS', 'PP$', 'PRPS', '', '', 'D'],
'Determiner, question', ['DTQ', 'WDT', 'WDT', '', 'DT', 'D'],
'Determiner, possessive & question', ['DTQ', 'WP$', 'WP$', '', '', 'D'],
'Interjection', ['', '', '', '', '', 'I'],
'Localizer', ['', '', '', '', 'LC'],
'Measure word', ['', '', '', '', 'M'],
'Noun, common', ['NN0', 'NN', 'NN', 'N', 'NN', 'NN'],
'Noun, singular', ['NN1', 'NN', 'NN', 'NN', 'NN', 'N'],
'Noun, plural', ['NN2', 'NNS', 'NNS', 'NN', 'NN', 'N'],
'Noun, proper, singular', ['NP0', 'NP', 'NNP', 'NE', 'NR', 'N'],
'Noun, proper, plural', ['NP0', 'NPS', 'NNPS', 'NE', 'NR', 'N'],
'Noun, adverbial', ['NN0', 'NR', 'NN', 'NE', '', 'N'],
'Noun, adverbial, plural', ['NN2', 'NRS', 'NNS', '', 'N'],
'Noun, temporal', ['', '', '', '', 'NT', 'N'],
'Noun, verbal', ['', '', '', '', 'NN', 'N'],
'Pronoun, nominal (indefinite)', ['PNI', 'PN', 'PRP', '', 'PN', 'CL'],
'Pronoun, personal, subject', ['PNP', 'PPSS', 'PRP', 'PPER'],
'Pronoun, personal, subject, 3SG', ['PNP', 'PPS', 'PRP', 'PPER'],
'Pronoun, personal, object', ['PNP', 'PPO', 'PRP', 'PPER'],
'Pronoun, reflexive', ['PNX', 'PPL', 'PRP', 'PRF'],
'Pronoun, reflexive, plural', ['PNX', 'PPLS', 'PRP', 'PRF'],
'Pronoun, question, subject', ['PNQ', 'WPS', 'WP', 'PWAV'],
'Pronoun, question, subject', ['PNQ', 'WPS', 'WPS', 'PWAV'], # FIXME
'Pronoun, question, object', ['PNQ', 'WPO', 'WP', 'PWAV', 'PWAT'],
'Pronoun, existential there', ['EX0', 'EX', 'EX'],
'Pronoun, attributive demonstrative', ['', '', '', 'PDAT'],
'Prounoun, attributive indefinite without determiner', ['', '', '', 'PIAT'],
'Pronoun, attributive possessive', ['', '', '', 'PPOSAT', ''],
'Pronoun, substituting demonstrative', ['', '', '', 'PDS'],
'Pronoun, substituting possessive', ['', '', '', 'PPOSS', ''],
'Prounoun, substituting indefinite', ['', '', '', 'PIS'],
'Pronoun, attributive relative', ['', '', '', 'PRELAT', ''],
'Pronoun, substituting relative', ['', '', '', 'PRELS', ''],
'Pronoun, attributive interrogative', ['', '', '', 'PWAT'],
'Pronoun, adverbial interrogative', ['', '', '', 'PWAV'],
'Pronoun, substituting interrogative', ['', '', '', 'PWS'],
'Verb, main, finite', ['', '', '', 'VVFIN', '', 'V'],
'Verb, main, infinitive', ['', '', '', 'VVINF', '', 'V'],
'Verb, main, imperative', ['', '', '', 'VVIMP', '', 'V'],
'Verb, base present form (not infinitive)', ['VVB', 'VB', 'VBP', '', '', 'V'],
'Verb, infinitive', ['VVI', 'VB', 'VB', 'V', '', 'V'],
'Verb, past tense', ['VVD', 'VBD', 'VBD', '', '', 'V'],
'Verb, present participle', ['VVG', 'VBG', 'VBG', 'VAPP', '', 'V'],
'Verb, past/passive participle', ['VVN', 'VBN', 'VBN', 'VVPP', '', 'V'],
'Verb, present, 3SG, -s form', ['VVZ', 'VBZ', 'VBZ', '', '', 'V'],
'Verb, auxiliary', ['', '', '', 'VAFIN', '', 'V'],
'Verb, imperative', ['', '', '', 'VAIMP', '', 'V'],
'Verb, imperative infinitive', ['', '', '', 'VAINF', '', 'V'],
'Verb, auxiliary do, base', ['VDB', 'DO', 'VBP', '', '', 'V'],
'Verb, auxiliary do, infinitive', ['VDB', 'DO', 'VB', '', '', 'V'],
'Verb, auxiliary do, past', ['VDD', 'DOD', 'VBD', '', '', 'V'],
'Verb, auxiliary do, present participle', ['VDG', 'VBG', 'VBG', '', '', 'V'],
'Verb, auxiliary do, past participle', ['VDN', 'VBN', 'VBN', '', '', 'V'],
'Verb, auxiliary do, present 3SG', ['VDZ', 'DOZ', 'VBZ', '', '', 'V'],
'Verb, auxiliary have, base', ['VHB', 'HV', 'VBP', 'VA', '', 'V'],
'Verb, auxiliary have, infinitive', ['VHI', 'HV', 'VB', 'VAINF', '', 'V'],
'Verb, auxiliary have, past', ['VHD', 'HVD', 'VBD', 'VA', '', 'V'],
'Verb, auxiliary have, present participle', ['VHG', 'HVG', 'VBG', 'VA', '', 'V'],
'Verb, auxiliary have, past participle', ['VHN', 'HVN', 'VBN', 'VAPP', '', 'V'],
'Verb, auxiliary have, present 3SG', ['VHZ', 'HVZ', 'VBZ', 'VA', '', 'V'],
'Verb, auxiliary be, infinitive', ['VBI', 'BE', 'VB', '', '', 'V'],
'Verb, auxiliary be, past', ['VBD', 'BED', 'VBD', '', '', 'V'],
'Verb, auxiliary be, past, 3SG', ['VBD', 'BEDZ', 'VBD', '', '', 'V'],
'Verb, auxiliary be, present participle', ['VBG', 'BEG', 'VBG', '', '', 'V'],
'Verb, auxiliary be, past participle', ['VBN', 'BEN', 'VBN', '', '', 'V'],
'Verb, auxiliary be, present, 3SG', ['VBZ', 'BEZ', 'VBZ', '', '', 'V'],
'Verb, auxiliary be, present, 1SG', ['VBB', 'BEM', 'VBP', '', '', 'V'],
'Verb, auxiliary be, present', ['VBB', 'BER', 'VBP', '', '', 'V'],
'Verb, modal', ['VM0', 'MD', 'MD', 'VMFIN', 'VV', 'V'],
'Verb, modal', ['VM0', 'MD', 'MD', 'VMINF', 'VV', 'V'],
'Verb, modal, finite', ['', '', '', '', 'VMFIN', 'V'],
'Verb, modal, infinite', ['', '', '', '', 'VMINF', 'V'],
'Verb, modal, past participle', ['', '', '', '', 'VMPP', 'V'],
'Particle', ['', '', '', '', '', 'PRT'],
'Particle, with adverb', ['', '', '', 'PTKA', '', 'PRT'],
'Particle, answer', ['', '', '', 'PTKANT', '', 'PRT'],
'Particle, negation', ['', '', '', 'PTKNEG', '', 'PRT'],
'Particle, separated verb', ['', '', '', 'PTKVZ', '', 'PRT'],
'Particle, to as infinitive marker', ['TO0', 'TO', 'TO', 'PTKZU', '', 'PRT'],
'Preposition, comparative', ['', '', '', 'KOKOM', '', 'P'],
'Preposition, to', ['PRP', 'IN', 'TO', '', '', 'P'],
'Preposition', ['PRP', 'IN', 'IN', 'APPR', 'P', 'P'],
'Preposition, with aritcle', ['', '', '', 'APPART', '', 'P'],
'Preposition, of', ['PRF', 'IN', 'IN', '', '', 'P'],
'Possessive', ['POS', '$', 'POS'],
'Postposition', ['', '', '', 'APPO'],
'Circumposition, right', ['', '', '', 'APZR', ''],
'Interjection, onomatopoeia or other isolate', ['ITJ', 'UH', 'UH', 'ITJ', 'IJ'],
'Onomatopoeia', ['', '', '', '', 'ON'],
'Punctuation', ['', '', '', '', 'PU', 'PN'],
'Punctuation, sentence ender', ['PUN', '.', '.', '', '', 'PN'],
'Punctuation, semicolon', ['PUN', '.', '.', '', '', 'PN'],
'Puncutation, colon or ellipsis', ['PUN', ':', ':'],
'Punctuation, comma', ['PUN', ',', ',', '$,'],
'Punctuation, dash', ['PUN', '-', '-'],
'Punctuation, dollar sign', ['PUN', '', '$'],
'Punctuation, left bracket', ['PUL', '(', '(', '$('],
'Punctuation, right bracket', ['PUR', ')', ')'],
'Punctuation, quotation mark, left', ['PUQ', '', '``'],
'Punctuation, quotation mark, right', ['PUQ', '', '"'],
'Punctuation, left bracket', ['PUL', '(', 'PPL'],
'Punctuation, right bracket', ['PUR', ')', 'PPR'],
'Punctuation, left square bracket', ['PUL', '(', 'LSB'],
'Punctuation, right square bracket', ['PUR', ')', 'RSB'],
'Punctuation, left curly bracket', ['PUL', '(', 'LCB'],
'Punctuation, right curly bracket', ['PUR', ')', 'RCB'],
'Unknown, foreign words (not in lexicon)', ['UNZ', '(FW-)', 'FW', '', 'FW'],
'Symbol', ['', '', 'SYM', 'XY'],
'Symbol, alphabetical', ['ZZ0', '', ''],
'Symbol, list item', ['', '', 'LS'],
# Not sure about these tags from the Chinese PTB.
'Aspect marker', ['', '', '', '', 'AS'], # ?
'Ba-construction', ['', '', '', '', 'BA'], # ?
'In relative', ['', '', '', '', 'DEC'], # ?
'Associative', ['', '', '', '', 'DER'], # ?
'In V-de or V-de-R construct', ['', '', '', '', 'DER'], # ?
'For words ? ', ['', '', '', '', 'ETC'], # ?
'In long bei-construct', ['', '', '', '', 'LB'], # ?
'In short bei-construct', ['', '', '', '', 'SB'], # ?
'Sentence-nal particle', ['', '', '', '', 'SB'], # ?
'Particle, other', ['', '', '', '', 'MSP'], # ?
'Before VP', ['', '', '', '', 'DEV'], # ?
'Verb, ? as main verb', ['', '', '', '', 'VE'], # ?
'Verb, ????', ['', '', '', '', 'VC'] # ?
]},
enju: {
cat_to_category: {
'ADJ' => 'adjective',
'ADV' => 'adverb',
'CONJ' => 'conjunction',
'COOD' => 'conjunction',
'C' => 'complementizer',
'D' => 'determiner',
'N' => 'noun',
'P' => 'preposition',
'PN' => 'punctuation',
'SC' => 'conjunction',
'V' => 'verb',
'PRT' => 'particle'
},
cat_to_description: [
['ADJ', 'Adjective'],
['ADV', 'Adverb'],
['CONJ', 'Coordination conjunction'],
['C', 'Complementizer'],
['D', 'Determiner'],
['N', 'Noun'],
['P', 'Preposition'],
['SC', 'Subordination conjunction'],
['V', 'Verb'],
['COOD', 'Part of coordination'],
['PN', 'Punctuation'],
['PRT', 'Particle'],
['S', 'Sentence']
],
xcat_to_description: [
['COOD', 'Coordinated phrase/clause'],
['IMP', 'Imperative sentence'],
['INV', 'Subject-verb inversion'],
['Q', 'Interrogative sentence with subject-verb inversion'],
['REL', 'A relativizer included'],
['FREL', 'A free relative included'],
['TRACE', 'A trace included'],
['WH', 'A wh-question word included']
],
xcat_to_ptb: [
['ADJP', '', 'ADJP'],
['ADJP', 'REL', 'WHADJP'],
['ADJP', 'FREL', 'WHADJP'],
['ADJP', 'WH', 'WHADJP'],
['ADVP', '', 'ADVP'],
['ADVP', 'REL', 'WHADVP'],
['ADVP', 'FREL', 'WHADVP'],
['ADVP', 'WH', 'WHADVP'],
['CONJP', '', 'CONJP'],
['CP', '', 'SBAR'],
['DP', '', 'NP'],
['NP', '', 'NP'],
['NX', 'NX', 'NAC'],
['NP' 'REL' 'WHNP'],
['NP' 'FREL' 'WHNP'],
['NP' 'WH' 'WHNP'],
['PP', '', 'PP'],
['PP', 'REL', 'WHPP'],
['PP', 'WH', 'WHPP'],
['PRT', '', 'PRT'],
['S', '', 'S'],
['S', 'INV', 'SINV'],
['S', 'Q', 'SQ'],
['S', 'REL', 'SBAR'],
['S', 'FREL', 'SBAR'],
['S', 'WH', 'SBARQ'],
['SCP', '', 'SBAR'],
['VP', '', 'VP'],
['VP', '', 'VP'],
['', '', 'UK']
]},
paris7: {
tag_to_category: {
'C' => :complementizer,
'PN' => :punctuation,
'SC' => :conjunction
}
# Paris7 Treebank functional tags
=begin
SUJ (subject)
OBJ (direct object)
ATS (predicative complement of a subject)
ATO (predicative complement of a direct object)
MOD (modifier or adjunct)
A-OBJ (indirect complement introduced by à)
DE-OBJ (indirect complement introduced by de)
P-OBJ (indirect complement introduced by another preposition)
=end
},
ptb: {
escape_characters: {
'(' => '-LRB-',
')' => '-RRB-',
'[' => '-LSB-',
']' => '-RSB-',
'{' => '-LCB-',
'}' => '-RCB-'
},
phrase_tag_to_description: [
['S', 'Paris7 declarative clause'],
['SBAR', 'Clause introduced by a (possibly empty) subordinating conjunction'],
['SBARQ', 'Direct question introduced by a wh-word or a wh-phrase'],
['SINV', 'Inverted declarative sentence'],
['SQ', 'Inverted yes/no question']
]
}
}

View File

@ -0,0 +1 @@
{adapter: :mongo}

View File

@ -0,0 +1 @@
{host: 'localhost', port: '27017', db: nil }

View File

@ -1,31 +0,0 @@
# Mixin that is extended by Treat::Config
# in order to provide a single point of
# access method to trigger the import.
module Treat::Config::Importable
# Import relies on each configuration.
require_relative 'configurable'
# Store all the configuration in self.config
def self.extended(base)
class << base; attr_accessor :config; end
end
# Main function; loads all configuration options.
def import!
config, c = {}, Treat::Config::Configurable
definition = :define_singleton_method
Treat::Config.constants.each do |const|
next if const.to_s.downcase.is_mixin?
klass = Treat::Config.const_get(const)
klass.class_eval { extend c }.configure!
name = const.to_s.downcase.intern
config[name] = klass.config
Treat.send(definition, name) do
Treat::Config.config[name]
end
end
self.config = config.to_struct
end
end

View File

@ -0,0 +1,34 @@
{
dependencies: [
'psych',
'nokogiri',
'ferret',
'bson_ext',
'mongo',
'lda-ruby',
'stanford-core-nlp',
'linguistics',
'ruby-readability',
'whatlanguage',
'chronic',
'nickel',
'decisiontree',
'ai4r'
],
workers: {
extractors: {
keywords: [:tf_idf],
language: [:what_language]
},
formatters: {
serializers: [:xml, :yaml, :mongo]
},
lexicalizers: {
categorizers: [:from_tag]
},
inflectors: {
ordinalizers: [:linguistics],
cardinalizers: [:linguistics]
}
}
}

View File

@ -6,7 +6,7 @@
workers: {
processors: {
segmenters: [:punkt],
tokenizers: []
tokenizers: [:tactful]
}
}
}

View File

@ -0,0 +1,60 @@
{
dependencies: [
'rbtagger',
'ruby-stemmer',
'punkt-segmenter',
'tactful_tokenizer',
'nickel',
'rwordnet',
'uea-stemmer',
'engtagger',
'activesupport',
'english'
],
workers: {
extractors: {
time: [:chronic, :ruby, :nickel],
topics: [:reuters],
keywords: [:tf_idf],
name_tag: [:stanford],
coreferences: [:stanford]
},
inflectors: {
conjugators: [:linguistics],
declensors: [:english, :linguistics, :active_support],
stemmers: [:porter, :porter_c, :uea],
ordinalizers: [:linguistics],
cardinalizers: [:linguistics]
},
lexicalizers: {
taggers: [:lingua, :brill, :stanford],
sensers: [:wordnet]
},
processors: {
parsers: [:stanford, :enju],
segmenters: [:tactful, :punkt, :stanford],
tokenizers: [:ptb, :stanford, :tactful, :punkt]
}
},
info: {
stopwords:
['the', 'of', 'and', 'a', 'to', 'in', 'is',
'you', 'that', 'it', 'he', 'was', 'for', 'on',
'are', 'as', 'with', 'his', 'they', 'I', 'at',
'be', 'this', 'have', 'from', 'or', 'one', 'had',
'by', 'word', 'but', 'not', 'what', 'all', 'were',
'we', 'when', 'your', 'can', 'said', 'there', 'use',
'an', 'each', 'which', 'she', 'do', 'how', 'their',
'if', 'will', 'up', 'other', 'about', 'out', 'many',
'then', 'them', 'these', 'so', 'some', 'her', 'would',
'make', 'like', 'him', 'into', 'time', 'has', 'look',
'two', 'more', 'write', 'go', 'see', 'number', 'no',
'way', 'could', 'people', 'my', 'than', 'first', 'been',
'call', 'who', 'its', 'now', 'find', 'long', 'down',
'day', 'did', 'get', 'come', 'made', 'may', 'part',
'say', 'also', 'new', 'much', 'should', 'still',
'such', 'before', 'after', 'other', 'then', 'over',
'under', 'therefore', 'nonetheless', 'thereafter',
'afterwards', 'here', 'huh', 'hah', "n't", "'t", 'here']
}
}

View File

@ -0,0 +1,18 @@
{
dependencies: [
'punkt-segmenter',
'tactful_tokenizer',
'stanford-core-nlp'
],
workers: {
processors: {
segmenters: [:punkt],
tokenizers: [:tactful],
parsers: [:stanford]
},
lexicalizers: {
taggers: [:stanford],
categorizers: [:from_tag]
}
}
}

View File

@ -0,0 +1,18 @@
{
dependencies: [
'punkt-segmenter',
'tactful_tokenizer',
'stanford'
],
workers: {
processors: {
segmenters: [:punkt],
tokenizers: [:tactful],
parsers: [:stanford]
},
lexicalizers: {
taggers: [:stanford],
categorizers: [:from_tag]
}
}
}

View File

@ -6,7 +6,7 @@
workers: {
processors: {
segmenters: [:punkt],
tokenizers: []
tokenizers: [:tactful]
}
}
}

View File

@ -0,0 +1,12 @@
{
dependencies: [
'punkt-segmenter',
'tactful_tokenizer'
],
workers: {
processors: {
segmenters: [:punkt],
tokenizers: [:tactful]
}
}
}

View File

@ -0,0 +1,12 @@
{
dependencies: [
'punkt-segmenter',
'tactful_tokenizer'
],
workers: {
processors: {
segmenters: [:punkt],
tokenizers: [:tactful]
}
}
}

View File

@ -6,7 +6,7 @@
workers: {
processors: {
segmenters: [:punkt],
tokenizers: []
tokenizers: [:tactful]
}
}
}

View File

@ -6,7 +6,7 @@
workers: {
processors: {
segmenters: [:punkt],
tokenizers: []
tokenizers: [:tactful]
}
}
}

View File

@ -0,0 +1,12 @@
{
dependencies: [
'punkt-segmenter',
'tactful_tokenizer'
],
workers: {
processors: {
segmenters: [:punkt],
tokenizers: [:tactful]
}
}
}

View File

@ -0,0 +1,12 @@
{
dependencies: [
'punkt-segmenter',
'tactful_tokenizer'
],
workers: {
processors: {
segmenters: [:punkt],
tokenizers: [:tactful]
}
}
}

View File

@ -0,0 +1 @@
{jar_path: nil, model_path: nil}

View File

@ -0,0 +1,4 @@
['adjective', 'adverb', 'noun', 'verb', 'interjection',
'clitic', 'coverb', 'conjunction', 'determiner', 'particle',
'preposition', 'pronoun', 'number', 'symbol', 'punctuation',
'complementizer']

View File

@ -0,0 +1,33 @@
{punct_to_category: {
'.' => 'period',
',' => 'comma',
';' => 'semicolon',
':' => 'colon',
'?' => 'interrogation',
'!' => 'exclamation',
'"' => 'double_quote',
"'" => 'single_quote',
'$' => 'dollar',
'%' => 'percent',
'#' => 'hash',
'*' => 'asterisk',
'&' => 'ampersand',
'+' => 'plus',
'-' => 'dash',
'/' => 'slash',
'\\' => 'backslash',
'^' => 'caret',
'_' => 'underscore',
'`' => 'tick',
'|' => 'pipe',
'~' => 'tilde',
'@' => 'at',
'[' => 'bracket',
']' => 'bracket',
'{' => 'brace',
'}' => 'brace',
'(' => 'parenthesis',
')' => 'parenthesis',
'<' => 'tag',
'>' => 'tag'
}}

View File

@ -1,23 +0,0 @@
# Generates the following path config options:
# Treat.paths.tmp, Treat.paths.bin, Treat.paths.lib,
# Treat.paths.models, Treat.paths.files, Treat.paths.spec.
class Treat::Config::Paths
# Get the path configuration based on the
# directory structure loaded into Paths.
# Note that this doesn't call super, as
# there is no external config files to load.
def self.configure!
root = File.dirname(File.expand_path( # FIXME
__FILE__)).split('/')[0..-4].join('/') + '/'
self.config = Hash[
# Get a list of directories in treat/
Dir.glob(root + '*').select do |path|
FileTest.directory?(path)
# Map to pairs of [:name, path]
end.map do |path|
[File.basename(path).intern, path + '/']
end]
end
end

View File

@ -1,37 +0,0 @@
# Handles all configuration related
# to understanding of part of speech
# and phrasal tags.
class Treat::Config::Tags
# Generate a map of word and phrase tags
# to their syntactic category, keyed by
# tag set.
def self.configure!
super
config = self.config[:aligned].dup
word_tags, phrase_tags, tag_sets =
config[:word_tags], config[:phrase_tags]
tag_sets = config[:tag_sets]
config[:word_tags_to_category] =
align_tags(word_tags, tag_sets)
config[:phrase_tags_to_category] =
align_tags(phrase_tags, tag_sets)
self.config[:aligned] = config
end
# Helper methods for tag set config.
# Align tag tags in the tag set
def self.align_tags(tags, tag_sets)
wttc = {}
tags.each_slice(2) do |desc, tags|
category = desc.gsub(',', ' ,').
split(' ')[0].downcase
tag_sets.each_with_index do |tag_set, i|
next unless tags[i]
wttc[tags[i]] ||= {}
wttc[tags[i]][tag_set] = category
end
end; return wttc
end
end

View File

@ -0,0 +1,221 @@
{tag_sets: [
:claws_c5, :brown, :penn, :stutgart, :chinese, :paris7
],
phrase_tags: [
'Adjectival phrase', ['', '', 'ADJP', '', '', 'AP'],
'Adverbial phrase', ['', '', 'ADVP', '', '', 'AdP'],
'Conjunction phrase', ['', '', 'CONJP', '', '', 'Ssub'],
'Fragment', ['', '', 'FRAG', '', '', ''],
'Interjectional phrase', ['', '', 'INTJ', '', '', ''],
'List marker', ['', '', 'LST', '', '', ''],
'Not a phrase', ['', '', 'NAC', '', '', ''],
'Noun phrase', ['', '', 'NP', '', '', 'NP'],
'Verbal nucleus', ['', '', '', '', '', 'VN'],
'Head of noun phrase', ['', '', 'NX', '', '', ''],
'Prepositional phrase', ['', '', 'PP', '', '', 'PP'],
'Parenthetical', ['', '', 'PRN', '', '', ''],
'Particle', ['', '', 'PRT', '', '', ''],
'Participial phrase', ['', '', '', '', '', 'VPart'],
'Quantifier phrase', ['', '', 'QP', '', '', ''],
'Relative clause', ['', '', 'RRC', '', '', 'Srel'],
'Coordinated phrase', ['', '', 'UCP', '', '', 'COORD'],
'Infinitival phrase', ['', '', '', '', '', 'VPinf'],
'Verb phrase', ['', '', 'VP', '', '', ''],
'Wh adjective phrase', ['', '', 'WHADJP', '', '', ''],
'Wh adverb phrase', ['', '', 'WHAVP', '', '', ''],
'Wh noun phrase', ['', '', 'WHNP', '', '', ''],
'Wh prepositional phrase', ['', '', 'WHPP', '', '', ''],
'Unknown', ['', '', 'X', '', '', ''],
'Phrase', ['', '', 'P', '', '', 'Sint'],
'Sentence', ['', '', 'S', '', '', 'SENT'],
'Phrase', ['', '', 'SBAR', '', '', ''] # Fix
],
word_tags: [
# Aligned tags for the Claws C5, Brown and Penn tag sets.
# Adapted from Manning, Christopher and Schütze, Hinrich,
# 1999. Foundations of Statistical Natural Language
# Processing. MIT Press, p. 141-142;
# http://www.isocat.org/rest/dcs/376;
'Adjective', ['AJ0', 'JJ', 'JJ', '', 'JJ', 'A'],
'Adjective', ['AJ0', 'JJ', 'JJ', '', 'JJ', 'ADJ'],
'Ajective, adverbial or predicative', ['', '', '', 'ADJD', '', 'ADJ'],
'Adjective, attribute', ['', '', '', 'ADJA', 'VA', 'ADJ'],
'Adjective, ordinal number', ['ORD', 'OD', 'JJ', '', 'OD', 'ADJ'],
'Adjective, comparative', ['AJC', 'JJR', 'JJR', 'KOKOM', '', 'ADJ'],
'Adjective, superlative', ['AJS', 'JJT', 'JJS', '', 'JJ', 'ADJ'],
'Adjective, superlative, semantically', ['AJ0', 'JJS', 'JJ', '', '', 'ADJ'],
'Adjective, cardinal number', ['CRD', 'CD', 'CD', 'CARD', 'CD', 'ADJ'],
'Adjective, cardinal number, one', ['PNI', 'CD', 'CD', 'CARD', 'CD', 'ADJ'],
'Adverb', ['AV0', 'RB', 'RB', 'ADV', 'AD', 'ADV'],
'Adverb, negative', ['XX0', '*', 'RB', 'PTKNEG', '', 'ADV'],
'Adverb, comparative', ['AV0', 'RBR', 'RBR', '', 'AD', 'ADV'],
'Adverb, superlative', ['AV0', 'RBT', 'RBS', '', 'AD', 'ADV'],
'Adverb, particle', ['AVP', 'RP', 'RP', '', '', 'ADV'],
'Adverb, question', ['AVQ', 'WRB', 'WRB', '', 'AD', 'ADV'],
'Adverb, degree & question', ['AVQ', 'WQL', 'WRB', '', 'ADV'],
'Adverb, degree', ['AV0', 'QL', 'RB', '', '', 'ADV'],
'Adverb, degree, postposed', ['AV0', 'QLP', 'RB', '', '', 'ADV'],
'Adverb, nominal', ['AV0', 'RN', 'RB', 'PROP', '', 'ADV'],
'Adverb, pronominal', ['', '', '', '', 'PROP', '', 'ADV'],
'Conjunction, coordination', ['CJC', 'CC', 'CC', 'KON', 'CC', 'COOD'],
'Conjunction, coordination, and', ['CJC', 'CC', 'CC', 'KON', 'CC', 'ET'],
'Conjunction, subordination', ['CJS', 'CS', 'IN', 'KOUS', 'CS', 'CONJ'],
'Conjunction, subordination with to and infinitive', ['', '', '', 'KOUI', '', ''],
'Conjunction, complementizer, that', ['CJT', 'CS', 'IN', '', '', 'C'],
'Determiner', ['DT0', 'DT', 'DT', '', 'DT', 'D'],
'Determiner, pronoun', ['DT0', 'DTI', 'DT', '', '', 'D'],
'Determiner, pronoun, plural', ['DT0', 'DTS', 'DT', '', '', 'D'],
'Determiner, prequalifier', ['DT0', 'ABL', 'DT', '', '', 'D'],
'Determiner, prequantifier', ['DT0', 'ABN', 'PDT', '', 'DT', 'D'],
'Determiner, pronoun or double conjunction', ['DT0', 'ABX', 'PDT', '', '', 'D'],
'Determiner, pronoun or double conjunction', ['DT0', 'DTX', 'DT', '', '', 'D'],
'Determiner, article', ['AT0', 'AT', 'DT', 'ART', '', 'D'],
'Determiner, postdeterminer', ['DT0', 'AP', 'DT', '', '', 'D'],
'Determiner, possessive', ['DPS', 'PP$', 'PRP$', '', '', 'D'],
'Determiner, possessive, second', ['DPS', 'PP$', 'PRPS', '', '', 'D'],
'Determiner, question', ['DTQ', 'WDT', 'WDT', '', 'DT', 'D'],
'Determiner, possessive & question', ['DTQ', 'WP$', 'WP$', '', '', 'D'],
'Interjection', ['', '', '', '', '', 'I'],
'Localizer', ['', '', '', '', 'LC'],
'Measure word', ['', '', '', '', 'M'],
'Noun, common', ['NN0', 'NN', 'NN', 'N', 'NN', 'NN'],
'Noun, singular', ['NN1', 'NN', 'NN', 'NN', 'NN', 'N'],
'Noun, plural', ['NN2', 'NNS', 'NNS', 'NN', 'NN', 'N'],
'Noun, proper, singular', ['NP0', 'NP', 'NNP', 'NE', 'NR', 'N'],
'Noun, proper, plural', ['NP0', 'NPS', 'NNPS', 'NE', 'NR', 'N'],
'Noun, adverbial', ['NN0', 'NR', 'NN', 'NE', '', 'N'],
'Noun, adverbial, plural', ['NN2', 'NRS', 'NNS', '', 'N'],
'Noun, temporal', ['', '', '', '', 'NT', 'N'],
'Noun, verbal', ['', '', '', '', 'NN', 'N'],
'Pronoun, nominal (indefinite)', ['PNI', 'PN', 'PRP', '', 'PN', 'CL'],
'Pronoun, personal, subject', ['PNP', 'PPSS', 'PRP', 'PPER'],
'Pronoun, personal, subject, 3SG', ['PNP', 'PPS', 'PRP', 'PPER'],
'Pronoun, personal, object', ['PNP', 'PPO', 'PRP', 'PPER'],
'Pronoun, reflexive', ['PNX', 'PPL', 'PRP', 'PRF'],
'Pronoun, reflexive, plural', ['PNX', 'PPLS', 'PRP', 'PRF'],
'Pronoun, question, subject', ['PNQ', 'WPS', 'WP', 'PWAV'],
'Pronoun, question, subject', ['PNQ', 'WPS', 'WPS', 'PWAV'], # Hack
'Pronoun, question, object', ['PNQ', 'WPO', 'WP', 'PWAV', 'PWAT'],
'Pronoun, existential there', ['EX0', 'EX', 'EX'],
'Pronoun, attributive demonstrative', ['', '', '', 'PDAT'],
'Prounoun, attributive indefinite without determiner', ['', '', '', 'PIAT'],
'Pronoun, attributive possessive', ['', '', '', 'PPOSAT', ''],
'Pronoun, substituting demonstrative', ['', '', '', 'PDS'],
'Pronoun, substituting possessive', ['', '', '', 'PPOSS', ''],
'Prounoun, substituting indefinite', ['', '', '', 'PIS'],
'Pronoun, attributive relative', ['', '', '', 'PRELAT', ''],
'Pronoun, substituting relative', ['', '', '', 'PRELS', ''],
'Pronoun, attributive interrogative', ['', '', '', 'PWAT'],
'Pronoun, adverbial interrogative', ['', '', '', 'PWAV'],
'Pronoun, substituting interrogative', ['', '', '', 'PWS'],
'Verb, main, finite', ['', '', '', 'VVFIN', '', 'V'],
'Verb, main, infinitive', ['', '', '', 'VVINF', '', 'V'],
'Verb, main, imperative', ['', '', '', 'VVIMP', '', 'V'],
'Verb, base present form (not infinitive)', ['VVB', 'VB', 'VBP', '', '', 'V'],
'Verb, infinitive', ['VVI', 'VB', 'VB', 'V', '', 'V'],
'Verb, past tense', ['VVD', 'VBD', 'VBD', '', '', 'V'],
'Verb, present participle', ['VVG', 'VBG', 'VBG', 'VAPP', '', 'V'],
'Verb, past/passive participle', ['VVN', 'VBN', 'VBN', 'VVPP', '', 'V'],
'Verb, present, 3SG, -s form', ['VVZ', 'VBZ', 'VBZ', '', '', 'V'],
'Verb, auxiliary', ['', '', '', 'VAFIN', '', 'V'],
'Verb, imperative', ['', '', '', 'VAIMP', '', 'V'],
'Verb, imperative infinitive', ['', '', '', 'VAINF', '', 'V'],
'Verb, auxiliary do, base', ['VDB', 'DO', 'VBP', '', '', 'V'],
'Verb, auxiliary do, infinitive', ['VDB', 'DO', 'VB', '', '', 'V'],
'Verb, auxiliary do, past', ['VDD', 'DOD', 'VBD', '', '', 'V'],
'Verb, auxiliary do, present participle', ['VDG', 'VBG', 'VBG', '', '', 'V'],
'Verb, auxiliary do, past participle', ['VDN', 'VBN', 'VBN', '', '', 'V'],
'Verb, auxiliary do, present 3SG', ['VDZ', 'DOZ', 'VBZ', '', '', 'V'],
'Verb, auxiliary have, base', ['VHB', 'HV', 'VBP', 'VA', '', 'V'],
'Verb, auxiliary have, infinitive', ['VHI', 'HV', 'VB', 'VAINF', '', 'V'],
'Verb, auxiliary have, past', ['VHD', 'HVD', 'VBD', 'VA', '', 'V'],
'Verb, auxiliary have, present participle', ['VHG', 'HVG', 'VBG', 'VA', '', 'V'],
'Verb, auxiliary have, past participle', ['VHN', 'HVN', 'VBN', 'VAPP', '', 'V'],
'Verb, auxiliary have, present 3SG', ['VHZ', 'HVZ', 'VBZ', 'VA', '', 'V'],
'Verb, auxiliary be, infinitive', ['VBI', 'BE', 'VB', '', '', 'V'],
'Verb, auxiliary be, past', ['VBD', 'BED', 'VBD', '', '', 'V'],
'Verb, auxiliary be, past, 3SG', ['VBD', 'BEDZ', 'VBD', '', '', 'V'],
'Verb, auxiliary be, present participle', ['VBG', 'BEG', 'VBG', '', '', 'V'],
'Verb, auxiliary be, past participle', ['VBN', 'BEN', 'VBN', '', '', 'V'],
'Verb, auxiliary be, present, 3SG', ['VBZ', 'BEZ', 'VBZ', '', '', 'V'],
'Verb, auxiliary be, present, 1SG', ['VBB', 'BEM', 'VBP', '', '', 'V'],
'Verb, auxiliary be, present', ['VBB', 'BER', 'VBP', '', '', 'V'],
'Verb, modal', ['VM0', 'MD', 'MD', 'VMFIN', 'VV', 'V'],
'Verb, modal', ['VM0', 'MD', 'MD', 'VMINF', 'VV', 'V'],
'Verb, modal, finite', ['', '', '', '', 'VMFIN', 'V'],
'Verb, modal, infinite', ['', '', '', '', 'VMINF', 'V'],
'Verb, modal, past participle', ['', '', '', '', 'VMPP', 'V'],
'Particle', ['', '', '', '', '', 'PRT'],
'Particle, with adverb', ['', '', '', 'PTKA', '', 'PRT'],
'Particle, answer', ['', '', '', 'PTKANT', '', 'PRT'],
'Particle, negation', ['', '', '', 'PTKNEG', '', 'PRT'],
'Particle, separated verb', ['', '', '', 'PTKVZ', '', 'PRT'],
'Particle, to as infinitive marker', ['TO0', 'TO', 'TO', 'PTKZU', '', 'PRT'],
'Preposition, comparative', ['', '', '', 'KOKOM', '', 'P'],
'Preposition, to', ['PRP', 'IN', 'TO', '', '', 'P'],
'Preposition', ['PRP', 'IN', 'IN', 'APPR', 'P', 'P'],
'Preposition, with aritcle', ['', '', '', 'APPART', '', 'P'],
'Preposition, of', ['PRF', 'IN', 'IN', '', '', 'P'],
'Possessive', ['POS', '$', 'POS'],
'Postposition', ['', '', '', 'APPO'],
'Circumposition, right', ['', '', '', 'APZR', ''],
'Interjection, onomatopoeia or other isolate', ['ITJ', 'UH', 'UH', 'ITJ', 'IJ'],
'Onomatopoeia', ['', '', '', '', 'ON'],
'Punctuation', ['', '', '', '', 'PU', 'PN'],
'Punctuation, sentence ender', ['PUN', '.', '.', '', '', 'PN'],
'Punctuation, semicolon', ['PUN', '.', '.', '', '', 'PN'],
'Puncutation, colon or ellipsis', ['PUN', ':', ':'],
'Punctuationm, comma', ['PUN', ',', ',', '$,'],
'Punctuation, dash', ['PUN', '-', '-'],
'Punctuation, dollar sign', ['PUN', '', '$'],
'Punctuation, left bracket', ['PUL', '(', '(', '$('],
'Punctuation, right bracket', ['PUR', ')', ')'],
'Punctuation, quotation mark, left', ['PUQ', '', '``'],
'Punctuation, quotation mark, right', ['PUQ', '', '"'],
'Punctuation, left bracket', ['PUL', '(', 'PPL'],
'Punctuation, right bracket', ['PUR', ')', 'PPR'],
'Punctuation, left square bracket', ['PUL', '(', 'LSB'],
'Punctuation, right square bracket', ['PUR', ')', 'RSB'],
'Punctuation, left curly bracket', ['PUL', '(', 'LCB'],
'Punctuation, right curly bracket', ['PUR', ')', 'RCB'],
'Unknown, foreign words (not in lexicon)', ['UNZ', '(FW-)', 'FW', '', 'FW'],
'Symbol', ['', '', 'SYM', 'XY'],
'Symbol, alphabetical', ['ZZ0', '', ''],
'Symbol, list item', ['', '', 'LS'],
# Not sure about these tags from the Chinese PTB.
'Aspect marker', ['', '', '', '', 'AS'], # ?
'Ba-construction', ['', '', '', '', 'BA'], # ?
'In relative', ['', '', '', '', 'DEC'], # ?
'Associative', ['', '', '', '', 'DER'], # ?
'In V-de or V-de-R construct', ['', '', '', '', 'DER'], # ?
'For words ? ', ['', '', '', '', 'ETC'], # ?
'In long bei-construct', ['', '', '', '', 'LB'], # ?
'In short bei-construct', ['', '', '', '', 'SB'], # ?
'Sentence-nal particle', ['', '', '', '', 'SB'], # ?
'Particle, other', ['', '', '', '', 'MSP'], # ?
'Before VP', ['', '', '', '', 'DEV'], # ?
'Verb, ? as main verb', ['', '', '', '', 'VE'], # ?
'Verb, ????', ['', '', '', '', 'VC'] # ?
]}

View File

@ -0,0 +1,71 @@
{cat_to_category: {
'ADJ' => 'adjective',
'ADV' => 'adverb',
'CONJ' => 'conjunction',
'COOD' => 'conjunction',
'C' => 'complementizer',
'D' => 'determiner',
'N' => 'noun',
'P' => 'preposition',
'PN' => 'punctuation',
'SC' => 'conjunction',
'V' => 'verb',
'PRT' => 'particle'
},
cat_to_description: [
['ADJ', 'Adjective'],
['ADV', 'Adverb'],
['CONJ', 'Coordination conjunction'],
['C', 'Complementizer'],
['D', 'Determiner'],
['N', 'Noun'],
['P', 'Preposition'],
['SC', 'Subordination conjunction'],
['V', 'Verb'],
['COOD', 'Part of coordination'],
['PN', 'Punctuation'],
['PRT', 'Particle'],
['S', 'Sentence']
],
xcat_to_description: [
['COOD', 'Coordinated phrase/clause'],
['IMP', 'Imperative sentence'],
['INV', 'Subject-verb inversion'],
['Q', 'Interrogative sentence with subject-verb inversion'],
['REL', 'A relativizer included'],
['FREL', 'A free relative included'],
['TRACE', 'A trace included'],
['WH', 'A wh-question word included']
],
xcat_to_ptb: [
['ADJP', '', 'ADJP'],
['ADJP', 'REL', 'WHADJP'],
['ADJP', 'FREL', 'WHADJP'],
['ADJP', 'WH', 'WHADJP'],
['ADVP', '', 'ADVP'],
['ADVP', 'REL', 'WHADVP'],
['ADVP', 'FREL', 'WHADVP'],
['ADVP', 'WH', 'WHADVP'],
['CONJP', '', 'CONJP'],
['CP', '', 'SBAR'],
['DP', '', 'NP'],
['NP', '', 'NP'],
['NX', 'NX', 'NAC'],
['NP' 'REL' 'WHNP'],
['NP' 'FREL' 'WHNP'],
['NP' 'WH' 'WHNP'],
['PP', '', 'PP'],
['PP', 'REL', 'WHPP'],
['PP', 'WH', 'WHPP'],
['PRT', '', 'PRT'],
['S', '', 'S'],
['S', 'INV', 'SINV'],
['S', 'Q', 'SQ'],
['S', 'REL', 'SBAR'],
['S', 'FREL', 'SBAR'],
['S', 'WH', 'SBARQ'],
['SCP', '', 'SBAR'],
['VP', '', 'VP'],
['VP', '', 'VP'],
['', '', 'UK']
]}

View File

@ -0,0 +1,17 @@
{tag_to_category: {
'C' => :complementizer,
'PN' => :punctuation,
'SC' => :conjunction
}
# Paris7 Treebank functional tags
=begin
SUJ (subject)
OBJ (direct object)
ATS (predicative complement of a subject)
ATO (predicative complement of a direct object)
MOD (modifier or adjunct)
A-OBJ (indirect complement introduced by à)
DE-OBJ (indirect complement introduced by de)
P-OBJ (indirect complement introduced by another preposition)
=end
}

View File

@ -0,0 +1,15 @@
{escape_characters: {
'(' => '-LRB-',
')' => '-RRB-',
'[' => '-LSB-',
']' => '-RSB-',
'{' => '-LCB-',
'}' => '-RCB-'
},
phrase_tag_to_description: [
['S', 'Paris7 declarative clause'],
['SBAR', 'Clause introduced by a (possibly empty) subordinating conjunction'],
['SBARQ', 'Direct question introduced by a wh-word or a wh-phrase'],
['SINV', 'Inverted declarative sentence'],
['SQ', 'Inverted yes/no question']
]}

View File

@ -6,7 +6,7 @@
},
time: {
type: :annotator,
targets: [:group]
targets: [:phrase]
},
topics: {
type: :annotator,
@ -22,18 +22,18 @@
},
name_tag: {
type: :annotator,
targets: [:group]
targets: [:phrase, :word]
},
coreferences: {
type: :annotator,
targets: [:zone]
},
tf_idf: {
type: :annotator,
targets: [:word]
},
similarity: {
type: :computer,
targets: [:entity]
},
distance: {
type: :computer,
targets: [:entity]
summary: {
type: :annotator,
targets: [:document]
}
}

View File

@ -1,12 +1,11 @@
{
taggers: {
type: :annotator,
targets: [:group, :token],
recursive: true
targets: [:phrase, :token]
},
categorizers: {
type: :annotator,
targets: [:group, :token],
targets: [:phrase, :token],
recursive: true
},
sensers: {
@ -15,5 +14,5 @@
preset_option: :nym,
presets: [:synonyms, :antonyms,
:hyponyms, :hypernyms],
}
}
}

View File

@ -0,0 +1 @@
[:extractors, :inflectors, :formatters, :learners, :lexicalizers, :processors, :retrievers]

View File

@ -1,7 +1,7 @@
{
chunkers: {
type: :transformer,
targets: [:document, :section],
targets: [:document],
default: :autoselect
},
segmenters: {
@ -10,10 +10,10 @@
},
tokenizers: {
type: :transformer,
targets: [:group]
targets: [:sentence, :phrase]
},
parsers: {
type: :transformer,
targets: [:group]
targets: [:sentence, :phrase]
}
}

5
lib/treat/core.rb Normal file
View File

@ -0,0 +1,5 @@
# Contains the core classes used by Treat.
module Treat::Core
p = Treat.paths.lib + 'treat/core/*.rb'
Dir.glob(p).each { |f| require f }
end

108
lib/treat/core/data_set.rb Normal file
View File

@ -0,0 +1,108 @@
# A DataSet contains an entity classification
# problem as well as data for entities that
# have already been classified, complete with
# references to these entities.
class Treat::Core::DataSet
# Used to serialize Procs.
silence_warnings do
require 'sourcify'
end
# The classification problem this
# data set holds data for.
attr_accessor :problem
# Items that have been already
# classified (training data).
attr_accessor :items
# References to the IDs of the
# original entities contained
# in the data set.
attr_accessor :entities
# Initialize the DataSet. Can be
# done with a Problem entity
# (thereby creating an empty set)
# or with a filename (representing
# a serialized data set which will
# then be deserialized and loaded).
def initialize(prob_or_file)
if prob_or_file.is_a?(String)
ds = self.class.
unserialize(prob_or_file)
@problem = ds.problem
@items = ds.items
@entities = ds.entities
else
@problem = prob_or_file
@items, @entities = [], []
end
end
# Add an entity to the data set.
# The entity's relevant features
# are calculated based on the
# classification problem, and a
# line with the results of the
# calculation is added to the
# data set, along with the ID
# of the entity.
def <<(entity)
@items << @problem.
export_item(entity)
@entities << entity.id
end
# Marshal the data set to the supplied
# file name. Marshal is used for speed;
# other serialization options may be
# provided in later versions. This
# method relies on the sourcify gem
# to transform Feature procs to strings,
# since procs/lambdas can't be serialized.
def serialize(file)
problem = @problem.dup
problem.features.each do |feature|
next unless feature.proc
feature.proc = feature.proc.to_source
end
data = [problem, @items, @entities]
File.open(file, 'w') do |f|
f.write(Marshal.dump(data))
end
problem.features.each do |feature|
next unless feature.proc
source = feature.proc[5..-1]
feature.proc = eval("Proc.new #{source}")
end
end
# Merge another data set into this one.
def merge(data_set)
if data_set.problem != @problem
raise Treat::Exception,
"Cannot merge two data sets that " +
"don't reference the same problem."
else
@items << data_set.items
@entities << data_set.entities
end
end
# Unserialize a data set file created
# by using the #serialize method.
def self.unserialize(file)
data = Marshal.load(File.binread(file))
problem, items, entities = *data
problem.features.each do |feature|
next unless feature.proc
source = feature.proc[5..-1]
feature.proc = eval("Proc.new #{source}")
end
data_set = Treat::Core::DataSet.new(problem)
data_set.items = items
data_set.entities = entities
data_set
end
end

View File

@ -1,23 +0,0 @@
module Treat::Core::DSL
# Map all classes in Treat::Entities to
# a global builder function (entity, word,
# phrase, punctuation, symbol, list, etc.)
def self.included(base)
def method_missing(sym,*args,&block)
@@entities ||= Treat.core.entities.list
@@learning ||= Treat.core.learning.list
if @@entities.include?(sym)
klass = Treat::Entities.const_get(sym.cc)
return klass.build(*args)
elsif @@learning.include?(sym)
klass = Treat::Learning.const_get(sym.cc)
return klass.new(*args)
else
super(sym,*args,&block)
raise "Uncaught method ended up in Treat DSL."
end
end
end
end

42
lib/treat/core/feature.rb Normal file
View File

@ -0,0 +1,42 @@
# Represents a feature to be used
# in a classification task.
class Treat::Core::Feature
# The name of the feature. If no
# proc is supplied, this assumes
# that the target of your classification
# problem responds to the method
# corresponding to this name.
attr_reader :name
# A proc that can be used to perform
# calculations before storing a feature.
attr_accessor :proc
# The default value to be
attr_reader :default
# Initialize a feature for a classification
# problem. If two arguments are supplied,
# the second argument is assumed to be the
# default value. If three arguments are
# supplied, the second argument is the
# callback to generate the feature, and
# the third one is the default value.
def initialize(name, proc_or_default = nil, default = nil)
@name = name
if proc_or_default.is_a?(Proc)
@proc, @default =
proc_or_default, default
else
@proc = nil
@default = proc_or_default
end
end
# Custom comparison operator for features.
def ==(feature)
@name == feature.name &&
@proc == feature.proc &&
@default == feature.default
end
end

251
lib/treat/core/node.rb Executable file
View File

@ -0,0 +1,251 @@
# This module provides an abstract tree structure.
module Treat::Core
# This class is a node for an N-ary tree data structure
# with a unique identifier, text value, children, features
# (annotations) and dependencies.
#
# This class was partly based on the 'rubytree' gem.
# RubyTree is licensed under the BSD license and can
# be found at http://rubytree.rubyforge.org/rdoc/.
# I have made several modifications in order to better
# suit this library and to avoid ugly monkey patching.
class Node
# A string containing the node's value (or empty).
attr_accessor :value
# A unique identifier for the node.
attr_reader :id
# An array containing the children of this node.
attr_reader :children
# A hash containing the features of this node.
attr_accessor :features
# An array containing the dependencies that link this
# node to other nodes.
attr_accessor :dependencies
# A struct for dependencies. # Fix
Struct.new('Dependency',
:target, :type, :directed, :direction)
# The parent of the node.
attr_accessor :parent
# Initialize the node with its value and id.
# Setup containers for the children, features
# and dependencies of this node.
def initialize(value, id = nil)
@parent = nil
@value, @id = value, id
@children = []
@children_hash = {}
@features = {}
@dependencies = []
end
# Iterate over each children in the node.
# Non-recursive.
def each
@children.each { |child| yield child }
end
# Boolean - does the node have dependencies?
def has_dependencies?; !(@dependencies.size == 0); end
# Boolean - does the node have children?
def has_children?; !(@children.size == 0); end
# Boolean - does the node have a parent?
def has_parent?; !@parent.nil?; end
# Boolean - does the node have features?
def has_features?; !(@features.size == 0); end
# Does the entity have a feature ?
def has_feature?(feature); @features.has_key?(feature); end
# Boolean - does the node not have a parent?
def is_root?; @parent.nil?; end
# Remove this node from its parent and set as root.
def set_as_root!; @parent = nil; self; end
# Boolean - is this node a leaf ?
# This is overriden in leaf classes.
def is_leaf?; !has_children?; end
# Add the nodes to the given child.
# This may be used with several nodes,
# for example: node << [child1, child2, child3]
def <<(nodes)
nodes = [nodes] unless nodes.is_a? Array
if nodes.include?(nil)
raise Treat::Exception,
'Trying to add a nil node.'
end
nodes.each do |node|
node.parent = self
@children << node
@children_hash[node.id] = node
end
nodes[0]
end
# Retrieve a child node by name or index.
def [](name_or_index)
if name_or_index == nil
raise Treat::Exception,
'Non-nil name or index needs to be provided.'
end
if name_or_index.kind_of?(Integer) &&
name_or_index < 1000
@children[name_or_index]
else
@children_hash[name_or_index]
end
end
# Remove the supplied node or id of a
# node from the children.
def remove!(ion)
return nil unless ion
if ion.is_a? Treat::Core::Node
@children.delete(ion)
@children_hash.delete(ion.id)
ion.set_as_root!
else
@children.delete(@children_hash[ion])
@children_hash.delete(ion)
end
end
# Remove all children.
def remove_all!
@children.each do |child|
child.set_as_root!
end
@children = []
@children_hash = {}
self
end
# Return the sibling with position #pos
# versus this one.
# #pos can be ... -1, 0, 1, ...
def sibling(pos)
return nil if is_root?
id = @parent.children.index(self)
@parent.children.at(id + pos)
end
# Return the sibling N positions to
# the left of this one.
def left(n = 1); sibling(-1*n); end
alias :previous_sibling :left
# Return the sibling N positions to the
# right of this one.
def right(n = 1); sibling(1*n); end
alias :next_sibling :right
# Return all brothers and sisters of this node.
def siblings
r = @parent.children.dup
r.delete(self)
r
end
# Total number of nodes in the subtree,
# including this one.
def size
@children.inject(1) do |sum, node|
sum += node.size
end
end
# Set the feature to the supplied value.
def set(feature, value)
@features ||= {}
@features[feature] = value
end
# Return a feature.
def get(feature)
return @value if feature == :value
return @id if feature == :id
@features[feature]
end
# Unset a feature.
def unset(*features)
if features.size == 1
@features.delete(features[0])
else
features.each do |feature|
@features.delete(feature)
end
end
end
# Return the depth of this node in the tree.
def depth
return 0 if is_root?
1 + parent.depth
end
alias :has? :has_feature?
# Link this node to the target node with
# the supplied dependency type.
def link(id_or_node, type = nil,
directed = true, direction = 1)
if id_or_node.is_a?(Treat::Core::Node)
id = root.find(id_or_node).id
else
id = id_or_node
end
@dependencies.each do |d|
return if d.target == id
end
@dependencies <<
Struct::Dependency.new(
id, type,
directed, direction
)
end
# Find the node in the tree with the given id.
def find(id_or_node)
if id_or_node.is_a?(Treat::Core::Node)
id = id_or_node.id
else
id = id_or_node
end
if @children_hash[id]
return @children_hash[id]
end
self.each do |child|
r = child.find(id)
return r if r.is_a? Treat::Core::Node
end
nil
end
# Find the root of the tree within which
# this node is contained.
def root
return self if !has_parent?
ancestor = @parent
while ancestor.has_parent?
ancestor = ancestor.parent
end
ancestor
end
end
end

49
lib/treat/core/problem.rb Normal file
View File

@ -0,0 +1,49 @@
# Defines a classification problem.
# - What question are we trying to answer?
# - What features are we going to look at
# to attempt to answer that question?
class Treat::Core::Problem
# The question we are trying to answer.
attr_reader :question
# An array of features that will be
# looked at in trying to answer the
# problem's question.
attr_reader :features
# Just the labels from the features.
attr_reader :labels
# Initialize the problem with a question
# and an arbitrary number of features.
def initialize(question, *features)
@question = question
@features = features
@labels = @features.map { |f| f.name }
end
# Custom comparison for problems.
def ==(problem)
@question == problem.question &&
@features == problem.features
end
# Return an array of all the entity's
# features, as defined by the problem.
# If include_answer is set to true, will
# append the answer to the problem after
# all of the features.
def export_item(e, include_answer = true)
line = []
@features.each do |feature|
r = feature.proc ?
feature.proc.call(e) :
e.send(feature.name)
line << (r || feature.default)
end
return line unless include_answer
line << (e.has?(@question.name) ?
e.get(@question.name) : @question.default)
line
end
end

View File

@ -1,6 +1,6 @@
# Defines a question to answer in the
# context of a classification problem.
class Treat::Learning::Question
class Treat::Core::Question
# Defines an arbitrary label for the
# question we are trying to answer
@ -8,32 +8,20 @@ class Treat::Learning::Question
# also be used as the annotation name
# for the answer to the question.
attr_reader :name
# Defines the target of the question
# (e.g. :sentence, :paragraph, etc.)
attr_reader :target
# Can be :continuous or :discrete,
# depending on the features used.
attr_reader :type
# Defines the target of the question
# (e.g. :sentence, :paragraph, etc.)
attr_reader :target
# Default for the answer to the question.
attr_reader :default
# Initialize the question.
def initialize(name, target, default = nil, type = :continuous)
unless name.is_a?(Symbol)
raise Treat::Exception,
"Question name should be a symbol."
end
unless Treat.core.entities.list.include?(target)
raise Treat::Exception, "Target type should be " +
"a symbol and should be one of the following: " +
Treat.core.entities.list.inspect
end
unless [:continuous, :discrete].include?(type)
raise Treat::Exception, "Type should be " +
"continuous or discrete."
end
@name, @target, @type, @default =
name, target, type, default
def initialize(name, target,
type = :continuous, default = nil)
@name, @target = name, target
@type, @default = type, default
end
# Custom comparison operator for questions.

View File

@ -1,41 +0,0 @@
class Treat::Core::Server
# Refer to http://rack.rubyforge.org/doc/classes/Rack/Server.html
# for possible options to configure.
def initialize(handler = 'thin', options = {})
raise "Implementation not finished."
require 'json'; require 'rack'
@handler, @options = handler.capitalize, options
end
def start
handler = Rack::Handler.const_get(@handler)
handler.run(self, @options)
end
def call(env)
headers = { 'content-type' => 'application/json' }
rack_input = env["rack.input"].read
if rack_input.strip == ''
return [500, headers, {
'error' => 'Empty JSON request.'
}]
end
rack_json = JSON.parse(rack_input)
unless rack_json['type'] &&
rack_json['value'] && rack_json['do']
return [500, headers, {
'error' => 'Must specify "type", "value" and "do".'
}]
end
if rack_json['conf']
# Set the configuration.
end
method = rack_json['type'].capitalize.intern
resp = send(method, rack_json[value]).do(rack_json['do'])
response = [rack_input.to_json]
[200, headers, response]
end
end

6
lib/treat/entities.rb Normal file
View File

@ -0,0 +1,6 @@
# Contains the textual model used by Treat.
module Treat::Entities
require 'treat/entities/entity'
p = Treat.paths.lib + 'treat/entities/*.rb'
Dir.glob(p).each { |f| require f }
end

View File

@ -3,7 +3,7 @@
# a string or a numeric object. This class
# is pretty much self-explanatory.
# FIXME how can we make this language independent?
module Treat::Entities::Entity::Buildable
module Treat::Entities::Abilities::Buildable
require 'schiphol'
require 'fileutils'
@ -15,21 +15,7 @@ module Treat::Entities::Entity::Buildable
PunctRegexp = /^[[:punct:]\$]+$/
UriRegexp = /^(http|https):\/\/[a-z0-9]+([\-\.]{1}[a-z0-9]+)*\.[a-z]{2,5}(([0-9]{1,5})?\/.*)?$/ix
EmailRegexp = /.+\@.+\..+/
Enclitics = [
# EXAMPLE:
"'d", # I'd => I would
"'ll", # I'll => I will
"'m", # I'm => I am
"'re", # We're => We are
"'s", # There's => There is
# Let's => Let us
"'t", # 'Twas => Archaic ('Twas the night)
"'ve", # They've => They have
"n't" # Can't => Can not
]
# Accepted formats of serialized files
AcceptedFormats = ['.xml', '.yml', '.yaml', '.mongo']
Enclitics = %w['ll 'm 're 's 't 've]
# Reserved folder names
Reserved = ['.index']
@ -37,38 +23,23 @@ module Treat::Entities::Entity::Buildable
# Build an entity from anything (can be
# a string, numeric,folder, or file name
# representing a raw or serialized file).
def build(*args)
# This probably needs some doc.
if args.size == 0
file_or_value = ''
elsif args[0].is_a?(Hash)
file_or_value = args[0]
elsif args.size == 1
if args[0].is_a?(Treat::Entities::Entity)
args[0] = [args[0]]
end
file_or_value = args[0]
else
file_or_value = args
end
def build(file_or_value, options = {})
fv = file_or_value.to_s
if fv == ''; self.new
elsif file_or_value.is_a?(Array)
from_array(file_or_value)
elsif file_or_value.is_a?(Hash)
if file_or_value.is_a?(Hash)
from_db(file_or_value)
elsif self == Treat::Entities::Document || (is_serialized_file?(fv))
elsif self == Treat::Entities::Document ||
(fv.index('yml') || fv.index('yaml') ||
fv.index('xml') || fv.index('mongo'))
if fv =~ UriRegexp
from_url(fv)
from_url(fv, options)
else
from_file(fv)
from_file(fv, options)
end
elsif self == Treat::Entities::Collection
if FileTest.directory?(fv)
from_folder(fv)
from_folder(fv, options)
else
create_collection(fv)
end
@ -92,34 +63,27 @@ module Treat::Entities::Entity::Buildable
# is user-created (i.e. by calling build
# instead of from_string directly).
def from_string(string, enforce_type = false)
# If calling using the build syntax (i.e. user-
# called), enforce the type that was supplied.
enforce_type = true if caller_method == :build
unless self == Treat::Entities::Entity
return self.new(string) if enforce_type
end
e = anything_from_string(string)
if enforce_type && !e.is_a?(self)
raise "Asked to build a #{self.mn.downcase} "+
"from \"#{string}\" and to enforce type, "+
"but type detected was #{e.class.mn.downcase}."
end
e
end
# Build a document from an array
# of builders.
def from_array(array)
obj = self.new
array.each do |el|
el = el.to_entity unless el.is_a?(Treat::Entities::Entity)
obj << el
e = anything_from_string(string)
if enforce_type && !e.is_a?(self)
raise "Asked to build a #{cl(self).downcase} "+
"from \"#{string}\" and to enforce type, "+
"but type detected was #{cl(e.class).downcase}."
end
obj
e
end
# Build a document from an URL.
def from_url(url)
def from_url(url, options)
unless self ==
Treat::Entities::Document
raise Treat::Exception,
@ -127,22 +91,16 @@ module Treat::Entities::Entity::Buildable
'else than a document from a url.'
end
begin
folder = Treat.paths.files
if folder[-1] == '/'
folder = folder[0..-2]
end
f = Schiphol.download(url,
download_folder: folder,
show_progress: !Treat.core.verbosity.silence,
rectify_extensions: true,
max_tries: 3)
rescue
raise Treat::Exception,
"Couldn't download file at #{url}."
end
f = Schiphol.download(url,
:download_folder => Treat.paths.files,
:show_progress => Treat.core.verbosity.silence,
:rectify_extensions => true,
:max_tries => 3
)
e = from_file(f,'html')
options[:default_to] ||= 'html'
e = from_file(f, options)
e.set :url, url.to_s
e
@ -165,7 +123,7 @@ module Treat::Entities::Entity::Buildable
# Build an entity from a folder with documents.
# Folders will be searched recursively.
def from_folder(folder)
def from_folder(folder, options)
return if Reserved.include?(folder)
@ -191,43 +149,39 @@ module Treat::Entities::Entity::Buildable
c = Treat::Entities::Collection.new(folder)
folder += '/' unless folder[-1] == '/'
if !FileTest.directory?(folder)
FileUtils.mkdir(folder)
end
c.set :folder, folder
i = folder + '/.index'
c.set :index, i if FileTest.directory?(i)
Dir[folder + '*'].each do |f|
if FileTest.directory?(f)
c2 = Treat::Entities::Collection.
from_folder(f)
from_folder(f, options)
c.<<(c2, false) if c2
else
c.<<(Treat::Entities::Document.
from_file(f), false)
from_file(f, options), false)
end
end
return c
c
end
# Build a document from a raw or serialized file.
def from_file(file,def_fmt=nil)
def from_file(file, options)
if is_serialized_file?(file)
from_serialized_file(file)
if file.index('yml') ||
file.index('yaml') ||
file.index('xml') ||
file.index('mongo')
from_serialized_file(file, options)
else
fmt = Treat::Workers::Formatters::Readers::Autoselect.detect_format(file,def_fmt)
from_raw_file(file, fmt)
fmt = Treat::Workers::Formatters::Readers::Autoselect.
detect_format(file, options[:default_to])
options[:_format] = fmt
from_raw_file(file, options)
end
end
# Build a document from a raw file.
def from_raw_file(file, def_fmt='txt')
def from_raw_file(file, options)
unless self ==
Treat::Entities::Document
@ -241,40 +195,32 @@ module Treat::Entities::Entity::Buildable
"Path '#{file}' does not "+
"point to a readable file."
end
options = {default_format: def_fmt}
d = Treat::Entities::Document.new
d.set :file, file
d = Treat::Entities::Document.new(file)
d.read(:autoselect, options)
end
# Build an entity from a serialized file.
def from_serialized_file(file)
def from_serialized_file(file, options)
unless File.readable?(file)
raise Treat::Exception,
"Path '#{file}' does not "+
"point to a readable file."
end
doc = Treat::Entities::Document.new
doc.set :file, file
format = nil
if File.extname(file) == '.yml' ||
File.extname(file) == '.yaml'
format = :yaml
elsif File.extname(file) == '.xml'
format = :xml
if file.index('mongo')
options[:id] = file.scan( # Consolidate this
/([0-9]+)\.mongo/).first.first
from_db(:mongo, options)
else
raise Treat::Exception,
"Unreadable serialized format for #{file}."
unless File.readable?(file)
raise Treat::Exception,
"Path '#{file}' does not "+
"point to a readable file."
end
d = Treat::Entities::Document.new(file)
d.unserialize(:autoselect, options)
d.children[0].set_as_root! # Fix this
d.children[0]
end
doc.unserialize(format)
doc.children[0].set_as_root! # Fix this
doc.children[0]
end
def is_serialized_file?(path_to_check)
(AcceptedFormats.include? File.extname(path_to_check)) && (File.file?(path_to_check))
end
def from_db(hash)
@ -292,28 +238,15 @@ module Treat::Entities::Entity::Buildable
# Build any kind of entity from a string.
def anything_from_string(string)
case self.mn.downcase.intern
when :document
folder = Treat.paths.files
if folder[-1] == '/'
folder = folder[0..-2]
end
now = Time.now.to_f
doc_file = folder+ "/#{now}.txt"
string.force_encoding('UTF-8')
File.open(doc_file, 'w') do |f|
f.puts string
end
from_raw_file(doc_file)
when :collection
case cl(self).downcase.intern
when :document, :collection
raise Treat::Exception,
"Cannot create a " +
"Cannot create a document or " +
"collection from a string " +
"(need a readable file/folder)."
when :phrase
group_from_string(string)
sentence_or_phrase_from_string(string)
when :token
token_from_string(string)
when :zone
@ -325,7 +258,7 @@ module Treat::Entities::Entity::Buildable
if string.gsub(/[\.\!\?]+/,
'.').count('.') <= 1 &&
string.count("\n") == 0
group_from_string(string)
sentence_or_phrase_from_string(string)
else
zone_from_string(string)
end
@ -336,14 +269,15 @@ module Treat::Entities::Entity::Buildable
end
# This should be improved on.
def check_encoding(string)
string.encode("UTF-8", undef: :replace) # Fix
end
# Build a phrase from a string.
def group_from_string(string)
def sentence_or_phrase_from_string(string)
check_encoding(string)
if !(string =~ /[a-zA-Z]+/)
Treat::Entities::Fragment.new(string)
elsif string.count('.!?') >= 1
@ -351,6 +285,7 @@ module Treat::Entities::Entity::Buildable
else
Treat::Entities::Phrase.new(string)
end
end
# Build the right type of token
@ -396,7 +331,7 @@ module Treat::Entities::Entity::Buildable
end
end
def create_collection(fv)
FileUtils.mkdir(fv)
Treat::Entities::Collection.new(fv)

View File

@ -1,8 +1,8 @@
# Implement support for the functions #do and #do_task.
module Treat::Entities::Entity::Applicable
module Treat::Entities::Abilities::Chainable
# Perform the supplied tasks on the entity.
def apply(*tasks)
def do(*tasks)
tasks.each do |task|
if task.is_a?(Hash)
@ -25,8 +25,8 @@ module Treat::Entities::Entity::Applicable
end
self
end
alias :do :apply
alias :chain :do
# Perform an individual task on an entity
# given a worker and options to pass to it.
@ -35,7 +35,7 @@ module Treat::Entities::Entity::Applicable
entity_types = group.targets
f = nil
entity_types.each do |t|
f = true if is_a?(Treat::Entities.const_get(t.cc))
f = true if is_a?(Treat::Entities.const_get(cc(t)))
end
if f || entity_types.include?(:entity)
send(task, worker, options)

View File

@ -1,7 +1,7 @@
# This module implements methods that are used
# by workers to determine if an entity is properly
# formatted before working on it.
module Treat::Entities::Entity::Checkable
module Treat::Entities::Abilities::Checkable
# Check if the entity has the given feature,
# and if so return it. If not, calculate the
@ -15,7 +15,7 @@ module Treat::Entities::Entity::Checkable
g2 = Treat::Workers.lookup(feature)
raise Treat::Exception,
"#{g1.type.to_s.capitalize} " +
"#{g1.type.to_s.capitalize} #{task} " +
"requires #{g2.type} #{g2.method}."
end

View File

@ -1,21 +1,21 @@
# Allow comparison of entity hierarchy in DOM.
module Treat::Entities::Entity::Comparable
module Treat::Entities::Abilities::Comparable
# Determines whether the receiving class
# is smaller, equal or greater in the DOM
# hierarchy compared to the supplied one.
def compare_with(klass)
i = 0; rank_a = nil; rank_b = nil
Treat.core.entities.order.each do |type|
klass2 = Treat::Entities.const_get(type.cc)
klass2 = Treat::Entities.const_get(cc(type))
rank_a = i if self <= klass2
rank_b = i if klass <= klass2
next if rank_a && rank_b
i += 1
end
return -1 if rank_a < rank_b
return 0 if rank_a == rank_b
return 1 if rank_a > rank_b
end
end

View File

@ -0,0 +1,47 @@
module Treat::Entities::Abilities::Copyable
require 'fileutils'
# What happens when it is a database-stored
# collection or document ?
def copy_into(collection)
unless collection.is_a?(
Treat::Entities::Collection)
raise Treat::Exception,
"Cannot copy an entity into " +
"something else than a collection."
end
if type == :document
copy_document_into(collection)
elsif type == :collection
copy_collection_into(collection)
else
raise Treat::Exception,
"Can only copy a document " +
"or collection into a collection."
end
end
def copy_collection_into(collection)
copy = dup
f = File.dirname(folder)
f = f.split(File::SEPARATOR)[-1]
f = File.join(collection.folder, f)
FileUtils.mkdir(f) unless
FileTest.directory(f)
FileUtils.cp_r(folder, f)
copy.set :folder, f
copy
end
def copy_document_into(collection)
copy = dup
return copy unless file
f = File.basename(file)
f = File.join(collection.folder, f)
FileUtils.cp(file, f)
copy.set :file, f
copy
end
end

View File

@ -1,4 +1,4 @@
module Treat::Entities::Entity::Countable
module Treat::Entities::Abilities::Countable
# Find the position of the current entity
# inside the parent entity, starting at 1.
@ -41,7 +41,6 @@ module Treat::Entities::Entity::Countable
# Returns the frequency of the given value
# in the this entity.
def frequency_of(value)
value = value.downcase
if is_a?(Treat::Entities::Token)
raise Treat::Exception,
"Cannot get the frequency " +

View File

@ -0,0 +1,83 @@
# When Treat.debug is set to true, each call to
# #call_worker will result in a debug message being
# printed by the #print_debug function.
module Treat::Entities::Abilities::Debuggable
@@prev = nil
@@i = 0
# Explains what Treat is currently doing.
def print_debug(entity, task, worker, group, options)
targs = group.targets.map do |target|
target.to_s
end
if targs.size == 1
t = targs[0]
else
t = targs[0..-2].join(', ') +
' and/or ' + targs[-1]
end
genitive = targs.size > 1 ?
'their' : 'its'
doing = ''
human_task = task.to_s.gsub('_', ' ')
if group.type == :transformer ||
group.type == :computer
tt = human_task
tt = tt[0..-2] if tt[-1] == 'e'
ed = tt[-1] == 'd' ? '' : 'ed'
doing = "#{tt.capitalize}#{ed} #{t}"
elsif group.type == :annotator
if group.preset_option
opt = options[group.preset_option]
form = opt.to_s.gsub('_', ' ')
human_task[-1] = ''
human_task = form + ' ' + human_task
end
doing = "Annotated #{t} with " +
"#{genitive} #{human_task}"
end
if group.to_s.index('Formatters')
curr = doing +
' in format ' +
worker.to_s
else
curr = doing +
' using ' +
worker.to_s.gsub('_', ' ')
end
curr.gsub!('ss', 's') unless curr.index('class')
curr += '.'
if curr == @@prev
@@i += 1
else
if @@i > 1
Treat.core.entities.list.each do |e|
@@prev.gsub!(e.to_s, e.to_s + 's')
end
@@prev.gsub!('its', 'their')
@@prev = @@prev.split(' ').
insert(1, @@i.to_s).join(' ')
end
@@i = 0
puts @@prev # Last call doesn't get shown.
end
@@prev = curr
end
end

View File

@ -1,7 +1,7 @@
# Makes a class delegatable, allowing calls
# on it to be forwarded to a worker class
# able to perform the appropriate task.
module Treat::Entities::Entity::Delegatable
module Treat::Entities::Abilities::Delegatable
# Add preset methods to an entity class.
def add_presets(group)
@ -10,25 +10,27 @@ module Treat::Entities::Entity::Delegatable
return unless opt
self.class_eval do
group.presets.each do |preset|
define_method(preset) do |worker=nil, options={}|
return get(preset) if has?(preset)
options = {opt => preset}.merge(options)
m = group.method
send(m, worker, options)
f = unset(m)
features[preset] = f if f
end
group.presets.each do |preset|
define_method(preset) do |worker=nil, options={}|
return get(preset) if has?(preset)
options = {opt => preset}.merge(options)
m = group.method
send(m, worker, options)
f = unset(m)
features[preset] = f if f
end
end
end
end
# Add the workers to perform a task on an entity class.
def add_workers(group)
self.class_eval do
task = group.method
add_presets(group)
define_method(task) do |worker=nil, options={}|
if worker.is_a?(Hash)
options, worker =
@ -62,7 +64,7 @@ module Treat::Entities::Entity::Delegatable
worker_not_found(worker, group)
end
worker = group.const_get(worker.to_s.cc.intern)
worker = group.const_get(cc(worker.to_s).intern)
result = worker.send(group.method, entity, options)
if group.type == :annotator && result
@ -88,32 +90,40 @@ module Treat::Entities::Entity::Delegatable
# Get the default worker for that language
# inside the given group.
def find_worker_for_language(language, group)
lang = Treat.languages[language]
cat = group.to_s.split('::')[2].downcase.intern
group = group.mn.ucc.intern
group = ucc(cl(group)).intern
if lang.nil?
raise Treat::Exception,
"No configuration file loaded for language #{language}."
end
workers = lang.workers
if !workers.respond_to?(cat) ||
!workers[cat].respond_to?(group)
workers = Treat.languages.agnostic.workers
end
if !workers.respond_to?(cat) ||
!workers[cat].respond_to?(group)
raise Treat::Exception,
"No #{group} is/are available for the " +
"#{language.to_s.capitalize} language."
end
workers[cat][group].first
end
# Return an error message and suggest possible typos.
def worker_not_found(worker, group)
"Worker with name '#{worker}' couldn't be "+
"found in group #{group}." + Treat::Helpers::Help.
did_you_mean?(group.list.map { |c| c.ucc }, worker)
def worker_not_found(klass, group)
"Algorithm '#{ucc(cl(klass))}' couldn't be "+
"found in group #{group}." + did_you_mean?(
group.list.map { |c| ucc(c) }, ucc(klass))
end
end

View File

@ -1,7 +1,7 @@
module Treat::Entities::Entity::Exportable
module Treat::Entities::Abilities::Exportable
def export(problem)
ds = Treat::Learning::DataSet.new(problem)
ds = Treat::Core::DataSet.new(problem)
each_entity(problem.question.target) do |e|
ds << e
end

View File

@ -1,4 +1,4 @@
module Treat::Entities::Entity::Iterable
module Treat::Entities::Abilities::Iterable
# Yields each entity of any of the supplied
# types in the children tree of this Entity.
@ -6,12 +6,12 @@ module Treat::Entities::Entity::Iterable
# #each. It does not yield the top element being
# recursed.
#
# This function NEEDS to be ported to C. #FIXME
# This function NEEDS to be ported to C.
def each_entity(*types)
types = [:entity] if types.size == 0
f = false
types.each do |t2|
if is_a?(Treat::Entities.const_get(t2.cc))
if is_a?(Treat::Entities.const_get(cc(t2)))
f = true; break
end
end
@ -57,7 +57,7 @@ module Treat::Entities::Entity::Iterable
def ancestor_with_type(type)
return unless has_parent?
ancestor = @parent
type_klass = Treat::Entities.const_get(type.cc)
type_klass = Treat::Entities.const_get(cc(type))
while not ancestor.is_a?(type_klass)
return nil unless (ancestor && ancestor.has_parent?)
ancestor = ancestor.parent
@ -94,17 +94,25 @@ module Treat::Entities::Entity::Iterable
end
# Number of children that have a given feature.
# Second variable to allow for passing value to check for.
def num_children_with_feature(feature, value = nil, recursive = true)
def num_children_with_feature(feature)
i = 0
m = method(recursive ? :each_entity : :each)
m.call do |c|
next unless c.has?(feature)
i += (value == nil ? 1 :
(c.get(feature) == value ? 1 : 0))
each do |c|
i += 1 if c.has?(feature)
end
i
end
# Return the first element in the array, warning if not
# the only one in the array. Used for magic methods: e.g.,
# the magic method "word" if called on a sentence with many
# words, Treat will return the first word, but warn the user.
def first_but_warn(array, type)
if array.size > 1
warn "Warning: requested one #{type}, but" +
" there are many #{type}s in this entity."
end
array[0]
end
end

View File

@ -1,20 +1,27 @@
module Treat::Entities::Entity::Magical
module Treat::Entities::Abilities::Magical
# Parse "magic methods", which allow the following
# syntaxes to be used (where 'word' can be replaced
# by any entity type, e.g. token, zone, etc.):
#
# - each_word : iterate over each children of type word.
# - words: return an array of children words.
# - each_word : iterate over each entity of type word.
# - words: return an array of words in the entity.
# - word: return the first word in the entity.
# - word_count: return the number of words in the entity.
# - words_with_*(value) (where * is an arbitrary feature):
# return the words that have the given feature set to value.
# - words_with_*(value) (where is an arbitrary feature):
# return the words that have the given feature.
# - word_with_*(value) : return the first word with
# the feature specified by * in value.
#
# Also provides magical methods for types of words:
#
# - each_noun:
# - nouns:
# - noun:
# - noun_count:
# - nouns_with_*(value)
# - noun_with_*(value)
#
# Also provides magical methods for types of words (each_noun,
# nouns, noun_count, nouns_with_*(value) noun_with_*(value), etc.)
# For this to be used, the words in the text must have been
# tokenized and categorized in the first place.
def magic(sym, *args)
# Cache this for performance.
@ -73,21 +80,9 @@ module Treat::Entities::Entity::Magical
elsif method =~ /^frequency_in_#{@@entities_regexp}$/
frequency_in($1.intern)
else
return :no_magic # :-(
return :no_magic
end
end
# Return the first element in the array, warning if not
# the only one in the array. Used for magic methods: e.g.,
# the magic method "word" if called on a sentence with many
# words, Treat will return the first word, but warn the user.
def first_but_warn(array, type)
if array.size > 1
warn "Warning: requested one #{type}, but" +
" there are many #{type}s in this entity."
end
array[0]
end
end

View File

@ -0,0 +1,46 @@
# Registers occurences of textual values inside
# all children entity. Useful to calculate frequency.
module Treat::Entities::Abilities::Registrable
# Registers a token in the @registry hash.
def register(entity)
unless @registry
@count = 0
@registry = {
:value => {},
:position => {},
:type => {},
:id => {}
}
end
if entity.is_a?(Treat::Entities::Token) ||
entity.is_a?(Treat::Entities::Phrase)
val = entity.to_s.downcase
@registry[:value][val] ||= 0
@registry[:value][val] += 1
end
@registry[:id][entity.id] = true
@registry[:type][entity.type] ||= 0
@registry[:type][entity.type] += 1
@registry[:position][entity.id] = @count
@count += 1
@parent.register(entity) if has_parent?
end
# Backtrack up the tree to find a token registry,
# by default the one in the root node of any entity.
def registry(type = nil)
if has_parent? &&
type != self.type
@parent.registry(type)
else
@registry
end
end
end

View File

@ -1,22 +1,18 @@
# Gives entities the ability to be converted
# to string representations (#to_string, #to_s,
# #to_str, #inspect, #print_tree).
module Treat::Entities::Entity::Stringable
# Returns the entity's true string value.
def to_string; @value.dup; end
# Returns an array of the childrens' string
# values, found by calling #to_s on them.
def to_a; @children.map { |c| c.to_s }; end
alias :to_ary :to_a
module Treat::Entities::Abilities::Stringable
# Return the entity's true string value in
# plain text format. Non-terminal entities
# will normally have an empty value.
def to_string; @value; end
# Returns the entity's string value by
# imploding the value of all terminal
# entities in the subtree of that entity.
def to_s
has_children? ? implode.strip : @value.dup
@value != '' ? @value : implode.strip
end
# #to_str is the same as #to_s.
@ -28,10 +24,12 @@ module Treat::Entities::Entity::Stringable
def short_value(max_length = 30)
s = to_s
words = s.split(' ')
return s if (s.length < max_length) ||
!(words[0..2] && words[-2..-1])
words[0..2].join(' ') + ' [...] ' +
words[-2..-1].join(' ')
if s.length < max_length
s
else
words[0..2].join(' ') + ' [...] ' +
words[-2..-1].join(' ')
end
end
# Print out an ASCII representation of the tree.
@ -40,32 +38,33 @@ module Treat::Entities::Entity::Stringable
# Return an informative string representation
# of the entity.
def inspect
name = self.class.mn
s = "#{name} (#{@id.to_s})"
s = "#{cl(self.class)} (#{@id.to_s})"
if caller_method(2) == :inspect
@id.to_s
else
edges = []
@edges.each do |edge|
edges <<
"#{edge.target}#{edge.type}"
dependencies = []
@dependencies.each do |dependency|
dependencies <<
"#{dependency.target}#{dependency.type}"
end
s += " --- #{short_value.inspect}" +
" --- #{@features.inspect} " +
" --- #{edges.inspect} "
" --- #{dependencies.inspect} "
end
s
end
# Helper method to implode the string value of the subtree.
def implode(value = "")
def implode
return @value.dup if !has_children?
value = ''
each do |child|
if child.is_a?(Treat::Entities::Section)
value << "\n\n"
value += "\n\n"
end
if child.is_a?(Treat::Entities::Token) || child.value != ''
@ -73,14 +72,14 @@ module Treat::Entities::Entity::Stringable
child.is_a?(Treat::Entities::Enclitic)
value.strip!
end
value << child.to_s + ' '
value += child.to_s + ' '
else
child.implode(value)
value += child.implode
end
if child.is_a?(Treat::Entities::Title) ||
child.is_a?(Treat::Entities::Paragraph)
value << "\n\n"
value += "\n\n"
end
end

View File

@ -0,0 +1,36 @@
module Treat::Entities
# Represents a collection of texts.
class Collection < Entity
# Initialize the collection with a folder
# containing the texts of the collection.
def initialize(folder = nil, id = nil)
super('', id)
if folder && !FileTest.directory?(folder)
FileUtils.mkdir(folder)
end
set :folder, folder if folder
i = folder + '/.index'
set :index, i if FileTest.directory?(i)
end
# Works like the default <<, but if the
# file being added is a collection or a
# document, then copy that collection or
# document into this collection's folder.
def <<(entities, copy = true)
unless entities.is_a? Array
entities = [entities]
end
entities.each do |entity|
if [:document, :collection].
include?(entity.type) && copy &&
@features[:folder] != nil
entity = entity.copy_into(self)
end
end
super(entities)
end
end
end

10
lib/treat/entities/document.rb Executable file
View File

@ -0,0 +1,10 @@
module Treat::Entities
# Represents a document.
class Document < Entity
# Initialize a document with a file name.
def initialize(file = nil, id = nil)
super('', id)
set :file, file
end
end
end

View File

@ -1,101 +0,0 @@
module Treat::Entities
# * Collection and document classes * #
# Represents a collection.
class Collection < Entity; end
# Represents a document.
class Document < Entity; end
# * Sections and related classes * #
# Represents a section.
class Section < Entity; end
# Represents a page of text.
class Page < Section; end
# Represents a block of text
class Block < Section; end
# Represents a list.
class List < Section; end
# * Zones and related classes * #
# Represents a zone of text.
class Zone < Entity; end
# Represents a title, subtitle,
# logical header of a text.
class Title < Zone; end
# Represents a paragraph (group
# of sentences and/or phrases).
class Paragraph < Zone; end
# * Groups and related classes * #
# Represents a group of tokens.
class Group < Entity; end
# Represents a group of words
# with a sentence ender (.!?)
class Sentence < Group; end
# Represents a group of words,
# with no sentence ender.
class Phrase < Group; end
# Represents a non-linguistic
# fragment (e.g. stray symbols).
class Fragment < Group; end
# * Tokens and related classes* #
# Represents a terminal element
# (leaf) in the text structure.
class Token < Entity; end
# Represents a word. Strictly,
# this is /^[[:alpha:]\-']+$/.
class Word < Token; end
# Represents an enclitic.
# Strictly, this is any of
# 'll 'm 're 's 't or 've.
class Enclitic < Token; end
# Represents a number. Strictly,
# this is /^#?([0-9]+)(\.[0-9]+)?$/.
class Number < Token
def to_i; to_s.to_i; end
def to_f; to_s.to_f; end
end
# Represents a punctuation sign.
# Strictly, this is /^[[:punct:]\$]+$/.
class Punctuation < Token; end
# Represents a character that is neither
# a word, an enclitic, a number or a
# punctuation character (e.g. @#$%&*).
class Symbol < Token; end
# Represents a url. This is (imperfectly)
# defined as /^(http|https):\/\/[a-z0-9]
# +([\-\.]{1}[a-z0-9]+)*\.[a-z]{2,5}
# (([0-9]{1,5})?\/.*)?$/ix
class Url < Token; end
# Represents a valid RFC822 address.
# This is (imperfectly) defined as
# /.+\@.+\..+/ (fixme maybe?)
class Email < Token; end
# Represents a token whose type
# cannot be identified.
class Unknown; end
end

View File

@ -1,106 +1,94 @@
module Treat::Entities
# Basic tree structure.
module Abilities; end
# Require abilities.
p = Treat.paths.lib +
'treat/entities/abilities/*.rb'
Dir.glob(p).each { |f| require f }
require 'birch'
# The Entity class extends a basic tree structure
# (written in C for optimal speed) and represents
# any form of textual entityin a processing task
# (this could be a collection of documents, a
# single document, a single paragraph, etc.)
#
# Classes that extend Entity provide the concrete
# behavior corresponding to the relevant entity type.
# See entities.rb for a full list and description of
# the different entity types in the document model.
class Entity < ::Birch::Tree
# A symbol representing the lowercase
# version of the class name. This is
# the only attribute that the Entity
# class adds to the Birch::Tree class.
class Entity < Treat::Core::Node
# A Symbol representing the lowercase
# version of the class name.
attr_accessor :type
# Autoload all the classes in /abilities.
path = File.expand_path(__FILE__)
patt = File.dirname(path) + '/entity/*.rb'
Dir.glob(patt).each { |f| require f }
# Implements support for #register, #registry.
include Registrable
# Implement support for #self.call_worker, etc.
extend Delegatable
# Implements support for #register,
# #registry, and #contains_* methods.
include Abilities::Registrable
# Implement support for #self.print_debug, etc.
extend Debuggable
# Implement support for #self.add_workers
extend Abilities::Delegatable
# Implement support for #self.build and #self.from_*
extend Buildable
# Implement support for #self.print_debug and
# #self.invalid_call_msg
extend Abilities::Debuggable
# Implement support for #apply (previously #do).
include Applicable
# Implement support for #self.build
# and #self.from_*
extend Abilities::Buildable
# Implement support for #frequency, #frequency_in,
# #frequency_of, #position, #position_from_end, etc.
include Countable
# Implement support for #do.
include Abilities::Chainable
# Implement support for over 100 #magic methods!
include Magical
# Implement support for #frequency,
# #frequency_in_parent and #position_in_parent.
include Abilities::Countable
# Implement support for #magic.
include Abilities::Magical
# Implement support for #to_s, #inspect, etc.
include Stringable
include Abilities::Stringable
# Implement support for #check_has and others.
include Checkable
# Implement support for #check_has
# and #check_hasnt_children?
include Abilities::Checkable
# Implement support for #each_entity, as well as
# #entities_with_type, #ancestors_with_type,
# #entities_with_feature, #entities_with_category, etc.
include Iterable
# #entities_with_feature, #entities_with_category.
include Abilities::Iterable
# Implement support for #export, allowing to export
# a data set row from the receiving entity.
include Exportable
# Implement support for #export to export
# a line of a data set based on a classification.
include Abilities::Exportable
# Implement support for #copy_into.
include Abilities::Copyable
# Implement support for #self.compare_with
extend Comparable
extend Abilities::Comparable
# Initialize the entity with its value and
# (optionally) a unique identifier. By default,
# the object_id will be used as id.
def initialize(value = '', id = nil)
id ||= object_id; super(value, id)
id ||= object_id
super(value, id)
@type = :entity if self == Entity
@type ||= self.class.mn.ucc.intern
@type ||= ucc(cl(self.class)).intern
end
# Add an entity to the current entity.
# Registers the entity in the root node
# token registry if the entity is a leaf.
# Unsets the parent node's value; in order
# to keep the tree clean, only the leaf
# values are stored.
#
# Takes in a single entity or an array of
# entities. Returns the first child supplied.
# If a string is
#
# @see Treat::Registrable
def <<(entities, clear_parent = true)
entities = (entities.is_a?(::String) ||
entities.is_a?(::Numeric)) ?
entities.to_entity : entities
entities = entities.is_a?(::Array) ?
entities : [entities]
# Register each entity in this node.
entities.each { |e| register(e) }
# Pass to the <<() method in Birch.
unless entities.is_a? Array
entities = [entities]
end
entities.each do |entity|
register(entity)
end
super(entities)
# Unset the parent value if necessary.
@parent.value = '' if has_parent?
# Return the first child.
return entities[0]
entities[0]
end
# Catch missing methods to support method-like
# access to features (e.g. entity.category
# instead of entity.features[:category]) and to
@ -110,30 +98,33 @@ module Treat::Entities
# or can't be parsed, raises an exception.
#
# Also catches the "empty" method call (e.g.
# Word('hello') or Word 'hello') as syntactic
# word 'hello' or word 'hello' as syntactic
# sugar for the #self.build method.
def method_missing(sym, *args, &block)
return self.build(*args) if sym == nil
return @features[sym] if @features.has_key?(sym)
result = magic(sym, *args, &block)
return result unless result == :no_magic
begin; super(sym, *args, &block)
rescue NoMethodError; invalid_call(sym); end
if !@features.has_key?(sym)
r = magic(sym, *args, &block)
return r unless r == :no_magic
begin
super(sym, *args, &block)
rescue NoMethodError
raise Treat::Exception,
if Treat::Workers.lookup(sym)
msg = "Method #{sym} cannot " +
"be called on a #{type}."
else
msg = "Method #{sym} does not exist."
msg += did_you_mean?(
Treat::Workers.methods, sym)
end
end
else
@features[sym]
end
end
# Raises a Treat::Exception saying that the
# method called was invalid, and that the
# requested method does not exist. Also
# provides suggestions for misspellings.
def invalid_call(sym)
msg = Treat::Workers.lookup(sym) ?
"Method #{sym} can't be called on a #{type}." :
"Method #{sym} is not defined by Treat." +
Treat::Helpers::Help.did_you_mean?(
Treat::Workers.methods, sym)
raise Treat::Exception, msg
end
end
end

View File

@ -1,86 +0,0 @@
# When Treat.debug is set to true, each call to
# #call_worker will result in a debug message being
# printed by the #print_debug function.
module Treat::Entities::Entity::Debuggable
# Previous state and counter.
@@prev, @@i = nil, 0
# Explains what Treat is currently doing.
# Fixme: last call will never get shown.
def print_debug(entity, task, worker, group, options)
# Get a list of the worker's targets.
targets = group.targets.map(&:to_s)
# List the worker's targets as either
# a single target or an and/or form
# (since it would be too costly to
# actually determine what target types
# were processed at runtime for each call).
t = targets.size == 1 ? targets[0] : targets[
0..-2].join(', ') + ' and/or ' + targets[-1]
# Add genitive for annotations (sing./plural)
genitive = targets.size > 1 ? 'their' : 'its'
# Set up an empty string and humanize task name.
doing, human_task = '', task.to_s.gsub('_', ' ')
# Base is "{task}-ed {a(n)|N} {target(s)}"
if [:transformer, :computer].include?(group.type)
tt = human_task
tt = tt[0..-2] if tt[-1] == 'e'
ed = tt[-1] == 'd' ? '' : 'ed'
doing = "#{tt.capitalize}#{ed} #{t}"
# Base is "Annotated {a(n)|N} {target(s)}"
elsif group.type == :annotator
if group.preset_option
opt = options[group.preset_option]
form = opt.to_s.gsub('_', ' ')
human_task[-1] = ''
human_task = form + ' ' + human_task
end
doing = "Annotated #{t} with " +
"#{genitive} #{human_task}"
end
# Form is '{base} in format {worker}'.
if group.to_s.index('Formatters')
curr = doing + ' in format ' + worker.to_s
# Form is '{base} using {worker}'.
else
curr = doing + ' using ' + worker.to_s.gsub('_', ' ')
end
# Remove any double pluralization that may happen.
curr.gsub!('ss', 's') unless curr.index('class')
# Accumulate repeated tasks.
@@i += 1 if curr == @@prev
# Change tasks, so output.
if curr != @@prev && @@prev
# Pluralize entity names if necessary.
if @@i > 1
Treat.core.entities.list.each do |e|
@@prev.gsub!(e.to_s, e.to_s + 's')
end
@@prev.gsub!('its', 'their')
@@prev = @@prev.split(' ').
insert(1, @@i.to_s).join(' ')
# Add determiner if singular.
else
@@prev = @@prev.split(' ').
insert(1, 'a').join(' ')
end
# Reset counter.
@@i = 0
# Write to stdout.
puts @@prev + '.'
end
@@prev = curr
end
end

View File

@ -1,36 +0,0 @@
# Registers the entities ocurring in the subtree of
# a node as children are added. Also registers text
# occurrences for word groups and tokens (n grams).
module Treat::Entities::Entity::Registrable
# Registers a token or phrase in the registry.
# The registry keeps track of children by id,
# by entity type, and also keeps the position
# of the entity in its parent entity.
def register(entity)
unless @registry
@count, @registry = 0,
{id: {}, value: {}, position:{}, type: {}}
end
if entity.is_a?(Treat::Entities::Token) ||
entity.is_a?(Treat::Entities::Group)
val = entity.to_s.downcase
@registry[:value][val] ||= 0
@registry[:value][val] += 1
end
@registry[:id][entity.id] = true
@registry[:type][entity.type] ||= 0
@registry[:type][entity.type] += 1
@registry[:position][entity.id] = @count
@count += 1
@parent.register(entity) if has_parent?
end
# Backtrack up the tree to find a token registry,
# by default the one in the root node of the tree.
def registry(type = nil)
(has_parent? && type != self.type) ?
@parent.registry(type) : @registry
end
end

18
lib/treat/entities/group.rb Executable file
View File

@ -0,0 +1,18 @@
module Treat::Entities
# Represents a group of tokens.
class Group < Entity; end
# Represents a group of words
# with a sentence ender (.!?)
class Sentence < Group; end
# Represents a group of words,
# with no sentence ender.
class Phrase < Group; end
# Represents a non-linguistic
# fragment (e.g. stray symbols).
class Fragment < Group; end
end

View File

@ -0,0 +1,13 @@
module Treat::Entities
# Represents a section.
class Section < Entity; end
# Represents a page of text.
class Page < Section; end
# Represents a block of text
class Block < Section; end
# Represents a list.
class List < Section; end
end

Some files were not shown because too many files have changed in this diff Show More