Compare commits

...

254 Commits

Author SHA1 Message Date
Louis Mullie 8c6450a0d7 Merge pull request #120 from indentlabs/master
Use S3's static-public-assets bucket instead of louismullie.com
2017-05-05 18:31:43 -04:00
Louis Mullie d6a163f8f7 Make nokogiri a required dependency, #121 2017-04-22 16:15:35 -04:00
Andrew Brown 58911ce969 Add \n after post-install output 2016-05-24 14:04:48 -05:00
Andrew Brown 7b6e8fe6a8 Unnecessary whitespace removal 2016-05-24 13:20:17 -05:00
Andrew Brown f1f8c010a6 Point treat assets to mirror on s3 2016-05-24 13:19:51 -05:00
Louis Mullie a86c3036cc Merge pull request #110 from agarie/update-travis
Update .travis.yml for Ruby 2+
2015-12-07 21:17:40 -05:00
Louis Mullie e6c1f7c3a4 Merge pull request #114 from ojak/103_parsing_issues
Fixed typo in phrase tag that causes Adverbial Phrases to parse improperly
2015-12-07 21:17:12 -05:00
Oliver Jakubiec 4dfaf83a14 Fixed typo in phrase tag that causes Adverbial Phrases to parse improperly 2015-12-07 17:36:11 -08:00
Carlos Agarie eed75d3ba7 Update .travis.yml for Ruby 2+ 2015-05-28 18:42:37 -03:00
Louis Mullie 3dc184277d Merge pull request #92 from jbranchaud/update-post-install-note
Put quotes around gem name in post-install message.
2015-04-10 15:21:51 -04:00
Louis Mullie e218791c88 Merge pull request #106 from trosborn/master
abbreviates adjective and adverb to work properly with rwordnet
2015-04-07 18:12:35 -04:00
Thomas Osborn 2d6891ab9c abbreviates adjective and adverb to work properly with rwordnet 2015-04-07 15:00:38 -07:00
Louis Mullie f61adc8187 Merge pull request #105 from trosborn/master
fixes errors caused by wordnet 1.0.0
2015-04-06 16:34:48 -04:00
Thomas Osborn cb7ae61cc5 fixes errors caused by wordnet 1.0.0 2015-04-05 21:27:08 -07:00
Louis Mullie 394edb8fc4 Merge pull request #101 from ojak/master
Added enclitic for 'd and fixed a tag typo.
2015-03-21 16:43:02 -04:00
Oliver Jakubiec 721215b64a Added enclitic for 'd and enclitic examples for clarity 2015-03-20 13:12:21 -07:00
Oliver Jakubiec 90fdcd8c9e Fixed typo in comma punctuation tag 2015-02-27 17:08:15 -08:00
Louis Mullie 3992f2c177 Merge pull request #95 from ojak/master
Fixed typo that caused n't to be tagged as a Word rather than an Enclitic
2015-02-26 14:12:17 -05:00
Oliver Jakubiec 159f3b266e Fixed typo that caused n't to be tagged as a Word rather than an Enclitics 2015-02-26 10:56:03 -08:00
jbranchaud b7c6ae452b Put quotes around gem name in post-install message.
The post-install message for this gem instructs the user to run `require
treat` from irb, however, the actual, functioning line of code involves
quotes, namely, `require 'treat'`.
2015-02-14 00:21:39 -06:00
Louis Mullie e753389911 Merge pull request #85 from nicklink483/rspec_file_fix
rspec file fix
2014-08-30 19:48:52 -04:00
Dominic Muller 1b783faf77 fixed line in .rspec file to invoke correct formatter and not cause a TravisCI bug 2014-08-25 23:13:29 -07:00
Louis Mullie febc47a905 Bump version 2014-07-17 14:18:12 -04:00
Louis Mullie 3969478f7f Fix https://github.com/louismullie/bind-it/issues/4 2014-07-17 14:17:29 -04:00
Louis Mullie fc77587390 Bump version for next release 2014-05-28 13:29:36 -04:00
Louis Mullie eb0ec83053 Merge pull request #79 from smileart/master
Fixed file path detection bug in build method polymorphism
2014-05-20 18:10:04 -04:00
Smile @rT f8e5f8392a Fixed file path detection bug in build method polymorphism
(when string was recognised as path cause of  "xml", "yml", etc. content in it)
Refactored file detection duplication in few places
Fixed xml file format detection (f variable assigned but never used)
Deleted some trailing spaces in the refactored file
2014-05-20 14:20:24 +03:00
Louis Mullie c813e72c2c Merge pull request #78 from dalton/master
Revert "Considerably simplify all time annotators."
2014-05-17 22:59:06 -04:00
Adam Dalton 316a6cec04 Revert "Considerably simplify all time annotators."
This reverts commit 01c3e50ef0.
2014-05-16 12:08:54 -04:00
Louis Mullie d45d82b3c4 Merge pull request #75 from adammcarthur/master
update code climate badge in readme
2014-05-07 09:37:39 -04:00
Adam McArthur 78c13c280f update code climate badge in readme 2014-05-07 18:18:57 +10:00
Louis Mullie 690157af8b Merge pull request #70 from chris-at-thewebfellas/master
possible fix for issue #68 and #66 and #63
2014-02-18 19:26:51 -05:00
Chris Anderton a8ce6b2f18 change newlines to try and allow rdoc to run without errors 2014-02-18 23:40:09 +00:00
Rob Anderton f7bd0c35d7 tweak multi-line should expectations to be consistent with other specs 2014-01-28 12:08:44 +00:00
Rob Anderton 65235fb935 fix language detection and tweak tests 2014-01-28 11:56:03 +00:00
Rob Anderton 0ed8a37761 add enclitics test, fix multi-line should expectations 2014-01-28 11:38:34 +00:00
Rob Anderton 8f82086bb1 fix collection test 2014-01-28 10:55:02 +00:00
Chris Anderton d9b912f24d naive attempt to fix the implode method on stringable 2014-01-23 00:57:32 +00:00
Louis Mullie 10e6612a06 Bump version 2013-11-23 16:54:41 -05:00
Louis Mullie 5fb11131c7 Fix require path on Ruby 2.0 2013-11-23 16:54:29 -05:00
Louis Mullie 281934d572 Fix #61 2013-11-23 16:29:17 -05:00
Louis Mullie c9bf91562b Bump 2013-06-29 14:23:25 -04:00
Louis Mullie 643937b231 Cheap fix for #45 2013-06-29 14:22:50 -04:00
Louis Mullie fff57a4526 Fix issues #54 and #57 2013-06-27 17:59:35 -04:00
Louis Mullie 49e99ccd0d Merge pull request #58 from geoffharcourt/fix_rubygems_source_declaration
Fix Rubygems source in Gemfile
2013-06-22 11:45:52 -07:00
Geoff Harcourt 136a7cdbb3 Fix Rubygems source in Gemfile
The use of source :rubygems has been deprecated, and causes the system
to issue a warning when bundle is used with the gem.
2013-06-20 15:59:41 -04:00
Louis Mullie a8448f8103 Merge pull request #55 from wstrinz/master
Update Ocropus calls to work with v 0.7
2013-06-05 13:13:28 -07:00
= 91e3571699 fix spelling for doc 2013-06-05 10:02:38 -05:00
= 6464bb64a0 updated old ocropus commands 2013-06-05 01:10:03 -05:00
= 5c984a1dbb updated old ocropus commands 2013-06-05 00:59:06 -05:00
Louis Mullie 20e3aa1bb7 Update README for v2.0.5 2013-06-02 21:51:59 -04:00
Louis Mullie e483b764e4 Merge branch 'master' of github.com:louismullie/treat
Conflicts:
	Gemfile
2013-06-02 21:48:25 -04:00
Louis Mullie 3a367f7c56 Add list of development gems. 2013-06-02 21:44:52 -04:00
Louis Mullie c99dfac530 Bump version for next release. 2013-06-02 21:44:25 -04:00
Louis Mullie ee98b19960 Fix "humongous" bug (closes #45). 2013-06-02 21:44:18 -04:00
Louis Mullie bf6d951cf7 Fix XML deserialization bug (closes #43). 2013-06-02 21:44:01 -04:00
Louis Mullie 1b28d3640d Comment out development stuff. 2013-06-02 21:43:30 -04:00
Louis Mullie 727a307af0 Add basic support for OpenNLP. 2013-06-02 21:42:56 -04:00
Louis Mullie cf8acd2b64 Merge pull request #51 from Erol/general-document-reader-using-yomu
Add a general document reader class using Yomu
2013-05-01 18:05:12 -07:00
Erol Fornoles 1aa4df9566 Add a general document reader class using Yomu. 2013-05-01 13:03:51 +08:00
Louis Mullie 89cf494cba Merge pull request #47 from smileart/master
Little patch to german.rb to fix issue #39
2013-04-05 04:23:15 -07:00
Serge Bedzhik e2ef8ea748 Little patch to german.rb to fix issue
It fixes issue — https://github.com/louismullie/treat/issues/39 with invalid multibyte char (US-ASCII)
2013-04-04 11:03:39 +03:00
Louis Mullie 038d62b12b Add logo to README. 2013-01-26 14:40:40 -05:00
Louis Mullie 767e1f7f45 Version 2.0.4. 2013-01-23 10:48:33 -05:00
Louis Mullie 4cbcf03d43 Fix parser spec. 2013-01-09 12:53:09 -05:00
Louis Mullie 5e4189218c A few fixes to Stanford parser. 2013-01-06 21:58:57 -05:00
Louis Mullie 482ed30d76 Update spec to reflect that sentences now need to be tokenized before parsing. 2013-01-06 21:58:30 -05:00
Louis Mullie b9ee19e276 Fix spec due to change in gem dependency behaviour. 2013-01-06 21:58:13 -05:00
Louis Mullie fe7f97afd7 Completely refactor parser without using pipeline. 2013-01-06 21:50:45 -05:00
Louis Mullie e135ffa466 Bind Stanford Core NLP on load. 2013-01-06 21:50:27 -05:00
Louis Mullie a755272934 Forgot some devel stuff in there. 2013-01-06 19:31:39 -05:00
Louis Mullie 8b8a769d74 Put all specs under the Treat::Specs module. 2013-01-06 19:26:36 -05:00
Louis Mullie e2b813ca21 Add support for kronic time extractor. 2013-01-06 19:17:51 -05:00
Louis Mullie f72c8aec54 Add support for kronic time extractor. 2013-01-06 19:17:27 -05:00
Louis Mullie 6a025ca4d6 Add support for kronic time extractor. 2013-01-06 19:17:20 -05:00
Louis Mullie 1f69c3f089 Add post-install message. 2013-01-06 19:17:11 -05:00
Louis Mullie 4b74913bcd Cosmetic changes. 2013-01-06 19:16:57 -05:00
Louis Mullie a34714cb99 Fix bug and make sure to load Maxent class. 2013-01-06 19:16:46 -05:00
Louis Mullie b37ecdb695 Make sure that the class is loaded. 2013-01-06 19:14:41 -05:00
Louis Mullie 01c3e50ef0 Considerably simplify all time annotators. 2013-01-06 19:14:20 -05:00
Louis Mullie ba5a2d78ad Update specs for time annotators. 2013-01-06 19:12:00 -05:00
Louis Mullie c6ceac22d6 Bump version number. 2013-01-05 16:31:03 -05:00
Louis Mullie c18d0073f7 Add Levensthein and Jaro-Winkler string distance algorithms. 2013-01-05 16:30:55 -05:00
Louis Mullie 056ef129d6 Use instance_exec pattern for builder. 2013-01-04 08:45:29 -05:00
Louis Mullie 6e644db50a Fix variable scoping. 2013-01-03 22:04:28 -05:00
Louis Mullie d351108ad0 Switch back to ruby-readability now that JRuby support was added. 2013-01-03 22:04:14 -05:00
Louis Mullie cbca3d8d37 Add builder. 2013-01-03 22:03:45 -05:00
Louis Mullie ebf8a16c2a Move topic_words spec to agnostic language, where it belongs. 2013-01-02 21:49:18 -05:00
Louis Mullie 683861dfd1 Refactorings and bug fixes in Stanford tools. 2013-01-02 21:25:40 -05:00
Louis Mullie 2cded62abf Add FIXME note. 2013-01-02 21:06:57 -05:00
Louis Mullie e32f44f0a0 Update Linguistics loader to reflect change in API. 2013-01-02 21:06:42 -05:00
Louis Mullie e87574c03d Add loggability dependency for linguistics. 2013-01-02 21:06:30 -05:00
Louis Mullie 6cf37a2926 Remove unused dependency. 2013-01-02 21:05:57 -05:00
Louis Mullie ff80d8ab2c Coding style. 2013-01-02 21:05:32 -05:00
Louis Mullie fdb5abb748 Suppress warning with new gem API. 2013-01-02 21:05:26 -05:00
Louis Mullie 931a66ec40 Fixes for change in Ruby Linguistics API. 2013-01-02 21:05:06 -05:00
Louis Mullie 644248504d Update gem to reflect new Linguistics API. 2013-01-02 20:37:09 -05:00
Louis Mullie 0731d33024 Update to reflect new API in gem. 2013-01-02 20:34:49 -05:00
Louis Mullie e8b12c5d03 Change development paths in comment. 2013-01-02 20:27:09 -05:00
Louis Mullie cc2b8d6679 Tentative wrapper for OpenNLP MaxEnt tokenizer. 2013-01-02 20:26:08 -05:00
Louis Mullie e284aee7eb Add tentative configuration options. 2013-01-02 20:25:43 -05:00
Louis Mullie 8209447625 Include DSL in agnostic spec. 2013-01-02 20:25:15 -05:00
Louis Mullie 201ac69775 Coding style. 2013-01-02 20:24:57 -05:00
Louis Mullie c42ee58bcd Clarify condition. 2013-01-02 20:24:47 -05:00
Louis Mullie 970a6248d5 Refactor and simplify some methods. 2013-01-02 20:24:28 -05:00
Louis Mullie 7fb10de5cc Refactor stanford loader. 2013-01-02 20:24:16 -05:00
Louis Mullie 577a063165 Fix typo. 2013-01-02 20:23:37 -05:00
Louis Mullie 9fa4ee9c1e Add note on implementation. 2013-01-02 20:23:27 -05:00
Louis Mullie fd0733113a Add 'nt as an enclitic. 2012-12-18 02:52:20 -05:00
Louis Mullie 50ac0f67e0 Fix user-specified option precedence in mongo unserialization. 2012-12-18 02:52:13 -05:00
Louis Mullie c2532d6cd2 Bump version. 2012-12-15 06:02:48 -05:00
Louis Mullie 0abeb6ba14 Fix typo causing syntax error. 2012-12-15 06:02:42 -05:00
Louis Mullie d38387265b Add in FastImage for readability. 2012-12-15 06:02:35 -05:00
Louis Mullie abf16b2b3e Update agnostic specs. 2012-12-15 05:21:17 -05:00
Louis Mullie 20b49cc471 Update the specs. 2012-12-15 05:21:02 -05:00
Louis Mullie 3295529e97 Change ruby-readability to jruby-* for compat. 2012-12-15 05:20:38 -05:00
Louis Mullie 5087b53623 Use jruby-readability, add images support. 2012-12-15 05:20:28 -05:00
Louis Mullie e149c46bde Directly call from class. 2012-12-15 05:20:16 -05:00
Louis Mullie 9836dabe87 Bump version number. 2012-12-13 21:23:15 -05:00
Louis Mullie 02ccf9942f Make nym option strings. 2012-12-13 03:28:22 -05:00
Louis Mullie c2a7a33fae Add in rest of specs. 2012-12-12 09:42:02 -05:00
Louis Mullie e9dd1a9298 Move language spec to agnostic. 2012-12-12 09:41:52 -05:00
Louis Mullie 3484baa904 Fix typo causing syntax error. 2012-12-12 09:41:37 -05:00
Louis Mullie 2b8e269380 Fix spec helper. 2012-12-12 02:54:31 -05:00
Louis Mullie 28669ec6c6 Make specs simpler and sweeter. 2012-12-12 02:50:15 -05:00
Louis Mullie 5daa79c2f0 Reflect change in folder from core to learning. 2012-12-12 02:35:00 -05:00
Louis Mullie 72eb214650 Delete duplicate file. 2012-12-12 02:34:48 -05:00
Louis Mullie 9f89a006a4 Add a list of keywords for the learning DSL. 2012-12-12 02:34:43 -05:00
Louis Mullie ec3465d24d Considerably simplify DSL code. 2012-12-12 02:34:33 -05:00
Louis Mullie b76a5c2c15 Rename folder to make consistent with /lib. 2012-12-12 02:34:17 -05:00
Louis Mullie 3e2a898155 Support groups and fragments. 2012-12-12 02:33:54 -05:00
Louis Mullie 7ef79deaa5 Test the :nym option for validity. 2012-12-12 02:33:43 -05:00
Louis Mullie e13ff4dcab Support fragments. 2012-12-12 02:33:30 -05:00
Louis Mullie 796ae59d77 Make consistent with other serializers by returning location of stored file. 2012-12-12 02:33:15 -05:00
Louis Mullie 9f54b6daba Fix bug due to change in language variable type. 2012-12-12 02:33:00 -05:00
Louis Mullie 66f0cadc57 Improve language spec. 2012-12-11 19:23:19 -05:00
Louis Mullie a7b4a7eea8 Fix specs in accordance with 6c5f31fd5e19d45f6e7f5f6393a02459b01eabb9. 2012-12-11 19:22:59 -05:00
Louis Mullie 2d9ef825c2 Cosmetic change. 2012-12-11 19:20:04 -05:00
Louis Mullie 0e06750ede Bump version to 2.0.0. 2012-12-05 15:24:06 -05:00
Louis Mullie 2c6c4ec964 Add contributors. 2012-12-04 17:49:04 -05:00
Louis Mullie 4d60ceb6ea Remove examples; moved to quick tour. 2012-12-04 02:17:45 -05:00
Louis Mullie b57d047ea8 Fix typo in README. 2012-12-04 02:16:07 -05:00
Louis Mullie 10abcb7e20 Simplify README even more. 2012-12-04 02:13:08 -05:00
Louis Mullie 1ab18f42c3 Format README. 2012-12-04 02:06:05 -05:00
Louis Mullie b6be6e7e1d Begin release notes for 2.0.0rc1. 2012-12-04 02:04:39 -05:00
Louis Mullie 4577430e84 Bump version number. 2012-12-04 02:04:29 -05:00
Louis Mullie 6c46ebf707 Update README. 2012-12-04 02:02:32 -05:00
Louis Mullie 01125b8a38 Cheating is bad. 2012-12-04 01:57:16 -05:00
Louis Mullie a7991bd009 Update spec examples after fixing worker for underscores. 2012-12-04 01:57:02 -05:00
Louis Mullie 6cc94ca6c5 Manually run specs to comply with RSpec update. 2012-12-04 01:56:44 -05:00
Louis Mullie 6f443b3087 Fix typo raising SyntaxError. 2012-12-04 01:56:18 -05:00
Louis Mullie 5b22b0d3b8 Input option may be a symbol. 2012-12-04 01:56:06 -05:00
Louis Mullie 7c1b597619 Fix YAML require issue. 2012-12-04 01:55:09 -05:00
Louis Mullie 2394de746d Images are in HTML format after conversion. 2012-12-03 23:58:59 -05:00
Louis Mullie 7f83e78a04 Ensure correctness when entity is smaller than minimum detection threshold. 2012-12-03 23:58:59 -05:00
Louis Mullie 1c69d6558e Allow to pass in a mix of entities and strings. 2012-12-03 23:58:59 -05:00
Louis Mullie da096bd15e Refactor and improve doc. 2012-12-03 23:58:59 -05:00
Louis Mullie 1e47f41510 Re-work some formatting and option issues. 2012-12-03 23:58:59 -05:00
Louis Mullie d3e52fb36a Add Learning DSL in a safer way. 2012-12-03 23:58:59 -05:00
Louis Mullie b5528a34f1 Make taggers recursive within token groups. 2012-12-03 23:58:59 -05:00
Louis Mullie e4664586fe Add suitable plugins for French. 2012-12-03 23:58:59 -05:00
Louis Mullie a49d1d97a0 Rename phrase to group, add group in entity list. 2012-12-03 23:58:58 -05:00
Louis Mullie fbd5e48fae Merge pull request #34 from LeFnord/master
removed Psych
2012-11-26 08:07:27 -08:00
LeFnord ee3ed723bc changed french and german encoding to utf-8 (should be standard) 2012-11-26 16:30:21 +01:00
LeFnord 476962e126 removed psych dependencies, cause in 1.9.x default 2012-11-26 14:48:04 +01:00
Louis Mullie 57e3c88d9e Default format to txt. 2012-11-25 03:23:19 -05:00
Louis Mullie bb4910c43d Fix problem with order of function calls. 2012-11-25 03:23:10 -05:00
Louis Mullie a1d09cbb99 Remove underscores in composite words. 2012-11-25 03:22:57 -05:00
Louis Mullie c8020c90e6 Put first_but_warn along with code that uses it. 2012-11-25 03:22:43 -05:00
Louis Mullie 5ac537db5a Downcase value for expected behaviour. 2012-11-25 03:22:15 -05:00
Louis Mullie 2162339419 Fix mess with DSL. 2012-11-25 03:21:51 -05:00
Louis Mullie be96c54064 Make scalpel default segmenter. 2012-11-25 03:21:38 -05:00
Louis Mullie a116cc8a5b Don't skip linear and svm as they are now working. 2012-10-31 23:26:09 -04:00
Louis Mullie 12cac5c769 Remove now unused labels feature. 2012-10-31 23:26:08 -04:00
Louis Mullie 1eb9341f77 Call class method, not instance method. 2012-10-31 23:26:08 -04:00
Louis Mullie 22b2afdddf Generalize sentence/phrase to group. 2012-10-31 23:26:08 -04:00
Louis Mullie 01d8f656b2 Remove labels from everywhere. 2012-10-31 23:26:08 -04:00
Louis Mullie 9af52ec798 Allow transparent casting when adding strings/numbers. 2012-10-31 23:26:08 -04:00
Louis Mullie 30196bb699 Generalize taggers and categorizers to target groups. 2012-10-31 23:26:08 -04:00
Louis Mullie f2edcf8118 Merge pull request #31 from LeFnord/master
Added stop words for several languages.
2012-10-29 05:34:22 -07:00
LeFnord b20836bc04 added umlaute to german stop words 2012-10-29 10:57:43 +01:00
LeFnord 3fa0e8593b corrected typo 2012-10-29 10:36:55 +01:00
LeFnord cd044f1c59 added stop words 2012-10-29 10:28:19 +01:00
Louis Mullie 5144f43c63 Update to reflect removal of labels option. 2012-10-28 02:18:43 -04:00
Louis Mullie f0d63fea20 Silence library output. 2012-10-28 02:18:31 -04:00
Louis Mullie 7f93decac7 Make sure silence_stdout returns the result of yield. 2012-10-28 02:18:25 -04:00
Louis Mullie f6862357e6 Completely remove need for labels when creating question. 2012-10-28 02:05:20 -04:00
Louis Mullie 19813f6a14 Remove unnecessary check. 2012-10-28 02:01:04 -04:00
Louis Mullie 295e22417d Almost finalize liblinear adapter. 2012-10-28 02:00:56 -04:00
Louis Mullie 9fb2c9a127 Stylistic conventions. 2012-10-28 01:52:42 -04:00
Louis Mullie e5f8da4201 #17: Modify the including class, not Object. 2012-10-28 01:52:29 -04:00
Louis Mullie 386d56de70 Fix failing specs. 2012-10-28 01:42:07 -04:00
Louis Mullie 24e02802b9 Merge branch 'master' of github.com:louismullie/treat 2012-10-28 01:35:21 -04:00
Louis Mullie 18fce305ed Add documentation on config options. 2012-10-28 01:35:08 -04:00
Louis Mullie 49cfd6a6f7 Add tf-idf-similarity requirement. 2012-10-28 01:34:38 -04:00
Louis Mullie 5c1ca9d9f1 Move Levensthein to a similarity annotator. 2012-10-28 01:34:26 -04:00
Louis Mullie f7f53b0f7e Add precision/recall example. 2012-10-28 01:34:07 -04:00
Louis Mullie a6b42ca20e Add SVN classification example. 2012-10-28 01:33:59 -04:00
Louis Mullie d3f6df6266 Add in similarity annotators. 2012-10-28 01:33:32 -04:00
Louis Mullie b20a5dd412 Merge pull request #30 from darkphantum/master
Fix bug in tf_idf worker
2012-10-27 08:48:58 -07:00
darkphantum d09dd391b7 fix bug in td_idf worker 2012-10-26 19:08:35 -07:00
Louis Mullie a8170397ff Add message for deprecation of old DSL syntax. 2012-10-26 13:36:37 -04:00
Louis Mullie c182ca8a42 Merge pull request #29 from darkphantum/master
Allow Document entity creation from Raw String
2012-10-25 19:05:41 -07:00
darkphantum 8b8a1c93f3 edit err message related to prev commit 2012-10-25 18:53:33 -07:00
darkphantum ba7d0995d6 Allow Document Entity creation from string 2012-10-25 15:16:57 -07:00
Louis Mullie c7aa48e130 Take care of external output silencing issues. 2012-10-24 01:47:21 -04:00
Louis Mullie 22f17a5b4c Don't use DSL in specs. 2012-10-24 01:47:07 -04:00
Louis Mullie 61d683603c Finalize the implementation of MLPs with FANN. 2012-10-24 01:28:00 -04:00
Louis Mullie 3c925e579f Switch to ruby-fann for neural networks. 2012-10-24 01:27:51 -04:00
Louis Mullie f01f050f9d Add example of ID3 classification. 2012-10-24 01:07:03 -04:00
Louis Mullie afefa10e87 Fix and finish SVM implementation. 2012-10-24 01:06:02 -04:00
Louis Mullie d2a41c0dea Add example of classification with new DSL. 2012-10-24 00:42:44 -04:00
Louis Mullie 651c8490f2 Add proxy for calling Treat methods on lists. 2012-10-24 00:36:40 -04:00
Louis Mullie 1355122e2f Cleanup; pretty final implementation. 2012-10-24 00:35:10 -04:00
Louis Mullie 65311ceda9 Fix bugs occuring due to new return type. 2012-10-24 00:34:42 -04:00
Louis Mullie 0240847ce1 Cleanup. 2012-10-24 00:34:30 -04:00
Louis Mullie 3ae63233c2 Add to_a and to_ary methods (should move elsewhere eventually). 2012-10-24 00:34:20 -04:00
Louis Mullie 86db70f1e8 Cleanup and fix bug. 2012-10-24 00:33:12 -04:00
Louis Mullie 2c7aff0323 Allow to build entities from lists of entities. 2012-10-24 00:33:01 -04:00
Louis Mullie 98b39ef541 A List is going to be a section for now. 2012-10-24 00:32:43 -04:00
Louis Mullie f4975270b2 #17: Use lowercase builder method names. 2012-10-24 00:32:24 -04:00
Louis Mullie e73463c0d0 Format list of dependencies. 2012-10-24 00:30:51 -04:00
Louis Mullie e995da0e71 Reduce complexity in configure! 2012-10-21 19:24:02 -04:00
Louis Mullie aa99e71058 Remove dependency status.
Need to rely on a legacy version of rspec for now.
2012-10-21 19:17:15 -04:00
Louis Mullie 8baa98930a Fix typo. 2012-10-21 19:12:27 -04:00
Louis Mullie 7997a5d6b7 Remove sourcify gem. 2012-10-21 19:11:48 -04:00
Louis Mullie c0e98efb76 Add models to gemspec. 2012-10-21 19:10:01 -04:00
Louis Mullie be72e2499a Attempt to make temp dir if it doesn't exist. 2012-10-21 19:09:46 -04:00
Louis Mullie 2495d9cf05 Punctuation doesn't really have a language (?) 2012-10-21 19:09:25 -04:00
Louis Mullie e588051558 Remove trailing slash before downloading. 2012-10-21 19:09:00 -04:00
Louis Mullie 25689e5ef1 Don't crash if a language pack gem can't be installed. 2012-10-21 19:08:36 -04:00
Louis Mullie decb377e55 Add SQ tag for phrases. 2012-10-21 19:08:13 -04:00
Louis Mullie 5535c03141 Put this before loading library in case that fails. 2012-10-21 17:20:12 -04:00
Louis Mullie 7a902900b2 Improve documentation. 2012-10-21 17:00:17 -04:00
Louis Mullie 69d7c9ff8a Prettify a bit. 2012-10-21 17:00:07 -04:00
Louis Mullie 56902944c1 Ignore benchmark results. 2012-10-21 16:59:51 -04:00
Louis Mullie a713441546 Explicitly call categorization for clarity. 2012-10-21 16:43:35 -04:00
Louis Mullie d43e6be5c8 Isolate import behaviour in module. 2012-10-21 16:43:25 -04:00
Louis Mullie e8baae7513 Use -able convention for all mixins. 2012-10-21 16:43:12 -04:00
Louis Mullie f2c22b42f4 Delete empty class. 2012-10-21 16:42:25 -04:00
Louis Mullie 5a750834ae Add is_mixin? functionality on String (does it end with -able?) 2012-10-21 16:42:17 -04:00
Louis Mullie 363eaca919 Make things cleaner. 2012-10-21 16:42:01 -04:00
Louis Mullie 06913afa8a Add to_struct method to Hash helper. 2012-10-21 16:41:14 -04:00
Louis Mullie 9e755535c6 Global lookup method makes more sense on Workers. 2012-10-21 16:40:44 -04:00
Louis Mullie 4dd56750fb Add a fucking space. 2012-10-21 16:40:00 -04:00
Louis Mullie 544a2a115e Paths configuration does not need super. 2012-10-21 16:39:54 -04:00
Louis Mullie 19a59eaae8 Add some documentation and error checking to feel safer at night. 2012-10-21 16:39:41 -04:00
Louis Mullie 233894a5d8 Complete cleanup of the main config module. 2012-10-21 16:39:29 -04:00
Louis Mullie a34fca4e88 Don't reuse for border cases. 2012-10-20 22:18:05 -04:00
Louis Mullie a36bd07777 Improve code clarity in this file. 2012-10-20 22:17:50 -04:00
Louis Mullie 13175e8f25 Add propert determiner. 2012-10-20 22:17:37 -04:00
Louis Mullie 7b30c26686 Merge pull request #27 from louismullie/config-refactoring
Forgot to rename helper method here.
2012-10-20 18:47:32 -07:00
Louis Mullie 502b21797f Merge pull request #26 from louismullie/config-refactoring
Config refactoring
2012-10-20 18:46:11 -07:00
Louis Mullie e79c2dbdb8 Update README.md 2012-10-15 19:05:41 -03:00
130 changed files with 3935 additions and 2664 deletions

1
.gitignore vendored
View File

@ -13,4 +13,5 @@
*.yaml
spec/sandbox.rb
coverage/*
benchmark/*
TODO

2
.rspec
View File

@ -1,2 +1,2 @@
--format s -c
--format d -c
--order rand

View File

@ -1,11 +1,18 @@
language: ruby
rvm:
- 1.9.2
- 1.9.3
- 2.0
- 2.1
- 2.2
before_install:
- export "JAVA_HOME=/usr/lib/jvm/java-6-openjdk-i386/"
before_script:
before_script:
- sudo apt-get install antiword
- sudo apt-get install poppler-utils
- rake treat:install[travis] --trace
script: rake treat:spec --trace
script: rake treat:spec --trace

35
.treat Normal file
View File

@ -0,0 +1,35 @@
# A boolean value indicating whether to silence
# the output of external libraries (e.g. Stanford
# tools, Enju, LDA, Ruby-FANN, Schiphol).
Treat.core.verbosity.silence = false
# A boolean value indicating whether to explain
# the steps that Treat is performing.
Treat.core.verbosity.debug = true
# A boolean value indicating whether Treat should
# try to detect the language of newly input text.
Treat.core.language.detect = false
# A string representing the language to default
# to when detection is off.
Treat.core.language.default = 'english'
# A symbol representing the finest level at which
# language detection should be performed if language
# detection is turned on.
Treat.core.language.detect_at = :document
# The directory containing executables and JAR files.
Treat.paths.bin = '##_INSTALLER_BIN_PATH_##'
# The directory containing trained models
Treat.paths.models = '##_INSTALLER_MODELS_PATH_##'
# Mongo database configuration.
Treat.databases.mongo.db = 'your_database'
Treat.databases.mongo.host = 'localhost'
Treat.databases.mongo.port = '27017'
# Include the DSL by default.
include Treat::Core::DSL

57
Gemfile
View File

@ -1,48 +1,45 @@
source :rubygems
source 'https://rubygems.org'
gemspec
gem 'birch'
gem 'schiphol'
gem 'sourcify'
gem 'yomu'
gem 'ruby-readability'
gem 'nokogiri'
group :test do
gem 'rspec', '2.9.0'
gem 'rspec'
gem 'rake'
gem 'terminal-table'
gem 'simplecov'
end
=begin
gem 'nokogiri'
gem 'psych'
gem 'mongoid'
gem 'mongo'
gem 'bson_ext'
gem 'zip'
gem 'ferret'
gem 'lda-ruby'
gem 'stanford-core-nlp'
gem 'linguistics'
gem 'ruby-readability'
gem 'whatlanguage'
gem 'chronic'
gem 'nickel'
gem 'engtagger'
gem 'open-nlp'
gem 'stanford-core-nlp'
gem 'rwordnet'
gem 'scalpel'
gem 'fastimage'
gem 'decisiontree'
gem 'rb-libsvm'
gem 'ai4r'
gem 'whatlanguage'
gem 'zip'
gem 'nickel'
gem 'tactful_tokenizer'
gem 'srx-english'
gem 'punkt-segmenter'
gem 'chronic'
gem 'uea-stemmer'
gem 'rbtagger'
gem 'ruby-stemmer'
gem 'punkt-segmenter'
gem 'tactful_tokenizer'
gem 'nickel'
gem 'rwordnet'
gem 'uea-stemmer'
gem 'engtagger'
gem 'activesupport'
gem 'srx-english'
gem 'scalpel'
=end
# english?
gem 'rb-libsvm'
gem 'tomz-liblinear-ruby-swig'
gem 'ruby-fann'
gem 'fuzzy-string-match'
gem 'levenshtein-ffi'
gem 'tf-idf-similarity'
gem 'kronic'
=end

View File

@ -1,4 +1,4 @@
Treat - Text Retrieval, Extraction and Annotation Toolkit, v. 1.1.2
Treat - Text Retrieval, Extraction and Annotation Toolkit, v. 2.0.0
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
@ -15,7 +15,7 @@ along with this program. If not, see <http://www.gnu.org/licenses/>.
Author: Louis-Antoine Mullie (louis.mullie@gmail.com). Copyright 2011-12.
Non-trivial amount of code has been incorporated and modified from other libraries:
A non-trivial amount of code has been incorporated and modified from other libraries:
- formatters/readers/odt.rb - Mark Watson (GPL license)
- processors/tokenizers/tactful.rb - Matthew Bunday (GPL license)

View File

@ -1,35 +1,43 @@
[![Build Status](https://secure.travis-ci.org/louismullie/treat.png)](http://travis-ci.org/#!/louismullie/treat)
[![Dependency Status](https://gemnasium.com/louismullie/treat.png)](https://gemnasium.com/louismullie/treat)
[![Code Climate](https://codeclimate.com/badge.png)](https://codeclimate.com/github/louismullie/treat)
[![Code Climate](https://codeclimate.com/github/louismullie/treat.png)](https://codeclimate.com/github/louismullie/treat)
Treat is a toolkit for natural language processing and computational linguistics in Ruby. The Treat project aims to build a language- and algorithm- agnostic NLP framework for Ruby with support for tasks such as document retrieval, text chunking, segmentation and tokenization, natural language parsing, part-of-speech tagging, keyword extraction and named entity recognition.
![Treat Logo](http://www.louismullie.com/treat/treat-logo.jpg)
**Current features**
**New in v2.0.5: [OpenNLP integration](https://github.com/louismullie/treat/commit/727a307af0c64747619531c3aa355535edbf4632) and [Yomu support](https://github.com/louismullie/treat/commit/e483b764e4847e48b39e91a77af8a8baa1a1d056)**
Treat is a toolkit for natural language processing and computational linguistics in Ruby. The Treat project aims to build a language- and algorithm- agnostic NLP framework for Ruby with support for tasks such as document retrieval, text chunking, segmentation and tokenization, natural language parsing, part-of-speech tagging, keyword extraction and named entity recognition. Learn more by taking a [quick tour](https://github.com/louismullie/treat/wiki/Quick-Tour) or by reading the [manual](https://github.com/louismullie/treat/wiki/Manual).
**Features**
* Text extractors for PDF, HTML, XML, Word, AbiWord, OpenOffice and image formats (Ocropus).
* Text retrieval with indexation and full-text search (Ferret).
* Text chunkers, sentence segmenters, tokenizers, and parsers (Stanford & Enju).
* Word inflectors, including stemmers, conjugators, declensors, and number inflection.
* Lexical resources (WordNet interface, several POS taggers for English).
* Language, date/time, topic words (LDA) and keyword (TF*IDF) extraction.
* Word inflectors, including stemmers, conjugators, declensors, and number inflection.
* Serialization of annotated entities to YAML, XML or to MongoDB.
* Visualization in ASCII tree, directed graph (DOT) and tag-bracketed (standoff) formats.
* Linguistic resources, including language detection and tag alignments for several treebanks.
* Machine learning (decision tree, multilayer perceptron, linear, support vector machines).
* Machine learning (decision tree, multilayer perceptron, LIBLINEAR, LIBSVM).
* Text retrieval with indexation and full-text search (Ferret).
<br>
**Contributing**
**Resources**
I am actively seeking developers that can help maintain and expand this project. You can find a list of ideas for contributing to the project [here](https://github.com/louismullie/treat/wiki/Contributing).
* Read the [latest documentation](http://rubydoc.info/github/louismullie/treat/frames).
* See how to [install Treat](https://github.com/louismullie/treat/wiki/Installation).
* Learn how to [use Treat](https://github.com/louismullie/treat/wiki/Manual).
* Help out by [contributing to the project](https://github.com/louismullie/treat/wiki/Contributing).
* View a list of [papers](https://github.com/louismullie/treat/wiki/Papers) about tools included in this toolkit.
* Open an [issue](https://github.com/louismullie/treat/issues).
<br>
**Authors**
Lead developper: @louismullie [[Twitter](https://twitter.com/LouisMullie)]
Contributors:
- @bdigital
- @automatedtendencies
- @LeFnord
- @darkphantum
- @whistlerbrk
- @smileart
- @erol
**License**
This software is released under the [GPL License](https://github.com/louismullie/treat/wiki/License-Information) and includes software released under the GPL, Ruby, Apache 2.0 and MIT licenses.
This software is released under the [GPL License](https://github.com/louismullie/treat/wiki/License-Information) and includes software released under the GPL, Ruby, Apache 2.0 and MIT licenses.

View File

@ -48,4 +48,9 @@ Treat - Text Retrieval, Extraction and Annotation Toolkit
* Added LIBSVM and LIBLINEAR classifier support.
* Added support for serialization of documents and data sets to MongoDB.
* Added specs for most of the core classes.
* Several bug fixes.
* Several bug fixes.
2.0.0rc1
* MAJOR CHANGE: the old DSL is no longer supported. A new DSL style using
lowercase keywords is now used and must be required explicitly.

View File

@ -40,20 +40,8 @@ namespace :treat do
task :spec, [:language] do |t, args|
require_relative 'spec/helper'
Treat::Specs::Helper.start_coverage
Treat::Specs::Helper.run_core_specs
Treat::Specs::Helper.run_examples_as(
'spec', args.language)
end
# Runs worker benchmarks for all languages (by
# default), or for a specific language (if supplied).
# Also outputs an HTML table
# Syntax: rake treat:benchmark (all languages)
# - OR - rake treat:benchmark[language]
task :benchmark, [:language] do |t, args|
require_relative 'spec/helper'
Treat::Specs::Helper.run_examples_as(
'benchmark', args.language)
Treat::Specs::Helper.run_library_specs
Treat::Specs::Helper.run_language_specs(args.language)
end
end

View File

@ -18,5 +18,6 @@ module Treat
require_relative 'treat/exception'
require_relative 'treat/autoload'
require_relative 'treat/modules'
require_relative 'treat/builder'
end

View File

@ -14,21 +14,31 @@ module Treat::Autoload
# Loads all the files for the base
# module in the appropriate order.
def self.included(base)
# Get the parts of module name.
bits = base.to_s.split('::')
# Singularize the module name.
w = bits[-1].downcase
n = (w[-3..-1] == 'ies' ?
(w[0..-4] + 'y') : (w[-1] ==
's' ? w[0...-1] : w)) + '.rb'
# Get the module's directory.
d = File.dirname(File.
expand_path(__FILE__))[0..-6] +
bits.join('/').downcase + '/'
# Require base class if exists.
require d + n if File.readable?(d + n)
# Require all other files in dir.
Dir.glob("#{d}*.rb").each { |f| require f }
m = self.get_module_name(base)
d = self.get_module_path(m)
n = self.singularize(m) + '.rb'
f, p = File.join(d, n), "#{d}/*.rb"
require f if File.readable?(f)
Dir.glob(p).each { |f| require f }
end
# Returns the path to a module's dir.
def self.get_module_path(name)
file = File.expand_path(__FILE__)
dirs = File.dirname(file).split('/')
File.join(*dirs[0..-1], name)
end
# Return the downcased form of the
# module's last name (e.g. "entities").
def self.get_module_name(mod)
mod.to_s.split('::')[-1].downcase
end
# Helper method to singularize words.
def self.singularize(w)
if w[-3..-1] == 'ies'; w[0..-4] + 'y'
else; (w[-1] == 's' ? w[0..-2] : w); end
end
end

6
lib/treat/builder.rb Normal file
View File

@ -0,0 +1,6 @@
class Treat::Builder
include Treat::Core::DSL
def initialize(&block)
instance_exec(&block)
end
end

View File

@ -3,71 +3,36 @@
# the /config folder.
module Treat::Config
# Require configurable mix in.
require_relative 'importable'
# Make all configuration importable.
extend Treat::Config::Importable
# Core configuration options for entities.
class Treat::Config::Entities; end
# Configuration for paths to models, binaries,
# temporary storage and file downloads.
class Treat::Config::Paths; end
# Configuration for all Treat workers.
class Treat::Config::Workers; end
# Helpful linguistic options.
class Treat::Config::Linguistics; end
# Supported workers for each language.
class Treat::Config::Languages; end
# Configuration options for external libraries.
class Treat::Config::Libraries; end
class Treat::Config::Workers; end
# Configuration options for database
# connectivity (host, port, etc.)
class Treat::Config::Databases; end
# Configuration options for Treat core.
class Treat::Config::Core; end
# Require autolodable mix in.
require_relative 'configurable'
# Store all the configuration in self.config
class << self; attr_accessor :config; end
# Setup a proxy on the main Treat module to
# make configuration options directly accessible,
# using e.g. Treat.paths.tmp = '...'
Treat.module_eval do
# Handle all missing methods as conf options.
# Instead, should dynamically define them. FIXME.
def self.method_missing(sym, *args, &block)
super(sym, *args, &block) if sym == :to_ary
Treat::Config.config[sym]
end
end
# Main function; loads all configuration options.
def self.configure!
config = {}
Treat::Config.constants.each do |const|
unless const == :Configurable
klass = Treat::Config.const_get(const)
klass.class_eval do
extend Treat::Config::Configurable
end
k = const.to_s.downcase.intern
klass.configure!
config[k] = klass.config
end
end
self.config = self.hash_to_struct(config)
end
# * Helper methods * #
# Convert a hash to nested structs.
def self.hash_to_struct(hash)
return hash if hash.keys.
select { |k| !k.is_a?(Symbol) }.size > 0
struct = Struct.new(*hash.keys).new(*hash.values)
hash.each do |key, value|
if value.is_a?(Hash)
struct[key] = self.hash_to_struct(value)
end
end; return struct
end
end

View File

@ -1,10 +1,29 @@
# Provide default functionality to load configuration
# options from flat files into their respective modules.
module Treat::Config::Configurable
# When extended, add the .config property to
# the class that is being operated on.
def self.extended(base)
class << base; attr_accessor :config; end
base.class_eval { self.config = {} }
end
# Provide base functionality to configure
# all modules. The behaviour is as follows:
#
# 1 - Check if a file named data/$CLASS$.rb
# exists; if so, load that file as the base
# configuration, i.e. "Treat.$CLASS$"; e.g.
# "Treat.core"
#
# 2 - Check if a folder named data/$CLASS$
# exists; if so, load each file in that folder
# as a suboption of the main configuration,
# i.e. "Treat.$CLASS$.$FILE$"; e.g. "Treat.workers"
#
# (where $CLASS$ is the lowercase name of
# the concrete class being extended by this.)
def configure!
path = File.dirname(File.expand_path( # FIXME
__FILE__)).split('/')[0..-4].join('/') + '/'
@ -14,15 +33,19 @@ module Treat::Config::Configurable
base_file = main_dir + mod_name + '.rb'
if File.readable?(base_file)
self.config = eval(File.read(base_file))
end
if FileTest.directory?(conf_dir)
config = {}
Dir[conf_dir + '/*'].each do |path|
name = File.basename(path, '.*').intern
config[name] = eval(File.read(path))
end
self.config = config
elsif FileTest.directory?(conf_dir)
self.config = self.from_dir(conf_dir)
else; raise Treat::Exception,
"No config file found for #{mod_name}."
end
end
# * Helper methods for configuraton * #
def from_dir(conf_dir)
Hash[Dir[conf_dir + '/*'].map do |path|
name = File.basename(path, '.*').intern
[name, eval(File.read(path))]
end]
end
end

View File

@ -1,50 +0,0 @@
{acronyms:
['xml', 'html', 'txt', 'odt',
'abw', 'doc', 'yaml', 'uea',
'lda', 'pdf', 'ptb', 'dot',
'ai', 'id3', 'svo', 'mlp',
'svm', 'srx'],
encodings:
{language_to_code: {
arabic: 'UTF-8',
chinese: 'GB18030',
english: 'UTF-8',
french: 'ISO_8859-1',
ferman: 'ISO_8859-1',
hebrew: 'UTF-8'
}},
entities:
{list:
[:entity, :unknown, :email,
:url, :symbol, :sentence,
:punctuation, :number,
:enclitic, :word, :token,
:fragment, :phrase, :paragraph,
:title, :zone, :list, :block,
:page, :section, :collection,
:document],
order:
[:token, :fragment, :phrase,
:sentence, :zone, :section,
:document, :collection]},
language: {
default: :english,
detect: false,
detect_at: :document
},
paths: {
description: {
tmp: 'temporary files',
lib: 'class and module definitions',
bin: 'binary files',
files: 'user-saved files',
models: 'model files',
spec: 'spec test files'
}
},
syntax: { sweetened: false },
verbosity: { debug: false, silence: true}}

View File

@ -4,7 +4,7 @@
'abw', 'doc', 'yaml', 'uea',
'lda', 'pdf', 'ptb', 'dot',
'ai', 'id3', 'svo', 'mlp',
'svm', 'srx'],
'svm', 'srx', 'nlp'],
encodings:
{language_to_code: {
@ -21,13 +21,13 @@
[:entity, :unknown, :email,
:url, :symbol, :sentence,
:punctuation, :number,
:enclitic, :word, :token,
:enclitic, :word, :token, :group,
:fragment, :phrase, :paragraph,
:title, :zone, :list, :block,
:page, :section, :collection,
:document],
order:
[:token, :fragment, :phrase,
[:token, :fragment, :group,
:sentence, :zone, :section,
:document, :collection]},
language: {
@ -45,7 +45,9 @@
spec: 'spec test files'
}
},
learning: {
list: [:data_set, :export, :feature, :tag, :problem, :question]
},
syntax: { sweetened: false },
verbosity: { debug: false, silence: true}

View File

@ -1,21 +1,12 @@
{
dependencies: [
'psych',
'nokogiri',
'ferret',
'bson_ext',
'mongo',
'lda-ruby',
'stanford-core-nlp',
'linguistics',
'ruby-readability',
'whatlanguage',
'chronic',
'nickel',
'decisiontree',
'rb-libsvm',
'ai4r',
'zip'
'ferret', 'bson_ext', 'mongo', 'lda-ruby',
'stanford-core-nlp', 'linguistics',
'ruby-readability', 'whatlanguage',
'chronic', 'kronic', 'nickel', 'decisiontree',
'rb-libsvm', 'ruby-fann', 'zip', 'loggability',
'tf-idf-similarity', 'narray', 'fastimage',
'fuzzy-string-match', 'levenshtein-ffi'
],
workers: {
learners: {
@ -25,7 +16,9 @@
keywords: [:tf_idf],
language: [:what_language],
topic_words: [:lda],
tf_idf: [:native]
tf_idf: [:native],
distance: [:levenshtein],
similarity: [:jaro_winkler, :tf_idf]
},
formatters: {
serializers: [:xml, :yaml, :mongo],

View File

@ -14,7 +14,7 @@
],
workers: {
extractors: {
time: [:chronic, :ruby, :nickel],
time: [:chronic, :kronic, :ruby, :nickel],
topics: [:reuters],
name_tag: [:stanford]
},
@ -32,28 +32,64 @@
},
processors: {
parsers: [:stanford],
segmenters: [:srx, :tactful, :punkt, :stanford, :scalpel],
tokenizers: [:ptb, :stanford, :punkt]
segmenters: [:scalpel, :srx, :tactful, :punkt, :stanford],
tokenizers: [:ptb, :stanford, :punkt, :open_nlp]
}
},
stop_words:
['the', 'of', 'and', 'a', 'to', 'in', 'is',
'you', 'that', 'it', 'he', 'was', 'for', 'on',
'are', 'as', 'with', 'his', 'they', 'I', 'at',
'be', 'this', 'have', 'from', 'or', 'one', 'had',
'by', 'word', 'but', 'not', 'what', 'all', 'were',
'we', 'when', 'your', 'can', 'said', 'there', 'use',
'an', 'each', 'which', 'she', 'do', 'how', 'their',
'if', 'will', 'up', 'other', 'about', 'out', 'many',
'then', 'them', 'these', 'so', 'some', 'her', 'would',
'make', 'like', 'him', 'into', 'time', 'has', 'look',
'two', 'more', 'write', 'go', 'see', 'number', 'no',
'way', 'could', 'people', 'my', 'than', 'first', 'been',
'call', 'who', 'its', 'now', 'find', 'long', 'down',
'day', 'did', 'get', 'come', 'made', 'may', 'part',
'say', 'also', 'new', 'much', 'should', 'still',
'such', 'before', 'after', 'other', 'then', 'over',
'under', 'therefore', 'nonetheless', 'thereafter',
'afterwards', 'here', 'huh', 'hah', "n't", "'t", 'here',
'neither', 'towards']
[
"about",
"also",
"are",
"away",
"because",
"been",
"beside",
"besides",
"between",
"but",
"cannot",
"could",
"did",
"etc",
"even",
"ever",
"every",
"for",
"had",
"have",
"how",
"into",
"isn",
"maybe",
"non",
"nor",
"now",
"should",
"such",
"than",
"that",
"then",
"these",
"this",
"those",
"though",
"too",
"was",
"wasn",
"were",
"what",
"when",
"where",
"which",
"while",
"who",
"whom",
"whose",
"will",
"with",
"would",
"wouldn",
"yes"
]
}

View File

@ -6,13 +6,143 @@
],
workers: {
processors: {
segmenters: [:punkt],
tokenizers: [],
segmenters: [:scalpel],
tokenizers: [:ptb,:stanford],
parsers: [:stanford]
},
lexicalizers: {
taggers: [:stanford],
categorizers: [:from_tag]
}
}
},
stop_words:
[
"ailleurs",
"ainsi",
"alors",
"aucun",
"aucune",
"auquel",
"aurai",
"auras",
"aurez",
"aurons",
"auront",
"aussi",
"autre",
"autres",
"aux",
"auxquelles",
"auxquels",
"avaient",
"avais",
"avait",
"avec",
"avez",
"aviez",
"avoir",
"avons",
"celui",
"cependant",
"certaine",
"certaines",
"certains",
"ces",
"cet",
"cette",
"ceux",
"chacun",
"chacune",
"chaque",
"comme",
"constamment",
"davantage",
"depuis",
"des",
"desquelles",
"desquels",
"dessous",
"dessus",
"donc",
"dont",
"duquel",
"egalement",
"elles",
"encore",
"enfin",
"ensuite",
"etaient",
"etais",
"etait",
"etes",
"etiez",
"etions",
"etre",
"eux",
"guere",
"ici",
"ils",
"jamais",
"jusqu",
"laquelle",
"legerement",
"lequel",
"les",
"lesquelles",
"lesquels",
"leur",
"leurs",
"lors",
"lui",
"maintenant",
"mais",
"malgre",
"moi",
"moins",
"notamment",
"parce",
"plupart",
"pourtant",
"presentement",
"presque",
"puis",
"puisque",
"quand",
"quant",
"que",
"quel",
"quelqu",
"quelque",
"quelques",
"qui",
"quoi",
"quoique",
"rien",
"selon",
"serai",
"seras",
"serez",
"serons",
"seront",
"soient",
"soit",
"sommes",
"sont",
"sous",
"suis",
"telle",
"telles",
"tels",
"toi",
"toujours",
"tout",
"toutes",
"tres",
"trop",
"une",
"vos",
"votre",
"vous"
]
}

View File

@ -1,3 +1,5 @@
#encoding: UTF-8
{
dependencies: [
'punkt-segmenter',
@ -6,13 +8,130 @@
],
workers: {
processors: {
segmenters: [:punkt],
tokenizers: [],
segmenters: [:tactful, :punkt, :stanford, :scalpel],
tokenizers: [:stanford, :punkt],
parsers: [:stanford]
},
lexicalizers: {
taggers: [:stanford],
categorizers: [:from_tag]
}
}
}
},
stop_words:
[
"alle",
"allem",
"alles",
"andere",
"anderem",
"anderen",
"anderer",
"anderes",
"auf",
"bei",
"beim",
"bist",
"dadurch",
"dein",
"deine",
"deiner",
"deines",
"deins",
"dem",
"denen",
"der",
"deren",
"des",
"deshalb",
"dessen",
"diese",
"diesem",
"diesen",
"dieser",
"dieses",
"ein",
"eine",
"einem",
"einen",
"einer",
"eines",
"euer",
"euere",
"eueren",
"eueres",
"für",
"haben",
"habt",
"hatte",
"hatten",
"hattest",
"hattet",
"hierzu",
"hinter",
"ich",
"ihr",
"ihre",
"ihren",
"ihrer",
"ihres",
"indem",
"ist",
"jede",
"jedem",
"jeden",
"jeder",
"jedes",
"kann",
"kannst",
"können",
"könnt",
"konnte",
"konnten",
"konntest",
"konntet",
"mehr",
"mein",
"meine",
"meiner",
"meines",
"meins",
"nach",
"neben",
"nicht",
"nichts",
"seid",
"sein",
"seine",
"seiner",
"seines",
"seins",
"sie",
"sind",
"über",
"und",
"uns",
"unser",
"unsere",
"unter",
"vor",
"warst",
"weil",
"wenn",
"werde",
"werden",
"werdet",
"willst",
"wir",
"wird",
"wirst",
"wollen",
"wollt",
"wollte",
"wollten",
"wolltest",
"wolltet",
"zum",
"zur"
]
}

View File

@ -8,5 +8,155 @@
segmenters: [:punkt],
tokenizers: []
}
}
}
},
stop_words:
[
"affinche",
"alcun",
"alcuna",
"alcune",
"alcuni",
"alcuno",
"allora",
"altra",
"altre",
"altri",
"altro",
"anziche",
"certa",
"certe",
"certi",
"certo",
"che",
"chi",
"chiunque",
"comunque",
"con",
"cosa",
"cose",
"cui",
"dagli",
"dai",
"dall",
"dalla",
"dalle",
"darsi",
"degli",
"del",
"dell",
"della",
"delle",
"dello",
"dunque",
"egli",
"eppure",
"esse",
"essi",
"forse",
"gia",
"infatti",
"inoltre",
"invece",
"lui",
"malgrado",
"mediante",
"meno",
"mentre",
"mie",
"miei",
"mio",
"modo",
"molta",
"molte",
"molti",
"molto",
"negli",
"nel",
"nella",
"nelle",
"nessun",
"nessuna",
"nessuno",
"niente",
"noi",
"nostra",
"nostre",
"nostri",
"nostro",
"nulla",
"occorre",
"ogni",
"ognuno",
"oltre",
"oltretutto",
"oppure",
"ovunque",
"ovvio",
"percio",
"pertanto",
"piu",
"piuttosto",
"poca",
"poco",
"poiche",
"propri",
"proprie",
"proprio",
"puo",
"qua",
"qual",
"qualche",
"qualcuna",
"qualcuno",
"quale",
"quali",
"qualunque",
"quando",
"quant",
"quante",
"quanti",
"quanto",
"quantunque",
"quegli",
"quei",
"quest",
"questa",
"queste",
"questi",
"questo",
"qui",
"quindi",
"sebbene",
"sembra",
"sempre",
"senza",
"soltanto",
"stessa",
"stesse",
"stessi",
"stesso",
"sugli",
"sui",
"sul",
"sull",
"sulla",
"sulle",
"suo",
"suoi",
"taluni",
"taluno",
"tanta",
"tanti",
"tanto",
"tra",
"tuo",
"tuoi",
"tutt",
"tutta",
"tutte",
"tutto",
"una",
"uno",
"voi"
]
}

View File

@ -8,5 +8,284 @@
segmenters: [:punkt],
tokenizers: []
}
}
},
stop_words:
[
"abans",
"aca",
"acerca",
"ahora",
"aixo",
"algo",
"algu",
"alguien",
"algun",
"alguna",
"algunas",
"algunes",
"alguno",
"algunos",
"alguns",
"alla",
"alli",
"allo",
"altra",
"altre",
"altres",
"amb",
"amunt",
"antes",
"aquel",
"aquell",
"aquella",
"aquellas",
"aquelles",
"aquellos",
"aquells",
"aquest",
"aquesta",
"aquestes",
"aquests",
"aqui",
"asimismo",
"aun",
"aunque",
"avall",
"cada",
"casi",
"com",
"como",
"con",
"cosas",
"coses",
"cual",
"cuales",
"cualquier",
"cuando",
"damunt",
"darrera",
"davant",
"debe",
"deben",
"deber",
"debia",
"debian",
"decia",
"decian",
"decir",
"deia",
"deien",
"del",
"demasiado",
"des",
"desde",
"despues",
"dicen",
"diciendo",
"dins",
"dir",
"diu",
"diuen",
"doncs",
"ell",
"ellas",
"elles",
"ells",
"els",
"encara",
"entonces",
"ese",
"esos",
"esser",
"esta",
"estan",
"estando",
"estant",
"estar",
"estaria",
"estarian",
"estarien",
"estas",
"estos",
"farien",
"feia",
"feien",
"fent",
"fue",
"fueron",
"gaire",
"gairebe",
"hace",
"hacia",
"hacian",
"haciendo",
"haran",
"hauria",
"haurien",
"hemos",
"hola",
"junto",
"lejos",
"les",
"lloc",
"los",
"menos",
"menys",
"meva",
"mias",
"mio",
"misma",
"mismas",
"mismo",
"mismos",
"molt",
"molta",
"moltes",
"mon",
"mucha",
"mucho",
"muy",
"nadie",
"ningu",
"nomes",
"nosaltres",
"nosotros",
"nostra",
"nostre",
"nuestra",
"nuestras",
"nuestro",
"nuestros",
"nunca",
"otra",
"pasa",
"pasan",
"pasara",
"pasaria",
"passara",
"passaria",
"passen",
"perque",
"poc",
"pocas",
"pocos",
"podem",
"poden",
"podeu",
"podria",
"podrian",
"podrien",
"poques",
"porque",
"potser",
"puc",
"pudieron",
"pudo",
"puede",
"pueden",
"puesto",
"qualsevol",
"quan",
"que",
"queria",
"querian",
"qui",
"quien",
"quienes",
"quiere",
"quieren",
"quin",
"quina",
"quines",
"quins",
"quizas",
"segueent",
"segun",
"sempre",
"seran",
"seria",
"serian",
"seu",
"seva",
"sido",
"siempre",
"siendo",
"siguiente",
"sino",
"sobretodo",
"solamente",
"sovint",
"suya",
"suyas",
"suyo",
"suyos",
"tambe",
"tambien",
"tanmateix",
"tanta",
"tanto",
"tendran",
"tendria",
"tendrian",
"tenen",
"teu",
"teva",
"tiene",
"tienen",
"tindran",
"tindria",
"tindrien",
"toda",
"todavia",
"todo",
"tota",
"totes",
"tras",
"traves",
"tuvieron",
"tuvo",
"tuya",
"tuyas",
"tuyo",
"tuyos",
"unas",
"unes",
"unos",
"uns",
"usaba",
"usaban",
"usada",
"usades",
"usado",
"usan",
"usando",
"usant",
"usar",
"usat",
"usava",
"usaven",
"usen",
"vaig",
"varem",
"varen",
"vareu",
"vegada",
"vegades",
"vez",
"volem",
"volen",
"voleu",
"vora",
"vos",
"vosaltres",
"vosotros",
"vostra",
"vostre",
"voy",
"vuestra",
"vuestras",
"vuestro",
"vuestros",
"vull"
]
}

View File

@ -8,5 +8,282 @@
segmenters: [:punkt],
tokenizers: []
}
}
}
},
stop_words:
[
"atminstone",
"an",
"anda",
"aven",
"aldrig",
"alla",
"alls",
"allt",
"alltid",
"allting",
"alltsa",
"andra",
"annan",
"annars",
"antingen",
"att",
"bakom",
"bland",
"blev",
"bli",
"bliva",
"blivit",
"bort",
"bortom",
"bredvid",
"dar",
"darav",
"darefter",
"darfor",
"dari",
"darigenom",
"darvid",
"dedar",
"definitivt",
"del",
"den",
"dendar",
"denhar",
"denna",
"deras",
"dessa",
"dessutom",
"desto",
"det",
"detta",
"dylik",
"efterat",
"efter",
"eftersom",
"eller",
"emellertid",
"enbart",
"endast",
"enligt",
"ens",
"ensam",
"envar",
"eran",
"etc",
"ett",
"exakt",
"fatt",
"fastan",
"fick",
"fler",
"flera",
"foljande",
"foljde",
"foljer",
"for",
"fore",
"forhoppningsvis",
"formodligen",
"forr",
"forra",
"forutom",
"forvisso",
"fran",
"framfor",
"fullstandigt",
"gang",
"gar",
"gatt",
"ganska",
"gav",
"genom",
"genomgaende",
"ger",
"gick",
"gjorde",
"gjort",
"gor",
"hade",
"har",
"harav",
"har",
"hej",
"hela",
"helst",
"helt",
"hitta",
"hon",
"honom",
"hur",
"huruvida",
"huvudsakligen",
"ibland",
"icke",
"ickedestomindre",
"igen",
"ihop",
"inat",
"ingen",
"ingenstans",
"inget",
"innan",
"innehalla",
"inre",
"inte",
"inuti",
"istaellet",
"kanske",
"klart",
"knappast",
"knappt",
"kom",
"komma",
"kommer",
"kraver",
"kunde",
"kunna",
"lata",
"later",
"lagga",
"langre",
"laet",
"lagd",
"leta",
"letar",
"manga",
"maste",
"med",
"medan",
"medans",
"mellan",
"mest",
"min",
"mindre",
"minst",
"mittemellan",
"motsvarande",
"mycket",
"nagon",
"nagongang",
"nagonsin",
"nagonstans",
"nagonting",
"nagorlunda",
"nagot",
"namligen",
"nar",
"nara",
"nasta",
"nastan",
"nedat",
"nedanfor",
"nerat",
"ner",
"nog",
"normalt",
"nummer",
"nuvarande",
"nytt",
"oavsett",
"och",
"ocksa",
"oppna",
"over",
"overallt",
"ofta",
"okej",
"olika",
"ovanfor",
"ratt",
"redan",
"relativt",
"respektive",
"rimlig",
"rimligen",
"rimligt",
"salunda",
"savida",
"saga",
"sager",
"sakert",
"sand",
"sarskilt",
"satt",
"sak",
"samma",
"samtliga",
"sedd",
"senare",
"senaste",
"ser",
"sig",
"sista",
"sjaelv",
"ska",
"skall",
"skickad",
"skriva",
"skulle",
"snabb",
"snarare",
"snart",
"som",
"somliga",
"speciellt",
"stalla",
"stallet",
"starta",
"strax",
"stundom",
"tackar",
"tanka",
"taga",
"tagen",
"tala",
"tanke",
"tidigare",
"tills",
"tog",
"totalt",
"trolig",
"troligen",
"tvaers",
"tvars",
"tycka",
"tyckte",
"tyvarr",
"understundom",
"upp",
"uppenbarligen",
"uppenbart",
"utan",
"utanfor",
"uteslutande",
"utom",
"var",
"varan",
"vad",
"val",
"varde",
"vanlig",
"vanligen",
"var",
"vare",
"varenda",
"varfor",
"varifran",
"varit",
"varje",
"varken",
"vars",
"vart",
"vem",
"verkligen",
"vidare",
"vilken",
"vill",
"visar",
"visst",
"visste"
]
}

View File

@ -8,5 +8,9 @@
stanford: {
jar_path: nil,
model_path: nil
},
open_nlp: {
jar_path: nil,
model_path: nil
}
}

View File

@ -24,8 +24,9 @@
'Coordinated phrase', ['', '', 'UCP', '', '', 'COORD'],
'Infinitival phrase', ['', '', '', '', '', 'VPinf'],
'Verb phrase', ['', '', 'VP', '', '', ''],
'Inverted yes/no question', ['', '', 'SQ', '', '', ''],
'Wh adjective phrase', ['', '', 'WHADJP', '', '', ''],
'Wh adverb phrase', ['', '', 'WHAVP', '', '', ''],
'Wh adverb phrase', ['', '', 'WHADVP', '', '', ''],
'Wh noun phrase', ['', '', 'WHNP', '', '', ''],
'Wh prepositional phrase', ['', '', 'WHPP', '', '', ''],
'Unknown', ['', '', 'X', '', '', ''],
@ -100,7 +101,7 @@
'Pronoun, reflexive', ['PNX', 'PPL', 'PRP', 'PRF'],
'Pronoun, reflexive, plural', ['PNX', 'PPLS', 'PRP', 'PRF'],
'Pronoun, question, subject', ['PNQ', 'WPS', 'WP', 'PWAV'],
'Pronoun, question, subject', ['PNQ', 'WPS', 'WPS', 'PWAV'], # Hack
'Pronoun, question, subject', ['PNQ', 'WPS', 'WPS', 'PWAV'], # FIXME
'Pronoun, question, object', ['PNQ', 'WPO', 'WP', 'PWAV', 'PWAT'],
'Pronoun, existential there', ['EX0', 'EX', 'EX'],
'Pronoun, attributive demonstrative', ['', '', '', 'PDAT'],
@ -181,7 +182,7 @@
'Punctuation, semicolon', ['PUN', '.', '.', '', '', 'PN'],
'Puncutation, colon or ellipsis', ['PUN', ':', ':'],
'Punctuationm, comma', ['PUN', ',', ',', '$,'],
'Punctuation, comma', ['PUN', ',', ',', '$,'],
'Punctuation, dash', ['PUN', '-', '-'],
'Punctuation, dollar sign', ['PUN', '', '$'],
'Punctuation, left bracket', ['PUL', '(', '(', '$('],
@ -324,4 +325,4 @@
['SQ', 'Inverted yes/no question']
]
}
}
}

View File

@ -27,5 +27,13 @@
tf_idf: {
type: :annotator,
targets: [:word]
},
similarity: {
type: :computer,
targets: [:entity]
},
distance: {
type: :computer,
targets: [:entity]
}
}

View File

@ -1,11 +1,12 @@
{
taggers: {
type: :annotator,
targets: [:phrase, :token]
targets: [:group, :token],
recursive: true
},
categorizers: {
type: :annotator,
targets: [:phrase, :token],
targets: [:group, :token],
recursive: true
},
sensers: {
@ -14,5 +15,5 @@
preset_option: :nym,
presets: [:synonyms, :antonyms,
:hyponyms, :hypernyms],
}
}
}

View File

@ -0,0 +1,31 @@
# Mixin that is extended by Treat::Config
# in order to provide a single point of
# access method to trigger the import.
module Treat::Config::Importable
# Import relies on each configuration.
require_relative 'configurable'
# Store all the configuration in self.config
def self.extended(base)
class << base; attr_accessor :config; end
end
# Main function; loads all configuration options.
def import!
config, c = {}, Treat::Config::Configurable
definition = :define_singleton_method
Treat::Config.constants.each do |const|
next if const.to_s.downcase.is_mixin?
klass = Treat::Config.const_get(const)
klass.class_eval { extend c }.configure!
name = const.to_s.downcase.intern
config[name] = klass.config
Treat.send(definition, name) do
Treat::Config.config[name]
end
end
self.config = config.to_struct
end
end

View File

@ -1,9 +1,13 @@
# Generates the following path config options:
# Treat.paths.tmp, Treat.paths.bin, Treat.paths.lib,
# Treat.paths.models, Treat.paths.files, Treat.paths.spec.
class Treat::Config::Paths
# Get the path configuration based on the
# directory structure loaded into Paths.
# Note that this doesn't call super, as
# there is no external config files to load.
def self.configure!
super
root = File.dirname(File.expand_path( # FIXME
__FILE__)).split('/')[0..-4].join('/') + '/'
self.config = Hash[

View File

@ -1,6 +1,8 @@
# Handles all configuration related
# to understanding of part of speech
# and phrasal tags.
class Treat::Config::Tags
# Load and align tags.
# Generate a map of word and phrase tags
# to their syntactic category, keyed by
# tag set.
@ -16,21 +18,20 @@ class Treat::Config::Tags
align_tags(phrase_tags, tag_sets)
self.config[:aligned] = config
end
# * Helper methods for tag set config * #
# Align tag tags in the tag set
def self.align_tags(tags, tag_sets)
wttc = {}
tags.each_slice(2) do |desc, tags|
category = desc.gsub(',', ' ,').
split(' ')[0].downcase
tag_sets.each_with_index do |tag_set, i|
next unless tags[i]
wttc[tags[i]] ||= {}
wttc[tags[i]][tag_set] = category
end
end; return wttc
end
# Helper methods for tag set config.
# Align tag tags in the tag set
def self.align_tags(tags, tag_sets)
wttc = {}
tags.each_slice(2) do |desc, tags|
category = desc.gsub(',', ' ,').
split(' ')[0].downcase
tag_sets.each_with_index do |tag_set, i|
next unless tags[i]
wttc[tags[i]] ||= {}
wttc[tags[i]][tag_set] = category
end
end; return wttc
end
end

View File

@ -1,36 +1,21 @@
module Treat::Core::DSL
# Include DSL on base.
def self.included(base)
self.sweeten_entities(base)
self.sweeten_learning(base)
end
# Map all classes in Treat::Entities to
# a global builder function (Entity, etc.)
def self.sweeten_entities(base, on = true)
Treat.core.entities.list.each do |type|
next if type == :Symbol
kname = type.cc.intern
klass = Treat::Entities.const_get(kname)
Object.class_eval do
define_method(kname) do |val, opts={}|
klass.build(val, opts)
end if on
remove_method(name) if !on
end
end
end
# Map all classes in the Learning module
# to a global builder function (e.g. DataSet).
def self.sweeten_learning(base, on = true)
Treat::Learning.constants.each do |kname|
Object.class_eval do
define_method(kname) do |*args|
Treat::Learning.const_get(kname).new(*args)
end if on
remove_method(name) if !on
# a global builder function (entity, word,
# phrase, punctuation, symbol, list, etc.)
def self.included(base)
def method_missing(sym,*args,&block)
@@entities ||= Treat.core.entities.list
@@learning ||= Treat.core.learning.list
if @@entities.include?(sym)
klass = Treat::Entities.const_get(sym.cc)
return klass.build(*args)
elsif @@learning.include?(sym)
klass = Treat::Learning.const_get(sym.cc)
return klass.new(*args)
else
super(sym,*args,&block)
raise "Uncaught method ended up in Treat DSL."
end
end
end

View File

@ -1,11 +1,11 @@
# A dependency manager for Treat language plugins.
# Usage: Treat::Installer.install('language')
module Treat::Core::Installer
require 'schiphol'
# Address of the server with the files.
Server = 'www.louismullie.com'
Server = 's3.amazonaws.com/static-public-assets'
# Filenames for the Stanford packages.
StanfordPackages = {
@ -20,34 +20,34 @@ module Treat::Core::Installer
:bin => File.absolute_path(Treat.paths.bin),
:models => File.absolute_path(Treat.paths.models)
}
# Install required dependencies and optional
# dependencies for a specific language.
def self.install(language = 'english')
# Require the Rubygem dependency installer.
silence_warnings do
require 'rubygems/dependency_installer'
end
@@installer = Gem::DependencyInstaller.new
if language == 'travis'
install_travis; return
end
l = "#{language.to_s.capitalize} language"
puts "\nTreat Installer, v. #{Treat::VERSION.to_s}\n\n"
begin
title "Installing core dependencies."
install_language_dependencies('agnostic')
title "Installing dependencies for the #{l}.\n"
install_language_dependencies(language)
# If gem is installed only, download models.
begin
Gem::Specification.find_by_name('punkt-segmenter')
@ -73,7 +73,7 @@ module Treat::Core::Installer
end
end
# Minimal install for Travis CI.
def self.install_travis
install_language_dependencies(:agnostic)
@ -81,7 +81,7 @@ module Treat::Core::Installer
download_stanford(:minimal)
download_punkt_models(:english)
end
def self.install_language_dependencies(language)
dependencies = Treat.languages[language].dependencies
@ -92,31 +92,31 @@ module Treat::Core::Installer
end
def self.download_stanford(package = :minimal)
f = StanfordPackages[package]
url = "http://#{Server}/treat/#{f}"
loc = Schiphol.download(url,
loc = Schiphol.download(url,
download_folder: Treat.paths.tmp
)
puts "- Unzipping package ..."
dest = File.join(Treat.paths.tmp, 'stanford')
unzip_stanford(loc, dest)
model_dir = File.join(Paths[:models], 'stanford')
bin_dir = File.join(Paths[:bin], 'stanford')
origin = File.join(Paths[:tmp], 'stanford')
# Mac hidden files fix.
mac_remove = File.join(dest, '__MACOSX')
if File.readable?(mac_remove)
FileUtils.rm_rf(mac_remove)
end
unless File.readable?(bin_dir)
puts "- Creating directory bin/stanford ..."
FileUtils.mkdir_p(bin_dir)
end
unless File.readable?(model_dir)
puts "- Creating directory models/stanford ..."
FileUtils.mkdir_p(model_dir)
@ -127,18 +127,18 @@ module Treat::Core::Installer
Dir.glob(File.join(origin, '*')) do |f|
next if ['.', '..'].include?(f)
if f.index('jar')
FileUtils.cp(f, File.join(Paths[:bin],
FileUtils.cp(f, File.join(Paths[:bin],
'stanford', File.basename(f)))
elsif FileTest.directory?(f)
FileUtils.cp_r(f, model_dir)
end
end
puts "- Cleaning up..."
FileUtils.rm_rf(origin)
'Done.'
end
def self.download_punkt_models(language)
@ -146,7 +146,7 @@ module Treat::Core::Installer
f = "#{language}.yaml"
dest = "#{Treat.paths.models}punkt/"
url = "http://#{Server}/treat/punkt/#{f}"
loc = Schiphol.download(url,
loc = Schiphol.download(url,
download_folder: Treat.paths.tmp
)
unless File.readable?(dest)
@ -156,7 +156,7 @@ module Treat::Core::Installer
puts "- Copying model file to models/punkt ..."
FileUtils.cp(loc, File.join(Paths[:models], 'punkt', f))
puts "- Cleaning up..."
FileUtils.rm_rf(Paths[:tmp] + Server)
@ -181,12 +181,11 @@ module Treat::Core::Installer
begin
puts "Installing #{dependency}...\n"
@@installer.install(dependency)
rescue Exception => error
raise
puts "Couldn't install gem '#{dependency}' " +
"(#{error.message})."
rescue Gem::InstallError => error
puts "Warning: couldn't install " +
"gem '#{dependency}' (#{error.message})."
end
end
# Unzip a file to the destination path.
@ -194,7 +193,7 @@ module Treat::Core::Installer
require 'zip/zip'
f_path = ''
Zip::ZipFile.open(file) do |zip_file|
zip_file.each do |f|
f_path = File.join(destination, f.name)

View File

@ -3,6 +3,7 @@ class Treat::Core::Server
# Refer to http://rack.rubyforge.org/doc/classes/Rack/Server.html
# for possible options to configure.
def initialize(handler = 'thin', options = {})
raise "Implementation not finished."
require 'json'; require 'rack'
@handler, @options = handler.capitalize, options
end

View File

@ -4,6 +4,7 @@ module Treat::Entities
# Represents a collection.
class Collection < Entity; end
# Represents a document.
class Document < Entity; end
@ -18,6 +19,9 @@ module Treat::Entities
# Represents a block of text
class Block < Section; end
# Represents a list.
class List < Section; end
# * Zones and related classes * #
# Represents a zone of text.
@ -31,9 +35,6 @@ module Treat::Entities
# of sentences and/or phrases).
class Paragraph < Zone; end
# Represents a list.
class List < Zone; end
# * Groups and related classes * #
# Represents a group of tokens.

View File

@ -22,7 +22,9 @@ module Treat::Entities
attr_accessor :type
# Autoload all the classes in /abilities.
include Treat::Autoload
path = File.expand_path(__FILE__)
patt = File.dirname(path) + '/entity/*.rb'
Dir.glob(patt).each { |f| require f }
# Implements support for #register, #registry.
include Registrable
@ -82,8 +84,11 @@ module Treat::Entities
#
# Takes in a single entity or an array of
# entities. Returns the first child supplied.
# @see Treat::Registrable
# If a string is
def <<(entities, clear_parent = true)
entities = (entities.is_a?(::String) ||
entities.is_a?(::Numeric)) ?
entities.to_entity : entities
entities = entities.is_a?(::Array) ?
entities : [entities]
# Register each entity in this node.
@ -121,7 +126,7 @@ module Treat::Entities
# requested method does not exist. Also
# provides suggestions for misspellings.
def invalid_call(sym)
msg = Treat::Workers::Category.lookup(sym) ?
msg = Treat::Workers.lookup(sym) ?
"Method #{sym} can't be called on a #{type}." :
"Method #{sym} is not defined by Treat." +
Treat::Helpers::Help.did_you_mean?(

View File

@ -57,7 +57,7 @@ module Treat::Entities::Entity::Applicable
# Get the group of a task.
def get_group(task)
g = Treat::Workers::Category.lookup(task)
g = Treat::Workers.lookup(task)
unless g
raise Treat::Exception,
"Task #{task} does not exist."

View File

@ -15,7 +15,21 @@ module Treat::Entities::Entity::Buildable
PunctRegexp = /^[[:punct:]\$]+$/
UriRegexp = /^(http|https):\/\/[a-z0-9]+([\-\.]{1}[a-z0-9]+)*\.[a-z]{2,5}(([0-9]{1,5})?\/.*)?$/ix
EmailRegexp = /.+\@.+\..+/
Enclitics = %w['ll 'm 're 's 't 've]
Enclitics = [
# EXAMPLE:
"'d", # I'd => I would
"'ll", # I'll => I will
"'m", # I'm => I am
"'re", # We're => We are
"'s", # There's => There is
# Let's => Let us
"'t", # 'Twas => Archaic ('Twas the night)
"'ve", # They've => They have
"n't" # Can't => Can not
]
# Accepted formats of serialized files
AcceptedFormats = ['.xml', '.yml', '.yaml', '.mongo']
# Reserved folder names
Reserved = ['.index']
@ -23,23 +37,38 @@ module Treat::Entities::Entity::Buildable
# Build an entity from anything (can be
# a string, numeric,folder, or file name
# representing a raw or serialized file).
def build(file_or_value, options = {})
def build(*args)
# This probably needs some doc.
if args.size == 0
file_or_value = ''
elsif args[0].is_a?(Hash)
file_or_value = args[0]
elsif args.size == 1
if args[0].is_a?(Treat::Entities::Entity)
args[0] = [args[0]]
end
file_or_value = args[0]
else
file_or_value = args
end
fv = file_or_value.to_s
if file_or_value.is_a?(Hash)
if fv == ''; self.new
elsif file_or_value.is_a?(Array)
from_array(file_or_value)
elsif file_or_value.is_a?(Hash)
from_db(file_or_value)
elsif self == Treat::Entities::Document ||
(fv.index('yml') || fv.index('yaml') ||
fv.index('xml') || fv.index('mongo'))
elsif self == Treat::Entities::Document || (is_serialized_file?(fv))
if fv =~ UriRegexp
from_url(fv, options)
from_url(fv)
else
from_file(fv, options)
from_file(fv)
end
elsif self == Treat::Entities::Collection
if FileTest.directory?(fv)
from_folder(fv, options)
from_folder(fv)
else
create_collection(fv)
end
@ -78,8 +107,19 @@ module Treat::Entities::Entity::Buildable
e
end
# Build a document from an array
# of builders.
def from_array(array)
obj = self.new
array.each do |el|
el = el.to_entity unless el.is_a?(Treat::Entities::Entity)
obj << el
end
obj
end
# Build a document from an URL.
def from_url(url, options)
def from_url(url)
unless self ==
Treat::Entities::Document
raise Treat::Exception,
@ -88,8 +128,12 @@ module Treat::Entities::Entity::Buildable
end
begin
folder = Treat.paths.files
if folder[-1] == '/'
folder = folder[0..-2]
end
f = Schiphol.download(url,
download_folder: Treat.paths.files,
download_folder: folder,
show_progress: !Treat.core.verbosity.silence,
rectify_extensions: true,
max_tries: 3)
@ -97,10 +141,8 @@ module Treat::Entities::Entity::Buildable
raise Treat::Exception,
"Couldn't download file at #{url}."
end
options[:default_to] ||= 'html'
e = from_file(f, options)
e = from_file(f,'html')
e.set :url, url.to_s
e
@ -123,7 +165,7 @@ module Treat::Entities::Entity::Buildable
# Build an entity from a folder with documents.
# Folders will be searched recursively.
def from_folder(folder, options)
def from_folder(folder)
return if Reserved.include?(folder)
@ -148,49 +190,44 @@ module Treat::Entities::Entity::Buildable
c = Treat::Entities::Collection.new(folder)
folder += '/' unless folder[-1] == '/'
if !FileTest.directory?(folder)
FileUtils.mkdir(folder)
end
c.set :folder, folder
i = folder + '/.index'
c.set :index, i if FileTest.directory?(i)
Dir[folder + '*'].each do |f|
if FileTest.directory?(f)
c2 = Treat::Entities::Collection.
from_folder(f, options)
from_folder(f)
c.<<(c2, false) if c2
else
c.<<(Treat::Entities::Document.
from_file(f, options), false)
from_file(f), false)
end
end
c
return c
end
# Build a document from a raw or serialized file.
def from_file(file, options)
def from_file(file,def_fmt=nil)
if file.index('yml') ||
file.index('yaml') ||
file.index('xml') ||
file.index('mongo')
from_serialized_file(file, options)
if is_serialized_file?(file)
from_serialized_file(file)
else
fmt = Treat::Workers::Formatters::Readers::Autoselect.
detect_format(file, options[:default_to])
options[:_format] = fmt
from_raw_file(file, options)
fmt = Treat::Workers::Formatters::Readers::Autoselect.detect_format(file,def_fmt)
from_raw_file(file, fmt)
end
end
# Build a document from a raw file.
def from_raw_file(file, options)
def from_raw_file(file, def_fmt='txt')
unless self ==
Treat::Entities::Document
@ -204,7 +241,7 @@ module Treat::Entities::Entity::Buildable
"Path '#{file}' does not "+
"point to a readable file."
end
options = {default_format: def_fmt}
d = Treat::Entities::Document.new
d.set :file, file
d.read(:autoselect, options)
@ -212,34 +249,32 @@ module Treat::Entities::Entity::Buildable
end
# Build an entity from a serialized file.
def from_serialized_file(file, options)
def from_serialized_file(file)
if file.index('mongo')
options[:id] = file.scan( # Consolidate this
/([0-9]+)\.mongo/).first.first
from_db(:mongo, options)
else
unless File.readable?(file)
raise Treat::Exception,
"Path '#{file}' does not "+
"point to a readable file."
end
doc = Treat::Entities::Document.new
doc.set :file, file
format = nil
if file.index('yml') || file.index('yaml')
format = :yaml
elsif file.index('xml')
f = :xml
else
raise Treat::Exception,
"Unreadable serialized format for #{file}."
end
doc.unserialize(format, options)
doc.children[0].set_as_root! # Fix this
doc.children[0]
unless File.readable?(file)
raise Treat::Exception,
"Path '#{file}' does not "+
"point to a readable file."
end
doc = Treat::Entities::Document.new
doc.set :file, file
format = nil
if File.extname(file) == '.yml' ||
File.extname(file) == '.yaml'
format = :yaml
elsif File.extname(file) == '.xml'
format = :xml
else
raise Treat::Exception,
"Unreadable serialized format for #{file}."
end
doc.unserialize(format)
doc.children[0].set_as_root! # Fix this
doc.children[0]
end
def is_serialized_file?(path_to_check)
(AcceptedFormats.include? File.extname(path_to_check)) && (File.file?(path_to_check))
end
def from_db(hash)
@ -258,9 +293,23 @@ module Treat::Entities::Entity::Buildable
# Build any kind of entity from a string.
def anything_from_string(string)
case self.mn.downcase.intern
when :document, :collection
when :document
folder = Treat.paths.files
if folder[-1] == '/'
folder = folder[0..-2]
end
now = Time.now.to_f
doc_file = folder+ "/#{now}.txt"
string.force_encoding('UTF-8')
File.open(doc_file, 'w') do |f|
f.puts string
end
from_raw_file(doc_file)
when :collection
raise Treat::Exception,
"Cannot create a document or " +
"Cannot create a " +
"collection from a string " +
"(need a readable file/folder)."
when :phrase
@ -287,6 +336,7 @@ module Treat::Entities::Entity::Buildable
end
# This should be improved on.
def check_encoding(string)
string.encode("UTF-8", undef: :replace) # Fix
end
@ -346,7 +396,7 @@ module Treat::Entities::Entity::Buildable
end
end
def create_collection(fv)
FileUtils.mkdir(fv)
Treat::Entities::Collection.new(fv)

View File

@ -11,8 +11,8 @@ module Treat::Entities::Entity::Checkable
return @features[feature] if has?(feature)
return send(feature) if do_it
task = caller_method(2) # This is dangerous !
g1 = Treat::Workers::Category.lookup(task)
g2 = Treat::Workers::Category.lookup(feature)
g1 = Treat::Workers.lookup(task)
g2 = Treat::Workers.lookup(feature)
raise Treat::Exception,
"#{g1.type.to_s.capitalize} " +

View File

@ -41,6 +41,7 @@ module Treat::Entities::Entity::Countable
# Returns the frequency of the given value
# in the this entity.
def frequency_of(value)
value = value.downcase
if is_a?(Treat::Entities::Token)
raise Treat::Exception,
"Cannot get the frequency " +

View File

@ -3,67 +3,64 @@
# printed by the #print_debug function.
module Treat::Entities::Entity::Debuggable
@@prev = nil
@@i = 0
# Previous state and counter.
@@prev, @@i = nil, 0
# Explains what Treat is currently doing.
# Fixme: last call will never get shown.
def print_debug(entity, task, worker, group, options)
targs = group.targets.map do |target|
target.to_s
end
# Get a list of the worker's targets.
targets = group.targets.map(&:to_s)
if targs.size == 1
t = targs[0]
else
t = targs[0..-2].join(', ') +
' and/or ' + targs[-1]
end
# List the worker's targets as either
# a single target or an and/or form
# (since it would be too costly to
# actually determine what target types
# were processed at runtime for each call).
t = targets.size == 1 ? targets[0] : targets[
0..-2].join(', ') + ' and/or ' + targets[-1]
# Add genitive for annotations (sing./plural)
genitive = targets.size > 1 ? 'their' : 'its'
# Set up an empty string and humanize task name.
doing, human_task = '', task.to_s.gsub('_', ' ')
genitive = targs.size > 1 ?
'their' : 'its'
doing = ''
human_task = task.to_s.gsub('_', ' ')
if group.type == :transformer ||
group.type == :computer
# Base is "{task}-ed {a(n)|N} {target(s)}"
if [:transformer, :computer].include?(group.type)
tt = human_task
tt = tt[0..-2] if tt[-1] == 'e'
ed = tt[-1] == 'd' ? '' : 'ed'
doing = "#{tt.capitalize}#{ed} #{t}"
# Base is "Annotated {a(n)|N} {target(s)}"
elsif group.type == :annotator
if group.preset_option
opt = options[group.preset_option]
form = opt.to_s.gsub('_', ' ')
human_task[-1] = ''
human_task = form + ' ' + human_task
end
doing = "Annotated #{t} with " +
"#{genitive} #{human_task}"
end
# Form is '{base} in format {worker}'.
if group.to_s.index('Formatters')
curr = doing +
' in format ' +
worker.to_s
curr = doing + ' in format ' + worker.to_s
# Form is '{base} using {worker}'.
else
curr = doing +
' using ' +
worker.to_s.gsub('_', ' ')
curr = doing + ' using ' + worker.to_s.gsub('_', ' ')
end
# Remove any double pluralization that may happen.
curr.gsub!('ss', 's') unless curr.index('class')
curr += '.'
if curr == @@prev
@@i += 1
else
# Accumulate repeated tasks.
@@i += 1 if curr == @@prev
# Change tasks, so output.
if curr != @@prev && @@prev
# Pluralize entity names if necessary.
if @@i > 1
Treat.core.entities.list.each do |e|
@@prev.gsub!(e.to_s, e.to_s + 's')
@ -71,9 +68,15 @@ module Treat::Entities::Entity::Debuggable
@@prev.gsub!('its', 'their')
@@prev = @@prev.split(' ').
insert(1, @@i.to_s).join(' ')
# Add determiner if singular.
else
@@prev = @@prev.split(' ').
insert(1, 'a').join(' ')
end
# Reset counter.
@@i = 0
puts @@prev # Last call doesn't get shown.
# Write to stdout.
puts @@prev + '.'
end
@@prev = curr

View File

@ -88,7 +88,6 @@ module Treat::Entities::Entity::Delegatable
# Get the default worker for that language
# inside the given group.
def find_worker_for_language(language, group)
lang = Treat.languages[language]
cat = group.to_s.split('::')[2].downcase.intern
group = group.mn.ucc.intern
@ -96,31 +95,25 @@ module Treat::Entities::Entity::Delegatable
raise Treat::Exception,
"No configuration file loaded for language #{language}."
end
workers = lang.workers
if !workers.respond_to?(cat) ||
!workers[cat].respond_to?(group)
workers = Treat.languages.agnostic.workers
end
if !workers.respond_to?(cat) ||
!workers[cat].respond_to?(group)
raise Treat::Exception,
"No #{group} is/are available for the " +
"#{language.to_s.capitalize} language."
end
workers[cat][group].first
end
# Return an error message and suggest possible typos.
def worker_not_found(klass, group)
"Algorithm '#{klass.mn.ucc}' couldn't be "+
def worker_not_found(worker, group)
"Worker with name '#{worker}' couldn't be "+
"found in group #{group}." + Treat::Helpers::Help.
did_you_mean?(group.list.map { |c| c.ucc }, klass.ucc)
did_you_mean?(group.list.map { |c| c.ucc }, worker)
end
end

View File

@ -105,18 +105,6 @@ module Treat::Entities::Entity::Iterable
end
i
end
# Return the first element in the array, warning if not
# the only one in the array. Used for magic methods: e.g.,
# the magic method "word" if called on a sentence with many
# words, Treat will return the first word, but warn the user.
def first_but_warn(array, type)
if array.size > 1
warn "Warning: requested one #{type}, but" +
" there are many #{type}s in this entity."
end
array[0]
end
end

View File

@ -78,5 +78,16 @@ module Treat::Entities::Entity::Magical
end
# Return the first element in the array, warning if not
# the only one in the array. Used for magic methods: e.g.,
# the magic method "word" if called on a sentence with many
# words, Treat will return the first word, but warn the user.
def first_but_warn(array, type)
if array.size > 1
warn "Warning: requested one #{type}, but" +
" there are many #{type}s in this entity."
end
array[0]
end
end

View File

@ -6,6 +6,12 @@ module Treat::Entities::Entity::Stringable
# Returns the entity's true string value.
def to_string; @value.dup; end
# Returns an array of the childrens' string
# values, found by calling #to_s on them.
def to_a; @children.map { |c| c.to_s }; end
alias :to_ary :to_a
# Returns the entity's string value by
# imploding the value of all terminal
# entities in the subtree of that entity.
@ -52,16 +58,14 @@ module Treat::Entities::Entity::Stringable
end
# Helper method to implode the string value of the subtree.
def implode
def implode(value = "")
return @value.dup if !has_children?
value = ''
each do |child|
if child.is_a?(Treat::Entities::Section)
value += "\n\n"
value << "\n\n"
end
if child.is_a?(Treat::Entities::Token) || child.value != ''
@ -69,14 +73,14 @@ module Treat::Entities::Entity::Stringable
child.is_a?(Treat::Entities::Enclitic)
value.strip!
end
value += child.to_s + ' '
value << child.to_s + ' '
else
value += child.implode
child.implode(value)
end
if child.is_a?(Treat::Entities::Title) ||
child.is_a?(Treat::Entities::Paragraph)
value += "\n\n"
value << "\n\n"
end
end

View File

@ -1,18 +1,29 @@
# Helper methods to manipulate hashes.
class Treat::Helpers::Hash
# Allow getting the caller method in any context.
Hash.class_eval do
# Mixin to allow conversion of hashes to
# nested structs with the keys as attributes.
module ToStruct
# Converts a hash to nested structs.
def self.hash_to_struct(hash)
return hash if hash.keys.
select { |k| !k.is_a?(Symbol) }.size > 0
struct = Struct.new(*hash.keys).new(*hash.values)
def to_struct
hash = self
symbols = hash.keys.select { |k|
!k.is_a?(Symbol) }.size
return hash if symbols > 0
klass = Struct.new(*hash.keys)
struct = klass.new(*hash.values)
hash.each do |key, value|
if value.is_a?(Hash)
struct[key] = self.hash_to_struct(value)
v = value.to_struct
struct[key] = v
end
end; return struct
end
end
# Include the mixins on the core Hash class.
Hash.class_eval do
include Treat::Helpers::Hash::ToStruct
end
end

View File

@ -32,30 +32,4 @@ class Treat::Helpers::Help
msg
end
# Return the levensthein distance between
# two strings taking into account the costs
# of insertion, deletion, and substitution.
# Used by did_you_mean? to detect typos.
def self.levenshtein(first, other, ins=1, del=1, sub=1)
return nil if first.nil? || other.nil?
dm = []
dm[0] = (0..first.length).collect { |i| i * ins}
fill = [0] * (first.length - 1).abs
for i in 1..other.length
dm[i] = [i * del, fill.flatten]
end
for i in 1..other.length
for j in 1..first.length
dm[i][j] = [
dm[i-1][j-1] +
(first[i-1] ==
other[i-1] ? 0 : sub),
dm[i][j-1] + ins,
dm[i-1][j] + del
].min
end
end
dm[other.length][first.length]
end
end

View File

@ -4,46 +4,40 @@ class Treat::Helpers::Object
# Allow introspection onto what method called
# another one at runtime (useful for debugging).
module CallerMethod
# Pattern to match method from trace.
CMPattern = /^(.+?):(\d+)(?::in `(.*)')?/
# Return the name of the method that
# called the method that calls this method.
def caller_method(n = 3)
at = caller(n).first
/^(.+?):(\d+)(?::in `(.*)')?/ =~ at
CMPattern =~ at
Regexp.last_match[3].
gsub('block in ', '').intern
end
end
# Retrieve the last name of a class/module
# (i.e. the part after the last "::").
module ModuleName
def module_name; self.to_s.split('::')[-1]; end
alias :mn :module_name
end
module Verbosity
# Runs a block of code without warnings.
def silence_warnings(&block)
warn_level = $VERBOSE
$VERBOSE = nil
result = block.call
$VERBOSE = warn_level
warn_level = $VERBOSE; $VERBOSE = nil
result = block.call; $VERBOSE = warn_level
result
end
# Runs a block of code while blocking stdout.
def silence_stdout(log = '/dev/null')
unless Treat.core.verbosity.silence
yield; return
end
old = $stdout.dup
$stdout.reopen(File.new(log, 'w'))
yield
$stdout = old
file, old, ret = File.new(log, 'w'),
$stdout.dup, nil; $stdout.reopen(file)
ret = yield; $stdout = old; return ret
end
end

View File

@ -54,7 +54,7 @@ class Treat::Helpers::String
if @@cc_cache[o_phrase]
return @@cc_cache[o_phrase]
end
if Treat.core.acronyms.include?(phrase)
if Treat.core.acronyms.include?(phrase.downcase)
phrase = phrase.upcase
else
phrase.gsub!(Regex) { |a| a.upcase }
@ -99,12 +99,19 @@ class Treat::Helpers::String
end
# Determines whether module is
# an "-able" mixin kind of thing.
module IsMixin
def is_mixin?; to_s[-4..-1] == 'able'; end
end
# Graft the helpers onto the string module.
String.class_eval do
include Treat::Helpers::String::CamelCaseable
include Treat::Helpers::String::UnCamelCaseable
include Treat::Helpers::String::Escapable
include Treat::Helpers::String::Unescapable
include Treat::Helpers::String::IsMixin
end
# Graft camel casing onto symbols.

View File

@ -1,7 +0,0 @@
# Handles the verbosity for external
# programs (gems, binaries, etc.)
module Treat::Helpers::Verbosity
end

View File

@ -63,7 +63,7 @@ class Treat::Learning::Problem
# all of the features.
def export_features(e, include_answer = true)
features = export(e, @features)
return features unless include_answer
return features if !include_answer
features << (e.has?(@question.name) ?
e.get(@question.name) : @question.default)
features
@ -80,9 +80,11 @@ class Treat::Learning::Problem
def export(entity, exports)
unless @question.target == entity.type
targ, type = @question.target, entity.type
raise Treat::Exception,
"This classification problem targets #{@question.target}s, " +
"but a(n) #{entity.type} was passed to export instead."
"This classification problem targets " +
"#{targ}s, but a(n) #{type} " +
"was passed to export instead."
end
ret = []
exports.each do |export|
@ -116,9 +118,8 @@ class Treat::Learning::Problem
question = Treat::Learning::Question.new(
hash['question']['name'],
hash['question']['target'],
hash['question']['type'],
hash['question']['default'],
hash['question']['labels']
hash['question']['type']
)
features = []
hash['features'].each do |feature|

View File

@ -16,12 +16,9 @@ class Treat::Learning::Question
attr_reader :type
# Default for the answer to the question.
attr_reader :default
# A list of possible answers to the question.
attr_reader :labels
# Initialize the question.
def initialize(name, target,
type = :continuous, default = nil, labels = [])
def initialize(name, target, default = nil, type = :continuous)
unless name.is_a?(Symbol)
raise Treat::Exception,
"Question name should be a symbol."
@ -35,8 +32,8 @@ class Treat::Learning::Question
raise Treat::Exception, "Type should be " +
"continuous or discrete."
end
@name, @target, @type, @default, @labels =
name, target, type, default, labels
@name, @target, @type, @default =
name, target, type, default
end
# Custom comparison operator for questions.
@ -44,8 +41,7 @@ class Treat::Learning::Question
@name == question.name &&
@type == question.type &&
@target == question.target &&
@default == question.default &&
@labels = question.labels
@default == question.default
end
end

View File

@ -0,0 +1,52 @@
class Treat::Loaders::BindIt
# Keep track of whether its loaded or not.
@@loaded = {}
# Load CoreNLP package for a given language.
def self.load(klass, name, language = nil)
return if @@loaded[klass]
language ||= Treat.core.language.default
jar_path = Treat.libraries[name].jar_path ||
Treat.paths.bin + "#{name}/"
model_path = Treat.libraries[name].model_path ||
Treat.paths.models + "#{name}/"
if !File.directory?(jar_path)
raise Treat::Exception, "Looking for #{klass} " +
"library JAR files in #{jar_path}, but it is " +
"not a directory. Please set the config option " +
"Treat.libraries.#{name}.jar_path to a folder " +
"containing the appropriate JAR files."
end
if !File.directory?(model_path)
raise Treat::Exception, "Looking for #{klass} " +
"library model files in #{model_path}, but it " +
"is not a directory. Please set the config option " +
"Treat.libraries.#{name}.model_path to a folder " +
"containing the appropriate JAR files."
end
klass.jar_path = jar_path
klass.model_path = model_path
klass.use language
if Treat.core.verbosity.silence
if Gem.win_platform?
klass.log_file = 'NUL'
else
klass.log_file = '/dev/null'
end
end
klass.bind
@@loaded[klass] = true
end
end

View File

@ -10,14 +10,13 @@ class Treat::Loaders::Linguistics
# to the supplied language; raises an exception
# if there is no such language class registered.
def self.load(language)
silence_warnings do
# Linguistics throws warnings; silence them.
silence_warnings { require 'linguistics' }
code = language.to_s[0..1].upcase
@@languages[language] ||=
::Linguistics.const_get(code)
code = language.to_s[0..1].intern # FIX
unless @@languages[language]
require 'linguistics'
Linguistics.use(code)
@@languages[language] = true
end
return @@languages[language]
code
rescue RuntimeError
raise Treat::Exception,
"Ruby Linguistics does not have a module " +

View File

@ -0,0 +1,11 @@
require_relative 'bind_it'
# A helper class to load the OpenNLP package.
class Treat::Loaders::OpenNLP < Treat::Loaders::BindIt
def self.load(language = nil)
require 'open-nlp'
super(OpenNLP, :open_nlp, language)
end
end

View File

@ -1,24 +1,20 @@
# A helper class to load the CoreNLP package.
class Treat::Loaders::Stanford
# Keep track of whether its loaded or not.
@@loaded = false
require_relative 'bind_it'
# Load CoreNLP package for a given language.
def self.load(language = nil)
return if @@loaded
require 'stanford-core-nlp'
language ||= Treat.core.language.default
StanfordCoreNLP.jar_path =
Treat.libraries.stanford.jar_path ||
Treat.paths.bin + 'stanford/'
StanfordCoreNLP.model_path =
Treat.libraries.stanford.model_path ||
Treat.paths.models + 'stanford/'
StanfordCoreNLP.use(language)
StanfordCoreNLP.log_file = '/dev/null' if
Treat.core.verbosity.silence
StanfordCoreNLP.bind; @@loaded = true
end
# A helper class to load the CoreNLP package.
class Treat::Loaders::Stanford < Treat::Loaders::BindIt
end
def self.load(language = nil)
require 'stanford-core-nlp'
super(StanfordCoreNLP, :stanford, language)
end
def self.find_model(name, language)
language = language.intern
model_file = StanfordCoreNLP::Config::Models[name][language]
model_dir = StanfordCoreNLP::Config::ModelFolders[name]
model_path = Treat.libraries.stanford.model_path ||
File.join(Treat.paths.models, 'stanford')
File.join(model_path, model_dir, model_file)
end
end

View File

@ -1,13 +1,13 @@
module Treat
# Contains common utility/helper functions.
module Helpers; include Autoload; end
# Contains all the configuration options.
module Config; include Autoload; end
# Load all the configuration options.
Treat::Config.configure!
# Contains common utility/helper functions.
module Helpers; include Autoload; end
# Import all the configuration options.
Treat::Config.import!
# Contains classes to load external libraries.
module Loaders; include Autoload; end
@ -20,7 +20,10 @@ module Treat
# Contains all the worker categories.
module Workers; include Autoload; end
# Make all the worker categories.
Treat::Workers.categorize!
# Installs builders on core Ruby objects.
module Proxies; include Autoload; end

View File

@ -0,0 +1,27 @@
module Treat::Proxies
module Array
# Include base proxy functionality.
include Treat::Proxies::Proxy
def method_missing(sym, *args, &block)
if [:do, :apply].include?(sym) ||
Treat::Workers.lookup(sym)
map do |el|
if el.is_a?(Treat::Entities::Entity)
el.send(sym, *args)
else
el.to_entity.send(sym, *args)
end
end
else
super(sym, *args, &block)
end
end
end
# Include Treat methods on numerics.
::Array.class_eval do
include Treat::Proxies::Array
end
end

View File

@ -21,17 +21,18 @@ module Treat::Proxies
!Treat.core.language.detect
if is_a?(Treat::Entities::Symbol) ||
is_a?(Treat::Entities::Number)
is_a?(Treat::Entities::Number) ||
is_a?(Treat::Entities::Punctuation)
return Treat.core.language.default
end
dlvl = Treat.core.language.detect_at
dklass = Treat::Entities.const_get(dlvl.cc)
if self.class.compare_with(
dklass) < 1 && has_parent?
if self.class.compare_with(dklass) < 1
anc = ancestor_with_type(dlvl)
return anc.language if anc
return self.parent.language if has_parent?
end
extractor ||= Treat.workers.

View File

@ -10,15 +10,16 @@ module Treat::Proxies
# object and send the method call to the entity.
def method_missing(sym, *args, &block)
if [:do, :apply].include?(sym) ||
Treat::Workers::Category.lookup(sym)
to_entity.send(sym, *args)
Treat::Workers.lookup(sym)
to_entity.send(sym, *args)
else
super(sym, *args, &block)
end
end
# Create an unknown type of entity by default.
def to_entity(builder = nil)
Treat::Entities::Unknown(self.to_s)
Treat::Entities::Unknown.new(self.to_s)
end
end

View File

@ -1,12 +1,12 @@
module Treat
# The current version of Treat.
VERSION = "1.2.0"
VERSION = '2.1.0'
# Treat requires Ruby >= 1.9.2
if RUBY_VERSION < '1.9.2'
raise "Treat requires Ruby version 1.9.2 " +
"or higher, but current is #{RUBY_VERSION}."
end
end
end

View File

@ -1,51 +1,49 @@
# This module creates all the worker categories
# and the groups within these categories and adds
# the relevant hooks on the appropriate entities.
module Treat::Workers::Category
module Treat::Workers::Categorizable
require_relative 'group'
require_relative 'groupable'
# A lookup table for entity types.
@@lookup = {}
# Find a worker group based on method.
def self.lookup(method)
@@lookup[method]
end
def lookup(method); @@lookup[method]; end
def self.create_categories
def categorize!
Treat.workers.members.each do |cat|
create_category(cat.
capitalize.intern,
load_category_conf(cat))
name = cat.capitalize.intern
conf = load_category_conf(cat)
create_category(name, conf)
end
end
def self.load_category_conf(name)
config = Treat.workers[name]
if config.nil?
def load_category_conf(name)
if !Treat.workers.respond_to?(name)
raise Treat::Exception,
"The configuration file " +
"for #{cat_sym} is missing."
else
Treat.workers[name]
end
config
end
def self.create_category(name, conf)
def create_category(name, conf)
category = Treat::Workers.
const_set(name, Module.new)
conf.each_pair do |group, worker|
name = group.to_s.cc.intern
category.module_eval do
@@methods = []; def methods;
@@methods; end; def groups;
self.constants; end
@@methods = []
def methods; @@methods; end
def groups; self.constants; end
end
self.create_group(name, worker, category)
create_group(name, worker, category)
end
end
def self.create_group(name, conf, category)
def create_group(name, conf, category)
group = category.const_set(name, Module.new)
self.set_group_options(group, conf)
self.bind_group_targets(group)
@ -54,27 +52,9 @@ module Treat::Workers::Category
@@lookup[group.method] = group
end
def self.bind_group_targets(group)
group.targets.each do |entity_type|
entity = Treat::Entities.
const_get(entity_type.cc)
entity.class_eval do
add_workers group
end
end
end
def self.register_group_presets(group, conf)
return unless conf.respond_to? :presets
conf.presets.each do |m|
@@methods << m
@@lookup[m] = group
end
end
def self.set_group_options(group, conf)
def set_group_options(group, conf)
group.module_eval do
extend Treat::Workers::Group
extend Treat::Workers::Groupable
self.type = conf.type
self.targets = conf.targets
if conf.respond_to?(:default)
@ -92,6 +72,22 @@ module Treat::Workers::Category
end
end
self.create_categories
def bind_group_targets(group)
group.targets.each do |entity_type|
entity = Treat::Entities.
const_get(entity_type.cc)
entity.class_eval do
add_workers group
end
end
end
def register_group_presets(group, conf)
return unless conf.respond_to?(:presets)
conf.presets.each do |method|
@@methods << method
@@lookup[method] = group
end
end
end

View File

@ -0,0 +1,35 @@
# The C extension uses char* strings, and so Unicode strings
# will give incorrect distances. Need to provide a pure
# implementation if that's the case (FIX).
class Treat::Workers::Extractors::Distance::Levenshtein
require 'levenshtein'
DefaultOptions = {
ins_cost: 1,
del_cost: 1,
sub_cost: 1
}
@@matcher = nil
# Return the levensthein distance between
# two strings taking into account the costs
# of insertion, deletion, and substitution.
def self.distance(entity, options)
options = DefaultOptions.merge(options)
unless options[:to]
raise Treat::Exception, "Must supply " +
"a string/entity to compare to using " +
"the option :to for this worker."
end
a, b = entity.to_s, options[:to].to_s
Levenshtein.distance(a, b)
end
end

View File

@ -23,19 +23,16 @@ class Treat::Workers::Extractors::Keywords::TfIdf
tf_idfs = tf_idfs.
sort_by {|k,v| v}.reverse
if tf_idfs.size <= options[:number]
return tf_idfs
end
keywords = []
i = 0
max_count = tf_idfs.size < options[:number] ? tf_idfs.size : options[:number]
tf_idfs.each do |word|
w = word[0].to_s
next if keywords.include?(w)
break if i > options[:number]
break if i > max_count
keywords << w
i += 1

View File

@ -1,9 +1,9 @@
# Language detection using a probabilistic algorithm
# that checks for the presence of words with Bloom
# that checks for the presence of words with Bloom
# filters built from dictionaries for each language.
#
# Original paper: Grothoff. 2007. A Quick Introduction to
# Bloom Filters. Department of Computer Sciences, Purdue
# Original paper: Grothoff. 2007. A Quick Introduction to
# Bloom Filters. Department of Computer Sciences, Purdue
# University.
class Treat::Workers::Extractors::Language::WhatLanguage
@ -35,7 +35,7 @@ class Treat::Workers::Extractors::Language::WhatLanguage
options = DefaultOptions.merge(options)
@@detector ||= ::WhatLanguage.new(:possibilities)
@@detector ||= ::WhatLanguage.new(:all)
possibilities = @@detector.process_text(entity.to_s)
lang = {}

View File

@ -1,7 +1,7 @@
# Named entity tag extraction using the Stanford NLP
# Deterministic Coreference Resolver, which implements a
# multi-pass sieve coreference resolution (or anaphora
# resolution) system.
# resolution) system based on conditional random fields.
#
# Original paper: Heeyoung Lee, Yves Peirsman, Angel
# Chang, Nathanael Chambers, Mihai Surdeanu, Dan Jurafsky.
@ -16,32 +16,24 @@ class Treat::Workers::Extractors::NameTag::Stanford
def self.name_tag(entity, options = {})
pp = nil
language = entity.language
Treat::Loaders::Stanford.load(language)
isolated_token = entity.is_a?(Treat::Entities::Token)
tokens = isolated_token ? [entity] : entity.tokens
ms = StanfordCoreNLP::Config::Models[:ner][language]
model_path = Treat.libraries.stanford.model_path ||
(Treat.paths.models + '/stanford/')
ms = model_path + '/' +
StanfordCoreNLP::Config::ModelFolders[:ner] +
ms['3class']
@@classifiers[language] ||=
StanfordCoreNLP::CRFClassifier.
getClassifier(ms)
unless classifier = @@classifiers[language]
model = Treat::Loaders::Stanford.find_model(:ner, language)
unless StanfordCoreNLP.const_defined?('CRFClassifier')
StanfordCoreNLP.load_class('CRFClassifier', 'edu.stanford.nlp.ie.crf')
end
classifier = StanfordCoreNLP::CRFClassifier.getClassifier(model)
@@classifiers[language] = classifier
end
token_list = StanfordCoreNLP.get_list(tokens)
sentence = @@classifiers[language].
classify_sentence(token_list)
sentence = classifier.classify_sentence(token_list)
i = 0
n = 0
sentence.each do |s_token|
tag = s_token.get(:answer).to_s.downcase
@ -49,14 +41,9 @@ class Treat::Workers::Extractors::NameTag::Stanford
return tag if isolated_token
if tag
tokens[i].set :name_tag, tag
n += 1
end
i += 1
end
entity.set :named_entity_count, n
nil
end

View File

@ -0,0 +1,38 @@
# Similarity measure for short strings such as person names.
# C extension won't work for Unicode strings; need to set
# extension to "pure" in that case (FIX).
class Treat::Workers::Extractors::Similarity::JaroWinkler
require 'fuzzystringmatch'
DefaultOptions = {
threshold: 0.7,
implementation: nil
}
@@matcher = nil
def self.similarity(entity, options={})
options = DefaultOptions.merge(options)
unless options[:to]
raise Treat::Exception, "Must supply " +
"a string/entity to compare to using " +
"the option :to for this worker."
end
unless @@matcher
impl = options[:implementation]
impl ||= defined?(JRUBY_VERSION) ? :pure : :native
klass = FuzzyStringMatch::JaroWinkler
@@matcher = klass.create(impl)
end
a, b = entity.to_s, options[:to].to_s
@@matcher.getDistance(a, b)
end
end

View File

@ -0,0 +1,43 @@
# Calculates the TF*IDF score of words.
class Treat::Workers::Extractors::Similarity::TfIdf
require 'tf-idf-similarity'
def self.similarity(entity, options={})
raise 'Not currently implemented.'
unless options[:to] &&
options[:to].type == :document
raise Treat::Exception, 'Must supply ' +
'a document to compare to using ' +
'the option :to for this worker.'
end
unless options[:to].parent_collection &&
entity.parent_collection
raise Treat::Exception, 'The TF*IDF ' +
'similarity algorithm can only be applied ' +
'to documents that are inside collections.'
end
coll = TfIdfSimilarity::Collection.new
entity.each_document do |doc|
tdoc = TfIdfSimilarity::Document.new(doc.to_s)
term_counts = Hash.new(0)
doc.each_word do |word|
val = word.value.downcase
term_counts[val] ||= 0.0
term_counts[val] += 1.0
end
size = term_counts.values.reduce(:+)
tdoc.instance_eval do
@term_counts, @size = term_counts, size
end
coll << tdoc
end
puts coll.similarity_matrix.inspect
end
end

View File

@ -0,0 +1,20 @@
# Time/date extraction using a simple rule-based library.
#
# Supported formats: Today, yesterday, tomorrow,
# last thursday, this thursday, 14 Sep, 14 June 2010.
# Any dates without a year are assumed to be in the past.
class Treat::Workers::Extractors::Time::Kronic
require 'kronic'
require 'date'
# Return the date information contained within
# the entity by parsing it with the 'chronic' gem.
#
# Options: none.
def self.time(entity, options = {})
time = Kronic.parse(entity.to_s)
time.is_a?(DateTime) ? time : nil
end
end

View File

@ -53,9 +53,9 @@ class Treat::Workers::Extractors::TopicWords::LDA
# Run the EM algorithm using random
# starting points
silence_stdout do
lda.em('random')
end
Treat.core.verbosity.silence ?
silence_stdout { lda.em('random') } :
lda.em('random')
# Load the vocabulary.
if options[:vocabulary]

View File

@ -3,7 +3,7 @@ class Treat::Workers::Formatters::Readers::Autoselect
ExtensionRegexp = /^.*?\.([a-zA-Z0-9]{2,5})$/
ImageExtensions = ['gif', 'jpg', 'jpeg', 'png']
DefaultOptions = {
:default_to => 'txt'
:default_to => 'document'
}
# Choose a reader to use.
@ -12,7 +12,9 @@ class Treat::Workers::Formatters::Readers::Autoselect
# - (Symbol) :default_to => format to default to.
def self.read(document, options = {})
options = DefaultOptions.merge(options)
document.read(detect_format(document.file, options[:default_to]))
fmt = detect_format(document.file, options[:default_to])
Treat::Workers::Formatters::Readers.
const_get(fmt.cc).read(document,options)
end
def self.detect_format(filename, default_to = nil)

View File

@ -0,0 +1,17 @@
require 'yomu'
# This class is a wrapper for Yomu.
# Yomu is a library for extracting text and metadata from files and documents
# using the Apache Tika content analysis toolkit.
class Treat::Workers::Formatters::Readers::Document
# Extract the readable text from any document.
#
# Options: none.
def self.read(document, options = {})
yomu = Yomu.new(document.file)
document.value = yomu.text
document.set :format, yomu.mimetype.extensions.first
document
end
end

View File

@ -11,7 +11,8 @@ class Treat::Workers::Formatters::Readers::HTML
# By default, don't backup the original HTML
DefaultOptions = {
:keep_html => false,
:tags => %w[p div h1 h2 h3 ul ol dl dt li]
:tags => %w[p div h1 h2 h3 ul ol dl dt li img],
}
# Read the HTML document and strip it of its markup.
@ -46,6 +47,7 @@ class Treat::Workers::Formatters::Readers::HTML
d = Readability::Document.new(html, options)
document.value = "<h1>#{d.title}</h1>\n" + d.content
document.set :format, 'html'
document.set :images, d.images
end
document

View File

@ -7,8 +7,8 @@
# statistical natural language modeling, and multi-
# lingual capabilities."
#
# Original paper: Google Ocropus Engine: Breuel,
# Thomas M. The Ocropus Open Source OCR System.
# Original paper: Google Ocropus Engine: Breuel,
# Thomas M. The Ocropus Open Source OCR System.
# DFKI and U. Kaiserslautern, Germany.
class Treat::Workers::Formatters::Readers::Image
@ -18,29 +18,31 @@ class Treat::Workers::Formatters::Readers::Image
#
# - (Boolean) :silent => whether to silence Ocropus.
def self.read(document, options = {})
read = lambda do |doc|
self.create_temp_dir do |tmp|
`ocropus book2pages #{tmp}/out #{doc.file}`
`ocropus pages2lines #{tmp}/out`
`ocropus lines2fsts #{tmp}/out`
`ocropus buildhtml #{tmp}/out > #{tmp}/output.html`
doc.set :file, "#{tmp}/output.html"
`ocropus-nlbin -o #{tmp}/out #{doc.file}`
`ocropus-gpageseg #{tmp}/out/????.bin.png --minscale 2`
`ocropus-rpred #{tmp}/out/????/??????.bin.png`
`ocropus-hocr #{tmp}/out/????.bin.png -o #{tmp}/book.html`
doc.set :file, "#{tmp}/book.html"
doc.set :format, :html
doc = doc.read(:html)
doc.set :file, f
doc.set :format, 'image'
end
end
options[:silent] ? silence_stdout {
Treat.core.verbosity.silence ? silence_stdout {
read.call(document) } : read.call(document)
document
end
# Create a dire that gets deleted after execution of the block.
# Create a dir that gets deleted after execution of the block.
def self.create_temp_dir(&block)
if not FileTest.directory?(Treat.paths.tmp)
FileUtils.mkdir(Treat.paths.tmp)
end
dname = Treat.paths.tmp +
"#{Random.rand(10000000).to_s}"
Dir.mkdir(dname)
@ -48,5 +50,5 @@ class Treat::Workers::Formatters::Readers::Image
ensure
FileUtils.rm_rf(dname)
end
end

View File

@ -32,6 +32,9 @@ class Treat::Workers::Formatters::Readers::PDF
# Create a temporary file which is deleted
# after execution of the block.
def self.create_temp_file(ext, value = nil, &block)
if not FileTest.directory?(Treat.paths.tmp)
FileUtils.mkdir(Treat.paths.tmp)
end
fname = Treat.paths.tmp +
"#{Random.rand(10000000).to_s}.#{ext}"
File.open(fname, 'w') do |f|

View File

@ -30,7 +30,7 @@ class Treat::Workers::Formatters::Readers::XML
@@xml_reader ||= StanfordCoreNLP.load(
:tokenize, :ssplit, :cleanxml)
text = StanfordCoreNLP::Text.new(xml)
text = StanfordCoreNLP::Annotation.new(xml)
@@xml_reader.annotate(text)
text.get(:sentences).each do |sentence|

View File

@ -9,18 +9,19 @@ class Treat::Workers::Formatters::Serializers::XML
# - (String) :file => a file to write to.
def self.serialize(entity, options = {})
options[:file] ||= (entity.id.to_s + '.xml')
if options[:indent].nil?
options = options.merge({:indent => 0})
end
indent = options[:indent]
if options[:indent] == 0
enc = entity.to_s.encoding.to_s.downcase
string = "<?xml version=\"1.0\" " +
"encoding=\"#{enc}\" ?>\n<treat>\n"
else
string = ''
end
spaces = ''
options[:indent] = 0
enc = entity.to_s.encoding.to_s.downcase
string = "<?xml version=\"1.0\" " +
"encoding=\"#{enc}\" ?>\n<treat>\n"
val = self.recurse(entity, options)
string += "#{val}\n</treat>"
File.open(options[:file], 'w') do |f|
f.write(string)
end; return options[:file]
end
def self.recurse(entity, options)
spaces, string = '', ''
options[:indent].times { spaces << ' ' }
attributes = " id='#{entity.id}'"
if !entity.features.nil? && entity.features.size != 0
@ -56,27 +57,16 @@ class Treat::Workers::Formatters::Serializers::XML
if entity.has_children?
options[:indent] += 1
entity.children.each do |child|
string =
string +
serialize(child, options)
string += self.recurse(child, options)
end
options[:indent] -= 1
else
string = string + "#{escape(entity.value)}"
string += "#{escape(entity.value)}"
end
unless entity.is_a?(Treat::Entities::Token)
string += "#{spaces}"
end
string += "</#{tag}>\n"
if indent == 0
string += "\n</treat>"
if options[:file]
File.open(options[:file], 'w') do |f|
f.write(string)
end
end
end
options[:file]
end
def self.escape(input)

View File

@ -3,7 +3,7 @@ class Treat::Workers::Formatters::Serializers::YAML
silence_warnings do
# Require the Psych YAML serializer.
require 'psych'
require 'yaml'
end
# Serialize an entity in YAML format.
@ -11,7 +11,7 @@ class Treat::Workers::Formatters::Serializers::YAML
# Options:
# - (String) :file => a file to write to.
def self.serialize(entity, options = {})
yaml = ::Psych.dump(entity)
yaml = ::YAML.dump(entity)
options[:file] ||= (entity.id.to_s + '.yml')
if options[:file]
File.open(options[:file], 'w') do |f|

View File

@ -17,7 +17,7 @@ class Treat::Workers::Formatters::Unserializers::Mongo
@@database ||= Mongo::Connection.
new(Treat.databases.mongo.host).
db(Treat.databases.mongo.db || db)
db(db || Treat.databases.mongo.db)
supertype = Treat::Entities.const_get(
entity.type.to_s.capitalize.intern).superclass.mn.downcase

View File

@ -65,6 +65,7 @@ class Treat::Workers::Formatters::Unserializers::XML
value = v
else
v = v[1..-1].intern if v[0] == ':'
v = ":".intern if v == :''
v = v.to_i if v =~ /^[0-9]*$/
v = v.to_f if v =~ /^[0-9\.]*$/
v = false if v == 'false'

View File

@ -3,7 +3,7 @@ class Treat::Workers::Formatters::Unserializers::YAML
silence_warnings do
# Require the Psych YAML parser.
require 'psych'
require 'yaml'
end
# Require date to revive DateTime.
@ -13,7 +13,7 @@ class Treat::Workers::Formatters::Unserializers::YAML
#
# Options: none.
def self.unserialize(document, options = {})
document << ::Psych.load(
document << ::YAML.load(
File.read(document.file))
document
end

View File

@ -1,11 +1,12 @@
module Treat::Workers::Group
module Treat::Workers::Groupable
# Lazily load the worker classes in the group.
def const_missing(const)
bits = self.ancestors[0].to_s.split('::')
bits.collect! { |bit| bit.ucc }
file = bits.join('/') + "/#{const.ucc}"
if not File.readable?(Treat.paths.lib + "#{file}.rb")
path = Treat.paths.lib + "#{file}.rb"
if not File.readable?(path)
raise Treat::Exception,
"File '#{file}.rb' corresponding to " +
"requested worker #{self}::#{const} " +
@ -14,7 +15,7 @@ module Treat::Workers::Group
require file
if not self.const_defined?(const)
raise Treat::Exception,
"File #{file} does not define " +
"File #{file}.rb does not define " +
"#{self}::#{const}."
end
const_get(const)
@ -69,9 +70,7 @@ module Treat::Workers::Group
# Get constants in this module, excluding by
# default those defined by parent modules.
def const_get(const)
super(const, false)
end
def const_get(const); super(const, false); end
# Modify the extended class.
def self.extended(group)

View File

@ -35,9 +35,9 @@ class Treat::Workers::Inflectors::Cardinalizers::Linguistics
# More specific options when using :type => :ordinal:
def self.cardinal(entity, options = {})
options = DefaultOptions.merge(options)
Treat::Loaders::Linguistics.
load(options[:language]).
numwords(entity.to_s, options)
lang = entity.language
code = Treat::Loaders::Linguistics.load(lang)
entity.to_s.send(code).numwords(options)
end
end

View File

@ -35,13 +35,15 @@ module Treat::Workers::Inflectors::Conjugators::Linguistics
options = Forms[options[:form].to_s] if options[:form]
klass = Treat::Loaders::Linguistics.load(entity.language)
code = Treat::Loaders::Linguistics.load(entity.language)
obj = entity.to_s.send(code)
if options[:mode] == 'infinitive'
silence_warnings { klass.infinitive(entity.to_s) }
obj.infinitive
elsif options[:mode] == 'participle' && options[:tense] == 'present'
silence_warnings { klass.present_participle(entity.to_s) }
obj.present_participle
elsif options[:count] == 'plural' && options.size == 1
silence_warnings { klass.plural_verb(entity.to_s) }
obj.plural_verb
else
raise Treat::Exception,
'This combination of modes, tenses, persons ' +

View File

@ -21,9 +21,9 @@ class Treat::Workers::Inflectors::Declensors::English
'option count ("singular" or "plural").'
end
string = entity.to_s
if options[:count] == 'plural'
if options[:count].to_s == 'plural'
Inflect.plural(string)
elsif options[:count] == 'singular'
elsif options[:count].to_s == 'singular'
Inflect.singular(string)
end
end

View File

@ -17,34 +17,27 @@ class Treat::Workers::Inflectors::Declensors::Linguistics
cat = entity.get(:category)
return if cat && !POS.include?(cat)
unless options[:count]
raise Treat::Exception, 'Must supply ' +
':count option ("singular" or "plural").'
end
klass = Treat::Loaders::
Linguistics.load(entity.language)
string = entity.to_s
if options[:count] == 'plural'
if (entity.has?(:category))
result = ''
silence_warnings do
result = klass.send(
:"plural_#{entity.category}",
string)
end
return result
else
return klass.plural(string)
end
else
unless options[:count].to_s == 'plural'
raise Treat::Exception,
"Ruby Linguistics does not support " +
"singularization of words."
end
lang = entity.language
code = Treat::Loaders::Linguistics.load(lang)
obj = entity.to_s.send(code)
if cat = entity.get(:category)
method = "plural_#{cat}"
obj.send(method)
else; obj.plural; end
end
end

View File

@ -11,11 +11,11 @@ class Treat::Workers::Inflectors::Ordinalizers::Linguistics
# Desribe a number in words in ordinal form, using the
# 'linguistics' gem.
def self.ordinal(number, options = {})
def self.ordinal(entity, options = {})
options = DefaultOptions.merge(options)
klass = Treat::Loaders::
Linguistics.load(options[:language])
klass.ordinate(number.to_s)
lang = entity.language
code = Treat::Loaders::Linguistics.load(lang)
entity.to_s.send(code).ordinate
end
end

View File

@ -12,24 +12,20 @@ class Treat::Workers::Learners::Classifiers::ID3
@@classifiers = {}
def self.classify(entity, options = {})
set = options[:training]
cl = set.problem
if !@@classifiers[cl]
dset = options[:training]
prob = dset.problem
if !@@classifiers[prob]
dec_tree = DecisionTree::ID3Tree.new(
cl.feature_labels.map { |l| l.to_s },
set.items.map { |i| i[:features]},
cl.question.default, cl.question.type)
prob.feature_labels.map { |l| l.to_s },
dset.items.map { |i| i[:features] },
prob.question.default, prob.question.type)
dec_tree.train
@@classifiers[cl] = dec_tree
@@classifiers[prob] = dec_tree
else
dec_tree = @@classifiers[cl]
dec_tree.graph('testingbitch')
dec_tree = @@classifiers[prob]
end
dec_tree.predict(
cl.export_features(entity, false)
)
vect = prob.export_features(entity, false)
dec_tree.predict(vect)
end
end

View File

@ -11,35 +11,23 @@ class Treat::Workers::Learners::Classifiers::Linear
}
def self.classify(entity, options = {})
options = DefaultOptions.merge(options)
set = options[:training]
problem = set.problem
if !@@classifiers[problem]
labels = problem.question.labels
unless labels
raise Treat::Exception,
"LibLinear requires that you provide the possible " +
"labels to assign to classification items when " +
"specifying the question."
end
param = LParameter.new
param.solver_type = options[:solver_type]
param.eps = options[:eps]
bias = options[:bias]
data = set.items.map do |item|
self.array_to_hash(item[:features])
end
prob = LProblem.new(labels, data, bias)
@@classifiers[problem] =
LModel.new(prob, param)
dset = options[:training]
prob, items = dset.problem, dset.items
if !@@classifiers[prob]
lparam = LParameter.new
lparam.solver_type = options[:solver_type]
lparam.eps = options[:eps]
lbls = items.map { |it| it[:features][-1] }
exs = items.map { |it| it[:features][0..-2] }.
map { |ary| self.array_to_hash(ary) }
lprob = LProblem.new(lbls, exs, options[:bias])
model = LModel.new(lprob, lparam)
@@classifiers[prob] = model
end
@@classifiers[problem].predict(
self.array_to_hash(problem.
export_features(entity, false)))
features = prob.export_features(entity, false)
@@classifiers[prob].predict(
self.array_to_hash(features))
end
def self.array_to_hash(array)

View File

@ -1,30 +1,43 @@
# Classification based on a multilayer perceptron.
class Treat::Workers::Learners::Classifiers::MLP
require 'ai4r'
require 'ruby_fann/neural_network'
@@mlps = {}
DefaultOptions = {
num_inputs: 3,
hidden_neurons: [2, 8, 4, 3, 4],
num_outputs: 1,
max_neurons: 1000,
neurons_between_reports: 1,
desired_error: 0.1
}
@@classifiers = {}
def self.classify(entity, options = {})
set = options[:training]
cl = set.problem
if !@@mlps[cl]
net = Ai4r::NeuralNetwork::Backpropagation.new(
[cl.feature_labels.size, 3, 1])
set.items.each do |item|
inputs = item[:features][0..-2]
outputs = [item[:features][-1]]
net.train(inputs, outputs)
options = DefaultOptions.merge(options)
dset = options[:training]
prob, items = dset.problem, dset.items
if !@@classifiers[prob]
fann = RubyFann::Standard.new(options)
inputs = items.map { |it| it[:features][0..-2] }
outputs = items.map { |it| [it[:features][-1]] }
training = silence_stdout do
RubyFann::TrainData.new(inputs:
inputs, desired_outputs: outputs)
end
@@mlps[cl] = net
params = [options[:max_neurons],
options[:neurons_between_reports],
options[:desired_error]]
fann.train_on_data(training, *params)
@@classifiers[prob] = fann
else
net = @@mlps[cl]
fann = @@classifiers[prob]
end
net.eval(cl.export_features(entity, false))[0]
vect = prob.export_features(entity, false)
Treat.core.verbosity.silence ?
silence_stdout { fann.run(vect)[0] } :
fann.run(vect)[0]
end
end

View File

@ -5,7 +5,7 @@ class Treat::Workers::Learners::Classifiers::SVM
@@classifiers = {}
DefaultOptions = {
cache_size: 1,
cache_size: 1, # in MB
eps: 0.001,
c: 10
}
@ -14,35 +14,25 @@ class Treat::Workers::Learners::Classifiers::SVM
# - (Numeric) :eps => tolerance of termination criterion
# - (Numeric) :c => C parameter
def self.classify(entity, options = {})
options = DefaultOptions.merge(options)
set = options[:training]
problem = set.problem
if !@@classifiers[problem]
labels = problem.question.labels
unless labels
raise Treat::Exception,
"LibSVM requires that you provide the possible " +
"labels to assign to classification items when " +
"specifying the question."
end
examples = set.items.map { |item| item[:features] }
prob = Libsvm::Problem.new
prob.set_examples(labels, examples)
param = Libsvm::SvmParameter.new
param.cache_size = options[:cache_size]
param.eps = options[:eps]
param.c = options[:c]
model = Libsvm::Model.train(problem, parameter)
@@classifiers[problem] = model
dset = options[:training]
prob, items = dset.problem, dset.items
if !@@classifiers[prob]
lprob = Libsvm::Problem.new
lparam = Libsvm::SvmParameter.new
lparam.cache_size = options[:cache_size]
lparam.eps = options[:eps]
lparam.c = options[:c]
llabels = items.map { |it| it[:features][-1] }
lexamples = items.map { |it| it[:features][0..-2] }.
map { |ary| Libsvm::Node.features(ary) }
lprob.set_examples(llabels, lexamples)
model = Libsvm::Model.train(lprob, lparam)
@@classifiers[prob] = model
end
features = problem.export_features(entity, false)
@@classifiers[problem].predict(
Libsvm::Node.features(*features))
features = prob.export_features(entity, false)
@@classifiers[prob].predict(
Libsvm::Node.features(features))
end
end

View File

@ -28,8 +28,9 @@ class Treat::Workers::Lexicalizers::Categorizers::FromTag
tag = entity.check_has(:tag)
return 'unknown' if tag.nil? || tag == '' || entity.type == :symbol
return 'sentence' if tag == 'S' || entity.type == :sentence
return 'unknown' if tag.nil? || tag == ''
return 'fragment' if tag == 'F'
return 'sentence' if tag == 'S'
return 'number' if entity.type == :number
return Ptc[entity.to_s] if entity.type == :punctuation

View File

@ -1,62 +1,79 @@
# Sense information (synonyms, antonyms, hypernyms
# and hyponyms) obtained through a Ruby parser that
# accesses Wordnet flat files.
#
# Original paper: George A. Miller (1995). WordNet:
# A Lexical Database for English. Communications of
#
# Original paper: George A. Miller (1995). WordNet:
# A Lexical Database for English. Communications of
# the ACM Vol. 38, No. 11: 39-41.
class Treat::Workers::Lexicalizers::Sensers::Wordnet
# Require the 'wordnet' gem (install as 'rwordnet').
require 'wordnet'
# Patch for bug.
::WordNet.module_eval do
remove_const(:SynsetType)
const_set(:SynsetType,
remove_const(:SYNSET_TYPES)
const_set(:SYNSET_TYPES,
{"n" => "noun", "v" => "verb", "a" => "adj"})
end
# Require an adaptor for Wordnet synsets.
require_relative 'wordnet/synset'
# Noun, adjective and verb indexes.
@@indexes = {}
# Obtain lexical information about a word using the
# ruby 'wordnet' gem.
def self.sense(word, options = nil)
category = word.check_has(:category)
unless options[:nym]
if !options[:nym]
raise Treat::Exception, "You must supply " +
"the :nym option (:synonym, :hypernym, etc.)"
"the :nym option ('synonyms', 'hypernyms', etc.)"
end
if !options[:nym].is_a?(Symbol)
options[:nym] = options[:nym].intern
end
if ![:synonyms, :antonyms,
:hypernyms, :hyponyms].include?(options[:nym])
raise Treat::Exception, "You must supply " +
"a valid :nym option ('synonyms', 'hypernyms', etc.)"
end
unless ['noun', 'adjective', 'verb'].
include?(word.category)
return []
end
cat = category.to_s.capitalize
@@indexes[cat] ||=
::WordNet.const_get(cat + 'Index').instance
lemma = @@indexes[cat].find(word.value.downcase)
cat = abbreviate(category)
lemma = ::WordNet::Lemma.find(word.value.downcase, cat)
return [] if lemma.nil?
synsets = []
lemma.synsets.each do |synset|
synsets <<
synsets <<
Treat::Workers::Lexicalizers::Sensers::Wordnet::Synset.new(synset)
end
((synsets.collect do |ss|
ss.send(options[:nym])
end - [word.value]).flatten).uniq
end - [word.value]).
flatten).uniq.map do |token|
token.gsub('_', ' ')
end
end
end
def self.abbreviate category
if category == 'adjective'
:adj
elsif category == 'adverb'
:adv
else
category.to_sym
end
end
end

View File

@ -40,15 +40,15 @@ class Treat::Workers::Lexicalizers::Taggers::Brill
return pair[1] if isolated_token
end
if entity.is_a?(Treat::Entities::Sentence) ||
(entity.is_a?(Treat::Entities::Phrase) &&
!entity.parent_sentence)
if entity.is_a?(Treat::Entities::Group) &&
!entity.parent_sentence
entity.set :tag_set, :penn
end
return 'S' if entity.is_a?(Treat::Entities::Sentence)
return 'P' if entity.is_a?(Treat::Entities::Phrase)
return 'F' if entity.is_a?(Treat::Entities::Fragment)
return 'G' if entity.is_a?(Treat::Entities::Group)
end
end

View File

@ -61,15 +61,16 @@ class Treat::Workers::Lexicalizers::Taggers::Lingua
end
if entity.is_a?(Treat::Entities::Sentence) ||
(entity.is_a?(Treat::Entities::Phrase) &&
!entity.parent_sentence)
if entity.is_a?(Treat::Entities::Group) &&
!entity.parent_sentence
entity.set :tag_set, :penn
end
return 'S' if entity.is_a?(Treat::Entities::Sentence)
return 'P' if entity.is_a?(Treat::Entities::Phrase)
return 'F' if entity.is_a?(Treat::Entities::Fragment)
return 'G' if entity.is_a?(Treat::Entities::Group)
end
end

View File

@ -1,15 +1,15 @@
# POS tagging using (i) explicit use of both preceding
# and following tag contexts via a dependency network
# representation, (ii) broad use of lexical features,
# including jointly conditioning on multiple consecutive
# words, (iii) effective use of priors in conditional
# loglinear models, and (iv) fine-grained modeling of
# unknown word features.
# POS tagging using a maximum entropy model, with (i)
# explicit use of both preceding and following tag
# contexts via a dependency network representation,
# (ii) broad use of lexical features, including jointly
# conditioning on multiple consecutive words, (iii)
# effective use of priors in conditional loglinear models,
# and (iv) fine-grained modeling of unknown word features.
#
# Original paper: Toutanova, Manning, Klein and Singer.
# 2003. Feature-Rich Part-of-Speech Tagging with a
# Cyclic Dependency Network. In Proceedings of the
# Conference of the North American Chapter of the
# 2003. Feature-Rich Part-of-Speech Tagging with a
# Cyclic Dependency Network. In Proceedings of the
# Conference of the North American Chapter of the
# Association for Computational Linguistics.
class Treat::Workers::Lexicalizers::Taggers::Stanford
@ -25,34 +25,32 @@ class Treat::Workers::Lexicalizers::Taggers::Stanford
def self.tag(entity, options = {})
# Handle tags for sentences and phrases.
if entity.is_a?(Treat::Entities::Sentence) ||
(entity.is_a?(Treat::Entities::Phrase) &&
!entity.parent_sentence)
if entity.is_a?(Treat::Entities::Group) &&
!entity.parent_sentence
tag_set = options[:tag_set]
entity.set :tag_set, tag_set
end
if entity.is_a?(Treat::Entities::Sentence)
return 'S'
elsif entity.is_a?(Treat::Entities::Phrase)
return 'P'
end
return 'S' if entity.is_a?(Treat::Entities::Sentence)
return 'P' if entity.is_a?(Treat::Entities::Phrase)
return 'F' if entity.is_a?(Treat::Entities::Fragment)
return 'G' if entity.is_a?(Treat::Entities::Group)
# Handle options and initialize the tagger.
lang = entity.language
options = get_options(options, lang)
lang = entity.language.intern
init_tagger(lang) unless @@taggers[lang]
tokens, list = get_token_list(entity)
options = get_options(options, lang)
tokens, t_list = get_token_list(entity)
# Do the tagging.
i = 0
isolated_token = entity.is_a?(Treat::Entities::Token)
@@taggers[lang].apply(list).each do |tok|
tokens[i].set :tag, tok.tag
tokens[i].set :tag_set,
options[:tag_set] if isolated_token
@@taggers[lang].apply(t_list).each do |tok|
tokens[i].set(:tag, tok.tag.split('-').first)
tokens[i].set(:tag_set,
options[:tag_set]) if isolated_token
return tok.tag if isolated_token
i += 1
end
@ -61,21 +59,24 @@ class Treat::Workers::Lexicalizers::Taggers::Stanford
# Initialize the tagger for a language.
def self.init_tagger(language)
Treat::Loaders::Stanford.load(language)
model = StanfordCoreNLP::Config::Models[:pos][language]
model_path = Treat.libraries.stanford.model_path ||
Treat.paths.models + 'stanford/'
model = model_path + StanfordCoreNLP::
Config::ModelFolders[:pos] + model
@@taggers[language] ||=
StanfordCoreNLP::MaxentTagger.new(model)
unless @@taggers[language]
Treat::Loaders::Stanford.load(language)
unless StanfordCoreNLP.const_defined?('MaxentTagger')
StanfordCoreNLP.load_class('MaxentTagger',
'edu.stanford.nlp.tagger.maxent')
end
model = Treat::Loaders::Stanford.find_model(:pos,language)
tagger = StanfordCoreNLP::MaxentTagger.new(model)
@@taggers[language] = tagger
end
@@taggers[language]
end
# Handle the options for the tagger.
def self.get_options(options, language)
options = DefaultOptions.merge(options)
if options[:tagger_model]
::StanfordCoreNLP.set_model('pos.model',
StanfordCoreNLP.set_model('pos.model',
options[:tagger_model])
end
options[:tag_set] =

View File

@ -2,16 +2,13 @@ class Treat::Workers::Processors::Chunkers::Autoselect
def self.chunk(entity, options = {})
unless entity.has?(:format)
raise Treat::Exception,
"Must have a format to autoselect chunker."
entity.set :format, 'txt'
end
begin
k = Treat::Workers::Processors::
Chunkers.const_get(entity.format.cc)
k = Treat::Workers::Processors::Chunkers.const_get(entity.format.cc)
k.chunk(entity, options)
rescue Treat::Exception
Treat::Workers::Processors::
Chunkers::TXT.chunk(entity, options)
Treat::Workers::Processors::Chunkers::TXT.chunk(entity, options)
end
end

View File

@ -3,12 +3,9 @@ class Treat::Workers::Processors::Chunkers::HTML
require 'nokogiri'
def self.chunk(entity, options = {})
entity.check_hasnt_children
doc = Nokogiri::HTML(entity.value)
recurse(entity, doc)
self.recurse(entity, doc)
end
def self.recurse(node, html_node, level = 1)
@ -16,7 +13,6 @@ class Treat::Workers::Processors::Chunkers::HTML
html_node.children.each do |child|
next if child.name == 'text'
txt = child.inner_text
if child.name =~ /^h([0-9]{1})$/ ||

View File

@ -12,16 +12,13 @@ class Treat::Workers::Processors::Chunkers::TXT
zones.each do |zone|
zone.strip!
next if zone == ''
c = Treat::Entities::
Zone.from_string(zone)
c = Treat::Entities::Zone.from_string(zone)
if c.type == :title
if current.type == :section
current = current.parent
current = entity << Treat::
Entities::Section.new
current = entity << Treat::Entities::Section.new
else
current = entity << Treat::
Entities::Section.new
current = entity << Treat::Entities::Section.new
end
end
current << c

View File

@ -1,150 +1,88 @@
# Parsing using an interface to a Java implementation
# of probabilistic natural language parsers, both
# optimized PCFG and lexicalized dependency parsers,
# and a lexicalized PCFG parser.
#
# Original paper: Dan Klein and Christopher D.
# Manning. 2003. Accurate Unlexicalized Parsing.
# Proceedings of the 41st Meeting of the Association
# Parsing using an interface to a Java implementation
# of probabilistic natural language parsers, both
# optimized PCFG and lexicalized dependency parsers,
# and a lexicalized PCFG parser.
#
# Original paper: Dan Klein and Christopher D.
# Manning. 2003. Accurate Unlexicalized Parsing.
# Proceedings of the 41st Meeting of the Association
# for Computational Linguistics, pp. 423-430.
class Treat::Workers::Processors::Parsers::Stanford
Pttc = Treat.tags.aligned.phrase_tags_to_category
# Hold one instance of the pipeline per language.
@@parsers = {}
DefaultOptions = {
:parser_model => nil,
:tagger_model => nil
}
DefaultOptions = { model: nil }
# Parse the entity using the Stanford parser.
#
# Options:
#
# - (Boolean) :silent => whether to silence the output
# of the JVM.
# - (String) :log_file => a filename to log output to
# instead of displaying it.
def self.parse(entity, options = {})
entity.check_hasnt_children
val, lang = entity.to_s, entity.language.intern
val = entity.to_s
lang = entity.language
init(lang, options)
Treat::Loaders::Stanford.load(lang)
tag_set = StanfordCoreNLP::Config::TagSets[lang]
text = ::StanfordCoreNLP::Text.new(val)
@@parsers[lang].annotate(text)
text.get(:sentences).each do |s|
if entity.is_a?(Treat::Entities::Sentence) ||
entity.is_a?(Treat::Entities::Phrase)
tag = s.get(:category).to_s
tag_s, tag_opt = *tag.split('-')
tag_s ||= 'S'
entity.set :tag, tag_s
entity.set :tag_opt, tag_opt if tag_opt
recurse(s.get(:tree).children[0], entity, tag_set)
break #######
else
recurse(s.get(:tree), entity, tag_set)
end
list = get_token_list(entity)
entity.remove_all!
model_file = options[:model] ||
StanfordCoreNLP::Config::Models[:parse][lang]
unless @@parsers[lang] && @@parsers[lang][model_file]
model_path = Treat.libraries.stanford.model_path ||
StanfordCoreNLP.model_path
model_folder = StanfordCoreNLP::Config::ModelFolders[:parse]
model = File.join(model_path, model_folder, model_file)
@@parsers[lang] ||= {}
options = StanfordCoreNLP::Options.new
parser = StanfordCoreNLP::LexicalizedParser
.getParserFromFile(model, options)
@@parsers[lang][model_file] = parser
end
parser = @@parsers[lang][model_file]
text = parser.apply(list)
recurse(text.children[0], entity, tag_set)
entity.set :tag_set, tag_set
end
def self.init(lang, options)
return if @@parsers[lang]
def self.recurse(java_node, ruby_node, tag_set)
Treat::Loaders::Stanford.load(lang)
options = DefaultOptions.merge(options)
StanfordCoreNLP.use(lang)
if options[:tagger_model]
::StanfordCoreNLP.set_model(
'pos.model', options[:tagger_model]
)
end
if options[:parser_model]
::StanfordCoreNLP.set_model(
'parser.model', options[:parser_model]
)
end
@@parsers[lang] ||=
::StanfordCoreNLP.load(
:tokenize, :ssplit, :pos, :lemma, :parse
)
end
java_node.children.each do |java_child|
# Helper method which recurses the tree supplied by
# the Stanford parser.
def self.recurse(java_node, ruby_node, tag_set, additional_tags = [])
label = java_child.label
tag = label.get(:category).to_s
if java_node.num_children == 0
label = java_node.label
tag = label.get(:part_of_speech).to_s
tag_s, tag_opt = *tag.split('-')
tag_s ||= ''
ruby_node.value = java_node.value.to_s.strip
ruby_node.set :tag, tag_s
ruby_node.set :tag_opt, tag_opt if tag_opt
ruby_node.set :lemma, label.get(:lemma).to_s
additional_tags.each do |t|
lt = label.get(t)
ruby_node.set t, lt.to_s if lt
end
ruby_node
else
if java_node.num_children == 1 &&
java_node.children[0].num_children == 0
recurse(java_node.children[0],
ruby_node, tag_set, additional_tags)
return
end
java_node.children.each do |java_child|
label = java_child.label
tag = label.get(:category).to_s
tag_s, tag_opt = *tag.split('-')
tag_s ||= ''
if Pttc[tag_s] && Pttc[tag_s][tag_set]
ruby_child = Treat::Entities::Phrase.new
else
l = java_child.children[0].to_s
v = java_child.children[0].value.to_s.strip
# Mhmhmhmhmhm
val = (l == v) ? v : l.split(' ')[-1].gsub(')', '')
ruby_child = Treat::Entities::Token.from_string(val)
end
ruby_child.set :tag, tag_s
ruby_child.set :tag_opt, tag_opt if tag_opt
if Pttc[tag] && Pttc[tag][tag_set]
ruby_child = Treat::Entities::Phrase.new
ruby_child.set :tag, tag
ruby_node << ruby_child
unless java_child.children.empty?
recurse(java_child, ruby_child, tag_set, additional_tags)
recurse(java_child, ruby_child, tag_set)
end
else
val = java_child.children[0].to_s
ruby_child = Treat::Entities::Token.from_string(val)
ruby_child.set :tag, tag
ruby_node << ruby_child
end
end
end
def self.get_token_list(entity)
list = StanfordCoreNLP::ArrayList.new
entity.tokens.each do |token|
list.add(StanfordCoreNLP::Word.new(token.to_s))
end
list
end
end

View File

@ -13,7 +13,7 @@ class Treat::Workers::Processors::Segmenters::Punkt
silence_warnings { require 'punkt-segmenter' }
# Require the YAML parser.
silence_warnings { require 'psych' }
# silence_warnings { require 'psych' }
# Hold one copy of the segmenter per language.
@@segmenters = {}
@ -87,7 +87,7 @@ class Treat::Workers::Processors::Segmenters::Punkt
end
end
t = ::Psych.load(File.read(model))
t = ::YAML.load(File.read(model))
@@segmenters[lang] =
::Punkt::SentenceTokenizer.new(t)

Some files were not shown because too many files have changed in this diff Show More