diff --git a/README.md b/README.md index 18c6ee2..71e552b 100644 --- a/README.md +++ b/README.md @@ -1,6 +1,8 @@ # Kaggle ![](static/images/logos/kaggle-logo-gray-bigger.jpeg) +* [ApacheCN 开源组织](https://github.com/apachecn/organization): https://github.com/apachecn/organization + > **欢迎任何人参与和完善:一个人可以走的很快,但是一群人却可以走的更远** * ApacheCN - Kaggle组队群【686932392】ApacheCN - Kaggle组队群【686932392】 * [Kaggle](https://www.kaggle.com) 是一个流行的数据科学竞赛平台。 @@ -15,6 +17,16 @@ ## [竞赛](https://www.kaggle.com/competitions) +* 【推荐】特征工程全过程: https://www.cnblogs.com/jasonfreak/p/5448385.html + +> train loss 与 test loss 结果分析 + +* train loss 不断下降,test loss不断下降,说明网络仍在学习; +* train loss 不断下降,test loss趋于不变,说明网络过拟合; +* train loss 趋于不变,test loss不断下降,说明数据集100%有问题; +* train loss 趋于不变,test loss趋于不变,说明学习遇到瓶颈,需要减小学习率或批量数目; +* train loss 不断上升,test loss不断上升,说明网络结构设计不当,训练超参数设置不当,数据集经过清洗等问题。 + ``` 机器学习比赛,奖金很高,业界承认分数。 现在我们已经准备好尝试 Kaggle 竞赛了,这些竞赛分成以下几个类别。 @@ -31,6 +43,7 @@ * [**数字识别**](/competitions/getting-started/digit-recognizer) * [**泰坦尼克**](/competitions/getting-started/titanic) * [**房价预测**](/competitions/getting-started/house-price) +* [**nlp-情感分析**](/competitions/getting-started/word2vec-nlp-tutorial) > [第3部分:训练场 Playground](https://www.kaggle.com/competitions?sortBy=deadline&group=all&page=1&pageSize=20&segment=playground) @@ -134,15 +147,3 @@ * 企鹅: 529815144(片刻) 1042658081(那伊抹微笑) 190442212(瑶妹) * **ApacheCN - 学习机器学习群【629470233】ApacheCN - 学习机器学习群【629470233】** * **Kaggle (数据科学竞赛平台) | [ApacheCN(apache中文网)](http://www.apachecn.org/)** - -## [ApacheCN 组织资源](http://www.apachecn.org/) - -> [kaggle: 机器学习竞赛](https://github.com/apachecn/kaggle) - -| 深度学习 | 机器学习 | 大数据 | 运维工具 | -| --- | --- | --- | --- | -| [TensorFlow R1.2 中文文档](http://cwiki.apachecn.org/pages/viewpage.action?pageId=10030122) | [机器学习实战-教学](https://github.com/apachecn/MachineLearning) | [Spark 2.2.0和2.0.2 中文文档](http://spark.apachecn.org/) | [Zeppelin 0.7.2 中文文档](http://cwiki.apachecn.org/pages/viewpage.action?pageId=10030467) | -| [Pytorch 0.3 中文文档 ](http://pytorch.apachecn.org/cn/0.3.0/) | [Sklearn 0.19 中文文档](http://sklearn.apachecn.org/) | [Storm 1.1.0和1.0.1 中文文档](http://storm.apachecn.org/) | [Kibana 5.2 中文文档](http://cwiki.apachecn.org/pages/viewpage.action?pageId=8159377) | -| | [LightGBM 中文文档](http://lightgbm.apachecn.org/cn/latest) | [Kudu 1.4.0 中文文档](http://cwiki.apachecn.org/pages/viewpage.action?pageId=10813594) | | -| | [XGBoost 中文文档](http://xgboost.apachecn.org/cn/latest) | [Elasticsearch 5.4 中文文档](http://cwiki.apachecn.org/pages/viewpage.action?pageId=4260364) | -| | | [Beam 中文文档](http://beam.apachecn.org/) | diff --git a/competitions/getting-started/word2vec-nlp-tutorial/NLP电影预测.ipynb b/competitions/getting-started/word2vec-nlp-tutorial/NLP电影预测.ipynb new file mode 100644 index 0000000..f35be15 --- /dev/null +++ b/competitions/getting-started/word2vec-nlp-tutorial/NLP电影预测.ipynb @@ -0,0 +1,3080 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# **word2vec nlp tutorial**\n", + "\n", + "> 比赛说明\n", + "\n", + "标记的数据集由50,000条IMDB电影评论组成,专门用于情感分析。评论的情感是二元的,这意味着IMDB评分<5导致情绪分数为0,并且评分≥7的情绪评分为1.没有单独的电影具有超过30个评论。25,000个带有复审标签的训练集不包含与25,000个复习测试集相同的电影。此外,还有另外50,000个IMDB评论没有提供任何评级标签。\n", + "\n", + "> 文件说明\n", + "\n", + "| 文件 | 说明 |\n", + "| :--- | :--- |\n", + "| labeledTrainData | 标记的训练集。该文件是制表符分隔的,并且有一个标题行,后面跟着25,000行,其中包含每个审阅的ID,情绪和文本。 |\n", + "| testData | 测试集。制表符分隔的文件有一个标题行,后面跟着25,000行,其中包含每个评论的标识和文本。你的任务是预测每个人的情绪。 |\n", + "| unlabeledTrainData | 一个没有标签的额外训练集。制表符分隔的文件有一个标题行,后跟50,000行,每行包含一个标识和文本。 |\n", + "| sampleSubmission | 以正确格式的逗号分隔的示例提交文件。 |\n", + "\n", + "> 数据字段\n", + "\n", + "| 字段 | 说明 |\n", + "| :--- | :--- |\n", + "| id | 每个评论的唯一ID |\n", + "| sentiment | 审查的情绪; 1为正面评论,0为负面评论 |\n", + "| review | 审查的文本 |\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# 比赛操作流程\n", + "\n", + "分类问题:预测的是好与坏的问题\n", + "常用算法: K紧邻(knn)、逻辑回归(LogisticRegression)、随机森林(RandomForest)、支持向量机(SVM)、xgboost、GBDT\n", + "\n", + "> 步骤:\n", + "\n", + "```\n", + "一. 数据分析\n", + "1. 下载并加载数据\n", + "2. 总体预览:了解每列数据的含义,数据的格式等\n", + "3. 数据初步分析,使用统计学与绘图:初步了解数据之间的相关性,为构造特征工程以及模型建立做准备\n", + "\n", + "二. 特征工程\n", + "1.根据业务,常识,以及第二步的数据分析构造特征工程.\n", + "2.将特征转换为模型可以辨别的类型(如处理缺失值,处理文本进行等)\n", + "\n", + "三. 模型选择\n", + "1.根据目标函数确定学习类型,是无监督学习还是监督学习,是分类问题还是回归问题等.\n", + "2.比较各个模型的分数,然后取效果较好的模型作为基础模型.\n", + "\n", + "四. 模型融合\n", + "\n", + "五. 修改特征和模型参数\n", + "1.可以通过添加或者修改特征,提高模型的上限.\n", + "2.通过修改模型的参数,是模型逼近上限\n", + "```\n", + "\n", + "* * * \n", + "\n", + "* 比赛地址: https://www.kaggle.com/c/word2vec-nlp-tutorial\n", + "* 参考地址: https://www.cnblogs.com/zhao441354231/p/6056914.html\n", + "* 参考地址: https://blog.csdn.net/lijingpengchina/article/details/52250765\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 一.数据分析\n", + "\n", + "### 数据下载和加载\n", + "\n", + "* 数据集下载地址: https://www.kaggle.com/c/word2vec-nlp-tutorial/data\n" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "# 导入相关数据包\n", + "import pandas as pd\n", + "import numpy as np\n", + "from bs4 import *" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 读取数据" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "data_dir = \"/opt/data/kaggle/getting-started/word2vec-nlp-tutorial\"\n", + "# 载入数据集 \n", + "train = pd.read_csv(os.path.join(data_dir, 'labeledTrainData.tsv'), header=0, delimiter=\"\\t\", quoting=3)\n", + "pre = pd.read_csv(os.path.join(data_dir, 'testData.tsv'), header=0, delimiter=\"\\t\", quoting=3)" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "(25000, 3) \t (25000, 2) \t\n", + "\n", + "RangeIndex: 25000 entries, 0 to 24999\n", + "Data columns (total 3 columns):\n", + "id 25000 non-null object\n", + "sentiment 25000 non-null int64\n", + "review 25000 non-null object\n", + "dtypes: int64(1), object(2)\n", + "memory usage: 586.0+ KB\n", + "None \n", + "\n", + "\n", + " ['id' 'sentiment' 'review']\n", + "\n", + " id sentiment review\n", + "0 \"5814_8\" 1 \"With all this stuff going down at the moment ...\n", + "1 \"2381_9\" 1 \"\\\"The Classic War of the Worlds\\\" by Timothy ...\n", + "2 \"7759_3\" 0 \"The film starts with a manager (Nicholas Bell...\n" + ] + } + ], + "source": [ + "print(train.shape, '\\t', pre.shape, '\\t')\n", + "print(train.info(), '\\n')\n", + "print('\\n', train.columns.values)\n", + "print('\\n', train.head(3))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 数据预处理\n", + "\n", + "* 1.去掉html标签\n", + "* 2.移除标点\n", + "* 3.切分成词/token\n", + "* 4.去掉停用词\n", + "* 5.重组为新的句子" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": { + "scrolled": true + }, + "outputs": [], + "source": [ + "# def review_to_wordlist(review):\n", + "# '''\n", + "# 把IMDB的评论转成词序列\n", + "# 参考:http://blog.csdn.net/longxinchen_ml/article/details/50629613\n", + "# '''\n", + "# # 去掉HTML标签,拿到内容\n", + "# review_text = BeautifulSoup(review, \"html.parser\").get_text()\n", + "# # 用正则表达式取出符合规范的部分\n", + "# review_text = re.sub(\"[^a-zA-Z]\", \" \", review_text)\n", + "# # 小写化所有的词,并转成词list\n", + "# words = review_text.lower().split()\n", + "# # 返回words\n", + "# return words\n", + "\n", + "\n", + "# 预处理数据\n", + "label = train['sentiment']\n", + "train_data = []\n", + "pre_data = []\n", + "for i in range(len(train['review'])):\n", + " train_data.append(BeautifulSoup(train['review'][i], \"html.parser\").get_text())\n", + "test_data = []\n", + "for i in range(len(pre['review'])):\n", + " pre_data.append(BeautifulSoup(pre['review'][i], \"html.parser\").get_text())" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\"With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.The actual feature film bit when it finally starts is only on for 20 minutes or so excluding the Smooth Criminal sequence and Joe Pesci is convincing as a psychopathic all powerful drug lord. Why he wants MJ dead so bad is beyond me. Because MJ overheard his plans? Nah, Joe Pesci's character ranted that he wanted people to know it is he who is supplying drugs etc so i dunno, maybe he just hates MJ's music.Lots of cool things in this like MJ turning into a car and a robot and the whole Speed Demon sequence. Also, the director must have had the patience of a saint when it came to filming the kiddy Bad sequence as usually directors hate working with one kid let alone a whole bunch of them performing a complex dance scene.Bottom line, this movie is for people who like MJ on one level or another (which i think is most people). If not, then stay away. It does try and give off a wholesome message and ironically MJ's bestest buddy in this movie is a girl! Michael Jackson is truly one of the most talented people ever to grace this planet but is he guilty? Well, with all the attention i've gave this subject....hmmm well i don't know because people can be different behind closed doors, i know this for a fact. He is either an extremely nice but stupid guy or one of the most sickest liars. I hope he is not the latter.\" \n", + "\n", + "\"Naturally in a film who's main themes are of mortality, nostalgia, and loss of innocence it is perhaps not surprising that it is rated more highly by older viewers than younger ones. However there is a craftsmanship and completeness to the film which anyone can enjoy. The pace is steady and constant, the characters full and engaging, the relationships and interactions natural showing that you do not need floods of tears to show emotion, screams to show fear, shouting to show dispute or violence to show anger. Naturally Joyce's short story lends the film a ready made structure as perfect as a polished diamond, but the small changes Huston makes such as the inclusion of the poem fit in neatly. It is truly a masterpiece of tact, subtlety and overwhelming beauty.\"\n" + ] + } + ], + "source": [ + "# 预览数据\n", + "print(train_data[0], '\\n')\n", + "print(pre_data[0])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 特征处理" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [], + "source": [ + "# 合并训练和测试集以便进行TFIDF向量化操作\n", + "data_all = train_data + pre_data\n", + "len_train = len(train_data)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "直接丢给计算机这些词文本,计算机是无法计算的,因此我们需要把文本转换为向量,有几种常见的文本向量处理方法,比如: \n", + "\n", + "1. Bow-of-Words计数 \n", + "2. TF-IDF向量 \n", + "3. Word2vec向量 " + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "['have', 'yourselves', 'itself', \"haven't\", 'y', 'shan', 'because', 'didn', 'it', \"she's\", 'nor', 'once', 'hadn', 'an', 'will', 'in', 'than', 'just', \"doesn't\", 'down', \"mightn't\", 've', 'shouldn', 'before', 'when', 'and', 'won', 'which', \"wouldn't\", 'other', 'are', 'doesn', 'here', 'him', 'why', \"mustn't\", 'theirs', 'ours', 'himself', 'now', 'at', 'but', 'its', 'were', 'whom', 'how', 'again', 'under', 'myself', 'me', 'your', 'then', 'he', 'the', 'who', 'herself', 'off', 'aren', 'each', 'same', 'all', \"that'll\", 'so', 'having', 'that', 'couldn', 'she', 'wasn', 'own', \"shouldn't\", 'by', 'there', 'this', 'we', 'if', 'no', 'doing', 'don', 'ain', \"you've\", 'had', 't', 'into', 'too', 'hasn', 'they', 'few', 'their', 'being', 'mightn', \"you'd\", 'a', 'her', \"couldn't\", 'did', \"you'll\", 'd', 'can', 'been', 'm', 'yours', 'very', 'wouldn', 'i', 'his', 'during', 'through', 'you', 'against', 'be', 'themselves', 'not', 'out', \"don't\", 'is', \"it's\", 'was', 'does', 'ma', 'needn', 'these', 'some', 'on', \"isn't\", 'for', 'further', \"hadn't\", 'isn', 'below', 'more', \"didn't\", 'has', 'up', 'with', 'about', \"weren't\", 'am', 'those', 'where', 'what', 'any', 's', \"you're\", 'do', 'or', 'over', 'weren', 'my', 'until', 'as', 'most', 'only', \"should've\", 'ourselves', \"needn't\", 'haven', 'above', 'such', 'hers', \"shan't\", 'after', 'while', \"wasn't\", 'them', 'between', 'our', 'from', 'yourself', \"aren't\", 'should', 'mustn', \"hasn't\", \"won't\", 'to', 're', 'of', 'both', 'o', 'll']\n" + ] + } + ], + "source": [ + "from nltk.corpus import stopwords\n", + "#英文停止词,set()集合函数消除重复项\n", + "list_stopWords = list(set(stopwords.words('english')))\n", + "print(list_stopWords)" + ] + }, + { + "cell_type": "code", + "execution_count": 25, + "metadata": {}, + "outputs": [], + "source": [ + "from gensim import corpora\n", + "\n", + "# bow 模型 \n", + "import re\n", + "texts = [[word for word in re.sub(\"[^a-zA-Z]\", \" \", doc.lower()) if word != \"\" and word not in list_stopWords] for doc in data_all]\n", + "dictionary = corpora.Dictionary(texts)\n", + "# 对每一行的单词,进行统计出现次数\n", + "corpus = [dictionary.doc2bow(text) for text in texts]" + ] + }, + { + "cell_type": "code", + "execution_count": 26, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "0 \n", + "1 b\n", + "2 c\n", + "3 e\n", + "4 f\n", + "[(0, 471), (1, 29), (2, 56), (3, 206), (4, 37), (5, 42), (6, 102), (7, 19), (8, 18), (9, 82), (10, 115), (11, 31), (12, 3), (13, 72), (14, 53), (15, 15), (16, 41), (17, 3), (18, 1)]\n" + ] + } + ], + "source": [ + "for key in dictionary.keys()[0:5]:\n", + " print (key, dictionary[key])\n", + "\n", + "print(corpus[0])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 2.TF-IDF + Lsi主题模型" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [], + "source": [ + "from gensim import models\n", + "tfidf_model = models.TfidfModel(corpus=corpus, id2word=dictionary, normalize=True) \n", + "# 将整个corpus转为tf-idf格式\n", + "corpus_tfidf = tfidf_model[corpus]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "## lsi 主题模型,作为特征向量\n", + "lsi_model = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=200)\n", + "corpus_lsi = lsi_model[corpus_tfidf]\n", + "\n", + "# 提取数字,转化为numpy的矩阵\n", + "all_x = [[v for k,v in doc] for doc in corpus_lsi]" + ] + }, + { + "cell_type": "code", + "execution_count": 54, + "metadata": { + "scrolled": true + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[(0, '0.162*\"movie\" + 0.140*\"film\" + 0.102*\"-\" + 0.099*\"good\" + 0.099*\"like\" + 0.098*\"really\" + 0.098*\"bad\" + 0.092*\"one\" + 0.089*\"would\" + 0.088*\"story\"'), (1, '0.276*\"bad\" + 0.238*\"movie\" + 0.180*\"worst\" + -0.156*\"-\" + 0.153*\"movies\" + 0.113*\"waste\" + 0.108*\"ever\" + 0.106*\"acting\" + 0.106*\"terrible\" + -0.101*\"film\"'), (2, '-0.667*\"show\" + -0.212*\"episode\" + -0.203*\"series\" + 0.160*\"-\" + 0.153*\"film\" + -0.146*\"season\" + -0.145*\"episodes\" + -0.135*\"tv\" + -0.130*\"shows\" + -0.125*\"funny\"')]\n", + "(50000, 200)\n" + ] + } + ], + "source": [ + "# print(np.shape(corpus_lsi))\n", + "# (50000, 200, 2)\n", + "print(lsi_model.print_topics(3))\n", + "# print(corpus_lsi[0])\n", + "\n", + "import numpy as np\n", + "print(np.shape(all_x))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 3.Word2vec向量\n", + "\n", + "神经网络语言模型 L = SUM[log(p(w|contect(w))],即在w的上下文下计算当前词w的概率,由公式可以看到,我们的核心是计算p(w|contect(w), Word2vec给出了构造这个概率的一个方法。" + ] + }, + { + "cell_type": "code", + "execution_count": 27, + "metadata": {}, + "outputs": [], + "source": [ + "import re\n", + "from bs4 import BeautifulSoup\n", + "from nltk.corpus import stopwords\n", + "\n", + "# def show_diff(origin, html, text):\n", + "# print(origin)\n", + "# print(\"\\n-----------show diff-----------\\n\")\n", + "# print(html)\n", + "# print(\"\\n-----------show diff-----------\\n\")\n", + "# print(text)\n", + "\n", + "# origin = train['review'][0]\n", + "# html = BeautifulSoup(origin, \"html.parser\").get_text()\n", + "# text = re.sub('[^a-zA-Z]', ' ', html).strip()\n", + "# show_diff(origin, html, text)\n", + "\n", + "stopwords = set(stopwords.words(\"english\"))\n", + "\n", + "def review_to_sentence(review):\n", + " html = BeautifulSoup(review, \"html.parser\").get_text()\n", + " text = re.sub('[^a-zA-Z]', ' ', html).strip()\n", + " words = [word for word in text.lower().split() if word not in stopwords]\n", + " return words\n", + "\n", + "unlabeled_train = pd.read_csv(os.path.join(data_dir, 'unlabeledTrainData.tsv'), header=0, delimiter=\"\\t\", quoting=3 )\n", + "train_texts = pd.concat([train['review'], unlabeled_train['review']])\n", + "sentences = list(map(review_to_sentence, train_texts))" + ] + }, + { + "cell_type": "code", + "execution_count": 28, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "(75000,)\n" + ] + } + ], + "source": [ + "print(np.shape(sentences))" + ] + }, + { + "cell_type": "code", + "execution_count": 31, + "metadata": {}, + "outputs": [], + "source": [ + "import time\n", + "from gensim.models import Word2Vec\n", + "# 模型参数\n", + "num_features = 784 # Word vector dimensionality(原来默认用300维,为了计算CNN, 设置 784维 = 28*28) \n", + "min_word_count = 10 # Minimum word count \n", + "num_workers = 4 # Number of threads to run in parallel\n", + "context = 10 # Context window size \n", + "downsampling = 1e-3 # Downsample setting for frequent words" + ] + }, + { + "cell_type": "code", + "execution_count": 32, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "训练模型中...\n", + "训练完成\n", + "CPU times: user 7min 17s, sys: 6.85 s, total: 7min 24s\n", + "Wall time: 3min 1s\n" + ] + } + ], + "source": [ + "%%time\n", + "# 训练模型\n", + "print(\"训练模型中...\")\n", + "# model = Word2Vec(sentences, workers=num_workers, \\\n", + "# size=num_features, min_count=min_word_count, \\\n", + "# window=5, sample=downsampling)\n", + "model = Word2Vec(sentences, size=num_features, window=5)\n", + "print(\"训练完成\")" + ] + }, + { + "cell_type": "code", + "execution_count": 33, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/Users/jiangzl/.virtualenvs/python3.6/lib/python3.6/site-packages/ipykernel_launcher.py:1: DeprecationWarning: Call to deprecated `__getitem__` (Method will be removed in 4.0.0, use self.wv.__getitem__() instead).\n", + " \"\"\"Entry point for launching an IPython kernel.\n" + ] + }, + { + "data": { + "text/plain": [ + "(784,)" + ] + }, + "execution_count": 33, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "np.shape(model[\"film\"])" + ] + }, + { + "cell_type": "code", + "execution_count": 34, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'kitchen'" + ] + }, + "execution_count": 34, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "model.wv.doesnt_match(\"man woman child kitchen\".split())" + ] + }, + { + "cell_type": "code", + "execution_count": 35, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'berlin'" + ] + }, + "execution_count": 35, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "model.wv.doesnt_match(\"france england germany berlin\".split())" + ] + }, + { + "cell_type": "code", + "execution_count": 36, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "'london'" + ] + }, + "execution_count": 36, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "model.wv.doesnt_match(\"paris berlin london austria\".split())" + ] + }, + { + "cell_type": "code", + "execution_count": 37, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[('men', 0.5552874803543091),\n", + " ('lady', 0.5526503920555115),\n", + " ('woman', 0.49917668104171753),\n", + " ('mans', 0.47213518619537354),\n", + " ('guy', 0.4668915569782257)]" + ] + }, + "execution_count": 37, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "model.wv.most_similar(\"man\", topn=5)" + ] + }, + { + "cell_type": "code", + "execution_count": 38, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[('princess', 0.6967809200286865),\n", + " ('bride', 0.6197341084480286),\n", + " ('latifah', 0.6163896918296814),\n", + " ('goddess', 0.6069524884223938),\n", + " ('showgirl', 0.5752988457679749)]" + ] + }, + "execution_count": 38, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "model.wv.most_similar(\"queen\", topn=5)" + ] + }, + { + "cell_type": "code", + "execution_count": 39, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[('terrible', 0.8102608919143677),\n", + " ('horrible', 0.7840115427970886),\n", + " ('dreadful', 0.7728089690208435),\n", + " ('horrid', 0.7526298761367798),\n", + " ('atrocious', 0.7394574284553528)]" + ] + }, + "execution_count": 39, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "model.wv.most_similar(\"awful\", topn=5)" + ] + }, + { + "cell_type": "code", + "execution_count": 40, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[('princess', 0.4752192795276642)]" + ] + }, + "execution_count": 40, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "model.wv.most_similar(positive=['woman', 'king'], negative=['man'], topn=1)" + ] + }, + { + "cell_type": "code", + "execution_count": 41, + "metadata": {}, + "outputs": [], + "source": [ + "def makeFeatureVec(words, model, num_features):\n", + " '''\n", + " 对段落中的所有词向量进行取平均操作\n", + " '''\n", + " featureVec = np.zeros((num_features,), dtype=\"float32\")\n", + " nwords = 0.\n", + "\n", + " # Index2word包含了词表中的所有词,为了检索速度,保存到set中\n", + " index2word_set = set(model.wv.index2word)\n", + " for word in words:\n", + " if word in index2word_set:\n", + " nwords = nwords + 1.\n", + " featureVec = np.add(featureVec, model[word])\n", + "\n", + " # 取平均\n", + " featureVec = np.divide(featureVec, nwords)\n", + " return featureVec\n", + "\n", + "\n", + "def getAvgFeatureVecs(reviews, model, num_features):\n", + " '''\n", + " 给定一个文本列表,每个文本由一个词列表组成,返回每个文本的词向量加和的平均值\n", + " '''\n", + " counter = 0\n", + " reviewFeatureVecs = np.zeros((len(reviews), num_features), dtype=\"float32\")\n", + "\n", + " for review in reviews:\n", + " if counter % 5000 == 0:\n", + " print(\"Review %d of %d\" % (counter, len(reviews)))\n", + "\n", + " reviewFeatureVecs[counter] = makeFeatureVec(review, model, num_features)\n", + " counter = counter + 1\n", + "\n", + " return reviewFeatureVecs" + ] + }, + { + "cell_type": "code", + "execution_count": 42, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "CPU times: user 6 µs, sys: 1e+03 ns, total: 7 µs\n", + "Wall time: 11.2 µs\n", + "Review 0 of 25000\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/Users/jiangzl/.virtualenvs/python3.6/lib/python3.6/site-packages/ipykernel_launcher.py:13: DeprecationWarning: Call to deprecated `__getitem__` (Method will be removed in 4.0.0, use self.wv.__getitem__() instead).\n", + " del sys.path[0]\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Review 5000 of 25000\n", + "Review 10000 of 25000\n", + "Review 15000 of 25000\n", + "Review 20000 of 25000\n", + "(25000, 784)\n" + ] + } + ], + "source": [ + "%time \n", + "trainDataVecs = getAvgFeatureVecs(texts[:len_train], model, num_features)\n", + "print(np.shape(trainDataVecs))" + ] + }, + { + "cell_type": "code", + "execution_count": 43, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "CPU times: user 6 µs, sys: 2 µs, total: 8 µs\n", + "Wall time: 97 µs\n", + "Review 0 of 25000\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/Users/jiangzl/.virtualenvs/python3.6/lib/python3.6/site-packages/ipykernel_launcher.py:13: DeprecationWarning: Call to deprecated `__getitem__` (Method will be removed in 4.0.0, use self.wv.__getitem__() instead).\n", + " del sys.path[0]\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Review 5000 of 25000\n", + "Review 10000 of 25000\n", + "Review 15000 of 25000\n", + "Review 20000 of 25000\n", + "(25000, 784)\n" + ] + } + ], + "source": [ + "%time \n", + "testDataVecs = getAvgFeatureVecs(texts[len_train:], model, num_features)\n", + "print(np.shape(testDataVecs))" + ] + }, + { + "cell_type": "code", + "execution_count": 92, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "高斯贝叶斯分类器 10折交叉验证得分: \n", + " [0.62715936 0.6181632 0.62577952 0.62458144 0.63289088 0.59956992\n", + " 0.61033216 0.62668192 0.610296 0.60734944]\n", + "\n", + "高斯贝叶斯分类器 10折交叉验证平均得分: \n", + " 0.618280384\n" + ] + } + ], + "source": [ + "from sklearn.naive_bayes import GaussianNB as GNB\n", + "from sklearn.cross_validation import cross_val_score\n", + "\n", + "gnb_model = GNB()\n", + "gnb_model.fit(trainDataVecs, label)\n", + "\n", + "scores = cross_val_score(gnb_model, trainDataVecs, label, cv=10, scoring='roc_auc')\n", + "print(\"\\n高斯贝叶斯分类器 10折交叉验证得分: \\n\", scores)\n", + "print(\"\\n高斯贝叶斯分类器 10折交叉验证平均得分: \\n\", np.mean(scores))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 机器学习 - 模型调参" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### KNN 模型训练" + ] + }, + { + "cell_type": "code", + "execution_count": 72, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "knn算法 10折交叉验证得分: \n", + " [0.82005056 0.81503776 0.83006976 0.8199152 0.82069568 0.827304\n", + " 0.81693088 0.8250944 0.80150176 0.821496 ]\n", + "\n", + "knn算法 10折交叉验证平均得分: \n", + " 0.8198095999999999\n" + ] + } + ], + "source": [ + "from sklearn.neighbors import KNeighborsClassifier\n", + "from sklearn.model_selection import cross_val_score\n", + "\n", + "knn_model = KNeighborsClassifier(n_neighbors=5)\n", + "knn_model.fit(all_x[:len_train], label)\n", + "\n", + "scores = cross_val_score(knn_model, all_x[:len_train], label, cv=10, scoring='roc_auc')\n", + "print(\"\\nknn算法 10折交叉验证得分: \\n\", scores)\n", + "print(\"\\nknn算法 10折交叉验证平均得分: \\n\", np.mean(scores))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 决策树 模型训练" + ] + }, + { + "cell_type": "code", + "execution_count": 73, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "决策树 10折交叉验证得分: \n", + " [0.7392 0.7232 0.7292 0.7236 0.7412 0.7164 0.718 0.724 0.7124 0.7156]\n", + "\n", + "决策树 10折交叉验证平均得分: \n", + " 0.72428\n" + ] + } + ], + "source": [ + "from sklearn.tree import DecisionTreeClassifier\n", + "from sklearn.model_selection import cross_val_score\n", + "\n", + "tree_model = DecisionTreeClassifier()\n", + "tree_model.fit(all_x[:len_train], label)\n", + "\n", + "scores = cross_val_score(tree_model, all_x[:len_train], label, cv=10, scoring='roc_auc')\n", + "print(\"\\n决策树 10折交叉验证得分: \\n\", scores)\n", + "print(\"\\n决策树 10折交叉验证平均得分: \\n\", np.mean(scores))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 逻辑回归 模型训练" + ] + }, + { + "cell_type": "code", + "execution_count": 99, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "逻辑回归 10折交叉验证得分: \n", + " [0.94440064 0.94031744 0.95128192 0.94374784 0.9410656 0.94308864\n", + " 0.94733184 0.948768 0.93660352 0.94612288]\n", + "\n", + "逻辑回归 10折交叉验证平均得分: \n", + " 0.944272832\n" + ] + } + ], + "source": [ + "from sklearn.linear_model import LogisticRegression\n", + "from sklearn.model_selection import cross_val_score\n", + "\n", + "lr_model = LogisticRegression(C=0.1, max_iter=100)\n", + "lr_model.fit(all_x[:len_train], label)\n", + "\n", + "scores = cross_val_score(lr_model, all_x[:len_train], label, cv=10, scoring='roc_auc')\n", + "print(\"\\n逻辑回归 10折交叉验证得分: \\n\", scores)\n", + "print(\"\\n逻辑回归 10折交叉验证平均得分: \\n\", np.mean(scores))\n", + "\n", + "\n", + "# from sklearn.model_selection import GridSearchCV\n", + "\n", + "# # 设定grid search的参数\n", + "# grid_values = {'C': [1, 15, 30, 50]} \n", + "# \"\"\"\n", + "# penalty: l1 or l2, 用于指定惩罚中使用的标准。\n", + "# \"\"\"\n", + "# model_LR = GridSearchCV(estimator=LR(penalty='l2', dual=True, random_state=0), grid_values, scoring='roc_auc', cv=20)\n", + "# model_LR.fit(train_x, label)\n", + "\n", + "# 输出结果\n", + "# print(model_LR.cv_results_, '\\n', model_LR.best_params_, model_LR.best_score_)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### SVM 模型训练" + ] + }, + { + "cell_type": "code", + "execution_count": 75, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "SVM 10折交叉验证得分: \n", + " [0.94539328 0.9421344 0.95242176 0.94563328 0.94189504 0.94408704\n", + " 0.94839424 0.94898688 0.93809024 0.9473792 ]\n", + "\n", + "SVM 10折交叉验证平均得分: \n", + " 0.945441536\n" + ] + } + ], + "source": [ + "from sklearn.svm import SVC\n", + "from sklearn.model_selection import cross_val_score\n", + "\n", + "# model = SVC(C=4, kernel='rbf')\n", + "svm_model = SVC(kernel='linear', probability=True)\n", + "svm_model.fit(all_x[:len_train], label)\n", + "\n", + "scores = cross_val_score(svm_model, all_x[:len_train], label, cv=10, scoring='roc_auc')\n", + "print(\"\\nSVM 10折交叉验证得分: \\n\", scores)\n", + "print(\"\\nSVM 10折交叉验证平均得分: \\n\", np.mean(scores))" + ] + }, + { + "cell_type": "code", + "execution_count": 152, + "metadata": {}, + "outputs": [], + "source": [ + "svm_model = SVC(kernel='linear', probability=True)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### XGBoost 模型训练" + ] + }, + { + "cell_type": "code", + "execution_count": 107, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "SVM 10折交叉验证得分: \n", + " [0.93396416 0.93105024 0.93927616 0.9332768 0.9338336 0.93682432\n", + " 0.93343296 0.93623296 0.92564352 0.93381504]\n", + "\n", + "SVM 10折交叉验证平均得分: \n", + " 0.933734976\n" + ] + } + ], + "source": [ + "from sklearn.model_selection import cross_val_score\n", + "from xgboost import XGBClassifier\n", + "import numpy as np\n", + "\n", + "xgb_model = XGBClassifier(n_estimators=150, min_samples_leaf=3, max_depth=6)\n", + "\"\"\"\n", + "AttributeError: 'list' object has no attribute 'shape'\n", + "list => np.array\n", + "\"\"\"\n", + "xgb_model.fit(np.array(all_x[:len_train]), label)\n", + "\n", + "scores = cross_val_score(xgb_model, np.array(all_x[:len_train]), label, cv=10, scoring='roc_auc')\n", + "print(\"\\nXGB 10折交叉验证得分: \\n\", scores)\n", + "print(\"\\nXGB 10折交叉验证平均得分: \\n\", np.mean(scores))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 模型融合" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### bagging: 随机森林 \n", + "\n", + "随机森林效果不好,去掉所有的树模型" + ] + }, + { + "cell_type": "code", + "execution_count": 108, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "随机森林 10折交叉验证得分: \n", + " [0.94539328 0.9421344 0.95242176 0.94563328 0.94189504 0.94408704\n", + " 0.94839424 0.94898688 0.93809024 0.9473792 ]\n", + "\n", + "随机森林 10折交叉验证平均得分: \n", + " 0.945441536\n" + ] + } + ], + "source": [ + "from sklearn.model_selection import GridSearchCV\n", + "from sklearn.ensemble import RandomForestClassifier\n", + "\n", + "# parameters= {'n_estimators': range(10, 101, 10)} \n", + "# gsearch_rf = GridSearchCV(\n", + "# estimator=RandomForestClassifier(max_depth=8, random_state=0),\n", + "# param_grid=parameters, scoring='roc_auc', cv=10)\n", + "\n", + "# gsearch_rf = gsearch_rf.fit(all_x[:len_train], label)\n", + "\n", + "# print(gsearch_rf.cv_results_, '\\n', gsearch_rf.best_params_, '\\t', gsearch_rf.best_score_)\n", + "\"\"\"\n", + "[mean: 0.87486, std: 0.00576, params: {'n_estimators': 10}, mean: 0.88505, std: 0.00611, params: {'n_estimators': 20}, mean: 0.89032, std: 0.00609, params: {'n_estimators': 30}, mean: 0.89246, std: 0.00537, params: {'n_estimators': 40}, mean: 0.89439, std: 0.00528, params: {'n_estimators': 50}, \n", + " mean: 0.89507, std: 0.00607, params: {'n_estimators': 60}, mean: 0.89591, std: 0.00618, params: {'n_estimators': 70}, mean: 0.89634, std: 0.00634, params: {'n_estimators': 80}, mean: 0.89671, std: 0.00607, params: {'n_estimators': 90}, mean: 0.89753, std: 0.00588, params: {'n_estimators': 100}] \n", + "\n", + "{'n_estimators': 100} 0.89753344\n", + "\"\"\"\n", + "\n", + "rf_model = RandomForestClassifier(n_estimators=100, max_depth=8, random_state=0)\n", + "rf_model.fit(all_x[:len_train], label)\n", + "\n", + "scores = cross_val_score(svm_model, all_x[:len_train], label, cv=10, scoring='roc_auc')\n", + "print(\"\\n随机森林 10折交叉验证得分: \\n\", scores)\n", + "print(\"\\n随机森林 10折交叉验证平均得分: \\n\", np.mean(scores))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### boosting: AdaBoost" + ] + }, + { + "cell_type": "code", + "execution_count": 89, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "AdaBoost 10折交叉验证得分: \n", + " [0.9207616 0.91974976 0.92787136 0.91906176 0.92305856 0.92228416\n", + " 0.9224832 0.92169856 0.91816256 0.92096768]\n", + "\n", + "AdaBoost 10折交叉验证平均得分: \n", + " 0.9216099200000001\n" + ] + } + ], + "source": [ + "from sklearn.ensemble import AdaBoostClassifier\n", + "from sklearn.tree import DecisionTreeClassifier\n", + "from sklearn.model_selection import cross_val_score\n", + "\n", + "ab_model = AdaBoostClassifier(\n", + " DecisionTreeClassifier(max_depth=2),\n", + " n_estimators=600,\n", + " learning_rate=1)\n", + "\n", + "ab_model.fit(all_x[:len_train], label)\n", + "\n", + "scores = cross_val_score(ab_model, all_x[:len_train], label, cv=10, scoring='roc_auc')\n", + "print(\"\\nAdaBoost 10折交叉验证得分: \\n\", scores)\n", + "print(\"\\nAdaBoost 10折交叉验证平均得分: \\n\", np.mean(scores))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### voting: 多模型投票" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "scrolled": false + }, + "outputs": [], + "source": [ + "from sklearn.model_selection import cross_val_score\n", + "from sklearn.ensemble import VotingClassifier\n", + "\n", + "\"\"\"\n", + "soft报错是因为这种投票方式使用的是每个分类器的概率输出值进行投票的。\n", + "所以要求每个分类器的输出是概率值,而不是一个类别。\n", + "对于svc来说,默认的输出是类别,所以会有问题,其他分类器不会有这样的问题。\n", + "\"\"\"\n", + "\n", + "vot_model = VotingClassifier(\n", + "# estimators=[('lr', lr_model), ('svm', svm_model), ('xgb', xgb_model), ('rf', rf_model), ('ab', ab_model)]\n", + " estimators=[('xgb', xgb_model), ('rf', rf_model)],\n", + " voting='hard')\n", + "vot_model.fit(np.array(all_x[:len_train]), np.array(label))\n", + "\n", + "scores = cross_val_score(vot_model, np.array(all_x[:len_train]), np.array(label), cv=10, scoring='roc_auc')\n", + "print(\"\\nAdaBoost 10折交叉验证得分: \\n\", scores)\n", + "print(\"\\nAdaBoost 10折交叉验证平均得分: \\n\", np.mean(scores))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### stacking: 模型" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "y = np.array([0, 0, 1, 1])\n", + "skf = StratifiedKFold(y, 2)\n", + "len(skf)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "'''模型融合中使用到的各个单模型'''\n", + "import numpy as np\n", + "from sklearn.model_selection import cross_val_score\n", + "from sklearn.cross_validation import StratifiedKFold\n", + "\n", + "# 划分train数据集,调用代码,把数据集名字转成和代码一样\n", + "X = np.array(all_x[:len_train])\n", + "X_predict = np.array(all_x[len_train:])\n", + "label.astype(np.integer)\n", + "y = label.values\n", + "\n", + "# clfs = [LogisticRegression(C=0.1, max_iter=100),\n", + "# xgb.XGBClassifier(max_depth=6, n_estimators=100, num_round = 5),\n", + "# RandomForestClassifier(n_estimators=100, max_depth=6, oob_score=True),\n", + "# GradientBoostingClassifier(learning_rate=0.3, max_depth=6, n_estimators=100)]\n", + "\n", + "clfs = [knn_model, tree_model, lr_model, svm_model, xgb_model, rf_model, ab_model]\n", + "\n", + "\n", + "# 创建n_folds\n", + "n_folds = 10\n", + "skf = StratifiedKFold(y, n_folds)\n", + "\n", + "\n", + "# 创建零矩阵\n", + "dataset_blend_train = np.zeros((X.shape[0], len(clfs)))\n", + "dataset_blend_test = np.zeros((X_predict.shape[0], len(clfs)))\n", + "\n", + "# 建立模型\n", + "for j, clf in enumerate(clfs):\n", + " '''依次训练各个单模型'''\n", + " # print(j, clf)\n", + " dataset_blend_test_j = np.zeros((X_predict.shape[0], len(skf)))\n", + " for i, (train, test) in enumerate(skf):\n", + " '''使用第i个部分作为预测,剩余的部分来训练模型,获得其预测的输出作为第i部分的新特征。'''\n", + " # print(\"Fold\", i)\n", + " X_train, y_train, X_test, y_test = X[train], y[train], X[test], y[test]\n", + " clf.fit(X_train, y_train)\n", + " y_submission = clf.predict_proba(X_test)[:, 1]\n", + " dataset_blend_train[test, j] = y_submission\n", + " dataset_blend_test_j[:, i] = clf.predict_proba(X_predict)[:, 1]\n", + " '''对于测试集,直接用这k个模型的预测值均值作为新的特征。'''\n", + " dataset_blend_test[:, j] = dataset_blend_test_j.mean(1)\n", + "\n", + "# 用建立第二层模型\n", + "stacking_model = LogisticRegression(C=0.1, max_iter=100)\n", + "stacking_model.fit(dataset_blend_train, y_train)\n", + "\n", + "scores = cross_val_score(ab_model, dataset_blend_train, label, cv=10, scoring='roc_auc')\n", + "print(\"\\nAdaBoost 10折交叉验证得分: \\n\", scores)\n", + "print(\"\\nAdaBoost 10折交叉验证平均得分: \\n\", np.mean(scores))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 数据导出" + ] + }, + { + "cell_type": "code", + "execution_count": 70, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "保存结果...\n", + " id sentiment\n", + "0 \"12311_10\" 1\n", + "1 \"8348_2\" 0\n", + "2 \"5828_4\" 1\n", + "3 \"7186_2\" 1\n", + "4 \"12128_7\" 1\n", + "5 \"2913_8\" 1\n", + "6 \"4396_1\" 0\n", + "7 \"395_2\" 0\n", + "8 \"10616_1\" 0\n", + "9 \"9074_9\" 1\n", + "结束.\n" + ] + }, + { + "data": { + "text/plain": [ + "'\\n1.提交最终的结果到kaggle,AUC为:0.85728,排名300左右,50%的水平\\n2. ngram_range = 3, 三元文法,AUC为0.85924\\n'" + ] + }, + "execution_count": 70, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "test_predicted = np.array(model_NB.predict(corpus_tfidf[len_train:]))\n", + "print('保存结果...')\n", + "\n", + "import os\n", + "root_dir = \"/opt/data/kaggle/getting-started/word2vec-nlp-tutorial\"\n", + " \n", + "submission_df = pd.DataFrame(data ={'id': test['id'], 'sentiment': test_predicted})\n", + "print(submission_df.head(10))\n", + "submission_df.to_csv(os.path.join(root_dir, 'submission_br.csv'), index = False)\n", + "\n", + "print('结束.')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## CNN 来处理 文本问题: https://zhuanlan.zhihu.com/p/26729228" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 分词,获取分割后的所有文章" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [], + "source": [ + "import re\n", + "from bs4 import BeautifulSoup\n", + "from nltk.corpus import stopwords\n", + "\n", + "stopwords = set(stopwords.words(\"english\"))\n", + "\n", + "def review_to_sentence(review):\n", + " html = BeautifulSoup(review, \"html.parser\").get_text()\n", + " text = re.sub('[^a-zA-Z]', ' ', html).strip()\n", + " words = [word for word in text.lower().split() if word not in stopwords]\n", + " return words\n", + "\n", + "unlabeled_train = pd.read_csv(os.path.join(data_dir, 'unlabeledTrainData.tsv'), header=0, delimiter=\"\\t\", quoting=3 )\n", + "train_texts = pd.concat([train['review'], unlabeled_train['review']], axis=0, ignore_index=True)\n", + "sentences = list(map(review_to_sentence, train_texts))" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "((75000,), 219, 84, 25000, 50000)" + ] + }, + "execution_count": 8, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "np.shape(sentences), len(sentences[0]), len(sentences[1]), len(train['review']), len( unlabeled_train['review'])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 对文章简历词典" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [], + "source": [ + "from gensim import corpora\n", + "dictionary = corpora.Dictionary(sentences)" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [], + "source": [ + "dictionary.add_documents([[\" \"]])" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "0 actual\n", + "1 alone\n", + "2 also\n", + "3 another\n", + "4 anyway\n" + ] + } + ], + "source": [ + "for key in dictionary.keys()[0:5]:\n", + " print (key, dictionary[key])\n" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(123350, 123350, dict, dict)" + ] + }, + "execution_count": 12, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "len(dictionary.token2id), len(dictionary.id2token), type(dictionary.token2id), type(dictionary.id2token)" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(3, 'another', 123349)" + ] + }, + "execution_count": 13, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "dictionary.token2id[\"another\"], dictionary.id2token[3], dictionary.token2id[\" \"]" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [], + "source": [ + "import torch\n", + "import torch.nn as nn\n", + "import torch.nn.functional as F # 激励函数都在这\n", + "from torch.autograd import Variable\n", + "import torch.utils.data as Data\n", + "import torchvision # 数据库模块" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "1416 \n", + " 1416\n" + ] + } + ], + "source": [ + "max_len = max([len(i) for i in sentences])\n", + "max_index = dictionary.token2id[\" \"]\n", + "max_list = [max_index for x in range(max_len)]\n", + "print(max_len, \"\\n\", len(max_list))" + ] + }, + { + "cell_type": "code", + "execution_count": 362, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Variable containing:\n", + "1.00000e-02 *\n", + " 5.6823 -5.1981\n", + "[torch.FloatTensor of size 1x2]\n", + "\n" + ] + } + ], + "source": [ + "# # prepare_sequence 是将文本的索引转化为 Variable 对象\n", + "# def prepare_sequence(seq):\n", + "# idxs = [dictionary.token2id[w] for w in seq]\n", + "# if len(idxs) < max_len:\n", + "# idxs = idxs + max_list[len(idxs):]\n", + "# # print('文本词典的索引序列:', idxs)\n", + "# tensor = torch.LongTensor(idxs)\n", + "# return Variable(tensor)\n", + "\n", + "# sentence_in = prepare_sequence(sentences[1383])\n", + "# # word_embeddings = nn.Embedding(len(dictionary.token2id), 5)\n", + "# # word_embeddings(sentence_in)\n", + "# x = cnn(sentence_in)\n", + "# print(x)" + ] + }, + { + "cell_type": "code", + "execution_count": 319, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "1416" + ] + }, + "execution_count": 319, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# sentence_in.data.size(0)" + ] + }, + { + "cell_type": "code", + "execution_count": 320, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(['stuff', 'going', 'moment'], '\\n\\n', 1416)" + ] + }, + "execution_count": 320, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# sentences[0][:3], \"\\n\\n\", len(sentence_in)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 拆分数据集" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": {}, + "outputs": [], + "source": [ + "# prepare_sequence 是将文本的索引转化为 Variable 对象\n", + "def prepare_sequence(seq):\n", + " idxs = [dictionary.token2id[w] for w in seq]\n", + " if len(idxs) < max_len:\n", + " idxs = idxs + max_list[len(idxs):]\n", + "# print('文本词典的索引序列:', idxs)\n", + " return idxs\n", + "\n", + "\n", + "from sklearn.model_selection import train_test_split\n", + "\n", + "X_train = list(map(prepare_sequence, sentences[:len(train)]))\n", + "X_train_d, X_test_d, y_train_d, y_test_d = train_test_split(X_train, label.tolist(), test_size=0.2, shuffle=True, random_state=42)\n" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(list, list)" + ] + }, + "execution_count": 17, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "type(X_train_d), type(y_train_d)" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "((20000, 1416), (5000, 1416), (20000,), (5000,))" + ] + }, + "execution_count": 18, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "np.shape(X_train_d), np.shape(X_test_d), np.shape(y_train_d), np.shape(y_test_d)" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "([100, 1380, 2330], [509, 58, 14209], [0, 0, 0], [0, 1, 0])" + ] + }, + "execution_count": 19, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "X_train_d[0][:3], X_test_d[0][:3], y_train_d[:3], y_test_d[:3]" + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "metadata": {}, + "outputs": [], + "source": [ + "class textCNN(nn.Module):\n", + " \n", + " def __init__(self, vocab_size, embedding_dim, max_len, n_classes):\n", + " super(textCNN, self).__init__()\n", + " \n", + " self.model_name = 'alexnet'\n", + " self.vocab_size = vocab_size\n", + " self.embedding_dim = embedding_dim\n", + " self.max_len = max_len\n", + " \n", + " self.word_embeddings = nn.Embedding(vocab_size, embedding_dim)\n", + " \n", + " self.features = nn.Sequential(\n", + "# nn.Conv2d(1, 64, kernel_size=11, stride=4, padding=2),\n", + "# nn.ReLU(inplace=True),\n", + "# nn.MaxPool2d(kernel_size=3, stride=2),\n", + "\n", + "# nn.Conv2d(64, 192, kernel_size=5, padding=2),\n", + "# nn.ReLU(inplace=True),\n", + "# nn.MaxPool2d(kernel_size=3, stride=2),\n", + "\n", + "# nn.Conv2d(192, 384, kernel_size=3, padding=1),\n", + "# nn.ReLU(inplace=True),\n", + "\n", + "# nn.Conv2d(384, 256, kernel_size=3, padding=1),\n", + "# nn.ReLU(inplace=True),\n", + "\n", + "# nn.Conv2d(256, 256, kernel_size=3, padding=1),\n", + "# nn.ReLU(inplace=True),\n", + "# nn.MaxPool2d(kernel_size=3, stride=2),\n", + " \n", + " \n", + " nn.Conv2d(1, 16, 5, 1, 2),\n", + " nn.ReLU(), # activation\n", + " nn.MaxPool2d(kernel_size=2, stride=2), # 在 2x2 空间里向下采样, output shape (16, 14, 14), 默认步长为2\n", + " \n", + " nn.Conv2d(16, 32, 5, 1, 2), # output shape (32, 14, 14)\n", + " nn.ReLU(), # activation\n", + " nn.MaxPool2d(2), # output shape (32, 7, 7)\n", + "\n", + " nn.Conv2d(32, 64, 5, 1, 2), \n", + " nn.ReLU(), \n", + " nn.MaxPool2d(2), \n", + " \n", + " nn.Conv2d(64, 128, 5, 1, 2), \n", + " nn.ReLU(), \n", + " nn.MaxPool2d(2)\n", + " )\n", + " \n", + " self.classifier = nn.Sequential(\n", + "# nn.Dropout(),\n", + "# nn.Linear(256 * 6 * 6, 4096),\n", + "# nn.ReLU(inplace=True),\n", + "\n", + "# nn.Dropout(),\n", + "# nn.Linear(4096, 4096),\n", + "# nn.ReLU(inplace=True),\n", + "\n", + "# nn.Linear(4096, n_classes),\n", + " \n", + " nn.Dropout(),\n", + " nn.Linear(45056, n_classes), \n", + " )\n", + " \n", + " def forward(self, x):\n", + " x = self.word_embeddings(x)\n", + " x = x.view(len(x), 1, self.max_len, self.embedding_dim)\n", + " x = self.features(x)\n", + " x = x.view(x.size(0), -1)\n", + " output = self.classifier(x)\n", + " return output # return x for visualization\n" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "textCNN(\n", + " (word_embeddings): Embedding(123350, 64)\n", + " (features): Sequential(\n", + " (0): Conv2d(1, 16, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))\n", + " (1): ReLU()\n", + " (2): MaxPool2d(kernel_size=(2, 2), stride=(2, 2), dilation=(1, 1), ceil_mode=False)\n", + " (3): Conv2d(16, 32, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))\n", + " (4): ReLU()\n", + " (5): MaxPool2d(kernel_size=(2, 2), stride=(2, 2), dilation=(1, 1), ceil_mode=False)\n", + " (6): Conv2d(32, 64, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))\n", + " (7): ReLU()\n", + " (8): MaxPool2d(kernel_size=(2, 2), stride=(2, 2), dilation=(1, 1), ceil_mode=False)\n", + " (9): Conv2d(64, 128, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))\n", + " (10): ReLU()\n", + " (11): MaxPool2d(kernel_size=(2, 2), stride=(2, 2), dilation=(1, 1), ceil_mode=False)\n", + " )\n", + " (classifier): Sequential(\n", + " (0): Dropout(p=0.5)\n", + " (1): Linear(in_features=45056, out_features=2, bias=True)\n", + " )\n", + ")\n" + ] + } + ], + "source": [ + "cnn = textCNN(len(dictionary.token2id), 64, max_len, 2)\n", + "print(cnn) # net architecture\n", + "\n", + "# optimizer = torch.optim.SGD(cnn.parameters(), lr=0.02) # 传入 net 的所有参数, 学习率\n", + "# lr 优化步长\n", + "# weight_decay(权重衰减): 也叫 L2 regularization (1e-5就是 1*(10的-5次方)即0.00001)\n", + "optimizer = torch.optim.Adam(cnn.parameters(), lr=1e-5, weight_decay=1e-7)\n", + "# 算误差的时候, 注意真实值!不是! one-hot 形式的, 而是1D Tensor, (batch,)\n", + "# 但是预测值是2D tensor (batch, n_classes)\n", + "loss_func = nn.CrossEntropyLoss()" + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "metadata": {}, + "outputs": [], + "source": [ + "import time\n", + "import math\n", + "\n", + "def timeSince(since):\n", + " now = time.time()\n", + " s = now - since\n", + " m = math.floor(s / 60)\n", + " s -= m * 60\n", + " return '%dm %ds' % (m, s)\n", + "\n", + "start = time.time()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "code", + "execution_count": 31, + "metadata": {}, + "outputs": [], + "source": [ + "tr_x = torch.LongTensor(X_train_d)\n", + "tr_y = torch.LongTensor(y_train_d)\n", + "te_x = torch.LongTensor(X_test_d)\n", + "te_y = torch.LongTensor(y_test_d)\n", + "\n", + "torch_train_dataset = Data.TensorDataset(data_tensor=tr_x, target_tensor=tr_y)\n", + "torch_test_dataset = Data.TensorDataset(data_tensor=te_x, target_tensor=te_y)\n", + "\n", + "BATCH_SIZE = 20 # 批训练的数据个数\n", + "\n", + "# 把 dataset 放入 DataLoader\n", + "train_loader = Data.DataLoader(\n", + " dataset=torch_train_dataset, # torch TensorDataset format\n", + " batch_size=BATCH_SIZE, # 每个 batch 加载多少个样本\n", + " shuffle=True, # 要不要打乱数据 (打乱比较好)\n", + " num_workers=2, # 多线程来读数据\n", + ")\n", + "test_loader = Data.DataLoader(\n", + " dataset=torch_test_dataset, # torch TensorDataset format\n", + " batch_size=BATCH_SIZE, # 每个 batch 加载多少个样本\n", + " shuffle=True, # 要不要打乱数据 (打乱比较好)\n", + " num_workers=2, # 多线程来读数据\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": 34, + "metadata": { + "scrolled": true + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "pred_y:\t [1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 0 1 1 1 1]\n", + "target_y:\t [0 1 1 1 0 1 0 0 1 0 0 1 1 1 1 0 1 0 1 0]\n", + "0-19 7.60% (101m 42s) logloss=12.09 \t accuracy=0.65 \t loss=0.6844480037689209\n", + "pred_y:\t [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]\n", + "target_y:\t [1 1 0 1 1 0 1 1 0 0 0 0 0 1 1 0 1 1 1 0]\n", + "0-39 15.60% (103m 24s) logloss=15.54 \t accuracy=0.55 \t loss=0.6886069774627686\n", + "pred_y:\t [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]\n", + "target_y:\t [0 1 1 1 1 0 1 0 1 0 1 1 1 1 0 1 1 0 0 1]\n", + "0-59 23.60% (106m 31s) logloss=12.09 \t accuracy=0.65 \t loss=0.6793051958084106\n", + "pred_y:\t [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]\n", + "target_y:\t [0 0 0 0 1 1 0 0 0 0 1 0 1 1 0 0 1 0 0 1]\n", + "0-79 31.60% (114m 37s) logloss=22.45 \t accuracy=0.35 \t loss=0.6999103426933289\n", + "pred_y:\t [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]\n", + "target_y:\t [0 0 0 0 0 1 1 0 1 0 1 1 0 1 1 0 0 0 1 0]\n", + "0-99 39.60% (120m 29s) logloss=20.72 \t accuracy=0.40 \t loss=0.7069584131240845\n", + "pred_y:\t [1 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]\n", + "target_y:\t [1 1 1 1 0 1 0 0 0 0 0 0 0 0 0 1 1 1 0 1]\n", + "0-119 47.60% (125m 5s) logloss=19.00 \t accuracy=0.45 \t loss=0.7006260752677917\n", + "pred_y:\t [0 0 0 1 1 0 0 0 0 0 1 1 0 1 1 0 0 0 0 1]\n", + "target_y:\t [0 1 1 0 0 1 0 0 1 1 0 0 0 0 1 1 1 1 1 0]\n", + "0-139 55.60% (132m 49s) logloss=25.90 \t accuracy=0.25 \t loss=0.7005825638771057\n", + "pred_y:\t [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]\n", + "target_y:\t [0 1 1 1 1 1 0 1 0 1 1 1 1 1 0 1 0 1 1 0]\n", + "0-159 63.60% (140m 33s) logloss=24.18 \t accuracy=0.30 \t loss=0.699988067150116\n", + "pred_y:\t [1 0 1 0 0 0 1 1 0 0 1 0 0 0 1 0 0 0 0 0]\n", + "target_y:\t [1 1 1 1 0 1 0 0 0 0 1 0 0 0 1 1 1 0 1 1]\n", + "0-179 71.60% (148m 14s) logloss=15.54 \t accuracy=0.55 \t loss=0.689541220664978\n", + "pred_y:\t [1 1 0 1 1 1 1 1 0 1 1 1 1 1 0 0 1 0 1 1]\n", + "target_y:\t [0 1 0 0 1 0 0 0 0 1 1 1 0 0 1 1 1 1 0 1]\n", + "0-199 79.60% (154m 37s) logloss=19.00 \t accuracy=0.45 \t loss=0.6907564401626587\n", + "pred_y:\t [1 1 0 1 1 1 1 0 1 1 1 1 1 0 1 1 1 1 1 1]\n", + "target_y:\t [0 0 1 1 0 0 0 1 1 0 0 0 1 1 1 0 1 1 0 0]\n", + "0-219 87.60% (160m 44s) logloss=24.18 \t accuracy=0.30 \t loss=0.6987239122390747\n", + "pred_y:\t [1 0 1 1 0 1 1 1 1 1 0 1 1 1 0 0 0 1 1 1]\n", + "target_y:\t [0 1 0 0 0 0 0 1 0 1 0 1 0 0 0 1 0 0 1 1]\n", + "0-239 95.60% (167m 26s) logloss=19.00 \t accuracy=0.45 \t loss=0.691254734992981\n", + "pred_y:\t [0 1 1 0 1 0 0 1 1 0 1 0 1 0 1 0 1 1 1 1]\n", + "target_y:\t [1 0 1 0 1 1 0 1 1 0 1 1 1 0 1 1 1 0 0 0]\n", + "0-259 103.60% (174m 38s) logloss=13.82 \t accuracy=0.60 \t loss=0.6896542310714722\n", + "pred_y:\t [0 1 1 1 0 0 0 0 0 0 1 1 0 0 0 1 0 0 0 0]\n", + "target_y:\t [1 0 0 1 0 0 0 1 0 0 1 0 0 0 0 1 0 1 0 1]\n", + "0-279 111.60% (181m 56s) logloss=12.09 \t accuracy=0.65 \t loss=0.6853312253952026\n", + "pred_y:\t [0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0]\n", + "target_y:\t [1 1 1 1 1 0 1 1 0 1 1 1 1 1 0 0 0 1 1 1]\n", + "0-299 119.60% (188m 49s) logloss=25.90 \t accuracy=0.25 \t loss=0.7102211713790894\n", + "pred_y:\t [0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]\n", + "target_y:\t [0 0 0 0 0 1 1 1 0 0 0 1 1 0 0 0 0 1 1 1]\n", + "0-319 127.60% (196m 8s) logloss=12.09 \t accuracy=0.65 \t loss=0.6825012564659119\n", + "pred_y:\t [0 1 0 0 0 0 1 0 1 1 0 0 0 0 0 0 1 1 0 0]\n", + "target_y:\t [0 0 0 1 0 1 1 0 0 1 1 0 1 1 0 0 0 0 1 0]\n", + "0-339 135.60% (203m 21s) logloss=17.27 \t accuracy=0.50 \t loss=0.693179190158844\n", + "pred_y:\t [0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 1 0 0]\n", + "target_y:\t [0 1 0 1 1 0 0 1 0 1 0 0 0 0 0 1 1 0 0 1]\n", + "0-359 143.60% (210m 48s) logloss=13.82 \t accuracy=0.60 \t loss=0.6911899447441101\n", + "pred_y:\t [0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]\n", + "target_y:\t [0 0 1 0 0 1 0 0 0 1 1 1 0 0 0 0 0 1 0 1]\n", + "0-379 151.60% (218m 11s) logloss=10.36 \t accuracy=0.70 \t loss=0.6836111545562744\n", + "pred_y:\t [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]\n", + "target_y:\t [0 1 1 1 0 0 0 1 0 0 1 1 0 1 0 0 1 1 1 0]\n", + "0-399 159.60% (225m 11s) logloss=17.27 \t accuracy=0.50 \t loss=0.6962472200393677\n", + "pred_y:\t [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0]\n", + "target_y:\t [0 1 1 0 0 1 0 0 0 1 1 0 0 0 1 1 0 0 1 1]\n", + "0-419 167.60% (232m 36s) logloss=17.27 \t accuracy=0.50 \t loss=0.6901986598968506\n", + "pred_y:\t [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1]\n", + "target_y:\t [1 1 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 1]\n", + "0-439 175.60% (240m 1s) logloss=8.63 \t accuracy=0.75 \t loss=0.6787502765655518\n", + "pred_y:\t [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]\n", + "target_y:\t [0 1 0 1 1 0 0 0 0 0 1 0 1 1 0 0 0 1 0 1]\n", + "0-459 183.60% (247m 31s) logloss=13.82 \t accuracy=0.60 \t loss=0.694130003452301\n", + "pred_y:\t [0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0]\n", + "target_y:\t [1 0 1 1 0 0 0 0 0 1 0 1 1 1 1 1 0 0 0 0]\n", + "0-479 191.60% (254m 5s) logloss=15.54 \t accuracy=0.55 \t loss=0.6915131211280823\n", + "pred_y:\t [0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 1 0 0]\n", + "target_y:\t [0 0 1 1 1 0 0 0 1 1 1 1 1 1 0 1 0 1 0 1]\n", + "0-499 199.60% (260m 32s) logloss=20.72 \t accuracy=0.40 \t loss=0.696361243724823\n", + "pred_y:\t [0 0 0 0 1 1 0 0 0 1 0 0 1 0 0 1 1 1 0 1]\n", + "target_y:\t [0 0 0 0 0 1 0 0 1 0 0 1 1 0 1 0 0 0 1 1]\n", + "0-519 207.60% (267m 39s) logloss=15.54 \t accuracy=0.55 \t loss=0.6885088086128235\n", + "pred_y:\t [0 0 0 1 1 1 1 1 0 1 1 1 1 1 0 1 1 1 1 1]\n", + "target_y:\t [1 0 0 0 1 0 1 1 1 0 1 0 0 0 0 1 1 0 0 0]\n", + "0-539 215.60% (275m 2s) logloss=19.00 \t accuracy=0.45 \t loss=0.6945030093193054\n", + "pred_y:\t [1 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 0]\n", + "target_y:\t [1 0 0 0 1 0 1 1 1 0 0 1 1 1 0 0 1 1 1 0]\n", + "0-559 223.60% (281m 12s) logloss=13.82 \t accuracy=0.60 \t loss=0.6861685514450073\n", + "pred_y:\t [1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1]\n", + "target_y:\t [0 1 1 1 1 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0]\n", + "0-579 231.60% (287m 12s) logloss=22.45 \t accuracy=0.35 \t loss=0.6945004463195801\n", + "pred_y:\t [0 1 0 0 1 1 1 1 0 1 1 1 0 0 0 1 0 1 0 1]\n", + "target_y:\t [1 0 1 0 1 0 1 0 0 0 0 0 1 0 0 0 1 1 0 1]\n", + "0-599 239.60% (293m 44s) logloss=19.00 \t accuracy=0.45 \t loss=0.6927602887153625\n", + "pred_y:\t [1 1 1 1 1 0 1 1 1 1 1 0 0 0 1 0 1 0 1 0]\n", + "target_y:\t [0 0 1 1 1 0 0 1 1 1 1 0 0 1 1 1 1 0 0 1]\n", + "0-619 247.60% (300m 38s) logloss=12.09 \t accuracy=0.65 \t loss=0.6841057538986206\n", + "pred_y:\t [1 0 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1]\n", + "target_y:\t [1 1 0 1 0 1 1 1 1 0 0 1 0 0 0 1 0 1 0 0]\n", + "0-639 255.60% (307m 59s) logloss=17.27 \t accuracy=0.50 \t loss=0.6867891550064087\n", + "pred_y:\t [1 0 0 1 0 0 0 0 1 1 1 1 0 0 0 1 1 0 1 0]\n", + "target_y:\t [1 0 1 0 0 1 1 1 0 1 0 0 1 0 0 0 1 1 0 0]\n", + "0-659 263.60% (315m 15s) logloss=20.72 \t accuracy=0.40 \t loss=0.697279691696167\n", + "pred_y:\t [0 1 1 1 0 1 1 0 1 1 1 1 1 1 0 1 1 1 0 0]\n", + "target_y:\t [1 0 0 0 1 0 0 0 1 0 0 0 1 1 1 0 0 1 1 0]\n", + "0-679 271.60% (322m 12s) logloss=24.18 \t accuracy=0.30 \t loss=0.6973448991775513\n", + "pred_y:\t [1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1]\n", + "target_y:\t [1 1 1 1 0 1 1 0 0 0 1 1 1 1 1 0 0 0 0 0]\n", + "0-699 279.60% (329m 20s) logloss=17.27 \t accuracy=0.50 \t loss=0.693761944770813\n", + "pred_y:\t [0 1 1 1 1 1 1 1 1 1 1 1 0 1 0 1 1 1 1 0]\n", + "target_y:\t [1 0 1 1 0 1 0 1 0 1 1 0 0 0 1 0 1 1 1 1]\n", + "0-719 287.60% (335m 57s) logloss=17.27 \t accuracy=0.50 \t loss=0.6938427686691284\n", + "pred_y:\t [0 1 1 1 1 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0]\n", + "target_y:\t [1 0 1 0 0 1 1 0 0 1 1 0 0 1 0 0 1 0 1 0]\n", + "0-739 295.60% (342m 53s) logloss=17.27 \t accuracy=0.50 \t loss=0.6820307970046997\n", + "pred_y:\t [0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0]\n", + "target_y:\t [0 1 1 1 1 1 1 1 0 0 1 0 1 0 0 1 1 0 0 0]\n", + "0-759 303.60% (350m 7s) logloss=15.54 \t accuracy=0.55 \t loss=0.6916964650154114\n", + "pred_y:\t [0 1 0 1 0 0 0 1 0 1 1 1 0 0 0 0 1 0 0 0]\n", + "target_y:\t [1 1 1 0 0 1 1 1 1 0 1 1 1 1 1 1 0 0 0 0]\n", + "0-779 311.60% (357m 5s) logloss=20.72 \t accuracy=0.40 \t loss=0.6954394578933716\n", + "pred_y:\t [0 1 0 1 1 1 0 1 1 1 0 1 0 0 1 0 1 0 1 0]\n", + "target_y:\t [0 0 1 1 1 1 0 0 0 0 1 0 1 1 0 1 1 0 0 0]\n", + "0-799 319.60% (364m 23s) logloss=20.72 \t accuracy=0.40 \t loss=0.7011473178863525\n", + "pred_y:\t [0 0 0 0 0 0 1 0 1 0 1 0 1 0 1 1 1 0 1 0]\n", + "target_y:\t [0 1 1 1 1 0 0 1 1 0 1 0 0 0 0 1 1 1 0 1]\n", + "0-819 327.60% (371m 51s) logloss=19.00 \t accuracy=0.45 \t loss=0.6923590898513794\n", + "pred_y:\t [0 1 1 0 1 0 1 1 1 0 1 1 0 0 1 1 0 1 1 1]\n", + "target_y:\t [0 0 1 1 0 0 0 1 1 1 0 0 0 1 0 0 1 0 1 1]\n", + "0-839 335.60% (379m 18s) logloss=20.72 \t accuracy=0.40 \t loss=0.7093911170959473\n", + "pred_y:\t [1 1 1 1 0 1 1 1 0 1 1 1 1 1 1 0 1 1 1 0]\n", + "target_y:\t [0 0 0 0 0 1 1 1 0 0 0 1 1 1 1 1 0 1 0 1]\n", + "0-859 343.60% (386m 37s) logloss=17.27 \t accuracy=0.50 \t loss=0.6935914158821106\n", + "pred_y:\t [1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 0 1 1 0 1]\n", + "target_y:\t [1 1 1 0 0 0 0 0 1 0 0 1 0 1 0 0 0 0 0 0]\n", + "0-879 351.60% (394m 6s) logloss=17.27 \t accuracy=0.50 \t loss=0.6971246004104614\n", + "pred_y:\t [1 1 1 1 1 1 1 1 1 0 1 0 1 1 1 1 1 1 1 1]\n", + "target_y:\t [1 1 0 1 0 1 1 1 0 0 1 0 0 0 0 1 0 1 0 1]\n", + "0-899 359.60% (401m 57s) logloss=13.82 \t accuracy=0.60 \t loss=0.7006368637084961\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "pred_y:\t [0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0]\n", + "target_y:\t [1 0 0 0 0 0 1 1 0 1 1 1 0 0 1 0 0 0 0 1]\n", + "0-919 367.60% (409m 9s) logloss=13.82 \t accuracy=0.60 \t loss=0.684188961982727\n", + "pred_y:\t [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]\n", + "target_y:\t [1 0 1 1 1 0 0 1 1 0 0 0 1 1 1 0 1 0 0 0]\n", + "0-939 375.60% (416m 38s) logloss=17.27 \t accuracy=0.50 \t loss=0.7021096348762512\n", + "pred_y:\t [0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]\n", + "target_y:\t [0 0 0 0 1 1 0 1 0 0 1 1 1 1 1 1 1 0 0 1]\n", + "0-959 383.60% (424m 14s) logloss=20.72 \t accuracy=0.40 \t loss=0.6950109004974365\n", + "pred_y:\t [0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0]\n", + "target_y:\t [0 1 1 1 1 1 1 1 1 1 1 0 1 0 0 0 0 1 0 0]\n", + "0-979 391.60% (431m 21s) logloss=17.27 \t accuracy=0.50 \t loss=0.6897827982902527\n", + "pred_y:\t [0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 1 0 1 1 1]\n", + "target_y:\t [0 0 0 1 1 1 1 0 1 1 1 0 0 1 0 1 0 1 1 0]\n", + "0-999 399.60% (438m 31s) logloss=15.54 \t accuracy=0.55 \t loss=0.6863324642181396\n", + "pred_y:\t [0 0 0 1 0 0 0 1 0 1 1 0 0 1 0 0 0 0 0 0]\n", + "target_y:\t [0 1 0 1 1 0 1 1 0 0 0 1 1 1 0 1 1 1 0 0]\n", + "1-69 27.60% (447m 6s) logloss=17.27 \t accuracy=0.50 \t loss=0.6887694597244263\n", + "pred_y:\t [1 0 1 1 1 0 0 0 0 0 0 1 1 1 0 0 0 1 0 0]\n", + "target_y:\t [1 0 1 1 1 0 0 0 0 0 0 1 0 0 0 0 1 1 1 0]\n", + "1-89 35.60% (455m 24s) logloss=6.91 \t accuracy=0.80 \t loss=0.6829289197921753\n", + "pred_y:\t [0 1 0 1 0 0 0 0 1 1 0 0 0 0 0 0 0 1 0 0]\n", + "target_y:\t [0 1 1 1 0 0 0 0 0 1 1 1 1 0 0 0 0 0 1 1]\n", + "1-109 43.60% (463m 55s) logloss=13.82 \t accuracy=0.60 \t loss=0.6840181350708008\n", + "pred_y:\t [0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0]\n", + "target_y:\t [0 1 0 0 0 0 0 0 1 1 0 0 0 0 0 1 0 0 0 1]\n", + "1-129 51.60% (472m 25s) logloss=3.45 \t accuracy=0.90 \t loss=0.6634591817855835\n", + "pred_y:\t [0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]\n", + "target_y:\t [0 1 1 1 0 0 0 0 0 1 1 0 0 0 1 0 0 1 1 1]\n", + "1-149 59.60% (481m 20s) logloss=15.54 \t accuracy=0.55 \t loss=0.6897100210189819\n", + "pred_y:\t [0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]\n", + "target_y:\t [1 0 0 1 0 0 0 1 0 0 1 1 0 0 1 0 1 0 0 0]\n", + "1-169 67.60% (490m 7s) logloss=12.09 \t accuracy=0.65 \t loss=0.6846562623977661\n", + "pred_y:\t [0 1 0 0 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0]\n", + "target_y:\t [1 1 0 1 1 0 1 1 1 0 0 1 1 0 1 0 0 0 0 1]\n", + "1-189 75.60% (498m 56s) logloss=12.09 \t accuracy=0.65 \t loss=0.6853525638580322\n", + "pred_y:\t [0 1 0 0 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0]\n", + "target_y:\t [1 0 1 1 1 0 1 1 0 0 0 1 0 0 0 1 0 0 0 1]\n", + "1-209 83.60% (507m 31s) logloss=19.00 \t accuracy=0.45 \t loss=0.7002253532409668\n", + "pred_y:\t [1 0 1 0 1 0 0 1 0 1 1 0 1 1 1 0 1 0 1 0]\n", + "target_y:\t [1 1 0 1 0 0 0 0 1 1 1 0 0 0 1 0 1 0 1 0]\n", + "1-229 91.60% (516m 28s) logloss=13.82 \t accuracy=0.60 \t loss=0.6863614320755005\n", + "pred_y:\t [0 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]\n", + "target_y:\t [0 1 0 0 1 0 0 1 0 1 0 1 0 1 0 0 1 0 1 1]\n", + "1-249 99.60% (522m 43s) logloss=19.00 \t accuracy=0.45 \t loss=0.6914201974868774\n", + "pred_y:\t [0 1 1 1 1 1 0 1 1 1 1 1 0 1 1 1 1 1 0 1]\n", + "target_y:\t [1 0 0 0 0 0 0 0 1 1 0 0 0 1 0 1 1 1 0 1]\n", + "1-269 107.60% (524m 30s) logloss=17.27 \t accuracy=0.50 \t loss=0.6946980953216553\n", + "pred_y:\t [1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1]\n", + "target_y:\t [0 0 0 1 1 0 1 0 0 1 1 1 1 0 1 0 1 1 0 1]\n", + "1-289 115.60% (526m 16s) logloss=19.00 \t accuracy=0.45 \t loss=0.7042781114578247\n", + "pred_y:\t [0 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1 0 0 1]\n", + "target_y:\t [1 0 1 0 0 0 1 0 0 0 1 1 1 1 1 1 1 1 1 0]\n", + "1-309 123.60% (530m 35s) logloss=19.00 \t accuracy=0.45 \t loss=0.6874731183052063\n", + "pred_y:\t [1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 0 1 1]\n", + "target_y:\t [1 0 1 1 0 0 0 1 0 1 1 1 0 1 0 1 1 0 0 1]\n", + "1-329 131.60% (538m 20s) logloss=12.09 \t accuracy=0.65 \t loss=0.6883570551872253\n", + "pred_y:\t [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]\n", + "target_y:\t [1 0 1 1 0 1 0 1 0 0 0 1 1 0 0 1 0 0 1 0]\n", + "1-349 139.60% (546m 2s) logloss=19.00 \t accuracy=0.45 \t loss=0.6977750062942505\n", + "pred_y:\t [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1]\n", + "target_y:\t [0 0 0 1 0 1 1 0 0 1 1 1 0 0 1 0 0 0 0 0]\n", + "1-369 147.60% (553m 42s) logloss=17.27 \t accuracy=0.50 \t loss=0.6937034130096436\n", + "pred_y:\t [0 0 1 0 0 1 1 0 1 0 1 0 0 1 0 0 1 1 0 1]\n", + "target_y:\t [1 0 0 0 0 1 1 0 0 0 0 1 1 0 0 0 1 0 0 1]\n", + "1-389 155.60% (561m 27s) logloss=13.82 \t accuracy=0.60 \t loss=0.691476583480835\n", + "pred_y:\t [0 0 0 1 0 1 0 1 1 0 1 1 0 0 0 0 0 0 0 0]\n", + "target_y:\t [0 0 0 0 0 1 0 1 0 0 1 0 0 0 1 0 1 1 1 1]\n", + "1-409 163.60% (564m 49s) logloss=13.82 \t accuracy=0.60 \t loss=0.6903079152107239\n", + "pred_y:\t [0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0]\n", + "target_y:\t [0 0 1 0 0 1 0 0 1 0 1 1 0 1 0 0 0 1 1 0]\n", + "1-429 171.60% (566m 34s) logloss=12.09 \t accuracy=0.65 \t loss=0.6823988556861877\n", + "pred_y:\t [0 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 0 0]\n", + "target_y:\t [0 1 0 0 0 1 0 0 1 0 0 0 0 0 1 0 1 1 1 1]\n", + "1-449 179.60% (568m 21s) logloss=10.36 \t accuracy=0.70 \t loss=0.6680271029472351\n", + "pred_y:\t [1 0 0 0 0 1 0 1 0 0 0 0 1 0 0 0 0 0 0 0]\n", + "target_y:\t [0 0 0 0 1 0 1 1 0 1 1 1 0 1 1 0 1 1 1 1]\n", + "1-469 187.60% (575m 33s) logloss=24.18 \t accuracy=0.30 \t loss=0.708787739276886\n", + "pred_y:\t [1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1]\n", + "target_y:\t [1 0 0 1 0 1 0 1 0 1 0 1 0 0 0 0 1 0 1 0]\n", + "1-489 195.60% (583m 20s) logloss=15.54 \t accuracy=0.55 \t loss=0.6908186674118042\n", + "pred_y:\t [0 0 0 0 1 1 0 0 1 0 1 0 0 1 0 0 0 0 0 0]\n", + "target_y:\t [1 0 0 1 1 0 1 1 1 0 0 0 0 0 1 0 0 1 0 0]\n", + "1-509 203.60% (591m 14s) logloss=15.54 \t accuracy=0.55 \t loss=0.6828701496124268\n", + "pred_y:\t [0 0 1 0 0 0 0 0 1 1 0 0 1 1 0 0 0 1 1 1]\n", + "target_y:\t [1 0 0 0 1 1 1 1 1 0 1 0 0 0 0 0 0 0 0 0]\n", + "1-529 211.60% (598m 51s) logloss=22.45 \t accuracy=0.35 \t loss=0.7000082731246948\n", + "pred_y:\t [1 0 1 0 1 0 0 0 1 0 0 0 0 1 0 0 1 0 1 0]\n", + "target_y:\t [1 1 0 1 1 1 1 1 1 0 0 0 1 1 0 1 1 1 0 0]\n", + "1-549 219.60% (606m 39s) logloss=17.27 \t accuracy=0.50 \t loss=0.6983622312545776\n", + "pred_y:\t [1 1 0 1 1 0 1 0 0 0 1 0 1 0 0 1 0 0 1 0]\n", + "target_y:\t [1 0 0 1 1 1 0 0 1 0 0 0 0 1 1 0 1 1 0 0]\n", + "1-569 227.60% (614m 32s) logloss=20.72 \t accuracy=0.40 \t loss=0.6938793063163757\n", + "pred_y:\t [1 0 0 0 0 1 0 0 0 0 0 1 0 1 0 0 0 0 1 0]\n", + "target_y:\t [1 0 0 1 0 1 0 1 0 0 0 0 0 0 1 1 1 0 0 0]\n", + "1-589 235.60% (622m 28s) logloss=13.82 \t accuracy=0.60 \t loss=0.681753396987915\n", + "pred_y:\t [0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0]\n", + "target_y:\t [1 1 0 1 1 0 0 1 0 1 1 1 1 1 0 1 1 0 0 0]\n", + "1-609 243.60% (626m 59s) logloss=22.45 \t accuracy=0.35 \t loss=0.7026186585426331\n", + "pred_y:\t [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]\n", + "target_y:\t [0 1 1 0 1 1 1 0 0 0 1 0 0 0 1 0 1 0 0 1]\n", + "1-629 251.60% (629m 0s) logloss=17.27 \t accuracy=0.50 \t loss=0.6928611993789673\n", + "pred_y:\t [1 0 0 0 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0]\n", + "target_y:\t [0 0 0 1 1 0 0 1 1 0 0 0 1 0 0 1 0 1 0 0]\n", + "1-649 259.60% (630m 48s) logloss=12.09 \t accuracy=0.65 \t loss=0.6916486620903015\n", + "pred_y:\t [0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0]\n", + "target_y:\t [0 0 1 0 0 1 1 0 1 0 0 1 0 1 0 1 0 0 1 0]\n", + "1-669 267.60% (632m 20s) logloss=15.54 \t accuracy=0.55 \t loss=0.6855508685112\n", + "pred_y:\t [0 1 0 1 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0]\n", + "target_y:\t [0 1 0 1 1 1 0 1 1 0 0 1 1 0 0 0 0 0 1 1]\n", + "1-689 275.60% (633m 51s) logloss=13.82 \t accuracy=0.60 \t loss=0.6923184990882874\n", + "pred_y:\t [1 0 0 0 0 0 0 1 1 1 0 1 1 1 0 0 1 1 1 1]\n", + "target_y:\t [0 0 0 1 0 1 1 0 0 1 1 1 0 0 0 1 1 1 0 1]\n", + "1-709 283.60% (637m 49s) logloss=19.00 \t accuracy=0.45 \t loss=0.6926537752151489\n", + "pred_y:\t [0 0 1 0 1 0 1 0 0 0 1 1 0 1 1 0 1 0 1 0]\n", + "target_y:\t [1 0 0 0 1 1 0 1 1 0 1 0 0 1 0 0 0 0 1 1]\n", + "1-729 291.60% (643m 55s) logloss=17.27 \t accuracy=0.50 \t loss=0.6858220100402832\n", + "pred_y:\t [1 0 1 1 1 1 0 0 1 0 1 1 0 1 1 1 1 0 1 0]\n", + "target_y:\t [1 0 0 0 1 0 1 0 1 1 1 0 1 1 1 0 0 0 1 0]\n", + "1-749 299.60% (650m 27s) logloss=15.54 \t accuracy=0.55 \t loss=0.68160480260849\n", + "pred_y:\t [1 0 0 1 1 1 0 0 0 1 1 1 0 1 0 1 0 0 1 0]\n", + "target_y:\t [1 1 0 1 1 0 1 0 1 0 1 1 1 1 0 0 1 1 0 0]\n", + "1-769 307.60% (657m 30s) logloss=17.27 \t accuracy=0.50 \t loss=0.6910548806190491\n", + "pred_y:\t [0 1 1 1 0 1 0 1 1 1 1 1 1 1 1 1 0 1 0 1]\n", + "target_y:\t [0 1 1 0 0 1 1 1 0 0 0 1 0 0 1 0 0 1 0 1]\n", + "1-789 315.60% (662m 46s) logloss=13.82 \t accuracy=0.60 \t loss=0.6959726810455322\n", + "pred_y:\t [0 1 1 0 0 1 1 0 0 1 1 0 1 1 1 1 1 0 1 1]\n", + "target_y:\t [1 1 1 0 0 1 1 1 1 0 0 1 0 1 0 0 1 1 0 0]\n", + "1-809 323.60% (665m 22s) logloss=20.72 \t accuracy=0.40 \t loss=0.6980509161949158\n", + "pred_y:\t [1 1 0 1 1 1 0 0 0 1 0 1 1 0 1 0 0 0 0 0]\n", + "target_y:\t [0 0 1 0 1 0 1 1 0 1 0 0 0 1 1 1 0 1 0 1]\n", + "1-829 331.60% (667m 52s) logloss=22.45 \t accuracy=0.35 \t loss=0.7028436064720154\n", + "pred_y:\t [1 1 0 0 1 1 1 1 1 0 1 1 1 1 1 0 1 1 1 1]\n", + "target_y:\t [1 0 0 0 0 1 0 1 1 1 0 0 0 0 1 0 1 1 1 1]\n", + "1-849 339.60% (670m 21s) logloss=13.82 \t accuracy=0.60 \t loss=0.6839519739151001\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "pred_y:\t [1 1 1 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1 1]\n", + "target_y:\t [0 1 1 1 1 1 0 1 1 0 0 1 1 0 0 0 1 1 0 1]\n", + "1-869 347.60% (672m 56s) logloss=17.27 \t accuracy=0.50 \t loss=0.6827632188796997\n", + "pred_y:\t [1 0 0 1 1 1 0 0 0 0 1 1 1 0 0 0 0 1 0 0]\n", + "target_y:\t [1 0 0 0 1 0 1 1 0 1 1 1 0 1 0 1 1 0 1 0]\n", + "1-889 355.60% (675m 13s) logloss=19.00 \t accuracy=0.45 \t loss=0.6897896528244019\n", + "pred_y:\t [0 0 0 1 1 0 0 0 0 0 1 0 1 1 0 0 0 0 1 0]\n", + "target_y:\t [0 1 0 1 0 1 1 1 0 0 0 1 0 0 0 0 0 0 1 0]\n", + "1-909 363.60% (677m 28s) logloss=15.54 \t accuracy=0.55 \t loss=0.6818998456001282\n", + "pred_y:\t [0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 1 1 0 0 0]\n", + "target_y:\t [0 1 1 1 0 0 1 0 1 0 0 1 0 0 0 1 0 0 0 1]\n", + "1-929 371.60% (680m 31s) logloss=12.09 \t accuracy=0.65 \t loss=0.6927092671394348\n", + "pred_y:\t [1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0]\n", + "target_y:\t [0 0 0 1 1 0 0 1 1 0 1 1 1 0 1 1 0 0 1 0]\n", + "1-949 379.60% (685m 4s) logloss=19.00 \t accuracy=0.45 \t loss=0.69705730676651\n", + "pred_y:\t [0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 1 0 1 0 0]\n", + "target_y:\t [0 0 1 0 0 1 0 0 1 1 0 1 1 0 0 1 1 0 1 0]\n", + "1-969 387.60% (687m 23s) logloss=19.00 \t accuracy=0.45 \t loss=0.6987588405609131\n", + "pred_y:\t [0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0]\n", + "target_y:\t [0 1 0 0 1 0 0 0 1 0 1 0 1 1 1 1 0 1 1 1]\n", + "1-989 395.60% (689m 42s) logloss=13.82 \t accuracy=0.60 \t loss=0.6846868991851807\n", + "pred_y:\t [0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0]\n", + "target_y:\t [0 1 0 1 0 1 0 1 0 1 1 0 1 0 1 0 1 0 1 0]\n", + "1-1009 403.60% (692m 1s) logloss=15.54 \t accuracy=0.55 \t loss=0.6894447803497314\n", + "pred_y:\t [0 0 0 1 0 0 0 1 0 0 0 0 0 1 0 1 0 0 0 0]\n", + "target_y:\t [1 1 1 0 0 1 1 1 0 0 1 0 0 0 0 1 1 1 1 1]\n", + "1-1029 411.60% (694m 35s) logloss=20.72 \t accuracy=0.40 \t loss=0.7035297751426697\n", + "pred_y:\t [0 0 0 0 0 1 0 0 0 1 1 0 0 0 0 0 0 0 1 0]\n", + "target_y:\t [1 0 1 1 0 0 0 0 1 0 0 1 0 1 0 1 1 1 1 0]\n", + "1-1049 419.60% (696m 57s) logloss=20.72 \t accuracy=0.40 \t loss=0.6977611780166626\n", + "pred_y:\t [1 1 1 0 1 0 1 1 1 1 1 0 1 0 1 1 0 1 1 0]\n", + "target_y:\t [0 0 1 0 1 0 0 0 1 1 1 0 1 0 0 0 1 1 0 1]\n", + "2-119 47.60% (699m 40s) logloss=15.54 \t accuracy=0.55 \t loss=0.6885925531387329\n", + "pred_y:\t [0 1 0 0 0 0 1 1 0 1 1 0 0 0 1 1 1 1 0 1]\n", + "target_y:\t [0 0 0 1 0 1 0 0 1 1 0 0 0 0 1 0 0 0 0 0]\n", + "2-139 55.60% (702m 1s) logloss=19.00 \t accuracy=0.45 \t loss=0.6926258206367493\n", + "pred_y:\t [1 1 0 0 0 0 0 0 0 0 0 1 0 0 1 1 1 0 1 0]\n", + "target_y:\t [1 1 0 0 0 0 1 0 0 0 1 0 0 1 1 0 1 0 0 0]\n", + "2-159 63.60% (704m 42s) logloss=10.36 \t accuracy=0.70 \t loss=0.6825841069221497\n", + "pred_y:\t [0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 0]\n", + "target_y:\t [0 0 0 0 1 1 1 1 1 0 1 0 0 0 0 0 0 0 1 1]\n", + "2-179 71.60% (707m 7s) logloss=15.54 \t accuracy=0.55 \t loss=0.6942364573478699\n", + "pred_y:\t [0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 0 0 0 0 0]\n", + "target_y:\t [0 1 1 1 1 1 1 0 0 1 0 0 0 1 1 1 0 0 0 0]\n", + "2-199 79.60% (709m 45s) logloss=15.54 \t accuracy=0.55 \t loss=0.6828181147575378\n", + "pred_y:\t [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0]\n", + "target_y:\t [1 0 1 0 0 1 1 1 0 0 1 0 1 1 1 0 1 1 1 0]\n", + "2-219 87.60% (711m 43s) logloss=19.00 \t accuracy=0.45 \t loss=0.697152853012085\n", + "pred_y:\t [1 0 0 1 1 0 0 0 0 0 0 0 1 1 1 1 0 1 0 0]\n", + "target_y:\t [0 1 0 1 1 1 1 1 0 1 0 1 0 0 0 0 1 1 1 0]\n", + "2-239 95.60% (714m 15s) logloss=22.45 \t accuracy=0.35 \t loss=0.6952451467514038\n", + "pred_y:\t [0 0 1 1 1 1 1 0 0 1 0 0 0 0 0 0 1 1 0 0]\n", + "target_y:\t [0 0 1 1 1 0 0 1 1 0 1 0 0 1 1 1 0 0 0 0]\n", + "2-259 103.60% (716m 52s) logloss=19.00 \t accuracy=0.45 \t loss=0.6940121054649353\n", + "pred_y:\t [0 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1]\n", + "target_y:\t [1 1 1 1 0 0 1 1 0 0 1 0 0 1 0 0 1 1 0 1]\n", + "2-279 111.60% (719m 27s) logloss=17.27 \t accuracy=0.50 \t loss=0.694961667060852\n", + "pred_y:\t [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1]\n", + "target_y:\t [1 0 1 1 1 0 1 0 1 0 1 0 0 0 1 0 1 0 0 0]\n", + "2-299 119.60% (721m 56s) logloss=17.27 \t accuracy=0.50 \t loss=0.6933314204216003\n", + "pred_y:\t [1 1 1 1 1 0 1 0 1 0 1 1 1 1 0 1 1 1 1 1]\n", + "target_y:\t [0 0 0 1 1 0 0 0 1 1 0 0 0 0 0 0 1 0 0 0]\n", + "2-319 127.60% (724m 17s) logloss=22.45 \t accuracy=0.35 \t loss=0.7001301646232605\n", + "pred_y:\t [1 0 1 1 0 1 1 1 0 1 1 1 1 1 1 1 0 0 0 0]\n", + "target_y:\t [0 1 1 1 1 0 1 1 1 0 0 0 0 0 0 1 1 0 0 0]\n", + "2-339 135.60% (726m 48s) logloss=20.72 \t accuracy=0.40 \t loss=0.6952366828918457\n", + "pred_y:\t [0 1 0 1 1 1 1 1 1 0 1 0 1 0 1 0 1 1 1 0]\n", + "target_y:\t [0 1 0 1 0 0 1 0 1 1 1 1 1 1 0 1 0 1 1 0]\n", + "2-359 143.60% (729m 14s) logloss=15.54 \t accuracy=0.55 \t loss=0.690677285194397\n", + "pred_y:\t [0 1 0 1 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0]\n", + "target_y:\t [1 0 0 0 0 0 1 0 1 1 1 0 1 1 1 1 1 0 0 0]\n", + "2-379 151.60% (731m 24s) logloss=19.00 \t accuracy=0.45 \t loss=0.7013611793518066\n", + "pred_y:\t [1 0 1 0 0 0 0 1 1 1 1 1 1 1 1 1 0 1 0 1]\n", + "target_y:\t [1 0 0 1 1 1 0 1 1 0 0 0 0 1 0 1 0 1 1 1]\n", + "2-399 159.60% (733m 11s) logloss=17.27 \t accuracy=0.50 \t loss=0.6919983625411987\n", + "pred_y:\t [0 1 1 1 0 0 1 1 1 1 1 1 0 1 1 1 1 1 1 1]\n", + "target_y:\t [0 1 0 0 0 0 1 1 1 1 0 1 0 0 1 0 0 0 0 0]\n", + "2-419 167.60% (735m 6s) logloss=15.54 \t accuracy=0.55 \t loss=0.688313364982605\n", + "pred_y:\t [0 1 0 0 1 1 1 1 0 0 1 1 1 0 1 0 1 0 1 0]\n", + "target_y:\t [0 1 0 0 1 0 0 0 0 0 1 1 1 0 1 1 0 1 0 0]\n", + "2-439 175.60% (737m 3s) logloss=12.09 \t accuracy=0.65 \t loss=0.684729814529419\n", + "pred_y:\t [0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1]\n", + "target_y:\t [1 0 0 1 1 1 1 0 0 1 1 0 0 0 1 0 1 1 1 0]\n", + "2-459 183.60% (738m 51s) logloss=17.27 \t accuracy=0.50 \t loss=0.6940157413482666\n", + "pred_y:\t [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]\n", + "target_y:\t [1 1 0 1 1 0 0 1 0 1 0 1 0 1 0 1 0 0 1 1]\n", + "2-479 191.60% (740m 34s) logloss=19.00 \t accuracy=0.45 \t loss=0.6916437149047852\n", + "pred_y:\t [0 0 0 1 0 0 0 0 1 1 1 0 1 1 0 0 0 0 0 0]\n", + "target_y:\t [1 1 1 1 0 1 1 0 0 1 0 1 0 0 1 1 0 0 1 0]\n", + "2-499 199.60% (742m 17s) logloss=22.45 \t accuracy=0.35 \t loss=0.700393795967102\n", + "pred_y:\t [0 1 0 1 0 0 0 0 0 0 1 0 0 1 1 0 1 0 0 1]\n", + "target_y:\t [0 0 0 1 0 0 1 1 1 0 1 1 0 1 0 0 1 0 0 0]\n", + "2-519 207.60% (746m 17s) logloss=12.09 \t accuracy=0.65 \t loss=0.6829585433006287\n", + "pred_y:\t [0 1 0 1 1 0 1 0 1 1 1 0 1 0 0 0 1 1 1 0]\n", + "target_y:\t [1 1 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 1 1]\n", + "2-539 215.60% (748m 52s) logloss=12.09 \t accuracy=0.65 \t loss=0.6759796738624573\n", + "pred_y:\t [0 0 0 0 0 1 1 1 0 1 0 0 1 1 0 1 0 1 1 1]\n", + "target_y:\t [1 1 1 0 1 1 0 0 0 1 0 0 1 1 1 0 0 1 0 1]\n", + "2-559 223.60% (751m 30s) logloss=15.54 \t accuracy=0.55 \t loss=0.6874823570251465\n", + "pred_y:\t [0 0 0 1 0 0 0 0 0 0 1 0 0 1 1 0 0 1 1 0]\n", + "target_y:\t [1 0 0 0 1 1 1 0 0 1 0 0 1 1 0 1 1 0 1 1]\n", + "2-579 231.60% (754m 2s) logloss=22.45 \t accuracy=0.35 \t loss=0.6899620890617371\n", + "pred_y:\t [0 0 0 0 0 0 1 1 1 0 0 0 0 1 1 1 1 0 0 0]\n", + "target_y:\t [1 1 1 0 1 1 0 0 0 0 0 1 0 0 1 0 0 1 0 1]\n", + "2-599 239.60% (756m 31s) logloss=24.18 \t accuracy=0.30 \t loss=0.7059974670410156\n", + "pred_y:\t [0 1 0 0 1 1 0 0 0 1 0 1 0 1 0 0 1 0 0 0]\n", + "target_y:\t [1 0 1 0 1 1 0 0 0 1 1 0 0 0 0 1 0 0 0 0]\n", + "2-619 247.60% (759m 4s) logloss=13.82 \t accuracy=0.60 \t loss=0.6885284781455994\n", + "pred_y:\t [0 0 1 1 0 0 0 0 0 0 0 1 1 0 0 0 1 0 0 0]\n", + "target_y:\t [0 0 1 0 1 0 1 1 0 0 0 1 0 0 0 0 1 1 1 0]\n", + "2-639 255.60% (761m 23s) logloss=12.09 \t accuracy=0.65 \t loss=0.6848553419113159\n", + "pred_y:\t [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]\n", + "target_y:\t [0 1 1 1 0 1 1 0 1 0 1 0 1 1 0 1 0 0 1 0]\n", + "2-659 263.60% (763m 50s) logloss=19.00 \t accuracy=0.45 \t loss=0.6990257501602173\n", + "pred_y:\t [0 0 0 0 1 1 0 0 0 0 1 0 1 1 0 1 0 0 0 0]\n", + "target_y:\t [0 1 0 1 1 0 1 1 0 0 1 0 0 1 1 0 0 1 1 1]\n", + "2-679 271.60% (766m 23s) logloss=19.00 \t accuracy=0.45 \t loss=0.696912407875061\n", + "pred_y:\t [1 0 1 1 1 0 1 1 1 0 1 0 0 0 1 1 1 1 1 0]\n", + "target_y:\t [1 0 0 1 0 1 0 0 0 0 1 0 0 1 0 1 1 1 0 1]\n", + "2-699 279.60% (768m 59s) logloss=17.27 \t accuracy=0.50 \t loss=0.6896503567695618\n", + "pred_y:\t [0 0 1 0 0 0 1 0 0 0 0 1 0 1 1 1 1 0 0 0]\n", + "target_y:\t [0 1 0 0 1 1 1 1 0 0 1 0 1 0 0 1 0 0 1 1]\n", + "2-719 287.60% (771m 32s) logloss=22.45 \t accuracy=0.35 \t loss=0.7048081159591675\n", + "pred_y:\t [0 0 0 0 0 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0]\n", + "target_y:\t [1 0 1 1 0 0 0 1 1 0 0 1 1 0 1 0 0 0 1 1]\n", + "2-739 295.60% (773m 50s) logloss=19.00 \t accuracy=0.45 \t loss=0.6990195512771606\n", + "pred_y:\t [1 0 0 0 1 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0]\n", + "target_y:\t [1 0 1 1 0 0 0 1 0 1 0 1 1 0 0 0 1 0 0 0]\n", + "2-759 303.60% (776m 10s) logloss=19.00 \t accuracy=0.45 \t loss=0.6988271474838257\n", + "pred_y:\t [0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]\n", + "target_y:\t [0 0 0 1 0 1 0 0 0 0 0 0 1 1 0 0 0 1 0 0]\n", + "2-779 311.60% (783m 5s) logloss=6.91 \t accuracy=0.80 \t loss=0.6714716553688049\n", + "pred_y:\t [1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0]\n", + "target_y:\t [0 0 0 1 1 0 0 0 1 1 0 1 0 1 0 0 0 0 1 1]\n", + "2-799 319.60% (785m 51s) logloss=13.82 \t accuracy=0.60 \t loss=0.6856887340545654\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "pred_y:\t [0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0]\n", + "target_y:\t [1 1 0 1 1 1 0 0 0 0 0 1 0 1 1 0 1 0 1 1]\n", + "2-819 327.60% (788m 46s) logloss=19.00 \t accuracy=0.45 \t loss=0.7051876783370972\n", + "pred_y:\t [1 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0]\n", + "target_y:\t [0 0 0 0 0 0 1 1 1 1 0 1 1 0 1 1 1 0 0 1]\n", + "2-839 335.60% (791m 22s) logloss=15.54 \t accuracy=0.55 \t loss=0.6883013844490051\n", + "pred_y:\t [1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 1 0 0]\n", + "target_y:\t [0 1 0 0 1 1 1 0 0 1 0 0 0 1 1 1 1 0 1 0]\n", + "2-859 343.60% (793m 51s) logloss=19.00 \t accuracy=0.45 \t loss=0.6974400877952576\n", + "pred_y:\t [0 1 0 1 0 0 0 1 1 0 0 1 0 0 0 0 0 1 0 1]\n", + "target_y:\t [0 0 0 1 0 0 0 1 1 0 0 0 0 0 1 0 0 1 0 1]\n", + "2-879 351.60% (796m 54s) logloss=5.18 \t accuracy=0.85 \t loss=0.6511715650558472\n", + "pred_y:\t [1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]\n", + "target_y:\t [1 1 0 1 1 0 1 1 0 1 0 0 0 1 0 0 1 1 1 1]\n", + "2-899 359.60% (800m 12s) logloss=17.27 \t accuracy=0.50 \t loss=0.6897764801979065\n", + "pred_y:\t [0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 0]\n", + "target_y:\t [0 0 1 1 1 0 1 0 1 0 1 0 1 0 0 0 1 1 1 1]\n", + "2-919 367.60% (803m 27s) logloss=17.27 \t accuracy=0.50 \t loss=0.6790357232093811\n", + "pred_y:\t [0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0]\n", + "target_y:\t [0 0 0 0 0 0 0 1 1 1 0 1 0 0 1 1 1 0 0 0]\n", + "2-939 375.60% (806m 42s) logloss=13.82 \t accuracy=0.60 \t loss=0.6841265559196472\n", + "pred_y:\t [0 0 0 1 0 0 0 0 1 0 1 0 0 0 0 1 0 1 0 0]\n", + "target_y:\t [0 1 1 0 0 0 1 0 0 1 1 1 0 1 1 0 0 0 0 1]\n", + "2-959 383.60% (809m 33s) logloss=20.72 \t accuracy=0.40 \t loss=0.7058143615722656\n", + "pred_y:\t [0 0 1 0 0 0 0 1 1 0 1 0 0 0 0 0 0 0 0 1]\n", + "target_y:\t [0 0 0 1 1 0 0 1 0 0 0 1 0 1 1 1 0 1 0 0]\n", + "2-979 391.60% (812m 32s) logloss=19.00 \t accuracy=0.45 \t loss=0.6926060318946838\n", + "pred_y:\t [0 0 0 0 1 0 1 0 0 0 1 0 0 0 0 1 0 0 0 0]\n", + "target_y:\t [0 1 0 0 0 1 0 0 0 1 1 1 0 1 0 1 1 1 0 0]\n", + "2-999 399.60% (814m 52s) logloss=15.54 \t accuracy=0.55 \t loss=0.6884217858314514\n", + "pred_y:\t [1 0 0 0 0 0 0 0 1 0 0 0 0 0 1 1 1 0 0 0]\n", + "target_y:\t [1 0 0 0 1 0 0 0 1 1 0 0 1 1 0 0 1 0 0 0]\n", + "2-1019 407.60% (817m 8s) logloss=10.36 \t accuracy=0.70 \t loss=0.6804812550544739\n", + "pred_y:\t [1 1 0 0 1 0 1 1 1 1 1 1 0 0 0 0 1 1 1 1]\n", + "target_y:\t [0 1 0 1 1 0 0 1 1 0 0 1 1 1 0 0 1 1 0 1]\n", + "2-1039 415.60% (820m 20s) logloss=13.82 \t accuracy=0.60 \t loss=0.6895269751548767\n", + "pred_y:\t [1 1 0 0 1 1 1 0 1 1 1 0 1 0 0 0 0 0 0 1]\n", + "target_y:\t [0 1 0 1 1 1 1 1 0 0 1 1 1 1 0 0 1 0 0 0]\n", + "2-1059 423.60% (822m 52s) logloss=15.54 \t accuracy=0.55 \t loss=0.6950637102127075\n", + "pred_y:\t [1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 1]\n", + "target_y:\t [0 0 1 1 1 0 1 1 0 0 1 1 1 1 0 0 1 1 1 1]\n", + "2-1079 431.60% (825m 11s) logloss=17.27 \t accuracy=0.50 \t loss=0.6880634427070618\n", + "pred_y:\t [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]\n", + "target_y:\t [0 0 0 0 1 1 1 1 0 1 1 1 0 0 0 1 1 0 0 0]\n", + "2-1099 439.60% (829m 39s) logloss=19.00 \t accuracy=0.45 \t loss=0.7105134129524231\n", + "pred_y:\t [0 1 1 1 1 1 1 0 1 0 1 1 1 1 0 0 0 1 0 1]\n", + "target_y:\t [1 0 1 0 1 0 1 0 0 0 1 1 1 1 0 1 1 0 1 1]\n", + "3-169 67.60% (831m 56s) logloss=15.54 \t accuracy=0.55 \t loss=0.7098760604858398\n", + "pred_y:\t [1 1 1 0 0 1 1 1 1 0 1 0 1 0 1 0 0 1 1 1]\n", + "target_y:\t [0 1 0 1 0 0 0 1 0 1 1 1 1 1 1 0 0 1 1 0]\n", + "3-189 75.60% (833m 44s) logloss=17.27 \t accuracy=0.50 \t loss=0.682663083076477\n", + "pred_y:\t [1 0 0 0 1 0 1 0 0 1 0 1 1 1 1 0 0 1 0 0]\n", + "target_y:\t [1 1 0 1 1 0 1 0 1 1 0 1 1 0 0 0 1 1 1 0]\n", + "3-209 83.60% (835m 39s) logloss=12.09 \t accuracy=0.65 \t loss=0.6758695840835571\n", + "pred_y:\t [0 1 1 1 1 0 0 1 1 1 1 1 0 0 1 1 1 1 1 0]\n", + "target_y:\t [1 1 1 1 0 1 1 1 0 0 1 0 1 1 0 0 0 1 0 0]\n", + "3-229 91.60% (837m 55s) logloss=22.45 \t accuracy=0.35 \t loss=0.7095354795455933\n", + "pred_y:\t [1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0]\n", + "target_y:\t [1 0 1 0 0 0 0 0 0 1 1 1 1 1 0 0 0 1 0 0]\n", + "3-249 99.60% (840m 15s) logloss=13.82 \t accuracy=0.60 \t loss=0.6910253167152405\n", + "pred_y:\t [0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0]\n", + "target_y:\t [1 0 0 1 0 1 0 0 0 1 1 0 0 1 1 0 0 0 1 1]\n", + "3-269 107.60% (842m 31s) logloss=12.09 \t accuracy=0.65 \t loss=0.6822283864021301\n", + "pred_y:\t [0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0]\n", + "target_y:\t [1 1 1 1 1 0 1 1 1 0 1 0 1 1 1 1 1 1 1 0]\n", + "3-289 115.60% (844m 43s) logloss=25.90 \t accuracy=0.25 \t loss=0.7268241047859192\n", + "pred_y:\t [0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]\n", + "target_y:\t [0 0 1 0 1 1 1 0 0 1 0 1 0 1 1 1 1 0 0 0]\n", + "3-309 123.60% (847m 3s) logloss=13.82 \t accuracy=0.60 \t loss=0.6838265657424927\n", + "pred_y:\t [0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0]\n", + "target_y:\t [1 0 0 0 1 1 1 1 0 1 1 0 0 0 1 0 0 1 1 0]\n", + "3-329 131.60% (849m 35s) logloss=15.54 \t accuracy=0.55 \t loss=0.6976009607315063\n", + "pred_y:\t [0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0]\n", + "target_y:\t [0 0 1 1 0 0 1 1 0 1 0 1 0 0 0 0 1 0 0 0]\n", + "3-349 139.60% (852m 0s) logloss=15.54 \t accuracy=0.55 \t loss=0.6879168152809143\n", + "pred_y:\t [1 0 0 0 0 0 0 0 1 1 0 1 0 1 0 1 0 0 1 1]\n", + "target_y:\t [0 0 1 0 1 1 0 0 1 0 0 1 1 0 1 0 0 0 0 0]\n", + "3-369 147.60% (854m 19s) logloss=19.00 \t accuracy=0.45 \t loss=0.7046571373939514\n", + "pred_y:\t [0 0 0 1 1 0 1 1 1 1 0 0 0 0 1 1 0 1 1 0]\n", + "target_y:\t [0 0 1 0 1 0 1 1 0 0 1 1 1 0 0 1 1 1 1 0]\n", + "3-389 155.60% (856m 19s) logloss=15.54 \t accuracy=0.55 \t loss=0.677700400352478\n", + "pred_y:\t [0 0 1 0 0 1 1 0 0 0 0 1 0 1 0 0 0 0 0 1]\n", + "target_y:\t [1 1 0 0 1 0 0 1 1 0 1 0 1 1 1 0 1 1 0 0]\n", + "3-409 163.60% (858m 10s) logloss=25.90 \t accuracy=0.25 \t loss=0.7202922105789185\n", + "pred_y:\t [1 1 1 1 0 1 1 0 1 0 0 0 0 1 1 0 0 0 0 0]\n", + "target_y:\t [1 0 0 0 0 0 1 1 1 1 1 0 0 0 0 1 0 1 1 0]\n", + "3-429 171.60% (860m 38s) logloss=20.72 \t accuracy=0.40 \t loss=0.6937626600265503\n", + "pred_y:\t [1 0 1 1 1 0 1 1 0 1 1 1 1 0 0 1 1 1 1 1]\n", + "target_y:\t [0 1 0 1 0 0 1 1 1 1 0 1 1 0 0 0 0 1 0 0]\n", + "3-449 179.60% (863m 8s) logloss=17.27 \t accuracy=0.50 \t loss=0.6776196956634521\n", + "pred_y:\t [0 1 1 1 1 1 1 1 0 1 0 0 0 1 1 1 1 1 1 1]\n", + "target_y:\t [0 1 0 0 1 0 1 1 0 1 0 0 1 1 1 1 1 0 1 0]\n", + "3-469 187.60% (865m 31s) logloss=10.36 \t accuracy=0.70 \t loss=0.6834084987640381\n", + "pred_y:\t [1 1 1 1 1 1 0 1 1 1 0 0 1 1 1 1 0 0 1 1]\n", + "target_y:\t [0 1 0 1 1 1 0 0 0 0 1 1 1 0 0 0 1 0 0 0]\n", + "3-489 195.60% (867m 51s) logloss=22.45 \t accuracy=0.35 \t loss=0.6945611238479614\n", + "pred_y:\t [1 0 0 1 0 1 0 0 1 1 0 0 0 0 1 0 0 0 0 0]\n", + "target_y:\t [0 0 0 0 0 0 1 0 1 0 1 0 0 1 0 1 0 1 0 0]\n", + "3-509 203.60% (870m 3s) logloss=17.27 \t accuracy=0.50 \t loss=0.6905189752578735\n", + "pred_y:\t [1 0 0 0 0 0 0 1 1 0 0 0 0 0 1 0 0 0 0 0]\n", + "target_y:\t [1 1 1 0 1 1 0 0 1 0 1 0 0 0 1 1 0 0 0 1]\n", + "3-529 211.60% (872m 42s) logloss=13.82 \t accuracy=0.60 \t loss=0.6931790113449097\n", + "pred_y:\t [0 1 0 1 1 0 1 1 1 1 0 1 0 1 0 1 0 0 0 0]\n", + "target_y:\t [0 1 1 1 1 0 0 1 1 0 1 0 1 0 1 0 0 0 0 0]\n", + "3-549 219.60% (874m 53s) logloss=15.54 \t accuracy=0.55 \t loss=0.6936815977096558\n", + "pred_y:\t [0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0]\n", + "target_y:\t [0 0 0 0 1 1 1 0 1 1 1 0 0 1 1 1 1 0 0 1]\n", + "3-569 227.60% (877m 4s) logloss=24.18 \t accuracy=0.30 \t loss=0.7091041803359985\n", + "pred_y:\t [0 0 0 0 0 0 0 0 1 0 0 0 1 1 0 1 0 0 0 0]\n", + "target_y:\t [0 0 0 1 0 1 1 0 1 0 0 0 1 1 0 0 0 0 0 1]\n", + "3-589 235.60% (880m 4s) logloss=8.63 \t accuracy=0.75 \t loss=0.6674402356147766\n", + "pred_y:\t [0 0 1 0 0 1 1 0 0 0 0 0 1 0 1 0 0 0 0 1]\n", + "target_y:\t [0 1 0 0 0 0 1 0 1 1 0 1 1 1 1 0 1 0 1 0]\n", + "3-609 243.60% (882m 57s) logloss=17.27 \t accuracy=0.50 \t loss=0.6928107142448425\n", + "pred_y:\t [0 0 0 0 0 1 0 1 0 0 0 1 1 0 0 1 1 0 0 0]\n", + "target_y:\t [0 0 1 0 1 0 0 0 1 0 0 0 0 0 1 1 1 1 1 1]\n", + "3-629 251.60% (885m 15s) logloss=19.00 \t accuracy=0.45 \t loss=0.7027646899223328\n", + "pred_y:\t [0 0 0 0 1 1 0 0 1 1 0 1 1 0 0 0 0 0 1 1]\n", + "target_y:\t [0 0 0 0 1 1 1 1 0 0 0 1 1 1 0 0 1 1 1 1]\n", + "3-649 259.60% (887m 35s) logloss=12.09 \t accuracy=0.65 \t loss=0.6845759153366089\n", + "pred_y:\t [0 1 0 0 1 0 1 0 0 1 0 1 0 1 1 0 1 1 1 0]\n", + "target_y:\t [1 1 1 1 1 1 0 0 0 1 0 1 1 0 0 1 1 1 1 1]\n", + "3-669 267.60% (890m 7s) logloss=17.27 \t accuracy=0.50 \t loss=0.6864821314811707\n", + "pred_y:\t [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]\n", + "target_y:\t [0 0 1 0 0 1 1 0 1 0 0 0 0 0 0 1 0 0 0 0]\n", + "3-689 275.60% (892m 46s) logloss=25.90 \t accuracy=0.25 \t loss=0.7397301197052002\n", + "pred_y:\t [1 1 1 1 1 1 0 1 1 0 1 1 1 1 1 0 1 1 1 0]\n", + "target_y:\t [1 0 1 1 0 1 1 0 1 0 0 0 0 0 0 0 1 0 1 0]\n", + "3-709 283.60% (895m 49s) logloss=17.27 \t accuracy=0.50 \t loss=0.6964901685714722\n", + "pred_y:\t [1 0 0 1 0 0 0 1 1 0 0 1 0 1 1 1 0 1 1 0]\n", + "target_y:\t [1 0 1 0 1 1 0 0 1 0 0 1 0 0 1 1 0 1 1 1]\n", + "3-729 291.60% (898m 44s) logloss=12.09 \t accuracy=0.65 \t loss=0.6655923128128052\n", + "pred_y:\t [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]\n", + "target_y:\t [0 1 0 1 1 1 1 0 1 0 1 0 0 0 1 0 1 1 1 0]\n", + "3-749 299.60% (901m 31s) logloss=15.54 \t accuracy=0.55 \t loss=0.7032931447029114\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "pred_y:\t [1 1 0 1 0 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1]\n", + "target_y:\t [0 1 0 1 1 0 1 0 1 1 0 1 0 1 0 0 0 0 0 0]\n", + "3-769 307.60% (904m 20s) logloss=17.27 \t accuracy=0.50 \t loss=0.6945109963417053\n", + "pred_y:\t [0 1 0 0 0 0 0 0 1 1 0 1 1 0 0 1 0 0 1 1]\n", + "target_y:\t [0 1 1 0 1 1 1 1 1 0 0 0 1 0 0 1 0 0 1 0]\n", + "3-789 315.60% (907m 19s) logloss=13.82 \t accuracy=0.60 \t loss=0.6908108592033386\n", + "pred_y:\t [1 0 1 0 1 0 0 0 0 0 0 0 1 0 1 0 1 0 0 1]\n", + "target_y:\t [1 1 1 0 0 1 0 1 1 0 0 1 0 0 1 0 1 1 1 1]\n", + "3-809 323.60% (909m 46s) logloss=15.54 \t accuracy=0.55 \t loss=0.6973456144332886\n", + "pred_y:\t [0 0 0 1 0 1 1 1 1 0 0 1 0 1 0 1 0 1 0 1]\n", + "target_y:\t [1 0 0 1 1 1 1 0 0 0 0 0 1 1 1 0 1 0 1 1]\n", + "3-829 331.60% (912m 25s) logloss=19.00 \t accuracy=0.45 \t loss=0.6773894429206848\n", + "pred_y:\t [0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0]\n", + "target_y:\t [1 0 0 0 1 0 1 1 0 1 0 1 1 0 0 1 1 0 0 1]\n", + "3-849 339.60% (914m 47s) logloss=15.54 \t accuracy=0.55 \t loss=0.6820051670074463\n", + "pred_y:\t [0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 1 0 0]\n", + "target_y:\t [0 1 0 0 1 1 0 1 1 0 0 0 0 1 1 0 1 1 0 1]\n", + "3-869 347.60% (920m 18s) logloss=13.82 \t accuracy=0.60 \t loss=0.6797436475753784\n", + "pred_y:\t [0 0 0 0 0 0 1 0 0 0 1 1 0 0 0 0 0 0 0 0]\n", + "target_y:\t [0 0 1 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1]\n", + "3-889 355.60% (923m 7s) logloss=10.36 \t accuracy=0.70 \t loss=0.6741222143173218\n", + "pred_y:\t [0 0 1 1 0 1 1 0 0 0 0 0 1 0 0 0 0 0 1 0]\n", + "target_y:\t [0 1 1 0 0 0 1 0 0 0 1 0 1 1 0 1 0 1 1 1]\n", + "3-909 363.60% (925m 22s) logloss=13.82 \t accuracy=0.60 \t loss=0.6847935914993286\n", + "pred_y:\t [1 1 0 1 1 0 1 0 0 0 1 0 1 0 1 0 0 1 1 1]\n", + "target_y:\t [0 0 0 0 1 1 1 0 0 1 0 1 1 0 1 1 0 0 1 1]\n", + "3-929 371.60% (927m 38s) logloss=15.54 \t accuracy=0.55 \t loss=0.7047096490859985\n", + "pred_y:\t [0 0 1 1 0 0 0 0 1 0 0 1 0 1 0 0 1 0 1 0]\n", + "target_y:\t [0 0 1 0 0 1 0 1 1 1 1 0 1 1 1 1 1 1 1 0]\n", + "3-949 379.60% (935m 52s) logloss=17.27 \t accuracy=0.50 \t loss=0.6819921731948853\n", + "pred_y:\t [1 0 0 0 1 1 1 1 1 0 0 0 1 1 0 1 0 0 0 0]\n", + "target_y:\t [1 0 0 1 1 1 0 1 1 0 1 1 0 1 0 1 1 0 1 0]\n", + "3-969 387.60% (944m 46s) logloss=12.09 \t accuracy=0.65 \t loss=0.6759116053581238\n", + "pred_y:\t [1 0 0 1 1 0 0 1 1 1 1 0 1 0 0 0 1 1 0 1]\n", + "target_y:\t [1 0 0 0 0 1 1 1 0 0 1 0 1 0 1 1 0 1 1 0]\n", + "3-989 395.60% (953m 30s) logloss=19.00 \t accuracy=0.45 \t loss=0.6978853344917297\n", + "pred_y:\t [1 0 1 0 1 1 0 1 0 0 1 0 1 0 0 0 0 0 1 0]\n", + "target_y:\t [0 1 0 1 0 0 0 1 1 1 0 0 1 1 1 1 0 0 0 0]\n", + "3-1009 403.60% (1268m 54s) logloss=22.45 \t accuracy=0.35 \t loss=0.6969307065010071\n", + "pred_y:\t [0 0 0 1 0 0 0 0 0 1 1 0 0 1 0 0 0 0 0 0]\n", + "target_y:\t [0 1 0 1 1 1 1 1 1 0 1 1 1 1 0 1 1 0 1 1]\n", + "3-1029 411.60% (1271m 31s) logloss=22.45 \t accuracy=0.35 \t loss=0.7028256058692932\n", + "pred_y:\t [1 0 1 1 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 1]\n", + "target_y:\t [0 1 1 0 1 1 0 1 0 1 0 1 1 1 1 1 0 1 1 0]\n", + "3-1049 419.60% (1273m 47s) logloss=22.45 \t accuracy=0.35 \t loss=0.7021511793136597\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Process Process-20:\n", + "Process Process-19:\n", + "Traceback (most recent call last):\n", + "Traceback (most recent call last):\n", + " File \"/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/multiprocessing/process.py\", line 258, in _bootstrap\n", + " self.run()\n", + " File \"/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/multiprocessing/process.py\", line 258, in _bootstrap\n", + " self.run()\n", + " File \"/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/multiprocessing/process.py\", line 93, in run\n", + " self._target(*self._args, **self._kwargs)\n", + " File \"/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/multiprocessing/process.py\", line 93, in run\n", + " self._target(*self._args, **self._kwargs)\n", + " File \"/Users/jiangzl/.virtualenvs/python3.6/lib/python3.6/site-packages/torch/utils/data/dataloader.py\", line 50, in _worker_loop\n", + " r = index_queue.get()\n", + " File \"/Users/jiangzl/.virtualenvs/python3.6/lib/python3.6/site-packages/torch/utils/data/dataloader.py\", line 50, in _worker_loop\n", + " r = index_queue.get()\n", + " File \"/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/multiprocessing/queues.py\", line 335, in get\n", + " res = self._reader.recv_bytes()\n", + " File \"/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/multiprocessing/queues.py\", line 334, in get\n", + " with self._rlock:\n", + " File \"/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/multiprocessing/connection.py\", line 216, in recv_bytes\n", + " buf = self._recv_bytes(maxlength)\n", + " File \"/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/multiprocessing/connection.py\", line 407, in _recv_bytes\n", + " buf = self._recv(4)\n", + " File \"/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/multiprocessing/synchronize.py\", line 96, in __enter__\n", + " return self._semlock.__enter__()\n", + " File \"/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/multiprocessing/connection.py\", line 379, in _recv\n", + " chunk = read(handle, remaining)\n", + "KeyboardInterrupt\n", + "KeyboardInterrupt\n" + ] + }, + { + "ename": "KeyboardInterrupt", + "evalue": "", + "output_type": "error", + "traceback": [ + "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", + "\u001b[0;31mKeyboardInterrupt\u001b[0m Traceback (most recent call last)", + "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[1;32m 14\u001b[0m \u001b[0mb_y\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mVariable\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mbatch_y\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;31m# batch y\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 15\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 16\u001b[0;31m \u001b[0mout\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mcnn\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mb_x\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;31m# 喂给 net 训练数据 x, 输出分析值\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 17\u001b[0m \u001b[0mloss\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mloss_func\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mout\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mb_y\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;31m# 计算两者的误差\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 18\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32m~/.virtualenvs/python3.6/lib/python3.6/site-packages/torch/nn/modules/module.py\u001b[0m in \u001b[0;36m__call__\u001b[0;34m(self, *input, **kwargs)\u001b[0m\n\u001b[1;32m 355\u001b[0m \u001b[0mresult\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_slow_forward\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0minput\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 356\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 357\u001b[0;31m \u001b[0mresult\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mforward\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0minput\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 358\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0mhook\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_forward_hooks\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mvalues\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 359\u001b[0m \u001b[0mhook_result\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mhook\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0minput\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mresult\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32m\u001b[0m in \u001b[0;36mforward\u001b[0;34m(self, x)\u001b[0m\n\u001b[1;32m 66\u001b[0m \u001b[0mx\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mword_embeddings\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mx\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 67\u001b[0m \u001b[0mx\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mx\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mview\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mlen\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mx\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;36m1\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mmax_len\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0membedding_dim\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 68\u001b[0;31m \u001b[0mx\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfeatures\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mx\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 69\u001b[0m \u001b[0mx\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mx\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mview\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mx\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msize\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;36m0\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m-\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 70\u001b[0m \u001b[0moutput\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mclassifier\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mx\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32m~/.virtualenvs/python3.6/lib/python3.6/site-packages/torch/nn/modules/module.py\u001b[0m in \u001b[0;36m__call__\u001b[0;34m(self, *input, **kwargs)\u001b[0m\n\u001b[1;32m 355\u001b[0m \u001b[0mresult\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_slow_forward\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0minput\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 356\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 357\u001b[0;31m \u001b[0mresult\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mforward\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0minput\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 358\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0mhook\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_forward_hooks\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mvalues\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 359\u001b[0m \u001b[0mhook_result\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mhook\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0minput\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mresult\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32m~/.virtualenvs/python3.6/lib/python3.6/site-packages/torch/nn/modules/container.py\u001b[0m in \u001b[0;36mforward\u001b[0;34m(self, input)\u001b[0m\n\u001b[1;32m 65\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0mforward\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0minput\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 66\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0mmodule\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_modules\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mvalues\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 67\u001b[0;31m \u001b[0minput\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mmodule\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0minput\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 68\u001b[0m \u001b[0;32mreturn\u001b[0m \u001b[0minput\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 69\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32m~/.virtualenvs/python3.6/lib/python3.6/site-packages/torch/nn/modules/module.py\u001b[0m in \u001b[0;36m__call__\u001b[0;34m(self, *input, **kwargs)\u001b[0m\n\u001b[1;32m 355\u001b[0m \u001b[0mresult\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_slow_forward\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0minput\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 356\u001b[0m \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 357\u001b[0;31m \u001b[0mresult\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mforward\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0minput\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 358\u001b[0m \u001b[0;32mfor\u001b[0m \u001b[0mhook\u001b[0m \u001b[0;32min\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_forward_hooks\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mvalues\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 359\u001b[0m \u001b[0mhook_result\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mhook\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0minput\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mresult\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;32m~/.virtualenvs/python3.6/lib/python3.6/site-packages/torch/nn/modules/activation.py\u001b[0m in \u001b[0;36mforward\u001b[0;34m(self, input)\u001b[0m\n\u001b[1;32m 41\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 42\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0mforward\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0minput\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 43\u001b[0;31m \u001b[0;32mreturn\u001b[0m \u001b[0mF\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mthreshold\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0minput\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mthreshold\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mvalue\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0minplace\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 44\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 45\u001b[0m \u001b[0;32mdef\u001b[0m \u001b[0m__repr__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n", + "\u001b[0;31mKeyboardInterrupt\u001b[0m: " + ] + } + ], + "source": [ + "from sklearn.metrics import log_loss\n", + "\n", + "Epoch = 5\n", + "print_every = 20\n", + "max_step = len(X_train_d)/BATCH_SIZE\n", + "\n", + "# 跟踪绘图的损失\n", + "current_loss = 0\n", + "all_losses = []\n", + "\n", + "for epoch in range(Epoch):\n", + " for step, (batch_x, batch_y) in enumerate(train_loader): # 每一步 loader 释放一小批数据用来学习\n", + " b_x = Variable(batch_x) # batch x\n", + " b_y = Variable(batch_y) # batch y\n", + " \n", + " out = cnn(b_x) # 喂给 net 训练数据 x, 输出分析值\n", + " loss = loss_func(out, b_y) # 计算两者的误差\n", + "\n", + " optimizer.zero_grad() # 清空上一步的残余更新参数值\n", + " loss.backward() # 误差反向传播, 计算参数更新值\n", + " optimizer.step() # 将参数更新值施加到 net 的 parameters 上\n", + "\n", + " current_loss += loss.data[0]\n", + " # print(F.softmax(out), '---', torch.max(F.softmax(out), 1), 'xxx', torch.max(F.softmax(out), 1)[1])\n", + " if step % print_every == print_every-1:\n", + " # softmax 用来计算输出分类的概率,然后max是选出最大的一组:(概率值,分类值)\n", + " prediction = torch.max(F.softmax(out, dim=1), 1)[1]\n", + " pred_y = prediction.data.numpy().squeeze()\n", + " target_y = b_y.data.numpy()\n", + " print(\"pred_y:\\t\", pred_y)\n", + " print(\"target_y:\\t\", target_y)\n", + " logloss = log_loss(target_y, pred_y, eps=1e-15)\n", + " accuracy = sum(pred_y == target_y)/len(target_y) # 预测中有多少和真实值一样\n", + "\n", + " # 总次数\n", + " loop_step = epoch*max_step + step\n", + " total_step = Epoch*max_step\n", + " print('%d-%d %.2f%% (%s) logloss=%.2f \\t accuracy=%.2f \\t loss=%s' % (epoch, loop_step, loop_step/total_step*100, timeSince(start), logloss, accuracy, loss.data[0]))\n", + "\n", + " all_losses.append(current_loss/print_every)\n", + " current_loss = 0\n", + " " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from sklearn.metrics import log_loss\n", + "\n", + "Epoch = 1\n", + "print_every = 100\n", + "max_step = len(X_train_d)\n", + "\n", + "# 跟踪绘图的损失\n", + "current_loss = 0\n", + "all_losses = []\n", + "pre_result = []\n", + "rea_result = []\n", + "\n", + "for epoch in range(Epoch):\n", + " # 优化为批处理\n", + " \n", + " # step = len(X_train_d)/BATCH_SIZE (train_loader 为 BATCH_SIZE 大小的集合)\n", + " for step, (x, y) in enumerate(zip(X_train_d[1384:], y_train_d[1384:])):\n", + " b_x = prepare_sequence(x)\n", + " b_y = Variable(torch.LongTensor([y])) # batch y\n", + " \n", + " out = cnn(b_x) # 喂给 net 训练数据 x, 输出分析值\n", + " loss = loss_func(out, b_y) # 计算两者的误差\n", + "\n", + " optimizer.zero_grad() # 清空上一步的残余更新参数值\n", + " loss.backward() # 误差反向传播, 计算参数更新值\n", + " optimizer.step() # 将参数更新值施加到 net 的 parameters 上\n", + "\n", + " current_loss += loss.data[0]\n", + "\n", + " prediction = torch.max(F.softmax(out, dim=1), 1)[1]\n", + " pred_y = prediction.data.numpy()\n", + " target_y = b_y.data.numpy()\n", + " \n", + "# print('预测---', pred_y)\n", + "# print('目标---', target_y)\n", + " \n", + "# if step>2:\n", + "# break\n", + "\n", + " # softmax 用来计算输出分类的概率,然后max是选出最大的一组:(概率值,分类值)\n", + " prediction = torch.max(F.softmax(out, dim=1), 1)[1]\n", + " pre_result.append(prediction.data.numpy()[0])\n", + " rea_result.append(b_y.data.numpy()[0])\n", + " \n", + " # print(F.softmax(out), '---', torch.max(F.softmax(out), 1), 'xxx', torch.max(F.softmax(out), 1)[1])\n", + " if step % print_every == print_every-1:\n", + "# print('预测---', pred_y)\n", + "# print('目标---', target_y)\n", + " logloss = log_loss(rea_result, pre_result, eps=1e-15)\n", + "# print(\"pre_result: \\t\", pre_result)\n", + "# print(\"rea_result: \\t\", rea_result)\n", + " accuracy = sum(np.array(pre_result) == np.array(rea_result))/len(rea_result) # 预测中有多少和真实值一样\n", + " \n", + " # 总次数\n", + " loop_step = epoch*max_step + step\n", + " total_step = Epoch*max_step\n", + " \n", + " print('%d-%d %.2f%% (%s) logloss=%.2f \\t accuracy=%.2f \\t loss=%s' % (epoch, loop_step, loop_step/total_step*100, timeSince(start), logloss, accuracy, loss.data[0]))\n", + "\n", + " all_losses.append(current_loss/print_every)\n", + " current_loss = 0\n", + " pre_result = []\n", + " rea_result = []\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "一,train loss与test loss结果分析\n", + "\n", + "* train loss 不断下降,test loss不断下降,说明网络仍在学习;\n", + "* train loss 不断下降,test loss趋于不变,说明网络过拟合;\n", + "* train loss 趋于不变,test loss不断下降,说明数据集100%有问题;\n", + "* train loss 趋于不变,test loss趋于不变,说明学习遇到瓶颈,需要减小学习率或批量数目;\n", + "* train loss 不断上升,test loss不断上升,说明网络结构设计不当,训练超参数设置不当,数据集经过清洗等问题。\n" + ] + }, + { + "cell_type": "code", + "execution_count": 35, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[]" + ] + }, + "execution_count": 35, + "metadata": {}, + "output_type": "execute_result" + }, + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "import matplotlib.pyplot as plt\n", + "import matplotlib.ticker as ticker\n", + "\n", + "plt.figure()\n", + "plt.plot(all_losses)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "code", + "execution_count": 234, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[]" + ] + }, + "execution_count": 234, + "metadata": {}, + "output_type": "execute_result" + }, + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "\n", + "import matplotlib.pyplot as plt\n", + "import matplotlib.ticker as ticker\n", + "\n", + "plt.figure()\n", + "plt.plot(all_losses)" + ] + }, + { + "cell_type": "code", + "execution_count": 235, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "(3m 30s) logloss=14.94 \t accuracy=0.57\n" + ] + } + ], + "source": [ + "# 看看测试集合的正确率多高, 发现效果很差,说明一点。。(数据不均匀)\n", + "\n", + "out_t = cnn(Variable(te_x))\n", + "\n", + "# softmax 用来计算输出分类的概率,然后max是选出最大的一组:(概率值,分类值)\n", + "prediction_t = torch.max(F.softmax(out_t, dim=1), 1)[1]\n", + "pred_t_y = prediction_t.data.numpy().squeeze()\n", + "target_t_y = Variable(te_y).data.numpy()\n", + "logloss_t = log_loss(target_t_y, pred_t_y, eps=1e-15)\n", + "accuracy_t = sum(pred_t_y == target_t_y)/len(target_t_y) # 预测中有多少和真实值一样\n", + "print('(%s) logloss=%.2f \\t accuracy=%.2f' % (timeSince(start), logloss_t, accuracy_t))" + ] + }, + { + "cell_type": "code", + "execution_count": 236, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "(3m 49s) logloss=14.25 \t accuracy=0.59\n" + ] + } + ], + "source": [ + "# 又回头看看训练集合的正确率多高,严重说明:数据在采样的时候不均匀,倒是数据有丢失没学习到\n", + "\n", + "out_t = cnn(Variable(tr_x[:10000]))\n", + "\n", + "# softmax 用来计算输出分类的概率,然后max是选出最大的一组:(概率值,分类值)\n", + "prediction_t = torch.max(F.softmax(out_t, dim=1), 1)[1]\n", + "pred_t_y = prediction_t.data.numpy().squeeze()\n", + "target_t_y = Variable(tr_y[:10000]).data.numpy()\n", + "logloss_t = log_loss(target_t_y, pred_t_y, eps=1e-15)\n", + "accuracy_t = sum(pred_t_y == target_t_y)/len(target_t_y) # 预测中有多少和真实值一样\n", + "print('(%s) logloss=%.2f \\t accuracy=%.2f' % (timeSince(start), logloss_t, accuracy_t))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "* * * \n", + "\n", + "其他的信息" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# 加载训练好的词向量\n", + "\n", + "from gensim.models.word2vec import Word2Vec\n", + "\n", + "model = Word2Vec.load_word2vec_format(\"vector.txt\", binary=False) # C text format\n", + "# model = Word2Vec.load_word2vec_format(\"vector.bin\", binary=True) # C" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# 加载 google 的词向量,查看单词之间关系\n", + "\n", + "from gensim.models.word2vec import Word2Vec \n", + "model = Word2Vec.load_word2vec_format(\"GoogleNews-vectors-negative300.bin\", binary=True)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# 测试预测效果\n", + "\n", + "print(model.most_similar(positive=[\"woman\", \"king\"], negative=[\"man\"], topn=5))\n", + "print(model.most_similar(positive=[\"biggest\", \"small\"], negative=[\"big\"], topn=5))\n", + "print(model.most_similar(positive=[\"ate\", \"speak\"], negative=[\"eat\"], topn=5))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import numpy as np\n", + "\n", + "with open(\"food_words.txt\", \"r\") as infile:\n", + " food_words = infile.readlines()\n", + " \n", + "with open(\"sports_words.txt\", \"r\") as infile:\n", + " food_words = infile.readlines()\n", + " \n", + "with open(\"weather_words.txt\", \"r\") as infile:\n", + " food_words = infile.readlines()\n", + " \n", + "def getWordVecs(words):\n", + " vec = []\n", + " for word in words:\n", + " word = word.replace(\"\\n\", \"\")\n", + " try:\n", + " vecs.append(model[word].reshape((1, 300)))\n", + " except KeyError:\n", + " continue\n", + " \n", + " # numpy提供了numpy.concatenate((a1,a2,...), axis=0)函数。能够一次完成多个数组的拼接\n", + " \"\"\"\n", + " >>> a=np.array([1,2,3])\n", + " >>> b=np.array([11,22,33])\n", + " >>> c=np.array([44,55,66])\n", + " >>> np.concatenate((a,b,c),axis=0) # 默认情况下,axis=0可以不写\n", + " array([ 1, 2, 3, 11, 22, 33, 44, 55, 66]) #对于一维数组拼接,axis的值不影响最后的结果\n", + " \"\"\"\n", + " vecs = np.concatenate(vecs)\n", + " return np.array(vecs, dtype=\"float\")\n", + "\n", + "food_vecs = getWordVecs(food_words)\n", + "sports_vecs = getWordVecs(sports_words)\n", + "weather_vecs = getWordVecs(weather_words)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# 利用 TSNE 和 matplotlib 对分类结果进行可视化处理\n", + "\n", + "from sklearn.manifold import TSEN\n", + "import matplotlib.pyplot as plt\n", + "\n", + "ts = TSEN(2)\n", + "reduced_vecs = ts.fit_transform(np.concatenate((food_vecs, sports_vecs, weather_vecs)))\n", + "\n", + "for i in range(len(reduced_vecs)):\n", + " if i < len(food_vecs):\n", + " color = \"b\"\n", + " elif i >= len(food_vecs) and i <(len(food_vecs)+len(sports_vecs)):\n", + " color = \"r\"\n", + " else:\n", + " color = \"g\"\n", + " \n", + " plt.plot(reduced_vecs[i, 0], reduced_vecs[i, 1], marker=\"0\", color=color, marksize=8)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# 首先,我们导入数据并构建 Word2Vec 模型:\n", + "\n", + "from sklearn.cross_validation import train_ _test_ _split\n", + "from gensim.models.word2vec import Word2Vec\n", + "\n", + "with open('twitter.data/pos_ tweets.txt', 'r') as infile:\n", + " pos_tweets= infile.readlines()\n", + "\n", + "with open(' twitter_ data/neg_ tweets.txt', 'r') as infile:\n", + " neg_ _tweets = infile.readlines()\n", + "\n", + "# use 1for positive sentiment,0 for negative\n", + "Y= np.concatenate((np.ones( len (pos_tweets )) ,np.zeros(len(neg_tweets))))\n", + "\n", + "x_train,x_test,y_train,y_test = train_test_split(np.concatenate((pos_tweets, neg_tweets)), y, test_size=0.2)\n", + "# Do some very minor text preprocessing\n", + "\n", + "def cleanText(corpus):\n", + " corpus= [z.lower( ).replace(' \\n' , '').split() for z in corpus]\n", + " return corpus\n", + "\n", + "x_ train= cleanText(x_ train)\n", + "x_ test= cleanText (x_ _test)\n", + "\n", + "n _dim= 300\n", + "#Initialize model and build vocab\n", + "imdb_w2v= Word2Vec(size=n dim, min_count=10)\n", + "imdb_w2v.build_vocab(x_ _train)\n", + "#Train the model over train_ _reviews (this may take several minutes)\n", + "imdb_w2v.train( x_train)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# 接下来,为了利用下面的函数获得推文中所有词向量的平均值,我们必须构建作为输入文本的词向量。\n", + "\n", + "def buildWordVector(text, size):\n", + " vec = np.zeros(size).reshape((1,size))\n", + " count= 0.\n", + "\n", + " for word in text :\n", + " try:\n", + " vec += imdb_w2v[word].reshape( (1,size) )\n", + " count += 1.\n", + " except KeyError:\n", + " continue\n", + " if count != 0:\n", + " vec 1'= count\n", + " return vec" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# 调整数据集的量纲是数据标准化处理的一部分,我们通常将数据集转化成服从均值为零的高斯分布,这说明数值大于均值表示乐观,反之则表示悲观。为了使模型更有效,许多机器学习模型需要预先处理数据集的量纲,特别是文本分类器这类具有许多变量的模型。\n", + "\n", + "from sklearn.preprocessing import scale\n", + "\n", + "train_vecs = np.concatenate([buildWordVector(z ,n_dim) for z in x_train])\n", + "train_vecs= scale(train_vecs)\n", + "\n", + "# Train word2vec on test tweets\n", + "imdb_w2v.train(x_test)\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# 最后我们需要建立测试集向量并对其标准化处理:\n", + "\n", + "#Build test tweet vectors then scale\n", + "test_vecs = np.concatenate( [buildWordVector( Z,n _dim) for z in x _test ])\n", + "test_vecs = scale(test_vecs)\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "\"\"\"\n", + "接下来我们想要通过计算测试集的预测精度和 ROC 曲线来验证分类器的有效性。 ROC 曲线衡量当模型参数调整的时候,其真阳性率和假阳性率的变化情况。在我们的案例中,我们调整的是分类器模型截断阈值的概率。一般来说,ROC 曲线下的面积(AUC)越大,该模型的表现越好。你可以在这里找到更多关于 ROC 曲线的资料\n", + "\n", + "(https://en.wikipedia.org/wiki/Receiver_operating_characteristic)\n", + "\n", + "在这个案例中我们使用罗吉斯回归的随机梯度下降法作为分类器算法。\n", + "\"\"\"\n", + "\n", + "#Use classification algorithm (i.e.Stochastic Logistic Regression) on training set, then assess model performance on test set\n", + "\n", + "from sklearn.linear model import SGDClassifier\n", + "lr = SGDClassifier(loss='log' ,penalty='11' )\n", + "lr.fit(train_vecs, y_train)\n", + "print' Test Accuracy: %.2f' % r.score(test vecs, y_test )\n", + "\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# 随后我们利用 matplotlib 和 metric 库来构建 ROC 曲线\n", + "\n", + "#Crea t e ROC curve\n", + "from sklearn.metrics import roc_curve, auc\n", + "import matplotlib.pyplot as plt\n", + "\n", + "pred_probas = lr.predict_proba(test_vecs)[:, 1]\n", + "\n", + "fpr, tpr, _ = roc_curve(y_test, pred_probas )\n", + "roc_auc = auc(fpr, tpr)\n", + "\n", + "plt.plot(fpr,tpr,label='area = %.2f' % roc_ auc)\n", + "plt.plot([0,1],[0,1],'k--')\n", + "plt. xlim( [0. 0 ,1. 0 ])\n", + "plt.ylim([0.0, 1.05])\n", + "plt.legend(loc='lower right')\n", + "plt.show()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.6.3" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/src/python/getting-started/digit-recognizer/cnn_pytorch-python3.6.py b/src/python/getting-started/digit-recognizer/cnn_pytorch-python3.6.py index f3278fd..3f7ed41 100644 --- a/src/python/getting-started/digit-recognizer/cnn_pytorch-python3.6.py +++ b/src/python/getting-started/digit-recognizer/cnn_pytorch-python3.6.py @@ -20,7 +20,7 @@ from torch.utils.data import Dataset, DataLoader import os.path # 数据路径 -data_dir = '/media/wsw/B634091A3408DF6D/data/kaggle/datasets/getting-started/digit-recognizer/' +data_dir = '/opt/data/kaggle/getting-started/digit-recognizer/' class CustomedDataSet(Dataset): def __init__(self, train=True):