01-26 21:11
Recent Posts
Recent Comments
관리 메뉴

너와나의 관심사

Deep Learning Text Classification 본문

카테고리 없음

Deep Learning Text Classification

벤치마킹 2019. 5. 31. 22:23

1. 제목 생성 모델 ( Text Classification)

1.1. 기본 컨셉

주로 사용자나 댓글 문장의 감정을 평가하는 모델에서 사용된 알고리즘을 활용

아래 사용되는 단어와 감정을 매칭해서 claissification 통해서 구분 해주는 개념

 

Image classification 유사한 용도로도 사용

 

 

활용예) Sentiment Analysis / Sentence Classification

 

 

 

 

1.2. 알고리즘 소개

단어를 각각 one hot encoding 으로 mapping 해주고 convolution neural network + max pooling + fully connected layer 통해 가장 가까운 class predict 해주는 알고리즘

 

CNN 알고리즘은 주로 이미지의 특징을 추출하여 유사점을 찾는 이미지 inference 용도로 많이 사용된다

 

각각의 layer 합성을 포함한다면 아래와 같다.

 

전체 matrix 필터를 슬라이딩한 element 합성 곱을 얻을 있다.

 

 

 

 

In paper,  simple CNN with one layer of convolution on top of word vectors.

filter size 각각 2,3,4 두개 층으로 (3 x 2)  6개를 문장 matrix 합성곱 (convolution) 수행하고 feature map  생성 한다.,

 

결국 이미지에서 적용되던 CNN 알고리즘을 적용하려는 노력 -> NLP 에도 적용할려는 논문들이 많이 나왔고 적용이 되었음

 

 

 

 

 

 

 

 

 

실제 코드

코드 분석

 

https://github.com/likejazz/cnn-text-classification-tf

 

1.3. Tensorflow Code 분석

 

-          학습 데이터

-          선행 해야할 코드 소개

-          Tensorflow 코드 설명

-          F1 Score

-          연어처리를 공부하다보면 논문에 Precision, Recall등으로 실험결과를 표현한다.

-          자주쓰이지만 해깔려서 적어본다.

 

 실험 결과

 True

False 

 실제 정답

 True

 a

 b

 False

 c

 d

 

-          Precision(정확률) = a / (a + c)

-          - 모델에서 True라 한 것 중에 실제 True의(정답을 맞춘) 비율

-          Recall(재현률) = a / (a + b)

-          - 실제 True중(정답중)  컴퓨터가 True라 한 것 의 비율

-          Accuracy(정확도) = (a + d) / (a + b + c + d)

-          - 전체 결과중 실제 정답과 같은 판단이 나온 비율

-          F1 score =  2 * (Precision * Recall) / (Precision + Recall)

-          - 위의 Precision과 Recall을 기반으로 표현한 것

 

 

Tensorflow model parameter 설명

 

--embedding_dim EMBEDDING_DIM

                        Dimensionality of character embedding (default: 128)

  --filter_sizes FILTER_SIZES

                        Comma-separated filter sizes (default: '3,4,5')

  --num_filters NUM_FILTERS

                        Number of filters per filter size (default: 128)

  --l2_reg_lambda L2_REG_LAMBDA

                        L2 regularizaion lambda (default: 0.0)

  --dropout_keep_prob DROPOUT_KEEP_PROB

                        Dropout keep probability (default: 0.5)

  --batch_size BATCH_SIZE

                        Batch Size (default: 64)

  --num_epochs NUM_EPOCHS

                        Number of training epochs (default: 100)

  --evaluate_every EVALUATE_EVERY

                        Evaluate model on dev set after this many steps

                        (default: 100)

  --checkpoint_every CHECKPOINT_EVERY

                        Save model after this many steps (default: 100)

  --allow_soft_placement ALLOW_SOFT_PLACEMENT

                        Allow device soft device placement

  --noallow_soft_placement

  --log_device_placement LOG_DEVICE_PLACEMENT

                        Log placement of ops on devices

  --nolog_device_placement

 

 

 


class
TextCNN(object):

    """

    A CNN for text classification.

    Uses an embedding layer, followed by a convolutional, max-pooling and softmax layer.

    """

    def __init__(

      self, sequence_length, num_classes, vocab_size,

      embedding_size, filter_sizes, num_filters, l2_reg_lambda=0.0):

 

        # Placeholders for input, output and dropout

        self.input_x = tf.placeholder(tf.int32, [None, sequence_length], name="input_x")

        self.input_y = tf.placeholder(tf.float32, [None, num_classes], name="input_y")

        self.dropout_keep_prob = tf.placeholder(tf.float32, name="dropout_keep_prob")

 

        # Keeping track of l2 regularization loss (optional)

        l2_loss = tf.constant(0.0)

 

        # Embedding layer

        with tf.device('/cpu:0'), tf.name_scope("embedding"):

            W = tf.Variable(

                tf.random_uniform([vocab_size, embedding_size], -1.0, 1.0),

                name="W")

            self.embedded_chars = tf.nn.embedding_lookup(W, self.input_x)

            self.embedded_chars_expanded = tf.expand_dims(self.embedded_chars, -1)

 

        # Create a convolution + maxpool layer for each filter size

        pooled_outputs = []

        for i, filter_size in enumerate(filter_sizes):

            with tf.name_scope("conv-maxpool-%s" % filter_size):

                # Convolution Layer

                filter_shape = [filter_size, embedding_size, 1, num_filters]

                W = tf.Variable(tf.truncated_normal(filter_shape, stddev=0.1), name="W")

                b = tf.Variable(tf.constant(0.1, shape=[num_filters]), name="b")

                conv = tf.nn.conv2d(

                    self.embedded_chars_expanded,

                    W,

                    strides=[1, 1, 1, 1],

                    padding="VALID",

                    name="conv")

                # Apply nonlinearity

                h = tf.nn.relu(tf.nn.bias_add(conv, b), name="relu")

                # Maxpooling over the outputs

                pooled = tf.nn.max_pool(

                    h,

                    ksize=[1, sequence_length - filter_size + 1, 1, 1],

                    strides=[1, 1, 1, 1],

                    padding='VALID',

                    name="pool")

                pooled_outputs.append(pooled)

 

        # Combine all the pooled features

        num_filters_total = num_filters * len(filter_sizes)

        self.h_pool = tf.concat(3, pooled_outputs)

        self.h_pool_flat = tf.reshape(self.h_pool, [-1, num_filters_total])

 

        # Add dropout

        with tf.name_scope("dropout"):

            self.h_drop = tf.nn.dropout(self.h_pool_flat, self.dropout_keep_prob)

 

        # Final (unnormalized) scores and predictions

        with tf.name_scope("output"):

            W = tf.get_variable(

                "W",

                shape=[num_filters_total, num_classes],

                initializer=tf.contrib.layers.xavier_initializer())

            b = tf.Variable(tf.constant(0.1, shape=[num_classes]), name="b")

            l2_loss += tf.nn.l2_loss(W)

            l2_loss += tf.nn.l2_loss(b)

            self.scores = tf.nn.xw_plus_b(self.h_drop, W, b, name="scores")

            self.predictions = tf.argmax(self.scores, 1, name="predictions")

 

        # CalculateMean cross-entropy loss

        with tf.name_scope("loss"):

            losses = tf.nn.softmax_cross_entropy_with_logits(self.scores, self.input_y)

            self.loss = tf.reduce_mean(losses) + l2_reg_lambda * l2_loss

 

        # Accuracy

        with tf.name_scope("accuracy"):

            correct_predictions = tf.equal(self.predictions, tf.argmax(self.input_y, 1))

            self.accuracy = tf.reduce_mean(tf.cast(correct_predictions, "float"), name="accuracy")

 

 

 

#! /usr/bin/env python

import tensorflow as tf

import numpy as np

import os

import time

import datetime

import data_helpers

from text_cnn import TextCNN

from tensorflow.contrib import learn

from collections import OrderedDict

from cnn_model import CNN

import json

# Parameters

# ==================================================

# Data loading params

tf.flags.DEFINE_float("dev_sample_percentage", .1, "Percentage of the training data to use for validation")

tf.flags.DEFINE_string("positive_data_file", "./data/rt-polaritydata/rt-polarity.pos", "Data source for the positive data.")

tf.flags.DEFINE_string("negative_data_file", "./data/rt-polaritydata/rt-polarity.neg", "Data source for the negative data.")

tf.flags.DEFINE_string("animal_data_file", "./data/rt-polaritydata/rt-polarity.animal", "Data source for the negative data.")

tf.flags.DEFINE_string("data_path", "./data/group_tag4/", "Data source for the data group.")

#tf.flags.DEFINE_string("data_path", "./data/rt-polaritydata/", "Data source for the data group.")

# Model Hyperparameters

tf.flags.DEFINE_integer("embedding_dim", 128, "Dimensionality of character embedding (default: 128)")

tf.flags.DEFINE_string("filter_sizes", "3,4", "Comma-separated filter sizes (default: '3,4,5')")

tf.flags.DEFINE_integer("num_filters", 128, "Number of filters per filter size (default: 128)")

tf.flags.DEFINE_float("dropout_keep_prob", 0.5, "Dropout keep probability (default: 0.5)")

tf.flags.DEFINE_float("l2_reg_lambda", 0.0, "L2 regularization lambda (default: 0.0)")

# Training parameters

tf.flags.DEFINE_integer("batch_size", 64, "Batch Size (default: 64)")

tf.flags.DEFINE_integer("num_epochs", 200, "Number of training epochs (default: 200)")

tf.flags.DEFINE_integer("evaluate_every", 100, "Evaluate model on dev set after this many steps (default: 100)")

tf.flags.DEFINE_integer("checkpoint_every", 100, "Save model after this many steps (default: 100)")

tf.flags.DEFINE_integer("num_checkpoints", 5, "Number of checkpoints to store (default: 5)")

# Misc Parameters

tf.flags.DEFINE_boolean("allow_soft_placement", True, "Allow device soft device placement")

tf.flags.DEFINE_boolean("log_device_placement", False, "Log placement of ops on devices")

FLAGS = tf.flags.FLAGS

# FLAGS._parse_flags()

# print("\nParameters:")

# for attr, value in sorted(FLAGS.__flags.items()):

#     print("{}={}".format(attr.upper(), value))

# print("")

def preprocess():

    # Data Preparation

    # ==================================================

    # Load data

    print("Loading data...")

    '''

    x_text, y = data_helpers.load_data_and_labels(FLAGS.airfield, FLAGS.airplanecabin,FLAGS.airportterminal,FLAGS.alcove,FLAGS.alley,

                                                  FLAGS.amphitheater,FLAGS.amusementarcade,FLAGS.amusementpark,FLAGS.apartmentbuilding,FLAGS.aquarium,

                                                  FLAGS.aqueduct,FLAGS.arcade,FLAGS.arch,FLAGS.archaelogical,FLAGS.archive,FLAGS.armybase,FLAGS.artclass,

                                                  FLAGS.artgallery,FLAGS.artistsloft,FLAGS.assemblyline,FLAGS.atelier,FLAGS.atrium,FLAGS.attic,FLAGS.auditorium,

                                                  FLAGS.badlands,FLAGS.bakery)

    '''

    x_text, y = data_helpers.load_data_and_labels_all(FLAGS.data_path)

    #print(x_text)

    # Build vocabulary

    max_document_length = max([len(x.split(" ")) for x in x_text])

    #max_document_length = max([len(x) for x in x_text])

    print("max_document_length",max_document_length)

    vocab_processor = learn.preprocessing.VocabularyProcessor(max_document_length)

    x = np.array(list(vocab_processor.fit_transform(x_text)))

    vocab_dict = vocab_processor.vocabulary_._mapping

    sorted_vocab = sorted(vocab_dict.items(), key=lambda x: x[1])

    vocabulary = list(list(zip(*sorted_vocab))[0])

    print("Vocabulary : ",vocabulary)

    with open('./vocab.txt', 'w') as f:

        for item in vocabulary:

            f.write("%s\n" % item)

    with open('./diction.txt', 'w') as file:

        file.write(json.dumps(vocab_dict))

        #wordnp = np.column_stack((x, x_train_np))

    #wordnp = np.concatenate((x, x_text),axis=0)

    #np.savetxt('wordidx.txt', (x_train_np), fmt='%s')

    '''

    with open("./vocab.txt", 'wt') as word:

        for i, val in enumerate (x) :

            #data = x_text[i] #

            print(val,x_text[i])

            word.write('%d %s\n'.join(x_text[i]))

    '''

    #f.close()

    # Randomly shuffle data

    np.random.seed(10)

    shuffle_indices = np.random.permutation(np.arange(len(y)))

    print("shuffle_indices",shuffle_indices)

    print("len(y)", len(y))

    x_shuffled = x[shuffle_indices]

    y_shuffled = y[shuffle_indices]

    # Split train/test set

    # TODO: This is very crude, should use cross-validation

    dev_sample_index = -1 * int(FLAGS.dev_sample_percentage * float(len(y)))

    print("dev_sample_index", dev_sample_index)

    x_train, x_dev = x_shuffled[:dev_sample_index], x_shuffled[dev_sample_index:]

    y_train, y_dev = y_shuffled[:dev_sample_index], y_shuffled[dev_sample_index:]

    del x, y, x_shuffled, y_shuffled

    print("Vocabulary Size: {:d}".format(len(vocab_processor.vocabulary_)))

    print("Train/Dev split: {:d}/{:d}".format(len(y_train), len(y_dev)))

    return x_train, y_train, vocab_processor, x_dev, y_dev , max_document_length

def train(x_train, y_train, vocab_processor, x_dev, y_dev,maxlength):

    # Training

    # ==================================================

    with tf.Graph().as_default():

        session_conf = tf.ConfigProto(

          allow_soft_placement=FLAGS.allow_soft_placement,

          log_device_placement=FLAGS.log_device_placement)

        sess = tf.Session(config=session_conf)

        with sess.as_default():

            def __init__(self, embedding_layer=None, num_words=None, embedding_dim=None,

                         max_seq_length=100, filter_sizes=[3, 4, 5], feature_maps=[100, 100, 100],

                         hidden_units=100, dropout_rate=None, nb_classes=None):

                '''

                cnn = CNN(

                    #embedding_layer=x_train.shape[1],

                    embedding_dim=FLAGS.embedding_dim,

                    max_seq_length=maxlength,

                    filter_sizes=list(map(int, FLAGS.filter_sizes.split(","))),

                    #feature_maps=[3,4,5],

                    #hidden_units=,

                    dropout_rate= 0.1 ,

                    nb_classes= y_train.shape[1]

                 )

                '''

            cnn = TextCNN(

                sequence_length=x_train.shape[1],

                num_classes=y_train.shape[1],

                vocab_size=len(vocab_processor.vocabulary_),

                embedding_size=FLAGS.embedding_dim,

                filter_sizes=list(map(int, FLAGS.filter_sizes.split(","))),

                num_filters=FLAGS.num_filters,

                l2_reg_lambda=FLAGS.l2_reg_lambda)

            # Define Training procedure

            global_step = tf.Variable(0, name="global_step", trainable=False)

            optimizer = tf.train.AdamOptimizer(1e-3)

            grads_and_vars = optimizer.compute_gradients(cnn.loss)

            train_op = optimizer.apply_gradients(grads_and_vars, global_step=global_step)

            # Keep track of gradient values and sparsity (optional)

            grad_summaries = []

            for g, v in grads_and_vars:

                if g is not None:

                    grad_hist_summary = tf.summary.histogram("{}/grad/hist".format(v.name), g)

                    sparsity_summary = tf.summary.scalar("{}/grad/sparsity".format(v.name), tf.nn.zero_fraction(g))

                    grad_summaries.append(grad_hist_summary)

                    grad_summaries.append(sparsity_summary)

            grad_summaries_merged = tf.summary.merge(grad_summaries)

            # Output directory for models and summaries

            timestamp = str(int(time.time()))

            out_dir = os.path.abspath(os.path.join(os.path.curdir, "runs", timestamp))

            print("Writing to {}\n".format(out_dir))

            # Summaries for loss and accuracy

            loss_summary = tf.summary.scalar("loss", cnn.loss)

            acc_summary = tf.summary.scalar("accuracy", cnn.accuracy)

            # Train Summaries

            train_summary_op = tf.summary.merge([loss_summary, acc_summary, grad_summaries_merged])

            train_summary_dir = os.path.join(out_dir, "summaries", "train")

            train_summary_writer = tf.summary.FileWriter(train_summary_dir, sess.graph)

            # Dev summaries

            dev_summary_op = tf.summary.merge([loss_summary, acc_summary])

            dev_summary_dir = os.path.join(out_dir, "summaries", "dev")

            dev_summary_writer = tf.summary.FileWriter(dev_summary_dir, sess.graph)

            # Checkpoint directory. Tensorflow assumes this directory already exists so we need to create it

            checkpoint_dir = os.path.abspath(os.path.join(out_dir, "checkpoints"))

            checkpoint_prefix = os.path.join(checkpoint_dir, "model")

            if not os.path.exists(checkpoint_dir):

                os.makedirs(checkpoint_dir)

            saver = tf.train.Saver(tf.global_variables(), max_to_keep=FLAGS.num_checkpoints)

            # Write vocabulary

            vocab_processor.save(os.path.join(out_dir, "vocab"))

            # Initialize all variables

            sess.run(tf.global_variables_initializer())

            tf.train.write_graph(sess.graph_def, '.', 'textcnn.pbtxt')

            def train_step(x_batch, y_batch):

                """

                A single training step

                """

                feed_dict = {

                  cnn.input_x: x_batch,

                  cnn.input_y: y_batch,

                  cnn.dropout_keep_prob: FLAGS.dropout_keep_prob

                }

                _, step, summaries, loss, accuracy = sess.run(

                    [train_op, global_step, train_summary_op, cnn.loss, cnn.accuracy],

                    feed_dict)

                time_str = datetime.datetime.now().isoformat()

                print("{}: step {}, loss {:g}, acc {:g}".format(time_str, step, loss, accuracy))

                train_summary_writer.add_summary(summaries, step)

            def dev_step(x_batch, y_batch, writer=None):

                """

                Evaluates model on a dev set

                """

                feed_dict = {

                  cnn.input_x: x_batch,

                  cnn.input_y: y_batch,

                  cnn.dropout_keep_prob: 1.0

                }

                step, summaries, loss, accuracy = sess.run(

                    [global_step, dev_summary_op, cnn.loss, cnn.accuracy],

                    feed_dict)

                time_str = datetime.datetime.now().isoformat()

                print("{}: step {}, loss {:g}, acc {:g}".format(time_str, step, loss, accuracy))

                if writer:

                    writer.add_summary(summaries, step)

            # 1000 / 30 = 33 * 1000

            # Generate batches

            print(len(x_train))

            print(len(y_train))

            batches = data_helpers.batch_iter(

                list(zip(x_train, y_train)), FLAGS.batch_size, FLAGS.num_epochs)

            # Training loop. For each batch...

            for batch in batches:

                x_batch, y_batch = zip(*batch)

                train_step(x_batch, y_batch)

                current_step = tf.train.global_step(sess, global_step)

                if current_step % FLAGS.evaluate_every == 0:

                    print("\nEvaluation:")

                    dev_step(x_dev, y_dev, writer=dev_summary_writer)

                    print("")

                if current_step % FLAGS.checkpoint_every == 0:

                    path = saver.save(sess, checkpoint_prefix, global_step=current_step)

                    print("Saved model checkpoint to {}\n".format(path))

def main(argv=None):

    x_train, y_train, vocab_processor, x_dev, y_dev, maxlength = preprocess()

    train(x_train, y_train, vocab_processor, x_dev, y_dev,maxlength)

if __name__ == '__main__':

    tf.app.run()

 

 


Comments