06-20 06:43

너와나의 관심사

벤치마킹

Notice

도움이 되셨다면

Recent Posts

Recent Comments

Link

« 2025/06 »
일	월	화	수	목	금	토
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

Tags more

Archives

Today

Total

관리 메뉴

너와나의 관심사

Deep Learning Text Classification 본문

카테고리 없음

Deep Learning Text Classification

벤치마킹 2019. 5. 31. 22:23

1. 제목 생성 모델 ( Text Classification)

1.1. 기본 컨셉

주로 사용자나 댓글 문장의 감정을 평가하는 모델에서 사용된 알고리즘을 활용

아래 사용되는 단어와 감정을 매칭해서 claissification 을 통해서 구분 해주는 개념

Image classification 과 유사한 용도로도 사용 됨

활용예) Sentiment Analysis / Sentence Classification

1.2. 알고리즘 소개

단어를 각각 one hot encoding 으로 mapping 해주고 convolution neural network + max pooling + fully connected layer 를 통해 가장 가까운 class 로 predict 해주는 알고리즘

CNN 알고리즘은 주로 이미지의 특징을 추출하여 유사점을 찾는 이미지 inference 용도로 많이 사용된다

각각의 layer 의 합성을 포함한다면 아래와 같다.

전체 matrix 에 필터를 슬라이딩한 각 element 의 합성 곱을 얻을 수 있다.

In paper, simple CNN with one layer of convolution on top of word vectors.

filter size 를 각각 2,3,4를 두개 층으로 총 (3 x 2) 6개를 문장 matrix에 합성곱 (convolution) 을 수행하고 feature map을 생성 한다.,

결국 이미지에서 적용되던 CNN 알고리즘을 적용하려는 노력 -> NLP 에도 적용할려는 논문들이 많이 나왔고 적용이 되었음

실제 코드 예

코드 분석

https://github.com/likejazz/cnn-text-classification-tf

1.3. Tensorflow Code 분석

- 학습 데이터

- 선행 해야할 코드 소개

- Tensorflow 코드 설명

- F1 Score

- 연어처리를 공부하다보면 논문에 Precision, Recall등으로 실험결과를 표현한다.

- 자주쓰이지만 해깔려서 적어본다.

		실험 결과
		True	False
실제 정답	True	a	b
실제 정답	False	c	d

- Precision(정확률) = a / (a + c)

- - 모델에서 True라 한 것 중에 실제 True의(정답을 맞춘) 비율

- Recall(재현률) = a / (a + b)

- - 실제 True중(정답중) 컴퓨터가 True라 한 것 의 비율

- Accuracy(정확도) = (a + d) / (a + b + c + d)

- - 전체 결과중 실제 정답과 같은 판단이 나온 비율

- F1 score = 2 * (Precision * Recall) / (Precision + Recall)

- - 위의 Precision과 Recall을 기반으로 표현한 것

Tensorflow model parameter 설명

--embedding_dim EMBEDDING_DIM

Dimensionality of character embedding (default: 128)

--filter_sizes FILTER_SIZES

Comma-separated filter sizes (default: '3,4,5')

--num_filters NUM_FILTERS

Number of filters per filter size (default: 128)

--l2_reg_lambda L2_REG_LAMBDA

L2 regularizaion lambda (default: 0.0)

--dropout_keep_prob DROPOUT_KEEP_PROB

Dropout keep probability (default: 0.5)

--batch_size BATCH_SIZE

Batch Size (default: 64)

--num_epochs NUM_EPOCHS

Number of training epochs (default: 100)

--evaluate_every EVALUATE_EVERY

Evaluate model on dev set after this many steps

(default: 100)

--checkpoint_every CHECKPOINT_EVERY

Save model after this many steps (default: 100)

--allow_soft_placement ALLOW_SOFT_PLACEMENT

Allow device soft device placement

--noallow_soft_placement

--log_device_placement LOG_DEVICE_PLACEMENT

Log placement of ops on devices

--nolog_device_placement

class TextCNN(object):
	"""
	A CNN for text classification.
	Uses an embedding layer, followed by a convolutional, max-pooling and softmax layer.
	"""
	def __init__(
	self, sequence_length, num_classes, vocab_size,
	embedding_size, filter_sizes, num_filters, l2_reg_lambda=0.0):

	# Placeholders for input, output and dropout
	self.input_x = tf.placeholder(tf.int32, [None, sequence_length], name="input_x")
	self.input_y = tf.placeholder(tf.float32, [None, num_classes], name="input_y")
	self.dropout_keep_prob = tf.placeholder(tf.float32, name="dropout_keep_prob")

	# Keeping track of l2 regularization loss (optional)
	l2_loss = tf.constant(0.0)

	# Embedding layer
	with tf.device('/cpu:0'), tf.name_scope("embedding"):
	W = tf.Variable(
	tf.random_uniform([vocab_size, embedding_size], -1.0, 1.0),
	name="W")
	self.embedded_chars = tf.nn.embedding_lookup(W, self.input_x)
	self.embedded_chars_expanded = tf.expand_dims(self.embedded_chars, -1)

	# Create a convolution + maxpool layer for each filter size
	pooled_outputs = []
	for i, filter_size in enumerate(filter_sizes):
	with tf.name_scope("conv-maxpool-%s" % filter_size):
	# Convolution Layer
	filter_shape = [filter_size, embedding_size, 1, num_filters]
	W = tf.Variable(tf.truncated_normal(filter_shape, stddev=0.1), name="W")
	b = tf.Variable(tf.constant(0.1, shape=[num_filters]), name="b")
	conv = tf.nn.conv2d(
	self.embedded_chars_expanded,
	W,
	strides=[1, 1, 1, 1],
	padding="VALID",
	name="conv")
	# Apply nonlinearity
	h = tf.nn.relu(tf.nn.bias_add(conv, b), name="relu")
	# Maxpooling over the outputs
	pooled = tf.nn.max_pool(
	h,
	ksize=[1, sequence_length - filter_size + 1, 1, 1],
	strides=[1, 1, 1, 1],
	padding='VALID',
	name="pool")
	pooled_outputs.append(pooled)

	# Combine all the pooled features
	num_filters_total = num_filters * len(filter_sizes)
	self.h_pool = tf.concat(3, pooled_outputs)
	self.h_pool_flat = tf.reshape(self.h_pool, [-1, num_filters_total])

	# Add dropout
	with tf.name_scope("dropout"):
	self.h_drop = tf.nn.dropout(self.h_pool_flat, self.dropout_keep_prob)

	# Final (unnormalized) scores and predictions
	with tf.name_scope("output"):
	W = tf.get_variable(
	"W",
	shape=[num_filters_total, num_classes],
	initializer=tf.contrib.layers.xavier_initializer())
	b = tf.Variable(tf.constant(0.1, shape=[num_classes]), name="b")
	l2_loss += tf.nn.l2_loss(W)
	l2_loss += tf.nn.l2_loss(b)
	self.scores = tf.nn.xw_plus_b(self.h_drop, W, b, name="scores")
	self.predictions = tf.argmax(self.scores, 1, name="predictions")

	# CalculateMean cross-entropy loss
	with tf.name_scope("loss"):
	losses = tf.nn.softmax_cross_entropy_with_logits(self.scores, self.input_y)
	self.loss = tf.reduce_mean(losses) + l2_reg_lambda * l2_loss

	# Accuracy
	with tf.name_scope("accuracy"):
	correct_predictions = tf.equal(self.predictions, tf.argmax(self.input_y, 1))
	self.accuracy = tf.reduce_mean(tf.cast(correct_predictions, "float"), name="accuracy")

#! /usr/bin/env python

import tensorflow as tf

import numpy as np

import os

import time

import datetime

import data_helpers

from text_cnn import TextCNN

from tensorflow.contrib import learn

from collections import OrderedDict

from cnn_model import CNN

import json

# Parameters

# ==================================================

# Data loading params

tf.flags.DEFINE_float("dev_sample_percentage", .1, "Percentage of the training data to use for validation")

tf.flags.DEFINE_string("positive_data_file", "./data/rt-polaritydata/rt-polarity.pos", "Data source for the positive data.")

tf.flags.DEFINE_string("negative_data_file", "./data/rt-polaritydata/rt-polarity.neg", "Data source for the negative data.")

tf.flags.DEFINE_string("animal_data_file", "./data/rt-polaritydata/rt-polarity.animal", "Data source for the negative data.")

tf.flags.DEFINE_string("data_path", "./data/group_tag4/", "Data source for the data group.")

#tf.flags.DEFINE_string("data_path", "./data/rt-polaritydata/", "Data source for the data group.")

# Model Hyperparameters

tf.flags.DEFINE_integer("embedding_dim", 128, "Dimensionality of character embedding (default: 128)")

tf.flags.DEFINE_string("filter_sizes", "3,4", "Comma-separated filter sizes (default: '3,4,5')")

tf.flags.DEFINE_integer("num_filters", 128, "Number of filters per filter size (default: 128)")

tf.flags.DEFINE_float("dropout_keep_prob", 0.5, "Dropout keep probability (default: 0.5)")

tf.flags.DEFINE_float("l2_reg_lambda", 0.0, "L2 regularization lambda (default: 0.0)")

# Training parameters

tf.flags.DEFINE_integer("batch_size", 64, "Batch Size (default: 64)")

tf.flags.DEFINE_integer("num_epochs", 200, "Number of training epochs (default: 200)")

tf.flags.DEFINE_integer("evaluate_every", 100, "Evaluate model on dev set after this many steps (default: 100)")

tf.flags.DEFINE_integer("checkpoint_every", 100, "Save model after this many steps (default: 100)")

tf.flags.DEFINE_integer("num_checkpoints", 5, "Number of checkpoints to store (default: 5)")

# Misc Parameters

tf.flags.DEFINE_boolean("allow_soft_placement", True, "Allow device soft device placement")

tf.flags.DEFINE_boolean("log_device_placement", False, "Log placement of ops on devices")

FLAGS = tf.flags.FLAGS

# FLAGS._parse_flags()

# print("\nParameters:")

# for attr, value in sorted(FLAGS.__flags.items()):

# print("{}={}".format(attr.upper(), value))

# print("")

def preprocess():

# Data Preparation

# ==================================================

# Load data

print("Loading data...")

'''

x_text, y = data_helpers.load_data_and_labels(FLAGS.airfield, FLAGS.airplanecabin,FLAGS.airportterminal,FLAGS.alcove,FLAGS.alley,

FLAGS.amphitheater,FLAGS.amusementarcade,FLAGS.amusementpark,FLAGS.apartmentbuilding,FLAGS.aquarium,

FLAGS.aqueduct,FLAGS.arcade,FLAGS.arch,FLAGS.archaelogical,FLAGS.archive,FLAGS.armybase,FLAGS.artclass,

FLAGS.artgallery,FLAGS.artistsloft,FLAGS.assemblyline,FLAGS.atelier,FLAGS.atrium,FLAGS.attic,FLAGS.auditorium,

FLAGS.badlands,FLAGS.bakery)

'''

x_text, y = data_helpers.load_data_and_labels_all(FLAGS.data_path)

#print(x_text)

# Build vocabulary

max_document_length = max([len(x.split(" ")) for x in x_text])

#max_document_length = max([len(x) for x in x_text])

print("max_document_length",max_document_length)

vocab_processor = learn.preprocessing.VocabularyProcessor(max_document_length)

x = np.array(list(vocab_processor.fit_transform(x_text)))

vocab_dict = vocab_processor.vocabulary_._mapping

sorted_vocab = sorted(vocab_dict.items(), key=lambda x: x[1])

vocabulary = list(list(zip(*sorted_vocab))[0])

print("Vocabulary : ",vocabulary)

with open('./vocab.txt', 'w') as f:

for item in vocabulary:

f.write("%s\n" % item)

with open('./diction.txt', 'w') as file:

file.write(json.dumps(vocab_dict))

#wordnp = np.column_stack((x, x_train_np))

#wordnp = np.concatenate((x, x_text),axis=0)

#np.savetxt('wordidx.txt', (x_train_np), fmt='%s')

'''

with open("./vocab.txt", 'wt') as word:

for i, val in enumerate (x) :

#data = x_text[i] #

print(val,x_text[i])

word.write('%d %s\n'.join(x_text[i]))

'''

#f.close()

# Randomly shuffle data

np.random.seed(10)

shuffle_indices = np.random.permutation(np.arange(len(y)))

print("shuffle_indices",shuffle_indices)

print("len(y)", len(y))

x_shuffled = x[shuffle_indices]

y_shuffled = y[shuffle_indices]

# Split train/test set

# TODO: This is very crude, should use cross-validation

dev_sample_index = -1 * int(FLAGS.dev_sample_percentage * float(len(y)))

print("dev_sample_index", dev_sample_index)

x_train, x_dev = x_shuffled[:dev_sample_index], x_shuffled[dev_sample_index:]

y_train, y_dev = y_shuffled[:dev_sample_index], y_shuffled[dev_sample_index:]

del x, y, x_shuffled, y_shuffled

print("Vocabulary Size: {:d}".format(len(vocab_processor.vocabulary_)))

print("Train/Dev split: {:d}/{:d}".format(len(y_train), len(y_dev)))

return x_train, y_train, vocab_processor, x_dev, y_dev , max_document_length

def train(x_train, y_train, vocab_processor, x_dev, y_dev,maxlength):

# Training

# ==================================================

with tf.Graph().as_default():

session_conf = tf.ConfigProto(

allow_soft_placement=FLAGS.allow_soft_placement,

log_device_placement=FLAGS.log_device_placement)

sess = tf.Session(config=session_conf)

with sess.as_default():

def __init__(self, embedding_layer=None, num_words=None, embedding_dim=None,

max_seq_length=100, filter_sizes=[3, 4, 5], feature_maps=[100, 100, 100],

hidden_units=100, dropout_rate=None, nb_classes=None):

'''

cnn = CNN(

#embedding_layer=x_train.shape[1],

embedding_dim=FLAGS.embedding_dim,

max_seq_length=maxlength,

filter_sizes=list(map(int, FLAGS.filter_sizes.split(","))),

#feature_maps=[3,4,5],

#hidden_units=,

dropout_rate= 0.1 ,

nb_classes= y_train.shape[1]

)

'''

cnn = TextCNN(

sequence_length=x_train.shape[1],

num_classes=y_train.shape[1],

vocab_size=len(vocab_processor.vocabulary_),

embedding_size=FLAGS.embedding_dim,

filter_sizes=list(map(int, FLAGS.filter_sizes.split(","))),

num_filters=FLAGS.num_filters,

l2_reg_lambda=FLAGS.l2_reg_lambda)

# Define Training procedure

global_step = tf.Variable(0, name="global_step", trainable=False)

optimizer = tf.train.AdamOptimizer(1e-3)

grads_and_vars = optimizer.compute_gradients(cnn.loss)

train_op = optimizer.apply_gradients(grads_and_vars, global_step=global_step)

# Keep track of gradient values and sparsity (optional)

grad_summaries = []

for g, v in grads_and_vars:

if g is not None:

grad_hist_summary = tf.summary.histogram("{}/grad/hist".format(v.name), g)

sparsity_summary = tf.summary.scalar("{}/grad/sparsity".format(v.name), tf.nn.zero_fraction(g))

grad_summaries.append(grad_hist_summary)

grad_summaries.append(sparsity_summary)

grad_summaries_merged = tf.summary.merge(grad_summaries)

# Output directory for models and summaries

timestamp = str(int(time.time()))

out_dir = os.path.abspath(os.path.join(os.path.curdir, "runs", timestamp))

print("Writing to {}\n".format(out_dir))

# Summaries for loss and accuracy

loss_summary = tf.summary.scalar("loss", cnn.loss)

acc_summary = tf.summary.scalar("accuracy", cnn.accuracy)

# Train Summaries

train_summary_op = tf.summary.merge([loss_summary, acc_summary, grad_summaries_merged])

train_summary_dir = os.path.join(out_dir, "summaries", "train")

train_summary_writer = tf.summary.FileWriter(train_summary_dir, sess.graph)

# Dev summaries

dev_summary_op = tf.summary.merge([loss_summary, acc_summary])

dev_summary_dir = os.path.join(out_dir, "summaries", "dev")

dev_summary_writer = tf.summary.FileWriter(dev_summary_dir, sess.graph)

# Checkpoint directory. Tensorflow assumes this directory already exists so we need to create it

checkpoint_dir = os.path.abspath(os.path.join(out_dir, "checkpoints"))

checkpoint_prefix = os.path.join(checkpoint_dir, "model")

if not os.path.exists(checkpoint_dir):

os.makedirs(checkpoint_dir)

saver = tf.train.Saver(tf.global_variables(), max_to_keep=FLAGS.num_checkpoints)

# Write vocabulary

vocab_processor.save(os.path.join(out_dir, "vocab"))

# Initialize all variables

sess.run(tf.global_variables_initializer())

tf.train.write_graph(sess.graph_def, '.', 'textcnn.pbtxt')

def train_step(x_batch, y_batch):

"""

A single training step

"""

feed_dict = {

cnn.input_x: x_batch,

cnn.input_y: y_batch,

cnn.dropout_keep_prob: FLAGS.dropout_keep_prob

}

_, step, summaries, loss, accuracy = sess.run(

[train_op, global_step, train_summary_op, cnn.loss, cnn.accuracy],

feed_dict)

time_str = datetime.datetime.now().isoformat()

print("{}: step {}, loss {:g}, acc {:g}".format(time_str, step, loss, accuracy))

train_summary_writer.add_summary(summaries, step)

def dev_step(x_batch, y_batch, writer=None):

"""

Evaluates model on a dev set

"""

feed_dict = {

cnn.input_x: x_batch,

cnn.input_y: y_batch,

cnn.dropout_keep_prob: 1.0

}

step, summaries, loss, accuracy = sess.run(

[global_step, dev_summary_op, cnn.loss, cnn.accuracy],

feed_dict)

time_str = datetime.datetime.now().isoformat()

print("{}: step {}, loss {:g}, acc {:g}".format(time_str, step, loss, accuracy))

if writer:

writer.add_summary(summaries, step)

# 1000 / 30 = 33 * 1000

# Generate batches

print(len(x_train))

print(len(y_train))

batches = data_helpers.batch_iter(

list(zip(x_train, y_train)), FLAGS.batch_size, FLAGS.num_epochs)

# Training loop. For each batch...

for batch in batches:

x_batch, y_batch = zip(*batch)

train_step(x_batch, y_batch)

current_step = tf.train.global_step(sess, global_step)

if current_step % FLAGS.evaluate_every == 0:

print("\nEvaluation:")

dev_step(x_dev, y_dev, writer=dev_summary_writer)

print("")

if current_step % FLAGS.checkpoint_every == 0:

path = saver.save(sess, checkpoint_prefix, global_step=current_step)

print("Saved model checkpoint to {}\n".format(path))

def main(argv=None):

x_train, y_train, vocab_processor, x_dev, y_dev, maxlength = preprocess()

train(x_train, y_train, vocab_processor, x_dev, y_dev,maxlength)

if __name__ == '__main__':

tf.app.run()

저작자표시 (새창열림)

Comments

너와나의 관심사

Deep Learning Text Classification 본문

Deep Learning Text Classification

1. 제목 생성 모델 ( Text Classification)

1.1. 기본 컨셉

1.2. 알고리즘 소개

1.3. Tensorflow Code 분석

티스토리툴바