It is not a secret that the bad guys are using this horrible weapon in a scale never saw before. Our legal systems are not fast enough to stop them, because they are too fast. Machine learning techniques are certainly a way out of this war.
Unfortunately we have to work in this context, and our focus in this article is not to find the best technique to solve this challenge. In this article we aim to suggest a way to handle data sets that are too big to fit in memory.
The selected data set is not really that big, but it will serve the purpose.
The data set
We will use a Kaggle’s data set named Fake and real news dataset. It has 44,898 texts labeled as True or Fake.
The data set consists of two csv files. If we download it to a directory named kaggle
, we can use the following code to split the articles in individual files that we will use during our out-of-core training.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
from csv import reader
from os import makedirs, path
from re import sub
class Article:
def __init__(self, title: str, text: str):
self.title = title
self.text = text
@property
def clean(self):
sanitizer = '[^A-Za-z]+'
clean = sub(sanitizer, ' ', f'{self.title} {self.text}')
clean = clean.lower()
return sub('\s+', ' ', clean)
def split_dataset(base_folder: str = 'kaggle'):
fake_folder = path.join(base_folder, 'fake')
true_folder = path.join(base_folder, 'true')
makedirs(fake_folder, exist_ok=True)
makedirs(true_folder, exist_ok=True)
for file in ['Fake', 'True']:
with open(path.join('kaggle', f'{file}.csv'), 'r',
encoding='utf-8',
errors='ignore') as f:
csv_reader = reader(f, delimiter=',', quotechar='"')
_ = next(csv_reader) # headers
for i, row in enumerate(csv_reader):
article_file = path.join(base_folder,
file.lower(),
f'{i:05d}.txt')
with open(article_file, 'w+') as a:
a.write(Article(row[0], row[1]).clean)
split_dataset()
For sanity check, we can run:
1
2
3
4
5
6
7
from glob import glob
print('Fake articles:', len(glob(path.join('kaggle', 'fake', '*'))))
# Fake articles: 23481
print('True articles:', len(glob(path.join('kaggle', 'true', '*'))))
# True articles: 21417
Pre-processing
The approach here will be the one of encoding each article as an array of words’ occurrences. In this array, each position corresponds to one specific word.
To begin with, we create a transformer to encode our articles. When the encoder is fitted for the first time, it persists the words dictionary in disk inside a directory called .dictionaries
. Before fitting again, the encoder will check first if a dictionary with the same name already exists. The parameter refit
allows to fit again ignoring an existing dictionary.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
from collections import defaultdict
import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin
class ArticleEncoder(BaseEstimator, TransformerMixin):
def __init__(self,
dictionary_name: str,
binary: bool = False,
top: int = 1000,
min_length: int = 2,
max_length: int = 100,
refit: bool = False):
makedirs('.dictionaries', exist_ok=True)
self.dictionary = None
self.dictionary_name = dictionary_name
self.dictionary_filename = path.join('.dictionaries',
f'{self.dictionary_name}.npy')
self.binary = binary
self.top = top
self.min_length = min_length
self.max_length = max_length
self.refit = refit
def fit(self, X, y=None):
if path.exists(self.dictionary_filename) and not self.refit:
self.dictionary = np.load(self.dictionary_filename)
else:
dictionary = defaultdict(int)
for article_path in X:
with open(article_path, 'r') as article:
for word in article.read().split(' '):
dictionary[word] += 1
descending_dictionary = sorted(dictionary.items(),
key=lambda v: v[1],
reverse=True)
self.dictionary = np.array([
word for (word, occur) in descending_dictionary
if self.min_length <= len(word) <= self.max_length
][:self.top])
if self.dictionary_name:
np.save(path.join('.dictionaries', self.dictionary_name),
self.dictionary)
return self
def transform(self, X):
return np.array(list(map(self.encode_article, X)))
def encode_article(self, article_path: str):
encoded = np.zeros(self.dictionary.size)
with open(article_path, 'r') as article:
words = article.read().split(' ')
for word in words:
index = np.where(self.dictionary == word)[0]
if index.size == 1: # we ignore unknown words
if self.binary:
encoded[index[0]] = 1
else:
encoded[index[0]] += 1
return encoded
Next step is to split our articles in train and test data sets:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
from sklearn.model_selection import train_test_split
true_articles = glob(path.join('kaggle', 'true', '*'))
fake_articles = glob(path.join('kaggle', 'fake', '*'))
all_articles = np.concatenate((np.array(true_articles),
np.array(fake_articles)))
all_labels = np.concatenate((np.zeros(len(true_articles)),
np.ones(len(fake_articles))))
articles_train, articles_test, labels_train, labels_test = \
train_test_split(all_articles,
all_labels,
test_size=0.2,
random_state=42,
stratify=all_labels)
And we fit our encoder using only the training set:
1
2
encoder = ArticleEncoder(dictionary_name='dictionary_1000')
encoder.fit(articles_train)
Training a model
Because we are assuming our data set is too big to fit in memory at once, we need to choose a model that allows the training process to be done in small batches. A good candidate for that is the SGDClassifier
which applies the gradient descent algorithm to minimize the loss function of a linear model.
We create 5 folds out of the training set:
1
2
3
4
5
from sklearn.model_selection import StratifiedKFold
kfold = StratifiedKFold(n_splits=5).split(articles_train,
labels_train)
This function will help us by printing a good looking percentage bar.
1
2
3
4
5
6
7
8
def print_epoch_bar(cur_epoch: int, n_epochs: int, acc: list, loss: list):
pct = 100 * cur_epoch / n_epochs
h_pct = int(pct / 2)
acc_ = 100 * sum(acc) / len(acc)
loss_ = sum(loss) / len(loss)
print('{} {} {}'.format('{:.2f} % | '.format(pct).rjust(11, ' '),
'{}'.format("=" * h_pct + '>').ljust(51, ' '),
f'| acc: {acc_:.2f} %, loss: {loss_:.4f}'))
We create some arrays to hold the training and validation metrics:
1
2
3
4
5
6
7
8
9
10
11
12
13
from sklearn.metrics import (
accuracy_score,
f1_score,
hinge_loss,
precision_score,
recall_score
)
train_accuracy, val_accuracy = [], []
train_f1, val_f1 = [], []
train_hinge, val_hinge = [], []
train_precision, val_precision = [], []
train_recall, val_recall = [], []
And then we have a big for loop that will train estimators for each one the folds.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
import warnings
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline
for k, (train, val) in enumerate(kfold):
print(f'Fold {k}')
articles_train_ = articles_train[train]
articles_val_ = articles_train[val]
labels_train_ = labels_train[train]
labels_val_ = labels_train[val]
# train
print('training')
print()
n_epochs = 100
batch_size = 100
pipeline = Pipeline([
('encoder', ArticleEncoder(dictionary_name='dictionary_1000')),
('sgd_clf', SGDClassifier(warm_start=True,
max_iter=1,
penalty='l1',
alpha=0.01,
loss='hinge'))
])
train_accuracy_ = []
train_f1_ = []
train_hinge_ = []
train_precision_ = []
train_recall_ = []
for epoch in range(n_epochs):
idx = np.random.randint(0, articles_train_.size, batch_size)
articles = articles_train_[idx]
labels = labels_train_[idx]
with warnings.catch_warnings():
warnings.simplefilter("ignore")
pipeline.fit(articles, labels)
label_pred = pipeline.predict(articles)
train_accuracy_.append(accuracy_score(labels, label_pred))
train_f1_.append(f1_score(labels, label_pred))
train_hinge_.append(hinge_loss(labels, label_pred))
train_precision_.append(precision_score(labels, label_pred))
train_recall_.append(recall_score(labels, label_pred))
print_epoch_bar(epoch, n_epochs, train_accuracy_, train_hinge_)
# validate
print('validating')
label_pred = pipeline.predict(articles_val_)
val_accuracy_ = accuracy_score(labels_val_, label_pred)
val_f1_ = f1_score(labels_val_, label_pred)
val_hinge_ = hinge_loss(labels_val_, label_pred)
val_precision_ = precision_score(labels_val_, label_pred)
val_recall_ = recall_score(labels_val_, label_pred)
print(', '.join([f'loss: {val_hinge_:04f}',
f'precision: {100 * val_precision_:.2f}',
f'recall: {100 * val_recall_:.2f}']))
print()
val_accuracy.append(val_accuracy_)
val_f1.append(val_f1_)
val_hinge.append(val_hinge_)
val_precision.append(val_precision_)
val_recall.append(val_recall_)
train_accuracy.append(np.mean(train_accuracy_))
train_f1.append(np.mean(train_f1_))
train_hinge.append(np.mean(train_hinge_))
train_precision.append(np.mean(train_precision_))
train_recall.append(np.mean(train_recall_))
If we now print a summary of what we have accomplished so far:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
def get_interval(values: list, pct: bool = True):
avg = np.mean(values) * (100.0 if pct else 1.0)
std = np.std(values) * (100.0 if pct else 1.0)
formated = '({:.2f} +/- {:.2f})'.format(avg, std / np.sqrt(len(values)))
if pct:
return f'{formated} %'
return formated
print('cross validation result:')
print()
print('Avg. train accuracy: {}'.format(get_interval(train_accuracy)))
print('Avg. train precision: {}'.format(get_interval(train_precision)))
print('Avg. train recall: {}'.format(get_interval(train_recall)))
print('Avg. train f1: {}'.format(get_interval(train_f1)))
print('Avg. train hinge loss: {}'.format(get_interval(train_hinge, pct=False)))
print()
print('Avg. val accuracy: {}'.format(get_interval(val_accuracy)))
print('Avg. val precision: {}'.format(get_interval(val_precision)))
print('Avg. val recall: {}'.format(get_interval(val_recall)))
print('Avg. val f1: {}'.format(get_interval(val_f1)))
print('Avg. val hinge loss: {}'.format(get_interval(val_hinge, pct=False)))
The result is:
cross validation result:
Avg. train accuracy: (94.27 +/- 0.17) %
Avg. train precision: (95.40 +/- 0.18) %
Avg. train recall: (93.84 +/- 0.42) %
Avg. train f1: (94.23 +/- 0.27) %
Avg. train hinge loss: (0.54 +/- 0.00)
Avg. val accuracy: (91.85 +/- 1.66) %
Avg. val precision: (98.94 +/- 0.48) %
Avg. val recall: (85.40 +/- 3.58) %
Avg. val f1: (91.48 +/- 1.86) %
Avg. val hinge loss: (0.56 +/- 0.02)
Not terrible, but it is overfitting.
A few highlights on the code above:
- The parameter
warm_start
is the one that allows the model to be trained in small batches. - Using
penalty=l1
extends for Lasso regularization, which is interesting when we have a large number of features (many words) that might not have the same importance. - The parameter
alpha
determines the strength of the regularization contrain. - The choice
loss=hinge
is an attempt do have something like a Support Vector Machine, which is great finding a safe boundary that separates classes, but not suitable for out-of-core training.
In order to fight the overfitting, we could try to increase the alpha
parameter in the SGDClassifier
, however the low overall accuracy during training already tells us that the capacity of the model is not big enough.
Let’s see how it performs on the test set:
1
2
3
4
5
6
7
8
from sklearn.metrics import classification_report
print('cross validation result:')
label_pred = pipeline.predict(articles_test)
print(classification_report(labels_test,
label_pred,
target_names=['True', 'Fake'],
digits=4))
And the result is:
precision recall f1-score support
True 0.8212 0.9788 0.8931 4284
Fake 0.9765 0.8056 0.8828 4696
accuracy 0.8882 8980
macro avg 0.8989 0.8922 0.8880 8980
weighted avg 0.9024 0.8882 0.8877 8980
Conclusion
We saw a simple strategy to train a model when the data set doesn’t fit in memory all at once. Althoght it has room for improvement, it might be a good scaffolding for a simpler problem.
This problem would be better addressed by NLP, and we might return to it in the future, but for now we might say we have better feel of the problem’s complexity.