python - dimension mismatch error in CountVectorizer MultinomialNB -
before lodge question, have i've thoroughly read more 15 similar topics on board, each somehow different recommendations, of them not me right.
ok, split 'spam email' text data (originally in csv format) training , test sets, using countvectorizer , 'fit_transform' function fit vocabulary of corpus , extracts word count features text. , applied multinomialnb() learn training set , predict on test set. here code (simplified):
from sklearn.feature_extraction.text import countvectorizer sklearn.cross_validation import train_test_split sklearn.naive_bayes import multinomialnb # loading data # data contains 2 columns ('text', 'target') spam = pd.read_csv('spam.csv') spam['target'] = np.where(spam_data['target']=='spam',1,0) # split data x_train, x_test, y_train, y_test = train_test_split(spam_data['text'], spam_data['target'], random_state=0) # fit vocabulary , extract word count features cv = countvectorizer() x_traincv = cv.fit_transform(x_train) x_testcv = cv.fit_transform(x_test) # learn , predict using multinomialnb clfnb = multinomialnb(alpha=0.1) clfnb.fit(x_traincv, y_train) # far good, when predict on x_testcv y_pred = algo.predict(x_testcv) # python throws me error: dimension mismatch
the suggestions gleaned previous question threads (1) use .transform() on x_test, or (2) ascertain if each row in original spam data on string format (yes, are), or (3) nothing on x_test. of them didn't ring bell , python kept giving me 'dimension mismatch' error. after struggling 4 hours, had succumb stackoverflow. appreciated if enlighten me on this. want know goes wrong code , how dimension right.
thank you.
btw, original data entries this
_ test target 0 go until jurong point, crazy.. available 0 1 ok lar... joking wif u oni... 0 2 free entry in 2 wkly comp win fa cup fina 1 3 u dun hor... u c 0 4 nah don't think goes usf, lives aro 0 5 freemsg hey there darling it's been 3 week's n 1 6 winner!! valued network customer have 1
your countvectorizer
has been fitted training data. test data, want call transform()
, not fit_transform()
.
otherwise, if use fit_transform()
again on test data, different columns based on unique vocabulary of test data. fit once training.
x_testcv = cv.transform(x_test)
wiki
Comments
Post a Comment