Natural Language Processing (NLP)

Last updated Nov 14, 2024 Edit Source

A Computer Science field connected to Artificial Intelligence and Computational Linguistics which focuses on interactions between computers and human language and a machine’s ability to understand, or mimic the understanding of human language

See:
AI/Generative AI/Foundation models
AI/Deep learning/Multimodal learning

# Resources

https://en.wikipedia.org/wiki/Natural_language_processing
NLP is a branch of data science that consists of systematic processes for analyzing, understanding, and deriving information from the text data in a smart and efficient manner. By utilizing NLP and its components, one can organize the massive chunks of text data, perform numerous automated tasks and solve a wide range of problems such as – automatic summarization, machine translation, named entity recognition, relationship extraction, sentiment analysis, speech recognition, and topic segmentation etc.
https://github.com/keon/awesome-nlp
The most important NLP highlights of 2018
NLP - Udemy ML
https://www.analyticsvidhya.com/blog/2017/01/ultimate-guide-to-understand-implement-natural-language-processing-codes-in-python/
https://www.datascience.com/blog/introduction-to-natural-language-processing-lexical-units-learn-data-science-tutorials
https://github.com/BotCube/awesome-bots
http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html
Hidden Markov model
Speech recognition

# Text preparation

http://nitin-panwar.github.io/Text-prepration-before-Sentiment-analysis/
- Removing numbers
- Removing Urls and Links
- Removing stopwords
- Stemming words
- Suffix-dropping algorithms
- Lemmatisation algorithms
- n-gram analysis
- Removing punctuation
- Stripping whitespace
- Checking for impure characters
http://thinknook.com/10-ways-to-improve-your-classification-algorithm-performance-2013-01-21/

# Feature engineering

# Bag of words

A commonly used model in methods of Text Classification. As part of the BOW model, a piece of text (sentence or a document) is represented as a bag or multiset of words, disregarding grammar and even word order and the frequency or occurrence of each word is used as a feature for training a classifier.
BoW is different from Word2vec, which we’ll cover next. The main difference is that Word2vec produces one vector per word, whereas BoW produces one number (a wordcount). Word2vec is great for digging into documents and identifying content and subsets of content. Its vectors represent each word’s context, the ngrams of which it is a part. BoW is good for classifying documents as a whole.

# tf–idf

tf-idf
Term Frequency-Inverse Document Frequency
tf–idf, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.
The tf-idf value increases proportionally to the number of times a word appears in the document, but is often offset by the frequency of the word in the corpus, which helps to adjust for the fact that some words appear more frequently in general.
https://deeplearning4j.org/bagofwords-tf-idf
http://www.tfidf.com/
http://scikit-learn.org/stable/auto_examples/text/document_clustering.html#sphx-glr-auto-examples-text-document-clustering-py

# Word embeddings

Word embedding is the modern way of representing words as vectors. The aim of word embedding is to redefine the high dimensional word features into low dimensional feature vectors by preserving the contextual similarity in the corpus. They are widely used in deep learning models such as Convolutional Neural Networks and Recurrent Neural Networks.
#PAPER Word2Vec: Distributed Representations of Words and Phrases and their Compositionality (Mikolov 2013)
#PAPER Distributed representations of sentences and documents (Le 2014)
- https://medium.com/scaleabout/a-gentle-introduction-to-doc2vec-db3e8c0cce5e
#PAPER GloVe: Global Vectors for Word Representation (Pennington 2014)
- Glove
- GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.
#PAPER sense2vec - A Fast and Accurate Method for Word Sense Disambiguation In Neural Word Embeddings (Trask 2015)
- Sense2vec
#PAPER Enriching Word Vectors with Subword Information (Bojanowski 2017)

# Semantics

# Distributional semantics

General recipe:
- form a word-context matrix of counts (data)
- perform dimensionality reduction (AI/Math and Statistics/SVD) for generalization
For LSA the context is the document where the word appears.
For word2vec the context is just a work, nearby words (in some window) in a document.
Latent semantic analysis
- The process of analyzing relationships between a set of documents and the terms they contain. Accomplished by producing a set of concepts related to the documents and terms. LSA assumes that words that are close in meaning will occur in similar pieces of text.
- Latent semantic analysis (LSA) is a technique in natural language processing, in particular distributional semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms.
- LSA assumes that words that are close in meaning will occur in similar pieces of text. A matrix containing word counts per paragraph (rows represent unique words and columns represent each paragraph) is constructed from a large piece of text and SVD is used to reduce the number of rows while preserving the similarity structure among columns. Words are then compared by taking the cosine of the angle between the two vectors (or the dot product between the normalizations of the two vectors) formed by any two rows. Values close to 1 represent very similar words while values close to 0 represent very dissimilar words.
- http://mccormickml.com/2016/03/25/lsa-for-text-classification-tutorial/
- https://github.com/chrisjmccormick/LSA_Classification/blob/master/runClassification_LSA.py
- http://stackoverflow.com/questions/30590881/python-lsa-with-sklearn

# Topic Modelling

https://en.wikipedia.org/wiki/Topic_model
Latent Dirichlet Allocation
- A common topic modeling technique, LDA is based on the premise that each document or piece of text is a mixture of a small number of topics and that each word in a document is attributable to one of the topics.
- http://engineering.flipboard.com/2017/02/storyclustering
- http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html
- http://ahmedbesbes.com/how-to-mine-newsfeed-data-and-extract-interactive-insights-in-python.html

# Neural semantic parsing

#PAPER Neural Semantic Parsing (Jia & Liang 2016)

# Explicit semantic analysis

https://en.wikipedia.org/wiki/Explicit_semantic_analysis
In NLP and information retrieval, explicit semantic analysis (ESA) is a vectorial representation of text (individual words or entire documents) that uses a document corpus as a knowledge base. Specifically, in ESA, a word is represented as a column vector in the tf–idf matrix of the text corpus and a document (string of words) is represented as the centroid of the vectors representing its words. Typically, the text corpus is English Wikipedia, though other corpora including the Open Directory Project have been used.
Used in Information Retrieval, Document Classification and Semantic Relatedness calculation (i.e. how similar in meaning two words or pieces of text are to each other), ESA is the process of understanding the meaning of a piece text, as a combination of the concepts found in that text.
Corpus or Corpora. A usually large collection of documents that can be used to infer and validate linguistic rules, as well as to do statistical analysis and hypothesis testing.

# Sentiment analysis

https://en.wikipedia.org/wiki/Sentiment_analysis
The use of NLP techniques to extract subjective information from a piece of text. i.e. whether an author is being subjective or objective or even positive or negative
http://varianceexplained.org/r/trump-tweets/
http://blog.aylien.com/sentiment-analysis-of-2-2-million-tweets-from-super-bowl-51/

# Deep learning-based

# CNN-based

See AI/Deep learning/CNNs

# RNN-based

See AI/Deep learning/RNNs

# Seq2seq

#PAPER Sequence to Sequence Learning with Neural Networks (Sustkever 2014)
Sequence-to-sequence models are deep learning models that have achieved a lot of success in tasks like machine translation, text summarization, and image captioning. Google Translate started using such a model in production in late 2016. These models are explained in the two pioneering papers (Sutskever et al., 2014, Cho et al., 2014).
A sequence-to-sequence model is a model that takes a sequence of items (words, letters, features of an images…etc) and outputs another sequence of items.
Under the hood, the model is composed of an encoder and a decoder. The encoder processes each item in the input sequence, it compiles the information it captures into a vector (called the context). After processing the entire input sequence, the encoder sends the context over to the decoder, which begins producing the output sequence item by item.
The context is a vector (an array of numbers, basically) in the case of machine translation. The encoder and decoder tend to both be recurrent neural networks.
https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/
Encoder-Decoder LSTMs for sequence-to-sequence prediction

# Google Neural Machine Translation

https://en.wikipedia.org/wiki/Google_Neural_Machine_Translation
#PAPER Zero-shot translation
Google Neural Machine Translation (GNMT) is a neural machine translation (NMT) system developed by Google and introduced in November 2016, that uses an artificial neural network to increase fluency and accuracy in Google Translate.
https://research.googleblog.com/2016/11/zero-shot-translation-with-googles.html

# Transformer-based

See AI/Deep learning/Transformers

# Books

# Courses

# Talks

# Code

#CODE Arabica - A Python package for exploratory analysis of text data
#CODE Rubrix - Rubrix, open-source framework for data-centric NLP. Data annotation and monitoring for enterprise NLP
#CODE Beir - Heterogeneous Benchmark for Information Retrieval
- #PAPER BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models (Thakur 2021)
#CODE FastText - Library for efficient text classification and representation learning
- Install FastText on Google colaboratory
#CODE Fairseq - Facebook AI Research Sequence-to-Sequence Toolkit written in Python
- Fairseq(-py) is a sequence modeling toolkit that allows researchers and developers to train custom models for translation, summarization, language modeling and other text generation tasks
#CODE OpenNMT-tf - OpenNMT-tf is a general purpose sequence learning toolkit using TensorFlow 2
#CODE OpenNLP - The Apache OpenNLP library is a machine learning based toolkit for the processing of natural language text
#CODE Textgen RNN
#CODE Standford CoreNLP
#CODE NLTK - NLTK is a leading platform for building Python programs to work with human language data
#CODE Textblob - TextBlobis a Python library for processing textual data
#CODE Spacy - Industrial-strength NLP
#CODE ParlAI - A unified platform for sharing, training and evaluating dialogue models across many tasks
#CODE Gensim - Gensim is a Python library for topic modelling, document indexing and similarity retrieval with large corpora
#CODE Spark-NLP - State-of-the-art Natural Language Processing library built on top of Apache Spark

# LLMs

#CODE Langchain - Building applications with LLMs through composability

# Speech

#CODE PaddleSpeech - toolkit for tasks in speech and audio, with the state-of-art and influential models
- #PAPER PaddleSpeech: An Easy-to-Use All-in-One Speech Toolkit (Zhang 2022)
#CODE Whisper - Robust Speech Recognition via Large-Scale Weak Supervision
- https://openai.com/research/whisper

# Web scrapping and cleaning

#CODE Requests (For fetching HTML/XML from web pages)
#CODE BeautifulSoup (web scraping data parsing)
- #TALK Introduction To Web Scraping (with Python and Beautiful Soup)
#CODE LXML (web scraping data parsing)
#CODE Dryscape (web scraping with javascript)
#CODE Selenium (web scraping with javascript)
#CODE Scrapy (web scraping framework)
- https://doc.scrapy.org/en/latest/intro/overview.html
- Scrapy is an application framework for crawling web sites and extracting structured data which can be used for a wide range of useful applications, like data mining, information processing or historical archival.
- https://medium.com/@kaismh/extracting-data-from-websites-using-scrapy-e1e1e357651a#.j9hrs2scn
#CODE python-ftfy: fixes text for you
#CODE Arrow - working with dates and times
#CODE Beautifier - clean and prettify URLs and email addresses

CarlosGG's Knowledge Garden 🪴

Natural Language Processing (NLP)

# Resources

# Text preparation

# Feature engineering

# Bag of words

# tf–idf

# Word embeddings

# Semantics

# Distributional semantics

# Topic Modelling

# Neural semantic parsing

# Explicit semantic analysis

# Sentiment analysis

# Deep learning-based

# CNN-based

# RNN-based

# Seq2seq

# Google Neural Machine Translation

# Transformer-based

# Books

# Courses

# Talks

# Code

# LLMs

# Speech

# Web scrapping and cleaning

Backlinks

Interactive Graph