nlp 1
cyber hashBecause language is so situated, when developing computational models for language processing from a corpus, it’s important to consider who produced the language, in what context, for what purpose.
HOW CAN A USER OF A DATASET KNOW ALL THESE DETAILS?
tokenization is run before any other language processing, it needs to be very fast.
words low, new, newer, but not lower, then if the word lower appears in our test
corpus, our system will not know what to do with it. to deal with this unknown word problem…use subwords.
Edit distance gives us a way to quantify these intuitions about string similarity.
The space of all possible edits is enormous, so we can’t search naively
Alignment Knowing the minimum edit distance is useful for algorithms like finding potential spelling error corrections. But the edit distance algorithm is important
in another way; with a small change, it can also provide the minimum cost alignment between two strings.