๐–๐ก๐š๐ญ ๐ข๐ฌ ๐š๐ง ๐ˆ๐ง๐ฏ๐ž๐ซ๐ญ๐ž๐ ๐ˆ๐ง๐๐ž๐ฑ? ๐‡๐จ๐ฐ ๐๐จ๐ž๐ฌ ๐š ๐’๐ž๐š๐ซ๐œ๐ก ๐„๐ง๐ ๐ข๐ง๐ž ๐ˆ๐ง๐๐ž๐ฑ ๐ƒ๐จ๐œ๐ฎ๐ฆ๐ž๐ง๐ญ๐ฌ? ๐€๐ง๐ ๐ฐ๐ก๐ข๐œ๐ก ๐ญ๐ž๐œ๐ก๐ง๐ข๐ช๐ฎ๐ž ๐ฐ๐š๐ฌ ๐ฎ๐ฌ๐ž๐ ๐›๐ž๐Ÿ๐จ๐ซ๐ž ๐ˆ๐ง๐๐ž๐ฑ๐ข๐ง๐  ๐Ÿ๐จ๐ซ ๐ˆ๐ง๐Ÿ๐จ๐ซ๐ฆ๐š๐ญ๐ข๐จ๐ง ๐‘๐ž๐ญ๐ซ๐ข๐ž๐ฏ๐š๐ฅ?

Before smart search engines, people used a command called Grep.
It searched every word in every file. Line by line. Word by word for information retrieval. It was slow.

As data grew, Grep couldnโ€™t keep up. We needed a faster way.
Thatโ€™s where the Inverted Index came in.

๐–๐ก๐š๐ญ ๐ข๐ฌ ๐š๐ง ๐ˆ๐ง๐ฏ๐ž๐ซ๐ญ๐ž๐ ๐ˆ๐ง๐๐ž๐ฑ?
An inverted index is a list that tells you which words appear in which documents.
Instead of reading every document to find a word, you start with the word and go directly to the documents that contain it.

๐ˆ๐ญ ๐ก๐š๐ฌ ๐ญ๐ฐ๐จ ๐ฉ๐š๐ซ๐ญ๐ฌ:
๐“๐ก๐ž ๐ƒ๐ข๐œ๐ญ๐ข๐จ๐ง๐š๐ซ๐ฒ โ€“ a list of all the words
๐“๐ก๐ž ๐๐จ๐ฌ๐ญ๐ข๐ง๐ ๐ฌ ๐‹๐ข๐ฌ๐ญ โ€“ where each word appears
Letโ€™s see how itโ€™s created.

๐’๐ญ๐ž๐ฉ ๐Ÿ: ๐‚๐จ๐ฅ๐ฅ๐ž๐œ๐ญ ๐ƒ๐จ๐œ๐ฎ๐ฆ๐ž๐ง๐ญ๐ฌ
We start with some text. For example:
๐ƒ๐จ๐œ ๐Ÿ: I did enact Julius Caesar. Brutus killed me.
๐ƒ๐จ๐œ ๐Ÿ: So let it be with Caesar. The noble Brutus told you Caesar was ambitious.

๐’๐ญ๐ž๐ฉ ๐Ÿ: ๐“๐จ๐ค๐ž๐ง๐ข๐ณ๐š๐ญ๐ข๐จ๐ง
We break the text into words (called tokens):
๐ƒ๐จ๐œ ๐Ÿ ๐“๐จ๐ค๐ž๐ง๐ฌ: I, did, enact, Julius, Caesar, Brutus, killed, me
๐ƒ๐จ๐œ ๐Ÿ ๐“๐จ๐ค๐ž๐ง๐ฌ: So, let, it, be, with, Caesar, noble, Brutus, told, you, ambitious

๐’๐ญ๐ž๐ฉ ๐Ÿ‘: ๐๐จ๐ซ๐ฆ๐š๐ฅ๐ข๐ณ๐š๐ญ๐ข๐จ๐ง
We clean the words:
โ†’ Convert to lowercase
โ†’ Remove punctuation
โ†’ Sometimes, simplify words (like โ€œkillingโ€ โ†’ โ€œkillโ€)
Now we have:
brutus, caesar, kill, noble, tell, ambitious...

๐’๐ญ๐ž๐ฉ ๐Ÿ’: ๐€๐ฌ๐ฌ๐ข๐ ๐ง ๐ƒ๐จ๐œ๐ฎ๐ฆ๐ž๐ง๐ญ ๐ˆ๐ƒ๐ฌ
We give each document a number:
Doc 1 โ†’ 1
Doc 2 โ†’ 2

๐’๐ญ๐ž๐ฉ ๐Ÿ“: ๐Œ๐š๐ค๐ž ๐–๐จ๐ซ๐โ€“๐ƒ๐จ๐œ ๐๐š๐ข๐ซ๐ฌ
We link each word to its document:
(โ€œbrutusโ€, 1)
(โ€œcaesarโ€, 1)
(โ€œkilledโ€, 1)
(โ€œbrutusโ€, 2)
(โ€œcaesarโ€, 2)
(โ€œambitiousโ€, 2)

๐’๐ญ๐ž๐ฉ ๐Ÿ”: ๐’๐จ๐ซ๐ญ ๐ญ๐ก๐ž ๐–๐จ๐ซ๐๐ฌ
Now we sort them by word (Alphabetically):
(โ€œambitiousโ€, 2)
(โ€œbrutusโ€, 1)
(โ€œbrutusโ€, 2)
(โ€œcaesarโ€, 1)
(โ€œcaesarโ€, 2)
โ€ฆ
๐’๐ญ๐ž๐ฉ ๐Ÿ•: ๐†๐ซ๐จ๐ฎ๐ฉ ๐›๐ฒ ๐–๐จ๐ซ๐
We group the documents for each word:
ambitious โ†’ [2]
brutus โ†’ [1, 2]
caesar โ†’ [1, 2]
This is the postings list.

๐’๐ญ๐ž๐ฉ ๐Ÿ–: ๐๐ฎ๐ข๐ฅ๐ ๐ญ๐ก๐ž ๐ˆ๐ง๐ฏ๐ž๐ซ๐ญ๐ž๐ ๐ˆ๐ง๐๐ž๐ฑ
We now have:
๐ƒ๐ข๐œ๐ญ๐ข๐จ๐ง๐š๐ซ๐ฒ:
ambitious, brutus, caesar...
๐๐จ๐ฌ๐ญ๐ข๐ง๐ ๐ฌ:
ambitious โ†’ [2]
brutus โ†’ [1, 2]
caesar โ†’ [1, 2]

๐‡๐จ๐ฐ ๐ˆ๐ญโ€™๐ฌ ๐’๐ญ๐จ๐ซ๐ž๐
The dictionary stays in memory (itโ€™s small).
The postings lists are stored on disk (theyโ€™re bigger).
We can use lists or arrays to save space and improve speed.

๐๐ž๐ง๐ž๐Ÿ๐ข๐ญ๐ฌ
โ†’ Inverted index makes search fast.
โ†’ We no longer check every document.
โ†’ We go straight to the ones that matter.

๐™๐™๐™ž๐™จ ๐™ž๐™จ ๐™๐™ค๐™ฌ ๐™จ๐™š๐™–๐™ง๐™˜๐™ ๐™š๐™ฃ๐™œ๐™ž๐™ฃ๐™š๐™จ, ๐™ก๐™ž๐™ ๐™š ๐™‚๐™ค๐™ค๐™œ๐™ก๐™š, ๐™ฌ๐™ค๐™ง๐™  ๐™—๐™š๐™๐™ž๐™ฃ๐™™ ๐™ฉ๐™๐™š ๐™จ๐™˜๐™š๐™ฃ๐™š๐™จ.

Want to understand SEO instead of imitating? ๐˜ฟ๐™ค ๐™๐™ค๐™ก๐™ก๐™ค๐™ฌ!

SemanticSEO InformationRetrieval Indexing


This post was originally shared by on Linkedin.