Showing posts with label Elastic Search. Show all posts
Showing posts with label Elastic Search. Show all posts

Thursday, August 20, 2020

Elastic Search - Types of Analyzer in the Elastic Search

 Do you know how many types of analyzers available in the Elastic Search? Are you looking for the details about all the analyzers come with Elastic Search? If so, then you reached the right place. In this article, we will discuss the types of analyzes which are more commonly used in Elastic Search.

What is an Analyzer?

An analyzer is a package which contains three lower-level building blocks: character filters, tokenizers, and token filters which are used for analyzing the data 

Types of Analyzer

Here is a list of analyzer which comes with Elastic Search-

  • Standard Analyzer
  • Simple Analyzer
  • Whitespace Analyzer
  • Stop Analyzer
  • Keyword Analyzer
  • Pattern Analyzer
  • Language Analyzers
  • Fingerprint Analyzer

Understanding Analyzers

  • Standard Analyzer

The text gets divided into terms of word boundaries in a standard analyzer. The punctuations are removed and the upper case is converted into lowercase. It also supports removing stop words.


Input: "This is a sample example, for STANDARD-Aanlyzer"

Output:[this, is, a, sample, example, for, standard, analyzer]

  • Simple Analyzer

With Simple Analyzer, the text is divided into separate terms whenever non-letter character appears. The non-letter character can be number, hyphens, and space, etc. The upper case characters are converted into lowercase. 

Input: "My dog's name is Rocky-Hunter"

Output:[my, dog, s, name, is, rocky, hunter]

  • Whitespace Analyzer

The input phrase is divided into terms based on whitespace. It does not lowercase terms.

Input: "Technology-World has articles on ElasticSearch and Artificial-Intelligence etc."

Output:[Technology-World, has, articles, on, ElasticSearch, and,  Artificial-Intelligence, etc.]

  • Stop Analyzer

A stop analyzer is a form of  Simple Analyzer where the text is divided into separate terms whenever non-letter characters encountered. The non-letter character can be number, hyphens, and space, etc.  Like Simple analyzer in Stop Analyzer, the upper case characters are converted into lowercase. Additionally, it removed the stop words. Assume that stop word file includes work 'the', 'is', 'of', 

Input: "Gone with the wind is one of my favorite books."

Output:[Gone, with, wind, one, my, favorite, books]

  • Keyword Analyzer

The input phrase is NOT divided into terms rather output phrase/token is the same as the input phrase.

Input: "Mount Everest is one of the worlds natural wonders"

Output:[Mount Everest is one of the worlds natural wonders]

  • Pattern Analyzer

The regular expression is used in the pattern analyzer to split the text into terms. The default regular expression is \W+  which is nothing but all non-word characters. We need to remember that the regular expression is used as a term separator in the input phrase. The upper case characters are converted into lower case, also the stop words are removed.

Input: "My daughter's name is Rita and she is 7 years old"

Output:[my, daughter, s, name, is, Rita, and, she, is, 7, years, old]

  • Language Analyzers

The language-specific such as English, French, Hindi are provided in the Elasticsearch. 

Here is a sample keyword from the Hindi language analyzer.

e.g. "keywords": ["उदाहरण"]

  • Fingerprint Analyzer

The fingerprint analyzer is used for duplicate detection. The input phrase is converted into lowercase, the extended characters are removed. The duplicate words are removed and a single toke is created. It also supports stop words.

Input: "á is a Spanish accents character"

Output:[a, accents, character, is, spanish]

Learn more about Elastic Search here

Monday, August 10, 2020

Elastic Search Concepts - Cluster, Node, Index, Document, Shard and Replica

 Are you looking for detailed information about the various concepts used in Elastic Search? Are you also interested in knowing what is Document, Shard, and Replica in Elastic Search? If so, then you reached the right place. In this article, we will understand all the important concepts which are more commonly used in the Elastic Search.

A. Elastic Search Cluster

The Cluster is a collection of nodes. It has a unique name. If we do not provide any name to the cluster then it defaults as elasticsearch. We can create clusters specific to each environment. for example, we can development cluster or QA clusters or production clusters. We can create clusters with more than one node, however, it is totally okay if we have just one node in a cluster. The cluster provides indexing and searchable capabilities across all the nodes. i.e. when we perform search or index a data we do not have worry about on which node the data is getting indexed or searched.

B. The node in Elastic Search

The node is a single server which is the part of the cluster that stores the data. Node has a unique name as like a cluster name. Node provides important capabilities such as search and index which is part of a cluster. An important thing to remember is the node names are in all lower case. We can create as many nodes as we want. There is no limit on it. If a cluster has more than one node than each node contains a subset of data.

C. Index

So, what is an index? As we know the nodes contain indices and an index is a collection of similar documents. for example, the document can be customer information or production information. In short for each type of document we create the index. The index name is in lowercase. The index name is used for indexing, searching, updating, deleting documents within an index. We can create n number of indices in a cluster.

D. Category or Type in Elastic Search

Inside each index, we have a type it is nothing but a category. We can create multiple categories such as Customer, Product, Vendor, Supplier, Broker, etc. Assume that our index name is the customer then we can create categories such as Individual, Organization, Self Proprietor, etc. Under each category, we can have document. The type has a name and associated with mapping. We create a separate mapping for each type of index. Here is some additional note about category or type. As we know Elastic search is built on Lucene and in Lucene there is no concept of type or category. The category is stored as _type in the metadata. while search document of a particular type, elastic search applies a filter on this field

E. Mapping in Elastic Search

The mapping describes fields and their types. e.g. data types such as string, integer, date, geo, etc. It also contains details about how each field will be indexed and stored. In many cases we don't have to create mapping explicitly, it is called dynamic mapping.

F. Document

The document is the base unit of information in the Elastic Search. The document contains fields with key/value pair values. The value can be of any data type such as string, date, integer which is defined in the mapping.   It could be a single Customer or Product or vendor etc. The document is in JSON format and it physically resides in the index which we create. We can as many documents as we need in a given index.

G. Shard

The shard is a portion of that index. We can divide index into multiple pieces i.e. shards which will be helpful if we have large set data to store on the physical disk. If the physical disk does not have enough capacity then we can divide the index into multiple pieces.  each shard is a fully functional index in its own. By default while creating an index we create five shards, however, we can configure as many shards as we need. In short, shards are created to achieve scalability.

H. Replica

The replica is a segment of an index or a copy of the shard. We never locate a replica on the same node where the primary shard is present so that when one node goes down, another node will be helpful for recovery. By default, while creating an index we create only one replica. Assume that we have two nodes, in that case, we will have five replica shards and five primary shards across two nodes. So replica's are helpful to achieve high availability. An important thing to note about replica is - Search queries can be executed on all replicas in parallel.

Exploring Amazon SES: A Powerful Solution for Email Delivery

Email communication is a cornerstone of business operations, marketing campaigns, and customer engagement strategies. Reliable email deliver...