11 Kasım 2020 Çarşamba

Elasticsearch

Giriş
Elasticsearch, full text search için tek araç değil.
Couchbase, Splunk ve Apache Solr gibi diğer araçlar da bu yeteneğe sahip.

Elasticsearch ve Veri tabanını Senkron Tutmak
İki temel yöntem var
1. Periyodik olarak veri tabanını dolaşmak ve güncellemeleri ElasticSearch'e aktarmak
2. Eğer Hibernate kullanıyorsak, Hibernate Search anotasyonlarını projeye eklemek

Bir diğer yöntem ise eğer SpringData ElasticSearch kullanıyorsak, JPA işlemini SpringData ElasticSearch ile de tekrar etmek. Açıklaması şöyle
As far as I understand, Spring-Data-Elasticsearch is focused on accessing Elasticsearch and has no JPA integration whatsoever. That is to say, you can use Spring-Data-JPA, and you can use Spring-Data-Elasticsearch, but they won't communicate with each other. You will have two separate models, which you will update and query separately.
Elasticsearch ve Veri
Açıklaması şöyle. Yani shard'ler yüzünden verinin tamamını tek bir düğümden göremeyiz.
As a distributed database, your data is partitioned into “shards” which are then allocated to one or more servers.

Because of this sharding, a read or write request to an Elasticsearch cluster requires coordinating between multiple nodes as there is no “global view” of your data on a single server. While this makes Elasticsearch highly scalable, it also makes it much more complex to setup and tune than other popular databases like MongoDB or PostgresSQL, which can run on a single server.

Elastic Stack - Log Management
Elastic Stack yazısına taşıdım

Elastic APM
Açıklaması şöyle
Elastic APM is an application performance monitoring tool that is built on top of Elastic Search and Kibana, the E and the K of the ELK stack. Implementing Elastic APM is super easy — all you need to do is add the agent jar to your service and set some basic properties. This is done once per service, and that enables distributed tracing for all requests of that service.
Maven
Şöyle yaparız
<dependency> <groupId>co.elastic.apm</groupId> <artifactId>apm-agent-attach</artifactId> <!--version should be compatible with your elastic instance--> <version>${elastic-version}</version> </dependency>
Açıklaması şöyle
Properties can be set in one of the following ways:
1. elasticapm.properties in classpath
2. Java System properties
3. Environment variables
Docker
Docker ve Elasticsearch yazısına taşıdım

Docker Compose
Docker Compose ve ElasticSearch yazısına taşıdım

Cluster Yapısı
Cluster'da 3 çeşit node vardır. Bunlar Master Node, Master-Eligible Node ve Data Node. Açıklaması şöyle
Data nodes hold data and perform data-related operations such as CRUD, search, and aggregations.

A master node in charge of cluster-wide management and configuration actions such as add/remove nodes, create/update/delete index, … A cluster has only one master node at a time. If a master node fails, Master-Eligible Nodes in the cluster elect a new master node from the master-eligible node pool.

Master-eligible node which can be voted to become a new master node when disaster happens with the master node.

In a cluster with only one node, it’s both master node and data node
Index
Elasticsearch Index yazısına taşıdım

Shard
Açıklaması şöyle.
Simply, the shard is a single instance of Lucene. It stores data and can perform any data-related operations. A shard can be a primary shard or replica shard. Any document in an index belongs to a single primary shard. A replica shard is simply just a copy of a primary shard. It provides redundant copies and helps protect data when problems happen with primary shards. Replica shard also improves read performance, because it can serve read requests like primary shard but you only can perform write requests on the primary shard.

When creating an index, you should specify the number of primary shards. This number is fixed after the index is created, but you can change the number of replica shards by changing index settings.
Field Data Types
Her field'ın bir tipi olmalı. Field tipleri şu başlıklar altında toplanmış
Common Types
Object and Relational Types
Structured data typese
Aggregate data types
Text search types
Document ranking types
Spatial data types
Other types
Arrays
Multi-fields
Text vs Keyword
Field tipleri arasında Text Search Types başlığı altındaki "text" ve Common Types başlığı altındaki "keyword" farkını bilmek lazım. Açıklaması şöyle
A String field can be either mapped to the text or the keyword type of Elasticsearch.

The primary difference between text and a keyword is that a text field will be tokenized while a keyword cannot.

We can use the keyword type when we want to perform filtering or sorting operations on the field.

For instance, let’s assume that we have a String field called body, and let’s say it has the value ‘Hibernate is fun’.

If we choose to treat body as text then we will be able to tokenize it [‘Hibernate’, ‘is’, ‘fun’] and we will be able to perform queries like body: Hibernate.

If we make it a keyword type, a match will only be found if we pass the complete text body: Hibernate is fun (wildcard will work, though: body: Hibernate*).
Açıklaması şöyle
If the field type is Text, Elasticsearch pre-processes raw data with an Analyzer before saving processed data to an Inverted Index (An inverted index consists of a list of all the unique words that appear in any document, and for each word, a list of the documents in which it appears.)
Analyzers ve Normalizers
Açıklaması şöyle. Yani Text alan için Analyzer kullanılır, Keyword tipindeki alan için Normalizer kullanılır.
Analyzers and normalizers are text analysis operations that are performed on text and keyword respectively, before indexing them and searching for them.

When an analyzer is applied on text, it first tokenizes the text and then applies one or more filters such as a lowercase filter (which converts all the text to lowercase) or a stop word filter (which removes common English stop words such as ‘is’, ‘an’, ‘the’ etc).

Normalizers are similar to analyzers with the difference that normalizers don’t apply a tokenizer.

On a given field we can either apply an analyzer or a normalizer.

To summarize:

Text Keyword
Is tokenized Can not be tokenized
Is analyzed Can be normalized
Can perform term based search Can only match exact text

URI API
Elasticsearch URI API yazısına taşıdım

The Request Body Search
REST + JSON kullanarak gönderilen sorgulardır

Analiz Edilmeyen Sorgular - Term Level Queries
Açıklaması şöyle
exists query
Returns documents that contain any indexed value for a field.

fuzzy query
Returns documents that contain terms similar to the search term. Elasticsearch measures similarity, or fuzziness, using a Levenshtein edit distance.

ids query
Returns documents based on their document IDs.

prefix query
Returns documents that contain a specific prefix in a provided field.

range query
Returns documents that contain terms within a provided range.

regexp query
Returns documents that contain terms matching a regular expression.

term query
Returns documents that contain an exact term in a provided field.

terms query
Returns documents that contain one or more exact terms in a provided field.

terms_set query
Returns documents that contain a minimum number of exact terms in a provided field. You can define the minimum number of matching terms using a field or script.

type query
Returns documents of the specified type.

wildcard query
Returns documents that contain terms matching a wildcard pattern.
1. Exists Sorgusu
Açıklaması şöyle
Due to the fact that Elasticsearch is schemaless (or not strict scema limitation), it is a fairly common situation when different documents have different fields. As a result, there is a lot of use to know whether a document has any certain field or not.
Örnek
Şöyle yaparız
GET /_search
{
  "query" : {
    "exists" : {
      "field": "<your_field_name>"
    }
  }
}
2. Fuzzy Sorgusu
Açıklaması şöyle. Yazım hatası varsa kullanılabilir. wildcard query ile farkı da açıklamada var.
Fuzzy search gives relevant results even if you have some typos in your query. It gives end-users some flexibility in terms of searching by allowing some degree of error. The threshold of the error to be allowed can be decided by us.

For instance, here we have set edit distance to 2 (default is also 2 by the way) which means Elasticsearch will match all the words with a maximum of 2 differences to the input. e.g., ‘jab’ will match ‘jane’.

While Fuzzy queries allow us to search even when we have misspelled words in your query, wildcard queries allow us to perform pattern-based searches. For instance, a search query with ‘s?ring*’ will match ‘spring’,’string’,’strings’’ etc.

Here ‘*’ indicates zero or more characters and ‘?’ indicates a single character.
Açıklaması şöyle
Fuzzy searching uses the Damerau-Levenshtein Distance to match terms that are similar in spelling. This is great when your data set has misspelled words.

Use the tilde (~) to find similar terms:

  blow~
This will return results like “blew,” “brow,” and “glow.”

Use the tilde (~) along with a number to specify the how big the distance between words can be:

  john~2
This will match, among other things: “jean,” “johns,” “jhon,” and “horn”
3. Term Sorgusu
term Sorgusu yazısına taşıdım

4. Terms Sorgusu
terms Sorgusu yazısına taşıdım

5. wildcard_query  Sorgusu - Tek Field'a Wildcard Sorgu Yapar
Örnek
Şöyle yaparız. Burada eski ElasticSearch kullanılıyor ve sorguda filter görülebilir.
{
  "query": {
    "filtered": {
      "query": {
        "match_all": {}
      },
      "filter": {
        "bool": {
          "should": [
            {"query": {"wildcard": {"user.name": {"value": "*mar*"}}}},
            {"query": {"wildcard": {"user.surname": {"value": "*mar*"}}}}
          ]
        }
      }
    }
  }
}
Analiz Edilen Sorgular - Full Text Queries
Açıklaması şöyle
intervals query
A full text query that allows fine-grained control of the ordering and proximity of matching terms.

match query
The standard query for performing full text queries, including fuzzy matching and phrase or proximity queries.

match_bool_prefix query
Creates a bool query that matches each term as a term query, except for the last term, which is matched as a prefix query

match_phrase query
Like the match query but used for matching exact phrases or word proximity matches.

match_phrase_prefix query
Like the match_phrase query, but does a wildcard search on the final word.

multi_match query
The multi-field version of the match query.

common terms query
A more specialized query which gives more preference to uncommon words.

query_string query
Supports the compact Lucene query string syntax, allowing you to specify AND|OR|NOT conditions and multi-field search within a single query string. For expert users only.

simple_query_string query
A simpler, more robust version of the query_string syntax suitable for exposing directly to users.
match Sorgusu
match Sorgusu yazısına taşıdım.

match_phrase Sorgusu
Phrase kelimesi aynı sırada olduğunu belirtir. Yani tam kelimelerin hepsi aynı sırada varsa bulur. Açıklaması şöyle
match_phrase query will analyze the input if analyzers are defined for the queried field and find documents matching the following criterias :

- all the terms must appear in the field
- they must have the same order as the input value
Örnek
Elimizde şöyle bir indeks olsun
{ "foo":"I just said hello world" }

{ "foo":"Hello world" }

{ "foo":"World Hello" }
Sorgu şöyle olsun. Sonuç olarak sadece 1 ve 2. dokümanları alırız. 0. doküman sonuca dahil olmaz.
{
  "query": {
    "match_phrase": {
      "foo": "Hello World"
    }
  }
}
match_phrase_prefix Sorgusu - search as you type
Tam kelimelerin hepsi varsa + yarım kelimeleri bulur
Örnek
Açıklaması şöyle
Keywords: “puerto r”
It considers “puerto” as exact word that needs to be in the country name, and “r” as prefix for any word after “puerto”. This will match “Puerto Rico”.
multi_match Sorgusu - Çoklu Field İçin Sorgu Yapar
Açıklaması şöyle.
Similar to match, but searches multiple fields.
Örnek ver

common terms query
Açıklaması şöyle.
A more specialized query which gives more preference to uncommon words.
Örnek ver

query_string Sorgusu Çoklu Field İçin Sorgu Yapar ve AND, OR Gibi Kriterleri Destekler
query_string Sorgusu yazısına taşıdım

simple_query_string Sorgusu
Açıklaması şöyle.
A simpler, more robust version of the query_string syntax suitable for exposing directly to users.
Örnek ver

Bileşik Sorgular - Compound Query
Açıklaması şöyle
bool query
The default query for combining multiple leaf or compound query clauses, as must, should, must_not, or filter clauses. The must and should clauses have their scores combined — the more matching clauses, the better — while the must_not and filter clauses are executed in filter context.

boosting query
Return documents which match a positive query, but reduce the score of documents which also match a negative query.

constant_score query
A query which wraps another query, but executes it in filter context. All matching documents are given the same “constant” _score.

dis_max query
A query which accepts multiple queries, and returns any documents which match any of the query clauses. While the bool query combines the scores from all matching queries, the dis_max query uses the score of the single best- matching query clause.

function_score query
Modify the scores returned by the main query with functions to take into account factors like popularity, recency, distance, or custom algorithms implemented with scripting.
bool query ile kullanılan Sorgular
Açıklaması şöyle
must
All queries within this clause must match a document in order for ES to return it. Think of this as your AND queries. The query that we used here is the fuzzy query, and it will match any documents that have a name field that matches “john” in a fuzzy way. The extra “fuzziness” parameter tells Elasticsearch that it should be using a Damerau-Levenshtein Distance of 2 two determine the fuzziness.

must_not
Any documents that match the query within this clause will be outside of the result set. This is the NOT or minus (-) operator of the query DSL. In this case, we do a simple match query, looking for documents that contain the term “city.” Using _all as the field name indicates that the term can appear in any of the document’s fields. This is the must_not clause, so matching documents will be excluded.

should
Up until now, we have been dealing with absolutes: must and must_not. Should is not absolute and is equivalent to the OR operator. Elasticsearch will return any documents that match one or more of the queries in the should clause. The first query that we provided looks for documents where the age field is between 30 and 40. The second query does a wildcard search on the surname field, looking for values that start with “K.”

The query contained three different clauses, so Elasticsearch will only return documents that match the criteria in all of them. These queries can be nested, so you can build up very complex queries by specifying a bool query as a must, must_not, should or filter query.

Hiç yorum yok:

Yorum Gönder