{"id":2788,"date":"2020-04-13T17:12:59","date_gmt":"2020-04-13T11:42:59","guid":{"rendered":"https:\/\/rzpwp.blog\/?p=2788"},"modified":"2025-05-28T07:47:07","modified_gmt":"2025-05-28T02:17:07","slug":"detect-fraud-using-ml-ai-thirdwatch","status":"publish","type":"post","link":"https:\/\/razorpay.com\/blog\/detect-fraud-using-ml-ai-thirdwatch\/","title":{"rendered":"Using Machine Learning to Detect Fraud: Introduction"},"content":{"rendered":"<p><span style=\"font-weight: 400;\">The last couple of decades have seen the rise of e-commerce throughout the world, and both merchants and customers are now able to experience a level of comfort in dealing and shopping that could only be imagined before. For the merchant, this means easier showcasing of goods, 24&#215;7 operation, a chance to expand their global outreach and so much more. Unfortunately, it isn\u2019t just the stores that have evolved, major problems that shop owners used to face in the pre-internet era such as fraud have evolved too.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Fraud is a much less talked about facet of e-commerce which has a large impact on the revenue of a business. E-commerce businesses across different industries have seen up to <\/span><span style=\"font-weight: 400;\">40%<\/span><span style=\"font-weight: 400;\"> of fraudulent orders on a regular basis.<\/span><\/p>\n<h2><b>Types of fraud<\/b><\/h2>\n<p><span style=\"font-weight: 400;\">E-commerce frauds happen on Cash On Delivery (COD) as well as prepaid orders. One common type of fraud is the <em>Return To Origin (RTO)<\/em> fraud where the customer initiates a return on receiving the product and either using it temporarily, swapping it with a faulty\/damaged product or denying that they ever received the product. <a href=\"https:\/\/razorpay.com\/blog\/online-payment-fraud-and-risk-mitigation\/\">Payment frauds<\/a> related to credit cards, where the customer <a href=\"https:\/\/razorpay.com\/blog\/what-is-a-chargeback\/\">initiates a <em>chargeback<\/em><\/a> on receiving the product and denies having made a purchase with the card in question, are also quite common. Other types of e-commerce fraud include <a href=\"https:\/\/razorpay.com\/learn\/promo-code-fraud-abuse\/\"><em>promo code abuse<\/em><\/a>, where a single customer signs up multiple times on an app to avail discounts using <a href=\"https:\/\/razorpay.com\/learn\/what-is-promo-code\/\">promo codes<\/a>, and <em>account takeover<\/em>, where a fraudster gains access to a customer\u2019s account and purchases multiple items on the customer\u2019s behalf.<\/span><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"size-large wp-image-2792 aligncenter\" src=\"http:\/\/blog.razorpay.in\/wp-content\/uploads\/2020\/04\/image-3-1024x590.png\" alt=\"return to origin orders e-commerce flow\" width=\"1024\" height=\"590\" srcset=\"https:\/\/blog.razorpay.in\/wp-content\/uploads\/2020\/04\/image-3-1024x590.png 1024w, https:\/\/blog.razorpay.in\/wp-content\/uploads\/2020\/04\/image-3-300x173.png 300w, https:\/\/blog.razorpay.in\/wp-content\/uploads\/2020\/04\/image-3-768x442.png 768w, https:\/\/blog.razorpay.in\/wp-content\/uploads\/2020\/04\/image-3-1536x884.png 1536w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/p>\n<p><span style=\"font-weight: 400;\">However, it would be a gross underestimation to think that e-commerce frauds are limited to these types. Frauds are ever-evolving and new ways of defrauding come up more often than one would imagine.<\/span><\/p>\n<p><em><strong>Related Read: <a href=\"https:\/\/razorpay.com\/blog\/what-is-fraud-analytics\/\">Fraud Analytics: A Guide to Preventing Financial Fraud<\/a><\/strong><\/em><\/p>\n<h2><b>The data age<\/b><\/h2>\n<p><span style=\"font-weight: 400;\">Traditionally, tech solutions to problems centred around fixed rules for solving problems. For example, to tackle e-commerce fraud, one of the rules we can create is, \u201cif the mobile number and pin code of the customer doesn\u2019t seem correct, declare the order as a fraud\u201d, which roughly translates to (note that this is just an example and more rigorous checks can be carried out),<\/span><\/p>\n<blockquote>\n<pre>if (no. of digits in mobile number != 10) then\r\n  if (length of pincode != 6 or no. of digits in pincode != 6) then \r\n    reject order;<\/pre>\n<\/blockquote>\n<p><span style=\"font-weight: 400;\">This seems like a good way to tackle this problem, but this has several issues, the most important one being that we don\u2019t really know what rules to build and apply. While active research is being carried out to solve such problems, something like e-commerce fraud is ever-evolving and hence, no fixed set of rules will ever be able to cover all fraud cases.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">This is where data-based solutions come in. These involve recording and<strong> analyzing data over a period of time and trying to figure out patterns<\/strong> in the data that would provide enough insight to come up with a solution.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Large companies record mind-boggling quantities of data every day, given that we as end-users turn to the internet for much of our daily activities. As early as 2017, for instance, in every minute of an average day, Google conducted 3.6 million searches, Skype users made about 1,54,200 calls, Netflix users streamed 69,444 hours of videos and Instagram users posted 46,740 photos. By 2018, over 2.5 quintillion bytes of data was generated each day of the year. As of 2020, it\u2019s estimated that about 1.7 MB of data is generated by every single person on Earth every second, and all of it is being stored. The age of data is upon us.<\/span><\/p>\n<blockquote><p>Implementing rule-based solutions for complex and ever-changing problems such as e-commerce fraud is not feasible and hence data-based solutions are preferred for such problems.<\/p><\/blockquote>\n<p><span style=\"font-weight: 400;\">Hence, it is no surprise that data-based solutions have become the most popular ways of tackling the e-commerce fraud problem. Specifically <em>Machine Learning<\/em>, a concept closely related to <em>Artificial Intelligence (AI)<\/em> is employed these days to try and solve this problem.<\/span><\/p>\n<h2><b>The basics of Machine Learning<\/b><\/h2>\n<h3><b>How does ML work<\/b><\/h3>\n<p><span style=\"font-weight: 400;\"><strong>Machine Learning (ML)<\/strong> is a set of algorithms that can actively recognize patterns from large amounts of data and use these patterns to predict a certain parameter &#8211; in this case, whether a given order is fraudulent or not. Machine Learning has been around for a while now &#8211; since the later part of the 20th century but only came into mainstream programming in the 2010s.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">On a broad level, ML algorithms can be classified into <em>supervised<\/em> and <em>unsupervised<\/em> algorithms. Both kinds of algorithms require many examples (data records) to learn any useful patterns. The difference is, supervised ML algorithms require labels for each data sample while unsupervised ones don\u2019t. A popular example of a supervised ML problem can be rent prediction; we can provide a dataset containing various attributes pertaining to the area, location, number of rooms, size of rooms, etc. of the house and label each house with its corresponding rent. <\/span><\/p>\n<p><span style=\"font-weight: 400;\">A supervised ML algorithm can learn how these attributes affect the rent of the house. An example of an unsupervised algorithm can be learning from user behaviour and giving recommendations based on their liking. Detecting e-commerce frauds is mostly a supervised classification problem, given that a dataset with orders and labels (fraud\/not fraud) would be available.<\/span><\/p>\n<p>To sum it all up,<\/p>\n<blockquote><p>Supervised Machine Learning algorithms require labelled samples to learn from data, while Unsupervised ML algorithms don&#8217;t need labels to learn from data.<\/p><\/blockquote>\n<p><span style=\"font-weight: 400;\">Further, this problem is a <strong><em>classification<\/em> problem, in which the output can be one of a predefined set of values called classes<\/strong> (&#8216;fraud&#8217; and &#8216;not fraud&#8217; in this case), as opposed to a <em>regression<\/em> problem, where the output can be a range of real numbers (say &#8216;10.0&#8217; to &#8216;100.0&#8217;). <strong>A classification problem in which the number of output classes is equal to 2<\/strong>, as in this case, <strong>is called a <em>binary classification <\/em>problem<\/strong>, as opposed to a <em>multi-class classification<\/em> problem, where the number of output classes is more than 2.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">The working of ML models varies according to what algorithm is used to implement the model, however, all supervised models follow a certain pattern of working. The very basic idea of a model is that it would learn a function mapping between the inputs &#8216;X_i&#8217;<\/span><span style=\"font-weight: 400;\"> and the output &#8216;Y&#8217; <\/span><span style=\"font-weight: 400;\">using the examples given to it. The complexity of the function that a model can learn varies according to the algorithm. For instance, an algorithm like Logistic Regression would end up learning a much simpler mapping as compared to a <em>Multi-Layer Perceptron<\/em>. This function is called the <\/span><em>Hypothesis<\/em><span style=\"font-weight: 400;\">. A sample hypothesis for a model operating with two features &#8216;X_1&#8217;<\/span><span style=\"font-weight: 400;\"> and &#8216;X_2&#8217;<\/span><span style=\"font-weight: 400;\">\u00a0can be:<\/span><\/p>\n<blockquote>\n<pre> h(X) = W_0 + W_1*X_1 + W_2*X_2<\/pre>\n<\/blockquote>\n<p><span style=\"font-weight: 400;\">This is a fairly simple hypothesis that is used by a model like <em>Linear Regression<\/em>. The figure below demonstrates how a simple hypothesis function like a line can do well in a binary classification task if the points from the two classes are already separate. The blue and orange points represent two separate classes. Points on the left side of the hypothesis would be marked as &#8216;blue&#8217; and those on the right as &#8216;orange&#8217;. This would mean that most of the points could be classified correctly using this kind of a hypothesis.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">In case the data is more complex, i.e. the points from both classes are more &#8220;mixed&#8221; with each other, a more complex hypothesis function will be needed.<\/span><\/p>\n<p style=\"text-align: center;\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-large wp-image-2794\" src=\"http:\/\/blog.razorpay.in\/wp-content\/uploads\/2020\/04\/using-ml-for-fraud-hypothesis-2-1024x614.png\" alt=\"using machine learning for fraud hypotheses\" width=\"1024\" height=\"614\" srcset=\"https:\/\/blog.razorpay.in\/wp-content\/uploads\/2020\/04\/using-ml-for-fraud-hypothesis-2-1024x614.png 1024w, https:\/\/blog.razorpay.in\/wp-content\/uploads\/2020\/04\/using-ml-for-fraud-hypothesis-2-300x180.png 300w, https:\/\/blog.razorpay.in\/wp-content\/uploads\/2020\/04\/using-ml-for-fraud-hypothesis-2-768x461.png 768w, https:\/\/blog.razorpay.in\/wp-content\/uploads\/2020\/04\/using-ml-for-fraud-hypothesis-2-1536x922.png 1536w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/p>\n<p><span style=\"font-weight: 400;\">At the core of the algorithm is a <em>loss function<\/em>, which tells the model if it is learning correctly or not. The objective of the training process is to minimize the loss function as much as possible. For each sample that the model sees, it provides an estimation of what the value of y can be for that sample. If the estimation is close to the real value, the loss function is minimized, else it increases, thereby \u2018<em>penalizing<\/em>\u2019 the model. For each update in the loss function, the parameters &#8216;W_0&#8217;<\/span><span style=\"font-weight: 400;\">, &#8216;W_1&#8217;<\/span><span style=\"font-weight: 400;\"> and &#8216;W_2&#8217;<\/span><span style=\"font-weight: 400;\">\u00a0are updated in such a way that the next estimation is closer to the real value. Hence, the model learns the mapping between the features and the labels.<\/span><\/p>\n<blockquote><p>Training an ML model involves iteratively updating certain parameters such that a loss function is minimized. The set of parameters which gives the minimum value for the loss function are used to predict the target variable for new samples.<\/p><\/blockquote>\n<p><span style=\"font-weight: 400;\">The general process of building and using a Machine Learning model is simple enough to understand. We gather data (in this case, a dataset of recent orders) with multiple features (in this case, for instance, order date, order time, price of the product, information about the product, user account details, etc.) and label each of these as <\/span><span style=\"font-weight: 400;\">true<\/span><span style=\"font-weight: 400;\"> if they are fraudulent and <\/span><span style=\"font-weight: 400;\">false<\/span><span style=\"font-weight: 400;\"> if they are not. An ML model is then iteratively trained on this data and tested on a hold-out set (also called the <em>test set<\/em>), which is never shown to the model during training (more on this later). If the model performs well on the test set, we decide to use the model to predict the order status of future orders.\u00a0<\/span><\/p>\n<p style=\"text-align: center;\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-large wp-image-2791\" src=\"http:\/\/blog.razorpay.in\/wp-content\/uploads\/2020\/04\/image-2-1024x397.png\" alt=\"using machine learning for rto orders ecommerce\" width=\"1024\" height=\"397\" srcset=\"https:\/\/blog.razorpay.in\/wp-content\/uploads\/2020\/04\/image-2-1024x397.png 1024w, https:\/\/blog.razorpay.in\/wp-content\/uploads\/2020\/04\/image-2-300x116.png 300w, https:\/\/blog.razorpay.in\/wp-content\/uploads\/2020\/04\/image-2-768x298.png 768w, https:\/\/blog.razorpay.in\/wp-content\/uploads\/2020\/04\/image-2-1536x595.png 1536w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/p>\n<p><span style=\"font-weight: 400;\">Now, once a new order is passed to the model, it can predict if the label for this new order would be <\/span><span style=\"font-weight: 400;\">true<\/span><span style=\"font-weight: 400;\"> or <\/span><span style=\"font-weight: 400;\">false<\/span><span style=\"font-weight: 400;\">. This being said, the model would not output a value saying &#8216;true&#8217; or &#8216;false&#8217; exactly, it would output the probability of the order being fraudulent, i.e., &#8216;P(fraud)&#8217;.<\/span><span style=\"font-weight: 400;\">\u00a0It would now be up to us to set a cutoff on the probability that would work for us. This is explained in more detail in the next blog.<\/span><\/p>\n<h3><b>Why use ML for this problem<\/b><\/h3>\n<p><span style=\"font-weight: 400;\">In a nutshell, the reason why ML-based solutions for e-commerce fraud detection are gaining popularity fast is that we as <strong>humans cannot fathom how each factor in the e-commerce ecosystem might be affecting the fraudulence of a particular order<\/strong>. We know that there are a lot of factors that might hint at an order being fraudulent, for instance, a user might have made an abnormally large amount of orders in the past few minutes, or the user has entered a monkey-typed address in the address fields or the user has skipped over the basic information needed for an order to be delivered, which will result in an RTO. We cannot, however, evaluate each factor and determine their contribution towards the fraudulence of that order manually.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">To prove this point, consider that we take a traditional approach towards solving this problem. We would eventually come up with a set of rules that will determine if an order is fraudulent or not. For example,<\/span><\/p>\n<blockquote>\n<pre><span style=\"font-weight: 400;\">if (<\/span><span style=\"font-weight: 400;\">X_<\/span><span style=\"font-weight: 400;\">1<\/span><span style=\"font-weight: 400;\">) then<\/span>\r\n<span style=\"font-weight: 400;\">   if (<\/span><span style=\"font-weight: 400;\">X_<\/span><span style=\"font-weight: 400;\">2<\/span><span style=\"font-weight: 400;\"> and <\/span><span style=\"font-weight: 400;\">X_<\/span><span style=\"font-weight: 400;\">4<\/span><span style=\"font-weight: 400;\">) then \r\n       \u2026<\/span><\/pre>\n<\/blockquote>\n<p><span style=\"font-weight: 400;\">The rule is far more complex than we as humans can write in an affordable amount of time. On the other hand, Machine Learning models can come up with such rules in a very short amount of time and hence reduce cost, time and manual labour on this task.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Another reason is that these rules can be dynamic and can change over time. <strong>Fraudulent users keep changing their tactics to avoid getting caught and novel ways of committing fraud keep coming up from time to time<\/strong>. Giving valuable resources into creating rules only to change them over time is a very cumbersome and wasteful task, and ML provides a far more comfortable solution.<\/span><\/p>\n<h3 class=\"Heading__StyledHeading-sc-11dvfr8-4 flfjZd\" data-key=\"181\"><span class=\"Heading__Anchor-sc-11dvfr8-3 aNicP\">There\u2019s more to come!<br \/>\n<\/span><\/h3>\n<p data-key=\"184\"><span data-key=\"185\">This is only the first of the four-part blog series on how Machine Learning can be used to effectively detect fraud in e-commerce. The next instalment of this series would focus more on the technical aspects of which algorithm to choose for this Machine Learning task and which features can be created for the task of detecting e-commerce frauds. Stay tuned for more!<\/span><\/p>\n","protected":false},"excerpt":{"rendered":"<p>ML is a set of algorithms that can actively recognize patterns from large amounts of data and use these patterns to predict a certain parameter&#8211;in this case, whether a given order is fraudulent or not. <\/p>\n","protected":false},"author":44,"featured_media":2796,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"inline_featured_image":false,"footnotes":""},"categories":[69],"tags":[57],"class_list":{"0":"post-2788","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-razorpay-stories","8":"tag-technology"},"_links":{"self":[{"href":"https:\/\/razorpay.com\/blog\/wp-json\/wp\/v2\/posts\/2788","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/razorpay.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/razorpay.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/razorpay.com\/blog\/wp-json\/wp\/v2\/users\/44"}],"replies":[{"embeddable":true,"href":"https:\/\/razorpay.com\/blog\/wp-json\/wp\/v2\/comments?post=2788"}],"version-history":[{"count":7,"href":"https:\/\/razorpay.com\/blog\/wp-json\/wp\/v2\/posts\/2788\/revisions"}],"predecessor-version":[{"id":22927,"href":"https:\/\/razorpay.com\/blog\/wp-json\/wp\/v2\/posts\/2788\/revisions\/22927"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/razorpay.com\/blog\/wp-json\/wp\/v2\/media\/2796"}],"wp:attachment":[{"href":"https:\/\/razorpay.com\/blog\/wp-json\/wp\/v2\/media?parent=2788"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/razorpay.com\/blog\/wp-json\/wp\/v2\/categories?post=2788"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/razorpay.com\/blog\/wp-json\/wp\/v2\/tags?post=2788"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}