Jekyll2022-08-17T21:23:52+01:00http://dyanarose.github.io/feed.xmlThese Things HappenDyana RoseReducing Costs by Switching to Cheaper AWS Services2022-06-08T12:11:11+01:002022-06-08T12:11:11+01:00http://dyanarose.github.io/blog/2022/06/08/reducing-costs-by-switching-to-cheaper-aws-services<p>I run a tiny website for tracking item prices in Guild Wars 2, <a href="http://www.gw2roar.com/">gw2roar.com</a>. I don’t run it for the money (it has 1 or 2 visitors per day), it’s simply a fun site that suits my needs.</p>
<p>Because it doesn’t make money, every penny of running cost comes out of my pocket. When I first started the site, it was all 12 month free tier AWS services so I didn’t care about cost and then I took part in an Alexa promotion that gave me a $100 credit against my monthly bill. All good things come to an end as they say. After I got notification that the promotion was ending, I started looking at how I could significantly reduce my costs.</p>
<h2 id="step-1-use-the-cost-explorer-to-find-the-high-cost-services">Step 1: Use the Cost Explorer to find the high cost services</h2>
<p>Open up <a href="console.aws.amazon.com/cost-management/home">Cost Management</a> and click on the Cost Explorer in the list on the left.</p>
<p>In this view you can explore the costs of each service you use.</p>
<p>I had two spikes in my cost explorer, Elasticache and RDS. Elasticache stores candlestick data per item per day for quick retrieval. RDS runs a postgres instance that stores the item price data retrieved every 30 minutes from the Guild Wars 2 API.</p>
<h2 id="step-2-imagine-your-site-without-the-high-cost-service">Step 2: Imagine your site without the high cost service</h2>
<p>What if Elasticache didn’t exist?</p>
<p>Calculating candlestick data on the fly is slow and the result never changes after a day is done and I used Elasticache to speed up loading times.</p>
<p>Ultimately I needed a quick, distributed cache I could query by item id and time. Elasticache isn’t the only service that provides that, I could move to nearly any distributed cache. My decision ultimately came down to cost.</p>
<h2 id="step-3-cost-up-the-changes">Step 3: Cost up the changes</h2>
<p>The AWS pricing calculator has gotten much easier to use over the past few years. Now that you can easily price by service there’s less room for missing out or adding in costs by mistake.</p>
<p>A 1 node, t2.micro <a href="https://calculator.aws/#/addService/ElastiCache">ElastiCache cluster costs about $12.41 per month</a>.</p>
<p>What would the price be if I switched to, say, DynamoDB?</p>
<p>DynamoDB is a bit more difficult to price, you need to know:</p>
<ul>
<li>the size of your data</li>
<li>the baseline read/write rates</li>
<li>the peak read/write rates</li>
<li>how long the peaks last</li>
<li>what different options and features mean</li>
</ul>
<p>I used the information from ElastiCache to estimate the size of the data and the web statistics to estimate the read capacity. The write capacity was harder to estimate because it came from two sources.</p>
<p>A visitor could cause a write by loading the data for an item, but that would be at a much frequency lower than the number of users on the site. So the baseline write rate is very low.</p>
<p>At the end of every month a batch job calculates and loads all the data for the month into the cache. Given I knew how many items there were, and that the data needed to be loaded within 15 minutes (the timeout of a Lambda function), I could work out what the peak write rate would be.</p>
<p>Did I need:</p>
<ul>
<li><a href="https://aws.amazon.com/blogs/aws/dynamodb-price-reduction-and-new-reserved-capacity-model/">reserved capacity</a>? no, that would drive up the cost and provide no benefit at a low rate of reads/writes</li>
<li><a href="https://aws.amazon.com/dynamodb/pricing/on-demand/">on-demand capacity</a>? It’s nice in theory to pay for what you use, but in practice it would be more expensive for my site than using <a href="https://aws.amazon.com/dynamodb/pricing/provisioned/">provisioned capacity</a> and auto-scaling.</li>
</ul>
<p>With everything in the calculator, AWS predicted a cost of about $3.20 per month. However, the calculator doesn’t take into account the <a href="https://aws.amazon.com/free/?all-free-tier.sort-by=item.additionalFields.SortRank&all-free-tier.sort-order=asc&awsf.Free%20Tier%20Types=tier%23always-free&awsf.Free%20Tier%20Categories=categories%23databases">“always free” alowances</a>, so switching to DynamoDB would actually bring my costs to 0.</p>
<h2 id="step-4-make-the-changes-and-evaluate">Step 4: Make the changes and evaluate</h2>
<p>The cost estimation is just an estimation. The real world can be different.</p>
<p>For example, with my first implementation of the switch to DynamoDB, I put some of the code into the same Lamdba function that called RDS. That function was running in the RDS VPC, and it’s quite simple to set up a VPC endpoint for DynamoDB. So easy that I missed that this would cost me <a href="https://aws.amazon.com/privatelink/pricing/">$0.01 per AZ per hour</a>.</p>
<p>Thankfully I saw the costs within the first few hours of creating the endpoint and was able to re-architect my functions.</p>
<p>Watching and evaluating the costs is essential to finding out fast if your estimates were wrong.</p>
<h2 id="step-5-repeat-from-step-1">Step 5: Repeat from Step 1</h2>
<p>Keep hunting for options and reducing costs until you’re satisfied.</p>Dyana RoseI run a tiny website for tracking item prices in Guild Wars 2, gw2roar.com. I don’t run it for the money (it has 1 or 2 visitors per day), it’s simply a fun site that suits my needs.Upserting in Postges: It’s not just all or nothing2022-06-01T12:11:11+01:002022-06-01T12:11:11+01:00http://dyanarose.github.io/blog/2022/06/01/upserting-in-postgres<p>Upserting in Postgres lets you insert a new value or update an existing value in a single atomic statement. This avoids needing two separate statements, read and update/insert, wrapped in a transaction.</p>
<p>If you want to only update particular fields, and not all fields in the row, you can do that to.</p>
<p>Let’s explore upserting into a table tracking a location’s high temperatures per day.</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">CREATE</span> <span class="k">TABLE</span> <span class="n">IF</span> <span class="k">NOT</span> <span class="k">EXISTS</span> <span class="n">temp_agg</span>
<span class="p">(</span>
<span class="n">id</span> <span class="nb">integer</span> <span class="k">NOT</span> <span class="k">NULL</span><span class="p">,</span>
<span class="k">day</span> <span class="nb">date</span><span class="p">,</span>
<span class="n">high</span> <span class="nb">NUMERIC</span><span class="p">(</span><span class="mi">5</span><span class="p">,</span> <span class="mi">2</span><span class="p">),</span>
<span class="k">CONSTRAINT</span> <span class="n">id_day</span> <span class="k">UNIQUE</span> <span class="p">(</span><span class="n">id</span><span class="p">,</span> <span class="k">day</span><span class="p">)</span>
<span class="p">)</span>
</code></pre></div></div>
<p>Every hour our application polls for the temperature at a set of location endpoints and inserts the current temperature. If a location endpoint is down, unresponsive, or returning “bad” data, it will be skipped and re-polled in the next hourly run.</p>
<p>Our first attempt at writing the temperature <code class="language-plaintext highlighter-rouge">INSERT</code> statement starts out like:</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">INSERT</span> <span class="k">INTO</span> <span class="n">temp_agg</span>
<span class="p">(</span><span class="n">id</span><span class="p">,</span> <span class="k">day</span><span class="p">,</span> <span class="n">high</span><span class="p">)</span>
<span class="k">VALUES</span>
<span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="nb">DATE</span><span class="p">(</span><span class="s1">'2022-06-01'</span><span class="p">),</span> <span class="mi">15</span><span class="p">.</span><span class="mi">00</span><span class="p">),</span>
<span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="nb">DATE</span><span class="p">(</span><span class="s1">'2022-06-01'</span><span class="p">),</span> <span class="mi">15</span><span class="p">.</span><span class="mi">50</span><span class="p">)</span>
</code></pre></div></div>
<p>This runs and the table now looks like:</p>
<table>
<thead>
<tr>
<th>id</th>
<th>high</th>
<th>day</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>15.00</td>
<td>2022-06-01</td>
</tr>
<tr>
<td>2</td>
<td>15.50</td>
<td>2022-06-01</td>
</tr>
</tbody>
</table>
<p>But the next time the application tries to insert data for <code class="language-plaintext highlighter-rouge">2022-06-01</code> the INSERT statement returns the error <code class="language-plaintext highlighter-rouge">duplicate key value violates unique constraint "id_day"</code> because inserting the data would conflict with the constraint that each row have a unique (id, day) pair.</p>
<h2 id="on-conflict">On Conflict…</h2>
<p>The <a href="https://www.postgresql.org/docs/current/sql-insert.html#SQL-ON-CONFLICT">ON CONFLICT</a> clause acts a bit like a row based “catch” of a “try catch…” statement. <code class="language-plaintext highlighter-rouge">try</code> to insert the row, <code class="language-plaintext highlighter-rouge">catch and handle any conflict</code>. In the <code class="language-plaintext highlighter-rouge">ON CONFLICT</code> clause you declare what conflict you are interested in and how you want to deal with it at the row level.</p>
<p>There are two options for using ON CONFLICT, <code class="language-plaintext highlighter-rouge">DO NOTHING</code> and <code class="language-plaintext highlighter-rouge">DO UPDATE</code></p>
<h3 id="on-conflict-do-nothing">ON CONFLICT DO NOTHING</h3>
<p><code class="language-plaintext highlighter-rouge">ON CONFLICT ... DO NOTHING</code> inserts each row that doesn’t have a conflict and skips each row that does.</p>
<p>The following SQL has conflicts on ids 1 and 2 on the day ‘2022-06-01’. It <em>ignores</em> those two rows and inserts the row with id 3.</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">INSERT</span> <span class="k">INTO</span> <span class="n">temp_agg</span>
<span class="p">(</span><span class="n">id</span><span class="p">,</span> <span class="n">high</span><span class="p">,</span> <span class="k">day</span><span class="p">)</span>
<span class="k">VALUES</span>
<span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">20</span><span class="p">.</span><span class="mi">00</span><span class="p">,</span> <span class="nb">DATE</span><span class="p">(</span><span class="s1">'2022-06-01'</span><span class="p">)),</span>
<span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="mi">20</span><span class="p">.</span><span class="mi">00</span><span class="p">,</span> <span class="nb">DATE</span><span class="p">(</span><span class="s1">'2022-06-01'</span><span class="p">)),</span>
<span class="p">(</span><span class="mi">3</span><span class="p">,</span> <span class="mi">21</span><span class="p">.</span><span class="mi">00</span><span class="p">,</span> <span class="nb">DATE</span><span class="p">(</span><span class="s1">'2022-06-01'</span><span class="p">))</span>
<span class="k">ON</span> <span class="n">CONFLICT</span> <span class="p">(</span><span class="n">id</span><span class="p">,</span> <span class="k">day</span><span class="p">)</span>
<span class="k">DO</span> <span class="k">NOTHING</span>
</code></pre></div></div>
<p>Result:</p>
<table>
<thead>
<tr>
<th>id</th>
<th>high</th>
<th>day</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>15.00</td>
<td>2022-06-01</td>
</tr>
<tr>
<td>2</td>
<td>15.50</td>
<td>2022-06-01</td>
</tr>
<tr>
<td>3</td>
<td>21.00</td>
<td>2022-06-01</td>
</tr>
</tbody>
</table>
<h3 id="on-conflict-do-update">ON CONFLICT DO UPDATE</h3>
<p><code class="language-plaintext highlighter-rouge">ON CONFLICT ... DO UPDATE</code> inserts each row that doesn’t have a conflict and <em>updates</em> each row that does.</p>
<p>In the <code class="language-plaintext highlighter-rouge">DO UPDATE</code> statement you have access to both the existing row data and the new row data, though referencing them is not quite obvious. You reference them like so:</p>
<p>Existing data => <code class="language-plaintext highlighter-rouge"><table_name>.<field></code> ie <code class="language-plaintext highlighter-rouge">temp_agg.high</code></p>
<p>If you have aliased the table, for example <code class="language-plaintext highlighter-rouge">INSERT INTO temp_agg as t</code>, then you will reference the existing data as <code class="language-plaintext highlighter-rouge">t.high</code></p>
<p>New data => <code class="language-plaintext highlighter-rouge">EXCLUDED.<field></code> ie <code class="language-plaintext highlighter-rouge">EXCLUDED.high</code></p>
<p>If your insert statement already references a table named <code class="language-plaintext highlighter-rouge">excluded</code> you need to alias it to avoid any naming conflicts.</p>
<p>The result we’re looking for is to update the high temperature for locations 1 & 2 on day <code class="language-plaintext highlighter-rouge">2022-06-01</code> but location 3 is unchanged.</p>
<p>The following SQL statements achieve this using different semantics</p>
<p>In the first example, the conflict rows are only updated if the new (<code class="language-plaintext highlighter-rouge">EXCLUDED</code>) high temperature is greater than the existing high temperature.</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">INSERT</span> <span class="k">INTO</span> <span class="n">temp_agg</span>
<span class="p">(</span><span class="n">id</span><span class="p">,</span> <span class="n">high</span><span class="p">,</span> <span class="k">day</span><span class="p">)</span>
<span class="k">VALUES</span>
<span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">20</span><span class="p">.</span><span class="mi">00</span><span class="p">,</span> <span class="nb">DATE</span><span class="p">(</span><span class="s1">'2022-06-01'</span><span class="p">)),</span>
<span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="mi">20</span><span class="p">.</span><span class="mi">00</span><span class="p">,</span> <span class="nb">DATE</span><span class="p">(</span><span class="s1">'2022-06-01'</span><span class="p">)),</span>
<span class="p">(</span><span class="mi">3</span><span class="p">,</span> <span class="mi">18</span><span class="p">.</span><span class="mi">00</span><span class="p">,</span> <span class="nb">DATE</span><span class="p">(</span><span class="s1">'2022-06-01'</span><span class="p">))</span>
<span class="k">ON</span> <span class="n">CONFLICT</span> <span class="p">(</span><span class="n">id</span><span class="p">,</span> <span class="k">day</span><span class="p">)</span>
<span class="k">DO</span> <span class="k">UPDATE</span> <span class="k">SET</span> <span class="p">(</span><span class="n">id</span><span class="p">,</span> <span class="n">high</span><span class="p">,</span> <span class="k">day</span><span class="p">)</span> <span class="o">=</span> <span class="p">(</span><span class="n">temp_agg</span><span class="p">.</span><span class="n">id</span><span class="p">,</span> <span class="n">EXCLUDED</span><span class="p">.</span><span class="n">high</span><span class="p">,</span> <span class="n">temp_agg</span><span class="p">.</span><span class="k">day</span><span class="p">)</span>
<span class="k">WHERE</span> <span class="n">EXCLUDED</span><span class="p">.</span><span class="n">high</span> <span class="o">></span> <span class="n">temp_agg</span><span class="p">.</span><span class="n">high</span>
</code></pre></div></div>
<p>In the second example the <code class="language-plaintext highlighter-rouge">GREATEST</code> function is used to set the high temperature instead of filtering with a <code class="language-plaintext highlighter-rouge">WHERE</code> clause.</p>
<div class="language-sql highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">INSERT</span> <span class="k">INTO</span> <span class="n">temp_agg</span>
<span class="p">(</span><span class="n">id</span><span class="p">,</span> <span class="n">high</span><span class="p">,</span> <span class="k">day</span><span class="p">)</span>
<span class="k">VALUES</span>
<span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">20</span><span class="p">.</span><span class="mi">00</span><span class="p">,</span> <span class="nb">DATE</span><span class="p">(</span><span class="s1">'2022-06-01'</span><span class="p">)),</span>
<span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="mi">20</span><span class="p">.</span><span class="mi">00</span><span class="p">,</span> <span class="nb">DATE</span><span class="p">(</span><span class="s1">'2022-06-01'</span><span class="p">)),</span>
<span class="p">(</span><span class="mi">3</span><span class="p">,</span> <span class="mi">18</span><span class="p">.</span><span class="mi">00</span><span class="p">,</span> <span class="nb">DATE</span><span class="p">(</span><span class="s1">'2022-06-01'</span><span class="p">))</span>
<span class="k">ON</span> <span class="n">CONFLICT</span> <span class="p">(</span><span class="n">id</span><span class="p">,</span> <span class="k">day</span><span class="p">)</span>
<span class="k">DO</span> <span class="k">UPDATE</span> <span class="k">SET</span> <span class="p">(</span><span class="n">id</span><span class="p">,</span> <span class="n">high</span><span class="p">,</span> <span class="k">day</span><span class="p">)</span> <span class="o">=</span> <span class="p">(</span><span class="n">temp_agg</span><span class="p">.</span><span class="n">id</span><span class="p">,</span> <span class="n">GREATEST</span><span class="p">(</span><span class="n">EXCLUDED</span><span class="p">.</span><span class="n">high</span><span class="p">,</span> <span class="n">temp_agg</span><span class="p">.</span><span class="n">high</span><span class="p">),</span> <span class="n">temp_agg</span><span class="p">.</span><span class="k">day</span><span class="p">)</span>
</code></pre></div></div>
<p>The result is that the high temperature for locations 1 & 2 on day <code class="language-plaintext highlighter-rouge">2022-06-01</code> have been updated and the high temp for location 3 has stayed the same.</p>
<p>Result:</p>
<table>
<thead>
<tr>
<th>id</th>
<th>high</th>
<th>day</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>20.00</td>
<td>2022-06-01</td>
</tr>
<tr>
<td>2</td>
<td>20.00</td>
<td>2022-06-01</td>
</tr>
<tr>
<td>3</td>
<td>21.00</td>
<td>2022-06-01</td>
</tr>
</tbody>
</table>Dyana RoseUpserting in Postgres lets you insert a new value or update an existing value in a single atomic statement. This avoids needing two separate statements, read and update/insert, wrapped in a transaction.Building and Showing2017-11-05T10:57:57+00:002017-11-05T10:57:57+00:00http://dyanarose.github.io/blog/2017/11/05/building-and-showing<p>On and off I’ve been working on a site that displays candlestick charts using data from the Guild Wars 2 trading post, but I’ve never publicly linked to it.</p>
<p>Part of the reason is that it’s not a finished product and I’m not a web designer. But why should that stop me, eh? It’s already been an interesting product involving:</p>
<ul>
<li>a service retrieving the data from the GW2 API</li>
<li>optimising sql to calculate the candlesticks quickly</li>
<li>an RTL service that takes old data and moves to a better long term storage format and then populates a cache</li>
<li>a website which must merge cached and uncached data before returning the candlesticks back to the caller.</li>
</ul>
<p>It didn’t start out with 2 services, a website, and a cache though. It started with a program, written in Go that would call an endpoint once an hour and then save the results in a sqlite database.</p>
<p>At that time I didn’t know what I wanted to do with the data, I just knew I wanted to do <em>something</em>.</p>
<p>Eventually, that something became me answering the question ‘what questions do I want the answers to?’ As it turns out, what I want is to answer yet another question, ‘should I buy, sell, or make this item, and should I do it now.’ Now this question I could answer! But my tools at the time would become problematic.</p>
<p>I chose to display historical data using candlestick charts, as I like the view of how prices are moving in a given time period. Calculating open, close, min and max using sqlite proved to be an interesting problem. It was possible though, with some sub-queries, and with some tradeoffs. For example I could only get results for one item at a time, and the larger the dataset, the slower the calculation got. But, most importantly, it gave me a place to start. And it worked!</p>
<p>Once the requirements started shaking out, infrastructure changes became frequent. I was inside the AWS free tier, but as the sqlite file grew, I started to get worried about EBS storage and keeping things free. So my architecture had to change and I moved to using RDS and PostgreSQL.</p>
<p>Then the fact that the data, once inserted into PostgreSQL, was effectivley dead, started grating on me, and also filling up my free tier allotments of space in RDS.</p>
<p>So I brought in an RTL process to take the dead data out of PostgreSQL, store it more efficiently, and use it to populate a cache.</p>
<p>And then the cache grew too fast and I started seeing evictions. But new problem means new solution. I needed a better way (or even a way) of compressing the data going into the cache. I’ve worked with Protobuf before so after a bit of a search around to see if Avro would be an immediate better fit, I decided to go with compression via Protobuf. I also reworked the structure of the stored data, because once it was compressed, the keys were still a major source of bloat. And that worked beautifully. I went from being able to store a few months of data to at few years.</p>
<p>And that’s where things stand right now. The UI isn’t much to talk about but it’s clean (sparse some may say) and it’s pretty zippy. Which is a long way from the minutes it used to take to load only a single month of data.</p>
<p>http://www.gw2roar.com</p>
<p>I’m going to continue to talk more about the choices I made, and continue to make, and why I had to make them with this project. For example, the significant work around keeping both cache and database sizes sane, automating the ETL process, automating deployments, and anything that comes next.</p>Dyana RoseOn and off I’ve been working on a site that displays candlestick charts using data from the Guild Wars 2 trading post, but I’ve never publicly linked to it.graceful shutdown of java apps under docker2017-08-26T13:39:00+01:002017-08-26T13:39:00+01:00http://dyanarose.github.io/blog/2017/08/26/graceful-shutdown-of-java-apps-under-docker<p>I ran across a problem recently when working on a Java application that needed to be allowed to finish processing its current batch before shutting down after the receipt of a shut down signal.</p>
<p>This app has a main loop that receives messages off an AWS SQS queue, processes those messages, takes action if required (making a put request to a third party), and then deletes the messages off the queue.</p>
<p>Each action must only ever send a unique record to the api. Because the third party doesn’t expose any unique identifier for a record, though it does provide an endpoint to get a list of all existing records, the application itself must handle the idea of uniqueness.</p>
<p>So, in the case of a double whammy of my app being shut down after taking action, but before deleting the message from the queue, and the third party being slow to update the list of previous requests, I could end up sending duplicate records when the new app starts up.</p>
<h2 id="no-problem-weve-got-sigterm">No problem, we’ve got SIGTERM</h2>
<p>When <code class="language-plaintext highlighter-rouge">docker stop</code> is called on a container, SIGTERM is sent to PID 1, and in Java a <a href="https://docs.oracle.com/javase/8/docs/technotes/guides/lang/hook-design.html">shutdown hook</a> is used to catch the SIGTERM and clean up any resources before finally stopping.</p>
<p>In this case though, it’s not resources that need cleaning up. This is a continuously running application that needs to finish any work currently in process before returning out of the main loop and stopping.</p>
<p>As it happens, Java has a good way of letting a Thread know that the application would like it to shut down before it starts its next itteration of work.</p>
<h2 id="interrupts">Interrupts</h2>
<p>A Thread in Java has an <code class="language-plaintext highlighter-rouge">interrupted</code> method which returns true if the Thread has been interrupted since the last time <code class="language-plaintext highlighter-rouge">Thread.interrupted()</code> was called.
(more on <a href="https://docs.oracle.com/javase/tutorial/essential/concurrency/interrupt.html">Interrupts</a>)</p>
<h3 id="why-are-interrupts-useful">Why are interrupts useful</h3>
<p>In the main loop a condition of <code class="language-plaintext highlighter-rouge">while(!Thread.interrupted())</code> will allow the while block to run to completion, but also prevent the next execution if an interrupt occured during the previous run, which is exactly what needs to happen to allow messages to complete processing before the app shuts down.</p>
<h4 id="how-to-interrupt-a-thread">How to interrupt a thread</h4>
<p>The short answer is ‘by invoking <code class="language-plaintext highlighter-rouge">Thread.interrupt</code>’ in the shutdown hook.</p>
<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">private</span> <span class="kd">static</span> <span class="kt">void</span> <span class="nf">threadInterruptedOnShutdown</span><span class="o">(</span><span class="kt">long</span> <span class="n">timeout</span><span class="o">)</span> <span class="o">{</span>
<span class="c1">// Setting interrupt doesn't cause the application to wait for the thread to exit.</span>
<span class="c1">// The statements inside the interrupt block in the loop may or may not be executed.</span>
<span class="nc">String</span> <span class="n">name</span> <span class="o">=</span> <span class="s">"threadInterruptedOnShutdown wait "</span> <span class="o">+</span> <span class="n">timeout</span><span class="o">;</span>
<span class="nc">Thread</span> <span class="n">t</span> <span class="o">=</span> <span class="k">new</span> <span class="nc">Thread</span><span class="o">(</span><span class="k">new</span> <span class="nc">RunnableLoop</span><span class="o">(</span><span class="n">name</span><span class="o">,</span> <span class="n">timeout</span><span class="o">));</span>
<span class="n">t</span><span class="o">.</span><span class="na">start</span><span class="o">();</span>
<span class="nc">Runtime</span><span class="o">.</span><span class="na">getRuntime</span><span class="o">().</span><span class="na">addShutdownHook</span><span class="o">(</span><span class="k">new</span> <span class="nc">Thread</span><span class="o">(()</span> <span class="o">-></span> <span class="o">{</span>
<span class="nc">System</span><span class="o">.</span><span class="na">out</span><span class="o">.</span><span class="na">println</span><span class="o">(</span><span class="n">name</span> <span class="o">+</span> <span class="s">" thread: setting interrupt"</span><span class="o">);</span>
<span class="n">t</span><span class="o">.</span><span class="na">interrupt</span><span class="o">();</span>
<span class="nc">System</span><span class="o">.</span><span class="na">out</span><span class="o">.</span><span class="na">println</span><span class="o">(</span><span class="n">name</span> <span class="o">+</span> <span class="s">" thread: shutting down"</span><span class="o">);</span>
<span class="o">}));</span>
<span class="o">}</span>
</code></pre></div></div>
<p>Just calling interrupt on a Thread doesn’t give the control needed to ensure the current work is completed. It doesn’t necessarily wait on the Thread to exit before letting the application shut down. There’s no reason you couldn’t write the code to handle this of course, but the <a href="https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/ExecutorService.html">ExecutorService</a> already provides that for us.</p>
<p>When submitting a Runnable (or Callable) to the ExecutorService, a handle for a Future is returned. Working together, the Future and the ExecutorService give control over interrupting Threads, waiting for them to exit, and if anything fails to exit, providing an opportunity to do any damage control.</p>
<div class="language-java highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">private</span> <span class="kd">static</span> <span class="kt">void</span> <span class="nf">futureFromExecutorService</span><span class="o">(</span><span class="kt">long</span> <span class="n">timeout</span><span class="o">)</span> <span class="o">{</span>
<span class="c1">// the executor service submit method allows us to get a handle on the thread</span>
<span class="c1">// via a future and set the interrupt in the shutdown hook</span>
<span class="nc">String</span> <span class="n">name</span> <span class="o">=</span> <span class="s">"futureFromExecutorService wait "</span> <span class="o">+</span> <span class="n">timeout</span><span class="o">;</span>
<span class="nc">ExecutorService</span> <span class="n">service</span> <span class="o">=</span> <span class="nc">Executors</span><span class="o">.</span><span class="na">newSingleThreadExecutor</span><span class="o">();</span>
<span class="nc">Future</span><span class="o"><?></span> <span class="n">app</span> <span class="o">=</span> <span class="n">service</span><span class="o">.</span><span class="na">submit</span><span class="o">(</span><span class="k">new</span> <span class="nc">RunnableLoop</span><span class="o">(</span><span class="n">name</span><span class="o">,</span> <span class="n">timeout</span><span class="o">));</span>
<span class="nc">Runtime</span><span class="o">.</span><span class="na">getRuntime</span><span class="o">().</span><span class="na">addShutdownHook</span><span class="o">(</span><span class="k">new</span> <span class="nc">Thread</span><span class="o">(()</span> <span class="o">-></span> <span class="o">{</span>
<span class="nc">System</span><span class="o">.</span><span class="na">out</span><span class="o">.</span><span class="na">println</span><span class="o">(</span><span class="n">name</span> <span class="o">+</span> <span class="s">" thread: setting interrupt"</span><span class="o">);</span>
<span class="n">app</span><span class="o">.</span><span class="na">cancel</span><span class="o">(</span><span class="kc">true</span><span class="o">);</span>
<span class="n">service</span><span class="o">.</span><span class="na">shutdown</span><span class="o">();</span>
<span class="k">try</span> <span class="o">{</span>
<span class="c1">// give the thread time to shutdown. This needs to be comfortably less than the</span>
<span class="c1">// time the docker stop command will wait for a container to terminate on its own</span>
<span class="c1">// before forcibly killing it.</span>
<span class="k">if</span> <span class="o">(!</span><span class="n">service</span><span class="o">.</span><span class="na">awaitTermination</span><span class="o">(</span><span class="mi">7</span><span class="o">,</span> <span class="nc">TimeUnit</span><span class="o">.</span><span class="na">SECONDS</span><span class="o">))</span> <span class="o">{</span>
<span class="nc">System</span><span class="o">.</span><span class="na">out</span><span class="o">.</span><span class="na">println</span><span class="o">(</span><span class="n">name</span> <span class="o">+</span> <span class="s">" thread: did not shutdown in time, forcing service shutdown"</span><span class="o">);</span>
<span class="n">service</span><span class="o">.</span><span class="na">shutdownNow</span><span class="o">();</span>
<span class="o">}</span> <span class="k">else</span> <span class="o">{</span>
<span class="nc">System</span><span class="o">.</span><span class="na">out</span><span class="o">.</span><span class="na">println</span><span class="o">(</span><span class="n">name</span> <span class="o">+</span> <span class="s">" thread: shutdown cleanly"</span><span class="o">);</span>
<span class="o">}</span>
<span class="o">}</span> <span class="k">catch</span> <span class="o">(</span><span class="nc">InterruptedException</span> <span class="n">e</span><span class="o">)</span> <span class="o">{</span>
<span class="nc">System</span><span class="o">.</span><span class="na">out</span><span class="o">.</span><span class="na">println</span><span class="o">(</span><span class="n">name</span> <span class="o">+</span> <span class="s">" thread: shutdown timer interrupted, forcing service shutdown"</span><span class="o">);</span>
<span class="n">service</span><span class="o">.</span><span class="na">shutdownNow</span><span class="o">();</span>
<span class="o">}</span>
<span class="o">}));</span>
<span class="o">}</span>
</code></pre></div></div>
<p>To interrupt the Thread when SIGTERM is received, inside the shutdown hook call <code class="language-plaintext highlighter-rouge">.cancel(true)</code> on the Future. The boolean parameter allows the cancel method to interrupt a running Thread. Without that parameter, the Thread will not be interrupted.</p>
<p>Once the Future is cancelled, the service can begin shutting down. The ExecutorService has two types of shutdown. <code class="language-plaintext highlighter-rouge">.shutdown()</code> will stop any more work from being submitted to the service while allowing existing work to execute. <code class="language-plaintext highlighter-rouge">.shutdownNow()</code> on the other hand actively attempts to stop all running tasks.</p>
<p>After calling <code class="language-plaintext highlighter-rouge">.shutdown()</code> use the ExecutorService’s <code class="language-plaintext highlighter-rouge">.awaitTermination</code> method to both give the Threads time to finish any current work and also to handle those that do not return in time. Set the timout argument to be less than that of the <code class="language-plaintext highlighter-rouge">docker stop</code> command so that there will be time to do damage control and attempt a final <code class="language-plaintext highlighter-rouge">.shutdownNow()</code> before docker kills the container for being non-responsive.</p>
<p>All together, this allows a continously running application to respond to shutdown requests in a timely manner and also complete the work that is currently in process.</p>
<p>tl;dr <a href="https://github.com/dyanarose/application-interrupts">application-interrupts</a></p>Dyana RoseI ran across a problem recently when working on a Java application that needed to be allowed to finish processing its current batch before shutting down after the receipt of a shut down signal.so you need to edit a parquet file2017-08-04T11:40:32+01:002017-08-04T11:40:32+01:00http://dyanarose.github.io/blog/2017/08/04/so-you-need-to-edit-a-parquet-file<p>You’ve uncovered a problem in your beautiful parquet files, some piece of data either snuck in, or was calculated incorrectly, or there was just a bug. You know exactly how to correct the data, but how do you update the files?</p>
<p>tl;dr: <a href="https://github.com/dyanarose/parquet-edit-examples">spark-edit-examples</a></p>
<h3 id="its-all-immutable">It’s all immutable</h3>
<p>The problem we have when we need to edit the data is that our data structures are immutable.</p>
<p>You can add partitions to Parquet files, but you can’t edit the data in place. Spark DataFrames are immutable.</p>
<p>But ultimately we can mutate the data, we just need to accept that we won’t be doing it in place. We will need to recreate the Parquet files using a combination of schemas and UDFs to correct the bad data.</p>
<h2 id="schemas">Schemas</h2>
<p>Reading in data using a schema gives you a lot of power over the resultant structure of the DataFrame (not to mention it makes reading in json files a lot faster, and will allow you to union compatible Parquet files)</p>
<h4 id="case-1-i-need-to-drop-an-entire-column">Case 1: I need to drop an entire column</h4>
<p>To drop an entire column, read the data in with a schema that doesn’t contain that column. When you write the DataFrame back out, the column will no longer exist</p>
<p><a href="https://github.com/dyanarose/parquet-edit-examples/blob/master/transform-examples/src/main/scala/com/dlr/transform/transformers/ColumnTransform.scala">ColumnTransform.scala</a></p>
<div class="language-scala highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">object</span> <span class="nc">ColumnTransform</span> <span class="o">{</span>
<span class="k">def</span> <span class="nf">transform</span><span class="o">(</span><span class="n">spark</span><span class="k">:</span> <span class="kt">SparkSession</span><span class="o">,</span> <span class="n">sourcePath</span><span class="k">:</span> <span class="kt">String</span><span class="o">,</span> <span class="n">destPath</span><span class="k">:</span> <span class="kt">String</span><span class="o">)</span><span class="k">:</span> <span class="kt">Unit</span> <span class="o">=</span> <span class="o">{</span>
<span class="c1">// read in the data with a new schema</span>
<span class="k">val</span> <span class="nv">allGoodData</span> <span class="k">=</span> <span class="nv">spark</span><span class="o">.</span><span class="py">read</span><span class="o">.</span><span class="py">schema</span><span class="o">(</span><span class="nv">ColumnDropSchema</span><span class="o">.</span><span class="py">schema</span><span class="o">).</span><span class="py">parquet</span><span class="o">(</span><span class="n">sourcePath</span><span class="o">)</span>
<span class="c1">// write out the final edited data</span>
<span class="nv">allGoodData</span><span class="o">.</span><span class="py">write</span><span class="o">.</span><span class="py">parquet</span><span class="o">(</span><span class="n">destPath</span><span class="o">)</span>
<span class="o">}</span>
<span class="o">}</span>
</code></pre></div></div>
<h4 id="case-2-i-need-to-drop-full-rows-of-data">Case 2: I need to drop full rows of data</h4>
<p>To drop full rows, read in the data and select the data you want to save into a new DataFrame using a where clause. When you write the new DataFrame it will only have the rows that match the where clause.</p>
<p><a href="https://github.com/dyanarose/parquet-edit-examples/blob/master/transform-examples/src/main/scala/com/dlr/transform/transformers/WhereTransform.scala">WhereTransform.scala</a></p>
<div class="language-scala highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">object</span> <span class="nc">WhereTransform</span> <span class="o">{</span>
<span class="k">def</span> <span class="nf">transform</span><span class="o">(</span><span class="n">spark</span><span class="k">:</span> <span class="kt">SparkSession</span><span class="o">,</span> <span class="n">sourcePath</span><span class="k">:</span> <span class="kt">String</span><span class="o">,</span> <span class="n">destPath</span><span class="k">:</span> <span class="kt">String</span><span class="o">)</span><span class="k">:</span> <span class="kt">Unit</span> <span class="o">=</span> <span class="o">{</span>
<span class="k">val</span> <span class="nv">originalData</span> <span class="k">=</span> <span class="nv">spark</span><span class="o">.</span><span class="py">read</span><span class="o">.</span><span class="py">schema</span><span class="o">(</span><span class="nv">RawDataSchema</span><span class="o">.</span><span class="py">schema</span><span class="o">).</span><span class="py">parquet</span><span class="o">(</span><span class="n">sourcePath</span><span class="o">)</span>
<span class="c1">// select only the good data rows</span>
<span class="k">val</span> <span class="nv">allGoodData</span> <span class="k">=</span> <span class="nv">originalData</span><span class="o">.</span><span class="py">where</span><span class="o">(</span><span class="s">"myField is null"</span><span class="o">)</span>
<span class="c1">// write out the final edited data</span>
<span class="nv">allGoodData</span><span class="o">.</span><span class="py">write</span><span class="o">.</span><span class="py">parquet</span><span class="o">(</span><span class="n">destPath</span><span class="o">)</span>
<span class="o">}</span>
<span class="o">}</span>
</code></pre></div></div>
<h2 id="user-defined-functions-udfs">User Defined Functions (UDFs)</h2>
<p><a href="https://blog.cloudera.com/blog/2017/02/working-with-udfs-in-apache-spark/">UDFs in Spark</a> are used to apply functions to a row of data. The result of the UDF becomes the field value.</p>
<p>Note that when using UDFs you must alias the resultant column otherwise it will end up renamed similar to <code class="language-plaintext highlighter-rouge">UDF(fieldName)</code></p>
<h4 id="case-3-i-need-to-edit-the-value-of-a-simple-type-string-boolean-">Case 3: I need to edit the value of a simple type (String, Boolean, …)</h4>
<p>To edit a simple type you first need to create a function that takes and returns the same type.</p>
<p>This function is then registered for use as a UDF and it can then be applied to a field in a select clause</p>
<p><a href="https://github.com/dyanarose/parquet-edit-examples/blob/master/transform-examples/src/main/scala/com/dlr/transform/transformers/SimpleTransform.scala">SimpleTransform.scala</a></p>
<div class="language-scala highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">object</span> <span class="nc">SimpleTransform</span> <span class="o">{</span>
<span class="k">def</span> <span class="nf">transform</span><span class="o">(</span><span class="n">spark</span><span class="k">:</span> <span class="kt">SparkSession</span><span class="o">,</span> <span class="n">sourcePath</span><span class="k">:</span> <span class="kt">String</span><span class="o">,</span> <span class="n">destPath</span><span class="k">:</span> <span class="kt">String</span><span class="o">)</span><span class="k">:</span> <span class="kt">Unit</span> <span class="o">=</span> <span class="o">{</span>
<span class="k">val</span> <span class="nv">originalData</span> <span class="k">=</span> <span class="nv">spark</span><span class="o">.</span><span class="py">read</span><span class="o">.</span><span class="py">schema</span><span class="o">(</span><span class="nv">RawDataSchema</span><span class="o">.</span><span class="py">schema</span><span class="o">).</span><span class="py">parquet</span><span class="o">(</span><span class="n">sourcePath</span><span class="o">)</span>
<span class="c1">// take in a String, return a String</span>
<span class="c1">// cleanFunc takes the String field value and return the empty string in its place</span>
<span class="c1">// you can interrogate the value and return any String here</span>
<span class="k">def</span> <span class="nf">cleanFunc</span><span class="k">:</span> <span class="o">(</span><span class="kt">String</span> <span class="o">=></span> <span class="kt">String</span><span class="o">)</span> <span class="k">=</span> <span class="o">{</span> <span class="k">_</span> <span class="k">=></span> <span class="s">""</span> <span class="o">}</span>
<span class="c1">// register the func as a udf</span>
<span class="k">val</span> <span class="nv">clean</span> <span class="k">=</span> <span class="nf">udf</span><span class="o">(</span><span class="n">cleanFunc</span><span class="o">)</span>
<span class="c1">// required for the $ column syntax</span>
<span class="k">import</span> <span class="nn">spark.sqlContext.implicits._</span>
<span class="c1">// if you have data that doesn't need editing, you can separate it out</span>
<span class="c1">// The data will need to be in a form that can be unioned with the edited data</span>
<span class="c1">// That can be done by selecting out the fields in the same way in both the good and transformed data sets.</span>
<span class="k">val</span> <span class="nv">alreadyGoodData</span> <span class="k">=</span> <span class="nv">originalData</span><span class="o">.</span><span class="py">where</span><span class="o">(</span><span class="s">"myField is null"</span><span class="o">).</span><span class="py">select</span><span class="o">(</span>
<span class="nc">Seq</span><span class="o">[</span><span class="kt">Column</span><span class="o">](</span>
<span class="n">$</span><span class="s">"myField"</span><span class="o">,</span>
<span class="n">$</span><span class="s">"myMap"</span><span class="o">,</span>
<span class="n">$</span><span class="s">"myStruct"</span>
<span class="o">)</span><span class="k">:_</span><span class="kt">*</span>
<span class="o">)</span>
<span class="c1">// apply the udf to the fields that need editing</span>
<span class="c1">// selecting out all the data that will be present in the final parquet file</span>
<span class="k">val</span> <span class="nv">transformedData</span> <span class="k">=</span> <span class="nv">originalData</span><span class="o">.</span><span class="py">where</span><span class="o">(</span><span class="s">"myField is not null"</span><span class="o">).</span><span class="py">select</span><span class="o">(</span>
<span class="nc">Seq</span><span class="o">[</span><span class="kt">Column</span><span class="o">](</span>
<span class="nf">clean</span><span class="o">(</span><span class="n">$</span><span class="s">"myField"</span><span class="o">).</span><span class="py">as</span><span class="o">(</span><span class="s">"myField"</span><span class="o">),</span>
<span class="n">$</span><span class="s">"myMap"</span><span class="o">,</span>
<span class="n">$</span><span class="s">"myStruct"</span>
<span class="o">)</span><span class="k">:_</span><span class="kt">*</span>
<span class="o">)</span>
<span class="c1">// union the two DataFrames</span>
<span class="k">val</span> <span class="nv">allGoodData</span> <span class="k">=</span> <span class="nv">alreadyGoodData</span><span class="o">.</span><span class="py">union</span><span class="o">(</span><span class="n">transformedData</span><span class="o">)</span>
<span class="c1">// write out the final edited data</span>
<span class="nv">allGoodData</span><span class="o">.</span><span class="py">write</span><span class="o">.</span><span class="py">parquet</span><span class="o">(</span><span class="n">destPath</span><span class="o">)</span>
<span class="o">}</span>
<span class="o">}</span>
</code></pre></div></div>
<h4 id="case-4-i-need-to-edit-the-value-of-a-maptype">Case 4: I need to edit the value of a MapType</h4>
<p>MapTypes follow the same pattern as simple types. You write a function that takes a Map of the correct key and value types and returns a Map of the same types.</p>
<p>In the following example, an entire entry in the Map[String,String] is removed from the final data by filtering on the keyset.</p>
<p><a href="https://github.com/dyanarose/parquet-edit-examples/blob/master/transform-examples/src/main/scala/com/dlr/transform/transformers/MapTypeTransform.scala">MapTypeTransform.scala</a></p>
<div class="language-scala highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">object</span> <span class="nc">MapTypeTransform</span> <span class="o">{</span>
<span class="k">def</span> <span class="nf">transform</span><span class="o">(</span><span class="n">spark</span><span class="k">:</span> <span class="kt">SparkSession</span><span class="o">,</span> <span class="n">sourcePath</span><span class="k">:</span> <span class="kt">String</span><span class="o">,</span> <span class="n">destPath</span><span class="k">:</span> <span class="kt">String</span><span class="o">)</span><span class="k">:</span> <span class="kt">Unit</span> <span class="o">=</span> <span class="o">{</span>
<span class="k">val</span> <span class="nv">originalData</span> <span class="k">=</span> <span class="nv">spark</span><span class="o">.</span><span class="py">read</span><span class="o">.</span><span class="py">schema</span><span class="o">(</span><span class="nv">RawDataSchema</span><span class="o">.</span><span class="py">schema</span><span class="o">).</span><span class="py">parquet</span><span class="o">(</span><span class="n">sourcePath</span><span class="o">)</span>
<span class="c1">// cleanFunc will simply take the MapType and return an edited Map</span>
<span class="c1">// in this example it removes one member of the map before returning</span>
<span class="k">def</span> <span class="nf">cleanFunc</span><span class="k">:</span> <span class="o">(</span><span class="kt">Map</span><span class="o">[</span><span class="kt">String</span>, <span class="kt">String</span><span class="o">]</span> <span class="o">=></span> <span class="nc">Map</span><span class="o">[</span><span class="kt">String</span>, <span class="kt">String</span><span class="o">])</span> <span class="k">=</span> <span class="o">{</span> <span class="n">m</span> <span class="k">=></span> <span class="nv">m</span><span class="o">.</span><span class="py">filterKeys</span><span class="o">(</span><span class="n">k</span> <span class="k">=></span> <span class="n">k</span> <span class="o">!=</span> <span class="s">"editMe"</span><span class="o">)</span> <span class="o">}</span>
<span class="c1">// register the func as a udf</span>
<span class="k">val</span> <span class="nv">clean</span> <span class="k">=</span> <span class="nf">udf</span><span class="o">(</span><span class="n">cleanFunc</span><span class="o">)</span>
<span class="c1">// required for the $ column syntax</span>
<span class="k">import</span> <span class="nn">spark.sqlContext.implicits._</span>
<span class="c1">// if you have data that doesn't need editing, you can separate it out</span>
<span class="c1">// The data will need to be in a form that can be unioned with the edited data</span>
<span class="c1">// I do that here by selecting out all the fields.</span>
<span class="k">val</span> <span class="nv">alreadyGoodData</span> <span class="k">=</span> <span class="nv">originalData</span><span class="o">.</span><span class="py">where</span><span class="o">(</span><span class="s">"myMap.editMe is null"</span><span class="o">).</span><span class="py">select</span><span class="o">(</span>
<span class="nc">Seq</span><span class="o">[</span><span class="kt">Column</span><span class="o">](</span>
<span class="n">$</span><span class="s">"myField"</span><span class="o">,</span>
<span class="n">$</span><span class="s">"myMap"</span><span class="o">,</span>
<span class="n">$</span><span class="s">"myStruct"</span>
<span class="o">)</span><span class="k">:_</span><span class="kt">*</span>
<span class="o">)</span>
<span class="c1">// apply the udf to the fields that need editing</span>
<span class="c1">// selecting out all the data that will be present in the final parquet file</span>
<span class="k">val</span> <span class="nv">transformedData</span> <span class="k">=</span> <span class="nv">originalData</span><span class="o">.</span><span class="py">where</span><span class="o">(</span><span class="s">"myMap.editMe is not null"</span><span class="o">).</span><span class="py">select</span><span class="o">(</span>
<span class="nc">Seq</span><span class="o">[</span><span class="kt">Column</span><span class="o">](</span>
<span class="n">$</span><span class="s">"myField"</span><span class="o">,</span>
<span class="nf">clean</span><span class="o">(</span><span class="n">$</span><span class="s">"myMap"</span><span class="o">).</span><span class="py">as</span><span class="o">(</span><span class="s">"myMap"</span><span class="o">),</span>
<span class="n">$</span><span class="s">"myStruct"</span>
<span class="o">)</span><span class="k">:_</span><span class="kt">*</span>
<span class="o">)</span>
<span class="c1">// union the two DataFrames</span>
<span class="k">val</span> <span class="nv">allGoodData</span> <span class="k">=</span> <span class="nv">alreadyGoodData</span><span class="o">.</span><span class="py">union</span><span class="o">(</span><span class="n">transformedData</span><span class="o">)</span>
<span class="c1">// write out the final edited data</span>
<span class="nv">allGoodData</span><span class="o">.</span><span class="py">write</span><span class="o">.</span><span class="py">parquet</span><span class="o">(</span><span class="n">destPath</span><span class="o">)</span>
<span class="o">}</span>
<span class="o">}</span>
</code></pre></div></div>
<h4 id="case-5-i-need-to-change-the-value-of-a-member-of-a-structtype">Case 5: I need to change the value of a member of a StructType</h4>
<p>Working with StructTypes requires an addition to the UDF registration statement. By supplying the schema of the StructType you are able to manipulate using a function that takes and returns a Row.</p>
<p>As Rows are immutable, a new Row must be created that has the same field order, type, and number as the schema. But, since the schema of the data is known, it’s relatively easy to reconstruct a new Row with the correct fields.</p>
<p><a href="https://github.com/dyanarose/parquet-edit-examples/blob/master/transform-examples/src/main/scala/com/dlr/transform/transformers/StructTypeTransform.scala">StructTypeTransform.scala</a></p>
<div class="language-scala highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">object</span> <span class="nc">StructTypeTransform</span> <span class="o">{</span>
<span class="k">def</span> <span class="nf">transform</span><span class="o">(</span><span class="n">spark</span><span class="k">:</span> <span class="kt">SparkSession</span><span class="o">,</span> <span class="n">sourcePath</span><span class="k">:</span> <span class="kt">String</span><span class="o">,</span> <span class="n">destPath</span><span class="k">:</span> <span class="kt">String</span><span class="o">)</span><span class="k">:</span> <span class="kt">Unit</span> <span class="o">=</span> <span class="o">{</span>
<span class="k">val</span> <span class="nv">originalData</span> <span class="k">=</span> <span class="nv">spark</span><span class="o">.</span><span class="py">read</span><span class="o">.</span><span class="py">schema</span><span class="o">(</span><span class="nv">RawDataSchema</span><span class="o">.</span><span class="py">schema</span><span class="o">).</span><span class="py">parquet</span><span class="o">(</span><span class="n">sourcePath</span><span class="o">)</span>
<span class="c1">// cleanFunc will take the struct as a Row and return a new Row with edited fields</span>
<span class="c1">// note that the ordering and count of the fields must remain the same</span>
<span class="k">def</span> <span class="nf">cleanFunc</span><span class="k">:</span> <span class="o">(</span><span class="kt">Row</span> <span class="o">=></span> <span class="kt">Row</span><span class="o">)</span> <span class="k">=</span> <span class="o">{</span> <span class="n">r</span> <span class="k">=></span> <span class="nv">RowFactory</span><span class="o">.</span><span class="py">create</span><span class="o">(</span><span class="nv">r</span><span class="o">.</span><span class="py">getAs</span><span class="o">[</span><span class="kt">BooleanType</span><span class="o">](</span><span class="mi">0</span><span class="o">),</span> <span class="s">""</span><span class="o">)</span> <span class="o">}</span>
<span class="c1">// register the func as a udf</span>
<span class="c1">// give the UDF a schema or the Row type won't be supported</span>
<span class="k">val</span> <span class="nv">clean</span> <span class="k">=</span> <span class="nf">udf</span><span class="o">(</span><span class="n">cleanFunc</span><span class="o">,</span>
<span class="nc">StructType</span><span class="o">(</span>
<span class="nc">StructField</span><span class="o">(</span><span class="s">"myField"</span><span class="o">,</span> <span class="nc">BooleanType</span><span class="o">,</span> <span class="kc">true</span><span class="o">)</span> <span class="o">::</span>
<span class="nc">StructField</span><span class="o">(</span><span class="s">"editMe"</span><span class="o">,</span> <span class="nc">StringType</span><span class="o">,</span> <span class="kc">true</span><span class="o">)</span> <span class="o">::</span>
<span class="nc">Nil</span>
<span class="o">)</span>
<span class="o">)</span>
<span class="c1">// required for the $ column syntax</span>
<span class="k">import</span> <span class="nn">spark.sqlContext.implicits._</span>
<span class="c1">// if you have data that doesn't need editing, you can separate it out</span>
<span class="c1">// The data will need to be in a form that can be unioned with the edited data</span>
<span class="c1">// I do that here by selecting out all the fields.</span>
<span class="k">val</span> <span class="nv">alreadyGoodData</span> <span class="k">=</span> <span class="nv">originalData</span><span class="o">.</span><span class="py">where</span><span class="o">(</span><span class="s">"myStruct.editMe is null"</span><span class="o">).</span><span class="py">select</span><span class="o">(</span>
<span class="nc">Seq</span><span class="o">[</span><span class="kt">Column</span><span class="o">](</span>
<span class="n">$</span><span class="s">"myField"</span><span class="o">,</span>
<span class="n">$</span><span class="s">"myStruct"</span><span class="o">,</span>
<span class="n">$</span><span class="s">"myMap"</span>
<span class="o">)</span><span class="k">:_</span><span class="kt">*</span>
<span class="o">)</span>
<span class="c1">// apply the udf to the fields that need editing</span>
<span class="c1">// selecting out all the data that will be present in the final parquet file</span>
<span class="k">val</span> <span class="nv">transformedData</span> <span class="k">=</span> <span class="nv">originalData</span><span class="o">.</span><span class="py">where</span><span class="o">(</span><span class="s">"myStruct.editMe is not null"</span><span class="o">).</span><span class="py">select</span><span class="o">(</span>
<span class="nc">Seq</span><span class="o">[</span><span class="kt">Column</span><span class="o">](</span>
<span class="n">$</span><span class="s">"myField"</span><span class="o">,</span>
<span class="nf">clean</span><span class="o">(</span><span class="n">$</span><span class="s">"myStruct"</span><span class="o">).</span><span class="py">as</span><span class="o">(</span><span class="s">"myStruct"</span><span class="o">),</span>
<span class="n">$</span><span class="s">"myMap"</span>
<span class="o">)</span><span class="k">:_</span><span class="kt">*</span>
<span class="o">)</span>
<span class="c1">// union the two DataFrames</span>
<span class="k">val</span> <span class="nv">allGoodData</span> <span class="k">=</span> <span class="nv">alreadyGoodData</span><span class="o">.</span><span class="py">union</span><span class="o">(</span><span class="n">transformedData</span><span class="o">)</span>
<span class="c1">// write out the final edited data</span>
<span class="nv">allGoodData</span><span class="o">.</span><span class="py">write</span><span class="o">.</span><span class="py">parquet</span><span class="o">(</span><span class="n">destPath</span><span class="o">)</span>
<span class="o">}</span>
<span class="o">}</span>
</code></pre></div></div>
<h3 id="finally">Finally</h3>
<p>Always test your transforms before you delete the original data!</p>Dyana RoseYou’ve uncovered a problem in your beautiful parquet files, some piece of data either snuck in, or was calculated incorrectly, or there was just a bug. You know exactly how to correct the data, but how do you update the files?Exploring Spark SQL DataTypes2016-04-09T12:57:06+01:002016-04-09T12:57:06+01:00http://dyanarose.github.io/blog/2016/04/09/exploring-spark-sql-datatypes<p>I’ve been exploring how different DataTypes in Spark SQL are imported from line delimited json to try to understand which DataTypes can be used for a semi-structured data set I’m converting to parquet files. The data won’t all be processed at once and the schema will need to grow, so it’s imperative that the parquet files have schemas that are compatible.</p>
<p>The only one I really can’t get working yet is the CalendarIntervalType.</p>
<p>Looking at the Spark source files <a href="https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/literals.scala">literals.scala</a> and <a href="https://github.com/apache/spark/blob/master/common/unsafe/src/main/java/org/apache/spark/unsafe/types/CalendarInterval.java">CalendarInterval.java</a> I would assume that <code class="language-plaintext highlighter-rouge">CalendarInterval.fromString</code> is called with the value, however I’m just getting nulls back when passing in a value like ‘interval 2 days’ which, when passed to <code class="language-plaintext highlighter-rouge">CalendarInterval.fromString</code>, returns a non-null <code class="language-plaintext highlighter-rouge">CalendarInterval</code>.</p>
<p>Source code for the tests is at: https://github.com/dyanarose/dlr-spark</p>
<p>Results:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>-------------- DecimalType --------------
------- DecimalType Input
{'decimal': 1.2345}
{'decimal': 1}
{'decimal': 234.231}
{'decimal': Infinity}
{'decimal': -Infinity}
{'decimal': NaN}
{'decimal': '1'}
{'decimal': '1.2345'}
{'decimal': null}
------- DecimalType Inferred Schema
root
|-- decimal: string (nullable = true)
+-----------+
| decimal|
+-----------+
| 1.2345|
| 1|
| 234.231|
| "Infinity"|
|"-Infinity"|
| "NaN"|
| 1|
| 1.2345|
| null|
+-----------+
------- DecimalType Set Schema
root
|-- decimal: decimal(6,3) (nullable = true)
+-------+
|decimal|
+-------+
| 1.235|
| 1.000|
|234.231|
| null|
| null|
| null|
| null|
| null|
| null|
+-------+
-------------- BooleanType --------------
------- BooleanType Input
{'boolean': true}
{'boolean': false}
{'boolean': 'false'}
{'boolean': 'true'}
{'boolean': null}
{'boolean': 1}
{'boolean': 0}
{'boolean': '1'}
{'boolean': '0'}
{'boolean': 'a'}
------- BooleanType Inferred Schema
root
|-- boolean: string (nullable = true)
+-------+
|boolean|
+-------+
| true|
| false|
| false|
| true|
| null|
| 1|
| 0|
| 1|
| 0|
| a|
+-------+
------- BooleanType Set Schema
root
|-- boolean: boolean (nullable = true)
+-------+
|boolean|
+-------+
| true|
| false|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
| null|
+-------+
-------------- ByteType --------------
------- ByteType Input
{'byte': 'a'}
{'byte': 'b'}
{'byte': 1}
{'byte': 0}
{'byte': 5}
{'byte': null}
------- ByteType Inferred Schema
root
|-- byte: string (nullable = true)
+----+
|byte|
+----+
| a|
| b|
| 1|
| 0|
| 5|
|null|
+----+
------- ByteType Set Schema
root
|-- byte: byte (nullable = true)
+----+
|byte|
+----+
|null|
|null|
| 1|
| 0|
| 5|
|null|
+----+
-------------- CalendarIntervalType --------------
------- CalendarIntervalType Input
{'calendarInterval': 'interval 2 days'}
{'calendarInterval': 'interval 1 week'}
{'calendarInterval': 'interval 5 years'}
{'calendarInterval': 'interval 6 months'}
{'calendarInterval': 10}
{'calendarInterval': 'interval a'}
{'calendarInterval': null}
------- CalendarIntervalType Inferred Schema
root
|-- calendarInterval: string (nullable = true)
+-----------------+
| calendarInterval|
+-----------------+
| interval 2 days|
| interval 1 week|
| interval 5 years|
|interval 6 months|
| 10|
| interval a|
| null|
+-----------------+
------- CalendarIntervalType Set Schema
root
|-- calendarInterval: calendarinterval (nullable = true)
+----------------+
|calendarInterval|
+----------------+
| null|
| null|
| null|
| null|
| null|
| null|
| null|
+----------------+
-------------- DateType --------------
------- DateType Input
{'date': '2016-04-24'}
{'date': '0001-01-01'}
{'date': '9999-12-31'}
{'date': '2016-04-24 12:10:01'}
{'date': 1461496201000}
{'date': null}
------- DateType Inferred Schema
root
|-- date: string (nullable = true)
+-------------------+
| date|
+-------------------+
| 2016-04-24|
| 0001-01-01|
| 9999-12-31|
|2016-04-24 12:10:01|
| 1461496201000|
| null|
+-------------------+
------- DateType Set Schema
root
|-- date: date (nullable = true)
+----------+
| date|
+----------+
|2016-04-24|
|0001-01-01|
|9999-12-31|
|2016-04-24|
| null|
| null|
+----------+
-------------- DoubleType --------------
------- DoubleType Input
{'double': 1.23456}
{'double': 1}
{'double': 1.7976931348623157E308}
{'double': -1.7976931348623157E308}
{'double': Infinity}
{'double': -Infinity}
{'double': NaN}
{'double': '1'}
{'double': '1.23456'}
{'double': null}
------- DoubleType Inferred Schema
root
|-- double: string (nullable = true)
+--------------------+
| double|
+--------------------+
| 1.23456|
| 1|
|1.797693134862315...|
|-1.79769313486231...|
| "Infinity"|
| "-Infinity"|
| "NaN"|
| 1|
| 1.23456|
| null|
+--------------------+
------- DoubleType Set Schema
root
|-- double: double (nullable = true)
+--------------------+
| double|
+--------------------+
| 1.23456|
| 1.0|
|1.797693134862315...|
|-1.79769313486231...|
| Infinity|
| -Infinity|
| NaN|
| null|
| null|
| null|
+--------------------+
-------------- FloatType --------------
------- FloatType Input
{'float': 1.23456}
{'float': 1}
{'float': 3.4028235E38}
{'float': -3.4028235E38}
{'float': Infinity}
{'float': -Infinity}
{'float': NaN}
{'float': '1'}
{'float': '1.23456'}
{'float': null}
------- FloatType Inferred Schema
root
|-- float: string (nullable = true)
+-------------+
| float|
+-------------+
| 1.23456|
| 1|
| 3.4028235E38|
|-3.4028235E38|
| "Infinity"|
| "-Infinity"|
| "NaN"|
| 1|
| 1.23456|
| null|
+-------------+
------- FloatType Set Schema
root
|-- float: float (nullable = true)
+-------------+
| float|
+-------------+
| 1.23456|
| 1.0|
| 3.4028235E38|
|-3.4028235E38|
| Infinity|
| -Infinity|
| NaN|
| null|
| null|
| null|
+-------------+
-------------- IntegerType --------------
------- IntegerType Input
{'integer': 1}
{'integer': 2147483647}
{'integer': -2147483648}
{'integer': 2147483648}
{'integer': '1'}
{'integer': 1.23456}
{'integer': '1.23456'}
{'integer': null}
------- IntegerType Inferred Schema
root
|-- integer: string (nullable = true)
+-----------+
| integer|
+-----------+
| 1|
| 2147483647|
|-2147483648|
| 2147483648|
| 1|
| 1.23456|
| 1.23456|
| null|
+-----------+
------- IntegerType Set Schema
root
|-- integer: integer (nullable = true)
+-----------+
| integer|
+-----------+
| 1|
| 2147483647|
|-2147483648|
| null|
| null|
| null|
| null|
| null|
+-----------+
-------------- LongType --------------
------- LongType Input
{'long': 1}
{'long': 9223372036854775807}
{'long': -9223372036854775808}
{'long': '1'}
{'long': 1.23456}
{'long': '1.23456'}
{'long': null}
------- LongType Inferred Schema
root
|-- long: string (nullable = true)
+--------------------+
| long|
+--------------------+
| 1|
| 9223372036854775807|
|-9223372036854775808|
| 1|
| 1.23456|
| 1.23456|
| null|
+--------------------+
------- LongType Set Schema
root
|-- long: long (nullable = true)
+--------------------+
| long|
+--------------------+
| 1|
| 9223372036854775807|
|-9223372036854775808|
| null|
| null|
| null|
| null|
+--------------------+
-------------- MapType --------------
------- MapType Input
{'map': {'a_key': 'a value', 'b_key': 'b value'}}
{'map': {'key': 1, 'key1': null}}
{'map': null}
------- MapType Inferred Schema
root
|-- map: struct (nullable = true)
| |-- a_key: string (nullable = true)
| |-- b_key: string (nullable = true)
| |-- key: long (nullable = true)
| |-- key1: string (nullable = true)
+--------------------+
| map|
+--------------------+
|[a value,b value,...|
| [null,null,1,null]|
| null|
+--------------------+
------- MapType Set Schema
root
|-- map: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
+--------------------+
| map|
+--------------------+
|Map(a_key -> a va...|
|Map(key -> 1, key...|
| null|
+--------------------+
-------------- NullType --------------
------- NullType Input
{'null': null}
{'null': true}
{'null': false}
{'null': 1}
{'null': 0}
{'null': '1'}
{'null': '0'}
{'null': 'a'}
------- NullType Inferred Schema
root
|-- null: string (nullable = true)
+-----+
| null|
+-----+
| null|
| true|
|false|
| 1|
| 0|
| 1|
| 0|
| a|
+-----+
------- NullType Set Schema
root
|-- null: null (nullable = true)
+----+
|null|
+----+
|null|
|null|
|null|
|null|
|null|
|null|
|null|
|null|
+----+
-------------- ShortType --------------
------- ShortType Input
{'short': 0}
{'short': 1}
{'short': 32767}
{'short': -32768}
{'short': 32768}
{'short': 1.23456}
{'short': '0'}
{'short': '1'}
{'short': '1.23456'}
{'short': null}
------- ShortType Inferred Schema
root
|-- short: string (nullable = true)
+-------+
| short|
+-------+
| 0|
| 1|
| 32767|
| -32768|
| 32768|
|1.23456|
| 0|
| 1|
|1.23456|
| null|
+-------+
------- ShortType Set Schema
root
|-- short: short (nullable = true)
+------+
| short|
+------+
| 0|
| 1|
| 32767|
|-32768|
| null|
| null|
| null|
| null|
| null|
| null|
+------+
-------------- TimestampType --------------
------- TimestampType Input
{'timestamp': '2016-04-24'}
{'timestamp': '2016-04-24 12:10:01'}
{'timestamp': 1461496201000}
{'timestamp': '0001-01-01'}
{'timestamp': '9999-12-31'}
{'timestamp': null}
------- TimestampType Inferred Schema
root
|-- timestamp: string (nullable = true)
+-------------------+
| timestamp|
+-------------------+
| 2016-04-24|
|2016-04-24 12:10:01|
| 1461496201000|
| 0001-01-01|
| 9999-12-31|
| null|
+-------------------+
------- TimestampType Set Schema
root
|-- timestamp: timestamp (nullable = true)
+--------------------+
| timestamp|
+--------------------+
|2016-04-24 00:00:...|
|2016-04-24 12:10:...|
|2016-04-24 12:10:...|
|0001-01-01 00:00:...|
|9999-12-31 00:00:...|
| null|
+--------------------+
</code></pre></div></div>Dyana RoseI’ve been exploring how different DataTypes in Spark SQL are imported from line delimited json to try to understand which DataTypes can be used for a semi-structured data set I’m converting to parquet files. The data won’t all be processed at once and the schema will need to grow, so it’s imperative that the parquet files have schemas that are compatible.Preventing Duplication when Creating Relationships in Neo4j2014-07-08T20:50:10+01:002014-07-08T20:50:10+01:00http://dyanarose.github.io/blog/2014/07/08/preventing-duplication-when-creating-relationships-in-neo4j<p>Creating relationships between known nodes using Cypher in Neo4j is simple.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>MATCH (p:Person), (e:Episode)
CREATE (p) - [:INTERVIEWED_IN] -> (e)
</code></pre></div></div>
<p>But what if you don’t know if one of the nodes exists? And further, what if you don’t know if the relationship itself already exists?</p>
<ul>
<li>If the node doesn’t exist, I want it to be created.</li>
<li>If the relationship doesn’t exist, I want it to be created.</li>
<li>If both node and relationship exist, then nothing should be changed</li>
</ul>
<p>The simple scenario is of a set of Episode nodes and a set of Person nodes.
CREATE (e:Episode {title:”foo”, subtitle:”bar”})
return e
CREATE (p:Person {name:”Lynn Rose”})
return p</p>
<p>The Episode nodes are known to exist. The Person nodes may or may not exist.</p>
<p>My first attempts at getting this right in my working database were tragic failures.
For example, the following, while working as intended in a new 2.0.1 database, fails and causes duplicate person nodes in my working database. (Adding a unique constraint on person.name causes the statement to throw an exception rather than create a duplicate)
MATCH (e:Episode {title: “foo”})
CREATE UNIQUE (e) <- [:INTERVIEWED_IN] - (p:Person {name:”Lynn Rose”})</p>
<p>As I said, the duplication doesn’t happen in a clean 2.0.1 database, so the problem must be with my working database.</p>
<ol>
<li>The database was originally created on Neo4j 1.9.3 and is now running under Neo4j 2.0.1</li>
<li>The nodes created before the release of 2.x use the old style indexes.</li>
</ol>
<p>But those facts aside, I still need to stop the duplication! So back to the <a href="http://docs.neo4j.org/chunked/stable/cypher-query-lang.html">Cypher documentation</a>.</p>
<p>The Cypher documents for <a href="http://docs.neo4j.org/chunked/stable/query-create-unique.html">CREATE UNIQUE</a> specify the following in a call out box:</p>
<blockquote>
<p><a href="http://docs.neo4j.org/chunked/stable/query-merge.html">MERGE</a> might be what you want to use instead of CREATE UNIQUE</p>
</blockquote>
<p>It’s MERGE that gives the ability to control what happens when a node is, or isn’t, matched. It does this through the syntax of ON MATCH and ON CREATE.</p>
<p>Using MERGE and ON CREATE I can get a handle on an existing person node to be able to use in our relationship creation, thus preventing duplication of person nodes.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>MATCH (e:Episode {title: "foo"})
MERGE (p:Person {name: "Lynn Rose"})
ON CREATE SET p.firstname = "Lynn", p.surname = "Rose"
MERGE (e) <- [:INTERVIEWED_IN] - (p)
</code></pre></div></div>
<p>But, we’ve still got an issue here. This doesn’t necessarily solve the problem of duplicating relationships.</p>
<p>In that call out box in the CREATE UNIQUE documentation, it goes on to say:</p>
<blockquote>
<p>Note however, that MERGE doesn’t give as strong guarantees for relationships being unique.</p>
</blockquote>
<p>So I take from this that I should use MERGE to prevent node duplication, but CREATE UNIQUE should be used to prevent relationship duplication.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>MATCH (e:Episode {title: "foo"})
MERGE (p:Person {name: "Lynn Rose"})
ON CREATE SET p.firstname = "Lynn", p.surname = "Rose"
CREATE UNIQUE (e) <- [:INTERVIEWED_IN] - (p)
</code></pre></div></div>
<p>And here we are.</p>
<ul>
<li>If the node doesn’t exist, it is created using MERGE ON CREATE.</li>
<li>If the relationship doesn’t exist, it is created using CREATE UNIQUE.</li>
<li>If both node and relationship exist, then nothing is changed.</li>
</ul>Dyana RoseCreating relationships between known nodes using Cypher in Neo4j is simple.